Hi there,

On Thu, 2012-05-10 at 16:06 +0100, Caolán McNamara wrote:
> fdo#47644 big 18 meg .doc is super dooper slow to load, mostly because
> of the extra checks I added to sanity test .doc file during parsing.

        :-)

> 1st attachment gets us back to where we came from in terms of
> performance there.

        Great - I've pushed that to libreoffice-3-5.

> The second one is maybe a bit more contentious for 3-5, but I include it
> for a look-over. Basically there's a "page chain" in the ole2 format and
> to seek to a location you walk through the chain until you get to the
> correct page.

        Yep - it's a FAT structure; nasty one-slot linked list.

>  I see that in practice most documents have their chain in
> a presorted order (I have no idea if having them unsorted is actually a
> sign of a broken document or if its legal)

        It is legal, but it's highly unlikely.

> So second attachment keeps the chain if its one of the circumstances
> where we have to parse the whole thing anyway, and if not (the usual
> one) then if we're seeking an (fairly arbitrary) large number of steps
> where its likely that the cost-benefit is in favour of it then generate
> the chain once and hold on to it for reuse until there any event which
> might invalidate the chain.

        So - I'm a bit lost with the second patch. The FAT is essentially a
linked list; and we walk it to find the block we want, so:

        nRel = 3; -> go three steps along the linked list.
        A -> B -> C == C ...

        Your pageCache looks like the right approach, and turns this into a
simple vector we can rapidly look up:

        [0] = A, [1] = B, [2] = C

        And so on - hopefully I'm there so far :-) With the 512 byte block size
we expect, that turns a 100Mb OLE2 stream into a 800k vector (which is
just fine).

> Second patch knocks about 50 seconds off load time. First patch knocks
> some unknown but > 10 mins off load time. 

        My question is thus about:

[snip]
        std::vector<sal_Int32>::iterator aI;

        if (m_bSortedPageChain)
            aI = std::lower_bound(m_aPagesCache.begin(), m_aPagesCache.end(), 
nBgn);
        else
            aI = std::find(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn);

        if (aI == m_aPagesCache.end())
        {
            SAL_WARN("sot", "Unknown page position");
            nBgn = STG_EOF;
        }
        else
        {
            size_t nBgnDistance = std::distance(m_aPagesCache.begin(), aI);

            size_t nIndex = nBgnDistance + nRel;

            if (nIndex > m_aPagesCache.size())
            {
                nRel = m_aPagesCache.size() - nBgnDistance;
                nIndex = m_aPagesCache.size() - 1;
            }
            else
                nRel = 0;

            nLast = nIndex ? m_aPagesCache[nIndex - 1] : STG_EOF;
            nBgn = m_aPagesCache[nIndex];
        }
[/snip]

        Surely that should read:

                nBgn = m_aPageCache[nRel];

        or did I go a bit crazy ? [ I'm trying to turn the above into that
mentally and giving up ;-].

        Otherwise the clearing of the cache looks sensible; though the:

        StgFATStrm::SetPage(); should prolly just update that vector as well as
doing the nasty fat walk so:

        m_aPageCache[nBlocks] = nNewPage;

        or somesuch (assuming it is long enough for that).

        Otherwise it looks really lovely.

        Thanks !

                Michael.

-- 
michael.me...@suse.com  <><, Pseudo Engineer, itinerant idiot

_______________________________________________
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Reply via email to