Hi there, On Thu, 2012-05-10 at 16:06 +0100, Caolán McNamara wrote: > fdo#47644 big 18 meg .doc is super dooper slow to load, mostly because > of the extra checks I added to sanity test .doc file during parsing.
:-) > 1st attachment gets us back to where we came from in terms of > performance there. Great - I've pushed that to libreoffice-3-5. > The second one is maybe a bit more contentious for 3-5, but I include it > for a look-over. Basically there's a "page chain" in the ole2 format and > to seek to a location you walk through the chain until you get to the > correct page. Yep - it's a FAT structure; nasty one-slot linked list. > I see that in practice most documents have their chain in > a presorted order (I have no idea if having them unsorted is actually a > sign of a broken document or if its legal) It is legal, but it's highly unlikely. > So second attachment keeps the chain if its one of the circumstances > where we have to parse the whole thing anyway, and if not (the usual > one) then if we're seeking an (fairly arbitrary) large number of steps > where its likely that the cost-benefit is in favour of it then generate > the chain once and hold on to it for reuse until there any event which > might invalidate the chain. So - I'm a bit lost with the second patch. The FAT is essentially a linked list; and we walk it to find the block we want, so: nRel = 3; -> go three steps along the linked list. A -> B -> C == C ... Your pageCache looks like the right approach, and turns this into a simple vector we can rapidly look up: [0] = A, [1] = B, [2] = C And so on - hopefully I'm there so far :-) With the 512 byte block size we expect, that turns a 100Mb OLE2 stream into a 800k vector (which is just fine). > Second patch knocks about 50 seconds off load time. First patch knocks > some unknown but > 10 mins off load time. My question is thus about: [snip] std::vector<sal_Int32>::iterator aI; if (m_bSortedPageChain) aI = std::lower_bound(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn); else aI = std::find(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn); if (aI == m_aPagesCache.end()) { SAL_WARN("sot", "Unknown page position"); nBgn = STG_EOF; } else { size_t nBgnDistance = std::distance(m_aPagesCache.begin(), aI); size_t nIndex = nBgnDistance + nRel; if (nIndex > m_aPagesCache.size()) { nRel = m_aPagesCache.size() - nBgnDistance; nIndex = m_aPagesCache.size() - 1; } else nRel = 0; nLast = nIndex ? m_aPagesCache[nIndex - 1] : STG_EOF; nBgn = m_aPagesCache[nIndex]; } [/snip] Surely that should read: nBgn = m_aPageCache[nRel]; or did I go a bit crazy ? [ I'm trying to turn the above into that mentally and giving up ;-]. Otherwise the clearing of the cache looks sensible; though the: StgFATStrm::SetPage(); should prolly just update that vector as well as doing the nasty fat walk so: m_aPageCache[nBlocks] = nNewPage; or somesuch (assuming it is long enough for that). Otherwise it looks really lovely. Thanks ! Michael. -- michael.me...@suse.com <><, Pseudo Engineer, itinerant idiot _______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice