looked at it quickly - very nice! Maruan Am 27.02.2015 um 16:34 schrieb Andrea Vacondio <[email protected]>:
> Hi, > few days ago I was profiling PDFBox when loading medium/large size > documents and I think I found something. > If you try loading the document > http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see > it takes quite some time and that's mostly spent in the > XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time > an object contained in an unparsed object stream is found, the > XrefTrailerResolver performs a full scan of the xref entries found in the > document, in this case hundreds of thousands. If the object streams are > many (like in the given doc), it performs many full scans resulting in poor > performance. > I'm trying to get familiar with the PDFBox code and I decided to try and > fix this here https://github.com/torakiki/sambox/tree/xref > As you can see I refactored a bit extracting some classes and covered the > expect behaviour with unit tests. I tested it with few random docs, loading > and saving them back and the output is exactly the same with or without my > changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as > this > http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf > it takes half the time. Other kind of docs loads in a comparable amount of > time and even profiling memory usage it seems comparable if not a little > less. > Maybe someone wants to take a look? > > I understand my changes look a bit invasive and the issue could probably be > fixed differently, on the other hand the couple BaseParser+COSParser looks > like a big intimidating monster to a newcomer like me and it's quite > difficult to follow the expected behaviour so I thought this might be a > chance to start breaking them down in smaller, distilled classes... > something a little more manageable and testable... anyway, grab what you > like, leave what you don't :)

