Hi, few days ago I was profiling PDFBox when loading medium/large size documents and I think I found something. If you try loading the document http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see it takes quite some time and that's mostly spent in the XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time an object contained in an unparsed object stream is found, the XrefTrailerResolver performs a full scan of the xref entries found in the document, in this case hundreds of thousands. If the object streams are many (like in the given doc), it performs many full scans resulting in poor performance. I'm trying to get familiar with the PDFBox code and I decided to try and fix this here https://github.com/torakiki/sambox/tree/xref As you can see I refactored a bit extracting some classes and covered the expect behaviour with unit tests. I tested it with few random docs, loading and saving them back and the output is exactly the same with or without my changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as this http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf it takes half the time. Other kind of docs loads in a comparable amount of time and even profiling memory usage it seems comparable if not a little less. Maybe someone wants to take a look?
I understand my changes look a bit invasive and the issue could probably be fixed differently, on the other hand the couple BaseParser+COSParser looks like a big intimidating monster to a newcomer like me and it's quite difficult to follow the expected behaviour so I thought this might be a chance to start breaking them down in smaller, distilled classes... something a little more manageable and testable... anyway, grab what you like, leave what you don't :)

