Hi

Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
Hi Andrea,

While a speed improvement in parsing of large files would be much appreciated
(especially by the TIKA users), there are several problems with your change:
+1

- don't do changes that need JDK7 or higher even if they are cool. We use JDK6
currently.

- regressions:

Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
java.io.IOException: XREF for 3:0 points to wrong object: 1:0
     at
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
     at
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
     at 
org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
     at 
org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
     at
org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

     at java.lang.reflect.Method.invoke(Method.java:606)
     at junit.framework.TestCase.runTest(TestCase.java:176)
     at junit.framework.TestCase.runBare(TestCase.java:141)
     at junit.framework.TestResult$1.protect(TestResult.java:122)
     at junit.framework.TestResult.runProtected(TestResult.java:142)
     at junit.framework.TestResult.run(TestResult.java:125)
     at junit.framework.TestCase.run(TestCase.java:129)
     at junit.framework.TestSuite.runTest(TestSuite.java:255)
     at junit.framework.TestSuite.run(TestSuite.java:250)
     at junit.textui.TestRunner.doRun(TestRunner.java:116)
     at junit.textui.TestRunner.start(TestRunner.java:183)
     at junit.textui.TestRunner.main(TestRunner.java:137)
     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)


Error converting file PDFBOX-2599.pdf
java.io.IOException: XREF for 2:0 points to wrong object: 1:0
     at
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
     at
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
     at 
org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
     at 
org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
     at
org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

     at java.lang.reflect.Method.invoke(Method.java:606)
     at junit.framework.TestCase.runTest(TestCase.java:176)
     at junit.framework.TestCase.runBare(TestCase.java:141)
     at junit.framework.TestResult$1.protect(TestResult.java:122)
     at junit.framework.TestResult.runProtected(TestResult.java:142)
     at junit.framework.TestResult.run(TestResult.java:125)
     at junit.framework.TestCase.run(TestCase.java:129)
     at junit.framework.TestSuite.runTest(TestSuite.java:255)
     at junit.framework.TestSuite.run(TestSuite.java:250)
     at junit.textui.TestRunner.doRun(TestRunner.java:116)
     at junit.textui.TestRunner.start(TestRunner.java:183)
     at junit.textui.TestRunner.main(TestRunner.java:137)
     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)


- why change only one of the members of that cosobjectkey class to int?
According to the spec, both are integers. Maybe there's a good reason, but I'd
like to know.
ASFAIK there is no good reason not to change both to int.

- even if you get rid of the regressions, a remaining problem is that
    - Andreas L. is currently working on some parser stuff in PDFBOX-2527
That's not a problem. For now I'm focused on the parsing process itself and am working on one last piece, the rebuild mechanism.

    - your change is too big to evaluate (I'm speaking only for myself there).
It would be better to first submit only small refactorings in PDFBOX-2576, and

I agree. We should try to break up the patch into smaller pieces if possible. Let's start with the long -> int change

then the optimization you mention (or the other way around). The parser is
indeed a tricky part of the code (And SonarQube and Software Diagnostics have
also flagged it as too complex). I did some refactorings a few weeks ago there
(splitting methods), but stopped because I couldn't come up with names for the
new methods. I just didn't understand what they were doing.

Tilman

BR
Andreas Lehmkühler


Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
Hi,
few days ago I was profiling PDFBox when loading medium/large size
documents and I think I found something.
If you try loading the document
http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
it takes quite some time and that's mostly spent in the
XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
an object contained in an unparsed object stream is found, the
XrefTrailerResolver performs a full scan of the xref entries found in the
document, in this case hundreds of thousands. If the object streams are
many (like in the given doc), it performs many full scans resulting in poor
performance.
I'm trying to get familiar with the PDFBox code and I decided to try and
fix this herehttps://github.com/torakiki/sambox/tree/xref
As you can see I refactored a bit extracting some classes and covered the
expect behaviour with unit tests. I tested it with few random docs, loading
and saving them back and the output is exactly the same with or without my
changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
this
http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
it takes half the time. Other kind of docs loads in a comparable amount of
time and even profiling memory usage it seems comparable if not a little
less.
Maybe someone wants to take a look?

I understand my changes look a bit invasive and the issue could probably be
fixed differently, on the other hand the couple BaseParser+COSParser looks
like a big intimidating monster to a newcomer like me and it's quite
difficult to follow the expected behaviour so I thought this might be a
chance to start breaking them down in smaller, distilled classes...
something a little more manageable and testable... anyway, grab what you
like, leave what you don't  :)



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to