Re: Xref parsing performance

Andrea Vacondio Sat, 28 Feb 2015 08:57:16 -0800

mmm... are you using the tip of the "xref" branch? Because it shouldn't use
any jdk7 stuff and it compiles and runs fine on my machine. I'm using
Ubuntu and jdk1.6.0_45 and I have:
[INFO] BUILD SUCCESS


I changed the generation number to int because in the xref table it's a 5
digit number so it fits an int. According to the spec object number and
generation number are both integer (as opposite to real numbers) but I
don't think the specs distinguish between int and long so, while the gen
number can be at most 99999, I couldn't find any limit to the object number
so I left it long.
I noticed there's currently some work going on the COSPaser but I was
already playing with this changes so I finished them. I actually posted
them here more as a starting point for discussion... see what you guys
think of these kind of refactors/patch. I'm quite new to PDFBox (not to
java and PDF) and I'm kind of trying to understand what is welcome and what
is not :)

On Sat, Feb 28, 2015 at 4:47 PM, Tilman Hausherr <[email protected]>
wrote:

> Hi Andrea,
>
> While a speed improvement in parsing of large files would be much
> appreciated (especially by the TIKA users), there are several problems with
> your change:
>
> - don't do changes that need JDK7 or higher even if they are cool. We use
> JDK6 currently.
>
> - regressions:
>
> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:696)
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:639)
>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(
> COSParser.java:600)
>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(
> PDFParser.java:346)
>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(
> TestPDFToImage.java:201)
>     at org.apache.pdfbox.util.TestPDFToImage.testRenderImage(
> TestPDFToImage.java:343)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at junit.framework.TestCase.runTest(TestCase.java:176)
>     at junit.framework.TestCase.runBare(TestCase.java:141)
>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>     at junit.framework.TestResult.run(TestResult.java:125)
>     at junit.framework.TestCase.run(TestCase.java:129)
>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>     at junit.framework.TestSuite.run(TestSuite.java:250)
>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>     at junit.textui.TestRunner.start(TestRunner.java:183)
>     at junit.textui.TestRunner.main(TestRunner.java:137)
>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>
>
> Error converting file PDFBOX-2599.pdf
> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:696)
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:639)
>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(
> COSParser.java:600)
>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(
> PDFParser.java:346)
>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(
> TestPDFToImage.java:201)
>     at org.apache.pdfbox.util.TestPDFToImage.testRenderImage(
> TestPDFToImage.java:343)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at junit.framework.TestCase.runTest(TestCase.java:176)
>     at junit.framework.TestCase.runBare(TestCase.java:141)
>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>     at junit.framework.TestResult.run(TestResult.java:125)
>     at junit.framework.TestCase.run(TestCase.java:129)
>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>     at junit.framework.TestSuite.run(TestSuite.java:250)
>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>     at junit.textui.TestRunner.start(TestRunner.java:183)
>     at junit.textui.TestRunner.main(TestRunner.java:137)
>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>
>
> - why change only one of the members of that cosobjectkey class to int?
> According to the spec, both are integers. Maybe there's a good reason, but
> I'd like to know.
>
> - even if you get rid of the regressions, a remaining problem is that
>    - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>    - your change is too big to evaluate (I'm speaking only for myself
> there). It would be better to first submit only small refactorings in
> PDFBOX-2576, and then the optimization you mention (or the other way
> around). The parser is indeed a tricky part of the code (And SonarQube and
> Software Diagnostics have also flagged it as too complex). I did some
> refactorings a few weeks ago there (splitting methods), but stopped because
> I couldn't come up with names for the new methods. I just didn't understand
> what they were doing.
>
> Tilman
>
> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>
>> Hi,
>> few days ago I was profiling PDFBox when loading medium/large size
>> documents and I think I found something.
>> If you try loading the document
>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll
>> see
>> it takes quite some time and that's mostly spent in the
>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every
>> time
>> an object contained in an unparsed object stream is found, the
>> XrefTrailerResolver performs a full scan of the xref entries found in the
>> document, in this case hundreds of thousands. If the object streams are
>> many (like in the given doc), it performs many full scans resulting in
>> poor
>> performance.
>> I'm trying to get familiar with the PDFBox code and I decided to try and
>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>> As you can see I refactored a bit extracting some classes and covered the
>> expect behaviour with unit tests. I tested it with few random docs,
>> loading
>> and saving them back and the output is exactly the same with or without my
>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>> this
>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/
>> pdf/pdfs/PDF32000_2008.pdf
>> it takes half the time. Other kind of docs loads in a comparable amount of
>> time and even profiling memory usage it seems comparable if not a little
>> less.
>> Maybe someone wants to take a look?
>>
>> I understand my changes look a bit invasive and the issue could probably
>> be
>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>> like a big intimidating monster to a newcomer like me and it's quite
>> difficult to follow the expected behaviour so I thought this might be a
>> chance to start breaking them down in smaller, distilled classes...
>> something a little more manageable and testable... anyway, grab what you
>> like, leave what you don't  :)
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Xref parsing performance

Reply via email to