mmm... are you using the tip of the "xref" branch? Because it shouldn't use
any jdk7 stuff and it compiles and runs fine on my machine. I'm using
Ubuntu and jdk1.6.0_45 and I have:
[INFO] BUILD SUCCESS

I changed the generation number to int because in the xref table it's a 5
digit number so it fits an int. According to the spec object number and
generation number are both integer (as opposite to real numbers) but I
don't think the specs distinguish between int and long so, while the gen
number can be at most 99999, I couldn't find any limit to the object number
so I left it long.
I noticed there's currently some work going on the COSPaser but I was
already playing with this changes so I finished them. I actually posted
them here more as a starting point for discussion... see what you guys
think of these kind of refactors/patch. I'm quite new to PDFBox (not to
java and PDF) and I'm kind of trying to understand what is welcome and what
is not :)

On Sat, Feb 28, 2015 at 4:47 PM, Tilman Hausherr <[email protected]>
wrote:

> Hi Andrea,
>
> While a speed improvement in parsing of large files would be much
> appreciated (especially by the TIKA users), there are several problems with
> your change:
>
> - don't do changes that need JDK7 or higher even if they are cool. We use
> JDK6 currently.
>
> - regressions:
>
> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:696)
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:639)
>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(
> COSParser.java:600)
>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(
> PDFParser.java:346)
>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(
> TestPDFToImage.java:201)
>     at org.apache.pdfbox.util.TestPDFToImage.testRenderImage(
> TestPDFToImage.java:343)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at junit.framework.TestCase.runTest(TestCase.java:176)
>     at junit.framework.TestCase.runBare(TestCase.java:141)
>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>     at junit.framework.TestResult.run(TestResult.java:125)
>     at junit.framework.TestCase.run(TestCase.java:129)
>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>     at junit.framework.TestSuite.run(TestSuite.java:250)
>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>     at junit.textui.TestRunner.start(TestRunner.java:183)
>     at junit.textui.TestRunner.main(TestRunner.java:137)
>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>
>
> Error converting file PDFBOX-2599.pdf
> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:696)
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:639)
>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(
> COSParser.java:600)
>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(
> PDFParser.java:346)
>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(
> TestPDFToImage.java:201)
>     at org.apache.pdfbox.util.TestPDFToImage.testRenderImage(
> TestPDFToImage.java:343)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at junit.framework.TestCase.runTest(TestCase.java:176)
>     at junit.framework.TestCase.runBare(TestCase.java:141)
>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>     at junit.framework.TestResult.run(TestResult.java:125)
>     at junit.framework.TestCase.run(TestCase.java:129)
>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>     at junit.framework.TestSuite.run(TestSuite.java:250)
>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>     at junit.textui.TestRunner.start(TestRunner.java:183)
>     at junit.textui.TestRunner.main(TestRunner.java:137)
>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>
>
> - why change only one of the members of that cosobjectkey class to int?
> According to the spec, both are integers. Maybe there's a good reason, but
> I'd like to know.
>
> - even if you get rid of the regressions, a remaining problem is that
>    - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>    - your change is too big to evaluate (I'm speaking only for myself
> there). It would be better to first submit only small refactorings in
> PDFBOX-2576, and then the optimization you mention (or the other way
> around). The parser is indeed a tricky part of the code (And SonarQube and
> Software Diagnostics have also flagged it as too complex). I did some
> refactorings a few weeks ago there (splitting methods), but stopped because
> I couldn't come up with names for the new methods. I just didn't understand
> what they were doing.
>
> Tilman
>
> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>
>> Hi,
>> few days ago I was profiling PDFBox when loading medium/large size
>> documents and I think I found something.
>> If you try loading the document
>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll
>> see
>> it takes quite some time and that's mostly spent in the
>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every
>> time
>> an object contained in an unparsed object stream is found, the
>> XrefTrailerResolver performs a full scan of the xref entries found in the
>> document, in this case hundreds of thousands. If the object streams are
>> many (like in the given doc), it performs many full scans resulting in
>> poor
>> performance.
>> I'm trying to get familiar with the PDFBox code and I decided to try and
>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>> As you can see I refactored a bit extracting some classes and covered the
>> expect behaviour with unit tests. I tested it with few random docs,
>> loading
>> and saving them back and the output is exactly the same with or without my
>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>> this
>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/
>> pdf/pdfs/PDF32000_2008.pdf
>> it takes half the time. Other kind of docs loads in a comparable amount of
>> time and even profiling memory usage it seems comparable if not a little
>> less.
>> Maybe someone wants to take a look?
>>
>> I understand my changes look a bit invasive and the issue could probably
>> be
>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>> like a big intimidating monster to a newcomer like me and it's quite
>> difficult to follow the expected behaviour so I thought this might be a
>> chance to start breaking them down in smaller, distilled classes...
>> something a little more manageable and testable... anyway, grab what you
>> like, leave what you don't  :)
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to