mmm... are you using the tip of the "xref" branch? Because it shouldn't use any jdk7 stuff and it compiles and runs fine on my machine. I'm using Ubuntu and jdk1.6.0_45 and I have: [INFO] BUILD SUCCESS
I changed the generation number to int because in the xref table it's a 5 digit number so it fits an int. According to the spec object number and generation number are both integer (as opposite to real numbers) but I don't think the specs distinguish between int and long so, while the gen number can be at most 99999, I couldn't find any limit to the object number so I left it long. I noticed there's currently some work going on the COSPaser but I was already playing with this changes so I finished them. I actually posted them here more as a starting point for discussion... see what you guys think of these kind of refactors/patch. I'm quite new to PDFBox (not to java and PDF) and I'm kind of trying to understand what is welcome and what is not :) On Sat, Feb 28, 2015 at 4:47 PM, Tilman Hausherr <[email protected]> wrote: > Hi Andrea, > > While a speed improvement in parsing of large files would be much > appreciated (especially by the TIKA users), there are several problems with > your change: > > - don't do changes that need JDK7 or higher even if they are cool. We use > JDK6 currently. > > - regressions: > > Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf > java.io.IOException: XREF for 3:0 points to wrong object: 1:0 > at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically( > COSParser.java:696) > at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically( > COSParser.java:639) > at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects( > COSParser.java:600) > at org.apache.pdfbox.pdfparser.PDFParser.initialParse( > PDFParser.java:346) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757) > at org.apache.pdfbox.util.TestPDFToImage.doTestFile( > TestPDFToImage.java:201) > at org.apache.pdfbox.util.TestPDFToImage.testRenderImage( > TestPDFToImage.java:343) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:57) > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at junit.framework.TestCase.runTest(TestCase.java:176) > at junit.framework.TestCase.runBare(TestCase.java:141) > at junit.framework.TestResult$1.protect(TestResult.java:122) > at junit.framework.TestResult.runProtected(TestResult.java:142) > at junit.framework.TestResult.run(TestResult.java:125) > at junit.framework.TestCase.run(TestCase.java:129) > at junit.framework.TestSuite.runTest(TestSuite.java:255) > at junit.framework.TestSuite.run(TestSuite.java:250) > at junit.textui.TestRunner.doRun(TestRunner.java:116) > at junit.textui.TestRunner.start(TestRunner.java:183) > at junit.textui.TestRunner.main(TestRunner.java:137) > at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393) > > > Error converting file PDFBOX-2599.pdf > java.io.IOException: XREF for 2:0 points to wrong object: 1:0 > at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically( > COSParser.java:696) > at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically( > COSParser.java:639) > at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects( > COSParser.java:600) > at org.apache.pdfbox.pdfparser.PDFParser.initialParse( > PDFParser.java:346) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757) > at org.apache.pdfbox.util.TestPDFToImage.doTestFile( > TestPDFToImage.java:201) > at org.apache.pdfbox.util.TestPDFToImage.testRenderImage( > TestPDFToImage.java:343) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:57) > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at junit.framework.TestCase.runTest(TestCase.java:176) > at junit.framework.TestCase.runBare(TestCase.java:141) > at junit.framework.TestResult$1.protect(TestResult.java:122) > at junit.framework.TestResult.runProtected(TestResult.java:142) > at junit.framework.TestResult.run(TestResult.java:125) > at junit.framework.TestCase.run(TestCase.java:129) > at junit.framework.TestSuite.runTest(TestSuite.java:255) > at junit.framework.TestSuite.run(TestSuite.java:250) > at junit.textui.TestRunner.doRun(TestRunner.java:116) > at junit.textui.TestRunner.start(TestRunner.java:183) > at junit.textui.TestRunner.main(TestRunner.java:137) > at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393) > > > - why change only one of the members of that cosobjectkey class to int? > According to the spec, both are integers. Maybe there's a good reason, but > I'd like to know. > > - even if you get rid of the regressions, a remaining problem is that > - Andreas L. is currently working on some parser stuff in PDFBOX-2527 > - your change is too big to evaluate (I'm speaking only for myself > there). It would be better to first submit only small refactorings in > PDFBOX-2576, and then the optimization you mention (or the other way > around). The parser is indeed a tricky part of the code (And SonarQube and > Software Diagnostics have also flagged it as too complex). I did some > refactorings a few weeks ago there (splitting methods), but stopped because > I couldn't come up with names for the new methods. I just didn't understand > what they were doing. > > Tilman > > Am 27.02.2015 um 16:34 schrieb Andrea Vacondio: > >> Hi, >> few days ago I was profiling PDFBox when loading medium/large size >> documents and I think I found something. >> If you try loading the document >> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll >> see >> it takes quite some time and that's mostly spent in the >> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every >> time >> an object contained in an unparsed object stream is found, the >> XrefTrailerResolver performs a full scan of the xref entries found in the >> document, in this case hundreds of thousands. If the object streams are >> many (like in the given doc), it performs many full scans resulting in >> poor >> performance. >> I'm trying to get familiar with the PDFBox code and I decided to try and >> fix this herehttps://github.com/torakiki/sambox/tree/xref >> As you can see I refactored a bit extracting some classes and covered the >> expect behaviour with unit tests. I tested it with few random docs, >> loading >> and saving them back and the output is exactly the same with or without my >> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as >> this >> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/ >> pdf/pdfs/PDF32000_2008.pdf >> it takes half the time. Other kind of docs loads in a comparable amount of >> time and even profiling memory usage it seems comparable if not a little >> less. >> Maybe someone wants to take a look? >> >> I understand my changes look a bit invasive and the issue could probably >> be >> fixed differently, on the other hand the couple BaseParser+COSParser looks >> like a big intimidating monster to a newcomer like me and it's quite >> difficult to follow the expected behaviour so I thought this might be a >> chance to start breaking them down in smaller, distilled classes... >> something a little more manageable and testable... anyway, grab what you >> like, leave what you don't :) >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

