Hi, Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <[email protected]>:
> Hi > > Am 28.02.2015 um 16:47 schrieb Tilman Hausherr: >> Hi Andrea, >> >> While a speed improvement in parsing of large files would be much appreciated >> (especially by the TIKA users), there are several problems with your change: > +1 > >> - don't do changes that need JDK7 or higher even if they are cool. We use >> JDK6 >> currently. >> >> - regressions: >> >> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf >> java.io.IOException: XREF for 3:0 points to wrong object: 1:0 >> at >> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696) >> at >> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639) >> at >> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600) >> at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346) >> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373) >> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811) >> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757) >> at >> org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201) >> at >> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> >> at java.lang.reflect.Method.invoke(Method.java:606) >> at junit.framework.TestCase.runTest(TestCase.java:176) >> at junit.framework.TestCase.runBare(TestCase.java:141) >> at junit.framework.TestResult$1.protect(TestResult.java:122) >> at junit.framework.TestResult.runProtected(TestResult.java:142) >> at junit.framework.TestResult.run(TestResult.java:125) >> at junit.framework.TestCase.run(TestCase.java:129) >> at junit.framework.TestSuite.runTest(TestSuite.java:255) >> at junit.framework.TestSuite.run(TestSuite.java:250) >> at junit.textui.TestRunner.doRun(TestRunner.java:116) >> at junit.textui.TestRunner.start(TestRunner.java:183) >> at junit.textui.TestRunner.main(TestRunner.java:137) >> at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393) >> >> >> Error converting file PDFBOX-2599.pdf >> java.io.IOException: XREF for 2:0 points to wrong object: 1:0 >> at >> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696) >> at >> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639) >> at >> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600) >> at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346) >> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373) >> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811) >> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757) >> at >> org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201) >> at >> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> >> at java.lang.reflect.Method.invoke(Method.java:606) >> at junit.framework.TestCase.runTest(TestCase.java:176) >> at junit.framework.TestCase.runBare(TestCase.java:141) >> at junit.framework.TestResult$1.protect(TestResult.java:122) >> at junit.framework.TestResult.runProtected(TestResult.java:142) >> at junit.framework.TestResult.run(TestResult.java:125) >> at junit.framework.TestCase.run(TestCase.java:129) >> at junit.framework.TestSuite.runTest(TestSuite.java:255) >> at junit.framework.TestSuite.run(TestSuite.java:250) >> at junit.textui.TestRunner.doRun(TestRunner.java:116) >> at junit.textui.TestRunner.start(TestRunner.java:183) >> at junit.textui.TestRunner.main(TestRunner.java:137) >> at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393) >> >> >> - why change only one of the members of that cosobjectkey class to int? >> According to the spec, both are integers. Maybe there's a good reason, but >> I'd >> like to know. > ASFAIK there is no good reason not to change both to int. as the offset is a 10 digit number is that really covered being an int? BR Maruan > >> - even if you get rid of the regressions, a remaining problem is that >> - Andreas L. is currently working on some parser stuff in PDFBOX-2527 > That's not a problem. For now I'm focused on the parsing process itself and > am working on one last piece, the rebuild mechanism. > >> - your change is too big to evaluate (I'm speaking only for myself there). >> It would be better to first submit only small refactorings in PDFBOX-2576, >> and > > I agree. We should try to break up the patch into smaller pieces if possible. > Let's start with the long -> int change > >> then the optimization you mention (or the other way around). The parser is >> indeed a tricky part of the code (And SonarQube and Software Diagnostics have >> also flagged it as too complex). I did some refactorings a few weeks ago >> there >> (splitting methods), but stopped because I couldn't come up with names for >> the >> new methods. I just didn't understand what they were doing. >> >> Tilman > > BR > Andreas Lehmkühler > >> >> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio: >>> Hi, >>> few days ago I was profiling PDFBox when loading medium/large size >>> documents and I think I found something. >>> If you try loading the document >>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see >>> it takes quite some time and that's mostly spent in the >>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time >>> an object contained in an unparsed object stream is found, the >>> XrefTrailerResolver performs a full scan of the xref entries found in the >>> document, in this case hundreds of thousands. If the object streams are >>> many (like in the given doc), it performs many full scans resulting in poor >>> performance. >>> I'm trying to get familiar with the PDFBox code and I decided to try and >>> fix this herehttps://github.com/torakiki/sambox/tree/xref >>> As you can see I refactored a bit extracting some classes and covered the >>> expect behaviour with unit tests. I tested it with few random docs, loading >>> and saving them back and the output is exactly the same with or without my >>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as >>> this >>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>> it takes half the time. Other kind of docs loads in a comparable amount of >>> time and even profiling memory usage it seems comparable if not a little >>> less. >>> Maybe someone wants to take a look? >>> >>> I understand my changes look a bit invasive and the issue could probably be >>> fixed differently, on the other hand the couple BaseParser+COSParser looks >>> like a big intimidating monster to a newcomer like me and it's quite >>> difficult to follow the expected behaviour so I thought this might be a >>> chance to start breaking them down in smaller, distilled classes... >>> something a little more manageable and testable... anyway, grab what you >>> like, leave what you don't :) >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

