[ https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902659#comment-14902659 ]
Tim Allison commented on TIKA-1737: ----------------------------------- bq. dating back as far as 1992 Y, I just confirmed that I can't find any overlapping stacktraces from our govdocs1+common crawl corpus. Thank you for sharing. > PDFBox 1.8.10 is still a basket case > ------------------------------------ > > Key: TIKA-1737 > URL: https://issues.apache.org/jira/browse/TIKA-1737 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 1.10 > Environment: Linux, Solaris > Reporter: Alan Burlison > Attachments: pdfbox.txt > > > In TIKA-1471 I reported OOM errors when parsing PDF files. According to that > bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather > than PDFBox being better it's actually far, far worse. With the same corpus, > Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox > 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I > can tell, the memory leaks are even worse in 1.8.10 as well. > I've had to resort to destroying the Tika instances and starting over each > time there's an error indexing a PDF file. It's so bad I'm going to switch to > running pdftotext (part of Xpdf) as an external process. Note that many of > the errors in PDFBox are clearly caused by programming errors, e.g. > ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and > EOFException. > I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a > replacement for PDFBox as 1.8.10 just isn't fit for purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)