All,
As part of TIKA-1285, I updated Jeremy Anderson's original patch for our
wrapper for PDFBox 2.0.0 on Tika. I'm having some problems running the unit
tests because at least one of our files [0] is causing hefty resource
utilization, which sends my laptop into paging. The parse does eventually
stop, and content is extracted.
I also tried this file outside of Tika and used the straight PDFBox-app (
both ExtractImages and ExtractText), and performance is also far, far slower
when compared with 1.8.9.
Many apologies if this issue has already been identified.
I also noticed that the tiff file is no longer extracted (2.0.0 logger says
tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
Thank you!
Best,
Tim
[0]
https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf