All,

  As part of TIKA-1285, I updated Jeremy Anderson's original patch for our 
wrapper for PDFBox 2.0.0 on Tika.  I'm having some problems running the unit 
tests because at least one of our files [0] is causing hefty resource 
utilization, which sends my laptop into paging.  The parse does eventually 
stop, and content is extracted.

  I also tried this file outside of Tika and used the straight PDFBox-app ( 
both ExtractImages and ExtractText), and performance is also far, far slower 
when compared with 1.8.9.

  Many apologies if this issue has already been identified.

  I also noticed that the tiff file is no longer extracted (2.0.0 logger says 
tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected? 

         Thank you!

              Best,

                     Tim
[0] 
https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
 


Reply via email to