Allow PDFBox to be used with RandomAccessFile vs RandomAccessBuffer to allow 
for a memory vs performance tradeoff
-----------------------------------------------------------------------------------------------------------------

                 Key: TIKA-818
                 URL: https://issues.apache.org/jira/browse/TIKA-818
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.0, 0.10
            Reporter: Paul Pearcy


After upgrading to Tika 0.10, began having OOM errors processing large amounts 
of PDFs in parallel. The heap dump indicated that all the memory was getting 
used up by PDFBox RandomAccessBuffers. After digging around, it looks like 
PDFBox now defaults to using RAM vs temporary files for PDF extraction. This 
can be overridden to use RandomAccessFiless. 

I propose that Tika controls file vs buffer based on the inputstream type 
received. If the TikaInputStream is a file, RandomAccessFile should be used and 
for other stream types, RandomAccessBuffer can be used. 

I believe the code to control this is here:
https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java

At ~line 87:
PDDocument pdfDocument =
            PDDocument.load(new CloseShieldInputStream(stream), true);

Not sure if this is the best approach and am curious if there are other ideas 
on how to control this and keep the interface clean. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to