[ 
https://issues.apache.org/jira/browse/TIKA-818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192072#comment-13192072
 ] 

Nick Burch commented on TIKA-818:
---------------------------------

Are you sure the scratchFile should be the real file itself, rather than a temp 
file? The javadoc says "scratchFile - A location to store temp PDFBox data for 
this document." which makes me think it maybe shouldn't be the same one

Also, for the TikaInputStream check, can we not just use hasFile(), rather than 
adding a new method?
                
> Allow PDFBox to be used with RandomAccessFile vs RandomAccessBuffer to allow 
> for a memory vs performance tradeoff
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-818
>                 URL: https://issues.apache.org/jira/browse/TIKA-818
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.10, 1.0
>            Reporter: Paul Pearcy
>         Attachments: choose_inmemory_vs_temp_file_pdf.patch, 
> choose_inmemory_vs_temp_file_pdf_passes_tests.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> After upgrading to Tika 0.10, began having OOM errors processing large 
> amounts of PDFs in parallel. The heap dump indicated that all the memory was 
> getting used up by PDFBox RandomAccessBuffers. After digging around, it looks 
> like PDFBox now defaults to using RAM vs temporary files for PDF extraction. 
> This can be overridden to use RandomAccessFiless. 
> I propose that Tika controls file vs buffer based on the inputstream type 
> received. If the TikaInputStream is a file, RandomAccessFile should be used 
> and for other stream types, RandomAccessBuffer can be used. 
> I believe the code to control this is here:
> https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
> At ~line 87:
> PDDocument pdfDocument =
>             PDDocument.load(new CloseShieldInputStream(stream), true);
> Not sure if this is the best approach and am curious if there are other ideas 
> on how to control this and keep the interface clean. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to