[ https://issues.apache.org/jira/browse/TIKA-818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205473#comment-13205473 ]
Nick Burch commented on TIKA-818: --------------------------------- Temp files created through TemporaryResources are already added for closing, so we can simplify your patch a little bit Slightly modified version applied in r1242786, thanks! > Allow PDFBox to be used with RandomAccessFile vs RandomAccessBuffer to allow > for a memory vs performance tradeoff > ----------------------------------------------------------------------------------------------------------------- > > Key: TIKA-818 > URL: https://issues.apache.org/jira/browse/TIKA-818 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 0.10, 1.0 > Reporter: Paul Pearcy > Fix For: 1.1 > > Attachments: PDFParser.java.patch, > choose_inmemory_vs_temp_file_pdf.patch, > choose_inmemory_vs_temp_file_pdf_passes_tests.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > After upgrading to Tika 0.10, began having OOM errors processing large > amounts of PDFs in parallel. The heap dump indicated that all the memory was > getting used up by PDFBox RandomAccessBuffers. After digging around, it looks > like PDFBox now defaults to using RAM vs temporary files for PDF extraction. > This can be overridden to use RandomAccessFiless. > I propose that Tika controls file vs buffer based on the inputstream type > received. If the TikaInputStream is a file, RandomAccessFile should be used > and for other stream types, RandomAccessBuffer can be used. > I believe the code to control this is here: > https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java > At ~line 87: > PDDocument pdfDocument = > PDDocument.load(new CloseShieldInputStream(stream), true); > Not sure if this is the best approach and am curious if there are other ideas > on how to control this and keep the interface clean. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira