[jira] [Commented] (TIKA-818) Allow PDFBox to be used with RandomAccessFile vs RandomAccessBuffer to allow for a memory vs performance tradeoff

Paul Pearcy (Commented) (JIRA) Sun, 22 Jan 2012 22:32:36 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190895#comment-13190895
 ]


Paul Pearcy commented on TIKA-818:
----------------------------------

Hey Nick, 
  Thanks a ton for taking a look! Apologies for the delay in response. 

The key trigger for PDFBox to use in-memory vs temporary file is the 
RandomAccess passed to the load method:
http://www.jarvana.com/jarvana/view/org/apache/pdfbox/pdfbox/1.6.0/pdfbox-1.6.0-javadoc.jar!/org/apache/pdfbox/pdmodel/PDDocument.html#load(java.io.InputStream,
 org.apache.pdfbox.io.RandomAccess, boolean)

Here is a sample I've been hacking around with:
https://gist.github.com/1661161

The code probably isn't the best way to set things up for a couple of reasons:
- It'd be nice to allow callers to pick memory or file buffers. Not sure what 
the correct approach would be to keep Tika interface clean.
- I think TikaInputStream has its own temporary file resource management that 
should probably be used. Haven't figured that out yet. 

Thanks and Best Regards,
Paul
                
> Allow PDFBox to be used with RandomAccessFile vs RandomAccessBuffer to allow 
> for a memory vs performance tradeoff
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-818
>                 URL: https://issues.apache.org/jira/browse/TIKA-818
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.10, 1.0
>            Reporter: Paul Pearcy
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> After upgrading to Tika 0.10, began having OOM errors processing large 
> amounts of PDFs in parallel. The heap dump indicated that all the memory was 
> getting used up by PDFBox RandomAccessBuffers. After digging around, it looks 
> like PDFBox now defaults to using RAM vs temporary files for PDF extraction. 
> This can be overridden to use RandomAccessFiless. 
> I propose that Tika controls file vs buffer based on the inputstream type 
> received. If the TikaInputStream is a file, RandomAccessFile should be used 
> and for other stream types, RandomAccessBuffer can be used. 
> I believe the code to control this is here:
> https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
> At ~line 87:
> PDDocument pdfDocument =
>             PDDocument.load(new CloseShieldInputStream(stream), true);
> Not sure if this is the best approach and am curious if there are other ideas 
> on how to control this and keep the interface clean. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-818) Allow PDFBox to be used with RandomAccessFile vs RandomAccessBuffer to allow for a memory vs performance tradeoff

Reply via email to