[ https://issues.apache.org/jira/browse/TIKA-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472245#comment-17472245 ]
Tim Allison edited comment on TIKA-3642 at 1/10/22, 7:20 PM: ------------------------------------------------------------- Are you calling Tika on a file or on an inputstream? If memory serves, PDFBox is more efficient on a file. I know that zip-based parsers (docx, etc.) are much more memory efficient when processing files vs streams. {noformat} Path p = Paths.get("my.pdf"); try (InputStream is = TikaInputStream.get(p)) {... {noformat} Separately, I _think_ you can twiddle some of the memory configurations via PDFParserConfig ([https://tika.apache.org/2.2.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setMaxMainMemoryBytes-long-]). was (Author: talli...@mitre.org): Are you calling Tika on a file or on an inputstream? If memory serves, PDFBox is more efficient on a file. I know that zip-based parsers (docx, etc.) are much more memory efficient when processing files vs streams. {noformat} Path p = "my.pdf" try (InputStream is = TikaInputStream.get(p)) {... {noformat} Separately, I _think_ you can twiddle some of the memory configurations via PDFParserConfig (https://tika.apache.org/2.2.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setMaxMainMemoryBytes-long-). > Getting java.lang.OutOfMemoryError: Java heap space when parsing PDF file > ------------------------------------------------------------------------- > > Key: TIKA-3642 > URL: https://issues.apache.org/jira/browse/TIKA-3642 > Project: Tika > Issue Type: Bug > Reporter: Tika User > Priority: Major > > When parsing large PDF files(1.65 GB) we are getting out of memory error. The > version we are using 2.0.25(pdfbox) > java.lang.OutOfMemoryError: Java heap space at > org.apache.pdfbox.pdfparser.COSParser.isString -- This message was sent by Atlassian Jira (v8.20.1#820001)