[ https://issues.apache.org/jira/browse/TIKA-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474997#comment-17474997 ]
Tim Allison commented on TIKA-3642: ----------------------------------- Got your file. Thank you. That was critical. What's going on is that in tika-1.x we're defaulting to 512MB for maxMainMemory. In tika-2.x, the default is -1. This is {*}bad{*}, and we should fix this quickly. I was able to parse the file without a problem in 1.x with -Xmx1g, and when I used this config in 2.x, I got the same behavior. If I didn't use this config, I got an OOM with -Xmx2g (I didn't try higher). {noformat} <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/> </parser> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <param name="maxMainMemoryBytes" type="long">524288000</param> </params> </parser> </parsers> </properties> {noformat} > Getting java.lang.OutOfMemoryError: Java heap space when parsing PDF file > ------------------------------------------------------------------------- > > Key: TIKA-3642 > URL: https://issues.apache.org/jira/browse/TIKA-3642 > Project: Tika > Issue Type: Bug > Reporter: Tika User > Priority: Major > > When parsing large PDF files(1.65 GB) we are getting out of memory error. The > version we are using 2.0.25(pdfbox) > java.lang.OutOfMemoryError: Java heap space at > org.apache.pdfbox.pdfparser.COSParser.isString -- This message was sent by Atlassian Jira (v8.20.1#820001)