[ 
https://issues.apache.org/jira/browse/TIKA-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474997#comment-17474997
 ] 

Tim Allison commented on TIKA-3642:
-----------------------------------

Got your file.  Thank you.  That was critical.  What's going on is that in 
tika-1.x we're defaulting to 512MB for maxMainMemory.  In tika-2.x, the default 
is -1.  This is {*}bad{*}, and we should fix this quickly.

 

I was able to parse the file without a problem in 1.x with -Xmx1g, and when I 
used this config in 2.x, I got the same behavior.  If I didn't use this config, 
I got an OOM with -Xmx2g (I didn't try higher).

 
{noformat}
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude 
class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="maxMainMemoryBytes" type="long">524288000</param>
            </params>
        </parser>
    </parsers>
</properties>


 {noformat}

> Getting java.lang.OutOfMemoryError: Java heap space when parsing PDF file
> -------------------------------------------------------------------------
>
>                 Key: TIKA-3642
>                 URL: https://issues.apache.org/jira/browse/TIKA-3642
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tika User
>            Priority: Major
>
> When parsing large PDF files(1.65 GB) we are getting out of memory error. The 
> version we are using 2.0.25(pdfbox)
> java.lang.OutOfMemoryError: Java heap space at 
> org.apache.pdfbox.pdfparser.COSParser.isString



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to