Manfred Schauer created PDFBOX-5479:
---------------------------------------

             Summary: PDFTextStripper needs 1GB heap for a 3.6 MB pdf
                 Key: PDFBOX-5479
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5479
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.26
         Environment: JDK11.0.2 on MacOS 12.4
            Reporter: Manfred Schauer
         Attachments: heapDump.png, x.pdf

Extracting text from the attached x.pdf:

PDDocument pdDocument = PDDocument.load(new File("/tmp/x.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
stripper.getText(pdDocument);

succeeds with -Xmx1G but throws OOME with -Xmx900m

Heapdump shows 2923 instances of TrueTypeFont, PDRessources.cache contains 
SoftReferences to lots of fonts keyed by different COSObjects;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to