[ 
https://issues.apache.org/jira/browse/PDFBOX-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569732#comment-17569732
 ] 

Michael Klink commented on PDFBOX-5479:
---------------------------------------

Wow, some 3000 form XObjects on page 1, many of them with an own font object, 
most of which point to the same font descriptor... that adds up...

> PDFTextStripper needs 1GB heap for a 3.6 MB pdf
> -----------------------------------------------
>
>                 Key: PDFBOX-5479
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5479
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.26
>         Environment: JDK11.0.2 on MacOS 12.4
>            Reporter: Manfred Schauer
>            Priority: Minor
>         Attachments: heapDump.png, x.pdf
>
>
> Extracting text from the attached x.pdf:
> PDDocument pdDocument = PDDocument.load(new File("/tmp/x.pdf"));
> PDFTextStripper stripper = new PDFTextStripper();
> stripper.getText(pdDocument);
> succeeds with -Xmx1G but throws OOME with -Xmx900m
> Heapdump shows 2923 instances of TrueTypeFont, PDRessources.cache contains 
> SoftReferences to lots of fonts keyed by different COSObjects;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to