[jira] [Commented] (PDFBOX-6146) OutOfMemoryError when trying to extract text from pdf

Tilman Hausherr (Jira) Wed, 14 Jan 2026 03:32:20 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051710#comment-18051710
 ]


Tilman Hausherr commented on PDFBOX-6146:
-----------------------------------------

I committed your proposed changes; I also tried an alternative, which was to 
just return if an entry occurred twice, but then unit tests failed. This means 
that there are legit fonts that have double entries. I'm now running our huge 
regression tests (PDFBOX-6140), which will be done within a few hours.

> OutOfMemoryError when trying to extract text from pdf
> -----------------------------------------------------
>
>                 Key: PDFBOX-6146
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6146
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 3.0.6 PDFBox
>         Environment: java 17.  macos 26.2
>            Reporter: james
>            Priority: Blocker
>         Attachments: test.pdf
>
>
> I have a pdf file which causes an OutOfMemory error when trying to extract 
> the text.  Unfortunately, the file is a customer file which i cannot share 
> (it is only about 5mb).  I am willing to work with someone, however, on 
> debugging the issue.  Before it fails with OOME, i get the following errors:
> {{16:45:23.418 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable - 
> scriptOffsets[1680]: 10084 implausible: data.getCurrentPosition() - offset = 
> 10088 ()}}
> {{16:45:23.419 [main] WARN  o.a.f.ttf.GlyphSubstitutionTable - FeatureRecord 
> array not alphabetically sorted by FeatureTag: S»d» < »d»c ()}}
> The above output is from running on 3.0.6, which seems to have improved 
> something relevant.  This was originally failing on 3.0.5, in which case i 
> got around 40k log entries like:
> {{16:51:52.897 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable - LangSysRecords 
> not alphabetically sorted by LangSys tag: PH'd <= PLS» ()}}
> I've tried allocating up to 6gb to the extraction process without any 
> success.  I can open the pdf file with the macos preview without any issues.
> The relevant stack trace is:
> {{java.lang.OutOfMemoryError: Java heap space 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupTable(GlyphSubstitutionTable.java:341)
>  
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupList(GlyphSubstitutionTable.java:292)
>  
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:115)at
>  org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:409) 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:186) 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:165) 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:66) 
> org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:123) 
> org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:72) 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:385)
>  org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:97) 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:173)
>  org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:170) 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:72)
>  
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:926)
>  
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:559)
>  
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:517)}}
> {{org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:158)}}
> {{org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:153)}}
> {{org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:380)}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-6146) OutOfMemoryError when trying to extract text from pdf

Reply via email to