[ 
https://issues.apache.org/jira/browse/PDFBOX-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051622#comment-18051622
 ] 

Tilman Hausherr commented on PDFBOX-6146:
-----------------------------------------

I'm able to extract from the command line:

Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdfparser.COSParser 
validateStreamLength
WARNUNG: The end of the stream doesn't point to the correct offset, using 
workaround to read the stream, stream start position: 3926932, length: 1327, 
expected end position: 3928259
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdmodel.PDPageTree getKids
WARNUNG: replaced null entry with an empty page
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdfparser.COSParser 
validateStreamLength
WARNUNG: The end of the stream doesn't point to the correct offset, using 
workaround to read the stream, stream start position: 2268226, length: 4895, 
expected end position: 2273121
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.pdfparser.COSParser 
validateStreamLength
WARNUNG: The end of the stream doesn't point to the correct offset, using 
workaround to read the stream, stream start position: 3049998, length: 77098, 
expected end position: 3127096
Jan. 13, 2026 8:29:16 NACHM. org.apache.pdfbox.filter.FlateFilterDecoderStream 
fetch
WARNUNG: FlateFilter: premature end of stream due to a DataFormatException = 
invalid stored block lengths
Jan. 13, 2026 8:29:17 NACHM. org.apache.pdfbox.filter.FlateFilterDecoderStream 
fetch
WARNUNG: FlateFilter: premature end of stream due to a DataFormatException = 
invalid stored block lengths
Jan. 13, 2026 8:29:17 NACHM. org.apache.fontbox.ttf.GlyphSubstitutionTable 
readScriptList
SCHWERWIEGEND: scriptOffsets[1680]: 10084 implausible: 
data.getCurrentPosition() - offset = 10088
Jan. 13, 2026 8:29:17 NACHM. org.apache.fontbox.ttf.GlyphSubstitutionTable 
readFeatureList
WARNUNG: FeatureRecord array not alphabetically sorted by FeatureTag: S╗d╗ < 
╗d╗c
Jan. 13, 2026 8:29:33 NACHM. org.apache.pdfbox.pdmodel.font.PDCIDFontType2 
findFontOrSubstitute
WARNUNG: Using fallback font LiberationSans for CID-keyed TrueType font 
ABCDEE+Arial Narrow
Jan. 13, 2026 8:29:33 NACHM. org.apache.pdfbox.pdmodel.font.PDCIDFontType2 
findFontOrSubstitute
WARNUNG: Using fallback font LiberationSans for CID-keyed TrueType font 
ABCDEE+Arial Narrow
Jan. 13, 2026 8:29:34 NACHM. org.apache.pdfbox.pdmodel.font.PDCIDFontType2 
findFontOrSubstitute
WARNUNG: Using fallback font LiberationSans for CID-keyed TrueType font 
ABCDEE+Arial Narrow
Jan. 13, 2026 8:29:34 NACHM. org.apache.pdfbox.filter.FlateFilterDecoderStream 
fetch
WARNUNG: FlateFilter: premature end of stream due to a DataFormatException = 
invalid stored block lengths
Jan. 13, 2026 8:29:34 NACHM. org.apache.pdfbox.filter.FlateFilterDecoderStream 
fetch
WARNUNG: FlateFilter: premature end of stream due to a DataFormatException = 
invalid stored block lengths
Jan. 13, 2026 8:29:34 NACHM. org.apache.fontbox.ttf.GlyphSubstitutionTable 
readScriptList
SCHWERWIEGEND: scriptOffsets[1680]: 10084 implausible: 
data.getCurrentPosition() - offset = 10088
Jan. 13, 2026 8:29:34 NACHM. org.apache.fontbox.ttf.GlyphSubstitutionTable 
readFeatureList
WARNUNG: FeatureRecord array not alphabetically sorted by FeatureTag: S╗d╗ < 
╗d╗c
Jan. 13, 2026 8:29:47 NACHM. org.apache.pdfbox.pdmodel.font.PDTrueTypeFont 
<init>
WARNUNG: Using fallback font Arial-BoldMT for ABCDEE+Arial Narrow,BoldItalic
Jan. 13, 2026 8:29:47 NACHM. org.apache.pdfbox.filter.FlateFilterDecoderStream 
fetch
WARNUNG: FlateFilter: premature end of stream due to a DataFormatException = 
invalid stored block lengths
Jan. 13, 2026 8:29:47 NACHM. org.apache.pdfbox.filter.FlateFilterDecoderStream 
fetch
WARNUNG: FlateFilter: premature end of stream due to a DataFormatException = 
invalid stored block lengths
Jan. 13, 2026 8:29:47 NACHM. org.apache.fontbox.ttf.GlyphSubstitutionTable 
readScriptList
SCHWERWIEGEND: scriptOffsets[1680]: 10084 implausible: 
data.getCurrentPosition() - offset = 10088
Jan. 13, 2026 8:29:47 NACHM. org.apache.fontbox.ttf.GlyphSubstitutionTable 
readFeatureList
WARNUNG: FeatureRecord array not alphabetically sorted by FeatureTag: S╗d╗ < 
╗d╗c
Jan. 13, 2026 8:29:59 NACHM. org.apache.pdfbox.pdmodel.font.PDTrueTypeFont 
<init>
WARNUNG: Using fallback font ArialMT for ABCDEE+Arial Narrow,Italic
Jan. 13, 2026 8:29:59 NACHM. org.apache.pdfbox.pdmodel.font.PDTrueTypeFont 
<init>
WARNUNG: Using fallback font ArialMT for ABCDEE+Arial Narrow,Italic
Jan. 13, 2026 8:29:59 NACHM. org.apache.pdfbox.pdmodel.font.PDTrueTypeFont 
<init>
WARNUNG: Using fallback font ArialMT for ABCDEE+Arial Narrow,Italic
Jan. 13, 2026 8:29:59 NACHM. org.apache.pdfbox.pdmodel.font.PDTrueTypeFont 
<init>
WARNUNG: Using fallback font Arial-BoldMT for ABCDEE+Arial Narrow,BoldItalic
Jan. 13, 2026 8:30:00 NACHM. org.apache.pdfbox.pdmodel.font.PDTrueTypeFont 
<init>
WARNUNG: Using fallback font Arial-BoldMT for ABCDEE+Arial Narrow,BoldItalic
Jan. 13, 2026 8:30:00 NACHM. org.apache.pdfbox.pdfparser.COSParser 
validateStreamLength
WARNUNG: The end of the stream doesn't point to the correct offset, using 
workaround to read the stream, stream start position: 1337785, length: 88200, 
expected end position: 1425985
Jan. 13, 2026 8:30:01 NACHM. org.apache.pdfbox.filter.FlateFilterDecoderStream 
fetch
WARNUNG: FlateFilter: premature end of stream due to a DataFormatException = 
invalid distance too far back

> OutOfMemoryError when trying to extract text from pdf
> -----------------------------------------------------
>
>                 Key: PDFBOX-6146
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6146
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 3.0.6 PDFBox
>         Environment: java 17.  macos 26.2
>            Reporter: james
>            Priority: Blocker
>         Attachments: test.pdf
>
>
> I have a pdf file which causes an OutOfMemory error when trying to extract 
> the text.  Unfortunately, the file is a customer file which i cannot share 
> (it is only about 5mb).  I am willing to work with someone, however, on 
> debugging the issue.  Before it fails with OOME, i get the following errors:
> {{16:45:23.418 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable - 
> scriptOffsets[1680]: 10084 implausible: data.getCurrentPosition() - offset = 
> 10088 ()}}
> {{16:45:23.419 [main] WARN  o.a.f.ttf.GlyphSubstitutionTable - FeatureRecord 
> array not alphabetically sorted by FeatureTag: S»d» < »d»c ()}}
> The above output is from running on 3.0.6, which seems to have improved 
> something relevant.  This was originally failing on 3.0.5, in which case i 
> got around 40k log entries like:
> {{16:51:52.897 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable - LangSysRecords 
> not alphabetically sorted by LangSys tag: PH'd <= PLS» ()}}
> I've tried allocating up to 6gb to the extraction process without any 
> success.  I can open the pdf file with the macos preview without any issues.
> The relevant stack trace is:
> {{java.lang.OutOfMemoryError: Java heap space 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupTable(GlyphSubstitutionTable.java:341)
>  
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupList(GlyphSubstitutionTable.java:292)
>  
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:115)at
>  org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:409) 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:186) 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:165) 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:66) 
> org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:123) 
> org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:72) 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:385)
>  org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:97) 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:173)
>  org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:170) 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:72)
>  
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:926)
>  
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:559)
>  
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:517)}}
> {{org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:158)}}
> {{org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:153)}}
> {{org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:380)}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to