Re: How to skip parsing embedded TTF inside PDF

Tilman Hausherr Mon, 02 Dec 2019 13:13:43 -0800

Send it to me,  tilman at snafu dot de.

(The readLangSysTable problem should be solved in 2.0.17, so make sureyou are using that one)

Oops I see this is the tika list, so maybe that is a lower version.Please retry with a "freshly downloaded" PDFDebugger of the pdfbox website.


Tilman

Am 02.12.2019 um 16:42 schrieb Slava G:

I have pdf that reproduce similar problem :

java.lang.OutOfMemoryError: Java heap space

atorg.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

atorg.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)

atorg.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)

atorg.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)


at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353)

at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)

at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)

at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)

atorg.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198)

atorg.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)


at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)

atorg.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)

atorg.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869)

atorg.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505)

atorg.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)

atorg.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)

atorg.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)

atorg.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)


at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)

atorg.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835)

atorg.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)


at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124)

at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)

To whom can I send pdf for investigation (it's from customer, can'tsend it public).



Thanks

On Sun, Nov 3, 2019 at 12:10 PM Slava G <[email protected]<mailto:[email protected]>> wrote:


    Well, it's not easy to provide those documents, as they're
    customers content and need approval,
    Need to get customer approval for that. I'll try, and will let you
    know..
    Thanks

    On Sun, Nov 3, 2019 at 11:45 AM Tilman Hausherr
    <[email protected] <mailto:[email protected]>> wrote:

        Hello,

        I'd be interested in the OOM exception. The one below aborts the
        parsing. Can you open a PDFBox issue and attach your PDF? We
        could just
        skip the table here instead of failing.

        Re the OOM we'd also need a PDF.

        Skipping parsing of embedded ttf will possibly have a negative
        impact on
        text extraction.

        Tilman


        Am 03.11.2019 um 10:38 schrieb Slava G:
        > Hi,
        > In some PDF files parsing we see different errors related to
        PDF
        > parsing, one is OutOfMemmory exception during pdf
        parsing and another:
        >
        > WARN      - Could not read embedded TTF for font ABCDEE+Segoe
        > UI,BoldItalic
        > java.io.IOException: Kerning sub-table too short, got 0
        bytes, expect
        > 6 or more.
        > at
        >
        
org.apache.fontbox.ttf.KerningSubtable.readSubtable0(KerningSubtable.java:191)
        > at
        org.apache.fontbox.ttf.KerningSubtable.read(KerningSubtable.java:70)
        > at
        org.apache.fontbox.ttf.KerningTable.read(KerningTable.java:80)
        > at
        org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353)
        > at
        org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
        > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
        > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
        > at
        >
        
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198)
        > at
        >
        
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
        > at
        org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
        > at
        >
        
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
        > at
        >
        
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869)
        > at
        >
        
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505)
        > at
        >
        
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)
        > at
        >
        
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)
        > at
        >
        
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
        > at
        >
        
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
        > at
        org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)
        > at
        >
        
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835)
        > at
        >
        
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
        > at
        org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124)
        > at
        org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
        >
        > How can I skip parsing of embedded TTF inside PDF ?
        >
        > Thanks

Re: How to skip parsing embedded TTF inside PDF

Reply via email to