Re: How to skip parsing embedded TTF inside PDF

Tilman Hausherr Mon, 02 Dec 2019 23:39:17 -0800

Am 03.12.2019 um 07:16 schrieb Slava G:

Hi,
I've tried to run PDFDebugger from the latest PDFBox, what should benormal expected result ? As in my case it's just hanged out, afterprinting:Dec 03, 2019 7:58:51 AM org.apache.pdfbox.rendering.PDFRenderersuggestKCMSINFO: use the option-Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProviderDec 03, 2019 7:58:51 AM org.apache.pdfbox.rendering.PDFRenderersuggestKCMSINFO: or call System.setProperty("sun.java2d.cmm","sun.java2d.cmm.kcms.KcmsServiceProvider")

That means you're running an older java version, and not using thisoption result in low speed. In newer java versions (1.8 at 192 or later)it is no longer needed. But that thing is not related to your problems.


Tilman


Thanks

On Mon, Dec 2, 2019 at 11:13 PM Tilman Hausherr <[email protected]<mailto:[email protected]>> wrote:


    Send it to me,  tilman at snafu dot  de.

    (The readLangSysTable problem should be solved in 2.0.17, so make
    sure you are using that one)

    Oops I see this is the tika list, so maybe that is a lower
    version. Please retry with a "freshly downloaded" PDFDebugger of
    the pdfbox website.

    Tilman

    Am 02.12.2019 um 16:42 schrieb Slava G:

    I have pdf that reproduce similar problem :

    java.lang.OutOfMemoryError: Java heap space

    at
    
org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

    at
    
org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)

    at
    
org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)

    at
    
org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)

    at
    org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353)

    at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)

    at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)

    at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)

    at
    
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198)

    at
    
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)

    at
    org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)

    at
    
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)

    at
    
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869)

    at
    
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505)

    at
    
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)

    at
    
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)

    at
    
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)

    at
    org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)

    at
    org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)

    at
    
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835)

    at
    org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)

    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124)

    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)


    To whom can I send pdf for investigation (it's from customer,
    can't send it public).


    Thanks


    On Sun, Nov 3, 2019 at 12:10 PM Slava G <[email protected]
    <mailto:[email protected]>> wrote:

        Well, it's not easy to provide those documents, as they're
        customers content and need approval,
        Need to get customer approval for that. I'll try, and will
        let you know..
        Thanks

        On Sun, Nov 3, 2019 at 11:45 AM Tilman Hausherr
        <[email protected] <mailto:[email protected]>> wrote:

            Hello,

            I'd be interested in the OOM exception. The one below
            aborts the
            parsing. Can you open a PDFBox issue and attach your PDF?
            We could just
            skip the table here instead of failing.

            Re the OOM we'd also need a PDF.

            Skipping parsing of embedded ttf will possibly have a
            negative impact on
            text extraction.

            Tilman


            Am 03.11.2019 um 10:38 schrieb Slava G:
            > Hi,
            > In some PDF files parsing we see different errors
            related to PDF
            > parsing, one is OutOfMemmory exception during pdf
            parsing and another:
            >
            > WARN      - Could not read embedded TTF for font
            ABCDEE+Segoe
            > UI,BoldItalic
            > java.io.IOException: Kerning sub-table too short, got 0
            bytes, expect
            > 6 or more.
            > at
            >
            
org.apache.fontbox.ttf.KerningSubtable.readSubtable0(KerningSubtable.java:191)
            > at
            org.apache.fontbox.ttf.KerningSubtable.read(KerningSubtable.java:70)
            > at
            org.apache.fontbox.ttf.KerningTable.read(KerningTable.java:80)
            > at
            org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353)
            > at
            org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
            > at
            org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
            > at
            org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
            > at
            >
            
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198)
            > at
            >
            
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
            > at
            org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
            > at
            >
            
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
            > at
            >
            
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869)
            > at
            >
            
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505)
            > at
            >
            
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)
            > at
            >
            
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)
            > at
            >
            
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
            > at
            >
            
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
            > at
            org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)
            > at
            >
            
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835)
            > at
            >
            
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
            > at
            org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124)
            > at
            org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
            >
            > How can I skip parsing of embedded TTF inside PDF ?
            >
            > Thanks

Re: How to skip parsing embedded TTF inside PDF

Reply via email to