[jira] [Comment Edited] (PDFBOX-3710) Text Stripper in 2.0 lost some texts - regression

Roman (JIRA) Mon, 06 Mar 2017 02:40:23 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897067#comment-15897067
 ]


Roman edited comment on PDFBOX-3710 at 3/6/17 10:38 AM:
--------------------------------------------------------

OK, I found an ugly solution - I've overided whole method *showGlyph()* from 
*LegacyPDFStreamEngine* class. (I had to override 4 more 
class-member-properties and one function for calculating them, and also a 
constructor).  So, this solution has performance overhead and very lot of 
copy-pasting. At the same time, it is intended to do very little, just to avoid 
returning in this piece of code:

{code}
        // when there is no Unicode mapping available, Acrobat simply coerces 
the character code
        // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't 
necessarily want
        // this, which is why we leave it until this point in 
PDFTextStreamEngine.
        if (unicode == null)
        {
            if (font instanceof PDSimpleFont)
            {
                char c = (char) code;
                unicode = new String(new char[] { c });
            }
            else
            {
                // Acrobat doesn't seem to coerce composite font's character 
codes, instead it
                // skips them. See the "allah2.pdf" TestTextStripper file.
                return;
            }
        }
{code}

now changed to:

{code}
        // when there is no Unicode mapping available, Acrobat simply coerces 
the character code
        // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't 
necessarily want
        // this, which is why we leave it until this point in 
PDFTextStreamEngine.
        if (unicode == null)
        {
//            if (font instanceof PDSimpleFont)
//            {
                char c = (char) code;
                unicode = new String(new char[] { c });
//            }
//            else
//            {
//                // Acrobat doesn't seem to coerce composite font's character 
codes, instead it
//                // skips them. See the "allah2.pdf" TestTextStripper file.
//
//                return;
//            }
        }
{code}

My only left question: can you tweak LegacyPDFStreamEngine class to be more 
flexible. For example, we may add new public overloadable boolean method 
*deepLegacy* as here:

{code}
        // when there is no Unicode mapping available, Acrobat simply coerces 
the character code
        // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't 
necessarily want
        // this, which is why we leave it until this point in 
PDFTextStreamEngine.
        if (unicode == null)
        {
            if (deepLegacy() || font instanceof PDSimpleFont)
            {
                char c = (char) code;
                unicode = new String(new char[] { c });
            }
            else
            {
                // Acrobat doesn't seem to coerce composite font's character 
codes, instead it
                // skips them. See the "allah2.pdf" TestTextStripper file.
                return;
            }
        }
{code}


was (Author: rmakarov):
OK, I found an ugly solution - I've overided whole method *showGlyph()* from 
*LegacyPDFStreamEngine* class. (I had to override 4 more 
class-member-properties and one function for calculating them, and also a 
constructor).  So, this solution has performance overhead and very lot of 
copy-pasting. At the same time, it is intended to do very little, just to avoid 
returning in this piece of code:

{code}
        // when there is no Unicode mapping available, Acrobat simply coerces 
the character code
        // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't 
necessarily want
        // this, which is why we leave it until this point in 
PDFTextStreamEngine.
        if (unicode == null)
        {
            if (font instanceof PDSimpleFont)
            {
                char c = (char) code;
                unicode = new String(new char[] { c });
            }
            else
            {
                // Acrobat doesn't seem to coerce composite font's character 
codes, instead it
                // skips them. See the "allah2.pdf" TestTextStripper file.
                return;
            }
        }
{code}

now changed to:

{code}
        // when there is no Unicode mapping available, Acrobat simply coerces 
the character code
        // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't 
necessarily want
        // this, which is why we leave it until this point in 
PDFTextStreamEngine.
        if (unicode == null)
        {
//            if (font instanceof PDSimpleFont)
//            {
                char c = (char) code;
                unicode = new String(new char[] { c });
//            }
//            else
//            {
//                // Acrobat doesn't seem to coerce composite font's character 
codes, instead it
//                // skips them. See the "allah2.pdf" TestTextStripper file.
//
//                return;
//            }
        }
{code}

My only left question: can you tweak LegacyPDFStreamEngine class to be more 
flexible. For example, we may add new boolean method *deepLegacy* as here:

{code}
        // when there is no Unicode mapping available, Acrobat simply coerces 
the character code
        // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't 
necessarily want
        // this, which is why we leave it until this point in 
PDFTextStreamEngine.
        if (unicode == null)
        {
            if (deepLegacy() || font instanceof PDSimpleFont)
            {
                char c = (char) code;
                unicode = new String(new char[] { c });
            }
            else
            {
                // Acrobat doesn't seem to coerce composite font's character 
codes, instead it
                // skips them. See the "allah2.pdf" TestTextStripper file.
                return;
            }
        }
{code}

> Text Stripper in 2.0 lost some texts - regression
> -------------------------------------------------
>
>                 Key: PDFBOX-3710
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3710
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Roman
>         Attachments: highlight19.pdf_page1-marked-1.png, 
> highlight19.pdf_page1.pdf, regression_in_blue.png
>
>
> After migration of our App from pdfbox 1.8 to 2.0, we noticed a regression: 4 
> lines of texts are disappeared. Those are the texts followed by black bullet 
> (3 lines) and also "OVERALL" word which is placed above in table.
> Problematic PDF attached - [^highlight19.pdf_page1.pdf]
> Also, attached the result of 
> [DrawPrintTextLocations|https://apache.googlesource.com/pdfbox/+/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/DrawPrintTextLocations.java]
>  example - 
> [highlight19.pdf_page1-marked-1.png|https://issues.apache.org/jira/secure/attachment/12856229/highlight19.pdf_page1-marked-1.png]
> Notice, that unicodes, red and blue boxes missing for problematic text. The 
> main problem that these glyphs are absent in *textPositions* parameter which 
> is passed to *writeString* function, line #275. In the 1.8 version these 
> characters ARE present, so their positions along with their char codes could 
> be extracted fine in our App.
> Also, attached picture of regression in our App - [^regression_in_blue.png]. 
> Here, blue boxes drawn where text WAS present and disappeared afterwards. 
> (The purple boxes are OK and should be ignored.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3710) Text Stripper in 2.0 lost some texts - regression

Reply via email to