[jira] [Updated] (PDFBOX-3962) No unicode mapping / Text not extracting

Roman (JIRA) Thu, 12 Oct 2017 02:41:24 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Roman updated PDFBOX-3962:
--------------------------
    Description: 
>From the attached [^72083_qdf.pdf] file, this text (big letters on the top) is 
>not extracted using PDFTextStripper:
{code}
AGGIE NIGHT
AT ENRON FIELD
FRIDAY, JUNE 15, 2001 at 7:05
HOUSTON ASTROS VS. TEXAS RANGERS
{code}

It does not work well in Acrobat Reader also. But, at the same time, it can be 
extracted properly by some PDF viewers.

Also, I found a workaround how to make it work, see it below.

1. Find this code block in *LegacyPDFStreamEngine.java*
{code}
        if(unicode == null) {
            if(!(font instanceof PDSimpleFont)) {
                return;
            }
            char c = (char)code;
            unicode = new String(new char[]{c});
        }
{code}

2. Insert this code block just before found one. 

{code}
        if (unicode == null) {
            if (font instanceof PDType1CFont) {
                String name = ((PDType1CFont) font).codeToName(code);
                try {
                    Method method = 
PDType1CFont.class.getDeclaredMethod("readEncodingFromFont");
                    method.setAccessible(true);
                    Encoding encoding = (Encoding) method.invoke(font);
                    Integer newCode = encoding.getNameToCodeMap().get(name);
                    if (newCode != null && newCode.intValue() != 0) {
                        unicode = new String(new char[]{(char) 
newCode.byteValue()});
                    }
                } catch (NoSuchMethodException e) {
                    e.printStackTrace();
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
            }
        }
{code}

  was:
>From the attached [^72083_qdf.pdf] file, this text (big letters on the top) is 
>not extracted using PDFTextStripper:
{code}
AGGIE NIGHT
AT ENRON FIELD
FRIDAY, JUNE 15, 2001 at 7:05
HOUSTON ASTROS VS. TEXAS RANGERS
{code}

It does not work well in Acrobat Reader also. But, in the same time, it can be 
extracted properly by some PDF viewers.
In the same time, it i

LegacyPDFStreamEngine.java
{code}
        if(unicode == null) {
            if(!(font instanceof PDSimpleFont)) {
                return;
            }
            char c = (char)code;
            unicode = new String(new char[]{c});
        }
{code}

{code}
        if (unicode == null) {

            if (font instanceof PDType1CFont) {
                String name = ((PDType1CFont) font).codeToName(code);
                try {
                    Method method = 
PDType1CFont.class.getDeclaredMethod("readEncodingFromFont");
                    method.setAccessible(true);
                    Encoding encoding = (Encoding) method.invoke(font);
                    Integer newCode = encoding.getNameToCodeMap().get(name);
                    //unicode = glyphList.codePointToName(newCode);
                    if (newCode != null && newCode.intValue() != 0) {
                        unicode = new String(new char[]{(char) 
newCode.byteValue()});
                    }
                } catch (NoSuchMethodException e) {
                    e.printStackTrace();
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
            }
        }

{code}


> No unicode mapping / Text not extracting
> ----------------------------------------
>
>                 Key: PDFBOX-3962
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3962
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Roman
>         Attachments: 72083_qdf.pdf
>
>
> From the attached [^72083_qdf.pdf] file, this text (big letters on the top) 
> is not extracted using PDFTextStripper:
> {code}
> AGGIE NIGHT
> AT ENRON FIELD
> FRIDAY, JUNE 15, 2001 at 7:05
> HOUSTON ASTROS VS. TEXAS RANGERS
> {code}
> It does not work well in Acrobat Reader also. But, at the same time, it can 
> be extracted properly by some PDF viewers.
> Also, I found a workaround how to make it work, see it below.
> 1. Find this code block in *LegacyPDFStreamEngine.java*
> {code}
>         if(unicode == null) {
>             if(!(font instanceof PDSimpleFont)) {
>                 return;
>             }
>             char c = (char)code;
>             unicode = new String(new char[]{c});
>         }
> {code}
> 2. Insert this code block just before found one. 
> {code}
>         if (unicode == null) {
>             if (font instanceof PDType1CFont) {
>                 String name = ((PDType1CFont) font).codeToName(code);
>                 try {
>                     Method method = 
> PDType1CFont.class.getDeclaredMethod("readEncodingFromFont");
>                     method.setAccessible(true);
>                     Encoding encoding = (Encoding) method.invoke(font);
>                     Integer newCode = encoding.getNameToCodeMap().get(name);
>                     if (newCode != null && newCode.intValue() != 0) {
>                         unicode = new String(new char[]{(char) 
> newCode.byteValue()});
>                     }
>                 } catch (NoSuchMethodException e) {
>                     e.printStackTrace();
>                 } catch (IllegalAccessException e) {
>                     e.printStackTrace();
>                 } catch (InvocationTargetException e) {
>                     e.printStackTrace();
>                 }
>             }
>         }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-3962) No unicode mapping / Text not extracting

Reply via email to