[jira] [Commented] (PDFBOX-3962) No unicode mapping / Text not extracting

Tilman Hausherr (JIRA) Thu, 12 Oct 2017 09:26:25 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202190#comment-16202190
 ]


Tilman Hausherr commented on PDFBOX-3962:
-----------------------------------------

Even Adobe Reader isn't able to extract that one. The glyph names are non 
standard. Your workaround will not work in jdk9 (or needs additional command 
line options) because it is kindof "hacky".

What also works is a change in the source code that I haven't made because 
there is no guarantee that it will work for all files. In 
PDSimpleFont.toUnicode() change this part (the first 4 lines exist):
{code}
            unicode = unicodeGlyphList.toUnicode(name);
            if (unicode != null)
            {
                return unicode;
            }
            // can't remember what issue
            if (name.matches("C\\d\\d\\d\\d"))
            {
                unicode = new String(new byte[]{ (byte) 
Integer.parseInt(name.substring(1)) });
                return unicode;
            }
            // PDFBOX-3962
            if (name.matches("G[A-F0-9][A-F0-9]"))
            {
                unicode = new String(new byte[]{ (byte) 
Integer.parseInt(name.substring(1), 16) });
                return unicode;
            }
{code}


> No unicode mapping / Text not extracting
> ----------------------------------------
>
>                 Key: PDFBOX-3962
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3962
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Roman
>         Attachments: 72083_qdf.pdf
>
>
> From the attached [^72083_qdf.pdf] file, this text (big letters on the top) 
> is not extracted using PDFTextStripper:
> {code}
> AGGIE NIGHT
> AT ENRON FIELD
> FRIDAY, JUNE 15, 2001 at 7:05
> HOUSTON ASTROS VS. TEXAS RANGERS
> {code}
> It does not work well in Acrobat Reader also. But, at the same time, it can 
> be extracted properly by some PDF viewers.
> Also, I found a workaround how to make it work, see it below.
> 1. Find this code block in *LegacyPDFStreamEngine.java*
> {code}
>         if(unicode == null) {
>             if(!(font instanceof PDSimpleFont)) {
>                 return;
>             }
>             char c = (char)code;
>             unicode = new String(new char[]{c});
>         }
> {code}
> 2. Insert this code block just before found one. 
> {code}
>         if (unicode == null) {
>             if (font instanceof PDType1CFont) {
>                 String name = ((PDType1CFont) font).codeToName(code);
>                 try {
>                     Method method = 
> PDType1CFont.class.getDeclaredMethod("readEncodingFromFont");
>                     method.setAccessible(true);
>                     Encoding encoding = (Encoding) method.invoke(font);
>                     Integer newCode = encoding.getNameToCodeMap().get(name);
>                     if (newCode != null && newCode.intValue() != 0) {
>                         unicode = new String(new char[]{(char) 
> newCode.byteValue()});
>                     }
>                 } catch (NoSuchMethodException e) {
>                     e.printStackTrace();
>                 } catch (IllegalAccessException e) {
>                     e.printStackTrace();
>                 } catch (InvocationTargetException e) {
>                     e.printStackTrace();
>                 }
>             }
>         }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3962) No unicode mapping / Text not extracting

Reply via email to