[ https://issues.apache.org/jira/browse/PDFBOX-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Roman updated PDFBOX-3962: -------------------------- Description: >From the attached [^72083_qdf.pdf] file, this text (big letters on the top) is >not extracted using PDFTextStripper: {code} AGGIE NIGHT AT ENRON FIELD FRIDAY, JUNE 15, 2001 at 7:05 HOUSTON ASTROS VS. TEXAS RANGERS {code} It does not work well in Acrobat Reader also. But, in the same time, it can be extracted properly by some PDF viewers. In the same time, it i LegacyPDFStreamEngine.java {code} if(unicode == null) { if(!(font instanceof PDSimpleFont)) { return; } char c = (char)code; unicode = new String(new char[]{c}); } {code} {code} if (unicode == null) { if (font instanceof PDType1CFont) { String name = ((PDType1CFont) font).codeToName(code); try { Method method = PDType1CFont.class.getDeclaredMethod("readEncodingFromFont"); method.setAccessible(true); Encoding encoding = (Encoding) method.invoke(font); Integer newCode = encoding.getNameToCodeMap().get(name); //unicode = glyphList.codePointToName(newCode); if (newCode != null && newCode.intValue() != 0) { unicode = new String(new char[]{(char) newCode.byteValue()}); } } catch (NoSuchMethodException e) { e.printStackTrace(); } catch (IllegalAccessException e) { e.printStackTrace(); } catch (InvocationTargetException e) { e.printStackTrace(); } } } {code} > No unicode mapping / Text not extracting > ---------------------------------------- > > Key: PDFBOX-3962 > URL: https://issues.apache.org/jira/browse/PDFBOX-3962 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Reporter: Roman > Attachments: 72083_qdf.pdf > > > From the attached [^72083_qdf.pdf] file, this text (big letters on the top) > is not extracted using PDFTextStripper: > {code} > AGGIE NIGHT > AT ENRON FIELD > FRIDAY, JUNE 15, 2001 at 7:05 > HOUSTON ASTROS VS. TEXAS RANGERS > {code} > It does not work well in Acrobat Reader also. But, in the same time, it can > be extracted properly by some PDF viewers. > In the same time, it i > LegacyPDFStreamEngine.java > {code} > if(unicode == null) { > if(!(font instanceof PDSimpleFont)) { > return; > } > char c = (char)code; > unicode = new String(new char[]{c}); > } > {code} > {code} > if (unicode == null) { > if (font instanceof PDType1CFont) { > String name = ((PDType1CFont) font).codeToName(code); > try { > Method method = > PDType1CFont.class.getDeclaredMethod("readEncodingFromFont"); > method.setAccessible(true); > Encoding encoding = (Encoding) method.invoke(font); > Integer newCode = encoding.getNameToCodeMap().get(name); > //unicode = glyphList.codePointToName(newCode); > if (newCode != null && newCode.intValue() != 0) { > unicode = new String(new char[]{(char) > newCode.byteValue()}); > } > } catch (NoSuchMethodException e) { > e.printStackTrace(); > } catch (IllegalAccessException e) { > e.printStackTrace(); > } catch (InvocationTargetException e) { > e.printStackTrace(); > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org