[
https://issues.apache.org/jira/browse/PDFBOX-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman updated PDFBOX-3962:
--------------------------
Description:
>From the attached [^72083_qdf.pdf] file, this text (big letters on the top) is
>not extracted using PDFTextStripper:
{code}
AGGIE NIGHT
AT ENRON FIELD
FRIDAY, JUNE 15, 2001 at 7:05
HOUSTON ASTROS VS. TEXAS RANGERS
{code}
It does not work well in Acrobat Reader also. But, at the same time, it can be
extracted properly by some PDF viewers.
Also, I found a workaround how to make it work, see it below.
1. Find this code block in *LegacyPDFStreamEngine.java*
{code}
if(unicode == null) {
if(!(font instanceof PDSimpleFont)) {
return;
}
char c = (char)code;
unicode = new String(new char[]{c});
}
{code}
2. Insert this code block just before found one.
{code}
if (unicode == null) {
if (font instanceof PDType1CFont) {
String name = ((PDType1CFont) font).codeToName(code);
try {
Method method =
PDType1CFont.class.getDeclaredMethod("readEncodingFromFont");
method.setAccessible(true);
Encoding encoding = (Encoding) method.invoke(font);
Integer newCode = encoding.getNameToCodeMap().get(name);
if (newCode != null && newCode.intValue() != 0) {
unicode = new String(new char[]{(char)
newCode.byteValue()});
}
} catch (NoSuchMethodException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
}
}
}
{code}
was:
>From the attached [^72083_qdf.pdf] file, this text (big letters on the top) is
>not extracted using PDFTextStripper:
{code}
AGGIE NIGHT
AT ENRON FIELD
FRIDAY, JUNE 15, 2001 at 7:05
HOUSTON ASTROS VS. TEXAS RANGERS
{code}
It does not work well in Acrobat Reader also. But, in the same time, it can be
extracted properly by some PDF viewers.
In the same time, it i
LegacyPDFStreamEngine.java
{code}
if(unicode == null) {
if(!(font instanceof PDSimpleFont)) {
return;
}
char c = (char)code;
unicode = new String(new char[]{c});
}
{code}
{code}
if (unicode == null) {
if (font instanceof PDType1CFont) {
String name = ((PDType1CFont) font).codeToName(code);
try {
Method method =
PDType1CFont.class.getDeclaredMethod("readEncodingFromFont");
method.setAccessible(true);
Encoding encoding = (Encoding) method.invoke(font);
Integer newCode = encoding.getNameToCodeMap().get(name);
//unicode = glyphList.codePointToName(newCode);
if (newCode != null && newCode.intValue() != 0) {
unicode = new String(new char[]{(char)
newCode.byteValue()});
}
} catch (NoSuchMethodException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
}
}
}
{code}
> No unicode mapping / Text not extracting
> ----------------------------------------
>
> Key: PDFBOX-3962
> URL: https://issues.apache.org/jira/browse/PDFBOX-3962
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Roman
> Attachments: 72083_qdf.pdf
>
>
> From the attached [^72083_qdf.pdf] file, this text (big letters on the top)
> is not extracted using PDFTextStripper:
> {code}
> AGGIE NIGHT
> AT ENRON FIELD
> FRIDAY, JUNE 15, 2001 at 7:05
> HOUSTON ASTROS VS. TEXAS RANGERS
> {code}
> It does not work well in Acrobat Reader also. But, at the same time, it can
> be extracted properly by some PDF viewers.
> Also, I found a workaround how to make it work, see it below.
> 1. Find this code block in *LegacyPDFStreamEngine.java*
> {code}
> if(unicode == null) {
> if(!(font instanceof PDSimpleFont)) {
> return;
> }
> char c = (char)code;
> unicode = new String(new char[]{c});
> }
> {code}
> 2. Insert this code block just before found one.
> {code}
> if (unicode == null) {
> if (font instanceof PDType1CFont) {
> String name = ((PDType1CFont) font).codeToName(code);
> try {
> Method method =
> PDType1CFont.class.getDeclaredMethod("readEncodingFromFont");
> method.setAccessible(true);
> Encoding encoding = (Encoding) method.invoke(font);
> Integer newCode = encoding.getNameToCodeMap().get(name);
> if (newCode != null && newCode.intValue() != 0) {
> unicode = new String(new char[]{(char)
> newCode.byteValue()});
> }
> } catch (NoSuchMethodException e) {
> e.printStackTrace();
> } catch (IllegalAccessException e) {
> e.printStackTrace();
> } catch (InvocationTargetException e) {
> e.printStackTrace();
> }
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]