Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested
fix)
---------------------------------------------------------------------------------
Key: PDFBOX-371
URL: https://issues.apache.org/jira/browse/PDFBOX-371
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 0.7.3
Environment: Java 1.5, OSX 10.5
Reporter: Robert Baruch
Priority: Minor
When running text extraction on a PDF file that contains the soft hyphen
character in the WinAnsiEncoding (that is, 0255), the text extractor
incorrectly maps this as a space, when it should be a hyphen. As the PDF
Reference 1.7 says in note 5 of table D.1:
'The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning of
this duplicate code is "soft hyphen," but it is typographically the same as
hyphen.'
The reason that a soft hyphen is typographically the same as hyphen is that a
soft hyphen indicates that a hyphen MAY be placed here if necessary (i.e.
breaking a word across lines). Since the soft hyphen should only be put, by the
PDF producer, at the end of a line to break a word, it stands to reason that
the option to place a hyphen must be taken.
I think I've traced the reason for the substitution to Encoding.getName, where
because there is no mapping in the codeToName mapping for this code in
WinAnsiEncoding, by default it returns "space".
The fix is not as simple as adding an addCharacterEncoding( 0255,
COSName.getPDFName("hyphen")) to WinAnsiEncoding, because that will set both
the codeToName mapping AND the nameToMap encoding, which will overwrite the 055
nameToCode mapping.
Adding this line:
codeToName.add( new Integer(0255), COSName.getPDFName("hyphen"));
to the end of the WinAnsiEncoding constructor seems to fix the issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.