[ https://issues.apache.org/jira/browse/PDFBOX-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr closed PDFBOX-371. ---------------------------------- Resolution: Duplicate I believe that this has been solved by PDFBOX-1713 and PDFBOX-1357, which has a similar solution. If it doesn't work for you, please reopen but attach a PDF and explain what doesn't work for you. > Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested > fix) > --------------------------------------------------------------------------------- > > Key: PDFBOX-371 > URL: https://issues.apache.org/jira/browse/PDFBOX-371 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 0.7.3 > Environment: Java 1.5, OSX 10.5 > Reporter: Robert Baruch > Priority: Minor > > When running text extraction on a PDF file that contains the soft hyphen > character in the WinAnsiEncoding (that is, 0255), the text extractor > incorrectly maps this as a space, when it should be a hyphen. As the PDF > Reference 1.7 says in note 5 of table D.1: > 'The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning > of this duplicate code is "soft hyphen," but it is typographically the same > as hyphen.' > The reason that a soft hyphen is typographically the same as hyphen is that a > soft hyphen indicates that a hyphen MAY be placed here if necessary (i.e. > breaking a word across lines). Since the soft hyphen should only be put, by > the PDF producer, at the end of a line to break a word, it stands to reason > that the option to place a hyphen must be taken. > I think I've traced the reason for the substitution to Encoding.getName, > where because there is no mapping in the codeToName mapping for this code in > WinAnsiEncoding, by default it returns "space". > The fix is not as simple as adding an addCharacterEncoding( 0255, > COSName.getPDFName("hyphen")) to WinAnsiEncoding, because that will set both > the codeToName mapping AND the nameToCode mapping, which will overwrite the > 055 nameToCode mapping. > Adding this line: > codeToName.add( new Integer(0255), COSName.getPDFName("hyphen")); > to the end of the WinAnsiEncoding constructor seems to fix the issue. -- This message was sent by Atlassian JIRA (v6.2#6252)