[ 
https://issues.apache.org/jira/browse/PDFBOX-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-371.
----------------------------------

    Resolution: Duplicate

I believe that this has been solved by PDFBOX-1713 and PDFBOX-1357, which has a 
similar solution. If it doesn't work for you, please reopen but attach a PDF 
and explain what doesn't work for you.

> Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested 
> fix)
> ---------------------------------------------------------------------------------
>
>                 Key: PDFBOX-371
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-371
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: Java 1.5, OSX 10.5
>            Reporter: Robert Baruch
>            Priority: Minor
>
> When running text extraction on a PDF file that contains the soft hyphen 
> character in the WinAnsiEncoding (that is, 0255), the text extractor 
> incorrectly maps this as a space, when it should be a hyphen. As the PDF 
> Reference 1.7 says in note 5 of table D.1:
> 'The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning 
> of this duplicate code is "soft hyphen," but it is typographically the same 
> as hyphen.'
> The reason that a soft hyphen is typographically the same as hyphen is that a 
> soft hyphen indicates that a hyphen MAY be placed here if necessary (i.e. 
> breaking a word across lines). Since the soft hyphen should only be put, by 
> the PDF producer, at the end of a line to break a word, it stands to reason 
> that the option to place a hyphen must be taken.
> I think I've traced the reason for the substitution to Encoding.getName, 
> where because there is no mapping in the codeToName mapping for this code in 
> WinAnsiEncoding, by default it returns "space".
> The fix is not as simple as adding an addCharacterEncoding( 0255, 
> COSName.getPDFName("hyphen")) to WinAnsiEncoding, because that will set both 
> the codeToName mapping AND the nameToCode mapping, which will overwrite the 
> 055 nameToCode mapping.
> Adding this line:
> codeToName.add( new Integer(0255), COSName.getPDFName("hyphen"));
> to the end of the WinAnsiEncoding constructor seems to fix the issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to