[jira] [Commented] (PDFBOX-371) Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested fix)

Ramesh (JIRA) Fri, 08 Apr 2011 04:19:47 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017400#comment-13017400
 ]


Ramesh commented on PDFBOX-371:
-------------------------------

OOps Sorry. I wrongly addressed to Navendu.

hi Robert Baruch,
I saw your solution for "soft hyphen in pdf". I have this issue at our place I 
want implement your solution. But i am not a programming side. Would you please 
let me know how to implement your solution? I do have Acrobat 6, 7, 8, 9 
versions in both PC and Macintosh platforms. 

Thanks in advance 

Ramesh 


> Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested 
> fix)
> ---------------------------------------------------------------------------------
>
>                 Key: PDFBOX-371
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-371
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: Java 1.5, OSX 10.5
>            Reporter: Robert Baruch
>            Priority: Minor
>
> When running text extraction on a PDF file that contains the soft hyphen 
> character in the WinAnsiEncoding (that is, 0255), the text extractor 
> incorrectly maps this as a space, when it should be a hyphen. As the PDF 
> Reference 1.7 says in note 5 of table D.1:
> 'The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning 
> of this duplicate code is "soft hyphen," but it is typographically the same 
> as hyphen.'
> The reason that a soft hyphen is typographically the same as hyphen is that a 
> soft hyphen indicates that a hyphen MAY be placed here if necessary (i.e. 
> breaking a word across lines). Since the soft hyphen should only be put, by 
> the PDF producer, at the end of a line to break a word, it stands to reason 
> that the option to place a hyphen must be taken.
> I think I've traced the reason for the substitution to Encoding.getName, 
> where because there is no mapping in the codeToName mapping for this code in 
> WinAnsiEncoding, by default it returns "space".
> The fix is not as simple as adding an addCharacterEncoding( 0255, 
> COSName.getPDFName("hyphen")) to WinAnsiEncoding, because that will set both 
> the codeToName mapping AND the nameToCode mapping, which will overwrite the 
> 055 nameToCode mapping.
> Adding this line:
> codeToName.add( new Integer(0255), COSName.getPDFName("hyphen"));
> to the end of the WinAnsiEncoding constructor seems to fix the issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-371) Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested fix)

Reply via email to