[ 
https://issues.apache.org/jira/browse/PDFBOX-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940933#comment-16940933
 ] 

Michael Klink commented on PDFBOX-4661:
---------------------------------------

{noformat}
<38ff><39ff><c6a9>{noformat}
Actually this range is not even allowed to be 255 bytes long, merely a length 
of 86 is allowed.
{panel:title=ISO 32000-1, section 9.10.3 ToUnicode CMaps}
   _n_ *beginbfrange*
   _srcCode1 srcCode2 dstString_
   *endbfrange*

 In this case, the last byte of the string shall be incremented for each 
consecutive code in the source code range.
 When defining ranges of this type, the value of the last byte in the string 
shall be less than or equal to 255 − (_srcCode2_ − _srcCode1_). This ensures 
that the last byte of the string shall not be incremented past 255; otherwise, 
the result of mapping is undefined.
{panel}

I'd propose following the effect outlined in the spec, beyond the index for 
which the last byte of the string would be incremented past 255, the mapping 
should be undefined.

> Regression No Unicode mapping with Identity-H font
> --------------------------------------------------
>
>                 Key: PDFBOX-4661
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4661
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 2.0.16, 2.0.17
>            Reporter: Daniel Lowe
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>              Labels: regression
>             Fix For: 2.0.18, 3.0.0 PDFBox
>
>         Attachments: KR1020067006547.pdf
>
>
> In v2.0.16 or v2.0.17 running the following code. The expected output is 
> obtained in v2.0.15 and earlier.
> PDFTextStripper stripper = new PDFTextStripper();
>  PDDocument doc = PDDocument.load(new File("KR1020067006547.pdf"));
>  stripper.getText(doc);
> results in errors like the following and missing characters
> Sep 27, 2019 11:49:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+17679 (17679) in font 
> LHXXBJ+¹ÙÅÁ-Identity-H
> Sep 27, 2019 11:49:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+16131 (16131) in font 
> LHXXBJ+¹ÙÅÁ-Identity-H
> Sep 27, 2019 11:49:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+14802 (14802) in font 
> LHXXBJ+¹ÙÅÁ-Identity-H
> This change is likely related to PDFBOX-4549



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to