[ https://issues.apache.org/jira/browse/PDFBOX-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940933#comment-16940933 ]
Michael Klink commented on PDFBOX-4661: --------------------------------------- {noformat} <38ff><39ff><c6a9>{noformat} Actually this range is not even allowed to be 255 bytes long, merely a length of 86 is allowed. {panel:title=ISO 32000-1, section 9.10.3 ToUnicode CMaps} _n_ *beginbfrange* _srcCode1 srcCode2 dstString_ *endbfrange* In this case, the last byte of the string shall be incremented for each consecutive code in the source code range. When defining ranges of this type, the value of the last byte in the string shall be less than or equal to 255 − (_srcCode2_ − _srcCode1_). This ensures that the last byte of the string shall not be incremented past 255; otherwise, the result of mapping is undefined. {panel} I'd propose following the effect outlined in the spec, beyond the index for which the last byte of the string would be incremented past 255, the mapping should be undefined. > Regression No Unicode mapping with Identity-H font > -------------------------------------------------- > > Key: PDFBOX-4661 > URL: https://issues.apache.org/jira/browse/PDFBOX-4661 > Project: PDFBox > Issue Type: Bug > Components: FontBox > Affects Versions: 2.0.16, 2.0.17 > Reporter: Daniel Lowe > Assignee: Andreas Lehmkühler > Priority: Major > Labels: regression > Fix For: 2.0.18, 3.0.0 PDFBox > > Attachments: KR1020067006547.pdf > > > In v2.0.16 or v2.0.17 running the following code. The expected output is > obtained in v2.0.15 and earlier. > PDFTextStripper stripper = new PDFTextStripper(); > PDDocument doc = PDDocument.load(new File("KR1020067006547.pdf")); > stripper.getText(doc); > results in errors like the following and missing characters > Sep 27, 2019 11:49:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+17679 (17679) in font > LHXXBJ+¹ÙÅÁ-Identity-H > Sep 27, 2019 11:49:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+16131 (16131) in font > LHXXBJ+¹ÙÅÁ-Identity-H > Sep 27, 2019 11:49:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+14802 (14802) in font > LHXXBJ+¹ÙÅÁ-Identity-H > This change is likely related to PDFBOX-4549 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org