[jira] [Commented] (PDFBOX-5350) Regression unicode mapping in Korean document

Jira Mon, 02 Oct 2023 02:48:39 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771026#comment-17771026
 ]


Andreas Lehmkühler commented on PDFBOX-5350:
--------------------------------------------

First of all the CMap-parser simply stopped parsing whenever it stumbled upon a 
malformed CMap. That led to missing mappings and consequently to wrong text 
extaction results. We fixed that in PDFBOX-4661.

PDFBOX-4661 introduced a regression as the overflow rule is limited to embedded 
CMaps. Imported CMaps don't have to follow that strict rule. PDFBOX-5090 fixed 
that by introducing a strict mode which is limited to embedded CMaps.

There were some other minor fixes/improvements regarding the CMap parser, e.g. 
PDFBOX-4720

I've the impression that the mapping overflow itself wasn't the real cause at 
all, but all the other side effects which were fixed in the above tickets. 
Saying that, I'd like to follow [~tilman]s idea and remove/deactivate the 
strict mode parser. IMHO it can't get worse. A missing mapping leads to an 
extraction issue as well as a wrong mapping which may be introduced by ignoring 
range overflows.

We should run the regression tests to be on the save side.

> Regression unicode mapping in Korean document
> ---------------------------------------------
>
>                 Key: PDFBOX-5350
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5350
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox
>            Reporter: John Mayfield
>            Priority: Major
>              Labels: regression
>         Attachments: KR1019900015076.pdf, KR1019980000128.pdf, 
> KR1019980000128_2_0_15.txt, KR1019980000128_2_0_25.txt, KR1020140140600.pdf
>
>
> The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode 
> mapping?), this was previously addressed in PDFBOX-4661 and resolved that 
> example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents 
> (included here) to now have incorrect text output.
> PDFTextStripper stripper = new PDFTextStripper();
> PDDocument doc = PDDocument.load(new File("KR1019980000128.pdf"));
> stripper.getText(doc);
> Like in PDFBOX-4661 there are numerous warnings of the form:
> WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe
> I've attached the text dump of two versions, but in brief:
> 2.0.15: 공개번호 (public number)
> 2.0.25: 공개 
> I only confirmed the issue in the versions listed above but presume the issue 
> persists >=2.0.18.
> My reading of PDFBOX-4661 is there is something funky about these PDFs? 
> PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly 
> produces 공개뮈픸 so I can see there is something non-trivial here.
> Any help is much appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5350) Regression unicode mapping in Korean document

Reply via email to