[ https://issues.apache.org/jira/browse/PDFBOX-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515961#comment-17515961 ]
Tilman Hausherr commented on PDFBOX-5406: ----------------------------------------- Yes sometimes we get trash. But there are also cases where Adobe Reader brings trash. Some files have a /ToUnicode map and still return trash. We don't have a "strict" setting because there's no simple solution. Use a word dictionary to detect whether the output is trash, and then run OCR. > Assumption of Identity Not Valid for Text Extraction > ---------------------------------------------------- > > Key: PDFBOX-5406 > URL: https://issues.apache.org/jira/browse/PDFBOX-5406 > Project: PDFBox > Issue Type: Bug > Affects Versions: 2.0.24 > Reporter: Michael Tighe > Priority: Major > > PDF BOX issue 1090 (closed years ago) makes an assumption that can lead to > serious issues when the text extraction process returns garbage. > Version: PDFBOX v2.0.24 > PDFBOX -> PDFont.java -> loadUnicodeCMap line 150 > The code distinctly KNOWS that there is no UNICODE map. > It then makes a number of guesses - runs out of options, and explicitly makes > an assumption that silently creates bad output.{{{}{}}} > {{ LOG.warn("Invalid ToUnicode CMap in font " + getName());}} > {{ ...}} > {{ LOG.warn("Using predefined identity CMap instead");}} > Every document that I've seen that produces that WARNING has bad text > returned for the document when you use PDFBOX to do text extraction. > My logic is that the CMap is being ignored by the producer of that PDF, and > assuming that it's possible to use the reverse causes silent failure on the > part of PDFBOX. The software package calling PDFBOX gets no warning that > there is an issue. > I propose that this code throw an exception rather than a warning. > That way the extraction caller KNOWS that the text is wrong. > I have examples identical to those shown in the original issue. > Is there any more recent work on this issue? E.g., parameters that could be > set to say "I want perfect extraction or no extraction"? -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org