[
https://issues.apache.org/jira/browse/PDFBOX-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18015715#comment-18015715
]
Marc Reichman commented on PDFBOX-6054:
---------------------------------------
Thank you [~tilman] for the background information and for the text analysis. I
may find a way to use that override just to at least flag the possibility of
downstream "broken" content. I am still puzzled as to what Chrome/Edge are
doing to make it usable for casual viewing. I did observe that, even in those
browsers, if i copied and pasted that text, it resembled the scrambled content,
not what was visually presented in the browser.
I understand there is no pdfbox bug so I am perfectly fine with this ticket
being closed.
Thanks!
Marc
> Enable API support to check when text is scrambled and/or if some of the
> unicode mapping warnings happen
> --------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-6054
> URL: https://issues.apache.org/jira/browse/PDFBOX-6054
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 3.0.5 PDFBox
> Environment: Linux / JDK 21 / Docker
> Windows / JDK 21
> Reporter: Marc Reichman
> Priority: Minor
> Attachments: 7E32D4EAD8382000E24D9967C1913F6E.pdf, screenshot-1.png
>
>
> With the attached PDF, there is plenty of gibberish in the text extraction. I
> have seen other issues mention this, but in this particular case it displays
> perfectly fine in Edge or Chrome. I have opened it in the pdf debugger but
> it's hard to figure out what I'm looking at.
>
> The pdftotext tool from xpdf generates the same. Interestingly, the pdffonts
> tool does not show any fonts as "problem".
>
> I understand this will happen and it's due to pdf generation bugs, not
> including proper unicode translators, etc. but I am curious, could we check a
> property or get a specific exception when unicode mapping is not available?
> I'm not sure if that's overcorrective; i.e. unicode mapping failures is a way
> of normal life.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]