[ https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014860#comment-17014860 ]
Tilman Hausherr commented on PDFBOX-4737: ----------------------------------------- Not done anything because 1) no time, 2) not my core skills, 3) I don't see this as a PDFBox problem. This is something on a higher level, i.e. deciding whether an existing text is garbage or not. As explained, even if we'd provide a "strict" extraction (i.e. ignore unicode mappings that are not 100% following the PDF specification) we could still have gibberish. > Text extraction is gibberish > ---------------------------- > > Key: PDFBOX-4737 > URL: https://issues.apache.org/jira/browse/PDFBOX-4737 > Project: PDFBox > Issue Type: Improvement > Affects Versions: 2.0.18 > Reporter: Jorge Spinsanti > Priority: Major > Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf > > > As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549 > there are many PDFs where the text extraction is gibberish. > Perhaps you can add two modes (strict/lax) to text extraction to avoid > gibberish if not useful. Add a file to analyze the problem. > [^noUnicodeMapping.pdf] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org