[ https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012820#comment-17012820 ]
Michael Klink commented on PDFBOX-4737: --------------------------------------- A strict/lax mode could help prevent PDFBox from trying to extract text for broken text extraction information, but broken text extraction information usually is not what obfuscators create but instead what buggy PDF generators create. Obfuscators usually will generate PDFs without text extraction information (like your examples) or with misleading information (like in [this stack overflow q&a|https://stackoverflow.com/a/22688775/1729265]). > Text extraction is gibberish > ---------------------------- > > Key: PDFBOX-4737 > URL: https://issues.apache.org/jira/browse/PDFBOX-4737 > Project: PDFBox > Issue Type: Improvement > Affects Versions: 2.0.18 > Reporter: Jorge Spinsanti > Priority: Major > Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf > > > As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549 > there are many PDFs where the text extraction is gibberish. > Perhaps you can add two modes (strict/lax) to text extraction to avoid > gibberish if not useful. Add a file to analyze the problem. > [^noUnicodeMapping.pdf] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org