[ https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016209#comment-17016209 ]
Tim Allison commented on PDFBOX-4737: ------------------------------------- The following reinforces points already made, I think. >On the other hand of course a proper implementation of a strict mode will >require quite a lot of work +1 > and a half-hearted implementation is worthless. Indications of specific types of wonkiness – e.g. missing fonts, missing unicode mappings, missing/invalid xref, many other features – would be useful to some downstream processors, and if we did a "group by" on "producer/creator tool" for a given corpus like CommonCrawl, we might be able to shame software companies and projects into fixing specific issues. We could add these incrementally... and I see some benefit from even partial information (missing unicode mappings). As I and others point out, though, text can always be hosed, and there is no perfect "junk detector". You can try to use tika-eval's out of vocabulary statistic as an indicator that the text is not "languagey", but it will incorrectly categorize parts lists, isbns, duck phyla as "bad." More advanced machine learning (e.g. neural nets) may do a better job, but they will still be wrong some of the time. There's a reason Google is running OCR on at least some PDFs. :P So, from an OS community perspective, I see two avenues of work: # improving reporting of "nonstandard" features of the PDF – or helping developers understand what types of "nonstandard" features can currently be detected with PDFBox # working together to improve a junk detector... a la Tika's > Text extraction is gibberish > ---------------------------- > > Key: PDFBOX-4737 > URL: https://issues.apache.org/jira/browse/PDFBOX-4737 > Project: PDFBox > Issue Type: Improvement > Affects Versions: 2.0.18 > Reporter: Jorge Spinsanti > Priority: Major > Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf > > > As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549 > there are many PDFs where the text extraction is gibberish. > Perhaps you can add two modes (strict/lax) to text extraction to avoid > gibberish if not useful. Add a file to analyze the problem. > [^noUnicodeMapping.pdf] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org