[jira] [Commented] (PDFBOX-4737) Text extraction is gibberish

Tim Allison (Jira) Wed, 15 Jan 2020 10:34:12 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016209#comment-17016209
 ]


Tim Allison commented on PDFBOX-4737:
-------------------------------------

The following reinforces points already made, I think.

>On the other hand of course a proper implementation of a strict mode will 
>require quite a lot of work

+1

> and a half-hearted implementation is worthless.

Indications of specific types of wonkiness – e.g. missing fonts, missing 
unicode mappings, missing/invalid xref, many other features – would be useful 
to some downstream processors, and if we did a "group by" on "producer/creator 
tool" for a given corpus like CommonCrawl, we might be able to shame software 
companies and projects into fixing specific issues.  We could add these 
incrementally... and I see some benefit from even partial information (missing 
unicode mappings).

As I and others point out, though, text can always be hosed, and there is no 
perfect "junk detector".  You can try to use tika-eval's out of vocabulary 
statistic as an indicator that the text is not "languagey", but it will 
incorrectly categorize parts lists, isbns, duck phyla as "bad."  More advanced 
machine learning (e.g. neural nets) may do a better job, but they will still be 
wrong some of the time.

 

There's a reason Google is running OCR on at least some PDFs. :P

 

So, from an OS community perspective, I see two avenues of work:
 # improving reporting of "nonstandard" features of the PDF – or helping 
developers understand what types of "nonstandard" features can currently be 
detected with PDFBox
 # working together to improve a junk detector... a la Tika's

> Text extraction is gibberish
> ----------------------------
>
>                 Key: PDFBOX-4737
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4737
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.18
>            Reporter: Jorge Spinsanti
>            Priority: Major
>         Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf
>
>
> As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549 
> there are many PDFs where the text extraction is gibberish.
> Perhaps you can add two modes (strict/lax) to text extraction to avoid 
> gibberish if not useful. Add a file to analyze the problem.
> [^noUnicodeMapping.pdf]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4737) Text extraction is gibberish

Reply via email to