[jira] [Commented] (PDFBOX-4737) Text extraction is gibberish

Tilman Hausherr (Jira) Thu, 09 Jan 2020 11:55:00 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012180#comment-17012180
 ]


Tilman Hausherr commented on PDFBOX-4737:
-----------------------------------------

Re the file mentioned in PDFBOX-4549 ([^obfuscateTest_Duplicate_2_3.pdf]), that 
one currently returns nothing, although Adobe has something.

If we'd do "strict" text extraction we could still hit files that are purposely 
obfuscated, see the comment by mkl. IMHO this isn't the job of PDFBox. This 
should be done by the caller.

> Text extraction is gibberish
> ----------------------------
>
>                 Key: PDFBOX-4737
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4737
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.18
>            Reporter: Jorge Spinsanti
>            Priority: Major
>         Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf
>
>
> As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549 
> there are many PDFs where the text extraction is gibberish.
> Perhaps you can add two modes (strict/lax) to text extraction to avoid 
> gibberish if not useful. Add a file to analyze the problem.
> [^noUnicodeMapping.pdf]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4737) Text extraction is gibberish

Reply via email to