[jira] [Commented] (PDFBOX-4737) Text extraction is gibberish

Michael Klink (Jira) Fri, 10 Jan 2020 04:56:05 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012820#comment-17012820
 ]


Michael Klink commented on PDFBOX-4737:
---------------------------------------

A strict/lax mode could help prevent PDFBox from trying to extract text for 
broken text extraction information, but broken text extraction information 
usually is not what obfuscators create but instead what buggy PDF generators 
create.

Obfuscators usually will generate PDFs without text extraction information 
(like your examples) or with misleading information (like in [this stack 
overflow q&a|https://stackoverflow.com/a/22688775/1729265]).

> Text extraction is gibberish
> ----------------------------
>
>                 Key: PDFBOX-4737
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4737
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.18
>            Reporter: Jorge Spinsanti
>            Priority: Major
>         Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf
>
>
> As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549 
> there are many PDFs where the text extraction is gibberish.
> Perhaps you can add two modes (strict/lax) to text extraction to avoid 
> gibberish if not useful. Add a file to analyze the problem.
> [^noUnicodeMapping.pdf]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4737) Text extraction is gibberish

Reply via email to