[ 
https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700777#comment-16700777
 ] 

Tilman Hausherr commented on PDFBOX-4386:
-----------------------------------------

You could use OCR. And them maybe compare the text extraction with the OCR and 
then correct using services like the amazon mechanical turk.

Another idea would be to detect such fonts and adjust the Unicode when missing. 
But you're on your own there, it will probably be several days of work. You'd 
need to connect the glyph name ("f_f") with the Unicode entry.

But this will not be the only problem you may have with text extraction. Some 
PDF files may bring nothing, or completely garbled text.

> Incorrect encoding during pdf file reading
> ------------------------------------------
>
>                 Key: PDFBOX-4386
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4386
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.12
>            Reporter: Oleksandr Skoryi
>            Priority: Major
>         Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png
>
>
> Hello everybody, I use PDFBOX for scrapping text from attached pdf
> The issue is in double ff in Kaffee-Pads
> I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, 
> however I think it is a bug. Sincerely waiting for your reply
> !image-2018-11-26-21-06-57-022.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to