[ 
https://issues.apache.org/jira/browse/TIKA-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418755#comment-15418755
 ] 

Tim Allison commented on TIKA-2054:
-----------------------------------

I don't think we want to modify our SafeContentHandler to stop converting 
control characters.

This is difficult.  If I understand correctly, PDFBox complains that the 
ligatures aren't correctly encoded:

{noformat}
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_i (31) in font XOILAG+MyriadPro-Bold
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_i (31) in font XOILAG+MyriadPro-Regular
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_f (30) in font XOILAG+MyriadPro-Regular
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_l (29) in font XOILAG+MyriadPro-Regular
Aug 12, 2016 8:03:22 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_f_i (28) in font XOILAG+MyriadPro-Regular
{noformat}

So "fi" is being mapped to "0x1f" (31), "ff" to "0x1e" (30), and, as you point 
out, you can recover these by a custom mapping in the output of PDFBox.  Tika 
via its SafeContentHandler converts most chars < 0x20 to '\ufffd'.

Adobe Reader seems to do the same thing that PDFBox does, but Microsoft Edge is 
able to correctly extract e.g. "confidentiality"...not sure how that is 
happening?!


> Problem with ligatures converting from PDF to HTML with Tika
> ------------------------------------------------------------
>
>                 Key: TIKA-2054
>                 URL: https://issues.apache.org/jira/browse/TIKA-2054
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11, 1.13
>            Reporter: Angela O
>         Attachments: 2482_2014_DAVIDE+CAMPARI-MILANO+SPA_SUSTY-AR.pdf
>
>
> When converting certain PDFs from PDF to HTML I am having trouble with 
> ligature characters being displayed as U+FFFD � REPLACEMENT CHARACTER
> I have tried using Apache Tika 1.11 and 1.13, converting on the command line 
> using the .jar and get the same results.
> If I use pdfbox-app-2.0.1.jar and 'ExtractText' with the icu4j-57_1.jar in 
> the path and I convert to text rather than HTML then I am able to at least 
> preserve information about what each ligature was originally, even if they 
> are still represented as unprintable characters. 
> I.e. if I run the following from the command line:
> java -jar pdfbox-app-1.8.12.jar ExtractText 'test.pdf' 'test.txt'
> Then the resulting test.txt when viewed in Sublime2 has "fi" represented as 
> the  US (unit separator character), "ff" represented as RS, "fl" represented 
> as GS and "ffl" reperesented as FS, which I could then replace with the 
> appropriate characters.
> I was under the impression Tika uses icu4j, is there a way to get the same 
> behaviour I see with PDFBox with Tika when converting from PDF to HTML? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to