[ 
https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tom hill updated TIKA-3858:
---------------------------
    Description: 
It appears that the issue in TIKA-1289 is still present. Ligatures get replaced 
by a question mark.

As a particular example, the ft ligature is getting replaced by utf-8: ef bf  bd

Is there any new resolution on this issue? Just returning the fl ligature would 
be great, or normalizing it to f, t.

This particular example comes from saving my gmail inbox page as a pdf, in 
chrome. It uses the ft ligature in the word "Drafts".

There are many similar examples, it's not specific to one pdf generator. 

I'm using tika-app-2.4.1.jar 

  was:
It appears that the issue in TIKA-1289 is still present. Ligatures get replaced 
by a question mark.

As a particular example, the ft ligature is getting replaced by utf-8: ef bf  bd

Is there any new resolution on this issue? Just returning the fl ligature would 
be great, or normalizing it to f, t.

This particular example comes from saving my gmail inbox page as a pdf, in 
chrome. It uses the ft ligature in the word "Drafts".

There are many similar examples, it's not specific to one pdf generator. 


>  Ligatures convert on text extraction
> -------------------------------------
>
>                 Key: TIKA-3858
>                 URL: https://issues.apache.org/jira/browse/TIKA-3858
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5
>         Environment: win 8, jre 1.5
>            Reporter: tom hill
>            Priority: Major
>
> It appears that the issue in TIKA-1289 is still present. Ligatures get 
> replaced by a question mark.
> As a particular example, the ft ligature is getting replaced by utf-8: ef bf  
> bd
> Is there any new resolution on this issue? Just returning the fl ligature 
> would be great, or normalizing it to f, t.
> This particular example comes from saving my gmail inbox page as a pdf, in 
> chrome. It uses the ft ligature in the word "Drafts".
> There are many similar examples, it's not specific to one pdf generator. 
> I'm using tika-app-2.4.1.jar 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to