[ https://issues.apache.org/jira/browse/TIKA-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alex Andrushchak updated TIKA-1289: ----------------------------------- Attachment: PDF_text that can be copied is over the picture.pdf In first sentense "Replace this file with..." fi is ligature and it is not extracted as text. I see ? mark instead of fi > Ligatures convert on text extraction > ------------------------------------ > > Key: TIKA-1289 > URL: https://issues.apache.org/jira/browse/TIKA-1289 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.5 > Environment: win 8, jre 1.5 > Reporter: Alex Andrushchak > Attachments: PDF_text that can be copied is over the picture.pdf > > > According to tika sources review, it uses pdfbox to parse pdf files. > I found that pdfbox itself uses icu4j to handle ligatures. > Unfortunately, when i added icu4j jar to my classpath nothing changed, > ligatures are still not converted. Sample pdf file is attached. -- This message was sent by Atlassian JIRA (v6.2#6252)