[ https://issues.apache.org/jira/browse/TIKA-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418723#comment-15418723 ]
Angela Onslow edited comment on TIKA-2054 at 8/12/16 11:48 AM: --------------------------------------------------------------- Here is a file which demonstrates this problem (see attachments) was (Author: ang...@erevalue.com): Here is a file which demonstrates this problem > Problem with ligatures converting from PDF to HTML with Tika > ------------------------------------------------------------ > > Key: TIKA-2054 > URL: https://issues.apache.org/jira/browse/TIKA-2054 > Project: Tika > Issue Type: Bug > Affects Versions: 1.11, 1.13 > Reporter: Angela Onslow > Attachments: 2482_2014_DAVIDE+CAMPARI-MILANO+SPA_SUSTY-AR.pdf > > > When converting certain PDFs from PDF to HTML I am having trouble with > ligature characters being displayed as U+FFFD � REPLACEMENT CHARACTER > I have tried using Apache Tika 1.11 and 1.13, converting on the command line > using the .jar and get the same results. > If I use pdfbox-app-2.0.1.jar and 'ExtractText' with the icu4j-57_1.jar in > the path and I convert to text rather than HTML then I am able to at least > preserve information about what each ligature was originally, even if they > are still represented as unprintable characters. > I.e. if I run the following from the command line: > java -jar pdfbox-app-1.8.12.jar ExtractText 'test.pdf' 'test.txt' > Then the resulting test.txt when viewed in Sublime2 has "fi" represented as > the US (unit separator character), "ff" represented as RS, "fl" represented > as GS and "ffl" reperesented as FS, which I could then replace with the > appropriate characters. > I was under the impression Tika uses icu4j, is there a way to get the same > behaviour I see with PDFBox with Tika when converting from PDF to HTML? -- This message was sent by Atlassian JIRA (v6.3.4#6332)