[ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342716#comment-14342716 ]
Tyler Palsulich commented on TIKA-911: -------------------------------------- Still seeing this issue (question marks instead of spaces) on a Mac with Tika 1.8-SNAPSHOT. {{mvn -version}}: {code} Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T16:58:10-04:00) Maven home: /usr/local/Cellar/maven/3.2.3/libexec Java version: 1.7.0_71, vendor: Oracle Corporation Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_71.jdk/Contents/Home/jre Default locale: en_US, platform encoding: UTF-8 OS name: "mac os x", version: "10.10.2", arch: "x86_64", family: "mac" {code} > Converted PDF document contains question marks in place of spaces and > inconsistent case > --------------------------------------------------------------------------------------- > > Key: TIKA-911 > URL: https://issues.apache.org/jira/browse/TIKA-911 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.8 > Reporter: Matt Sheppard > Attachments: Rust Biosecurity Brochure.pdf, Rust Biosecurity > Brochure.pdf.html > > > The PDF document at > http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, > when converted with tika v1.1 using > {code} > $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf > {code} > Produces substantially worse output than xpdf's pdftotext program. > Specifically, we see... > Some 'spaces' replaced with question marks > {noformat} > ... > <body><div class="page"><p/> > <p>How can I help? > When you're overseas: > • ?wherever?possible,?don't?visit?crops?—?contact?with? > </p> > <p>growing?crops?greatly?increases?the?risk?of?contaminating? > footwear?or?clothing;? > ... > {noformat} > and some odd case conversions > {noformat} > <p>stem rust in wheat. > (soURce: BRAd collIs)</p> > <p/> > </div> > {noformat} > (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper > case. > To compare that with pdftotext > {code} > $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ > Brochure.pdf > {code} > This does not output the question marks, and produces "Source: BRAD COLLIS" > at the end there, both of which seem to be improvements. Note that it does, > however, produce a number of ^G characters which are not desireable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)