[ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266502#comment-13266502 ]
Matt Sheppard commented on TIKA-911: ------------------------------------ Confirmed that it still occurs for me on a different mac (with freshly downloaded PDF and tika-app-1.1.jar). {noformat} mercury:Downloads matt$ system_profiler SPSoftwareDataType Software: System Software Overview: System Version: Mac OS X 10.7.3 (11D50d) Kernel Version: Darwin 11.3.0 Boot Volume: Macintosh HD Boot Mode: Normal Computer Name: Mercury User Name: Matthew Sheppard (matt) Secure Virtual Memory: Enabled 64-bit Kernel and Extensions: Yes Time since boot: 3 days 1:10 mercury:Downloads matt$ java -version java version "1.6.0_31" Java(TM) SE Runtime Environment (build 1.6.0_31-b04-415-11M3635) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01-415, mixed mode) mercury:Downloads matt$ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="xmpTPg:NPages" content="2"/> <meta name="Creation-Date" content="2008-06-06T02:53:07Z"/> <meta name="trapped" content="False"/> <meta name="created" content="Fri Jun 06 12:53:07 EST 2008"/> <meta name="Content-Length" content="755665"/> <meta name="Last-Modified" content="2008-06-06T02:53:23Z"/> <meta name="producer" content="Adobe PDF Library 7.0"/> <meta name="Content-Type" content="application/pdf"/> <meta name="resourceName" content="Rust Biosecurity Brochure.pdf"/> <meta name="creator" content="Adobe InDesign CS2 (4.0.5)"/> <title/> </head> <body><div class="page"><p/> <p>How can I help? When youâre overseas: ⢠�wherever�possible,�donât�visit�crops�â�contact�with� </p> <p>growing�crops�greatly�increases�the�risk�of�contaminating� footwear�or�clothing;� ...[snip]... <p>Initial detection points of exotic wheat rust incursions </p> <p>stem rust in wheat. (soURce: BRAd collIs)</p> <p/> </div> </body></html> {noformat} Note that the ?s reported appear to display differently on this machine. Will attach a copy of the output as a file for reference. > Converted PDF document contains question marks in place of spaces and > inconsistent case > --------------------------------------------------------------------------------------- > > Key: TIKA-911 > URL: https://issues.apache.org/jira/browse/TIKA-911 > Project: Tika > Issue Type: Bug > Affects Versions: 1.1 > Reporter: Matt Sheppard > Attachments: Rust Biosecurity Brochure.pdf > > > The PDF document at > http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, > when converted with tika v1.1 using > {code} > $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf > {code} > Produces substantially worse output than xpdf's pdftotext program. > Specifically, we see... > Some 'spaces' replaced with question marks > {noformat} > ... > <body><div class="page"><p/> > <p>How can I help? > When you're overseas: > • ?wherever?possible,?don't?visit?crops?—?contact?with? > </p> > <p>growing?crops?greatly?increases?the?risk?of?contaminating? > footwear?or?clothing;? > ... > {noformat} > and some odd case conversions > {noformat} > <p>stem rust in wheat. > (soURce: BRAd collIs)</p> > <p/> > </div> > {noformat} > (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper > case. > To compare that with pdftotext > {code} > $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ > Brochure.pdf > {code} > This does not output the question marks, and produces "Source: BRAD COLLIS" > at the end there, both of which seem to be improvements. Note that it does, > however, produce a number of ^G characters which are not desireable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira