Matt Sheppard created TIKA-911: ---------------------------------- Summary: Converted PDF document contains question marks in place of spaces and inconsistent case Key: TIKA-911 URL: https://issues.apache.org/jira/browse/TIKA-911 Project: Tika Issue Type: Bug Affects Versions: 1.1 Reporter: Matt Sheppard
The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using {code} $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf {code} Produces substantially worse output than xpdf's pdftotext program. Specifically, we see... Some 'spaces' replaced with question marks {noformat} ... <body><div class="page"><p/> <p>How can I help? When you're overseas: • ?wherever?possible,?don't?visit?crops?—?contact?with? </p> <p>growing?crops?greatly?increases?the?risk?of?contaminating? footwear?or?clothing;? ... {noformat} and some odd case conversions {noformat} <p>stem rust in wheat. (soURce: BRAd collIs)</p> <p/> </div> {noformat} (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case. To compare that with pdftotext {code} $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf {code} This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira