[ 
https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266502#comment-13266502
 ] 

Matt Sheppard commented on TIKA-911:
------------------------------------

Confirmed that it still occurs for me on a different mac (with freshly 
downloaded PDF and tika-app-1.1.jar).

{noformat}
mercury:Downloads matt$ system_profiler SPSoftwareDataType
Software:

    System Software Overview:

      System Version: Mac OS X 10.7.3 (11D50d)
      Kernel Version: Darwin 11.3.0
      Boot Volume: Macintosh HD
      Boot Mode: Normal
      Computer Name: Mercury
      User Name: Matthew Sheppard (matt)
      Secure Virtual Memory: Enabled
      64-bit Kernel and Extensions: Yes
      Time since boot: 3 days 1:10

mercury:Downloads matt$ java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04-415-11M3635)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01-415, mixed mode)
mercury:Downloads matt$ java -jar tika-app-1.1.jar Rust\ Biosecurity\ 
Brochure.pdf 
<?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="xmpTPg:NPages" content="2"/>
<meta name="Creation-Date" content="2008-06-06T02:53:07Z"/>
<meta name="trapped" content="False"/>
<meta name="created" content="Fri Jun 06 12:53:07 EST 2008"/>
<meta name="Content-Length" content="755665"/>
<meta name="Last-Modified" content="2008-06-06T02:53:23Z"/>
<meta name="producer" content="Adobe PDF Library 7.0"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="resourceName" content="Rust Biosecurity Brochure.pdf"/>
<meta name="creator" content="Adobe InDesign CS2 (4.0.5)"/>
<title/>
</head>
<body><div class="page"><p/>
<p>How can I help?
When you’re overseas:
• �wherever�possible,�don’t�visit�crops�—�contact�with�
</p>
<p>growing�crops�greatly�increases�the�risk�of�contaminating�
footwear�or�clothing;�
...[snip]...
<p>Initial detection  
points of exotic wheat 
rust incursions
</p>
<p>stem rust in wheat.  
 (soURce: BRAd collIs)</p>
<p/>
</div>
</body></html>
{noformat}

Note that the ?s reported appear to display differently on this machine.

Will attach a copy of the output as a file for reference.
                
> Converted PDF document contains question marks in place of spaces and 
> inconsistent case
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-911
>                 URL: https://issues.apache.org/jira/browse/TIKA-911
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.1
>            Reporter: Matt Sheppard
>         Attachments: Rust Biosecurity Brochure.pdf
>
>
> The PDF document at 
> http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, 
> when converted with tika v1.1 using
> {code}
> $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
> {code}
> Produces substantially worse output than xpdf's pdftotext program.
> Specifically, we see...
> Some 'spaces' replaced with question marks
> {noformat}
> ...
> <body><div class="page"><p/>
> <p>How can I help?
> When you're overseas:
> • ?wherever?possible,?don't?visit?crops?—?contact?with?
> </p>
> <p>growing?crops?greatly?increases?the?risk?of?contaminating?
> footwear?or?clothing;?
> ...
> {noformat}
> and some odd case conversions
> {noformat}
> <p>stem rust in wheat.  
>  (soURce: BRAd collIs)</p>
> <p/>
> </div>
> {noformat}
> (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper 
> case.
> To compare that with pdftotext
> {code}
> $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ 
> Brochure.pdf
> {code}
> This does not output the question marks, and produces "Source: BRAD COLLIS" 
> at the end there, both of which seem to be improvements. Note that it does, 
> however, produce a number of ^G characters which are not desireable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to