[jira] [Commented] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

Michael McCandless (JIRA) Wed, 02 May 2012 11:43:13 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266801#comment-13266801
 ]


Michael McCandless commented on TIKA-911:
-----------------------------------------

So strange ... I tested on a Mac (10.6.8) with Java 1.6.0_31, and I don't see 
the ? for spaces nor the mixed case.

Hmm, my header has a different content-length then yours:

{noformat}
<meta name="xmpTPg:NPages" content="2"/>
<meta name="Creation-Date" content="2012-05-02T10:25:00Z"/>
<meta name="created" content="Wed May 02 06:25:00 EDT 2012"/>
<meta name="Content-Length" content="639985"/>
<meta name="Last-Modified" content="2012-05-02T10:25:00Z"/>
<meta name="producer" content="Mac OS X 10.6.8 Quartz PDFContext"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="resourceName" content="Rust Biosecurity Brochure.pdf"/>
<meta name="creator" content="Adobe InDesign CS2 (4.0.5)"/>
{noformat}

OK! If I used the PDF attached to the issue, I indeed see these problems (I had 
downloaded from the web site).  Maybe the web site has since changed/fixed the 
PDF?  Hmm.

So, the extra characters (where there should be spaces) are U+FFFD (the unicode 
replacement character); Tika outputs this whenever there is a character it 
can't safely output into the XHTML (this is done in SafeContentHanderl.java).  
Tika used to (before 0.10) simply replace such characters with space (ASCII 
32), so, to get back to pre-0.10 behaviour you can replace U+FFFD with space.

Not sure about the mixed case issue...

                
> Converted PDF document contains question marks in place of spaces and 
> inconsistent case
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-911
>                 URL: https://issues.apache.org/jira/browse/TIKA-911
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.1
>            Reporter: Matt Sheppard
>         Attachments: Rust Biosecurity Brochure.pdf, Rust Biosecurity 
> Brochure.pdf.html
>
>
> The PDF document at 
> http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, 
> when converted with tika v1.1 using
> {code}
> $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
> {code}
> Produces substantially worse output than xpdf's pdftotext program.
> Specifically, we see...
> Some 'spaces' replaced with question marks
> {noformat}
> ...
> <body><div class="page"><p/>
> <p>How can I help?
> When you're overseas:
> • ?wherever?possible,?don't?visit?crops?—?contact?with?
> </p>
> <p>growing?crops?greatly?increases?the?risk?of?contaminating?
> footwear?or?clothing;?
> ...
> {noformat}
> and some odd case conversions
> {noformat}
> <p>stem rust in wheat.  
>  (soURce: BRAd collIs)</p>
> <p/>
> </div>
> {noformat}
> (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper 
> case.
> To compare that with pdftotext
> {code}
> $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ 
> Brochure.pdf
> {code}
> This does not output the question marks, and produces "Source: BRAD COLLIS" 
> at the end there, both of which seem to be improvements. Note that it does, 
> however, produce a number of ^G characters which are not desireable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

Reply via email to