Thanks Dave!
------------------------
Chris Mattmann
[email protected]




-----Original Message-----
From: David Meikle <[email protected]>
Reply-To: <[email protected]>
Date: Monday, December 29, 2014 at 2:50 PM
To: <[email protected]>
Subject: Re: Parsing PDF files

>Hello,
>
>On 24 Dec 2014, at 20:30, A.M. Sabuncu <[email protected]> wrote:
>
>I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and
>using the following curl command to test text extraction from PDF files:
>curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header
>"Content-type: application/pdf"On trivial PDF files (e.g. created using
>Word 2010's convert-to-pdf functionality and containing only the text
>"Testing", about 81 KB in size), I get errors in that there's nothing
>returned from the curl command, and on the tika-server end, I see the
>following errors:
>
>
><lots of garbage characters displayed on screen, followed by>
>
>WARNING: Did not found XRef object at specified startxref position 0
>
>
>Being new to Tika, I would like to know whether I am doing something
>wrong, or if PDF parsing is not yet an exact science.
>
>Many thanks in advance.
>
>
>Sabuncu
>
>
>
>
>
>
>
>
>Working through this we have discovered we were using different commands,
>which then uncovered an error in the example on the TikaJAXRS wiki page
>where all examples, regardless of the nature of the content, use the -d
>flag (effectively --data-ascii) in the curl commands.  This means that
>binary files are being processed as ASCII content.
>
>Based on the above, all that was required was to change the command from:
>
>curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header
>"Content-type: application/pdf”
>
>To:
>
>curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika
>--header "Content-type: application/pdf”
>
>I have updated the TikaJAXRS wiki page accordingly but felt it was worth
>posting back to the list for future reference.
>
>Cheers,
>Dave 


Reply via email to