Re: TextExtraction only working after uncompressing with pdftk

Tilman Hausherr Tue, 29 Apr 2014 08:43:32 -0700

Currently there seems to be a problem with the apache build process...either wait a few hours / days, or try building from source with svn andmaven, or e-mail me and tell me which jar files you need.

Tilman

Am 29.04.2014 15:05, schrieb Jonas Karlsson:

Great! I will check it out when the new snapshot is available,


thanks!
_jonas


On Tue, Apr 29, 2014 at 2:02 AM, Tilman Hausherr <[email protected]>wrote:

Problem solved, see

https://issues.apache.org/jira/browse/PDFBOX-2048


Tilman



Am 28.04.2014 21:17, schrieb Tilman Hausherr:

  Hi,

I'm afraid we won't be research this depper without the PDF. Normally,
one possibility would be to decompress the PDF and alter the data so that
personal stuff is removed, but you said that the problem goes away when
decompressing the PDF with a 3rd party product :-(

It is obvious that the PDF is somehow corrupted... you could use an
editor like NOTEPAD++ to look at the stream length values and then see the
actual length. (See the PDF spec for details, but it is rather obvious when
looking in the editor anyway).

/Length nnnn/......>>stream
.....nnnn bytes of data....
endstream

But I think this isn't the only problem in that PDF.

Tilman



Am 28.04.2014 20:56, schrieb Jonas Karlsson:

Hi Tilman,
Thanks for trying to help!

With both the 1.8.5 and the 2.0.0 SNAPSHOTs, WriteDecodedDoc and
ExtractText I now only get the error

org.apache.pdfbox.pdfparser.NonSequentialPDFParser validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream

I'm not seeing the StreamCorrupted Exception anymore. However, I'm still
only getting empty text, and WriteDecodedDoc returns a

pdf with blank pages.

_jonas




On Mon, Apr 28, 2014 at 2:28 PM, Tilman Hausherr <[email protected]

wrote:

  Yes, but does WriteDecodedDoc now work correctly, or does it still bring

that LZW error?

About the streams issue: the error status is somewhat misleading, it
should rather be a warning, because there is a "plan B", which is to
disregard the length parameter and to read the PDF until "endstream". If
that one failed too, then there would be a new error message "Error
reading
stream using length value". So I wonder if there is another problem.
Sometimes people transfer PDF file in ascii mode from an ftp server.
Could
you try the text decode feature of the pdfbox app 2.0 ?

https://repository.apache.org/content/groups/snapshots/org/
apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/

command:

java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf


Tilman


Am 28.04.2014 18:21, schrieb Jonas Karlsson:

   Hi Tilman,

I tried the 1.8.5-SNAPSHOT and get the same result as before. No text
and

Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser.
NonSequentialPDFParser
validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset,
using
workaround to read the stream

_jonas

On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr <
[email protected]

wrote:

   There was a (recently fixed) bug with the LZW decoder, please try the

current snapshot and tell us what happens
https://repository.apache.org/content/groups/snapshots/org/
apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/

Tilman

Am 28.04.2014 17:00, schrieb Jonas Karlsson:

    java.io.StreamCorruptedException: Error: data is null

      at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)

Re: TextExtraction only working after uncompressing with pdftk

Reply via email to