Hi,

Very weird. The best is that you open an issue with JIRA, and attach the PDF and the text and include the description (i.e. your two postings) and the actual command line (just to be sure, try again with the -nonSeq option).

If you attached the PDF here, it will probably have been deleted by the mailing list software.

Tilman

Am 09.06.2014 14:54, schrieb Walter Kehl:
Hi Tilman,

This is definitely not an ORC'ed file. It is an official report from a
financial institution and has been created with Adobe PDF library. Also
copying and pasting is fine.

The interesting fact, however, is that some portions of text appear twice in
the output: first correctly and then corrupted. I have attached an output
created with PDFBox's command line options.
If you compare lines 357- 365 with lines 421-429 you see that it is the same
paragraph, first ok and then with characters missing. In the original source
this paragraph is unique.
The same seems to happen for the other instances where text is corrupted.

Best
Walter




-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Montag, 9. Juni 2014 12:19
To: [email protected]
Subject: Re: Corrupted words when using PDFTextStripper

This could be a OCRed file. Try copy & paste from acrobat reader to see
whether you get the same result.

Tilman

Am 09.06.2014 11:55, schrieb Walter Kehl:
Hi,

I am new to the list so I don't know whether this has been asked before:

I am using PDFTextStripper (embedded into another application) to get
the raw text of PDFs so far with good results but recently a PDF file
has appeared where the output of the PDFTextStripper was corrupted. I
got sentences like:

"There is al o con ern that b nkers may be pushed to misprice risk
(No. 6) by the pres ures of c mpetition and an abunda ce of central b
nk-provided liquidity."

where characters seem to be missing. Does anyone have any idea what
went wrong here and how could I prevent it?

Thanks for your help

Walter Kehl



Reply via email to