Hi,

Am 20.07.2012 10:02, schrieb "Andreas Lehmkühler":
Hi,


Stephen Haggai<[email protected]>  hat am 20. Juli 2012 um
05:44 geschrieben:


_______________________________________________________________________________________

Note: This e-mail is subject to the disclaimer contained at the bottom of this
message.
_______________________________________________________________________________________


Hi,

I have looked at the PDF file. It looks as if text in all the pages were
scanned as images. I am certain that one cannot extract text from (text
scanned as) images using PDFBox. Could someone correct me if I am wrong.


You are correct. The pdfs consists of scanned text and yes pdfbox can't extract
I've to corrrect myself. There is no single image containing the scanned text. It consists of thousands of small lines, like the following 3 ones:


319.5 3175.84 m
319.5 3175.84 l
S
353.5 3175.84 m
353.5 3175.84 l
S
376.5 3175.84 m
376.5 3175.84 l
S

So if you want to use an image to an ocr software you have to use PDFToImage


BR
Andreas Lehmkühler

that text, but the images. Those could be used with a OCR-software to get the
text. I didn't try that but it should work, more or less precise.

BTW: It is always a good idea to extract the text using the acrobat reader. Just
select the text a copy and paste it to an editor. If that doesn't work it most
likely won't work using PDFBox.



Thanks,
Stephen

-----Original Message-----
From: Big Donkeys [mailto:[email protected]]
Sent: Friday, 20 July 2012 6:09 AM
To: [email protected]
Subject: Can't extract text Adobe-WinCharSetFFFF-UCS2

Hi, I&#39;m having some troubles extracting text from some South Korean PDF
files using PDFTextStripper.  When I try I get a "severe error could not parse
predefined CMAP file for&#39;Adobe-WinCharSetFFFF-UCS2&#39;" message and then
gives me some gibberish.  File opens and displays fine in Adobe reader.
   I&#39;m using pdfbox-app-1.7.0.jar.

Here is a link to an example PDF that gives me trouble:

http://eng.khoa.go.kr/inc/func/fileDownloadBlob_nori.asp?cmsCd=CM0237&ntNo=626&fNo=4

Any ideas?

_______________________________________________________________________________________

The information transmitted in this message and its attachments (if any) is
intended
only for the person or entity to which it is addressed.
The message may contain confidential and/or privileged material. Any review,
retransmission, dissemination or other use of, or taking of any action in
reliance
upon this information, by persons or entities other than the intended
recipient is
prohibited.

If you have received this in error, please contact the sender and delete this
e-mail
and associated material from any computer.

The intended recipient of this e-mail may only use, reproduce, disclose or
distribute
the information contained in this e-mail and any attached files, with the
permission
of the sender.

This message has been scanned for viruses.
_______________________________________________________________________________________

Br
Andreas Lehmkühler

Reply via email to