Re: Fwd: Junk Characters while Extracting text from pdf file.

Peter Murray-Rust Wed, 06 Feb 2013 09:19:58 -0800

On Wed, Feb 6, 2013 at 10:24 AM, kulbhushan singh <[email protected]
> wrote:


> Hi Andreas,
>
> I did the adobe test and it gives me the same junk characters as pdfbox. I
> also tried to "save as text.." but result is same.  In pdf properties I
> found that encoding is Identity-H. I googled this encoding and fond that
> many others also have the same problem.
>

Identity-H is a problem. We will probably have to interpret the glyph

P.



>
> In my pdf I am even not able to search any text. Is OCR and Glyph my only
> option to extract text out of it? Or is there and other way to go on this.
>
> Regards, Kulbhushan
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Fwd: Junk Characters while Extracting text from pdf file.

Reply via email to