You may want to try out pdfminer. Its very similar to xpdf in structure and
should give you the parsed data into unicode directly.

On Mon, May 24, 2010 at 7:13 PM, Eknath Venkataramani <eknath.i...@gmail.com
> wrote:

> I have around 45 pdfs to convert into raw text containing text in _HINDI_ .
> When I use the xpdf package, the generated text is very weird, so I'd like
> to write a program which would convert the pdf text into Unicode text as it
> is.
>
> The fonts used in the pdfs:
> name                                   type              emb sub uni object
> ID
> ------------------------------------ ----------------- --- --- ---
> ---------
> APKAPP+Usha-Bold                     Type 1C           yes yes yes     72
>  0
> APKBBB+Agenda-Light                  Type 1C           yes yes yes     77
>  0
> APKBGF+Usha                          Type 1C           yes yes yes     41
>  0
> APKBKJ+Agenda-Medium                 Type 1C           yes yes yes     46
>  0
> APKBON+Agenda-Bold                   Type 1C           yes yes yes     49
>  0
>
> For eg. in the pdf: आदमी मुसाफिर है
>              when I use pdftotext, I get some very weird symbols: '...
> .......'
>             while i'd like 'आदमी मुसाफिर है' to be the output
>
>
> --
> Eknath Venkataramani
> _______________________________________________
> BangPypers mailing list
> BangPypers@python.org
> http://mail.python.org/mailman/listinfo/bangpypers
>



-- 
--------------------------------------------------------
blog: http://blog.dhananjaynene.com
twitter: http://twitter.com/dnene
_______________________________________________
BangPypers mailing list
BangPypers@python.org
http://mail.python.org/mailman/listinfo/bangpypers

Reply via email to