Re: [BangPypers] extracting unicode text from pdfs

Dhananjay Nene Mon, 24 May 2010 07:22:07 -0700

You may want to try out pdfminer. Its very similar to xpdf in structure and
should give you the parsed data into unicode directly.


On Mon, May 24, 2010 at 7:13 PM, Eknath Venkataramani <eknath.i...@gmail.com
> wrote:

> I have around 45 pdfs to convert into raw text containing text in _HINDI_ .
> When I use the xpdf package, the generated text is very weird, so I'd like
> to write a program which would convert the pdf text into Unicode text as it
> is.
>
> The fonts used in the pdfs:
> name                                   type              emb sub uni object
> ID
> ------------------------------------ ----------------- --- --- ---
> ---------
> APKAPP+Usha-Bold                     Type 1C           yes yes yes     72
>  0
> APKBBB+Agenda-Light                  Type 1C           yes yes yes     77
>  0
> APKBGF+Usha                          Type 1C           yes yes yes     41
>  0
> APKBKJ+Agenda-Medium                 Type 1C           yes yes yes     46
>  0
> APKBON+Agenda-Bold                   Type 1C           yes yes yes     49
>  0
>
> For eg. in the pdf: आदमी मुसाफिर है
>              when I use pdftotext, I get some very weird symbols: '...
> .......'
>             while i'd like 'आदमी मुसाफिर है' to be the output
>
>
> --
> Eknath Venkataramani
> _______________________________________________
> BangPypers mailing list
> BangPypers@python.org
> http://mail.python.org/mailman/listinfo/bangpypers
>



-- 
--------------------------------------------------------
blog: http://blog.dhananjaynene.com
twitter: http://twitter.com/dnene
_______________________________________________
BangPypers mailing list
BangPypers@python.org
http://mail.python.org/mailman/listinfo/bangpypers

Re: [BangPypers] extracting unicode text from pdfs

Reply via email to