Hi,
I came across this he has got way out from the issue of encoding for 
devnagari post converting pdf to text/ html

https://github.com/RO-29/electoral_scraper_pdf
https://docs.google.com/document/d/1ZbY7KF4XQfJ7K3VbkcSaLW__sThIlnxcvLVN5nFKSUk/edit

Best wishes,
Vishal Bhave

On Saturday, August 19, 2017 at 10:44:28 PM UTC+5:30, Devdatta Tengshe 
wrote:
>
> I'm attempting to read Names, Ages & Genders from Electoral Rolls, so that 
> I can create a database of Names, to figure out the General Spread of 
> Specific Names across locations, and ages.
>
> I began working with Mumbai's rolls, and am running into the following 
> issues:
>
> 1) The Electoral Rolls are not in English, but in Devanagari. This is not 
> a Major issue, because I could transliterate it into English for Comparison 
> (I need the names to be in English, so that I can use Soundex to remove 
> misspellings etc). I know libraries for transliteratation that work with 
> Devanagari (Hindi & Marathi). Is there anything similar for other scripts 
> such as Kannada & Tamil etc?
>
> 2)While the Rolls are in Devanagari, the text is not actually in Unicode. 
> It is in some other font, and hence when I Get the text out, it's garbage. 
> Since Others have worked with the rolls before, is there a better way to 
> get the Text Out?
>
> 3)If it's not possible to get the Text out, Can we use OCR? What OCR 
> library is best at working with Indic Scripts?
>
> If anyone has some experience to share on these issues, it will be much 
> appreciated.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to