Hi, I came across this he has got way out from the issue of encoding for devnagari post converting pdf to text/ html
https://github.com/RO-29/electoral_scraper_pdf https://docs.google.com/document/d/1ZbY7KF4XQfJ7K3VbkcSaLW__sThIlnxcvLVN5nFKSUk/edit Best wishes, Vishal Bhave On Saturday, August 19, 2017 at 10:44:28 PM UTC+5:30, Devdatta Tengshe wrote: > > I'm attempting to read Names, Ages & Genders from Electoral Rolls, so that > I can create a database of Names, to figure out the General Spread of > Specific Names across locations, and ages. > > I began working with Mumbai's rolls, and am running into the following > issues: > > 1) The Electoral Rolls are not in English, but in Devanagari. This is not > a Major issue, because I could transliterate it into English for Comparison > (I need the names to be in English, so that I can use Soundex to remove > misspellings etc). I know libraries for transliteratation that work with > Devanagari (Hindi & Marathi). Is there anything similar for other scripts > such as Kannada & Tamil etc? > > 2)While the Rolls are in Devanagari, the text is not actually in Unicode. > It is in some other font, and hence when I Get the text out, it's garbage. > Since Others have worked with the rolls before, is there a better way to > get the Text Out? > > 3)If it's not possible to get the Text out, Can we use OCR? What OCR > library is best at working with Indic Scripts? > > If anyone has some experience to share on these issues, it will be much > appreciated. > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
