Hi everyone, 

Over the past year and more, I’ve been working on parsing India’s recent 
electoral rolls to make them available in analyzable formats. Earlier this 
week, I released a small portion of this dataset: the state of Haryana 
(2024 Vidhansabha), with over 20 million individual-level voter records. I 
plan to update the dataset about once a month, potentially adding one state 
at a time, with the goal of preparing a journal article introducing the 
dataset by early 2026.

As some of you may know, India’s electoral rolls have been made available 
only in non-machine-readable (non-OCR) formats over the last five years or 
so. Unfortunately, widely used OCR models do not perform well on these 
rolls. To address this, I used a new OCR engine—Surya-OCR—on 
high-performance computing clusters. This required a lot of diving into the 
growing frontier of machine learning and supercomputing infrastructure. 

You can access this dataset on Harvard Dataverse 
<https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YNEY6G>.
 
To access this dataset, you will have to fill out this Google Form 
<https://forms.gle/26spnRXCsYEoP1uQ7?_imcp=1> (also available on Dataverse 
page's description). You can find the documentation related to the dataset 
on my GitHub <https://github.com/sharik19/India-Electoral-Rolls-2024-25>
 repo. 

Please feel free to send in any queries, feedback, or questions. And please 
feel free to share and circulate. Thanks so much! 

Warmly,
Sharik

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/datameet/7e477386-c665-46f1-a549-2afa9b7e1bcdn%40googlegroups.com.

Reply via email to