Hi everyone, Over the past year and more, I’ve been working on parsing India’s recent electoral rolls to make them available in analyzable formats. Earlier this week, I released a small portion of this dataset: the state of Haryana (2024 Vidhansabha), with over 20 million individual-level voter records. I plan to update the dataset about once a month, potentially adding one state at a time, with the goal of preparing a journal article introducing the dataset by early 2026.
As some of you may know, India’s electoral rolls have been made available only in non-machine-readable (non-OCR) formats over the last five years or so. Unfortunately, widely used OCR models do not perform well on these rolls. To address this, I used a new OCR engine—Surya-OCR—on high-performance computing clusters. This required a lot of diving into the growing frontier of machine learning and supercomputing infrastructure. You can access this dataset on Harvard Dataverse <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YNEY6G>. To access this dataset, you will have to fill out this Google Form <https://forms.gle/26spnRXCsYEoP1uQ7?_imcp=1> (also available on Dataverse page's description). You can find the documentation related to the dataset on my GitHub <https://github.com/sharik19/India-Electoral-Rolls-2024-25> repo. Please feel free to send in any queries, feedback, or questions. And please feel free to share and circulate. Thanks so much! Warmly, Sharik -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/datameet/7e477386-c665-46f1-a549-2afa9b7e1bcdn%40googlegroups.com.
