Hi everyone,

Over the past year and more, I’ve been working on parsing India’s recent
electoral rolls to make them available in analyzable formats. Earlier this
week, I released a small portion of this dataset: the state of Haryana
(2024 Vidhansabha), with over 20 million individual-level voter records. I
plan to update the dataset about once a month, potentially adding one state
at a time, with the goal of preparing a journal article introducing the
dataset by early 2026.

As some of you may know, India’s electoral rolls have been made available
only in non-machine-readable (non-OCR) formats over the last five years or
so. Unfortunately, widely used OCR models do not perform well on these
rolls. To address this, I used a new OCR engine—Surya-OCR—on
high-performance computing clusters. This required a lot of diving into the
growing frontier of machine learning and supercomputing infrastructure.

You can access this dataset on Harvard Dataverse
<https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YNEY6G>.
To access this dataset, you will have to fill out this Google Form
<https://forms.gle/26spnRXCsYEoP1uQ7?_imcp=1> (also available on Dataverse
page's description). You can find the documentation related to the dataset
on my GitHub <https://github.com/sharik19/India-Electoral-Rolls-2024-25>
 repo.

Please feel free to send in any queries, feedback, or questions. And please
feel free to share and circulate. Thanks so much!

Warmly,
-- 
Sharik Laliwala
PhD Candidate
Department of Political Science
University of California, Berkeley

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/datameet/CAH1C3%3Dx1fuF7NufS_-EHtt3nsBK4N-Fn0KEzOKeVQ4L6OkQpcg%40mail.gmail.com.

Reply via email to