Hi everyone,

I’ve been trying to parse the compiled PDFs uploaded by the CGWB here 
<https://cgwb.gov.in/en/ground-water-level-monitoring> (specifically the 
ones under “4. Water Level Data”) which contain four readings per 
monitoring well per year. However, I’ve run into an issue with overlapping 
text across columns, which is leading to jumbled or misaligned outputs.

For instance, on page 5 of the file titled “August Ground Water Level 
1994–2023”, the district “Dr. B.R. Ambedkar Konaseema” appears as “Dr. B.R. 
Ambedkar Konaseem”, with the missing "a" mistakenly attached to the start 
of the following block name. Camelot (Python) is detecting these characters 
but struggles to resolve them correctly, likely because overlapping text 
layers in the PDF are assigned nearly identical coordinates, causing cell 
misassignments. Another example is all rows correspondeding to "Dadra and 
Nagar Haveli and Daman and Diu".

I wanted to check:

   1. Has anyone here successfully parsed this dataset before?
   2. Am I understanding the complexity of scraping this correctly?
   3. Does anyone have a contact at CGWB who might be able to share the 
   original Excel files? The PDFs appear to have been exported via iLovePDF 
   from XLSX files. Since these files are already publicly available, I’m 
   hoping the CGWB might be open to sharing the source formats directly, but 
   I'm worried the turnaround times might vary.

Any help, advice, or pointers would be really appreciated. Thanks so much!

Best,

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/datameet/1e037636-d31e-4cb3-8703-433000a9a573n%40googlegroups.com.

Reply via email to