Amazon textract(paid0 and mupdf are a couple of other alternatives to consider. In my experience amazon textract is the best available tool.
On Mon, Jul 21, 2025 at 3:01 PM sreeram kandimalla < [email protected]> wrote: > Camelot is nice and lightweight but is currently unmaintained.. > https://github.com/datalab-to/marker is a good alternative. It's a mix of > OCR and pdf parsing and can use LLMs for correcting thorny cases. Here is > an example of an invocation for a different dataset - > https://github.com/publicmap/amche-atlas/issues/104#issuecomment-2842058569 > > On Mon, Jul 21, 2025 at 2:04 PM Saloni Taneja <[email protected]> > wrote: > >> Hi everyone, >> >> I’ve been trying to parse the compiled PDFs uploaded by the CGWB here >> <https://cgwb.gov.in/en/ground-water-level-monitoring> (specifically the >> ones under “4. Water Level Data”) which contain four readings per >> monitoring well per year. However, I’ve run into an issue with overlapping >> text across columns, which is leading to jumbled or misaligned outputs. >> >> For instance, on page 5 of the file titled “August Ground Water Level >> 1994–2023”, the district “Dr. B.R. Ambedkar Konaseema” appears as “Dr. B.R. >> Ambedkar Konaseem”, with the missing "a" mistakenly attached to the start >> of the following block name. Camelot (Python) is detecting these characters >> but struggles to resolve them correctly, likely because overlapping text >> layers in the PDF are assigned nearly identical coordinates, causing cell >> misassignments. Another example is all rows correspondeding to "Dadra and >> Nagar Haveli and Daman and Diu". >> >> I wanted to check: >> >> 1. Has anyone here successfully parsed this dataset before? >> 2. Am I understanding the complexity of scraping this correctly? >> 3. Does anyone have a contact at CGWB who might be able to share the >> original Excel files? The PDFs appear to have been exported via iLovePDF >> from XLSX files. Since these files are already publicly available, I’m >> hoping the CGWB might be open to sharing the source formats directly, but >> I'm worried the turnaround times might vary. >> >> Any help, advice, or pointers would be really appreciated. Thanks so much! >> >> Best, >> >> -- >> Datameet is a community of Data Science enthusiasts in India. Know more >> about us by visiting http://datameet.org >> --- >> You received this message because you are subscribed to the Google Groups >> "datameet" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion visit >> https://groups.google.com/d/msgid/datameet/1e037636-d31e-4cb3-8703-433000a9a573n%40googlegroups.com >> <https://groups.google.com/d/msgid/datameet/1e037636-d31e-4cb3-8703-433000a9a573n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/datameet/CAMgvHC5E0ZgtfBZwO%2BCBR6r7vSeRQ5ErHzJ_2qpuvMiGvd5CjQ%40mail.gmail.com.
