Amazon textract(paid0 and mupdf are a couple of other alternatives to
consider. In my experience amazon textract is the best available tool.

On Mon, Jul 21, 2025 at 3:01 PM sreeram kandimalla <
[email protected]> wrote:

> Camelot is nice and lightweight but is currently unmaintained..
> https://github.com/datalab-to/marker is a good alternative. It's a mix of
> OCR and pdf parsing and can use LLMs for correcting thorny cases. Here is
> an example of an invocation for a different dataset -
> https://github.com/publicmap/amche-atlas/issues/104#issuecomment-2842058569
>
> On Mon, Jul 21, 2025 at 2:04 PM Saloni Taneja <[email protected]>
> wrote:
>
>> Hi everyone,
>>
>> I’ve been trying to parse the compiled PDFs uploaded by the CGWB here
>> <https://cgwb.gov.in/en/ground-water-level-monitoring> (specifically the
>> ones under “4. Water Level Data”) which contain four readings per
>> monitoring well per year. However, I’ve run into an issue with overlapping
>> text across columns, which is leading to jumbled or misaligned outputs.
>>
>> For instance, on page 5 of the file titled “August Ground Water Level
>> 1994–2023”, the district “Dr. B.R. Ambedkar Konaseema” appears as “Dr. B.R.
>> Ambedkar Konaseem”, with the missing "a" mistakenly attached to the start
>> of the following block name. Camelot (Python) is detecting these characters
>> but struggles to resolve them correctly, likely because overlapping text
>> layers in the PDF are assigned nearly identical coordinates, causing cell
>> misassignments. Another example is all rows correspondeding to "Dadra and
>> Nagar Haveli and Daman and Diu".
>>
>> I wanted to check:
>>
>>    1. Has anyone here successfully parsed this dataset before?
>>    2. Am I understanding the complexity of scraping this correctly?
>>    3. Does anyone have a contact at CGWB who might be able to share the
>>    original Excel files? The PDFs appear to have been exported via iLovePDF
>>    from XLSX files. Since these files are already publicly available, I’m
>>    hoping the CGWB might be open to sharing the source formats directly, but
>>    I'm worried the turnaround times might vary.
>>
>> Any help, advice, or pointers would be really appreciated. Thanks so much!
>>
>> Best,
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion visit
>> https://groups.google.com/d/msgid/datameet/1e037636-d31e-4cb3-8703-433000a9a573n%40googlegroups.com
>> <https://groups.google.com/d/msgid/datameet/1e037636-d31e-4cb3-8703-433000a9a573n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/datameet/CAMgvHC5E0ZgtfBZwO%2BCBR6r7vSeRQ5ErHzJ_2qpuvMiGvd5CjQ%40mail.gmail.com.

Reply via email to