Re: [PSF-Community] Python library to extract data tables from PDF files

Vinayak Mehta Fri, 28 Sep 2018 12:38:32 -0700

The library's API is pretty simple and intuitive too! You can check it out
in the README :)


On Sat, Sep 29, 2018 at 1:06 AM Vinayak Mehta <[email protected]> wrote:

> Hello David!
>
> Yes, I've created a wiki page comparing Camelot with other open source
> tools and libraries. tabula-py is a wrapper over tabula-java, which is used
> by Tabula. You can check out the comparison of Camelot with Tabula here
> <https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools#tabula>.
> As you can see in the comparison, it outperforms Tabula in almost all cases!
>
> While Tabula either gives either good output or fails miserably, Camelot
> gives you complete control over the extraction process with various
> configuration parameters! You can check out this section of the README
> <https://github.com/socialcopsdev/camelot#why-camelot> for more
> information. Camelot also lets you plot various geometries like detected
> lines, intersections, tables in the PDF to debug and improve table
> extraction! You can check out this part of the documentation
> <https://camelot-py.readthedocs.io/en/latest/user/advanced.html#plot-geometry>
> for more information on that.
>
> Try it out!
>
> Vinayak
>
> On Sat, Sep 29, 2018 at 12:34 AM David Mertz <[email protected]> wrote:
>
>> Have you compared your tool with existing ones, such as
>> https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302
>> ?
>>
>> What notable difference in API and/or accuracy do you have?
>>
>> On Fri, Sep 28, 2018 at 2:32 PM Vinayak Mehta <[email protected]> wrote:
>>
>>> I've created a Jupyter notebook which shows an example of how Camelot makes
>>> it easy to extract tables out of PDFs.
>>>
>>>
>>> In the example, I scrape a PDF from an Indian disease outbreaks data 
>>> source[1] using requests, extract tables from
>>> each page of the PDF using Camelot and then concat those tables. Here's the 
>>> gist!https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873 
>>> :)
>>>
>>> [1] http://idsp.nic.in/index4.php?lang=1&level=0&linkid=406&lid=3689
>>>
>>>
>>> On Fri, Sep 28, 2018 at 12:01 PM Vinayak Mehta <[email protected]>
>>> wrote:
>>>
>>>> Hello everyone!
>>>>
>>>> I recently released a Python library which lets users extract data
>>>> tables out of PDF files, my first open source library! Here's the link:
>>>> https://github.com/socialcopsdev/camelot
>>>>
>>>> I've created a wiki page
>>>> <https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools>
>>>> comparing it to other open source PDF table extraction tools. I'm currently
>>>> working on porting it to Python3!
>>>>
>>>> I would be really grateful if you could check it out and see if its
>>>> useful to you and give me any feedback that may help me improve it, by
>>>> replying here, opening an issue or a pull request!
>>>>
>>>> Looking forward to hearing from you all!
>>>>
>>>> Thanks for your time!
>>>>
>>>> Vinayak
>>>>
>>> _______________________________________________
>>> PSF-Community mailing list
>>> [email protected]
>>> https://mail.python.org/mailman/listinfo/psf-community
>>>
>>
>>
>> --
>> Keeping medicines from the bloodstreams of the sick; food
>> from the bellies of the hungry; books from the hands of the
>> uneducated; technology from the underdeveloped; and putting
>> advocates of freedom in prisons.  Intellectual property is
>> to the 21st century what the slave trade was to the 16th.
>>
>

_______________________________________________
PSF-Community mailing list
[email protected]
https://mail.python.org/mailman/listinfo/psf-community

Re: [PSF-Community] Python library to extract data tables from PDF files

Reply via email to