Re: [PSF-Community] Python library to extract data tables from PDF files

Vinayak Mehta Mon, 01 Oct 2018 02:22:40 -0700

Thanks Vasudev!

[1]  xtopdf looks great! will check it out.


[2] I've faced similar issues w.r.t.junk characters, which may happen when
the PDF contains an incorrect ToUnicode map, though I still have to dig
deeper and I'm not 100% sure. I've also faced an issue where duplicate
strings are assigned to the same cell. You can check it out on Github
<https://github.com/socialcopsdev/camelot/issues/103>. I suspect that since
PDF is a canvas-based model and not a text-based one, like you said, text
is just transposed a bit further to make it look like bold text. I'll
probably write a detailed blog post about the issues I faced while
development :)

Thanks for checking it out!

On Sat, Sep 29, 2018 at 1:26 AM Vasudev Ram <vasudev...@gmail.com> wrote:

> Very interesting, and congrats, Vinayak.
>
> As a person interested in both PDF generation [1] and PDF text
> extraction [2], I'm interested to know what issues you faced w.r.t.
> accuracy of text extraction and also formatting.
>
> [1] I'm the creator of xtopdf, a Python toolkit for PDF generation
> from other file formats;
>
> http://slides.com/vasudevram/xtopdf
>
> http://bitbucket.org/vasudevram/xtopdf
>
> [2] I worked on a project to extract text from PDF files. It was done
> using a C library (xpdf), though, not a Python one. However, the text
> extraction accuracy issues (some of which are technical issues
> inherent in the PDF format, according to the vendor of xpdf, Glyph and
> Cog) are language-independent. There were things like characters
> getting transposed, missing characters, junk characters sometimes,
> etc. (I also wrote a heuristics program to detect some such issues,
> but that too could only reject the bad extracts, not make them
> correct.)
>
> So the extraction was not 100% accurate, at least in my project. Also,
> like I said, that vendor said the issues are inherent in PDF, partly
> related to it being a canvas-based model, not a text-based one.
>
> I'll try to check out your project some time later.
>
> Cheers,
> Vasudev
> --
> vi quickstart: https://gumroad.com/l/vi_quick
> Web site:      https://vasudevram.github.io
> Blog:             https://jugad2.blogspot.com
> Products:      https://gumroad.com/vasudevram
>
> > While Tabula either gives either good output or fails miserably, Camelot
> > gives you complete control over the extraction process with various
> > configuration parameters! You can check out this section of the README
> > <https://github.com/socialcopsdev/camelot#why-camelot> for more
> > information. Camelot also lets you plot various geometries like detected
> > lines, intersections, tables in the PDF to debug and improve table
> > extraction! You can check out this part of the documentation
> > <
> https://camelot-py.readthedocs.io/en/latest/user/advanced.html#plot-geometry
> >
> > for more information on that.
> >
>
> >>>> Hello everyone!
> >>>>
> >>>> I recently released a Python library which lets users extract data
> >>>> tables out of PDF files, my first open source library! Here's the
> link:
> >>>> https://github.com/socialcopsdev/camelot
> >>>>
> >>>> I've created a wiki page
> >>>> <
> https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools
> >
> >>>> comparing it to other open source PDF table extraction tools. I'm
> >>>> currently
> >>>> working on porting it to Python3!
> >>>>
> >>>> I would be really grateful if you could check it out and see if its
> >>>> useful to you and give me any feedback that may help me improve it, by
> >>>> replying here, opening an issue or a pull request!
> >>>>
> >>>> Looking forward to hearing from you all!
> >>>>
> >>>> Thanks for your time!
> >>>>
> >>>> Vinayak
> >>>>
>

_______________________________________________
PSF-Community mailing list
PSF-Community@python.org
https://mail.python.org/mailman/listinfo/psf-community

Re: [PSF-Community] Python library to extract data tables from PDF files

Reply via email to