Very interesting, and congrats, Vinayak. As a person interested in both PDF generation [1] and PDF text extraction [2], I'm interested to know what issues you faced w.r.t. accuracy of text extraction and also formatting.
[1] I'm the creator of xtopdf, a Python toolkit for PDF generation from other file formats; http://slides.com/vasudevram/xtopdf http://bitbucket.org/vasudevram/xtopdf [2] I worked on a project to extract text from PDF files. It was done using a C library (xpdf), though, not a Python one. However, the text extraction accuracy issues (some of which are technical issues inherent in the PDF format, according to the vendor of xpdf, Glyph and Cog) are language-independent. There were things like characters getting transposed, missing characters, junk characters sometimes, etc. (I also wrote a heuristics program to detect some such issues, but that too could only reject the bad extracts, not make them correct.) So the extraction was not 100% accurate, at least in my project. Also, like I said, that vendor said the issues are inherent in PDF, partly related to it being a canvas-based model, not a text-based one. I'll try to check out your project some time later. Cheers, Vasudev -- vi quickstart: https://gumroad.com/l/vi_quick Web site: https://vasudevram.github.io Blog: https://jugad2.blogspot.com Products: https://gumroad.com/vasudevram > While Tabula either gives either good output or fails miserably, Camelot > gives you complete control over the extraction process with various > configuration parameters! You can check out this section of the README > <https://github.com/socialcopsdev/camelot#why-camelot> for more > information. Camelot also lets you plot various geometries like detected > lines, intersections, tables in the PDF to debug and improve table > extraction! You can check out this part of the documentation > <https://camelot-py.readthedocs.io/en/latest/user/advanced.html#plot-geometry> > for more information on that. > >>>> Hello everyone! >>>> >>>> I recently released a Python library which lets users extract data >>>> tables out of PDF files, my first open source library! Here's the link: >>>> https://github.com/socialcopsdev/camelot >>>> >>>> I've created a wiki page >>>> <https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools> >>>> comparing it to other open source PDF table extraction tools. I'm >>>> currently >>>> working on porting it to Python3! >>>> >>>> I would be really grateful if you could check it out and see if its >>>> useful to you and give me any feedback that may help me improve it, by >>>> replying here, opening an issue or a pull request! >>>> >>>> Looking forward to hearing from you all! >>>> >>>> Thanks for your time! >>>> >>>> Vinayak >>>> _______________________________________________ PSF-Community mailing list PSF-Community@python.org https://mail.python.org/mailman/listinfo/psf-community