Thanks Vasudev! [1] xtopdf looks great! will check it out.
[2] I've faced similar issues w.r.t.junk characters, which may happen when the PDF contains an incorrect ToUnicode map, though I still have to dig deeper and I'm not 100% sure. I've also faced an issue where duplicate strings are assigned to the same cell. You can check it out on Github <https://github.com/socialcopsdev/camelot/issues/103>. I suspect that since PDF is a canvas-based model and not a text-based one, like you said, text is just transposed a bit further to make it look like bold text. I'll probably write a detailed blog post about the issues I faced while development :) Thanks for checking it out! On Sat, Sep 29, 2018 at 1:26 AM Vasudev Ram <vasudev...@gmail.com> wrote: > Very interesting, and congrats, Vinayak. > > As a person interested in both PDF generation [1] and PDF text > extraction [2], I'm interested to know what issues you faced w.r.t. > accuracy of text extraction and also formatting. > > [1] I'm the creator of xtopdf, a Python toolkit for PDF generation > from other file formats; > > http://slides.com/vasudevram/xtopdf > > http://bitbucket.org/vasudevram/xtopdf > > [2] I worked on a project to extract text from PDF files. It was done > using a C library (xpdf), though, not a Python one. However, the text > extraction accuracy issues (some of which are technical issues > inherent in the PDF format, according to the vendor of xpdf, Glyph and > Cog) are language-independent. There were things like characters > getting transposed, missing characters, junk characters sometimes, > etc. (I also wrote a heuristics program to detect some such issues, > but that too could only reject the bad extracts, not make them > correct.) > > So the extraction was not 100% accurate, at least in my project. Also, > like I said, that vendor said the issues are inherent in PDF, partly > related to it being a canvas-based model, not a text-based one. > > I'll try to check out your project some time later. > > Cheers, > Vasudev > -- > vi quickstart: https://gumroad.com/l/vi_quick > Web site: https://vasudevram.github.io > Blog: https://jugad2.blogspot.com > Products: https://gumroad.com/vasudevram > > > While Tabula either gives either good output or fails miserably, Camelot > > gives you complete control over the extraction process with various > > configuration parameters! You can check out this section of the README > > <https://github.com/socialcopsdev/camelot#why-camelot> for more > > information. Camelot also lets you plot various geometries like detected > > lines, intersections, tables in the PDF to debug and improve table > > extraction! You can check out this part of the documentation > > < > https://camelot-py.readthedocs.io/en/latest/user/advanced.html#plot-geometry > > > > for more information on that. > > > > >>>> Hello everyone! > >>>> > >>>> I recently released a Python library which lets users extract data > >>>> tables out of PDF files, my first open source library! Here's the > link: > >>>> https://github.com/socialcopsdev/camelot > >>>> > >>>> I've created a wiki page > >>>> < > https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools > > > >>>> comparing it to other open source PDF table extraction tools. I'm > >>>> currently > >>>> working on porting it to Python3! > >>>> > >>>> I would be really grateful if you could check it out and see if its > >>>> useful to you and give me any feedback that may help me improve it, by > >>>> replying here, opening an issue or a pull request! > >>>> > >>>> Looking forward to hearing from you all! > >>>> > >>>> Thanks for your time! > >>>> > >>>> Vinayak > >>>> >
_______________________________________________ PSF-Community mailing list PSF-Community@python.org https://mail.python.org/mailman/listinfo/psf-community