Hi Raphael, Could you share these references here: https://github.com/tabulapdf/tabula/issues/409
I had started a feature request two years ago on Tabula for cutting a PDF's page into table cells and OCR'ing them sequentially. Quite some people joined the discussion. What you shared in your last email might hold a key for taking this forward, but I could not find links to what you mentioned (you wrote "see below" but there's nothing more in the email). On 8/21/17, Raphael Susewind <[email protected]> wrote: > Hi Nikhil and Devdatta, > > very useful references. > > Just to jump in on table conversion: there is a python script called > pdf-table-convert that is quite capable of detecting tables in PDFs. > They use a graphical approach rather than a logical one, so it doesn't > matter how bad the PDF is - even scans work in principle. > > Importantly, with the right options, the script gives you boundary box > coordinates for each cell, which you can feed into ghostscript (or > whatever you like) to extract an image of just that cell prior to OCRing > - which indeed saves a lot of time. > > The whole processing chain is referenced in my GitHub scripts (see > below), namely in most versions of pdf2list.pl, where pdf-table-convert > is called towards the bottom and the output then fed to tesseract... > > Best, > Raphael > > On 08/20/2017 11:00 AM, Nikhil VJ wrote: >> Hi Devdatta, >> >> I had come across the legacy Devnagri fonts issue earlier when I started >> working on budget data. The fonts are Shree-Dev, Kruti-Dev, Shivaji, etc >> : legacy fonts used in an era when unicode devnagri wasn't invented, and >> to get around, there was simple substitution like a = क etc. I've put up >> a graphic that shows this mapping for a few fonts >> : http://i.imgur.com/ICUC6Wk.png >> >> I found a group named technical-hindi who have been working on simple >> javascript pages that convert these fonts to unicode devnagri (and >> back!). I used them, and with the content I had, I had to introduce some >> extra conversions, and it worked like a charm. >> >> Their site where many converters are shared : >> https://sites.google.com/site/technicalhindi/home/converters >> Their google group: >> https://groups.google.com/forum/#!forum/technical-hindi >> >> I've shared the modified converters I used here: >> http://ourpuneourbudget.in/tools/ >> (only had those limited use cases) >> >> In the process of studying these, I came upon an unexpected situation : >> If the document you are extracting data from is a PDF (which I also >> refer to as "digital graveyard"), then it is PREFERABLE if the fonts are >> in legacy Devnagri font rather than Unicode font! >> >> That's because as of today (or 2015 when I came across it), PDF >> technology doesn't handle unicode Devnagri well. Some distortions are >> done to make the glyphs "print" properly, which permanently distorts the >> original chars. The issue is described here: >> https://stackoverflow.com/questions/30756193/unable-to-copy-exact-hindi-content-from-pdf >> >> ..So if the text in the PDFs you're working on is in legacy Devnagri >> instead of Unicode Devnagri, then you're actually lucky :P . >> >> If it's in unicode then that PDF is a true digital graveyard :P. OCR can >> work, yes, but please tell me if you find a way to OCR a page table cell >> by table cell separately instead of jumbling up everything. I had also >> come across a project like yours an year ago but I backed out because I >> could not get around this issue.. the fonts in the PDF were in Unicode. >> >> Here's an issue I filed in the Tabula project related to this, and they >> fixed it for the legacy fonts extraction at least. >> https://github.com/tabulapdf/tabula/issues/303 >> >> >> >> -- >> Cheers, >> Nikhil VJ >> +91-966-583-1250 >> Pune / Mandangad, India >> DataMeet Pune chapter <https://datameet-pune.github.io/> >> Self-designed learner at Swaraj University >> <http://www.swarajuniversity.org> >> Blog <http://nikhilsheth.blogspot.in> >> >> On Sat, Aug 19, 2017 at 11:21 PM, Raphael Susewind >> <[email protected] <mailto:[email protected]>> wrote: >> >> Hi Devdatta, >> >> I had run into the same issue, and indeed the only workaround is OCR. >> Its not just a different encoding than unicode - its actually garbled >> CMaps, which is much worse (ie not recoverable). >> >> See my comments here for starters (and the badly written scripts): >> >> >> https://github.com/raphael-susewind/india-religion-politics/tree/master/maharolls2014 >> >> <https://github.com/raphael-susewind/india-religion-politics/tree/master/maharolls2014> >> >> As for Soundex, you might want to take a look at the IndicSoundex >> collection, which is more accurate than transliteration into latin >> followed by English soundex: >> >> http://libindic.org/Soundex >> >> Good news is that I have done the whole exercise for Maharashtra >> 2014, >> and may be able to share depending on what your project is about. >> Perhaps send me a PM and we can discuss further, >> >> Best, >> Raphael >> >> On 08/19/2017 06:14 PM, Devdatta Tengshe wrote: >> > I'm attempting to read Names, Ages & Genders from Electoral Rolls, >> so >> > that I can create a database of Names, to figure out the General >> Spread >> > of Specific Names across locations, and ages. >> > >> > I began working with Mumbai's rolls, and am running into the >> following >> > issues: >> > >> > 1) The Electoral Rolls are not in English, but in Devanagari. This >> is >> > not a Major issue, because I could transliterate it into English >> for >> > Comparison (I need the names to be in English, so that I can use >> Soundex >> > to remove misspellings etc). I know libraries for >> transliteratation that >> > work with Devanagari (Hindi & Marathi). Is there anything similar >> for >> > other scripts such as Kannada & Tamil etc? >> > >> > 2)While the Rolls are in Devanagari, the text is not actually in >> > Unicode. It is in some other font, and hence when I Get the text >> out, >> > it's garbage. Since Others have worked with the rolls before, is >> there a >> > better way to get the Text Out? >> > >> > 3)If it's not possible to get the Text out, Can we use OCR? What >> OCR >> > library is best at working with Indic Scripts? >> > >> > If anyone has some experience to share on these issues, it will be >> much >> > appreciated. >> > >> > -- >> > Datameet is a community of Data Science enthusiasts in India. Know >> more >> > about us by visiting http://datameet.org >> > --- >> > You received this message because you are subscribed to the Google >> > Groups "datameet" group. >> > To unsubscribe from this group and stop receiving emails from it, >> send >> > an email to [email protected] >> <mailto:datameet%[email protected]> >> > <mailto:[email protected] >> <mailto:datameet%[email protected]>>. >> > For more options, visit https://groups.google.com/d/optout >> <https://groups.google.com/d/optout>. >> >> -- >> Datameet is a community of Data Science enthusiasts in India. Know >> more about us by visiting http://datameet.org >> --- >> You received this message because you are subscribed to the Google >> Groups "datameet" group. >> To unsubscribe from this group and stop receiving emails from it, >> send an email to [email protected] >> <mailto:datameet%[email protected]>. >> For more options, visit https://groups.google.com/d/optout >> <https://groups.google.com/d/optout>. >> >> >> -- >> Datameet is a community of Data Science enthusiasts in India. Know more >> about us by visiting http://datameet.org >> --- >> You received this message because you are subscribed to the Google >> Groups "datameet" group. >> To unsubscribe from this group and stop receiving emails from it, send >> an email to [email protected] >> <mailto:[email protected]>. >> For more options, visit https://groups.google.com/d/optout. > -- -- Cheers, Nikhil VJ +91-966-583-1250 Pune / Mandangad, India DataMeet Pune chapter <https://datameet-pune.github.io/> Self-designed learner at Swaraj University <http://www.swarajuniversity.org> Blog <http://nikhilsheth.blogspot.in> Contribute <https://www.instamojo.com/@nikhilvj/> -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
