Re: [datameet] Need some Guidence on Parsing Electoral Roles.

Nikhil VJ Sun, 03 Sep 2017 09:31:20 -0700

Hi Raphael,

Could you share these references here:
https://github.com/tabulapdf/tabula/issues/409


I had started a feature request two years ago on Tabula for cutting a
PDF's page into table cells and OCR'ing them sequentially. Quite some
people joined the discussion. What you shared in your last email might
hold a key for taking this forward, but I could not find links to what
you mentioned (you wrote "see below" but there's nothing more in the
email).



On 8/21/17, Raphael Susewind <[email protected]> wrote:
> Hi Nikhil and Devdatta,
>
> very useful references.
>
> Just to jump in on table conversion: there is a python script called
> pdf-table-convert that is quite capable of detecting tables in PDFs.
> They use a graphical approach rather than a logical one, so it doesn't
> matter how bad the PDF is - even scans work in principle.
>
> Importantly, with the right options, the script gives you boundary box
> coordinates for each cell, which you can feed into ghostscript (or
> whatever you like) to extract an image of just that cell prior to OCRing
> - which indeed saves a lot of time.
>
> The whole processing chain is referenced in my GitHub scripts (see
> below), namely in most versions of pdf2list.pl, where pdf-table-convert
> is called towards the bottom and the output then fed to tesseract...
>
> Best,
> Raphael
>
> On 08/20/2017 11:00 AM, Nikhil VJ wrote:
>> Hi Devdatta,
>>
>> I had come across the legacy Devnagri fonts issue earlier when I started
>> working on budget data. The fonts are Shree-Dev, Kruti-Dev, Shivaji, etc
>> : legacy fonts used in an era when unicode devnagri wasn't invented, and
>> to get around, there was simple substitution like a = क etc. I've put up
>> a graphic that shows this mapping for a few fonts
>> : http://i.imgur.com/ICUC6Wk.png
>>
>> I found a group named technical-hindi who have been working on simple
>> javascript pages that convert these fonts to unicode devnagri (and
>> back!). I used them, and with the content I had, I had to introduce some
>> extra conversions, and it worked like a charm.
>>
>> Their site where many converters are shared :
>> https://sites.google.com/site/technicalhindi/home/converters
>> Their google group:
>> https://groups.google.com/forum/#!forum/technical-hindi
>>
>> I've shared the modified converters I used here:
>> http://ourpuneourbudget.in/tools/
>> (only had those limited use cases)
>>
>> In the process of studying these, I came upon an unexpected situation :
>> If the document you are extracting data from is a PDF (which I also
>> refer to as "digital graveyard"), then it is PREFERABLE if the fonts are
>> in legacy Devnagri font rather than Unicode font!
>>
>> That's because as of today (or 2015 when I came across it), PDF
>> technology doesn't handle unicode Devnagri well. Some distortions are
>> done to make the glyphs "print" properly, which permanently distorts the
>> original chars. The issue is described here:
>> https://stackoverflow.com/questions/30756193/unable-to-copy-exact-hindi-content-from-pdf
>>
>> ..So if the text in the PDFs you're working on is in legacy Devnagri
>> instead of Unicode Devnagri, then you're actually lucky :P .
>>
>> If it's in unicode then that PDF is a true digital graveyard :P. OCR can
>> work, yes, but please tell me if you find a way to OCR a page table cell
>> by table cell separately instead of jumbling up everything. I had also
>> come across a project like yours an year ago but I backed out because I
>> could not get around this issue.. the fonts in the PDF were in Unicode.
>>
>> Here's an issue I filed in the Tabula project related to this, and they
>> fixed it for the legacy fonts extraction at least.
>> https://github.com/tabulapdf/tabula/issues/303
>>
>>
>>
>> --
>> Cheers,
>> Nikhil VJ
>> +91-966-583-1250
>> Pune / Mandangad, India
>> DataMeet Pune chapter <https://datameet-pune.github.io/>
>> Self-designed learner at Swaraj University
>> <http://www.swarajuniversity.org>
>> Blog <http://nikhilsheth.blogspot.in>
>>
>> On Sat, Aug 19, 2017 at 11:21 PM, Raphael Susewind
>> <[email protected] <mailto:[email protected]>> wrote:
>>
>>     Hi Devdatta,
>>
>>     I had run into the same issue, and indeed the only workaround is OCR.
>>     Its not just a different encoding than unicode - its actually garbled
>>     CMaps, which is much worse (ie not recoverable).
>>
>>     See my comments here for starters (and the badly written scripts):
>>
>>
>> https://github.com/raphael-susewind/india-religion-politics/tree/master/maharolls2014
>>
>> <https://github.com/raphael-susewind/india-religion-politics/tree/master/maharolls2014>
>>
>>     As for Soundex, you might want to take a look at the IndicSoundex
>>     collection, which is more accurate than transliteration into latin
>>     followed by English soundex:
>>
>>     http://libindic.org/Soundex
>>
>>     Good news is that I have done the whole exercise for Maharashtra
>> 2014,
>>     and may be able to share depending on what your project is about.
>>     Perhaps send me a PM and we can discuss further,
>>
>>     Best,
>>     Raphael
>>
>>     On 08/19/2017 06:14 PM, Devdatta Tengshe wrote:
>>     > I'm attempting to read Names, Ages & Genders from Electoral Rolls,
>> so
>>     > that I can create a database of Names, to figure out the General
>>     Spread
>>     > of Specific Names across locations, and ages.
>>     >
>>     > I began working with Mumbai's rolls, and am running into the
>> following
>>     > issues:
>>     >
>>     > 1) The Electoral Rolls are not in English, but in Devanagari. This
>> is
>>     > not a Major issue, because I could transliterate it into English
>> for
>>     > Comparison (I need the names to be in English, so that I can use
>>     Soundex
>>     > to remove misspellings etc). I know libraries for
>>     transliteratation that
>>     > work with Devanagari (Hindi & Marathi). Is there anything similar
>> for
>>     > other scripts such as Kannada & Tamil etc?
>>     >
>>     > 2)While the Rolls are in Devanagari, the text is not actually in
>>     > Unicode. It is in some other font, and hence when I Get the text
>> out,
>>     > it's garbage. Since Others have worked with the rolls before, is
>>     there a
>>     > better way to get the Text Out?
>>     >
>>     > 3)If it's not possible to get the Text out, Can we use OCR? What
>> OCR
>>     > library is best at working with Indic Scripts?
>>     >
>>     > If anyone has some experience to share on these issues, it will be
>>     much
>>     > appreciated.
>>     >
>>     > --
>>     > Datameet is a community of Data Science enthusiasts in India. Know
>>     more
>>     > about us by visiting http://datameet.org
>>     > ---
>>     > You received this message because you are subscribed to the Google
>>     > Groups "datameet" group.
>>     > To unsubscribe from this group and stop receiving emails from it,
>> send
>>     > an email to [email protected]
>>     <mailto:datameet%[email protected]>
>>     > <mailto:[email protected]
>>     <mailto:datameet%[email protected]>>.
>>     > For more options, visit https://groups.google.com/d/optout
>>     <https://groups.google.com/d/optout>.
>>
>>     --
>>     Datameet is a community of Data Science enthusiasts in India. Know
>>     more about us by visiting http://datameet.org
>>     ---
>>     You received this message because you are subscribed to the Google
>>     Groups "datameet" group.
>>     To unsubscribe from this group and stop receiving emails from it,
>>     send an email to [email protected]
>>     <mailto:datameet%[email protected]>.
>>     For more options, visit https://groups.google.com/d/optout
>>     <https://groups.google.com/d/optout>.
>>
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google
>> Groups "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to [email protected]
>> <mailto:[email protected]>.
>> For more options, visit https://groups.google.com/d/optout.
>


-- 
--
Cheers,
Nikhil VJ
+91-966-583-1250
Pune / Mandangad, India
DataMeet Pune chapter <https://datameet-pune.github.io/>
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Blog <http://nikhilsheth.blogspot.in>
Contribute <https://www.instamojo.com/@nikhilvj/>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [datameet] Need some Guidence on Parsing Electoral Roles.

Reply via email to