Re: [datameet] Library to read tables in scanned PDFs

nikh...@gmail.com Sun, 26 Dec 2021 08:01:51 -0800

Hi All,

Replying on an old existing thread that matched the subject.


I recently got success in making a single python program do the whole job 
from taking a pdf, un-encrypting it (apparently that's a thing for many 
pdfs even with pw being blank), OCR'ing it (that too in a marathi/hindi 
script plus some english parts), extracting tabular data from it, 
transforming the data to a proper table with one data item per row and 
finally saving it to either an excel or a database.

It uses these python libraries and some dependencies which I could easily 
install in my ubuntu system 
pikepdf <https://pypi.org/project/pikepdf/> (decryption) | ocrmypdf 
<https://pypi.org/project/ocrmypdf/> (OCR) | tabula-py 
<https://pypi.org/project/tabula-py/> (table extraction)

(In windows: maybe doable, not sure about those dependencies. Feels like 
they also really want people to move to linux these days :P)

The last part in that chain is crucial, and it has evolved by now to give 
several good options for targeting the specific areas. Here's the core 
program <https://github.com/tabulapdf/tabula-java#usage-examples> the 
python lib wraps around.

For the OCR part, it did things properly for about 80% of my target PDF 
which was in marathi. This is running on the tesseract project 
<https://github.com/tesseract-ocr/> which is continuously evolving, so we 
can expect it to get better over time. (and if someone knows where to 
contribute more samples for the training model, pls let me know) It's also 
got a handy deskew option which you can use in case of photo-scanned pdfs 
where everything got tilted.

I'm not able to share a demo program at this point, but want to let people 
know that *Hey it can be done!* Just follow the trail. (and get in touch if 
you want to implement something)

Regards
Nikhil VJ
https://nikhilvj.co.in
+91-9665831250

On Tuesday, January 24, 2017 at 2:45:52 PM UTC+5:30 mohit ranjan wrote:

> Thanks Aman, Raphael
>
> Let me try these steps.
>
> - Mohit
>
> On Mon, Jan 23, 2017 at 7:31 PM, Raphael Susewind <
> li...@raphael-susewind.de> wrote:
>
>> Hi Mohit,
>>
>> just to add - a hacked-but-working workflow to extract the table
>> structure and OCR bits and pieces as needed can be found in my GitHub,
>> for instance here (at the bottom of the perl file):
>>
>>
>> https://github.com/raphael-susewind/india-religion-politics/blob/master/rajrolls2014/run-in-arc/pdf2list.pl
>>
>> It boils down to
>>
>> pdf-table-extract -i $file -p $page -r 300 -l 0.7 -t cells_xml
>>
>> for each page, parsing the results to extract cell coordinates, then
>>
>> gs -q -r300 -dFirstPage=$page -dLastPage=$page -sDEVICE=tiffgray
>> -sCompression=lzw -o $temp.tif -g".$width."x".$height." -c '<</Install
>> {-$bufferx -$buffery translate}>> setpagedevice' -f $file
>>
>> to get a TIFF of this cell, to be fed into
>>
>> tesseract -psm 4 -l hin temp.tif stdout
>>
>> (in the case of devanagari)
>>
>> Best of luck,
>> Raphael
>>
>> On 01/23/2017 09:20 AM, Amanbir Singh wrote:
>> > Hi Mohit,
>> >
>> > You'll have to use OCR on the pdf before any other method can be
>> > applied. This obviously makes it more complicated, but still manageable.
>> >
>> > You could use the Tesseract, a popular OCR package
>> > (https://github.com/tesseract-ocr/tesseract) and then try using tabula
>> > or the other packages mentioned. I've also had success using Xpdf
>> > (http://www.foolabs.com/xpdf/) to convert pdfs to text and then parsing
>> > the text.
>> >
>> > Aman
>> >
>> >
>> > On Friday, 20 January 2017 18:18:59 UTC+5:30, mohit ranjan wrote:
>> >
>> >     Tried Tabula, but again it's for PDF which has all the meta-data
>> >     within it.
>> >     I need it for paper scanned PDF/JPG and it fails by saying so
>> >
>> >     /"Sorry, your PDF file is image-based; it does not have any embedded
>> >     text. It might have been scanned from paper... Tabula isn't able to
>> >     extract any data from image-based PDFs. Click the Help button for
>> >     more information."/
>> >
>> >     - Mohit
>> >
>> >     On Fri, Jan 20, 2017 at 6:14 PM, Srinivasan Ramani
>> >     <sriniv...@gmail.com <javascript:>> wrote:
>> >
>> >         Tabula - http://tabula.technology/ works great with table
>> >         extraction from PDFs.
>> >
>> >         On Fri, Jan 20, 2017 at 5:51 PM, mohit ranjan
>> >         <shoony...@gmail.com <javascript:>> wrote:
>> >
>> >             Thanks for response Johnson.
>> >
>> >             Is this the pdf-table-extract
>> >             <https://github.com/ashima/pdf-table-extract> you are
>> >             referring to ?
>> >             It says, it reads table meta from PDF.
>> >
>> >             My query was for scanned PDF/JPG images
>> >
>> >             - Mohit
>> >
>> >             On Fri, Jan 20, 2017 at 4:37 PM, Johnson Chetty
>> >             <johnso...@gmail.com <javascript:>> wrote:
>> >
>> >
>> >                     Hello,
>> >
>> >                     I have had some reasonable success with 'pdfquery'
>> >                     if you like Python. It works with regional text as
>> >                     well.
>> >                     Also, for tabular data, do try pdf-table-extract if
>> >                     quick and dirty works for you.
>> >
>> >                     Java folks should try pdfbox.
>> >
>> >
>> >
>> >
>> >
>> >                     On 20 January 2017 at 15:23, mohit ranjan
>> >                     <shoony...@gmail.com <javascript:>> wrote:
>> >
>> >                         Sorry if this is off-topic, but have seen
>> >                         threads here about liberating data from PDFs.
>> >                         Most likely there will be lot of scanned PDFs
>> >                         among them.
>> >
>> >                         Do we have any in-house expert on this and which
>> >                         library/tool (preferably not paid) to extract
>> >                         tables in scanned PDF/JPG ?
>> >
>> >                         CVision
>> >                         <
>> http://www.cvisiontech.com/library/ocr/file-ocr/ocr-table-recognition.html
>> >
>> >                         does a decent job, but it's paid.
>> >
>> >
>> >
>> >                         - Mohit
>> >
>> >                         --
>> >                         Datameet is a community of Data Science
>> >                         enthusiasts in India. Know more about us by
>> >                         visiting http://datameet.org
>> >                         ---
>> >                         You received this message because you are
>> >                         subscribed to the Google Groups "datameet" 
>> group.
>> >                         To unsubscribe from this group and stop
>> >                         receiving emails from it, send an email to
>> >                         datameet+u...@googlegroups.com <javascript:>.
>> >                         For more options, visit
>> >                         https://groups.google.com/d/optout
>> >                         <https://groups.google.com/d/optout>.
>> >
>> >
>> >
>> >                 --
>> >                 Datameet is a community of Data Science enthusiasts in
>> >                 India. Know more about us by visiting 
>> http://datameet.org
>> >                 ---
>> >                 You received this message because you are subscribed to
>> >                 the Google Groups "datameet" group.
>> >                 To unsubscribe from this group and stop receiving emails
>> >                 from it, send an email to 
>> datameet+u...@googlegroups.com
>> >                 <javascript:>.
>> >                 For more options, visit
>> >                 https://groups.google.com/d/optout
>> >                 <https://groups.google.com/d/optout>.
>> >
>> >
>> >             --
>> >             Datameet is a community of Data Science enthusiasts in
>> >             India. Know more about us by visiting http://datameet.org
>> >             ---
>> >             You received this message because you are subscribed to the
>> >             Google Groups "datameet" group.
>> >             To unsubscribe from this group and stop receiving emails
>> >             from it, send an email to datameet+u...@googlegroups.com
>> >             <javascript:>.
>> >             For more options, visit https://groups.google.com/d/optout
>> >             <https://groups.google.com/d/optout>.
>> >
>> >
>> >
>> >
>> >         --
>> >         Best Regards,
>> >         Srinivasan V. Ramani ,
>> >         Associate Editor,
>> >         The Hindu,
>> >         Chennai.
>> >         Ph: 07299033554
>> >
>> >         --
>> >         Datameet is a community of Data Science enthusiasts in India.
>> >         Know more about us by visiting http://datameet.org
>> >         ---
>> >         You received this message because you are subscribed to the
>> >         Google Groups "datameet" group.
>> >         To unsubscribe from this group and stop receiving emails from
>> >         it, send an email to datameet+u...@googlegroups.com 
>> <javascript:>.
>> >         For more options, visit https://groups.google.com/d/optout
>> >         <https://groups.google.com/d/optout>.
>> >
>> >
>> > --
>> > Datameet is a community of Data Science enthusiasts in India. Know more
>> > about us by visiting http://datameet.org
>> > ---
>> > You received this message because you are subscribed to the Google
>> > Groups "datameet" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>>
> > an email to datameet+unsubscr...@googlegroups.com
>> > <mailto:datameet+unsubscr...@googlegroups.com>.
>
>
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> Dr Raphael Susewind | Postdoc, Max Planck Institute for the Study of
>>                     | Religious and Ethnic Diversity (MPI-MMG)
>>                     | Hermann-Föge-Weg 11, 37073 Göttingen, Germany
>>                     | https://www.raphael-susewind.de
>>
>> Please consider PGP for encryption: https://keybase.io/raphaelsusewind
>>
> --
>> Datameet is a community of Data Science enthusiasts in India. Know more 
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups 
>> "datameet" group.
>>
> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to datameet+unsubscr...@googlegroups.com.
>>
>
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/datameet/3c38920f-81ab-4e06-a9b5-9efb3e625f20n%40googlegroups.com.

Re: [datameet] Library to read tables in scanned PDFs

Reply via email to