Re: [Tutor] Package which can extract data from pdf

Mats Wichmann Wed, 14 Aug 2019 11:19:38 -0700

On 8/14/19 10:10 AM, Nupur Jha wrote:
> Hi All,
> 
> I have many pdf invoices with different formats. I want to extract the line
> items from these pdf files using python coding.
> 
> I would request you all to guide me how can i achieve this.
>


There are many packages that attempt to extract text from pdf.  They
have varying degrees of success on various different documents: you need
to be aware that PDF wasn't intended to be used that way, it was written
to *display* consistently.  Sometimes the pdf is full of instructions
for rendering that are hard for a reader to figure out, and need to be
pieced together in possibly unexpected ways.  My experience is that if
you can select the interesting text in a pdf reader, and paste it into
an editor, and it doesn't come out looking particularly mangled, then
reading it programmatically has a pretty good chance of working. If not,
you may be in trouble. That said...

pypdf2, textract, and tika all have their supporters. You can search for
all of these on pypi, which will give you links to the projects' home pages.

(if it matters, tika is an interface to a bunch of Java code, so you're
not using Python to read it, but you are using Python to control the
process)

There's a product called pdftables which specifically tries to be good
at spreadsheet-like data, which your invoices *might* be.  That is not a
free product, however. For that one there's a Python interface that
sends your data off to a web service and you get answers back.

There are probably dozens more... this seems to be an area with a lot of
reinvention going on.

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Package which can extract data from pdf

Reply via email to