> On Aug 14, 2019, at 2:16 PM, Mats Wichmann <m...@wichmann.us> wrote:
> 
>> On 8/14/19 10:10 AM, Nupur Jha wrote:
>> Hi All,
>> 
>> I have many pdf invoices with different formats. I want to extract the line
>> items from these pdf files using python coding.
>> 

Treat this as a two part problem: part one is extracting the text; part two is 
parsing that text for your desired information. Unless you have a specific need 
for extracting the text with python, I’d recommend solving part one with an 
image-to-text reader. These have gotten really quite good recently (AI no 
doubt). Then parsing the text with python’s string handling routines should be 
pretty straightforward. 

Bill

>> I would request you all to guide me how can i achieve this.
>> 
> 
> There are many packages that attempt to extract text from pdf.  They
> have varying degrees of success on various different documents: you need
> to be aware that PDF wasn't intended to be used that way, it was written
> to *display* consistently.  Sometimes the pdf is full of instructions
> for rendering that are hard for a reader to figure out, and need to be
> pieced together in possibly unexpected ways.  My experience is that if
> you can select the interesting text in a pdf reader, and paste it into
> an editor, and it doesn't come out looking particularly mangled, then
> reading it programmatically has a pretty good chance of working. If not,
> you may be in trouble. That said...
> 
> pypdf2, textract, and tika all have their supporters. You can search for
> all of these on pypi, which will give you links to the projects' home pages.
> 
> (if it matters, tika is an interface to a bunch of Java code, so you're
> not using Python to read it, but you are using Python to control the
> process)
> 
> There's a product called pdftables which specifically tries to be good
> at spreadsheet-like data, which your invoices *might* be.  That is not a
> free product, however. For that one there's a Python interface that
> sends your data off to a web service and you get answers back.
> 
> There are probably dozens more... this seems to be an area with a lot of
> reinvention going on.
> 
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to