> On Aug 14, 2019, at 2:16 PM, Mats Wichmann <m...@wichmann.us> wrote: > >> On 8/14/19 10:10 AM, Nupur Jha wrote: >> Hi All, >> >> I have many pdf invoices with different formats. I want to extract the line >> items from these pdf files using python coding. >>
Treat this as a two part problem: part one is extracting the text; part two is parsing that text for your desired information. Unless you have a specific need for extracting the text with python, I’d recommend solving part one with an image-to-text reader. These have gotten really quite good recently (AI no doubt). Then parsing the text with python’s string handling routines should be pretty straightforward. Bill >> I would request you all to guide me how can i achieve this. >> > > There are many packages that attempt to extract text from pdf. They > have varying degrees of success on various different documents: you need > to be aware that PDF wasn't intended to be used that way, it was written > to *display* consistently. Sometimes the pdf is full of instructions > for rendering that are hard for a reader to figure out, and need to be > pieced together in possibly unexpected ways. My experience is that if > you can select the interesting text in a pdf reader, and paste it into > an editor, and it doesn't come out looking particularly mangled, then > reading it programmatically has a pretty good chance of working. If not, > you may be in trouble. That said... > > pypdf2, textract, and tika all have their supporters. You can search for > all of these on pypi, which will give you links to the projects' home pages. > > (if it matters, tika is an interface to a bunch of Java code, so you're > not using Python to read it, but you are using Python to control the > process) > > There's a product called pdftables which specifically tries to be good > at spreadsheet-like data, which your invoices *might* be. That is not a > free product, however. For that one there's a Python interface that > sends your data off to a web service and you get answers back. > > There are probably dozens more... this seems to be an area with a lot of > reinvention going on. > > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor