On 8/14/19 10:10 AM, Nupur Jha wrote: > Hi All, > > I have many pdf invoices with different formats. I want to extract the line > items from these pdf files using python coding. > > I would request you all to guide me how can i achieve this. >
There are many packages that attempt to extract text from pdf. They have varying degrees of success on various different documents: you need to be aware that PDF wasn't intended to be used that way, it was written to *display* consistently. Sometimes the pdf is full of instructions for rendering that are hard for a reader to figure out, and need to be pieced together in possibly unexpected ways. My experience is that if you can select the interesting text in a pdf reader, and paste it into an editor, and it doesn't come out looking particularly mangled, then reading it programmatically has a pretty good chance of working. If not, you may be in trouble. That said... pypdf2, textract, and tika all have their supporters. You can search for all of these on pypi, which will give you links to the projects' home pages. (if it matters, tika is an interface to a bunch of Java code, so you're not using Python to read it, but you are using Python to control the process) There's a product called pdftables which specifically tries to be good at spreadsheet-like data, which your invoices *might* be. That is not a free product, however. For that one there's a Python interface that sends your data off to a web service and you get answers back. There are probably dozens more... this seems to be an area with a lot of reinvention going on. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor