RE: Script to extract text from PDF files
Its possible (likely) that I came into this in the middle, so sorry if this was already thrown out... but have you looked at any of the following suggestions? https://pypi.python.org/pypi?%3Aaction=search=pdf+convert=search http://stackoverflow.com/questions/6413441/python-pdf-library https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167 -Original Message- From: Python-list [mailto:python-list-bounces+d.strohl=f5@python.org] On Behalf Of Scott Werner Sent: Friday, November 06, 2015 2:30 PM To: python-list@python.org Subject: Re: Script to extract text from PDF files On Tuesday, September 25, 2007 at 1:41:56 PM UTC-4, brad wrote: > I have a very crude Python script that extracts text from some (and I > emphasize some) PDF documents. On many PDF docs, I cannot extract > text, but this is because I'm doing something wrong. The PDF spec is > large and complex and there are various ways in which to store and > encode text. I wanted to post here and ask if anyone is interested in > helping make the script better which means it should accurately > extract text from most any pdf file... not just some. > > I know the topic of reading/extracting the text from a PDF document > natively in Python comes up every now and then on comp.lang.python... > I've posted about it in the past myself. After searching for other > solutions, I've resorted to attempting this on my own in my spare time. > Using apps external to Python (pdftotext, etc.) is not really an > option for me. If someone knows of a free native Python app that does > this now, let me know and I'll use that instead! > > So, if other more experienced programmer are interested in helping > make the script better, please let me know. I can host a website and > the latest revision and do all of the grunt work. > > Thanks, > > Brad As mentioned before, extracting plain text from a PDF document can be hit or miss. I have tried all the following applications (free/open source) on Arch Linux. Note, I would execute the commands with subprocess and capture stdout or read plain text file created by the application. * textract (uses pdftotext) - https://github.com/deanmalmgren/textract * pdftotext - http://poppler.freedesktop.org/ - cmd: pdftotext -layout "/path/to/document.pdf" - - cmd: pdftotext "/path/to/document.pdf" - * Calibre - http://calibre-ebook.com/ - cmd: ebook-convert "/path/to/document.pdf" "/path/to/plain.txt" --no-chapters-in-toc * AbiWord - http://www.abiword.org/ - cmd: abiword --to-name=fd://1 --to-TXT "/path/to/document.pdf" * Apache Tika - https://tika.apache.org/ - cmd: "/usr/bin/java" -jar "/path/to/standalone/tika-app-1.10.jar" --text-main "/path/to/document.pdf" For my application, I saw the best results using Apache Tika. However, I do still encounter strange encoding or extraction issues, e.g. S P A C E D O U T H E A D E R S" and "\nBroken \nHeader\n". I ended up writing a lot of repairing/cleaning methods. I welcome an improved solution that has some intelligence like comparing the extract plain text order to a snapshot of the pdf page using OCR. -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: Script to extract text from PDF files
On Tuesday, September 25, 2007 at 1:41:56 PM UTC-4, brad wrote: > I have a very crude Python script that extracts text from some (and I > emphasize some) PDF documents. On many PDF docs, I cannot extract text, > but this is because I'm doing something wrong. The PDF spec is large and > complex and there are various ways in which to store and encode text. I > wanted to post here and ask if anyone is interested in helping make the > script better which means it should accurately extract text from most > any pdf file... not just some. > > I know the topic of reading/extracting the text from a PDF document > natively in Python comes up every now and then on comp.lang.python... > I've posted about it in the past myself. After searching for other > solutions, I've resorted to attempting this on my own in my spare time. > Using apps external to Python (pdftotext, etc.) is not really an option > for me. If someone knows of a free native Python app that does this now, > let me know and I'll use that instead! > > So, if other more experienced programmer are interested in helping make > the script better, please let me know. I can host a website and the > latest revision and do all of the grunt work. > > Thanks, > > Brad As mentioned before, extracting plain text from a PDF document can be hit or miss. I have tried all the following applications (free/open source) on Arch Linux. Note, I would execute the commands with subprocess and capture stdout or read plain text file created by the application. * textract (uses pdftotext) - https://github.com/deanmalmgren/textract * pdftotext - http://poppler.freedesktop.org/ - cmd: pdftotext -layout "/path/to/document.pdf" - - cmd: pdftotext "/path/to/document.pdf" - * Calibre - http://calibre-ebook.com/ - cmd: ebook-convert "/path/to/document.pdf" "/path/to/plain.txt" --no-chapters-in-toc * AbiWord - http://www.abiword.org/ - cmd: abiword --to-name=fd://1 --to-TXT "/path/to/document.pdf" * Apache Tika - https://tika.apache.org/ - cmd: "/usr/bin/java" -jar "/path/to/standalone/tika-app-1.10.jar" --text-main "/path/to/document.pdf" For my application, I saw the best results using Apache Tika. However, I do still encounter strange encoding or extraction issues, e.g. S P A C E D O U T H E A D E R S" and "\nBroken \nHeader\n". I ended up writing a lot of repairing/cleaning methods. I welcome an improved solution that has some intelligence like comparing the extract plain text order to a snapshot of the pdf page using OCR. -- https://mail.python.org/mailman/listinfo/python-list
Re: Script to extract text from PDF files
you can try this free online pdf text extractor http://www.online-code.net/pdf-to-word.html to extract text from pdf free online. -- https://mail.python.org/mailman/listinfo/python-list
Re: Script to extract text from PDF files
On Sep 26, 11:50 pm, [EMAIL PROTECTED] wrote: On Sep 26, 4:49 pm, Svenn Are Bjerkem [EMAIL PROTECTED] wrote: I have downloaded this package and installed it and found that the text-extraction is more or less useless. Looking into the code and comparing with the PDF spec show a very early implementation of text extraction. Luckily it is possible to overwrite the textextraction method in the base class without having to fiddle with the original code. I tried to contact the developer to offer some help on implementing text extraction, but he didn't answer my emails. -- Svenn Well, feel free to send any ideas or help to me! It seems simple... Do a binary read. Find 'stream' and 'endstream' sections. zlib.decompress() all the streams. Find BT and ET markers (Begin Text End Text) and finally locate the parens within those and string the text together. This works great on 3 out of 10 PDF documents, but my main issue seems to be the zlib compressed streams. Some of them don't seem to be FlateDecodeable (although they claim to be) or the header is somehow incorrect. But, once I get a good stream and decompress it, things are OK from that point on. Seriously, if you have ideas, please let me know. I'll be glad to share what I've got so far. So far I have found that extracting text from the IEEE journal papers is not as simple as described above. The IEEE journals are typesetting things in typical journal style with two columns body text and one column abstract and a blob of header and author information. Take figures and formulas and footnotes and spread them around in the journal and you are basically using all block text layout commands there is in PDF. I wanted to to get the pdftotext from xpdf package to see what that tool does to the IEEE pdfs in order to see if I should dive into the sources to see what they do to get things right. So far I have not got this far. Purpose of my work was to extract the abstract of each paper to put into a database for later search, but IEEE also has a search engine on their journal DVD = postpone python work. Got my gentoo machine back on track so that may maybe change again.. -- Svenn -- http://mail.python.org/mailman/listinfo/python-list
Re: Script to extract text from PDF files
On Sep 25, 10:19 pm, Lawrence D'Oliveiro [EMAIL PROTECTED] central.gen.new_zealand wrote: Doesn't work that well... This is inherent in the nature of PDF: it's a page-description language, not a document-interchange language. Each text-drawing command can put a block of text anywhere on the page, so you have no idea, just from parsing the PDF content, how to join these blocks up into lines, paragraphs, columns etc. So (I'm not being a wise guy) how does pdftotext do it so well? The text I can extract from PDFs is extracted as it appears in the doc. Although there are various ways to insert and encode text in PDFs, it's also well documented in the PDF specifications (http:// www.adobe.com/devnet/pdf/pdf_reference.html). Going back to pdftotext... it works well at extracting text from PDF. I'd like a native Python library that does the same. This can be done. And, it can be done in Python. I've made a small start, my hope was that others would be interested in helping, but I can do it on my own too... it'll just take a lot longer :) Brad -- http://mail.python.org/mailman/listinfo/python-list
Re: Script to extract text from PDF files
David Boddie wrote: There's a little information on that online: http://www.glyphandcog.com/textext.html Thanks, I'll read that. Just because inserting and encoding is well documented doesn't mean that the reverse processes are easy. :-/ Boy, that's an understatement... most of the PDF tools (in fact almost all) I come across write PDF docs... they output things to PDF. It's like anyone can generate PDF files... it's dead simple, but extracting text out of them in an accurate, reliable manner is much more difficult. Maybe you should look at the source code for pdftotext, if that's an option. I'm not sure it's opensource/free software with source available, but I'll look into that. Can I suggest that you approach one or more authors of the existing Python PDF solutions and work with them on this? There are at least four PDF parsers written in Python out there. I appreciate that suggestion, but again, none of the current solutions I've seen and tried, extract text from pdf documents. I'd love to be proven wrong on this point. So if one of those four current PDF solutions you mention do that, please let me know. Thanks, Brad -- http://mail.python.org/mailman/listinfo/python-list
Re: Script to extract text from PDF files
On Sep 25, 9:18 pm, [EMAIL PROTECTED] wrote: On Sep 25, 3:02 pm, Paul Hankin [EMAIL PROTECTED] wrote: Googling for 'pdf to text python' and following the first link giveshttp://pybrary.net/pyPdf/ Doesn't work that well, I've tried it, you should too... the author even admits this: extractText() [#] Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated. - sourcehttp://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html I have downloaded this package and installed it and found that the text-extraction is more or less useless. Looking into the code and comparing with the PDF spec show a very early implementation of text extraction. Luckily it is possible to overwrite the textextraction method in the base class without having to fiddle with the original code. I tried to contact the developer to offer some help on implementing text extraction, but he didn't answer my emails. -- Svenn -- http://mail.python.org/mailman/listinfo/python-list
Re: Script to extract text from PDF files
On Sep 26, 4:49 pm, Svenn Are Bjerkem [EMAIL PROTECTED] wrote: I have downloaded this package and installed it and found that the text-extraction is more or less useless. Looking into the code and comparing with the PDF spec show a very early implementation of text extraction. Luckily it is possible to overwrite the textextraction method in the base class without having to fiddle with the original code. I tried to contact the developer to offer some help on implementing text extraction, but he didn't answer my emails. -- Svenn Well, feel free to send any ideas or help to me! It seems simple... Do a binary read. Find 'stream' and 'endstream' sections. zlib.decompress() all the streams. Find BT and ET markers (Begin Text End Text) and finally locate the parens within those and string the text together. This works great on 3 out of 10 PDF documents, but my main issue seems to be the zlib compressed streams. Some of them don't seem to be FlateDecodeable (although they claim to be) or the header is somehow incorrect. But, once I get a good stream and decompress it, things are OK from that point on. Seriously, if you have ideas, please let me know. I'll be glad to share what I've got so far. Not many people seem to be interested. I'll stop adding to this thread... I don't want to beat a dead horse. Anyone interested in helping, can contact me via emial. Thanks, Brad -- http://mail.python.org/mailman/listinfo/python-list
Re: Script to extract text from PDF files
On Wed Sep 26 23:50:16 CEST 2007, byte8bits wrote: On Sep 26, 4:49 pm, Svenn Are Bjerkem svenn.bjer... at googlemail.com wrote: I have downloaded this package and installed it and found that the text-extraction is more or less useless. Looking into the code and comparing with the PDF spec show a very early implementation of text extraction. Luckily it is possible to overwrite the textextraction method in the base class without having to fiddle with the original code. I tried to contact the developer to offer some help on implementing text extraction, but he didn't answer my emails. That's disappointing to hear, but it's understandable. I must have one or two outstanding requests to add features to pdftools from a year ago. I keep meaning to look into making the necessary changes, but it's not something I'm looking forward to. Well, feel free to send any ideas or help to me! It seems simple... Do a binary read. Find 'stream' and 'endstream' sections. zlib.decompress() all the streams. Assuming that they're FlateEncoded... Find BT and ET markers (Begin Text End Text) and finally locate the parens within those and string the text together. Which works fine if the generator put in space characters. Otherwise, it seems to me that you need to figure out where any spaces should go. This works great on 3 out of 10 PDF documents, but my main issue seems to be the zlib compressed streams. Some of them don't seem to be FlateDecodeable (although they claim to be) or the header is somehow incorrect. But, once I get a good stream and decompress it, things are OK from that point on. Seriously, if you have ideas, please let me know. I'll be glad to share what I've got so far. You need to take a good parser and work on a higher level text extraction library. Not many people seem to be interested. I'll stop adding to this thread... I don't want to beat a dead horse. Anyone interested in helping, can contact me via emial. On the contrary, lots of people are interested in this sort of thing: http://phaseit.net/claird/comp.text.pdf/PDF_converters.html http://sourceforge.net/projects/pdfplayground http://www.adaptive-enterprises.com.au/~d/software/pdffile/ http://pybrary.net/pyPdf/ http://www.boddie.org.uk/david/Projects/Python/pdftools/ I discussed working with the author of pdfplayground, but things never really got going. I'd like to be part of a team working on a PDF library for Python, but my views on software licensing mean that I'd prefer to use a strong copyleft license rather than the permissive licenses found attached to most of the above libraries. David -- http://mail.python.org/mailman/listinfo/python-list
Re: Script to extract text from PDF files
On Sep 25, 6:41 pm, brad [EMAIL PROTECTED] wrote: I have a very crude Python script that extracts text from some (and I emphasize some) PDF documents. On many PDF docs, I cannot extract text, but this is because I'm doing something wrong. The PDF spec is large and complex and there are various ways in which to store and encode text. I wanted to post here and ask if anyone is interested in helping make the script better which means it should accurately extract text from most any pdf file... not just some. I know the topic of reading/extracting the text from a PDF document natively in Python comes up every now and then on comp.lang.python... I've posted about it in the past myself. After searching for other solutions, I've resorted to attempting this on my own in my spare time. Using apps external to Python (pdftotext, etc.) is not really an option for me. If someone knows of a free native Python app that does this now, let me know and I'll use that instead! Googling for 'pdf to text python' and following the first link gives http://pybrary.net/pyPdf/ -- Paul Hankin -- http://mail.python.org/mailman/listinfo/python-list
Re: Script to extract text from PDF files
On Sep 25, 3:02 pm, Paul Hankin [EMAIL PROTECTED] wrote: Googling for 'pdf to text python' and following the first link giveshttp://pybrary.net/pyPdf/ Doesn't work that well, I've tried it, you should too... the author even admits this: extractText() [#] Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated. - source http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html -- http://mail.python.org/mailman/listinfo/python-list
Re: Script to extract text from PDF files
In message [EMAIL PROTECTED], [EMAIL PROTECTED] wrote: On Sep 25, 3:02 pm, Paul Hankin [EMAIL PROTECTED] wrote: Googling for 'pdf to text python' and following the first link giveshttp://pybrary.net/pyPdf/ Doesn't work that well... This is inherent in the nature of PDF: it's a page-description language, not a document-interchange language. Each text-drawing command can put a block of text anywhere on the page, so you have no idea, just from parsing the PDF content, how to join these blocks up into lines, paragraphs, columns etc. -- http://mail.python.org/mailman/listinfo/python-list