Re: searching pdf files for certain info
rbt [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED]... Not really a Python question... but here goes: Is there a way to read the content of a PDF file and decode it with Python? I'd like to read PDF's, decode them, and then search the data for certain strings. I've had success with both: http://www.boddie.org.uk/david/Projects/Python/pdftools/ http://www.adaptive-enterprises.com.au/~d/software/pdffile/pdffile.py although my preference is for the latter as it transparently handles decryption. (I've previously posted an enhancement to the `pdftools` utility that adds decryption handling to it, but now use the `pdffile` library as it handles it better.) The ease of text extraction depends a lot on how the PDFs have been created. --Phil. -- http://mail.python.org/mailman/listinfo/python-list
searching pdf files for certain info
Not really a Python question... but here goes: Is there a way to read the content of a PDF file and decode it with Python? I'd like to read PDF's, decode them, and then search the data for certain strings. Thanks, rbt -- http://mail.python.org/mailman/listinfo/python-list
Re: searching pdf files for certain info
rbt wrote: Not really a Python question... but here goes: Is there a way to read the content of a PDF file and decode it with Python? I'd like to read PDF's, decode them, and then search the data for certain strings. There is a commercial tool pdflib availabla, that might help. It has a free evaluation version, and python bindings. If it's only about text, maybe pdf2text helps. -- Regards, Diez B. Roggisch -- http://mail.python.org/mailman/listinfo/python-list
Re: searching pdf files for certain info
Aloha, rbt wrote: Not really a Python question... but here goes: Is there a way to read the content of a PDF file and decode it with Python? I'd like to read PDF's, decode them, and then search the data for certain strings. First of all, http://groups.google.de/groups?selm=400CF2E3.29506EAE%40netsurf.deoutput=gplain still applies here. If you can deal with a very basic implementation of a pdf-lib you might be interested in http://sourceforge.net/projects/pdfplayground In the CVS (or the current snapshot) you can find in ppg/Doc/text_extract.txt an example for text extraction. import pdffile import pages import zlib pf = pdffile.pdffile('../pdf-testset1/a.pdf') pp = pages.pages(pf) c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream) op = pdftool.parse_content(c) sop = [x[1] for x in op if x[0] in [', Tj]] for a in sop: print a[0] Wishing a happy day LOBI -- http://mail.python.org/mailman/listinfo/python-list
Re: searching pdf files for certain info
Andreas Lobinger wrote: Aloha, rbt wrote: Not really a Python question... but here goes: Is there a way to read the content of a PDF file and decode it with Python? I'd like to read PDF's, decode them, and then search the data for certain strings. First of all, http://groups.google.de/groups?selm=400CF2E3.29506EAE%40netsurf.deoutput=gplain still applies here. If you can deal with a very basic implementation of a pdf-lib you might be interested in http://sourceforge.net/projects/pdfplayground In the CVS (or the current snapshot) you can find in ppg/Doc/text_extract.txt an example for text extraction. import pdffile import pages import zlib pf = pdffile.pdffile('../pdf-testset1/a.pdf') pp = pages.pages(pf) c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream) op = pdftool.parse_content(c) sop = [x[1] for x in op if x[0] in [', Tj]] for a in sop: print a[0] Wishing a happy day LOBI Thanks guys... what if I convert it to PS via printing it to a file or something? Would that make it easier to work with? -- http://mail.python.org/mailman/listinfo/python-list
Re: searching pdf files for certain info
Andreas Lobinger wrote: Aloha, rbt wrote: Thanks guys... what if I convert it to PS via printing it to a file or something? Would that make it easier to work with? Not really... The classical PS Drivers (f.e. Acroread4-Unix print- ps) simply define the pdf graphics and text operators as PS commands and copy the pdf content directly. Wishing a happy day LOBI I downloaded ghostscript for Win32 and added it to my PATH (C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works well on PDF files and it's entirely free. Usage: ps2ascii PDF_file.pdf ASCII_file.txt However, bundling a 9+ MB package with a 5K script and convincing users to install it is another matter altogether. -- http://mail.python.org/mailman/listinfo/python-list
Re: searching pdf files for certain info
I tried that for something not python related and I was getting sporadic spaces everywhere. I am assuming this is not the case in your experience? On Tue, 22 Feb 2005 10:45:09 -0500, rbt [EMAIL PROTECTED] wrote: Andreas Lobinger wrote: Aloha, rbt wrote: Thanks guys... what if I convert it to PS via printing it to a file or something? Would that make it easier to work with? Not really... The classical PS Drivers (f.e. Acroread4-Unix print- ps) simply define the pdf graphics and text operators as PS commands and copy the pdf content directly. Wishing a happy day LOBI I downloaded ghostscript for Win32 and added it to my PATH (C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works well on PDF files and it's entirely free. Usage: ps2ascii PDF_file.pdf ASCII_file.txt However, bundling a 9+ MB package with a 5K script and convincing users to install it is another matter altogether. -- http://mail.python.org/mailman/listinfo/python-list -- Thomas G. Willis http://paperbackmusic.net -- http://mail.python.org/mailman/listinfo/python-list
Re: searching pdf files for certain info
Tom Willis wrote: I tried that for something not python related and I was getting sporadic spaces everywhere. I am assuming this is not the case in your experience? On Tue, 22 Feb 2005 10:45:09 -0500, rbt [EMAIL PROTECTED] wrote: Andreas Lobinger wrote: Aloha, rbt wrote: Thanks guys... what if I convert it to PS via printing it to a file or something? Would that make it easier to work with? Not really... The classical PS Drivers (f.e. Acroread4-Unix print- ps) simply define the pdf graphics and text operators as PS commands and copy the pdf content directly. Wishing a happy day LOBI I downloaded ghostscript for Win32 and added it to my PATH (C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works well on PDF files and it's entirely free. Usage: ps2ascii PDF_file.pdf ASCII_file.txt However, bundling a 9+ MB package with a 5K script and convincing users to install it is another matter altogether. -- http://mail.python.org/mailman/listinfo/python-list For my purpose, it works fine. I'm searching for certain strings that might be in the document... all I need is a readable file. Layout, fonts and/or presentation is unimportant to me. -- http://mail.python.org/mailman/listinfo/python-list
Re: searching pdf files for certain info
rbt said the following on 2/22/2005 8:53 AM: Not really a Python question... but here goes: Is there a way to read the content of a PDF file and decode it with Python? I'd like to read PDF's, decode them, and then search the data for certain strings. Thanks, rbt Hi, Try pdftotext which is part of the XPdf project. pdftotext extracts textual information from a PDF file to an output text file of your choice. I have used it in the past (not with Python) to do what you are attempting. It is a small program and you can invoke from python and search for the string/pattern you want. You can download for your OS from: http://www.foolabs.com/xpdf/download.html Thanks, -Kartic -- http://mail.python.org/mailman/listinfo/python-list
Re: searching pdf files for certain info
Well sporadic spaces in strings would cause problems would it not? an example The String: Patient Face Sheet---pdftotext---P a tie n t Face Sheet I'm just curious if you see anything like that, since I really have no clue about ps or pdf etc...but I have a strong desire to replace a really flaky commercial tool. And if I can do it with free stuff, all the better my boss will love me. On Tue, 22 Feb 2005 11:31:16 -0500, rbt [EMAIL PROTECTED] wrote: Tom Willis wrote: I tried that for something not python related and I was getting sporadic spaces everywhere. I am assuming this is not the case in your experience? On Tue, 22 Feb 2005 10:45:09 -0500, rbt [EMAIL PROTECTED] wrote: Andreas Lobinger wrote: Aloha, rbt wrote: Thanks guys... what if I convert it to PS via printing it to a file or something? Would that make it easier to work with? Not really... The classical PS Drivers (f.e. Acroread4-Unix print- ps) simply define the pdf graphics and text operators as PS commands and copy the pdf content directly. Wishing a happy day LOBI I downloaded ghostscript for Win32 and added it to my PATH (C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works well on PDF files and it's entirely free. Usage: ps2ascii PDF_file.pdf ASCII_file.txt However, bundling a 9+ MB package with a 5K script and convincing users to install it is another matter altogether. -- http://mail.python.org/mailman/listinfo/python-list For my purpose, it works fine. I'm searching for certain strings that might be in the document... all I need is a readable file. Layout, fonts and/or presentation is unimportant to me. -- http://mail.python.org/mailman/listinfo/python-list -- Thomas G. Willis http://paperbackmusic.net -- http://mail.python.org/mailman/listinfo/python-list
Re: searching pdf files for certain info
Tom Willis wrote: Well sporadic spaces in strings would cause problems would it not? an example The String: Patient Face Sheet---pdftotext---P a tie n t Face Sheet I'm just curious if you see anything like that, since I really have no clue about ps or pdf etc...but I have a strong desire to replace a really flaky commercial tool. And if I can do it with free stuff, all the better my boss will love me. No, I do not see that type of behavior. I'm looking for strings that resemble SS numbers. So my strings look like this: nnn-nn-. The ps2ascii util in ghostscript reproduces strings in the format that I expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*. -- http://mail.python.org/mailman/listinfo/python-list
Re: searching pdf files for certain info
Ah that makes sense. I only see the behavior in pdftotext. ps2ascii doesn't give me the layout , which for my purposes, I certainly need. Thanks for the info, Looks like I'll keep searching for that silver bullet.:( On Tue, 22 Feb 2005 20:07:50 -0500, rbt [EMAIL PROTECTED] wrote: Tom Willis wrote: Well sporadic spaces in strings would cause problems would it not? an example The String: Patient Face Sheet---pdftotext---P a tie n t Face Sheet I'm just curious if you see anything like that, since I really have no clue about ps or pdf etc...but I have a strong desire to replace a really flaky commercial tool. And if I can do it with free stuff, all the better my boss will love me. No, I do not see that type of behavior. I'm looking for strings that resemble SS numbers. So my strings look like this: nnn-nn-. The ps2ascii util in ghostscript reproduces strings in the format that I expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*. -- http://mail.python.org/mailman/listinfo/python-list -- Thomas G. Willis http://paperbackmusic.net -- http://mail.python.org/mailman/listinfo/python-list