Re: searching pdf files for certain info

2005-02-24 Thread Follower
rbt [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED]...
 Not really a Python question... but here goes: Is there a way to read 
 the content of a PDF file and decode it with Python? I'd like to read 
 PDF's, decode them, and then search the data for certain strings.

I've had success with both:

  http://www.boddie.org.uk/david/Projects/Python/pdftools/

  http://www.adaptive-enterprises.com.au/~d/software/pdffile/pdffile.py

although my preference is for the latter as it transparently handles
decryption. (I've previously posted an enhancement to the `pdftools`
utility that adds decryption handling to it, but now use the `pdffile`
library as it handles it better.)

The ease of text extraction depends a lot on how the PDFs have been
created.

--Phil.
-- 
http://mail.python.org/mailman/listinfo/python-list


searching pdf files for certain info

2005-02-22 Thread rbt
Not really a Python question... but here goes: Is there a way to read 
the content of a PDF file and decode it with Python? I'd like to read 
PDF's, decode them, and then search the data for certain strings.

Thanks, rbt
--
http://mail.python.org/mailman/listinfo/python-list


Re: searching pdf files for certain info

2005-02-22 Thread Diez B. Roggisch
rbt wrote:

 Not really a Python question... but here goes: Is there a way to read
 the content of a PDF file and decode it with Python? I'd like to read
 PDF's, decode them, and then search the data for certain strings.

There is a commercial tool pdflib availabla, that might help. It has  a free
evaluation version, and python bindings.

If it's only about text, maybe pdf2text helps.
-- 
Regards,

Diez B. Roggisch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: searching pdf files for certain info

2005-02-22 Thread Andreas Lobinger
Aloha,
rbt wrote:
Not really a Python question... but here goes: Is there a way to read 
the content of a PDF file and decode it with Python? I'd like to read 
PDF's, decode them, and then search the data for certain strings.
First of all,
http://groups.google.de/groups?selm=400CF2E3.29506EAE%40netsurf.deoutput=gplain
still applies here.
If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground
In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.
  import pdffile
  import pages
  import zlib
  pf = pdffile.pdffile('../pdf-testset1/a.pdf')
  pp = pages.pages(pf)
  c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
  op = pdftool.parse_content(c)
  sop = [x[1] for x in op if x[0] in [', Tj]]
  for a in sop:
print a[0]
Wishing a happy day
LOBI
--
http://mail.python.org/mailman/listinfo/python-list


Re: searching pdf files for certain info

2005-02-22 Thread rbt
Andreas Lobinger wrote:
Aloha,
rbt wrote:
Not really a Python question... but here goes: Is there a way to read 
the content of a PDF file and decode it with Python? I'd like to read 
PDF's, decode them, and then search the data for certain strings.

First of all,
http://groups.google.de/groups?selm=400CF2E3.29506EAE%40netsurf.deoutput=gplain 

still applies here.
If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground
In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.
  import pdffile
  import pages
  import zlib
  pf = pdffile.pdffile('../pdf-testset1/a.pdf')
  pp = pages.pages(pf)
  c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
  op = pdftool.parse_content(c)
  sop = [x[1] for x in op if x[0] in [', Tj]]
  for a in sop:
print a[0]
Wishing a happy day
LOBI
Thanks guys... what if I convert it to PS via printing it to a file or 
something? Would that make it easier to work with?
--
http://mail.python.org/mailman/listinfo/python-list


Re: searching pdf files for certain info

2005-02-22 Thread rbt
Andreas Lobinger wrote:
Aloha,
rbt wrote:
Thanks guys... what if I convert it to PS via printing it to a file or 
something? Would that make it easier to work with?

Not really...
The classical PS Drivers (f.e. Acroread4-Unix print- ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.
Wishing a happy day
LOBI
I downloaded ghostscript for Win32 and added it to my PATH 
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works 
well on PDF files and it's entirely free.

Usage:
ps2ascii PDF_file.pdf  ASCII_file.txt
However, bundling a 9+ MB package with a 5K script and convincing users 
to install it is another matter altogether.
--
http://mail.python.org/mailman/listinfo/python-list


Re: searching pdf files for certain info

2005-02-22 Thread Tom Willis
I tried that for something not python related and I was getting
sporadic spaces everywhere.

I am assuming this is not the case in your experience?


On Tue, 22 Feb 2005 10:45:09 -0500, rbt [EMAIL PROTECTED] wrote:
 Andreas Lobinger wrote:
  Aloha,
 
  rbt wrote:
 
  Thanks guys... what if I convert it to PS via printing it to a file or
  something? Would that make it easier to work with?
 
 
  Not really...
  The classical PS Drivers (f.e. Acroread4-Unix print- ps) simply
  define the pdf graphics and text operators as PS commands and
  copy the pdf content directly.
 
  Wishing a happy day
  LOBI
 
 I downloaded ghostscript for Win32 and added it to my PATH
 (C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
 well on PDF files and it's entirely free.
 
 Usage:
 
 ps2ascii PDF_file.pdf  ASCII_file.txt
 
 However, bundling a 9+ MB package with a 5K script and convincing users
 to install it is another matter altogether.
 --
 http://mail.python.org/mailman/listinfo/python-list
 


-- 
Thomas G. Willis
http://paperbackmusic.net
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: searching pdf files for certain info

2005-02-22 Thread rbt
Tom Willis wrote:
I tried that for something not python related and I was getting
sporadic spaces everywhere.
I am assuming this is not the case in your experience?
On Tue, 22 Feb 2005 10:45:09 -0500, rbt [EMAIL PROTECTED] wrote:
Andreas Lobinger wrote:
Aloha,
rbt wrote:

Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?

Not really...
The classical PS Drivers (f.e. Acroread4-Unix print- ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.
Wishing a happy day
   LOBI
I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.
Usage:
ps2ascii PDF_file.pdf  ASCII_file.txt
However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.
--
http://mail.python.org/mailman/listinfo/python-list


For my purpose, it works fine. I'm searching for certain strings that 
might be in the document... all I need is a readable file. Layout, fonts 
and/or presentation is unimportant to me.
--
http://mail.python.org/mailman/listinfo/python-list


Re: searching pdf files for certain info

2005-02-22 Thread Kartic
rbt said the following on 2/22/2005 8:53 AM:
Not really a Python question... but here goes: Is there a way to read 
the content of a PDF file and decode it with Python? I'd like to read 
PDF's, decode them, and then search the data for certain strings.

Thanks, rbt
Hi,
Try pdftotext which is part of the XPdf project. pdftotext extracts 
textual information from a PDF file to an output text file of your 
choice. I have used it in the past (not with Python) to do what you are 
attempting. It is a small program and you can invoke from python and 
search for the string/pattern you want.

You can download for your OS from:
http://www.foolabs.com/xpdf/download.html
Thanks,
-Kartic
--
http://mail.python.org/mailman/listinfo/python-list


Re: searching pdf files for certain info

2005-02-22 Thread Tom Willis
Well sporadic spaces in strings would cause problems would it not?

an example


The String: Patient Face Sheet---pdftotext---P a tie n t Face Sheet

I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.


On Tue, 22 Feb 2005 11:31:16 -0500, rbt [EMAIL PROTECTED] wrote:
 Tom Willis wrote:
  I tried that for something not python related and I was getting
  sporadic spaces everywhere.
 
  I am assuming this is not the case in your experience?
 
 
  On Tue, 22 Feb 2005 10:45:09 -0500, rbt [EMAIL PROTECTED] wrote:
 
 Andreas Lobinger wrote:
 
 Aloha,
 
 rbt wrote:
 
 
 Thanks guys... what if I convert it to PS via printing it to a file or
 something? Would that make it easier to work with?
 
 
 Not really...
 The classical PS Drivers (f.e. Acroread4-Unix print- ps) simply
 define the pdf graphics and text operators as PS commands and
 copy the pdf content directly.
 
 Wishing a happy day
 LOBI
 
 I downloaded ghostscript for Win32 and added it to my PATH
 (C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
 well on PDF files and it's entirely free.
 
 Usage:
 
 ps2ascii PDF_file.pdf  ASCII_file.txt
 
 However, bundling a 9+ MB package with a 5K script and convincing users
 to install it is another matter altogether.
 --
 http://mail.python.org/mailman/listinfo/python-list
 
 
 
 
 
 For my purpose, it works fine. I'm searching for certain strings that
 might be in the document... all I need is a readable file. Layout, fonts
 and/or presentation is unimportant to me.
 --
 http://mail.python.org/mailman/listinfo/python-list
 


-- 
Thomas G. Willis
http://paperbackmusic.net
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: searching pdf files for certain info

2005-02-22 Thread rbt
Tom Willis wrote:
Well sporadic spaces in strings would cause problems would it not?
an example
The String: Patient Face Sheet---pdftotext---P a tie n t Face Sheet
I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.
No, I do not see that type of behavior. I'm looking for strings that 
resemble SS numbers. So my strings look like this: nnn-nn-.

The ps2ascii util in ghostscript reproduces strings in the format that I 
expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*.
--
http://mail.python.org/mailman/listinfo/python-list


Re: searching pdf files for certain info

2005-02-22 Thread Tom Willis
Ah that makes sense. I only see the behavior in pdftotext. ps2ascii
doesn't give me the layout , which for my purposes, I certainly need.

Thanks for the info, Looks like I'll keep searching for that silver bullet.:(


On Tue, 22 Feb 2005 20:07:50 -0500, rbt [EMAIL PROTECTED] wrote:
 Tom Willis wrote:
  Well sporadic spaces in strings would cause problems would it not?
 
  an example
 
 
  The String: Patient Face Sheet---pdftotext---P a tie n t Face Sheet
 
  I'm just curious if you see anything like that, since I really have no
  clue about ps or pdf etc...but I have a strong desire to replace a
  really flaky commercial tool. And if I can do it with free stuff, all
  the better my boss will love me.
 
 No, I do not see that type of behavior. I'm looking for strings that
 resemble SS numbers. So my strings look like this: nnn-nn-.
 
 The ps2ascii util in ghostscript reproduces strings in the format that I
 expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*.
 --
 http://mail.python.org/mailman/listinfo/python-list
 


-- 
Thomas G. Willis
http://paperbackmusic.net
-- 
http://mail.python.org/mailman/listinfo/python-list