On Sep 27, 12:49 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > On 27/09/2010 01:39, flebber wrote: > > > > > On Sep 27, 9:38 am, "w.g.sned...@gmail.com"<w.g.sned...@gmail.com> > > wrote: > >> On Sep 26, 7:10 pm, flebber<flebber.c...@gmail.com> wrote: > > >>> I was trying to use Pypdf following a recipe from the Activestate > >>> cookbooks. However I cannot get it too work. Unsure if it is me or it > >>> is beacuse sets are deprecated. > > >>> I have placed a pdf in my C:\ drive. it is called "Components-of-Dot- > >>> NET.pdf" You could use anything I was just testing with it. > > >>> I was using the last script on that page that was most recently > >>> updated. I am using python 2.6. > > >>>http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co... > > >>> import pyPdf > > >>> def getPDFContent(path): > >>> content = "C:\Components-of-Dot-NET.pdf" > >>> # Load PDF into pyPDF > >>> pdf = pyPdf.PdfFileReader(file(path, "rb")) > >>> # Iterate pages > >>> for i in range(0, pdf.getNumPages()): > >>> # Extract text from page and add to content > >>> content += pdf.getPage(i).extractText() + "\n" > >>> # Collapse whitespace > >>> content = " ".join(content.replace(u"\xa0", " ").strip().split()) > >>> return content > > >>> print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii", > >>> "ignore") > > >>> This is my error. > > >>> Warning (from warnings module): > >>> File "C:\Documents and Settings\Family\Application Data\Python > >>> \Python26\site-packages\pyPdf\pdf.py", line 52 > >>> from sets import ImmutableSet > >>> DeprecationWarning: the sets module is deprecated > > >>> Traceback (most recent call last): > >>> File "C:/Python26/Pdfread", line 15, in<module> > >>> print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii", > >>> "ignore") > >>> File "C:/Python26/Pdfread", line 6, in getPDFContent > >>> pdf = pyPdf.PdfFileReader(file(path, "rb")) > > >> ---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> > >> NET.pdf' > > >> Looks like a issue with finding the file. > >> how do you pass the path? > > > okay thanks I thought that when I set content here > > > def getPDFContent(path): > > content = "C:\Components-of-Dot-NET.pdf" > > > that i was defining where it is. > > > but yeah I updated script to below and it works. That is the contents > > are displayed to the interpreter. How do I output to a .txt file? > > > import pyPdf > > > def getPDFContent(path): > > content = "C:\Components-of-Dot-NET.pdf" > > That simply binds to a local name; 'content' is a local variable in the > function 'getPDFContent'. > > > # Load PDF into pyPDF > > pdf = pyPdf.PdfFileReader(file(path, "rb")) > > You're opening a file whose path is in 'path'. > > > # Iterate pages > > for i in range(0, pdf.getNumPages()): > > # Extract text from page and add to content > > content += pdf.getPage(i).extractText() + "\n" > > That appends to 'content'. > > > # Collapse whitespace > > 'content' now contains the text of the PDF, starting with > r"C:\Components-of-Dot-NET.pdf". > > > content = " ".join(content.replace(u"\xa0", " ").strip().split()) > > return content > > > print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii", > > "ignore") > > Outputting to a .txt file is simple: open the file for writing using > 'open', write the string to it, and then close it.
Thats what I was trying to do with open('x.txt', 'w').write(content) the rest of the script works it wont output the tect though -- http://mail.python.org/mailman/listinfo/python-list