Re: [Tutor] extracting text from word files (.doc, .docx) and pdf

Walter Prins Tue, 25 Jan 2011 14:39:16 -0800

On 25 January 2011 21:52, Juan Jose Del Toro <[email protected]> wrote:


> Dear List;
>
> I am looking for a way to extract parts of a text from word (.doc,.docx)
> files as well as pdf; the idea is to walk through the whole directory tree
> and populate a csv file with an excerpt from each file.
> For PDF I found PyPdf <http://pybrary.net/pyPdf/>ave found nothing to read
> doc, docx
>

http://www.google.com/search?q=python+read+ms+word+file&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a

which returns this:

http://stackoverflow.com/questions/125222/extracting-text-from-ms-word-files-in-python

Additionally -- docx are, IIRC, zipped XML, so you could probably just
uncompress it and scan the XML directly...

Walter

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] extracting text from word files (.doc, .docx) and pdf

Reply via email to