Alexander Klingenstein wrote: > I need to take a bunch of .doc files (word 2000) which have a little text > including some tables/layout and mostly pictures and comvert them to a pdf > and extract the text and images separately too. If I have a pdf, I can do > create the html with pdftohtml called from python with popen. However I need > an automated way to converst the .doc to PDF first. > > Is there a way to do what I want either with a python lib, 3rd party app, or > maybe remote controlling Word (a la VBA) by "printing" to PDF with a > distiller? > I already tried wvware from gwnuwin32, however it has problems with big image > files embedded in .doc file(looks like a mmap error).
I would try scripting OpenOffice from Python, using the Python-UNO bridge. http://udk.openoffice.org/python/python-bridge.html Once you have the pdf, use the pdftohtml to get access to the image elements you need. -- pkm ~ http://paulmcnett.com -- http://mail.python.org/mailman/listinfo/python-list