google won't do a good job with .doc files but they may do pdf to html and back.. It's per each I just mentioned it to make fun of them here is my resume converted from a monster.com .doc file
http://docs.google.com/View?docid=dftrj73t_3cfwjdv [EMAIL PROTECTED] wrote: > Alexander Klingenstein wrote: > > I need to take a bunch of .doc files (word 2000) which have a little text > > including some tables/layout and mostly pictures and comvert them to a pdf > > and extract the text and images > separately too. If I have a pdf, I can do > > create the html with pdftohtml called from python with > popen. However I > > need an automated way to converst the .doc to PDF first. > > Is there some reason you really want to convert to PDF first? You can > get much better HTML right from the Word doc. You'll lose a lot of info > going from PDF to HTML. > > Something like this can open doc in Word, save as HTML, then close doc. > > import os, win32com.client > > wdApp = win32com.client.Dispatch("Word.Application") > wdApp.Visible = 1 > > def SaveDocAsHTML(docPath, htmlPath): > doc = wdApp.Documents.Open(docPath) > # See > mk:@MSITStore:C:\Program%20Files\Microsoft%20Office\OFFICE11\1033\VBAWD10.CHM::/html/womthSaveAs1.htm > # in Word VBA help doc for more info. > > # Saves all text and formatting with HTML tags so that the > resulting document can be viewed in a Web browser. > doc.SaveAs(htmlPath, win32com.client.constants.wdFormatHTML) > # Saves text with HTML tags with minimal cascading style sheet > formatting. The resulting document can be viewed in a Web browser. > #doc.SaveAs(htmlPath, > win32com.client.constants.wdFormatFilteredHTML) > doc.Close() > > And if you aren't satisfied with the ugly HTML you're likely to get, > you can try running µTidylib (http://utidylib.berlios.de/) on the > output after this step also. > > Thank you, > Paul -- http://mail.python.org/mailman/listinfo/python-list