Re: parsing MS word docs -- tutorial request
Thanks everyone -- very helpful! I really appreciate your help -- that is what makes the world a wonderful place. peace. ::bp:: -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing MS word docs -- tutorial request
Kay Schluehr wrote: On 28 Okt., 15:25, [EMAIL PROTECTED] wrote: All, I am trying to write a script that will parse and extract data from a MS Word document. Can / would anyone refer me to a tutorial on how to do that? (perhaps from tables). I am aware of, and have downloaded the pywin32 extensions, but am unsure of how to proceed -- I'm not familiar with the COM API for word, so help for that would also be welcome. Any help would be appreciated. Thanks for your attention and patience. ::bp:: One can convert MS-Word documents into some class of XML documents called MHTML. If I remember correctly those documents had an .mht extension. The result is a huge amount of ( nevertheless structured ) markup gibberish together with text. If one spends time and attention one can find pattern in the markup ( we have XML and it's human readable ). A related solution is to use OpenOffice to convert to OpenDocumentFormat, a zipped multiple XML format, and then use ODFPY to parse the XML and access the contents as linked objects. http://opendocumentfellowship.com/development/projects/odfpy -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing MS word docs -- tutorial request
On 28 Okt., 15:25, [EMAIL PROTECTED] wrote: > All, > > I am trying to write a script that will parse and extract data from a > MS Word document. Can / would anyone refer me to a tutorial on how to > do that? (perhaps from tables). I am aware of, and have downloaded > the pywin32 extensions, but am unsure of how to proceed -- I'm not > familiar with the COM API for word, so help for that would also be > welcome. > > Any help would be appreciated. Thanks for your attention and > patience. > > ::bp:: One can convert MS-Word documents into some class of XML documents called MHTML. If I remember correctly those documents had an .mht extension. The result is a huge amount of ( nevertheless structured ) markup gibberish together with text. If one spends time and attention one can find pattern in the markup ( we have XML and it's human readable ). A few years ago I used this conversion to implement roughly following thing algorithm: 1. I manually highlighted one or more sections in a Word doc using a background colour marker. 2. I searched for the colour marked section and determined the structure. The structure information was fed into a state machine. 3. With this state machine I searched for all sections that were equally structured. 4. I applied a href link to the text that was surrounded by the structure and removed the colour marker. 5. In another document I searched for the same text and set an anchor. This way I could link two documents ( those were public specifications being originally disconnected ). Kay -- http://mail.python.org/mailman/listinfo/python-list
RE: parsing MS word docs -- tutorial request
> -Original Message- > From: [EMAIL PROTECTED] [mailto:python- > [EMAIL PROTECTED] On Behalf Of > [EMAIL PROTECTED] > Sent: Tuesday, October 28, 2008 10:26 AM > To: python-list@python.org > Subject: parsing MS word docs -- tutorial request > > All, > > I am trying to write a script that will parse and extract data from a > MS Word document. Can / would anyone refer me to a tutorial on how to > do that? (perhaps from tables). I am aware of, and have downloaded > the pywin32 extensions, but am unsure of how to proceed -- I'm not > familiar with the COM API for word, so help for that would also be > welcome. > > Any help would be appreciated. Thanks for your attention and > patience. > > ::bp:: > -- > http://mail.python.org/mailman/listinfo/python-list Word Object Model: http://msdn.microsoft.com/en-us/library/bb244515.aspx Google for sample code to get you started. -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing MS word docs -- tutorial request
On Oct 29, 4:32 am, Okko Willeboordsed <[EMAIL PROTECTED]> wrote: > Get a copy of; Python Programming on Win32, ISBN 1-56592-621-8 > Use Google and VBA for help > > [EMAIL PROTECTED] wrote: > > All, > > > I am trying to write a script that will parse and extract data from a > > MS Word document. Can / would anyone refer me to a tutorial on how to > > do that? (perhaps from tables). I am aware of, and have downloaded > > the pywin32 extensions, but am unsure of how to proceed -- I'm not > > familiar with the COM API for word, so help for that would also be > > welcome. > > > Any help would be appreciated. Thanks for your attention and > > patience. > > > ::bp:: Also check out MSDN as the win32 module is a thin wrapper so most of the syntax on MSDN or in VB examples can be directly translated to Python. There's also a PyWin32 mailing list which is quite helpful: http://mail.python.org/mailman/listinfo/python-win32 Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing MS word docs -- tutorial request
Get a copy of; Python Programming on Win32, ISBN 1-56592-621-8 Use Google and VBA for help [EMAIL PROTECTED] wrote: All, I am trying to write a script that will parse and extract data from a MS Word document. Can / would anyone refer me to a tutorial on how to do that? (perhaps from tables). I am aware of, and have downloaded the pywin32 extensions, but am unsure of how to proceed -- I'm not familiar with the COM API for word, so help for that would also be welcome. Any help would be appreciated. Thanks for your attention and patience. ::bp:: -- http://mail.python.org/mailman/listinfo/python-list
parsing MS word docs -- tutorial request
All, I am trying to write a script that will parse and extract data from a MS Word document. Can / would anyone refer me to a tutorial on how to do that? (perhaps from tables). I am aware of, and have downloaded the pywin32 extensions, but am unsure of how to proceed -- I'm not familiar with the COM API for word, so help for that would also be welcome. Any help would be appreciated. Thanks for your attention and patience. ::bp:: -- http://mail.python.org/mailman/listinfo/python-list