Re: parsing MS word docs -- tutorial request

2008-10-29 Thread bp . tralfamadore

Thanks everyone -- very helpful!
I really appreciate your help -- that is what makes the world a
wonderful place.

peace.

::bp::
--
http://mail.python.org/mailman/listinfo/python-list


Re: parsing MS word docs -- tutorial request

2008-10-29 Thread Terry Reedy

Kay Schluehr wrote:

On 28 Okt., 15:25, [EMAIL PROTECTED] wrote:

All,

I am trying to write a script that will parse and extract data from a
MS Word document.  Can / would anyone refer me to a tutorial on how to
do that?  (perhaps from tables).  I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated.  Thanks for your attention and
patience.

::bp::


One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).


A related solution is to use OpenOffice to convert to 
OpenDocumentFormat, a zipped multiple XML format, and then use ODFPY to 
parse the XML and access the contents as linked objects.

http://opendocumentfellowship.com/development/projects/odfpy

--
http://mail.python.org/mailman/listinfo/python-list


Re: parsing MS word docs -- tutorial request

2008-10-29 Thread Kay Schluehr
On 28 Okt., 15:25, [EMAIL PROTECTED] wrote:
> All,
>
> I am trying to write a script that will parse and extract data from a
> MS Word document.  Can / would anyone refer me to a tutorial on how to
> do that?  (perhaps from tables).  I am aware of, and have downloaded
> the pywin32 extensions, but am unsure of how to proceed -- I'm not
> familiar with the COM API for word, so help for that would also be
> welcome.
>
> Any help would be appreciated.  Thanks for your attention and
> patience.
>
> ::bp::

One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).

A few years ago I used this conversion to implement roughly following
thing algorithm:

1. I manually highlighted one or more sections in a Word doc using a
background colour marker.
2. I searched for the colour marked section and determined the
structure. The structure information was fed into a state machine.
3. With this state machine I searched for all sections that were
equally structured.
4. I applied a href link to the text that was surrounded by the
structure and removed the colour marker.
5. In another document I searched for the same text and set an anchor.

This way I could link two documents ( those were public specifications
being originally disconnected ).

Kay

--
http://mail.python.org/mailman/listinfo/python-list


RE: parsing MS word docs -- tutorial request

2008-10-29 Thread Reedick, Andrew
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:python-
> [EMAIL PROTECTED] On Behalf Of
> [EMAIL PROTECTED]
> Sent: Tuesday, October 28, 2008 10:26 AM
> To: python-list@python.org
> Subject: parsing MS word docs -- tutorial request
> 
> All,
> 
> I am trying to write a script that will parse and extract data from a
> MS Word document.  Can / would anyone refer me to a tutorial on how to
> do that?  (perhaps from tables).  I am aware of, and have downloaded
> the pywin32 extensions, but am unsure of how to proceed -- I'm not
> familiar with the COM API for word, so help for that would also be
> welcome.
> 
> Any help would be appreciated.  Thanks for your attention and
> patience.
> 
> ::bp::
> --
> http://mail.python.org/mailman/listinfo/python-list


Word Object Model:
http://msdn.microsoft.com/en-us/library/bb244515.aspx

Google for sample code to get you started.


--
http://mail.python.org/mailman/listinfo/python-list


Re: parsing MS word docs -- tutorial request

2008-10-29 Thread Mike Driscoll
On Oct 29, 4:32 am, Okko Willeboordsed <[EMAIL PROTECTED]>
wrote:
> Get a copy of;  Python Programming on Win32, ISBN 1-56592-621-8
> Use Google and VBA for help
>
> [EMAIL PROTECTED] wrote:
> > All,
>
> > I am trying to write a script that will parse and extract data from a
> > MS Word document.  Can / would anyone refer me to a tutorial on how to
> > do that?  (perhaps from tables).  I am aware of, and have downloaded
> > the pywin32 extensions, but am unsure of how to proceed -- I'm not
> > familiar with the COM API for word, so help for that would also be
> > welcome.
>
> > Any help would be appreciated.  Thanks for your attention and
> > patience.
>
> > ::bp::

Also check out MSDN as the win32 module is a thin wrapper so most of
the syntax on MSDN or in VB examples can be directly translated to
Python. There's also a PyWin32 mailing list which is quite helpful:

http://mail.python.org/mailman/listinfo/python-win32

Mike
--
http://mail.python.org/mailman/listinfo/python-list


Re: parsing MS word docs -- tutorial request

2008-10-29 Thread Okko Willeboordsed

Get a copy of;  Python Programming on Win32, ISBN 1-56592-621-8
Use Google and VBA for help

[EMAIL PROTECTED] wrote:

All,

I am trying to write a script that will parse and extract data from a
MS Word document.  Can / would anyone refer me to a tutorial on how to
do that?  (perhaps from tables).  I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated.  Thanks for your attention and
patience.

::bp::

--
http://mail.python.org/mailman/listinfo/python-list


parsing MS word docs -- tutorial request

2008-10-28 Thread bp . tralfamadore
All,

I am trying to write a script that will parse and extract data from a
MS Word document.  Can / would anyone refer me to a tutorial on how to
do that?  (perhaps from tables).  I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated.  Thanks for your attention and
patience.

::bp::
--
http://mail.python.org/mailman/listinfo/python-list