How about just doing "Save As" and selecting "text file" as your output
format? I do that a lot to get a text version of a .odt file, should
work pretty much the same for a .doc file.
I never heard of ANYBODY having a command-line text extractor for a .doc
file. Let's just look at the specs for .doc and see how to write one ...
oh, wait a minute, there ARE NO SPECS for .doc, are there? I guess M$
forgot to publish it - oh, darn!
**IF** you can get your stuff into .odt format, there is a CLI solution
that will do MOST of the job. First "Unzip whatever.odt content.xml" to
get a file with all the text plus a lot of XML markup. Since the XML
markup LOOKS LIKE HTML markup, go to my website at
http://jsoftco.8m.com/download.html
and get my HTML to Text converter, which will get rid of MOST of the
markup - you might have to do a little cleanup by hand.
Hope this helps, or at least gives you some ideas.
Jim Hartley
Frank Cox wrote:
On Thu, 27 Sep 2007 10:15:30 +0800
Jerry Tan <[EMAIL PROTECTED]> wrote:
Is there a command line tool (in openoffice.org) to convert ms office
into plain text,
just get its content, remove all its style informations.
soffice -help tells me this:
QUOTE:
-p <documents...>
print the specified documents on the default printer.
-pt <printer> <documents...>
print the specified documents on the specified printer.
END OF QUOTE
Therefore, I suspect that you could create a "virtual printer" that redirects
to a file on your hard drive, then do something like:
soffice -pt textprinter document1.doc document2.doc document3.doc
You could set up a script to run through a directory and print everything in
that directory to separate files, if you wish.
I haven't tried this but I don't see why it wouldn't work.
--
Teen Angel - a ghost story - http://teenangel.netfirms.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]