How about just doing "Save As" and selecting "text file" as your output format? I do that a lot to get a text version of a .odt file, should work pretty much the same for a .doc file.

I never heard of ANYBODY having a command-line text extractor for a .doc file. Let's just look at the specs for .doc and see how to write one ... oh, wait a minute, there ARE NO SPECS for .doc, are there? I guess M$ forgot to publish it - oh, darn!

**IF** you can get your stuff into .odt format, there is a CLI solution that will do MOST of the job. First "Unzip whatever.odt content.xml" to get a file with all the text plus a lot of XML markup. Since the XML markup LOOKS LIKE HTML markup, go to my website at http://jsoftco.8m.com/download.html and get my HTML to Text converter, which will get rid of MOST of the markup - you might have to do a little cleanup by hand.

Hope this helps, or at least gives you some ideas.

Jim Hartley

Frank Cox wrote:
On Thu, 27 Sep 2007 10:15:30 +0800
Jerry Tan <[EMAIL PROTECTED]> wrote:

Is there a command line tool (in openoffice.org) to convert ms office into plain text,
just get its content, remove all its style informations.

soffice -help tells me this:

QUOTE:
-p <documents...>
      print the specified documents on the default printer.
-pt <printer> <documents...>
      print the specified documents on the specified printer.
END OF QUOTE

Therefore, I suspect that you could create a "virtual printer" that redirects
to a file on your hard drive, then do something like:

soffice -pt textprinter document1.doc document2.doc document3.doc

You could set up a script to run through a directory and print everything in
that directory to separate files, if you wish.

I haven't tried this but I don't see why it wouldn't work.



--
Teen Angel - a ghost story - http://teenangel.netfirms.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to