I have a package on my website that was designed for extracting text from HTML pages. Based on the discussion here, it might work for this too. It gets rid of all the "<...>" stuff, and tries to clean up line breaks and stuff like that. It might leave less of a mess requiring manual cleanup than the trick given here.

Or it might not, I haven't tried it on .docx files. But if somebody wants to play with it (it's GPL, written in C) you can get it from

http://jsoftco.8m.com/download.html/

Jim Hartley

Keith Bates wrote:
On Fri, 12 Oct 2007 10:49:26 -0700
"Mark Hull-Richter" <[EMAIL PROTECTED]> wrote:

On 10/12/07, Harold Fuchs <[EMAIL PROTECTED]> wrote:
Keith, I think I goofed. If you use an editor that *properly*
supports regular expressions then what I should have told you to
change was
"<^[>]*>"
without the quotes. That's less-than, circumflex,
open-square-bracket, greater-than, close-square-bracket, asterisk,
greater-than. I forgot that
in
a proper implementation the asterisk is "greedy" which is why (I
think)
you
lost your text. Even with this you'll lose all the formatting.


Close: it's "<[^>]*>" (without the quotes, with the caret inside the
brackets).

It means "< followed by any number of 'character-that-is-not-a->'
followed by a >".

mhr

Thanks Mark. It worked.

Then "all" I had to do was delete spurious line breaks, spaces and tabs
to recover the text... what a mess!


--
Teen Angel - a ghost story - http://teenangel.netfirms.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to