I have a package on my website that was designed for extracting text
from HTML pages. Based on the discussion here, it might work for this
too. It gets rid of all the "<...>" stuff, and tries to clean up line
breaks and stuff like that. It might leave less of a mess requiring
manual cleanup than the trick given here.
Or it might not, I haven't tried it on .docx files. But if somebody
wants to play with it (it's GPL, written in C) you can get it from
http://jsoftco.8m.com/download.html/
Jim Hartley
Keith Bates wrote:
On Fri, 12 Oct 2007 10:49:26 -0700
"Mark Hull-Richter" <[EMAIL PROTECTED]> wrote:
On 10/12/07, Harold Fuchs <[EMAIL PROTECTED]> wrote:
Keith, I think I goofed. If you use an editor that *properly*
supports regular expressions then what I should have told you to
change was
"<^[>]*>"
without the quotes. That's less-than, circumflex,
open-square-bracket, greater-than, close-square-bracket, asterisk,
greater-than. I forgot that
in
a proper implementation the asterisk is "greedy" which is why (I
think)
you
lost your text. Even with this you'll lose all the formatting.
Close: it's "<[^>]*>" (without the quotes, with the caret inside the
brackets).
It means "< followed by any number of 'character-that-is-not-a->'
followed by a >".
mhr
Thanks Mark. It worked.
Then "all" I had to do was delete spurious line breaks, spaces and tabs
to recover the text... what a mess!
--
Teen Angel - a ghost story - http://teenangel.netfirms.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]