Re: [users] The Dreaded .docx format

Jim Hartley Fri, 12 Oct 2007 21:28:50 -0700

I have a package on my website that was designed for extracting textfrom HTML pages. Based on the discussion here, it might work for thistoo. It gets rid of all the "<...>" stuff, and tries to clean up linebreaks and stuff like that. It might leave less of a mess requiringmanual cleanup than the trick given here.

Or it might not, I haven't tried it on .docx files. But if somebodywants to play with it (it's GPL, written in C) you can get it from


http://jsoftco.8m.com/download.html/

Jim Hartley

Keith Bates wrote:

On Fri, 12 Oct 2007 10:49:26 -0700
"Mark Hull-Richter" <[EMAIL PROTECTED]> wrote:

On 10/12/07, Harold Fuchs <[EMAIL PROTECTED]> wrote:

Keith, I think I goofed. If you use an editor that *properly*
supports regular expressions then what I should have told you to
change was

"<^[>]*>"

without the quotes. That's less-than, circumflex,
open-square-bracket, greater-than, close-square-bracket, asterisk,
greater-than. I forgot that

in

a proper implementation the asterisk is "greedy" which is why (I
think)

you

lost your text. Even with this you'll lose all the formatting.

Close: it's "<[^>]*>" (without the quotes, with the caret inside the
brackets).

It means "< followed by any number of 'character-that-is-not-a->'
followed by a >".

mhr


Thanks Mark. It worked.

Then "all" I had to do was delete spurious line breaks, spaces and tabs

to recover the text... what a mess!


--
Teen Angel - a ghost story - http://teenangel.netfirms.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [users] The Dreaded .docx format

Reply via email to