"mbstevens" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > On Thu, 06 Jul 2006 08:32:46 +0100, Martin Evans wrote: > >> "Juho Schultz" <[EMAIL PROTECTED]> wrote in message >> news:[EMAIL PROTECTED] >>> Martin Evans wrote: >>>> Sorry, yet another REGEX question. I've been struggling with trying to >>>> get >>>> a regular expression to do the following example in Python: >>>> >>>> Search and replace all instances of "sleeping" with "dead". >>>> >>>> This parrot is sleeping. Really, it is sleeping. >>>> to >>>> This parrot is dead. Really, it is dead. >>>> >>>> >>>> But not if part of a link or inside a link: >>>> >>>> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. >>>> Really, >>>> it >>>> is sleeping. >>>> to >>>> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. >>>> Really, >>>> it >>>> is dead. >>>> >>>> >>>> This is the full extent of the "html" that would be seen in the text, >>>> the >>>> rest of the page has already been processed. Luckily I can rely on the >>>> formating always being consistent with the above example (the url will >>>> normally by much longer in reality though). There may though be more >>>> than >>>> one link present. >>>> >>>> I'm hoping to use this to implement the automatic addition of links to >>>> other >>>> areas of a website based on keywords found in the text. >>>> >>>> I'm guessing this is a bit too much to ask for regex. If this is the >>>> case, >>>> I'll add some more manual Python parsing to the string, but was hoping >>>> to >>>> use it to learn more about regex. >>>> >>>> Any pointers would be appreciated. >>>> >>>> Martin >>> >>> What you want is: >>> >>> re.sub(regex, replacement, instring) >>> The replacement can be a function. So use a function. >>> >>> def sleeping_to_dead(inmatch): >>> instr = inmatch.group(0) >>> if needsfixing(instr): >>> return instr.replace('sleeping','dead') >>> else: >>> return instr >>> >>> as for the regex, something like >>> (<a)?[^<>]*(</a>)? >>> could be a start. It is probaly better to use the regex to recognize >>> the links as you might have something like >>> sleeping.sleeping/sleeping/sleeping.html in your urls. Also you >>> probably want to do many fixes, so you can put them all within the same >>> replacer function. >> >> ... My first >> working attempt had been to use the regex to locate all links. I then >> looped >> through replacing each with a numbered dummy entry. Then safely do the >> find/replaces and then replace the dummy entries with the original links. >> It >> just seems overly inefficient. > > Someone may have made use of > multiline links: > > _________________________ > This parrot > <a > href="sleeping.htm" > target="new" > > is sleeping > </a>. > Really, it is sleeping. > _________________________ > > > In such a case you may need to make the page > into one string to search if you don't want to use some complex > method of tracking state with variables as you move from > string to string. You'll also have to make it possible > for non-printing characters to have been inserted in all sorts > of ways around the '>' and '<' and 'a' or 'A' > characters using ' *' here and there in he regex.
I agree, but luckily in this case the HREF will always be formated the same as it happens to be inserted by another automated system, not a user. Ok it be changed by the server but AFAIK it hasn't for the last 6 years. The text in question apart from the links should be plain (not even <b> is allowed I think). I'd read about back and forward regex matching and thought it might somehow be of use here ie back search for the "<A" and forward search for the "</A>" but of course this would easily match in between two links. -- http://mail.python.org/mailman/listinfo/python-list