"Juho Schultz" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Martin Evans wrote: >> Sorry, yet another REGEX question. I've been struggling with trying to >> get >> a regular expression to do the following example in Python: >> >> Search and replace all instances of "sleeping" with "dead". >> >> This parrot is sleeping. Really, it is sleeping. >> to >> This parrot is dead. Really, it is dead. >> >> >> But not if part of a link or inside a link: >> >> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, >> it >> is sleeping. >> to >> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, >> it >> is dead. >> >> >> This is the full extent of the "html" that would be seen in the text, the >> rest of the page has already been processed. Luckily I can rely on the >> formating always being consistent with the above example (the url will >> normally by much longer in reality though). There may though be more than >> one link present. >> >> I'm hoping to use this to implement the automatic addition of links to >> other >> areas of a website based on keywords found in the text. >> >> I'm guessing this is a bit too much to ask for regex. If this is the >> case, >> I'll add some more manual Python parsing to the string, but was hoping to >> use it to learn more about regex. >> >> Any pointers would be appreciated. >> >> Martin > > What you want is: > > re.sub(regex, replacement, instring) > The replacement can be a function. So use a function. > > def sleeping_to_dead(inmatch): > instr = inmatch.group(0) > if needsfixing(instr): > return instr.replace('sleeping','dead') > else: > return instr > > as for the regex, something like > (<a)?[^<>]*(</a>)? > could be a start. It is probaly better to use the regex to recognize > the links as you might have something like > sleeping.sleeping/sleeping/sleeping.html in your urls. Also you > probably want to do many fixes, so you can put them all within the same > replacer function.
Many thanks for that, the function method looks very useful. My first working attempt had been to use the regex to locate all links. I then looped through replacing each with a numbered dummy entry. Then safely do the find/replaces and then replace the dummy entries with the original links. It just seems overly inefficient. -- http://mail.python.org/mailman/listinfo/python-list