On Jun 20, 9:58 am, linuxprog <[EMAIL PROTECTED]> wrote: > hello > > i have that string "<html>hello</a>world<anytag>ok" and i want to > extract all the text , without html tags , the result should be some > thing like that : helloworldok > > i have tried that : > > from re import findall > > chaine = """<html>hello</a>world<anytag>ok""" > > print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine) > > >>> ['html', 'hell', 'worl', 'anyt', 'ag>o'] > > the result is not correct ! what would be the correct regex to use ?
This: [^(<.*>)] is a set that contains everything but the characters "(","<",".","*",">" and ")". It most certainly doesn't do what you want it to. Is it absolutely necessary that you use a regular expression? There are a few HTML parsing libraries out there. The easiest approach using re might be to do a search and replace on all tags. Just replace the tags with nothing. Matt -- http://mail.python.org/mailman/listinfo/python-list