Re: need help with re module

Matimus Wed, 20 Jun 2007 11:06:35 -0700

On Jun 20, 9:58 am, linuxprog <[EMAIL PROTECTED]> wrote:
> hello
>
> i have that string "<html>hello</a>world<anytag>ok" and i want to
> extract all the text , without html tags , the result should be some
> thing like that : helloworldok
>
> i have tried that :
>
>         from re import findall
>
>         chaine = """<html>hello</a>world<anytag>ok"""
>
>         print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine)
>
>        >>> ['html', 'hell', 'worl', 'anyt', 'ag>o']
>
> the result is not correct ! what would be the correct regex to use ?


This: [^(<.*>)] is a set that contains everything but the characters
"(","<",".","*",">" and ")". It most certainly doesn't do what you
want it to. Is it absolutely necessary that you use a regular
expression? There are a few HTML parsing libraries out there. The
easiest approach using re might be to do a search and replace on all
tags. Just replace the tags with nothing.

Matt

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: need help with re module

Reply via email to