On Jul 23, 11:53 am, Paul McGuire <[email protected]> wrote:
> On Jul 22, 5:43 pm, Filip <[email protected]> wrote:
>
> # Needs re.IGNORECASE, and can have tag attributes, such as <BR
> CLEAR="ALL">
> line_break_re = re.compile('<br\/?>', re.UNICODE)
Just in case somebody actually uses valid XHTML :-) it might be a good
idea to allow for <br />
> # what about HTML entities defined using hex syntax, such as &#xxxx;
> amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)
What about the decimal syntax ones? E.g. not only and  
but also  
Also, entity names can contain digits e.g. ¹ ¾
--
http://mail.python.org/mailman/listinfo/python-list