Paul McGuire wrote: >> import re >> r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE) >> for m in r.finditer(html): >> print m.group('image') >> > > Ouch - this fails to match any <img> tag that has some other > attribute, such as "height" or "width", before the "src" attribute. > www.yahoo.com has several such tags.
It also fails to match any image tag where the src attribute is quoted using single quotes, or where the src attribute is not enclosed in quotes at all. Handle all of that correctly in the regex and the beautiful soup or pyparsing options look even more attractive. In fact, if anyone can write a regex which matches the source attribute in a single named group, and correctly handles double, single and unquoted attributes, I'll admit to being impressed (and probably also slightly queasy when looking at it). Here's my best attempt at a regex that gets it right, but it still gets confused by other attributes if they contain spaces. >>> ATTR = '''[^\s=>]+(?:=(?:"[^">]*"|'[^'>]*'|[^"'\s>][^\s>]*))?''' >>> NOTSRC = '(?!src=)' + ATTR >>> PAT = '''<img\s(?:'''+NOTSRC + '''\s*)*src=(?:["']?)(?P<image>(?<=")[^">]*|(?<=')[^'>]*|[^ >]*)''' >>> htmlPage = '''<html><body><img width=42 src=fred.jpg><img src=\"freda.jpg\"> <img title='the src="silly" title' src='another'></body></html>''' >>> for m in r.finditer(htmlPage): print m.group('image') fred.jpg freda.jpg >>> -- http://mail.python.org/mailman/listinfo/python-list