pyparsing is cool. but use only re is also OK # -*- coding: UTF-8 -*- import urllib2 html=urllib2.urlopen(ur"http://www.yahoo.com/").read()
import re r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE) for m in r.finditer(html): print m.group('image') I got these rusults: http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg http://us.i1.yimg.com/us.yimg.com/i/ww/beta/wthr.gif http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif On 6/8/06, Paul McGuire <[EMAIL PROTECTED]> wrote: > <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > > Hi, > > I am new to python regular expression, I would like to use it to get an > > attribute of an html element from an html file? > > > > for example, I was able to read the html file using this: > > req = urllib2.Request(url=acaURL) > > f = urllib2.urlopen(req) > > > > data = f.read() > > > > my question is how can I just get the src attribute value of an img > > tag? > > something like this: > > (.*)<img src="href of the image source">(.*) > > > > I need to get the href of the image source. > > > > Thanks. > > > > As Fredrik pointed out, re's are not the only tool out there. Here's a > pyparsing solution. > > -- Paul > > > import pyparsing > import urllib > > # define HTML tag format using makeHTMLTags helper > # (we don't really care about the ending </img> tag, > # even though makeHTMLTags returns definitions for both > # starting and ending tag patterns) > imgStartTag, dummy = pyparsing.makeHTMLTags("img") > > # get HTML source from some web site > htmlPage = urllib.urlopen("http://www.yahoo.com") > htmlSource = htmlPage.read() > htmlPage.close() > > # scan HTML source, printing SRC attribute from each <img> tag > for tokens,start,end in imgStartTag.scanString(htmlSource): > print tokens.src > > > Prints: > > http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif > http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/hea_0411.gif > http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/img_0607.jpg > http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg > http://us.i1.yimg.com/us.yimg.com/i/ww/beta/news/video.gif > http://us.i1.yimg.com/us.yimg.com/i/buzz/2006/06/wholefoodssmall.jpg > http://us.i1.yimg.com/us.yimg.com/i/mntl/msg/06q2/img_im.jpg > http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif > > > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list