On Dec 21, 5:38 am, Oltmans <rolf.oltm...@gmail.com> wrote: > Hello,. everyone. > > I've a string that looks something like > ---- > lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id > = "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div> > ---- > > From above string I need the digits within the ID attribute. For > example, required output from above string is > - 35343433 > - 345343 > - 8898 > > I've written this regex that's kind of working > re.findall("\w+\s*\W+amazon_(\d+)",str) >
The issue with using regexen for parsing HTML is that you often get surprised by attributes that you never expected, or out of order, or with weird or missing quotation marks, or tags or attributes that are in upper/lower case. BeautifulSoup is one tool to use for HTML scraping, here is a pyparsing example, with hopefully descriptive comments: from pyparsing import makeHTMLTags,ParseException src = """ lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id = "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div> hello, my age is 86 years old and I was born in 1945. Do you know that PI is roughly 3.1443534534534534534 """ # use makeHTMLTags to return an expression that will match # HTML <div> tags, including attributes, upper/lower case, # etc. (makeHTMLTags will return expressions for both # opening and closing tags, but we only care about the # opening one, so just use the [0]th returned item div = makeHTMLTags("div")[0] # define a parse action to filter only for <div> tags # with the proper id form def filterByIdStartingWithAmazon(tokens): if not tokens.id.startswith("amazon_"): raise ParseException( "must have id attribute starting with 'amazon_'") # define a parse action that will add a pseudo- # attribute 'amazon_id', to make it easier to get the # numeric portion of the id after the leading 'amazon_' def makeAmazonIdAttribute(tokens): tokens["amazon_id"] = tokens.id[len("amazon_"):] # attach parse action callbacks to the div expression - # these will be called during parse time div.setParseAction(filterByIdStartingWithAmazon, makeAmazonIdAttribute) # search through the input string for matching <div>s, # and print out their amazon_id's for divtag in div.searchString(src): print divtag.amazon_id Prints: 345343 35343433 8898 -- http://mail.python.org/mailman/listinfo/python-list