On Feb 10, 5:03 pm, "mtuller" <[EMAIL PROTECTED]> wrote: > Alright. I have tried everything I can find, but am not getting > anywhere. I have a web page that has data like this: > > <tr > > <td headers="col1_1" style="width:21%" > > <span class="hpPageText" >LETTER</span></td> > <td headers="col2_1" style="width:13%; text-align:right" > > <span class="hpPageText" >33,699</span></td> > <td headers="col3_1" style="width:13%; text-align:right" > > <span class="hpPageText" >1.0</span></td> > <td headers="col4_1" style="width:13%; text-align:right" > > </tr> > > What is show is only a small section. > > I want to extract the 33,699 (which is dynamic) and set the value to a > variable so that I can insert it into a database. I have tried parsing > the html with pyparsing, and the examples will get it to print all > instances with span, of which there are a hundred or so when I use: > > for srvrtokens in printCount.searchString(printerListHTML): > print srvrtokens > > If I set the last line to srvtokens[3] I get the values, but I don't > know grab a single line and then set that as a variable. >
So what you are saying is that you need to make your pattern more specific. So I suggest adding these items to your matching pattern: - only match span if inside a <td> with attribute 'headers="col2_1"' - only match if the span body is an integer (with optional comma separater for thousands) This grammar adds these more specific tests for matching the input HTML (note also the use of results names to make it easy to extract the integer number, and a parse action added to integer to convert the '33,699' string to the integer 33699). -- Paul htmlSource = """<tr > <td headers="col1_1" style="width:21%" > <span class="hpPageText" >LETTER</span></td> <td headers="col2_1" style="width:13%; text-align:right" > <span class="hpPageText" >33,699</span></td> <td headers="col3_1" style="width:13%; text-align:right" > <span class="hpPageText" >1.0</span></td> <td headers="col4_1" style="width:13%; text-align:right" > </tr>""" from pyparsing import makeHTMLTags, Word, nums, ParseException tdStart, tdEnd = makeHTMLTags('td') spanStart, spanEnd = makeHTMLTags('span') def onlyAcceptWithTagAttr(attrname,attrval): def action(tagAttrs): if not(attrname in tagAttrs and tagAttrs[attrname]==attrval): raise ParseException("",0,"") return action tdStart.setParseAction(onlyAcceptWithTagAttr("headers","col2_1")) spanStart.setParseAction(onlyAcceptWithTagAttr("class","hpPageText")) integer = Word(nums,nums+',') integer.setParseAction(lambda t:int("".join(c for c in t[0] if c != ','))) patt = tdStart + spanStart + integer.setResultsName("intValue") + spanEnd + tdEnd for matches in patt.searchString(htmlSource): print matches.intValue prints: 33699 -- http://mail.python.org/mailman/listinfo/python-list