Re: [Tutor] weather scraping with Beautiful Soup

Stefan Behnel Fri, 17 Jul 2009 05:42:26 -0700

Che M wrote:
> <div class="blueBox">
>       <div id="curcondbox">
>               <div class="subG b">West of Town, Jamestown, Pennsylvania 
> (PWS)</div>
>               <div class="bm10">Updated: <span class="pwsrt" 
> pwsid="KPAJAMES1" pwsunit="english" pwsvariable="lu" value="1247814018">3:00 
> AM EDT on July 17, 2009</span></div>
>               <table cellspacing="0" cellpadding="0" class="full">
>               <tr>
>               <td class="vaT full">
>               <table cellspacing="0" cellpadding="5" class="full">
>               <tr>
>               <td class="vaM taC"><img 
> src="http://icons-pe.wxug.com/i/c/a/nt_clear.gif"; width="42" height="42" 
> alt="Clear" class="condIcon" /></td>
>               <td class="vaM taC full">
>               <div style="font-size: 17px;"><span class="pwsrt" 
> pwsid="KPAJAMES1" pwsunit="english" pwsvariable="tempf" english="&deg;F" 
> metric="&deg;C" value="60.3">
>   <span class="nobr"><span class="b">60.3</span>&nbsp;&#176;F</span>
> </span></div>
>
> The 60.3 is the value I want to extract.  It appears to be down within a 
> hierarchy
> something like:
> 
> <body
> <div class="blueBox">
>     <div id="curcondbox">
>          <table 
>             <table 
>                <div>
>                    <span class="nobr">
>                          <span class="b">


You may consider using lxml's cssselect module:

   from lxml import html
   doc = html.parse("http://some/url/to/parse.html";)
   spans = doc.cssselect("div.bluebox > #curcondbox span.b")
   print spans[0].text

However, I'd rather go for the other "60.3" value using XPath:

   print doc.xpath('//sp...@pwsvariable="tempf"]/@value')

Stefan

_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] weather scraping with Beautiful Soup

Reply via email to