This may elicit a RTFM, but I hope this request isn't
too awful. I, like many others on this list, haven't
wrapped my head around the HTML parsing packages. I'm trying
to figure it out, so I have a basic test case.
I'm trying to parse the following URL:
http://www.cuisinenet.com/cnet:pub:cgi-bin/exp_search.cgi/cnet?key_region=Ne
w_York
Of interest is this:
<P CLASS="restInfo">
<SPAN CLASS="restName"><FONT FACE="Arial, Helvetica"><A
href="url"><B>name</B></A></FONT></SPAN>
<BR>
<BASEFONT SIZE="1">
<FONT FACE="Verdana, Geneva, Arial, Helvetica">
<B>Cuisine:</B> cuisinetype <BR>
<B>Price:</B> price<BR>
address (cross streets)<BR>
<B>Phone</B>: phoneno<BR>
</FONT>
</P>
There are other <P CLASS="restInfo"></P> in the document but I'm only
interested in those with
the <SPAN CLASS="restname"> inside.
My pseudocode is something like
$h = new HTML::TreeBuilder;
$h->parse($url);
$h->traverse(\&callback);
sub callback {
$p = shift;
if $p->tag is 'p' and $p->attr('class') is 'restInfo' {
if $p contains span class=restName {
push @restaurants, getrestattrs $p;
}
}
}
sub getrestattrs {
$p = shift;
my %rest;
$rest{url} = $p->span(restName)->a->attr('href');
$rest{name} = $p->span(restName)->textcontent;
$p->content =~ /<B>Cuisine:</B> (.*) <BR>/;
$rest{cuisine} = $1;
$p->content =~ /<B>Price:</B> (.*)<BR>/;
$rest{price} = $1;
$p->content =~ /(.*) \((.*)\)<BR>/;
($rest{address},$rest{crossstreets}) = ($1,$2);
$p->content =~ /<B>Phone:</B> (.*) <BR>\n/;
$rest{phone} = $1;
return \%rest;
}
There some things that don't really look like Perl--that's where
I have no idea how to do it right.
I know $p->content doesn't actually operate the way I'm using it
here; again placeholder for correct syntax.
If this is a quick job to clean, or if someone knows where to
point me in the right direction, I'd be grateful.
If someone does help me, I'd be happy to write this in a clear
form usable as an example.
Thanks.