As a complete tyro, I've broken my teeth on this web-page scraping 
problem. I've several times wanted to scrape pages in which the only 
identifying elements are positional rather than syntactical, that is, 
pages in which everything's a sibling and there's no way to predict how 
many sibs there are in each section headed by an empty named anchor. I've 
been trying to use beautifulSoup to scrape these. It's not clear to me 
which is worse: my grasp of python in general or beautifulSoup in 
particular. Here's a stripped down example of the sort of thing I mean:

<html>
<body>
<a name="A1"></a>
<p>paragraph 1</p>
<p>paragraph 1.A</p>
<ul>
   <li>some line</li>
   <li>another line</li>
</ul>
<p>paragraph 1.B</p>

<a name="A2"></a>
<p>paragraph 2</p>
<p>paragraph 2.B</p>

<a name="A3"></a>
<p>paragraph 3</p>
<table>
   <tr><td>some</td><td>data</td></tr>
</table>
</body>
</html>

I want to end up with some container, say a list, containing something 
like this:
[
   [A1, paragraph 1, paragraph 1.A, some line, another line, paragraph 1.B]
   [A2, paragraph 2, paragraph 2.B]
   [A3, paragraph 3, some, data]
]
I've tried things like this: (just using print for now, I think I'll be 
able to build the lists or whatever once I get the basic idea.)

anchors = soup.findAll('a', { 'name' : re.compile('^A.*$')})
for x in anchors:
   print x
   x = x.next
   while getattr(x, 'name') != 'a':
     print x

And get into endless loops. I can't help thinking there are simple and 
obvious ways to do this, probably many, but as a rank beginner, they are 
escaping me.

Can someone wise in the ways of screen scraping give me a clue?

thanks,
Jon
_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Reply via email to