Correct looping and extracting

Sayth Renshaw Sun, 22 Jun 2014 20:18:07 -0700

Can I have some assistance with looping and traversing links. First I will 
start with what I know and what I am doing.


Currently I am in the scrapy shell determining which xpaths work for me to 
retrieve data. The site I am using 
is http://stats.rleague.com/rl/rl_index.html

>From there I am using my starting URL 
as http://stats.rleague.com/rl/seas/2014.html

This page contains summary of rounds and matches all on one page with page 
nav links to jump down to relevant round.
In [9]: sel.xpath('//body/center/center/a').extract()
Out[9]: 
[u'<a href="#1">1</a>',
 u'<a href="#2">2</a>',
 u'<a href="#3">3</a>',
 u'<a href="#4">4</a>',
 u'<a href="#5">5</a>',
 u'<a href="#6">6</a>',
 u'<a href="#7">7</a>',
 u'<a href="#8">8</a>',
 u'<a href="#9">9</a>',
...and so on to number 26, each number representing the round number.

In each round there is a link called match details which would take you to 
a link such 
as http://stats.rleague.com/rl/scorers/games/2014/201403060921.html
with the 201403060921.html being different on each link based on date.

If I use that link as a start url 
sayth:~$ scrapy shell 
"http://stats.rleague.com/rl/scorers/games/2014/201403060921.html";

Then I can access most data in the table (excluding player names) with.
In [2]: sel.xpath('//tr/td/text()').extract()Out[2]: 
[u'Pos',
 u'Player',
 u'T',
 u'G',
 u'FG',
 u'Pts',
 u'Pos',
 u'Player',
 u'T',
 u'G',
 u'FG',
 u'Pts',
 u'FB',
 u'3',
 u'\xa0',
 u'\xa0',
 u'12',
 u'FB',
 u'\xa0',
 u'\xa0',
 u'\xa0',
 u'\xa0',
 u'WG',
... and so on

And I can extract player names with 
In [5]: sel.xpath('//tr/td/a/text()').extract()
Out[5]: 
[u'Greg Inglis',
 u'Anthony Minichiello',
 u'Nathan Merritt',
 u'Daniel Tupou',
 u'Beau Champion',
 u'Michael Jennings',
 u'Bryson Goodwin',
 u'Shaun Kenny-Dowall',
 u'Lote Tuqiri',
 u'Roger Tuivasa-Sheck',
... and so on.

How though should I best loop from first start URL into all subsequent 
'Match details URL's' to extract the tables, and how should I combine 
correctly  sel.xpath('//tr/td/text()').extract() and 
sel.xpath('//tr/td/a/text()').extract() so the data comes out all as one 
table?

Thanks Sayth
 


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Correct looping and extracting

Reply via email to