Currently I am in the scrapy shell determining which xpaths work for me to 
retrieve data. The site I am using is 
http://stats.rleague.com/rl/rl_index.html

>From there I am using my starting URL as 
http://stats.rleague.com/rl/seas/2014.html

This page contains summary of rounds and matches all on one page with page 
nav links to jump down to relevant round.

    In [9]: sel.xpath('//body/center/center/a').extract()
    Out[9]: 
    [u'<a href="#1">1</a>',
     u'<a href="#2">2</a>',
     u'<a href="#3">3</a>',
     u'<a href="#4">4</a>',
     u'<a href="#5">5</a>',
     u'<a href="#6">6</a>',
     u'<a href="#7">7</a>',
     u'<a href="#8">8</a>',
     u'<a href="#9">9</a>',
    ...and so on to number 26, each number representing the round number.

In each round there are links called match details(one link per match) 
which would take you to a link such as 
http://stats.rleague.com/rl/scorers/games/2014/201403060921.html
with the 201403060921.html being different on each link based on date.

If I use that link as a start url 

    sayth:~$ scrapy shell 
"http://stats.rleague.com/rl/scorers/games/2014/201403060921.html";

Then I can access most data in the table (excluding player names) with.

    In [2]: sel.xpath('//tr/td/text()').extract()
    Out[2]: 
    [u'Pos',
     u'Player',
     u'T',
     u'G',
     u'FG',
     u'Pts',
     u'Pos',
     u'Player',
     u'T',
     u'G',
     u'FG',
     u'Pts',
     u'FB',
     u'3',
     u'\xa0',
     u'\xa0',
     u'12',
     u'FB',
     u'\xa0',
     u'\xa0',
     u'\xa0',
     u'\xa0',
     u'WG',
    ... and so on

And I can extract player names with 

    In [5]: sel.xpath('//tr/td/a/text()').extract()
    Out[5]: 
    [u'Greg Inglis',
     u'Anthony Minichiello',
     u'Nathan Merritt',
     u'Daniel Tupou',
     u'Beau Champion',
     u'Michael Jennings',
     u'Bryson Goodwin',
     u'Shaun Kenny-Dowall',
     u'Lote Tuqiri',
     u'Roger Tuivasa-Sheck',
    ... and so on.

How though should I best loop from first start URL into all subsequent 
'Match details URL's' to extract the tables, and how should I combine 
correctly  sel.xpath('//tr/td/text()').extract() and 
sel.xpath('//tr/td/a/text()').extract() so the data comes out all as one 
table?

So for example I would get 

POS, FB; Player, Greg Inglis; Tries, 3

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to