Re: crawl with pagination

WANG Ruoxi Sun, 28 Aug 2016 02:28:24 -0700

Hi Raf,

Glad to see that you achieved to crawl that site anyway.


But I see that you used some hard coding to get that pagination down. A 
better idea is to follow the next link "suivante" to follow the pagination 
on the site.
To do this, the parse function in the scrapy has provided the callback 
function in the scrapy.http.Request().  When a reponse is returned from 
that request, it is going to be fed to the callback function (by default it 
is the parse function itself). With this you can do something like this:

def parse(self, response):
    next_selector = 
response.xpath(//div[@id='paginationControl']//a[contains(text(), 
'Suivante')]/@href)
    for url in next_selector.extract():
         yield scrapy.http.Request(urlparse.urljoin(response.url, url))
    
    for href in response.css('div.lien-ville ul li a::attr("href")'):
      full_url = response.urljoin(href.extract())
      yield scrapy.Request(full_url, callback=self.parse_lien)

In this way the pagination links will be parsed by the parse function, but 
the item pages with information you needed will be parsed by your 
parse_lien function. You can refer to "learning scrapy" by Dimitrios 
Kouzis-Loukas for more infos.

regards,

On Wednesday, August 17, 2016 at 4:31:08 AM UTC+8, Raf Roger wrote:
>
> So for now i took a website for testing purpose and to help me to learn 
> basics of scrapy.
>
> website is "http://www.allosociete.ch/telephone-horaires-metier/Pressing 
> <http://www.google.com/url?q=http%3A%2F%2Fwww.allosociete.ch%2Ftelephone-horaires-metier%2FPressing&sa=D&sntz=1&usg=AFQjCNG12IUALxmfWvsQqq5rxdvDLYfZJg>
> "
>
> i would to get in csv the following output:
> counter: a simple increment number
> page_id: page number
> url: url the display the company details
> company name: once on the URL, to collect company name
>
> for that on > http://www.allosociete.ch/telephone-horaires-metier/Pressing 
> which is in page page 1, i was able to collect data as following:
>
> import scrapy
>
> class AlloSociete(scrapy.Spider):
>   name = 'allosocietepressing'
>   start_urls = ['
> http://www.allosociete.ch/telephone-horaires-metier/Pressing']
>   counter = 1
>   pagenum = 1
>   def parse(self, response):
>     for href in response.css('div.lien-ville ul li a::attr("href")'):
>       full_url = response.urljoin(href.extract())
>       yield scrapy.Request(full_url, self.parse_lien)
>
>   def parse_lien(self, response):
>     yield {
>       'count' : self.counter,
>       'page' : self.pagenum,
>       'lien' : response.url
>     }
>     self.counter = self.counter + 1
>
> for now i was not able to have a clear understanding how to code the 
> pagination catch and to replace self.pagenum by the pagination id.
> this section has only 3 pages.
>
> thanks to help me to understand how scrapy works as it seems to be very 
> promising for collecting real time data.
>
>
>
>
> On Monday, August 15, 2016 at 1:14:45 AM UTC+2, WANG Ruoxi wrote:
>>
>> Hi Raf,
>>
>> Not sure that I understand your question well, you can always use a regex 
>> in the LinkExtractor to retrieve all the pagination links that you need.
>> Something like 
>>
>> "telephone-horaires-metier\/Restaurant\?p=[0-9]+$" can match the links, 
>> if your last number is always a positive integer.
>>
>> Regards,
>>
>>
>>
>> On Sunday, August 14, 2016 at 11:11:40 PM UTC+8, Raf Roger wrote:
>>>
>>> Hi,
>>>
>>> i'm new to scrapy and i'm looking for a way to retrieve all links (with 
>>> class: ul li a).
>>> on each page, there is pagination and first page url is like:
>>> telephone-horaires-metier/Restaurant
>>>
>>> page 2 url is:
>>> telephone-horaires-metier/Restaurant?p=2
>>>
>>> page 3 url is:
>>> telephone-horaires-metier/Restaurant?p=3
>>>
>>> etc...
>>>
>>> the "next" url is always the current page +1 so if i'm page 2 "next" url 
>>> is telephone-horaires-metier/Restaurant?p=3
>>>
>>> How can i do to collect all links on each page ?
>>>
>>> thx
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: crawl with pagination

Reply via email to