Re: Scrapy Random answer with the same script

Steven Almeroth Sat, 05 Mar 2016 16:22:16 -0800

Weiron, response order is not guaranteed because the server you are 
crawling may return responses differently each time; in other words if you 
make a request and then make another request, the server may return the 
second request's response before the first, or not; but, this is beyond our 
control.


Try setting `CONCURRENT_REQUESTS = 1` but the downloader is also 
"asynchronous" so response order is still not guaranteed.

http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests

Also if you make requests manually you can set the `priority`.

http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects

Finally you can listen for the `spider-idle` signal in a middleware and 
manually feed in requests in your own queue.

http://doc.scrapy.org/en/latest/topics/signals.html#spider-idle



On Friday, February 5, 2016 at 4:30:02 AM UTC-7, Weiron wrote:
>
> Hello,
>
> I have a problem with the last scrapy's release on my windows 10 system. 
> Actually, the goal of my project is to crawl some french government web 
> page to have every article of the law from a code. For exemple, I'm trying 
> to crawl the "code général des impôts" from this page 
> http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160202
>  . 
> Moreover, I have to crawl each article linked in this summary. At the end, 
> I have to put every titles of the summuray in a database with its article 
> associated.
>
> I tried some things to do this. I did that with the latest tutorial of 
> scrapy to write my script.
>
> this is the begining of my script :
>
> import scrapy
>
> from tutorial.items import GouvItem
>
> class GouvSpider(scrapy.Spider):
>
> name = "gouv"
> allowed_domains = ["legifrance.gouv.fr"]
> start_urls = [
> "
> http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160128
> "
> ]
>
> Then the script allowing me to crawl each title of the summary :
>
> def parse(self, response):
> for sel in response.xpath('//ul/li'):
> item = GouvItem()
> if len(sel.xpath('span/text()').extract()) > 0:
> item['title1'] = sel.xpath('span/text()').extract()
> if len(sel.xpath('ul/li/span/text()').extract()) > 0:
> item['title2'] = sel.xpath('ul/li/span/text()').extract()
> if len(sel.xpath('ul/li/ul/li/span/text()').extract()) > 0:
> item['title3'] = sel.xpath('ul/li/ul/li/span/text()').extract()
> if len(sel.xpath('ul/li/ul/li/ul/li/span/text()').extract()) > 0:
> item['title4'] = sel.xpath('ul/li/ul/li/ul/li/span/text()').extract()
> if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > 0:
> item['title5'] = sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract()
> if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > 
> 0:
> item['title6'] = 
> sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract()
> if len(sel.xpath('ul/li/ul/li/span/a/@href').extract()) > 0:
> item['link'] = sel.xpath('ul/li/ul/li/span/a/@href').extract()
> yield item
>
> And now i'm trying to cawl each article called in the summary with this 
> script :
>
> def parse(self, response):
> for href in response.xpath("//a/@href"):
> url = response.urljoin(href.extract())
> yield scrapy.Request(url, callback=self.parse_article)
> def parse_article(self, response):
> for art in response.xpath("//div[@class='corpsArt']"):
> item = GouvItem()
> item['article'] = art.xpath('p/text()').extract()
> yield item
>
> To try these things faster, I don't crawl the summary and each article at 
> the same time to test because it takes a long time. So, with the crawl of 
> articles, my issue is that the things that it return to me are random. I 
> mean that if I execute my script two times it will return to me two 
> differents answers... I don't get why...
>
> I hope that you will be able to help me and that you understood my subject 
> =) I tried to explain it as much as possible.
>
> Thank you so much !!
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Scrapy Random answer with the same script

Reply via email to