Weiron, response order is not guaranteed because the server you are crawling may return responses differently each time; in other words if you make a request and then make another request, the server may return the second request's response before the first, or not; but, this is beyond our control.
Try setting `CONCURRENT_REQUESTS = 1` but the downloader is also "asynchronous" so response order is still not guaranteed. http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests Also if you make requests manually you can set the `priority`. http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects Finally you can listen for the `spider-idle` signal in a middleware and manually feed in requests in your own queue. http://doc.scrapy.org/en/latest/topics/signals.html#spider-idle On Friday, February 5, 2016 at 4:30:02 AM UTC-7, Weiron wrote: > > Hello, > > I have a problem with the last scrapy's release on my windows 10 system. > Actually, the goal of my project is to crawl some french government web > page to have every article of the law from a code. For exemple, I'm trying > to crawl the "code général des impôts" from this page > http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160202 > . > Moreover, I have to crawl each article linked in this summary. At the end, > I have to put every titles of the summuray in a database with its article > associated. > > I tried some things to do this. I did that with the latest tutorial of > scrapy to write my script. > > this is the begining of my script : > > import scrapy > > from tutorial.items import GouvItem > > class GouvSpider(scrapy.Spider): > > name = "gouv" > allowed_domains = ["legifrance.gouv.fr"] > start_urls = [ > " > http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160128 > " > ] > > Then the script allowing me to crawl each title of the summary : > > def parse(self, response): > for sel in response.xpath('//ul/li'): > item = GouvItem() > if len(sel.xpath('span/text()').extract()) > 0: > item['title1'] = sel.xpath('span/text()').extract() > if len(sel.xpath('ul/li/span/text()').extract()) > 0: > item['title2'] = sel.xpath('ul/li/span/text()').extract() > if len(sel.xpath('ul/li/ul/li/span/text()').extract()) > 0: > item['title3'] = sel.xpath('ul/li/ul/li/span/text()').extract() > if len(sel.xpath('ul/li/ul/li/ul/li/span/text()').extract()) > 0: > item['title4'] = sel.xpath('ul/li/ul/li/ul/li/span/text()').extract() > if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > 0: > item['title5'] = sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract() > if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > > 0: > item['title6'] = > sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract() > if len(sel.xpath('ul/li/ul/li/span/a/@href').extract()) > 0: > item['link'] = sel.xpath('ul/li/ul/li/span/a/@href').extract() > yield item > > And now i'm trying to cawl each article called in the summary with this > script : > > def parse(self, response): > for href in response.xpath("//a/@href"): > url = response.urljoin(href.extract()) > yield scrapy.Request(url, callback=self.parse_article) > def parse_article(self, response): > for art in response.xpath("//div[@class='corpsArt']"): > item = GouvItem() > item['article'] = art.xpath('p/text()').extract() > yield item > > To try these things faster, I don't crawl the summary and each article at > the same time to test because it takes a long time. So, with the crawl of > articles, my issue is that the things that it return to me are random. I > mean that if I execute my script two times it will return to me two > differents answers... I don't get why... > > I hope that you will be able to help me and that you understood my subject > =) I tried to explain it as much as possible. > > Thank you so much !! > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
