Hello, I have a problem with the last scrapy's release on my windows 10 system. Actually, the goal of my project is to crawl some french government web page to have every article of the law from a code. For exemple, I'm trying to crawl the "code général des impôts" from this page http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160202 . Moreover, I have to crawl each article linked in this summary. At the end, I have to put every titles of the summuray in a database with its article associated.
I tried some things to do this. I did that with the latest tutorial of scrapy to write my script. this is the begining of my script : import scrapy from tutorial.items import GouvItem class GouvSpider(scrapy.Spider): name = "gouv" allowed_domains = ["legifrance.gouv.fr"] start_urls = [ "http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160128" ] Then the script allowing me to crawl each title of the summary : def parse(self, response): for sel in response.xpath('//ul/li'): item = GouvItem() if len(sel.xpath('span/text()').extract()) > 0: item['title1'] = sel.xpath('span/text()').extract() if len(sel.xpath('ul/li/span/text()').extract()) > 0: item['title2'] = sel.xpath('ul/li/span/text()').extract() if len(sel.xpath('ul/li/ul/li/span/text()').extract()) > 0: item['title3'] = sel.xpath('ul/li/ul/li/span/text()').extract() if len(sel.xpath('ul/li/ul/li/ul/li/span/text()').extract()) > 0: item['title4'] = sel.xpath('ul/li/ul/li/ul/li/span/text()').extract() if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > 0: item['title5'] = sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract() if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > 0: item['title6'] = sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract() if len(sel.xpath('ul/li/ul/li/span/a/@href').extract()) > 0: item['link'] = sel.xpath('ul/li/ul/li/span/a/@href').extract() yield item And now i'm trying to cawl each article called in the summary with this script : def parse(self, response): for href in response.xpath("//a/@href"): url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_article) def parse_article(self, response): for art in response.xpath("//div[@class='corpsArt']"): item = GouvItem() item['article'] = art.xpath('p/text()').extract() yield item To try these things faster, I don't crawl the summary and each article at the same time to test because it takes a long time. So, with the crawl of articles, my issue is that the things that it return to me are random. I mean that if I execute my script two times it will return to me two differents answers... I don't get why... I hope that you will be able to help me and that you understood my subject =) I tried to explain it as much as possible. Thank you so much !! -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
