Hello,

I have a problem with the last scrapy's release on my windows 10 system. 
Actually, the goal of my project is to crawl some french government web 
page to have every article of the law from a code. For exemple, I'm trying 
to crawl the "code général des impôts" from this page 
http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160202
 . 
Moreover, I have to crawl each article linked in this summary. At the end, 
I have to put every titles of the summuray in a database with its article 
associated.

I tried some things to do this. I did that with the latest tutorial of 
scrapy to write my script.

this is the begining of my script :

import scrapy

from tutorial.items import GouvItem

class GouvSpider(scrapy.Spider):

name = "gouv"
allowed_domains = ["legifrance.gouv.fr"]
start_urls = [
"http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160128";
]

Then the script allowing me to crawl each title of the summary :

def parse(self, response):
for sel in response.xpath('//ul/li'):
item = GouvItem()
if len(sel.xpath('span/text()').extract()) > 0:
item['title1'] = sel.xpath('span/text()').extract()
if len(sel.xpath('ul/li/span/text()').extract()) > 0:
item['title2'] = sel.xpath('ul/li/span/text()').extract()
if len(sel.xpath('ul/li/ul/li/span/text()').extract()) > 0:
item['title3'] = sel.xpath('ul/li/ul/li/span/text()').extract()
if len(sel.xpath('ul/li/ul/li/ul/li/span/text()').extract()) > 0:
item['title4'] = sel.xpath('ul/li/ul/li/ul/li/span/text()').extract()
if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > 0:
item['title5'] = sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract()
if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > 
0:
item['title6'] = 
sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract()
if len(sel.xpath('ul/li/ul/li/span/a/@href').extract()) > 0:
item['link'] = sel.xpath('ul/li/ul/li/span/a/@href').extract()
yield item

And now i'm trying to cawl each article called in the summary with this 
script :

def parse(self, response):
for href in response.xpath("//a/@href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_article)
def parse_article(self, response):
for art in response.xpath("//div[@class='corpsArt']"):
item = GouvItem()
item['article'] = art.xpath('p/text()').extract()
yield item

To try these things faster, I don't crawl the summary and each article at 
the same time to test because it takes a long time. So, with the crawl of 
articles, my issue is that the things that it return to me are random. I 
mean that if I execute my script two times it will return to me two 
differents answers... I don't get why...

I hope that you will be able to help me and that you understood my subject 
=) I tried to explain it as much as possible.

Thank you so much !!

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to