I try to crawl a website, some page have this kind of URL : 
http://domain/fr/detail/459340332/westsite.aspx ie. 
http://domain/fr/detail/ *number */ *name *.aspx

and page like this : 
http://domain/fr/detail/459340332/westsite/activity.aspx ie. 
http://domain/fr/detail/ *number */ *name /*activity.asp

I want only scrap first type not the second but Scrapy try to parse twice 
and crach because I don't handle second type of page.

First I try  this :  
Rule(SgmlLinkExtractor(
    allow=[r
'http://trendstop\.levif\.be/fr/detail/[0-9]+/[0-9a-z\-]+\.aspx$'], unique=
True), callback='parse_fiche')

But Scrapy send activity.aspx pages to parse_fiche (my callback function)

After this, I try :
Rule(
    SgmlLinkExtractor(allow=[r'/detail/[0-9]+/[0-9a-z\-]+\.aspx'], deny=[r
'activity'], unique=True), callback='parse_fiche')

But Scrapy don't ignore activity.aspx page how can I do this ?

I have tried my regex in Notepad++ and it seem correct, it match first URL 
not the second type.

Thanks, 
Nickko

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to