I try to crawl a website, some page have this kind of URL : http://domain/fr/detail/459340332/westsite.aspx ie. http://domain/fr/detail/ *number */ *name *.aspx
and page like this : http://domain/fr/detail/459340332/westsite/activity.aspx ie. http://domain/fr/detail/ *number */ *name /*activity.asp I want only scrap first type not the second but Scrapy try to parse twice and crach because I don't handle second type of page. First I try this : Rule(SgmlLinkExtractor( allow=[r 'http://trendstop\.levif\.be/fr/detail/[0-9]+/[0-9a-z\-]+\.aspx$'], unique= True), callback='parse_fiche') But Scrapy send activity.aspx pages to parse_fiche (my callback function) After this, I try : Rule( SgmlLinkExtractor(allow=[r'/detail/[0-9]+/[0-9a-z\-]+\.aspx'], deny=[r 'activity'], unique=True), callback='parse_fiche') But Scrapy don't ignore activity.aspx page how can I do this ? I have tried my regex in Notepad++ and it seem correct, it match first URL not the second type. Thanks, Nickko -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
