Re: Excluding subfolders from LinkExtractor rules

Jakob de Maeyer Wed, 13 Aug 2014 08:24:33 -0700

Hey Bobby,

if I understand you correctly instead of accepting any character after
the first slash (.*) you want to accept any character but another slash
([^\/]*) and no characters after the second slash ($):
        r'\/foo\/[^\/]*\/$'


This will match "/foo/bar/" but not "/foo/bar/baz" or "/foo/bar/baz/".
It will not match "/foo/" (neither does your original regexp).

Cheers,
-Jakob


On 08/13/2014 04:49 PM, Bobby Kolba wrote:
> It seems that in Scrapy if I have a rule that looks for r'\/foo\/.*\/'
> it will match /foo/, /foo/bar/ /and/ /foo/bar/baz -- how do I need to
> modify my regex to match /exactly/ the pattern
> http://domain.com/foo/bar/ and exclude anything like
> http://domain.com/foo/bar/baz, etc. Tried playing around with
> end-of-string characters which will cause most regex testers I've found
> to stop matching the foo/bar/baz string but my crawler keeps pulling
> them down.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Excluding subfolders from LinkExtractor rules

Reply via email to