Thanks for reply,
I solved it using a simpler method
rules = [
Rule(SgmlLinkExtractor(allow=("resumes/\w+/page-1?\d?/",),restrict_xpaths
=('//a[@title="Next"]')),
callback="parse_items", follow=True),
]
On Saturday, 1 March 2014 08:44:06 UTC-7, Svyatoslav Sydorenko wrote:
>
> Hi Duy,
>
> You may replace \d+ with smth like:
>
> In [46]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-29s').group(1)
> Out[46]: '29'
>
> In [47]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-31s').group(1)
> ---------------------------------------------------------------------------
> AttributeError Traceback (most recent call last)
>
> /home/wk/src/ibcrawler/<ipython console> in <module>()
>
> AttributeError: 'NoneType' object has no attribute 'group'
>
> In [48]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-30s').group(1)
> Out[48]: '30'
>
> In [49]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-0s').group(1)
> Out[49]: '0'
>
> In [50]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-4s').group(1)
> Out[50]: '4'
>
> In [51]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-1s').group(1)
> Out[51]: '1'
>
>
> Понеділок, 24 лютого 2014 р. 21:16:37 UTC+2 користувач Duy Nguyen написав:
>>
>> Hi guys,
>>
>> I have 2 start_urls , each of them have 100 pages with the pattern
>> "resumes/url1/page-\d+"
>>
>> I am only interested in first 30 pages of each start_url. In other words,
>> I want to crawl "resumes/url1/page-\d+" where *\d+ <= 30*
>>
>> Is there an option I can specify under "rules" ?
>>
>> OK to crawl: resumes/something/page-20/
>>
>> NOT OK to crawl: resumes/something/page-31/
>>
>> start_urls = [
>> "url1,"url2"
>> ]
>>
>> rules = [
>> Rule(SgmlLinkExtractor(allow=("resumes/\w+/page-\d+",),
>> restrict_xpaths=('//a[@title="Next"]')),
>> callback="parse_items", follow=True),
>> ]
>>
>> def parse_items(self, response):
>> ........
>>
>> Thanks,
>>
>>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.