Re: Fetching a specific number of urls

Renato Marroquín Mogrovejo Sun, 12 May 2013 10:28:53 -0700

Hi Tejas,

Thanks for your help. I have tried the expression you suggested, and
now my url-filter file is like this:
+http://www.xyz.com/\?page=*


# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
+.

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+.

# accept anything else
+.

So after this, I run a generate command -topN 5 -depth 5, and then a
fetch all, but I keep on getting a single page fetched. What am I
doing wrong? Thanks again for your help.


Renato M.

2013/5/12 Tejas Patil <[email protected]>:
> FYI: You can use anyone of these commands to run the regex-urlfilter rules
> against any given url:
>
> bin/nutch plugin urlfilter-regex
> org.apache.nutch.urlfilter.regex.RegexURLFilter
> OR
> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> org.apache.nutch.urlfilter.regex.RegexURLFilter
>
> Both of them accept input url one at a time from stdin.
> The later one has a param which can enable you to test a given url against
> several url filters at once. See its usage for more details.
>
>
>
> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <[email protected]>wrote:
>
>> If there is no restriction on the number at the end of the url, you might
>> just use this:
>> (note that the rule must be above the one which filters urls with a "?"
>> character)
>>
>> *+http://www.xyz.com/\?page=*
>> *
>> *
>> *# skip URLs containing certain characters as probable queries, etc.*
>> *-[?*!@=]*
>>
>>
>>
>>
>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
>> [email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I have been trying to fetch a query similar to:
>>>
>>> http://www.xyz.com/?page=1
>>>
>>> But where the number can vary from 1 to 100. Inside the first page
>>> there are links to the next ones. So I updated the
>>> conf/regex-urlfilter file and added:
>>>
>>> ^[0-9]{1,45}$
>>>
>>> When I do this, the generate job fails saying that it is "Invalid
>>> first character". I have tried generating with topN 5 and depth 5 and
>>> trying to fetch more urls but that does not work.
>>>
>>> Could anyone advise me on how to accomplish this? I am running Nutch 2.x.
>>> Thanks in advance!
>>>
>>>
>>> Renato M.
>>>
>>
>>

Re: Fetching a specific number of urls

Reply via email to