Re: Fetching a specific number of urls

Tejas Patil Sun, 12 May 2013 02:17:36 -0700

FYI: You can use anyone of these commands to run the regex-urlfilter rules
against any given url:


bin/nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter
OR
bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
org.apache.nutch.urlfilter.regex.RegexURLFilter

Both of them accept input url one at a time from stdin.
The later one has a param which can enable you to test a given url against
several url filters at once. See its usage for more details.



On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <[email protected]>wrote:

> If there is no restriction on the number at the end of the url, you might
> just use this:
> (note that the rule must be above the one which filters urls with a "?"
> character)
>
> *+http://www.xyz.com/\?page=*
> *
> *
> *# skip URLs containing certain characters as probable queries, etc.*
> *-[?*!@=]*
>
>
>
>
> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
> [email protected]> wrote:
>
>> Hi all,
>>
>> I have been trying to fetch a query similar to:
>>
>> http://www.xyz.com/?page=1
>>
>> But where the number can vary from 1 to 100. Inside the first page
>> there are links to the next ones. So I updated the
>> conf/regex-urlfilter file and added:
>>
>> ^[0-9]{1,45}$
>>
>> When I do this, the generate job fails saying that it is "Invalid
>> first character". I have tried generating with topN 5 and depth 5 and
>> trying to fetch more urls but that does not work.
>>
>> Could anyone advise me on how to accomplish this? I am running Nutch 2.x.
>> Thanks in advance!
>>
>>
>> Renato M.
>>
>
>

Re: Fetching a specific number of urls

Reply via email to