Re: Fetching a specific number of urls

Renato Marroquín Mogrovejo Sun, 12 May 2013 10:30:57 -0700

And I did try the commands you told me but I am not sure how they
work. They do wait for an url to be input, but then it prints the url
with a '+' at the beginning, what does that mean?


http://www.xyz.com/lanchon
+http://www.xyz.com/lanchon

2013/5/12 Renato Marroquín Mogrovejo <[email protected]>:
> Hi Tejas,
>
> Thanks for your help. I have tried the expression you suggested, and
> now my url-filter file is like this:
> +http://www.xyz.com/\?page=*
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
> +.
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> +.
>
> # accept anything else
> +.
>
> So after this, I run a generate command -topN 5 -depth 5, and then a
> fetch all, but I keep on getting a single page fetched. What am I
> doing wrong? Thanks again for your help.
>
>
> Renato M.
>
> 2013/5/12 Tejas Patil <[email protected]>:
>> FYI: You can use anyone of these commands to run the regex-urlfilter rules
>> against any given url:
>>
>> bin/nutch plugin urlfilter-regex
>> org.apache.nutch.urlfilter.regex.RegexURLFilter
>> OR
>> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
>> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>
>> Both of them accept input url one at a time from stdin.
>> The later one has a param which can enable you to test a given url against
>> several url filters at once. See its usage for more details.
>>
>>
>>
>> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <[email protected]>wrote:
>>
>>> If there is no restriction on the number at the end of the url, you might
>>> just use this:
>>> (note that the rule must be above the one which filters urls with a "?"
>>> character)
>>>
>>> *+http://www.xyz.com/\?page=*
>>> *
>>> *
>>> *# skip URLs containing certain characters as probable queries, etc.*
>>> *-[?*!@=]*
>>>
>>>
>>>
>>>
>>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
>>> [email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have been trying to fetch a query similar to:
>>>>
>>>> http://www.xyz.com/?page=1
>>>>
>>>> But where the number can vary from 1 to 100. Inside the first page
>>>> there are links to the next ones. So I updated the
>>>> conf/regex-urlfilter file and added:
>>>>
>>>> ^[0-9]{1,45}$
>>>>
>>>> When I do this, the generate job fails saying that it is "Invalid
>>>> first character". I have tried generating with topN 5 and depth 5 and
>>>> trying to fetch more urls but that does not work.
>>>>
>>>> Could anyone advise me on how to accomplish this? I am running Nutch 2.x.
>>>> Thanks in advance!
>>>>
>>>>
>>>> Renato M.
>>>>
>>>
>>>

Re: Fetching a specific number of urls

Reply via email to