Hi Tejas, Thanks for your help. I have tried the expression you suggested, and now my url-filter file is like this: +http://www.xyz.com/\?page=*
# skip URLs containing certain characters as probable queries, etc. #-[?*!@=] +. # skip URLs with slash-delimited segment that repeats 3+ times, to break loops #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ +. # accept anything else +. So after this, I run a generate command -topN 5 -depth 5, and then a fetch all, but I keep on getting a single page fetched. What am I doing wrong? Thanks again for your help. Renato M. 2013/5/12 Tejas Patil <[email protected]>: > FYI: You can use anyone of these commands to run the regex-urlfilter rules > against any given url: > > bin/nutch plugin urlfilter-regex > org.apache.nutch.urlfilter.regex.RegexURLFilter > OR > bin/nutch org.apache.nutch.net.URLFilterChecker -filterName > org.apache.nutch.urlfilter.regex.RegexURLFilter > > Both of them accept input url one at a time from stdin. > The later one has a param which can enable you to test a given url against > several url filters at once. See its usage for more details. > > > > On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <[email protected]>wrote: > >> If there is no restriction on the number at the end of the url, you might >> just use this: >> (note that the rule must be above the one which filters urls with a "?" >> character) >> >> *+http://www.xyz.com/\?page=* >> * >> * >> *# skip URLs containing certain characters as probable queries, etc.* >> *-[?*!@=]* >> >> >> >> >> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo < >> [email protected]> wrote: >> >>> Hi all, >>> >>> I have been trying to fetch a query similar to: >>> >>> http://www.xyz.com/?page=1 >>> >>> But where the number can vary from 1 to 100. Inside the first page >>> there are links to the next ones. So I updated the >>> conf/regex-urlfilter file and added: >>> >>> ^[0-9]{1,45}$ >>> >>> When I do this, the generate job fails saying that it is "Invalid >>> first character". I have tried generating with topN 5 and depth 5 and >>> trying to fetch more urls but that does not work. >>> >>> Could anyone advise me on how to accomplish this? I am running Nutch 2.x. >>> Thanks in advance! >>> >>> >>> Renato M. >>> >> >>

