FYI: You can use anyone of these commands to run the regex-urlfilter rules against any given url:
bin/nutch plugin urlfilter-regex org.apache.nutch.urlfilter.regex.RegexURLFilter OR bin/nutch org.apache.nutch.net.URLFilterChecker -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter Both of them accept input url one at a time from stdin. The later one has a param which can enable you to test a given url against several url filters at once. See its usage for more details. On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <[email protected]>wrote: > If there is no restriction on the number at the end of the url, you might > just use this: > (note that the rule must be above the one which filters urls with a "?" > character) > > *+http://www.xyz.com/\?page=* > * > * > *# skip URLs containing certain characters as probable queries, etc.* > *-[?*!@=]* > > > > > On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo < > [email protected]> wrote: > >> Hi all, >> >> I have been trying to fetch a query similar to: >> >> http://www.xyz.com/?page=1 >> >> But where the number can vary from 1 to 100. Inside the first page >> there are links to the next ones. So I updated the >> conf/regex-urlfilter file and added: >> >> ^[0-9]{1,45}$ >> >> When I do this, the generate job fails saying that it is "Invalid >> first character". I have tried generating with topN 5 and depth 5 and >> trying to fetch more urls but that does not work. >> >> Could anyone advise me on how to accomplish this? I am running Nutch 2.x. >> Thanks in advance! >> >> >> Renato M. >> > >

