And I did try the commands you told me but I am not sure how they work. They do wait for an url to be input, but then it prints the url with a '+' at the beginning, what does that mean?
http://www.xyz.com/lanchon +http://www.xyz.com/lanchon 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>: > Hi Tejas, > > Thanks for your help. I have tried the expression you suggested, and > now my url-filter file is like this: > +http://www.xyz.com/\?page=* > > # skip URLs containing certain characters as probable queries, etc. > #-[?*!@=] > +. > > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ > +. > > # accept anything else > +. > > So after this, I run a generate command -topN 5 -depth 5, and then a > fetch all, but I keep on getting a single page fetched. What am I > doing wrong? Thanks again for your help. > > > Renato M. > > 2013/5/12 Tejas Patil <[email protected]>: >> FYI: You can use anyone of these commands to run the regex-urlfilter rules >> against any given url: >> >> bin/nutch plugin urlfilter-regex >> org.apache.nutch.urlfilter.regex.RegexURLFilter >> OR >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName >> org.apache.nutch.urlfilter.regex.RegexURLFilter >> >> Both of them accept input url one at a time from stdin. >> The later one has a param which can enable you to test a given url against >> several url filters at once. See its usage for more details. >> >> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <[email protected]>wrote: >> >>> If there is no restriction on the number at the end of the url, you might >>> just use this: >>> (note that the rule must be above the one which filters urls with a "?" >>> character) >>> >>> *+http://www.xyz.com/\?page=* >>> * >>> * >>> *# skip URLs containing certain characters as probable queries, etc.* >>> *-[?*!@=]* >>> >>> >>> >>> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo < >>> [email protected]> wrote: >>> >>>> Hi all, >>>> >>>> I have been trying to fetch a query similar to: >>>> >>>> http://www.xyz.com/?page=1 >>>> >>>> But where the number can vary from 1 to 100. Inside the first page >>>> there are links to the next ones. So I updated the >>>> conf/regex-urlfilter file and added: >>>> >>>> ^[0-9]{1,45}$ >>>> >>>> When I do this, the generate job fails saying that it is "Invalid >>>> first character". I have tried generating with topN 5 and depth 5 and >>>> trying to fetch more urls but that does not work. >>>> >>>> Could anyone advise me on how to accomplish this? I am running Nutch 2.x. >>>> Thanks in advance! >>>> >>>> >>>> Renato M. >>>> >>> >>>

