Hi Tejas, So I started fresh. I deleted the webpage keyspace as I am using Cassandra as a backend. But I did get the same output. I mean I get a bunch of urls after I do a readdb -dump but not the ones I want. I get only one fetched site, and many links parsed (to be parsed in the next cycle?). Maybe it has to do something with the urls I am trying to get? I am trying to get this url and similar ones:
http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1 But I have noticed that the links pointing to the next ones are something like this: <a class="resultado_roda" href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a> So I decided to try commenting this url rule: # skip URLs with slash-delimited segment that repeats 3+ times, to break loops #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ But I got the same results. A single site fetched, some urls parsed but not the ones I want using the regex-urlfilter.txt. Any Ideas? Thanks a ton for your help Tejas! Renato M. 2013/5/12 Tejas Patil <[email protected]>: > Hi Renato, > > Thats weird. I ran a crawl over similar urls having a query in the end ( > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x. > My guess is that there is something wrong while parsing due to which > outlinks are not getting into the crawldb. > > Start from fresh. Clear everything from previous attempts. (including the > backend table named as the value of 'storage.schema.webpage'). > Run these : > bin/nutch inject *<urldir>* > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0 > bin/nutch fetch *<batchID>* -threads 2 > bin/nutch parse *<batchID> * > bin/nutch updatedb > bin/nutch readdb -dump <*output dir*> > > The readdb output will shown if the outlinks were extracted correctly. > > The commands for checking urlfilter rules accept one input url at a time > from console (you need to type/paste the url and hit enter). > It shows "+" if the url is accepted by the current rules. ("-" for > rejection). > > Thanks, > Tejas > > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo < > [email protected]> wrote: > >> And I did try the commands you told me but I am not sure how they >> work. They do wait for an url to be input, but then it prints the url >> with a '+' at the beginning, what does that mean? >> >> http://www.xyz.com/lanchon >> +http://www.xyz.com/lanchon >> >> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>: >> > Hi Tejas, >> > >> > Thanks for your help. I have tried the expression you suggested, and >> > now my url-filter file is like this: >> > +http://www.xyz.com/\?page=* >> > >> > # skip URLs containing certain characters as probable queries, etc. >> > #-[?*!@=] >> > +. >> > >> > # skip URLs with slash-delimited segment that repeats 3+ times, to break >> loops >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ >> > +. >> > >> > # accept anything else >> > +. >> > >> > So after this, I run a generate command -topN 5 -depth 5, and then a >> > fetch all, but I keep on getting a single page fetched. What am I >> > doing wrong? Thanks again for your help. >> > >> > >> > Renato M. >> > >> > 2013/5/12 Tejas Patil <[email protected]>: >> >> FYI: You can use anyone of these commands to run the regex-urlfilter >> rules >> >> against any given url: >> >> >> >> bin/nutch plugin urlfilter-regex >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter >> >> OR >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter >> >> >> >> Both of them accept input url one at a time from stdin. >> >> The later one has a param which can enable you to test a given url >> against >> >> several url filters at once. See its usage for more details. >> >> >> >> >> >> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <[email protected] >> >wrote: >> >> >> >>> If there is no restriction on the number at the end of the url, you >> might >> >>> just use this: >> >>> (note that the rule must be above the one which filters urls with a "?" >> >>> character) >> >>> >> >>> *+http://www.xyz.com/\?page=* >> >>> * >> >>> * >> >>> *# skip URLs containing certain characters as probable queries, etc.* >> >>> *-[?*!@=]* >> >>> >> >>> >> >>> >> >>> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo < >> >>> [email protected]> wrote: >> >>> >> >>>> Hi all, >> >>>> >> >>>> I have been trying to fetch a query similar to: >> >>>> >> >>>> http://www.xyz.com/?page=1 >> >>>> >> >>>> But where the number can vary from 1 to 100. Inside the first page >> >>>> there are links to the next ones. So I updated the >> >>>> conf/regex-urlfilter file and added: >> >>>> >> >>>> ^[0-9]{1,45}$ >> >>>> >> >>>> When I do this, the generate job fails saying that it is "Invalid >> >>>> first character". I have tried generating with topN 5 and depth 5 and >> >>>> trying to fetch more urls but that does not work. >> >>>> >> >>>> Could anyone advise me on how to accomplish this? I am running Nutch >> 2.x. >> >>>> Thanks in advance! >> >>>> >> >>>> >> >>>> Renato M. >> >>>> >> >>> >> >>> >>

