Well I have managed to get the same results as you have (I think). Now on my crawldb there are the links with the following structure:
+http://www.xyz.com/\?page=* But there are also many other links, how would I do to only get the links in the above format? I mean ignoring all the others and only getting the ones with the same structure. I have also noticed something interesting, that if I use: ./bin/nutch generate -topN 10 -numFetchers 1 -depth 10 -noFilter -adddays 0 I only get the same seed url but no others, is this caused by the depth parameter? Thanks again! Renato M. 2013/5/16 Renato Marroquín Mogrovejo <[email protected]>: > Hi Tejas, > > Thank you very much for your help again. > But I'm sorry to inform that I am still not able to get the next link > into my crawldb. I am thinking that my conf/regex-urlfilter.txt file > is not properly set up. I am sending the content of this file, could > you help me determining what is wrong with it? > Thanks a ton in advanced! > > > Renato M. > > > # skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > #+http://www.xyz.com/\?page=* > +http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=* > > # skip URLs containing certain characters as probable queries, etc. > #-[?*!@=] > +. > > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ > +. > > # accept anything else > +. > > 2013/5/13 Tejas Patil <[email protected]>: >> Hi Renato, >> >> The default content limit for http protocol is 65536 while the webpage is >> much bigger than that. The relevant config needs to be updated. >> Add this to the conf/nutch-site.xml: >> >> *<property>* >> * <name>http.content.limit</name>* >> * <value>240000</value>* >> * <description>The length limit for downloaded content using the http* >> * protocol, in bytes. If this value is nonnegative (>=0), content longer* >> * than it will be truncated; otherwise, no truncation at all. Do not* >> * confuse this setting with the file.content.limit setting.* >> * </description>* >> *</property>* >> >> I got a connection timed out error post this config change above (it makes >> sense as the content to be downloaded is more). >> So I added this to the conf/nutch-site.xml: >> >> *<property>* >> * <name>http.timeout</name>* >> * <value>1000000</value>* >> * <description>The default network timeout, in milliseconds.</description>* >> *</property>* >> >> After running a fresh crawl, I could see the link to the next page in the >> crawldb: >> >> * >> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2 >> key: >> net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2 >> * >> *baseUrl: null* >> *status: 1 (status_unfetched)* >> *fetchTime: 1368424541731* >> *prevFetchTime: 0* >> *fetchInterval: 2592000* >> *retriesSinceFetch: 0* >> *modifiedTime: 0* >> *prevModifiedTime: 0* >> *protocolStatus: (null)* >> *parseStatus: (null)* >> *title: null* >> *score: 0.0042918455* >> *markers: {dist=1}* >> *reprUrl: null* >> *metadata _csh_ : ;���* >> >> HTH >> >> >> On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo < >> [email protected]> wrote: >> >>> Hi Tejas, >>> >>> So I started fresh. I deleted the webpage keyspace as I am using >>> Cassandra as a backend. But I did get the same output. I mean I get a >>> bunch of urls after I do a readdb -dump but not the ones I want. I get >>> only one fetched site, and many links parsed (to be parsed in the next >>> cycle?). Maybe it has to do something with the urls I am trying to >>> get? >>> I am trying to get this url and similar ones: >>> >>> >>> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1 >>> >>> But I have noticed that the links pointing to the next ones are >>> something like this: >>> >>> <a class="resultado_roda" >>> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a> >>> >>> So I decided to try commenting this url rule: >>> # skip URLs with slash-delimited segment that repeats 3+ times, to break >>> loops >>> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ >>> >>> But I got the same results. A single site fetched, some urls parsed >>> but not the ones I want using the regex-urlfilter.txt. Any Ideas? >>> Thanks a ton for your help Tejas! >>> >>> >>> Renato M. >>> >>> >>> 2013/5/12 Tejas Patil <[email protected]>: >>> > Hi Renato, >>> > >>> > Thats weird. I ran a crawl over similar urls having a query in the end ( >>> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x. >>> > My guess is that there is something wrong while parsing due to which >>> > outlinks are not getting into the crawldb. >>> > >>> > Start from fresh. Clear everything from previous attempts. (including the >>> > backend table named as the value of 'storage.schema.webpage'). >>> > Run these : >>> > bin/nutch inject *<urldir>* >>> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0 >>> > bin/nutch fetch *<batchID>* -threads 2 >>> > bin/nutch parse *<batchID> * >>> > bin/nutch updatedb >>> > bin/nutch readdb -dump <*output dir*> >>> > >>> > The readdb output will shown if the outlinks were extracted correctly. >>> > >>> > The commands for checking urlfilter rules accept one input url at a time >>> > from console (you need to type/paste the url and hit enter). >>> > It shows "+" if the url is accepted by the current rules. ("-" for >>> > rejection). >>> > >>> > Thanks, >>> > Tejas >>> > >>> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo < >>> > [email protected]> wrote: >>> > >>> >> And I did try the commands you told me but I am not sure how they >>> >> work. They do wait for an url to be input, but then it prints the url >>> >> with a '+' at the beginning, what does that mean? >>> >> >>> >> http://www.xyz.com/lanchon >>> >> +http://www.xyz.com/lanchon >>> >> >>> >> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>: >>> >> > Hi Tejas, >>> >> > >>> >> > Thanks for your help. I have tried the expression you suggested, and >>> >> > now my url-filter file is like this: >>> >> > +http://www.xyz.com/\?page=* >>> >> > >>> >> > # skip URLs containing certain characters as probable queries, etc. >>> >> > #-[?*!@=] >>> >> > +. >>> >> > >>> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to >>> break >>> >> loops >>> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ >>> >> > +. >>> >> > >>> >> > # accept anything else >>> >> > +. >>> >> > >>> >> > So after this, I run a generate command -topN 5 -depth 5, and then a >>> >> > fetch all, but I keep on getting a single page fetched. What am I >>> >> > doing wrong? Thanks again for your help. >>> >> > >>> >> > >>> >> > Renato M. >>> >> > >>> >> > 2013/5/12 Tejas Patil <[email protected]>: >>> >> >> FYI: You can use anyone of these commands to run the regex-urlfilter >>> >> rules >>> >> >> against any given url: >>> >> >> >>> >> >> bin/nutch plugin urlfilter-regex >>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter >>> >> >> OR >>> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName >>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter >>> >> >> >>> >> >> Both of them accept input url one at a time from stdin. >>> >> >> The later one has a param which can enable you to test a given url >>> >> against >>> >> >> several url filters at once. See its usage for more details. >>> >> >> >>> >> >> >>> >> >> >>> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil < >>> [email protected] >>> >> >wrote: >>> >> >> >>> >> >>> If there is no restriction on the number at the end of the url, you >>> >> might >>> >> >>> just use this: >>> >> >>> (note that the rule must be above the one which filters urls with a >>> "?" >>> >> >>> character) >>> >> >>> >>> >> >>> *+http://www.xyz.com/\?page=* >>> >> >>> * >>> >> >>> * >>> >> >>> *# skip URLs containing certain characters as probable queries, >>> etc.* >>> >> >>> *-[?*!@=]* >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo < >>> >> >>> [email protected]> wrote: >>> >> >>> >>> >> >>>> Hi all, >>> >> >>>> >>> >> >>>> I have been trying to fetch a query similar to: >>> >> >>>> >>> >> >>>> http://www.xyz.com/?page=1 >>> >> >>>> >>> >> >>>> But where the number can vary from 1 to 100. Inside the first page >>> >> >>>> there are links to the next ones. So I updated the >>> >> >>>> conf/regex-urlfilter file and added: >>> >> >>>> >>> >> >>>> ^[0-9]{1,45}$ >>> >> >>>> >>> >> >>>> When I do this, the generate job fails saying that it is "Invalid >>> >> >>>> first character". I have tried generating with topN 5 and depth 5 >>> and >>> >> >>>> trying to fetch more urls but that does not work. >>> >> >>>> >>> >> >>>> Could anyone advise me on how to accomplish this? I am running >>> Nutch >>> >> 2.x. >>> >> >>>> Thanks in advance! >>> >> >>>> >>> >> >>>> >>> >> >>>> Renato M. >>> >> >>>> >>> >> >>> >>> >> >>> >>> >> >>>

