Hi Renato, The default content limit for http protocol is 65536 while the webpage is much bigger than that. The relevant config needs to be updated. Add this to the conf/nutch-site.xml:
*<property>* * <name>http.content.limit</name>* * <value>240000</value>* * <description>The length limit for downloaded content using the http* * protocol, in bytes. If this value is nonnegative (>=0), content longer* * than it will be truncated; otherwise, no truncation at all. Do not* * confuse this setting with the file.content.limit setting.* * </description>* *</property>* I got a connection timed out error post this config change above (it makes sense as the content to be downloaded is more). So I added this to the conf/nutch-site.xml: *<property>* * <name>http.timeout</name>* * <value>1000000</value>* * <description>The default network timeout, in milliseconds.</description>* *</property>* After running a fresh crawl, I could see the link to the next page in the crawldb: * http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2 key: net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2 * *baseUrl: null* *status: 1 (status_unfetched)* *fetchTime: 1368424541731* *prevFetchTime: 0* *fetchInterval: 2592000* *retriesSinceFetch: 0* *modifiedTime: 0* *prevModifiedTime: 0* *protocolStatus: (null)* *parseStatus: (null)* *title: null* *score: 0.0042918455* *markers: {dist=1}* *reprUrl: null* *metadata _csh_ : ;���* HTH On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo < [email protected]> wrote: > Hi Tejas, > > So I started fresh. I deleted the webpage keyspace as I am using > Cassandra as a backend. But I did get the same output. I mean I get a > bunch of urls after I do a readdb -dump but not the ones I want. I get > only one fetched site, and many links parsed (to be parsed in the next > cycle?). Maybe it has to do something with the urls I am trying to > get? > I am trying to get this url and similar ones: > > > http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1 > > But I have noticed that the links pointing to the next ones are > something like this: > > <a class="resultado_roda" > href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a> > > So I decided to try commenting this url rule: > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > But I got the same results. A single site fetched, some urls parsed > but not the ones I want using the regex-urlfilter.txt. Any Ideas? > Thanks a ton for your help Tejas! > > > Renato M. > > > 2013/5/12 Tejas Patil <[email protected]>: > > Hi Renato, > > > > Thats weird. I ran a crawl over similar urls having a query in the end ( > > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x. > > My guess is that there is something wrong while parsing due to which > > outlinks are not getting into the crawldb. > > > > Start from fresh. Clear everything from previous attempts. (including the > > backend table named as the value of 'storage.schema.webpage'). > > Run these : > > bin/nutch inject *<urldir>* > > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0 > > bin/nutch fetch *<batchID>* -threads 2 > > bin/nutch parse *<batchID> * > > bin/nutch updatedb > > bin/nutch readdb -dump <*output dir*> > > > > The readdb output will shown if the outlinks were extracted correctly. > > > > The commands for checking urlfilter rules accept one input url at a time > > from console (you need to type/paste the url and hit enter). > > It shows "+" if the url is accepted by the current rules. ("-" for > > rejection). > > > > Thanks, > > Tejas > > > > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo < > > [email protected]> wrote: > > > >> And I did try the commands you told me but I am not sure how they > >> work. They do wait for an url to be input, but then it prints the url > >> with a '+' at the beginning, what does that mean? > >> > >> http://www.xyz.com/lanchon > >> +http://www.xyz.com/lanchon > >> > >> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>: > >> > Hi Tejas, > >> > > >> > Thanks for your help. I have tried the expression you suggested, and > >> > now my url-filter file is like this: > >> > +http://www.xyz.com/\?page=* > >> > > >> > # skip URLs containing certain characters as probable queries, etc. > >> > #-[?*!@=] > >> > +. > >> > > >> > # skip URLs with slash-delimited segment that repeats 3+ times, to > break > >> loops > >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ > >> > +. > >> > > >> > # accept anything else > >> > +. > >> > > >> > So after this, I run a generate command -topN 5 -depth 5, and then a > >> > fetch all, but I keep on getting a single page fetched. What am I > >> > doing wrong? Thanks again for your help. > >> > > >> > > >> > Renato M. > >> > > >> > 2013/5/12 Tejas Patil <[email protected]>: > >> >> FYI: You can use anyone of these commands to run the regex-urlfilter > >> rules > >> >> against any given url: > >> >> > >> >> bin/nutch plugin urlfilter-regex > >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter > >> >> OR > >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName > >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter > >> >> > >> >> Both of them accept input url one at a time from stdin. > >> >> The later one has a param which can enable you to test a given url > >> against > >> >> several url filters at once. See its usage for more details. > >> >> > >> >> > >> >> > >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil < > [email protected] > >> >wrote: > >> >> > >> >>> If there is no restriction on the number at the end of the url, you > >> might > >> >>> just use this: > >> >>> (note that the rule must be above the one which filters urls with a > "?" > >> >>> character) > >> >>> > >> >>> *+http://www.xyz.com/\?page=* > >> >>> * > >> >>> * > >> >>> *# skip URLs containing certain characters as probable queries, > etc.* > >> >>> *-[?*!@=]* > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo < > >> >>> [email protected]> wrote: > >> >>> > >> >>>> Hi all, > >> >>>> > >> >>>> I have been trying to fetch a query similar to: > >> >>>> > >> >>>> http://www.xyz.com/?page=1 > >> >>>> > >> >>>> But where the number can vary from 1 to 100. Inside the first page > >> >>>> there are links to the next ones. So I updated the > >> >>>> conf/regex-urlfilter file and added: > >> >>>> > >> >>>> ^[0-9]{1,45}$ > >> >>>> > >> >>>> When I do this, the generate job fails saying that it is "Invalid > >> >>>> first character". I have tried generating with topN 5 and depth 5 > and > >> >>>> trying to fetch more urls but that does not work. > >> >>>> > >> >>>> Could anyone advise me on how to accomplish this? I am running > Nutch > >> 2.x. > >> >>>> Thanks in advance! > >> >>>> > >> >>>> > >> >>>> Renato M. > >> >>>> > >> >>> > >> >>> > >> >

