On Thu, May 16, 2013 at 11:53 AM, Renato Marroquín Mogrovejo < [email protected]> wrote:
> Well I have managed to get the same results as you have (I think). Now > on my crawldb there are the links with the following structure: > > +http://www.xyz.com/\?page=* > > But there are also many other links, how would I do to only get the > links in the above format? I mean ignoring all the others and only > getting the ones with the same structure. > If you *just* want urls of type http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=<http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1> then add a accept rule for that and reject the rest by using this: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps| EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM| tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ + http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=* # reject rest all urls -. > I have also noticed something interesting, that if I use: > > ./bin/nutch generate -topN 10 -numFetchers 1 -depth 10 -noFilter -adddays > 0 > > I only get the same seed url but no others, is this caused by the > depth parameter? > Weird. Depth has nothing to do with this. topN parameter could be set to a bigger value to see if this happens. I vaguely remember (2-3 years back) that there was a jira about this and it was said to be wont fix as people wont use low topN values in a typical prod setup. > Thanks again! > > > Renato M. > > > 2013/5/16 Renato Marroquín Mogrovejo <[email protected]>: > > Hi Tejas, > > > > Thank you very much for your help again. > > But I'm sorry to inform that I am still not able to get the next link > > into my crawldb. I am thinking that my conf/regex-urlfilter.txt file > > is not properly set up. I am sending the content of this file, could > > you help me determining what is wrong with it? > > Thanks a ton in advanced! > > > > > > Renato M. > > > > > > # skip file: ftp: and mailto: urls > > -^(file|ftp|mailto): > > > > # skip image and other suffixes we can't yet parse > > # for a more extensive coverage use the urlfilter-suffix plugin > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > > > #+http://www.xyz.com/\?page=* > > + > http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=* > > > > # skip URLs containing certain characters as probable queries, etc. > > #-[?*!@=] > > +. > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > +. > > > > # accept anything else > > +. > > > > 2013/5/13 Tejas Patil <[email protected]>: > >> Hi Renato, > >> > >> The default content limit for http protocol is 65536 while the webpage > is > >> much bigger than that. The relevant config needs to be updated. > >> Add this to the conf/nutch-site.xml: > >> > >> *<property>* > >> * <name>http.content.limit</name>* > >> * <value>240000</value>* > >> * <description>The length limit for downloaded content using the http* > >> * protocol, in bytes. If this value is nonnegative (>=0), content > longer* > >> * than it will be truncated; otherwise, no truncation at all. Do not* > >> * confuse this setting with the file.content.limit setting.* > >> * </description>* > >> *</property>* > >> > >> I got a connection timed out error post this config change above (it > makes > >> sense as the content to be downloaded is more). > >> So I added this to the conf/nutch-site.xml: > >> > >> *<property>* > >> * <name>http.timeout</name>* > >> * <value>1000000</value>* > >> * <description>The default network timeout, in > milliseconds.</description>* > >> *</property>* > >> > >> After running a fresh crawl, I could see the link to the next page in > the > >> crawldb: > >> > >> * > >> > http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2 > >> key: > >> > net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2 > >> * > >> *baseUrl: null* > >> *status: 1 (status_unfetched)* > >> *fetchTime: 1368424541731* > >> *prevFetchTime: 0* > >> *fetchInterval: 2592000* > >> *retriesSinceFetch: 0* > >> *modifiedTime: 0* > >> *prevModifiedTime: 0* > >> *protocolStatus: (null)* > >> *parseStatus: (null)* > >> *title: null* > >> *score: 0.0042918455* > >> *markers: {dist=1}* > >> *reprUrl: null* > >> *metadata _csh_ : ;���* > >> > >> HTH > >> > >> > >> On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo < > >> [email protected]> wrote: > >> > >>> Hi Tejas, > >>> > >>> So I started fresh. I deleted the webpage keyspace as I am using > >>> Cassandra as a backend. But I did get the same output. I mean I get a > >>> bunch of urls after I do a readdb -dump but not the ones I want. I get > >>> only one fetched site, and many links parsed (to be parsed in the next > >>> cycle?). Maybe it has to do something with the urls I am trying to > >>> get? > >>> I am trying to get this url and similar ones: > >>> > >>> > >>> > http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1 > >>> > >>> But I have noticed that the links pointing to the next ones are > >>> something like this: > >>> > >>> <a class="resultado_roda" > >>> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a> > >>> > >>> So I decided to try commenting this url rule: > >>> # skip URLs with slash-delimited segment that repeats 3+ times, to > break > >>> loops > >>> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ > >>> > >>> But I got the same results. A single site fetched, some urls parsed > >>> but not the ones I want using the regex-urlfilter.txt. Any Ideas? > >>> Thanks a ton for your help Tejas! > >>> > >>> > >>> Renato M. > >>> > >>> > >>> 2013/5/12 Tejas Patil <[email protected]>: > >>> > Hi Renato, > >>> > > >>> > Thats weird. I ran a crawl over similar urls having a query in the > end ( > >>> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with > 2.x. > >>> > My guess is that there is something wrong while parsing due to which > >>> > outlinks are not getting into the crawldb. > >>> > > >>> > Start from fresh. Clear everything from previous attempts. > (including the > >>> > backend table named as the value of 'storage.schema.webpage'). > >>> > Run these : > >>> > bin/nutch inject *<urldir>* > >>> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0 > >>> > bin/nutch fetch *<batchID>* -threads 2 > >>> > bin/nutch parse *<batchID> * > >>> > bin/nutch updatedb > >>> > bin/nutch readdb -dump <*output dir*> > >>> > > >>> > The readdb output will shown if the outlinks were extracted > correctly. > >>> > > >>> > The commands for checking urlfilter rules accept one input url at a > time > >>> > from console (you need to type/paste the url and hit enter). > >>> > It shows "+" if the url is accepted by the current rules. ("-" for > >>> > rejection). > >>> > > >>> > Thanks, > >>> > Tejas > >>> > > >>> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo < > >>> > [email protected]> wrote: > >>> > > >>> >> And I did try the commands you told me but I am not sure how they > >>> >> work. They do wait for an url to be input, but then it prints the > url > >>> >> with a '+' at the beginning, what does that mean? > >>> >> > >>> >> http://www.xyz.com/lanchon > >>> >> +http://www.xyz.com/lanchon > >>> >> > >>> >> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>: > >>> >> > Hi Tejas, > >>> >> > > >>> >> > Thanks for your help. I have tried the expression you suggested, > and > >>> >> > now my url-filter file is like this: > >>> >> > +http://www.xyz.com/\?page=* > >>> >> > > >>> >> > # skip URLs containing certain characters as probable queries, > etc. > >>> >> > #-[?*!@=] > >>> >> > +. > >>> >> > > >>> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to > >>> break > >>> >> loops > >>> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ > >>> >> > +. > >>> >> > > >>> >> > # accept anything else > >>> >> > +. > >>> >> > > >>> >> > So after this, I run a generate command -topN 5 -depth 5, and > then a > >>> >> > fetch all, but I keep on getting a single page fetched. What am I > >>> >> > doing wrong? Thanks again for your help. > >>> >> > > >>> >> > > >>> >> > Renato M. > >>> >> > > >>> >> > 2013/5/12 Tejas Patil <[email protected]>: > >>> >> >> FYI: You can use anyone of these commands to run the > regex-urlfilter > >>> >> rules > >>> >> >> against any given url: > >>> >> >> > >>> >> >> bin/nutch plugin urlfilter-regex > >>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter > >>> >> >> OR > >>> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName > >>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter > >>> >> >> > >>> >> >> Both of them accept input url one at a time from stdin. > >>> >> >> The later one has a param which can enable you to test a given > url > >>> >> against > >>> >> >> several url filters at once. See its usage for more details. > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil < > >>> [email protected] > >>> >> >wrote: > >>> >> >> > >>> >> >>> If there is no restriction on the number at the end of the url, > you > >>> >> might > >>> >> >>> just use this: > >>> >> >>> (note that the rule must be above the one which filters urls > with a > >>> "?" > >>> >> >>> character) > >>> >> >>> > >>> >> >>> *+http://www.xyz.com/\?page=* > >>> >> >>> * > >>> >> >>> * > >>> >> >>> *# skip URLs containing certain characters as probable queries, > >>> etc.* > >>> >> >>> *-[?*!@=]* > >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo < > >>> >> >>> [email protected]> wrote: > >>> >> >>> > >>> >> >>>> Hi all, > >>> >> >>>> > >>> >> >>>> I have been trying to fetch a query similar to: > >>> >> >>>> > >>> >> >>>> http://www.xyz.com/?page=1 > >>> >> >>>> > >>> >> >>>> But where the number can vary from 1 to 100. Inside the first > page > >>> >> >>>> there are links to the next ones. So I updated the > >>> >> >>>> conf/regex-urlfilter file and added: > >>> >> >>>> > >>> >> >>>> ^[0-9]{1,45}$ > >>> >> >>>> > >>> >> >>>> When I do this, the generate job fails saying that it is > "Invalid > >>> >> >>>> first character". I have tried generating with topN 5 and > depth 5 > >>> and > >>> >> >>>> trying to fetch more urls but that does not work. > >>> >> >>>> > >>> >> >>>> Could anyone advise me on how to accomplish this? I am running > >>> Nutch > >>> >> 2.x. > >>> >> >>>> Thanks in advance! > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> Renato M. > >>> >> >>>> > >>> >> >>> > >>> >> >>> > >>> >> > >>> >

