Re: Nutch Incremental Crawl
Hi David May be what your want is an adaptive re-fetch algorithm. see [0] [0] http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ On Wed, Feb 27, 2013 at 1:20 PM, David Philip wrote: > Hi all, > > Thank you very much for the replies. Very useful information to > understand how incremental crawling can be achieved. > > Dear Markus: > Can you please tell me how do I over ride this fetch interval , incase if I > require to fetch the page before the time interval is passed? > > > > Thanks very much > - David > > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma > wrote: > > > If you want records to be fetched at a fixed interval its easier to > inject > > them with a fixed fetch interval. > > > > nutch.fixedFetchInterval=86400 > > > > > > > > -Original message- > > > From:kemical > > > Sent: Thu 14-Feb-2013 10:15 > > > To: user@nutch.apache.org > > > Subject: Re: Nutch Incremental Crawl > > > > > > Hi David, > > > > > > You can also consider setting shorter fetch interval time with nutch > > inject. > > > This way you'll set higher score (so the url is always taken in > priority > > > when you generate a segment) and a fetch.interval of 1 day. > > > > > > If you have a case similar to me, you'll often want some homepage fetch > > each > > > day but not their inlinks. What you can do is inject all your seed urls > > > again (assuming those url are only homepages). > > > > > > #change nutch option so existing urls can be injected again in > > > conf/nutch-default.xml or conf/nutch-site.xml > > > db.injector.update=true > > > > > > #Add metadata to update score/fetch interval > > > #the following line will concat to each line of your seed urls files > with > > > the new score / new interval > > > perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=8' > > > [your_seed_url_dir]/* > > > > > > #run command > > > bin/nutch inject crawl/crawldb [your_seed_url_dir] > > > > > > Now, the following crawl will take your urls in top priority and crawl > > them > > > once a day. I've used my situation to illustrate the concept but i > guess > > you > > > can tweek params to fit your needs. > > > > > > This way is useful when you want a regular fetch on some urls, if it's > > > occured rarely i guess freegen is the right choice. > > > > > > Best, > > > Mike > > > > > > > > > > > > > > > > > > -- > > > View this message in context: > > > http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > -- Don't Grow Old, Grow Up... :-)
Re: Nutch Incremental Crawl
Hi all, Thank you very much for the replies. Very useful information to understand how incremental crawling can be achieved. Dear Markus: Can you please tell me how do I over ride this fetch interval , incase if I require to fetch the page before the time interval is passed? Thanks very much - David On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma wrote: > If you want records to be fetched at a fixed interval its easier to inject > them with a fixed fetch interval. > > nutch.fixedFetchInterval=86400 > > > > -Original message- > > From:kemical > > Sent: Thu 14-Feb-2013 10:15 > > To: user@nutch.apache.org > > Subject: Re: Nutch Incremental Crawl > > > > Hi David, > > > > You can also consider setting shorter fetch interval time with nutch > inject. > > This way you'll set higher score (so the url is always taken in priority > > when you generate a segment) and a fetch.interval of 1 day. > > > > If you have a case similar to me, you'll often want some homepage fetch > each > > day but not their inlinks. What you can do is inject all your seed urls > > again (assuming those url are only homepages). > > > > #change nutch option so existing urls can be injected again in > > conf/nutch-default.xml or conf/nutch-site.xml > > db.injector.update=true > > > > #Add metadata to update score/fetch interval > > #the following line will concat to each line of your seed urls files with > > the new score / new interval > > perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=8' > > [your_seed_url_dir]/* > > > > #run command > > bin/nutch inject crawl/crawldb [your_seed_url_dir] > > > > Now, the following crawl will take your urls in top priority and crawl > them > > once a day. I've used my situation to illustrate the concept but i guess > you > > can tweek params to fit your needs. > > > > This way is useful when you want a regular fetch on some urls, if it's > > occured rarely i guess freegen is the right choice. > > > > Best, > > Mike > > > > > > > > > > > > -- > > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > >
Re: migrating from 1.x to 2.x
Hi kaveh, Size of crawl database is not an issue with regards to migration between Nutch versions, it is a compatibility issue which you need to be concerned about. There are no tools currently available in Nutch (as far as I know) to read URLs from hdfs and import/inject your crawl data into your hbase cluster. This is mostly due to the nature of direction in which Nutch is moving, which is to do just crawling, at scale, quickly. We don't have immediate necessity or passion to maintain legacy tools within the codebase and have been trying to reduce this aspect of the codebase. This however doesn't help as there was never a tool for this specific purpose anyway (as far as I know). It is however becoming something which I am getting interested about (the notion of obtaining lots of data from various data stores and bootstrapping Nutch with it). I would really like to read the data with Gora and map it somewhere. I am interested in the Nutch injecting code and would be interested to extend it/write new code to solve this issue. On Tue, Feb 26, 2013 at 5:03 PM, kaveh minooie wrote: > me again, > > is there anyway that I can import my existing crawldb from a nutch 1.4 > which has about 2.5 B (with a B) links in it and currently resides in a > hdfs file system into webpages table in hbase? > > > and what happened to linkdb in nutch 2.x? > > thanks, > -- *Lewis*
migrating from 1.x to 2.x
me again, is there anyway that I can import my existing crawldb from a nutch 1.4 which has about 2.5 B (with a B) links in it and currently resides in a hdfs file system into webpages table in hbase? and what happened to linkdb in nutch 2.x? thanks,
Re: Eclipse Error
We compile and test Apache Nutch on Solaris with the Java 7 (latest) JDK and all is good. I run Apache Nutch on some CI ubuntu servers. I do not run on Windows. This may be a problem, or it could be your development environment, or it could be something else. I have not engaged in this conversation before, so I can only assume it is something to do with your local environment. On Tue, Feb 26, 2013 at 11:05 AM, Danilo Fernandes < dan...@kelsorfernandes.com.br> wrote: > > > Thanks for the reply Tejas, but I tried run by ant many times and > do not have any difference. > > About turning on verbose level for ivy, I´m > a nuts with Eclipse and plugins. > > Can you help me do that? > > On Tue, 26 > Feb 2013 10:46:43 -0800, Tejas Patil wrote: > > > Hi Lewis, > > > > The OP is > not able to build nutch in Eclipse. So far people have been > > suspecting > over this part of the log: > > > > > **C:UsersDaniloworkspaceNutchbuild.xml:96: > > > java.lang.UnsupportedClassVersionError: com/sun/tools/javac/Main : > > > Unsupported major.minor version 51.0* > > * > > > > It turns out that the > java version is fine. (v 1.6). I am not sure but this > > problem might be > related to ivy as per this error: > > > > *> > [*ivy:resolve*] unknown > resolver main* > > *> > [*ivy:resolve*] :: USE VERBOSE OR DEBUG > MESSAGE LEVEL FOR MORE > > > >>> DETAILS > > > > * > > I had faced this > error before but it was sporadic and used to go away after > > invoking > ant again. OP faces it consistently. > > > > @Danilo: Maybe turning on > verbose level for ivy might shed some light. > > > > Thanks, > > Tejas > Patil > > > > On Tue, Feb 26, 2013 at 10:32 AM, Lewis John Mcgibbney < > > > lewis.mcgibb...@gmail.com [15]> wrote: > > > >> What is the problem? There > is a community here that can help... if we know what is wrong! On Tue, > Feb 26, 2013 at 7:44 AM, Danilo Fernandes < > dan...@kelsorfernandes.com.br [14]> wrote: > >> > >>> I tried both and no > one function! :( -Mensagem original- De: kiran chitturi > [mailto:chitturikira...@gmail.com [8]] Enviada em: terça-feira, 26 de > fevereiro de 2013 12:32 Para: user@nutch.apache.org [9] Cc: > ferna...@gmail.com [10] Assunto: Re: Eclipse Error Let's keep the > discussion in the User mailing list. I would suggest you to follow the > instructions here to set up Nutch in Eclipse [1] JDK 1.6 + or 1.7 + will > be good enough. I would also suggest to keep your JRE compatible with > the JDK. [1] - http://wiki.apache.org/nutch/RunNutchInEclipse [11] On > Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes wrote: > >>> > > Kiran, Do you think I need a JDK 7? ** ** *De:* kiran chitturi > [mailto:chitturikira...@gmail.com [1]] *Enviada em:* terça-feira, 26 de > fevereiro de 2013 11:57 *Para:* d...@nutch.apache.org [2] *Assunto:* Re: > Eclipse Error ** ** I think Nutch requires atleast Java 1.6. ** > ** On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes wrote: What > version of JDK fit with Nutch trunk? Anybody knows? ** ** 2013/2/25 > Danilo Fernandes Feng Lu, thanks for the fast reply. But, > I'm using the JavaSE-1.6 (jre6) and always get this error. > *De:* feng lu [mailto:amuseme...@gmail.com [5]] *Enviada em:* > segunda-feira, 25 de fevereiro de 2013 22:35 *Para:* > d...@nutch.apache.org [6] *Assunto:* Re: Eclipse Error Hi > Danilo "Unsupported maj.minor version 51.0" means that you > compiled your classes under a specific JDK, but then try to run them > under older > >>> version of JDK. > >>> > So, you can't run classes > compiled with JDK 6.0 under JDK 5.0. The same with classes compiled > under JDK 7.0 when you try to run them under > >>> JDK 6.0. On > Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes wrote: *Hi, I want do > some changes in Nutch to get a HTML and take some data from them. My > problem starts when I'm compiling the code in Eclipse. I alw > >>> > d > not load definitions from resource org/sonar/ant/antlib.xml. It could > not be found. *ivy-probe-antlib*: *ivy-download*: > [*taskdef*] Could not load definitions from resource > org/sonar/ant/antlib.xml. It could not be found. > *ivy-download-unchecked*: *ivy-init-antlib*: *ivy-init*: > *init*: [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuild > [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildclasses** ** > [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildrelease** ** > [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildtest > [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildtestclasses > [*copy*] Copying 8 files to C:UsersDaniloworkspaceNutchconf [*copy*] > Copying C:UsersDaniloworkspaceNutchconfautomaton-urlfilter.txt.template > to C:UsersDaniloworkspaceNutchconfautomaton-urlfilter.txt [*copy*] > Copying C:UsersDaniloworkspaceNutchconfhttpclient-auth.xml.template to > C:UsersDaniloworkspaceNutchconfhttpclient-auth.xml [*copy*] Copying > C:Us
Re: Eclipse Error
Thanks for the reply Tejas, but I tried run by ant many times and do not have any difference. About turning on verbose level for ivy, I´m a nuts with Eclipse and plugins. Can you help me do that? On Tue, 26 Feb 2013 10:46:43 -0800, Tejas Patil wrote: > Hi Lewis, > > The OP is not able to build nutch in Eclipse. So far people have been > suspecting over this part of the log: > > **C:UsersDaniloworkspaceNutchbuild.xml:96: > java.lang.UnsupportedClassVersionError: com/sun/tools/javac/Main : > Unsupported major.minor version 51.0* > * > > It turns out that the java version is fine. (v 1.6). I am not sure but this > problem might be related to ivy as per this error: > > *> > [*ivy:resolve*] unknown resolver main* > *> > [*ivy:resolve*] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE > >>> DETAILS > > * > I had faced this error before but it was sporadic and used to go away after > invoking ant again. OP faces it consistently. > > @Danilo: Maybe turning on verbose level for ivy might shed some light. > > Thanks, > Tejas Patil > > On Tue, Feb 26, 2013 at 10:32 AM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com [15]> wrote: > >> What is the problem? There is a community here that can help... if we know what is wrong! On Tue, Feb 26, 2013 at 7:44 AM, Danilo Fernandes < dan...@kelsorfernandes.com.br [14]> wrote: >> >>> I tried both and no one function! :( -Mensagem original- De: kiran chitturi [mailto:chitturikira...@gmail.com [8]] Enviada em: terça-feira, 26 de fevereiro de 2013 12:32 Para: user@nutch.apache.org [9] Cc: ferna...@gmail.com [10] Assunto: Re: Eclipse Error Let's keep the discussion in the User mailing list. I would suggest you to follow the instructions here to set up Nutch in Eclipse [1] JDK 1.6 + or 1.7 + will be good enough. I would also suggest to keep your JRE compatible with the JDK. [1] - http://wiki.apache.org/nutch/RunNutchInEclipse [11] On Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes wrote: >>> Kiran, Do you think I need a JDK 7? ** ** *De:* kiran chitturi [mailto:chitturikira...@gmail.com [1]] *Enviada em:* terça-feira, 26 de fevereiro de 2013 11:57 *Para:* d...@nutch.apache.org [2] *Assunto:* Re: Eclipse Error ** ** I think Nutch requires atleast Java 1.6. ** ** On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes wrote: What version of JDK fit with Nutch trunk? Anybody knows? ** ** 2013/2/25 Danilo Fernandes Feng Lu, thanks for the fast reply. But, I'm using the JavaSE-1.6 (jre6) and always get this error. *De:* feng lu [mailto:amuseme...@gmail.com [5]] *Enviada em:* segunda-feira, 25 de fevereiro de 2013 22:35 *Para:* d...@nutch.apache.org [6] *Assunto:* Re: Eclipse Error Hi Danilo "Unsupported maj.minor version 51.0" means that you compiled your classes under a specific JDK, but then try to run them under older >>> version of JDK. >>> So, you can't run classes compiled with JDK 6.0 under JDK 5.0. The same with classes compiled under JDK 7.0 when you try to run them under >>> JDK 6.0. On Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes wrote: *Hi, I want do some changes in Nutch to get a HTML and take some data from them. My problem starts when I'm compiling the code in Eclipse. I alw >>> d not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. *ivy-probe-antlib*: *ivy-download*: [*taskdef*] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. *ivy-download-unchecked*: *ivy-init-antlib*: *ivy-init*: *init*: [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuild [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildclasses** ** [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildrelease** ** [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildtest [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildtestclasses [*copy*] Copying 8 files to C:UsersDaniloworkspaceNutchconf [*copy*] Copying C:UsersDaniloworkspaceNutchconfautomaton-urlfilter.txt.template to C:UsersDaniloworkspaceNutchconfautomaton-urlfilter.txt [*copy*] Copying C:UsersDaniloworkspaceNutchconfhttpclient-auth.xml.template to C:UsersDaniloworkspaceNutchconfhttpclient-auth.xml [*copy*] Copying C:UsersDaniloworkspaceNutchconfnutch-site.xml.template to C:UsersDaniloworkspaceNutchconfnutch-site.xml [*copy*] Copying C:UsersDaniloworkspaceNutchconfprefix-urlfilter.txt.template to C:UsersDaniloworkspaceNutchconfprefix-urlfilter.txt [*copy*] Copying C:UsersDaniloworkspaceNutchconfregex-normalize.xml.template to C:UsersDaniloworkspaceNutchconfregex-normalize.xml [*copy*] Copying C:UsersDaniloworkspaceNutchconfregex-urlfilter.txt.template to C:UsersDaniloworkspaceNutchconfregex-urlfilter.txt [*copy*] Copying C:UsersDaniloworkspaceNutchconfsubcollections.xml.template to C:UsersDaniloworkspaceNutchconfsubcollections.xml [*copy*] Copying C:UsersDanilowor
Re: Nutch 2.1 - Image / Video Search
Hey Joaquin, That seems to be a interesting tool to me. Have you integrated it with nutch ? Just curious to know things :) Thanks, Tejas Patil On Mon, Feb 25, 2013 at 1:38 PM, J. Delgado wrote: > If your interested in pure image search you may want to use Nutch for > crawling but something like imgseek (http://www.imgseek.net/isk-daemon) > for > indexing and search. > > -J > > El lunes, 25 de febrero de 2013, Jorge Luis Betancourt Gonzalez escribió: > > > Hi: > > > > Like Raja said, it's possible the thing is that out of the box, nutch is > > only able to index the metadata of the file, you can always write some > > plugins to implement any logic you desire. > > > > - Mensaje original - > > De: "Raja Kulasekaran" > > > Para: user@nutch.apache.org > > Enviados: Domingo, 24 de Febrero 2013 13:31:28 > > Asunto: Nutch 2.1 - Image / Video Search > > > > Hi, > > > > Is it possible to crawl the Images as well as videos from Nutch latest > > version . I am using Nutch 1.6. I would like to know whether I can go > ahead > > to > > > > use Nutch 1.6 or Suggest me the appropriate versions . > > > > Raja > > > > > -- > Sent from Gmail Mobile >
Re: Eclipse Error
Hi Lewis, The OP is not able to build nutch in Eclipse. So far people have been suspecting over this part of the log: **C:\Users\Danilo\workspace\Nutch\build.xml:96: java.lang.UnsupportedClassVersionError: com/sun/tools/javac/Main : Unsupported major.minor version 51.0* * It turns out that the java version is fine. (v 1.6). I am not sure but this problem might be related to ivy as per this error: *> > [*ivy:resolve*] unknown resolver main* *> > [*ivy:resolve*] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE > > DETAILS * I had faced this error before but it was sporadic and used to go away after invoking ant again. OP faces it consistently. @Danilo: Maybe turning on verbose level for ivy might shed some light. Thanks, Tejas Patil On Tue, Feb 26, 2013 at 10:32 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > What is the problem? There is a community here that can help... if we know > what is wrong! > > On Tue, Feb 26, 2013 at 7:44 AM, Danilo Fernandes < > dan...@kelsorfernandes.com.br> wrote: > > > I tried both and no one function! :( > > > > -Mensagem original- > > De: kiran chitturi [mailto:chitturikira...@gmail.com] > > Enviada em: terça-feira, 26 de fevereiro de 2013 12:32 > > Para: user@nutch.apache.org > > Cc: ferna...@gmail.com > > Assunto: Re: Eclipse Error > > > > Let's keep the discussion in the User mailing list. > > > > I would suggest you to follow the instructions here to set up Nutch in > > Eclipse [1] > > > > JDK 1.6 + or 1.7 + will be good enough. I would also suggest to keep your > > JRE compatible with the JDK. > > > > [1] - http://wiki.apache.org/nutch/RunNutchInEclipse > > > > > > > > > > On Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes > > wrote: > > > > > Kiran, > > > > > > Do you think I need a JDK 7? > > > > > > ** ** > > > > > > *De:* kiran chitturi [mailto:chitturikira...@gmail.com] > > > *Enviada em:* terça-feira, 26 de fevereiro de 2013 11:57 > > > > > > *Para:* d...@nutch.apache.org > > > *Assunto:* Re: Eclipse Error > > > > > > ** ** > > > > > > I think Nutch requires atleast Java 1.6. > > > > > > ** ** > > > > > > On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes > > > wrote: > > > > > > What version of JDK fit with Nutch trunk? > > > > > > Anybody knows? > > > > > > ** ** > > > > > > 2013/2/25 Danilo Fernandes > > > > > > Feng Lu, thanks for the fast reply. > > > > > > > > > > > > But, I’m using the JavaSE-1.6 (jre6) and always get this error. > > > > > > > > > > > > *De:* feng lu [mailto:amuseme...@gmail.com] *Enviada em:* > > > segunda-feira, 25 de fevereiro de 2013 22:35 > > > *Para:* d...@nutch.apache.org > > > *Assunto:* Re: Eclipse Error > > > > > > > > > > > > Hi Danilo > > > > > > > > > > > > "Unsupported maj.minor version 51.0" means that you compiled your > > > classes under a specific JDK, but then try to run them under older > > version > > of JDK. > > > So, you can't run classes compiled with JDK 6.0 under JDK 5.0. The > > > same with classes compiled under JDK 7.0 when you try to run them under > > JDK 6.0. > > > > > > > > > > > > > > > On Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes > > > wrote: > > > > > > *Hi, I want do some changes in Nutch to get a HTML and take some data > > > from them. > > > > > > My problem starts when I’m compiling the code in Eclipse. > > > > > > I always receive the follow error message.* > > > > > > Buildfile: *C:\Users\Danilo\workspace\Nutch\build.xml* > > > > > > [*taskdef*] Could not load definitions from resource > > > org/sonar/ant/antlib.xml. It could not be found. > > > > > > *ivy-probe-antlib*: > > > > > > *ivy-download*: > > > > > > [*taskdef*] Could not load definitions from resource > > > org/sonar/ant/antlib.xml. It could not be found. > > > > > > *ivy-download-unchecked*: > > > > > > *ivy-init-antlib*: > > > > > > *ivy-init*: > > > > > > *init*: > > > > > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build > > > > > > [*mkdir*] Created dir: > > > C:\Users\Danilo\workspace\Nutch\build\classes** > > > ** > > > > > > [*mkdir*] Created dir: > > > C:\Users\Danilo\workspace\Nutch\build\release** > > > ** > > > > > > [*mkdir*] Created dir: > > > C:\Users\Danilo\workspace\Nutch\build\test > > > > > > [*mkdir*] Created dir: > > > C:\Users\Danilo\workspace\Nutch\build\test\classes > > > > > > [*copy*] Copying 8 files to > > > C:\Users\Danilo\workspace\Nutch\conf > > > > > > [*copy*] Copying > > > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt.template > > > to > > > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt > > > > > > [*copy*] Copying > > > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml.template to > > > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml > > > > > > [*copy*] Copying > > > C:\Users\Danilo\workspace\Nutch\conf\nutch-
Re: Eclipse Error
What is the problem? There is a community here that can help... if we know what is wrong! On Tue, Feb 26, 2013 at 7:44 AM, Danilo Fernandes < dan...@kelsorfernandes.com.br> wrote: > I tried both and no one function! :( > > -Mensagem original- > De: kiran chitturi [mailto:chitturikira...@gmail.com] > Enviada em: terça-feira, 26 de fevereiro de 2013 12:32 > Para: user@nutch.apache.org > Cc: ferna...@gmail.com > Assunto: Re: Eclipse Error > > Let's keep the discussion in the User mailing list. > > I would suggest you to follow the instructions here to set up Nutch in > Eclipse [1] > > JDK 1.6 + or 1.7 + will be good enough. I would also suggest to keep your > JRE compatible with the JDK. > > [1] - http://wiki.apache.org/nutch/RunNutchInEclipse > > > > > On Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes > wrote: > > > Kiran, > > > > Do you think I need a JDK 7? > > > > ** ** > > > > *De:* kiran chitturi [mailto:chitturikira...@gmail.com] > > *Enviada em:* terça-feira, 26 de fevereiro de 2013 11:57 > > > > *Para:* d...@nutch.apache.org > > *Assunto:* Re: Eclipse Error > > > > ** ** > > > > I think Nutch requires atleast Java 1.6. > > > > ** ** > > > > On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes > > wrote: > > > > What version of JDK fit with Nutch trunk? > > > > Anybody knows? > > > > ** ** > > > > 2013/2/25 Danilo Fernandes > > > > Feng Lu, thanks for the fast reply. > > > > > > > > But, I’m using the JavaSE-1.6 (jre6) and always get this error. > > > > > > > > *De:* feng lu [mailto:amuseme...@gmail.com] *Enviada em:* > > segunda-feira, 25 de fevereiro de 2013 22:35 > > *Para:* d...@nutch.apache.org > > *Assunto:* Re: Eclipse Error > > > > > > > > Hi Danilo > > > > > > > > "Unsupported maj.minor version 51.0" means that you compiled your > > classes under a specific JDK, but then try to run them under older > version > of JDK. > > So, you can't run classes compiled with JDK 6.0 under JDK 5.0. The > > same with classes compiled under JDK 7.0 when you try to run them under > JDK 6.0. > > > > > > > > > > On Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes > > wrote: > > > > *Hi, I want do some changes in Nutch to get a HTML and take some data > > from them. > > > > My problem starts when I’m compiling the code in Eclipse. > > > > I always receive the follow error message.* > > > > Buildfile: *C:\Users\Danilo\workspace\Nutch\build.xml* > > > > [*taskdef*] Could not load definitions from resource > > org/sonar/ant/antlib.xml. It could not be found. > > > > *ivy-probe-antlib*: > > > > *ivy-download*: > > > > [*taskdef*] Could not load definitions from resource > > org/sonar/ant/antlib.xml. It could not be found. > > > > *ivy-download-unchecked*: > > > > *ivy-init-antlib*: > > > > *ivy-init*: > > > > *init*: > > > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build > > > > [*mkdir*] Created dir: > > C:\Users\Danilo\workspace\Nutch\build\classes** > > ** > > > > [*mkdir*] Created dir: > > C:\Users\Danilo\workspace\Nutch\build\release** > > ** > > > > [*mkdir*] Created dir: > > C:\Users\Danilo\workspace\Nutch\build\test > > > > [*mkdir*] Created dir: > > C:\Users\Danilo\workspace\Nutch\build\test\classes > > > > [*copy*] Copying 8 files to > > C:\Users\Danilo\workspace\Nutch\conf > > > > [*copy*] Copying > > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt.template > > to > > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt > > > > [*copy*] Copying > > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml.template to > > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml > > > > [*copy*] Copying > > C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml.template to > > C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml > > > > [*copy*] Copying > > C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt.template to > > C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt > > > > [*copy*] Copying > > C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml.template to > > C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml > > > > [*copy*] Copying > > C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt.template to > > C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt > > > > [*copy*] Copying > > C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml.template to > > C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml > > > > [*copy*] Copying > > C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt.template to > > C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt > > > > *clean-lib*: > > > > *resolve-default*: > > > > [*ivy:resolve*] :: Ivy 2.2.0 - 20100923230623 :: > > http://ant.apache.org/ivy/ :: > > > > [*ivy:resolve*] :: loading settings :: file = > > C:\Users\Danilo\workspa
Re: nutch-2.1 with hbase - any good tool for querying results?
We will be working on better support (gora-pig adapter) for this functionality in Apache Gora > 0.3. For now Kiran's suggestion is by far the best. Thank you Lewis On Tue, Feb 26, 2013 at 10:17 AM, kiran chitturi wrote: > I found apache pig [1] convenient to use with Hbase for querying and > filtering. > > 1 - http://pig.apache.org/ > > > > > On Tue, Feb 26, 2013 at 12:18 PM, adfel70 wrote: > > > Anybody using a good tool for performing queries on the crawl results > > directly from hbase? > > some of the queries I want to make are: get all the url that failed > > fetching, get all the urls that failed parsing. > > > > querying hbasedirectly seems more convenient then running readdb, waiting > > for results, than parsing the readdb output to get the required > > information. > > > > thanks. > > > > > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > -- > Kiran Chitturi > -- *Lewis*
Re: nutch-2.1 with hbase - any good tool for querying results?
I found apache pig [1] convenient to use with Hbase for querying and filtering. 1 - http://pig.apache.org/ On Tue, Feb 26, 2013 at 12:18 PM, adfel70 wrote: > Anybody using a good tool for performing queries on the crawl results > directly from hbase? > some of the queries I want to make are: get all the url that failed > fetching, get all the urls that failed parsing. > > querying hbasedirectly seems more convenient then running readdb, waiting > for results, than parsing the readdb output to get the required > information. > > thanks. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Kiran Chitturi
nutch-2.1 with hbase - any good tool for querying results?
Anybody using a good tool for performing queries on the crawl results directly from hbase? some of the queries I want to make are: get all the url that failed fetching, get all the urls that failed parsing. querying hbasedirectly seems more convenient then running readdb, waiting for results, than parsing the readdb output to get the required information. thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109.html Sent from the Nutch - User mailing list archive at Nabble.com.
RES: Eclipse Error
I tried both and no one function! :( -Mensagem original- De: kiran chitturi [mailto:chitturikira...@gmail.com] Enviada em: terça-feira, 26 de fevereiro de 2013 12:32 Para: user@nutch.apache.org Cc: ferna...@gmail.com Assunto: Re: Eclipse Error Let's keep the discussion in the User mailing list. I would suggest you to follow the instructions here to set up Nutch in Eclipse [1] JDK 1.6 + or 1.7 + will be good enough. I would also suggest to keep your JRE compatible with the JDK. [1] - http://wiki.apache.org/nutch/RunNutchInEclipse On Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes wrote: > Kiran, > > Do you think I need a JDK 7? > > ** ** > > *De:* kiran chitturi [mailto:chitturikira...@gmail.com] > *Enviada em:* terça-feira, 26 de fevereiro de 2013 11:57 > > *Para:* d...@nutch.apache.org > *Assunto:* Re: Eclipse Error > > ** ** > > I think Nutch requires atleast Java 1.6. > > ** ** > > On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes > wrote: > > What version of JDK fit with Nutch trunk? > > Anybody knows? > > ** ** > > 2013/2/25 Danilo Fernandes > > Feng Lu, thanks for the fast reply. > > > > But, Im using the JavaSE-1.6 (jre6) and always get this error. > > > > *De:* feng lu [mailto:amuseme...@gmail.com] *Enviada em:* > segunda-feira, 25 de fevereiro de 2013 22:35 > *Para:* d...@nutch.apache.org > *Assunto:* Re: Eclipse Error > > > > Hi Danilo > > > > "Unsupported maj.minor version 51.0" means that you compiled your > classes under a specific JDK, but then try to run them under older version of JDK. > So, you can't run classes compiled with JDK 6.0 under JDK 5.0. The > same with classes compiled under JDK 7.0 when you try to run them under JDK 6.0. > > > > > On Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes > wrote: > > *Hi, I want do some changes in Nutch to get a HTML and take some data > from them. > > My problem starts when Im compiling the code in Eclipse. > > I always receive the follow error message.* > > Buildfile: *C:\Users\Danilo\workspace\Nutch\build.xml* > > [*taskdef*] Could not load definitions from resource > org/sonar/ant/antlib.xml. It could not be found. > > *ivy-probe-antlib*: > > *ivy-download*: > > [*taskdef*] Could not load definitions from resource > org/sonar/ant/antlib.xml. It could not be found. > > *ivy-download-unchecked*: > > *ivy-init-antlib*: > > *ivy-init*: > > *init*: > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build > > [*mkdir*] Created dir: > C:\Users\Danilo\workspace\Nutch\build\classes** > ** > > [*mkdir*] Created dir: > C:\Users\Danilo\workspace\Nutch\build\release** > ** > > [*mkdir*] Created dir: > C:\Users\Danilo\workspace\Nutch\build\test > > [*mkdir*] Created dir: > C:\Users\Danilo\workspace\Nutch\build\test\classes > > [*copy*] Copying 8 files to > C:\Users\Danilo\workspace\Nutch\conf > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt.template > to > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt.template to > C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt.template to > C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt.template to > C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt > > *clean-lib*: > > *resolve-default*: > > [*ivy:resolve*] :: Ivy 2.2.0 - 20100923230623 :: > http://ant.apache.org/ivy/ :: > > [*ivy:resolve*] :: loading settings :: file = > C:\Users\Danilo\workspace\Nutch\ivy\ivysettings.xml > > [*ivy:resolve*] :: problems summary :: > > [*ivy:resolve*] ERRORS > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unkn
Re: Eclipse Error
Let's keep the discussion in the User mailing list. I would suggest you to follow the instructions here to set up Nutch in Eclipse [1] JDK 1.6 + or 1.7 + will be good enough. I would also suggest to keep your JRE compatible with the JDK. [1] - http://wiki.apache.org/nutch/RunNutchInEclipse On Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes wrote: > Kiran, > > Do you think I need a JDK 7? > > ** ** > > *De:* kiran chitturi [mailto:chitturikira...@gmail.com] > *Enviada em:* terça-feira, 26 de fevereiro de 2013 11:57 > > *Para:* d...@nutch.apache.org > *Assunto:* Re: Eclipse Error > > ** ** > > I think Nutch requires atleast Java 1.6. > > ** ** > > On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes > wrote: > > What version of JDK fit with Nutch trunk? > > Anybody knows? > > ** ** > > 2013/2/25 Danilo Fernandes > > Feng Lu, thanks for the fast reply. > > > > But, I’m using the JavaSE-1.6 (jre6) and always get this error. > > > > *De:* feng lu [mailto:amuseme...@gmail.com] > *Enviada em:* segunda-feira, 25 de fevereiro de 2013 22:35 > *Para:* d...@nutch.apache.org > *Assunto:* Re: Eclipse Error > > > > Hi Danilo > > > > "Unsupported maj.minor version 51.0" means that you compiled your classes > under a specific JDK, but then try to run them under older version of JDK. > So, you can't run classes compiled with JDK 6.0 under JDK 5.0. The same > with classes compiled under JDK 7.0 when you try to run them under JDK 6.0. > > > > > On Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes > wrote: > > *Hi, I want do some changes in Nutch to get a HTML and take some data > from them. > > My problem starts when I’m compiling the code in Eclipse. > > I always receive the follow error message.* > > Buildfile: *C:\Users\Danilo\workspace\Nutch\build.xml* > > [*taskdef*] Could not load definitions from resource > org/sonar/ant/antlib.xml. It could not be found. > > *ivy-probe-antlib*: > > *ivy-download*: > > [*taskdef*] Could not load definitions from resource > org/sonar/ant/antlib.xml. It could not be found. > > *ivy-download-unchecked*: > > *ivy-init-antlib*: > > *ivy-init*: > > *init*: > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build\classes** > ** > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build\release** > ** > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build\test > > [*mkdir*] Created dir: > C:\Users\Danilo\workspace\Nutch\build\test\classes > > [*copy*] Copying 8 files to C:\Users\Danilo\workspace\Nutch\conf > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt.template to > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt.template to > C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt.template to > C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt.template to > C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt > > *clean-lib*: > > *resolve-default*: > > [*ivy:resolve*] :: Ivy 2.2.0 - 20100923230623 :: > http://ant.apache.org/ivy/ :: > > [*ivy:resolve*] :: loading settings :: file = > C:\Users\Danilo\workspace\Nutch\ivy\ivysettings.xml > > [*ivy:resolve*] :: problems summary :: > > [*ivy:resolve*] ERRORS > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver mai
Re: Crawling URLs with query string while limiting only web pages
Feng Lu: Thanks for the tip. I will definitely try the approach. Appreciate your help. Tejas: I am using the groping approach, filtering out some keywords from the fetched log. Good so far I observed 20% of the fetched list is filled up with the not-so important URL. I hope optimized filter can do some good for my crawler performance. Thanks for your directions. Cheers, Ye On Mon, Feb 25, 2013 at 3:31 AM, Tejas Patil wrote: > @Ye, You need not look at each url. Random sampling will be better. It wont > be accurate but practical thing to do. Even while going through logs, > extract the urls, sort them so that all of those belonging to the same host > lie in the same group. > > @feng lu: +1. Good trick to remove the bad urls using normalization. The > main problem in front of OP would be still to come up with such rules by > manually observing the logs. > > Thanks, > Tejas Patil > > > On Sun, Feb 24, 2013 at 7:16 AM, feng lu wrote: > > > Hi Ye > > > > Can you add this pattern to regex-normalize.xml configuration file for > the > > RegexUrlNormalize class. > > > > > > > > > > > > > ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid|view|zoom)=.*?)(\?|&|#|$) > > $4 > > > > > > it will removes session ids from urls such as view and zoom. > > > > e.g. site1.com/article1/?view=printerfriendly > > e.g. site1.com/article1/?zoom=large > > e.g. site1.com/article1/?zoom=extralarge > > > > to > > > > e.g. site1.com/article1 > > > > > > > > > > > > On Sun, Feb 24, 2013 at 9:48 PM, Ye T Thet > wrote: > > > > > Tejas, > > > > > > Thanks for your pointers. They are really helpful. As of now my > approach > > is > > > according to your direction 1, 2 and 3. Since my sites are around 10k > in > > > number, I hope it would be manageable for near future. > > > > > > I might need to apply as per your direction 4 and 5 in the future as > > well. > > > But I believe it might be out of my league to get it right though. > > > > > > Some extra information my approach, most of my target sites are using > CMS > > > and quite a number of them DOES NOT use pretty URL. I have been greping > > the > > > log and identify the pattern of redundant or non-important URL and > adding > > > regex rules to regex-urlfilter. 2 millions URL is quite hard to process > > for > > > one man though. Phew! > > > > > > I would share if I could fine an approach that could benefit us all. > > > > > > Regards, > > > > > > Ye > > > > > > On Sat, Feb 23, 2013 at 12:22 PM, Tejas Patil < > tejas.patil...@gmail.com > > > >wrote: > > > > > > > one correction in red below. > > > > > > > > On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil < > tejas.patil...@gmail.com > > > > >wrote: > > > > > > > > > I think that what you have done till now is logical. Typically in > > nutch > > > > > crawls people dont want urls with query string but nowadays things > > have > > > > > changed. For instance, category #2 you pointed out may capture some > > > vital > > > > > pages. I once ran into the similar issue. Crawler cant be made > > > > intelligent > > > > > beyond a certain point and I had to go through crawl logs to check > > what > > > > all > > > > > urls are being fetched and later redefine by regex rules. > > > > > > > > > > Some things that I had considered doing: > > > > > 1. Start off with rules which are less restrictive and observe the > > logs > > > > > for which urls are visited. This will give you an idea about the > bad > > > urls > > > > > and the good ones. As you already have crawled for 10 days, you are > > > (just > > > > > !!) left with studying the logs. > > > > > 2. After #1 is done, launch crawls with accept rules for the good > > urls > > > > and > > > > > put a "-." in the end to avoid the bad urls. > > > > > 3. Having a huge list of regexes is bad thing because its comparing > > > urls > > > > > against regexes is a costly operation and done for every url. A url > > > > getting > > > > > a match early saves this time. So have patterns which capture a > huge > > > set > > > > of > > > > > urls at the top for the regex urlfilter file. > > > > > 4. Sometimes you dont want the parser to extract urls from certain > > > areas > > > > > of the page as you know that its not going to yield anything good > to > > > you. > > > > > Lets say that the "print" or "zoom" urls are coming from some > > specific > > > > tags > > > > > of the html source. Its better not to parse those things and thus > not > > > > have > > > > > those urls itself in the first place. The profit here is that now > the > > > > regex > > > > > rules to be defined are reduced. > > > > > 5. An improvement over *#4* is that if you know the nature of pages > > > that > > > > > are being crawled, you can tweak parsers to extract urls from > > specific > > > > tags > > > > > only. This reduces noise and much cleaner fetch list. > > > > > > > > > > As far as I feel, this problem wont have an automated solution like > > > > > modifying some config/setting. There is a decent amount of human
RE: regex-urlfilter file for multiple domains
Very nice. Thanks. My problem now is to put Nuth to run into Eclipse. I´m having some problems with JDKs. When I run in Ant, I alwys receive a message about differents versions of JDK/JRE. Have you idea about that? On Tue, 26 Feb 2013 10:42:21 +, Markus Jelsma wrote: > No, there is no feature for that. You would have to patch it up yourself. It shouldn't be very hard. > > -Original message- > >> From:Danilo Fernandes Sent: Tue 26-Feb-2013 11:37 To: user@nutch.apache.org [2]Subject: RE: regex-urlfilter file for multiple domains Yes, my first options is differents files to differents domains. The point is how can I link the files with each domain? Do I need do some changes in Nutch code or the project have a feature for do that? On Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrote: >> >>> Yes, >> it will support that until you run out of memory. But having a million expressions is not going to work nicely. If you have a lot of expressions but can divide them into domains i would patch the filter so it will only execute filters that or for a specific domain.-Original message- >> From:Danilo Fernandes Sent: Tue >> 26-Feb-2013 11:31 To: user@nutch.apache.org [3] [2] Subject: RE: regex-urlfilter file for multiple domains Tejas, do you have any idea about how many rules can I use in the file? Probably I will work with 1M regex for differentes URLs. Nutch will support that? Links: -- [1] mailto:dan...@kelsorfernandes.com.br [4] [2] mailto:user@nutch.apache.org [5] Links: -- [1] mailto:dan...@kelsorfernandes.com.br [2] mailto:user@nutch.apache.org [3] mailto:user@nutch.apache.org [4] mailto:dan...@kelsorfernandes.com.br [5] mailto:user@nutch.apache.org
RE: regex-urlfilter file for multiple domains
No, there is no feature for that. You would have to patch it up yourself. It shouldn't be very hard. -Original message- > From:Danilo Fernandes > Sent: Tue 26-Feb-2013 11:37 > To: user@nutch.apache.org > Subject: RE: regex-urlfilter file for multiple domains > > > > Yes, my first options is differents files to differents domains. > The point is how can I link the files with each domain? Do I need do > some changes in Nutch code or the project have a feature for do > that? > > On Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrote: > > > Yes, > it will support that until you run out of memory. But having a million > expressions is not going to work nicely. If you have a lot of > expressions but can divide them into domains i would patch the filter so > it will only execute filters that or for a specific domain. > > > > > -Original message- > > > >> From:Danilo Fernandes Sent: Tue > 26-Feb-2013 11:31 To: user@nutch.apache.org [2] Subject: RE: > regex-urlfilter file for multiple domains Tejas, do you have any idea > about how many rules can I use in the file? Probably I will work with 1M > regex for differentes URLs. Nutch will support that? > > > > > Links: > -- > [1] mailto:dan...@kelsorfernandes.com.br > [2] > mailto:user@nutch.apache.org >
RE: regex-urlfilter file for multiple domains
Yes, my first options is differents files to differents domains. The point is how can I link the files with each domain? Do I need do some changes in Nutch code or the project have a feature for do that? On Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrote: > Yes, it will support that until you run out of memory. But having a million expressions is not going to work nicely. If you have a lot of expressions but can divide them into domains i would patch the filter so it will only execute filters that or for a specific domain. > > -Original message- > >> From:Danilo Fernandes Sent: Tue 26-Feb-2013 11:31 To: user@nutch.apache.org [2] Subject: RE: regex-urlfilter file for multiple domains Tejas, do you have any idea about how many rules can I use in the file? Probably I will work with 1M regex for differentes URLs. Nutch will support that? Links: -- [1] mailto:dan...@kelsorfernandes.com.br [2] mailto:user@nutch.apache.org
RE: regex-urlfilter file for multiple domains
Yes, it will support that until you run out of memory. But having a million expressions is not going to work nicely. If you have a lot of expressions but can divide them into domains i would patch the filter so it will only execute filters that or for a specific domain. -Original message- > From:Danilo Fernandes > Sent: Tue 26-Feb-2013 11:31 > To: user@nutch.apache.org > Subject: RE: regex-urlfilter file for multiple domains > > Tejas, do you have any idea about how many rules can I use in the file? > > Probably I will work with 1M regex for differentes URLs. > > Nutch will support that?
RE: regex-urlfilter file for multiple domains
Tejas, do you have any idea about how many rules can I use in the file? Probably I will work with 1M regex for differentes URLs. Nutch will support that?
Re: Only a small portion of URLs is indexed in Solr at the end of the crawl
Am 26.02.2013 10:19, schrieb Amit Sela: Hi all, I'm running nutch 1.6 and solr 3.6.2 and I'm crawling with depth 1 topN 100 and 'db.update.additions.allowed' false. The idea is to fetch, parse and index only the URLs in the seed list. I seed ~120K URLs but in solr I see only ~20K indexed. The fetch job counters show: moved 49,937 -> redirections i think (not be crawled, there is a nutch property, which allows this) robots_denied 1,149 -> forbidden by the robots txt of the seed url robots_denied_maxcrawldelay 267 -> forbidden by the robots txt delay option of the seed url hitByTimeLimit 6,072 -> response timeout exception 4,479 -> other stuff notmodified 2 access_denied 4 -> login needed temp_moved 4,658 -> redirections (not be crawled, there is a nutch property, which allows this) success 23,033 -> your 20k, which are indexed notfound 1,658 -> 404 By the way. if you crawl just with a depth of 1, you don´t need to specify a topN, because you will allways crawl just the seedurl and the ParserStatus success count is 22844 What happened to all the URLs ? they are all active URLs, not some old list... Thanks, Amit. -- Stefan Scheffler Avantgarde Labs GmbH Löbauer Straße 19, 01099 Dresden Telefon: + 49 (0) 351 21590834 Email: sscheff...@avantgarde-labs.de
Only a small portion of URLs is indexed in Solr at the end of the crawl
Hi all, I'm running nutch 1.6 and solr 3.6.2 and I'm crawling with depth 1 topN 100 and 'db.update.additions.allowed' false. The idea is to fetch, parse and index only the URLs in the seed list. I seed ~120K URLs but in solr I see only ~20K indexed. The fetch job counters show: moved 49,937 robots_denied 1,149 robots_denied_maxcrawldelay 267 hitByTimeLimit 6,072 exception 4,479 notmodified 2 access_denied 4 temp_moved 4,658 success 23,033 notfound 1,658 and the ParserStatus success count is 22844 What happened to all the URLs ? they are all active URLs, not some old list... Thanks, Amit.