Thanks a bunch Markus. By the way, is there some book or material on Nutch which would help me understanding it better? I come from an application development background and all the crawl n search stuff is *very* new to me :)
On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma <markus.jel...@openindex.io>wrote: > These values come from the CrawlDB and have the following meaning. > > db_unfetched > This is the number of URL's that are to be crawled when the next batch is > started. This number is usually limited with the generate.max.per.host > setting. So, if there are 5000 unfetched and generate.max.per.host is set > to > 1000, the next batch will fetch only 1000. Watch, the number of unfetched > will > usually not be 5000-1000 because new URL's have been discovered and added > to > the CrawlDB. > > db_fetched > These URL's have been fetched. Their next fetch will be > db.fetcher.interval. > But, this is not always the case. There the adaprive schedule algorithm can > tune this number depending on several settings. With these you can tune the > interval when a page is modified or not modified. > > db_gone > HTTP 404 Not Found > > db_redir-temp > HTTP 307 Temporary Redirect > > db_redir_perm > HTTP 301 Moved Permanently > > Code: > > http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup > > Configuration: > http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch- > default.xml?view=markup<http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup> > > > Thanks Chris, Charan and Alex. > > > > I am looking into the crawl statistics now. And I see fields like > > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what > do > > they mean? > > > > And, I also see the db_unfetched is way too high than the db_fetched. > Does > > it mean most of the pages did not crawl at all due to some issues? > > > > Thanks again for your time! > > > > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar <charan.ku...@gmail.com > >wrote: > > > db.fetcher.interval : It means that URLS which were fetched in the last > > > 30 days <default> will not be fetched. Or A URL is eligible for refetch > > > only after 30 days of last crawl. > > > > > > On Mon, Jan 24, 2011 at 9:23 PM, <alx...@aim.com> wrote: > > > > How to use solr to index nutch segments? > > > > What is the meaning of db.fetcher.interval? Does this mean that if I > > > > run the same crawl command before 30 days it will do nothing? > > > > > > > > Thanks. > > > > Alex. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: Charan K <charan.ku...@gmail.com> > > > > To: user <user@nutch.apache.org> > > > > Cc: user <user@nutch.apache.org> > > > > Sent: Mon, Jan 24, 2011 8:24 pm > > > > Subject: Re: Few questions from a newbie > > > > > > > > > > > > Refer NutchBean.java for the their question. You can run than from > > > > > > command > > > > > > > line > > > > > > > > to test the index. > > > > > > > > If you use SOLR indexing, it is going to be much simpler, they have > a > > > > > > solr > > > > > > > java > > > > > > > > client.. > > > > > > > > > > > > > > > > Sent from my iPhone > > > > > > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar <amna.waqar...@gmail.com> > wrote: > > > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet > > > > > crawl > > > > > > > > > > gives u more control and speed > > > > > > > > > > 2.After the first crawl,the recrawling the same sites time is 30 > days > > > > > > by > > > > > > > > default in db.fetcher.interval,you can change it according to ur > own > > > > > > > > > > convenience. > > > > > > > > > > 3.I ve no idea about the third question > > > > > > > > > > cz i m also a newbie > > > > > > > > > > Best of luck with nutch learning > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <ab1s...@gmail.com > > > > > > > > > > wrote: > > > > >> Hi all, > > > > >> > > > > >> > > > > >> > > > > >> I am very new to Nutch and Lucene as well. I am having few > questions > > > > > > > > about > > > > > > > > >> Nutch, I know they are very much basic but I could not get clear > cut > > > > >> > > > > >> answers > > > > >> > > > > >> out of googling for this. The questions are, > > > > >> > > > > >> - If I have to crawl just 5-6 web sites or URL's should I use > > > > > > intranet > > > > > > > >> crawl or whole web crawl. > > > > >> > > > > >> - How do I set recrawl's for these same web sites after the first > > > > > > > > crawl. > > > > > > > > >> - If I have to start search the results via my own java code > which > > > > > > jar > > > > > > > >> files or api's or samples should I be looking into. > > > > >> > > > > >> - Is there a book on Nutch? > > > > >> > > > > >> Thanks a bunch for your patience. I appreciate your time. > > > > >> > > > > >> > > > > >> > > > > >> ./Abishek >