Re: [Nutch-dev] Nutch Crawler !!!

2005-03-07 Thread Feng Zhou
I've been reading Nutch code recently. Below's some of my understanding. Others should correct me if I'm wrong. Regards, - Feng Zhou On Mon, 7 Mar 2005 21:15:38 -0500, Daniel Drazner <[EMAIL PROTECTED]> wrote: > Hi, > > Thanks for your email. I have come across this article before. Unfortunately

RE: [Nutch-dev] Nutch Crawler !!!

2005-03-07 Thread Daniel Drazner
Hi, Thanks for your email. I have come across this article before. Unfortunately it doesn't reveal all secrets. Thanks, Daniel -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Feng Zhou Sent: Monday, March 07, 2005 8:48 PM To: [EMAIL PROTECTED] Cc: nutch-dev

Re: [Nutch-dev] Nutch Crawler !!!

2005-03-07 Thread Feng Zhou
Hi Daniel, I'm no expert on Nutch but I've found the Nutch tech report is a good place to start. It has detailed discussion on how various components work. http://labs.commerce.net/wiki/images/0/06/CN-TR-04-04.pdf - Feng Zhou On Mon, 7 Mar 2005 19:45:30 -0500, Daniel Drazner <[EMAIL PROTECTED]

[Nutch-dev] Nutch Crawler details needed !!!

2005-03-07 Thread Daniel Drazner
Hello, First of all I would like to thank you for your wonderful work!! I'm interesting to learn in details how Nutch Crawler design and performs Web crawling. I tried to find any technical documentation that will describe the fetching process. But only find one article (http://nutch.sourceforge

[Nutch-dev] Nutch Crawler !!!

2005-03-07 Thread Daniel Drazner
Hello, First of all I would like to thank you for your wonderful work!! I'm interesting to learn in details how Nutch Crawler design and performs Web crawling. I tried to find any technical documentation that will describe the fetching process. But only find one article (http://nutch.sourceforge

[Nutch-dev] [jira] Created: (NUTCH-5) Hit limiter off-by-one bug

2005-03-07 Thread Andy Liu (JIRA)
Hit limiter off-by-one bug -- Key: NUTCH-5 URL: http://issues.apache.org/jira/browse/NUTCH-5 Project: Nutch Type: Bug Components: searcher Reporter: Andy Liu Priority: Minor When re-searching for more raw hits, the first result o

Re: [Nutch-dev] Re: NameNode scalibility

2005-03-07 Thread michael_cafarella
Hi, This is very interesting, thanks, Angel. Doug's right about the datanode startup and replication problem. I believe there's a simple fix for the problem you describe when starting up all the datanodes. He's also probably right about the namenode startup. A Namenode logs all its

[Nutch-dev] Re: NameNode scalibility

2005-03-07 Thread Doug Cutting
Thanks for the report! 400,000 is a larger number of files than I have yet tested NDFS with, and it looks like there are some issues caused by this. Mike has built the largest NDFS systems that I know of (several terabytes spread over around 20 machines) but these probably had less than a thous

[Nutch-dev] NameNode scalibility

2005-03-07 Thread Angel Faus
Hi, I have been doing some tests to find out if NDFS can be used at our company to reliable store many files (both small and big) across a cluster of cheap servers. The short summary is that right now NDFS doesn't look viable for our needs. I am sending the results of the test to the list, in c

[Nutch-dev] joining developer mailing list

2005-03-07 Thread Rohit Upadhyay
i would like to join the developer mailing list for nutch rohit upadhyay --- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the h

Re: [Nutch-dev] Plugins - sum up

2005-03-07 Thread LocalSearch.HK
FYI -- I have the carrot plugin working on www.localsearch.hk - Original Message - From: "Dawid Weiss" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, March 03, 2005 11:41 PM Subject: Re: [Nutch-dev] Plugins - sum up Clustering Carrot2: It is a search results clustering plugi

[Nutch-dev] Add segment/index procedure

2005-03-07 Thread Christophe Noel
--- Begin Message --- Hello, Here's some easy but interesting questions... Thanks for help. I crawled some pages with bin/nutch crawl? Some pages were not fetched (because of some timeouts from really slow web server). To get the unfetched (but existing pages) I ran "bin/nutch generate db segme