Re: Benchmark of Nutch trunk

Dennis Kubes Fri, 30 Jul 2010 15:34:23 -0700

Very nice.

On 07/30/2010 05:07 PM, Andrzej Bialecki wrote:

Hi,
We have a simple crawling benchmark now in trunk. Here's how to use it:
* in one console execute 'ant proxy'. This will start on port 8181 aproxy server that produces fake pages.
* in another console execute 'ant benchmark'. This will run 5 roundsof fetching (~16,000 pages) using that proxy server.
There are already some interesting issues I noticed. First, on areasonably good hardware in local mode I was able to fetch and process(NOTE: this includes ALL steps, i.e. generate, fetch, parse, crawldbupdate and invertlinks) 16k pages in 400 sec. This means a totalcrawling throughput of 40 pages/sec. This is in local mode, so indistributed mode I guess we would be getting this number times thenumber of tasks.
Secondly, it seems that Fetcher has some synchronization issues in itsqueue management - even if other queues are non-empty, but one of thequeues blocks, the Fetcher will spin-wait all threads until an itembecomes available on that queue, and then it starts to happily consumeitems from all non-blocking queues (including this one). The processthen repeats - one queue blocks, and all threads stop getting itemsfrom other queues... At the moment I can't figure out where thislock-up is happening, but the symptoms are obvious when you look atthe logs in real-time.
More stuff to come on this subject - at least we have a tool toexperiment with :)

Re: Benchmark of Nutch trunk

Reply via email to