Andrzej, you are the man! Cheers, Chris
On 7/30/10 3:07 PM, "Andrzej Bialecki" <a...@getopt.org> wrote: Hi, We have a simple crawling benchmark now in trunk. Here's how to use it: * in one console execute 'ant proxy'. This will start on port 8181 a proxy server that produces fake pages. * in another console execute 'ant benchmark'. This will run 5 rounds of fetching (~16,000 pages) using that proxy server. There are already some interesting issues I noticed. First, on a reasonably good hardware in local mode I was able to fetch and process (NOTE: this includes ALL steps, i.e. generate, fetch, parse, crawldb update and invertlinks) 16k pages in 400 sec. This means a total crawling throughput of 40 pages/sec. This is in local mode, so in distributed mode I guess we would be getting this number times the number of tasks. Secondly, it seems that Fetcher has some synchronization issues in its queue management - even if other queues are non-empty, but one of the queues blocks, the Fetcher will spin-wait all threads until an item becomes available on that queue, and then it starts to happily consume items from all non-blocking queues (including this one). The process then repeats - one queue blocks, and all threads stop getting items from other queues... At the moment I can't figure out where this lock-up is happening, but the symptoms are obvious when you look at the logs in real-time. More stuff to come on this subject - at least we have a tool to experiment with :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++