Robin Haswell wrote:
> On Fri, 2006-12-08 at 11:01 +0100, Andrzej Bialecki wrote:
>   
>> Ad 1.
>>
>> I suspect that it's sorting the reduce output now ... in 0.8.x this 
>> operation has poor performance, especially when run on a single server. 
>> So, I advise patience, and giving as much CPU and RAM as possible. For 
>> the future, it's also much much better to run the fetcher in non-parsing 
>> mode and run "nutch parse" afterwards as a separate step.
>>     
>
> Okay, I'll give it a while and see what happens. Is it possible to get
> any information on what's going on? I'm running 0.8 pretty much
> out-of-the-box on a single server. I've seen people mentioning phases of
> Hadoop - can it tell me what's going on?
>   

This should be shown in the logs - the map xx% or reduce xx% progress is 
printed to the logs.

The reduce phase consists of copying map outputs (reduce 0-33%), then 
sorting them - and here's where most CPU and disk IO and time is spent - 
which happens between 33%-66%, and finally copying sorted outputs to 
form the final result.

You can also do a kill -SIGQUIT <pid> to get a thread dump - you will be 
able to see what the threads are really doing.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to