Hello,
I'm trying to optimize nutch performance for crawling sites. Now i test 
performance on small hadoop cluster, only two nodes 32gb RAM, cpu Intel Xeon E3 
1245v2 4c/8t.
My config for nutch  http://pastebin.com/bBRHpFuq
So, the problem: fetching jobs works not optimal. Some reduce task has 4k pages 
for fetching, some 1kk pages. For example see screenshot  
https://docs.google.com/file/d/0B98dgNxOqKMvT1doOVVPUU1PNXM/edit  Some reduce 
task finished in 10 minutes, but one task work 11 hours and still continue 
working, so it's like a bottle neck when i have 24 reduce task, but works only 
one.
May be someone can give usefull advices or links where i can read about problem.

Big thanks for help
Sergey

Reply via email to