First things first: I want to salute you all and thank you for developing a distributed engine such as Hadoop. It certainly helped me at work. I am now in the process of writing an application for user clustering based on their historical behavior as consumers. For clustering/classification algorithms I resorted to Apache Mahout.
Here's the thing: I generated a pretty small dataset of about ~62MB and set up a small cluster of 5 datanodes and a namenode/jobtracker (runnning on the same machine). Of the datanodes, two of them are four-core processors and the remaining are two-cores (totaling fourteen slaves nodes)... and I tend to think that's more than enough processing power to finish the task in a relatively considerate time, which is exactly what it is not happening. Each MR job is taking about ~3hs to complete, as shown by the jobtracker web UI: Hadoop job_200907221734_0004 Finished in: 2hrs, 34mins, 3sec Hadoop job_200907221734_0005 Finished in: 2hrs, 59mins, 34sec The clustering algorithms runs several iterations of MR phases until it converges, and it takes more than 30hs. in total to complete. For such a small dataset, this is unacceptable and I'm quite sure is has something to do with my cluster configuration and/or how block and its sizes are treated in HDFS. Moreover -and this is quite puzzling to me-, every core on every machine is running at its full capacity almost constantly and it doesn't seem to be any idle time in between tasks. Here are my .xml conf files (just relevant lines): (mapred-site.xml) <name>mapred.job.tracker</name> <value>hdfs://hadoop-jobtracker:54311/</value> <final>true</final> <name>mapred.reduce.tasks</name> <value>98</value> <-- 1.75*14*4 (as suggested by Hadoop's documentation) <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>4</value> <name>mapred.tasktracker.map.tasks.maximum</name> <value>4</value> <name>mapred.map.tasks</name> <value>17</value> <-- With 4MB set as dfs.block.size and having -put the 62MB dataset file with -D dfs.block.size=4194304, there should be ~16 map tasks spawned. <name>mapred.tasktracker.tasks.maximum</name> <value>20</value> (hdfs-site.xml) <name>dfs.replication</name> <value>5</value> <name>dfs.block.size</name> <value>4194304</value> (core-site.xml) <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value> <name>fs.default.name</name> <value>hdfs://hadoop-namenode:54310/</value> <final>true</final> Of course, hadoop-namenode and hadoop-jobtracker are both defined in /etc/hosts and they both reference the same IP. No firewall is enabled on the network. The doesn't seem to be any errors output on the datanode/jobtracker's logs, either. Is there something I should be taking into account, that I am currently not? What could be the cause of such poor performance? Overhead due to copying small bits of data through the nodes, perhaps? Any pointers would be generously appreciated. -- View this message in context: http://www.nabble.com/Hadoop-performance-using-Mahout-tp24626239p24626239.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.