First things first: I want to salute you all and thank you for developing a
distributed engine such as Hadoop. It certainly helped me at work. I am now
in the process of writing an application for user clustering based on their
historical behavior as consumers. For clustering/classification algorithms I
resorted to Apache Mahout.

Here's the thing: I generated a pretty small dataset of about ~62MB and set
up a small cluster of 5 datanodes and a namenode/jobtracker (runnning on the
same machine). Of the datanodes, two of them are four-core processors and
the remaining are two-cores (totaling fourteen slaves nodes)... and I tend
to think that's more than enough processing power to finish the task in a
relatively considerate time, which is exactly what it is not happening. Each
MR job is taking about ~3hs to complete, as shown by the jobtracker web UI:

Hadoop job_200907221734_0004
Finished in: 2hrs, 34mins, 3sec

Hadoop job_200907221734_0005
Finished in: 2hrs, 59mins, 34sec

The clustering algorithms runs several iterations of MR phases until it
converges, and it takes more than 30hs. in total to complete. For such a
small dataset, this is unacceptable and I'm quite sure is has something to
do with my cluster configuration and/or how block and its sizes are treated
in HDFS. Moreover -and this is quite puzzling to me-, every core on every
machine is running at its full capacity almost constantly and it doesn't
seem to be any idle time in between tasks. Here are my .xml conf files (just
relevant lines):

(mapred-site.xml)
<name>mapred.job.tracker</name>
<value>hdfs://hadoop-jobtracker:54311/</value>
<final>true</final>

<name>mapred.reduce.tasks</name>
<value>98</value> <-- 1.75*14*4 (as suggested by Hadoop's documentation)

<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>

<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>

<name>mapred.map.tasks</name>
<value>17</value> <-- With 4MB set as dfs.block.size and having -put the
62MB dataset file with -D dfs.block.size=4194304, there should be ~16 map
tasks spawned.

<name>mapred.tasktracker.tasks.maximum</name>
<value>20</value>

(hdfs-site.xml)
<name>dfs.replication</name>
 <value>5</value>

<name>dfs.block.size</name>
<value>4194304</value>

(core-site.xml)
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value>

<name>fs.default.name</name>
<value>hdfs://hadoop-namenode:54310/</value>
<final>true</final>

Of course, hadoop-namenode and hadoop-jobtracker are both defined in
/etc/hosts and they both reference the same IP. No firewall is enabled on
the network. The doesn't seem to be any errors output on the
datanode/jobtracker's logs, either. Is there something I should be taking
into account, that I am currently not? What could be the cause of such poor
performance? Overhead due to copying small bits of data through the nodes,
perhaps? Any pointers would be generously appreciated.
-- 
View this message in context: 
http://www.nabble.com/Hadoop-performance-using-Mahout-tp24626239p24626239.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Reply via email to