Hi Satish
        Please find some pointers in line

(a) How do we know number of  map tasks spawned?  Can this be controlled?
We notice only 4 jvms running on a single node - namenode, datanode,
jobtracker, tasktracker. As we understand depending on number of splits
that many map tasks are spawned - so we should see that many increase in
jvms.

[Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode are
the default process on hadoop it is not dependent on your tasks and your
tasks are custom tasks are launched in separate jvms. You can control the
maximum number of mappers on each tasktracker at an instance by setting
mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or
reduce) are executed on individual jvms and once the task is completed the
jvms are destroyed. You are right, in default one map task is launched per
input split.
Just check the jobtracker web UI (
http://nameNodeHostName:50030/jobtracker.jsp), it would give you you all
details on the job including the number of map tasks spanned by a job. If
you want to run multiple task tracker and data node instances on the same
machine you need to ensure that there are no port conflicts.

(b) Our mapper class should perform complex computations - it has plenty of
dependent jars so how do we add all jars in class path  while running
application? Since we require to perform parallel computations - we need
many map tasks running in parallel with different data. All are in same
machine with different jvms.

[Bejoy] If these dependent jars are used by almost all your applications
include the same in class path of all your nodes.(in your case just one
node). Alternatively you can use -libjars option while submitting your job.
For more details refer
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

(c) How does data split happen?  JobClient does not talk about data splits?
As we understand we create format for distributed file system, start-all.sh
and then "hadoop fs -put". Do this write data to all datanodes? But we are
unable to see physical location? How does split happen from this hdfs
source?

[Bejoy] Input files are split into blocks during copy into hdfs itself ,
the size of each block is detmined from the hadoop configuration of your
cluster. Name node decides on which all datanodes these blocks are to be
placed including its replicas and this details are passed on to the client.
The client copies the blocks to one data node and from this data node the
block is replicated to other datanodes. The splitting of a file happens in
HDFS API level.

(d) Can we control number of reduce tasks? Is this seperate jvm?  How are
 optimal numbers for  map and reduce tasks determined?

[Bejoy] You can control the total number of reduce tasks spawned on a task
tracker using mapred.tasktracker.map.tasks.maximum . You can control the
number of reduce tasks at job level using mapred.reduce.tasks . Unless you
enable jvm reuse all tasks are spawned within individual jvms. Optimal
number of reduce tasks for your job is determined on the data that flows to
your reducer and other parameters. Make sure that your tasks are not very
short lived(a few seconds) as task initialization itself is expensive. For
more details refer
http://wiki.apache.org/hadoop/HowManyMapsAndReduces


(e) Any good documentation/links which speaks about namenode, datanode,
jobtracker and tasktracker.

[Bejoy] Refer ASF documents on hadoop, http://wiki.apache.org/hadoop/
 Yahoo Developer Network tutorial
http://developer.yahoo.com/hadoop/tutorial/module1.html
and for a one stop reference get the book 'Hadoop - The Definitive Guide'
by Tom White


Hope it helps

Regards
Bejoy.K.S

Reply via email to