Hi Satish Please find some pointers in line (a) How do we know number of map tasks spawned? Can this be controlled? We notice only 4 jvms running on a single node - namenode, datanode, jobtracker, tasktracker. As we understand depending on number of splits that many map tasks are spawned - so we should see that many increase in jvms.
[Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode are the default process on hadoop it is not dependent on your tasks and your tasks are custom tasks are launched in separate jvms. You can control the maximum number of mappers on each tasktracker at an instance by setting mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or reduce) are executed on individual jvms and once the task is completed the jvms are destroyed. You are right, in default one map task is launched per input split. Just check the jobtracker web UI ( http://nameNodeHostName:50030/jobtracker.jsp), it would give you you all details on the job including the number of map tasks spanned by a job. If you want to run multiple task tracker and data node instances on the same machine you need to ensure that there are no port conflicts. (b) Our mapper class should perform complex computations - it has plenty of dependent jars so how do we add all jars in class path while running application? Since we require to perform parallel computations - we need many map tasks running in parallel with different data. All are in same machine with different jvms. [Bejoy] If these dependent jars are used by almost all your applications include the same in class path of all your nodes.(in your case just one node). Alternatively you can use -libjars option while submitting your job. For more details refer http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ (c) How does data split happen? JobClient does not talk about data splits? As we understand we create format for distributed file system, start-all.sh and then "hadoop fs -put". Do this write data to all datanodes? But we are unable to see physical location? How does split happen from this hdfs source? [Bejoy] Input files are split into blocks during copy into hdfs itself , the size of each block is detmined from the hadoop configuration of your cluster. Name node decides on which all datanodes these blocks are to be placed including its replicas and this details are passed on to the client. The client copies the blocks to one data node and from this data node the block is replicated to other datanodes. The splitting of a file happens in HDFS API level. (d) Can we control number of reduce tasks? Is this seperate jvm? How are optimal numbers for map and reduce tasks determined? [Bejoy] You can control the total number of reduce tasks spawned on a task tracker using mapred.tasktracker.map.tasks.maximum . You can control the number of reduce tasks at job level using mapred.reduce.tasks . Unless you enable jvm reuse all tasks are spawned within individual jvms. Optimal number of reduce tasks for your job is determined on the data that flows to your reducer and other parameters. Make sure that your tasks are not very short lived(a few seconds) as task initialization itself is expensive. For more details refer http://wiki.apache.org/hadoop/HowManyMapsAndReduces (e) Any good documentation/links which speaks about namenode, datanode, jobtracker and tasktracker. [Bejoy] Refer ASF documents on hadoop, http://wiki.apache.org/hadoop/ Yahoo Developer Network tutorial http://developer.yahoo.com/hadoop/tutorial/module1.html and for a one stop reference get the book 'Hadoop - The Definitive Guide' by Tom White Hope it helps Regards Bejoy.K.S