Sathish Can you post in the values for all the parameters you get under Cluster Summary in your jobtracker webUI.
Regards Bejoy.K.S 2012/1/12 Satish Setty (HCL Financial Services) <satish.se...@hcl.com> > Hi Bejoy, > > I have put each line item in a seperate file [created 10 files]. Gave > directory as input. Now we have 10 map processes created, but at a time > only 2 are running > Please see start time / finish time. Any setting at cluster level - > map_slot says ("2") not sure where to change this? Lot of cpu is idle - > out of 8 cpus atleast 2 are idle/under-utilized so in theory all 10 map > tasks should be running concurrently. > > Thanks > Hadoop map task list for > job_201201121044_0001<http://192.168.60.12:50030/jobdetails.jsp?jobid=job_201201121044_0001>on > localhost <http://192.168.60.12:50030/jobtracker.jsp> > ------------------------------ > All Tasks Task Complete Status Start Time Finish Time Errors Counters > task_201201121044_0001_m_000000<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000000> > 100.00% > hdfs://localhost:9000/user/soruser/hellodir1/hello10.txt:0+8 > 12-Jan-2012 10:46:10 > 12-Jan-2012 10:48:55 (2mins, 45sec) > > > > 12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000000> > task_201201121044_0001_m_000001<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000001> > 100.00% > hdfs://localhost:9000/user/soruser/hellodir1/hello1.txt:0+7 > 12-Jan-2012 10:46:10 > 12-Jan-2012 10:48:52 (2mins, 42sec) > > > > 12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000001> > task_201201121044_0001_m_000002<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000002> > 100.00% > hdfs://localhost:9000/user/soruser/hellodir1/hello2.txt:0+7 > 12-Jan-2012 10:48:52 > 12-Jan-2012 10:51:34 (2mins, 42sec) > > > > 12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000002> > task_201201121044_0001_m_000003<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000003> > 100.00% > hdfs://localhost:9000/user/soruser/hellodir1/hello3.txt:0+7 > 12-Jan-2012 10:48:55 > 12-Jan-2012 10:51:25 (2mins, 30sec) > > > > 12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000003> > task_201201121044_0001_m_000004<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000004> > 0.00% > hdfs://localhost:9000/user/soruser/hellodir1/hello4.txt:0+7 > 12-Jan-2012 10:51:25 > > > > 12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000004> > task_201201121044_0001_m_000005<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000005> > 0.00% > hdfs://localhost:9000/user/soruser/hellodir1/hello5.txt:0+7 > 12-Jan-2012 10:51:34 > > > > 12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000005> > task_201201121044_0001_m_000006<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000006> > 0.00% > > > > > > 0<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000006> > task_201201121044_0001_m_000007<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000007> > 0.00% > > > > > > 0<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000007> > task_201201121044_0001_m_000008<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000008> > 0.00% > > > > > > 0<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000008> > task_201201121044_0001_m_000009<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000009> > 0.00% > > > > > > 0<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000009> > ------------------------------ > Go back to JobTracker <http://192.168.60.12:50030/jobtracker.jsp> > ------------------------------ > This is Apache Hadoop <http://hadoop.apache.org/> release 0.20.203.0 > > ------------------------------ > *From:* Satish Setty (HCL Financial Services) > *Sent:* Tuesday, January 10, 2012 8:57 AM > *To:* Bejoy Ks > *Cc:* mapreduce-user@hadoop.apache.org > *Subject:* RE: hadoop > > > > Hi Bejoy, > > > > > Thanks for help. Changed values > mapred.min.split.size=0,mapred.max.split.size=40 > but but job counter does not reflect any other changes? > > For posting kindly let me know correct link/mail-id - at present directly > sending to your account["Bejoy Ks [bejoy.had...@gmail.com]" - has been > great help to me. > > > > Posting to group account mapreduce-user@hadoop.apache.org > bounces back. > > > > Counter Map Reduce Total File Input Format Counters Bytes Read 61 0 61 Job > Counters SLOTS_MILLIS_MAPS 0 0 3,886 Launched map tasks 0 0 2 Data-local > map tasks 0 0 2 FileSystemCounters HDFS_BYTES_READ 267 0 267 > FILE_BYTES_WRITTEN 58,134 0 58,134 Map-Reduce Framework Map output > materialized bytes 0 0 0 Combine output records 0 0 0 Map input records > 9 0 9 Spilled Records 0 0 0 Map output bytes 70 0 70 Map input bytes 54 > 0 54 SPLIT_RAW_BYTES 206 0 206 Map output records 7 0 7 Combine input > records 0 0 0 > ------------------------------ > *From:* Bejoy Ks [bejoy.had...@gmail.com] > *Sent:* Monday, January 09, 2012 11:13 PM > *To:* Satish Setty (HCL Financial Services) > *Cc:* mapreduce-user@hadoop.apache.org > *Subject:* Re: hadoop > > Hi Satish > It would be good if you don't cross post your queries. Just post it > once on the right list. > > What is your value for mapred.max.split.size? Try setting these > values as well > mapred.min.split.size=0 (it is the default value) > mapred.max.split.size=40 > > Try executing your job once you apply these changes on top of others you > did. > > Regards > Bejoy.K.S > > On Mon, Jan 9, 2012 at 5:09 PM, Satish Setty (HCL Financial Services) < > satish.se...@hcl.com> wrote: > >> Hi Bejoy, >> >> Even with below settings map tasks never go beyound 2, any way to make >> this spawn 10 tasks. Basically it should look like compute grid - >> computation in parallel. >> >> <property> >> <name>io.bytes.per.checksum</name> >> <value>30</value> >> <description>The number of bytes per checksum. Must not be larger than >> io.file.buffer.size.</description> >> </property> >> >> <property> >> <name>dfs.block.size</name> >> <value>30</value> >> <description>The default block size for new files.</description> >> </property> >> <property> >> <name>mapred.tasktracker.map.tasks.maximum</name> >> <value>10</value> >> <description>The maximum number of map tasks that will be run >> simultaneously by a task tracker. >> </description> >> </property> >> >> ------------------------------ >> *From:* Satish Setty (HCL Financial Services) >> *Sent:* Monday, January 09, 2012 1:21 PM >> >> *To:* Bejoy Ks >> *Cc:* mapreduce-user@hadoop.apache.org >> *Subject:* RE: hadoop >> >> Hi Bejoy, >> >> In hdfs I have set block size - 40bytes . Input Data set is as below >> data1 (5*8=40 bytes) >> data2 >> ...... >> data10 >> >> >> But still I see only 2 map tasks spawned, should have been atleast 10 map >> tasks. Not sure how works internally. Line feed does not work [as you have >> explained below] >> >> Thanks >> ------------------------------ >> *From:* Satish Setty (HCL Financial Services) >> *Sent:* Saturday, January 07, 2012 9:17 PM >> *To:* Bejoy Ks >> *Cc:* mapreduce-user@hadoop.apache.org >> *Subject:* RE: hadoop >> >> Thanks Bejoy - great information - will try out. >> >> I meant for below problem single node with high configuration -> 8 cpus >> and 8gb memory. Hence taking an example of 10 data items with line feeds. >> We want to utilize full power of machine - hence want at least 10 map tasks >> - each task needs to perform highly complex mathematical simulation. At >> present it looks like file data is the only way to specify number of map >> tasks via splitsize (in bytes) - but I prefer some criteria like line feed >> or whatever. >> >> In below example - 'data1' corresponds to 5*8=40bytes, if I have data1 >> .... data10 in theory I need to see 10 map tasks with split size of 40bytes. >> >> How do I perform logging - where is the log (apache logger) data written? >> system outs may not come as it is background process. >> >> Regards >> >> >> ------------------------------ >> *From:* Bejoy Ks [bejoy.had...@gmail.com] >> *Sent:* Saturday, January 07, 2012 7:35 PM >> *To:* Satish Setty (HCL Financial Services) >> *Cc:* mapreduce-user@hadoop.apache.org >> *Subject:* Re: hadoop >> >> Hi Satish >> Please find some pointers inline >> >> Problem - As per documentation filesplits corresponds to number of map >> tasks. File split is governed by bock size - 64mb in hadoop-0.20.203.0. >> Where can I find default settings for variour parameters like block size, >> number of map/reduce tasks. >> >> [Bejoy] I'd rather state it other way round, the number of map tasks >> triggered by a MR job is determined by number of input splits (and input >> format). If you use TextInputFormat with default settings the number of >> input splits is equal to the no of hdfs blocks occupied by the input. Size >> of an input split is equal to hdfs block size in default(64Mb). If you want >> to have more splits for one hdfs block itself you need to set a value less >> than 64 Mb for mapred.max.split.size. >> >> You can find pretty much all default configuration values from the >> downloaded .tar at >> hadoop-0.20.*/src/mapred/mapred-default.xml >> hadoop-0.20.*/src/hdfs/hdfs-default.xml >> hadoop-0.20.*/src/core/core-default.xml >> >> If you want to alter some of these values then you can provide the same >> in >> $HADOOP_HOME/conf/mapred-site.xml >> $HADOOP_HOME/conf/hdfs-site.xml >> $HADOOP_HOME/conf/core-site.xml >> >> These values provided in *-site.xml would be taken into account only if >> they are not marked in *-default.xml. If not final, the values provided in >> *-site.xml overrides the values in *-default.xml for corresponding >> configuration parameter. >> >> I require atleast 10 map taks which is same as number of "line feeds". >> Each corresponds to complex calculation to be done by map task. So I can >> have optimal cpu utilization - 8 cpus. >> >> [Bejoy] Hadoop is a good choice processing large amounts of data. It is >> not wise to choose one mapper for one record/line in a file, as creation of >> a map task itself is expensive with jvm spanning and all. Currently you may >> have 10 records in your input but I believe you are just testing Hadoop in >> dev env and in production that wouldn't be the case there could be n files >> having m records each and this m can be in millions.(Just assuming based on >> my experience). On larger data sets you may not need to split on line >> boundaries. There can be multiple lines in a file and if you use >> TextInputFormat it is just one line processed by a map task at an instant. >> If you have n map tasks then n lines could be getting processed at an >> instant of map task execution time frame one by each map task. In larger >> data volumes map tasks are spanned in specific nodes primarily based on >> data locality, then on available tasks slots on data local node and so on. >> It is possible that if you have a 10 node cluster, 10 hdfs blocks >> corresponding to a input file and assume that all the blocks are present >> only on 8 nodes and there are sufficient task slots available on all 8 , >> then tasks for your job may be executed in 8 nodes alone instead of 10. So >> there are chances that there won't be 100% balanced CPU utilization across >> nodes in a cluster. >> I'm not really sure how you can spawn map tasks based on >> line feeds in a file .Let us wait for others to comment on this. >> Also if your using map reduce for parallel computation alone >> the make sure you set the number of reducers to zero, with that you can >> save a lot of time that would be other wise spend on sort and shuffle >> phases. >> (-D mapred.reduce.tasks=0) >> >> Behaviour of maptasks looks strange to be as some times if I give in >> program jobconf.set(num map tasks) it takes 2 or 8. >> >> [Bejoy]There is no default value for number of map tasks, it is >> determined by input splits and input format used by your job. You cannot >> set the number of map tasks even if you set them at your job level, it is >> not considered. (mapred.map.tasks) . But you can definitely specify the >> number of reduce tasks at your job level by job.setNumReduceTasks(n) or >> mapred.reduce.tasks. If not set it would take the default value for reduce >> tasks specified in conf files. >> >> >> I see some files like part-00001... >> Are they partitions? >> >> [Bejoy] The part-000* files corresponds to reducers. You'd have n files >> if you have n reducers as one reducer produces one output file. >> >> Hope it helps!.. >> >> Regards >> Bejoy.KS >> >> >> On Sat, Jan 7, 2012 at 3:32 PM, Satish Setty (HCL Financial Services) < >> satish.se...@hcl.com> wrote: >> >>> Hi Bijoy, >>> >>> Just finished installation and tested sample applications. >>> >>> Problem - As per documentation filesplits corresponds to number of map >>> tasks. File split is governed by bock size - 64mb in hadoop-0.20.203.0. >>> Where can I find default settings for variour parameters like block size, >>> number of map/reduce tasks. >>> >>> Is it possible to control filesplit by "line feed - \n". I tried giving >>> sample input -> jobconf -> TextInputFormat >>> >>> date1 >>> date2 >>> date3 >>> ....... >>> ...... >>> date10 >>> >>> But when I run I see number of maptasks=2 or 1. >>> I require atleast 10 map taks which is same as number of "line feeds". >>> Each corresponds to complex calculation to be done by map task. So I can >>> have optimal cpu utilization - 8 cpus. >>> >>> Behaviour of maptasks looks strange to be as some times if I give in >>> program jobconf.set(num map tasks) it takes 2 or 8. I see some files like >>> part-00001... >>> Are they partitions? >>> >>> Thanks >>> ------------------------------ >>> *From:* Satish Setty (HCL Financial Services) >>> *Sent:* Friday, January 06, 2012 12:29 PM >>> *To:* bejoy.had...@gmail.com >>> *Subject:* FW: hadoop >>> >>> >>> Thanks Bejoy. Extremely useful information. We will try and come >>> back. WebApp application [jobtracker web UI ] does this require >>> deployment or application server container comes inbuilt with hadoop? >>> >>> Regards >>> >>> ------------------------------ >>> *From:* Bejoy Ks [bejoy.had...@gmail.com] >>> *Sent:* Friday, January 06, 2012 12:54 AM >>> *To:* mapreduce-user@hadoop.apache.org >>> *Subject:* Re: hadoop >>> >>> Hi Satish >>> Please find some pointers in line >>> >>> (a) How do we know number of map tasks spawned? Can this be >>> controlled? We notice only 4 jvms running on a single node - namenode, >>> datanode, jobtracker, tasktracker. As we understand depending on number of >>> splits that many map tasks are spawned - so we should see that many >>> increase in jvms. >>> >>> [Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode >>> are the default process on hadoop it is not dependent on your tasks and >>> your tasks are custom tasks are launched in separate jvms. You can control >>> the maximum number of mappers on each tasktracker at an instance by setting >>> mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or >>> reduce) are executed on individual jvms and once the task is completed the >>> jvms are destroyed. You are right, in default one map task is launched per >>> input split. >>> Just check the jobtracker web UI ( >>> http://nameNodeHostName:50030/jobtracker.jsp), it would give you you >>> all details on the job including the number of map tasks spanned by a job. >>> If you want to run multiple task tracker and data node instances on the >>> same machine you need to ensure that there are no port conflicts. >>> >>> (b) Our mapper class should perform complex computations - it has plenty >>> of dependent jars so how do we add all jars in class path while running >>> application? Since we require to perform parallel computations - we need >>> many map tasks running in parallel with different data. All are in same >>> machine with different jvms. >>> >>> [Bejoy] If these dependent jars are used by almost all your applications >>> include the same in class path of all your nodes.(in your case just one >>> node). Alternatively you can use -libjars option while submitting your job. >>> For more details refer >>> >>> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ >>> >>> (c) How does data split happen? JobClient does not talk about data >>> splits? As we understand we create format for distributed file system, >>> start-all.sh and then "hadoop fs -put". Do this write data to all >>> datanodes? But we are unable to see physical location? How does split >>> happen from this hdfs source? >>> >>> [Bejoy] Input files are split into blocks during copy into hdfs itself , >>> the size of each block is detmined from the hadoop configuration of your >>> cluster. Name node decides on which all datanodes these blocks are to be >>> placed including its replicas and this details are passed on to the client. >>> The client copies the blocks to one data node and from this data node the >>> block is replicated to other datanodes. The splitting of a file happens in >>> HDFS API level. >>> >>> thanks >>> >>> ------------------------------ >>> ::DISCLAIMER:: >>> >>> ----------------------------------------------------------------------------------------------------------------------- >>> >>> The contents of this e-mail and any attachment(s) are confidential and >>> intended for the named recipient(s) only. >>> It shall not attach any liability on the originator or HCL or its >>> affiliates. Any views or opinions presented in >>> this email are solely those of the author and may not necessarily >>> reflect the opinions of HCL or its affiliates. >>> Any form of reproduction, dissemination, copying, disclosure, >>> modification, distribution and / or publication of >>> this message without the prior written consent of the author of this >>> e-mail is strictly prohibited. If you have >>> received this email in error please delete it and notify the sender >>> immediately. Before opening any mail and >>> attachments please check them for viruses and defect. >>> >>> >>> ----------------------------------------------------------------------------------------------------------------------- >>> >> >> >