Sorry for the blunder guys. Warm Regards, Tariq cloudfront.blogspot.com
On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <donta...@gmail.com> wrote: > @Rahul : I'm sorry as I am not aware of any such document. But you could > use distcp for local to HDFS copy : > *bin/hadoop distcp file:///home/tariq/in.txt hdfs://localhost:9000/* > * > * > And yes. When you use distcp from local to HDFS, you can't take the > pleasure of parallelism as the data is stored in a non distributed fashion. > > Warm Regards, > Tariq > cloudfront.blogspot.com > > > On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <donta...@gmail.com>wrote: > >> Hello guys, >> >> My 2 cents : >> >> Actually no. of mappers is primarily governed by the no. of InputSplits >> created by the InputFormat you are using and the no. of reducers by the no. >> of partitions you get after the map phase. Having said that, you should >> also keep the no of slots, available per slave, in mind, along with the >> available memory. But as a general rule you could use this approach : >> >> Take the no. of virtual CPUs*.75 and that's the no. of slots you can >> configure. For example, if you have 12 physical cores (or 24 virtual >> cores), you would have (24*.75)=18 slots. Now, based on your requirement >> you could choose how many mappers and reducers you want to use. With 18 MR >> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers >> or whatever you think is OK with you. >> >> I don't know if it ,makes much sense, but it helps me pretty decently. >> >> Warm Regards, >> Tariq >> cloudfront.blogspot.com >> >> >> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee < >> rahul.rec....@gmail.com> wrote: >> >>> Hi, >>> >>> I am also new to Hadoop world , here is my take on your question , if >>> there is something missing then others would surely correct that. >>> >>> For per-YARN , the slots are fixed and computed based on the crunching >>> capacity of the datanode hardware , once the slots per data node is >>> ascertained , they are divided into Map and reducer slots and that goes >>> into the config files and remain fixed , until changed.In YARN , its >>> decided at runtime based on the kind of requirement of particular task.Its >>> very much possible that a datanode at certain point of time running 10 >>> tasks and another similar datanode is only running 4 tasks. >>> >>> Coming to your question. Based of the data set size , block size of dfs >>> and input formater , the number of map tasks are decided , generally for >>> file based inputformats its one mapper per data block , however there are >>> way to change this using configuration settings.Reduce tasks are set using >>> job configuration. >>> >>> General rule as I have read from various documents is that Mappers >>> should run atleast a minute , so you can run a sample to find out a good >>> size of data block which would make you mapper run more than a minute. Now >>> it again depends on your SLA , in case you are not looking for a very small >>> SLA you can choose to run less mappers at the expense of higher runtime. >>> >>> But again its all theory , not sure how these things are handled in >>> actual prod clusters. >>> >>> HTH, >>> >>> >>> >>> Thanks, >>> Rahul >>> >>> >>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao < >>> raoshashidhar...@gmail.com> wrote: >>> >>>> Hi Users, >>>> >>>> I am new to Hadoop and confused about task slots in a cluster. How >>>> would I know how many task slots would be required for a job. Is there any >>>> empirical formula or on what basis should I set the number of task slots. >>>> >>>> Advanced Thanks >>>> >>> >>> >> >