@Rahul : I'm sorry as I am not aware of any such document. But you could
use distcp for local to HDFS copy :
*bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
*
*
And yes. When you use distcp from local to HDFS, you can't take the
pleasure of parallelism as the data is stored in a non distributed fashion.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <donta...@gmail.com> wrote:

> Hello guys,
>
>              My 2 cents :
>
> Actually no. of mappers is primarily governed by the no. of InputSplits
> created by the InputFormat you are using and the no. of reducers by the no.
> of partitions you get after the map phase. Having said that, you should
> also keep the no of slots, available per slave, in mind, along with the
> available memory. But as a general rule you could use this approach :
>
> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
> configure. For example, if you have 12 physical cores (or 24 virtual
> cores), you would have (24*.75)=18 slots. Now, based on your requirement
> you could choose how many mappers and reducers you want to use. With 18 MR
> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
> or whatever you think is OK with you.
>
> I don't know if it ,makes much sense, but it helps me pretty decently.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
> rahul.rec....@gmail.com> wrote:
>
>> Hi,
>>
>> I am also new to Hadoop world , here is my take on your question , if
>> there is something missing then others would surely correct that.
>>
>> For per-YARN , the slots are fixed and computed based on the crunching
>> capacity of the datanode hardware , once the slots per data node is
>> ascertained , they are divided into Map and reducer slots and that goes
>> into the config files and remain fixed , until changed.In YARN , its
>> decided at runtime based on the kind of requirement of particular task.Its
>> very much possible that a datanode at certain point of time running  10
>> tasks and another similar datanode is only running 4 tasks.
>>
>> Coming to your question. Based of the data set size , block size of dfs
>> and input formater , the number of map tasks are decided , generally for
>> file based inputformats its one mapper per data block , however there are
>> way to change this using configuration settings.Reduce tasks are set using
>> job configuration.
>>
>> General rule as I have read from various documents is that Mappers should
>> run atleast a minute , so you can run a sample to find out a good size of
>> data block which would make you mapper run more than a minute. Now it again
>> depends on your SLA , in case you are not looking for a very small SLA you
>> can choose to run less mappers at the expense of higher runtime.
>>
>> But again its all theory , not sure how these things are handled in
>> actual prod clusters.
>>
>> HTH,
>>
>>
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>> raoshashidhar...@gmail.com> wrote:
>>
>>> Hi Users,
>>>
>>> I am new to Hadoop and confused about task slots in a cluster. How would
>>> I know how many task slots would be required for a job. Is there any
>>> empirical formula or on what basis should I set the number of task slots.
>>>
>>> Advanced Thanks
>>>
>>
>>
>

Reply via email to