RE: does hadoop always respect setNumReduceTasks?

2012-03-10 Thread WangRamon

Jane, i think you have mapred.tasktracker.reduce.tasks.maximum or 
mapred.reduce.tasks set to 1 in your local, and have them set to some other 
values in the emr, that's why you always get one reducer in your local and not 
on the emr. CheersRamon
  Date: Thu, 8 Mar 2012 21:30:26 -0500
 Subject: does hadoop always respect setNumReduceTasks?
 From: jane.wayne2...@gmail.com
 To: common-user@hadoop.apache.org
 
 i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
 
 as i am emitting items from the mapper, i expect/desire only 1 reducer to
 get these items because i want to assign each key of the key-value input
 pair a unique integer id. if i had 1 reducer, i can just keep a local
 counter (with respect to the reducer instance) and increment it.
 
 on my local hadoop cluster, i noticed that most, if not all, my jobs have
 only 1 reducer, regardless of whether or not i set
 Job.setNumReduceTasks(int).
 
 however, as soon as i moved the code unto amazon's elastic mapreduce (emr),
 i notice that there are multiple reducers. if i set the number of reduce
 tasks to 1, is this always guaranteed? i ask because i don't know if there
 is a gotcha like the combiner (where it may or may not run at all).
 
 also, it looks like this might not be a good idea just having 1 reducer (it
 won't scale). it is most likely better if there are +1 reducers, but in
 that case, i lose the ability to assign unique numbers to the key-value
 pairs coming in. is there a design pattern out there that addresses this
 issue?
 
 my mapper/reducer key-value pair signatures looks something like the
 following.
 
 mapper(Text, Text, Text, IntWritable)
 reducer(Text, IntWritable, IntWritable, Text)
 
 the mapper reads a sequence file whose key-value pairs are of type Text and
 Text. i then emit Text (let's say a word) and IntWritable (let's say
 frequency of the word).
 
 the reducer gets the word and its frequencies, and then assigns the word an
 integer id. it emits IntWritable (the id) and Text (the word).
 
 i remember seeing code from mahout's API where they assign integer ids to
 items. the items were already given an id of type long. the conversion they
 make is as follows.
 
 public static int idToIndex(long id) {
  return 0x7FFF  ((int) id ^ (int) (id  32));
 }
 
 is there something equivalent for Text or a word? i was thinking about
 simply taking the hash value of the string/word, but of course, different
 strings can map to the same hash value.
  

how to add more than one user to hadoop with DFS permissions?

2012-03-10 Thread Austin Chungath
I have a 2 node cluster running hadoop 0.20.205. There is only one user ,
username: hadoop of group: hadoop.
What is the easiest way to add one more user say hadoop1 with DFS
permissions set as true?

I did the following to create a user in the master node.
sudo adduser --ingroup hadoop hadoop1

My aim is to have hadoop run in such a way that each user input and output
data is accessible only to the owner (chmod 700).
I did play around with the configuration properties for sometime now but to
no end.

It would be great if some one could tell me what are the configuration file
properties that I should change to achieve this?

Thanks,
Austin


Re: how to add more than one user to hadoop with DFS permissions?

2012-03-10 Thread Harsh J
Austin,

1. Enable HDFS Permissions. In hdfs-site.xml, set dfs.permissions as true.

2. To commission any new user, as HDFS admin (the user who runs the
NameNode process), run:
hadoop fs -mkdir /user/username
hadoop fs -chown username:username /user/username

3. For default file/dir permissions to be 700, tweak the dfs.umaskmode property.

Much of this is also documented at the permissions guide:
http://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.html

On Sat, Mar 10, 2012 at 9:59 PM, Austin Chungath austi...@gmail.com wrote:
 I have a 2 node cluster running hadoop 0.20.205. There is only one user ,
 username: hadoop of group: hadoop.
 What is the easiest way to add one more user say hadoop1 with DFS
 permissions set as true?

 I did the following to create a user in the master node.
 sudo adduser --ingroup hadoop hadoop1

 My aim is to have hadoop run in such a way that each user input and output
 data is accessible only to the owner (chmod 700).
 I did play around with the configuration properties for sometime now but to
 no end.

 It would be great if some one could tell me what are the configuration file
 properties that I should change to achieve this?

 Thanks,
 Austin



-- 
Harsh J


Re: how to add more than one user to hadoop with DFS permissions?

2012-03-10 Thread Austin Chungath
Thanks Harsh :)

On Sat, Mar 10, 2012 at 10:12 PM, Harsh J ha...@cloudera.com wrote:

 Austin,

 1. Enable HDFS Permissions. In hdfs-site.xml, set dfs.permissions as
 true.

 2. To commission any new user, as HDFS admin (the user who runs the
 NameNode process), run:
 hadoop fs -mkdir /user/username
 hadoop fs -chown username:username /user/username

 3. For default file/dir permissions to be 700, tweak the dfs.umaskmode
 property.

 Much of this is also documented at the permissions guide:
 http://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.html

 On Sat, Mar 10, 2012 at 9:59 PM, Austin Chungath austi...@gmail.com
 wrote:
  I have a 2 node cluster running hadoop 0.20.205. There is only one user ,
  username: hadoop of group: hadoop.
  What is the easiest way to add one more user say hadoop1 with DFS
  permissions set as true?
 
  I did the following to create a user in the master node.
  sudo adduser --ingroup hadoop hadoop1
 
  My aim is to have hadoop run in such a way that each user input and
 output
  data is accessible only to the owner (chmod 700).
  I did play around with the configuration properties for sometime now but
 to
  no end.
 
  It would be great if some one could tell me what are the configuration
 file
  properties that I should change to achieve this?
 
  Thanks,
  Austin



 --
 Harsh J



Re: mapred.tasktracker.map.tasks.maximum not working

2012-03-10 Thread Mohit Anchlia
Thanks. Looks like there are some parameters that I can use at client level
and others need cluster wide setting. Is there a place where I can see all
the config parameters with description of level of changes that can be done
at client level vs at cluster level?

On Fri, Mar 9, 2012 at 10:39 PM, bejoy.had...@gmail.com wrote:

 Adding on to Chen's response.

 This is a setting meant at Task Tracker level(environment setting based on
 parameters like your CPU cores, memory etc) and you need to override the
 same at each task tracker's mapred-site.xml and restart the TT daemon for
 changes to be in effect.

 Regards
 Bejoy K S

 From handheld, Please excuse typos.

 -Original Message-
 From: Chen He airb...@gmail.com
 Date: Fri, 9 Mar 2012 20:16:23
 To: common-user@hadoop.apache.org
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: mapred.tasktracker.map.tasks.maximum not working

 you set the  mapred.tasktracker.map.tasks.maximum  in your job means
 nothing. Because Hadoop mapreduce platform only checks this parameter when
 it starts. This is a system configuration.

  You need to set it in your conf/mapred-site.xml file and restart your
 hadoop mapreduce.


 On Fri, Mar 9, 2012 at 7:32 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I have mapred.tasktracker.map.tasks.maximum set to 2 in my job and I
 have 5
  nodes. I was expecting this to have only 10 concurrent jobs. But I have
 30
  mappers running. Does hadoop ignores this setting when supplied from the
  job?
 




Why most of the free reduce slots are NOT used for my Hadoop Jobs? Thanks.

2012-03-10 Thread WangRamon

Sorry for the duplicate, i have sent this mail to map/reduce mail list but 
haven't got any useful response yet, so i think maybe i can get some 
suggestions here, thanks.

Hi All
 
I'm using Hadoop-0.20-append, the cluster contains 3 nodes, for each node I 
have 14 map and 14 reduce slots, here is the configuration:
 
 
property
namemapred.tasktracker.map.tasks.maximum/name
value14/value
/property
property
namemapred.tasktracker.reduce.tasks.maximum/name
value14/value
/property
property
namemapred.reduce.tasks/name
value73/value
/property

 
When I submit 5 Jobs simultane
 ously (the input data for each job is not so big for the test, it's about 2~5M 
in size), I assume the Jobs will use the slots as much as possible, each Job 
did created 73 Reduce Tasks as configured above, so there will be 5 * 73 Reduce 
Tasks in total, but, most of them are in pending state, only about 12 of them 
are running, it's too small compared to the total slots number for reduce, 42 
reduce slots for the 3 nodes cluster. 
 
What interestring is that it always about 12 of them are running, I tried a few 
times.
 
So, I thought it might because about the scheduler, I changed it to Fair 
Scheduler, I created 3 pools, the configure is as below:
 
?xml version=1.0?
allocations
 pool name=pool-a
  minMaps14/minMaps
  minReduces14/minReduces
  weight1.0/weight
 /pool
 pool name=pool
 -b
  minMaps14/minMaps
  minReduces14/minReduces
  weight1.0/weight
 /pool
 pool name=pool-c
  minMaps14/minMaps
  minReduces14/minReduces
  weight1.0/weight
 /pool
 
/allocations 
 
Then I submit the 5 Jobs simultaneously to these pools randomly again, I can 
see the jobs were assigned to different pools, but, it's still the same problem 
only about 12 of the reduce tasks from different pool are running, here is the 
output i copied from the Fair Scheduler monitor GUI:
 
pool-a 2 14 14 0 9
pool-b 0 14 14 0 0 
pool-c 2 14 14 0 3 
 
pool-a and pool-c have a total of 12 reduce tasks running, but I do have about 
11 reduce slots at least available in my cluster.
 
So can anyone
  please give me some suggestions, why NOT all my REDUCE SLOTS are working and 
it's always the number of 12? Thanks in advance.  Btw, here is the information 
from the job tracker GUI:
Cluster Summary (Heap Size is 481.88 MB/1.74 GB)
Maps Reduces Total Submissions Nodes Map Task Capacity Reduce Task Capacity 
Avg. Tasks/Node Blacklisted Nodes 
06 11  3 42 
  42 28.00  0 

 
Cheers 
Ramon