RE: does hadoop always respect setNumReduceTasks?
Jane, i think you have mapred.tasktracker.reduce.tasks.maximum or mapred.reduce.tasks set to 1 in your local, and have them set to some other values in the emr, that's why you always get one reducer in your local and not on the emr. CheersRamon Date: Thu, 8 Mar 2012 21:30:26 -0500 Subject: does hadoop always respect setNumReduceTasks? From: jane.wayne2...@gmail.com To: common-user@hadoop.apache.org i am wondering if hadoop always respect Job.setNumReduceTasks(int)? as i am emitting items from the mapper, i expect/desire only 1 reducer to get these items because i want to assign each key of the key-value input pair a unique integer id. if i had 1 reducer, i can just keep a local counter (with respect to the reducer instance) and increment it. on my local hadoop cluster, i noticed that most, if not all, my jobs have only 1 reducer, regardless of whether or not i set Job.setNumReduceTasks(int). however, as soon as i moved the code unto amazon's elastic mapreduce (emr), i notice that there are multiple reducers. if i set the number of reduce tasks to 1, is this always guaranteed? i ask because i don't know if there is a gotcha like the combiner (where it may or may not run at all). also, it looks like this might not be a good idea just having 1 reducer (it won't scale). it is most likely better if there are +1 reducers, but in that case, i lose the ability to assign unique numbers to the key-value pairs coming in. is there a design pattern out there that addresses this issue? my mapper/reducer key-value pair signatures looks something like the following. mapper(Text, Text, Text, IntWritable) reducer(Text, IntWritable, IntWritable, Text) the mapper reads a sequence file whose key-value pairs are of type Text and Text. i then emit Text (let's say a word) and IntWritable (let's say frequency of the word). the reducer gets the word and its frequencies, and then assigns the word an integer id. it emits IntWritable (the id) and Text (the word). i remember seeing code from mahout's API where they assign integer ids to items. the items were already given an id of type long. the conversion they make is as follows. public static int idToIndex(long id) { return 0x7FFF ((int) id ^ (int) (id 32)); } is there something equivalent for Text or a word? i was thinking about simply taking the hash value of the string/word, but of course, different strings can map to the same hash value.
how to add more than one user to hadoop with DFS permissions?
I have a 2 node cluster running hadoop 0.20.205. There is only one user , username: hadoop of group: hadoop. What is the easiest way to add one more user say hadoop1 with DFS permissions set as true? I did the following to create a user in the master node. sudo adduser --ingroup hadoop hadoop1 My aim is to have hadoop run in such a way that each user input and output data is accessible only to the owner (chmod 700). I did play around with the configuration properties for sometime now but to no end. It would be great if some one could tell me what are the configuration file properties that I should change to achieve this? Thanks, Austin
Re: how to add more than one user to hadoop with DFS permissions?
Austin, 1. Enable HDFS Permissions. In hdfs-site.xml, set dfs.permissions as true. 2. To commission any new user, as HDFS admin (the user who runs the NameNode process), run: hadoop fs -mkdir /user/username hadoop fs -chown username:username /user/username 3. For default file/dir permissions to be 700, tweak the dfs.umaskmode property. Much of this is also documented at the permissions guide: http://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.html On Sat, Mar 10, 2012 at 9:59 PM, Austin Chungath austi...@gmail.com wrote: I have a 2 node cluster running hadoop 0.20.205. There is only one user , username: hadoop of group: hadoop. What is the easiest way to add one more user say hadoop1 with DFS permissions set as true? I did the following to create a user in the master node. sudo adduser --ingroup hadoop hadoop1 My aim is to have hadoop run in such a way that each user input and output data is accessible only to the owner (chmod 700). I did play around with the configuration properties for sometime now but to no end. It would be great if some one could tell me what are the configuration file properties that I should change to achieve this? Thanks, Austin -- Harsh J
Re: how to add more than one user to hadoop with DFS permissions?
Thanks Harsh :) On Sat, Mar 10, 2012 at 10:12 PM, Harsh J ha...@cloudera.com wrote: Austin, 1. Enable HDFS Permissions. In hdfs-site.xml, set dfs.permissions as true. 2. To commission any new user, as HDFS admin (the user who runs the NameNode process), run: hadoop fs -mkdir /user/username hadoop fs -chown username:username /user/username 3. For default file/dir permissions to be 700, tweak the dfs.umaskmode property. Much of this is also documented at the permissions guide: http://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.html On Sat, Mar 10, 2012 at 9:59 PM, Austin Chungath austi...@gmail.com wrote: I have a 2 node cluster running hadoop 0.20.205. There is only one user , username: hadoop of group: hadoop. What is the easiest way to add one more user say hadoop1 with DFS permissions set as true? I did the following to create a user in the master node. sudo adduser --ingroup hadoop hadoop1 My aim is to have hadoop run in such a way that each user input and output data is accessible only to the owner (chmod 700). I did play around with the configuration properties for sometime now but to no end. It would be great if some one could tell me what are the configuration file properties that I should change to achieve this? Thanks, Austin -- Harsh J
Re: mapred.tasktracker.map.tasks.maximum not working
Thanks. Looks like there are some parameters that I can use at client level and others need cluster wide setting. Is there a place where I can see all the config parameters with description of level of changes that can be done at client level vs at cluster level? On Fri, Mar 9, 2012 at 10:39 PM, bejoy.had...@gmail.com wrote: Adding on to Chen's response. This is a setting meant at Task Tracker level(environment setting based on parameters like your CPU cores, memory etc) and you need to override the same at each task tracker's mapred-site.xml and restart the TT daemon for changes to be in effect. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Chen He airb...@gmail.com Date: Fri, 9 Mar 2012 20:16:23 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: mapred.tasktracker.map.tasks.maximum not working you set the mapred.tasktracker.map.tasks.maximum in your job means nothing. Because Hadoop mapreduce platform only checks this parameter when it starts. This is a system configuration. You need to set it in your conf/mapred-site.xml file and restart your hadoop mapreduce. On Fri, Mar 9, 2012 at 7:32 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I have mapred.tasktracker.map.tasks.maximum set to 2 in my job and I have 5 nodes. I was expecting this to have only 10 concurrent jobs. But I have 30 mappers running. Does hadoop ignores this setting when supplied from the job?
Why most of the free reduce slots are NOT used for my Hadoop Jobs? Thanks.
Sorry for the duplicate, i have sent this mail to map/reduce mail list but haven't got any useful response yet, so i think maybe i can get some suggestions here, thanks. Hi All I'm using Hadoop-0.20-append, the cluster contains 3 nodes, for each node I have 14 map and 14 reduce slots, here is the configuration: property namemapred.tasktracker.map.tasks.maximum/name value14/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value14/value /property property namemapred.reduce.tasks/name value73/value /property When I submit 5 Jobs simultane ously (the input data for each job is not so big for the test, it's about 2~5M in size), I assume the Jobs will use the slots as much as possible, each Job did created 73 Reduce Tasks as configured above, so there will be 5 * 73 Reduce Tasks in total, but, most of them are in pending state, only about 12 of them are running, it's too small compared to the total slots number for reduce, 42 reduce slots for the 3 nodes cluster. What interestring is that it always about 12 of them are running, I tried a few times. So, I thought it might because about the scheduler, I changed it to Fair Scheduler, I created 3 pools, the configure is as below: ?xml version=1.0? allocations pool name=pool-a minMaps14/minMaps minReduces14/minReduces weight1.0/weight /pool pool name=pool -b minMaps14/minMaps minReduces14/minReduces weight1.0/weight /pool pool name=pool-c minMaps14/minMaps minReduces14/minReduces weight1.0/weight /pool /allocations Then I submit the 5 Jobs simultaneously to these pools randomly again, I can see the jobs were assigned to different pools, but, it's still the same problem only about 12 of the reduce tasks from different pool are running, here is the output i copied from the Fair Scheduler monitor GUI: pool-a 2 14 14 0 9 pool-b 0 14 14 0 0 pool-c 2 14 14 0 3 pool-a and pool-c have a total of 12 reduce tasks running, but I do have about 11 reduce slots at least available in my cluster. So can anyone please give me some suggestions, why NOT all my REDUCE SLOTS are working and it's always the number of 12? Thanks in advance. Btw, here is the information from the job tracker GUI: Cluster Summary (Heap Size is 481.88 MB/1.74 GB) Maps Reduces Total Submissions Nodes Map Task Capacity Reduce Task Capacity Avg. Tasks/Node Blacklisted Nodes 06 11 3 42 42 28.00 0 Cheers Ramon