Map and Reduce numbers are not restricted by setNumMapTasks and setNumReduceTasks, JobConf related?
Dears, Sorry, I did not mean to cross post. But the previous article was accidentally posted to the HBase user list. I would like to bring it back to the Hadoop user since it is confusing me a lot and it is mainly MapReduce related. Currently running version hadoop-0.18.1 on 25 nodes. Map and Reduce Task Capacity is 92. When I do this in my MapReduce program: = SAMPLE CODE = JobConf jconf = new JobConf(conf, TestTask.class); jconf.setJobName(my.test.TestTask); jconf.setOutputKeyClass(Text.class); jconf.setOutputValueClass(Text.class); jconf.setOutputFormat(TextOutputFormat.class); jconf.setMapperClass(MyMapper.class); jconf.setCombinerClass(MyReducer.class); jconf.setReducerClass(MyReducer.class); jconf.setInputFormat(TextInputFormat.class); try { jconf.setNumMapTasks(5); jconf.setNumReduceTasks(3); JobClient.runJob(jconf); } catch (Exception e) { e.printStackTrace(); } = = = When I run the job, I'm always getting 300 mappers and 1 reducers from the JobTracker webpage running on the default port 50030. No matter how I configure the numbers in methods setNumMapTasks and setNumReduceTasks, I get the same result. Does anyone know why this is happening? Am I missing something or misunderstand something in the picture? =( Here's a reference to the parameters we have override in hadoop-site.xml. === property namemapred.tasktracker.map.tasks.maximum/name value4/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value4/value /property other parameters are default from hadoop-default.xml. Any idea how this is happening? Any inputs are appreciated. Thanks, -Andy
Re: [some bugs] Re: file permission problem
I think this is the same problem related to this mail thread. http://www.mail-archive.com/[EMAIL PROTECTED]/msg02759.html A JIRA has been filed, please see HADOOP-2915. On Fri, Mar 14, 2008 at 2:08 AM, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, any magic we can do with hadoop.dfs.umask? Or is there any other off switch for the file security? Thanks. Stefan On Mar 13, 2008, at 11:26 PM, Stefan Groschupf wrote: Hi Nicholas, Hi All, I definitely can reproduce the problem Johannes describes. Also from debugging through the code it is clearly a bug from my point of view. So this is the call stack: SequenceFile.createWriter FileSystem.create DFSClient.create namenode.create In NameNode I found this: namesystem.startFile(src, new PermissionStatus(Server.getUserInfo().getUserName(), null, masked), clientName, clientMachine, overwrite, replication, blockSize); In getUserInfo is this comment: // This is to support local calls (as opposed to rpc ones) to the name-node. // Currently it is name-node specific and should be placed somewhere else. try { return UnixUserGroupInformation.login(); The login javaDoc says: /** * Get current user's name and the names of all its groups from Unix. * It's assumed that there is only one UGI per user. If this user already * has a UGI in the ugi map, return the ugi in the map. * Otherwise get the current user's information from Unix, store it * in the map, and return it. */ Beside of that I had some interesting observations. If I have permissions to write to a folder A I can delete folder A and file B that is inside of folder A even if I do have no permissions for B. Also I noticed following in my dfs [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598 Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALLdir 2008-03-13 16:00 rwxr-xr-x hadoop supergroup [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598/VOICE_CALL Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL/part-0 r 3 27311 2008-03-13 16:00 rw-r--r-- joa23 supergroup Do I miss something or was I able to write as user joa23 into a folder owned by hadoop where I should have no permissions. :-O. Should I open some jira issues? Stefan On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote: Hi Johannes, i'm using the 0.16.0 distribution. I assume you mean the 0.16.0 release ( http://hadoop.apache.org/core/releases.html ) without any additional patch. I just have tried it but cannot reproduce the problem you described. I did the following: 1) start a cluster with tsz 2) run a job with nicholas The output directory and files are owned by nicholas. Am I doing the same thing you did? Could you try again? Nicholas - Original Message From: Johannes Zillmann [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, March 12, 2008 5:47:27 PM Subject: file permission problem Hi, i have a question regarding the file permissions. I have a kind of workflow where i submit a job from my laptop to a remote hadoop cluster. After the job finished i do some file operations on the generated output. The cluster-user is different to the laptop-user. As output i specify a directory inside the users home. This output directory, created through the map-reduce job has cluster-user permissions, so this does not allow me to move or delete the output folder with my laptop-user. So it looks as follow: /user/jz/ rwxrwxrwx jzsupergroup /user/jz/output rwxr-xr-xhadoopsupergroup I tried different things to achieve what i want (moving/deleting the output folder): - jobConf.setUser(hadoop) on the client side - System.setProperty(user.name,hadoop) before jobConf instantiation on the client side - add user.name node in the hadoop-site.xml on the client side - setPermision(777) on the home folder on the client side (does not work recursiv) - setPermision(777) on the output folder on the client side (permission denied) - create the output folder before running the job (Output directory already exists exception) None of the things i tried worked. Is there a way to achieve what i want ? Any ideas appreciated! cheers Johannes -- ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: long write operations and data recovery
What about a hot standby namenode? For write-ahead-log to avoid crash and recovery, I think this is fine for small I/O. For large volume, the write-ahead-log will actually take up the system IO resource pretty much that makes 2 IO per block (log and the actual data). This will fall back how current database design implements recovery and crash. Another thing I don't see in the picture is how Hadoop manage system file system instructions. Each system has different implementation on their file system and I believe that by calling 'write' or 'flush' does not really flush the data to the disk. Not sure if this is inevitable and platform OS dependent, but I cannot find any documents to describe how Hadoop handle this. P.S. I handle HA and fail-over mechanism in my own application, but I think for a framwork, it should be transparent (semi-transparent) to the user. -annndy On Fri, Feb 29, 2008 at 1:54 PM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote: I would agree with Ted. You should easily be able to get 100MBps write throughput on a standard Netapp box (with read bandwidth left over - since the peak write throughput rating is more than twice of that). Even at an average write throughput rate of 50MBps - the daily data volume would be (drumroll ..) 4+TB! So buffer to a decent box and copy stuff over .. -Original Message- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Friday, February 29, 2008 11:33 AM To: core-user@hadoop.apache.org Subject: Re: long write operations and data recovery Unless your volume is MUCH higher than ours, I think you can get by with a relatively small farm of log consolidators that collect and concatenate files. If each log line is 100 bytes after compression (that is huge really) and you have 10,000 events per second (also pretty danged high) then you are only writing 1MB/s. If you need a day of buffering (=100,000 seconds), then you need 100GB of buffer storage. These are very, very moderate requirements for your ingestion point. On 2/29/08 11:18 AM, Steve Sapovits [EMAIL PROTECTED] wrote: Ted Dunning wrote: In our case, we looked at the problem and decided that Hadoop wasn't feasible for our real-time needs in any case. There were several issues, - first, of all, map-reduce itself didn't seem very plausible for real-time applications. That left hbase and hdfs as the capabilities offered by hadoop (for real-time stuff) We'll be using map-reduce batch mode, so we're okay there. The upshot is that we use hadoop extensively for batch operations where it really shines. The other nice effect is that we don't have to worry all that much about HA (at least not real-time HA) since we don't do real-time with hadoop. What I'm struggling with is the write side of things. We'll have a huge amount of data to write that's essentially a log format. It would seem that writing that outside of HDFS then trying to batch import it would be a losing battle -- that you would need the distributed nature of HDFS to do very large volume writes directly and wouldn't easily be able to take some other flat storage model and feed it in as a secondary step without having the HDFS side start to lag behind. The realization is that Name Node could go down so we'll have to have a backup store that might be used during temporary outages, but that most of the writes would be direct HDFS updates. The alternative would seem to be to end up with a set of distributed files without some unifying distributed file system (e.g., like lots of Apache web logs on many many individual boxes) and then have to come up with some way to funnel those back into HDFS.
Re: Questions regarding configuration parameters...
Try the 2 parameters to utilize all the cores per node/host. property namemapred.tasktracker.map.tasks.maximum/name value7/value descriptionThe maximum number of map tasks that will be run simultaneously by a task tracker. /description /property property namemapred.tasktracker.reduce.tasks.maximum/name value7/value descriptionThe maximum number of reduce tasks that will be run simultaneously by a task tracker. /description /property The default value are 2 so you might only see 2 cores used by Hadoop per node/host. If each system/machine has 4 cores (dual dual core), then you can change them to 3. Hope this works for you. -Andy On Wed, Feb 20, 2008 at 9:30 AM, C G [EMAIL PROTECTED] wrote: Hi All: The documentation for the configuration parameters mapred.map.tasks and mapred.reduce.tasks discuss these values in terms of number of available hosts in the grid. This description strikes me as a bit odd given that a host could be anything from a uniprocessor to an N-way box, where values for N could vary from 2..16 or more. The documentation is also vague about computing the actual value. For example, for mapred.map.tasks the doc says …a prime number several times greater…. I'm curious about how people are interpreting the descriptions and what values people are using. Specifically, I'm wondering if I should be using core count instead of host count to set these values. In the specific case of my system, we have 24 hosts where each host is a 4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the value 173, as that is a prime number which is near 7*24. For mapred.reduce.tasks I chose 23 since that is a prime number close to 24. Is this what was intended? Beyond curiousity, I'm concerned about setting these values and other configuration parameters correctly because I am pursuing some performance issues where it is taking a very long time to process small amounts of data. I am hoping that some amount of tuning will resolve the problems. Any thoughts and insights most appreciated. Thanks, C G - Never miss a thing. Make Yahoo your homepage.