Re: Regarding Capacity Scheduler
Manish, The pre-emption code in capacity scheduler was found to require a good relook and due to the inherent complexity of the problem is likely to have issues of the type you have noticed. We have decided to relook at the pre-emption code from scratch and to this effect removed it from the 0.20 branch to start afresh. Thanks Hemanth Manish Katyal wrote: I'm experimenting with the Capacity scheduler (0.19.0) in a multi-cluster environment. I noticed that unlike the mappers, the reducers are not pre-empted? I have two queues (high and low) that are each running big jobs (70+ maps each). The scheduler splits the mappers as per the queue guaranteed-capacity (5/8ths for the high and the rest for the low). However, the reduce jobs are not interleaved -- the reduce job in the high queue is blocked waiting for the reduce job in the low queue to complete. Is this a bug or by design? *Low queue:* Guaranteed Capacity (%) : 37.5 Guaranteed Capacity Maps : 3 Guaranteed Capacity Reduces : *3* User Limit : 100 Reclaim Time limit : 300 Number of Running Maps : 3 Number of Running Reduces : *7* Number of Waiting Maps : 131 Number of Waiting Reduces : 0 Priority Supported : NO *High queue:* Guaranteed Capacity (%) : 62.5 Guaranteed Capacity Maps : 5 Guaranteed Capacity Reduces : 5 User Limit : 100 Reclaim Time limit : 300 Number of Running Maps : 4 Number of Running Reduces : *0* Number of Waiting Maps : 68 Number of Waiting Reduces : *7* Priority Supported : NO
Re: HOD questions
Craig, While HOD does not do this automatically, please note that since you are bringing up a Map/Reduce cluster on the allocated nodes, you can submit map/reduce parameters with which to bring up the cluster when allocating jobs. The relevant options are --gridservice-mapred.server-params (or -M in shorthand). Please refer to http://hadoop.apache.org/core/docs/r0.19.0/hod_user_guide.html#Options+for+Configuring+Hadoop for details. I was aware of this, but the issue is that unless you obtain dedicated nodes (as above), this option is not suitable, as it isn't set on a per-node basis. I think it would be /fairly/ straightfoward to add to HOD, as I detailed in my initial email, so that it "does the correct thing" out the box. True, I did assume you obtained dedicated nodes. It has been fairly simpler to operate HOD in this manner, and if I understand correctly, would help to solve the requirement you are having as well. According to hadoop-default.xml, the number of maps is "Typically set to a prime several times greater than number of available hosts." - Say that we relax this recommendation to read "Typically set to a NUMBER several times greater than number of available hosts" then it should be straightforward for HOD to set it automatically then? Actually, AFAIK, the number of maps for a job is determined more or less exclusively by the M/R framework based on the number of splits. I've seen messages on this list before about how the documentation for this configuration item is misleading. So, this might actually not make a difference at all, whatever is specified. Thanks Hemanth
Re: HOD questions
Craig, Hello, We have two HOD questions: (1) For our current Torque PBS setup, the number of nodes requested by HOD (-l nodes=X) corresponds to the number of CPUs allocated, however these nodes can be spread across various partially or empty nodes. Unfortunately, HOD does not appear to honour the number of processors actually allocated by Torque PBS to that job. Just FYI, at Yahoo! we've set torque to allocate separate nodes for the number specified to HOD. In other words, the number corresponds to the number of nodes, not processors. This has proved simpler to manage. I forget right now, but I think you can make Torque behave like this (to not treat processors as individual nodes). For example, a current running HOD session can be viewed in qstat as: 104544.trmaster user parallel HOD 4178 8 ---- 288:0 R 01:48 node29/2+node29/1+node29/0+node17/2+node17/1+node18/2+node18/1 +node19/1 However, on inspection of the Jobtracker UI, it tells us that node19 has "Max Map Tasks" and "Max Reduce Tasks" both set to 2, when I think that for node19, it should only be allowed one map task. While HOD does not do this automatically, please note that since you are bringing up a Map/Reduce cluster on the allocated nodes, you can submit map/reduce parameters with which to bring up the cluster when allocating jobs. The relevant options are --gridservice-mapred.server-params (or -M in shorthand). Please refer to http://hadoop.apache.org/core/docs/r0.19.0/hod_user_guide.html#Options+for+Configuring+Hadoop for details. I believe that for each node, HOD should determine (using the information in the $PBS_NODEFILE), how many CPUs for each node are allocated to the HOD job, and then set mapred.tasktracker.map.tasks.maximum appropriately on each node. (2) In our InputFormat, we use the numSplits to tell us how many map tasks the job's files should be split into. However, HOD does not override the mapred.map.tasks property (nor the mapred.reduce.tasks), while they should be set dependent on the number of available task trackers and/or nodes in the HOD session. Can this not be submitted via the Hadoop job's configuration ? Again, HOD cannot do this automatically currently. But you could use the hod.client-params to set up a client side hadoop-site.xml that would work like this for all jobs submitted to the cluster. Hope this helps some. Thanks Hemanth
Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment
Hemanth Yamijala wrote: Filippo Spiga wrote: This procedure allows me to - use persistent HDFS on all cluster, placing namenode to frontend (always up and running) and datanode to other nodes - submit a lot of jobs to resource manager trasparently without any problem and manage jobs priority/reservation with MAUI as simple as other classical HPC jobs - execute jobtracker and tasktracker services on the nodes chosen by TORQUE (in particular, the first node selected becomes the jobtracker) - store logs for different users into separated directory - run only one job at time (but probably multiple map/reduce jobs can runs together because different jobs use different subset of nodes) Probably HOD does what I can do with my raw script... it's possibile that I don't understand well the userguide... Filippo, HOD indeed allows you to do all these things, and a little bit more. On the other hand your script executes the jobtracker on the first node always, which also seems useful to me. It will be nice if you can still try HOD and see if it makes your life simpler in any way. :-) Some things that HOD does automatically: - Set up log directories differently for different users - Port numbers need not be fixed, HOD detects free ports and provisions the services to use them - Depending on need, you can also use a custom tarball of hadoop to deploy, rather than use a pre-installed version. Also, since HOD is only a thin wrapper around the resource manager, all policies that you can set up for Maui can automatically apply for HOD-run clusters. Sorry for my english :-P Regards 2008/9/2 Hemanth Yamijala <[EMAIL PROTECTED]> Allen Wittenauer wrote: On 8/18/08 11:33 AM, "Filippo Spiga" <[EMAIL PROTECTED]> wrote: Well but I haven't understand how I should configurate HOD to work in this manner. For HDFS I folllow this sequence of steps - conf/master contain only master node of my cluster - conf/slaves contain all nodes - I start HDFS using bin/start-dfs.sh Right, fine... Potentially I would allow to use all nodes for MapReduce. For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I change only the gridservice-hdfs section? I was hoping the HOD folks would answer this question for you, but they are apparently sleeping. :) Woops ! Sorry, I missed this. Anyway, yes, if you point gridservice-hdfs to a static HDFS, it should use that as the -default- HDFS. That doesn't prevent a user from using HOD to create a custom HDFS as part of their job submission. Allen's answer is perfect. Please refer to http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS for more information about how to set up the gridservice-hdfs section to use a static or external HDFS.
Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment
Filippo Spiga wrote: This procedure allows me to - use persistent HDFS on all cluster, placing namenode to frontend (always up and running) and datanode to other nodes - submit a lot of jobs to resource manager trasparently without any problem and manage jobs priority/reservation with MAUI as simple as other classical HPC jobs - execute jobtracker and tasktracker services on the nodes chosen by TORQUE (in particular, the first node selected becomes the jobtracker) - store logs for different users into separated directory - run only one job at time (but probably multiple map/reduce jobs can runs together because different jobs use different subset of nodes) Probably HOD does what I can do with my raw script... it's possibile that I don't understand well the userguide... Filippo, HOD indeed allows you to do all these things, and a little bit more. On the other hand your script executes the jobtracker on the first node always, which also seems useful to me. It will be nice if you can still try HOD and see if it makes your life simpler in any way. :-) Sorry for my english :-P Regards 2008/9/2 Hemanth Yamijala <[EMAIL PROTECTED]> Allen Wittenauer wrote: On 8/18/08 11:33 AM, "Filippo Spiga" <[EMAIL PROTECTED]> wrote: Well but I haven't understand how I should configurate HOD to work in this manner. For HDFS I folllow this sequence of steps - conf/master contain only master node of my cluster - conf/slaves contain all nodes - I start HDFS using bin/start-dfs.sh Right, fine... Potentially I would allow to use all nodes for MapReduce. For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I change only the gridservice-hdfs section? I was hoping the HOD folks would answer this question for you, but they are apparently sleeping. :) Woops ! Sorry, I missed this. Anyway, yes, if you point gridservice-hdfs to a static HDFS, it should use that as the -default- HDFS. That doesn't prevent a user from using HOD to create a custom HDFS as part of their job submission. Allen's answer is perfect. Please refer to http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS for more information about how to set up the gridservice-hdfs section to use a static or external HDFS.
Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment
Filippo Spiga wrote: This procedure allows me to - use persistent HDFS on all cluster, placing namenode to frontend (always up and running) and datanode to other nodes - submit a lot of jobs to resource manager trasparently without any problem and manage jobs priority/reservation with MAUI as simple as other classical HPC jobs - execute jobtracker and tasktracker services on the nodes chosen by TORQUE (in particular, the first node selected becomes the jobtracker) - store logs for different users into separated directory - run only one job at time (but probably multiple map/reduce jobs can runs together because different jobs use different subset of nodes) Probably HOD does what I can do with my raw script... it's possibile that I don't understand well the userguide... Filippo, HOD indeed allows you to do all these things, and a little bit more. On the other hand your script executes the jobtracker on the first node always, which also seems useful to me. It will be nice if you can still try HOD and see if it makes your life simpler in any way. :-) Sorry for my english :-P Regards 2008/9/2 Hemanth Yamijala <[EMAIL PROTECTED]> Allen Wittenauer wrote: On 8/18/08 11:33 AM, "Filippo Spiga" <[EMAIL PROTECTED]> wrote: Well but I haven't understand how I should configurate HOD to work in this manner. For HDFS I folllow this sequence of steps - conf/master contain only master node of my cluster - conf/slaves contain all nodes - I start HDFS using bin/start-dfs.sh Right, fine... Potentially I would allow to use all nodes for MapReduce. For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I change only the gridservice-hdfs section? I was hoping the HOD folks would answer this question for you, but they are apparently sleeping. :) Woops ! Sorry, I missed this. Anyway, yes, if you point gridservice-hdfs to a static HDFS, it should use that as the -default- HDFS. That doesn't prevent a user from using HOD to create a custom HDFS as part of their job submission. Allen's answer is perfect. Please refer to http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS for more information about how to set up the gridservice-hdfs section to use a static or external HDFS.
Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment
Allen Wittenauer wrote: On 8/18/08 11:33 AM, "Filippo Spiga" <[EMAIL PROTECTED]> wrote: Well but I haven't understand how I should configurate HOD to work in this manner. For HDFS I folllow this sequence of steps - conf/master contain only master node of my cluster - conf/slaves contain all nodes - I start HDFS using bin/start-dfs.sh Right, fine... Potentially I would allow to use all nodes for MapReduce. For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I change only the gridservice-hdfs section? I was hoping the HOD folks would answer this question for you, but they are apparently sleeping. :) Woops ! Sorry, I missed this. Anyway, yes, if you point gridservice-hdfs to a static HDFS, it should use that as the -default- HDFS. That doesn't prevent a user from using HOD to create a custom HDFS as part of their job submission. Allen's answer is perfect. Please refer to http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS for more information about how to set up the gridservice-hdfs section to use a static or external HDFS.
Re: HoD and locality of TaskTrackers to data (on DataNodes)
Jiaqi, Hi, I have a question about using HoD and the locality of the assigned TaskTrackers to the data. Suppose I have a long-running HDFS installation with TaskTrackers/JobTracker nodes dynamically allocated by HoD, and I uploaded my data to HDFS prior to running my job/allocating nodes using "dfs -put". Then, I allocate some nodes and run my job on that data using HoD. Would the nodes allocated by HoD take into account the HDFS nodes on which my data resides (e.g. by looking at which DataNodes hold blocks that belong to the current user)? If the nodes are just arbitrarily allocated, doesn't that break Hadoop's design principle of having processing take place near the data? And if HoD doesn't currently take block location into account when allocating nodes, are there future plans for that to be incorporated? Excellent point ! HOD does not currently take this into account. We are working on ways in which we can accomplish this using configuration outside HOD (i.e. in Torque / some Hadoop features in 0.17 like HADOOP-1985). I will update this list (and possibly also documentation) on how this can be setup, after we have some more concrete results. Thanks Hemanth
Re: HOD question wrt virtual mapred master node
Jason Venner wrote: We would like to have the master node be the submit node. Our processing nodes tend to be rather beefy machines and we don't want to have to use one for a HOD master. We run a shared DFS, so we are only building virtual mapred nodes via hod. Does anyone have any suggestions on how to do that via HOD? What do you mean by master node - the node that runs the RingMaster process in HOD ? If yes, then that node is not exclusively reserved for running the RingMaster process. Either a JobTracker, or TaskTracker is going to be brought up on it. Thanks Hemanth
Re: [HOD] Collecting MapReduce logs
Luca, Luca wrote: Hello everyone, I wonder what is the meaning of hodring.log-destination-uri versus hodring.log-dir. I'd like to collect MapReduce UI logs after a job has been run and the only attribute seems to be hod.hadoop-ui-log-dir, in the hod section. log-destination-uri is a config option for uploading hadoop logs after the cluster is deallocated. log-dir is used to store logs generated by the HOD processes itself. If you want MapReduce UI logs, hadoop-ui-log-dir is what you want, as you rightly noted. With that attribute specified, logs are all grabbed in that directory, producing a large amount of html files. Is there a way to collect them, maybe as a .tar.gz, in a place somewhere related to the user? Sorry, no, we don't have that option yet. In fact going forward, Hadoop might solve this problem on its own. HADOOP-2178 seems to be related to this, but I haven't looked at it too closely. Additionally, how do administrators specify variables in these values? Which interpreter interprets them? For instance, variables specified in a bash fashion like $USER in section hodring or ringmaster work well (I guess they are interpreted by bash itself) but if specified in the hod section they're not: I tried with [hod] hadoop-ui-log-dir=/somedir/$USER but any hod command fails displaying an error on that line. We are definitely planning to build this capability, as part of the work we will be doing for HADOOP-2849. Cheers, Luca
Re: [HOD] Example script
Luca, #!/bin/bash hadoop --config /home/luca/hod-test jar /mnt/scratch/hadoop/hadoop-0.16.0-examples.jar wordcount file:///mnt/scratch/hadoop/test/part-0 test.hodscript.out Can you try removing the --config from this script ? While running scripts, HOD automatically allocates a directory and sets HADOOP_CONF_DIR to that. So, you don't need to mention that. That said, I have a very important note: we are changing HOD's user interface a bit based on some feedback from some of our users. Before you write too many scripts, it might be worthwhile to wait until Hadoop 0.16.1 is out which will have a more user friendly HOD interface. Please look at H-2861 for all details. Thanks hemanth
Re: More HOD: is there anyway to get HOD to copy back all of the log files to the submit node?
Jason Venner wrote: I have found that HOD writes a series of log files to directories on the virtual cluster master, if you specify log directories. The interesting part is figuring out which machine was the virtual cluster master, if you have a decent sized pool of machines. Can you explain what you mean by 'virtual cluster master' ? I guess you could mean the node which ran the 'ringmaster' process - the master process in a hod virtual cluster, or the hadoop 'jobtracker'. Both this information is stored by hod for an allocated cluster. You could retrieve it by using hod -o list, and hod -o "info cluster-dir" If the cluster is deallocated, you can get the ringmaster node still by using torque commands: qstat (to get torque job ids) and qstat -n This will print a list of nodes, the first one is the ringmaster. It would be so nice if HOD could copy this back to the submission node. I suppose we could configure syslog up but then we have to configure the syslogd on each of the submit hosts to accept from any compute node... Syslog configuration has a bug which will be addressed in Hadoop 0.16.1. Thanks hemanth
Re: HOD question wrt to the virtual cluster log files - where do they end up when the job ends
Jason Venner wrote: As you have all read from my previous emails, we are still pretty low on the HOD learning curve. That is explained. It is new software, so we will improve over time with feedback from our users, like you :-) We are having jobs that terminate and the virtual mapred cluster is terminated also. We want access to the log files and job history from the virtual cluster. You've mentioned that you have a persistent DFS. So, you could do the following: Set up the following variables in your hodrc: [hodring] log-destination-uri = hdfs://:/user/hod/logs If you are enabling permissions, you may need to make sure this directory is writable by all users who launch hod jobs. When a job is completed, hadoop logs will be uploaded to this folder in dfs as zipped files. You can access them from there. Hod logs are stored on the local file system under the log-dir configured in the [ringmaster] and [hodring] sections. Our other active research project is why are the virtual clusters getting killed. You mean you allocate, allocation succeeds, and after a while the cluster is no longer active ? If yes, please check the following: - What's the default wallclock time set up in your torque master ? For e.g.: $ qmgr -c "p q batch" # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.walltime = 1:00:00 If this timelimit is low, you could be hitting that and the torque server is killing the cluster. You can increase the wallclock time when you allocate a cluster. Use --hod.walltime in the allocate command line: hod --hod.walltime 3600 -o "allocate .." - Another possibility for auto-deallocation is if your cluster is not running no hadoop jobs for a long time. To free up nodes, HOD deallocates automatically. Anyway, how do we get access to the log files and job history from a terminated virtual cluster. History logs also should be uploaded to DFS as explained above. Note: we run a persistent DFS, and allocate virtual mapred clusters. Thanks
Re: [HOD] hdfs:///mapredsystem directory
Mahadev Konar wrote: Hi Luca, Can you do a ls -l on /mapredsystem and send the output? According to permissions for mapreduce the system directories created by jobtracker should be world writable so permissions should have worked as it is for hod. No, it doesn't appear to be working that way. The mapred system directory specified by HOD for a jobtracker is /mapredsystem/. When the job tracker starts, it creates this directory with permissions rwx-wx-wx. It does not clean it up when it stops. But when it is started again on the same node, I think (may be wrong here) it tries to clean up the directory (possibly to handle crashes etc ?). This seems equivalent to issuing a "hadoop dfs -rmr /mapredsystem/"which in turn rm -r of unix. So, it basically has to read the directory to recursively delete; and because there are no read permissions, it fails. A simple experiment to simulate the same set of DFS operations using 2 different users, without using mapred, give the same result. Enabling read permissions allows the delete to work. There appear several ways we can fix this, and that discussion should go on HADOOP-2899. Will post my comments there. Thanks hemanth
Re: [HOD] Port for MapReduce UI interface
Luca wrote: [hod] xrs-port-range = 1-11000 http-port-range = 1-11000 the Mapred UI is chosen outside this range. There's no port range option for Mapred and HDFS sections currently. You seem to have a use-case for specifying the range within which HOD should try and bind. Please file a JIRA for this. Thanks Hemanth
Re: More HOD questions 0.16.0 - debug log enclosed - help with how to debug
Jason Venner wrote: My hadoop jobs don't start This is configured to use an existing DFS and to unpack a tarball with a cut down 0.16.0 config I have looked in the mom logs on the client machines and am not getting anything meaningful. What is your hod command line ? Specifically, how did you provide the tarball option ? Can you attach the log of the hod command, like you did the hodrc. There are some lines in the output that don't seem complete. Set your debug option in the [ringmaster] section to 4, and rerun hod. Under the log-dir specified in the [ringmaster] section you will be able to see a log file corresponding to your jobid. Can you attach that too ? The ringmaster node is the first one allocated by torque for the job, that is, the mother superior for the job. How is your tarball built ? Can you check that there's no hadoop-env.sh with pre-filled values in them. Look at HADOOP-2860. Thanks Hemanth