Re: [9fans] xcpu note
Oh, wait, 12 nodes. Hmm. That's cheating! unfortunately, we haven't been able to run the inferno grid stuff on any more than about 300 nodes. it works fairly quickly on that number, but task takeup slows down considerably when it's pumping out a lot of data (this is better now that nodes cache data). things are slowed down quite a bit by logging constraints (it stores much on-going data on disk, both to reduce memory consumption and so that if the server crashes or is turned off, things can resume with virtually nothing lost). running on top of a ram disk can speed things up by at least an order of magnitude. this probably makes sense for short-lived jobs. i'd love to try it out on a larger cluster (one could use an existing scheduler to leverage the initial installation).
Re: [9fans] xcpu note
| Probably apples and oranges, but Jim Kent wrote a job scheduler for his | kilocluster that nicely handled about 1M jobs in six hours. It's the | standard thing for whole genome sequence alignments at ucsc. | | I think that's neat, I would like to learn more. Was this scheduler for | an arbitrary job mix, or specialized to that app? Well, it was designed to do what we needed and no more, but it's still pretty general. The input is a file of commands, and it runs them all until they are all done (with a way to retry the ones that failed.) http://www.cse.ucsc.edu/~kent/ http://www.soe.ucsc.edu/~donnak/eng/parasol.htm
Re: [9fans] xcpu note
David Leimbach wrote: Clustermatic is pretty cool, I think it's what was installed on one of the other clusters I used at LANL as a contractor at the time. I recall a companion tool for bproc to request nodes, sort of an ad-hoc scheduler. I had to integrate support for this in our MPI's start up that I was testing on that machine. the simple scheduler, bjs, was written by erik hendriks (now at Google, sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 128 2-cpu nodes with a very diverse job mix, for one year. It was a great piece of software. It was far faster, and far more reliable, than any scheduler we have ever seen, then or now. In one test, we ran about 20,000 jobs through it on about an hour, on a 1024-node cluster, just to test. Note that it could probably have scheduled a lot more jobs, but the run-time of the job was non-zero. No other scheduler we have used comes close to this kind of performance. Scheduler overhead was basically insignificant. Yeah, when I came to the lab last it was a surprise to find out that I not only had to support bproc but bjs though. Luckilly it took about 10 minutes to figure it out and add support to our mpirun startup script. It was pretty neat. I'm curious to see how this all fits together with xcpu, if there is such a resource allocation setup needed etc. we're going to take bjs and have it schedule nodes to give to users. Note one thing we are going to do with xcpu: attach nodes to a user's desktop machine, rather than make users log in to the cluster. So users will get interactive clusters that look like they own them. This will, we hope, kill batch mode. Plan 9 ideas make this possible. It's going to be a big change, one we hope users will like. Hmm, planning to create a multi-hosted xcpu resource all bound to the user's namespace? Or one host per set of files? Is there a way to launch multiple jobs in one shot ala MPI startup this way that's easy? If you look at how most clusters are used today, they closely resemble the batch world of the 1960s. It is actually kind of shocking. I downloaded a JCL manual a year or two ago, and compared what JCL did to what people wanted batch schedulers for clusters to do, and the correspondance was a little depressing. The Data General ad said it best: Batch is a bitch. Yeah, I've been comparing them to punch card systems for a while now. Some are even almost the same size as those old machines now that we've stacked them up. MPI jobs have turned modern machines into huge monoliths that basically throw out the advantages of a multi-user system. In fact having worked with CPlant for a while with Ron Brightwell over at SNL, they had a design optimized for one process per machine. One CPU [no SMP hardware contention], Myrinet with Portals for RDMA and OS bypass reasons [low overheads], no threads [though I was somewhat taunted with them at one point], and this Yod and Yod2 scheduler for job startup. It was very unique, and very interesting to work on and not a lot of fun to debug running code on. :) The closest thing I've seen to this kind of design in production has to be Blue Gene [which is a much different architecture of course but similar in that it is very custom designed for a few purposes]. Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it in .pdf :-) It appeared in the late 70s IIRC. ron p.s. go ahead, google JCL, and you can find very recent manuals on how to use it. I will be happy to post the JCL for sort + copy if anyone wants to see it. Please god no!!! :) Dave
Re: [9fans] xcpu note
David Leimbach wrote: Clustermatic is pretty cool, I think it's what was installed on one of the other clusters I used at LANL as a contractor at the time. I recall a companion tool for bproc to request nodes, sort of an ad-hoc scheduler. I had to integrate support for this in our MPI's start up that I was testing on that machine. the simple scheduler, bjs, was written by erik hendriks (now at Google, sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 128 2-cpu nodes with a very diverse job mix, for one year. It was a great piece of software. It was far faster, and far more reliable, than any scheduler we have ever seen, then or now. In one test, we ran about 20,000 jobs through it on about an hour, on a 1024-node cluster, just to test. Note that it could probably have scheduled a lot more jobs, but the run-time of the job was non-zero. No other scheduler we have used comes close to this kind of performance. Scheduler overhead was basically insignificant. Yeah, when I came to the lab last it was a surprise to find out that I not only had to support bproc but bjs though. Luckilly it took about 10 minutes to figure it out and add support to our mpirun startup script. It was pretty neat. I'm curious to see how this all fits together with xcpu, if there is such a resource allocation setup needed etc. we're going to take bjs and have it schedule nodes to give to users. Note one thing we are going to do with xcpu: attach nodes to a user's desktop machine, rather than make users log in to the cluster. So users will get interactive clusters that look like they own them. This will, we hope, kill batch mode. Plan 9 ideas make this possible. It's going to be a big change, one we hope users will like. Hmm, planning to create a multi-hosted xcpu resource all bound to the user's namespace? Or one host per set of files? Is there a way to launch multiple jobs in one shot ala MPI startup this way that's easy? If you look at how most clusters are used today, they closely resemble the batch world of the 1960s. It is actually kind of shocking. I downloaded a JCL manual a year or two ago, and compared what JCL did to what people wanted batch schedulers for clusters to do, and the correspondance was a little depressing. The Data General ad said it best: Batch is a bitch. Yeah, I've been comparing them to punch card systems for a while now. Some are even almost the same size as those old machines now that we've stacked them up. MPI jobs have turned modern machines into huge monoliths that basically throw out the advantages of a multi-user system. In fact having worked with CPlant for a while with Ron Brightwell over at SNL, they had a design optimized for one process per machine. One CPU [no SMP hardware contention], Myrinet with Portals for RDMA and OS bypass reasons [low overheads], no threads [though I was somewhat taunted with them at one point], and this Yod and Yod2 scheduler for job startup. It was very unique, and very interesting to work on and not a lot of fun to debug running code on. :) The closest thing I've seen to this kind of design in production has to be Blue Gene [which is a much different architecture of course but similar in that it is very custom designed for a few purposes]. Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it in .pdf :-) It appeared in the late 70s IIRC. ron p.s. go ahead, google JCL, and you can find very recent manuals on how to use it. I will be happy to post the JCL for sort + copy if anyone wants to see it. Please god no!!! :) Dave
Re: [9fans] xcpu note
David Leimbach wrote: Clustermatic is pretty cool, I think it's what was installed on one of the other clusters I used at LANL as a contractor at the time. I recall a companion tool for bproc to request nodes, sort of an ad-hoc scheduler. I had to integrate support for this in our MPI's start up that I was testing on that machine. the simple scheduler, bjs, was written by erik hendriks (now at Google, sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 128 2-cpu nodes with a very diverse job mix, for one year. It was a great piece of software. It was far faster, and far more reliable, than any scheduler we have ever seen, then or now. In one test, we ran about 20,000 jobs through it on about an hour, on a 1024-node cluster, just to test. Note that it could probably have scheduled a lot more jobs, but the run-time of the job was non-zero. No other scheduler we have used comes close to this kind of performance. Scheduler overhead was basically insignificant. Yeah, when I came to the lab last it was a surprise to find out that I not only had to support bproc but bjs though. Luckilly it took about 10 minutes to figure it out and add support to our mpirun startup script. It was pretty neat. I'm curious to see how this all fits together with xcpu, if there is such a resource allocation setup needed etc. we're going to take bjs and have it schedule nodes to give to users. Note one thing we are going to do with xcpu: attach nodes to a user's desktop machine, rather than make users log in to the cluster. So users will get interactive clusters that look like they own them. This will, we hope, kill batch mode. Plan 9 ideas make this possible. It's going to be a big change, one we hope users will like. Hmm, planning to create a multi-hosted xcpu resource all bound to the user's namespace? Or one host per set of files? Is there a way to launch multiple jobs in one shot ala MPI startup this way that's easy? If you look at how most clusters are used today, they closely resemble the batch world of the 1960s. It is actually kind of shocking. I downloaded a JCL manual a year or two ago, and compared what JCL did to what people wanted batch schedulers for clusters to do, and the correspondance was a little depressing. The Data General ad said it best: Batch is a bitch. Yeah, I've been comparing them to punch card systems for a while now. Some are even almost the same size as those old machines now that we've stacked them up. MPI jobs have turned modern machines into huge monoliths that basically throw out the advantages of a multi-user system. In fact having worked with CPlant for a while with Ron Brightwell over at SNL, they had a design optimized for one process per machine. One CPU [no SMP hardware contention], Myrinet with Portals for RDMA and OS bypass reasons [low overheads], no threads [though I was somewhat taunted with them at one point], and this Yod and Yod2 scheduler for job startup. It was very unique, and very interesting to work on and not a lot of fun to debug running code on. :) The closest thing I've seen to this kind of design in production has to be Blue Gene [which is a much different architecture of course but similar in that it is very custom designed for a few purposes]. Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it in .pdf :-) It appeared in the late 70s IIRC. ron p.s. go ahead, google JCL, and you can find very recent manuals on how to use it. I will be happy to post the JCL for sort + copy if anyone wants to see it. Please god no!!! :) Dave
[9fans] xcpu note
oh, yeah, you're going to see a lot of debugging crap from xcpusrv. this is called: A guy who's done select()-based threading for xx years tries to learn Plan 9 threads, and fails a lot, but is slowly getting it, sometimes sorry for any convenience (sic). also, on Plan 9 ports, you are going to need a linux kernel, e.g. 2.6.14-rc2, to make it go, or use Newsham's python client code. ron
Re: [9fans] xcpu note
Congrats on another fine Linux Journal article Ron. I just got this in the mail yesterday and read it today. Clustermatic is pretty cool, I think it's what was installed on one of the other clusters I used at LANL as a contractor at the time. I recall a companion tool for bproc to request nodes, sort of an ad-hoc scheduler. I had to integrate support for this in our MPI's start up that I was testing on that machine. I'm curious to see how this all fits together with xcpu, if there is such a resource allocation setup needed etc. Dave On 10/17/05, Ronald G Minnich rminnich@lanl.gov wrote: oh, yeah, you're going to see a lot of debugging crap from xcpusrv. this is called: A guy who's done select()-based threading for xx years tries to learn Plan 9 threads, and fails a lot, but is slowly getting it, sometimes sorry for any convenience (sic). also, on Plan 9 ports, you are going to need a linux kernel, e.g. 2.6.14-rc2, to make it go, or use Newsham's python client code. ron
Re: [9fans] xcpu note
David Leimbach wrote: Clustermatic is pretty cool, I think it's what was installed on one of the other clusters I used at LANL as a contractor at the time. I recall a companion tool for bproc to request nodes, sort of an ad-hoc scheduler. I had to integrate support for this in our MPI's start up that I was testing on that machine. the simple scheduler, bjs, was written by erik hendriks (now at Google, sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 128 2-cpu nodes with a very diverse job mix, for one year. It was a great piece of software. It was far faster, and far more reliable, than any scheduler we have ever seen, then or now. In one test, we ran about 20,000 jobs through it on about an hour, on a 1024-node cluster, just to test. Note that it could probably have scheduled a lot more jobs, but the run-time of the job was non-zero. No other scheduler we have used comes close to this kind of performance. Scheduler overhead was basically insignificant. I'm curious to see how this all fits together with xcpu, if there is such a resource allocation setup needed etc. we're going to take bjs and have it schedule nodes to give to users. Note one thing we are going to do with xcpu: attach nodes to a user's desktop machine, rather than make users log in to the cluster. So users will get interactive clusters that look like they own them. This will, we hope, kill batch mode. Plan 9 ideas make this possible. It's going to be a big change, one we hope users will like. If you look at how most clusters are used today, they closely resemble the batch world of the 1960s. It is actually kind of shocking. I downloaded a JCL manual a year or two ago, and compared what JCL did to what people wanted batch schedulers for clusters to do, and the correspondance was a little depressing. The Data General ad said it best: Batch is a bitch. Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it in .pdf :-) It appeared in the late 70s IIRC. ron p.s. go ahead, google JCL, and you can find very recent manuals on how to use it. I will be happy to post the JCL for sort + copy if anyone wants to see it.
Re: [9fans] xcpu note
also, on Plan 9 ports, you are going to need a linux kernel, e.g. 2.6.14-rc2, to make it go, or use Newsham's python client code. The latest candidate of this, I checked just before, is still 2.6.14-rc4.☺ Do you have any info when it'll become stable release? Kenji
Re: [9fans] xcpu note
Kenji Okamoto wrote: also, on Plan 9 ports, you are going to need a linux kernel, e.g. 2.6.14-rc2, to make it go, or use Newsham's python client code. The latest candidate of this, I checked just before, is still 2.6.14-rc4.☺ Do you have any info when it'll become stable release? Kenji usual rule with this most recent series is pretty damn soon. 2.6.13 stabilized quite fast. I am guessing we're close, not that I know any more than you do :-) ron
Re: [9fans] xcpu note
Should be any day now, there weren't that many patches in rc4. Of course, lucho has a massive rehaul of the mux code waiting in the wings and I have my fid tracking rework, so stuff won't be stable for long ;) Of course, we'll keep that code in our development trees until it has cooked a little, but lucho's code looks to fix a lot of long standing problems and hopefully my new fid stuff will make Plan 9 things (p9p) and Ron's new synthetics work better. -eric On 10/17/05, Kenji Okamoto [EMAIL PROTECTED] wrote: also, on Plan 9 ports, you are going to need a linux kernel, e.g. 2.6.14-rc2, to make it go, or use Newsham's python client code. The latest candidate of this, I checked just before, is still 2.6.14-rc4.☺ Do you have any info when it'll become stable release? Kenji
Re: [9fans] xcpu note
| No other scheduler we have used | comes close to this kind of performance. Scheduler overhead was | basically insignificant. Probably apples and oranges, but Jim Kent wrote a job scheduler for his kilocluster that nicely handled about 1M jobs in six hours. It's the standard thing for whole genome sequence alignments at ucsc. | If you look at how most clusters are used today, they closely resemble | the batch world of the 1960s. It is actually kind of shocking. On the other hand, sometimes that's just what you really want.
Re: [9fans] xcpu note
Scott Schwartz wrote: Probably apples and oranges, but Jim Kent wrote a job scheduler for his kilocluster that nicely handled about 1M jobs in six hours. It's the standard thing for whole genome sequence alignments at ucsc. I think that's neat, I would like to learn more. Was this scheduler for an arbitrary job mix, or specialized to that app? | If you look at how most clusters are used today, they closely resemble | the batch world of the 1960s. It is actually kind of shocking. On the other hand, sometimes that's just what you really want. true. Sometimes it is. I've found, more often, that it's what people will accept, but not what they want. ron
Re: [9fans] xcpu note
| No other scheduler we have used | comes close to this kind of performance. Scheduler overhead was | basically insignificant. Probably apples and oranges, but Jim Kent wrote a job scheduler for his kilocluster that nicely handled about 1M jobs in six hours. It's the standard thing for whole genome sequence alignments at ucsc. the vitanuova guys probably have better numbers, but when we ran their grid code at ucalgary it executed over a million jobs in a 24-hour period. the jobs were non-null (md5sum using inferno's dis code). it ran on a 12 (or so) -node cluster :)
Re: [9fans] xcpu note
andrey mirtchovski wrote: | No other scheduler we have used | comes close to this kind of performance. Scheduler overhead was | basically insignificant. Probably apples and oranges, but Jim Kent wrote a job scheduler for his kilocluster that nicely handled about 1M jobs in six hours. It's the standard thing for whole genome sequence alignments at ucsc. the vitanuova guys probably have better numbers, but when we ran their grid code at ucalgary it executed over a million jobs in a 24-hour period. the jobs were non-null (md5sum using inferno's dis code). it ran on a 12 (or so) -node cluster :) man, all these schedulers that work MUCH better than the stuff we pay money for ... ah well. It shows how limited my experience is ... I'm used to schedulers that take 5-25 seconds to schedule jobs on 1000 or so nodes. Oh, wait, 12 nodes. Hmm. That's cheating! ron