Re: [9fans] xcpu note

2005-10-19 Thread rog
 Oh, wait, 12 nodes. Hmm. That's cheating!

unfortunately, we haven't been able to run the inferno grid stuff on
any more than about 300 nodes.  it works fairly quickly on that
number, but task takeup slows down considerably when it's pumping out
a lot of data (this is better now that nodes cache data).

things are slowed down quite a bit by logging constraints (it stores
much on-going data on disk, both to reduce memory consumption and so
that if the server crashes or is turned off, things can resume with
virtually nothing lost).  running on top of a ram disk can speed
things up by at least an order of magnitude. this probably makes
sense for short-lived jobs.

i'd love to try it out on a larger cluster (one could use an existing
scheduler to leverage the initial installation).



Re: [9fans] xcpu note

2005-10-18 Thread Scott Schwartz
|  Probably apples and oranges, but Jim Kent wrote a job scheduler for his
|  kilocluster that nicely handled about 1M jobs in six hours.  It's the
|  standard thing for whole genome sequence alignments at ucsc.
| 
| I think that's neat, I would like to learn more. Was this scheduler for 
| an arbitrary job mix, or specialized to that app?
 
Well, it was designed to do what we needed and no more, but it's still
pretty general.  The input is a file of commands, and it runs them all
until they are all done (with a way to retry the ones that failed.)

http://www.cse.ucsc.edu/~kent/
http://www.soe.ucsc.edu/~donnak/eng/parasol.htm


Re: [9fans] xcpu note

2005-10-18 Thread leimy2k
 David Leimbach wrote:
 
 Clustermatic is pretty cool, I think it's what was installed on one of
 the other clusters I used at LANL as a contractor at the time.  I
 recall a companion tool for bproc to request nodes, sort of an ad-hoc
 scheduler.  I had to integrate support for this in our MPI's start up
 that I was testing on that machine.
 
 the simple scheduler, bjs, was written by erik hendriks (now at Google, 
 sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 
 128 2-cpu nodes with a very diverse job mix, for one year. It was a 
 great piece of software. It was far faster, and far more reliable, than 
 any scheduler we have ever seen, then or now. In one test, we ran about 
 20,000 jobs through it on about an hour, on a 1024-node cluster, just to 
 test. Note that it could probably have scheduled a lot more jobs, but 
 the run-time of the job was non-zero. No other scheduler we have used 
 comes close to this kind of performance. Scheduler overhead was 
 basically insignificant.
 

Yeah, when I came to the lab last it was a surprise to find out that I 
not only had to support bproc but bjs though.  Luckilly it took about 10
minutes to figure it out and add support to our mpirun startup script.

It was pretty neat.

 
 I'm curious to see how this all fits together with xcpu, if there is
 such a resource allocation setup needed etc.
 
 we're going to take bjs and have it schedule nodes to give to users.
 
 Note one thing we are going to do with xcpu: attach nodes to a user's 
 desktop machine, rather than make users log in to the cluster. So users 
 will get interactive clusters that look like they own them. This will, 
 we hope, kill batch mode. Plan 9 ideas make this possible. It's going to 
 be a big change, one we hope users will like.

Hmm, planning to create a multi-hosted xcpu resource all bound to the 
user's namespace?  Or one host per set of files?  Is there a way to launch
multiple jobs in one shot ala MPI startup this way that's easy?

 
 If you look at how most clusters are used today, they closely resemble 
 the batch world of the 1960s. It is actually kind of shocking. I 
 downloaded a JCL manual a year or two ago, and compared what JCL did to 
 what people wanted batch schedulers for clusters to do, and the 
 correspondance was a little depressing. The Data General ad said it 
 best: Batch is a bitch.

Yeah, I've been comparing them to punch card systems for a while now. 
Some are even almost the same size as those old machines now that we've 
stacked them up.

MPI jobs have turned modern machines into huge monoliths that basically 
throw out the advantages of a multi-user system.  In fact having worked
with CPlant for a while with Ron Brightwell over at SNL, they had a design
optimized for one process per machine.  One CPU [no SMP hardware contention],
Myrinet with Portals for RDMA and OS bypass reasons [low overheads], 
no threads [though I was somewhat taunted with them at one point], and this
Yod and Yod2 scheduler for job startup.

It was very unique, and very interesting to work on and not a lot of fun to
debug running code on. :)

The closest thing I've seen to this kind of design in production has to be 
Blue Gene [which is a much different architecture of course but similar in 
that it is very custom designed for a few purposes].


 
 Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it 
 in .pdf :-) It appeared in the late 70s IIRC.
 
 ron
 p.s. go ahead, google JCL, and you can find very recent manuals on how 
 to use it. I will be happy to post the JCL for sort + copy if anyone 
 wants to see it.

Please god no!!! :)

Dave



Re: [9fans] xcpu note

2005-10-18 Thread leimy2k
 David Leimbach wrote:
 
 Clustermatic is pretty cool, I think it's what was installed on one of
 the other clusters I used at LANL as a contractor at the time.  I
 recall a companion tool for bproc to request nodes, sort of an ad-hoc
 scheduler.  I had to integrate support for this in our MPI's start up
 that I was testing on that machine.
 
 the simple scheduler, bjs, was written by erik hendriks (now at Google, 
 sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 
 128 2-cpu nodes with a very diverse job mix, for one year. It was a 
 great piece of software. It was far faster, and far more reliable, than 
 any scheduler we have ever seen, then or now. In one test, we ran about 
 20,000 jobs through it on about an hour, on a 1024-node cluster, just to 
 test. Note that it could probably have scheduled a lot more jobs, but 
 the run-time of the job was non-zero. No other scheduler we have used 
 comes close to this kind of performance. Scheduler overhead was 
 basically insignificant.
 

Yeah, when I came to the lab last it was a surprise to find out that I 
not only had to support bproc but bjs though.  Luckilly it took about 10
minutes to figure it out and add support to our mpirun startup script.

It was pretty neat.

 
 I'm curious to see how this all fits together with xcpu, if there is
 such a resource allocation setup needed etc.
 
 we're going to take bjs and have it schedule nodes to give to users.
 
 Note one thing we are going to do with xcpu: attach nodes to a user's 
 desktop machine, rather than make users log in to the cluster. So users 
 will get interactive clusters that look like they own them. This will, 
 we hope, kill batch mode. Plan 9 ideas make this possible. It's going to 
 be a big change, one we hope users will like.

Hmm, planning to create a multi-hosted xcpu resource all bound to the 
user's namespace?  Or one host per set of files?  Is there a way to launch
multiple jobs in one shot ala MPI startup this way that's easy?

 
 If you look at how most clusters are used today, they closely resemble 
 the batch world of the 1960s. It is actually kind of shocking. I 
 downloaded a JCL manual a year or two ago, and compared what JCL did to 
 what people wanted batch schedulers for clusters to do, and the 
 correspondance was a little depressing. The Data General ad said it 
 best: Batch is a bitch.

Yeah, I've been comparing them to punch card systems for a while now. 
Some are even almost the same size as those old machines now that we've 
stacked them up.

MPI jobs have turned modern machines into huge monoliths that basically 
throw out the advantages of a multi-user system.  In fact having worked
with CPlant for a while with Ron Brightwell over at SNL, they had a design
optimized for one process per machine.  One CPU [no SMP hardware contention],
Myrinet with Portals for RDMA and OS bypass reasons [low overheads], 
no threads [though I was somewhat taunted with them at one point], and this
Yod and Yod2 scheduler for job startup.

It was very unique, and very interesting to work on and not a lot of fun to
debug running code on. :)

The closest thing I've seen to this kind of design in production has to be 
Blue Gene [which is a much different architecture of course but similar in 
that it is very custom designed for a few purposes].


 
 Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it 
 in .pdf :-) It appeared in the late 70s IIRC.
 
 ron
 p.s. go ahead, google JCL, and you can find very recent manuals on how 
 to use it. I will be happy to post the JCL for sort + copy if anyone 
 wants to see it.

Please god no!!! :)

Dave



Re: [9fans] xcpu note

2005-10-18 Thread leimy2k
 David Leimbach wrote:
 
 Clustermatic is pretty cool, I think it's what was installed on one of
 the other clusters I used at LANL as a contractor at the time.  I
 recall a companion tool for bproc to request nodes, sort of an ad-hoc
 scheduler.  I had to integrate support for this in our MPI's start up
 that I was testing on that machine.
 
 the simple scheduler, bjs, was written by erik hendriks (now at Google, 
 sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 
 128 2-cpu nodes with a very diverse job mix, for one year. It was a 
 great piece of software. It was far faster, and far more reliable, than 
 any scheduler we have ever seen, then or now. In one test, we ran about 
 20,000 jobs through it on about an hour, on a 1024-node cluster, just to 
 test. Note that it could probably have scheduled a lot more jobs, but 
 the run-time of the job was non-zero. No other scheduler we have used 
 comes close to this kind of performance. Scheduler overhead was 
 basically insignificant.
 

Yeah, when I came to the lab last it was a surprise to find out that I 
not only had to support bproc but bjs though.  Luckilly it took about 10
minutes to figure it out and add support to our mpirun startup script.

It was pretty neat.

 
 I'm curious to see how this all fits together with xcpu, if there is
 such a resource allocation setup needed etc.
 
 we're going to take bjs and have it schedule nodes to give to users.
 
 Note one thing we are going to do with xcpu: attach nodes to a user's 
 desktop machine, rather than make users log in to the cluster. So users 
 will get interactive clusters that look like they own them. This will, 
 we hope, kill batch mode. Plan 9 ideas make this possible. It's going to 
 be a big change, one we hope users will like.

Hmm, planning to create a multi-hosted xcpu resource all bound to the 
user's namespace?  Or one host per set of files?  Is there a way to launch
multiple jobs in one shot ala MPI startup this way that's easy?

 
 If you look at how most clusters are used today, they closely resemble 
 the batch world of the 1960s. It is actually kind of shocking. I 
 downloaded a JCL manual a year or two ago, and compared what JCL did to 
 what people wanted batch schedulers for clusters to do, and the 
 correspondance was a little depressing. The Data General ad said it 
 best: Batch is a bitch.

Yeah, I've been comparing them to punch card systems for a while now. 
Some are even almost the same size as those old machines now that we've 
stacked them up.

MPI jobs have turned modern machines into huge monoliths that basically 
throw out the advantages of a multi-user system.  In fact having worked
with CPlant for a while with Ron Brightwell over at SNL, they had a design
optimized for one process per machine.  One CPU [no SMP hardware contention],
Myrinet with Portals for RDMA and OS bypass reasons [low overheads], 
no threads [though I was somewhat taunted with them at one point], and this
Yod and Yod2 scheduler for job startup.

It was very unique, and very interesting to work on and not a lot of fun to
debug running code on. :)

The closest thing I've seen to this kind of design in production has to be 
Blue Gene [which is a much different architecture of course but similar in 
that it is very custom designed for a few purposes].


 
 Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it 
 in .pdf :-) It appeared in the late 70s IIRC.
 
 ron
 p.s. go ahead, google JCL, and you can find very recent manuals on how 
 to use it. I will be happy to post the JCL for sort + copy if anyone 
 wants to see it.

Please god no!!! :)

Dave



[9fans] xcpu note

2005-10-17 Thread Ronald G Minnich

oh, yeah, you're going to see a lot of debugging crap from xcpusrv.

this is called: A guy who's done select()-based threading for xx years 
tries to learn Plan 9 threads, and fails a lot, but is slowly getting 
it, sometimes


sorry for any convenience (sic).

also, on Plan 9 ports,  you are going to need a linux kernel, e.g. 
2.6.14-rc2, to make it go, or use Newsham's python client code.


ron


Re: [9fans] xcpu note

2005-10-17 Thread David Leimbach
Congrats on another fine Linux Journal article Ron.  I just got this
in the mail yesterday and read it today.

Clustermatic is pretty cool, I think it's what was installed on one of
the other clusters I used at LANL as a contractor at the time.  I
recall a companion tool for bproc to request nodes, sort of an ad-hoc
scheduler.  I had to integrate support for this in our MPI's start up
that I was testing on that machine.

I'm curious to see how this all fits together with xcpu, if there is
such a resource allocation setup needed etc.

Dave

On 10/17/05, Ronald G Minnich rminnich@lanl.gov wrote:
 oh, yeah, you're going to see a lot of debugging crap from xcpusrv.

 this is called: A guy who's done select()-based threading for xx years
 tries to learn Plan 9 threads, and fails a lot, but is slowly getting
 it, sometimes

 sorry for any convenience (sic).

 also, on Plan 9 ports,  you are going to need a linux kernel, e.g.
 2.6.14-rc2, to make it go, or use Newsham's python client code.

 ron



Re: [9fans] xcpu note

2005-10-17 Thread Ronald G Minnich

David Leimbach wrote:


Clustermatic is pretty cool, I think it's what was installed on one of
the other clusters I used at LANL as a contractor at the time.  I
recall a companion tool for bproc to request nodes, sort of an ad-hoc
scheduler.  I had to integrate support for this in our MPI's start up
that I was testing on that machine.


the simple scheduler, bjs, was written by erik hendriks (now at Google, 
sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 
128 2-cpu nodes with a very diverse job mix, for one year. It was a 
great piece of software. It was far faster, and far more reliable, than 
any scheduler we have ever seen, then or now. In one test, we ran about 
20,000 jobs through it on about an hour, on a 1024-node cluster, just to 
test. Note that it could probably have scheduled a lot more jobs, but 
the run-time of the job was non-zero. No other scheduler we have used 
comes close to this kind of performance. Scheduler overhead was 
basically insignificant.




I'm curious to see how this all fits together with xcpu, if there is
such a resource allocation setup needed etc.


we're going to take bjs and have it schedule nodes to give to users.

Note one thing we are going to do with xcpu: attach nodes to a user's 
desktop machine, rather than make users log in to the cluster. So users 
will get interactive clusters that look like they own them. This will, 
we hope, kill batch mode. Plan 9 ideas make this possible. It's going to 
be a big change, one we hope users will like.


If you look at how most clusters are used today, they closely resemble 
the batch world of the 1960s. It is actually kind of shocking. I 
downloaded a JCL manual a year or two ago, and compared what JCL did to 
what people wanted batch schedulers for clusters to do, and the 
correspondance was a little depressing. The Data General ad said it 
best: Batch is a bitch.


Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it 
in .pdf :-) It appeared in the late 70s IIRC.


ron
p.s. go ahead, google JCL, and you can find very recent manuals on how 
to use it. I will be happy to post the JCL for sort + copy if anyone 
wants to see it.


Re: [9fans] xcpu note

2005-10-17 Thread Kenji Okamoto
 also, on Plan 9 ports,  you are going to need a linux kernel, e.g. 
 2.6.14-rc2, to make it go, or use Newsham's python client code.

The latest candidate of this, I checked just before, is still 2.6.14-rc4.☺
Do you have any info when it'll become stable release?

Kenji



Re: [9fans] xcpu note

2005-10-17 Thread Ronald G Minnich

Kenji Okamoto wrote:
also, on Plan 9 ports,  you are going to need a linux kernel, e.g. 
2.6.14-rc2, to make it go, or use Newsham's python client code.



The latest candidate of this, I checked just before, is still 2.6.14-rc4.☺
Do you have any info when it'll become stable release?

Kenji



usual rule with this most recent series is pretty damn soon. 2.6.13 
stabilized quite fast. I am guessing we're close, not that I know any 
more than you do :-)


ron


Re: [9fans] xcpu note

2005-10-17 Thread Eric Van Hensbergen
Should be any day now, there weren't that many patches in rc4.
Of course, lucho has a massive rehaul of the mux code waiting in the
wings and I have my fid tracking rework, so stuff won't be stable for
long ;)  Of course, we'll keep that code in our development trees
until it has cooked a little, but lucho's code looks to fix a lot of
long standing problems and hopefully my new fid stuff will make Plan 9
things (p9p) and Ron's new synthetics work better.

-eric


On 10/17/05, Kenji Okamoto [EMAIL PROTECTED] wrote:
  also, on Plan 9 ports,  you are going to need a linux kernel, e.g.
  2.6.14-rc2, to make it go, or use Newsham's python client code.

 The latest candidate of this, I checked just before, is still 2.6.14-rc4.☺
 Do you have any info when it'll become stable release?

 Kenji




Re: [9fans] xcpu note

2005-10-17 Thread Scott Schwartz
| No other scheduler we have used 
| comes close to this kind of performance. Scheduler overhead was 
| basically insignificant.
 
Probably apples and oranges, but Jim Kent wrote a job scheduler for his
kilocluster that nicely handled about 1M jobs in six hours.  It's the
standard thing for whole genome sequence alignments at ucsc.

| If you look at how most clusters are used today, they closely resemble 
| the batch world of the 1960s. It is actually kind of shocking. 

On the other hand, sometimes that's just what you really want.



Re: [9fans] xcpu note

2005-10-17 Thread Ronald G Minnich

Scott Schwartz wrote:


Probably apples and oranges, but Jim Kent wrote a job scheduler for his
kilocluster that nicely handled about 1M jobs in six hours.  It's the
standard thing for whole genome sequence alignments at ucsc.


I think that's neat, I would like to learn more. Was this scheduler for 
an arbitrary job mix, or specialized to that app?




| If you look at how most clusters are used today, they closely resemble 
| the batch world of the 1960s. It is actually kind of shocking. 


On the other hand, sometimes that's just what you really want.



true. Sometimes it is. I've found, more often, that it's what people 
will accept, but not what they want.


ron


Re: [9fans] xcpu note

2005-10-17 Thread andrey mirtchovski
 | No other scheduler we have used 
 | comes close to this kind of performance. Scheduler overhead was 
 | basically insignificant.
  
 Probably apples and oranges, but Jim Kent wrote a job scheduler for his
 kilocluster that nicely handled about 1M jobs in six hours.  It's the
 standard thing for whole genome sequence alignments at ucsc.

the vitanuova guys probably have better numbers, but when we ran their
grid code at ucalgary it executed over a million jobs in a 24-hour
period.  the jobs were non-null (md5sum using inferno's dis code).  it
ran on a 12 (or so) -node cluster :)



Re: [9fans] xcpu note

2005-10-17 Thread Ronald G Minnich

andrey mirtchovski wrote:
| No other scheduler we have used 
| comes close to this kind of performance. Scheduler overhead was 
| basically insignificant.


Probably apples and oranges, but Jim Kent wrote a job scheduler for his
kilocluster that nicely handled about 1M jobs in six hours.  It's the
standard thing for whole genome sequence alignments at ucsc.



the vitanuova guys probably have better numbers, but when we ran their
grid code at ucalgary it executed over a million jobs in a 24-hour
period.  the jobs were non-null (md5sum using inferno's dis code).  it
ran on a 12 (or so) -node cluster :)



man, all these schedulers that work MUCH better than the stuff we pay 
money for ... ah well. It shows how limited my experience is ... I'm 
used to schedulers that take 5-25 seconds to schedule jobs on 1000 or so 
nodes.


Oh, wait, 12 nodes. Hmm. That's cheating!

ron