[Beowulf] shared memory versus MPI and bootless boot

Vincent Diepeveen Wed, 28 Jun 2006 17:50:56 -0700

----- Original Message -----From: "Brian Dobbins" <[EMAIL PROTECTED]>

To: "Vincent Diepeveen" <[EMAIL PROTECTED]>

Cc: "pauln" <[EMAIL PROTECTED]>; "Eray Ozkural" <[EMAIL PROTECTED]>;<[email protected]>

Sent: Saturday, June 03, 2006 11:04 AM
Subject: Re: [Beowulf] Building my own highend cluster

Hi Vincent (and others),

 I just wanted to add my own two cents after having fairly recently

[snip]

Thanks, i'll have a look at it!

Of course i prefer to just put in a cdrom, hit enter and then connect the
cables.

But really, if you guys talk about cfengine i have no clue what universe youtalk about.

If i boot a machine without harddrive, basically the machine says: "F you,error! Press enter to reboot"


Ok let's start please there. What do i do after getting that message?

Which key do i hit?

recalled the relative complexity of creating diskless nodes 'by hand' a
few years back and subsequently finding the wonderful simplicity of
tools such as Warewulf (or Rocks).  So, in the interest of providing
more information to the discussion at hand, here's a bit more detail and
other assorted thoughts:


I put in a warewulf cdrom in the 'masternode', press enter, select
at all 'diskless nodes' in bios: "boot over network" and it all works fine?

btw does that 'boot over network means i need a 16 node hub for 100 mbit and
connect all the machines besides the quadrics network also to 100 mbit?

About wareful, small problem, how to coboot it with openSSI and elan3drivers?

Now don't tell me it's based upon open-BS, learning linux when Linus startedreleasing

it start of 90s was already hard enough for me :)

[From pauln]
.. my apologies in advance:
http://www.psc.edu/~pauln/Diskless_Boot_Howto.html


 While I think cfengine and custom scripts gives a ton of flexibility,
I've found it much easier on our diskless clusters to use the Warewulf
software ( http://www.warewulf-cluster.org/ ).  It handles a lot of the
behind-the-scenes dirty work for you (ie, making the RAM disks/tmpfs,
configuring PXE & DHCP, etc.) and the people on the mailing list tend to
be quick to respond to troubles with effective solutions.  Also, it's
actively supported by other people and it just makes life a lot easier,
in my opinion.  It isn't hard at all to tweak, either, and I'd happily
go into more detail if you wish, but I'd really recommend a quick look
through the website as well, just to get a rough idea of the process.

 Secondly, though I haven't used it myself, I recently spoke with a
friend who was very knowledgeable about Rocks, which also has a diskless
mode, I'm told.  Here's the link for that: (
http://www.rocksclusters.org/ )

Programming in MPI/SHMEM is by the way pretty retarded way to program.


 If ease-of-use and shared-memory style are more important to you than
performance, you might be interested in checking out the "Cluster
OpenMP" developments in the Intel compilers.


OpenMP doesn't enter the room here of course.

No no shared memory programming is way more easier.

Just share in linux with shget and shmat some memory and you're sharedmemory.


That's how diep works basically.

If i go add all kind of fancy MPI commands to that, it of course slows downa factor 2 or so

first, at a single processor.

Much easier is to just keep using what i've got. Start n processes at ncores, and use shared memory

to divide memory segments.

The assumption in diep is that the process that allocates shared memorysegments the first and alsocleans them (or initializes them whatever you want to call it) is theprocessor at which the memory gets allocated.

If that principle gets followed, then diep runs parallel fine, even withpretty bad latencies from processor to processor.

The luck i've got in Diep is that it has the most chessknowledge in itsevaluation function from all chessprogramsin world. That's a result from me having been dogfood for world top playersover the yearsand i actually managed to draw a world top 6 player once myself in anofficial major league game),

some of them in world top 10 even. You learn the game quickly then :)

So needing those 64 bytes from a remote node isn't happening too frequentlyin Diep and with 4 cores at a dual opteron

of course the odds of it being at a remote memory node is far less than 50%.

Example of access to remote memory is in hashtable:

   unsigned int
     l,procnr,hindex;

procnr = ((((unsignedint)(hashpos.lo&0x000000000000ffff))*nprocesses)>>16);hindex = (unsignedint)((((hashpos.lo>>16)&0x00000000ffffffff)*abmod)>>32);

   hentry = &(globaltrans[procnr][hindex]);

So basically it exists:
  HashEntry *globaltrans[MAXPROCESSORS];

I attach simply with shmget/shmat shared memory to that from remoteprocessors.

Then what happens is a lookup.

This is a lot simpler of course than OpenMP not to mention MPI.

This is simplistically how you program for a shared memory machine such as aquad opteron or a quad xeon too.


This is how the commercial version of the software looks like too of course.

As you see i also avoid a slow 'modulo' instruction or 2 in the code.
Average coders would write here something like:

   procnr = ((unsigned int)hashpos.lo) % (unsigned int)nprocesses);
   hindex = (unsigned int)((hashpos.lo>>16)%abmod);

modulo and dividing is BAD on the processor. Very very slow.

Though not near as slow as a MPI call.

 This is mostly an aside, but why would you need to strip MPI commands to
run on a 4 or 8 processor system?

The basic point is: most sciensits first slow down their program factor 20to get MPI

and in order to then simply throw factor 1000 at it.

I can't afford that loss at a single mainboard machine. This software isquite optimized written to run optimal at a single mainboard

machine. No slowdowns.

So if i add MPI calls that slows me down.

If i move from my dual opteron dual core to a 16 node cluster using mpicalls, my first priority is to be faster than something very well

optimized for a single mainboard machine.

THAT IS NOT EASY.

matter.  I agree shared memory methods are easier to program, but I


It's not about stripping.

We're talking about 2.2 MB of optimized C code where i would ADD mpicommands to,with all bugs that you get and that need to bugfixed. Bugfixing that takesyears.


Vincent

 Finally, going back to the beginning of the discussion, I'd just caution
you about putting motherboards on a slab of wood in a garage.  The
filter might keep dust out of the garage, but other things always seem
to manage to get into garages, and lots of creepy-crawly things love
warmth and light - two things your system are bound to give off.  :)


Bugs :)

Thanks,
Vincent

_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] shared memory versus MPI and bootless boot

Reply via email to