Re: [Oscar-devel] Node Groups

Matt Garrett Tue, 02 Dec 2003 19:59:51 -0800

Benoit des Ligneris wrote:

Hello,
I think that we should also add some flags relevant for the queuing
system and maybe some kind of "Next Action" items.
Basically, for a planned maintenance, for instance, you need to shut your node down and [reboot from hard-disk] (simple kernel upgrade, to have LAM/MPI checkpointing for instance) or you want a clean install [reboot from network] or do you want to halt [halt] because of a power failure or too important generated heat (failure detection). In all cases, the node has to be removed from the PBS queue and certain commands (grub/lilo/syslinux/XXX dependant + halt/remove) have to be issued on the node.

I'm operating on the assumption that any given node in the cluster can suddenly be found to be not just underperforming, but to not even be in the building at any given time. Any efforts we make toward HA or FT are great, but we always have to be able to fall back on error handlers that can recover from situations like, "I just tried to submit this fragment of a PBS job to node 0142 and it stood mute... now what?"

I think "Available" flags in the Node Record table would be a good addition. That way the first time a node is found to be surprisingly unavailable, we don't have to get surprised again, and in HA/FT handled absenses, we'll never have to be surprised, but surprise is a fact of life *BOO!* (see?).

I think that the same procedure should be followed when doing a package install/removal/upgrade of some packages on the node. i.e. : - Mark the node so that it does not accept jobs anymore - Wait for job completion - Do some action on the node : * reboot from medium X * halt * RPM operations * packman operations

For nodes where the installed software base is RPM based, PackMan operations *are* RPM operations. In a perfect world, when this system is implemented, there would be nothing preventing you from having nodes W-X be RPM based, nodes Y-Z be Debian based, and nodes J-Q be Stampede based. Let me state simply, I don't suggest that.

    * ...
- Confirm the succes of the action
- Mark the node so that it can now accept jobs
Alternatively, we should be able to override this and not care about
the queuing system and proceed immediately.
Some inline comments from selected extracts :

* Matt Garrett <[EMAIL PROTECTED]> [03-12-02 19:31]:

The Golden Image, in my imagination, is inviolable. It's to be the minimal disk system for a given node group. Real no-frills stuff, but this is, of course, SysAdmin tunable. This field would tell the head node(s) where the image file is located. Nothing prevents more than one node group from having the same GI as its foundation.

Is it different from a regular SIS image with a minimal number of packages ?

I think, no. Though, how is the content of that image decided upon? One way David Lombard is working on is with update-rpms (but it will be made available from all compliant DepMan modules) to select a good (read: short) set of top level packages and have the dependency manager fill in all of the underlying required packages to flesh out an absolutely minimal system that is guaranteed to support all of the top level packages specified for a given GI.

Is this distro dependant ?

I think, yes. But, not by much. That's what PackMan/DepMan are meant to handle, though they can't do everything. Hopefully, the set of top level packages can be named similarly across many OSes (read: distributions) so the same kernel package lists can be fead to DepMan for different OSes and reach installed software bases for the node groups that the differences between clusters of similar hardware by different OSes are unnoticeable at the operations level (modulo factors such as the performance of Linux 2.6's kernel kicking BSD's... but I digress). Worst case scenario, each distro/version would require a separate top level package list for a given installed software base target.

The Membership List is just a list of node names/IDs for Node Records. My current imagining is that a given node can only belong to one node group at a time. If there is a compelling reason to allow multiple group membership, we can work that out.
Well, there are several reasons for this. For instance you can have
"geographical" groupe of nodes : rack01, rack02, ...
"network" group of nodes : switch01, switch02, ...
"memory" group of nodes : "256", "512", ...
"node type" group of nodes : "PII300" "PIII557" "Athlon800" "IA64" ...
"queue" group of nodes : 64jobs 128jobs ...
"network type" : 100Mb 1Gb Myry Infiniband ...
"..."

But, none of those grouping labels necessarily impacts the installed software base the group requires. Membership in each of those groups can be determined by a suitable query against a database properly populated by filled out fields. As opposed to say, run this job on those nodes that are attached to the instruments that this particular job requires (and upon which I've installed the apprpriate drivers), or run this job on the nodes I've installed LAM/MPI 7.0 on when most of the others have 6.5. I'm looking toward those grouping desicions that mandate alterations in the installed software base.

Now, having said that, it doesn't mean that things like "network type" wouldn't also be reflected in the installed software base. The inclusion or exclusion of GM (for Myrinet) would definitely be an impact on the installed software base (mostly in the GI, I should think) by the "network type", but things like "geograpical", "network", "memory", and "node type" (when all nodes are within a broad category such as IA32) shouldn't have an impact on the node groups system as I've described it, though obviously node architecture (PPC, IA32, IA64, AMD64, Sparc, et al.) would all have to be represented by different GIs. I see no roadblock to running jobs on such meta-groups as can be ascertained by judicious ODA queries as "all nodes attached to switch03".


And, of course, each group is usefull sometime. Nodes can be member of
several groups.

Meta-groups, sure, no problem, but I see headaches galore when trying to secure memberships in multiple software base groups for a single node. Which GI does it get imaged with? Does it get the union of the deltas for both groups or the difference?

The node then filters the deltas for contradictions ("install package-1.0" followed by "remove package-1.0" later), redundencies, etc. until it has minimal deltas it can actually apply. It applies the OSCAR package deltas first. Then, it filters the auxilliary package delta for removal of packages it doesn't have or installation/update of packages that are already installed/up-to-date, since they might have been affected by the OSCAR package delta application.
Is there an "upgrade path" or each installation is considered like a new
one (i.e. : remove older package, install newer package) ?

I guess that would really depend on the underlying OS/package manager. If it is RPM based, then rpm -U will get used (handled by the PackMan abstraction anyway). If not, then that OS's PackMan module handles it as it sees fit. It's not really a concideration at this level.

 del_node_group <group>
   I guess after that, I needed to specify this. Just deletes the
   specified Node Group table from all existence. This of course
   orphans all those nodes that belonged to the group. Perhaps the
   default Next Action if a Node Record's Groupname field is invalid
   should just be "Nothing".
I would like to create a "active" flag for the node_group and set it to 0. This is generally what is done when the information that a node_group of this name existed. For instance, if we start logging node_group_name somewhere, we should keep the record for this group name so that we can analyze the logs later on.
   [Doable anytime. Usually done manually. Deleting the Node Group
   table of nodes that are currently running jobs is probably not a
   good idea and should probably be trapped, but it shouldn't affect
   any currently running jobs.]
Well, now that we have this nice mechanism for grouping, I think I would like OSCAR packages to use these groups. For instance c3.conf file, PBS queue

configuration, etc.

At this point, deleting a node_group will have important [bad]
consequences (not necessarily immediate)...

I completely see your point. Just like good SysAdmin practice is to never delete a pwent in /etc/passwd, but merely set its password hash (or shadow spec as the case may be) to a disabled setting.

Good suggestion.

Ben

--
Matthew Garrett
[EMAIL PROTECTED]

"... I do not love the bright sword for its sharpness, nor the arrow for
its swiftness, nor the warrior for his glory. I love only that which
they defend..."
  -- Faramir, "The Lord of the Rings"

-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Oscar-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-devel

Re: [Oscar-devel] Node Groups

Reply via email to