Hello,
I think that we should also add some flags relevant for the queuing
system and maybe some kind of "Next Action" items.
Basically, for a planned maintenance, for instance, you need to shut
your node down and [reboot from hard-disk] (simple kernel upgrade, to have LAM/MPI
checkpointing for instance) or you want a clean install
[reboot from network] or do you want to halt [halt] because of a power
failure or too important generated heat (failure detection). In all cases, the
node has to be removed from the PBS queue and certain commands
(grub/lilo/syslinux/XXX dependant + halt/remove) have to be issued on the node.
I think that the same procedure should be followed when doing a package
install/removal/upgrade of some packages on the node. i.e. :
- Mark the node so that it does not accept jobs anymore
- Wait for job completion
- Do some action on the node :
* reboot from medium X
* halt
* RPM operations
* packman operations
* ...
- Confirm the succes of the action
- Mark the node so that it can now accept jobs
Alternatively, we should be able to override this and not care about
the queuing system and proceed immediately.
Some inline comments from selected extracts :
* Matt Garrett <[EMAIL PROTECTED]> [03-12-02 19:31]:
> The Golden Image, in my imagination, is inviolable. It's to be the
> minimal disk system for a given node group. Real no-frills stuff, but
> this is, of course, SysAdmin tunable. This field would tell the head
> node(s) where the image file is located. Nothing prevents more than one
> node group from having the same GI as its foundation.
Is it different from a regular SIS image with a minimal number of
packages ?
Is this distro dependant ?
> The Membership List is just a list of node names/IDs for Node Records.
> My current imagining is that a given node can only belong to one node
> group at a time. If there is a compelling reason to allow multiple group
> membership, we can work that out.
Well, there are several reasons for this. For instance you can have
"geographical" groupe of nodes : rack01, rack02, ...
"network" group of nodes : switch01, switch02, ...
"memory" group of nodes : "256", "512", ...
"node type" group of nodes : "PII300" "PIII557" "Athlon800" "IA64" ...
"queue" group of nodes : 64jobs 128jobs ...
"network type" : 100Mb 1Gb Myry Infiniband ...
"..."
And, of course, each group is usefull sometime. Nodes can be member of
several groups.
> The node then filters the deltas for contradictions ("install
> package-1.0" followed by "remove package-1.0" later), redundencies, etc.
> until it has minimal deltas it can actually apply. It applies the OSCAR
> package deltas first. Then, it filters the auxilliary package delta for
> removal of packages it doesn't have or installation/update of packages
> that are already installed/up-to-date, since they might have been
> affected by the OSCAR package delta application.
Is there an "upgrade path" or each installation is considered like a new
one (i.e. : remove older package, install newer package) ?
> del_node_group <group>
> I guess after that, I needed to specify this. Just deletes the
> specified Node Group table from all existence. This of course
> orphans all those nodes that belonged to the group. Perhaps the
> default Next Action if a Node Record's Groupname field is invalid
> should just be "Nothing".
I would like to create a "active" flag for the node_group and set it to 0.
This is generally what is done when the information that a node_group of this
name existed. For instance, if we start logging node_group_name somewhere, we
should keep the record for this group name so that we can analyze the
logs later on.
> [Doable anytime. Usually done manually. Deleting the Node Group
> table of nodes that are currently running jobs is probably not a
> good idea and should probably be trapped, but it shouldn't affect
> any currently running jobs.]
Well, now that we have this nice mechanism for grouping, I think I would
like OSCAR packages to use these groups. For instance c3.conf file, PBS queue
configuration, etc.
At this point, deleting a node_group will have important [bad]
consequences (not necessarily immediate)...
Ben
--
Benoit des Ligneris Ph. D. <|> http://benoit.des.ligneris.net/
Centre de Calcul Scientifique <|> http://ccs.USherbrooke.ca/
OSCAR Developpe(u)r <|> http://oscar.sourceforge.net/
�duLinux <|> http://www.edulinux.org/
R�volution Linux <|> http://www.revolutionlinux.com/
-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
Oscar-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-devel