Re: [Oscar-devel] Node Groups

Jeremy Enos Tue, 02 Dec 2003 20:41:36 -0800

Matt- It's pretty cool that so much of this coincides with resolutions we've discussed here at NCSA... it doesn't completely match of course. We've discussed this (grouping) at *extreme* length at NCSA, and finally came to some agreement. I described much of it at the OSCAR meeting at NCSA a few months back... unfortunately, neither you or Ben could be there at that one. It's my own fault I didn't get a better description in text out earlier, but so be it. We'll definitely have to have a call to discuss a couple things here. IU and NCSA were going to call each other tomorrow morning, but I expect Ben may be interested in joining that discussion as well. Ben- please let me know if you're interested/available tomorrow morning.

I don't have much time right at the moment, but off the cuff I have these comments: * Completely arbitrary grouping of nodes is a must. I suspect this will raise the most questions for you. We do have a layout where this will work well though. No matter what, we have to have it though. * Packages will be responsible for adjusting their own configs to reflect the grouping. OSCAR's grouping should in no way be bound to any packages. * The synchronization should NOT be bound to any batch system. If someone wants to "trigger" synchronizations after/between batch jobs, then that should be a layer added after all this, not tied to it whatsoever. (i.e. I could have PBS epilogue scripts kick off a sync, reboot, etc) * The only fixed reason why more than a single GI would ever be needed is IDE/SCSI at this point, and even that may be fixable in due time. The GI should either be completely minimal, or else be grown to the point to minimize the +/- deltas somehow. This really isn't that critical, since the sync operations should just "work". * If a sync fails, no big deal. The "sync_worked" flag is the last operation to be done, and any sync should be re-runnable anyway.

I'm out of time, but I can explain much better later. Hopefully over the phone, and before this thread goes in wild directions that potentially won't matter after we talk on the phone and sync up. :)

No conference number is set up or anything, but we can call 2 places (via 3 way calling) tomorrow from NCSA. Right now I'm thinking it will be IU/Sherbrooke. How does 11AM central time work for you?

Jeremy

At 06:35 PM 12/2/2003, Matt Garrett wrote:

Character sketch for OSCAR4's Node Groups

ODA tables:
Node Group (groupname/ID)
  Golden Image
  OSCAR Package Delta
  Auxilliary Package Delta
  Membership List
  Reader-Writer Flag
Node record (nodename/ID)
  Hostname
  Domain name
  NIC List
  Groupname/ID
  Last Sync Timestamp
  Next Action
NIC (interface/ID)
  IP
  Subnet mask
  Broadcast
  Default Gateway
The groupname is an arbitrary name applied to the group. For a cluster with 128 nodes with IDE disks and 128 nodes with SCSI disks, "IDE" and "SCSI" would be natural groupnames for managing the two distinct groups of compute nodes. I was thinking of setting up ODA's setup script to seed two default Node Group tables, "Head" and "Compute". For simple clusters, "Head" wouldn't have much to do and should be able to be removed without any impact, but for HA clusters, "Head" would be required to make keeping the redundent head nodes in sync. "Compute" would just be the default catch-all for the entire population of compute nodes. None of these names is sacrosanct. It should be relatively straight forward to begin partitioning the cluster into distinct groups later.

The Golden Image, in my imagination, is inviolable. It's to be the minimal disk system for a given node group. Real no-frills stuff, but this is, of course, SysAdmin tunable. This field would tell the head node(s) where the image file is located. Nothing prevents more than one node group from having the same GI as its foundation.

The Package Deltas are just lists of package manager operations (not coincidentally those supported by PackMan. Each Delta list item would contain a marker to determine what the package manager operation was, Install, Update, Remove; the list of package names the operation operates on; and a timestamp of when that operation was added to the delta. The deltas are running logs of every package that is I/U/R'ed to a node group, hence the use of timestamps. Nothing prevents a package that is in the GI from being U/R'ed in the deltas.

The Membership List is just a list of node names/IDs for Node Records. My current imagining is that a given node can only belong to one node group at a time. If there is a compelling reason to allow multiple group membership, we can work that out.

The Reader-Writer Flag is just to keep any compute node from sync'ing when the deltas are about to be augmented, and to keep the head nodes from augmenting the deltas while any compute nodes are sync'ing.

There is one Node Group table per defined node group. There is one Node Record table per node, head or compute. The main data of interest here is the groupname/ID identifying to which node group this node belongs (the double of the Node Group/Membership List), the Last Sync Timestamp, and the Next Action.

Ignoring Next Action, when a node syncs itself, it merely sends a message to the current head node that it wishes to do so. The head node looks up the requesting node's node group, increments that Node group's Readers-Writers Flag and retrieves its the deltas, then looks up the node's timestamp, filters the deltas of everything older (everything the node's already synced to), and passes the filtered deltas back to the node.

The node then filters the deltas for contradictions ("install package-1.0" followed by "remove package-1.0" later), redundencies, etc. until it has minimal deltas it can actually apply. It applies the OSCAR package deltas first. Then, it filters the auxilliary package delta for removal of packages it doesn't have or installation/update of packages that are already installed/up-to-date, since they might have been affected by the OSCAR package delta application.

Having done that, it updates its own Last Sync Time and then decrements the Readers-Writers Flag. What happens if a sync fails? Much wailing and nashing of teeth is the best I have. Let the holy war entitled "When is a package concidered installed?" recommence. My vote is when the delta is successfully augmented, the package is concidered I/U/R'ed. The head node can use the Node Group's Membership List to force an entire Node Group to do a group sync. If there's a need to guarantee immediate effect of a delta augmentation just completed, then the head node can force a group sync and check if any nodes barfed. What happens if a sync'ing node dies before it can decrement the Readers-Writers Flag? I suggest stiff fines and possible jail time. Perhaps Readers-Writers needs to be a list of current readers/sync'ers, that way when a machine boots, all Readers-Writers Flags can be filtered for the newcomer's ID (right before it tries a new sync).

Now, what does Next Action mean? "Normal Sync" means exactly what I just described. "Reimage" signals the node that it's getting a new GI and to assume the position. "Nothing" signals the node that it's not able to sync right now, maybe try again later, but definitey don't bother to update its timestamp, and the Readers-Writers Flag is left unchanged. Future Actions might include "Reboot from [floppy, CD-ROM, HD]".

What action scripts would we want?
Client:
  sync_node_group
    Kicks off the sync/update process described above.
    [Doable anytime, but might get the "Nothing" Action is the delta's
    tied up. *shrug* Suggested times: At boot, after the completion of
    any PB job.]
Server:
  augment_group_delta <group> <operation> <package ...>
    Augments the appropriate delta list. Which one should be manifest
    from the package names. Maybe not.
    [Doable whever the Readers-Writers Flag is 0.]
  sync_node_group <group>
    Uses <group>'s Membership List to force a group sync.
    [Should be done immediately after any new node group is configured
    to initialize it. Also doable ASAW (At SysAdmin's Whim).]
  make_node_group <groupname>
    Adds new Node Group table, which can be filled in with just a new
    GI (built by SystemImager using PackMan/DepMan to build a package
    list for a new minimal system (plus new drivers, configuration data,
    etc.)). Empty deltas are not an inherent problem, unless your GIs
    are truely minimal and you don't have things like
    MPI/PVM/C3/PBS/etc. added to the nodes yet. The group's Membership
    List is empty and its Readers-Writers Flag is 0.
    [Doable anytime. Usually done manually, except during initial
    install.]
  add_node_to_group <nodename> <group>
    Node Group {group}.Membership List += Node Record {nodename}
    Node Record {nodename}.Groupname = Node Group {group}
    Node Record {nodename}.Last Sync Timestamp = <Never>
    Node Record {nodename}.Next Action = Reimage
    foreach other_group in Node Group {
      if (other_group != Node Group {group})
        other_group.Membership List -= Node Record {nodename}
    }
(If you ask me what langauge that is, I'll plead the fifth. That was just a fancy way of suggesting an enforcement mechanism for allowing each node membership in only one group at a time, as well as all of the other updating that needs to happen.) [Doable anytime. Usually done manually, except during initial install.]
  clone_node_group <group> <groupname>
    Makes a duplicate Node Group table exactly like <group> except it's
    named <groupname>, its Membership List is empty, and its
    Readers-Writers Flag is 0.
    [Doable anytime. Usually done manually.]
clone_node_group would be useful primarily for two purposes: 1) you have a monolithic cluster and now want to start specializing off chunks of it. Clone it, then add_node_to_group those nodes you want in the new, specialized group, then augment_group_delta on the specialized group to begin distinguishing the new group.

And 2) diagnostics and testing. You want to test out a new configuration on a subset of nodes without disrupting your entire cluster's configuration, clone the configuration, add the victi-- test nodes, make the changes in question, go "wheeee!". If things crash and there's lots of pain, del_node_group, add the nodes back to their original Node group, do a group sync, and no one needs to know otherwise.
  del_node_group <group>
    I guess after that, I needed to specify this. Just deletes the
    specified Node Group table from all existence. This of course
    orphans all those nodes that belonged to the group. Perhaps the
    default Next Action if a Node Record's Groupname field is invalid
    should just be "Nothing".
    [Doable anytime. Usually done manually. Deleting the Node Group
    table of nodes that are currently running jobs is probably not a
    good idea and should probably be trapped, but it shouldn't affect
    any currently running jobs.]
Questions?

Comments?

Snide remarks?
--
Matthew Garrett
[EMAIL PROTECTED]
"... I do not love the bright sword for its sharpness, nor the arrow for
its swiftness, nor the warrior for his glory. I love only that which
they defend..."
  -- Faramir, "The Lord of the Rings"
-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Oscar-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-devel


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Oscar-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-devel

Re: [Oscar-devel] Node Groups

Reply via email to