So the question is are we looking at /nodes/ that have a /current
role/, or are we looking at /roles/ that have some /current nodes/.

My contention is that the role is the interesting thing, and the nodes
is the incidental thing. That is, as a sysadmin, my hierarchy of
concerns is something like:
  A: are all services running
  B: are any of them in a degraded state where I need to take prompt
action to prevent a service outage [might mean many things: - software
update/disk space criticals/a machine failed and we need to scale the
cluster back up/too much load]
  C: are there any planned changes I need to make [new software deploy,
feature request from user, replacing a faulty machine]
  D: are there long term issues sneaking up on me [capacity planning,
machine obsolescence]

If we take /nodes/ as the interesting thing, and what they are doing
right now as the incidental thing, it's much harder to map that onto
the sysadmin concerns. If we start with /roles/ then can answer:
  A: by showing the list of roles and the summary stats (how many
machines, service status aggregate), role level alerts (e.g. nova-api
is not responding)
  B: by showing the list of roles and more detailed stats (overall
load, response times of services, tickets against services
      and a list of in trouble instances in each role - instances with
alerts against them - low disk, overload, failed service,
early-detection alerts from hardware
  C: probably out of our remit for now in the general case, but we need
to enable some things here like replacing faulty machines
  D: by looking at trend graphs for roles (not machines), but also by
looking at the hardware in aggregate - breakdown by age of machines,
summary data for tickets filed against instances that were deployed to
a particular machine

C: and D: are (F) category work, but for all but the very last thing,
it seems clear how to approach this from a roles perspective.

I've tried to approach this using /nodes/ as the starting point, and
after two terrible drafts I've deleted the section. I'd love it if
someone could show me how it would work:)

     * Unallocated nodes

This implies an 'allocation' step, that we don't have - how about
'Idle nodes' or something.

It can be auto-allocation. I don't see problem with 'unallocated' term.

Ok, it's not a biggy. I do think it will frame things poorly and lead
to an expectation about how TripleO works that doesn't match how it
does, but we can change it later if I'm right, and if I'm wrong, well
it won't be the first time :).


I'm interested in what the distinction you're making here is.  I'd rather get 
things
defined correctly the first time, and it's very possible that I'm missing a 
fundamental
definition here.

So we have:
  - node - a physical general purpose machine capable of running in
many roles. Some nodes may have hardware layout that is particularly
useful for a given role.
  - role - a specific workload we want to map onto one or more nodes.
Examples include 'undercloud control plane', 'overcloud control
plane', 'overcloud storage', 'overcloud compute' etc.
  - instance - A role deployed on a node - this is where work actually happens.
  - scheduling - the process of deciding which role is deployed on which node.

This glossary is really handy to make sure we're all speaking the same language.

The way TripleO works is that we defined a Heat template that lays out
policy: 5 instances of 'overcloud control plane please', '20
hypervisors' etc. Heat passes that to Nova, which pulls the image for
the role out of Glance, picks a node, and deploys the image to the
node.

Note in particular the order: Heat -> Nova -> Scheduler -> Node chosen.

The user action is not 'allocate a Node to 'overcloud control plane',
it is 'size the control plane through heat'.

So when we talk about 'unallocated Nodes', the implication is that
users 'allocate Nodes', but they don't: they size roles, and after
doing all that there may be some Nodes that are - yes - unallocated,

I'm not sure if I should ask this here or to your point above, but what about multi-role nodes? Is there any piece in here that says "The policy wants 5 instances but I can fit two of them on this existing underutilized node and three of them on unallocated nodes" or since it's all at the image level you get just what's in the image and that's the finest-level of granularity?

or have nothing scheduled to them. So... I'm not debating that we
should have a list of free hardware - we totally should - I'm debating
how we frame it. 'Available Nodes' or 'Undeployed machines' or
whatever. I just want to get away from talking about something
([manual] allocation) that we don't offer.

My only concern here is that we're not talking about cloud users, we're talking about admins adminning (we'll pretend it's a word, come with me) a cloud. To a cloud user, "give me some power so I can do some stuff" is a safe use case if I trust the cloud I'm running on. I trust that the cloud provider has taken the proper steps to ensure that my CPU isn't in New York and my storage in Tokyo.

To the admin setting up an overcloud, they are the ones providing that trust to eventual cloud users. That's where I feel like more visibility and control are going to be desired/appreciated.

I admit what I just said isn't at all concrete. Might even be flat out wrong. I was never an admin, I've just worked on sys management software long enough to have the opinion that their levels of OCD are legendary. I can't shake this feeling that someone is going to slap some fancy new jacked-up piece of hardware onto the network and have a specific purpose they are going to want to use it for. But maybe that's antiquated thinking on my part.

-Rob


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to