Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
Vish, I think I don't understand your statement fully. Unless we use different hostnames, (hostname, hypervisor_hostname) must be the same for all bare-metal nodes under a bare-metal nova-compute. Could you elaborate the following statement a little bit more? You would just have to use a little more than hostname. Perhaps (hostname, hypervisor_hostname) could be used to update the entry? Thanks, David - Original Message - I would investigate changing the capabilities to key off of something other than hostname. It looks from the table structure like compute_nodes could be have a many-to-one relationship with services. You would just have to use a little more than hostname. Perhaps (hostname, hypervisor_hostname) could be used to update the entry? Vish On Aug 24, 2012, at 11:23 AM, David Kang dk...@isi.edu wrote: Vish, I've tested your code and did more testing. There are a couple of problems. 1. host name should be unique. If not, any repetitive updates of new capabilities with the same host name are simply overwritten. 2. We cannot generate arbitrary host names on the fly. The scheduler (I tested filter scheduler) gets host names from db. So, if a host name is not in the 'services' table, it is not considered by the scheduler at all. So, to make your suggestions possible, nova-compute should register N different host names in 'services' table, and N corresponding entries in 'compute_nodes' table. Here is an example: mysql select id, host, binary, topic, report_count, disabled, availability_zone from services; ++-++---+--+--+---+ | id | host | binary | topic | report_count | disabled | | availability_zone | ++-++---+--+--+---+ | 1 | bespin101 | nova-scheduler | scheduler | 17145 | 0 | nova | | 2 | bespin101 | nova-network | network | 16819 | 0 | nova | | 3 | bespin101-0 | nova-compute | compute | 16405 | 0 | nova | | 4 | bespin101-1 | nova-compute | compute | 1 | 0 | nova | ++-++---+--+--+---+ mysql select id, service_id, hypervisor_hostname from compute_nodes; ++++ | id | service_id | hypervisor_hostname | ++++ | 1 | 3 | bespin101.east.isi.edu | | 2 | 4 | bespin101.east.isi.edu | ++++ Then, nova db (compute_nodes table) has entries of all bare-metal nodes. What do you think of this approach. Do you have any better approach? Thanks, David - Original Message - To elaborate, something the below. I'm not absolutely sure you need to be able to set service_name and host, but this gives you the option to do so if needed. iff --git a/nova/manager.py b/nova/manager.py index c6711aa..c0f4669 100644 --- a/nova/manager.py +++ b/nova/manager.py @@ -217,6 +217,8 @@ class SchedulerDependentManager(Manager): def update_service_capabilities(self, capabilities): Remember these capabilities to send on next periodic update. + if not isinstance(capabilities, list): + capabilities = [capabilities] self.last_capabilities = capabilities @periodic_task @@ -224,5 +226,8 @@ class SchedulerDependentManager(Manager): Pass data back to the scheduler at a periodic interval. if self.last_capabilities: LOG.debug(_('Notifying Schedulers of capabilities ...')) - self.scheduler_rpcapi.update_service_capabilities(context, - self.service_name, self.host, self.last_capabilities) + for capability_item in self.last_capabilities: + name = capability_item.get('service_name', self.service_name) + host = capability_item.get('host', self.host) + self.scheduler_rpcapi.update_service_capabilities(context, + name, host, capability_item) On Aug 21, 2012, at 1:28 PM, David Kang dk...@isi.edu wrote: Hi Vish, We are trying to change our code according to your comment. I want to ask a question. a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) Modifying driver.get_host_stats to return a list of host stats is easy. Calling muliple calls to self.update_service_capabilities(capabilities) doesn't seem to work, because 'capabilities' is overwritten each time. Modifying the receiving end to accept a list seems to be easy. However, 'capabilities' is assumed to be dictionary by all other scheduler routines, it looks like that we have to change all of them to handle 'capability' as a list of dictionary. If my
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
Hi David, I just checked out the code more extensively and I don't see why you need to create a new service entry for each compute_node entry. The code in host_manager to get all host states explicitly gets all compute_node entries. I don't see any reason why multiple compute_node entries can't share the same service. I don't see any place in the scheduler that is grabbing records by service instead of by compute node, but if there is one that I missed, it should be fairly easy to change it. The compute_node record is created in the compute/resource_tracker.py as of a recent commit, so I think the path forward would be to make sure that one of the records is created for each bare metal node by the bare metal compute, perhaps by having multiple resource_trackers. Vish On Aug 27, 2012, at 9:40 AM, David Kang dk...@isi.edu wrote: Vish, I think I don't understand your statement fully. Unless we use different hostnames, (hostname, hypervisor_hostname) must be the same for all bare-metal nodes under a bare-metal nova-compute. Could you elaborate the following statement a little bit more? You would just have to use a little more than hostname. Perhaps (hostname, hypervisor_hostname) could be used to update the entry? Thanks, David - Original Message - I would investigate changing the capabilities to key off of something other than hostname. It looks from the table structure like compute_nodes could be have a many-to-one relationship with services. You would just have to use a little more than hostname. Perhaps (hostname, hypervisor_hostname) could be used to update the entry? Vish On Aug 24, 2012, at 11:23 AM, David Kang dk...@isi.edu wrote: Vish, I've tested your code and did more testing. There are a couple of problems. 1. host name should be unique. If not, any repetitive updates of new capabilities with the same host name are simply overwritten. 2. We cannot generate arbitrary host names on the fly. The scheduler (I tested filter scheduler) gets host names from db. So, if a host name is not in the 'services' table, it is not considered by the scheduler at all. So, to make your suggestions possible, nova-compute should register N different host names in 'services' table, and N corresponding entries in 'compute_nodes' table. Here is an example: mysql select id, host, binary, topic, report_count, disabled, availability_zone from services; ++-++---+--+--+---+ | id | host | binary | topic | report_count | disabled | | availability_zone | ++-++---+--+--+---+ | 1 | bespin101 | nova-scheduler | scheduler | 17145 | 0 | nova | | 2 | bespin101 | nova-network | network | 16819 | 0 | nova | | 3 | bespin101-0 | nova-compute | compute | 16405 | 0 | nova | | 4 | bespin101-1 | nova-compute | compute | 1 | 0 | nova | ++-++---+--+--+---+ mysql select id, service_id, hypervisor_hostname from compute_nodes; ++++ | id | service_id | hypervisor_hostname | ++++ | 1 | 3 | bespin101.east.isi.edu | | 2 | 4 | bespin101.east.isi.edu | ++++ Then, nova db (compute_nodes table) has entries of all bare-metal nodes. What do you think of this approach. Do you have any better approach? Thanks, David - Original Message - To elaborate, something the below. I'm not absolutely sure you need to be able to set service_name and host, but this gives you the option to do so if needed. iff --git a/nova/manager.py b/nova/manager.py index c6711aa..c0f4669 100644 --- a/nova/manager.py +++ b/nova/manager.py @@ -217,6 +217,8 @@ class SchedulerDependentManager(Manager): def update_service_capabilities(self, capabilities): Remember these capabilities to send on next periodic update. + if not isinstance(capabilities, list): + capabilities = [capabilities] self.last_capabilities = capabilities @periodic_task @@ -224,5 +226,8 @@ class SchedulerDependentManager(Manager): Pass data back to the scheduler at a periodic interval. if self.last_capabilities: LOG.debug(_('Notifying Schedulers of capabilities ...')) - self.scheduler_rpcapi.update_service_capabilities(context, - self.service_name, self.host, self.last_capabilities) + for capability_item in self.last_capabilities: + name = capability_item.get('service_name', self.service_name) + host = capability_item.get('host', self.host) + self.scheduler_rpcapi.update_service_capabilities(context, + name, host, capability_item) On Aug 21, 2012, at 1:28 PM, David Kang dk...@isi.edu wrote: Hi Vish, We are trying to change our code according to your comment. I want to ask a
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
Hi Vish, I think I understand your idea. One service entry with multiple bare-metal compute_node entries are registered at the start of bare-metal nova-compute. 'hypervisor_hostname' must be different for each bare-metal machine, such as 'bare-metal-0001.xxx.com', 'bare-metal-0002.xxx.com', etc.) But their IP addresses must be the IP address of bare-metal nova-compute, such that an instance is casted not to bare-metal machine directly but to bare-metal nova-compute. One extension we need to do at the scheduler side is using (host, hypervisor_hostname) instead of (host) only in host_manager.py. 'HostManager.service_state' is { host : { service : { cap k : v }}}. It needs to be changed to { host : { service : { hypervisor_name : { cap k : v . Most functions of HostState need to be changed to use (host, hypervisor_name) pair to identify a compute node. Are we on the same page, now? Thanks, David - Original Message - Hi David, I just checked out the code more extensively and I don't see why you need to create a new service entry for each compute_node entry. The code in host_manager to get all host states explicitly gets all compute_node entries. I don't see any reason why multiple compute_node entries can't share the same service. I don't see any place in the scheduler that is grabbing records by service instead of by compute node, but if there is one that I missed, it should be fairly easy to change it. The compute_node record is created in the compute/resource_tracker.py as of a recent commit, so I think the path forward would be to make sure that one of the records is created for each bare metal node by the bare metal compute, perhaps by having multiple resource_trackers. Vish On Aug 27, 2012, at 9:40 AM, David Kang dk...@isi.edu wrote: Vish, I think I don't understand your statement fully. Unless we use different hostnames, (hostname, hypervisor_hostname) must be the same for all bare-metal nodes under a bare-metal nova-compute. Could you elaborate the following statement a little bit more? You would just have to use a little more than hostname. Perhaps (hostname, hypervisor_hostname) could be used to update the entry? Thanks, David - Original Message - I would investigate changing the capabilities to key off of something other than hostname. It looks from the table structure like compute_nodes could be have a many-to-one relationship with services. You would just have to use a little more than hostname. Perhaps (hostname, hypervisor_hostname) could be used to update the entry? Vish On Aug 24, 2012, at 11:23 AM, David Kang dk...@isi.edu wrote: Vish, I've tested your code and did more testing. There are a couple of problems. 1. host name should be unique. If not, any repetitive updates of new capabilities with the same host name are simply overwritten. 2. We cannot generate arbitrary host names on the fly. The scheduler (I tested filter scheduler) gets host names from db. So, if a host name is not in the 'services' table, it is not considered by the scheduler at all. So, to make your suggestions possible, nova-compute should register N different host names in 'services' table, and N corresponding entries in 'compute_nodes' table. Here is an example: mysql select id, host, binary, topic, report_count, disabled, availability_zone from services; ++-++---+--+--+---+ | id | host | binary | topic | report_count | disabled | | availability_zone | ++-++---+--+--+---+ | 1 | bespin101 | nova-scheduler | scheduler | 17145 | 0 | nova | | 2 | bespin101 | nova-network | network | 16819 | 0 | nova | | 3 | bespin101-0 | nova-compute | compute | 16405 | 0 | nova | | 4 | bespin101-1 | nova-compute | compute | 1 | 0 | nova | ++-++---+--+--+---+ mysql select id, service_id, hypervisor_hostname from compute_nodes; ++++ | id | service_id | hypervisor_hostname | ++++ | 1 | 3 | bespin101.east.isi.edu | | 2 | 4 | bespin101.east.isi.edu | ++++ Then, nova db (compute_nodes table) has entries of all bare-metal nodes. What do you think of this approach. Do you have any better approach? Thanks, David - Original Message - To elaborate, something the below. I'm not absolutely sure you need to be able to set service_name and host, but this gives you the option to do so if needed. iff --git a/nova/manager.py b/nova/manager.py index c6711aa..c0f4669 100644 --- a/nova/manager.py +++ b/nova/manager.py @@ -217,6
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
openstack-bounces+mjfork=us.ibm@lists.launchpad.net wrote on 08/27/2012 02:58:56 PM: From: David Kang dk...@isi.edu To: Vishvananda Ishaya vishvana...@gmail.com, Cc: OpenStack Development Mailing List openstack- d...@lists.openstack.org, openstack@lists.launchpad.net \ (openstack@lists.launchpad.net\) openstack@lists.launchpad.net Date: 08/27/2012 03:06 PM Subject: Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726) Sent by: openstack-bounces+mjfork=us.ibm@lists.launchpad.net Hi Vish, I think I understand your idea. One service entry with multiple bare-metal compute_node entries are registered at the start of bare-metal nova-compute. 'hypervisor_hostname' must be different for each bare-metal machine, such as 'bare-metal-0001.xxx.com', 'bare-metal-0002.xxx.com', etc.) But their IP addresses must be the IP address of bare-metal nova- compute, such that an instance is casted not to bare-metal machine directly but to bare-metal nova-compute. I believe the change here is to cast out the message to the topic.service-hostname. Existing code sends it to the compute_node hostname (see line 202 of nova/scheduler/filter_scheduler.py, specifically host=weighted_host.host_state.host). Changing that to cast to the service hostname would send the message to the bare-metal proxy node and should not have an effect on current deployments since the service hostname and the host_state.host would always be equal. This model will also let you keep the bare-metal compute node IP in the compute node table. One extension we need to do at the scheduler side is using (host, hypervisor_hostname) instead of (host) only in host_manager.py. 'HostManager.service_state' is { host : { service : { cap k : v }}}. It needs to be changed to { host : { service : { hypervisor_name : { cap k : v . Most functions of HostState need to be changed to use (host, hypervisor_name) pair to identify a compute node. Would an alternative here be to change the top level host to be the hypervisor_hostname and enforce uniqueness? Are we on the same page, now? Thanks, David - Original Message - Hi David, I just checked out the code more extensively and I don't see why you need to create a new service entry for each compute_node entry. The code in host_manager to get all host states explicitly gets all compute_node entries. I don't see any reason why multiple compute_node entries can't share the same service. I don't see any place in the scheduler that is grabbing records by service instead of by compute node, but if there is one that I missed, it should be fairly easy to change it. The compute_node record is created in the compute/resource_tracker.py as of a recent commit, so I think the path forward would be to make sure that one of the records is created for each bare metal node by the bare metal compute, perhaps by having multiple resource_trackers. Vish On Aug 27, 2012, at 9:40 AM, David Kang dk...@isi.edu wrote: Vish, I think I don't understand your statement fully. Unless we use different hostnames, (hostname, hypervisor_hostname) must be the same for all bare-metal nodes under a bare-metal nova-compute. Could you elaborate the following statement a little bit more? You would just have to use a little more than hostname. Perhaps (hostname, hypervisor_hostname) could be used to update the entry? Thanks, David - Original Message - I would investigate changing the capabilities to key off of something other than hostname. It looks from the table structure like compute_nodes could be have a many-to-one relationship with services. You would just have to use a little more than hostname. Perhaps (hostname, hypervisor_hostname) could be used to update the entry? Vish On Aug 24, 2012, at 11:23 AM, David Kang dk...@isi.edu wrote: Vish, I've tested your code and did more testing. There are a couple of problems. 1. host name should be unique. If not, any repetitive updates of new capabilities with the same host name are simply overwritten. 2. We cannot generate arbitrary host names on the fly. The scheduler (I tested filter scheduler) gets host names from db. So, if a host name is not in the 'services' table, it is not considered by the scheduler at all. So, to make your suggestions possible, nova-compute should register N different host names in 'services' table, and N corresponding entries in 'compute_nodes' table. Here is an example: mysql select id, host, binary, topic, report_count, disabled, availability_zone from services; ++-++--- +--+--+---+ | id | host | binary | topic | report_count | disabled | | availability_zone
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
Hello all, It seems that the requirement for keys of HostManager.service_state is just to be unique; these do not have to be valid hostnames or queues (Already, existing code casts messages to topic.service-hostname. Michael, doesn't it?). So, I tried 'host/bm_node_id' as 'host' of capabilities. Then, HostManager.service_state is: { host/bm_node_id : { service : { cap k : v }}}. So far, it works fine. How about this way? I paste relevant code in the bottom of this mail just to make sure. NOTE: I added a new column 'nodename' to compute_nodes to store bm_node_id, but storing it in 'hypervisor_hostname' may be a right solution. (The whole code is in our github(NTTdocomo-openstack/nova, branch 'multinode'), multiple resource_trackers are also implemented.) Thanks, Arata diff --git a/nova/scheduler/host_manager.py b/nova/scheduler/host_manager.py index 33ba2c1..567729f 100644 --- a/nova/scheduler/host_manager.py +++ b/nova/scheduler/host_manager.py @@ -98,9 +98,10 @@ class HostState(object): previously used and lock down access. -def __init__(self, host, topic, capabilities=None, service=None): +def __init__(self, host, topic, capabilities=None, service=None, nodename=None): self.host = host self.topic = topic +self.nodename = nodename # Read-only capability dicts @@ -175,8 +176,8 @@ class HostState(object): return True def __repr__(self): -return (host '%s': free_ram_mb:%s free_disk_mb:%s % -(self.host, self.free_ram_mb, self.free_disk_mb)) +return (host '%s' / nodename '%s': free_ram_mb:%s free_disk_mb:%s % +(self.host, self.nodename, self.free_ram_mb, self.free_disk_mb)) class HostManager(object): @@ -268,11 +269,16 @@ class HostManager(object): LOG.warn(_(No service for compute ID %s) % compute['id']) continue host = service['host'] -capabilities = self.service_states.get(host, None) +if compute['nodename']: +host_node = '%s/%s' % (host, compute['nodename']) +else: +host_node = host +capabilities = self.service_states.get(host_node, None) host_state = self.host_state_cls(host, topic, capabilities=capabilities, -service=dict(service.iteritems())) +service=dict(service.iteritems()), +nodename=compute['nodename']) host_state.update_from_compute_node(compute) -host_state_map[host] = host_state +host_state_map[host_node] = host_state return host_state_map diff --git a/nova/virt/baremetal/driver.py b/nova/virt/baremetal/driver.py index 087d1b6..dbcfbde 100644 --- a/nova/virt/baremetal/driver.py +++ b/nova/virt/baremetal/driver.py (skip...) +def _create_node_cap(self, node): +dic = self._node_resources(node) +dic['host'] = '%s/%s' % (FLAGS.host, node['id']) +dic['cpu_arch'] = self._extra_specs.get('cpu_arch') +dic['instance_type_extra_specs'] = self._extra_specs +dic['supported_instances'] = self._supported_instances +# TODO: put node's extra specs +return dic def get_host_stats(self, refresh=False): -return self._get_host_stats() +caps = [] +context = nova_context.get_admin_context() +nodes = bmdb.bm_node_get_all(context, + service_host=FLAGS.host) +for node in nodes: +node_cap = self._create_node_cap(node) +caps.append(node_cap) +return caps (2012/08/28 5:55), Michael J Fork wrote: openstack-bounces+mjfork=us.ibm@lists.launchpad.net wrote on 08/27/2012 02:58:56 PM: From: David Kang dk...@isi.edu To: Vishvananda Ishaya vishvana...@gmail.com, Cc: OpenStack Development Mailing List openstack- d...@lists.openstack.org, openstack@lists.launchpad.net \ (openstack@lists.launchpad.net\) openstack@lists.launchpad.net Date: 08/27/2012 03:06 PM Subject: Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726) Sent by: openstack-bounces+mjfork=us.ibm@lists.launchpad.net Hi Vish, I think I understand your idea. One service entry with multiple bare-metal compute_node entries are registered at the start of bare-metal nova-compute. 'hypervisor_hostname' must be different for each bare-metal machine, such as 'bare-metal-0001.xxx.com', 'bare-metal-0002.xxx.com', etc.) But their IP addresses must be the IP address of bare-metal nova- compute, such that an instance is casted not to bare-metal machine directly but to bare-metal nova-compute. I believe the change here is to cast out the message to the topic.service-hostname. Existing code sends it to the compute_node hostname (see line 202 of nova/scheduler/filter_scheduler.py
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
David Kang dk...@isi.edu wrote on 08/27/2012 05:22:37 PM: From: David Kang dk...@isi.edu To: Michael J Fork/Rochester/IBM@IBMUS, Cc: openstack@lists.launchpad.net (openstack@lists.launchpad.net) openstack@lists.launchpad.net, openstack-bounces+mjfork=us ibm com openstack-bounces+mjfork=us.ibm@lists.launchpad.net, OpenStack Development Mailing List openstack-...@lists.openstack.org, Vishvananda Ishaya vishvana...@gmail.com Date: 08/27/2012 05:22 PM Subject: Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726) Michael, I think you mean compute_node hostname as 'hypervisor_hostname' field in the 'compute_node' table. Yes. This value would be part of the payload of the message cast to the proxy node so that it knows who the request was directed to. What do you mean by service hostname? I don't see such field in the 'service' table in the database. Is it in some other table? Or do you suggest adding 'service_hostname' field in the 'service' table? The host field in the services table. This value would be used as the target of the rpc cast so that the proxy node would receive the message. Thanks, David - Original Message - openstack-bounces+mjfork=us.ibm@lists.launchpad.net wrote on 08/27/2012 02:58:56 PM: From: David Kang dk...@isi.edu To: Vishvananda Ishaya vishvana...@gmail.com, Cc: OpenStack Development Mailing List openstack- d...@lists.openstack.org, openstack@lists.launchpad.net \ (openstack@lists.launchpad.net\) openstack@lists.launchpad.net Date: 08/27/2012 03:06 PM Subject: Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726) Sent by: openstack-bounces+mjfork=us.ibm@lists.launchpad.net Hi Vish, I think I understand your idea. One service entry with multiple bare-metal compute_node entries are registered at the start of bare-metal nova-compute. 'hypervisor_hostname' must be different for each bare-metal machine, such as 'bare-metal-0001.xxx.com', 'bare-metal-0002.xxx.com', etc.) But their IP addresses must be the IP address of bare-metal nova- compute, such that an instance is casted not to bare-metal machine directly but to bare-metal nova-compute. I believe the change here is to cast out the message to the topic.service-hostname. Existing code sends it to the compute_node hostname (see line 202 of nova/scheduler/filter_scheduler.py, specifically host=weighted_host.host_state.host). Changing that to cast to the service hostname would send the message to the bare-metal proxy node and should not have an effect on current deployments since the service hostname and the host_state.host would always be equal. This model will also let you keep the bare-metal compute node IP in the compute node table. One extension we need to do at the scheduler side is using (host, hypervisor_hostname) instead of (host) only in host_manager.py. 'HostManager.service_state' is { host : { service : { cap k : v }}}. It needs to be changed to { host : { service : { hypervisor_name : { cap k : v . Most functions of HostState need to be changed to use (host, hypervisor_name) pair to identify a compute node. Would an alternative here be to change the top level host to be the hypervisor_hostname and enforce uniqueness? Are we on the same page, now? Thanks, David - Original Message - Hi David, I just checked out the code more extensively and I don't see why you need to create a new service entry for each compute_node entry. The code in host_manager to get all host states explicitly gets all compute_node entries. I don't see any reason why multiple compute_node entries can't share the same service. I don't see any place in the scheduler that is grabbing records by service instead of by compute node, but if there is one that I missed, it should be fairly easy to change it. The compute_node record is created in the compute/resource_tracker.py as of a recent commit, so I think the path forward would be to make sure that one of the records is created for each bare metal node by the bare metal compute, perhaps by having multiple resource_trackers. Vish On Aug 27, 2012, at 9:40 AM, David Kang dk...@isi.edu wrote: Vish, I think I don't understand your statement fully. Unless we use different hostnames, (hostname, hypervisor_hostname) must be the same for all bare-metal nodes under a bare-metal nova-compute. Could you elaborate the following statement a little bit more? You would just have to use a little more than hostname. Perhaps (hostname, hypervisor_hostname) could be used to update the entry? Thanks, David
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
Michael, It is a little confusing without knowing the assumptions of your suggestions. First of all, I want to make sure that you agree on the followings: 1. one entry for a bare-metal machines in the 'compute_node' table. 2. one entry for bare-metal nova-compute that manages N bare-metal machines in the 'service' table. In addition to that I think you suggest augmenting 'host' field in the 'service' table, such that 'host' field can be used for RPC. (I don't think the current 'host' field can be used for that purpose now.) David - Original Message - David Kang dk...@isi.edu wrote on 08/27/2012 05:22:37 PM: From: David Kang dk...@isi.edu To: Michael J Fork/Rochester/IBM@IBMUS, Cc: openstack@lists.launchpad.net (openstack@lists.launchpad.net) openstack@lists.launchpad.net, openstack-bounces+mjfork=us ibm com openstack-bounces+mjfork=us.ibm@lists.launchpad.net, OpenStack Development Mailing List openstack-...@lists.openstack.org, Vishvananda Ishaya vishvana...@gmail.com Date: 08/27/2012 05:22 PM Subject: Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726) Michael, I think you mean compute_node hostname as 'hypervisor_hostname' field in the 'compute_node' table. Yes. This value would be part of the payload of the message cast to the proxy node so that it knows who the request was directed to. What do you mean by service hostname? I don't see such field in the 'service' table in the database. Is it in some other table? Or do you suggest adding 'service_hostname' field in the 'service' table? The host field in the services table. This value would be used as the target of the rpc cast so that the proxy node would receive the message. Thanks, David - Original Message - openstack-bounces+mjfork=us.ibm@lists.launchpad.net wrote on 08/27/2012 02:58:56 PM: From: David Kang dk...@isi.edu To: Vishvananda Ishaya vishvana...@gmail.com, Cc: OpenStack Development Mailing List openstack- d...@lists.openstack.org, openstack@lists.launchpad.net \ (openstack@lists.launchpad.net\) openstack@lists.launchpad.net Date: 08/27/2012 03:06 PM Subject: Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726) Sent by: openstack-bounces+mjfork=us.ibm@lists.launchpad.net Hi Vish, I think I understand your idea. One service entry with multiple bare-metal compute_node entries are registered at the start of bare-metal nova-compute. 'hypervisor_hostname' must be different for each bare-metal machine, such as 'bare-metal-0001.xxx.com', 'bare-metal-0002.xxx.com', etc.) But their IP addresses must be the IP address of bare-metal nova- compute, such that an instance is casted not to bare-metal machine directly but to bare-metal nova-compute. I believe the change here is to cast out the message to the topic.service-hostname. Existing code sends it to the compute_node hostname (see line 202 of nova/scheduler/filter_scheduler.py, specifically host=weighted_host.host_state.host). Changing that to cast to the service hostname would send the message to the bare-metal proxy node and should not have an effect on current deployments since the service hostname and the host_state.host would always be equal. This model will also let you keep the bare-metal compute node IP in the compute node table. One extension we need to do at the scheduler side is using (host, hypervisor_hostname) instead of (host) only in host_manager.py. 'HostManager.service_state' is { host : { service : { cap k : v }}}. It needs to be changed to { host : { service : { hypervisor_name : { cap k : v . Most functions of HostState need to be changed to use (host, hypervisor_name) pair to identify a compute node. Would an alternative here be to change the top level host to be the hypervisor_hostname and enforce uniqueness? Are we on the same page, now? Thanks, David - Original Message - Hi David, I just checked out the code more extensively and I don't see why you need to create a new service entry for each compute_node entry. The code in host_manager to get all host states explicitly gets all compute_node entries. I don't see any reason why multiple compute_node entries can't share the same service. I don't see any place in the scheduler that is grabbing records by service instead of by compute node, but if there is one that I missed, it should be fairly easy to change it. The compute_node record is created in the compute/resource_tracker.py as of a recent commit, so I think the path forward
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
VTJ NOTSU Arata no...@virtualtech.jp wrote on 08/27/2012 07:30:40 PM: From: VTJ NOTSU Arata no...@virtualtech.jp To: Michael J Fork/Rochester/IBM@IBMUS, Cc: David Kang dk...@isi.edu, openstack@lists.launchpad.net (openstack@lists.launchpad.net) openstack@lists.launchpad.net, openstack-bounces+mjfork=us.ibm@lists.launchpad.net, OpenStack Development Mailing List openstack-...@lists.openstack.org Date: 08/27/2012 07:30 PM Subject: Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726) Hi Michael, Looking at line 203 in nova/scheduler/filter_scheduler.py, the target host in the cast call is weighted_host*.*host_state*.*host and not a service host. (My guess is this will likely require a fair number of changes in the scheduler area to change cast calls to target a service host instead of a compute node) weighted_host.host_state.host still seems to be service['host']... Please look at it again with me. # First, HostStateManager.get_all_host_states: # host_manager.py:264 compute_nodes = db.compute_node_get_all(context) for compute in compute_nodes: # service is from services table (joined-loaded with compute_nodes) service = compute['service'] if not service: LOG.warn(_(No service for compute ID %s) % compute ['id']) continue host = service['host'] capabilities = self.service_states.get(host, None) # go to HostState constructor: # the 1st parameter 'host' is service['host'] host_state = self.host_state_cls(host, topic, capabilities=capabilities, service=dict(service.iteritems())) # host_manager.py:101 def __init__(self, host, topic, capabilities=None, service=None): self.host = host self.topic = topic # here, HostState.host is service['host'] Then, update_from_compute_node(compute) is called but it leaves self.host unchanged. WeightedHost.host_state is this HostState. So, host at filter_scheduler.py:203 is service['host']. We can use existing code about RPC target. Do I miss something? Agreed, you can use the existing RPC target. Sorry for the confusion. This actually answers the question in David's last e-mail asking if the host field can be used from the services table - it already is. Thanks, Arata BIG SNIP Michael - Michael Fork Cloud Architect, Emerging Solutions IBM Systems Technology Group___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
Vish, I've tested your code and did more testing. There are a couple of problems. 1. host name should be unique. If not, any repetitive updates of new capabilities with the same host name are simply overwritten. 2. We cannot generate arbitrary host names on the fly. The scheduler (I tested filter scheduler) gets host names from db. So, if a host name is not in the 'services' table, it is not considered by the scheduler at all. So, to make your suggestions possible, nova-compute should register N different host names in 'services' table, and N corresponding entries in 'compute_nodes' table. Here is an example: mysql select id, host, binary, topic, report_count, disabled, availability_zone from services; ++-++---+--+--+---+ | id | host | binary | topic | report_count | disabled | availability_zone | ++-++---+--+--+---+ | 1 | bespin101 | nova-scheduler | scheduler | 17145 | 0 | nova | | 2 | bespin101 | nova-network | network | 16819 | 0 | nova | | 3 | bespin101-0 | nova-compute | compute | 16405 | 0 | nova | | 4 | bespin101-1 | nova-compute | compute | 1 | 0 | nova | ++-++---+--+--+---+ mysql select id, service_id, hypervisor_hostname from compute_nodes; ++++ | id | service_id | hypervisor_hostname | ++++ | 1 | 3 | bespin101.east.isi.edu | | 2 | 4 | bespin101.east.isi.edu | ++++ Then, nova db (compute_nodes table) has entries of all bare-metal nodes. What do you think of this approach. Do you have any better approach? Thanks, David - Original Message - To elaborate, something the below. I'm not absolutely sure you need to be able to set service_name and host, but this gives you the option to do so if needed. iff --git a/nova/manager.py b/nova/manager.py index c6711aa..c0f4669 100644 --- a/nova/manager.py +++ b/nova/manager.py @@ -217,6 +217,8 @@ class SchedulerDependentManager(Manager): def update_service_capabilities(self, capabilities): Remember these capabilities to send on next periodic update. + if not isinstance(capabilities, list): + capabilities = [capabilities] self.last_capabilities = capabilities @periodic_task @@ -224,5 +226,8 @@ class SchedulerDependentManager(Manager): Pass data back to the scheduler at a periodic interval. if self.last_capabilities: LOG.debug(_('Notifying Schedulers of capabilities ...')) - self.scheduler_rpcapi.update_service_capabilities(context, - self.service_name, self.host, self.last_capabilities) + for capability_item in self.last_capabilities: + name = capability_item.get('service_name', self.service_name) + host = capability_item.get('host', self.host) + self.scheduler_rpcapi.update_service_capabilities(context, + name, host, capability_item) On Aug 21, 2012, at 1:28 PM, David Kang dk...@isi.edu wrote: Hi Vish, We are trying to change our code according to your comment. I want to ask a question. a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) Modifying driver.get_host_stats to return a list of host stats is easy. Calling muliple calls to self.update_service_capabilities(capabilities) doesn't seem to work, because 'capabilities' is overwritten each time. Modifying the receiving end to accept a list seems to be easy. However, 'capabilities' is assumed to be dictionary by all other scheduler routines, it looks like that we have to change all of them to handle 'capability' as a list of dictionary. If my understanding is correct, it would affect many parts of the scheduler. Is it what you recommended? Thanks, David - Original Message - This was an immediate goal, the bare-metal nova-compute node could keep an internal database, but report capabilities through nova in the common way with the changes below. Then the scheduler wouldn't need access to the bare metal database at all. On Aug 15, 2012, at 4:23 PM, David Kang dk...@isi.edu wrote: Hi Vish, Is this discussion for long-term goal or for this Folsom release? We still believe that bare-metal database is needed because there is not an automated way how bare-metal nodes report their capabilities to their bare-metal nova-compute node. Thanks, David I am interested in finding a solution
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
I would investigate changing the capabilities to key off of something other than hostname. It looks from the table structure like compute_nodes could be have a many-to-one relationship with services. You would just have to use a little more than hostname. Perhaps (hostname, hypervisor_hostname) could be used to update the entry? Vish On Aug 24, 2012, at 11:23 AM, David Kang dk...@isi.edu wrote: Vish, I've tested your code and did more testing. There are a couple of problems. 1. host name should be unique. If not, any repetitive updates of new capabilities with the same host name are simply overwritten. 2. We cannot generate arbitrary host names on the fly. The scheduler (I tested filter scheduler) gets host names from db. So, if a host name is not in the 'services' table, it is not considered by the scheduler at all. So, to make your suggestions possible, nova-compute should register N different host names in 'services' table, and N corresponding entries in 'compute_nodes' table. Here is an example: mysql select id, host, binary, topic, report_count, disabled, availability_zone from services; ++-++---+--+--+---+ | id | host| binary | topic | report_count | disabled | availability_zone | ++-++---+--+--+---+ | 1 | bespin101 | nova-scheduler | scheduler |17145 |0 | nova | | 2 | bespin101 | nova-network | network |16819 |0 | nova | | 3 | bespin101-0 | nova-compute | compute |16405 |0 | nova | | 4 | bespin101-1 | nova-compute | compute |1 |0 | nova | ++-++---+--+--+---+ mysql select id, service_id, hypervisor_hostname from compute_nodes; ++++ | id | service_id | hypervisor_hostname| ++++ | 1 | 3 | bespin101.east.isi.edu | | 2 | 4 | bespin101.east.isi.edu | ++++ Then, nova db (compute_nodes table) has entries of all bare-metal nodes. What do you think of this approach. Do you have any better approach? Thanks, David - Original Message - To elaborate, something the below. I'm not absolutely sure you need to be able to set service_name and host, but this gives you the option to do so if needed. iff --git a/nova/manager.py b/nova/manager.py index c6711aa..c0f4669 100644 --- a/nova/manager.py +++ b/nova/manager.py @@ -217,6 +217,8 @@ class SchedulerDependentManager(Manager): def update_service_capabilities(self, capabilities): Remember these capabilities to send on next periodic update. + if not isinstance(capabilities, list): + capabilities = [capabilities] self.last_capabilities = capabilities @periodic_task @@ -224,5 +226,8 @@ class SchedulerDependentManager(Manager): Pass data back to the scheduler at a periodic interval. if self.last_capabilities: LOG.debug(_('Notifying Schedulers of capabilities ...')) - self.scheduler_rpcapi.update_service_capabilities(context, - self.service_name, self.host, self.last_capabilities) + for capability_item in self.last_capabilities: + name = capability_item.get('service_name', self.service_name) + host = capability_item.get('host', self.host) + self.scheduler_rpcapi.update_service_capabilities(context, + name, host, capability_item) On Aug 21, 2012, at 1:28 PM, David Kang dk...@isi.edu wrote: Hi Vish, We are trying to change our code according to your comment. I want to ask a question. a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) Modifying driver.get_host_stats to return a list of host stats is easy. Calling muliple calls to self.update_service_capabilities(capabilities) doesn't seem to work, because 'capabilities' is overwritten each time. Modifying the receiving end to accept a list seems to be easy. However, 'capabilities' is assumed to be dictionary by all other scheduler routines, it looks like that we have to change all of them to handle 'capability' as a list of dictionary. If my understanding is correct, it would affect many parts of the scheduler. Is it what you recommended? Thanks, David - Original Message - This was an immediate goal, the bare-metal nova-compute node could keep an internal database, but report capabilities through nova in the common way with the changes below. Then the scheduler wouldn't need access to the
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
Hi Vish, We are trying to change our code according to your comment. I want to ask a question. a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) Modifying driver.get_host_stats to return a list of host stats is easy. Calling muliple calls to self.update_service_capabilities(capabilities) doesn't seem to work, because 'capabilities' is overwritten each time. Modifying the receiving end to accept a list seems to be easy. However, 'capabilities' is assumed to be dictionary by all other scheduler routines, it looks like that we have to change all of them to handle 'capability' as a list of dictionary. If my understanding is correct, it would affect many parts of the scheduler. Is it what you recommended? Thanks, David - Original Message - This was an immediate goal, the bare-metal nova-compute node could keep an internal database, but report capabilities through nova in the common way with the changes below. Then the scheduler wouldn't need access to the bare metal database at all. On Aug 15, 2012, at 4:23 PM, David Kang dk...@isi.edu wrote: Hi Vish, Is this discussion for long-term goal or for this Folsom release? We still believe that bare-metal database is needed because there is not an automated way how bare-metal nodes report their capabilities to their bare-metal nova-compute node. Thanks, David I am interested in finding a solution that enables bare-metal and virtualized requests to be serviced through the same scheduler where the compute_nodes table has a full view of schedulable resources. This would seem to simplify the end-to-end flow while opening up some additional use cases (e.g. dynamic allocation of a node from bare-metal to hypervisor and back). One approach would be to have a proxy running a single nova-compute daemon fronting the bare-metal nodes . That nova-compute daemon would report up many HostState objects (1 per bare-metal node) to become entries in the compute_nodes table and accessible through the scheduler HostManager object. The HostState object would set cpu_info, vcpus, member_mb and local_gb values to be used for scheduling with the hypervisor_host field holding the bare-metal machine address (e.g. for IPMI based commands) and hypervisor_type = NONE. The bare-metal Flavors are created with an extra_spec of hypervisor_type= NONE and the corresponding compute_capabilities_filter would reduce the available hosts to those bare_metal nodes. The scheduler would need to understand that hypervisor_type = NONE means you need an exact fit (or best-fit) host vs weighting them (perhaps through the multi-scheduler). The scheduler would cast out the message to the topic.service-hostname (code today uses the HostState hostname), with the compute driver having to understand if it must be serviced elsewhere (but does not break any existing implementations since it is 1 to 1). Does this solution seem workable? Anything I missed? The bare metal driver already is proxying for the other nodes so it sounds like we need a couple of things to make this happen: a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) b) make a few minor changes to the scheduler to make sure filtering still works. Note the changes here may be very helpful: https://review.openstack.org/10327 c) we have to make sure that instances launched on those nodes take up the entire host state somehow. We could probably do this by making sure that the instance_type ram, mb, gb etc. matches what the node has, but we may want a new boolean field used if those aren't sufficient. I This approach seems pretty good. We could potentially get rid of the shared bare_metal_node table. I guess the only other concern is how you populate the capabilities that the bare metal nodes are reporting. I guess an api extension that rpcs to a baremetal node to add the node. Maybe someday this could be autogenerated by the bare metal host looking in its arp table for dhcp requests! :) Vish ___ OpenStack-dev mailing list openstack-...@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list openstack-...@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
I think you should be able to modify nova/manager.py to be able to store and report a list of capabilites and report them all individually to the scheduler via the periodic task. Vish On Aug 21, 2012, at 1:28 PM, David Kang dk...@isi.edu wrote: Hi Vish, We are trying to change our code according to your comment. I want to ask a question. a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) Modifying driver.get_host_stats to return a list of host stats is easy. Calling muliple calls to self.update_service_capabilities(capabilities) doesn't seem to work, because 'capabilities' is overwritten each time. Modifying the receiving end to accept a list seems to be easy. However, 'capabilities' is assumed to be dictionary by all other scheduler routines, it looks like that we have to change all of them to handle 'capability' as a list of dictionary. If my understanding is correct, it would affect many parts of the scheduler. Is it what you recommended? Thanks, David - Original Message - This was an immediate goal, the bare-metal nova-compute node could keep an internal database, but report capabilities through nova in the common way with the changes below. Then the scheduler wouldn't need access to the bare metal database at all. On Aug 15, 2012, at 4:23 PM, David Kang dk...@isi.edu wrote: Hi Vish, Is this discussion for long-term goal or for this Folsom release? We still believe that bare-metal database is needed because there is not an automated way how bare-metal nodes report their capabilities to their bare-metal nova-compute node. Thanks, David I am interested in finding a solution that enables bare-metal and virtualized requests to be serviced through the same scheduler where the compute_nodes table has a full view of schedulable resources. This would seem to simplify the end-to-end flow while opening up some additional use cases (e.g. dynamic allocation of a node from bare-metal to hypervisor and back). One approach would be to have a proxy running a single nova-compute daemon fronting the bare-metal nodes . That nova-compute daemon would report up many HostState objects (1 per bare-metal node) to become entries in the compute_nodes table and accessible through the scheduler HostManager object. The HostState object would set cpu_info, vcpus, member_mb and local_gb values to be used for scheduling with the hypervisor_host field holding the bare-metal machine address (e.g. for IPMI based commands) and hypervisor_type = NONE. The bare-metal Flavors are created with an extra_spec of hypervisor_type= NONE and the corresponding compute_capabilities_filter would reduce the available hosts to those bare_metal nodes. The scheduler would need to understand that hypervisor_type = NONE means you need an exact fit (or best-fit) host vs weighting them (perhaps through the multi-scheduler). The scheduler would cast out the message to the topic.service-hostname (code today uses the HostState hostname), with the compute driver having to understand if it must be serviced elsewhere (but does not break any existing implementations since it is 1 to 1). Does this solution seem workable? Anything I missed? The bare metal driver already is proxying for the other nodes so it sounds like we need a couple of things to make this happen: a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) b) make a few minor changes to the scheduler to make sure filtering still works. Note the changes here may be very helpful: https://review.openstack.org/10327 c) we have to make sure that instances launched on those nodes take up the entire host state somehow. We could probably do this by making sure that the instance_type ram, mb, gb etc. matches what the node has, but we may want a new boolean field used if those aren't sufficient. I This approach seems pretty good. We could potentially get rid of the shared bare_metal_node table. I guess the only other concern is how you populate the capabilities that the bare metal nodes are reporting. I guess an api extension that rpcs to a baremetal node to add the node. Maybe someday this could be autogenerated by the bare metal host looking in its arp table for dhcp requests! :) Vish ___ OpenStack-dev mailing list openstack-...@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
To elaborate, something the below. I'm not absolutely sure you need to be able to set service_name and host, but this gives you the option to do so if needed. iff --git a/nova/manager.py b/nova/manager.py index c6711aa..c0f4669 100644 --- a/nova/manager.py +++ b/nova/manager.py @@ -217,6 +217,8 @@ class SchedulerDependentManager(Manager): def update_service_capabilities(self, capabilities): Remember these capabilities to send on next periodic update. +if not isinstance(capabilities, list): +capabilities = [capabilities] self.last_capabilities = capabilities @periodic_task @@ -224,5 +226,8 @@ class SchedulerDependentManager(Manager): Pass data back to the scheduler at a periodic interval. if self.last_capabilities: LOG.debug(_('Notifying Schedulers of capabilities ...')) -self.scheduler_rpcapi.update_service_capabilities(context, -self.service_name, self.host, self.last_capabilities) +for capability_item in self.last_capabilities: +name = capability_item.get('service_name', self.service_name) +host = capability_item.get('host', self.host) +self.scheduler_rpcapi.update_service_capabilities(context, +name, host, capability_item) On Aug 21, 2012, at 1:28 PM, David Kang dk...@isi.edu wrote: Hi Vish, We are trying to change our code according to your comment. I want to ask a question. a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) Modifying driver.get_host_stats to return a list of host stats is easy. Calling muliple calls to self.update_service_capabilities(capabilities) doesn't seem to work, because 'capabilities' is overwritten each time. Modifying the receiving end to accept a list seems to be easy. However, 'capabilities' is assumed to be dictionary by all other scheduler routines, it looks like that we have to change all of them to handle 'capability' as a list of dictionary. If my understanding is correct, it would affect many parts of the scheduler. Is it what you recommended? Thanks, David - Original Message - This was an immediate goal, the bare-metal nova-compute node could keep an internal database, but report capabilities through nova in the common way with the changes below. Then the scheduler wouldn't need access to the bare metal database at all. On Aug 15, 2012, at 4:23 PM, David Kang dk...@isi.edu wrote: Hi Vish, Is this discussion for long-term goal or for this Folsom release? We still believe that bare-metal database is needed because there is not an automated way how bare-metal nodes report their capabilities to their bare-metal nova-compute node. Thanks, David I am interested in finding a solution that enables bare-metal and virtualized requests to be serviced through the same scheduler where the compute_nodes table has a full view of schedulable resources. This would seem to simplify the end-to-end flow while opening up some additional use cases (e.g. dynamic allocation of a node from bare-metal to hypervisor and back). One approach would be to have a proxy running a single nova-compute daemon fronting the bare-metal nodes . That nova-compute daemon would report up many HostState objects (1 per bare-metal node) to become entries in the compute_nodes table and accessible through the scheduler HostManager object. The HostState object would set cpu_info, vcpus, member_mb and local_gb values to be used for scheduling with the hypervisor_host field holding the bare-metal machine address (e.g. for IPMI based commands) and hypervisor_type = NONE. The bare-metal Flavors are created with an extra_spec of hypervisor_type= NONE and the corresponding compute_capabilities_filter would reduce the available hosts to those bare_metal nodes. The scheduler would need to understand that hypervisor_type = NONE means you need an exact fit (or best-fit) host vs weighting them (perhaps through the multi-scheduler). The scheduler would cast out the message to the topic.service-hostname (code today uses the HostState hostname), with the compute driver having to understand if it must be serviced elsewhere (but does not break any existing implementations since it is 1 to 1). Does this solution seem workable? Anything I missed? The bare metal driver already is proxying for the other nodes so it sounds like we need a couple of things to make this happen: a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
vishvana...@gmail.com wrote on 08/15/2012 06:54:58 PM: From: Vishvananda Ishaya vishvana...@gmail.com To: OpenStack Development Mailing List openstack-...@lists.openstack.org, Cc: openstack@lists.launchpad.net \(openstack@lists.launchpad.net \) openstack@lists.launchpad.net Date: 08/15/2012 06:58 PM Subject: Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726) Sent by: openstack-bounces+mjfork=us.ibm@lists.launchpad.net On Aug 15, 2012, at 3:17 PM, Michael J Fork mjf...@us.ibm.com wrote: I am interested in finding a solution that enables bare-metal and virtualized requests to be serviced through the same scheduler where the compute_nodes table has a full view of schedulable resources. This would seem to simplify the end-to-end flow while opening up some additional use cases (e.g. dynamic allocation of a node from bare-metal to hypervisor and back). One approach would be to have a proxy running a single nova-compute daemon fronting the bare-metal nodes . That nova-compute daemon would report up many HostState objects (1 per bare-metal node) to become entries in the compute_nodes table and accessible through the scheduler HostManager object. The HostState object would set cpu_info, vcpus, member_mb and local_gb values to be used for scheduling with the hypervisor_host field holding the bare-metal machine address (e.g. for IPMI based commands) and hypervisor_type = NONE. The bare-metal Flavors are created with an extra_spec of hypervisor_type= NONE and the corresponding compute_capabilities_filter would reduce the available hosts to those bare_metal nodes. The scheduler would need to understand that hypervisor_type = NONE means you need an exact fit (or best-fit) host vs weighting them (perhaps through the multi- scheduler). The scheduler would cast out the message to the topic.service-hostname (code today uses the HostState hostname), with the compute driver having to understand if it must be serviced elsewhere (but does not break any existing implementations since it is 1 to 1). Does this solution seem workable? Anything I missed? The bare metal driver already is proxying for the other nodes so it sounds like we need a couple of things to make this happen: a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) b) make a few minor changes to the scheduler to make sure filtering still works. Note the changes here may be very helpful: https://review.openstack.org/10327 c) we have to make sure that instances launched on those nodes take up the entire host state somehow. We could probably do this by making sure that the instance_type ram, mb, gb etc. matches what the node has, but we may want a new boolean field used if those aren't sufficient. My initial thought is that showing the actual resources the guest requested as being consumed in HostState would enable use cases like migrating a guest running on a too-big machine to a right-size one. However, that would required the bare-metal node to store the state of the requested guest when that information could be obtained from the instance_type. For now, the simplest is probably to have the bare-metal virt driver set the disk_available = 0 and host_memory_free = 0 so the scheduler removes them from consideration, with the vcpus, disk_total, host_memory_total set to the physical machine values. If the requested guest size is easily accessible, the _used values could be set to those values (although not clear if anything would break though with _total != _free + _used, in which case setting _used = _total would seem to be acceptable for now). Another options is to add num_instances to HostState and have the bare-metal filter remove hypervisor_type = NONE with num_instances 0. The scheduler would never see them and then would be no need to show them fully consumed. Drawback is that the num_instances call is marked as being expensive and would incur some overhead. I This approach seems pretty good. We could potentially get rid of the shared bare_metal_node table. I guess the only other concern is how you populate the capabilities that the bare metal nodes are reporting. I guess an api extension that rpcs to a baremetal node to add the node. Maybe someday this could be autogenerated by the bare metal host looking in its arp table for dhcp requests! :) Vish ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp Michael - Michael Fork Cloud
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
I am interested in finding a solution that enables bare-metal and virtualized requests to be serviced through the same scheduler where the compute_nodes table has a full view of schedulable resources. This would seem to simplify the end-to-end flow while opening up some additional use cases (e.g. dynamic allocation of a node from bare-metal to hypervisor and back). One approach would be to have a proxy running a single nova-compute daemon fronting the bare-metal nodes . That nova-compute daemon would report up many HostState objects (1 per bare-metal node) to become entries in the compute_nodes table and accessible through the scheduler HostManager object. The HostState object would set cpu_info, vcpus, member_mb and local_gb values to be used for scheduling with the hypervisor_host field holding the bare-metal machine address (e.g. for IPMI based commands) and hypervisor_type = NONE. The bare-metal Flavors are created with an extra_spec of hypervisor_type= NONE and the corresponding compute_capabilities_filter would reduce the available hosts to those bare_metal nodes. The scheduler would need to understand that hypervisor_type = NONE means you need an exact fit (or best-fit) host vs weighting them (perhaps through the multi-scheduler). The scheduler would cast out the message to the topic.service-hostname (code today uses the HostState hostname), with the compute driver having to understand if it must be serviced elsewhere (but does not break any existing implementations since it is 1 to 1). Does this solution seem workable? Anything I missed? Michael - Michael Fork Cloud Architect, Emerging Solutions IBM Systems Technology Group David Kang dk...@isi.edu wrote on 08/15/2012 11:28:34 AM: From: David Kang dk...@isi.edu To: OpenStack Development Mailing List openstack- d...@lists.openstack.org, openstack@lists.launchpad.net (openstack@lists.launchpad.net) openstack@lists.launchpad.net, Cc: mkk...@isi.edu, Ken Ash 50ba...@gmail.com, VTJ NOTSU Arata no...@virtualtech.jp Date: 08/15/2012 02:08 PM Subject: [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726) Hi, This is call for discussion about the code review 10726. https://review.openstack.org/#/c/10726/ Mark asked why we implemented a separata database for bare-metal provisioning. Here we describe our thought. We are open to discussion and to the changes that the community recommends. Please give us your thoughts. NTT Docomo and USC/ISI have developed bare-metal provisioning. We created separate database to describe bare-metal nodes, which consists of 5 tables now. Our initial implementation assumes the database is not a part of nova database. In addition to the reasons described in the comments of the code review, here is another reason we decided a separate database for baremetal provisioning. Bare-metal database is mainly used by bare-metal nova-compute. Since bare-metal nova-compute manages multiple bare-metal machines, it needs to keep/update the information of bare-metal machines. If the bare-metal database is in the main nova db, accessing nova db remotely by bare-metal nova-compute is inevitable. Once Vish told us that shared db access from nova-compute is not desirable. It is possible to make the scheduler do the job of bare-metal nova-compute. However, it would need a big changes in how the scheduler and a nova-compute communicates. For example, currently, the scheduler casts an instance to a nova-compute. But for bare-metal node, the scheduler should cast an instance to a bare-metal machine through bare-metal nova-compute. Bare-metal nova-compute should boot the machine, transfer kernel, fs, etc. So, bare-metal nova-compute should know the id of bare-metal node and other information for booting (PXE ip address, ...) and more. That information should be sent to bare-metal nova-compute by the scheduler. If frequent access of bare-metal tables in nova db from bare-metal nova-compute is OK, we are OK to put the bare-metal tables into nova db. Please let us know your opinions. Thanks, David, Mikyung @ USC/ISI -- Dr. Dong-In David Kang Computer Scientist USC/ISI ___ OpenStack-dev mailing list openstack-...@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
On Aug 15, 2012, at 3:17 PM, Michael J Fork mjf...@us.ibm.com wrote: I am interested in finding a solution that enables bare-metal and virtualized requests to be serviced through the same scheduler where the compute_nodes table has a full view of schedulable resources. This would seem to simplify the end-to-end flow while opening up some additional use cases (e.g. dynamic allocation of a node from bare-metal to hypervisor and back). One approach would be to have a proxy running a single nova-compute daemon fronting the bare-metal nodes . That nova-compute daemon would report up many HostState objects (1 per bare-metal node) to become entries in the compute_nodes table and accessible through the scheduler HostManager object. The HostState object would set cpu_info, vcpus, member_mb and local_gb values to be used for scheduling with the hypervisor_host field holding the bare-metal machine address (e.g. for IPMI based commands) and hypervisor_type = NONE. The bare-metal Flavors are created with an extra_spec of hypervisor_type= NONE and the corresponding compute_capabilities_filter would reduce the available hosts to those bare_metal nodes. The scheduler would need to understand that hypervisor_type = NONE means you need an exact fit (or best-fit) host vs weighting them (perhaps through the multi-scheduler). The scheduler would cast out the message to the topic.service-hostname (code today uses the HostState hostname), with the compute driver having to understand if it must be serviced elsewhere (but does not break any existing implementations since it is 1 to 1). Does this solution seem workable? Anything I missed? The bare metal driver already is proxying for the other nodes so it sounds like we need a couple of things to make this happen: a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) b) make a few minor changes to the scheduler to make sure filtering still works. Note the changes here may be very helpful: https://review.openstack.org/10327 c) we have to make sure that instances launched on those nodes take up the entire host state somehow. We could probably do this by making sure that the instance_type ram, mb, gb etc. matches what the node has, but we may want a new boolean field used if those aren't sufficient. I This approach seems pretty good. We could potentially get rid of the shared bare_metal_node table. I guess the only other concern is how you populate the capabilities that the bare metal nodes are reporting. I guess an api extension that rpcs to a baremetal node to add the node. Maybe someday this could be autogenerated by the bare metal host looking in its arp table for dhcp requests! :) Vish ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
Hi Vish, Is this discussion for long-term goal or for this Folsom release? We still believe that bare-metal database is needed because there is not an automated way how bare-metal nodes report their capabilities to their bare-metal nova-compute node. Thanks, David I am interested in finding a solution that enables bare-metal and virtualized requests to be serviced through the same scheduler where the compute_nodes table has a full view of schedulable resources. This would seem to simplify the end-to-end flow while opening up some additional use cases (e.g. dynamic allocation of a node from bare-metal to hypervisor and back). One approach would be to have a proxy running a single nova-compute daemon fronting the bare-metal nodes . That nova-compute daemon would report up many HostState objects (1 per bare-metal node) to become entries in the compute_nodes table and accessible through the scheduler HostManager object. The HostState object would set cpu_info, vcpus, member_mb and local_gb values to be used for scheduling with the hypervisor_host field holding the bare-metal machine address (e.g. for IPMI based commands) and hypervisor_type = NONE. The bare-metal Flavors are created with an extra_spec of hypervisor_type= NONE and the corresponding compute_capabilities_filter would reduce the available hosts to those bare_metal nodes. The scheduler would need to understand that hypervisor_type = NONE means you need an exact fit (or best-fit) host vs weighting them (perhaps through the multi-scheduler). The scheduler would cast out the message to the topic.service-hostname (code today uses the HostState hostname), with the compute driver having to understand if it must be serviced elsewhere (but does not break any existing implementations since it is 1 to 1). Does this solution seem workable? Anything I missed? The bare metal driver already is proxying for the other nodes so it sounds like we need a couple of things to make this happen: a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) b) make a few minor changes to the scheduler to make sure filtering still works. Note the changes here may be very helpful: https://review.openstack.org/10327 c) we have to make sure that instances launched on those nodes take up the entire host state somehow. We could probably do this by making sure that the instance_type ram, mb, gb etc. matches what the node has, but we may want a new boolean field used if those aren't sufficient. I This approach seems pretty good. We could potentially get rid of the shared bare_metal_node table. I guess the only other concern is how you populate the capabilities that the bare metal nodes are reporting. I guess an api extension that rpcs to a baremetal node to add the node. Maybe someday this could be autogenerated by the bare metal host looking in its arp table for dhcp requests! :) Vish ___ OpenStack-dev mailing list openstack-...@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
This was an immediate goal, the bare-metal nova-compute node could keep an internal database, but report capabilities through nova in the common way with the changes below. Then the scheduler wouldn't need access to the bare metal database at all. On Aug 15, 2012, at 4:23 PM, David Kang dk...@isi.edu wrote: Hi Vish, Is this discussion for long-term goal or for this Folsom release? We still believe that bare-metal database is needed because there is not an automated way how bare-metal nodes report their capabilities to their bare-metal nova-compute node. Thanks, David I am interested in finding a solution that enables bare-metal and virtualized requests to be serviced through the same scheduler where the compute_nodes table has a full view of schedulable resources. This would seem to simplify the end-to-end flow while opening up some additional use cases (e.g. dynamic allocation of a node from bare-metal to hypervisor and back). One approach would be to have a proxy running a single nova-compute daemon fronting the bare-metal nodes . That nova-compute daemon would report up many HostState objects (1 per bare-metal node) to become entries in the compute_nodes table and accessible through the scheduler HostManager object. The HostState object would set cpu_info, vcpus, member_mb and local_gb values to be used for scheduling with the hypervisor_host field holding the bare-metal machine address (e.g. for IPMI based commands) and hypervisor_type = NONE. The bare-metal Flavors are created with an extra_spec of hypervisor_type= NONE and the corresponding compute_capabilities_filter would reduce the available hosts to those bare_metal nodes. The scheduler would need to understand that hypervisor_type = NONE means you need an exact fit (or best-fit) host vs weighting them (perhaps through the multi-scheduler). The scheduler would cast out the message to the topic.service-hostname (code today uses the HostState hostname), with the compute driver having to understand if it must be serviced elsewhere (but does not break any existing implementations since it is 1 to 1). Does this solution seem workable? Anything I missed? The bare metal driver already is proxying for the other nodes so it sounds like we need a couple of things to make this happen: a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) b) make a few minor changes to the scheduler to make sure filtering still works. Note the changes here may be very helpful: https://review.openstack.org/10327 c) we have to make sure that instances launched on those nodes take up the entire host state somehow. We could probably do this by making sure that the instance_type ram, mb, gb etc. matches what the node has, but we may want a new boolean field used if those aren't sufficient. I This approach seems pretty good. We could potentially get rid of the shared bare_metal_node table. I guess the only other concern is how you populate the capabilities that the bare metal nodes are reporting. I guess an api extension that rpcs to a baremetal node to add the node. Maybe someday this could be autogenerated by the bare metal host looking in its arp table for dhcp requests! :) Vish ___ OpenStack-dev mailing list openstack-...@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list openstack-...@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
I see. We (NTT and USC/ISI) will discuss more about that. And we will implement according to your input. From now, we assume to use current db. Here are my understanding of your comment. Please correct me if it is not correct. a) is clear. b) NTT Docomo has implemented some features that you mentioned in a new file baremetal_host_manager.py. We will not use baremetal_host_manager.py but will modify host_manager.py directly. I'm not sure if we have to change the scheduler. c) I think used field my work. We'll look into it. Thanks, David -- Dr. Dong-In David Kang Computer Scientist USC/ISI - Original Message - This was an immediate goal, the bare-metal nova-compute node could keep an internal database, but report capabilities through nova in the common way with the changes below. Then the scheduler wouldn't need access to the bare metal database at all. On Aug 15, 2012, at 4:23 PM, David Kang dk...@isi.edu wrote: Hi Vish, Is this discussion for long-term goal or for this Folsom release? We still believe that bare-metal database is needed because there is not an automated way how bare-metal nodes report their capabilities to their bare-metal nova-compute node. Thanks, David I am interested in finding a solution that enables bare-metal and virtualized requests to be serviced through the same scheduler where the compute_nodes table has a full view of schedulable resources. This would seem to simplify the end-to-end flow while opening up some additional use cases (e.g. dynamic allocation of a node from bare-metal to hypervisor and back). One approach would be to have a proxy running a single nova-compute daemon fronting the bare-metal nodes . That nova-compute daemon would report up many HostState objects (1 per bare-metal node) to become entries in the compute_nodes table and accessible through the scheduler HostManager object. The HostState object would set cpu_info, vcpus, member_mb and local_gb values to be used for scheduling with the hypervisor_host field holding the bare-metal machine address (e.g. for IPMI based commands) and hypervisor_type = NONE. The bare-metal Flavors are created with an extra_spec of hypervisor_type= NONE and the corresponding compute_capabilities_filter would reduce the available hosts to those bare_metal nodes. The scheduler would need to understand that hypervisor_type = NONE means you need an exact fit (or best-fit) host vs weighting them (perhaps through the multi-scheduler). The scheduler would cast out the message to the topic.service-hostname (code today uses the HostState hostname), with the compute driver having to understand if it must be serviced elsewhere (but does not break any existing implementations since it is 1 to 1). Does this solution seem workable? Anything I missed? The bare metal driver already is proxying for the other nodes so it sounds like we need a couple of things to make this happen: a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to self.update_service_capabilities(capabilities) b) make a few minor changes to the scheduler to make sure filtering still works. Note the changes here may be very helpful: https://review.openstack.org/10327 c) we have to make sure that instances launched on those nodes take up the entire host state somehow. We could probably do this by making sure that the instance_type ram, mb, gb etc. matches what the node has, but we may want a new boolean field used if those aren't sufficient. I This approach seems pretty good. We could potentially get rid of the shared bare_metal_node table. I guess the only other concern is how you populate the capabilities that the bare metal nodes are reporting. I guess an api extension that rpcs to a baremetal node to add the node. Maybe someday this could be autogenerated by the bare metal host looking in its arp table for dhcp requests! :) Vish ___ OpenStack-dev mailing list openstack-...@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list openstack-...@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list openstack-...@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe :