Hi Vish, I think I understand your idea. One service entry with multiple bare-metal compute_node entries are registered at the start of bare-metal nova-compute. 'hypervisor_hostname' must be different for each bare-metal machine, such as 'bare-metal-0001.xxx.com', 'bare-metal-0002.xxx.com', etc.) But their IP addresses must be the IP address of bare-metal nova-compute, such that an instance is casted not to bare-metal machine directly but to bare-metal nova-compute.
One extension we need to do at the scheduler side is using (host, hypervisor_hostname) instead of (host) only in host_manager.py. 'HostManager.service_state' is { <host> : { <service > : { cap k : v }}}. It needs to be changed to { <host> : { <service> : { <hypervisor_name> : { cap k : v }}}}. Most functions of HostState need to be changed to use (host, hypervisor_name) pair to identify a compute node. Are we on the same page, now? Thanks, David ----- Original Message ----- > Hi David, > > I just checked out the code more extensively and I don't see why you > need to create a new service entry for each compute_node entry. The > code in host_manager to get all host states explicitly gets all > compute_node entries. I don't see any reason why multiple compute_node > entries can't share the same service. I don't see any place in the > scheduler that is grabbing records by "service" instead of by "compute > node", but if there is one that I missed, it should be fairly easy to > change it. > > The compute_node record is created in the compute/resource_tracker.py > as of a recent commit, so I think the path forward would be to make > sure that one of the records is created for each bare metal node by > the bare metal compute, perhaps by having multiple resource_trackers. > > Vish > > On Aug 27, 2012, at 9:40 AM, David Kang <dk...@isi.edu> wrote: > > > > > Vish, > > > > I think I don't understand your statement fully. > > Unless we use different hostnames, (hostname, hypervisor_hostname) > > must be the > > same for all bare-metal nodes under a bare-metal nova-compute. > > > > Could you elaborate the following statement a little bit more? > > > >> You would just have to use a little more than hostname. Perhaps > >> (hostname, hypervisor_hostname) could be used to update the entry? > >> > > > > Thanks, > > David > > > > > > > > ----- Original Message ----- > >> I would investigate changing the capabilities to key off of > >> something > >> other than hostname. It looks from the table structure like > >> compute_nodes could be have a many-to-one relationship with > >> services. > >> You would just have to use a little more than hostname. Perhaps > >> (hostname, hypervisor_hostname) could be used to update the entry? > >> > >> Vish > >> > >> On Aug 24, 2012, at 11:23 AM, David Kang <dk...@isi.edu> wrote: > >> > >>> > >>> Vish, > >>> > >>> I've tested your code and did more testing. > >>> There are a couple of problems. > >>> 1. host name should be unique. If not, any repetitive updates of > >>> new > >>> capabilities with the same host name are simply overwritten. > >>> 2. We cannot generate arbitrary host names on the fly. > >>> The scheduler (I tested filter scheduler) gets host names from > >>> db. > >>> So, if a host name is not in the 'services' table, it is not > >>> considered by the scheduler at all. > >>> > >>> So, to make your suggestions possible, nova-compute should > >>> register > >>> N different host names in 'services' table, > >>> and N corresponding entries in 'compute_nodes' table. > >>> Here is an example: > >>> > >>> mysql> select id, host, binary, topic, report_count, disabled, > >>> availability_zone from services; > >>> +----+-------------+----------------+-----------+--------------+----------+-------------------+ > >>> | id | host | binary | topic | report_count | disabled | > >>> | availability_zone | > >>> +----+-------------+----------------+-----------+--------------+----------+-------------------+ > >>> | 1 | bespin101 | nova-scheduler | scheduler | 17145 | 0 | nova | > >>> | 2 | bespin101 | nova-network | network | 16819 | 0 | nova | > >>> | 3 | bespin101-0 | nova-compute | compute | 16405 | 0 | nova | > >>> | 4 | bespin101-1 | nova-compute | compute | 1 | 0 | nova | > >>> +----+-------------+----------------+-----------+--------------+----------+-------------------+ > >>> > >>> mysql> select id, service_id, hypervisor_hostname from > >>> compute_nodes; > >>> +----+------------+------------------------+ > >>> | id | service_id | hypervisor_hostname | > >>> +----+------------+------------------------+ > >>> | 1 | 3 | bespin101.east.isi.edu | > >>> | 2 | 4 | bespin101.east.isi.edu | > >>> +----+------------+------------------------+ > >>> > >>> Then, nova db (compute_nodes table) has entries of all bare-metal > >>> nodes. > >>> What do you think of this approach. > >>> Do you have any better approach? > >>> > >>> Thanks, > >>> David > >>> > >>> > >>> > >>> ----- Original Message ----- > >>>> To elaborate, something the below. I'm not absolutely sure you > >>>> need > >>>> to > >>>> be able to set service_name and host, but this gives you the > >>>> option > >>>> to > >>>> do so if needed. > >>>> > >>>> iff --git a/nova/manager.py b/nova/manager.py > >>>> index c6711aa..c0f4669 100644 > >>>> --- a/nova/manager.py > >>>> +++ b/nova/manager.py > >>>> @@ -217,6 +217,8 @@ class SchedulerDependentManager(Manager): > >>>> > >>>> def update_service_capabilities(self, capabilities): > >>>> """Remember these capabilities to send on next periodic > >>>> update.""" > >>>> + if not isinstance(capabilities, list): > >>>> + capabilities = [capabilities] > >>>> self.last_capabilities = capabilities > >>>> > >>>> @periodic_task > >>>> @@ -224,5 +226,8 @@ class SchedulerDependentManager(Manager): > >>>> """Pass data back to the scheduler at a periodic interval.""" > >>>> if self.last_capabilities: > >>>> LOG.debug(_('Notifying Schedulers of capabilities ...')) > >>>> - self.scheduler_rpcapi.update_service_capabilities(context, > >>>> - self.service_name, self.host, self.last_capabilities) > >>>> + for capability_item in self.last_capabilities: > >>>> + name = capability_item.get('service_name', self.service_name) > >>>> + host = capability_item.get('host', self.host) > >>>> + self.scheduler_rpcapi.update_service_capabilities(context, > >>>> + name, host, capability_item) > >>>> > >>>> On Aug 21, 2012, at 1:28 PM, David Kang <dk...@isi.edu> wrote: > >>>> > >>>>> > >>>>> Hi Vish, > >>>>> > >>>>> We are trying to change our code according to your comment. > >>>>> I want to ask a question. > >>>>> > >>>>>>>> a) modify driver.get_host_stats to be able to return a list > >>>>>>>> of > >>>>>>>> host > >>>>>>>> stats instead of just one. Report the whole list back to the > >>>>>>>> scheduler. We could modify the receiving end to accept a list > >>>>>>>> as > >>>>>>>> well > >>>>>>>> or just make multiple calls to > >>>>>>>> self.update_service_capabilities(capabilities) > >>>>> > >>>>> Modifying driver.get_host_stats to return a list of host stats > >>>>> is > >>>>> easy. > >>>>> Calling muliple calls to > >>>>> self.update_service_capabilities(capabilities) doesn't seem to > >>>>> work, > >>>>> because 'capabilities' is overwritten each time. > >>>>> > >>>>> Modifying the receiving end to accept a list seems to be easy. > >>>>> However, 'capabilities' is assumed to be dictionary by all other > >>>>> scheduler routines, > >>>>> it looks like that we have to change all of them to handle > >>>>> 'capability' as a list of dictionary. > >>>>> > >>>>> If my understanding is correct, it would affect many parts of > >>>>> the > >>>>> scheduler. > >>>>> Is it what you recommended? > >>>>> > >>>>> Thanks, > >>>>> David > >>>>> > >>>>> > >>>>> ----- Original Message ----- > >>>>>> This was an immediate goal, the bare-metal nova-compute node > >>>>>> could > >>>>>> keep an internal database, but report capabilities through nova > >>>>>> in > >>>>>> the > >>>>>> common way with the changes below. Then the scheduler wouldn't > >>>>>> need > >>>>>> access to the bare metal database at all. > >>>>>> > >>>>>> On Aug 15, 2012, at 4:23 PM, David Kang <dk...@isi.edu> wrote: > >>>>>> > >>>>>>> > >>>>>>> Hi Vish, > >>>>>>> > >>>>>>> Is this discussion for long-term goal or for this Folsom > >>>>>>> release? > >>>>>>> > >>>>>>> We still believe that bare-metal database is needed > >>>>>>> because there is not an automated way how bare-metal nodes > >>>>>>> report > >>>>>>> their capabilities > >>>>>>> to their bare-metal nova-compute node. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> David > >>>>>>> > >>>>>>>> > >>>>>>>> I am interested in finding a solution that enables bare-metal > >>>>>>>> and > >>>>>>>> virtualized requests to be serviced through the same > >>>>>>>> scheduler > >>>>>>>> where > >>>>>>>> the compute_nodes table has a full view of schedulable > >>>>>>>> resources. > >>>>>>>> This > >>>>>>>> would seem to simplify the end-to-end flow while opening up > >>>>>>>> some > >>>>>>>> additional use cases (e.g. dynamic allocation of a node from > >>>>>>>> bare-metal to hypervisor and back). > >>>>>>>> > >>>>>>>> One approach would be to have a proxy running a single > >>>>>>>> nova-compute > >>>>>>>> daemon fronting the bare-metal nodes . That nova-compute > >>>>>>>> daemon > >>>>>>>> would > >>>>>>>> report up many HostState objects (1 per bare-metal node) to > >>>>>>>> become > >>>>>>>> entries in the compute_nodes table and accessible through the > >>>>>>>> scheduler HostManager object. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> The HostState object would set cpu_info, vcpus, member_mb and > >>>>>>>> local_gb > >>>>>>>> values to be used for scheduling with the hypervisor_host > >>>>>>>> field > >>>>>>>> holding the bare-metal machine address (e.g. for IPMI based > >>>>>>>> commands) > >>>>>>>> and hypervisor_type = NONE. The bare-metal Flavors are > >>>>>>>> created > >>>>>>>> with > >>>>>>>> an > >>>>>>>> extra_spec of hypervisor_type= NONE and the corresponding > >>>>>>>> compute_capabilities_filter would reduce the available hosts > >>>>>>>> to > >>>>>>>> those > >>>>>>>> bare_metal nodes. The scheduler would need to understand that > >>>>>>>> hypervisor_type = NONE means you need an exact fit (or > >>>>>>>> best-fit) > >>>>>>>> host > >>>>>>>> vs weighting them (perhaps through the multi-scheduler). The > >>>>>>>> scheduler > >>>>>>>> would cast out the message to the <topic>.<service-hostname> > >>>>>>>> (code > >>>>>>>> today uses the HostState hostname), with the compute driver > >>>>>>>> having > >>>>>>>> to > >>>>>>>> understand if it must be serviced elsewhere (but does not > >>>>>>>> break > >>>>>>>> any > >>>>>>>> existing implementations since it is 1 to 1). > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Does this solution seem workable? Anything I missed? > >>>>>>>> > >>>>>>>> The bare metal driver already is proxying for the other nodes > >>>>>>>> so > >>>>>>>> it > >>>>>>>> sounds like we need a couple of things to make this happen: > >>>>>>>> > >>>>>>>> > >>>>>>>> a) modify driver.get_host_stats to be able to return a list > >>>>>>>> of > >>>>>>>> host > >>>>>>>> stats instead of just one. Report the whole list back to the > >>>>>>>> scheduler. We could modify the receiving end to accept a list > >>>>>>>> as > >>>>>>>> well > >>>>>>>> or just make multiple calls to > >>>>>>>> self.update_service_capabilities(capabilities) > >>>>>>>> > >>>>>>>> > >>>>>>>> b) make a few minor changes to the scheduler to make sure > >>>>>>>> filtering > >>>>>>>> still works. Note the changes here may be very helpful: > >>>>>>>> > >>>>>>>> > >>>>>>>> https://review.openstack.org/10327 > >>>>>>>> > >>>>>>>> > >>>>>>>> c) we have to make sure that instances launched on those > >>>>>>>> nodes > >>>>>>>> take > >>>>>>>> up > >>>>>>>> the entire host state somehow. We could probably do this by > >>>>>>>> making > >>>>>>>> sure that the instance_type ram, mb, gb etc. matches what the > >>>>>>>> node > >>>>>>>> has, but we may want a new boolean field "used" if those > >>>>>>>> aren't > >>>>>>>> sufficient. > >>>>>>>> > >>>>>>>> > >>>>>>>> I This approach seems pretty good. We could potentially get > >>>>>>>> rid > >>>>>>>> of > >>>>>>>> the > >>>>>>>> shared bare_metal_node table. I guess the only other concern > >>>>>>>> is > >>>>>>>> how > >>>>>>>> you populate the capabilities that the bare metal nodes are > >>>>>>>> reporting. > >>>>>>>> I guess an api extension that rpcs to a baremetal node to add > >>>>>>>> the > >>>>>>>> node. Maybe someday this could be autogenerated by the bare > >>>>>>>> metal > >>>>>>>> host > >>>>>>>> looking in its arp table for dhcp requests! :) > >>>>>>>> > >>>>>>>> > >>>>>>>> Vish > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> OpenStack-dev mailing list > >>>>>>>> openstack-...@lists.openstack.org > >>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> OpenStack-dev mailing list > >>>>>>> openstack-...@lists.openstack.org > >>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> OpenStack-dev mailing list > >>>>>> openstack-...@lists.openstack.org > >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >>>>> > >>>>> _______________________________________________ > >>>>> OpenStack-dev mailing list > >>>>> openstack-...@lists.openstack.org > >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >>>> > >>>> > >>>> _______________________________________________ > >>>> OpenStack-dev mailing list > >>>> openstack-...@lists.openstack.org > >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp