The Ironic PTG Summary - The blur(b) from the East In an effort to provide visibility and awareness of all the things related to Ironic, I've typed up a summary below. I've tried to keep this fairly generalized with enough context and convey action items or the instances of consensus where applicable. It goes without saying that the week went by as a complete blur. We had to abruptly change our schedule around, some fine detailed topics were missed. A special thanks to Ruby Loo for taking some time to proof read this for me.
-Julia --------- From our retrospective: As seems to be the norm with retrospectives, we did bring up a number of issues that slowed us down, hindered us, or hindered the ability to move faster. A great deal of this revolved around specifications, and the perceptions that tend to occur. Action Items: * Jroll will bring up for discussion if we can update the theme for rendered specs documentation to highlight that the specs are points in time references for design, and are not final documentation. * TheJulia will revise our specification template to attempt to be more clear about *why* we are asking the questions, also to suggest but not require proof of concept code After our retrospective, we spoke about things that can improve our velocity. This sort of discussion tends to always come up, and focused on community cultural aspects of revising/helping land code. The conclusion we quickly came to was that communication or context of the contributor is required. One of the points raised, that we did not get to, was that we should listen to contributor's perceptions, which really goes back to communication. As time went on, we shifted gears to a high level status of ironic, and there are some items to take away: * Inspector, at a high level, could use some additional work and contributors. Virtual media boot support would be helpful, and we may look at breaking some portions out and moving them into ironic. Additional High Availability work may be needed, at the same time it may not be needed. Entirely to be determined. * Ironic-ui presently has no active contributors, but is stable. Major risk right now is a breaking change coming from Horizon, which was also discussed earlier in the week with Horizon. Will add testing such that horizon's gate triggers ironic-ui testing and raises visibility to breaking changes. * Ironic itself got a lot completed this cycle, and we should expect quite a bit this cycle in terms of clean-up from deprecation. * Networking-baremetal received a good portion of work this cycle due to routed networks support. \o/ * Networking-generic-switch seems to be in a fairly stable state at this point. Some trunk awareness has been added, as well as some new switches and bug fixes. * Bifrost has low activity, but at the same time we're seeing new contributors fix issues or improve things, which is a good sign. * Sushy got authentication and introspection support added this cycle. We discussed that we may want to consider supporting RAID (in terms of client actions), as well as composable hardware. After statuses, we shifted into discussing the future. We started the entire discussion of the future with a visioning exercise to help frame the future, so we were all using the same words and had the same scope in mind when discussing the future of Ironic. One thing worth noting is upfront there was a lot of alignment, but we sometimes were just using slightly different words or concepts. Taking a little more time to reconcile those differences allowed us to relate additional words to the same meaning. Truly this set the stage for all of the other topics, and gave us the common reference point to grasp if what we were talking about made sense. Expect Jroll to send out an email to the mailing list to summarize this further, and from this initial discussion we will likely draft a formal vision document that will allow us to continue having the same reference point for discussions. Maybe one day your light bulb will be provisioned with Ironic! Deploy Steps In terms of the future, we again returned to the concept of breaking up deployments into a series of steps. Without going deep into detail, this is a very large piece of functionality and would help solve many problems and desires that exist today, especially where some operators wish things like deploy-time raid, or to flash firmware as part of the baremetal node provisioning process. This work is also influenced by traits, because traits can map to actions that need to be performed automatically. In the end, we agreed to take a small step, and iterate from there. Specifically adding a deploy steps framework and splitting our current deploy process into two logical steps. Location Awareness "Location awareness" as we are calling it, or possibly better stated as "conductor to node affinity" is a topic that we again revisited. This is important as many operators desire a single pane of glass for their entire baremetal fleet. Some operators would like to isolate conductors per rack, per data center, per customer, per sets of data centers in close proximity, per continent. This is a common problem of creating failure domains that match the environment and have optimal performance, as opposed to deploying across a point-to-point circuit. We agreed this is something that we need to make happen, as it is a very common operational problem. We may further work on this in the future to provide a scoring and anti-affinity system, but right now our focus is hard affinity to clusters of conductors. Graphical Consoles We revisited the topic of graphical consoles, which is one of the topics we made very little progress on this past cycle. This is difficult because there are several different ways to architect and develop this functionality. And then we realized libvirt offers a VNC server that we could very easily leverage as someone was kind enough to stub it out already in our virtualized BMC services. TL;DR We are going to pick this back up and try to reach consensus and try to land the framework this cycle. We know we are likely to want to land a distinct driver interface to support this since our existing console is designed around serial console usage. We also know we can use our virtualized BMC for testing. Going beyond the qcow2 Next up on the topic list was partitioning and getting beyond our current use case. Where this topic came from was several different topics with the same central theme of "what if I don't want to make or deploy a qcow2 file?" Historically, we have resisted this as it is more a pattern of pet management. The reality in that consensus is that we agree pets will happen, and have to be able to happen. So what does this mean for the average user? Not much right now. We still have some things to think about, such as what would be a good way to tell Ironic about disk partitioning? And then what to do with the contents of the image? This also had an interesting shift of "what if we supported a generic TFTP interface?" which gets us towards things like where we can configure new switches and non-traditional devices upon power-up. The possibilities are somewhat endless. The surprising thing... there was not disagreement. We even had consensus that this sort of thing would be useful, and be a step towards deploying that light bulb with Ironic. Action Items: * Jroll to look at ways we could allow for user definable partition data, and what that might look like. Security Interfaces/TPM modules! As a topic which the PTL mainly drove, there was a general consensus amongst the room that it could be useful, but that a greater understanding was required. Our consensus may be in part due to learning that Thursday we would likely have less attendees due to the incoming weather system. As a follow-up note: I was approached by the Cyborg PTL to see if there could be an opportunity to collaborate. At present we are unsure given our use model and workflow, but there may be some more discussions in the future. Action Items: * TheJulia needs to sit down and write a spec and popularize the concept. Reference Architecture One of our goals during the past cycle was to create a set of reference architecture documentation. We didn't quite get to that work. One of the advantages to being on the same page and having the same words was that we quickly determined the challenge that deterred us, which was a lack of clear scope. After some discussion, we were able to refine the scope into smaller logical blocks that would build upon each other to help convey how things fit together and how they can be fit together differently. This also raised some greater visibility on where we have an opportunity to improve our developer documentation. Action Item: * dtantsur and jroll to begin creating high level control plane diagrams covering API -> RabbitMQ -> Conductor communication. With this we intend to iterate. * Sambetts to update the development docs on how the networking works to help developers troubleshooting Cleaning - Firmware versions One topic that has come up a number of times is how to manage firmware efficiently and effectively, since there are substantial barriers to entry, which are compounded by differing vendors and hardware fleets. The ask from the community is to help spur further discussion to lower the bar to entry and make it easier to apply firmware updates to hardware nodes, in a way that also provides some level of visibility in that the process has completed, or that the latest firmware has been applied. This is further complicated even more by the fact that some operators have expressed need to apply firmware updates prior to the deploy being completed. Ultimately this takes us down the road of the deploy steps topic, since we should then be able to determine and handle cases where a BIOS image needs to be soft reset for in-band firmware updates, or turned off prior to out-of-band firmware updates. Action Item: * TheJulia is going to try and spur further community discussion in regards to standardization in two weeks. Cleaning - Burn-in As part of discussing cleaning changes, we discussed supporting a "burn-in" mode where hardware could be left to run load, memory, or other tests for a period of time. We did not have consensus on a generic solution, other than that this should likely involve clean-steps that we already have, and maybe another entry point into cleaning. Since we didn't really have consensus on use cases, we decided the logical thing was to write them down, and then go from there. Action Items: * Community members to document varying burn-in use cases for hardware, as they may vary based upon industry. * Community to try and come up with a couple example clean-steps. Planning for Rocky Rocky Planning was performed in record time, but in part because the ironic community performs the initial on-site prioritization via a poll of the room and then five votes per person. This is in turn transformed into our cycle priorities which is posted into gerrit. This can be viewed at https://review.openstack.org/#/c/550174/. We must stress that due to the notice of the need to vacate the building by2PM on Thursday, we chose to move up our planning session and not everyone was able to attend. Thoughts, feedback, and needs should be communicated via the posted change set for community participants that were not present during the planning process. Due to the abrupt schedule changes and need of contributors to begin re-booking flights, we lost some of our time for a little while on Thursday. This largely resulted in that we were unable to discuss miscellaneous items like communication flow changes, changing the default boot mode, alternative dhcp servers. None of which is contentious. Nova/Ironic Towards the end of Thursday, Ironic was able to convene with the Nova team to discuss topics of interest. Disk Partitioning One of the common asks, especially in large scale deployments, or where things such as RAID is needed, is to be able to define what the machine should look like by the requester. This is not a simple need to fulfill given that it is not a "cloudy" behavior. We discussed various options, and a spec is going to be proposed that will allow nova to pass a pointer of some sort to ironic that would define the disk and file system profile for the node. Action Item: * jroll to write a spec on how to allow user supplied partition/raid configuration to reach Ironic. Virt driver interactions There are several cases where the ironic virt driver in nova does things that are not ideal. Also because of long lived processes, hardware is not immediately freed to the resource tracker which can lead to issues. There is a mutual desire to fix these issues, and largely revolves around ensuring that we provide information correctly and set the state for the resources such that the virt driver does not encounter issues with placement. Action Item: * jroll to fix the nova-compute crash upon start-up if there are issues talking to Ironic such that it raises NotReadyYet. API Version Negotiation One of the biggest headaches that Ironic has encountered as time has gone on is the compliance with testing scenarios within the framework, as right now we force a very particular testing order. One of the things that makes this difficult is that we include a pin with our current API client usage (in nova's ironic virt driver) that locks the version the client speaks, and if the server does not speak it, the nova-compute process fails to start. The solution to this is to begin replacing the use of python-ironicclient in the virt driver with REST statements that explicitly state the API version they need to operate. This provides greater visibility, and maximum flexibility moving forward. Action Item: * TheJulia to work on updating the virt driver to use REST calls instead of the client library. And then there was Friday On Friday, the available team members discussed the bios_interface and how to handle the getting/setting of properties considering what was proposed is very different from how we presently handle RAID. Additionally the team discussed the deprecation of vif port ID's being stored in the port's (and portgroup's) extra field. This was originally how networking information was conveyed from Nova to Ironic, but that mechanism was replaced with the vif-attach and vif-detach APIs in a previous cycle. Additional items (from discussions outside ironic sessions): * Ironic to attempt to implement a CI job triggered in the horizon CI check queue to allow for some level of integration testing to help provide feedback if a horizon change breaks ironic-ui. This is the first logical step to support the future of plugins with Horizon, and lowers effort on our end to maintain. Please blame TheJulia if there are any questions. * Scientific SIG will be creating use cases for ironic as RFEs. Things like kexec from deployment ramdisk for extremely time consuming reboots, and pure booting from a ramdisk. * Scientific SIG will also be exploring things like BFV based cluster booting, so we may receive some interest and RFEs as a result. Joking about deploying a light bulb aside, it was a positive experience to talk about our mutual shared visions and really reach the same page. While last week was a complete blur, this is an exciting time, now onward to seize it! __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev