After having some conversations with folks at the Ops Midcycle a few weeks ago, and observing some of the more recent email threads related to glance, glance-store, the client, and the API, I spent last week contacting a few of you individually to learn more about some of the issues confronting the Glance team. I had some very frank, but I think constructive, conversations with all of you about the issues as you see them. As promised, this is the public email thread to discuss what I found, and to see if we can agree on what the Glance team should be focusing on going into the Mitaka summit and development cycle and how the rest of the community can support you in those efforts.
I apologize for the length of this email, but there's a lot to go over. I've identified 2 high priority items that I think are critical for the team to be focusing on starting right away in order to use the upcoming summit time effectively. I will also describe several other issues that need to be addressed but that are less immediately critical. First the high priority items: 1. Resolve the situation preventing the DefCore committee from including image upload capabilities in the tests used for trademark and interoperability validation. 2. Follow through on the original commitment of the project to provide an image API by completing the integration work with nova and cinder to ensure V2 API adoption. I. DefCore The primary issue that attracted my attention was the fact that DefCore cannot currently include an image upload API in its interoperability test suite, and therefore we do not have a way to ensure interoperability between clouds for users or for trademark use. The DefCore process has been long, and at times confusing, even to those of us following it sort of closely. It's not entirely surprising that some projects haven't been following the whole time, or aren't aware of exactly what the whole thing means. I have proposed a cross-project summit session for the Mitaka summit to address this need for communication more broadly, but I'll try to summarize a bit here. DefCore is using automated tests, combined with business policies, to build a set of criteria for allowing trademark use. One of the goals of that process is to ensure that all OpenStack deployments are interoperable, so that users who write programs that talk to one cloud can use the same program with another cloud easily. This is a *REST API* level of compatibility. We cannot insert cloud-specific behavior into our client libraries, because not all cloud consumers will use those libraries to talk to the services. Similarly, we can't put the logic in the test suite, because that defeats the entire purpose of making the APIs interoperable. For this level of compatibility to work, we need well-defined APIs, with a long support period, that work the same no matter how the cloud is deployed. We need the entire community to support this effort. From what I can tell, that is going to require some changes to the current Glance API to meet the requirements. I'll list those requirements, and I hope we can discuss them to a degree that ensures everyone understands them. I don't want this email thread to get bogged down in implementation details or API designs, though, so let's try to keep the discussion at a somewhat high level, and leave the details for specs and summit discussions. I do hope you will correct any misunderstandings or misconceptions, because unwinding this as an outside observer has been quite a challenge and it's likely I have some details wrong. As I understand it, there are basically two ways to upload an image to glance using the V2 API today. The "POST" API pushes the image's bits through the Glance API server, and the "task" API instructs Glance to download the image separately in the background. At one point apparently there was a bug that caused the results of the two different paths to be incompatible, but I believe that is now fixed. However, the two separate APIs each have different issues that make them unsuitable for DefCore. The DefCore process relies on several factors when designating APIs for compliance. One factor is the technical direction, as communicated by the contributor community -- that's where we tell them things like "we plan to deprecate the Glance V1 API". In addition to the technical direction, DefCore looks at the deployment history of an API. They do not want to require deploying an API if it is not seen as widely usable, and they look for some level of existing adoption by cloud providers and distributors as an indication of that the API is desired and can be successfully used. Because we have multiple upload APIs, the message we're sending on technical direction is weak right now, and so they have focused on deployment considerations to resolve the question. The POST API is enabled in many public clouds, but not consistently. In some clouds like HP, a tenant requires special permission to use the API. At least one provider, Rackspace, has disabled the API entirely. This is apparently due to what seems like a fair argument that uploading the bits directly to the API service presents a possible denial of service vector. Without arguing the technical merits of that decision, the fact remains that without a strong consensus from deployers that the POST API should be publicly and consistently available, it does not meet the requirements to be used for DefCore testing. The task API is also not widely deployed, so its adoption for DefCore is problematic. If we provide a clear technical direction that this API is preferred, that may overcome the lack of adoption, but the current task API seems to have technical issues that make it fundamentally unsuitable for DefCore consideration. While the task API addresses the problem of a denial of service, and includes useful features such as processing of the image during import, it is not strongly enough defined in its current form to be interoperable. Because it's a generic API, the caller must know how to fully construct each task, and know what task types are supported in the first place. There is only one "import" task type supported in the Glance code repository right now, but it is not clear that "import" always uses the same arguments, or interprets them in the same way. For example, the upstream documentation [1] describes a task that appears to use a URL as source, while the Rackspace documentation [2] describes a task that appears to take a swift storage location. I wasn't able to find JSONSchema validation for the "input" blob portion of the task in the code [3], though that may happen down inside the task implementation itself somewhere. Tasks also come from plugins, which may be installed differently based on the deployment. This is an interesting approach to creating API extensions, but isn't discoverable enough to write interoperable tools against. Most of the other projects are starting to move away from supporting API extensions at all because of interoperability concerns they introduce. Deployers should be able to configure their clouds to perform well, but not to behave in fundamentally different ways. Extensions are just that, extensions. We can't rely on them for interoperability testing. There is a lot of fuzziness around exactly what is supported for image upload, both in the documentation and in the minds of the developers I've spoken to this week, so I'd like to take a step back and try to work through some clear requirements, and then we can have folks familiar with the code help figure out if we have a real issue, if a minor tweak is needed, or if things are good as they stand today and it's all a misunderstanding. 1. We need a strongly defined and well documented API, with arguments that do not change based on deployment choices. The behind-the-scenes behaviors can change, but the arguments provided by the caller must be the same and the responses must look the same. The implementation can run as a background task rather than receiving the full image directly, but the current task API is too vaguely defined to meet this requirement, and IMO we need an entry point focused just on uploading or importing an image. 2. Glance cannot require having a Swift deployment. It's not clear whether this is actually required now, so if it's not then we're in a good state. It's fine to provide an optional way to take advantage of Swift if it is present, but it cannot be a required component. There are three separate trademark "programs", with separate policies attached to them. There is an umbrella "Platform" program that is intended to include all of the TC approved release projects, such as nova, glance, and swift. However, there is also a separate "Compute" program that is intended to include Nova, Glance, and some others but *not* Swift. This is an important distinction, because there are many use cases both for distributors and public cloud providers that do not incorporate Swift for a variety of reasons. So, we can't have Glance's primary configuration require Swift and we need to provide tests for the DefCore team that run without Swift. Duplicate tests that do use Swift are fine, and might be used for "Platform" compliance tests. 3. We need an integration test suite in tempest that fully exercises the public image API by talking directly to Glance. This applies to the entire API, not just image uploads. It's fine to have duplicate tests using the proxy in Nova if the Nova team wants those, but DefCore should be using tests that talk directly to the service that owns each feature, without relying on any proxying. We've already missed the chance to deal with this in the current DefCore definition, which uses image-related tests that talk to the Nova proxy [4][5], so we'll have to maintain the proxy for the required deprecation period. But we won't be able to consider removing that proxy until we provide alternate tests for those features that speak directly to Glance. We may have some coverage already, but I wasn't able to find a task-based image upload test and there is no "image create" mentioned in the current draft of capabilities being reviewed [6]. There may be others missing, so someone more familiar with the feature set of Glance should do an audit and document what tests are needed so the work can be split up. 4. Once identified and incorporated into the DefCore capabilities set, the selected API needs to remain stable for an extended period of time and follow the deprecation timelines defined by DefCore. That has implications for the V3 API currently in development to turn Glance into a more generic artifacts service. There are a lot of ways to handle those implications, and no choice needs to be made today, so I only mention it to make sure it's clear that (a) we must get V2 into shape for DefCore and (b) when that happens, we will need to maintain V2 even if V3 is finished. We won't be able to deprecate V2 quickly. Now, it's entirely possible that we can meet all of those requirements today, and that would be great. If that's the case, then the problem is just one of clear communication and documentation. I think there's probably more work to be done than that, though. [1] http://developer.openstack.org/api-ref-image-v2.html#os-tasks-v2 [2] http://docs.rackspace.com/images/api/v2/ci-devguide/content/POST_importImage_tasks_Image_Task_Calls.html#d6e4193 [3] http://git.openstack.org/cgit/openstack/glance/tree/glance/api/v2/tasks.py [4] http://git.openstack.org/cgit/openstack/defcore/tree/2015.05.json#n70 [5] http://git.openstack.org/cgit/openstack/defcore/tree/doc/source/guidelines/2015.07.rst [6] https://review.openstack.org/#/c/213353/ II. Complete Cinder and Nova V2 Adoption The Glance team originally committed to providing an Image Service API. Besides our end users, both Cinder and Nova consume that API. The shift from V1 to V2 has been a long road. We're far enough along, and the V1 API has enough issues preventing us from using it for DefCore, that we should push ahead and complete the V2 adoption. That will let us properly deprecate and drop V1 support, and concentrate on maintaining V2 for the necessary amount of time. There are a few specs for the work needed in Nova, but that work didn't land in Liberty for a variety of reasons. We need resources from both the Glance and Nova teams to work together to get this done as early as possible in Mitaka to ensure that it actually lands this time. We should be able to schedule a joint session at the summit to have the conversation, and we need to take advantage of that opportunity to ensure the details are fully resolved so that everyone understands the plan. The work in Cinder is more complete, but may need to be reviewed to ensure that it is using the API correctly, safely, and efficiently. Again, this is a joint effort between the Glance and Cinder teams to identify any issues and work out a resolution. Part of this work will also be to audit the Glance API documentation, to ensure it accurately reflects what the APIs expect to receive and return. There are reportedly at least a few cases where things are out of sync right now. This will require some coordination with the Documentation team. Those are the two big priorities I see, based on things the rest of the community needs from the team and existing commitments that have been made. There are some other things that should also be addressed. III. Security audits & bug fixes Five of 18 recent security reports were related to Glance [7]. It's not surprising, given recent resource constraints, that addressing these has been a challenge. Still, these should be given high priority. [7] https://security.openstack.org/search.html?q=glance&check_keywords=yes&area=default IV. Sorting out the glance-store question This was perhaps the most confusing thing I learned about this week. The perception outside of the Glance team is that the library is meant to be used by Nova and Cinder to communicate directly with the image store, bypassing the REST API, to improve performance in several cases. I know the Cinder team is especially interested in some sort of interface for manipulating images inside the storage system without having to download them to make copies (for RBD and other systems that support CoW natively). That doesn't seem to be what the library is actually good for, though, since most of the Glance core folks I talked to thought it was really a caching layer. This discrepancy in what folks wanted vs. what they got may explain some of the heated discussions in other email threads. Frankly, given the importance of the other issues, I recommend leaving glance-store standalone this cycle. Unless the work for dealing with priorities I and II is made *significantly* easier by not having a library, the time and energy it will take to re-integrate it with the Glance service seems like a waste of limited resources. The time to even discuss it may be better spent on the planning work needed. That said, if the library doesn't provide the features its users were expecting, it may be better to fold it back in and create a different library with a better understanding of the requirements at some point. The path to take is up to the Glance team, of course, but we're already down far enough on the priority list that I think we'll be lucky to finish the preceding items this cycle. Those are the development priorities I was able to identify in my interviews this week, and there is one last thing the team needs to do this cycle: Recruit more contributors. Almost every current core contributor I spoke with this week indicated that their time was split between another project and Glance. Often higher priority had to be given, understandibly, to internal product work. That's the reality we work in, and everyone feels the same pressures to some degree. One way to address that pressure is to bring in help. So, we need a recruiting drive to find folks willing to contribute code and reviews to the project to keep the team healthy. I listed this item last because if you've made it this far you should see just how much work the team has ahead. We're a big community, and I'm confident that we'll be able to find help for the Glance team, but it will require mentoring and education to bring people up to speed to make them productive. Doug __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev