On 24/01/2024 18:26, Faith Ekstrand wrote: > So far, we've been trying to build those components in terms of the > Vulkan API itself with calls jumping back into the dispatch table to > try and get inside the driver.
To me, it looks like the "opt-in" approach would still be well-applicable to the goal of cleaning up "implementing Vulkan in Vulkan", and gradual changes diverging from the usual Vulkan specification behavior can be implemented and maintained in existing and new drivers more efficiently compared to a whole new programming model. I think it's important that the scale of our solution should be appropriate to the scale of the problem, otherwise we risk creating large issues in other areas. Currently there are pretty few places where Mesa implements Vulkan on top of Vulkan: • WSI, • Emulated render passes, • Emulated secondary command buffers, • Meta. For WSI, render passes and secondary command buffers, I don't think there's anything that needs to be done, as those already have little to none driver backend involvement or interference with application's calls — render pass and secondary command buffer emulation interacts with the hardware driver entirely within the framework of the Vulkan specification, only storing a few fields in vk_command_buffer which are already handled fully in common code. Common meta, on the other hand, yes, is extremely intrusive — overriding the application's pipeline state, bindings, and passing shaders directly in NIR bypassing SPIR-V. But with meta being such a different beast, I think we shouldn't even be trying to tame it with the same interfaces as everything else. If we're going to handle meta's special cases throughout our common "Gallium2" framework, it feels like we'll simply be turning our "Vulkan on Vulkan" issue into the problem of "implementing Gallium2 on Gallium2". Instead, I think the cleanest solution in the common meta would be sending commands to the driver through a separate callback interface specifically for meta instead of trying to make meta mimic application code. That would allow drivers to clearly negotiate the details of applying/reverting state changes, shader compilation, while letting their developers assume that everything else is written for the most part purely against the Vulkan specification. It would still be okay for meta to make calls to vkGetPhysicalDevice*, vkCreate*/vkDestroy*, as long as they're done within the rules of the Vulkan specification, to require certain extensions, as well as to do some less-intrusive, non-hot-path interaction with the driver's internals directly — such as requiring that every VkImage is a vk_image and pulling the needed create info fields from there. However, everything interacting with the state/bindings, as well as things going beyond the specification like creating image views with incompatible formats, would be going through those new callbacks. NVK-style drivers would be able to share a common implementation of those callbacks. Drivers that want to take advantage of more direct-to-hardware paths would need to provide what's friendly to them (maybe even with lighter handling of compute-based meta operations compared to graphics ones). That'd probably be not a single flat list of callbacks, but a bunch of ones — like it'd be possible for a driver to use the common command buffer callbacks, but to specialize some view/descriptor-related ones (it may not be possible to make those common at all, by the way). And if a driver doesn't need the common meta at all, none of that would be bothering it. The other advantages I see in this separate meta API approach are: • In the rest of the code, driver developers in most cases will need to refer to only a single authority — the massively detailed Vulkan specification, and there are risks regarding rolling our own interface for everything: • Driver developers will have to spend more time carefully looking up what they need to do in two places rather than largely just one. • We're much more prone to leaving gaps in our interface and to writing lacking documentation. I can't see this effort not being rushed, with us having to catch up to 10 years of XGL/Vulkan development, while moving many drivers alongside working on other tasks, and with varying levels of enthusiasm of driver developers towards this. Unless zmike's 10 years estimate is our actual target 🤷 • Having to deal with a new large-scale API may raise the barrier for new contributors and discourage them. Unlike with OpenGL with all the resource renaming stuff, except for shader compilation, the experience I got from developing applications on Vulkan was enough for me to start comfortably implementing it. When zmike showed me an R600g issue about some relation of vertex buffer bindings and CSOs, I just didn't have anything useful to say. • Faster iteration inside the common meta code, with the meta interface not having to take the demands of regular draws into account as much. And vice versa, of course — especially when it comes to implementing new extensions, many of which would still need handling in every driver with Gallium2, but also in the Gallium2 interface itself in addition. • Breaking changes to the meta-specific interface would only require adjusting meta handling in affected drivers. Breaking changes to something used by everyone across a vast code surface… Maybe you, Faith, are already well used to doing them, but that's still a very special kind of fun 😜 — Triang3l