On Mon, 2011-01-31 at 10:46 -0800, Marek Olšák wrote: > Hi Keith, > > From my point of view, user buffers are just pointers passed through > the Gallium interface and are well-defined from what I can see. They > might be owned by the application (e.g. set via glVertexPointer etc.), > therefore using the transfer functions on user buffers is invalid per > se. Moreover, the application may change the content of user buffers > any time,
Up until now we've always worked as if user buffers were not mutable either by the application or the driver. This means that userbuffers behave very much like normal buffers which have some initial data but no transfer mechanism. One upshot of this is that the driver can safely promote userbuffers to true buffers in a one-off operation. A second upshot is that userbuffers which may change will need to be deleted & recreated by the state-tracker. This is more expressive than a situation where the driver has to always assume that userbuffers may have changed between arbitrary draw calls -- if the buffer is being reused, the driver knows it has not changed. > meaning that drivers should convert the user buffers to real buffers > in the draw_vbo function, then draw vertices, and then forget the real > buffers, keeping the user buffers bound for the next draw operation. > Drivers should not upload user buffers anywhere else, because the > application may change the contents between glDraw* calls. If they are > bound as vertex buffers, we don't need to know their size and > sometimes we even can't (again, glVertexPointer etc.). Instead, we can > use pipe_draw_info::min_index and max_index and upload only that > range. This has proved to be a great optimization and it's how r300g > and r600g work now. In what sense is this different to having the state-tracker destroy and recreate the userbuffer around each draw call? Is it just the overhead of the CALLOC call to create the pipe_resource struct? Otherwise the behaviour inside draw_vbo() looks identical - the same number of userbuffers get uploaded. In both cases the min/max_index values can be used to minimize the uploaded region. > In theory, doing user buffer uploads at the state tracker side using > inline transfers might work and should remove some burden from > drivers. This would be an alternate approach -- the state-tracker could itself figure out min/max_index, and upload that data into a real hardware buffer -- basically the same task that the driver is doing in both examples above. > In practice, inline transfers may have a very large overhead compared > to how things work now. In transfer_inline_write, you usually want to > map the buffer, do a memcpy, and unmap it. The map/unmap overhead can > be really significant. There are applications that use glDrawElements > to draw one triangle at a time, and they draw hundreds of triangles > with user buffers in this way (yes, applications really do this). We > can't afford doing any more work than is absolutely necessary. When > you get 10000 or more draw_vbo calls per second, everything matters. > > Currently, the radeon drivers have one upload buffer for vertices and > it stays mapped until the command stream is flushed. When they get a > user buffer, they do one memcpy and that's all. They don't touch > winsys unless the upload buffer is full. So the optimization we're really talking about here is saving the map/unmap overhead on the upload buffer? And if the state tracker could do the uploads without incurring the map/unmap overhead, would that be sufficient for you to feel comfortable moving this functionality up a level? > > > And user-buffers tend not to stay user-buffers - they can be promoted > to > regular buffers behind the scenes by the driver. Would that be > reflected in this interface somehow? > > I don't think it's needed. The pipe_resource fields can stay immutable > and drivers can internally replace vertex buffers with their private > pipe_resources. The state trackers don't need to know about it. > > > > From the point of view of recording, replaying, debugging, remoting, > etc. at the gallium boundary, it's preferable if all actions are > interposable - ie. all actions are mediated by a function call of some > sort into the gallium interface. Giving a component a direct memory > access into buffer contents would tend to defeat that and make > record/replay of that action difficult. > > Indeed, record/replay would be difficult but not impossbie. FWIW I > think the interface shouldn't be specifically designed for > record/replay. Instead, record/replay should be made work with > whatever interface there is. Well, yes, but there are some really powerful things you can do with an interface like gallium if it is possible to fully interpose at this level. I'm sure you can come up with examples around remote rendering, debugging, etc - it's important enough to want to preserve the ability to interpose and serialize. > Is it possible to get a description of what you're doing at a slightly > higher level to try and understand if there's a solution without these > drawbacks? > > I am quite content with the current interface for user buffers, but it > will need a few changes for it to be more... efficient. Below is my > plan for further improving Gallium and its interactions with user > buffers in general. > > > 1) New vertex buffer manager > > This is how I'd like to put the burden of uploading user buffers out > of the drivers. I've made a new vertex buffer manager. It can be found > here: > http://cgit.freedesktop.org/~mareko/mesa/commit/?h=vbuf-mgr&id=94a53b672dd238e6a50bb6b252614dc2e9f30ddf > > And the corresponding branch is here: > http://cgit.freedesktop.org/~mareko/mesa/log/?h=vbuf-mgr > > It's a module that drivers can use and it does 2 things: > - uploads user buffers > - takes care of converting unsupported vertex formats and unaligned > vertex layouts to supported ones (there are vertex fetcher capability > bits, see struct u_vbuf_caps) > > Besides some typos in a few commits, this work is already done. > > With this manager, the drivers don't have to deal with user buffers > when they are bound as vertex buffers. They only get real hardware > buffers. The drivers don't even have to deal with unsupported vertex > formats. Moreover, this code has already proven to be fast when it was > maturing in r300g and is specifically optimized for low overhead. Its > integration to a driver is easy and straithforward. But I need the > pipe_resource::user_ptr variable to be able to access the user buffer > memory in util (efficiently). What prevents this being efficient at the state-tracker level? You mention two things -- the map/unmap overhead (ie the state tracker has to constantly map/unmap because it doesn't know which draw_vbo call will cause the driver to flush its command buffer). -- extra capability information about the hardware > 2) Optimizing vertex array state changes in st/mesa > > Because we don't need to know the sizes of user vertex buffers (as I > said, we can use pipe_draw_info::min_index and max_index instead, as > is done in the vertex buffer manager), we can remove the calculation > of the buffer sizes from st_draw_vbo. But we still must be calculating that somewhere -- the min_/max_index info has to come from somewhere in the statetracker. > We can also remove pipe_vertex_buffer::max_index (again, the > information provided by pipe_draw_info is sufficient) and the related > code from st/mesa. Not only will this simplify st_draw_vbo, it will > allow us to bind vertex buffers and a vertex elements state only when > needed, i.e. when either the _NEW_ARRAY or _NEW_PROGRAM flag is set. > It makes this usage case a lot faster: > > for (i = 0; i < 1000; i++) glDrawElements(...); > > This work is *almost* complete and can be found here: > http://cgit.freedesktop.org/~mareko/mesa/log/?h=gallium-varrays-optim > > The framerate in the Torcs game goes from 14 to 20 fps with this code > (in one particular scenario I was optimizing for). > > Since I no longer compute the sizes of user buffers, I have to put ~0 > in the 'size' parameter of user_buffer_create and I expect drivers not > to use it. r300g, r600g, nv50, nvc0, softpipe, and llvmpipe should be > ready for this. Not sure about the other drivers, but it's fixable. > > -- > So this all is my current plan to simplify hardware drivers a bit and > add some nice optimizations. Another option would be to move the new > vertex buffer manager or something equivalent to the state tracker and > remove user buffers from the Gallium interface, but that would be > additional work with uncertain performance implications, so I decided > not to take this path (for now). I've still got some catching up to do on your code (sorry for being slow), but I want to make sure you get what you need to make this work. The only real constraint I don't want to break is the ability to interpose and serialize at the gallium interface layer. Keith _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev