Luca Barbieri wrote: >> At a first glance: >> >> 1) We probably *will* need a delayed destroyed workqueue to avoid wasting >> memory that otherwise should be freed to the system. At the very least, the >> delayed delete process should optionally be run by a system shrinker. >> > You are right. For VRAM we don't care since we are the only user, > while for system backed memory some delayed destruction will be > needed. > The logical extension of the scheme would be for the Linux page > allocator/swapper to check for TTM buffers to destroy when it would > otherwise shrink caches, try to swap and/or wait on swap to happen. > Not sure whether there are existing hooks for this or where exactly to > hook this code. > >
I think there are existing hooks for this, but I haven't yet figured out how they work. >> 2) Fences in TTM are currently not necessarily strictly ordered, and >> sequence numbers are hidden from the bo code. This means, for a given FIFO, >> fence sequence 3 may expire before fence sequence 2, depending on the usage >> of the buffer. >> > > My definition of "channel" (I sometimes used FIFO incorrectly as a > synonym of that) is exactly a set of fences that are strictly ordered. > If the card has multiple HW engines, each is considered a different > channel (so that a channel becomes a (fifo, engine) pair). > > We may need however to add the concept of a "sync domain" that would > be a set of channels that support on-GPU synchronization against each > other. > This would model hardware where channels with the same FIFO can be > synchronized together but those with different FIFOs don't, and also > multi-core GPUs where synchronization might be available only inside > each core and not across cores. > > To sum it up, a GPU consists of a set of sync domains, each consisting > of a set of channels, each consisting of a sequence of fences, with > the following rules: > 1. Fences within the same channel expire in order > 2. If channels A and B belong to the same sync domain, it's possible > to emit a fence on A that is guaranteed to expire after an arbitrary > fence of B > > Whether channels have the same FIFO or not is essentially a driver > implementation detail, and what TTM cares about is if they are in the > same sync domain. > > [I just made up "sync domain" here: is there a standard term?] > > This assumes that the "synchronizability" graph is a disjoint union of > complete graphs. Is there any example where it is not so? > Also, does this actually model correctly Poulsbo, or am I wrong? > Let me give some usage examples for intel, and Poulsbo. The synchronization for these are / were modeled with fences with "stages", (called fence_types). The signaling of one stage set a bit in a bit mask. A bo was considered idle when (bo->required_stages & fence->signaled_stages == bo->required_stages). Intel would have had the stages command_submitted, read_flushed, write_flushed. A command buffer would have been idle if the fence had signaled command_submitted, whereas a render buffer would've been idle on command_submitted | write_flushed. A write flush would've had to be issued separately. Typically when the buffer was put on the delayed delete queue or when the bo_idle function was called, which really minimized the amount of needed flushing. An executed write flush would signal write_flushed on all fences whose sequence had passed. An explicit ordering of fences here would've meant queueing a read-write-flush in between them. For Poulsbo, the GPU was modeled with a programming stage, a binner stage, a rasterizer stage and a feedback stage. Each stage of completion set a signaled_stage bit in the fence. Now, a vertex buffer would be idle when the fence signaled binner_done, (which could happen well before rasterization complete of the previous command sequence). It's true that one could use "feedback done" for all buffers but a low-memory-footprint system quickly ran out of vertex buffer space, so quick reuse of those were essential. So to summarize, the usage of the buffer together with the signaled state of the fence object really determines whether the buffer is idle. I think in your model, the Poulsbo GPU would've been a sync domain and the "programmer", binner, rasterizer and "feedback engine" would've been separate channels. The Intel case would perhaps have been a bit more tricky. /Thomas > Note that we could use CPU mediation more than we currently do. > For instance now Nouveau, to do inter-channel synchronization, simply > waits on the fence with the CPU immediately synchronously, while it > could instead queue the commands in software, and with an > interrupt/delayed mechanism submit them to hardware once the fence to > be waited for is expired. > _______________________________________________ Nouveau mailing list Nouveau@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/nouveau