On Mon, Oct 21, 2002 at 11:27:15AM -0400, Leif Delgass wrote: [...]
Check the log of an IRC talk I had with Willmore on today's meetingI have a pretty good idea how to do the verification (just checking the register count and offset range of each command, skipping the data), but I'm not sure if it'll be faster to copy as we verify, or memcpy the entire buffer and verify (or verify and then copy). At any rate, it should be easy to test and see which works best.
regarding this same subject. It was very insightfull. In summary, we
would make most use of the cache by verifying first and then copying (and if possible avoiding write caching).
drmAddBufs is getting deprecated, especially regarding the PCI
The thing I hadn't figured out yet is the best way to allocate the private
buffers. We could allocate a chunk of AGP mem like other drivers do for
the ring buffer, which could be set read only for userspace. I'm not sure
if drmAddBufs would work to create one mapped set and one unmapped set of
buffers.
counterpart, but I know that some drives (MGA IIRC), allocates one extra
buffer which isn't mapped.
I think we'll need a set of private buffers equivalent to what we use now (2MB), but we could probably get away with a smaller set of client buffers.
I agree. In principle, each client could even use the same buffer over and over again (not referring to blits). And they don't even need to be DMA buffers at all. Fixed malloc'd memory from the user space should be good enough.
We should also think about the Xserver, since we'll probably want to be able to use "indirect" command buffers from the Xserver to implement XAA with DMA.
The X server could use the private buffers since it's trusted. It seems that the logic to accomodate all these cenarios will be a little complicated. We have to carefully think of the API (i.e., the IOCTLs) to allow everything play nice with each-other - inside and outside of the DRM. José Fonseca
<willmore> I just wanted to answer his question on the mailing list about copying/verifying DMA buffers. Copy first and then verify using the target copy. <jfonseca> willmore: And why do you think that? <willmore> With a copy--unless you're using write cache bypass ala SSE--the target will be the cache 'hotter' of the two. If you verify first, you lose the cache benefits of optimized memcpys. <willmore> Completely optimal would be a hand tuned block prefetched chunk of assembly, but that's a bit out of the scope of this soltuion, yes? <jfonseca> mmm... thanx! <jfonseca> not necessarily. I've already done some hand coded assembly for copying the vertex data, and I hoped to adapt that for the copy/verify process later on. But it seems I need more info about prefetching for doing that. <willmore> No problem. The only thing that really determines this is if the memcpy uses a cache bypass type copy. <willmore> You have to prefetch by hand on x86 and I think there is an SSE instruciton to give the hints. I also think that for cpus that don't support prefetch, they just ignore the instruction--so that might not be exactly correct. <jfonseca> Not that in any case the buffer size we will be dealing is smaller than the cache size, to avoid a performance hit with the copying. <jfonseca> *Note* that in any case the buffer size we will be dealing is smaller than the cache size, to avoid a performance hit with the copying. <willmore> You can make the prefetch transparent on non-prefetch capable CPUs if you schedule things well in your code. <willmore> What is the typical buffer size? I was assuming near L1 sized. <jfonseca> I'm not quite sure of the values now, but is very rare to have more than 8K size. Usually there is a state change that forces to flush the buffer before it gets big. <willmore> Hmmm, that's in the range where you might consider doing it by hand and winning. How complex is the validation logic? <jfonseca> I'm not sure yet, but basically is checking if a command is acceptable (either using a bitmask, a table, or a series of compares), read a count word, and skip that many dwords, and go over it again until the end of the buffer. <jfonseca> just to be sure I'm on track, what's the usuall size of L1 cache? <willmore> Okay, a little pointer math should give you the opportunity to run the prefetch at the right points--as the basic loop isn't using fixed length operands, you can't do the normal *prefetch the next loop+x worth of data trick*. <willmore> Hmm, or could you? <willmore> 4K to 16K <willmore> Some wander way up like the dual 64K L1's of the K7 generation. Some modern procs like the P4 are only 4K. Go figure. :) <willmore> The nice thing about prefetch is you don't have to be exact about it. <willmore> You give the instruction a pointer to anywhere in a cache line and it'll prefetch the whole line. <jfonseca> As I said, I'm not very familiar with the actual assembly coding of prefecthing so I can't be sure now, but I'll investigate both the exact size of the buffers (we can control this anyway), and about all the prefecthing concept. <willmore> If you keep source+dest sized less than L1 it doesn't matter much. If you can control the cache bypass nature of your coppies, live gets interesting. <jfonseca> thank you very much for your help. If you don't mind I would mail the log of this to Leif and possibly dri-devel. <willmore> Oh, and you can always segment your coppies in 2K chunks, verify that chunk, copy 2K more, verify... That way is pretty much guaranteed to work on all chips with an L1 data cache--arch independently. <willmore> No problem. This and keep trying out new version of the mach64 driver are about all I'm good for. :) <jfonseca> yes, but it would be nice that the original client submited buffer wasn't trashed in the copying process. <jfonseca> i.e., that it remained on the cache before and during the call. <willmore> Right, so in that case (if you can control caching of the copy): verify the source in less than L1 sized chunks after copying a less than L1 sized chunk. <willmore> Otherwise, do it in less than half L1 sized chunks and hope you kept a copy in L2. :) <jfonseca> mmm, ok! I'll try to keep these figures in my mind! :) <willmore> If you can assume that the source is hot (I think you can), you have some nice optimization options. If you can control the write bypass of the copy, you have even better options. <willmore> I'm available if you ever have any more questions. I'm a 'large data' kind of optimizer. <jfonseca> mmm I see what you mean... it makes much more sense to do that verify+copy than the other way around..!! <jfonseca> ;-) <willmore> Yeah, I hadn't been thinking that the source would be hot. it so rarely is on the codes I play with. You're lucky. <willmore> hot input is a great subset. Cache bypassing writes are becoming more common and don't have any of the disadvantages of prefetch instructions. <jfonseca> well, I'm not 100% sure, but by the results I had so far (almost no perceptable hit with the copy) I can assume that. The buffer is being generated dword by dword just before the call. <willmore> Okay, with normal code, that would leave your source hot. Hmm, in that case, if the data is smaller than L1, verify the whole thing, <willmore> and then copy the whole thing--write cache bypassing if possible. <willmore> I only suggested, before, to verify the dest as with a non-write cache bypassing copy, you *know* the dest is still in cache, but the source <willmore> may be gone. <jfonseca> there is not interest if the dest gets in the cache (assuming the verification is done before), because it's not the CPU that will be reading, but the card. <willmore> I know that, but it's not a common assumption that you'll have the ability to say 'bypass the cache on writes'. So, I assumed that a write would dirty the cache--hence verify the dest as you know it's hot. <willmore> But, since you get to assume the source is hot and you *don't* want the dest to cache, you're right, verify the source and cache-bypass copy the data. <jfonseca> yep. it seems clear now. wow! this is neat stuff! thx <willmore> No problem. There's been a lot of thought put into stuff by people much brighter than you and I. :) I just focus on it. The 3D stuff I leave to others. :) FYI, O'Reilly has a nice book on optimization that covers stuff like this. It's a great read. :)