On Mon, Oct 21, 2002 at 11:27:15AM -0400, Leif Delgass wrote:
[...]
I have a pretty good idea how to do the verification (just checking the
register count and offset range of each command, skipping the data), but
I'm not sure if it'll be faster to copy as we verify, or memcpy the entire
buffer and verify (or verify and then copy).  At any rate, it should be
easy to test and see which works best.
Check the log of an IRC talk I had with Willmore on today's meeting
regarding this same subject. It was very insightfull. In summary, we
would make most use of the cache by verifying first and then copying (and if possible avoiding write caching).


The thing I hadn't figured out yet is the best way to allocate the private
buffers. We could allocate a chunk of AGP mem like other drivers do for
the ring buffer, which could be set read only for userspace. I'm not sure
if drmAddBufs would work to create one mapped set and one unmapped set of
buffers.
drmAddBufs is getting deprecated, especially regarding the PCI
counterpart, but I know that some drives (MGA IIRC), allocates one extra
buffer which isn't mapped.
I think we'll need a set of private buffers equivalent to what we use now (2MB), but we could probably get away with a smaller set of client buffers.
I agree. In principle, each client could even use the same buffer over
and over again (not referring to blits). And they don't even need to be
DMA buffers at all. Fixed malloc'd memory from the user space should be
good enough.

We should also think about the Xserver, since we'll probably want to be able to use "indirect" command buffers from the Xserver to implement XAA with DMA.
The X server could use the private buffers since it's trusted.

It seems that the logic to accomodate all these cenarios will be a
little complicated. We have to carefully think of the API (i.e., the
IOCTLs) to allow everything play nice with each-other - inside and
outside of the DRM.

José Fonseca
<willmore>  I just wanted to answer his question on the mailing list about 
copying/verifying DMA buffers.  Copy first and then verify using the target copy.
<jfonseca> willmore: And why do you think that?
<willmore> With a copy--unless you're using write cache bypass ala SSE--the target 
will be the cache 'hotter' of the two.  If you verify first, you lose the cache 
benefits of optimized memcpys.
<willmore> Completely optimal would be a hand tuned block prefetched chunk of 
assembly, but that's a bit out of the scope of this soltuion, yes?
<jfonseca> mmm... thanx!
<jfonseca> not necessarily. I've already done some hand coded assembly for copying the 
vertex data, and I hoped to adapt that for the copy/verify process later on. But it 
seems I need more info about prefetching for doing that.
<willmore> No problem.  The only thing that really determines this is if the memcpy 
uses a cache bypass type copy.
<willmore> You have to prefetch by hand on x86 and I think there is an SSE instruciton 
to give the hints.  I also think that for cpus that don't support prefetch, they just 
ignore the instruction--so that might not be exactly correct.
<jfonseca> Not that in any case the buffer size we will be dealing is smaller than the 
cache size, to avoid a performance hit with the copying.
<jfonseca> *Note* that in any case the buffer size we will be dealing is smaller than 
the cache size, to avoid a performance hit with the copying.
<willmore> You can make the prefetch transparent on non-prefetch capable CPUs if you 
schedule things well in your code.
<willmore> What is the typical buffer size?  I was assuming near L1 sized.
<jfonseca> I'm not quite sure of the values now, but is very rare to have more than 8K 
size. Usually there is a state change that forces to flush the buffer before it gets 
big.
<willmore> Hmmm, that's in the range where you might consider doing it by hand and 
winning.  How complex is the validation logic?
<jfonseca> I'm not sure yet, but basically is checking if a command is acceptable 
(either using a bitmask, a table, or a series of compares), read a count word, and 
skip that many dwords, and go over it again until the end of the buffer.
<jfonseca> just to be sure I'm on track, what's the usuall size of L1 cache?
<willmore> Okay, a little pointer math should give you the opportunity to run the 
prefetch at the right points--as the basic loop isn't using fixed length operands, you 
can't do the normal *prefetch the next loop+x worth of data trick*.
<willmore> Hmm, or could you?
<willmore> 4K to 16K
<willmore> Some wander way up like the dual 64K L1's of the K7 generation.  Some 
modern procs like the P4 are only 4K.  Go figure. :)
<willmore> The nice thing about prefetch is you don't have to be exact about it.
<willmore> You give the instruction a pointer to anywhere in a cache line and it'll 
prefetch the whole line.
<jfonseca> As I said, I'm not very familiar with the actual assembly coding of 
prefecthing so I can't be sure now, but I'll investigate both the exact size of the 
buffers (we can control this anyway), and about all the prefecthing concept.
<willmore> If you keep source+dest sized less than L1 it doesn't matter much.  If you 
can control the cache bypass nature of your coppies, live gets interesting.
<jfonseca> thank you very much for your help. If you don't mind I would mail the log 
of this to Leif and possibly dri-devel.
<willmore> Oh, and you can always segment your coppies in 2K chunks, verify that 
chunk, copy 2K more, verify...  That way is pretty much guaranteed to work on all 
chips with an L1 data cache--arch independently.
<willmore> No problem.  This and keep trying out new version of the mach64 driver are 
about all I'm good for. :)
<jfonseca> yes, but it would be nice that the original client submited buffer wasn't 
trashed in the copying process.
<jfonseca> i.e., that it remained on the cache before and during the call.
<willmore> Right, so in that case (if you can control caching of the copy): verify the 
source in less than L1 sized chunks after copying a less than L1 sized chunk.
<willmore> Otherwise, do it in less than half L1 sized chunks and hope you kept a copy 
in L2. :)
<jfonseca> mmm, ok! I'll try to keep these figures in my mind! :)
<willmore> If you can assume that the source is hot (I think you can), you have some 
nice optimization options.  If you can control the write bypass of the copy, you have 
even better options.
<willmore> I'm available if you ever have any more questions.  I'm a 'large data' kind 
of optimizer.
<jfonseca> mmm I see what you mean... it makes much more sense to do that verify+copy 
than the other way around..!!
<jfonseca> ;-)
<willmore> Yeah, I hadn't been thinking that the source would be hot.  it so rarely is 
on the codes I play with.  You're lucky.
<willmore> hot input is a great subset.  Cache bypassing writes are becoming more 
common and don't have any of the disadvantages of prefetch instructions.
<jfonseca> well, I'm not 100% sure, but by the results I had so far (almost no 
perceptable hit with the copy) I can assume that. The buffer is being generated dword 
by dword just before the call.
<willmore> Okay, with normal code, that would leave your source hot.  Hmm, in that 
case, if the data is smaller than L1, verify the whole thing,
<willmore> and then copy the whole thing--write cache bypassing if possible.
<willmore> I only suggested, before, to verify the dest as with a non-write cache 
bypassing copy, you *know* the dest is still in cache, but the source
<willmore> may be gone.
<jfonseca> there is not interest if the dest gets in the cache (assuming the 
verification is done before), because it's not the CPU that will be reading, but the 
card.
<willmore> I know that, but it's not a common assumption that you'll have the ability 
to say 'bypass the cache on writes'.  So, I assumed that a write would dirty the 
cache--hence verify the dest as you know it's hot.
<willmore> But, since you get to assume the source is hot and you *don't* want the 
dest to cache, you're right, verify the source and cache-bypass copy the data.
<jfonseca> yep. it seems clear now. wow! this is neat stuff! thx
<willmore> No problem.  There's been a lot of thought put into stuff by people much 
brighter than you and I. :)  I just focus on it.  The 3D stuff I leave to others. :)  
FYI, O'Reilly has a nice book on optimization that covers stuff like this.  It's a 
great read. :)

Reply via email to