On Thu, Apr 11, 2013 at 09:12:17AM -0400, Michael R. Hines wrote: > On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote: > >On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote: > >Maybe we should just say "RDMA is incompatible with memory > >overcommit" and be done with it then. But see below. > >>I would like to propose a compromise: > >> > >>How about we *keep* the registration capability and leave it enabled > >>by default? > >> > >>This gives management tools the ability to get performance if they want to, > >>but also satisfies your requirements in case management doesn't know the > >>feature exists - they will just get the default enabled? > >Well unfortunately the "overcommit" feature as implemented seems useless > >really. Someone wants to migrate with RDMA but with low performance? > >Why not migrate with TCP then? > > Answer below. > > >>Either way, I agree that the optimization would be very useful, > >>but I disagree that it is possible for an optimized registration algorithm > >>to perform *as well as* the case when there is no dynamic > >>registration at all. > >> > >>The point is that dynamic registration *only* helps overcommitment. > >> > >>It does nothing for performance - and since that's true any optimizations > >>that improve on dynamic registrations will always be sub-optimal to turning > >>off dynamic registration in the first place. > >> > >>- Michael > >So you've given up on it. Question is, sub-optimal by how much? And > >where's the bottleneck? > > > >Let's do some math. Assume you send 16 bytes registration request and > >get back a 16 byte response for each 4Kbyte page (16 bytes enough?). That's > >32/4096 < 1% transport overhead. Negligeable. > > > >Is it the source CPU then? But CPU on source is basically doing same > >things as with pre-registration: you do not pin all memory on source. > > > >So it must be the destination CPU that does not keep up then? > >But it has to do even less than the source CPU. > > > >I suggest one explanation: the protocol you proposed is inefficient. > >It seems to basically do everything in a single thread: > >get a chunk,pin,wait for control credit,request,response,rdma,unpin, > >There are two round-trips of send/receive here where you are not > >going anything useful. Why not let migration proceed? > > > >Doesn't all of this sound worth checking before we give up? > > > First, let me remind you: > > Chunks are already doing this! > > Perhaps you don't fully understand how chunks work or perhaps I > should be more verbose > in the documentation. The protocol is already joining multiple pages into a > single chunk without issuing any writes. It is only until the chunk > is full that an > actual page registration request occurs.
I think I got that at a high level. But there is a stall between chunks. If you make chunks smaller, but pipeline registration, then there will never be any stall. > So, basically what you want to know is what happens if we *change* > the chunk size > dynamically? What I wanted to know is where is performance going? Why is chunk based slower? It's not the extra messages, on the wire, these take up negligeable BW. > Something like this: > > 1. Chunk = 1MB, what is the performance? > 2. Chunk = 2MB, what is the performance? > 3. Chunk = 4MB, what is the performance? > 4. Chunk = 8MB, what is the performance? > 5. Chunk = 16MB, what is the performance? > 6. Chunk = 32MB, what is the performance? > 7. Chunk = 64MB, what is the performance? > 8. Chunk = 128MB, what is the performance? > > I'll get you a this table today. Expect an email soon. > > - Michael > > > > >