On Apr 28, 2009, at 6:11 PM, Ralph Campbell wrote:

Ah, free() just puts the buffer on a free list and a subsequent malloc()
can return it. The application isn't aware of the MPI library calling
ibv_reg_mr()


Right.

and the MPI library isn't aware of the application
reusing the buffer differently.
The virtual to physical mapping can't change while it is pinned
so buffer B should have been written with new data overwriting
the same physical pages that buffer A used.
I would assume the application would wait for the MPI_isend() to
complete before freeing the buffer so it shouldn't be the case that
the same buffer is in the process of being sent when the application
overwrites the address and tries to send it again.


This is not the problem.

An MPI program that re-uses a buffer that is in use in an ongoing non- blocking send operation is clearly erroneous.

Perhaps my explanations were incorrect and you kernel gurus can educate me. What I know can happen is:

- MPI application alloc's buffer A and gets virtual address B back, corresponding to physical address C
- MPI application calls MPI_SEND with A

- MPI implementation registers buffer A, and caches that address B is registered, and then does the send

- MPI application frees buffer A

- MPI implementation does *NOT* unregister buffer A

- MPI application alloc's buffer X and gets virtual address *B* back, corresponding to physical address Z (Z!=C)
- MPI application calls MPI_SEND with X

- MPI implementation sees virtual address B in its cache and thinks that it is already registered... badness ensues

Note that the virtual addresses are the same, but the physical addresses are different. This can, and does, happen. It makes it impossible to tell the buffer apart in userspace -- MPI cannot tell that the buffer is not already pinned (because according to MPI's internal cache, it *is* registered already). The only way to hack around this is for the MPI implementation to intercept free/sbrk/ whatever (horrors!) so that it can a) know to unregister the buffer and b) remove the address from its "already registered" cache.

It's quite possible that I don't know why this happens, or stated the wrong reasons why. But it definitely does happen.

> > Note that the above scenario occurs because before Linux kernel
> > v2.6.27, the OF kernel drivers are not notified when pages are
> > returned to the OS -- we're leaking registered memory, and therefore > > the OF driver/hardware have the wrong virtual/physical mapping. It
> > *may* not segv at step 7 because the OF driver/hardware can still
> > access the memory and it is still registered. But it will definitely
> > be accessing the wrong physical memory.

Well, the driver can register for callbacks when the mapping changes
but most HCA drivers aren't going to be able to use it.
The problem is that once a memory region is created, there is no way
the driver knows when an incoming or outgoing DMA might try to
reference that address.


Wouldn't it be an erroneous program that tried to use a region after free()'ing it?

There would need to be a way to suspend DMAs,
change the mapping, and then allow DMAs to continue.
The CPU equivalent is a TLB flush after changing the page table memory.

The whole area of page pinning, mapping, unmapping, etc. between the
application, MPI library, OS, and driver is very complex and I don't
think can be designed easily via email.


The conversation needs to start somewhere. MPI is verbs' biggest customer; this is a major pain point for all of us. Can't we fix it? Do you need something more than a specific use case and API proposal to start the conversation? No one has money to travel travel; the bi- weekly EWG call is for discussing bugs. What other vehicle do you suggest for this discussion?

I'd consider this issue to be in the top 3 major roadblocks of verbs adoption to developers other than those of us who write MPI implementations.

I wasn't at the Sonoma
conference so I don't know what was discussed.


Only the problem was discussed. It was hypothesized that Pete Wyckoff's "tweak a bit in userspace when something changes" notifier interface would fix the problem, but per my mail, after more post- Sonoma discussion, we think that it's not sufficient.

My Sonoma slides are here:

    
http://www.openfabrics.org/archives/spring2009sonoma/tuesday/panel3/panel3.zip
    
http://www.openfabrics.org/archives/spring2009sonoma/wednesday/panel1/panel1.zip

The "ideal" from
MPI library perspective is to not have to worry about memory
registrations and have the HCA somehow share the user application's
page table, faulting in IB to physical address mappings as needed.
That involves quite a bit of hardware support as well as the changes
in 2.6.27.



Understood -- but as I stated in my mail, I assume that such a change is a long way off (particularly since it needs some kind of hardware support). Moving the registration cache down into the kernel seems do- able. Why not try to tackle this [enormous] problem?

--
Jeff Squyres
Cisco Systems

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to