Re: [Pvfs2-developers] bmi questions

Phil Carns Fri, 18 Aug 2006 06:28:54 -0700

Sam Lang wrote:


On Aug 17, 2006, at 7:49 PM, Pete Wyckoff wrote:

[EMAIL PROTECTED] wrote on Thu, 17 Aug 2006 18:14 -0500:

* BMI memory allocation.  Do we place any restrictions on when or how
frequently BMI_memalloc is called?  In the pvfs code, we always call
BMI_memalloc for a post_send or post_recv.  Would it be possible to
avoid the malloc on the client for a write and just use the user
buffer?  Or should we mandate that calls to post_send and post_recv
always pass in a pointer from BMI_memalloc?  (as a side note, if we
make that mandate, maybe we should have a BMI_buffer type that
memalloc returns and post_send/post_recv accept).



Both bmi_ib and bmi_gm define the BMI memalloc method to do
something other than simply malloc().  In the IB case, it pins the
memory early, and never unpins it until the corresponding
BMI_memfree() happens.  This is better than letting BMI do the
pinning explicitly, as it moves some of the messaging work out of
the critical path, if you can arrange to alloc/free before you do
send/recv.

Note that these alloc routines only do something special if the
buffer is big enough to be "worth it" (8 kB for IB).

There's no restrictions on how frequently you can call these things.
Each pinned memory region has some overhead in terms of in-pvfs data
structures, in-kernel data structers, and on-NIC data structures.
Ideally we'd try to limit the growth of these things and force old
entries to be freed, but in practice they mostly just grow and it's
not a big problem (unless you have lots of pvfs apps on a single
box, for instance).

You can certainly avoid the malloc and use the user buffer when you
have one instead.  I think this is the common case for MPI-IO
operations.  Point out what case you're talking about and I'll take
a look.

It looks like the mem_to_bmi code (client write) in flow always does amemalloc for the intermediate buffer and then copies the user bufferinto that.

This is actually a corner case that you are looking at, not the defaultbehavior. There are two differenty buffer handling approaches in thisfunction:


/* was MAX_REGIONS enough to satisfy this step? */
if(!PINT_REQUEST_DONE(flow_data->parent->file_req_state) &&
   q_item->result_chain.result.bytes < flow_data->parent->buffer_size)
{
    /* create an intermediate buffer */

<.... code snipped - this is where the BMI_memalloc() occurs >

}
else
{
    /* normal case */

<.... code snipped - no BMI_memalloc() occurs here, and the existingbuffer is used>

In the case where an intermediate buffer is used, we have detected thatthe memory regions being accessed were so discontiguous that it willtake more than MAX_REGIONS (64) offsets and sizes to represent in thisiteration (up to BUFFER_SIZE, normally 256K, per iteration). Ratherthan make an arbitrarily long offset and size list to get to the 256Ktotal that we want to transmit, this is the cutoff point at which itthrows its hands up, makes a contiguous intermediate buffer, and copieseverything into that. In all other cases we use the existing buffersrather than allocating something new.

On reads (bmi_to_mem), flow does use the client's buffer,so I guess that's a case that doesn't do memalloc. I wonder if thecopy on a client write could be avoided as well though.

We do try to avoid the copy on write if possible, although the choice ofthe cutoff point where we give up on list operations and start copyingis kind of arbitrary- I don't think anyone has ever tested to see ifthat value makes sense. Modifying it requires some synchronization withBMI and Trove as well, though, to make sure that they can also handlelist io of up to MAX_REGIONS without breaking them apart.


-Phil
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] bmi questions

Reply via email to