Sam Lang wrote:
On Aug 17, 2006, at 7:49 PM, Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Thu, 17 Aug 2006 18:14 -0500:
* BMI memory allocation. Do we place any restrictions on when or how
frequently BMI_memalloc is called? In the pvfs code, we always call
BMI_memalloc for a post_send or post_recv. Would it be possible to
avoid the malloc on the client for a write and just use the user
buffer? Or should we mandate that calls to post_send and post_recv
always pass in a pointer from BMI_memalloc? (as a side note, if we
make that mandate, maybe we should have a BMI_buffer type that
memalloc returns and post_send/post_recv accept).
Both bmi_ib and bmi_gm define the BMI memalloc method to do
something other than simply malloc(). In the IB case, it pins the
memory early, and never unpins it until the corresponding
BMI_memfree() happens. This is better than letting BMI do the
pinning explicitly, as it moves some of the messaging work out of
the critical path, if you can arrange to alloc/free before you do
send/recv.
Note that these alloc routines only do something special if the
buffer is big enough to be "worth it" (8 kB for IB).
There's no restrictions on how frequently you can call these things.
Each pinned memory region has some overhead in terms of in-pvfs data
structures, in-kernel data structers, and on-NIC data structures.
Ideally we'd try to limit the growth of these things and force old
entries to be freed, but in practice they mostly just grow and it's
not a big problem (unless you have lots of pvfs apps on a single
box, for instance).
You can certainly avoid the malloc and use the user buffer when you
have one instead. I think this is the common case for MPI-IO
operations. Point out what case you're talking about and I'll take
a look.
It looks like the mem_to_bmi code (client write) in flow always does a
memalloc for the intermediate buffer and then copies the user buffer
into that.
This is actually a corner case that you are looking at, not the default
behavior. There are two differenty buffer handling approaches in this
function:
/* was MAX_REGIONS enough to satisfy this step? */
if(!PINT_REQUEST_DONE(flow_data->parent->file_req_state) &&
q_item->result_chain.result.bytes < flow_data->parent->buffer_size)
{
/* create an intermediate buffer */
<.... code snipped - this is where the BMI_memalloc() occurs >
}
else
{
/* normal case */
<.... code snipped - no BMI_memalloc() occurs here, and the existing
buffer is used>
}
In the case where an intermediate buffer is used, we have detected that
the memory regions being accessed were so discontiguous that it will
take more than MAX_REGIONS (64) offsets and sizes to represent in this
iteration (up to BUFFER_SIZE, normally 256K, per iteration). Rather
than make an arbitrarily long offset and size list to get to the 256K
total that we want to transmit, this is the cutoff point at which it
throws its hands up, makes a contiguous intermediate buffer, and copies
everything into that. In all other cases we use the existing buffers
rather than allocating something new.
On reads (bmi_to_mem), flow does use the client's buffer,
so I guess that's a case that doesn't do memalloc. I wonder if the
copy on a client write could be avoided as well though.
We do try to avoid the copy on write if possible, although the choice of
the cutoff point where we give up on list operations and start copying
is kind of arbitrary- I don't think anyone has ever tested to see if
that value makes sense. Modifying it requires some synchronization with
BMI and Trove as well, though, to make sure that they can also handle
list io of up to MAX_REGIONS without breaking them apart.
-Phil
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers