On Feb 25, 2009, at 10:36 , Eugene Loh wrote:
George Bosilca wrote:
On Feb 24, 2009, at 18:08 , Eugene Loh wrote:
(Probably this message only for George, but I'll toss it out to
the alias/archive.)
Actually, maybe Rich should weigh in here, too. This relates to the
overflow mechanism in MCA_BTL_SM_FIFO_WRITE.
I have a question about the sm sendi() function. What should
happen if the sendi() function attempts to write to the FIFO, but
the FIFO is full?
The write should not be queued except in the case where the whole
data referred by the convertor is copied out of the user memory.
And this is indeed the case. The data-convertor copy completed
successfully.
If the FIFO is full, the best will be to allocate the descriptor
and give it back to the PML.
Why? The data has been copied out of the user's buffer. The
pointer to that data has been queued for sending. (It hasn't been
queued in the FIFO, which is full, but it has been queued in the
pending-send list.)
As I previously state, if the data is copied out of the user buffer,
the sendi should always return success. However, having a queue in the
BTL only duplicates the queue from the PML.
The FIFO has an overflow mechanism. Actually, prior to my recent
putbacks, it had two overflow mechanisms. One was to grow the FIFO,
and the other was to use the pending-send queue. While adding
support for multiple senders per FIFO and at Rich's suggestion, I
pulled out the ability to grow the FIFO. (Some number of folks
didn't even believe that the FIFO-grow stuff even existed or was
enabled or worked properly.) That still leaves the pending sends.
So, the "out of resource" return code from the FIFO write is kind of
spurious. The FIFO write is returning that code even though it has
accepted the write and queued it up.
Currently, it appears that the sendi() function returns an error
code to the PML, which assumes that the sendi() tried to send the
message but failed and so just tried to allocate a descriptor.
Yes, this is the expected behavior.
But is that what should happen? The condition of the FIFO being
full is a little misleading since the write is still queued for
further progress -- not in the FIFO itself but in the pending-
send queue. This distinction should perhaps not matter to the
upper layers. The upper layers should still view the send as
"completed" (buffered by the MPI implementation to be progressed
later). I would think that the sendi() function should return a
SUCCESS code.
If the write is queued then this is more or less a bug. We will
nicely cope with this case, because we have this sequence number
and we will drop a message duplicate, but we will end-up sending
the same message twice. The problem is that I don't know which of
the copies will be used on the receiver side, I guess the first
one reaching the receiver.
Arrgh! When the primary mechanism (FIFO) starts getting congested,
we start pumping duplicate messages into the system?
If the BTL queue the send internally and returns an error, then the
PML will:
- go back in the mca_pml_ob1_send_request_start with the error set to
OUT_OF_RESSOURCES
- will continue over the list of available BTL for the eager and try
to send the same message again.
- in the case no more BTLs are available it will add the request to
the pending queue, and it will reschedule it later.
So the answer is yes, if a BTL returns an error while adding the data
in its own queues, then we will duplicate the send operation.
george.
The proper fix (IMHO) is to have the sendi function return a SUCCESS
code once it's written the message and the pointer to the message.
And, once it's written those two things, it seems to me to be a bug
to return any other code.
Relevent source code is
PML, line 496
https://svn.open-mpi.org/source/xref/ompi_1.3/ompi/mca/pml/ob1/pml_ob1_sendreq.c
#496
BTL, line 785
https://svn.open-mpi.org/source/xref/ompi_1.3/ompi/mca/btl/sm/
btl_sm.c#785
FIFO write, line 18
https://svn.open-mpi.org/opengrok/xref/ompi_1.3/ompi/mca/btl/sm/btl_sm_fifo.h
#18
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel