Re: [OMPI devel] Inherent limit on #communicators?

Edgar Gabriel Thu, 30 Apr 2009 15:04:26 -0400

so I agree that we need to fix that, and we'll get a fix for that assoon as possible. It still strikes me as wrong however to we havefundamentally different types on two layers for the same 'item'.

I still think that going back to the original algorithm would be bad -especially for an application that creates such a large number ofcommunicators potentially executed on a large number ( 1000s) ofprocessors. I'll look into how to reuse an entire block of communicatorcid respectively how to take the max_contextid into account.


Edgar

Brian W. Barrett wrote:

On Thu, 30 Apr 2009, Edgar Gabriel wrote:
Brian W. Barrett wrote:
When we added the CM PML, we added a pml_max_contextid field to thePML structure, which is the max size cid the PML can handle (becausethe matching interfaces don't allow 32 bits to be used for the cid.At the same time, the max cid for OB1 was shrunk significantly, sothat the header on a short message would be packed tightly with noalignment padding.
At the time, we believed 32k simultaneous communicators was plenty,and that CIDs were reused (we checked, I'm pretty sure). It soundslike someone removed the CID reuse code, which seems rather bad to me.
yes, we added the block algorithm. Not reusing a CID actually doesn'tbite me as that dramatic, and I am still not sure and convinced aboutthat:-) We do not have an empty array or something like that, its justa number.
The reason for the block algorithm was that the performance of ourcommunicator creation code sucked, and the cid allocation was oneportion of that. We used to require *at least* 4 collective operationsper communicator creation at that time. We are now potentially down to0, among others thanks to the block algorithm.
However, let me think about reusing entire blocks, its probably doablejust requires a little more bookkeeping...
There have to be unused CIDs in Ralph's example - is there a way tofallback out of the block algorithm when it can't find a new CID andfind one it can reuse? Other than setting the multi-threaded caseback on, that is?
remember that its not the communicator id allocation that is failingat this point, so the question is do we have to 'validate' a cid withthe pml before we declare it to be ok?
well, that's only because the code's doing something it shouldn't. Havea look at comm_cid.c:185 - there's the check we added to themulti-threaded case (which was the only case when we added it). The cidgeneration should never generate a number larger thanmca_pml.pml_max_contextid. I'm actually somewhat amazed this failsgracefully, as OB1 doesn't appear to check it got a valid cid inadd_comm, which it should probably do.
Looking at the differences between v1.2 and v1.3, the max_contextid codewas already in v1.2 and OB1 was setting it to 32k. So the cid blockingcode removed a rather critical feature and probably should be fixed orremoved for v1.3. On Portals, I only get 8k cids, so not having reuseis a rather large problem.
Brian
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335

Re: [OMPI devel] Inherent limit on #communicators?

Reply via email to