George I found the info I think you were referring to. Thanks. I then experimented essentially randomly with different algorithms for all reduce. But the issue with really bad performance for certain message sizes persisted with v1.1. The good news is that the upgrade to 1.2 fixed my worst problem. Now the performance is reasonable for all message sizes. I will test the tuned algorithms again asap.
I had a couple of questions 1) Ompi_info lists only 3 or 4 algorithms for allreduce and reduce and about 5 for b'cast. But you can use higher numbers as well. Are these additional undocmented algorithms (you mentioned a number like 15) or is it ignoring out of range parameters? 2) It seems for allreduce you can select a tuned reduce and tuned bcast instead of the binary tree. But there is a faster allreduce which is order 2N rather than 4N for Reduce + Bcast (N is msg size). It segments the vector and distributes the root among the nodes; in an allreduce there is no need to gather the root vector to one processor and then scatter it again. I wrote a simple version for powers of 2 (MPI_SUM)-any chance of it being implemented in OMPI. Tony