>        The code was pretty simple. I was trying to send 8MB data from one
>    rank to other in a loop(say 1000 iterations). And then i was taking the
>    average of time taken and was calculating the bandwidth.
>    The above logic i tried with both mpirun-with-mca-parameters and without
>    any parameters. And to my surprise, the performance was degrading when i
>    was trying to manipulate.
That sounds strange. So did you re-use the communication buffers? Did
you try to run some existing benchmarks like Netpipe [1], IMB or
Netgauge [2]?

>    Now I have another question in mind. Is it possible to have IB Hardware
>    Multicast implementation in OpenMPI? I have gone through the
>    issues/challenges for the same, but also read couple of people who have
>    successfully done it for Ethernet/Giga-bit Ethernet and IPoIB ofcourse in
>    experimental stage. Actually i want to contribute for it in OpenMPI and
>    need the help for the same.
As far as I know, there are two groups/people working on this. Andy
Friedley implements a "traditional" ACK based approach (like the one
that the OSU folks published about some time ago) and I implemented a
new idea for extreme scale (see "A practically constant-time MPI
Broadcast Algorithm for large-scale InfiniBand Clusters with
Multicast" [3]). I know that my version is still unstable and has some
problems. But I'm working on this.


[1]: http://www.scl.ameslab.gov/netpipe/
[2]: http://www.unixer.de/research/netgauge/
[3]: https://www.unixer.de/publications/#hoefler-cac07

