I am looking at large dense matrix multiplication as an example problem for a class of middleware. I am also interested in sparse matrices, but am taking things one step at a time.
There is a paper in IEEE CloudCom '10 about Hama, including a matrix multiplication technique. It is essentially the same as what is called "technique 4" in the 2009 monograph by John Norstad cited early in this thread. Which means that, despite the fact that Hama touts the virtues of BSP (a position with which I am very sympathetic), this technique doesn't really take advantage of the extra features that BSP has over MapReduce. Note also that this technique creates intermediate data of much greater volume than the input. For example, if each matrix is stored as an NxN grid of blocks, the intermediate data (the blocks paired up, awaiting multiplication) is a factor of N larger than the input. I have heard people saying that N may be rather larger than sqrt(number of machines) because in some circumstances N has to be chosen before the number of available machines is known and you want to be able to divide the NxN load among your machines rather evenly. Even if N is like sqrt(number of machines) this is still an unwelcome amount of bloat. In comparison, the SUMMA technique does matrix multiplication but its intermediate data volume is no greater than the input. Thanks, Mike