I should have mentioned that high performance matrix multiplies pretty much always involve pretty simple reorderings of the normal loops that follow from the definition of matrix multiplication.
For example, C = A' * B can be written as for (i in 1..n) { for (j in 1..n) { c[i,j] = 0 for (k = 1..n) { c[i,j] += a[k, i] * b[k, j] } } } But this can also be turned inside out as this: for (i in 1..n) { for (j in 1..n) { c[i,j] = 0 } } for (k = 1..n) { for (i in 1..n) { for (j in 1..n) { c[i,j] += a[k, i] * b[k, j] } } } This is roughly what I was talking about in the first option for map-reduce (although I may have gotten columns and rows backwards). On 3/4/08 5:31 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: > > Jeff is right that it really depends on the application. > > The area that I have done the most with map-reduce multiplication is in > coocurrence counting on large data sets. > > If we talk about documents that contain terms (as opposed to users who view > videos, ornithologists who observe bird species or any other equivalent > dyadic system) then we can represent our observations as a document x term > matrix A. For lots of analysis, it is interesting to compute a term x term > coocurrence matrix A' * A. Where we have millions of documents and millions > of terms, this is daunting on a single machine. > > In even more general terms, we might have a parallel corpus of documents > where there are two versions of each document in different languages. If we > say that the document x term matrices for the two languages are A and B, > then the cross-language coocurrence matrix A' * B gives us rough insight > into which terms translate commonly into terms in the other language. Chris > Dyer at Maryland talked a while ago about doing this using Hadoop and Miles > Osborne at Edinburgh is currently doing this as well. > > For these kinds of problems, it is common to store the matrices A and B as > column compressed sparse representations. This lets us quickly find all of > find all of the documents a term appears in, but makes it expensive to find > the terms in a single document. For this particular form of multiplication > where the left multiplicand is transposed, using this uniform representation > turns out pretty nice because iterating down a row of A' is the same as > iterating down a column of A. > > In general, computing A' * B requires O(n^3) arithmetic operations and > O(n^2) storage accesses. These same asymptotic costs apply for sparse > matrices (if sparsity is constant) as well as for most parallel > implementations. Since arithmetic operations are SOOO much faster than > moving data over the network, it behooves us to arrange our computation so > that we re-use data as much as possible. If we don't do this, then we wind > up with O(n^3) storage accesses and our program will be hundreds to > thousands of times slower than we might otherwise like. > > The simplest distributed algorithm for A' * A simply explodes the product by > sending partial sums with a key corresponding to the location in the product > to which they would contribute. These partial products are then summed by > the reducers with one reduce call per element in the result. This requires > that each column of A be examined and all non-zero products from that column > be sent out. This works reasonably well and is pretty easy to code. One > common requirement is that rows of the result be normalized to sum to one > (for probabilistic estimates) or be reduced by a statistical test. The > rough pseudo-code for the map step in the basic multiplication without the > fancy extra steps is this: > > { > j, column, out, reporter -> > column.foreachNonZero { > i1, x1 -> > column.foreachNonZero { > i2, x2 -> > out.collect([i1, i2], x1*x2) > } > } > } > > This mapper is applied in parallel to every column of A (=row of A'). > > Reduce and combine are very simple and just add up the results: > > { key, values, out, reporter -> > out.collect(key, values.sum()) > } > > This algorithm is the ultimate in block decomposition because the blocks are > as small as possible (1x1). Usually with matrix multiplication, though, it > helps to use larger sub-blocks so that we can re-use some of the operands > while they are still in registers. This helps us avoid some of O(n^3) data > traffic that we can incur otherwise. Also, small records tend to incur more > data transfer overhead than larger ones. The larger blocks need not be > square, but can as well be tall and skinner. > > One potential algorithm to limit data communication and at the same time > generalize the code above to A' * B is to pass A to all mappers and then map > over columns of B. The mapper then becomes: > > { > i, columnOfB, out, reporter -> > out.collect(i, A' * columnOfB) > } > > The output of each map is a row of the final output so the only thing the > reducer needs to do is concatenate rows. > > There are other block decompositions possible, of course. > > (Grant, the Groovy notation here is just for YOU) > > On 3/4/08 4:33 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote: > >> The current implementation is obviously not distributed, and operations >> only apply to the in-memory subsets of what may or may not be larger >> matrix/vector computations. These can still be quite large, of course. >> To go beyond this simple view requires a more global representation and >> something like BigTable or Hbase support. >> >> This matrix implementation, for example, would be quite sufficient to do >> 4x4 graphics transformations of billions of image points, but not to >> invert a billion element matrix. In the first example, each mapper would >> transform a subset of the points in parallel. In the second example, >> there would be significant inter-mapper communication required. I bet >> Ted knows how to approach that one :) >> >> Jeff >> >> -----Original Message----- >> From: Grant Ingersoll [mailto:[EMAIL PROTECTED] >> Sent: Tuesday, March 04, 2008 4:16 PM >> To: mahout-dev@lucene.apache.org >> Subject: Re: [jira] Updated: (MAHOUT-6) Need a matrix implementation >> >> One question I have is how you see this being distributed? Does each >> Mapper/Reducer create it's own Matrix based on some subset of the data >> it is working on? Or is there a notion of what I gather BigTable does >> (I haven't used BigTable) and you have the notion of a whole Matrix? >> I realize this is not implemented, just wondering your thoughts on how >> all this works in a distributed sense. >> >> -Grant >> >> On Mar 3, 2008, at 4:14 PM, Jeff Eastman (JIRA) wrote: >> >>> >>> [ >> https://issues.apache.org/jira/browse/MAHOUT-6?page=com.atlassian.jira.p >> lugin.system.issuetabpanels:all-tabpanel >>> ] >>> >>> Jeff Eastman updated MAHOUT-6: >>> ------------------------------ >>> >>> Attachment: MAHOUT-6j.diff >>> >>> Sorted out the two patches and added back my Vector unit tests that >>> fell out due to an oversight. Couldn't resist adding a cross() >>> operation on vectors but I'm going to do something else for a while >>> to let people review this and let the dust settle. Once we get >>> vectors into trunk I will update the clustering code to use them. >>> >>>> Need a matrix implementation >>>> ---------------------------- >>>> >>>> Key: MAHOUT-6 >>>> URL: https://issues.apache.org/jira/browse/MAHOUT-6 >>>> Project: Mahout >>>> Issue Type: New Feature >>>> Reporter: Ted Dunning >>>> Attachments: MAHOUT-6a.diff, MAHOUT-6b.diff, MAHOUT-6c.diff, >>>> MAHOUT-6d.diff, MAHOUT-6e.diff, MAHOUT-6f.diff, MAHOUT-6g.diff, >>>> MAHOUT-6h.patch, MAHOUT-6i.diff, MAHOUT-6j.diff >>>> >>>> >>>> We need matrices for Mahout. >>>> An initial set of basic requirements includes: >>>> a) sparse and dense support are required >>>> b) row and column labels are important >>>> c) serialization for hadoop use is required >>>> d) reasonable floating point performance is required, but awesome >>>> FP is not >>>> e) the API should be simple enough to understand >>>> f) it should be easy to carve out sub-matrices for sending to >>>> different reducers >>>> g) a reasonable set of matrix operations should be supported, these >>>> should eventually include: >>>> simple matrix-matrix and matrix-vector and matrix-scalar linear >>>> algebra operations, A B, A + B, A v, A + x, v + x, u + v, dot(u, v) >>>> row and column sums >>>> generalized level 2 and 3 BLAS primitives, alpha A B + beta C >>>> and A u + beta v >>>> h) easy and efficient iteration constructs, especially for sparse >>>> matrices >>>> i) easy to extend with new implementations >>> >>> -- >>> This message is automatically generated by JIRA. >>> - >>> You can reply to this email to add a comment to the issue online. >>> >> >