On Fri, May 30, 2008 at 1:14 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> > > a) tree classifiers, notably random forests and some sort of boosted >> decision tree >> > > I think one of the GSOC's may be working on Random Forests. Ian? Excellent! > c) better parallel matrix operations >> > > Note, Hama has been accepted into incubation, so that is something to keep > an eye on and evaluate. I am dubious of the Hama approach, based as it is on Hbase. For that level of performance, we can just emit order-1 updates and depend on combiners to make us whole. For higher performance, I think we need to do much better blocking and do high performance multiplies on map nodes. For example, for the matrix product {A B}_ij = sum_k a_ik b_kj we can implement this using map reduce as a bigram counter. If A and B are encoded as triples containing (m, row, column, value) where m is 'a' or 'b' as appropriate, then we can do roughly the following. This may be fast enough for very sparse matrices, but might be very slow for dense matrices: map1 = {key, value, out -> // *value is ['a', i, k, a_ik] or ['b', k, j, b_kj]* // *output record is keyed by k and has ['a', i, a_ik] or ['b', j, b_kj] as value *// *in real life, the 'a' or 'b' would be inferred in the configure method* if (value[0] == "a") out.collect(value[2], [value[0], value[1], value[3]]) else out.collect(value[1], [value[0], value[2], value[3]]) } reducer1 = {key, values, out -> // *key is k, values contains items like ['a', i, a_ik] or ['b', j, b_kj]* // *we build a nested map to simplify iterating over the cross product* def v = [:] def v.a = [:] def v.b = [:] values.each { v[it[0]] [it[1]] = it[2] } v.a.each {i, a_ik -> v.b.each {j, b_kj -> out.collect([i, j], a_jk * b_kj) } } } // *second mr adds up all of the products by reducing on final position* map2 = {key, ijv, out -> // *key is [i,j], value is a_ik * b_kj for some (now unknown) k* out.collect(ijv[0..1], ijv[2]) } combiner2 = {ij, values, out -> out.collect(ij, values.sum()) } reducer2 = combiner