On Fri, May 30, 2008 at 1:14 PM, Grant Ingersoll <[EMAIL PROTECTED]>
wrote:

>
>
> a) tree classifiers, notably random forests and some sort of boosted
>> decision tree
>>
>
> I think one of the GSOC's may be working on Random Forests.  Ian?


Excellent!


> c) better parallel matrix operations
>>
>
> Note, Hama has been accepted into incubation, so that is something to keep
> an eye on and evaluate.


I am dubious of the Hama approach, based as it is on Hbase.  For that level
of performance, we can just emit order-1 updates and depend on combiners to
make us whole.  For higher performance, I think we need to do much better
blocking and do high performance multiplies on map nodes.

For example, for the matrix product {A B}_ij = sum_k a_ik b_kj  we can
implement this using map reduce as a bigram counter.

If A and B are encoded as triples containing (m, row, column, value) where m
is 'a' or 'b' as appropriate, then we can do roughly the following.  This
may be fast enough for very sparse matrices, but might be very slow for
dense matrices:

    map1 = {key, value, out ->
          // *value is ['a', i, k, a_ik] or ['b', k, j, b_kj]*
          // *output record is keyed by k and has ['a', i, a_ik] or ['b', j,
b_kj] as value
          *// *in real life, the 'a' or 'b' would be inferred in the
configure method*
          if (value[0] == "a") out.collect(value[2], [value[0], value[1],
value[3]])
          else out.collect(value[1], [value[0], value[2], value[3]])
    }
    reducer1 = {key, values, out ->
          // *key is k, values contains items like ['a', i, a_ik] or ['b',
j, b_kj]*
          // *we build a nested map to simplify iterating over the cross
product*
          def v = [:]
          def v.a = [:]
          def v.b = [:]
          values.each {
               v[it[0]] [it[1]] = it[2]
          }
          v.a.each {i, a_ik ->
                v.b.each {j, b_kj ->
                     out.collect([i, j], a_jk * b_kj)
                }
          }
    }

    // *second mr adds up all of the products by reducing on final position*
    map2 = {key, ijv, out ->
        // *key is [i,j], value is a_ik * b_kj for some (now unknown) k*
        out.collect(ijv[0..1], ijv[2])
    }
    combiner2 = {ij, values, out ->
        out.collect(ij, values.sum())
    }
    reducer2 = combiner

Reply via email to