Re: computing conditional probabilities with Hadoop?

Colin Evans Thu, 27 Sep 2007 14:41:15 -0700

Hi Chris,

This requires some finesse, but you can do it all with 3 map/reducesteps. The third step requires that you do a join, which is the trickypart. Basically, the operation looks like this:


Map/Reduce 1: pairwise sums
Map: <A,B> => <A,B>, 1>
Reduce: <<A, B>, 1, 1, 1...> => <<A, B>, sum(A, B)>

Map/Reduce 2: sums for each variable
Map: <<A, B>, sum(A, B)> => <A, sum(A, B)>, <B, sum(A, B)>
Reduce: <A, sum(A, B1), sum(A, B2) ...> => <A, sum(A, *)>

Map/Reduce 3: join and conditional probability
Map: <<A, B>, sum(A, B)> => <A, sum(A, B)>, <B, sum(A, B)>
...and also <A, sum(A, *)> => <A, sum(A, *)>

Reduce: <A, <A, sum(A, *)>, <A, sum(A, B1)>, <A, sum(A, B2)>...> => <A,sum(A, B1)/sum(A, *)>, <A, sum(A, B2)/sum(A, *)>...

Note in the 3rd step that you're taking input from two different fileswith different formats and outputting two different message valuetypes. When I do work like step 3, I tend to set all keys and values asText, and parse the messages with a custom Mapper and Reducer. Hadoopdoesn't have good support for different message types or for inputs frommultiple files with different formats, so you'll have to roll it yourself.

Given that, this type of operation works -extremely well- with Hadooponce you get the message formats worked out. When Pig grows up a bit,it should be great for this kind of work.


-Colin




Chris Dyer wrote:

Hi all--
I'm new to using Hadoop so I'm hoping to get a little guidance on what
the best way to solve a particular class of problems would be.

The general use case is this: from a very small set of data, I will
generate a massive set of pairs of values, ie, <A,B>.  I would like to
compute the maximum likelihood estimate (MLE) of the conditional
probability P(A|B).  However, it is very obvious to me how to compute
the counts of C(<A,B>) and equally obvious how to compute the counts
C(<A,*>) or C(<*,B>), but what I need is: C(<A,B>)/C(<*,B>).

My approach:

My initial approach to the decomposition of this problem is to use a
mapper to go from my input data to <A,B> pairs, and then a reducer to
go for <A,B> pairs to counts C(A,B).  However, at that point, I'd like
a second reducer like thing (call it Normalize) to run which
aggregates all the C(*,B) pairs and returns a value P(A|B) for each A
that occurs with B.  This is where things get fuzzy for me.  How do I
do this?  A reducer can only return a single value (for example, if I
make B the key for Normalize it could return C(B) very easily).  What
I need is a value type that reduce can return that is essential a list
of (key,value) pairs.  Does such a thing exist?  Am I approaching this
the wrong way?

Thanks for any assistance!
Chris

------------------------------------------
Chris Dyer
Dept. of Linguistics
University of Maryland
1401 Marie Mount Hall
College Park MD 20742

Re: computing conditional probabilities with Hadoop?

Reply via email to