Hi Chris,
This requires some finesse, but you can do it all with 3 map/reduce
steps. The third step requires that you do a join, which is the tricky
part. Basically, the operation looks like this:
Map/Reduce 1: pairwise sums
Map: <A,B> => <A,B>, 1>
Reduce: <<A, B>, 1, 1, 1...> => <<A, B>, sum(A, B)>
Map/Reduce 2: sums for each variable
Map: <<A, B>, sum(A, B)> => <A, sum(A, B)>, <B, sum(A, B)>
Reduce: <A, sum(A, B1), sum(A, B2) ...> => <A, sum(A, *)>
Map/Reduce 3: join and conditional probability
Map: <<A, B>, sum(A, B)> => <A, sum(A, B)>, <B, sum(A, B)>
...and also <A, sum(A, *)> => <A, sum(A, *)>
Reduce: <A, <A, sum(A, *)>, <A, sum(A, B1)>, <A, sum(A, B2)>...> => <A,
sum(A, B1)/sum(A, *)>, <A, sum(A, B2)/sum(A, *)>...
Note in the 3rd step that you're taking input from two different files
with different formats and outputting two different message value
types. When I do work like step 3, I tend to set all keys and values as
Text, and parse the messages with a custom Mapper and Reducer. Hadoop
doesn't have good support for different message types or for inputs from
multiple files with different formats, so you'll have to roll it yourself.
Given that, this type of operation works -extremely well- with Hadoop
once you get the message formats worked out. When Pig grows up a bit,
it should be great for this kind of work.
-Colin
Chris Dyer wrote:
Hi all--
I'm new to using Hadoop so I'm hoping to get a little guidance on what
the best way to solve a particular class of problems would be.
The general use case is this: from a very small set of data, I will
generate a massive set of pairs of values, ie, <A,B>. I would like to
compute the maximum likelihood estimate (MLE) of the conditional
probability P(A|B). However, it is very obvious to me how to compute
the counts of C(<A,B>) and equally obvious how to compute the counts
C(<A,*>) or C(<*,B>), but what I need is: C(<A,B>)/C(<*,B>).
My approach:
My initial approach to the decomposition of this problem is to use a
mapper to go from my input data to <A,B> pairs, and then a reducer to
go for <A,B> pairs to counts C(A,B). However, at that point, I'd like
a second reducer like thing (call it Normalize) to run which
aggregates all the C(*,B) pairs and returns a value P(A|B) for each A
that occurs with B. This is where things get fuzzy for me. How do I
do this? A reducer can only return a single value (for example, if I
make B the key for Normalize it could return C(B) very easily). What
I need is a value type that reduce can return that is essential a list
of (key,value) pairs. Does such a thing exist? Am I approaching this
the wrong way?
Thanks for any assistance!
Chris
------------------------------------------
Chris Dyer
Dept. of Linguistics
University of Maryland
1401 Marie Mount Hall
College Park MD 20742