Hi Ted,

Thanks for your example. It's very interesting to learn about specific map reduce applications.

It's non-obvious to me that it's a good idea to combine two map- reduce pairs by using the cross product of the intermediate states- you might wind up building an O(n^2) intermediate data structure instead of two O(n) ones. Even with parallelism this is not good. I'm wondering if in your example you're relying on the fact that the viewer-video matrix is sparse, so many of the pairs will have value 0? Does the map phase emit intermediate results with 0-values?

Thanks,

Shirley


Take something like what we see in our logs of viewing. We have several log entries per view each of which contains an identifier for the viewer and for the video. These events occur on start, on progress and on completion. We want to have total views per viewer and total views per video. You can pass over the logs twice to get this data or you can pass over the data once to get total views per (viewer x video). This last is a semi- aggregated form that has no utility except that it is much smaller than the original data. Reducing the semi-aggregated from to viewer counts and video counts results
in shorter total processing than processing the raw data twice.

If you start with a program that has two map-reduce passes over the same data, it is likely very difficult to intuit that they could use the same intermediate data. Even with something like Pig, where you have a good
representation for internal optimizations, it is probably going to be
difficult to convert the two MR steps into one pre-aggregation and two final
aggregations.


On 4/20/08 7:39 AM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:

Hi Ted,

I'm confused about your second comment below: in the case where semi-
aggregated data is used to produce multiple low-level aggregates,
what sorts of detection did you have in mind which would be hard to do?

Thanks,

Shirley

On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote:


I re-use outputs of MR programs pretty often, but when I need to
reuse the
map output, I just manually break the process apart into a
map+identity-reducer and the multiple reducers.  This is rare.

It is common to have a semi-aggregated form that is much small than
the
original data which in turn can be used to produce multiple low
definition
aggregates.  I would find it very surprising if  you could detect
these
sorts of situations.


On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:

Dear Hadoop Users,

I'm writing to find out what you think about being able to
incrementally re-execute a map reduce job. My understanding is that
the current framework doesn't support it and I'd like to know
whether, in your opinion, having this capability could help to speed
up development and debugging.

My specific questions are:

1) Do you have to re-run a job often enough that it would be valuable
to incrementally re-run it?

2) Would it be helpful to save the output from a whole bunch of
mappers and then try to detect whether this output can be re-used
when a new job is launched?

3) Would it be helpful to be able to use the output from a map job on
many reducers?

Please let me know what your thoughts are and what specific
applications you are working on.

Much appreciation,

Shirley




Reply via email to