Re: incremental re-execution

Shirley Cohen Mon, 21 Apr 2008 13:07:13 -0700

Hi Ted,

Thanks for your example. It's very interesting to learn aboutspecific map reduce applications.

It's non-obvious to me that it's a good idea to combine two map-reduce pairs by using the cross product of the intermediate states-you might wind up building an O(n^2) intermediate data structureinstead of two O(n) ones. Even with parallelism this is not good. I'mwondering if in your example you're relying on the fact that theviewer-video matrix is sparse, so many of the pairs will have value0? Does the map phase emit intermediate results with 0-values?


Thanks,

Shirley

Take something like what we see in our logs of viewing. We haveseveral logentries per view each of which contains an identifier for theviewer and forthe video. These events occur on start, on progress and oncompletion. Wewant to have total views per viewer and total views per video. Youcan passover the logs twice to get this data or you can pass over the dataonce toget total views per (viewer x video). This last is a semi-aggregated formthat has no utility except that it is much smaller than theoriginal data.Reducing the semi-aggregated from to viewer counts and video countsresults
in shorter total processing than processing the raw data twice.
If you start with a program that has two map-reduce passes over thesamedata, it is likely very difficult to intuit that they could use thesameintermediate data. Even with something like Pig, where you have agood
representation for internal optimizations, it is probably going to be
difficult to convert the two MR steps into one pre-aggregation andtwo final
aggregations.


On 4/20/08 7:39 AM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:
Hi Ted,

I'm confused about your second comment below: in the case where semi-
aggregated data is used to produce multiple low-level aggregates,
what sorts of detection did you have in mind which would be hardto do?
Thanks,

Shirley

On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote:
I re-use outputs of MR programs pretty often, but when I need to
reuse the
map output, I just manually break the process apart into a
map+identity-reducer and the multiple reducers.  This is rare.

It is common to have a semi-aggregated form that is much small than
the
original data which in turn can be used to produce multiple low
definition
aggregates.  I would find it very surprising if  you could detect
these
sorts of situations.


On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:
Dear Hadoop Users,

I'm writing to find out what you think about being able to
incrementally re-execute a map reduce job. My understanding is that
the current framework doesn't support it and I'd like to know
whether, in your opinion, having this capability could help tospeed
up development and debugging.

My specific questions are:
1) Do you have to re-run a job often enough that it would bevaluable
to incrementally re-run it?

2) Would it be helpful to save the output from a whole bunch of
mappers and then try to detect whether this output can be re-used
when a new job is launched?
3) Would it be helpful to be able to use the output from a mapjob on
many reducers?

Please let me know what your thoughts are and what specific
applications you are working on.

Much appreciation,

Shirley

Re: incremental re-execution

Reply via email to