Hi Ted,
Thanks for your example. It's very interesting to learn about
specific map reduce applications.
It's non-obvious to me that it's a good idea to combine two map-
reduce pairs by using the cross product of the intermediate states-
you might wind up building an O(n^2) intermediate data structure
instead of two O(n) ones. Even with parallelism this is not good. I'm
wondering if in your example you're relying on the fact that the
viewer-video matrix is sparse, so many of the pairs will have value
0? Does the map phase emit intermediate results with 0-values?
Thanks,
Shirley
Take something like what we see in our logs of viewing. We have
several log
entries per view each of which contains an identifier for the
viewer and for
the video. These events occur on start, on progress and on
completion. We
want to have total views per viewer and total views per video. You
can pass
over the logs twice to get this data or you can pass over the data
once to
get total views per (viewer x video). This last is a semi-
aggregated form
that has no utility except that it is much smaller than the
original data.
Reducing the semi-aggregated from to viewer counts and video counts
results
in shorter total processing than processing the raw data twice.
If you start with a program that has two map-reduce passes over the
same
data, it is likely very difficult to intuit that they could use the
same
intermediate data. Even with something like Pig, where you have a
good
representation for internal optimizations, it is probably going to be
difficult to convert the two MR steps into one pre-aggregation and
two final
aggregations.
On 4/20/08 7:39 AM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:
Hi Ted,
I'm confused about your second comment below: in the case where semi-
aggregated data is used to produce multiple low-level aggregates,
what sorts of detection did you have in mind which would be hard
to do?
Thanks,
Shirley
On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote:
I re-use outputs of MR programs pretty often, but when I need to
reuse the
map output, I just manually break the process apart into a
map+identity-reducer and the multiple reducers. This is rare.
It is common to have a semi-aggregated form that is much small than
the
original data which in turn can be used to produce multiple low
definition
aggregates. I would find it very surprising if you could detect
these
sorts of situations.
On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:
Dear Hadoop Users,
I'm writing to find out what you think about being able to
incrementally re-execute a map reduce job. My understanding is that
the current framework doesn't support it and I'd like to know
whether, in your opinion, having this capability could help to
speed
up development and debugging.
My specific questions are:
1) Do you have to re-run a job often enough that it would be
valuable
to incrementally re-run it?
2) Would it be helpful to save the output from a whole bunch of
mappers and then try to detect whether this output can be re-used
when a new job is launched?
3) Would it be helpful to be able to use the output from a map
job on
many reducers?
Please let me know what your thoughts are and what specific
applications you are working on.
Much appreciation,
Shirley