Re: incremental re-execution

2008-04-21 Thread Ted Dunning

It isn't that bad.  Remember the input is sparse.

Actually, the size of the original data provides a sharp bound on the size
of the semi-aggregated data.  In practice, the semi-aggregated data will
have 1/k as many records as the original data where k is the average count.
The records in the semi-aggregated result are also typically considerably
smaller than the original records.  Even in long tail environments, k is
typically 5-20 and each record is often 2-4x smaller than the originals.

This means that there is typically an order of magnitude saving in
traversing the semi-aggregated result.

The other major approach would be to emit typed records for each full level
aggregation that you intend to do.  This gives you all of your aggregates
together in a single output, but you can jigger the partition function so
that you have a single reduce output file per aggregate type.  This approach
isn't quite a simple as the semi-aggregate approach and doesn't allow
retrospective ad hoc scans, but it is slightly to considerably faster than
the semi-agg approach, especially if you are aggregating on more than two
keys.


On 4/21/08 1:06 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:

> Hi Ted,
> 
> Thanks for your example. It's very interesting to learn about
> specific map reduce applications.
> 
> It's non-obvious to me that it's a good idea to combine two map-
> reduce pairs by using the cross product of the intermediate states-
> you might wind up building an O(n^2) intermediate data structure
> instead of two O(n) ones. Even with parallelism this is not good. I'm
> wondering if in your example you're relying on the fact that the
> viewer-video matrix is sparse, so many of the pairs will have value
> 0? Does the map phase emit intermediate results with 0-values?
> 
> Thanks,
> 
> Shirley
> 
>> 
>> Take something like what we see in our logs of viewing.  We have
>> several log
>> entries per view each of which contains an identifier for the
>> viewer and for
>> the video.  These events occur on start, on progress and on
>> completion.  We
>> want to have total views per viewer and total views per video.  You
>> can pass
>> over the logs twice to get this data or you can pass over the data
>> once to
>> get total views per (viewer x video).  This last is a semi-
>> aggregated form
>> that has no utility except that it is much smaller than the
>> original data.
>> Reducing the semi-aggregated from to viewer counts and video counts
>> results
>> in shorter total processing than processing the raw data twice.
>> 
>> If you start with a program that has two map-reduce passes over the
>> same
>> data, it is likely very difficult to intuit that they could use the
>> same
>> intermediate data.  Even with something like Pig, where you have a
>> good
>> representation for internal optimizations, it is probably going to be
>> difficult to convert the two MR steps into one pre-aggregation and
>> two final
>> aggregations.
>> 
>> 
>> On 4/20/08 7:39 AM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:
>> 
>>> Hi Ted,
>>> 
>>> I'm confused about your second comment below: in the case where semi-
>>> aggregated data is used to produce multiple low-level aggregates,
>>> what sorts of detection did you have in mind which would be hard
>>> to do?
>>> 
>>> Thanks,
>>> 
>>> Shirley
>>> 
>>> On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote:
>>> 
 
 I re-use outputs of MR programs pretty often, but when I need to
 reuse the
 map output, I just manually break the process apart into a
 map+identity-reducer and the multiple reducers.  This is rare.
 
 It is common to have a semi-aggregated form that is much small than
 the
 original data which in turn can be used to produce multiple low
 definition
 aggregates.  I would find it very surprising if  you could detect
 these
 sorts of situations.
 
 
 On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:
 
> Dear Hadoop Users,
> 
> I'm writing to find out what you think about being able to
> incrementally re-execute a map reduce job. My understanding is that
> the current framework doesn't support it and I'd like to know
> whether, in your opinion, having this capability could help to
> speed
> up development and debugging.
> 
> My specific questions are:
> 
> 1) Do you have to re-run a job often enough that it would be
> valuable
> to incrementally re-run it?
> 
> 2) Would it be helpful to save the output from a whole bunch of
> mappers and then try to detect whether this output can be re-used
> when a new job is launched?
> 
> 3) Would it be helpful to be able to use the output from a map
> job on
> many reducers?
> 
> Please let me know what your thoughts are and what specific
> applications you are working on.
> 
> Much appreciation,
> 
> Shirley
 
>>> 
>> 
> 



Re: incremental re-execution

2008-04-21 Thread Shirley Cohen

Hi Ted,

Thanks for your example. It's very interesting to learn about  
specific map reduce applications.


It's non-obvious to me that it's a good idea to combine two map- 
reduce pairs by using the cross product of the intermediate states-  
you might wind up building an O(n^2) intermediate data structure  
instead of two O(n) ones. Even with parallelism this is not good. I'm  
wondering if in your example you're relying on the fact that the  
viewer-video matrix is sparse, so many of the pairs will have value  
0? Does the map phase emit intermediate results with 0-values?


Thanks,

Shirley



Take something like what we see in our logs of viewing.  We have  
several log
entries per view each of which contains an identifier for the  
viewer and for
the video.  These events occur on start, on progress and on  
completion.  We
want to have total views per viewer and total views per video.  You  
can pass
over the logs twice to get this data or you can pass over the data  
once to
get total views per (viewer x video).  This last is a semi- 
aggregated form
that has no utility except that it is much smaller than the  
original data.
Reducing the semi-aggregated from to viewer counts and video counts  
results

in shorter total processing than processing the raw data twice.

If you start with a program that has two map-reduce passes over the  
same
data, it is likely very difficult to intuit that they could use the  
same
intermediate data.  Even with something like Pig, where you have a  
good

representation for internal optimizations, it is probably going to be
difficult to convert the two MR steps into one pre-aggregation and  
two final

aggregations.


On 4/20/08 7:39 AM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:


Hi Ted,

I'm confused about your second comment below: in the case where semi-
aggregated data is used to produce multiple low-level aggregates,
what sorts of detection did you have in mind which would be hard  
to do?


Thanks,

Shirley

On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote:



I re-use outputs of MR programs pretty often, but when I need to
reuse the
map output, I just manually break the process apart into a
map+identity-reducer and the multiple reducers.  This is rare.

It is common to have a semi-aggregated form that is much small than
the
original data which in turn can be used to produce multiple low
definition
aggregates.  I would find it very surprising if  you could detect
these
sorts of situations.


On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:


Dear Hadoop Users,

I'm writing to find out what you think about being able to
incrementally re-execute a map reduce job. My understanding is that
the current framework doesn't support it and I'd like to know
whether, in your opinion, having this capability could help to  
speed

up development and debugging.

My specific questions are:

1) Do you have to re-run a job often enough that it would be  
valuable

to incrementally re-run it?

2) Would it be helpful to save the output from a whole bunch of
mappers and then try to detect whether this output can be re-used
when a new job is launched?

3) Would it be helpful to be able to use the output from a map  
job on

many reducers?

Please let me know what your thoughts are and what specific
applications you are working on.

Much appreciation,

Shirley










Re: incremental re-execution

2008-04-20 Thread Ted Dunning

Take something like what we see in our logs of viewing.  We have several log
entries per view each of which contains an identifier for the viewer and for
the video.  These events occur on start, on progress and on completion.  We
want to have total views per viewer and total views per video.  You can pass
over the logs twice to get this data or you can pass over the data once to
get total views per (viewer x video).  This last is a semi-aggregated form
that has no utility except that it is much smaller than the original data.
Reducing the semi-aggregated from to viewer counts and video counts results
in shorter total processing than processing the raw data twice.

If you start with a program that has two map-reduce passes over the same
data, it is likely very difficult to intuit that they could use the same
intermediate data.  Even with something like Pig, where you have a good
representation for internal optimizations, it is probably going to be
difficult to convert the two MR steps into one pre-aggregation and two final
aggregations.


On 4/20/08 7:39 AM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:

> Hi Ted,
> 
> I'm confused about your second comment below: in the case where semi-
> aggregated data is used to produce multiple low-level aggregates,
> what sorts of detection did you have in mind which would be hard to do?
> 
> Thanks,
> 
> Shirley
> 
> On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote:
> 
>> 
>> I re-use outputs of MR programs pretty often, but when I need to
>> reuse the
>> map output, I just manually break the process apart into a
>> map+identity-reducer and the multiple reducers.  This is rare.
>> 
>> It is common to have a semi-aggregated form that is much small than
>> the
>> original data which in turn can be used to produce multiple low
>> definition
>> aggregates.  I would find it very surprising if  you could detect
>> these
>> sorts of situations.
>> 
>> 
>> On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:
>> 
>>> Dear Hadoop Users,
>>> 
>>> I'm writing to find out what you think about being able to
>>> incrementally re-execute a map reduce job. My understanding is that
>>> the current framework doesn't support it and I'd like to know
>>> whether, in your opinion, having this capability could help to speed
>>> up development and debugging.
>>> 
>>> My specific questions are:
>>> 
>>> 1) Do you have to re-run a job often enough that it would be valuable
>>> to incrementally re-run it?
>>> 
>>> 2) Would it be helpful to save the output from a whole bunch of
>>> mappers and then try to detect whether this output can be re-used
>>> when a new job is launched?
>>> 
>>> 3) Would it be helpful to be able to use the output from a map job on
>>> many reducers?
>>> 
>>> Please let me know what your thoughts are and what specific
>>> applications you are working on.
>>> 
>>> Much appreciation,
>>> 
>>> Shirley
>> 
> 



Re: incremental re-execution

2008-04-20 Thread Shirley Cohen

Hi Ted,

I'm confused about your second comment below: in the case where semi- 
aggregated data is used to produce multiple low-level aggregates,  
what sorts of detection did you have in mind which would be hard to do?


Thanks,

Shirley

On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote:



I re-use outputs of MR programs pretty often, but when I need to  
reuse the

map output, I just manually break the process apart into a
map+identity-reducer and the multiple reducers.  This is rare.

It is common to have a semi-aggregated form that is much small than  
the
original data which in turn can be used to produce multiple low  
definition
aggregates.  I would find it very surprising if  you could detect  
these

sorts of situations.


On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:


Dear Hadoop Users,

I'm writing to find out what you think about being able to
incrementally re-execute a map reduce job. My understanding is that
the current framework doesn't support it and I'd like to know
whether, in your opinion, having this capability could help to speed
up development and debugging.

My specific questions are:

1) Do you have to re-run a job often enough that it would be valuable
to incrementally re-run it?

2) Would it be helpful to save the output from a whole bunch of
mappers and then try to detect whether this output can be re-used
when a new job is launched?

3) Would it be helpful to be able to use the output from a map job on
many reducers?

Please let me know what your thoughts are and what specific
applications you are working on.

Much appreciation,

Shirley






Re: incremental re-execution

2008-04-16 Thread Amar Kamat

Shirley Cohen wrote:

Dear Hadoop Users,

I'm writing to find out what you think about being able to 
incrementally re-execute a map reduce job. My understanding is that 
the current framework doesn't support it and I'd like to know whether, 
in your opinion, having this capability could help to speed up 
development and debugging.


My specific questions are:

1) Do you have to re-run a job often enough that it would be valuable 
to incrementally re-run it?



Do you have any use case?
2) Would it be helpful to save the output from a whole bunch of 
mappers and then try to detect whether this output can be re-used when 
a new job is launched?
As Ted said, you can always write the data to the DFS and run multiple 
jobs using that. I think the overhead of keeping track of intermediate 
data will be much. Also keeping it on local FS would not work as nodes 
go down.


3) Would it be helpful to be able to use the output from a map job on 
many reducers?
If you mean *multiple reduce functionality* then see here 
[http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/[EMAIL PROTECTED]

Amar


Please let me know what your thoughts are and what specific 
applications you are working on.


Much appreciation,

Shirley




Re: incremental re-execution

2008-04-16 Thread Ted Dunning

I re-use outputs of MR programs pretty often, but when I need to reuse the
map output, I just manually break the process apart into a
map+identity-reducer and the multiple reducers.  This is rare.

It is common to have a semi-aggregated form that is much small than the
original data which in turn can be used to produce multiple low definition
aggregates.  I would find it very surprising if  you could detect these
sorts of situations.


On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote:

> Dear Hadoop Users,
> 
> I'm writing to find out what you think about being able to
> incrementally re-execute a map reduce job. My understanding is that
> the current framework doesn't support it and I'd like to know
> whether, in your opinion, having this capability could help to speed
> up development and debugging.
> 
> My specific questions are:
> 
> 1) Do you have to re-run a job often enough that it would be valuable
> to incrementally re-run it?
> 
> 2) Would it be helpful to save the output from a whole bunch of
> mappers and then try to detect whether this output can be re-used
> when a new job is launched?
> 
> 3) Would it be helpful to be able to use the output from a map job on
> many reducers?
> 
> Please let me know what your thoughts are and what specific
> applications you are working on.
> 
> Much appreciation,
> 
> Shirley



incremental re-execution

2008-04-16 Thread Shirley Cohen

Dear Hadoop Users,

I'm writing to find out what you think about being able to  
incrementally re-execute a map reduce job. My understanding is that  
the current framework doesn't support it and I'd like to know  
whether, in your opinion, having this capability could help to speed  
up development and debugging.


My specific questions are:

1) Do you have to re-run a job often enough that it would be valuable  
to incrementally re-run it?


2) Would it be helpful to save the output from a whole bunch of  
mappers and then try to detect whether this output can be re-used  
when a new job is launched?


3) Would it be helpful to be able to use the output from a map job on  
many reducers?


Please let me know what your thoughts are and what specific  
applications you are working on.


Much appreciation,

Shirley