Re: incremental re-execution
It isn't that bad. Remember the input is sparse. Actually, the size of the original data provides a sharp bound on the size of the semi-aggregated data. In practice, the semi-aggregated data will have 1/k as many records as the original data where k is the average count. The records in the semi-aggregated result are also typically considerably smaller than the original records. Even in long tail environments, k is typically 5-20 and each record is often 2-4x smaller than the originals. This means that there is typically an order of magnitude saving in traversing the semi-aggregated result. The other major approach would be to emit typed records for each full level aggregation that you intend to do. This gives you all of your aggregates together in a single output, but you can jigger the partition function so that you have a single reduce output file per aggregate type. This approach isn't quite a simple as the semi-aggregate approach and doesn't allow retrospective ad hoc scans, but it is slightly to considerably faster than the semi-agg approach, especially if you are aggregating on more than two keys. On 4/21/08 1:06 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote: > Hi Ted, > > Thanks for your example. It's very interesting to learn about > specific map reduce applications. > > It's non-obvious to me that it's a good idea to combine two map- > reduce pairs by using the cross product of the intermediate states- > you might wind up building an O(n^2) intermediate data structure > instead of two O(n) ones. Even with parallelism this is not good. I'm > wondering if in your example you're relying on the fact that the > viewer-video matrix is sparse, so many of the pairs will have value > 0? Does the map phase emit intermediate results with 0-values? > > Thanks, > > Shirley > >> >> Take something like what we see in our logs of viewing. We have >> several log >> entries per view each of which contains an identifier for the >> viewer and for >> the video. These events occur on start, on progress and on >> completion. We >> want to have total views per viewer and total views per video. You >> can pass >> over the logs twice to get this data or you can pass over the data >> once to >> get total views per (viewer x video). This last is a semi- >> aggregated form >> that has no utility except that it is much smaller than the >> original data. >> Reducing the semi-aggregated from to viewer counts and video counts >> results >> in shorter total processing than processing the raw data twice. >> >> If you start with a program that has two map-reduce passes over the >> same >> data, it is likely very difficult to intuit that they could use the >> same >> intermediate data. Even with something like Pig, where you have a >> good >> representation for internal optimizations, it is probably going to be >> difficult to convert the two MR steps into one pre-aggregation and >> two final >> aggregations. >> >> >> On 4/20/08 7:39 AM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote: >> >>> Hi Ted, >>> >>> I'm confused about your second comment below: in the case where semi- >>> aggregated data is used to produce multiple low-level aggregates, >>> what sorts of detection did you have in mind which would be hard >>> to do? >>> >>> Thanks, >>> >>> Shirley >>> >>> On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote: >>> I re-use outputs of MR programs pretty often, but when I need to reuse the map output, I just manually break the process apart into a map+identity-reducer and the multiple reducers. This is rare. It is common to have a semi-aggregated form that is much small than the original data which in turn can be used to produce multiple low definition aggregates. I would find it very surprising if you could detect these sorts of situations. On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote: > Dear Hadoop Users, > > I'm writing to find out what you think about being able to > incrementally re-execute a map reduce job. My understanding is that > the current framework doesn't support it and I'd like to know > whether, in your opinion, having this capability could help to > speed > up development and debugging. > > My specific questions are: > > 1) Do you have to re-run a job often enough that it would be > valuable > to incrementally re-run it? > > 2) Would it be helpful to save the output from a whole bunch of > mappers and then try to detect whether this output can be re-used > when a new job is launched? > > 3) Would it be helpful to be able to use the output from a map > job on > many reducers? > > Please let me know what your thoughts are and what specific > applications you are working on. > > Much appreciation, > > Shirley >>> >> >
Re: incremental re-execution
Hi Ted, Thanks for your example. It's very interesting to learn about specific map reduce applications. It's non-obvious to me that it's a good idea to combine two map- reduce pairs by using the cross product of the intermediate states- you might wind up building an O(n^2) intermediate data structure instead of two O(n) ones. Even with parallelism this is not good. I'm wondering if in your example you're relying on the fact that the viewer-video matrix is sparse, so many of the pairs will have value 0? Does the map phase emit intermediate results with 0-values? Thanks, Shirley Take something like what we see in our logs of viewing. We have several log entries per view each of which contains an identifier for the viewer and for the video. These events occur on start, on progress and on completion. We want to have total views per viewer and total views per video. You can pass over the logs twice to get this data or you can pass over the data once to get total views per (viewer x video). This last is a semi- aggregated form that has no utility except that it is much smaller than the original data. Reducing the semi-aggregated from to viewer counts and video counts results in shorter total processing than processing the raw data twice. If you start with a program that has two map-reduce passes over the same data, it is likely very difficult to intuit that they could use the same intermediate data. Even with something like Pig, where you have a good representation for internal optimizations, it is probably going to be difficult to convert the two MR steps into one pre-aggregation and two final aggregations. On 4/20/08 7:39 AM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote: Hi Ted, I'm confused about your second comment below: in the case where semi- aggregated data is used to produce multiple low-level aggregates, what sorts of detection did you have in mind which would be hard to do? Thanks, Shirley On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote: I re-use outputs of MR programs pretty often, but when I need to reuse the map output, I just manually break the process apart into a map+identity-reducer and the multiple reducers. This is rare. It is common to have a semi-aggregated form that is much small than the original data which in turn can be used to produce multiple low definition aggregates. I would find it very surprising if you could detect these sorts of situations. On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote: Dear Hadoop Users, I'm writing to find out what you think about being able to incrementally re-execute a map reduce job. My understanding is that the current framework doesn't support it and I'd like to know whether, in your opinion, having this capability could help to speed up development and debugging. My specific questions are: 1) Do you have to re-run a job often enough that it would be valuable to incrementally re-run it? 2) Would it be helpful to save the output from a whole bunch of mappers and then try to detect whether this output can be re-used when a new job is launched? 3) Would it be helpful to be able to use the output from a map job on many reducers? Please let me know what your thoughts are and what specific applications you are working on. Much appreciation, Shirley
Re: incremental re-execution
Take something like what we see in our logs of viewing. We have several log entries per view each of which contains an identifier for the viewer and for the video. These events occur on start, on progress and on completion. We want to have total views per viewer and total views per video. You can pass over the logs twice to get this data or you can pass over the data once to get total views per (viewer x video). This last is a semi-aggregated form that has no utility except that it is much smaller than the original data. Reducing the semi-aggregated from to viewer counts and video counts results in shorter total processing than processing the raw data twice. If you start with a program that has two map-reduce passes over the same data, it is likely very difficult to intuit that they could use the same intermediate data. Even with something like Pig, where you have a good representation for internal optimizations, it is probably going to be difficult to convert the two MR steps into one pre-aggregation and two final aggregations. On 4/20/08 7:39 AM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote: > Hi Ted, > > I'm confused about your second comment below: in the case where semi- > aggregated data is used to produce multiple low-level aggregates, > what sorts of detection did you have in mind which would be hard to do? > > Thanks, > > Shirley > > On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote: > >> >> I re-use outputs of MR programs pretty often, but when I need to >> reuse the >> map output, I just manually break the process apart into a >> map+identity-reducer and the multiple reducers. This is rare. >> >> It is common to have a semi-aggregated form that is much small than >> the >> original data which in turn can be used to produce multiple low >> definition >> aggregates. I would find it very surprising if you could detect >> these >> sorts of situations. >> >> >> On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote: >> >>> Dear Hadoop Users, >>> >>> I'm writing to find out what you think about being able to >>> incrementally re-execute a map reduce job. My understanding is that >>> the current framework doesn't support it and I'd like to know >>> whether, in your opinion, having this capability could help to speed >>> up development and debugging. >>> >>> My specific questions are: >>> >>> 1) Do you have to re-run a job often enough that it would be valuable >>> to incrementally re-run it? >>> >>> 2) Would it be helpful to save the output from a whole bunch of >>> mappers and then try to detect whether this output can be re-used >>> when a new job is launched? >>> >>> 3) Would it be helpful to be able to use the output from a map job on >>> many reducers? >>> >>> Please let me know what your thoughts are and what specific >>> applications you are working on. >>> >>> Much appreciation, >>> >>> Shirley >> >
Re: incremental re-execution
Hi Ted, I'm confused about your second comment below: in the case where semi- aggregated data is used to produce multiple low-level aggregates, what sorts of detection did you have in mind which would be hard to do? Thanks, Shirley On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote: I re-use outputs of MR programs pretty often, but when I need to reuse the map output, I just manually break the process apart into a map+identity-reducer and the multiple reducers. This is rare. It is common to have a semi-aggregated form that is much small than the original data which in turn can be used to produce multiple low definition aggregates. I would find it very surprising if you could detect these sorts of situations. On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote: Dear Hadoop Users, I'm writing to find out what you think about being able to incrementally re-execute a map reduce job. My understanding is that the current framework doesn't support it and I'd like to know whether, in your opinion, having this capability could help to speed up development and debugging. My specific questions are: 1) Do you have to re-run a job often enough that it would be valuable to incrementally re-run it? 2) Would it be helpful to save the output from a whole bunch of mappers and then try to detect whether this output can be re-used when a new job is launched? 3) Would it be helpful to be able to use the output from a map job on many reducers? Please let me know what your thoughts are and what specific applications you are working on. Much appreciation, Shirley
Re: incremental re-execution
Shirley Cohen wrote: Dear Hadoop Users, I'm writing to find out what you think about being able to incrementally re-execute a map reduce job. My understanding is that the current framework doesn't support it and I'd like to know whether, in your opinion, having this capability could help to speed up development and debugging. My specific questions are: 1) Do you have to re-run a job often enough that it would be valuable to incrementally re-run it? Do you have any use case? 2) Would it be helpful to save the output from a whole bunch of mappers and then try to detect whether this output can be re-used when a new job is launched? As Ted said, you can always write the data to the DFS and run multiple jobs using that. I think the overhead of keeping track of intermediate data will be much. Also keeping it on local FS would not work as nodes go down. 3) Would it be helpful to be able to use the output from a map job on many reducers? If you mean *multiple reduce functionality* then see here [http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/[EMAIL PROTECTED] Amar Please let me know what your thoughts are and what specific applications you are working on. Much appreciation, Shirley
Re: incremental re-execution
I re-use outputs of MR programs pretty often, but when I need to reuse the map output, I just manually break the process apart into a map+identity-reducer and the multiple reducers. This is rare. It is common to have a semi-aggregated form that is much small than the original data which in turn can be used to produce multiple low definition aggregates. I would find it very surprising if you could detect these sorts of situations. On 4/16/08 5:26 PM, "Shirley Cohen" <[EMAIL PROTECTED]> wrote: > Dear Hadoop Users, > > I'm writing to find out what you think about being able to > incrementally re-execute a map reduce job. My understanding is that > the current framework doesn't support it and I'd like to know > whether, in your opinion, having this capability could help to speed > up development and debugging. > > My specific questions are: > > 1) Do you have to re-run a job often enough that it would be valuable > to incrementally re-run it? > > 2) Would it be helpful to save the output from a whole bunch of > mappers and then try to detect whether this output can be re-used > when a new job is launched? > > 3) Would it be helpful to be able to use the output from a map job on > many reducers? > > Please let me know what your thoughts are and what specific > applications you are working on. > > Much appreciation, > > Shirley
incremental re-execution
Dear Hadoop Users, I'm writing to find out what you think about being able to incrementally re-execute a map reduce job. My understanding is that the current framework doesn't support it and I'd like to know whether, in your opinion, having this capability could help to speed up development and debugging. My specific questions are: 1) Do you have to re-run a job often enough that it would be valuable to incrementally re-run it? 2) Would it be helpful to save the output from a whole bunch of mappers and then try to detect whether this output can be re-used when a new job is launched? 3) Would it be helpful to be able to use the output from a map job on many reducers? Please let me know what your thoughts are and what specific applications you are working on. Much appreciation, Shirley