Re: M/R API and Writable semantics in reducer

Jan Lukavský Thu, 05 Sep 2013 01:29:22 -0700

Hi,

is there anyone interested in this topic? Basically, what I'm trying tofind out is, whether it is 'safe' to rely on the side-effect of updatingkey during iterating values. I believe that there must be someone who isalso interested in this, the secondary sort pattern is very common (atleast in our jobs). So far, we have been emulating theGroupingComparator by holding state in the Reducer class and thereforebeing able to keep track of 'groups' of keys among several calls toreduce() method. This method seems quite safe in the sense of API, butin the sense of code is not as pretty (and vulnerable to ugly bugs ifyou forget to reset the state correctly for instance).

On the other hand, if the way key gets updated while iterating thevalues is to be considered contract of the MapReduce API, I think itshould be implemented in MRUnit (or you basically cannot use MRUnit tounittest your job) and if it isn't, than it is probably a bug. If thisis internal behavior and might be subject to change anytime, than itclearly seems that keeping the state in Reducer is the only option.

Does anyone else have similar considerations? How do others implementthe secondary sort?


Thanks,
 Jan

On 09/02/2013 03:29 PM, Jan Lukavský wrote:

Hi all,
some time ago, I wrote a note to this conference, that it would benice if it would be possible to get the *real* key emitted from mapperto reducer, when using the GroupingComparator. I got the answer, thatit is possible, because of the Writable semantics and that currentlythe following holds:
@Override
protected void reduce(Key key, Iterable<Value> values, Context context)
{
  for (Value v : values) {
// The key MIGHT change its value in this cycle, becausereadFields() will be called on it.// When using GroupingComparator that groups only by some part ofthe key,// many different keys might be considered single group, so the*real* data matters.
  }
}
When you use GroupingComparator the contents of the key can matter,because if you cannot access it, you have to duplicate the data invalue (which means more network traffic in shuffle phase, and more I/Ogenerally).
Now, the question is, how much is this a matter of API that isreliable, or how much it is likely, that relying on this feature mightbreak in future versions. To me, it seems more like a side effect,that is not guaranteed to be maintained in the future. There alreadyexists a suggestion, that this is probably very fragile, becauseMRUnit seems not to update the key during the iteration.
Does anyone have any suggested way around? Is the 'official' preferredway of accessing the original key to call context.getCurrentKey()?Isn't this the same case? Wouldn't it be nice, if the API itself hadsome guaranties or suggestions how it works? I can imagine modifiedreduce() metod, with a signature like
protected void reduce(Key key, Iterable<Pair<Key, Value>> keyValues,Context context);
This seems easily transformable to the old call (which could bedefault implementation of this method).
Any opinion on this?

Thanks,
 Jan

Re: M/R API and Writable semantics in reducer

Reply via email to