I added the option that doesn't require the caller to specify the
mergeCombiner closure a while ago when I wanted to disable mapSideCombine.
In virtually all use cases I know of, it is fine & easy to specify a
mergeCombiner, so I'm all for this given it simplifies the codebase.
On Tue, Dec 31, 2013 at 5:05 PM, Patrick Wendell wrote:
> Hey All,
>
> There is a small API change that we are considering for the external
> sort patch. Previously we allowed mergeCombiner to be null when map
> side aggregation was not enabled. This is because it wasn't necessary
> in that case since mappers didn't ship pre-aggregated values to
> reducers.
>
> Because the external sort capability also relies on the mergeCombiner
> function to merge partially-aggregated on-disk segments, we now need
> it all the time, even if map side aggregation is enabled. This is a
> fairly esoteric thing that I'm not sure anyone other than Shark ever
> used, but I want to check in case anyone had feelings about this.
>
> The relevant code is here:
>
>
> https://github.com/apache/incubator-spark/pull/303/files#diff-f70e97c099b5eac05c75288cb215e080R72
>
> - Patrick
>