Runping Qi wrote:
The argument of using local combiners is interesting. To me, combiner class
is just another layer of transformer.  It does not mean that the combiner
class has to be the same as the reducer class. The only criteria is that
they meet the associate rule: Let L1, L2, ..., Ln and K1, K2, .., Km be two partitions of S, then Reduce(list(Combiner(L1), Combiner(L2),..., Combiner(Ln))) and Reduce(list(Combiner(K1), Combiner(K2), ..., Combiner(Km)) are the
same.

A special (maybe very common) scenario is that combiner and reducer are the
same class and reduce function is associate. However, this needs not to be
the case in general. And the class of the reduce outputs need not to be the
same as that of the combiner, if the combiner and the reducer are not the
same class.

This indeed may be be an intriguing generalization of the MapReduce model. But it does add more possible failure modes. At present we have far too few unit tests for the existing, simpler MapReduce model, and the platform is still shakey. Thus I am reluctant to spend a lot of extending the model in ways that are not absolutely essential.

My goal is for Hadoop to be widely used. I do not feel that the power of the MapReduce model is currently a primary bottleneck to wider adoption. The larger issues we face are performance, reliability, scalability and documentation.

If I am to commit a patch, then I must feel that I can support and maintain it, that it fits within my priorities. Otherwise, if it causes problems that I don't have time to attend to (even if this only means reviewing and testing fixes submitted by others) then the quality of the system will decrease, a vector we must avoid.

Currently we have just four committers on Hadoop. For Mike and Andrzej, Nutch is a secondary effort. Owen has been voted in as a Hadoop committer, but his paperwork is not yet complete. So I am the bottleneck. I spend a lot of time on annoying yet critical issues like making sure that recent extensions to Hadoop don't break Nutch running in pseudo-distributed mode on Windows.

I don't particularly like things this way, but that's where we are right now. The best way to get out of here is for folks who'd like to be committers to submit high-quality, well documented, well-formatted, non-disruptive, unit-test-bearing patches that are easy for me to apply and make Hadoop easier to use and more reliable, thus earning points towards becoming committers. If we have more committers then we should be able to advance with confidence on more fronts in parallel.

Doug

Reply via email to