Re: What do people use Hadoop for?

Arkady Borkovsky Thu, 25 Jan 2007 10:09:21 -0800

"Disabling the sort" == "map without reduce" == "map writes the outputinto DFS"

is indeed a very useful and desirable feature.
File a JIRA issue.



On Jan 24, 2007, at 5:32 PM, Doug Judd wrote:

After digging into this a bit, it looks like the use ofIdentityReducer does
not disable the sort.  I wrote a simple Map/Reduce program that uses
/usr/share/dict/words as input and generates keys that are a Text
representation of the CRC of the word modulo 65536 and values that aretheword itself. I set the reducer to be the IdentityReducer and theoutput
came out sorted:

0       apperceptively
0       Connarus
1       overfold
1       derationalization
1       gymnasium
10      respecting
10      supperwards
100     cellulofibrous
100     drogherman
100     heteroptics
1000    bacao
1000    Cumaean
1000    didymate
1000    disbelieving
1001    polymer
1001    salveline
1001    workwomanly
1002    sporty
1002    bakal
1003    preferentialist
Also, after reviewing the Google paper, they make no mention of thesortbeing disabled by the Identity reducer. In fact, they describe theirSort
implementation as using the identity reducer.
Unless I'm missing something, I retract my previous statement.Map-Reduceis really just distributed sort. I do think that being able todisable thesort is a much needed enhancement, especially since quite a fewapplications
don't need it.

- Doug

On 1/24/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Doug Judd wrote:
> Part of the problem is that calling the paradigm "Map-Reduce" is
somewhat
> misleading. It is really just a distributed sort. The sort iswhere
> all of
> the complexity comes from.  Invoking map() over the input is O(n),
> invoking
> reduce() over the intermediate results is O(n) as well.  The sort is
> O(nlogn).  A more appropriate name for this algorithm would be
> "Distributed
> Sort with a Pre-map Phase and a Post-reduce Phase"  Calling it
Map-Reduce
> and leaving out the word "sort" (the most important part) is asource of
> confusion.
>
> If you think of it in these terms, I think it's easier to see where
> and how
> it applies.

:) Sure, that's one point of view on this - however, in quite a few
applications sort is definitely less important than the ability tosplitthe processing load in map() and reduce() over many machines.Sometimes
I don't care about the sorting at all (in all cases where
IdentityReducer is used).

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: What do people use Hadoop for?

Reply via email to