Re: reducers and data locality

2012-04-27 Thread Robert Evans
Also generating random keys/partitions can be problematic.  Although the 
problems are rare.  A mapper can be restarted after it finishes successfully if 
the machine it was on goes down or has other problems so that the reducers and 
not able to get that mapper's output data.  If this happens while some of the 
reducers have finished fetching it, but not all of them, and the new mapper 
partitions things differently some records may show up twice in your output and 
others not at all.

If you do something like random for the partitioning make sure that you use a 
constant seed so that it is deterministic.

--Bobby Evans

On 4/27/12 4:24 AM, "Bejoy KS"  wrote:

Hi Mete

A custom Paritioner class can control the flow of keys to the desired reducer. 
It gives you more control on which key to which reducer.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: mete 
Date: Fri, 27 Apr 2012 09:19:21
To: 
Reply-To: common-user@hadoop.apache.org
Subject: reducers and data locality

Hello folks,

I have a lot of input splits (10k-50k - 128 mb blocks) which contains text
files. I need to process those line by line, then copy the result into
roughly equal size of "shards".

So i generate a random key (from a range of [0:numberOfShards]) which is
used to route the map output to different reducers and the size is more
less equal.

I know that this is not really efficient and i was wondering if i could
somehow control how keys are routed.
For example could i generate the randomKeys with hostname prefixes and
control which keys are sent to each reducer? What do you think?

Kind regards
Mete




Re: reducers and data locality

2012-04-27 Thread Bejoy KS
Hi Mete

A custom Paritioner class can control the flow of keys to the desired reducer. 
It gives you more control on which key to which reducer.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: mete 
Date: Fri, 27 Apr 2012 09:19:21 
To: 
Reply-To: common-user@hadoop.apache.org
Subject: reducers and data locality

Hello folks,

I have a lot of input splits (10k-50k - 128 mb blocks) which contains text
files. I need to process those line by line, then copy the result into
roughly equal size of "shards".

So i generate a random key (from a range of [0:numberOfShards]) which is
used to route the map output to different reducers and the size is more
less equal.

I know that this is not really efficient and i was wondering if i could
somehow control how keys are routed.
For example could i generate the randomKeys with hostname prefixes and
control which keys are sent to each reducer? What do you think?

Kind regards
Mete



reducers and data locality

2012-04-26 Thread mete
Hello folks,

I have a lot of input splits (10k-50k - 128 mb blocks) which contains text
files. I need to process those line by line, then copy the result into
roughly equal size of "shards".

So i generate a random key (from a range of [0:numberOfShards]) which is
used to route the map output to different reducers and the size is more
less equal.

I know that this is not really efficient and i was wondering if i could
somehow control how keys are routed.
For example could i generate the randomKeys with hostname prefixes and
control which keys are sent to each reducer? What do you think?

Kind regards
Mete