I think you don't need to split your input file so that each map is assigned
one key. Your goal is to make load balance. For each map task of yours, it
will initiate a new MR sub-job. This sub-job will be assigned a new
master/workers, which means the map task of the sub-job may be scheduled to
wor
Thanks for the suggestions!
On Mon, May 23, 2011 at 5:50 PM, Harsh J wrote:
> Vincent,
>
> You _might_ lose locality by splitting beyond the block splits, and
> the tasks although better 'parallelized', may only end up performing
> worse. A good way to instead increase task #s is to go the block
Vincent,
You _might_ lose locality by splitting beyond the block splits, and
the tasks although better 'parallelized', may only end up performing
worse. A good way to instead increase task #s is to go the block size
way (lower block size, getting more splits at the cost of little extra
NN space).
Look at NLineInputFormat
Sent from my iPhone
On May 23, 2011, at 2:09 AM, Vincent Xue wrote:
> Hello Hadoop Users,
>
> I would like to know if anyone has ever tried splitting an input
> sequence file by key instead of by size. I know that this is unusual
> for the map reduce paradigm but I am
Look at getInputSplits() of SequenceFileInputFormat.
-Joey
On May 23, 2011 5:09 AM, "Vincent Xue" wrote:
> Hello Hadoop Users,
>
> I would like to know if anyone has ever tried splitting an input
> sequence file by key instead of by size. I know that this is unusual
> for the map reduce paradigm
Hello Hadoop Users,
I would like to know if anyone has ever tried splitting an input
sequence file by key instead of by size. I know that this is unusual
for the map reduce paradigm but I am in a situation where I need to
perform some large tasks on each key pair in a load balancing like
fashion.