The first question can be split (no pun intended) into two topics because
there is actually two distinct steps. First, the InputFormat partitions the
data source into InputSplits. Its implementation will determine the exact
logic. Then the scheduler is responsible for ordering where/when the
InputS
Hey John,
I don't see the similarity. If you take the case of a normal record
file, such as a text file, you read data from the next block. That is,
n-1 blocks are "opened" twice, but not read entirely in both attempts.
In the link you refer to, a specific block will always be read by all
readers
under most file
formats, records *will* span blocks. But if it were simple to prevent them
from spanning blocks, would that be of benefit?
john
From: Bertrand Dechoux [mailto:decho...@gmail.com]
Sent: Thursday, June 13, 2013 3:37 PM
To: user@hadoop.apache.org
Subject: Re: Assignment of data splits
mple
> to prevent them from spanning blocks, would that be of benefit?
>
> john
>
> ** **
>
> *From:* Bertrand Dechoux [mailto:decho...@gmail.com]
> *Sent:* Thursday, June 13, 2013 3:37 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Assignment of data spli
s a connection.
Cheers,
John
From: Bertrand Dechoux [mailto:decho...@gmail.com]
Sent: Tuesday, June 18, 2013 3:54 PM
To: user@hadoop.apache.org
Subject: Re: Assignment of data splits to mappers
1) The tradeoff is between reducing the overhead of distributed computing and
reducing the cost of failu