I think that you just said what the OP said.

Your two cases reduce to the same single case that they had.

Whether this matters is another question, but it seems like it could in
cases where splits != blocks, especially if a split starts near the end of
a block which could give an illusion of locality.

My guess is that since data locality is typically very high that this
doesn't much matter.


On Wed, May 8, 2013 at 3:00 PM, Vinod Kumar Vavilapalli <
vino...@hortonworks.com> wrote:

> I think you misread it.
>
> If a given split has only one block, it uses all the locations of that
> block.
>
> If it so happens that a given split has multiple blocks, it uses all the
> locations of the first block.
>
>  HTH,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> On May 8, 2013, at 7:21 AM, Brian C. Huffman wrote:
>
> All,
>
> I'm trying to understand how the current FileInputFormat implements
> locality.  As far as I can tell, it calculates splits using getSplit and
> each split will contain the node that hosts the first block of data in that
> split.  Is my understanding correct?
>
> Looking at the FileInputFormat for the old API (mapred), it appears that
> it does more to implement locality, using getSplitHosts to "return the
> hosts that contribute most for a given split"
>
> If I understand correctly, why was this changed?
>
> Thanks,
> Brian
>
>
>

Reply via email to