Hi Robert!

I'm sorry I do not have a Windows box and probably don't understand the
shuffle process well enough. Could you please create a JIRA in the
mapreduce proect if you would like this fixed upstream?
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=116&projectKey=MAPREDUCE

Thanks
Ravi

On Mon, Jul 31, 2017 at 6:36 AM, Robert Schmidtke <ro.schmid...@gmail.com>
wrote:

> Hi all,
>
> I just ran into an issue, which likely resulted from my not very
> intelligent configuration, but nonetheless I'd like to share this with the
> community. This is all on Hadoop 2.7.3.
>
> In my setup, each reducer roughly fetched 65K from each mapper's spill
> file. I disabled transferTo during shuffle, because I wanted to have a look
> at the file system statistics, which miss mmap calls, which is what
> transferTo sometimes defaults to. I left the shuffle buffer size at 128K
> (not knowing about the parameter at the time). This had the effect that I
> observed roughly 100% more data being read during shuffle, since 128K were
> read for each 65K needed.
>
> I added a quick fix to Hadoop which chooses the minimum of the partition
> size and the shuffle buffer size: https://github.com/
> apache/hadoop/compare/branch-2.7.3...robert-schmidtke:
> adaptive-shuffle-buffer
> Benchmarking this version against transferTo.allowed=true yields the same
> runtime and roughly 10% more reads in YARN during the shuffle phase
> (compared to previous 100%).
> Maybe this is something that should be added to Hadoop? Or do users have
> to be more clever about their job configurations? I'd be happy to open a PR
> if this is deemed useful.
>
> Anyway, thanks for the attention!
>
> Cheers
> Robert
>
> --
> My GPG Key ID: 336E2680
>

Reply via email to