Hi Robert! I'm sorry I do not have a Windows box and probably don't understand the shuffle process well enough. Could you please create a JIRA in the mapreduce proect if you would like this fixed upstream? https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=116&projectKey=MAPREDUCE
Thanks Ravi On Mon, Jul 31, 2017 at 6:36 AM, Robert Schmidtke <ro.schmid...@gmail.com> wrote: > Hi all, > > I just ran into an issue, which likely resulted from my not very > intelligent configuration, but nonetheless I'd like to share this with the > community. This is all on Hadoop 2.7.3. > > In my setup, each reducer roughly fetched 65K from each mapper's spill > file. I disabled transferTo during shuffle, because I wanted to have a look > at the file system statistics, which miss mmap calls, which is what > transferTo sometimes defaults to. I left the shuffle buffer size at 128K > (not knowing about the parameter at the time). This had the effect that I > observed roughly 100% more data being read during shuffle, since 128K were > read for each 65K needed. > > I added a quick fix to Hadoop which chooses the minimum of the partition > size and the shuffle buffer size: https://github.com/ > apache/hadoop/compare/branch-2.7.3...robert-schmidtke: > adaptive-shuffle-buffer > Benchmarking this version against transferTo.allowed=true yields the same > runtime and roughly 10% more reads in YARN during the shuffle phase > (compared to previous 100%). > Maybe this is something that should be added to Hadoop? Or do users have > to be more clever about their job configurations? I'd be happy to open a PR > if this is deemed useful. > > Anyway, thanks for the attention! > > Cheers > Robert > > -- > My GPG Key ID: 336E2680 >