Hello Amit, For the most part, all that local fetch does is that in the case where the upstream vertex's output is on the same host where the downstream vertex task is running, the fetcher reads the data directly from disk instead of going via the http-based shuffle handler. This is an optimization that would perform well in smaller clusters ( or in cases where there is a high re-use of containers across vertices ).
For the setup, a simple approach might be to use a smaller cluster ( or artificially made small by using yarn node labels to force a job to run on a smaller subset of nodes ). @Prakash/@Rajesh might be able to shed light on what counters you can look at to verify that local fetch kicked in and also how to tune the setup further to test various scenarios. thanks — Hitesh On May 15, 2015, at 11:19 AM, Amit Tiwari <tiw...@yahoo-inc.com> wrote: > Hey guys, > Local fetch optimization seems like an awesome feature. I'd like to add some > tests for our CI/CD pipeline that exercise this feature. > Any thoughts on what kind of setup, data etc I may need for this? > thanks > --amit >