Hello Amit, 

For the most part, all that local fetch does is that in the case where the 
upstream vertex's output is on the same host where the downstream vertex task 
is running, the fetcher reads the data directly from disk instead of going via 
the http-based shuffle handler. This is an optimization that would perform well 
in smaller clusters ( or in cases where there is a high re-use of containers 
across vertices ).

For the setup, a simple approach might be to use a smaller cluster ( or 
artificially made small by using yarn node labels to force a job to run on a 
smaller subset of nodes ).

@Prakash/@Rajesh might be able to shed light on what counters you can look at 
to verify that local fetch kicked in and also how to tune the setup further to 
test various scenarios.

thanks
— Hitesh

On May 15, 2015, at 11:19 AM, Amit Tiwari <tiw...@yahoo-inc.com> wrote:

> Hey guys,
> Local fetch optimization seems like an awesome feature. I'd like to add some 
> tests for our CI/CD pipeline that exercise this feature.
> Any thoughts on what kind of setup, data etc I may need for this?
> thanks
> --amit
> 

Reply via email to