Hi Patrick, Please see inline.
Regards, Mridul On Wed, Jul 2, 2014 at 10:52 AM, Patrick Wendell <pwend...@gmail.com> wrote: >> b) Instead of pulling this information, push it to executors as part >> of task submission. (What Patrick mentioned ?) >> (1) a.1 from above is still an issue for this. > > I don't understand problem a.1 is. In this case, we don't need to do > caching, right? To rephrase in this context, attempting to cache wont help since it is reducer specific and benefits are minimal (other than for reexecution for failures and speculative tasks). > >> (2) Serialized task size is also a concern : we have already seen >> users hitting akka limits for task size - this will be an additional >> vector which might exacerbate it. > > This would add only a small, constant amount of data to the task. It's > strictly better than before. Before if the map output status array was > size M x R, we send a single akka message to every node of size M x > R... this basically scales quadratically with the size of the RDD. The > new approach is constant... it's much better. And the total amount of > data send over the wire is likely much less. It would be a function of the number of mappers - and an overhead for each task. Regards, Mridul > > - Patrick