Hi Patrick,

  Please see inline.

Regards,
Mridul


On Wed, Jul 2, 2014 at 10:52 AM, Patrick Wendell <pwend...@gmail.com> wrote:
>> b) Instead of pulling this information, push it to executors as part
>> of task submission. (What Patrick mentioned ?)
>> (1) a.1 from above is still an issue for this.
>
> I don't understand problem a.1 is. In this case, we don't need to do
> caching, right?


To rephrase in this context, attempting to cache wont help since it is
reducer specific and benefits are minimal (other than for reexecution
for failures and speculative tasks).


>
>> (2) Serialized task size is also a concern : we have already seen
>> users hitting akka limits for task size - this will be an additional
>> vector which might exacerbate it.
>
> This would add only a small, constant amount of data to the task. It's
> strictly better than before. Before if the map output status array was
> size M x R, we send a single akka message to every node of size M x
> R... this basically scales quadratically with the size of the RDD. The
> new approach is constant... it's much better. And the total amount of
> data send over the wire is likely much less.


It would be a function of the number of mappers - and an overhead for each task.


Regards,
Mridul

>
> - Patrick

Reply via email to