> b) Instead of pulling this information, push it to executors as part
> of task submission. (What Patrick mentioned ?)
> (1) a.1 from above is still an issue for this.

I don't understand problem a.1 is. In this case, we don't need to do
caching, right?

> (2) Serialized task size is also a concern : we have already seen
> users hitting akka limits for task size - this will be an additional
> vector which might exacerbate it.

This would add only a small, constant amount of data to the task. It's
strictly better than before. Before if the map output status array was
size M x R, we send a single akka message to every node of size M x
R... this basically scales quadratically with the size of the RDD. The
new approach is constant... it's much better. And the total amount of
data send over the wire is likely much less.

- Patrick

Reply via email to