Hi guys, I'm currently working at my master degree thesis on Tez, and I am
trying to understand how Tez works under the hood. I have some questions, I
hope someone can help with this:

1) How does Tez handle containers for reuse? Are they kept for some seconds
(how long?) in a sort of buffer waiting for tasks which will need them? Or
a container is sent back to the RM if no task is immediately ready to take
it?

2) Let's say I have a DAG with two branches proceeding in parallel before
joining in a root node (such as the example on the tez home page
http://tez.apache.org/images/PigHiveQueryOnTez.png ). In this case, we will
have both branches running at the same time. At some point we may have the
first branch that is almost complete, while the second is still at an early
stage. In this case, does Tez knows that "soon or later" the two branches
will merge, thus there will be a common consumer waiting for the slower
branch to complete? Actually the real question is: does Tez prioritize the
scheduling/resource allocation of tasks belonging to slower branches? If
yes, what kind of policy is adopted? Is it configurable?

3) tez.am.shuffle-vertex-manager.min-src-fraction: if I have a dag made of
two producer vertexes, each one running 10 tasks, and below them a consumer
vertex, let's say running 5 tasks, so if this property is set to 0.2, does
it mean that before running any consumer task we need 2 producer tasks to
complete for each of the producer vertexes? Or are they considered as a
whole and we need just 4 tasks completed (even just from one vertex)?

4) As far as I understand, a single Tez Application Master can handle
multiple DAGs at the same time, but only if the user-application has been
coded to do so (for example, if I run two wordcount with the same user, it
simply creates two different Tez App Master). Is this correct?

Thanks in advance

Fabio

Reply via email to