> i am new to tez, i am not famaliar with some words , what is 'Operator >pipeline'? > do we have some design wiki for the execution detail?
The operator pipeline is the part of execution which is not provided by Tez (the name is the same for Hive-on-MR). Tez provides an input/output and processors for each vertex, with each vertex connected with edges to make a directed acyclic graph (DAG). I said ³operator pipeline² because that is the component of Hive which runs inside each processor in Tez, because your question was about Hive on Tez & reuse of containers. Tez provides something known as Object Registry which allows us to place objects in a cache, which is cleared either when a new vertex is seen or when a new DAG is seen. When we reuse a container to run more than 1 task (so with 600 containers, you can run 60,000 tasks - unlike MR which spins up 1 container per task), we get to reuse some of that state which includes things like the Hive Operator pipeline for a given vertex. This lets us run sub-second tasks, because no time is wasted in loading classes or starting JVMs during reuse. A container might run ³Map 1 (split 0)², ³Map 1 (split 1)², ³Map 2 (split 0)² in it - we try to not reload the whole hive SQL operators when transitioning from ³Map 1 (split 0)² to ³Map 1 (split 0)². As a stress test, I have run ~2500 tasks in 50 containers to test those scenarios (each horizontal lane is a single YARN container, each box is a task) - http://people.apache.org/~gopalv/query10.svg If you look at the Operator pipeline at the same time as the Tez vertex separation, you will notice that data movement within the Hive operators are only within a single JVM, while the Tez edges can move data between different processes/machines. Tez does not really bother with the interior details of an operator pipeline, so the view Tez has is more like this - http://people.apache.org/~gopalv/q27-dag.svg But for the sake of illustration, I have drawn out that for TPC-DS Query 27 - http://people.apache.org/~gopalv/q27-plan.svg The dashed boxes contain Tez vertices and everything within a box is implemented by Hive as SQL operators. That¹s a high level picture of how a data access engine uses Tez - Tez handles the data transfers and the actual transformations are entirely performed by the operators (those are owned by PIG/Hive/Cascading/Flink etc.) You can follow the Hive on Tez design docs on the hive wiki for more details. Cheers, Gopal
