How the actual "sample data" are implemented when using tez reduce auto-parallelism

LLBian Fri, 26 Feb 2016 19:14:22 -0800


Hello, Respected experts:
 
Recently, I am studying  tez reduce auto-parallelism, I read the article 
"Apache Tez: Dynamic Graph Reconfiguration",TEZ-398 and HIVE-7158.
I found the HIVE-7158 said that "Tez can optionally sample data from a fraction 
of the tasks of a vertex and use that information to choose the number of 
downstream tasks for any given scatter gather edge".
I know how to use this optimization function，but I was so confused by this:


" Tez defines a VertexManager event that can be used to send an arbitrary user 
payload to the vertex manager of a given vertex. The partitioning tasks (say 
the Map tasks) use this event to send statistics such as the size of the output 
partitions produced to the ShuffleVertexManager for the reduce vertex. The 
manager receives these events and tries to model the final output statistics 
that would be produced by the all the tasks."

(1)How the actual "sample data" are implemented？I mean how does the reduce 
ShuffleVertexManger know how many sample data is enough to  estimate the whole 
vertex parallelism, is that relates to reduce slow-start?  I studied the source 
code of apache tez-0.7.0, but still not very clear. Mybe I was too stupid to 
understood that.
(2)Is the partitioning tasks proactively send their output data stats to the 
consumer ShuffleVertexManger ? The event is sended by RPC or http?

I am eager to get your instruction. 

Any reply would be very very grateful.


LuluBian

How the actual "sample data" are implemented when using tez reduce auto-parallelism

Reply via email to