Hello, Respected experts: Recently, I am studying tez reduce auto-parallelism, I read the article "Apache Tez: Dynamic Graph Reconfiguration",TEZ-398 and HIVE-7158. I found the HIVE-7158 said that "Tez can optionally sample data from a fraction of the tasks of a vertex and use that information to choose the number of downstream tasks for any given scatter gather edge". I know how to use this optimization function,but I was so confused by this:
" Tez defines a VertexManager event that can be used to send an arbitrary user payload to the vertex manager of a given vertex. The partitioning tasks (say the Map tasks) use this event to send statistics such as the size of the output partitions produced to the ShuffleVertexManager for the reduce vertex. The manager receives these events and tries to model the final output statistics that would be produced by the all the tasks." (1)How the actual "sample data" are implemented?I mean how does the reduce ShuffleVertexManger know how many sample data is enough to estimate the whole vertex parallelism, is that relates to reduce slow-start? I studied the source code of apache tez-0.7.0, but still not very clear. Mybe I was too stupid to understood that. (2)Is the partitioning tasks proactively send their output data stats to the consumer ShuffleVertexManger ? The event is sended by RPC or http? I am eager to get your instruction. Any reply would be very very grateful. LuluBian
