Tez runtime repartitioning

Robert Grandl Fri, 30 Dec 2016 12:11:35 -0800

Hi guys,
I know that Tez is able to automatically tune the number of downstream tasks 
based on statistics regarding the output of the tasks from the upstream tasks. 
Upstream tasks can be map tasks, downstream can be reduce tasks in MR parlance.


It seems that Tez is somehow able to repartition the key ranges to adjust the 
new number of downstream tasks computed at runtime based on these statistics. 

I have the following questions:1. When the upstream tasks are executing, how is 
the output data partitioned? I assume it should assume certain key-range 
splitting into separate partitions which are written to disk into intermediate 
files. 
2. After certain upstream tasks have finished and the number of  downstream 
tasks are adjusted based on the expected output, then the data will basically 
be repartitioned. That means if initially my data was going into 10 partitions, 
now it may go to 2 because Tez decided that only 2 downstream tasks are enough 
to fetch the data. If this is the case, how the repartitioning happens? What 
key ranges from the initial partitions will go to what key ranges from the 
computed new partitions?
Thanks in advance,Robert

Tez runtime repartitioning

Reply via email to