Hi guys, I know that Tez is able to automatically tune the number of downstream tasks based on statistics regarding the output of the tasks from the upstream tasks. Upstream tasks can be map tasks, downstream can be reduce tasks in MR parlance.
It seems that Tez is somehow able to repartition the key ranges to adjust the new number of downstream tasks computed at runtime based on these statistics. I have the following questions:1. When the upstream tasks are executing, how is the output data partitioned? I assume it should assume certain key-range splitting into separate partitions which are written to disk into intermediate files. 2. After certain upstream tasks have finished and the number of downstream tasks are adjusted based on the expected output, then the data will basically be repartitioned. That means if initially my data was going into 10 partitions, now it may go to 2 because Tez decided that only 2 downstream tasks are enough to fetch the data. If this is the case, how the repartitioning happens? What key ranges from the initial partitions will go to what key ranges from the computed new partitions? Thanks in advance,Robert
