Hi, I’m working on a set up where we would like to test how Tez behaves with lots of concurrenct tasks running without spinning up a huge cluster and generating a real workload.
We’ve tried to simulate this by setting up an external table in hive with 10k- 50k files about 5MB in size each and set the tez.grouping.min-size and max-size equal to the size of the files. YARN container sizes are also set appropriately. Tez is able to properly calculate the number of tasks that we have files that we see in the PENDING column within the Hive shell, but we are unable to have a large number of them run concurrently. It seems that there are only ever 10-20 tasks running at a time however our YARN RM reports < 10% utilization so we know cluster resources are not the issue. Is there a way to “trick” Tez into scheduling more tasks concurrently? We are running simple queries so it may be that tasks are simply finishing too fast but, for the scale of tasks we have, we expect more than 10-20 running at the same time. Any help would be appreciated. Thank you, Zac
