Hi,

I’m working on a set up where we would like to test how Tez behaves with lots 
of concurrenct tasks running without spinning up a huge cluster and generating 
a real workload.

We’ve tried to simulate this by setting up an external table in hive with 10k- 
50k files about 5MB in size each and set the tez.grouping.min-size and max-size 
equal to the size of the files. YARN container sizes are also set 
appropriately. Tez is able to properly calculate the number of tasks that we 
have files that we see in the PENDING column within the Hive shell, but we are 
unable to have a large number of them run concurrently. It seems that there are 
only ever 10-20 tasks running at a time however our YARN RM reports < 10% 
utilization so we know cluster resources are not the issue. Is there a way to 
“trick” Tez into scheduling more tasks concurrently?

We are running simple queries so it may be that tasks are simply finishing too 
fast but, for the scale of tasks we have, we expect more than 10-20 running at 
the same time. Any help would be appreciated.

Thank you,
Zac

Reply via email to