Hi,
I'm using Tez with Hive to query data on S3 and I notice the following two
cases.
*Case A*
When the query is covering a smaller amount of data a TEZ job (yarn
application) is not created
select dt from my_db_schema.my_table where dt in
('2018-03-10','2018-03-09') and header ='xxx';
The output in the above case is:
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
2018-03-10
2018-03-10
2018-03-09
2018-03-09
Time taken: 7.043 seconds, Fetched: 4 row(s)
*Case B*
When the query is scanning more data
select dt from my_db_schema.my_table where header ='xxx';
then the output is as follows and I can see a TEZ job logged in the TEZ ui
and in yarn.
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING
PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 22 22 0
0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 38.12 s
----------------------------------------------------------------------------------------------
OK
2018-03-05
2018-03-05
2018-03-06
2018-03-06
2018-03-07
2018-03-07
2018-03-08
2018-03-08
2018-03-09
2018-03-09
2018-03-10
2018-03-10
2018-03-25
2018-03-25
2018-03-26
2018-03-26
2018-03-28
2018-03-28
2018-05-09
2018-05-09
2018-05-10
2018-05-10
Time taken: 47.197 seconds, Fetched: 22 row(s)
The problem in case A is that sometimes Hive decides not to trigger a TEZ
job and the query is taking a long time to complete. In this case the
worker nodes are not utilised at all, it's only the master node executing
the query.
Is there a way to force Hive to always trigger a TEZ job?