Hi Hive team,
I have a Hive query translated and running as 2000+ map and 1009 reduce
jobs. Reduce jobs are configured to run after all map jobs are completed.
In reduce phase, 1008 of those reduce jobs complete within 5 minutes, but
the one last reduce job takes more than 14 hours.
I expect to see reduce jobs complete roughly at the same time if I optimize
data skew. For example,I have set the following parameters to optimize data
skew. But it didn't help.
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=100000000;
Any idea what else parameters I need to set? Or how to optimize the run
time for reduce jobs?
Query is as follows:
WITH uaf AS
(
SELECT user_id
FROM db1.table1
WHERE ds = '2018-11-25'
AND is_valid
AND days_since_last_visit = 0)
SELECT *
FROM db2.table2 c
WHERE c.user_id IN
(
SELECT user_id
FROM uaf)
AND Substr(datehour, 1, 8) = '20181125'
LIMIT 10
Da