I have the following join that takes 4.5 hours (with 12 nodes) mostly
because of a single reduce task that gets the bulk of the work:

SELECT ...
FROM T
LEFT OUTER JOIN S
ON T.timestamp = S.timestamp and T.id = S.id

This is a 1:0/1 join so the size of the output is exactly the same as the
size of T (500M records). S is actually very small (5K).

I've tried:
- switching the order of the join conditions
- using a different hash function setting (jenkins instead of murmur)
- using SET set hive.auto.convert.join = true;
- using SET hive.optimize.skewjoin = true;

but nothing helped :(

Anything else I can try?

Thanks!

Reply via email to