> SELECT COUNT(DISTINCT ip) FROM table - 71 seconds
> SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds

Ok, I misunderstood your gist.

> While ip is more unique that id, ip runs many times faster than id.
>
> How can I debug this ?

Nearly the same way - just replace "ip" with "id" in my exploratory queries.
 
count(distinct hash(id)) from the table?

count count(1) as collisions, hash(id) from table group by hash(id) order by 
collisions desc limit 10;

And, if those show many collisions

set tez.runtime.io.sort.mb=640;
set hive.map.aggr=false;
set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance (i.e 
retries are more expensive, happy path is faster)

select count(distinct id) from ip_table;

Java's hashCode() implementation is pretty horrible (& Hive defaults to using 
it). If you're seeing a high collision count, I think I might know what's 
happening here.

Cheers,
Gopal


Reply via email to