> SELECT COUNT(DISTINCT ip) FROM table - 71 seconds > SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds
Ok, I misunderstood your gist. > While ip is more unique that id, ip runs many times faster than id. > > How can I debug this ? Nearly the same way - just replace "ip" with "id" in my exploratory queries. count(distinct hash(id)) from the table? count count(1) as collisions, hash(id) from table group by hash(id) order by collisions desc limit 10; And, if those show many collisions set tez.runtime.io.sort.mb=640; set hive.map.aggr=false; set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance (i.e retries are more expensive, happy path is faster) select count(distinct id) from ip_table; Java's hashCode() implementation is pretty horrible (& Hive defaults to using it). If you're seeing a high collision count, I think I might know what's happening here. Cheers, Gopal
