Hi, I think this is worth fixing because this seems to be triggered by the data quality itself - so let me dig in a bit into a couple more scenarios.
> hive.optimize.distinct.rewrite is True by default FYI, we're tackling the count(1) + count(distinct col) case in the Optimizer now (which came up after your original email). https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.countdistinct > On running the orcfiledump utility, I see that the column on which I want to > run the distinct query is encoded with a DIRECT encoding. When I run > distinct on other columns in the table that are encoded with the dictionary > encoding, the query runs quickly. So the cut-off for dictionary encoding is that the value repeats at least ~2x in each stripe - so very unique patterns won't trigger this. If the total # of rows of IP == total IP values, I don't expect it to be encoded as a dictionary. Also interesting detail - I prefer to now store IPs as 2 bigint cols. bigint ip1, bigint ip2 This was primarily driven by the crazy math required to join different contractions of the IPv6 formatting. The two colon contractions are crazy when you want to joins across different data sources, if you store as a text string. Maybe 2017 is the year of IPv6 :D. > CLUSTERED BY (ip) INTO 16 BUCKETS This is something that completely annoys me - CLUSTERED BY does not cluster, but that doesn't help you here since IP is unique. You need SORTED BY (ip) to properly generate clusters in Hive. > Running a count(distinct) query on master id took 3+ hours. It looks like the > CPU was busy when running this query. Can you do me a favour and run some intermediate state data exploratory queries, because some part of the slowness is probably triggered due to the failure tolerance checkpoints. count(distinct hash(ip)) from the table? count count(1) as collisions, hash(ip) from table group by hash(ip) order by collisions desc limit 10; And, if those show many collisions set tez.runtime.io.sort.mb=640; set hive.map.aggr=false; set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance (i.e retries are more expensive, happy path is faster) select count(distinct ip) from ip_table; Cheers, Gopal