Hi,

I think this is worth fixing because this seems to be triggered by the data 
quality itself - so let me dig in a bit into a couple more scenarios.

> hive.optimize.distinct.rewrite is True by default

FYI, we're tackling the count(1) + count(distinct col) case in the Optimizer 
now (which came up after your original email).

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.countdistinct

> On running the orcfiledump utility, I see that the column on which I want to 
> run the distinct query is encoded with a DIRECT encoding.  When I run 
> distinct on other columns in the table that are encoded with the dictionary 
> encoding, the query runs quickly. 

So the cut-off for dictionary encoding is that the value repeats at least ~2x 
in each stripe - so very unique patterns won't trigger this.

If the total # of rows of IP == total IP values, I don't expect it to be 
encoded as a dictionary.

Also interesting detail - I prefer to now store IPs as 2 bigint cols.

bigint ip1, bigint ip2

This was primarily driven by the crazy math required to join different 
contractions of the IPv6 formatting.

The two colon contractions are crazy when you want to joins across different 
data sources, if you store as a text string. Maybe 2017 is the year of IPv6 :D.

> CLUSTERED BY (ip) INTO 16 BUCKETS

This is something that completely annoys me - CLUSTERED BY does not cluster, 
but that doesn't help you here since IP is unique.

You need SORTED BY (ip) to properly generate clusters in Hive.

> Running a count(distinct) query on master id took 3+ hours. It looks like the 
> CPU was busy when running this query.

Can you do me a favour and run some intermediate state data exploratory 
queries, because some part of the slowness is probably triggered due to the 
failure tolerance checkpoints.

count(distinct hash(ip)) from the table? 

count count(1) as collisions, hash(ip) from table group by hash(ip) order by 
collisions desc limit 10;

And, if those show many collisions

set tez.runtime.io.sort.mb=640;
set hive.map.aggr=false;
set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance (i.e 
retries are more expensive, happy path is faster)

select count(distinct ip) from ip_table;

Cheers,
Gopal 



Reply via email to