[ https://issues.apache.org/jira/browse/IMPALA-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Tamas closed IMPALA-9942. ------------------------------ Resolution: Fixed > DataSketches HLL shouldn't take empty strings as distinct values > ---------------------------------------------------------------- > > Key: IMPALA-9942 > URL: https://issues.apache.org/jira/browse/IMPALA-9942 > Project: IMPALA > Issue Type: Improvement > Components: Backend > Affects Versions: Impala 4.0 > Reporter: Gabor Kaszab > Assignee: Adam Tamas > Priority: Major > Labels: newbie, ramp-up > > Let's consider a table that has string, char and varchar columns and some of > the values in these columns are empty strings. > {code:java} > select * from strings; > +-----+------------+-----+ > | s | c | v | > +-----+------------+-----+ > | | | | > | abc | abc | abc | > | | | | > +-----+------------+-----+ > {code} > If I query the # of distinct values by DataSketches HLL then the empty string > add +1 to the end result. > {code:java} > select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), > ds_hll_estimate(ds_hll_sketch(v)) from strings; > +------------+----------+-------------+ > | hll_string | hll_char | hll_varchar | > +------------+----------+-------------+ > | 2 | 2 | 2 | > +------------+----------+-------------+ > {code} > However, Hive's implementation omits empty strings so for this particular > example above Hive would return 1 for each column. > I assume omits empty strings because of this line: > https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351 > First step of this task would be to decide which approach is the correct one, > and as a second step do the adjustment in Impala if we decide that way. > Btw, in Impala this functions updates string to the HLL sketches: > https://github.com/apache/impala/commit/7e456dfa9d932bcdb317ad6477abc3c399abacf2#diff-cb22c62db38ee853b857c3b2302244dfR1661 -- This message was sent by Atlassian Jira (v8.3.4#803005)