Adam Tamas created DATASKETCHES-8:
-------------------------------------

             Summary: HLL doesn't take empty strings as distinct values
                 Key: DATASKETCHES-8
                 URL: https://issues.apache.org/jira/browse/DATASKETCHES-8
             Project: Apache Datasketches
          Issue Type: Bug
            Reporter: Adam Tamas


Using ds_hll Hive is not counting empty strings as distinct values for string 
and varchar columns.

Example:
With a t table with the following (string, char(1), varchar(1)) values:
+------+------+------+
| t.s       | t.c      | t.v      |
+------+------+------+
|            |           |            |
| a         | a        | a         |
|            |           |            |
| a         | a        | a         |
| s         | s        | s         |
| d         | d       | d         |
+------+------+------+

select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), 
ds_hll_estimate(ds_hll_sketch(v)) from t;

+--------------------+--------------------+--------------------+
|        _c0                      |        _c1                      |        
_c2                     |
+--------------------+--------------------+--------------------+
| 3.000000014901161  | 4.000000029802323  | 3.000000014901161  |
+--------------------+--------------------+--------------------+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to