Adam Tamas created DATASKETCHES-8:
-------------------------------------
Summary: HLL doesn't take empty strings as distinct values
Key: DATASKETCHES-8
URL: https://issues.apache.org/jira/browse/DATASKETCHES-8
Project: Apache Datasketches
Issue Type: Bug
Reporter: Adam Tamas
Using ds_hll Hive is not counting empty strings as distinct values for string
and varchar columns.
Example:
With a t table with the following (string, char(1), varchar(1)) values:
+------+------+------+
| t.s | t.c | t.v |
+------+------+------+
| | | |
| a | a | a |
| | | |
| a | a | a |
| s | s | s |
| d | d | d |
+------+------+------+
select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)),
ds_hll_estimate(ds_hll_sketch(v)) from t;
+--------------------+--------------------+--------------------+
| _c0 | _c1 |
_c2 |
+--------------------+--------------------+--------------------+
| 3.000000014901161 | 4.000000029802323 | 3.000000014901161 |
+--------------------+--------------------+--------------------+
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]