[
https://issues.apache.org/jira/browse/DATASKETCHES-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adam Tamas updated DATASKETCHES-8:
----------------------------------
Description:
Using ds_hll Hive is not counting empty strings as distinct values for string
and varchar columns.
Example:
With a t table with the following (string, char(1), varchar(1)) values:
{code:java}
+------+------+------+
| t.s | t.c | t.v |
+------+------+------+
| | | |
| a | a | a |
| | | |
| a | a | a |
| s | s | s |
| d | d | d |
+------+------+------+
{code}
select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)),
ds_hll_estimate(ds_hll_sketch(v)) from t;
{code:java}
+--------------------+--------------------+--------------------+
| _c0 | _c1 | _c2 |
+--------------------+--------------------+--------------------+
| 3.000000014901161 | 4.000000029802323 | 3.000000014901161 |
+--------------------+--------------------+--------------------+
{code}
Could be a problem here:
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351
Char is working because it is filled with spaces up to the limit.
was:
Using ds_hll Hive is not counting empty strings as distinct values for string
and varchar columns.
Example:
With a t table with the following (string, char(1), varchar(1)) values:
{code:java}
+------+------+------+
| t.s | t.c | t.v |
+------+------+------+
| | | |
| a | a | a |
| | | |
| a | a | a |
| s | s | s |
| d | d | d |
+------+------+------+
{code}
select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)),
ds_hll_estimate(ds_hll_sketch(v)) from t;
{code:java}
+--------------------+--------------------+--------------------+
| _c0 | _c1 | _c2 |
+--------------------+--------------------+--------------------+
| 3.000000014901161 | 4.000000029802323 | 3.000000014901161 |
+--------------------+--------------------+--------------------+
{code}
Could be a problem here:
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351
but for char it is working fine.
> HLL doesn't take empty strings as distinct values
> -------------------------------------------------
>
> Key: DATASKETCHES-8
> URL: https://issues.apache.org/jira/browse/DATASKETCHES-8
> Project: Apache Datasketches
> Issue Type: Bug
> Reporter: Adam Tamas
> Priority: Major
>
> Using ds_hll Hive is not counting empty strings as distinct values for string
> and varchar columns.
> Example:
> With a t table with the following (string, char(1), varchar(1)) values:
> {code:java}
> +------+------+------+
> | t.s | t.c | t.v |
> +------+------+------+
> | | | |
> | a | a | a |
> | | | |
> | a | a | a |
> | s | s | s |
> | d | d | d |
> +------+------+------+
> {code}
> select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)),
> ds_hll_estimate(ds_hll_sketch(v)) from t;
> {code:java}
> +--------------------+--------------------+--------------------+
> | _c0 | _c1 | _c2 |
> +--------------------+--------------------+--------------------+
> | 3.000000014901161 | 4.000000029802323 | 3.000000014901161 |
> +--------------------+--------------------+--------------------+
> {code}
> Could be a problem here:
> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351
> Char is working because it is filled with spaces up to the limit.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]