[
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhenhua Wang updated SPARK-17074:
---------------------------------
Description:
Equi-height histogram is effective in handling skewed data distribution.
For equi-height histogram, the heights of all bins(intervals) are the same. The
default number of bins we use is 254.
Now we use a two-step method to generate an equi-height histogram:
1. use percentile_approx to get percentiles (end points of the equi-height bin
intervals);
2. use a new aggregate function to count ndv in each of these bins.
Note that this method takes two table scans. In the future we may provide other
algorithms which need only one table scan.
was:
Equi-height histogram is effective in handling skewed data distribution.
For equi-height histogram, the heights of all bins(intervals) are the same. The
default number of bins we use is 254.
Now we use a two-step method to generate an equi-height histogram:
1. use percentile_approx to get percentiles (end points of the equi-height bin
intervals);
2. use a new aggregate function to count ndv in each of these bins.
Note that this method takes two table scans. We may provide other algorithms
which takes only one table scan in the future.
> generate equi-height histogram for column
> -----------------------------------------
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
> Issue Type: Sub-task
> Components: Optimizer
> Affects Versions: 2.0.0
> Reporter: Ron Hu
>
> Equi-height histogram is effective in handling skewed data distribution.
> For equi-height histogram, the heights of all bins(intervals) are the same.
> The default number of bins we use is 254.
> Now we use a two-step method to generate an equi-height histogram:
> 1. use percentile_approx to get percentiles (end points of the equi-height
> bin intervals);
> 2. use a new aggregate function to count ndv in each of these bins.
> Note that this method takes two table scans. In the future we may provide
> other algorithms which need only one table scan.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]