[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17074:
---------------------------------
    Description: 
Equi-height histogram is effective in handling skewed data distribution.

For equi-height histogram, the heights of all bins(intervals) are the same. The 
default number of bins we use is 254.

Now we use a two-step method to generate an equi-height histogram:
1. use percentile_approx to get percentiles (end points of the equi-height bin 
intervals);
2. use a new aggregate function to count ndv in each of these bins.

Note that this method takes two table scans. We may provide other algorithms 
which takes only one table scan in the future.

  was:
Equi-height histogram is effective in handling skewed data distribution.

For equi-height histogram, the heights of all bins(intervals) are the same. The 
default number of bins we use is 254.

We first use [SPARK-18000] to compute equi-width histograms (for both numeric 
and string types) or endpoints of equi-height histograms (for numeric type 
only). Then, if we get endpoints of a equi-height histogram, we need to compute 
ndv's between those endpoints by [SPARK-17997] to form the equi-height 
histogram.

This Jira incorporates three Jiras mentioned above to support needed 
aggregation functions. We need to resolve them before this one.


> generate equi-height histogram for column
> -----------------------------------------
>
>                 Key: SPARK-17074
>                 URL: https://issues.apache.org/jira/browse/SPARK-17074
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Optimizer
>    Affects Versions: 2.0.0
>            Reporter: Ron Hu
>
> Equi-height histogram is effective in handling skewed data distribution.
> For equi-height histogram, the heights of all bins(intervals) are the same. 
> The default number of bins we use is 254.
> Now we use a two-step method to generate an equi-height histogram:
> 1. use percentile_approx to get percentiles (end points of the equi-height 
> bin intervals);
> 2. use a new aggregate function to count ndv in each of these bins.
> Note that this method takes two table scans. We may provide other algorithms 
> which takes only one table scan in the future.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to