[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15597858#comment-15597858
 ] 

Herman van Hovell commented on SPARK-17074:
-------------------------------------------

[~ZenWzh] I think your current approach is valid. I will take two passes, but 
that is fine for now.

I have discussed this with Tim and we are going to see if we can come up with 
something for a single pass algorithm. But that is going to be somewhere in the 
next week.

Please also note that we are currently doing some work on the aggregation code 
paths. This might make your effort a little easier.

> generate histogram information for column
> -----------------------------------------
>
>                 Key: SPARK-17074
>                 URL: https://issues.apache.org/jira/browse/SPARK-17074
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Optimizer
>    Affects Versions: 2.0.0
>            Reporter: Ron Hu
>
> We support two kinds of histograms: 
> -     Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> -     Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  
> We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms 
> (for both numeric and string types) or endpoints of equi-height histograms 
> (for numeric type only). Then, if we get endpoints of a equi-height 
> histogram, we need to compute ndv's between those endpoints by [SPARK-17997] 
> to form the equi-height histogram.
> This Jira incorporates three Jiras mentioned above to support needed 
> aggregation functions. We need to resolve them before this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to