[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542830#comment-15542830
 ] 

Srinath commented on SPARK-17074:
---------------------------------

IMO if you can get reasonable error bounds (as Tim points out) the method with 
lower overhead is preferable. In general you can't rely on exact statistics 
during optimization anyway since new data may have arrived since the last stats 
collection

> generate histogram information for column
> -----------------------------------------
>
>                 Key: SPARK-17074
>                 URL: https://issues.apache.org/jira/browse/SPARK-17074
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Optimizer
>    Affects Versions: 2.0.0
>            Reporter: Ron Hu
>
> We support two kinds of histograms: 
> -     Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> -     Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to