[ 
https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-18000:
---------------------------------
    Description: 
For a column, we will generate a equi-width or equi-height histogram, depending 
on if its ndv is large than the maximum number of bins allowed in one histogram 
(denoted as numBins).
The agg function for a column returns bins - (distinct value, frequency) pairs 
of equi-width histogram when the number of distinct values is less than or 
equal to numBins. Otherwise, 1) for column of string type, it returns an empty 
map; 2) for column of numeric type (including DateType and TimestampType), it 
returns endpoints of equi-height histogram - approximate percentiles at 
percentages 0.0, 1/numBins, 2/numBins, ..., (numBins-1)/numBins, 1.0.

  was:
For a column of numeric type (including date and timestamp), we will generate a 
equi-width or equi-height histogram, depending on if its ndv is large than the 
maximum number of bins allowed in one histogram (denoted as numBins).
This agg function computes values and their frequencies using a small hashmap, 
whose size is less than or equal to "numBins", and returns an equi-width 
histogram. 
When the size of hashmap exceeds "numBins", it cleans the hashmap and utilizes 
ApproximatePercentile to return endpoints of equi-height histogram.


> Aggregation function for computing endpoints for histograms
> -----------------------------------------------------------
>
>                 Key: SPARK-18000
>                 URL: https://issues.apache.org/jira/browse/SPARK-18000
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Zhenhua Wang
>
> For a column, we will generate a equi-width or equi-height histogram, 
> depending on if its ndv is large than the maximum number of bins allowed in 
> one histogram (denoted as numBins).
> The agg function for a column returns bins - (distinct value, frequency) 
> pairs of equi-width histogram when the number of distinct values is less than 
> or equal to numBins. Otherwise, 1) for column of string type, it returns an 
> empty map; 2) for column of numeric type (including DateType and 
> TimestampType), it returns endpoints of equi-height histogram - approximate 
> percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., 
> (numBins-1)/numBins, 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to