[ https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhenhua Wang updated SPARK-18000: --------------------------------- Description: For a column, we will generate a equi-width or equi-height histogram, depending on if its ndv is large than the maximum number of bins allowed in one histogram (denoted as numBins). The agg function for a column returns bins - (distinct value, frequency) pairs of equi-width histogram when the number of distinct values is less than or equal to numBins. Otherwise, 1) for column of string type, it returns an empty map; 2) for column of numeric type (including DateType and TimestampType), it returns endpoints of equi-height histogram - approximate percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., (numBins-1)/numBins, 1.0. was: For a column of numeric type (including date and timestamp), we will generate a equi-width or equi-height histogram, depending on if its ndv is large than the maximum number of bins allowed in one histogram (denoted as numBins). This agg function computes values and their frequencies using a small hashmap, whose size is less than or equal to "numBins", and returns an equi-width histogram. When the size of hashmap exceeds "numBins", it cleans the hashmap and utilizes ApproximatePercentile to return endpoints of equi-height histogram. > Aggregation function for computing endpoints for histograms > ----------------------------------------------------------- > > Key: SPARK-18000 > URL: https://issues.apache.org/jira/browse/SPARK-18000 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 2.1.0 > Reporter: Zhenhua Wang > > For a column, we will generate a equi-width or equi-height histogram, > depending on if its ndv is large than the maximum number of bins allowed in > one histogram (denoted as numBins). > The agg function for a column returns bins - (distinct value, frequency) > pairs of equi-width histogram when the number of distinct values is less than > or equal to numBins. Otherwise, 1) for column of string type, it returns an > empty map; 2) for column of numeric type (including DateType and > TimestampType), it returns endpoints of equi-height histogram - approximate > percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., > (numBins-1)/numBins, 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org