[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

mgaido91 Tue, 21 Nov 2017 04:46:18 -0800

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19774#discussion_r152264097
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
    @@ -689,6 +689,11 @@ case class DescribeColumnCommand(
           buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
           buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
           buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
    +      buffer ++= cs.flatMap(_.histogram.map { hist =>
    +        val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
    +        Seq(header) ++ hist.bins.map(bin =>
    +          Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
    --- End diff --
    
    @gatorsmile In Hive, there isn't yet a histogram implementation 
(HIVE-3526). In Oracle and MySQL, the information is stored in metadata tables 
which can be queried 
(https://docs.oracle.com/cloud/latest/db112/REFRN/statviews_2106.htm#REFRN20279)
 and here we have mainly two information:
     - the cumulative count so far;
     - the endpoint value for the current bin.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

Reply via email to