[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19774 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152299177 --- Diff: sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql --- @@ -34,6 +34,19 @@ DESC FORMATTED desc_complex_col_table col; -- Describe a nested column DESC FORMATTED desc_complex_col_table col.x; +-- Test output for histogram statistics +SET spark.sql.statistics.histogram.enabled=true; +SET spark.sql.statistics.histogram.numBins=2; + +INSERT INTO desc_col_table values 1, 2, 3, 4; + +ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key; + +DESC EXTENDED desc_col_table key; + +SET spark.sql.statistics.histogram.enabled=false; +SET spark.sql.statistics.histogram.numBins=256; --- End diff -- You can also argue that this increase the maintenance cost as we need to change it if we change the default value of these configs. I think the general principle is, set the config if the following queries depend on it. At the end of test, there is no following queries and we don't need to do anything to the configs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152295430 --- Diff: sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql --- @@ -34,6 +34,19 @@ DESC FORMATTED desc_complex_col_table col; -- Describe a nested column DESC FORMATTED desc_complex_col_table col.x; +-- Test output for histogram statistics +SET spark.sql.statistics.histogram.enabled=true; +SET spark.sql.statistics.histogram.numBins=2; + +INSERT INTO desc_col_table values 1, 2, 3, 4; + +ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key; + +DESC EXTENDED desc_col_table key; + +SET spark.sql.statistics.histogram.enabled=false; +SET spark.sql.statistics.histogram.numBins=256; --- End diff -- isn't it better to reset them to prevent possible future tests eventually added after these lines to be influenced? If no, I can remove them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152294075 --- Diff: sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql --- @@ -34,6 +34,19 @@ DESC FORMATTED desc_complex_col_table col; -- Describe a nested column DESC FORMATTED desc_complex_col_table col.x; +-- Test output for histogram statistics +SET spark.sql.statistics.histogram.enabled=true; +SET spark.sql.statistics.histogram.numBins=2; + +INSERT INTO desc_col_table values 1, 2, 3, 4; + +ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key; + +DESC EXTENDED desc_col_table key; + +SET spark.sql.statistics.histogram.enabled=false; +SET spark.sql.statistics.histogram.numBins=256; --- End diff -- We don't need to reset the config, the test framework will reset the entire session after each test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152265613 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) + }).getOrElse(Seq(Row("histogram", "NULL"))) --- End diff -- Hive has no histogram --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152265034 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- @henryr I think verbosity is a minor issue, since we are returning a `DataFrame` and the user can decide what to show. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152264097 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- @gatorsmile In Hive, there isn't yet a histogram implementation (HIVE-3526). In Oracle and MySQL, the information is stored in metadata tables which can be queried (https://docs.oracle.com/cloud/latest/db112/REFRN/statviews_2106.htm#REFRN20279) and here we have mainly two information: - the cumulative count so far; - the endpoint value for the current bin. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152158010 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- What is the output contents/formats of the other vendors? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152157864 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) + }).getOrElse(Seq(Row("histogram", "NULL"))) --- End diff -- What is the output of Hive? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152100465 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- for debug purpose, I'd like to have as more information as possible. In the future maybe we can simplify the `DESC COLUMN` and use `spark.catalog.getColumn` to get detailed information. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152099227 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- yeah, I was wondering if we actually need to see every individual bucket instead of a summary (max bucket: 2.0->4.0 w/128 entries, min bucket: 6.0->8.0 w/0 entries, median bucket size: 32). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152098557 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- kind of, but I don't have a better idea, adding a new flag to `DESC COLUMNS` seems an overkill. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152096698 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- Do you think this (printing one line per bucket) might become too verbose if there are many buckets? In an interactive session writing 256 buckets, one-per-line is likely to push the non-histogram output off the top of the screen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152005780 --- Diff: sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql --- @@ -24,6 +24,19 @@ DESC EXTENDED desc_col_table key; DESC FORMATTED desc_col_table key; +-- Test output for histogram statistics +SET spark.sql.statistics.histogram.enabled=true; +SET spark.sql.statistics.histogram.numBins=2; + +INSERT INTO desc_col_table values 1, 2, 3, 4; + +ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key; + +DESC EXTENDED desc_col_table key; --- End diff -- shall we move these to the end of this file? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152005513 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) + }).getOrElse(Seq(Row("histogram", "NULL"))) --- End diff -- If no histogram info, I'd like to print nothing here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152005342 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- maybe adding the bin id? ``` histogram height: 2.0, num_of_bins: 2 bin0 lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2 bin1 lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r152004929 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => --- End diff -- yea I think for this case the for loop is conventional --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151892566 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- And how about making each class member as a separate field? ``` histogram height: 2.0 num_of_bins: 2 histogram_bin lower_bound: 1.0upper_bound: 2.0 distinct_count: 2 histogram_bin lower_bound: 2.0upper_bound: 4.0 distinct_count: 2 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151892351 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- @mgaido91 @jaceklaskowski How about using "histogram_bin" instead of the empty string? ``` histogram height: 2.0, num_of_bins: 2 histogram_bin lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2 histogram_bin lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151855965 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) + }).getOrElse(Seq(Row("histogram", "NULL"))) --- End diff -- Some comments or cleanup here would be nicer though --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151838625 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- @wzhfy I'd rather define a `val` with the comment being the name of the val. That would make it "compile-safe". --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151826928 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- add a comment that the empty column is for indentation and better readability? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151741011 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => --- End diff -- Thanks for your feedback. IMHO I am not sure that it would be easier to read. But if someone else agrees with you I can change it, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151737674 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => --- End diff -- I'm pretty sure that for-comprehension would make the code read easier. ```scala for { c <- cs hist <- c.histogram ... } yield ... ``` Let me know if you need help with that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151693461 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => Row("", --- End diff -- nit: `hist.bins.map { bin =>` And is it possible to move `Row("",` to the next line? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151693478 --- Diff: sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql --- @@ -24,6 +24,18 @@ DESC EXTENDED desc_col_table key; DESC FORMATTED desc_col_table key; +SET spark.sql.statistics.histogram.enabled=true; +SET spark.sql.statistics.histogram.numBins=2; + +INSERT INTO desc_col_table values(1); +INSERT INTO desc_col_table values(2); +INSERT INTO desc_col_table values(3); +INSERT INTO desc_col_table values(4); + +ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key; --- End diff -- please set the sql conf back to default value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151689883 --- Diff: sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql --- @@ -24,6 +24,18 @@ DESC EXTENDED desc_col_table key; DESC FORMATTED desc_col_table key; +SET spark.sql.statistics.histogram.enabled=true; +SET spark.sql.statistics.histogram.numBins=2; + +INSERT INTO desc_col_table values(1); +INSERT INTO desc_col_table values(2); +INSERT INTO desc_col_table values(3); +INSERT INTO desc_col_table values(4); --- End diff -- INSERT INTO desc_col_table values 1, 2, 3, 4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/19774 [SPARK-22475][SQL] show histogram in DESC COLUMN command ## What changes were proposed in this pull request? Added the histogram representation to the output of the `DESCRIBE EXTENDED table_name column_name` command. ## How was this patch tested? Modified SQL UT and checked output Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22475 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19774.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19774 commit 24bfcb1132d35ffa8ba2341a7ea9057b14b5ab8a Author: Marco GaidoDate: 2017-11-17T12:42:16Z [SPARK-22475][SQL] show histogram in DESC COLUMN command --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org