[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19774


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152299177
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql ---
@@ -34,6 +34,19 @@ DESC FORMATTED desc_complex_col_table col;
 -- Describe a nested column
 DESC FORMATTED desc_complex_col_table col.x;
 
+-- Test output for histogram statistics
+SET spark.sql.statistics.histogram.enabled=true;
+SET spark.sql.statistics.histogram.numBins=2;
+
+INSERT INTO desc_col_table values 1, 2, 3, 4;
+
+ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key;
+
+DESC EXTENDED desc_col_table key;
+
+SET spark.sql.statistics.histogram.enabled=false;
+SET spark.sql.statistics.histogram.numBins=256;
--- End diff --

You can also argue that this increase the maintenance cost as we need to 
change it if we change the default value of these configs. I think the general 
principle is, set the config if the following queries depend on it. At the end 
of test, there is no following queries and we don't need to do anything to the 
configs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-21 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152295430
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql ---
@@ -34,6 +34,19 @@ DESC FORMATTED desc_complex_col_table col;
 -- Describe a nested column
 DESC FORMATTED desc_complex_col_table col.x;
 
+-- Test output for histogram statistics
+SET spark.sql.statistics.histogram.enabled=true;
+SET spark.sql.statistics.histogram.numBins=2;
+
+INSERT INTO desc_col_table values 1, 2, 3, 4;
+
+ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key;
+
+DESC EXTENDED desc_col_table key;
+
+SET spark.sql.statistics.histogram.enabled=false;
+SET spark.sql.statistics.histogram.numBins=256;
--- End diff --

isn't it better to reset them to prevent possible future tests eventually 
added after these lines to be influenced? If no, I can remove them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152294075
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql ---
@@ -34,6 +34,19 @@ DESC FORMATTED desc_complex_col_table col;
 -- Describe a nested column
 DESC FORMATTED desc_complex_col_table col.x;
 
+-- Test output for histogram statistics
+SET spark.sql.statistics.histogram.enabled=true;
+SET spark.sql.statistics.histogram.numBins=2;
+
+INSERT INTO desc_col_table values 1, 2, 3, 4;
+
+ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key;
+
+DESC EXTENDED desc_col_table key;
+
+SET spark.sql.statistics.histogram.enabled=false;
+SET spark.sql.statistics.histogram.numBins=256;
--- End diff --

We don't need to reset the config, the test framework will reset the entire 
session after each test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-21 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152265613
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
+  }).getOrElse(Seq(Row("histogram", "NULL")))
--- End diff --

Hive has no histogram


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-21 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152265034
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

@henryr I think verbosity is a minor issue, since we are returning a 
`DataFrame` and the user can decide what to show.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-21 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152264097
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

@gatorsmile In Hive, there isn't yet a histogram implementation 
(HIVE-3526). In Oracle and MySQL, the information is stored in metadata tables 
which can be queried 
(https://docs.oracle.com/cloud/latest/db112/REFRN/statviews_2106.htm#REFRN20279)
 and here we have mainly two information:
 - the cumulative count so far;
 - the endpoint value for the current bin.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152158010
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

What is the output contents/formats of the other vendors?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152157864
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
+  }).getOrElse(Seq(Row("histogram", "NULL")))
--- End diff --

What is the output of Hive?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152100465
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

for debug purpose, I'd like to have as more information as possible. In the 
future maybe we can simplify the `DESC COLUMN` and use 
`spark.catalog.getColumn` to get detailed information.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-20 Thread henryr
Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152099227
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

yeah, I was wondering if we actually need to see every individual bucket 
instead of a summary (max bucket: 2.0->4.0 w/128 entries, min bucket: 6.0->8.0 
w/0 entries, median bucket size: 32).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152098557
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

kind of, but I don't have a better idea, adding a new flag to `DESC 
COLUMNS` seems an overkill.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-20 Thread henryr
Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152096698
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

Do you think this (printing one line per bucket) might become too verbose 
if there are many buckets? In an interactive session writing 256 buckets, 
one-per-line is likely to push the non-histogram output off the top of the 
screen. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152005780
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql ---
@@ -24,6 +24,19 @@ DESC EXTENDED desc_col_table key;
 
 DESC FORMATTED desc_col_table key;
 
+-- Test output for histogram statistics
+SET spark.sql.statistics.histogram.enabled=true;
+SET spark.sql.statistics.histogram.numBins=2;
+
+INSERT INTO desc_col_table values 1, 2, 3, 4;
+
+ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key;
+
+DESC EXTENDED desc_col_table key;
--- End diff --

shall we move these to the end of this file?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152005513
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
+  }).getOrElse(Seq(Row("histogram", "NULL")))
--- End diff --

If no histogram info, I'd like to print nothing here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152005342
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

maybe adding the bin id?
```
histogram   height: 2.0, num_of_bins: 2
bin0  lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2
bin1  lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r152004929
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
--- End diff --

yea I think for this case the for loop is conventional


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151892566
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

And how about making each class member as a separate field?
```
histogram   height: 2.0 num_of_bins: 2
histogram_bin   lower_bound: 1.0upper_bound: 2.0
distinct_count: 2
histogram_bin   lower_bound: 2.0upper_bound: 4.0
distinct_count: 2
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151892351
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

@mgaido91 @jaceklaskowski How about using "histogram_bin" instead of the 
empty string?
```
histogram   height: 2.0, num_of_bins: 2
histogram_bin   lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2
histogram_bin   lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151855965
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
+  }).getOrElse(Seq(Row("histogram", "NULL")))
--- End diff --

Some comments or cleanup here would be nicer though


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-18 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151838625
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

@wzhfy I'd rather define a `val` with the comment being the name of the 
val. That would make it "compile-safe".


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-17 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151826928
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin =>
+  Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, 
distinct_count: ${bin.ndv}"))
--- End diff --

add a comment that the empty column is for indentation and better 
readability?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-17 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151741011
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
--- End diff --

Thanks for your feedback. IMHO I am not sure that it would be easier to 
read. But if someone else agrees with you I can change it, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-17 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151737674
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
--- End diff --

I'm pretty sure that for-comprehension would make the code read easier.

```scala
for {
  c <- cs
  hist <- c.histogram
  ...
} yield ...
```

Let me know if you need help with that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-17 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151693461
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -689,6 +689,11 @@ case class DescribeColumnCommand(
   buffer += Row("distinct_count", 
cs.map(_.distinctCount.toString).getOrElse("NULL"))
   buffer += Row("avg_col_len", 
cs.map(_.avgLen.toString).getOrElse("NULL"))
   buffer += Row("max_col_len", 
cs.map(_.maxLen.toString).getOrElse("NULL"))
+  buffer ++= cs.flatMap(_.histogram.map { hist =>
+val header = Row("histogram", s"height: ${hist.height}, 
num_of_bins: ${hist.bins.length}")
+Seq(header) ++ hist.bins.map(bin => Row("",
--- End diff --

nit: `hist.bins.map { bin =>`
And is it possible to move `Row("",` to the next line?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-17 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151693478
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql ---
@@ -24,6 +24,18 @@ DESC EXTENDED desc_col_table key;
 
 DESC FORMATTED desc_col_table key;
 
+SET spark.sql.statistics.histogram.enabled=true;
+SET spark.sql.statistics.histogram.numBins=2;
+
+INSERT INTO desc_col_table values(1);
+INSERT INTO desc_col_table values(2);
+INSERT INTO desc_col_table values(3);
+INSERT INTO desc_col_table values(4);
+
+ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key;
--- End diff --

please set the sql conf back to default value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-17 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151689883
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql ---
@@ -24,6 +24,18 @@ DESC EXTENDED desc_col_table key;
 
 DESC FORMATTED desc_col_table key;
 
+SET spark.sql.statistics.histogram.enabled=true;
+SET spark.sql.statistics.histogram.numBins=2;
+
+INSERT INTO desc_col_table values(1);
+INSERT INTO desc_col_table values(2);
+INSERT INTO desc_col_table values(3);
+INSERT INTO desc_col_table values(4);
--- End diff --

INSERT INTO desc_col_table values 1, 2, 3, 4


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-17 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/19774

[SPARK-22475][SQL] show histogram in DESC COLUMN command

## What changes were proposed in this pull request?

Added the histogram representation to the output of the `DESCRIBE EXTENDED 
table_name column_name` command.

## How was this patch tested?

Modified SQL UT and checked output

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-22475

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19774.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19774


commit 24bfcb1132d35ffa8ba2341a7ea9057b14b5ab8a
Author: Marco Gaido 
Date:   2017-11-17T12:42:16Z

[SPARK-22475][SQL] show histogram in DESC COLUMN command




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org