[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...

2018-08-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22036


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...

2018-08-13 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22036#discussion_r209536881
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
 ---
@@ -140,7 +140,13 @@ case class AnalyzePartitionCommand(
 val df = tableDf.filter(Column(filter)).groupBy(partitionColumns: 
_*).count()
 
 df.collect().map { r =>
-  val partitionColumnValues = 
partitionColumns.indices.map(r.get(_).toString)
+  val partitionColumnValues = partitionColumns.indices.map { i =>
+if (r.isNullAt(i)) {
+  ExternalCatalogUtils.DEFAULT_PARTITION_NAME
--- End diff --

I don't think so, as the same situation would happen if Hive's statistics 
are used instead of the ones computed by Spark 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...

2018-08-13 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22036#discussion_r209534161
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -204,6 +204,24 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 }
   }
 
+  test("SPARK-25028: column stats collection for null partitioning 
columns") {
+val table = "analyze_partition_with_null"
+withTempDir { dir =>
+  withTable(table) {
+sql(s"""
+ |CREATE TABLE $table (name string, value string)
+ |USING PARQUET
+ |PARTITIONED BY (name)
+ |LOCATION '${dir.toURI}'""".stripMargin)
+val df = Seq(("a", null), ("b", null)).toDF("value", "name")
--- End diff --

ok, will do, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...

2018-08-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22036#discussion_r209481430
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -204,6 +204,24 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 }
   }
 
+  test("SPARK-25028: column stats collection for null partitioning 
columns") {
+val table = "analyze_partition_with_null"
+withTempDir { dir =>
+  withTable(table) {
+sql(s"""
+ |CREATE TABLE $table (name string, value string)
+ |USING PARQUET
+ |PARTITIONED BY (name)
+ |LOCATION '${dir.toURI}'""".stripMargin)
+val df = Seq(("a", null), ("b", null)).toDF("value", "name")
--- End diff --

when creating the table, we can put partition column at the end, to avoid 
this confusion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...

2018-08-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22036#discussion_r209481344
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
 ---
@@ -140,7 +140,13 @@ case class AnalyzePartitionCommand(
 val df = tableDf.filter(Column(filter)).groupBy(partitionColumns: 
_*).count()
 
 df.collect().map { r =>
-  val partitionColumnValues = 
partitionColumns.indices.map(r.get(_).toString)
+  val partitionColumnValues = partitionColumns.indices.map { i =>
+if (r.isNullAt(i)) {
+  ExternalCatalogUtils.DEFAULT_PARTITION_NAME
--- End diff --

do we need to chang the read path? i.e. where we use these statistics.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...

2018-08-09 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22036#discussion_r208831510
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -204,6 +204,24 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 }
   }
 
+  test("SPARK-25028: column stats collection for null partitioning 
columns") {
+val table = "analyze_partition_with_null"
+withTempDir { dir =>
+  withTable(table) {
+sql(s"""
+ |CREATE TABLE $table (name string, value string)
+ |USING PARQUET
+ |PARTITIONED BY (name)
+ |LOCATION '${dir.toURI}'""".stripMargin)
+val df = Seq(("a", null), ("b", null)).toDF("value", "name")
--- End diff --

I don't think it is needed to add another partition value, as the problem 
here is with `null` throwing an NPE and the test shows that no NPE is thrown. 
But if you think it is necessary I can add it.

The reverse column order is the way spark works when inserting data into a 
partitioned table. The partitioning columns are specified at the end, after the 
non-partitioning ones.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...

2018-08-08 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22036#discussion_r208795446
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -204,6 +204,24 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 }
   }
 
+  test("SPARK-25028: column stats collection for null partitioning 
columns") {
+val table = "analyze_partition_with_null"
+withTempDir { dir =>
+  withTable(table) {
+sql(s"""
+ |CREATE TABLE $table (name string, value string)
+ |USING PARQUET
+ |PARTITIONED BY (name)
+ |LOCATION '${dir.toURI}'""".stripMargin)
+val df = Seq(("a", null), ("b", null)).toDF("value", "name")
--- End diff --

super nit: better to add a non-null partition value, e.g., `val df = 
Seq(("a", null), ("b", null), ("c", "1")).toDF("value", "name")`? btw, why is 
this a reverse column order (not "name", "value", but "value", "name")?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...

2018-08-08 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/22036

[SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values

## What changes were proposed in this pull request?

`ANALYZE TABLE ... PARTITION(...) COMPUTE STATISTICS` can fail with a NPE 
if a partition column contains a NULL value.

The PR avoids the NPE, replacing the `NULL` values with the default 
partition placeholder.

## How was this patch tested?

added UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-25028

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22036.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22036


commit ee64a6ba3d41c0f3d6776d7ccbd9af7185d8a3ad
Author: Marco Gaido 
Date:   2018-08-08T12:07:33Z

[SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org