[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22036 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22036#discussion_r209536881 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala --- @@ -140,7 +140,13 @@ case class AnalyzePartitionCommand( val df = tableDf.filter(Column(filter)).groupBy(partitionColumns: _*).count() df.collect().map { r => - val partitionColumnValues = partitionColumns.indices.map(r.get(_).toString) + val partitionColumnValues = partitionColumns.indices.map { i => +if (r.isNullAt(i)) { + ExternalCatalogUtils.DEFAULT_PARTITION_NAME --- End diff -- I don't think so, as the same situation would happen if Hive's statistics are used instead of the ones computed by Spark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22036#discussion_r209534161 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -204,6 +204,24 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } + test("SPARK-25028: column stats collection for null partitioning columns") { +val table = "analyze_partition_with_null" +withTempDir { dir => + withTable(table) { +sql(s""" + |CREATE TABLE $table (name string, value string) + |USING PARQUET + |PARTITIONED BY (name) + |LOCATION '${dir.toURI}'""".stripMargin) +val df = Seq(("a", null), ("b", null)).toDF("value", "name") --- End diff -- ok, will do, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22036#discussion_r209481430 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -204,6 +204,24 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } + test("SPARK-25028: column stats collection for null partitioning columns") { +val table = "analyze_partition_with_null" +withTempDir { dir => + withTable(table) { +sql(s""" + |CREATE TABLE $table (name string, value string) + |USING PARQUET + |PARTITIONED BY (name) + |LOCATION '${dir.toURI}'""".stripMargin) +val df = Seq(("a", null), ("b", null)).toDF("value", "name") --- End diff -- when creating the table, we can put partition column at the end, to avoid this confusion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22036#discussion_r209481344 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala --- @@ -140,7 +140,13 @@ case class AnalyzePartitionCommand( val df = tableDf.filter(Column(filter)).groupBy(partitionColumns: _*).count() df.collect().map { r => - val partitionColumnValues = partitionColumns.indices.map(r.get(_).toString) + val partitionColumnValues = partitionColumns.indices.map { i => +if (r.isNullAt(i)) { + ExternalCatalogUtils.DEFAULT_PARTITION_NAME --- End diff -- do we need to chang the read path? i.e. where we use these statistics. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22036#discussion_r208831510 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -204,6 +204,24 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } + test("SPARK-25028: column stats collection for null partitioning columns") { +val table = "analyze_partition_with_null" +withTempDir { dir => + withTable(table) { +sql(s""" + |CREATE TABLE $table (name string, value string) + |USING PARQUET + |PARTITIONED BY (name) + |LOCATION '${dir.toURI}'""".stripMargin) +val df = Seq(("a", null), ("b", null)).toDF("value", "name") --- End diff -- I don't think it is needed to add another partition value, as the problem here is with `null` throwing an NPE and the test shows that no NPE is thrown. But if you think it is necessary I can add it. The reverse column order is the way spark works when inserting data into a partitioned table. The partitioning columns are specified at the end, after the non-partitioning ones. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22036#discussion_r208795446 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -204,6 +204,24 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } + test("SPARK-25028: column stats collection for null partitioning columns") { +val table = "analyze_partition_with_null" +withTempDir { dir => + withTable(table) { +sql(s""" + |CREATE TABLE $table (name string, value string) + |USING PARQUET + |PARTITIONED BY (name) + |LOCATION '${dir.toURI}'""".stripMargin) +val df = Seq(("a", null), ("b", null)).toDF("value", "name") --- End diff -- super nit: better to add a non-null partition value, e.g., `val df = Seq(("a", null), ("b", null), ("c", "1")).toDF("value", "name")`? btw, why is this a reverse column order (not "name", "value", but "value", "name")? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/22036 [SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values ## What changes were proposed in this pull request? `ANALYZE TABLE ... PARTITION(...) COMPUTE STATISTICS` can fail with a NPE if a partition column contains a NULL value. The PR avoids the NPE, replacing the `NULL` values with the default partition placeholder. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-25028 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22036.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22036 commit ee64a6ba3d41c0f3d6776d7ccbd9af7185d8a3ad Author: Marco Gaido Date: 2018-08-08T12:07:33Z [SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org