[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user tedyu commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r51284828 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala --- @@ -359,6 +361,18 @@ object StructType extends AbstractDataType { protected[sql] def fromAttributes(attributes: Seq[Attribute]): StructType = StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata))) + def removeMetadata(key: String, dt: DataType): DataType = +dt match { + case StructType(fields) => +val newFields = fields.map { f => + val mb = new MetadataBuilder() --- End diff -- This workaround may be taken out in the future. However, use of MetadataBuilder occurs in many other places: http://pastebin.com/nVjNfrgp I feel adding clear() to MetadataBuilder would help in current and future use cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r51235044 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala --- @@ -359,6 +361,18 @@ object StructType extends AbstractDataType { protected[sql] def fromAttributes(attributes: Seq[Attribute]): StructType = StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata))) + def removeMetadata(key: String, dt: DataType): DataType = +dt match { + case StructType(fields) => +val newFields = fields.map { f => + val mb = new MetadataBuilder() --- End diff -- Agreed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r51232182 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala --- @@ -359,6 +361,18 @@ object StructType extends AbstractDataType { protected[sql] def fromAttributes(attributes: Seq[Attribute]): StructType = StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata))) + def removeMetadata(key: String, dt: DataType): DataType = +dt match { + case StructType(fields) => +val newFields = fields.map { f => + val mb = new MetadataBuilder() --- End diff -- This PR is mostly a workaround for a parquet-mr bug (PARQUET-389), and I'd assume that it will be fixed in the near future. Then we can remove this workaround. So it doesn't seem to be worth modifying `MetadataBuilder`, which is part of public API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user tedyu commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r51213703 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala --- @@ -258,7 +258,12 @@ private[sql] class ParquetRelation( job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]]) ParquetOutputFormat.setWriteSupportClass(job, classOf[CatalystWriteSupport]) -CatalystWriteSupport.setSchema(dataSchema, conf) + +// We want to clear this temporary metadata from saving into Parquet file. +// This metadata is only useful for detecting optional columns when pushdowning filters. --- End diff -- nit: pushdowning -> pushing down --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user tedyu commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r51213506 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala --- @@ -359,6 +361,18 @@ object StructType extends AbstractDataType { protected[sql] def fromAttributes(attributes: Seq[Attribute]): StructType = StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata))) + def removeMetadata(key: String, dt: DataType): DataType = +dt match { + case StructType(fields) => +val newFields = fields.map { f => + val mb = new MetadataBuilder() --- End diff -- If we add clear() to MetadataBuilder, this can be lifted above the fields.map. Inside the map operation we just clear the MetadataBuilder. What do you think ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/9940 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-176495087 Thanks! I'm going to merging this to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-175927505 ping @liancheng Please see if latest updates are proper for you. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-175567984 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50190/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-175567983 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-175567496 **[Test build #50190 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50190/consoleFull)** for PR 9940 at commit [`1a11770`](https://github.com/apache/spark/commit/1a11770a19b8a6ec263ca3c34def5b03ec4483c9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-175524399 **[Test build #50190 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50190/consoleFull)** for PR 9940 at commit [`1a11770`](https://github.com/apache/spark/commit/1a11770a19b8a6ec263ca3c34def5b03ec4483c9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-172972369 Overall LGTM except for several minor issues. Another thing is that, we probably want to use a more special name (something like `_OPTIONAL_`) to avoid naming conflict with user defined metadata keys. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r50167918 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // If the "c = 1" filter gets pushed down, this query will throw an exception which // Parquet emits. This is a Parquet issue (PARQUET-389). +val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a") checkAnswer( - sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a"), + df, (1 to 1).map(i => Row(i, i.toString, null))) + +// The fields "a" and "c" only exist in one Parquet file. +df.schema.fields.foreach { f => + if (f.name == "a" || f.name == "c") { +assert(f.metadata.contains("optional")) + } +} + +val pathThree = s"${dir.getCanonicalPath}/table3" +df.write.parquet(pathThree) + +// We will remove the temporary metadata when writing Parquet file. +sqlContext.read.parquet(pathThree).schema.fields.foreach { f => + assert(!f.metadata.contains("optional")) +} + +val pathFour = s"${dir.getCanonicalPath}/table4" +val dfStruct = sparkContext.parallelize(Seq((1, 1))).toDF("a", "b") +dfStruct.select(struct("a").as("s")).write.parquet(pathFour) + +val pathFive = s"${dir.getCanonicalPath}/table5" +val dfStruct2 = sparkContext.parallelize(Seq((1, 1))).toDF("c", "b") +dfStruct2.select(struct("c").as("s")).write.parquet(pathFive) + +// If the "s.c = 1" filter gets pushed down, this query will throw an exception which +// Parquet emits. +val dfStruct3 = sqlContext.read.parquet(pathFour, pathFive).filter("s.c = 1") + .selectExpr("s.c", "s.a") +checkAnswer( + dfStruct3, + (1 to 1).map(i => Row(i, null))) + +// The fields "s.a" and "s.c" only exist in one Parquet file. +dfStruct3.schema.fields.foreach { f => + if (f.name == "s.a" || f.name == "s.c") { +assert(f.metadata.contains("optional")) + } +} + +val pathSix = s"${dir.getCanonicalPath}/table6" +dfStruct3.write.parquet(pathSix) + +// We will remove the temporary metadata when writing Parquet file. +sqlContext.read.parquet(pathSix).schema.fields.foreach { f => + assert(!f.metadata.contains("optional")) +} --- End diff -- Similar as the other one mentioned above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r50167888 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // If the "c = 1" filter gets pushed down, this query will throw an exception which // Parquet emits. This is a Parquet issue (PARQUET-389). +val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a") checkAnswer( - sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a"), + df, (1 to 1).map(i => Row(i, i.toString, null))) + +// The fields "a" and "c" only exist in one Parquet file. +df.schema.fields.foreach { f => + if (f.name == "a" || f.name == "c") { +assert(f.metadata.contains("optional")) + } +} + +val pathThree = s"${dir.getCanonicalPath}/table3" +df.write.parquet(pathThree) + +// We will remove the temporary metadata when writing Parquet file. +sqlContext.read.parquet(pathThree).schema.fields.foreach { f => + assert(!f.metadata.contains("optional")) +} + +val pathFour = s"${dir.getCanonicalPath}/table4" +val dfStruct = sparkContext.parallelize(Seq((1, 1))).toDF("a", "b") +dfStruct.select(struct("a").as("s")).write.parquet(pathFour) + +val pathFive = s"${dir.getCanonicalPath}/table5" +val dfStruct2 = sparkContext.parallelize(Seq((1, 1))).toDF("c", "b") +dfStruct2.select(struct("c").as("s")).write.parquet(pathFive) + +// If the "s.c = 1" filter gets pushed down, this query will throw an exception which +// Parquet emits. +val dfStruct3 = sqlContext.read.parquet(pathFour, pathFive).filter("s.c = 1") + .selectExpr("s.c", "s.a") +checkAnswer( + dfStruct3, + (1 to 1).map(i => Row(i, null))) + +// The fields "s.a" and "s.c" only exist in one Parquet file. +dfStruct3.schema.fields.foreach { f => + if (f.name == "s.a" || f.name == "s.c") { +assert(f.metadata.contains("optional")) + } +} --- End diff -- ```scala assert(df.schema("a").metadata.contains("optional")) assert(df.schema("c").metadata.contains("optional")) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r50167793 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // If the "c = 1" filter gets pushed down, this query will throw an exception which // Parquet emits. This is a Parquet issue (PARQUET-389). +val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a") checkAnswer( - sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a"), + df, (1 to 1).map(i => Row(i, i.toString, null))) + +// The fields "a" and "c" only exist in one Parquet file. +df.schema.fields.foreach { f => + if (f.name == "a" || f.name == "c") { +assert(f.metadata.contains("optional")) + } +} + +val pathThree = s"${dir.getCanonicalPath}/table3" +df.write.parquet(pathThree) + +// We will remove the temporary metadata when writing Parquet file. +sqlContext.read.parquet(pathThree).schema.fields.foreach { f => + assert(!f.metadata.contains("optional")) +} + +val pathFour = s"${dir.getCanonicalPath}/table4" +val dfStruct = sparkContext.parallelize(Seq((1, 1))).toDF("a", "b") +dfStruct.select(struct("a").as("s")).write.parquet(pathFour) + +val pathFive = s"${dir.getCanonicalPath}/table5" +val dfStruct2 = sparkContext.parallelize(Seq((1, 1))).toDF("c", "b") +dfStruct2.select(struct("c").as("s")).write.parquet(pathFive) + +// If the "s.c = 1" filter gets pushed down, this query will throw an exception which +// Parquet emits. +val dfStruct3 = sqlContext.read.parquet(pathFour, pathFive).filter("s.c = 1") + .selectExpr("s.c", "s.a") +checkAnswer( + dfStruct3, + (1 to 1).map(i => Row(i, null))) --- End diff -- ```scala checkAnswer(dfStruct3, Row(1, null)) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r50167182 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // If the "c = 1" filter gets pushed down, this query will throw an exception which // Parquet emits. This is a Parquet issue (PARQUET-389). +val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a") checkAnswer( - sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a"), + df, (1 to 1).map(i => Row(i, i.toString, null))) + +// The fields "a" and "c" only exist in one Parquet file. +df.schema.fields.foreach { f => + if (f.name == "a" || f.name == "c") { +assert(f.metadata.contains("optional")) + } +} + +val pathThree = s"${dir.getCanonicalPath}/table3" +df.write.parquet(pathThree) + +// We will remove the temporary metadata when writing Parquet file. +sqlContext.read.parquet(pathThree).schema.fields.foreach { f => + assert(!f.metadata.contains("optional")) +} --- End diff -- ```scala assert(sqlContext.read.parquet(pathThree).schema.forall(_.metadata.contains("optional"))) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r50166920 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // If the "c = 1" filter gets pushed down, this query will throw an exception which // Parquet emits. This is a Parquet issue (PARQUET-389). +val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a") checkAnswer( - sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a"), + df, (1 to 1).map(i => Row(i, i.toString, null))) --- End diff -- Since you are updating this test case, this line can be simplified to `Row(1, "1", null)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r50166678 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // If the "c = 1" filter gets pushed down, this query will throw an exception which // Parquet emits. This is a Parquet issue (PARQUET-389). +val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a") checkAnswer( - sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1").selectExpr("c", "b", "a"), + df, (1 to 1).map(i => Row(i, i.toString, null))) + +// The fields "a" and "c" only exist in one Parquet file. +df.schema.fields.foreach { f => + if (f.name == "a" || f.name == "c") { +assert(f.metadata.contains("optional")) + } +} --- End diff -- I'd prefer the following version. If `a` or `c` doesn't even exist in `df.schema`, the current version still passes. ```scala assert(df.schema("a").metadata.contains("optional")) assert(df.schema("c").metadata.contains("optional")) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r50165611 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -207,11 +207,19 @@ private[sql] object ParquetFilters { */ } + private def getFieldMap(dataType: DataType): Array[(String, DataType)] = dataType match { --- End diff -- It's not quite intuitive why this method is needed. Can we add a comment here to explain why fields with "optional" metadata should be filtered? Corresponding JIRA links should also be mentioned. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9940#discussion_r50164899 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala --- @@ -122,7 +122,7 @@ class DataTypeSuite extends SparkFunSuite { val right = StructType(List()) val merged = left.merge(right) -assert(merged === left) +assert(DataType.equalsIgnoreCompatibleNullability(merged, left)) --- End diff -- I guess we resort to `equalsIgnoreCompatibleNullability` because extra metadata in `merged`? Can we also add assertion for added metadata instead of working around it using `equalsIgnoreCompatibleNullability`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-172777996 @liancheng Can we revisit this now? Or we want to wait a bit longer? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-162843069 @liancheng Sure. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-162839835 @viirya Thanks for your efforts! Would you mind me revisiting this after 1.6 release? I would like see whether we can have PARQUET-389 fixed in Parquet community ASAP, so that we may not need to work around it in Spark 1.7/2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-162336101 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47241/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-162336100 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-162335953 **[Test build #47241 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47241/consoleFull)** for PR 9940 at commit [`40533a7`](https://github.com/apache/spark/commit/40533a79de07f1159d5f5e2a6327a160f278be0e). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * ` public abstract static class PrefixComputer `\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-162329711 ping @liancheng I tried to update this for your suggestions. Now the metadata will be removed before saving Parquet file. Besides, I also add more test cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-162329133 **[Test build #47241 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47241/consoleFull)** for PR 9940 at commit [`40533a7`](https://github.com/apache/spark/commit/40533a79de07f1159d5f5e2a6327a160f278be0e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-162237940 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-162237942 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47229/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9940#issuecomment-162237893 **[Test build #47229 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47229/consoleFull)** for PR 9940 at commit [`db8ffa3`](https://github.com/apache/spark/commit/db8ffa3b73c80ec8f311ece5b1b9e17d5d7257ba). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org