[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r363111332 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/types/TestUDT.scala ## @@ -58,4 +63,22 @@ object TestUDT { override def equals(other: Any): Boolean = other.isInstanceOf[MyDenseVectorUDT] } + + private[sql] class MyXMLGregorianCalendarUDT extends UserDefinedType[XMLGregorianCalendar] { +override def sqlType: DataType = TimestampType + +override def serialize(obj: XMLGregorianCalendar): Any = + obj.toGregorianCalendar.getTimeInMillis * 1000 + +override def deserialize(datum: Any): XMLGregorianCalendar = { + val calendar = new GregorianCalendar + calendar.setTimeInMillis(datum.asInstanceOf[Long]) + DatatypeFactory.newInstance.newXMLGregorianCalendar(calendar) +} + +override def userClass: Class[XMLGregorianCalendar] = classOf[XMLGregorianCalendar] + +// By setting this to a timestamp, we lose the information about the udt +override private[sql] def jsonValue: JValue = "timestamp" Review comment: That is correct. So when we serialize it, it will use: ```scala override def serialize(obj: XMLGregorianCalendar): Any = obj.toGregorianCalendar.getTimeInMillis * 1000 ``` Which will write it as a timestamp stored as the number of microseconds from the epoch of 1970-01-01T00:00:00.00Z (UTC+00:00). And this serialize implementation should be the same as the one in the `sqlType`, because we don't store any references to the UDT, another Spark session will just decode it as a `TimestampType`, so the decoder of the TimestampType will be used. I know that this isn't trivial, but it is very powerful and makes the UDT's much more flexible when writing ETL jobs since you can directly use your own types with the Dataset API. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r363110958 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ## @@ -592,6 +597,12 @@ object StructType extends AbstractDataType { case (leftUdt: UserDefinedType[_], rightUdt: UserDefinedType[_]) if leftUdt.userClass == rightUdt.userClass => leftUdt + case (leftType, rightUdt: UserDefinedType[_]) +if leftType == rightUdt.sqlType => leftType + + case (leftUdt: UserDefinedType[_], rightType) +if leftUdt.sqlType == rightType => rightType Review comment: Well, the `UserDefinedType` extends `DataType`, similar to `TimestampType`, `StringType`, and any other type. The thing is that a UserDefinedType can be compatible with any other type. For example, it is allowed to merge an int into a long. This is an explicit choice by the developer. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r353026018 ## File path: sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala ## @@ -287,4 +293,63 @@ class UserDefinedTypeSuite extends QueryTest with SharedSparkSession with Parque checkAnswer(spark.createDataFrame(data, schema).selectExpr("typeof(a)"), Seq(Row("array"))) } + + test("Allow merge UserDefinedType into a native DataType") { Review comment: After a good night of sleep, I've came up with an integration test, that will mimic the behavior as in Delta. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r352771008 ## File path: sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala ## @@ -287,4 +293,63 @@ class UserDefinedTypeSuite extends QueryTest with SharedSparkSession with Parque checkAnswer(spark.createDataFrame(data, schema).selectExpr("typeof(a)"), Seq(Row("array"))) } + + test("Allow merge UserDefinedType into a native DataType") { Review comment: @maropu I'm unable to reproduce the issue without Delta. Since plan Spark does not check schema compatibility on write, I'm unable to implement an end to end integration test. I tried to fix this by taking the union of the two DF's, but this took another branch in the codebase, and it failed on `TypeCoercion`. Please advise. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r351535108 ## File path: sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala ## @@ -287,4 +293,63 @@ class UserDefinedTypeSuite extends QueryTest with SharedSparkSession with Parque checkAnswer(spark.createDataFrame(data, schema).selectExpr("typeof(a)"), Seq(Row("array"))) } + + test("Allow merge UserDefinedType into a native DataType") { Review comment: Ok, I've rewritten the test, and not it fails on current master: ``` scala> val df = gregorianCalendarDf.union(timestampDf) org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. timestamp <> timestamp at the first column of the second table;; 'Union :- LogicalRDD [dt#11], false +- LogicalRDD [dt#14], false at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$12$$anonfun$apply$13.apply(CheckAnalysis.scala:294) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$12$$anonfun$apply$13.apply(CheckAnalysis.scala:291) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$12.apply(CheckAnalysis.scala:291) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$12.apply(CheckAnalysis.scala:280) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:280) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:86) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95) at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108) at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78) at org.apache.spark.sql.Dataset.withSetOperator(Dataset.scala:3424) at org.apache.spark.sql.Dataset.union(Dataset.scala:1862) ... 40 elided ``` I'm creating two DF's, both representing time, and one of them a UDT. I'm trying to merge them. This is very similar to what I'm doing with Delta. With Delta, using `SaveMode.Overwrite` it will still check the compatibility with the existing schema. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r351430247 ## File path: sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala ## @@ -287,4 +293,63 @@ class UserDefinedTypeSuite extends QueryTest with SharedSparkSession with Parque checkAnswer(spark.createDataFrame(data, schema).selectExpr("typeof(a)"), Seq(Row("array"))) } + + test("Allow merge UserDefinedType into a native DataType") { Review comment: Interesting. When appending to the table, Spark does not check compatibility on write, when using Delta this is the case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r351293879 ## File path: sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala ## @@ -287,4 +293,63 @@ class UserDefinedTypeSuite extends QueryTest with SharedSparkSession with Parque checkAnswer(spark.createDataFrame(data, schema).selectExpr("typeof(a)"), Seq(Row("array"))) } + + test("Allow merge UserDefinedType into a native DataType") { Review comment: You're right, let me investigate. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r351154890 ## File path: sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala ## @@ -287,4 +293,63 @@ class UserDefinedTypeSuite extends QueryTest with SharedSparkSession with Parque checkAnswer(spark.createDataFrame(data, schema).selectExpr("typeof(a)"), Seq(Row("array"))) } + + test("Allow merge UserDefinedType into a native DataType") { Review comment: Ok, let's remove the rule, and I'll retrigger the test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r351154695 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ## @@ -453,7 +453,23 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru * 3. If B doesn't exist in `this`, it's also included in the result schema. * 4. Otherwise, `this` and `that` are considered as conflicting schemas and an exception would be *thrown. + * + * Function to merge the two DataTypes that is compatible with the left + * and right side. The order is: + * + * 1. Merge Arrays, where the type of the Arrays should be compatible + * 2. Merge Maps, where the type of the Maps should be compatible + * 3. Merge Structs, where the struct recursively checked for compatibility + * 4. Merge DecimalType, where the scale and precision should be equal + * 5. Merge UserDefinedType, where the underlying class is equal + * 6. Merge UserDefinedType into DataType, where the underlying DataType is equal + * 7. Merge DataType, where we check if the DataTypes are compatible + * 8. Unable to determine compatibility or incompatible, throw exception + * + * @throws org.apache.spark.SparkException In case the DataTypes are incompatible + * @return The compatible DataType that support both left and right */ + @throws(classOf[SparkException]) Review comment: Sure, I've removed it for now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r350698612 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ## @@ -453,7 +453,23 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru * 3. If B doesn't exist in `this`, it's also included in the result schema. * 4. Otherwise, `this` and `that` are considered as conflicting schemas and an exception would be *thrown. + * Review comment: Okay, I've added a statement to the existing comment This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r350674098 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ## @@ -453,7 +453,23 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru * 3. If B doesn't exist in `this`, it's also included in the result schema. * 4. Otherwise, `this` and `that` are considered as conflicting schemas and an exception would be *thrown. + * + * Function to merge the two DataTypes that is compatible with the left + * and right side. The order is: + * + * 1. Merge Arrays, where the type of the Arrays should be compatible + * 2. Merge Maps, where the type of the Maps should be compatible + * 3. Merge Structs, where the struct recursively checked for compatibility + * 4. Merge DecimalType, where the scale and precision should be equal + * 5. Merge UserDefinedType, where the underlying class is equal + * 6. Merge UserDefinedType into DataType, where the underlying DataType is equal + * 7. Merge DataType, where we check if the DataTypes are compatible + * 8. Unable to determine compatibility or incompatible, throw exception + * + * @throws org.apache.spark.SparkException In case the DataTypes are incompatible + * @return The compatible DataType that support both left and right */ + @throws(classOf[SparkException]) Review comment: If the schema isn't compatible, it will throw a SparkException. I like these functions to be annotated. If you don't like it, I'm happy to remove it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r350673743 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/types/TestUDT.scala ## @@ -58,4 +63,22 @@ object TestUDT { override def equals(other: Any): Boolean = other.isInstanceOf[MyDenseVectorUDT] } + + private[sql] class MyXMLGregorianCalendarUDT extends UserDefinedType[XMLGregorianCalendar] { +override def sqlType: DataType = TimestampType + +override def serialize(obj: XMLGregorianCalendar): Any = + obj.toGregorianCalendar.getTimeInMillis * 1000 + +override def deserialize(datum: Any): XMLGregorianCalendar = { + val calendar = new GregorianCalendar + calendar.setTimeInMillis(datum.asInstanceOf[Long]) + DatatypeFactory.newInstance.newXMLGregorianCalendar(calendar) +} + +override def userClass: Class[XMLGregorianCalendar] = classOf[XMLGregorianCalendar] + +// By setting this to a timestamp, we lose the information about the udt +override private[sql] def jsonValue: JValue = "timestamp" Review comment: With the normal UDT `jsonValue`, we would get: ``` override private[sql] def jsonValue: JValue = { ("type" -> "udt") ~ ("class" -> this.getClass.getName) ~ ("pyClass" -> pyUDT) ~ ("sqlType" -> sqlType.jsonValue) ``` Which will write the type as the UDT. If you try to read the column later on in another job where the UDTRegistration hasn't been done will give an error. Therefore we would like to write the XMLGregorianCalendar as a normal timestamp. However, when we append the table, we want to be able to merge the `XMLGregorianCalendar` UDT into the timestamp. Without the added rule this wouldn't be possible. Hope this helps. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r350672550 ## File path: sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala ## @@ -287,4 +293,63 @@ class UserDefinedTypeSuite extends QueryTest with SharedSparkSession with Parque checkAnswer(spark.createDataFrame(data, schema).selectExpr("typeof(a)"), Seq(Row("array"))) } + + test("Allow merge UserDefinedType into a native DataType") { Review comment: Perfect, much cleaner. Thanks for suggesting. I'm pretty sure that it will fail without the fix. It will fail when it will write the second time. Since we've saved the `XMLGregorianCalendarImpl` as a timestamp (this is why we've overwritten the `jvalue`), Spark will recognize it as a regular timestamp. At then next append we're trying to merge an XMLGregorianCalendar into a timestamp, which isn't allowed with the additional rule. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r350270807 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala ## @@ -478,4 +479,33 @@ class DataTypeSuite extends SparkFunSuite { assert(result === expected) } + + test("Allow UserDefinedType to be merged into a native DateType") { +// In this testcase we have stored the XMLGregorianCalendar as a Timestamp +// And we want to read this later on as a regular TimestampType +val left = StructType( + StructField("a", TimestampType) :: Nil) + +// We should be able to merge these types since the +// CustomXMLGregorianCalendarType maps to a TimestampTypes +val right = StructType( + StructField("a", new MyXMLGregorianCalendarUDT) :: Nil) + +// We expect the DataType to be returned instead of the UserDefinedType +assert(left.merge(right) === left) + } + + test("Disallow DataType to be merged into UserDefinedType") { +val right = StructType( + StructField("a", TimestampType) :: Nil) + +val left = StructType( + StructField("a", new MyXMLGregorianCalendarUDT) :: Nil) + +// We don't allow DataTypes being mapped into a UserDefinedType. +// Main reason is we don't see any application in this for now. +intercept[SparkException] { + left.merge(right) === left +} Review comment: Sure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r350213716 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala ## @@ -478,4 +479,33 @@ class DataTypeSuite extends SparkFunSuite { assert(result === expected) } + + test("Allow UserDefinedType to be merged into a native DateType") { +// In this testcase we have stored the XMLGregorianCalendar as a Timestamp +// And we want to read this later on as a regular TimestampType +val left = StructType( + StructField("a", TimestampType) :: Nil) + +// We should be able to merge these types since the +// CustomXMLGregorianCalendarType maps to a TimestampTypes +val right = StructType( + StructField("a", new MyXMLGregorianCalendarUDT) :: Nil) + +// We expect the DataType to be returned instead of the UserDefinedType +assert(left.merge(right) === left) + } + + test("Disallow DataType to be merged into UserDefinedType") { +val right = StructType( + StructField("a", TimestampType) :: Nil) + +val left = StructType( + StructField("a", new MyXMLGregorianCalendarUDT) :: Nil) Review comment: Sure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r350213677 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala ## @@ -478,4 +479,33 @@ class DataTypeSuite extends SparkFunSuite { assert(result === expected) } + + test("Allow UserDefinedType to be merged into a native DateType") { +// In this testcase we have stored the XMLGregorianCalendar as a Timestamp +// And we want to read this later on as a regular TimestampType +val left = StructType( + StructField("a", TimestampType) :: Nil) + +// We should be able to merge these types since the +// CustomXMLGregorianCalendarType maps to a TimestampTypes +val right = StructType( + StructField("a", new MyXMLGregorianCalendarUDT) :: Nil) + +// We expect the DataType to be returned instead of the UserDefinedType +assert(left.merge(right) === left) + } + + test("Disallow DataType to be merged into UserDefinedType") { +val right = StructType( + StructField("a", TimestampType) :: Nil) Review comment: Sure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r350213508 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/types/TestUDT.scala ## @@ -58,4 +63,22 @@ object TestUDT { override def equals(other: Any): Boolean = other.isInstanceOf[MyDenseVectorUDT] } + + private[sql] class MyXMLGregorianCalendarUDT extends UserDefinedType[XMLGregorianCalendar] { Review comment: No, this isn't possible since we need to override the `jvalue` to the native DataType type. Otherwise, it will be caught by: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L592 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r350212404 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ## @@ -526,6 +526,25 @@ object StructType extends AbstractDataType { case _ => dt } + /** + * Function to merge the two DataTypes that is compatible with the left + * and right side. The order is: + * + * 1. Merge Arrays, where the type of the Arrays should be compatible + * 2. Merge Maps, where the type of the Maps should be compatible + * 3. Merge Structs, where the struct recursively checked for compatibility + * 4. Merge DecimalType, where the scale and precision should be equal + * 5. Merge UserDefinedType, where the underlying class is equal + * 6. Merge UserDefinedType into DataType, where the underlying DataType is equal + * 7. Merge DataType, where we check if the DataTypes are compatible + * 8. Unable to determine compatibility or incompatible, throw exception Review comment: Sure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r349941019 ## File path: sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala ## @@ -17,15 +17,22 @@ package org.apache.spark.sql -import java.util.Arrays +import java.nio.file.Files +import java.sql.Timestamp +import java.util.{Arrays, GregorianCalendar} +import com.sun.org.apache.xerces.internal.jaxp.datatype.XMLGregorianCalendarImpl +import javax.xml.datatype.XMLGregorianCalendar Review comment: Thank you for the pointers. I've fixed the style errors. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r349918766 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala ## @@ -478,4 +483,33 @@ class DataTypeSuite extends SparkFunSuite { assert(result === expected) } + + test("SPARK-30004: Allow UserDefinedType to be merged into a standard DateType") { +class CustomXMLGregorianCalendarType extends UserDefinedType[XMLGregorianCalendar] { + override def sqlType: DataType = TimestampType + + override def serialize(obj: XMLGregorianCalendar): Any = +obj.toGregorianCalendar.getTimeInMillis * 1000 + + override def deserialize(datum: Any): XMLGregorianCalendar = { +val calendar = new GregorianCalendar +calendar.setTimeInMillis(datum.asInstanceOf[Long]) +DatatypeFactory.newInstance.newXMLGregorianCalendar(calendar) + } + + override def userClass: Class[XMLGregorianCalendar] = classOf[XMLGregorianCalendar] + + override private[sql] def jsonValue: JValue = "timestamp" +} + +val left = StructType( + StructField("a", TimestampType) :: Nil) + +// We should be able to merge these types since the +// CustomXMLGregorianCalendarType maps to a TimestampTypes +val right = StructType( + StructField("a", new CustomXMLGregorianCalendarType) :: Nil) + +assert(left.merge(right) === left) Review comment: The current implementation isn't symmetrical. So we can convert a UserDefinedType to a DateType, but not the other way around. I'm hesitant to add this functionality because I don't see any obvious applications. Please let me know if you think this should be added as well. I've added a test to check the opposite case as well, including some additional comments to clarify the idea and working. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r349908548 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala ## @@ -478,4 +483,33 @@ class DataTypeSuite extends SparkFunSuite { assert(result === expected) } + + test("SPARK-30004: Allow UserDefinedType to be merged into a standard DateType") { Review comment: Thank you, I wasn't aware of this convention. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType
Fokko commented on a change in pull request #26644: [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType URL: https://github.com/apache/spark/pull/26644#discussion_r349908533 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ## @@ -592,6 +592,9 @@ object StructType extends AbstractDataType { case (leftUdt: UserDefinedType[_], rightUdt: UserDefinedType[_]) if leftUdt.userClass == rightUdt.userClass => leftUdt + case (leftType, rightUdt: UserDefinedType[_]) Review comment: I've added a Scaladoc to the private merge function. I think the Javadoc that you're pointing to, is describing the function at a different level. For example, it doesn't mention any UDT's at all. Let me know what you think. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org