Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
gatorsmile commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1671181356 ## python/pyspark/sql/types.py: ## @@ -693,8 +710,16 @@ def jsonValue(self) -> Dict[str, Any]: } @classmethod -def fromJson(cls, json: Dict[str, Any]) -> "ArrayType": -return ArrayType(_parse_datatype_json_value(json["elementType"]), json["containsNull"]) +def fromJson( +cls, +json: Dict[str, Any], +fieldPath: str, +collationsMap: Optional[Dict[str, str]], Review Comment: This is a breaking change. We should provide a default value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on PR #46280: URL: https://github.com/apache/spark/pull/46280#issuecomment-2121377372 I think so. String type with collation should be normal string type in the Hive table schema, so that other engines can still read it. We only keep the collation info in the Spark-specific table schema JSON string in table properties. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on PR #46280: URL: https://github.com/apache/spark/pull/46280#issuecomment-2120702050 @cloud-fan I looked into HMS a bit, and it seems that we can't save column metadata there, so I guess we will still have to keep converting schema with collation to schema without when creating a table in hive even though collations are no longer a type? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan closed pull request #46280: [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE URL: https://github.com/apache/spark/pull/46280 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on PR #46280: URL: https://github.com/apache/spark/pull/46280#issuecomment-2118674278 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on PR #46280: URL: https://github.com/apache/spark/pull/46280#issuecomment-2117427722 @cloud-fan all checks passing, can we merge this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1604650741 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; Review Comment: that makes sense -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1604650376 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,57 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; + +public CollationIdentifier(String provider, String collationName, String version) { + this.provider = provider; + this.name = collationName; + this.version = version; +} + +public static CollationIdentifier fromString(String identifier) { + long numDots = identifier.chars().filter(ch -> ch == '.').count(); + assert(numDots > 0); + + if (numDots == 1) { +String[] parts = identifier.split("\\.", 2); +return new CollationIdentifier(parts[0], parts[1], null); + } + + String[] parts = identifier.split("\\.", 3); + return new CollationIdentifier(parts[0], parts[1], parts[2]); +} + +@Override +public String toString() { + if (version != null) { +return String.format("%s.%s.%s", provider, name, version); + } + + return toStringWithoutVersion(); +} + +/** + * Returns the identifier's string value without the version. + */ +public String toStringWithoutVersion() { Review Comment: added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1604604686 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,57 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; + +public CollationIdentifier(String provider, String collationName, String version) { + this.provider = provider; + this.name = collationName; + this.version = version; +} + +public static CollationIdentifier fromString(String identifier) { + long numDots = identifier.chars().filter(ch -> ch == '.').count(); + assert(numDots > 0); + + if (numDots == 1) { +String[] parts = identifier.split("\\.", 2); +return new CollationIdentifier(parts[0], parts[1], null); + } + + String[] parts = identifier.split("\\.", 3); + return new CollationIdentifier(parts[0], parts[1], parts[2]); +} + +@Override +public String toString() { + if (version != null) { +return String.format("%s.%s.%s", provider, name, version); + } + + return toStringWithoutVersion(); +} + +/** + * Returns the identifier's string value without the version. + */ +public String toStringWithoutVersion() { Review Comment: can we add more comments to explain when we should call `toStringWithoutVersion` instead of `toString`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r160462 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; Review Comment: when can `version` be null? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1604598115 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; Review Comment: To strictly follow the java style, I think these fields should be `private` and we add `getXXX` methods to return them. The field type should not be `Optional` but the `getXXX` return type can be `Optional`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603733508 ## sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala: ## @@ -712,4 +714,181 @@ class DataTypeSuite extends SparkFunSuite { assert(result === expected) } + + test("schema with collation should not change during ser/de") { +val simpleStruct = StructType( + StructField("c1", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val nestedStruct = StructType( + StructField("nested", simpleStruct) :: Nil) + +val caseInsensitiveNames = StructType( + StructField("c1", StringType(UNICODE_COLLATION_ID)) :: + StructField("C1", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val specialCharsInName = StructType( + StructField("c1.*23?", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val arrayInSchema = StructType( + StructField("arrayField", ArrayType(StringType(UNICODE_COLLATION_ID))) :: Nil) + +val mapInSchema = StructType( + StructField("mapField", +MapType(StringType(UNICODE_COLLATION_ID), StringType(UNICODE_COLLATION_ID))) :: Nil) + +val mapWithKeyInNameInSchema = StructType( + StructField("name.key", StringType) :: + StructField("name", +MapType(StringType(UNICODE_COLLATION_ID), StringType(UNICODE_COLLATION_ID))) :: Nil) + +val arrayInMapInNestedSchema = StructType( + StructField("arrInMap", +MapType(StringType(UNICODE_COLLATION_ID), +ArrayType(StringType(UNICODE_COLLATION_ID :: Nil) + +val nestedArrayInMap = StructType( + StructField("nestedArrayInMap", +ArrayType(MapType(StringType(UNICODE_COLLATION_ID), + ArrayType(ArrayType(StringType(UNICODE_COLLATION_ID)) :: Nil) + +val schemaWithMultipleFields = StructType( + simpleStruct.fields ++ nestedStruct.fields ++ arrayInSchema.fields ++ mapInSchema.fields ++ +mapWithKeyInNameInSchema ++ arrayInMapInNestedSchema.fields ++ nestedArrayInMap.fields) + +Seq( + simpleStruct, caseInsensitiveNames, specialCharsInName, nestedStruct, arrayInSchema, + mapInSchema, mapWithKeyInNameInSchema, nestedArrayInMap, arrayInMapInNestedSchema, + schemaWithMultipleFields) + .foreach { schema => +val json = schema.json +val parsed = DataType.fromJson(json) +assert(parsed === schema) + } + } + + test("non string field has collation metadata") { Review Comment: `StructTypeSuite` is used just to validate `toJson` value, deserialization is tested in `DataTypeSuite` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603732173 ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -208,22 +206,35 @@ object DataType { } // NOTE: Map fields must be sorted in alphabetical order to keep consistent with the Python side. - private[sql] def parseDataType(json: JValue): DataType = json match { + private[sql] def parseDataType( + json: JValue, + fieldPath: String = "", + collationsMap: Map[String, String] = Map.empty): DataType = json match { case JString(name) => - nameToType(name) + collationsMap.get(fieldPath) match { +case Some(collation) => + assertValidTypeForCollations(fieldPath, name, collationsMap) + stringTypeWithCollation(collation) +case _ => nameToType(name) + } case JSortedObject( ("containsNull", JBool(n)), ("elementType", t: JValue), ("type", JString("array"))) => - ArrayType(parseDataType(t), n) + assertValidTypeForCollations(fieldPath, "array", collationsMap) + val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", collationsMap) + ArrayType(elementType, n) case JSortedObject( ("keyType", k: JValue), ("type", JString("map")), ("valueContainsNull", JBool(n)), ("valueType", v: JValue)) => - MapType(parseDataType(k), parseDataType(v), n) + assertValidTypeForCollations(fieldPath, "map", collationsMap) + val keyType = parseDataTypeWithCollation(k, fieldPath + ".key", collationsMap) + val valueType = parseDataTypeWithCollation(v, fieldPath + ".value", collationsMap) Review Comment: good catch! After some refactoring earlier this is no longer needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603727528 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; Review Comment: So you also think we should put optional here? Even intellij complains when it is used in a field. https://github.com/apache/spark/assets/154237371/ea7bd5c5-43cb-4826-aae5-bf76588b6198;> Also, you can read below that it's intended to be used for method return type -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603651070 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,61 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = schemaCollationValue(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = schemaCollationValue(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) +fieldToCollationMap.toMap + } + + private def isCollatedString(dt: DataType): Boolean = dt match { +case st: StringType => !st.isUTF8BinaryCollation +case _ => false + } + + private def schemaCollationValue(dt: DataType): String = dt match { +case st: StringType => + val collation = CollationFactory.fetchCollation(st.collationId) + collation.identifier().toStringWithoutVersion() Review Comment: for writing statistics - version change invalidates them -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603406813 ## sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala: ## @@ -712,4 +714,181 @@ class DataTypeSuite extends SparkFunSuite { assert(result === expected) } + + test("schema with collation should not change during ser/de") { +val simpleStruct = StructType( + StructField("c1", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val nestedStruct = StructType( + StructField("nested", simpleStruct) :: Nil) + +val caseInsensitiveNames = StructType( + StructField("c1", StringType(UNICODE_COLLATION_ID)) :: + StructField("C1", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val specialCharsInName = StructType( + StructField("c1.*23?", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val arrayInSchema = StructType( + StructField("arrayField", ArrayType(StringType(UNICODE_COLLATION_ID))) :: Nil) + +val mapInSchema = StructType( + StructField("mapField", +MapType(StringType(UNICODE_COLLATION_ID), StringType(UNICODE_COLLATION_ID))) :: Nil) + +val mapWithKeyInNameInSchema = StructType( + StructField("name.key", StringType) :: + StructField("name", +MapType(StringType(UNICODE_COLLATION_ID), StringType(UNICODE_COLLATION_ID))) :: Nil) + +val arrayInMapInNestedSchema = StructType( + StructField("arrInMap", +MapType(StringType(UNICODE_COLLATION_ID), +ArrayType(StringType(UNICODE_COLLATION_ID :: Nil) + +val nestedArrayInMap = StructType( + StructField("nestedArrayInMap", +ArrayType(MapType(StringType(UNICODE_COLLATION_ID), + ArrayType(ArrayType(StringType(UNICODE_COLLATION_ID)) :: Nil) + +val schemaWithMultipleFields = StructType( + simpleStruct.fields ++ nestedStruct.fields ++ arrayInSchema.fields ++ mapInSchema.fields ++ +mapWithKeyInNameInSchema ++ arrayInMapInNestedSchema.fields ++ nestedArrayInMap.fields) + +Seq( + simpleStruct, caseInsensitiveNames, specialCharsInName, nestedStruct, arrayInSchema, + mapInSchema, mapWithKeyInNameInSchema, nestedArrayInMap, arrayInMapInNestedSchema, + schemaWithMultipleFields) + .foreach { schema => +val json = schema.json +val parsed = DataType.fromJson(json) +assert(parsed === schema) + } + } + + test("non string field has collation metadata") { Review Comment: how do we decide to put test in here or `StructTypeSuite`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603405105 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,61 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = schemaCollationValue(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = schemaCollationValue(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) +fieldToCollationMap.toMap + } + + private def isCollatedString(dt: DataType): Boolean = dt match { +case st: StringType => !st.isUTF8BinaryCollation +case _ => false + } + + private def schemaCollationValue(dt: DataType): String = dt match { +case st: StringType => + val collation = CollationFactory.fetchCollation(st.collationId) + collation.identifier().toStringWithoutVersion() Review Comment: when will we use the version? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603404296 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,61 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = schemaCollationValue(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = schemaCollationValue(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) +fieldToCollationMap.toMap + } + + private def isCollatedString(dt: DataType): Boolean = dt match { +case st: StringType => !st.isUTF8BinaryCollation +case _ => false + } + + private def schemaCollationValue(dt: DataType): String = dt match { +case st: StringType => + val collation = CollationFactory.fetchCollation(st.collationId) + collation.identifier().toStringWithoutVersion() +case _ => + throw new IllegalStateException(s"Unexpected data type $dt") Review Comment: we should use `SparkException.internalError` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603399289 ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -208,22 +206,35 @@ object DataType { } // NOTE: Map fields must be sorted in alphabetical order to keep consistent with the Python side. - private[sql] def parseDataType(json: JValue): DataType = json match { + private[sql] def parseDataType( + json: JValue, + fieldPath: String = "", + collationsMap: Map[String, String] = Map.empty): DataType = json match { case JString(name) => - nameToType(name) + collationsMap.get(fieldPath) match { +case Some(collation) => + assertValidTypeForCollations(fieldPath, name, collationsMap) + stringTypeWithCollation(collation) +case _ => nameToType(name) + } case JSortedObject( ("containsNull", JBool(n)), ("elementType", t: JValue), ("type", JString("array"))) => - ArrayType(parseDataType(t), n) + assertValidTypeForCollations(fieldPath, "array", collationsMap) + val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", collationsMap) + ArrayType(elementType, n) case JSortedObject( ("keyType", k: JValue), ("type", JString("map")), ("valueContainsNull", JBool(n)), ("valueType", v: JValue)) => - MapType(parseDataType(k), parseDataType(v), n) + assertValidTypeForCollations(fieldPath, "map", collationsMap) + val keyType = parseDataTypeWithCollation(k, fieldPath + ".key", collationsMap) + val valueType = parseDataTypeWithCollation(v, fieldPath + ".value", collationsMap) Review Comment: why do we need `parseDataTypeWithCollation`? It looks correct to just call `parseDataType`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603393035 ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -208,22 +206,35 @@ object DataType { } // NOTE: Map fields must be sorted in alphabetical order to keep consistent with the Python side. - private[sql] def parseDataType(json: JValue): DataType = json match { + private[sql] def parseDataType( + json: JValue, + fieldPath: String = "", + collationsMap: Map[String, String] = Map.empty): DataType = json match { case JString(name) => - nameToType(name) + collationsMap.get(fieldPath) match { +case Some(collation) => + assertValidTypeForCollations(fieldPath, name, collationsMap) + stringTypeWithCollation(collation) +case _ => nameToType(name) + } case JSortedObject( ("containsNull", JBool(n)), ("elementType", t: JValue), ("type", JString("array"))) => - ArrayType(parseDataType(t), n) + assertValidTypeForCollations(fieldPath, "array", collationsMap) + val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", collationsMap) + ArrayType(elementType, n) case JSortedObject( ("keyType", k: JValue), ("type", JString("map")), ("valueContainsNull", JBool(n)), ("valueType", v: JValue)) => - MapType(parseDataType(k), parseDataType(v), n) + assertValidTypeForCollations(fieldPath, "map", collationsMap) Review Comment: shall we do this assert in the StructType branch as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603384345 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -110,6 +158,8 @@ public Collation( // No Collation can simultaneously support binary equality and lowercase equality assert(!supportsBinaryEquality || !supportsLowercaseEquality); + assert(SUPPORTED_PROVIDERS.contains(provider) || provider == null); Review Comment: when `provider` can be null? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603382702 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; Review Comment: attributes are not optional... It's a bad practice to leave attributes un-initialized, which is null for reference type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
olaky commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603195283 ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,129 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map = StructType( +[ +StructField( +"arrInMap", +MapType( +StringType(unicode_collation), ArrayType(StringType(unicode_collation)) +), +) +] +) + +nested_array_in_map = StructType( +[ +StructField( +"nestedArrayInMap", +ArrayType( +MapType( +StringType(unicode_collation), + ArrayType(ArrayType(StringType(unicode_collation))), +) +), +) +] +) + +schema_with_multiple_fields = StructType( +simple_struct.fields ++ nested_struct.fields ++ array_in_schema.fields ++ map_in_schema.fields ++ array_in_map.fields ++ nested_array_in_map.fields +) + +schemas = [ +simple_struct, +nested_struct, +array_in_schema, +map_in_schema, +nested_array_in_map, +array_in_map, +schema_with_multiple_fields, +] + +for schema in schemas: +scala_datatype = self.spark._jsparkSession.parseDataType(schema.json()) Review Comment: right, my bad -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
olaky commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603191384 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; Review Comment: Honestly not too familiar with how this is usually done in Java, but Optional would clearly express that the absence is not a bug (which a null often indicates) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601881695 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) Review Comment: Each column has separate metadata so map and string can never clash -> here is how it would look like in json: ```json { "type": "struct", "fields": [ { "name": "c1.key", "type": "string", "nullable": true, "metadata": { "__COLLATIONS": { "c1.key": "ICU.UNICODE_CI" } } }, { "name": "c1", "type": { "type": "map", "keyType": "string", "valueType": "string", "valueContainsNull": true }, "nullable": true, "metadata": { "__COLLATIONS": { "c1.key": "ICU.UNICODE" } } } ] } ``` "c1.key" will mean different things in each column - for the first it's just the entire column and for the second it is the key of the map -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601881695 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) Review Comment: Each column has separate metadata so map and string can never clash -> here is how it would look like in json: ```json { "type": "struct", "fields": [ { "name": "c1.key", "type": "string", "nullable": true, "metadata": { "__COLLATIONS": { "c1.key": "ICU.UNICODE_CI" } } }, { "name": "c1", "type": { "type": "map", "keyType": "string", "valueType": "string", "valueContainsNull": true }, "nullable": true, "metadata": { "__COLLATIONS": { "c1.value": "ICU.UNICODE", "c1.key": "ICU.UNICODE" } } } ] } ``` "c1.key" will mean different things in each column - for the first it's just the entire column and for the second it is the key of the map -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601878212 ## python/pyspark/sql/types.py: ## @@ -1702,9 +1791,16 @@ def _parse_datatype_json_string(json_string: str) -> DataType: return _parse_datatype_json_value(json.loads(json_string)) -def _parse_datatype_json_value(json_value: Union[dict, str]) -> DataType: +def _parse_datatype_json_value( +json_value: Union[dict, str], +fieldPath: str = "", +collationsMap: Optional[Dict[str, str]] = None, +) -> DataType: if not isinstance(json_value, dict): if json_value in _all_atomic_types.keys(): +if collationsMap is not None and fieldPath in collationsMap: +collationName = collationsMap[fieldPath].split(".")[1] Review Comment: sure! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601839139 ## python/pyspark/sql/types.py: ## @@ -876,30 +894,86 @@ def __init__( self.dataType = dataType self.nullable = nullable self.metadata = metadata or {} +self._collationMetadata: Optional[Dict[str, str]] = None def simpleString(self) -> str: return "%s:%s" % (self.name, self.dataType.simpleString()) def __repr__(self) -> str: return "StructField('%s', %s, %s)" % (self.name, self.dataType, str(self.nullable)) +def __eq__(self, other: Any) -> bool: +# since collationMetadata is lazy evaluated we should not use it in equality check Review Comment: I agree, doing this for every json call is annoying but since we don't have the security of immutable data like in scala perhaps we should avoid lazy eval -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601793860 ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,129 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map = StructType( +[ +StructField( +"arrInMap", +MapType( +StringType(unicode_collation), ArrayType(StringType(unicode_collation)) +), +) +] +) + +nested_array_in_map = StructType( +[ +StructField( +"nestedArrayInMap", +ArrayType( +MapType( +StringType(unicode_collation), + ArrayType(ArrayType(StringType(unicode_collation))), +) +), +) +] +) + +schema_with_multiple_fields = StructType( +simple_struct.fields ++ nested_struct.fields ++ array_in_schema.fields ++ map_in_schema.fields ++ array_in_map.fields ++ nested_array_in_map.fields +) + +schemas = [ +simple_struct, +nested_struct, +array_in_schema, +map_in_schema, +nested_array_in_map, +array_in_map, +schema_with_multiple_fields, +] + +for schema in schemas: +scala_datatype = self.spark._jsparkSession.parseDataType(schema.json()) +python_datatype = _parse_datatype_json_string(scala_datatype.json()) +assert schema == python_datatype +assert schema == _parse_datatype_json_string(schema.json()) + +def test_schema_with_collations_on_non_string_types(self): +from pyspark.sql.types import _parse_datatype_json_string, _COLLATIONS_METADATA_KEY + +collations_on_int_col_json = f""" +{{ + "type": "struct", + "fields": [ +{{ + "name": "c1", + "type": "integer", + "nullable": true, + "metadata": {{ +"{_COLLATIONS_METADATA_KEY}": {{ + "c1": "icu.UNICODE" +}} + }} +}} + ] +}} +""" + +collations_in_map_json = f""" +{{ + "type": "struct", + "fields": [ +{{ + "name": "mapField", + "type": {{ +"type": "map", +"keyType": "string", +"valueType": "integer", +"valueContainsNull": true + }}, + "nullable": true, + "metadata": {{ +"{_COLLATIONS_METADATA_KEY}": {{ + "mapField.value": "icu.UNICODE" Review Comment: We talked about this one a bit offline, but I would rather tackle this as a separate issue than just a collation protocol error. Currently, both python and scala code will not fail when encountering duplicate keys; python will just pick one to put in the dictionary and scala will have both in the `JObject`. What do you think @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601793860 ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,129 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map = StructType( +[ +StructField( +"arrInMap", +MapType( +StringType(unicode_collation), ArrayType(StringType(unicode_collation)) +), +) +] +) + +nested_array_in_map = StructType( +[ +StructField( +"nestedArrayInMap", +ArrayType( +MapType( +StringType(unicode_collation), + ArrayType(ArrayType(StringType(unicode_collation))), +) +), +) +] +) + +schema_with_multiple_fields = StructType( +simple_struct.fields ++ nested_struct.fields ++ array_in_schema.fields ++ map_in_schema.fields ++ array_in_map.fields ++ nested_array_in_map.fields +) + +schemas = [ +simple_struct, +nested_struct, +array_in_schema, +map_in_schema, +nested_array_in_map, +array_in_map, +schema_with_multiple_fields, +] + +for schema in schemas: +scala_datatype = self.spark._jsparkSession.parseDataType(schema.json()) +python_datatype = _parse_datatype_json_string(scala_datatype.json()) +assert schema == python_datatype +assert schema == _parse_datatype_json_string(schema.json()) + +def test_schema_with_collations_on_non_string_types(self): +from pyspark.sql.types import _parse_datatype_json_string, _COLLATIONS_METADATA_KEY + +collations_on_int_col_json = f""" +{{ + "type": "struct", + "fields": [ +{{ + "name": "c1", + "type": "integer", + "nullable": true, + "metadata": {{ +"{_COLLATIONS_METADATA_KEY}": {{ + "c1": "icu.UNICODE" +}} + }} +}} + ] +}} +""" + +collations_in_map_json = f""" +{{ + "type": "struct", + "fields": [ +{{ + "name": "mapField", + "type": {{ +"type": "map", +"keyType": "string", +"valueType": "integer", +"valueContainsNull": true + }}, + "nullable": true, + "metadata": {{ +"{_COLLATIONS_METADATA_KEY}": {{ + "mapField.value": "icu.UNICODE" Review Comment: We talked about this one a bit offline, but I would rather tackle this as a separate issue than just a collation protocol error. Currently, both python and scala code will not fail when encountering duplicate keys; python will just pick one to put in the dictionary and scala will have both in the `JObject`. What do you think @cloud-fan ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601784552 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; Review Comment: I thought that optional is used in java for method return types, and not for attributes since they are "optional" anyways? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601770280 ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,129 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map = StructType( +[ +StructField( +"arrInMap", +MapType( +StringType(unicode_collation), ArrayType(StringType(unicode_collation)) +), +) +] +) + +nested_array_in_map = StructType( +[ +StructField( +"nestedArrayInMap", +ArrayType( +MapType( +StringType(unicode_collation), + ArrayType(ArrayType(StringType(unicode_collation))), +) +), +) +] +) + +schema_with_multiple_fields = StructType( +simple_struct.fields ++ nested_struct.fields ++ array_in_schema.fields ++ map_in_schema.fields ++ array_in_map.fields ++ nested_array_in_map.fields +) + +schemas = [ +simple_struct, +nested_struct, +array_in_schema, +map_in_schema, +nested_array_in_map, +array_in_map, +schema_with_multiple_fields, +] + +for schema in schemas: +scala_datatype = self.spark._jsparkSession.parseDataType(schema.json()) Review Comment: aren't we doing that in the following line `python_datatype = _parse_datatype_json_string(scala_datatype.json())`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601767087 ## sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala: ## @@ -606,4 +612,181 @@ class StructTypeSuite extends SparkFunSuite with SQLHelper { "b STRING NOT NULL,c STRING COMMENT 'nullable comment'") assert(fromDDL(struct.toDDL) === struct) } + + test("simple struct with collations to json") { +val simpleStruct = StructType( + StructField("c1", StringType(UNICODE_COLLATION)) :: Nil) + +val expectedJson = + s""" + |{ + | "type": "struct", + | "fields": [ + |{ + | "name": "c1", Review Comment: there are already tests with dots in `DataTypeSuite` because we care about deserialization there as well so that's why I didn't put them here. I also added there tests for case sensitivity and special characters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601765377 ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -208,22 +206,33 @@ object DataType { } // NOTE: Map fields must be sorted in alphabetical order to keep consistent with the Python side. - private[sql] def parseDataType(json: JValue): DataType = json match { + private[sql] def parseDataType( + json: JValue, + fieldPath: String = "", Review Comment: there can be no ambiguity with the dots in the names, string or a list doesn't make any difference -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601763659 ## python/pyspark/sql/types.py: ## @@ -263,21 +263,30 @@ def __init__(self, collation: Optional[str] = None): def fromCollationId(self, collationId: int) -> "StringType": return StringType(StringType.collationNames[collationId]) -def collationIdToName(self) -> str: -if self.collationId == 0: -return "" -else: -return " collate %s" % StringType.collationNames[self.collationId] +@classmethod +def collationIdToName(cls, collationId: int) -> str: +return StringType.collationNames[collationId] @classmethod def collationNameToId(cls, collationName: str) -> int: return StringType.collationNames.index(collationName) +@classmethod +def collationProvider(cls, collationName: str) -> str: +if collationName.startswith("UTF8"): Review Comment: I think that we will probably have to revamp this class completely to support all collations anyways, so it will probably look more like the scala side, am i correct @nikolamand-db? That's why I think we should be fine with this for now, but I've added a `TODO` to do it properly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
olaky commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601024762 ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,129 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map = StructType( +[ +StructField( +"arrInMap", +MapType( +StringType(unicode_collation), ArrayType(StringType(unicode_collation)) +), +) +] +) + +nested_array_in_map = StructType( Review Comment: ```suggestion nested_array_in_map_value = StructType( ``` ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; + +public CollationIdentifier(String provider, String collationName, String version) { + this.provider = provider; + this.name = collationName; + this.version = version; +} + +public static CollationIdentifier fromString(String identifier) { + String[] parts = identifier.split("\\.", 3); + return new CollationIdentifier(parts[0], parts[1], parts[2]); +} + +@Override +public String toString() { + return String.format("%s.%s.%s", provider, name, version); +} + +/** + * Returns the identifier's string value without the version. + */ +public String valueWithoutVersion() { Review Comment: ```suggestion public String toStringWithoutVersion() { ``` ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; Review Comment: How are we representing identifiers without version? Should we make this Optional[String]? The provider might also be optional, but this will make parsing a lot more difficult ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,129 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map = StructType( +[ +StructField( +"arrInMap", +MapType( +StringType(unicode_collation), ArrayType(StringType(unicode_collation)) +), +) +] +) + +nested_array_in_map = StructType( +[ +StructField( +"nestedArrayInMap", +ArrayType( +MapType( +StringType(unicode_collation), Review Comment: What about nested collations in map keys? ##
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cstavr commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598916467 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,42 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; + +public CollationIdentifier(String provider, String collationName, String version) { + this.provider = provider; + this.name = collationName; + this.version = version; +} + +public static CollationIdentifier fromString(String identifier) { + String[] parts = identifier.split("\\.", 3); + return new CollationIdentifier(parts[0], parts[1], parts[2]); +} + +@Override +public String toString() { + return String.format("%s.%s.%s", provider, name, version); +} + +/** + * Returns Review Comment: ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598823433 ## python/pyspark/sql/types.py: ## @@ -1702,9 +1791,16 @@ def _parse_datatype_json_string(json_string: str) -> DataType: return _parse_datatype_json_value(json.loads(json_string)) -def _parse_datatype_json_value(json_value: Union[dict, str]) -> DataType: +def _parse_datatype_json_value( +json_value: Union[dict, str], +fieldPath: str = "", +collationsMap: Optional[Dict[str, str]] = None, +) -> DataType: if not isinstance(json_value, dict): if json_value in _all_atomic_types.keys(): +if collationsMap is not None and fieldPath in collationsMap: +collationName = collationsMap[fieldPath].split(".")[1] Review Comment: In spark we only use name to identify collations and we don't need the provider at all. That is why I asked if provider could be optional in the delta protocol but the issue would be parsing the identifier in that case https://github.com/delta-io/delta/pull/3068#discussion_r1593742078 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598821762 ## python/pyspark/sql/types.py: ## @@ -1756,6 +1854,15 @@ def _parse_datatype_json_value(json_value: Union[dict, str]) -> DataType: ) +def _parse_type_with_collation( +json_value: Dict[str, Any], fieldPath: str, collationsMap: Optional[Dict[str, str]] +) -> DataType: Review Comment: yes, added now! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cstavr commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598693494 ## python/pyspark/sql/types.py: ## @@ -1702,9 +1791,16 @@ def _parse_datatype_json_string(json_string: str) -> DataType: return _parse_datatype_json_value(json.loads(json_string)) -def _parse_datatype_json_value(json_value: Union[dict, str]) -> DataType: +def _parse_datatype_json_value( +json_value: Union[dict, str], +fieldPath: str = "", +collationsMap: Optional[Dict[str, str]] = None, +) -> DataType: if not isinstance(json_value, dict): if json_value in _all_atomic_types.keys(): +if collationsMap is not None and fieldPath in collationsMap: +collationName = collationsMap[fieldPath].split(".")[1] Review Comment: So we are using only the name? What about the rest of the parts? ## python/pyspark/sql/types.py: ## @@ -263,21 +263,31 @@ def __init__(self, collation: Optional[str] = None): def fromCollationId(self, collationId: int) -> "StringType": return StringType(StringType.collationNames[collationId]) -def collationIdToName(self) -> str: -if self.collationId == 0: -return "" -else: -return " collate %s" % StringType.collationNames[self.collationId] +@classmethod +def collationIdToName(cls, collationId: int) -> str: +return StringType.collationNames[collationId] @classmethod def collationNameToId(cls, collationName: str) -> int: return StringType.collationNames.index(collationName) +@classmethod +def collationProvider(cls, collationName: str) -> str: +if collationName.startswith("UTF8"): +return "spark" +return "icu" + def simpleString(self) -> str: -return "string" + self.collationIdToName() +return ( +"string" +if self.isUTF8BinaryCollation() +else "string collate " + self.collationIdToName(self.collationId) +) Review Comment: nit: Since this is multi-line anyway, I think that standard if/else block is better here. ## python/pyspark/sql/types.py: ## @@ -1756,6 +1854,15 @@ def _parse_datatype_json_value(json_value: Union[dict, str]) -> DataType: ) +def _parse_type_with_collation( +json_value: Dict[str, Any], fieldPath: str, collationsMap: Optional[Dict[str, str]] +) -> DataType: +if collationsMap and fieldPath in collationsMap: +collationName = collationsMap[fieldPath].split(".")[1] Review Comment: ditto ## python/pyspark/sql/types.py: ## @@ -1756,6 +1854,15 @@ def _parse_datatype_json_value(json_value: Union[dict, str]) -> DataType: ) +def _parse_type_with_collation( +json_value: Dict[str, Any], fieldPath: str, collationsMap: Optional[Dict[str, str]] +) -> DataType: Review Comment: Don't we also need the string type check here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598434466 ## python/pyspark/sql/types.py: ## @@ -876,30 +894,86 @@ def __init__( self.dataType = dataType self.nullable = nullable self.metadata = metadata or {} +self._collationMetadata: Optional[Dict[str, str]] = None def simpleString(self) -> str: return "%s:%s" % (self.name, self.dataType.simpleString()) def __repr__(self) -> str: return "StructField('%s', %s, %s)" % (self.name, self.dataType, str(self.nullable)) +def __eq__(self, other: Any) -> bool: +# since collationMetadata is lazy evaluated we should not use it in equality check Review Comment: I would argue that using `self.__dict__ == other.__dict__` is even more dangerous and yet this is the current implementation of equals in all data types. This was only done in order for assert to pass in the tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598434466 ## python/pyspark/sql/types.py: ## @@ -876,30 +894,86 @@ def __init__( self.dataType = dataType self.nullable = nullable self.metadata = metadata or {} +self._collationMetadata: Optional[Dict[str, str]] = None def simpleString(self) -> str: return "%s:%s" % (self.name, self.dataType.simpleString()) def __repr__(self) -> str: return "StructField('%s', %s, %s)" % (self.name, self.dataType, str(self.nullable)) +def __eq__(self, other: Any) -> bool: +# since collationMetadata is lazy evaluated we should not use it in equality check Review Comment: I would argue that just using `self.__dict__ == other.__dict__` is even more dangerous and yet this is the current implementation of equals in all data types. This was only done in order for assert to pass in the tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598430811 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) +fieldToCollationMap + } + + private def isCollatedString(dt: DataType): Boolean = dt match { +case st: StringType => !st.isUTF8BinaryCollation +case _ => false + } + + private def collationName(dt: DataType): String = dt match { +case st: StringType => Review Comment: Casting to a specific `StringType(collationId)` is kind of ugly so I would rather avoid it. However, even though we only call this method when we know it's a string type I added an exception in the match clause to be more explicit with the error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598434466 ## python/pyspark/sql/types.py: ## @@ -876,30 +894,86 @@ def __init__( self.dataType = dataType self.nullable = nullable self.metadata = metadata or {} +self._collationMetadata: Optional[Dict[str, str]] = None def simpleString(self) -> str: return "%s:%s" % (self.name, self.dataType.simpleString()) def __repr__(self) -> str: return "StructField('%s', %s, %s)" % (self.name, self.dataType, str(self.nullable)) +def __eq__(self, other: Any) -> bool: +# since collationMetadata is lazy evaluated we should not use it in equality check Review Comment: I would argue that just using `self.__dict__ == other.__dict__` is also dangerous and yet this is the current implementation of equals in all data types. This was only done in order for assert to pass in the tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598430811 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) +fieldToCollationMap + } + + private def isCollatedString(dt: DataType): Boolean = dt match { +case st: StringType => !st.isUTF8BinaryCollation +case _ => false + } + + private def collationName(dt: DataType): String = dt match { +case st: StringType => Review Comment: Casting to a specific `StringType(collationId)` is kind of ugly so I would rather avoid it. However, even though we only call this method when we know it's a string type I added an exception in the match clause to be more explicit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598427469 ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -274,6 +288,39 @@ object DataType { messageParameters = Map("other" -> compact(render(other } + /** + * Checks if the current field is in the collation map, and if it is it returns + * a StringType with the given collation. Otherwise, it further parses its type. + */ + private def resolveType( + json: JValue, + path: String, + collationsMap: Map[String, String]): DataType = { +collationsMap.get(path) match { + case Some(collation) => stringTypeWithCollation(collation) Review Comment: good catch! Added new error class and tests for that case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598426717 ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,76 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map_in_nested_schema = StructType( Review Comment: you are right we don't need that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1598424420 ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,76 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map_in_nested_schema = StructType( +[ +StructField( +"arrInMap", +MapType( +StringType(unicode_collation), ArrayType(StringType(unicode_collation)) +), +) +] +) + +nested_array_in_map = StructType( +[ +StructField( +"nestedArrayInMap", +ArrayType( +MapType( +StringType(unicode_collation), + ArrayType(ArrayType(StringType(unicode_collation))), +) +), +) +] +) + +schema_with_multiple_fields = StructType( +simple_struct.fields ++ nested_struct.fields ++ array_in_schema.fields ++ map_in_schema.fields ++ array_in_map_in_nested_schema.fields ++ nested_array_in_map.fields +) + +schemas = [ +simple_struct, +nested_struct, +array_in_schema, +map_in_schema, +nested_array_in_map, +array_in_map_in_nested_schema, +schema_with_multiple_fields, +] + +for schema in schemas: +scala_datatype = self.spark._jsparkSession.parseDataType(schema.json()) +python_datatype = _parse_datatype_json_string(scala_datatype.json()) +assert schema == python_datatype Review Comment: `scala_datatype` is a jvm object so we can't just compare it to the `schema` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cstavr commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1597674391 ## python/pyspark/sql/types.py: ## @@ -263,21 +263,23 @@ def __init__(self, collation: Optional[str] = None): def fromCollationId(self, collationId: int) -> "StringType": return StringType(StringType.collationNames[collationId]) -def collationIdToName(self) -> str: -if self.collationId == 0: -return "" -else: -return " collate %s" % StringType.collationNames[self.collationId] +@classmethod +def collationIdToName(cls, collationId: int) -> str: +return StringType.collationNames[collationId] @classmethod def collationNameToId(cls, collationName: str) -> int: return StringType.collationNames.index(collationName) def simpleString(self) -> str: -return "string" + self.collationIdToName() +return ( +"string" +if self.isUTF8BinaryCollation() +else "string collate" + self.collationIdToName(self.collationId) Review Comment: ```suggestion else "string collate " + self.collationIdToName(self.collationId) ``` ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,76 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map_in_nested_schema = StructType( Review Comment: What does in "in nested schema" mean here? ```suggestion array_in_map = StructType( ``` ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,76 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map_in_nested_schema = StructType( +[ +StructField( +"arrInMap", +MapType( +StringType(unicode_collation), ArrayType(StringType(unicode_collation)) +), +) +] +) + +nested_array_in_map = StructType( +[ +StructField( +"nestedArrayInMap", +ArrayType( +MapType( +StringType(unicode_collation), + ArrayType(ArrayType(StringType(unicode_collation))), +) +), +) +] +) + +schema_with_multiple_fields = StructType( +simple_struct.fields ++ nested_struct.fields ++ array_in_schema.fields ++ map_in_schema.fields ++ array_in_map_in_nested_schema.fields ++ nested_array_in_map.fields +) + +schemas = [ +simple_struct, +nested_struct, +array_in_schema, +map_in_schema, +nested_array_in_map, +array_in_map_in_nested_schema, +schema_with_multiple_fields, +] + +for schema in schemas: +scala_datatype = self.spark._jsparkSession.parseDataType(schema.json()) +python_datatype = _parse_datatype_json_string(scala_datatype.json()) +assert schema == python_datatype Review Comment: ```suggestion scala_datatype = self.spark._jsparkSession.parseDataType(schema.json()) assert schema ==
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1595221619 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) Review Comment: I added a test to be sure this works, but I don't see why we would need any special logic for this case. Even if we have a column `name.key` this will never be ambiguous in the metadata as the struct field could be a string or a map but never both, so parsing will just work regardless - we could never have a case of adding the collation two times. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1595221619 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) Review Comment: I added a test to be sure this works, but I don't see why we would need any special logic for this case. Even if we have a column name.key this will never be ambiguous in the metadata as the struct field could be a string or a map but never both, so parsing will just work regardless - we could never have a case of adding the collation two times. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1595218281 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) Review Comment: I added a test to be sure this works, but I don't see why we would need any special logic for this case. Even if we have a column `name.key` this will never be ambiguous in the metadata as the struct field could be a string or a map but never both, so parsing will just work regardless - we could never have a case of adding the collation two times. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1595218281 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) Review Comment: I added a test to be sure this works, but I don't see why we would need any special logic for this case. Even if we have a column `name.key` this will never be ambiguous in the metadata as the struct field could be a string or a map but never both, so parsing will just work regardless - we could never have a case of adding the collation two times. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1595218281 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) Review Comment: I added a test to be sure this works, but I don't see why we would need any special logic for this case. Even if we have a column `name.key` this will never be ambiguous in the metadata as the struct field could be a string or a map but never both, so parsing will just work regardless. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1595169435 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) Review Comment: This is per field metadata, do we really need to put the filed name in the path key? My worry is that the field name can contain special chars, e.g. `name.key`, which may cause issues without proper quoting. How about we use empty string as the path key for the top-level field? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on PR #46280: URL: https://github.com/apache/spark/pull/46280#issuecomment-2100170222 @cloud-fan please take a look when you have the time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org