Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1620668890 ## sql/core/benchmarks/CollationBenchmark-results.txt: ## @@ -1,54 +1,79 @@ -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -UTF8_BINARY_LCASE3571 3576 7 0.0 35708.8 1.0X -UNICODE 2235 2240 7 0.0 22349.2 1.6X -UTF8_BINARY 2237 2242 6 0.0 22371.7 1.6X -UNICODE_CI 18733 18817 118 0.0 187333.8 0.2X +UTF8_BINARY_LCASE3268 3279 16 0.0 32676.7 1.0X +UNICODE 2086 2087 2 0.0 20857.9 1.6X +UTF8_BINARY 2085 2088 4 0.0 20854.2 1.6X +UNICODE_CI 19807 19813 7 0.0 198074.9 0.2X -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative --- -UTF8_BINARY_LCASE 4260 4290 41 0.0 42602.6 1.0X -UNICODE 19536 19624 124 0.0 195360.2 0.2X -UTF8_BINARY 3582 3612 43 0.0 35818.5 1.2X -UNICODE_CI 20381 20454 103 0.0 203814.1 0.2X +UTF8_BINARY_LCASE 3839 3843 6 0.0 38389.8 1.0X +UNICODE 19096 19136 57 0.0 190955.4 0.2X +UTF8_BINARY 3196 3197 2 0.0 31955.7 1.2X +UNICODE_CI 19038 19043 7 0.0 190383.4 0.2X -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -UTF8_BINARY_LCASE 7347 7349 3 0.0 73467.1 1.0X -UNICODE 73462 73608 206 0.0 734623.2 0.1X -UTF8_BINARY5775 5815 57 0.0 57746.0 1.3X -UNICODE_CI57543 57619 108 0.0 575425.2 0.1X +UTF8_BINARY_LCASE 6914 6921 10 0.0 69135.3 1.0X +UNICODE 67702 67724 31 0.0 677019.3 0.1X +UTF8_BINARY5330 5341 15 0.0 53296.6 1.3X +UNICODE_CI65340 65342 3 0.0 653395.9 0.1X -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -UTF8_BINARY_LCASE 15415 15424 13 0.0 154147.1 1.0X -UNICODE
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1620662883 ## sql/core/benchmarks/CollationBenchmark-results.txt: ## @@ -1,54 +1,79 @@ -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -UTF8_BINARY_LCASE3571 3576 7 0.0 35708.8 1.0X -UNICODE 2235 2240 7 0.0 22349.2 1.6X -UTF8_BINARY 2237 2242 6 0.0 22371.7 1.6X -UNICODE_CI 18733 18817 118 0.0 187333.8 0.2X +UTF8_BINARY_LCASE3268 3279 16 0.0 32676.7 1.0X +UNICODE 2086 2087 2 0.0 20857.9 1.6X +UTF8_BINARY 2085 2088 4 0.0 20854.2 1.6X +UNICODE_CI 19807 19813 7 0.0 198074.9 0.2X -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative --- -UTF8_BINARY_LCASE 4260 4290 41 0.0 42602.6 1.0X -UNICODE 19536 19624 124 0.0 195360.2 0.2X -UTF8_BINARY 3582 3612 43 0.0 35818.5 1.2X -UNICODE_CI 20381 20454 103 0.0 203814.1 0.2X +UTF8_BINARY_LCASE 3839 3843 6 0.0 38389.8 1.0X +UNICODE 19096 19136 57 0.0 190955.4 0.2X +UTF8_BINARY 3196 3197 2 0.0 31955.7 1.2X +UNICODE_CI 19038 19043 7 0.0 190383.4 0.2X -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -UTF8_BINARY_LCASE 7347 7349 3 0.0 73467.1 1.0X -UNICODE 73462 73608 206 0.0 734623.2 0.1X -UTF8_BINARY5775 5815 57 0.0 57746.0 1.3X -UNICODE_CI57543 57619 108 0.0 575425.2 0.1X +UTF8_BINARY_LCASE 6914 6921 10 0.0 69135.3 1.0X +UNICODE 67702 67724 31 0.0 677019.3 0.1X +UTF8_BINARY5330 5341 15 0.0 53296.6 1.3X +UNICODE_CI65340 65342 3 0.0 653395.9 0.1X -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -UTF8_BINARY_LCASE 15415 15424 13 0.0 154147.1 1.0X -UNICODE
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1620663435 ## sql/core/benchmarks/CollationBenchmark-results.txt: ## @@ -1,54 +1,79 @@ -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -UTF8_BINARY_LCASE3571 3576 7 0.0 35708.8 1.0X -UNICODE 2235 2240 7 0.0 22349.2 1.6X -UTF8_BINARY 2237 2242 6 0.0 22371.7 1.6X -UNICODE_CI 18733 18817 118 0.0 187333.8 0.2X +UTF8_BINARY_LCASE3268 3279 16 0.0 32676.7 1.0X +UNICODE 2086 2087 2 0.0 20857.9 1.6X +UTF8_BINARY 2085 2088 4 0.0 20854.2 1.6X +UNICODE_CI 19807 19813 7 0.0 198074.9 0.2X -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative --- -UTF8_BINARY_LCASE 4260 4290 41 0.0 42602.6 1.0X -UNICODE 19536 19624 124 0.0 195360.2 0.2X -UTF8_BINARY 3582 3612 43 0.0 35818.5 1.2X -UNICODE_CI 20381 20454 103 0.0 203814.1 0.2X +UTF8_BINARY_LCASE 3839 3843 6 0.0 38389.8 1.0X +UNICODE 19096 19136 57 0.0 190955.4 0.2X +UTF8_BINARY 3196 3197 2 0.0 31955.7 1.2X +UNICODE_CI 19038 19043 7 0.0 190383.4 0.2X -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -UTF8_BINARY_LCASE 7347 7349 3 0.0 73467.1 1.0X -UNICODE 73462 73608 206 0.0 734623.2 0.1X -UTF8_BINARY5775 5815 57 0.0 57746.0 1.3X -UNICODE_CI57543 57619 108 0.0 575425.2 0.1X +UTF8_BINARY_LCASE 6914 6921 10 0.0 69135.3 1.0X +UNICODE 67702 67724 31 0.0 677019.3 0.1X +UTF8_BINARY5330 5341 15 0.0 53296.6 1.3X +UNICODE_CI65340 65342 3 0.0 653395.9 0.1X -OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure +OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure AMD EPYC 7763 64-Core Processor collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -UTF8_BINARY_LCASE 15415 15424 13 0.0 154147.1 1.0X -UNICODE
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on PR #46597: URL: https://github.com/apache/spark/pull/46597#issuecomment-2134820952 We can leave `PandasMode` for a separate PR, but we'll definitely need to take care of it at one point now that you've explored various options and finished the `groupMapReduce` approach, I think should can call in other SQL team reviewers to take a look at this and provide their feedback: @dbatomic @nikolamand-db @stefankandic @stevomitric -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on PR #46597: URL: https://github.com/apache/spark/pull/46597#issuecomment-2132241154 @uros-db ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on PR #46597: URL: https://github.com/apache/spark/pull/46597#issuecomment-2129923419 @uros-db I forgot but should I add collation support to `org.apache.spark.sql.catalyst.expressions.aggregate.PandasMode`? The only difference will be 1. Support for null keys (thus StringType won't necessarily mean all values in buffer are UTF8String, some might just be null, right?) 2. PandasMode returns a list of all values that are tied for mode. In that case, should all the values be present? Eg if you have the pandas_mode of ['a', 'a', 'a', 'b', 'b', 'B'], with utf_binary_lcase collation, what do you think pandas_mode should return? If we want to support PandasMode, I can do a little research on what other databases seem to favor for this type of question. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1613363048 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -74,16 +90,29 @@ case class Mode( if (buffer.isEmpty) { return null } - +val collationAwareBuffer = child.dataType match { + case c: StringType if +!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => +val collationId = c.collationId +val modeMap = buffer.toSeq.groupMapReduce { + case (key: String, _) => +CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) + case (key: UTF8String, _) => +CollationFactory.getCollationKey(key, collationId) + case (key, _) => key Review Comment: also, what case exactly does this catch? if `case c: StringType` is already matched, then we should be able to do `CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], collationId)` right away? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1613361676 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -74,16 +90,29 @@ case class Mode( if (buffer.isEmpty) { return null } - +val collationAwareBuffer = child.dataType match { + case c: StringType if +!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => +val collationId = c.collationId +val modeMap = buffer.toSeq.groupMapReduce { + case (key: String, _) => +CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) + case (key: UTF8String, _) => +CollationFactory.getCollationKey(key, collationId) Review Comment: could we combine these with something like: `CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], collationId)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1613358004 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -74,16 +90,29 @@ case class Mode( if (buffer.isEmpty) { return null } - +val collationAwareBuffer = child.dataType match { + case c: StringType if +!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => +val collationId = c.collationId +val modeMap = buffer.toSeq.groupMapReduce { + case (key: String, _) => +CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) + case (key: UTF8String, _) => +CollationFactory.getCollationKey(key, collationId) Review Comment: ```suggestion CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], collationId) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1613358004 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -74,16 +90,29 @@ case class Mode( if (buffer.isEmpty) { return null } - +val collationAwareBuffer = child.dataType match { + case c: StringType if +!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => +val collationId = c.collationId +val modeMap = buffer.toSeq.groupMapReduce { + case (key: String, _) => +CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) + case (key: UTF8String, _) => +CollationFactory.getCollationKey(key, collationId) Review Comment: ```suggestion CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], collationId) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1613351378 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -48,6 +49,21 @@ case class Mode( override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def checkInputDataTypes(): TypeCheckResult = { +val defaultCheck = super.checkInputDataTypes() +if (defaultCheck.isFailure) { + defaultCheck +} else if (UnsafeRowUtils.isBinaryStable(child.dataType) || + child.dataType.isInstanceOf[StringType]) { + defaultCheck Review Comment: ```suggestion TypeCheckResult.TypeCheckSuccess ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1610561627 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -48,6 +49,22 @@ case class Mode( override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def checkInputDataTypes(): TypeCheckResult = { +val defaultCheck = super.checkInputDataTypes() +if (defaultCheck.isFailure) { + defaultCheck +} else { + child.dataType match { +case _: StructType | _: ArrayType if !UnsafeRowUtils.isBinaryStable(child.dataType) => + TypeCheckResult.TypeCheckFailure( +s"Input to function mode was a complex type" + + s" with non-binary collated fields," + + s" which is not yet supported by mode.") +case _ => TypeCheckResult.TypeCheckSuccess + } Review Comment: @uros-db done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on PR #46597: URL: https://github.com/apache/spark/pull/46597#issuecomment-2123419758 @uros-db I have made changes for all but your latest suggestion (re whitelists -- will add that soon) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1608866089 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -48,6 +49,22 @@ case class Mode( override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def checkInputDataTypes(): TypeCheckResult = { +val defaultCheck = super.checkInputDataTypes() +if (defaultCheck.isFailure) { + defaultCheck +} else { + child.dataType match { +case _: StructType | _: ArrayType if !UnsafeRowUtils.isBinaryStable(child.dataType) => + TypeCheckResult.TypeCheckFailure( +s"Input to function mode was a complex type" + + s" with non-binary collated fields," + + s" which is not yet supported by mode.") +case _ => TypeCheckResult.TypeCheckSuccess + } Review Comment: let's try to white-list instead of black-list (if we can, of course) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1608865653 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -48,6 +49,22 @@ case class Mode( override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def checkInputDataTypes(): TypeCheckResult = { +val defaultCheck = super.checkInputDataTypes() +if (defaultCheck.isFailure) { + defaultCheck +} else { + child.dataType match { +case _: StructType | _: ArrayType if !UnsafeRowUtils.isBinaryStable(child.dataType) => + TypeCheckResult.TypeCheckFailure( +s"Input to function mode was a complex type" + + s" with non-binary collated fields," + + s" which is not yet supported by mode.") +case _ => TypeCheckResult.TypeCheckSuccess + } Review Comment: ```suggestion if (UnsafeRowUtils.isBinaryStable(child.dataType) || child.dataType.isInstanceOf[StringType]) { TypeCheckResult.TypeCheckSuccess } else { ... } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1608247812 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -74,18 +102,56 @@ case class Mode( if (buffer.isEmpty) { return null } - +val collationAwareBuffer = child.dataType match { + case c: StringType if +!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => +val collationId = c.collationId +val modeMap = buffer.toSeq.groupMapReduce { + case (key: String, _) => +CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) + case (key: UTF8String, _) => +CollationFactory.getCollationKey(key, collationId) + case (key, _) => key +}(x => x)((x, y) => (x._1, x._2 + y._2)).values +modeMap +// case s: StructType => getBufferForStructType(buffer, s) + case _ => buffer +} reverseOpt.map { reverse => val defaultKeyOrdering = if (reverse) { PhysicalDataType.ordering(child.dataType).asInstanceOf[Ordering[AnyRef]].reverse } else { PhysicalDataType.ordering(child.dataType).asInstanceOf[Ordering[AnyRef]] } val ordering = Ordering.Tuple2(Ordering.Long, defaultKeyOrdering) - buffer.maxBy { case (key, count) => (count, key) }(ordering) -}.getOrElse(buffer.maxBy(_._2))._1 + collationAwareBuffer.maxBy { case (key, count) => (count, key) }(ordering) +}.getOrElse(collationAwareBuffer.maxBy(_._2))._1 } - +/* + private def getBufferForStructType( + buffer: OpenHashMap[AnyRef, Long], + s: StructType): Iterable[(AnyRef, Long)] = { +val fIsNonBinaryString = s.fields.map(f => (f, f.dataType)).map { + case (f, t: StringType) if !t.supportsBinaryEquality => (f.name, true) + case (f, t) => (f.name, false) +}.toMap +val fCollationIDs = s.fields.collect { + case f if fIsNonBinaryString(f.name) => +(f.name, f.dataType.asInstanceOf[StringType].collationId) +}.toMap + +buffer.groupMapReduce { + case (key: InternalRow, count) => +key.toSeq(s).zip(s.fields).map { + case (k: String, field) if fIsNonBinaryString(field.name) => +CollationFactory.getCollationKey(UTF8String.fromString(k), fCollationIDs(field.name)) + case (k: UTF8String, field) if fIsNonBinaryString(field.name) => +CollationFactory.getCollationKey(k, fCollationIDs(field.name)) + case (k, _) => k + } +}(x => x)((x, y) => (x._1, x._2 + y._2)).values + } +*/ Review Comment: I suppose you will remove this for now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1608240127 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -48,6 +49,33 @@ case class Mode( override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def checkInputDataTypes(): TypeCheckResult = { +checkDataType(child.dataType) + } + + private def checkDataType(dataType: DataType, level1: Boolean = true): TypeCheckResult = { +dataType match { + case ArrayType(elementType, _) => +checkDataType(elementType, level1 = false) + case StructType(fields) => +combineTypeCheckResults(fields.map { field => + checkDataType(field.dataType, level1 = false) +}) + case dt: StringType if !level1 && +!CollationFactory.fetchCollation(dt.collationId).supportsBinaryEquality + => TypeCheckResult.TypeCheckFailure( +s"Input to function $prettyName was a complex type" + + s" with strings collated on non-binary collations," + + s" which is not yet supported.") + case _ => TypeCheckResult.TypeCheckSuccess +} + } Review Comment: instead of this recursive check, consider using already existing recursive constructs (such as `existsRecursively` in `DataType`) (see comments above) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1608242763 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -48,6 +49,33 @@ case class Mode( override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def checkInputDataTypes(): TypeCheckResult = { +checkDataType(child.dataType) Review Comment: related to the comment below, I think you may just go ahead and use something we've already implemented - like `UnsafeRowUtils.isBinaryStable(child.dataType)` also, you'll likely need to allow `child.dataType.isInstanceOf[StringType]` - since we do want to allow all binary stable types, as well as all string types (some of them might be binary "unstable" - such as UTF8_BINARY_LCASE or UNICODE_CI) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1608240127 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -48,6 +49,33 @@ case class Mode( override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def checkInputDataTypes(): TypeCheckResult = { +checkDataType(child.dataType) + } + + private def checkDataType(dataType: DataType, level1: Boolean = true): TypeCheckResult = { +dataType match { + case ArrayType(elementType, _) => +checkDataType(elementType, level1 = false) + case StructType(fields) => +combineTypeCheckResults(fields.map { field => + checkDataType(field.dataType, level1 = false) +}) + case dt: StringType if !level1 && +!CollationFactory.fetchCollation(dt.collationId).supportsBinaryEquality + => TypeCheckResult.TypeCheckFailure( +s"Input to function $prettyName was a complex type" + + s" with strings collated on non-binary collations," + + s" which is not yet supported.") + case _ => TypeCheckResult.TypeCheckSuccess +} + } Review Comment: instead of this recursive check, consider using already existing recursive constructs (such as `existsRecursively` in `DataType`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1608236375 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -48,6 +49,33 @@ case class Mode( override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def checkInputDataTypes(): TypeCheckResult = { Review Comment: here, we also need to do something like: ``` val defaultCheck = super.checkInputDataTypes() if (defaultCheck.isFailure) { defaultCheck } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on PR #46597: URL: https://github.com/apache/spark/pull/46597#issuecomment-2121332325 @uros-db I agree that we should avoid auxiliary structures. And I don't see a good way to move the changes to implementation of `merge` and `update` without keeping an auxiliary map from the collation key to the actual values seen (eg from "aa" to "aaaAAa", "AA" for a data frame with the values "aaaAAa" and "AA".) That would be an auxiliary structure. There is ton of of scaffolding to support having just that OpenHashMap available throughout the expression being executed. So I advise strongly against us pursuing this idea, which is good in theory, at least for now. Having said that, such a prototype of an approach might look like this: https://github.com/GideonPotok/spark/pull/1 . Thoughts? Also, I am done with adding the exception for unsupported complex types! Take a look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok closed pull request #46526: [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce URL: https://github.com/apache/spark/pull/46526 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1600668920 ## sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala: ## @@ -22,7 +22,7 @@ import org.apache.spark.sql.catalyst.expressions.{ExpressionEvalHelper, Literal, import org.apache.spark.sql.catalyst.util.CollationFactory import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.test.SharedSparkSession -import org.apache.spark.sql.types.{ArrayType, BinaryType, BooleanType, DataType, IntegerType, StringType} +import org.apache.spark.sql.types._ Review Comment: ```suggestion import org.apache.spark.sql.types.{ArrayType, BinaryType, BooleanType, DataType, IntegerType, StringType} ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1600471419 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## Review Comment: all things considered, I would say proceed with this approach - clean everything up and get it running with your tests and all CI checks in order. then we can call in other reviewers and see where it goes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1600468023 ## sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala: ## Review Comment: "mode" is not a StringExpression let's move the tests to CollationSQLExpressionsSuite.scala -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1600463245 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## Review Comment: apropos altering the benchmark to yield better results for this particular expression, I'm not sure if that's something we should encourage - the benchmark is not perfect and should only be used for rough estimates. it's fine to consider the worst case scenario (all different elements), and I think we should look for the best approach anyways -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1600457296 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## Review Comment: apropos going through lower first, we need to be careful so as not to destroy the original data consider an example of finding the mode in ['a', 'B', 'B', 'A']. Here, correct answers would be: - 'a' - 'A' - 'B' but NOT: 'b' because that value is not found in the original data -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1600041876 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## Review Comment: 0. Note, by the way that because we are relying on supportsBinaryEquality, this is about preserving not only the performance for UTF8_BINARY, but also that of UNICODE 1. @uros-db check again. I believe the benchmarks are slightly more realistic now. In that for each string there are 3-6 that are equal by collation. EG: ``` collation unit benchmarks - mode - 30105 elements: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - UTF8_BINARY_LCASE - mode - 30105 elements 6 6 0 5.1 195.6 1.0X UNICODE - mode - 30105 elements3 3 0 11.6 86.0 2.3X UTF8_BINARY - mode - 30105 elements3 3 0 11.6 85.9 2.3X UNICODE_CI - mode - 30105 elements12 12 1 2.6 382.9 0.5X ``` 2. Still a slowdown (though there is more work, so how would we expect any different). I willl run these new benchmarks on the other approaches and assuming this one is best, we can get this one ready for final stages of cleanup and review... 3. I will leave one benchmark for mode rather then having three for different input sizes... that was just a temporary setup. 5. An idea would be for the case of UTF8_BINARY and UNICODE to go through the `lower` operation first. This would be a better way to check that, as the design doc instructs: "Performance regression for case insensitive collation must be no worse than using upper() or ilike() explicitly" . Let me know whether to change the benchmark accordingly. There will probably still be a performance degradation, but it would at least be a fairer comparison. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1598850667 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -74,6 +78,18 @@ case class Mode( if (buffer.isEmpty) { return null } +val buffer2 = if (!CollationFactory.fetchCollation(collationId).supportsBinaryEquality) { Review Comment: ```suggestion val collationAwareBuffer = if (!CollationFactory.fetchCollation(collationId).supportsBinaryEquality) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1598747396 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -70,10 +74,22 @@ case class Mode( buffer } - override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = { -if (buffer.isEmpty) { + override def eval(buff: OpenHashMap[AnyRef, Long]): Any = { +if (buff.isEmpty) { return null } +val buffer = if (isCollatedString(child)) { + val modeMap = buff.toSeq.groupMapReduce { +case (key: String, _) => + CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) +case (key: UTF8String, _) => + CollationFactory.getCollationKey(key, collationId) +case (key, _) => key + }(x => x)((x, y) => (x._1, x._2 + y._2)).values + modeMap +} else { + buff +} Review Comment: @uros-db Addressing the initial question about why Mode isn't functioning with collated strings in Spark as expected, even though Aggregation typically works: It seems that while the ordering for PhysicalStringType is correctly established using `private[sql] val ordering = CollationFactory.fetchCollation(collationId).comparator.compare(_, _)`, this does not automatically resolve the issue with Mode. To illustrate, consider the example of UTF8_BINARY_LCASE where an input like `Map("a" -> 3L, "b" -> 2L, "B" -> 2L)` results in evaluating the maximum over the tuples `(2L, "B"), (2L, "b"), (3L, "a")` rather than the expected `(3L, "a"), (4L, "b")`. This indicates that the current approach doesn't aggregate values as required for Mode to operate correctly. Unit tests confirm that Mode otherwise won't work for such cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1598747396 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -70,10 +74,22 @@ case class Mode( buffer } - override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = { -if (buffer.isEmpty) { + override def eval(buff: OpenHashMap[AnyRef, Long]): Any = { +if (buff.isEmpty) { return null } +val buffer = if (isCollatedString(child)) { + val modeMap = buff.toSeq.groupMapReduce { +case (key: String, _) => + CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) +case (key: UTF8String, _) => + CollationFactory.getCollationKey(key, collationId) +case (key, _) => key + }(x => x)((x, y) => (x._1, x._2 + y._2)).values + modeMap +} else { + buff +} Review Comment: @uros-db While the ordering for PhysicalStringType is correctly established using `private[sql] val ordering = CollationFactory.fetchCollation(collationId).comparator.compare(_, _)`, this does not automatically resolve the issue with Mode. To illustrate, consider the example of UTF8_BINARY_LCASE where an input like `Map("a" -> 3L, "b" -> 2L, "B" -> 2L)` results in evaluating the maximum over the tuples `(2L, "B"), (2L, "b"), (3L, "a")` rather than the expected `(3L, "a"), (4L, "b")`. This indicates that the current approach doesn't aggregate values as required for Mode to operate correctly. Unit tests confirm that Mode otherwise won't work for such cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1598030581 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -70,10 +74,22 @@ case class Mode( buffer } - override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = { -if (buffer.isEmpty) { + override def eval(buff: OpenHashMap[AnyRef, Long]): Any = { +if (buff.isEmpty) { return null } +val buffer = if (isCollatedString(child)) { + val modeMap = buff.toSeq.groupMapReduce { +case (key: String, _) => + CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) +case (key: UTF8String, _) => + CollationFactory.getCollationKey(key, collationId) +case (key, _) => key + }(x => x)((x, y) => (x._1, x._2 + y._2)).values + modeMap +} else { + buff +} Review Comment: Going back to the original issue (why Mode doesn't work already, while otherwise Aggregation generally works with collated strings in Spark), here's what I'm interested in: why does `PhysicalDataType.ordering(child.dataType).asInstanceOf[Ordering[AnyRef]]` not work here automatically? afaik, ordering for `PhysicalStringType` is defined correctly: ``` private[sql] val ordering = CollationFactory.fetchCollation(collationId).comparator.compare(_, _) ``` so one would naturally expect Mode to work "as is" did you investigate this maybe? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1598017669 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -70,10 +74,22 @@ case class Mode( buffer } - override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = { -if (buffer.isEmpty) { + override def eval(buff: OpenHashMap[AnyRef, Long]): Any = { +if (buff.isEmpty) { Review Comment: we shouldn't make unnecessary changes, let's not rename this if we don't have to just rename your new variable below to something else -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1598015529 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## Review Comment: I would say that these results are somewhat better, but still not too good However - what's imperative right now is that we preserve the performance for UTF8_BINARY (by doing that if/else branch on supportsBinaryEquality). If we don't have a better approach at this moment, and we've already tried a couple of things - then I would say that's fine and we can proceed with the best of what we've got (which is this PR) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1598012310 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -86,6 +102,14 @@ case class Mode( }.getOrElse(buffer.maxBy(_._2))._1 } + private def isCollatedString(child: Expression): Boolean = { +child.dataType match { + case s: StringType if s.collationId != CollationFactory.UTF8_BINARY_COLLATION_ID => true + // maybe use supportsBinaryEquality or something else + case _ => false +} + } + Review Comment: we generally shouldn't use collationIds to check equality, you correctly noticed that we should go with `supportsBinaryEquality` instead - you can remove this method, and perform the check above as suggested -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
uros-db commented on code in PR #46526: URL: https://github.com/apache/spark/pull/46526#discussion_r1598010702 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -70,10 +74,22 @@ case class Mode( buffer } - override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = { -if (buffer.isEmpty) { + override def eval(buff: OpenHashMap[AnyRef, Long]): Any = { +if (buff.isEmpty) { return null } +val buffer = if (isCollatedString(child)) { Review Comment: ```suggestion val buffer = if (!CollationFactory.fetchCollation(collationId).supportsBinaryEquality) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]
GideonPotok opened a new pull request, #46526: URL: https://github.com/apache/spark/pull/46526 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org