GideonPotok commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1626659308
########## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ########## @@ -74,16 +90,25 @@ case class Mode( if (buffer.isEmpty) { return null } - + val collationAwareBuffer = child.dataType match { + case c: StringType if + !CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => + val collationId = c.collationId + val modeMap = buffer.toSeq.groupMapReduce { Review Comment: @uros-db can you weigh in on whether you think that further exploration of dbatomic's approach would be warranted? What @dbatomic is suggesting is slightly different then the approach i have prototyped in the past (though is basically what you suggested very early on). dbatomic's suggestion might look something like the change seen here in core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala: https://github.com/GideonPotok/spark/pull/2/files#diff-d758fde336b16d62c0feb00db9350822d60bd16805f8a816f2ee82b343d851aa which I have just now whipped up. It is possible his approach will work well, but as with past experience when we tried to use the hashmap plus an auxilliary data structure (https://github.com/GideonPotok/spark/pull/1/files), it might bring complexity and more gotcha's then we might be interested in taking on for this feature. Is it worth it? I am capable of implementing it that way, but it is going to take a bit of work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org