GideonPotok commented on code in PR #46597: URL: https://github.com/apache/spark/pull/46597#discussion_r1627976630
########## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ########## @@ -74,16 +90,25 @@ case class Mode( if (buffer.isEmpty) { return null } - + val collationAwareBuffer = child.dataType match { + case c: StringType if + !CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => + val collationId = c.collationId + val modeMap = buffer.toSeq.groupMapReduce { Review Comment: @dbatomic @uros-db here is a mockup of this proposal: https://github.com/GideonPotok/spark/pull/2. To reiterate, this is merely a mockup/proof of concept. The relevant test has passed, which indicates that this approach could be viable. Now, we need to consider whether to advance and determine how to integrate the relevant information about key datatype into the OpenHashMap. What are your thoughts on the feasibility of moving forward? I'm primarily concerned about the risks involved: Integrating collation with specialized types and complex hash functions might increase maintenance demands, introduce serialization challenges, and lead to subtle bugs. Considering the crucial nature of this data structure, we should approach any changes with a detailed plan for validation, ensuring backward compatibility, and thorough performance testing. It may be wise to consider less invasive modifications , such as the one proposed in this PR. Despite these concerns, this approach is functioning, and it touches on a particularly intriguing part of the codebase that I am eager to work on. If you think it's a promising route, I'm ready to complete the implementation and perform further benchmarks. However, I would appreciate some design suggestions as mentioned below. To effectively implement this, I see two possible directions: 1. Is there a benefit to using `AnyRef` (as in `OpenHashMap[AnyRef, ...]`) by `TypedAggregateWithHashMapAsBuffer`? This was introduced here: https://github.com/apache/spark/pull/37216/files without a clear explanation of why `AnyRef` was preferred over generics. Should `TypedAggregateWithHashMapAsBuffer` remain unchanged, or should it evolve to rely on `OpenHashMap[childExpression.dataType.getClass, ...]` for more specific typing? @beliefer, although it’s been some time since you worked on this, could you advise on whether this component should be modified? 2. Assuming `TypedAggregateWithHashMapAsBuffer` remains unchanged, I'm seeking a more effective method to inject the custom hashing logic (and a custom `keyExistsAtPos` method) from `Mode` into the OpenHashMap, depending on the `childExpr.dataType`. I would greatly value ideas on how to best integrate this. At the moment, the proof of concept is assuming any object passed into `OpenHashSet` that is not `Long,Int,Double, or Float` is a `UTF8String` with `UTF8_BINARY_LCASE` collation. Lastly, while I am eager to complete the implementation, I hope to ensure that this is something you would definitively want to pursue, barring any significant performance setbacks revealed by benchmarking. I've developed this proof of concept and it's operational, but a full implementation should ideally be something you are confident is the right direction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org