GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1627976630


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -74,16 +90,25 @@ case class Mode(
     if (buffer.isEmpty) {
       return null
     }
-
+    val collationAwareBuffer = child.dataType match {
+      case c: StringType if
+        !CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality 
=>
+        val collationId = c.collationId
+        val modeMap = buffer.toSeq.groupMapReduce {

Review Comment:
   @dbatomic @uros-db here is a mockup of this proposal: 
https://github.com/GideonPotok/spark/pull/2. To reiterate, this is merely a 
mockup/proof of concept.
   
   The relevant test has passed, which indicates that this approach could be 
viable. Now, we need to consider whether to advance and determine how to 
integrate the relevant information about key datatype into the OpenHashMap. 
What are your thoughts on the feasibility of moving forward?
   
   I'm primarily concerned about the risks involved: Integrating collation with 
specialized types and complex hash functions might increase maintenance 
demands, introduce serialization challenges, and lead to subtle bugs. 
Considering the crucial nature of this data structure, we should approach any 
changes with a detailed plan for validation, ensuring backward compatibility, 
and thorough performance testing. It may be wise to consider less invasive 
modifications , such as the one proposed in this PR.
   
   Despite these concerns, this approach is functioning, and it touches on a 
particularly intriguing part of the codebase that I am eager to work on. If you 
think it's a promising route, I'm ready to complete the implementation and 
perform further benchmarks. However, I would appreciate some design suggestions 
as mentioned below.
   
   To effectively implement this, I see two possible directions:
   1. Is there a benefit to using `AnyRef` (as in `OpenHashMap[AnyRef, ...]`) 
by `TypedAggregateWithHashMapAsBuffer`? This was introduced here: 
https://github.com/apache/spark/pull/37216/files without a clear explanation of 
why `AnyRef` was preferred over generics. Should 
`TypedAggregateWithHashMapAsBuffer` remain unchanged, or should it evolve to 
rely on `OpenHashMap[childExpression.dataType.getClass, ...]` for more specific 
typing? @beliefer, although it’s been some time since you worked on this, could 
you advise on whether this component should be modified?
   2. Assuming `TypedAggregateWithHashMapAsBuffer` remains unchanged, I'm 
seeking a more effective method to inject the custom hashing logic (and a 
custom `keyExistsAtPos` method) from `Mode` into the OpenHashMap, depending on 
the `childExpr.dataType`. I would greatly value ideas on how to best integrate 
this. At the moment, the proof of concept is assuming any object passed into 
`OpenHashSet` that is not `Long,Int,Double, or Float` is a `UTF8String` with 
`UTF8_BINARY_LCASE` collation.
   
   Lastly, while I am eager to complete the implementation, I hope to ensure 
that this is something you would definitively want to pursue, barring any 
significant performance setbacks revealed by benchmarking. I've developed this 
proof of concept and it's operational, but a full implementation should ideally 
be something you are confident is the right direction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to