Re: [PR] [SPARK-48700] [SQL] Mode expression for complex types (all collations) [spark]

via GitHub Thu, 26 Sep 2024 13:54:56 -0700


GideonPotok commented on code in PR #47154:
URL: https://github.com/apache/spark/pull/47154#discussion_r1777467063



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -86,6 +91,53 @@ case class Mode(
     buffer
   }
 
+  private def getCollationAwareBuffer(
+      childDataType: DataType,
+      buffer: OpenHashMap[AnyRef, Long]): Iterable[(AnyRef, Long)] = {
+    def groupAndReduceBuffer(groupingFunction: AnyRef => _): Iterable[(AnyRef, 
Long)] = {
+      buffer.groupMapReduce(t =>
+        groupingFunction(t._1))(x => x)((x, y) => (x._1, x._2 + y._2)).values
+    }
+    def determineBufferingFunction(
+        childDataType: DataType): Option[AnyRef => _] = {
+      childDataType match {
+        case _ if UnsafeRowUtils.isBinaryStable(child.dataType) => None
+        case _ => Some(collationAwareTransform(_, childDataType))
+      }
+    }
+    
determineBufferingFunction(childDataType).map(groupAndReduceBuffer).getOrElse(buffer)
+  }
+
+  private def collationAwareTransform(data: AnyRef, dataType: DataType): 
AnyRef = {
+    dataType match {
+      case _ if UnsafeRowUtils.isBinaryStable(dataType) => data
+      case st: StructType =>
+        
processStructTypeWithBuffer(data.asInstanceOf[InternalRow].toSeq(st).zip(st.fields))
+      case at: ArrayType => processArrayTypeWithBuffer(at, 
data.asInstanceOf[ArrayData])
+      case st: StringType =>
+        CollationFactory.getCollationKey(data.asInstanceOf[UTF8String], 
st.collationId)
+      case _ =>
+        throw new SparkUnsupportedOperationException(
+          "UNSUPPORTED_MODE_DATA_TYPE",

Review Comment:
   @MaxGekk @uros-db I am having a lot of trouble with this one! I implemented 
it as COMPLEX_EXPRESSION_UNSUPPORTED_INPUT.NO_INPUT, and plan to create a 
different subclass that represents this situation (Maybe BAD_INPUT), once I 
just have it working. But When I throw the following:
   ```
   SparkUnsupportedOperationException(
             errorClass = "COMPLEX_EXPRESSION_UNSUPPORTED_INPUT.NO_INPUT",
     ```
     
     I end up getting:
     ```
     // org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot
             // find sub error class 
'COMPLEX_EXPRESSION_UNSUPPORTED_INPUT.NO_INPUT' SQLSTATE: XX000
    ```
    
    Yet if I do          
   ```
   SparkUnsupportedOperationException(
             errorClass = "COMPLEX_EXPRESSION_UNSUPPORTED_INPUT",
     ```
     Then `org.apache.spark.ErrorClassesJsonReader#getMessageTemplate` fails 
during the assertion     `assert(errorInfo.subClass.isDefined == 
subErrorClass.isDefined)`
     
   As the subClass is missing.
   
   Would either of you be able to tell me if this is the right pattern (eg. 
maybe it should be ComplexExpressionException("BAD_INPUT"), not sure what is 
the "old" pattern and which is the preferred one).
   
   
   
   
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48700] [SQL] Mode expression for complex types (all collations) [spark]

Reply via email to