[GitHub] [spark] zhenlineo commented on a diff in pull request #40729: [WIP][CONNECT] Adding groupByKey + mapGroup functions

via GitHub Thu, 13 Apr 2023 10:21:52 -0700


zhenlineo commented on code in PR #40729:
URL: https://github.com/apache/spark/pull/40729#discussion_r1165825957



##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala:
##########
@@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.util.Arrays
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.api.java.function._
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.encoders.AgnosticEncoder
+import org.apache.spark.sql.connect.client.UdfUtils
+import org.apache.spark.sql.expressions.{ScalarUserDefinedFunction, 
SingleInputUserDefinedFunction}
+import org.apache.spark.sql.functions.col
+
+/**
+ * A [[Dataset]] has been logically grouped by a user specified grouping key. 
Users should not
+ * construct a [[KeyValueGroupedDataset]] directly, but should instead call 
`groupByKey` on an
+ * existing [[Dataset]].
+ *
+ * @since 3.5.0
+ */
+class KeyValueGroupedDataset[K, V] private[sql] () extends Serializable {

Review Comment:
   In order to support `keyAs` and `mapValues`, we need to keep track of the 
original type info IK, IV, which are the ones to send via wire, as well as the 
casted version of K and V, which is used to modify the udf passed via e.g. 
mapGroups to make UDF accept the original IK and IV. Thus I introduced the new 
`KeyValueGroupedDatasetImpl` and `SingleInputUserDefinedFunction`. They keep 
track of all type info so that the modifications to UDFs can be easily chained.
   
   It is much more easier when the types are explicitly set as the compiler 
will ensure we did not pass any encoders etc wrongly by mistake. We can take a 
look again if it is better to merge the two classes after all methods are done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhenlineo commented on a diff in pull request #40729: [WIP][CONNECT] Adding groupByKey + mapGroup functions

Reply via email to