Enrico Minack created SPARK-40601: ------------------------------------- Summary: Improve error when cogrouping groups with mismatching key sizes Key: SPARK-40601 URL: https://issues.apache.org/jira/browse/SPARK-40601 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.4.0 Reporter: Enrico Minack
Cogrouping two grouped DataFrames in PySpark that have different group key cardinalities raises an error that is not very descriptive: {code:python} left.groupby("id", "k") .cogroup(right.groupby("id")) {code} {code:java} Traceback (most recent call last): py4j.protocol.Py4JJavaError: An error occurred while calling o726.collectToPython. : java.lang.IndexOutOfBoundsException: 1 at scala.collection.mutable.ResizableArray.apply(ResizableArray.scala:46) at scala.collection.mutable.ResizableArray.apply$(ResizableArray.scala:45) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:49) at org.apache.spark.sql.catalyst.plans.physical.HashShuffleSpec.$anonfun$createPartitioning$5(partitioning.scala:650) {code} *Note:* This is Python-specific as cogrouping with differing group key sizes is not possible in Scala. The respective Scala API is fully typed on the key. The problem is that {{EnsureRequirements.ensureDistributionAndOrdering}} calls into {{HashShuffleSpec.createPartitioning(clustering)}} where length of {{clustering}} is smaller than largest bits ({{{}v.head{}}}) in {{{}hashKeyPositions{}}} (EnsureRequirements.scala:159): {code:java} hashKeyPositions.map(v => clustering(v.head)) {code} Possible fixes: # Assert identical size for group keys, and provide meaningful cogroup-specific error message. # {{EnsureRequirements}} identifies this situation and provide a meaningful distribution-requirements-specific error message. # Ideally both. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org