[ https://issues.apache.org/jira/browse/SPARK-40601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610588#comment-17610588 ]
Apache Spark commented on SPARK-40601: -------------------------------------- User 'EnricoMi' has created a pull request for this issue: https://github.com/apache/spark/pull/38036 > Improve error when cogrouping groups with mismatching key sizes > --------------------------------------------------------------- > > Key: SPARK-40601 > URL: https://issues.apache.org/jira/browse/SPARK-40601 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 3.4.0 > Reporter: Enrico Minack > Priority: Minor > > Cogrouping two grouped DataFrames in PySpark that have different group key > cardinalities raises an error that is not very descriptive: > {code:python} > left.groupby("id", "k") > .cogroup(right.groupby("id")) > {code} > {code:java} > Traceback (most recent call last): > py4j.protocol.Py4JJavaError: An error occurred while calling > o726.collectToPython. > : java.lang.IndexOutOfBoundsException: 1 > at > scala.collection.mutable.ResizableArray.apply(ResizableArray.scala:46) > at > scala.collection.mutable.ResizableArray.apply$(ResizableArray.scala:45) > at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:49) > at > org.apache.spark.sql.catalyst.plans.physical.HashShuffleSpec.$anonfun$createPartitioning$5(partitioning.scala:650) > {code} > *Note:* This is Python-specific as cogrouping with differing group key sizes > is not possible in Scala. The respective Scala API is fully typed on the key. > The problem is that {{EnsureRequirements.ensureDistributionAndOrdering}} > calls into {{HashShuffleSpec.createPartitioning(clustering)}} where length of > {{clustering}} is smaller than largest bits ({{{}v.head{}}}) in > {{{}hashKeyPositions{}}} (EnsureRequirements.scala:159): > {code:java} > hashKeyPositions.map(v => clustering(v.head)) > {code} > Possible fixes: > # Assert identical size for group keys, and provide meaningful > cogroup-specific error message. > # {{EnsureRequirements}} identifies this situation and provide a meaningful > distribution-requirements-specific error message. > # Ideally both. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org