[ https://issues.apache.org/jira/browse/SPARK-35683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-35683: ------------------------------------ Assignee: Haejoon Lee (was: Apache Spark) > Fix Index.difference to avoid collect 'other' to driver side > ------------------------------------------------------------ > > Key: SPARK-35683 > URL: https://issues.apache.org/jira/browse/SPARK-35683 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.2.0 > Reporter: Hyukjin Kwon > Assignee: Haejoon Lee > Priority: Major > > See: > https://github.com/databricks/koalas/pull/1325#discussion_r647889901 > https://github.com/databricks/koalas/pull/1325#discussion_r647890007 > {code} > midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', > 3)]) > midx1.difference(idx1) > {code} > {code} > pyspark.pandas.exceptions.PandasNotImplementedError: The method > `pd.Index.__iter__()` is not implemented. If you want to collect your data as > an NumPy array, use 'to_numpy()' instead. > {code} > In addition, calling {{MultiIndex.from_tuples}} will result in collecting all > into driver side. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org