[ https://issues.apache.org/jira/browse/SPARK-21537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric O. LEBIGOT (EOL) updated SPARK-21537: ------------------------------------------ Description: The conversion of a *PySpark dataframe with nested columns* to Pandas (with `toPandas()`) does not convert nested columns into their Pandas equivalent, i.e. *columns indexed by a [MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*. For example, a dataframe with the following structure: {code:java} >>> df.printSchema() root |-- device_ID: string (nullable = true) |-- time_origin_UTC: timestamp (nullable = true) |-- duration_s: integer (nullable = true) |-- session_time_UTC: timestamp (nullable = true) |-- probes_by_AP: struct (nullable = true) | |-- aa:bb:cc:dd:ee:ff: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- delay_s: float (nullable = true) | | | |-- RSSI: short (nullable = true) |-- max_RSSI_info_by_AP: struct (nullable = true) | |-- aa:bb:cc:dd:ee:ff: struct (nullable = true) | | |-- delay_s: float (nullable = true) | | |-- RSSI: short (nullable = true) {code} yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ nested inside Pandas (through a MultiIndex): {code} >>> df_pandas_version = df.toPandas() >>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # >>> Should work! (…) KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI') >>> df_pandas_version["max_RSSI_info_by_AP"].iloc[0] Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6)) >>> type(_) # PySpark type, instead of Pandas! pyspark.sql.types.Row {code} It would be much more convenient if the Spark dataframe did the conversion to Pandas more thoroughly. was: The conversion of a *PySpark dataframe with nested columns* to Pandas (with `toPandas()`) does not convert nested columns into their Pandas equivalent, i.e. *columns indexed by a [MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*. For example, a dataframe with the following structure: {code:java} >>> df.printSchema() root |-- device_ID: string (nullable = true) |-- time_origin_UTC: timestamp (nullable = true) |-- duration_s: integer (nullable = true) |-- session_time_UTC: timestamp (nullable = true) |-- probes_by_AP: struct (nullable = true) | |-- aa:bb:cc:dd:ee:ff: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- delay_s: float (nullable = true) | | | |-- RSSI: short (nullable = true) |-- max_RSSI_info_by_AP: struct (nullable = true) | |-- aa:bb:cc:dd:ee:ff: struct (nullable = true) | | |-- delay_s: float (nullable = true) | | |-- RSSI: short (nullable = true) {code} yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ nested inside Pandas (through a MultiIndex): {code} >>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # >>> Should work! (…) KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI') >>> sessions_in_period["max_RSSI_info_by_AP"].iloc[0] Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6)) >>> type(_) # PySpark type, instead of Pandas! pyspark.sql.types.Row {code} It would be much more convenient if the Spark dataframe did the conversion to Pandas more thoroughly. > toPandas() should handle nested columns (as a Pandas MultiIndex) > ---------------------------------------------------------------- > > Key: SPARK-21537 > URL: https://issues.apache.org/jira/browse/SPARK-21537 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 2.2.0 > Reporter: Eric O. LEBIGOT (EOL) > Labels: pandas > > The conversion of a *PySpark dataframe with nested columns* to Pandas (with > `toPandas()`) does not convert nested columns into their Pandas equivalent, > i.e. *columns indexed by a > [MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*. > For example, a dataframe with the following structure: > {code:java} > >>> df.printSchema() > root > |-- device_ID: string (nullable = true) > |-- time_origin_UTC: timestamp (nullable = true) > |-- duration_s: integer (nullable = true) > |-- session_time_UTC: timestamp (nullable = true) > |-- probes_by_AP: struct (nullable = true) > | |-- aa:bb:cc:dd:ee:ff: array (nullable = true) > | | |-- element: struct (containsNull = true) > | | | |-- delay_s: float (nullable = true) > | | | |-- RSSI: short (nullable = true) > |-- max_RSSI_info_by_AP: struct (nullable = true) > | |-- aa:bb:cc:dd:ee:ff: struct (nullable = true) > | | |-- delay_s: float (nullable = true) > | | |-- RSSI: short (nullable = true) > {code} > yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ > nested inside Pandas (through a MultiIndex): > {code} > >>> df_pandas_version = df.toPandas() > >>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # > >>> Should work! > (…) > KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI') > >>> df_pandas_version["max_RSSI_info_by_AP"].iloc[0] > Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6)) > >>> type(_) # PySpark type, instead of Pandas! > pyspark.sql.types.Row > {code} > It would be much more convenient if the Spark dataframe did the conversion to > Pandas more thoroughly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org