Nicholas Gustafson created SPARK-37553:
------------------------------------------

             Summary: pandas.DataFrame.pivot_table KeyError on underscore in 
name
                 Key: SPARK-37553
                 URL: https://issues.apache.org/jira/browse/SPARK-37553
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.2.0
            Reporter: Nicholas Gustafson


The method 
[pyspark.pandas.frame.DataFrame.pivot_table|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L5841-L6134]
 will {{KeyError}} when the {{values}} kwarg has length greater than 1. There 
are two cases where this occurs.

First, when the columns for {{values}} contains an underscore({{{}_{}}}) and it 
splits the name 
[here|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L6064-L6067].

Minimal reproduction of the bug in {{values}} column names (using IPython):
{code:python}
>>> import numpy as np
>>> import pandas as pd

>>> from pyspark import pandas as ps

>>> pdf = pd.DataFrame(
        {
            "a": [4, 2, 3, 4, 8, 6],
            "b_b": [1, 2, 2, 4, 2, 4],
            "e": [10, 20, 20, 40, 20, 40],
            "c": [1, 2, 9, 4, 7, 4],
            "d": [-1, -2, -3, -4, -5, -6],
        },
        index=np.random.rand(6),
    )
>>> psdf = ps.from_pandas(pdf)
>>> psdf.pivot_table(index=["c"], columns="a", values=["b_b", "e"])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-32d5bb0e1166> in <module>
----> 1 psdf.pivot_table(index=["c"], columns="a", values=["b_b", "e"])

~/.pyenv/versions/3.7.9/envs/venv37/lib/python3.7/site-packages/pyspark/pandas/frame.py
 in pivot_table(self, values, index, columns, aggfunc, fill_value)
   6053                     column_labels = [
   6054                         
tuple(list(column_name_to_index[name.split("_")[1]]) + [name.split("_")[0]])
-> 6055                         for name in data_columns
   6056                     ]
   6057                     column_label_names = (

~/.pyenv/versions/3.7.9/envs/venv37/lib/python3.7/site-packages/pyspark/pandas/frame.py
 in <listcomp>(.0)
   6053                     column_labels = [
   6054                         
tuple(list(column_name_to_index[name.split("_")[1]]) + [name.split("_")[0]])
-> 6055                         for name in data_columns
   6056                     ]
   6057                     column_label_names = (

KeyError: 'b'
{code}

Second, when elements of the {{columns}} are strings containing underscores 
{{_}} we get a similar {{KeyError}} due to [this 
line|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L6058].
 

Here is a basic reproduction of the bug (in IPython):

{code:python}
>>> import numpy as np
>>> import pandas as pd

>>> from pyspark import pandas as ps

>>> pdf = pd.DataFrame(
        {
            "a": ["4_4", "2_2", "3_3", "4_4", "8_8", "6_6"],
            "b": [1, 2, 2, 4, 2, 4],
            "e": [10, 20, 20, 40, 20, 40],
            "c": [1, 2, 9, 4, 7, 4],
            "d": [-1, -2, -3, -4, -5, -6],
        },
        index=np.random.rand(6),
    )
>>> psdf = ps.from_pandas(pdf)
>>> psdf.pivot_table(index=["c"], columns="a", values=["b", "e"])

KeyError                                  Traceback (most recent call last)
<ipython-input-10-787c0fabea99> in <module>
      1 psdf = ps.from_pandas(pdf)
----> 2 psdf.pivot_table(index=["c"], columns="a", values=["b", "e"])

~/.pyenv/versions/3.7.9/envs/venv37/lib/python3.7/site-packages/pyspark/pandas/frame.py
 in pivot_table(self, values, index, columns, aggfunc, fill_value)
   6053                     column_labels = [
   6054                         
tuple(list(column_name_to_index[name.split("_")[1]]) + [name.split("_")[0]])
-> 6055                         for name in data_columns
   6056                     ]
   6057                     column_label_names = (

~/.pyenv/versions/3.7.9/envs/venv37/lib/python3.7/site-packages/pyspark/pandas/frame.py
 in <listcomp>(.0)
   6053                     column_labels = [
   6054                         
tuple(list(column_name_to_index[name.split("_")[1]]) + [name.split("_")[0]])
-> 6055                         for name in data_columns
   6056                     ]
   6057                     column_label_names = (

KeyError: '2'
{code}

Based on instructions in the [PySpark Contributing 
Guide|https://spark.apache.org/docs/latest/api/python/development/contributing.html],
 I forked the repo, created a new branch, and added an (example) fix + new unit 
tests for the code changes. Since I have never contributed to this project 
before, I have submitted a Draft Pull Request (instead of a normal pull 
request).




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to