[jira] [Comment Edited] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

Christos Karras (Jira) Fri, 05 Apr 2024 06:43:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834315#comment-17834315
 ]


Christos Karras edited comment on SPARK-46143 at 4/5/24 1:41 PM:
-----------------------------------------------------------------

I have the same issue too. The problem is because the squeeze parameter of 
read_excel has been deprecated since pandas version 1.4, and has been 
completely removed in pandas 2.0. But Pyspark's implementation of read_excel 
keeps passing the squeeze parameter, even though the parameter is also 
deprecated since pyspark 3.4.

Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not 
supported, a fix to avoid the need to stay with pandas 1.x could be to detect 
the pandas version and decide if the squeeze parameter should be passed 
depending on the pandas version. Also, if the squeeze parameter is passed in 
the Pyspark function, raise an error if a versio of pandas that no longer 
supports this parameter is installed. This would be a transition solution until 
the squeeze parameter is also removed completely from Pyspark.

 

Modify pyspark\pandas\namespace.py:
 * Change the squeeze parameter of the read_excel function to be "squeeze: 
Optional[bool] = None"
 * Modify the nested pd_read_excel function to check for the pandas version and 
build a dict of arguments it will pass to pd.read_excel based on that version. 
Also consider if the squeeze parameter was passed by the caller or not. And if 
the caller specified that argument but a newer version of pandas that doesn't 
support it, raise an exception

 

Sample code that could implement this solution:
def pd_read_excel(
    io_or_bin: Any,
    sn: Union[str, int, List[Union[str, int]], None], sq: bool   
) -> pd.DataFrame:

    read_excel_args: dict = {
        "io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) 
else io_or_bin,
        "sheet_name":sn,
        "header":header,
        # TODO other args...,
        **kwds
    }

    if squeeze is not None:
        if pandas_version >= 2:
            raise Exception("The squeeze parameter for read_excel is no longer 
available in pandas 2.x")
       
        read_excel_args["squeeze"] = squeeze
    return pd.read_excel(**read_excel_args)
 


was (Author: JIRAUSER304886):
I have the same issue too. The problem is because the squeeze parameter of 
read_excel has been deprecated since pandas version 1.4, and has been 
completely removed in pandas 2.0. But Pyspark's implementation of read_excel 
keeps passing the squeeze parameter, even though the parameter is also 
deprecated since pyspark 3.4.


Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not 
supported, a fix to avoid the need to stay with pandas 1.x could be to detect 
the pandas version and decide if the squeeze parameter should be passed 
depending on the pandas version. Also, if the squeeze parameter is passed in 
the Pyspark function, raise an error if a versio of pandas that no longer 
supports this parameter is installed. This would be a transition solution until 
the squeeze parameter is also removed completely from Pyspark.

 

Modify pyspark\pandas\namespace.py:
 * Change the squeeze parameter of the read_excel function to be "squeeze: 
Optional[bool] = None"
 * Modify the nested pd_read_excel function to check for the pandas version and 
build a dict of arguments it will pass to pd.read_excel based on that version. 
Also consider if the squeeze parameter was passed by the caller or not. And if 
the caller specified that argument but a newer version of pandas that doesn't 
support it, raise an exception:

{{    }}
{{def pd_read_excel(}}{{        io_or_bin: Any, sn: Union[str, int, 
List[Union[str, int]], None], sq: bool}}{{    ) -> 
pd.DataFrame:}}{{read_excel_args: dict = {
{{}}}}
{{"io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else 
io_or_bin,}}{{            }}
{{"sheet_name":sn,}}
{{"header":header,}}
{{...,}}
{{**kwds}}
{{}}}


{{if squeeze is not None:}}
{{   if pandas_version >= 2:}}
{{      raise Exception("The squeeze parameter for read_excel is no longer 
available in pandas 2.x")}}

{{read_excel_args["squeeze"] = squeeze}}
{{return pd.read_excel(**read_excel_args)}}
{{   }}

> pyspark.pandas read_excel implementation at version 3.4.1
> ---------------------------------------------------------
>
>                 Key: SPARK-46143
>                 URL: https://issues.apache.org/jira/browse/SPARK-46143
>             Project: Spark
>          Issue Type: Bug
>          Components: Build
>    Affects Versions: 3.4.1
>         Environment: pyspark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace at runtime 1.2.
> Tested the same scenario on a spark 3.4.1 standalone deployment on docker 
> documented at https://github.com/mpavanetti/sparkenv
>  
>  
>            Reporter: Matheus Pavanetti
>            Priority: Major
>         Attachments: MicrosoftTeams-image.png, 
> image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png
>
>
> Hello, 
> I would like to report an issue with pyspark.pandas implementation on 
> read_excel function.
> Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
> potentially uses an older version of pandas on it's implementations of 
> pyspark.pandas.
> The function read_excel from pandas doesn't expect a parameter called 
> "squeeze" however it's implemented as part of pyspark.pandas and the 
> parameter "squeeze" is being passed to the pandas function.
>  
> !image-2023-11-28-13-20-40-275.png!
>  
> I've been digging into it for further investigation into pyspark 3.4.1 
> documentation
> [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false&originalUrl=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel&McasCSRF=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]
>  
> This is the point I found that "squeeze" parameter is being passed to pandas 
> read_excel function which is not expected.
> It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
> in the implementation.
>  
> !image-2023-11-28-13-20-51-291.png!
>  
> I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
> with fabric. However fabric uses this version as its 1.2 build.
>  
> I am able to work around that for now by download the excel from the one lake 
> to the spark driver, loading that to the memory with pandas and then 
> converting to a spark dataframe etc or I made it work downgrading the build
> I downloaded the pyspark build 20230713 to my local, made the changes and 
> re-compiled it and it worked locally. So it means that is related to the 
> implementation and they would have to fix or I do a downgrade to older 
> version like 3.3.3 or try the latest 3.5.0 which is not the case for fabric
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

Reply via email to