[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

Matheus Pavanetti (Jira) Tue, 28 Nov 2023 12:21:08 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matheus Pavanetti updated SPARK-46143:
--------------------------------------
    Description: 
Hello, 

I would like to report an issue with pyspark.pandas implementation on 
read_excel function.

Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
potentially uses an older version of pandas on it's implementations of 
pyspark.pandas.

The function read_excel from pandas doesn't expect a parameter called "squeeze" 
however it's implemented as part of pyspark.pandas and the parameter "squeeze" 
is being passed to the pandas function.

 

!image-2023-11-28-13-20-40-275.png!

 

I've been digging into it for further investigation into pyspark 3.4.1 
documentation

[https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false&originalUrl=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel&McasCSRF=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]

 

This is the point I found that "squeeze" parameter is being passed to pandas 
read_excel function which is not expected.

It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
in the implementation.

 

!image-2023-11-28-13-20-51-291.png!

 

I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
with fabric. However fabric uses this version as its 1.2 build.

 

I am able to work around that for now by download the excel from the one lake 
to the spark driver, loading that to the memory with pandas and then converting 
to a spark dataframe etc or I made it work downgrading the build

I downloaded the pyspark build 20230713 to my local, made the changes and 
re-compiled it and it worked locally. So it means that is related to the 
implementation and they would have to fix or I do a downgrade to older version 
like 3.3.0 or try the latest 3.5.0 which is not the case for fabric

 

 

  was:
Hello, 

I would like to report an issue with pyspark.pandas implementation on 
read_excel function.

Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
potentially uses an older version of pandas on it's implementations of 
pyspark.pandas.

The function read_excel from pandas doesn't expect a parameter called "squeeze" 
however it's implemented as part of pyspark.pandas and the parameter "squeeze" 
is being passed to the pandas function.

 

!Z!

 

I've been digging into it for further investigation into pyspark 3.4.1 
documentation

[https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false&originalUrl=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel&McasCSRF=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]

 

This is the point I found that "squeeze" parameter is being passed to pandas 
read_excel function which is not expected.

It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
in the implementation.

 

!9k=!

 

I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
with fabric. However fabric uses this version as its 1.2 build.

 

I am able to work around that for now by download the excel from the one lake 
to the spark driver, loading that to the memory with pandas and then converting 
to a spark dataframe etc or I made it work downgrading the build

I downloaded the pyspark build 20230713 to my local, made the changes and 
re-compiled it and it worked locally. So it means that is related to the 
implementation and they would have to fix or I do a downgrade to older version 
like 3.3.0 or try the latest 3.5.0 which is not the case for fabric

 

 


> pyspark.pandas read_excel implementation at version 3.4.1
> ---------------------------------------------------------
>
>                 Key: SPARK-46143
>                 URL: https://issues.apache.org/jira/browse/SPARK-46143
>             Project: Spark
>          Issue Type: Bug
>          Components: Build
>    Affects Versions: 3.4.1
>         Environment: Apache spark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace.
>  
>  
>            Reporter: Matheus Pavanetti
>            Priority: Major
>         Attachments: MicrosoftTeams-image.png, 
> image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png
>
>
> Hello, 
> I would like to report an issue with pyspark.pandas implementation on 
> read_excel function.
> Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
> potentially uses an older version of pandas on it's implementations of 
> pyspark.pandas.
> The function read_excel from pandas doesn't expect a parameter called 
> "squeeze" however it's implemented as part of pyspark.pandas and the 
> parameter "squeeze" is being passed to the pandas function.
>  
> !image-2023-11-28-13-20-40-275.png!
>  
> I've been digging into it for further investigation into pyspark 3.4.1 
> documentation
> [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false&originalUrl=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel&McasCSRF=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]
>  
> This is the point I found that "squeeze" parameter is being passed to pandas 
> read_excel function which is not expected.
> It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
> in the implementation.
>  
> !image-2023-11-28-13-20-51-291.png!
>  
> I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
> with fabric. However fabric uses this version as its 1.2 build.
>  
> I am able to work around that for now by download the excel from the one lake 
> to the spark driver, loading that to the memory with pandas and then 
> converting to a spark dataframe etc or I made it work downgrading the build
> I downloaded the pyspark build 20230713 to my local, made the changes and 
> re-compiled it and it worked locally. So it means that is related to the 
> implementation and they would have to fix or I do a downgrade to older 
> version like 3.3.0 or try the latest 3.5.0 which is not the case for fabric
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

Reply via email to