[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

Matheus Pavanetti (Jira) Tue, 28 Nov 2023 12:23:26 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matheus Pavanetti updated SPARK-46143:
--------------------------------------
    Environment: 
pyspark 3.4.1.5.3 build 20230713.

Running on Microsoft Fabric workspace.

 

 

  was:
Apache spark 3.4.1.5.3 build 20230713.

Running on Microsoft Fabric workspace.

 

 


> pyspark.pandas read_excel implementation at version 3.4.1
> ---------------------------------------------------------
>
>                 Key: SPARK-46143
>                 URL: https://issues.apache.org/jira/browse/SPARK-46143
>             Project: Spark
>          Issue Type: Bug
>          Components: Build
>    Affects Versions: 3.4.1
>         Environment: pyspark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace.
>  
>  
>            Reporter: Matheus Pavanetti
>            Priority: Major
>         Attachments: MicrosoftTeams-image.png, 
> image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png
>
>
> Hello, 
> I would like to report an issue with pyspark.pandas implementation on 
> read_excel function.
> Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
> potentially uses an older version of pandas on it's implementations of 
> pyspark.pandas.
> The function read_excel from pandas doesn't expect a parameter called 
> "squeeze" however it's implemented as part of pyspark.pandas and the 
> parameter "squeeze" is being passed to the pandas function.
>  
> !image-2023-11-28-13-20-40-275.png!
>  
> I've been digging into it for further investigation into pyspark 3.4.1 
> documentation
> [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false&originalUrl=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel&McasCSRF=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]
>  
> This is the point I found that "squeeze" parameter is being passed to pandas 
> read_excel function which is not expected.
> It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
> in the implementation.
>  
> !image-2023-11-28-13-20-51-291.png!
>  
> I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
> with fabric. However fabric uses this version as its 1.2 build.
>  
> I am able to work around that for now by download the excel from the one lake 
> to the spark driver, loading that to the memory with pandas and then 
> converting to a spark dataframe etc or I made it work downgrading the build
> I downloaded the pyspark build 20230713 to my local, made the changes and 
> re-compiled it and it worked locally. So it means that is related to the 
> implementation and they would have to fix or I do a downgrade to older 
> version like 3.3.3 or try the latest 3.5.0 which is not the case for fabric
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

Reply via email to