[jira] [Created] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

Matheus Pavanetti (Jira) Tue, 28 Nov 2023 12:20:04 -0800

Matheus Pavanetti created SPARK-46143:
-----------------------------------------


             Summary: pyspark.pandas read_excel implementation at version 3.4.1
                 Key: SPARK-46143
                 URL: https://issues.apache.org/jira/browse/SPARK-46143
             Project: Spark
          Issue Type: Bug
          Components: Build
    Affects Versions: 3.4.1
         Environment: Apache spark 3.4.1.5.3 build 20230713.

Running on Microsoft Fabric workspace.

 

 
            Reporter: Matheus Pavanetti


Hello, 

I would like to report an issue with pyspark.pandas implementation on 
read_excel function.

Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
potentially uses an older version of pandas on it's implementations of 
pyspark.pandas.

The function read_excel from pandas doesn't expect a parameter called "squeeze" 
however it's implemented as part of pyspark.pandas and the parameter "squeeze" 
is being passed to the pandas function.

 

!Z!

 

I've been digging into it for further investigation into pyspark 3.4.1 
documentation

[https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false&originalUrl=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel&McasCSRF=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]

 

This is the point I found that "squeeze" parameter is being passed to pandas 
read_excel function which is not expected.

It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
in the implementation.

 

!9k=!

 

I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
with fabric. However fabric uses this version as its 1.2 build.

 

I am able to work around that for now by download the excel from the one lake 
to the spark driver, loading that to the memory with pandas and then converting 
to a spark dataframe etc or I made it work downgrading the build

I downloaded the pyspark build 20230713 to my local, made the changes and 
re-compiled it and it worked locally. So it means that is related to the 
implementation and they would have to fix or I do a downgrade to older version 
like 3.3.0 or try the latest 3.5.0 which is not the case for fabric

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

Reply via email to