Loading .xlsx and .xlx files using pyspark

Sid Wed, 23 Feb 2022 05:31:01 -0800

I have an excel file which unfortunately cannot be converted to CSV format
and I am trying to load it using pyspark shell.


I tried invoking the below pyspark session with the jars provided.

pyspark --jars
/home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar

and below is the code to read the excel file:

df = spark.read.format("excel") \
     .option("dataAddress", "'Sheet1'!") \
     .option("header", "true") \
     .option("inferSchema", "true") \
.load("/home/.../Documents/test_excel.xlsx")

It is giving me the below error message:

 java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager

I tried several Jars for this error but no luck. Also, what would be the
efficient way to load it?

Thanks,
Sid

Loading .xlsx and .xlx files using pyspark

Reply via email to