Re: Loading .xlsx and .xlx files using pyspark

Sid Wed, 23 Feb 2022 05:57:14 -0800

Hi Gourav,

Thanks for your time.


I am worried about the distribution of data in case of a huge dataset file.
Is Koalas still a better option to go ahead with? If yes, how can I use it
with Glue ETL jobs? Do I have to pass some kind of external jars for it?

Thanks,
Sid

On Wed, Feb 23, 2022 at 7:22 PM Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> this looks like a very specific and exact problem in its scope.
>
> Do you think that you can load the data into panda dataframe and load it
> back to SPARK using PANDAS UDF?
>
> Koalas is now natively integrated with SPARK, try to see if you can use
> those features.
>
>
> Regards,
> Gourav
>
> On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote:
>
>> I have an excel file which unfortunately cannot be converted to CSV
>> format and I am trying to load it using pyspark shell.
>>
>> I tried invoking the below pyspark session with the jars provided.
>>
>> pyspark --jars
>> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar
>>
>> and below is the code to read the excel file:
>>
>> df = spark.read.format("excel") \
>>      .option("dataAddress", "'Sheet1'!") \
>>      .option("header", "true") \
>>      .option("inferSchema", "true") \
>> .load("/home/.../Documents/test_excel.xlsx")
>>
>> It is giving me the below error message:
>>
>>  java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
>>
>> I tried several Jars for this error but no luck. Also, what would be the
>> efficient way to load it?
>>
>> Thanks,
>> Sid
>>
>

Re: Loading .xlsx and .xlx files using pyspark

Reply via email to