Hi Gourav, Thanks for your time.
I am worried about the distribution of data in case of a huge dataset file. Is Koalas still a better option to go ahead with? If yes, how can I use it with Glue ETL jobs? Do I have to pass some kind of external jars for it? Thanks, Sid On Wed, Feb 23, 2022 at 7:22 PM Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > this looks like a very specific and exact problem in its scope. > > Do you think that you can load the data into panda dataframe and load it > back to SPARK using PANDAS UDF? > > Koalas is now natively integrated with SPARK, try to see if you can use > those features. > > > Regards, > Gourav > > On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote: > >> I have an excel file which unfortunately cannot be converted to CSV >> format and I am trying to load it using pyspark shell. >> >> I tried invoking the below pyspark session with the jars provided. >> >> pyspark --jars >> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar >> >> and below is the code to read the excel file: >> >> df = spark.read.format("excel") \ >> .option("dataAddress", "'Sheet1'!") \ >> .option("header", "true") \ >> .option("inferSchema", "true") \ >> .load("/home/.../Documents/test_excel.xlsx") >> >> It is giving me the below error message: >> >> java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager >> >> I tried several Jars for this error but no luck. Also, what would be the >> efficient way to load it? >> >> Thanks, >> Sid >> >