Cool. Here, the problem is I have to run the Spark jobs on Glue ETL which supports 2.4.3 of Spark and I don't think so this distributed support was added for pandas in that version. AFMKIC, it has been added in 3.2 version.
So how can I do it in spark 2.4.3? Correct me if I'm wrong. On Wed, Feb 23, 2022 at 8:28 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> wrote: > You will. Pandas API on spark that `imported with from pyspark import > pandas as ps` is not pandas but an API that is using pyspark under. > > ons. 23. feb. 2022 kl. 15:54 skrev Sid <flinkbyhe...@gmail.com>: > >> Hi Bjørn, >> >> Thanks for your reply. This doesn't help while loading huge datasets. >> Won't be able to achieve spark functionality while loading the file in >> distributed manner. >> >> Thanks, >> Sid >> >> On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> >> wrote: >> >>> from pyspark import pandas as ps >>> >>> >>> ps.read_excel? >>> "Support both `xls` and `xlsx` file extensions from a local filesystem >>> or URL" >>> >>> pdf = ps.read_excel("file") >>> >>> df = pdf.to_spark() >>> >>> ons. 23. feb. 2022 kl. 14:57 skrev Sid <flinkbyhe...@gmail.com>: >>> >>>> Hi Gourav, >>>> >>>> Thanks for your time. >>>> >>>> I am worried about the distribution of data in case of a huge dataset >>>> file. Is Koalas still a better option to go ahead with? If yes, how can I >>>> use it with Glue ETL jobs? Do I have to pass some kind of external jars for >>>> it? >>>> >>>> Thanks, >>>> Sid >>>> >>>> On Wed, Feb 23, 2022 at 7:22 PM Gourav Sengupta < >>>> gourav.sengu...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> this looks like a very specific and exact problem in its scope. >>>>> >>>>> Do you think that you can load the data into panda dataframe and load >>>>> it back to SPARK using PANDAS UDF? >>>>> >>>>> Koalas is now natively integrated with SPARK, try to see if you can >>>>> use those features. >>>>> >>>>> >>>>> Regards, >>>>> Gourav >>>>> >>>>> On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote: >>>>> >>>>>> I have an excel file which unfortunately cannot be converted to CSV >>>>>> format and I am trying to load it using pyspark shell. >>>>>> >>>>>> I tried invoking the below pyspark session with the jars provided. >>>>>> >>>>>> pyspark --jars >>>>>> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar >>>>>> >>>>>> and below is the code to read the excel file: >>>>>> >>>>>> df = spark.read.format("excel") \ >>>>>> .option("dataAddress", "'Sheet1'!") \ >>>>>> .option("header", "true") \ >>>>>> .option("inferSchema", "true") \ >>>>>> .load("/home/.../Documents/test_excel.xlsx") >>>>>> >>>>>> It is giving me the below error message: >>>>>> >>>>>> java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager >>>>>> >>>>>> I tried several Jars for this error but no luck. Also, what would be >>>>>> the efficient way to load it? >>>>>> >>>>>> Thanks, >>>>>> Sid >>>>>> >>>>> >>> >>> -- >>> Bjørn Jørgensen >>> Vestre Aspehaug 4, 6010 Ålesund >>> Norge >>> >>> +47 480 94 297 >>> >> > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 >