Re: Loading .xlsx and .xlx files using pyspark

Sid Wed, 23 Feb 2022 07:06:22 -0800

Cool. Here, the problem is I have to run the Spark jobs on Glue ETL which
supports 2.4.3 of Spark and I don't think so this distributed support was
added for pandas in that version. AFMKIC, it has been added in 3.2 version.


So how can I do it in spark 2.4.3? Correct me if I'm wrong.


On Wed, Feb 23, 2022 at 8:28 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
wrote:

> You will. Pandas API on spark that `imported with from pyspark import
> pandas as ps` is not pandas but an API that is using pyspark under.
>
> ons. 23. feb. 2022 kl. 15:54 skrev Sid <flinkbyhe...@gmail.com>:
>
>> Hi Bjørn,
>>
>> Thanks for your reply. This doesn't help while loading huge datasets.
>> Won't be able to achieve spark functionality while loading the file in
>> distributed manner.
>>
>> Thanks,
>> Sid
>>
>> On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
>> wrote:
>>
>>> from pyspark import pandas as ps
>>>
>>>
>>> ps.read_excel?
>>> "Support both `xls` and `xlsx` file extensions from a local filesystem
>>> or URL"
>>>
>>> pdf = ps.read_excel("file")
>>>
>>> df = pdf.to_spark()
>>>
>>> ons. 23. feb. 2022 kl. 14:57 skrev Sid <flinkbyhe...@gmail.com>:
>>>
>>>> Hi Gourav,
>>>>
>>>> Thanks for your time.
>>>>
>>>> I am worried about the distribution of data in case of a huge dataset
>>>> file. Is Koalas still a better option to go ahead with? If yes, how can I
>>>> use it with Glue ETL jobs? Do I have to pass some kind of external jars for
>>>> it?
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>> On Wed, Feb 23, 2022 at 7:22 PM Gourav Sengupta <
>>>> gourav.sengu...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> this looks like a very specific and exact problem in its scope.
>>>>>
>>>>> Do you think that you can load the data into panda dataframe and load
>>>>> it back to SPARK using PANDAS UDF?
>>>>>
>>>>> Koalas is now natively integrated with SPARK, try to see if you can
>>>>> use those features.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gourav
>>>>>
>>>>> On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>>>
>>>>>> I have an excel file which unfortunately cannot be converted to CSV
>>>>>> format and I am trying to load it using pyspark shell.
>>>>>>
>>>>>> I tried invoking the below pyspark session with the jars provided.
>>>>>>
>>>>>> pyspark --jars
>>>>>> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar
>>>>>>
>>>>>> and below is the code to read the excel file:
>>>>>>
>>>>>> df = spark.read.format("excel") \
>>>>>>      .option("dataAddress", "'Sheet1'!") \
>>>>>>      .option("header", "true") \
>>>>>>      .option("inferSchema", "true") \
>>>>>> .load("/home/.../Documents/test_excel.xlsx")
>>>>>>
>>>>>> It is giving me the below error message:
>>>>>>
>>>>>>  java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
>>>>>>
>>>>>> I tried several Jars for this error but no luck. Also, what would be
>>>>>> the efficient way to load it?
>>>>>>
>>>>>> Thanks,
>>>>>> Sid
>>>>>>
>>>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: Loading .xlsx and .xlx files using pyspark

Reply via email to