Hi Bjørn,

Thanks for your reply. This doesn't help while loading huge datasets. Won't
be able to achieve spark functionality while loading the file in
distributed manner.

Thanks,
Sid

On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
wrote:

> from pyspark import pandas as ps
>
>
> ps.read_excel?
> "Support both `xls` and `xlsx` file extensions from a local filesystem or
> URL"
>
> pdf = ps.read_excel("file")
>
> df = pdf.to_spark()
>
> ons. 23. feb. 2022 kl. 14:57 skrev Sid <flinkbyhe...@gmail.com>:
>
>> Hi Gourav,
>>
>> Thanks for your time.
>>
>> I am worried about the distribution of data in case of a huge dataset
>> file. Is Koalas still a better option to go ahead with? If yes, how can I
>> use it with Glue ETL jobs? Do I have to pass some kind of external jars for
>> it?
>>
>> Thanks,
>> Sid
>>
>> On Wed, Feb 23, 2022 at 7:22 PM Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> this looks like a very specific and exact problem in its scope.
>>>
>>> Do you think that you can load the data into panda dataframe and load it
>>> back to SPARK using PANDAS UDF?
>>>
>>> Koalas is now natively integrated with SPARK, try to see if you can use
>>> those features.
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>
>>>> I have an excel file which unfortunately cannot be converted to CSV
>>>> format and I am trying to load it using pyspark shell.
>>>>
>>>> I tried invoking the below pyspark session with the jars provided.
>>>>
>>>> pyspark --jars
>>>> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar
>>>>
>>>> and below is the code to read the excel file:
>>>>
>>>> df = spark.read.format("excel") \
>>>>      .option("dataAddress", "'Sheet1'!") \
>>>>      .option("header", "true") \
>>>>      .option("inferSchema", "true") \
>>>> .load("/home/.../Documents/test_excel.xlsx")
>>>>
>>>> It is giving me the below error message:
>>>>
>>>>  java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
>>>>
>>>> I tried several Jars for this error but no luck. Also, what would be
>>>> the efficient way to load it?
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Reply via email to