from pyspark import pandas as ps

ps.read_excel?
"Support both `xls` and `xlsx` file extensions from a local filesystem or
URL"

pdf = ps.read_excel("file")

df = pdf.to_spark()

ons. 23. feb. 2022 kl. 14:57 skrev Sid <flinkbyhe...@gmail.com>:

> Hi Gourav,
>
> Thanks for your time.
>
> I am worried about the distribution of data in case of a huge dataset
> file. Is Koalas still a better option to go ahead with? If yes, how can I
> use it with Glue ETL jobs? Do I have to pass some kind of external jars for
> it?
>
> Thanks,
> Sid
>
> On Wed, Feb 23, 2022 at 7:22 PM Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Hi,
>>
>> this looks like a very specific and exact problem in its scope.
>>
>> Do you think that you can load the data into panda dataframe and load it
>> back to SPARK using PANDAS UDF?
>>
>> Koalas is now natively integrated with SPARK, try to see if you can use
>> those features.
>>
>>
>> Regards,
>> Gourav
>>
>> On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote:
>>
>>> I have an excel file which unfortunately cannot be converted to CSV
>>> format and I am trying to load it using pyspark shell.
>>>
>>> I tried invoking the below pyspark session with the jars provided.
>>>
>>> pyspark --jars
>>> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar
>>>
>>> and below is the code to read the excel file:
>>>
>>> df = spark.read.format("excel") \
>>>      .option("dataAddress", "'Sheet1'!") \
>>>      .option("header", "true") \
>>>      .option("inferSchema", "true") \
>>> .load("/home/.../Documents/test_excel.xlsx")
>>>
>>> It is giving me the below error message:
>>>
>>>  java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
>>>
>>> I tried several Jars for this error but no luck. Also, what would be the
>>> efficient way to load it?
>>>
>>> Thanks,
>>> Sid
>>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Reply via email to