Re: How to read excel file in PySpark

Bjørn Jørgensen Tue, 20 Jun 2023 12:10:27 -0700

Pandas API on spark is an API so that users can use spark as they use
pandas. This was known as koalas.


Is this limitation still valid for Pandas?
For pandas, yes. But what I did show wos pandas API on spark so its spark.

 Additionally when we convert from Panda DF to Spark DF, what process is
involved under the bonnet?
I gess pyarrow and drop the index column.

Have a look at
https://github.com/apache/spark/tree/master/python/pyspark/pandas

tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> Whenever someone mentions Pandas I automatically think of it as an excel
> sheet for Python.
>
> OK my point below needs some qualification
>
> Why Spark here. Generally, parallel architecture comes into play when the
> data size is significantly large which cannot be handled on a single
> machine, hence, the use of Spark becomes meaningful. In cases where (the
> generated) data size is going to be very large (which is often norm rather
> than the exception these days), the data cannot be processed and stored in
> Pandas data frames as these data frames store data in RAM. Then, the whole
> dataset from a storage like HDFS or cloud storage cannot be collected,
> because it will take significant time and space and probably won't fit in a
> single machine RAM. (in this the driver memory)
>
> Is this limitation still valid for Pandas? Additionally when we convert
> from Panda DF to Spark DF, what process is involved under the bonnet?
>
> Thanks
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <bjornjorgen...@gmail.com>
> wrote:
>
>> This is pandas API on spark
>>
>> from pyspark import pandas as ps
>> df = ps.read_excel("testexcel.xlsx")
>> [image: image.png]
>> this will convert it to pyspark
>> [image: image.png]
>>
>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>> <john.ja...@tdcx.com.invalid>:
>>
>>> Good day,
>>>
>>>
>>>
>>> I have a task to read excel files in databricks but I cannot seem to
>>> proceed. I am referencing the API documents -  read_excel
>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>>> , but there is an error sparksession object has no attribute
>>> 'read_excel'. Can you advise?
>>>
>>>
>>>
>>> *JOHN PAUL JAYME*
>>> Data Engineer
>>>
>>> m. +639055716384  w. www.tdcx.com
>>>
>>>
>>>
>>> *Winner of over 350 Industry Awards*
>>>
>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>>> <https://www.instagram.com/tdcxgroup/>
>>>
>>>
>>>
>>> This is a confidential email that may be privileged or legally
>>> protected. You are not authorized to copy or disclose the contents of this
>>> email. If you are not the intended addressee, please inform the sender and
>>> delete this email.
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: How to read excel file in PySpark

Reply via email to