Re: How to read excel file in PySpark

Sean Owen Tue, 20 Jun 2023 12:48:08 -0700

No, a pandas on Spark DF is distributed.

On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh <[email protected]>
wrote:


> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
> distributed and remains on the driver. I recall a while back we had this
> conversation. I don't think anything has changed.
>
> Happy to be corrected
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen <[email protected]>
> wrote:
>
>> Pandas API on spark is an API so that users can use spark as they use
>> pandas. This was known as koalas.
>>
>> Is this limitation still valid for Pandas?
>> For pandas, yes. But what I did show wos pandas API on spark so its spark.
>>
>>  Additionally when we convert from Panda DF to Spark DF, what process is
>> involved under the bonnet?
>> I gess pyarrow and drop the index column.
>>
>> Have a look at
>> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>>
>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
>> [email protected]>:
>>
>>> Whenever someone mentions Pandas I automatically think of it as an excel
>>> sheet for Python.
>>>
>>> OK my point below needs some qualification
>>>
>>> Why Spark here. Generally, parallel architecture comes into play when
>>> the data size is significantly large which cannot be handled on a single
>>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>>> generated) data size is going to be very large (which is often norm rather
>>> than the exception these days), the data cannot be processed and stored in
>>> Pandas data frames as these data frames store data in RAM. Then, the whole
>>> dataset from a storage like HDFS or cloud storage cannot be collected,
>>> because it will take significant time and space and probably won't fit in a
>>> single machine RAM. (in this the driver memory)
>>>
>>> Is this limitation still valid for Pandas? Additionally when we convert
>>> from Panda DF to Spark DF, what process is involved under the bonnet?
>>>
>>> Thanks
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <[email protected]>
>>> wrote:
>>>
>>>> This is pandas API on spark
>>>>
>>>> from pyspark import pandas as ps
>>>> df = ps.read_excel("testexcel.xlsx")
>>>> [image: image.png]
>>>> this will convert it to pyspark
>>>> [image: image.png]
>>>>
>>>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>>>> <[email protected]>:
>>>>
>>>>> Good day,
>>>>>
>>>>>
>>>>>
>>>>> I have a task to read excel files in databricks but I cannot seem to
>>>>> proceed. I am referencing the API documents -  read_excel
>>>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>>>>> , but there is an error sparksession object has no attribute
>>>>> 'read_excel'. Can you advise?
>>>>>
>>>>>
>>>>>
>>>>> *JOHN PAUL JAYME*
>>>>> Data Engineer
>>>>>
>>>>> m. +639055716384  w. www.tdcx.com
>>>>>
>>>>>
>>>>>
>>>>> *Winner of over 350 Industry Awards*
>>>>>
>>>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>>>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>>>>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>>>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>>>>> <https://www.instagram.com/tdcxgroup/>
>>>>>
>>>>>
>>>>>
>>>>> This is a confidential email that may be privileged or legally
>>>>> protected. You are not authorized to copy or disclose the contents of this
>>>>> email. If you are not the intended addressee, please inform the sender and
>>>>> delete this email.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: How to read excel file in PySpark

Reply via email to