Re: How to read excel file in PySpark

Bjørn Jørgensen Tue, 20 Jun 2023 13:27:58 -0700

yes, p_df = DF.toPandas() that is THE pandas the one you know.

change p_df = DF.toPandas() to
p_df = DF.pandas_on_spark()
or
p_df = DF.to_pandas_on_spark()
or
p_df = DF.pandas_api()
or
p_df = DF.to_koalas()



https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html

Then you will have yours pyspark df to panda API on spark.

tir. 20. juni 2023 kl. 22:16 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> OK thanks
>
> So the issue seems to be creating  a Panda DF from Spark DF (I do it for
> plotting with something like
>
> import matplotlib.pyplot as plt
> p_df = DF.toPandas()
> p_df.plt(....)
>
> I guess that stays in the driver.
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 20:46, Sean Owen <sro...@gmail.com> wrote:
>
>> No, a pandas on Spark DF is distributed.
>>
>> On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
>>> distributed and remains on the driver. I recall a while back we had this
>>> conversation. I don't think anything has changed.
>>>
>>> Happy to be corrected
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen <bjornjorgen...@gmail.com>
>>> wrote:
>>>
>>>> Pandas API on spark is an API so that users can use spark as they use
>>>> pandas. This was known as koalas.
>>>>
>>>> Is this limitation still valid for Pandas?
>>>> For pandas, yes. But what I did show wos pandas API on spark so its
>>>> spark.
>>>>
>>>>  Additionally when we convert from Panda DF to Spark DF, what process
>>>> is involved under the bonnet?
>>>> I gess pyarrow and drop the index column.
>>>>
>>>> Have a look at
>>>> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>>>>
>>>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
>>>> mich.talebza...@gmail.com>:
>>>>
>>>>> Whenever someone mentions Pandas I automatically think of it as an
>>>>> excel sheet for Python.
>>>>>
>>>>> OK my point below needs some qualification
>>>>>
>>>>> Why Spark here. Generally, parallel architecture comes into play when
>>>>> the data size is significantly large which cannot be handled on a single
>>>>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>>>>> generated) data size is going to be very large (which is often norm rather
>>>>> than the exception these days), the data cannot be processed and stored in
>>>>> Pandas data frames as these data frames store data in RAM. Then, the whole
>>>>> dataset from a storage like HDFS or cloud storage cannot be collected,
>>>>> because it will take significant time and space and probably won't fit in 
>>>>> a
>>>>> single machine RAM. (in this the driver memory)
>>>>>
>>>>> Is this limitation still valid for Pandas? Additionally when we
>>>>> convert from Panda DF to Spark DF, what process is involved under the
>>>>> bonnet?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Lead Solutions Architect/Engineering Lead
>>>>> Palantir Technologies Limited
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <
>>>>> bjornjorgen...@gmail.com> wrote:
>>>>>
>>>>>> This is pandas API on spark
>>>>>>
>>>>>> from pyspark import pandas as ps
>>>>>> df = ps.read_excel("testexcel.xlsx")
>>>>>> [image: image.png]
>>>>>> this will convert it to pyspark
>>>>>> [image: image.png]
>>>>>>
>>>>>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>>>>>> <john.ja...@tdcx.com.invalid>:
>>>>>>
>>>>>>> Good day,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I have a task to read excel files in databricks but I cannot seem to
>>>>>>> proceed. I am referencing the API documents -  read_excel
>>>>>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>>>>>>> , but there is an error sparksession object has no attribute
>>>>>>> 'read_excel'. Can you advise?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *JOHN PAUL JAYME*
>>>>>>> Data Engineer
>>>>>>>
>>>>>>> m. +639055716384  w. www.tdcx.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Winner of over 350 Industry Awards*
>>>>>>>
>>>>>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>>>>>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>>>>>>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>>>>>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>>>>>>> <https://www.instagram.com/tdcxgroup/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is a confidential email that may be privileged or legally
>>>>>>> protected. You are not authorized to copy or disclose the contents of 
>>>>>>> this
>>>>>>> email. If you are not the intended addressee, please inform the sender 
>>>>>>> and
>>>>>>> delete this email.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bjørn Jørgensen
>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>> Norge
>>>>>>
>>>>>> +47 480 94 297
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: How to read excel file in PySpark

Reply via email to