Re: How to read excel file in PySpark

Mich Talebzadeh Tue, 20 Jun 2023 16:02:50 -0700

OK thanks for the info.

Regards


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jun 2023 at 21:27, Bjørn Jørgensen <bjornjorgen...@gmail.com>
wrote:

> yes, p_df = DF.toPandas() that is THE pandas the one you know.
>
> change p_df = DF.toPandas() to
> p_df = DF.pandas_on_spark()
> or
> p_df = DF.to_pandas_on_spark()
> or
> p_df = DF.pandas_api()
> or
> p_df = DF.to_koalas()
>
>
>
> https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html
>
> Then you will have yours pyspark df to panda API on spark.
>
> tir. 20. juni 2023 kl. 22:16 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> OK thanks
>>
>> So the issue seems to be creating  a Panda DF from Spark DF (I do it for
>> plotting with something like
>>
>> import matplotlib.pyplot as plt
>> p_df = DF.toPandas()
>> p_df.plt(....)
>>
>> I guess that stays in the driver.
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 20 Jun 2023 at 20:46, Sean Owen <sro...@gmail.com> wrote:
>>
>>> No, a pandas on Spark DF is distributed.
>>>
>>> On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
>>>> distributed and remains on the driver. I recall a while back we had this
>>>> conversation. I don't think anything has changed.
>>>>
>>>> Happy to be corrected
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen <bjornjorgen...@gmail.com>
>>>> wrote:
>>>>
>>>>> Pandas API on spark is an API so that users can use spark as they use
>>>>> pandas. This was known as koalas.
>>>>>
>>>>> Is this limitation still valid for Pandas?
>>>>> For pandas, yes. But what I did show wos pandas API on spark so its
>>>>> spark.
>>>>>
>>>>>  Additionally when we convert from Panda DF to Spark DF, what process
>>>>> is involved under the bonnet?
>>>>> I gess pyarrow and drop the index column.
>>>>>
>>>>> Have a look at
>>>>> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>>>>>
>>>>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com>:
>>>>>
>>>>>> Whenever someone mentions Pandas I automatically think of it as an
>>>>>> excel sheet for Python.
>>>>>>
>>>>>> OK my point below needs some qualification
>>>>>>
>>>>>> Why Spark here. Generally, parallel architecture comes into play when
>>>>>> the data size is significantly large which cannot be handled on a single
>>>>>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>>>>>> generated) data size is going to be very large (which is often norm 
>>>>>> rather
>>>>>> than the exception these days), the data cannot be processed and stored 
>>>>>> in
>>>>>> Pandas data frames as these data frames store data in RAM. Then, the 
>>>>>> whole
>>>>>> dataset from a storage like HDFS or cloud storage cannot be collected,
>>>>>> because it will take significant time and space and probably won't fit 
>>>>>> in a
>>>>>> single machine RAM. (in this the driver memory)
>>>>>>
>>>>>> Is this limitation still valid for Pandas? Additionally when we
>>>>>> convert from Panda DF to Spark DF, what process is involved under the
>>>>>> bonnet?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>> Palantir Technologies Limited
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <
>>>>>> bjornjorgen...@gmail.com> wrote:
>>>>>>
>>>>>>> This is pandas API on spark
>>>>>>>
>>>>>>> from pyspark import pandas as ps
>>>>>>> df = ps.read_excel("testexcel.xlsx")
>>>>>>> [image: image.png]
>>>>>>> this will convert it to pyspark
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>>>>>>> <john.ja...@tdcx.com.invalid>:
>>>>>>>
>>>>>>>> Good day,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have a task to read excel files in databricks but I cannot seem
>>>>>>>> to proceed. I am referencing the API documents -  read_excel
>>>>>>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>>>>>>>> , but there is an error sparksession object has no attribute
>>>>>>>> 'read_excel'. Can you advise?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *JOHN PAUL JAYME*
>>>>>>>> Data Engineer
>>>>>>>>
>>>>>>>> m. +639055716384  w. www.tdcx.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Winner of over 350 Industry Awards*
>>>>>>>>
>>>>>>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>>>>>>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>>>>>>>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>>>>>>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>>>>>>>> <https://www.instagram.com/tdcxgroup/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This is a confidential email that may be privileged or legally
>>>>>>>> protected. You are not authorized to copy or disclose the contents of 
>>>>>>>> this
>>>>>>>> email. If you are not the intended addressee, please inform the sender 
>>>>>>>> and
>>>>>>>> delete this email.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bjørn Jørgensen
>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>>> Norge
>>>>>>>
>>>>>>> +47 480 94 297
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: How to read excel file in PySpark

Reply via email to