yes, p_df = DF.toPandas() that is THE pandas the one you know. change p_df = DF.toPandas() to p_df = DF.pandas_on_spark() or p_df = DF.to_pandas_on_spark() or p_df = DF.pandas_api() or p_df = DF.to_koalas()
https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html Then you will have yours pyspark df to panda API on spark. tir. 20. juni 2023 kl. 22:16 skrev Mich Talebzadeh < mich.talebza...@gmail.com>: > OK thanks > > So the issue seems to be creating a Panda DF from Spark DF (I do it for > plotting with something like > > import matplotlib.pyplot as plt > p_df = DF.toPandas() > p_df.plt(....) > > I guess that stays in the driver. > > > Mich Talebzadeh, > Lead Solutions Architect/Engineering Lead > Palantir Technologies Limited > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 20 Jun 2023 at 20:46, Sean Owen <sro...@gmail.com> wrote: > >> No, a pandas on Spark DF is distributed. >> >> On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not >>> distributed and remains on the driver. I recall a while back we had this >>> conversation. I don't think anything has changed. >>> >>> Happy to be corrected >>> >>> Mich Talebzadeh, >>> Lead Solutions Architect/Engineering Lead >>> Palantir Technologies Limited >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen <bjornjorgen...@gmail.com> >>> wrote: >>> >>>> Pandas API on spark is an API so that users can use spark as they use >>>> pandas. This was known as koalas. >>>> >>>> Is this limitation still valid for Pandas? >>>> For pandas, yes. But what I did show wos pandas API on spark so its >>>> spark. >>>> >>>> Additionally when we convert from Panda DF to Spark DF, what process >>>> is involved under the bonnet? >>>> I gess pyarrow and drop the index column. >>>> >>>> Have a look at >>>> https://github.com/apache/spark/tree/master/python/pyspark/pandas >>>> >>>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh < >>>> mich.talebza...@gmail.com>: >>>> >>>>> Whenever someone mentions Pandas I automatically think of it as an >>>>> excel sheet for Python. >>>>> >>>>> OK my point below needs some qualification >>>>> >>>>> Why Spark here. Generally, parallel architecture comes into play when >>>>> the data size is significantly large which cannot be handled on a single >>>>> machine, hence, the use of Spark becomes meaningful. In cases where (the >>>>> generated) data size is going to be very large (which is often norm rather >>>>> than the exception these days), the data cannot be processed and stored in >>>>> Pandas data frames as these data frames store data in RAM. Then, the whole >>>>> dataset from a storage like HDFS or cloud storage cannot be collected, >>>>> because it will take significant time and space and probably won't fit in >>>>> a >>>>> single machine RAM. (in this the driver memory) >>>>> >>>>> Is this limitation still valid for Pandas? Additionally when we >>>>> convert from Panda DF to Spark DF, what process is involved under the >>>>> bonnet? >>>>> >>>>> Thanks >>>>> >>>>> Mich Talebzadeh, >>>>> Lead Solutions Architect/Engineering Lead >>>>> Palantir Technologies Limited >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen < >>>>> bjornjorgen...@gmail.com> wrote: >>>>> >>>>>> This is pandas API on spark >>>>>> >>>>>> from pyspark import pandas as ps >>>>>> df = ps.read_excel("testexcel.xlsx") >>>>>> [image: image.png] >>>>>> this will convert it to pyspark >>>>>> [image: image.png] >>>>>> >>>>>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme >>>>>> <john.ja...@tdcx.com.invalid>: >>>>>> >>>>>>> Good day, >>>>>>> >>>>>>> >>>>>>> >>>>>>> I have a task to read excel files in databricks but I cannot seem to >>>>>>> proceed. I am referencing the API documents - read_excel >>>>>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html> >>>>>>> , but there is an error sparksession object has no attribute >>>>>>> 'read_excel'. Can you advise? >>>>>>> >>>>>>> >>>>>>> >>>>>>> *JOHN PAUL JAYME* >>>>>>> Data Engineer >>>>>>> >>>>>>> m. +639055716384 w. www.tdcx.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Winner of over 350 Industry Awards* >>>>>>> >>>>>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image: >>>>>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter] >>>>>>> <https://twitter.com/tdcxgroup/> [image: Youtube] >>>>>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram] >>>>>>> <https://www.instagram.com/tdcxgroup/> >>>>>>> >>>>>>> >>>>>>> >>>>>>> This is a confidential email that may be privileged or legally >>>>>>> protected. You are not authorized to copy or disclose the contents of >>>>>>> this >>>>>>> email. If you are not the intended addressee, please inform the sender >>>>>>> and >>>>>>> delete this email. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Bjørn Jørgensen >>>>>> Vestre Aspehaug 4, 6010 Ålesund >>>>>> Norge >>>>>> >>>>>> +47 480 94 297 >>>>>> >>>>> >>>> >>>> -- >>>> Bjørn Jørgensen >>>> Vestre Aspehaug 4, 6010 Ålesund >>>> Norge >>>> >>>> +47 480 94 297 >>>> >>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297