No, a pandas on Spark DF is distributed. On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not > distributed and remains on the driver. I recall a while back we had this > conversation. I don't think anything has changed. > > Happy to be corrected > > Mich Talebzadeh, > Lead Solutions Architect/Engineering Lead > Palantir Technologies Limited > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > >> Pandas API on spark is an API so that users can use spark as they use >> pandas. This was known as koalas. >> >> Is this limitation still valid for Pandas? >> For pandas, yes. But what I did show wos pandas API on spark so its spark. >> >> Additionally when we convert from Panda DF to Spark DF, what process is >> involved under the bonnet? >> I gess pyarrow and drop the index column. >> >> Have a look at >> https://github.com/apache/spark/tree/master/python/pyspark/pandas >> >> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh < >> mich.talebza...@gmail.com>: >> >>> Whenever someone mentions Pandas I automatically think of it as an excel >>> sheet for Python. >>> >>> OK my point below needs some qualification >>> >>> Why Spark here. Generally, parallel architecture comes into play when >>> the data size is significantly large which cannot be handled on a single >>> machine, hence, the use of Spark becomes meaningful. In cases where (the >>> generated) data size is going to be very large (which is often norm rather >>> than the exception these days), the data cannot be processed and stored in >>> Pandas data frames as these data frames store data in RAM. Then, the whole >>> dataset from a storage like HDFS or cloud storage cannot be collected, >>> because it will take significant time and space and probably won't fit in a >>> single machine RAM. (in this the driver memory) >>> >>> Is this limitation still valid for Pandas? Additionally when we convert >>> from Panda DF to Spark DF, what process is involved under the bonnet? >>> >>> Thanks >>> >>> Mich Talebzadeh, >>> Lead Solutions Architect/Engineering Lead >>> Palantir Technologies Limited >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <bjornjorgen...@gmail.com> >>> wrote: >>> >>>> This is pandas API on spark >>>> >>>> from pyspark import pandas as ps >>>> df = ps.read_excel("testexcel.xlsx") >>>> [image: image.png] >>>> this will convert it to pyspark >>>> [image: image.png] >>>> >>>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme >>>> <john.ja...@tdcx.com.invalid>: >>>> >>>>> Good day, >>>>> >>>>> >>>>> >>>>> I have a task to read excel files in databricks but I cannot seem to >>>>> proceed. I am referencing the API documents - read_excel >>>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html> >>>>> , but there is an error sparksession object has no attribute >>>>> 'read_excel'. Can you advise? >>>>> >>>>> >>>>> >>>>> *JOHN PAUL JAYME* >>>>> Data Engineer >>>>> >>>>> m. +639055716384 w. www.tdcx.com >>>>> >>>>> >>>>> >>>>> *Winner of over 350 Industry Awards* >>>>> >>>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image: >>>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter] >>>>> <https://twitter.com/tdcxgroup/> [image: Youtube] >>>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram] >>>>> <https://www.instagram.com/tdcxgroup/> >>>>> >>>>> >>>>> >>>>> This is a confidential email that may be privileged or legally >>>>> protected. You are not authorized to copy or disclose the contents of this >>>>> email. If you are not the intended addressee, please inform the sender and >>>>> delete this email. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Bjørn Jørgensen >>>> Vestre Aspehaug 4, 6010 Ålesund >>>> Norge >>>> >>>> +47 480 94 297 >>>> >>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> >