Yes indeed very good points by the Artemis User. Just to add if I may, why choose Spark? Generally, parallel architecture comes into play when the data size is significantly large which cannot be handled on a single machine, hence, the use of Spark becomes meaningful. In cases where (the generated) data size is going to be very large (which is often norm rather than the exception these days), the data cannot be processed and stored in Pandas data frames as these data frames store data in RAM. Then, the whole dataset from a storage like HDFS or cloud storage cannot be collected, because it will take significant time and space and probably will not fit in a single machine RAM.
HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Thu, 29 Jul 2021 at 15:12, Artemis User <arte...@dtechspace.com> wrote: > PySpark still uses Spark dataframe underneath (it wraps java code). Use > PySpark when you have to deal with big data ETL and analytics so you can > leverage the distributed architecture in Spark. If you job is simple, > dataset is relatively small, and doesn't require distributed processing, > use Pandas.. > > -- ND > > On 7/29/21 9:02 AM, ashok34...@yahoo.com.INVALID wrote: > > Hello team > > Someone asked me regarding well developed Python code with Panda dataframe > and comparing that to PySpark. > > Under what situations one choose PySpark instead of Python and Pandas. > > Appreciate > > > AK > > > > >