Re: Well balanced Python code with Pandas compared to PySpark

Mich Talebzadeh Thu, 29 Jul 2021 07:21:36 -0700

Yes indeed very good points by the Artemis User.

Just to add if I may, why choose Spark?  Generally, parallel architecture
comes into play when the data size is significantly large which cannot be
handled on a single machine, hence, the use of Spark becomes meaningful. In
cases where (the generated) data size is going to be very large (which is
often norm rather than the exception these days), the data cannot be
processed and stored in Pandas data frames as these data frames store data
in RAM. Then, the whole dataset from a storage like HDFS or cloud storage
cannot be collected, because it will take significant time and space and
probably will not fit in a single machine RAM.

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Thu, 29 Jul 2021 at 15:12, Artemis User <arte...@dtechspace.com> wrote:

> PySpark still uses Spark dataframe underneath (it wraps java code).  Use
> PySpark when you have to deal with big data ETL and analytics so you can
> leverage the distributed architecture in Spark.  If you job is simple,
> dataset is relatively small, and doesn't require distributed processing,
> use Pandas..
>
> -- ND
>
> On 7/29/21 9:02 AM, ashok34...@yahoo.com.INVALID wrote:
>
> Hello team
>
> Someone asked me regarding well developed Python code with Panda dataframe
> and comparing that to PySpark.
>
> Under what situations one choose PySpark instead of Python and Pandas.
>
> Appreciate
>
>
> AK
>
>
>
>
>

Re: Well balanced Python code with Pandas compared to PySpark

Reply via email to