Yes indeed very good points by the Artemis User.
Just to add if I may, why choose Spark? Generally, parallel architecture
comes into play when the data size is significantly large which cannot be
handled on a single machine, hence, the use of Spark becomes meaningful. In
cases where (the
PySpark still uses Spark dataframe underneath (it wraps java code). Use
PySpark when you have to deal with big data ETL and analytics so you can
leverage the distributed architecture in Spark. If you job is simple,
dataset is relatively small, and doesn't require distributed processing,
use
Hello team
Someone asked me regarding well developed Python code with Panda dataframe and
comparing that to PySpark.
Under what situations one choose PySpark instead of Python and Pandas.
Appreciate
AK