Re: caching a dataframe in Spark takes lot of time

Mich Talebzadeh Wed, 08 May 2024 06:08:28 -0700

*Potential reasons*


   - Data Serialization: Spark needs to serialize the DataFrame into an
   in-memory format suitable for storage. This process can be time-consuming,
   especially for large datasets like 3.2 GB with complex schemas.
   - Shuffle Operations: If your transformations involve shuffle
   operations, Spark might need to shuffle data across the cluster to ensure
   efficient storage. Shuffling can be slow, especially on large datasets or
   limited network bandwidth or nodes..  Check Spark UI staging and executor
   tabs for info on shuffle reads and writes
   - Memory Allocation: Spark allocates memory for the cached DataFrame.
   Depending on the cluster configuration and available memory, this
   allocation can take some time.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Wed, 8 May 2024 at 13:41, Prem Sahoo <prem.re...@gmail.com> wrote:

> Could any one help me here ?
> Sent from my iPhone
>
> > On May 7, 2024, at 4:30 PM, Prem Sahoo <prem.re...@gmail.com> wrote:
> >
> > 
> > Hello Folks,
> > in Spark I have read a file and done some transformation and finally
> writing to hdfs.
> >
> > Now I am interested in writing the same dataframe to MapRFS but for this
> Spark will execute the full DAG again  (recompute all the previous
> steps)(all the read + transformations ).
> >
> > I don't want this recompute again so I decided to cache() the dataframe
> so that 2nd/nth write won't recompute all the steps .
> >
> > But here is a catch: the cache() takes more time to persist the data in
> memory.
> >
> > I have a question when the dataframe is in memory then just to save it
> to another space in memory , why it will take more time (3.2 G data 6 mins)
> >
> > May I know what operations in cache() are taking such a long time ?
> >
> > I would appreciate it if someone would share the information .
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: caching a dataframe in Spark takes lot of time

Reply via email to