Very helpful! On Wed, May 8, 2024 at 9:07 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> *Potential reasons* > > > - Data Serialization: Spark needs to serialize the DataFrame into an > in-memory format suitable for storage. This process can be time-consuming, > especially for large datasets like 3.2 GB with complex schemas. > - Shuffle Operations: If your transformations involve shuffle > operations, Spark might need to shuffle data across the cluster to ensure > efficient storage. Shuffling can be slow, especially on large datasets or > limited network bandwidth or nodes.. Check Spark UI staging and executor > tabs for info on shuffle reads and writes > - Memory Allocation: Spark allocates memory for the cached DataFrame. > Depending on the cluster configuration and available memory, this > allocation can take some time. > > HTH > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative AI | FinCrime > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Wed, 8 May 2024 at 13:41, Prem Sahoo <prem.re...@gmail.com> wrote: > >> Could any one help me here ? >> Sent from my iPhone >> >> > On May 7, 2024, at 4:30 PM, Prem Sahoo <prem.re...@gmail.com> wrote: >> > >> > >> > Hello Folks, >> > in Spark I have read a file and done some transformation and finally >> writing to hdfs. >> > >> > Now I am interested in writing the same dataframe to MapRFS but for >> this Spark will execute the full DAG again (recompute all the previous >> steps)(all the read + transformations ). >> > >> > I don't want this recompute again so I decided to cache() the dataframe >> so that 2nd/nth write won't recompute all the steps . >> > >> > But here is a catch: the cache() takes more time to persist the data in >> memory. >> > >> > I have a question when the dataframe is in memory then just to save it >> to another space in memory , why it will take more time (3.2 G data 6 mins) >> > >> > May I know what operations in cache() are taking such a long time ? >> > >> > I would appreciate it if someone would share the information . >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>