Re: caching a dataframe in Spark takes lot of time

Prem Sahoo Wed, 08 May 2024 13:12:01 -0700

Very helpful!

On Wed, May 8, 2024 at 9:07 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:


> *Potential reasons*
>
>
>    - Data Serialization: Spark needs to serialize the DataFrame into an
>    in-memory format suitable for storage. This process can be time-consuming,
>    especially for large datasets like 3.2 GB with complex schemas.
>    - Shuffle Operations: If your transformations involve shuffle
>    operations, Spark might need to shuffle data across the cluster to ensure
>    efficient storage. Shuffling can be slow, especially on large datasets or
>    limited network bandwidth or nodes..  Check Spark UI staging and executor
>    tabs for info on shuffle reads and writes
>    - Memory Allocation: Spark allocates memory for the cached DataFrame.
>    Depending on the cluster configuration and available memory, this
>    allocation can take some time.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Wed, 8 May 2024 at 13:41, Prem Sahoo <prem.re...@gmail.com> wrote:
>
>> Could any one help me here ?
>> Sent from my iPhone
>>
>> > On May 7, 2024, at 4:30 PM, Prem Sahoo <prem.re...@gmail.com> wrote:
>> >
>> > 
>> > Hello Folks,
>> > in Spark I have read a file and done some transformation and finally
>> writing to hdfs.
>> >
>> > Now I am interested in writing the same dataframe to MapRFS but for
>> this Spark will execute the full DAG again  (recompute all the previous
>> steps)(all the read + transformations ).
>> >
>> > I don't want this recompute again so I decided to cache() the dataframe
>> so that 2nd/nth write won't recompute all the steps .
>> >
>> > But here is a catch: the cache() takes more time to persist the data in
>> memory.
>> >
>> > I have a question when the dataframe is in memory then just to save it
>> to another space in memory , why it will take more time (3.2 G data 6 mins)
>> >
>> > May I know what operations in cache() are taking such a long time ?
>> >
>> > I would appreciate it if someone would share the information .
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: caching a dataframe in Spark takes lot of time

Reply via email to