Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Very helpful! On Wed, May 8, 2024 at 9:07 AM Mich Talebzadeh wrote: > *Potential reasons* > > >- Data Serialization: Spark needs to serialize the DataFrame into an >in-memory format suitable for storage. This process can be time-consuming, >especially for large datasets like 3.2 GB

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Holden Karau
That looks cool, maybe let’s split off a thread on how to improve our release processes? Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Erik Krogen
On that note, GitHub recently released (public preview) a new feature called Artifact Attestions which may be relevant/useful here: Introducing Artifact Attestations–now in public beta - The GitHub Blog On Wed,

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Mich Talebzadeh
*Potential reasons* - Data Serialization: Spark needs to serialize the DataFrame into an in-memory format suitable for storage. This process can be time-consuming, especially for large datasets like 3.2 GB with complex schemas. - Shuffle Operations: If your transformations involve

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Could any one help me here ? Sent from my iPhone > On May 7, 2024, at 4:30 PM, Prem Sahoo wrote: > >  > Hello Folks, > in Spark I have read a file and done some transformation and finally writing > to hdfs. > > Now I am interested in writing the same dataframe to MapRFS but for this > Spark

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Nimrod Ofek
I have no permissions so I can't do it but I'm happy to help (although I am more familiar with Gitlab CICD than Github Actions). Is there some point of contact that can provide me needed context and permissions? I'd also love to see why the costs are high and see how we can reduce them... Thanks,