Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau
I think signing the artifacts produced from a secure CI sounds like a good
idea. I know we’ve been asked to reduce our GitHub action usage but perhaps
someone interested could volunteer to set that up.

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:

> Hi,
> Thanks for the reply.
>
> From my experience, a build on a build server would be much more
> predictable and less error prone than building on some laptop- and of
> course much faster to have builds, snapshots, release candidates, early
> previews releases, release candidates or final releases.
> It will enable us to have a preview version with current changes- snapshot
> version, either automatically every day or if we need to save costs
> (although build is really not expensive) - with a click of a button.
>
> Regarding keys for signing. - that's what vaults are for, all across the
> industry we are using vaults (such as hashicorp vault)- but if the build
> will be automated and the only thing which will be manual is to sign the
> release for security reasons that would be reasonable.
>
> Thanks,
> Nimrod
>
>
> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
> holden.ka...@gmail.com>:
>
>> Indeed. We could conceivably build the release in CI/CD but the final
>> verification / signing should be done locally to keep the keys safe (there
>> was some concern from earlier release processes).
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>> wrote:
>>
>>> Hi,
>>>
>>> Sorry for the novice question, Wenchen - the release is done manually
>>> from a laptop? Not using a CI CD process on a build server?
>>>
>>> Thanks,
>>> Nimrod
>>>
>>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>>>
 UPDATE:

 Unfortunately, it took me quite some time to set up my laptop and get
 it ready for the release process (docker desktop doesn't work anymore, my
 pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
 for your patience!

 Wenchen

 On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:

> +1
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年5月2日 星期四 10:21
> *收件人**: *Holden Karau 
> *抄送**: *Chao Sun , Xiao Li ,
> Tathagata Das , Wenchen Fan <
> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas
> , Dongjoon Hyun ,
> Cheng Pan , Spark dev list ,
> Anish Shrigondekar 
> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>
>
>
> +1 love to see it!
>
>
>
> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
> wrote:
>
> +1 :) yay previews
>
>
>
> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>
> +1
>
>
>
> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>
> +1 for next Monday.
>
>
>
> We can do more previews when the other features are ready for preview.
>
>
>
> Tathagata Das  于2024年5月1日周三 08:46写道:
>
> Next week sounds great! Thank you Wenchen!
>
>
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
> wrote:
>
> Yea I think a preview release won't hurt (without a branch cut). We
> don't need to wait for all the ongoing projects to be ready. How about we
> do a 4.0 preview release based on the current master branch next Monday?
>
>
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
> Hey all,
>
>
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard 
> to
> do that without a Preview release. So the sooner we make a Preview 
> release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
>
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
>
>
> Thanks!
>
>
>
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
> wrote:
>
> Thank you all for the replies!
>
>
>
> To @Nicholas Chammas  : Thanks for
> cleaning up the error terminology and documentation! I've merged the first
> PR and let's finish others before the 4.0 release.
>
> To @Dongjoon Hyun  : Thanks for driving the
> ANSI on by default effort! Now the vote has passed, let's flip the config

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
Hi,
Thanks for the reply.

>From my experience, a build on a build server would be much more
predictable and less error prone than building on some laptop- and of
course much faster to have builds, snapshots, release candidates, early
previews releases, release candidates or final releases.
It will enable us to have a preview version with current changes- snapshot
version, either automatically every day or if we need to save costs
(although build is really not expensive) - with a click of a button.

Regarding keys for signing. - that's what vaults are for, all across the
industry we are using vaults (such as hashicorp vault)- but if the build
will be automated and the only thing which will be manual is to sign the
release for security reasons that would be reasonable.

Thanks,
Nimrod


בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏:

> Indeed. We could conceivably build the release in CI/CD but the final
> verification / signing should be done locally to keep the keys safe (there
> was some concern from earlier release processes).
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek  wrote:
>
>> Hi,
>>
>> Sorry for the novice question, Wenchen - the release is done manually
>> from a laptop? Not using a CI CD process on a build server?
>>
>> Thanks,
>> Nimrod
>>
>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>>
>>> UPDATE:
>>>
>>> Unfortunately, it took me quite some time to set up my laptop and get it
>>> ready for the release process (docker desktop doesn't work anymore, my pgp
>>> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
>>> your patience!
>>>
>>> Wenchen
>>>
>>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>>
 +1



 *发件人**: *Jungtaek Lim 
 *日期**: *2024年5月2日 星期四 10:21
 *收件人**: *Holden Karau 
 *抄送**: *Chao Sun , Xiao Li ,
 Tathagata Das , Wenchen Fan <
 cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
 nicholas.cham...@gmail.com>, Dongjoon Hyun ,
 Cheng Pan , Spark dev list ,
 Anish Shrigondekar 
 *主题**: *Re: [DISCUSS] Spark 4.0.0 release



 +1 love to see it!



 On Thu, May 2, 2024 at 10:08 AM Holden Karau 
 wrote:

 +1 :) yay previews



 On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:

 +1



 On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:

 +1 for next Monday.



 We can do more previews when the other features are ready for preview.



 Tathagata Das  于2024年5月1日周三 08:46写道:

 Next week sounds great! Thank you Wenchen!



 On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
 wrote:

 Yea I think a preview release won't hurt (without a branch cut). We
 don't need to wait for all the ongoing projects to be ready. How about we
 do a 4.0 preview release based on the current master branch next Monday?



 On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

 Hey all,



 Reviving this thread, but Spark master has already accumulated a huge
 amount of changes.  As a downstream project maintainer, I want to really
 start testing the new features and other breaking changes, and it's hard to
 do that without a Preview release. So the sooner we make a Preview release,
 the faster we can start getting feedback for fixing things for a great
 Spark 4.0 final release.



 So I urge the community to produce a Spark 4.0 Preview soon even if
 certain features targeting the Delta 4.0 release are still incomplete.



 Thanks!





 On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
 wrote:

 Thank you all for the replies!



 To @Nicholas Chammas  : Thanks for
 cleaning up the error terminology and documentation! I've merged the first
 PR and let's finish others before the 4.0 release.

 To @Dongjoon Hyun  : Thanks for driving the
 ANSI on by default effort! Now the vote has passed, let's flip the config
 and finish the DataFrame error context feature before 4.0.

 To @Jungtaek Lim  : Ack. We can treat
 the Streaming state store data source as completed for 4.0 then.

 To @Cheng Pan  : Yea we definitely should have a
 preview release. Let's collect more feedback on the ongoing projects and
 then we can propose a date for the preview release.



 On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:

 will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?

 Thanks,
 Cheng Pan


 > On Apr 15, 2024, at 09:58, Jungtaek Lim 
 wrote:
 >
 > 

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau
Indeed. We could conceivably build the release in CI/CD but the final
verification / signing should be done locally to keep the keys safe (there
was some concern from earlier release processes).

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek  wrote:

> Hi,
>
> Sorry for the novice question, Wenchen - the release is done manually from
> a laptop? Not using a CI CD process on a build server?
>
> Thanks,
> Nimrod
>
> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>
>> UPDATE:
>>
>> Unfortunately, it took me quite some time to set up my laptop and get it
>> ready for the release process (docker desktop doesn't work anymore, my pgp
>> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
>> your patience!
>>
>> Wenchen
>>
>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>
>>> +1
>>>
>>>
>>>
>>> *发件人**: *Jungtaek Lim 
>>> *日期**: *2024年5月2日 星期四 10:21
>>> *收件人**: *Holden Karau 
>>> *抄送**: *Chao Sun , Xiao Li ,
>>> Tathagata Das , Wenchen Fan <
>>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>>> Cheng Pan , Spark dev list ,
>>> Anish Shrigondekar 
>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>
>>>
>>>
>>> +1 love to see it!
>>>
>>>
>>>
>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>> wrote:
>>>
>>> +1 :) yay previews
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>
>>> +1 for next Monday.
>>>
>>>
>>>
>>> We can do more previews when the other features are ready for preview.
>>>
>>>
>>>
>>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>> Hey all,
>>>
>>>
>>>
>>> Reviving this thread, but Spark master has already accumulated a huge
>>> amount of changes.  As a downstream project maintainer, I want to really
>>> start testing the new features and other breaking changes, and it's hard to
>>> do that without a Preview release. So the sooner we make a Preview release,
>>> the faster we can start getting feedback for fixing things for a great
>>> Spark 4.0 final release.
>>>
>>>
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>>
>>> Thank you all for the replies!
>>>
>>>
>>>
>>> To @Nicholas Chammas  : Thanks for cleaning
>>> up the error terminology and documentation! I've merged the first PR and
>>> let's finish others before the 4.0 release.
>>>
>>> To @Dongjoon Hyun  : Thanks for driving the
>>> ANSI on by default effort! Now the vote has passed, let's flip the config
>>> and finish the DataFrame error context feature before 4.0.
>>>
>>> To @Jungtaek Lim  : Ack. We can treat the
>>> Streaming state store data source as completed for 4.0 then.
>>>
>>> To @Cheng Pan  : Yea we definitely should have a
>>> preview release. Let's collect more feedback on the ongoing projects and
>>> then we can propose a date for the preview release.
>>>
>>>
>>>
>>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>>
>>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>>> wrote:
>>> >
>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>> follow-up tickets, but we don't plan to address them soon. The current
>>> implementation is the final shape for Spark 4.0.0, unless there are demands
>>> on the follow-up tickets.
>>> >
>>> > We may want to check the plan for transformWithState - my
>>> understanding is that we want to release the feature to 4.0.0, but there
>>> are several remaining works to be done. While the tentative timeline for
>>> releasing is June 2024, what would be the tentative timeline for the RC cut?
>>> > (cc. Anish to add more context on the plan for transformWithState)
>>> >
>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>>> and I think it's time to prepare for it and discuss the ongoing projects:
>>> > •
>>> > ANSI by default
>>> > • Spark Connect GA
>>> > • Structured Logging
>>> > • Streaming state store data source
>>> > • new 

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Dongjoon Hyun
Thank you so much for the update, Wenchen!

Dongjoon.

On Tue, May 7, 2024 at 10:49 AM Wenchen Fan  wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and get it
> ready for the release process (docker desktop doesn't work anymore, my pgp
> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
> your patience!
>
> Wenchen
>
> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年5月2日 星期四 10:21
>> *收件人**: *Holden Karau 
>> *抄送**: *Chao Sun , Xiao Li ,
>> Tathagata Das , Wenchen Fan <
>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>> Cheng Pan , Spark dev list ,
>> Anish Shrigondekar 
>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>
>>
>>
>> +1 love to see it!
>>
>>
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>> +1 :) yay previews
>>
>>
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>> +1 for next Monday.
>>
>>
>>
>> We can do more previews when the other features are ready for preview.
>>
>>
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>
>> Next week sounds great! Thank you Wenchen!
>>
>>
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>>
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard to
>> do that without a Preview release. So the sooner we make a Preview release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>>
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>
>> Thank you all for the replies!
>>
>>
>>
>> To @Nicholas Chammas  : Thanks for cleaning
>> up the error terminology and documentation! I've merged the first PR and
>> let's finish others before the 4.0 release.
>>
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>>
>> To @Jungtaek Lim  : Ack. We can treat the
>> Streaming state store data source as completed for 4.0 then.
>>
>> To @Cheng Pan  : Yea we definitely should have a
>> preview release. Let's collect more feedback on the ongoing projects and
>> then we can propose a date for the preview release.
>>
>>
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my understanding
>> is that we want to release the feature to 4.0.0, but there are several
>> remaining works to be done. While the tentative timeline for releasing is
>> June 2024, what would be the tentative timeline for the RC cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>> and I think it's time to prepare for it and discuss the ongoing projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>
>>
>>
>> --
>>
>> Twitter: https://twitter.com/holdenkarau
>> 
>>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> 
>>
>> YouTube Live Streams: 

caching a dataframe in Spark takes lot of time

2024-05-07 Thread Prem Sahoo
Hello Folks,
in Spark I have read a file and done some transformation and finally
writing to hdfs.

Now I am interested in writing the same dataframe to MapRFS but for this
Spark will execute the full DAG again  (recompute all the previous
steps)(all the read + transformations ).

I don't want this recompute again so I decided to cache() the dataframe so
that 2nd/nth write won't recompute all the steps .

But here is a catch: the cache() takes more time to persist the data in
memory.

I have a question when the dataframe is in memory then just to save it to
another space in memory , why it will take more time (3.2 G data 6 mins)

May I know what operations in cache() are taking such a long time ?

I would appreciate it if someone would share the information .


Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
Hi,

Sorry for the novice question, Wenchen - the release is done manually from
a laptop? Not using a CI CD process on a build server?

Thanks,
Nimrod

On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and get it
> ready for the release process (docker desktop doesn't work anymore, my pgp
> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
> your patience!
>
> Wenchen
>
> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年5月2日 星期四 10:21
>> *收件人**: *Holden Karau 
>> *抄送**: *Chao Sun , Xiao Li ,
>> Tathagata Das , Wenchen Fan <
>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>> Cheng Pan , Spark dev list ,
>> Anish Shrigondekar 
>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>
>>
>>
>> +1 love to see it!
>>
>>
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>> +1 :) yay previews
>>
>>
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>> +1 for next Monday.
>>
>>
>>
>> We can do more previews when the other features are ready for preview.
>>
>>
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>
>> Next week sounds great! Thank you Wenchen!
>>
>>
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>>
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard to
>> do that without a Preview release. So the sooner we make a Preview release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>>
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>
>> Thank you all for the replies!
>>
>>
>>
>> To @Nicholas Chammas  : Thanks for cleaning
>> up the error terminology and documentation! I've merged the first PR and
>> let's finish others before the 4.0 release.
>>
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>>
>> To @Jungtaek Lim  : Ack. We can treat the
>> Streaming state store data source as completed for 4.0 then.
>>
>> To @Cheng Pan  : Yea we definitely should have a
>> preview release. Let's collect more feedback on the ongoing projects and
>> then we can propose a date for the preview release.
>>
>>
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my understanding
>> is that we want to release the feature to 4.0.0, but there are several
>> remaining works to be done. While the tentative timeline for releasing is
>> June 2024, what would be the tentative timeline for the RC cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>> and I think it's time to prepare for it and discuss the ongoing projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>
>>
>>
>> --
>>
>> Twitter: https://twitter.com/holdenkarau
>> 
>>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> 

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Wenchen Fan
UPDATE:

Unfortunately, it took me quite some time to set up my laptop and get it
ready for the release process (docker desktop doesn't work anymore, my pgp
key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
your patience!

Wenchen

On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:

> +1
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年5月2日 星期四 10:21
> *收件人**: *Holden Karau 
> *抄送**: *Chao Sun , Xiao Li ,
> Tathagata Das , Wenchen Fan <
> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
> Cheng Pan , Spark dev list ,
> Anish Shrigondekar 
> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>
>
>
> +1 love to see it!
>
>
>
> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
> wrote:
>
> +1 :) yay previews
>
>
>
> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>
> +1
>
>
>
> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>
> +1 for next Monday.
>
>
>
> We can do more previews when the other features are ready for preview.
>
>
>
> Tathagata Das  于2024年5月1日周三 08:46写道:
>
> Next week sounds great! Thank you Wenchen!
>
>
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>
> Yea I think a preview release won't hurt (without a branch cut). We don't
> need to wait for all the ongoing projects to be ready. How about we do a
> 4.0 preview release based on the current master branch next Monday?
>
>
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
> wrote:
>
> Hey all,
>
>
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard to
> do that without a Preview release. So the sooner we make a Preview release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
>
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
>
>
> Thanks!
>
>
>
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>
> Thank you all for the replies!
>
>
>
> To @Nicholas Chammas  : Thanks for cleaning
> up the error terminology and documentation! I've merged the first PR and
> let's finish others before the 4.0 release.
>
> To @Dongjoon Hyun  : Thanks for driving the ANSI
> on by default effort! Now the vote has passed, let's flip the config and
> finish the DataFrame error context feature before 4.0.
>
> To @Jungtaek Lim  : Ack. We can treat the
> Streaming state store data source as completed for 4.0 then.
>
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
>
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are several
> follow-up tickets, but we don't plan to address them soon. The current
> implementation is the final shape for Spark 4.0.0, unless there are demands
> on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my understanding
> is that we want to release the feature to 4.0.0, but there are several
> remaining works to be done. While the tentative timeline for releasing is
> June 2024, what would be the tentative timeline for the RC cut?
> > (cc. Anish to add more context on the plan for transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan  wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June 2024),
> and I think it's time to prepare for it and discuss the ongoing projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here. I would
> like to volunteer as the release manager for Apache Spark 4.0.0 if there is
> no objection. Thank you all for the great work that fills Spark 4.0!
> >
> > Wenchen Fan
>
>
>
>
> --
>
> Twitter: https://twitter.com/holdenkarau
> 
>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> 
>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> 
>
>


Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi
Hi Folks,

I wanted to check why spark doesn't create staging dir while doing an
insertInto on partitioned tables. I'm running below example code –
```
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")

val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3)))
val df = spark.createDataFrame(rdd)
df.write.insertInto("testing_table") // testing table is partitioned on "_1"
```
In this scenario FileOutputCommitter considers table path as output path
and creates temporary folders like
`/testing_table/_temporary/0` and then moves them to
partition location when job commit happens.

But in-case if multiple parallel apps are inserting into the same
partition, this can cause race condition issues while deleting the
`_temporary` dir. Ideally for each app there should be a unique staging dir
where the job should write its output.

Is there any specific reason for this? or am i missing something here?
Thanks for your time and assistance regarding this!

Kind regards
Sanskar