from:"Holden Karau"

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Holden Karau

That looks cool, maybe let’s split off a thread on how to improve our
release processes?

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Wed, May 8, 2024 at 9:31 AM Erik Krogen  wrote:

> On that note, GitHub recently released (public preview) a new feature
> called Artifact Attestions which may be relevant/useful here: Introducing
> Artifact Attestations–now in public beta - The GitHub Blog
> <https://github.blog/2024-05-02-introducing-artifact-attestations-now-in-public-beta/>
>
> On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek  wrote:
>
>> I have no permissions so I can't do it but I'm happy to help (although I
>> am more familiar with Gitlab CICD than Github Actions).
>> Is there some point of contact that can provide me needed context and
>> permissions?
>> I'd also love to see why the costs are high and see how we can reduce
>> them...
>>
>> Thanks,
>> Nimrod
>>
>> On Wed, May 8, 2024 at 8:26 AM Holden Karau 
>> wrote:
>>
>>> I think signing the artifacts produced from a secure CI sounds like a
>>> good idea. I know we’ve been asked to reduce our GitHub action usage but
>>> perhaps someone interested could volunteer to set that up.
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek 
>>> wrote:
>>>
>>>> Hi,
>>>> Thanks for the reply.
>>>>
>>>> From my experience, a build on a build server would be much more
>>>> predictable and less error prone than building on some laptop- and of
>>>> course much faster to have builds, snapshots, release candidates, early
>>>> previews releases, release candidates or final releases.
>>>> It will enable us to have a preview version with current changes-
>>>> snapshot version, either automatically every day or if we need to save
>>>> costs (although build is really not expensive) - with a click of a button.
>>>>
>>>> Regarding keys for signing. - that's what vaults are for, all across
>>>> the industry we are using vaults (such as hashicorp vault)- but if the
>>>> build will be automated and the only thing which will be manual is to sign
>>>> the release for security reasons that would be reasonable.
>>>>
>>>> Thanks,
>>>> Nimrod
>>>>
>>>>
>>>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>>>> holden.ka...@gmail.com>:
>>>>
>>>>> Indeed. We could conceivably build the release in CI/CD but the final
>>>>> verification / signing should be done locally to keep the keys safe (there
>>>>> was some concern from earlier release processes).
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>>
>>>>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Sorry for the novice question, Wenchen - the release is done manually
>>>>>> from a laptop? Not using a CI CD process on a build server?
>>>>>>
>>>>>> Thanks,
>>>>>> Nimrod
>>>>>>
>>>>>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
>>>>>> wrote:
>>>>>>
>>>>>>> UPDATE:
>>>>>>>
>>>>>>> Unfortunately, it took me quite some time to set up my laptop and
>>>>>>> get it ready for the release process (docker desktop doesn't work 
>>>>>>> anymore,
>>>>>>> my pgp key is lost, etc.). I'll start the RC process at my tomorrow. 
>>>>>>> Thanks
>>>>>>> for your patience!
>>>>>>>
>>>>>>> Wenchen
>>>>>>>
>>>>>>> On Fri, May 3, 2024 at 7:47 AM yangjie01 
>>>>>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau

I think signing the artifacts produced from a secure CI sounds like a good
idea. I know we’ve been asked to reduce our GitHub action usage but perhaps
someone interested could volunteer to set that up.

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:

> Hi,
> Thanks for the reply.
>
> From my experience, a build on a build server would be much more
> predictable and less error prone than building on some laptop- and of
> course much faster to have builds, snapshots, release candidates, early
> previews releases, release candidates or final releases.
> It will enable us to have a preview version with current changes- snapshot
> version, either automatically every day or if we need to save costs
> (although build is really not expensive) - with a click of a button.
>
> Regarding keys for signing. - that's what vaults are for, all across the
> industry we are using vaults (such as hashicorp vault)- but if the build
> will be automated and the only thing which will be manual is to sign the
> release for security reasons that would be reasonable.
>
> Thanks,
> Nimrod
>
>
> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
> holden.ka...@gmail.com>:
>
>> Indeed. We could conceivably build the release in CI/CD but the final
>> verification / signing should be done locally to keep the keys safe (there
>> was some concern from earlier release processes).
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>> wrote:
>>
>>> Hi,
>>>
>>> Sorry for the novice question, Wenchen - the release is done manually
>>> from a laptop? Not using a CI CD process on a build server?
>>>
>>> Thanks,
>>> Nimrod
>>>
>>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>>>
>>>> UPDATE:
>>>>
>>>> Unfortunately, it took me quite some time to set up my laptop and get
>>>> it ready for the release process (docker desktop doesn't work anymore, my
>>>> pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
>>>> for your patience!
>>>>
>>>> Wenchen
>>>>
>>>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>>>
>>>>> +1
>>>>>
>>>>>
>>>>>
>>>>> *发件人**: *Jungtaek Lim 
>>>>> *日期**: *2024年5月2日 星期四 10:21
>>>>> *收件人**: *Holden Karau 
>>>>> *抄送**: *Chao Sun , Xiao Li ,
>>>>> Tathagata Das , Wenchen Fan <
>>>>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas
>>>>> , Dongjoon Hyun ,
>>>>> Cheng Pan , Spark dev list ,
>>>>> Anish Shrigondekar 
>>>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>>>
>>>>>
>>>>>
>>>>> +1 love to see it!
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>>>> wrote:
>>>>>
>>>>> +1 :) yay previews
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>>>
>>>>> +1
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>>>
>>>>> +1 for next Monday.
>>>>>
>>>>>
>>>>>
>>>>> We can do more previews when the other features are ready for preview.
>>>>>
>>>>>
>>>>>
>>>>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>>>>
>>>>> Next week sounds great! Thank you Wenchen!
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
>>>>> wrote:
>>>>>
>>>>> Yea I think a preview release won't hurt (without a branch cut). We
>>>>> don't need to wait for all the ongoing projects to be ready. How about we
>>>>> do a 4.0 preview release based on the current master branch next Monday?
>>>>>
>>&g

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau

Indeed. We could conceivably build the release in CI/CD but the final
verification / signing should be done locally to keep the keys safe (there
was some concern from earlier release processes).

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek  wrote:

> Hi,
>
> Sorry for the novice question, Wenchen - the release is done manually from
> a laptop? Not using a CI CD process on a build server?
>
> Thanks,
> Nimrod
>
> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>
>> UPDATE:
>>
>> Unfortunately, it took me quite some time to set up my laptop and get it
>> ready for the release process (docker desktop doesn't work anymore, my pgp
>> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
>> your patience!
>>
>> Wenchen
>>
>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>
>>> +1
>>>
>>>
>>>
>>> *发件人**: *Jungtaek Lim 
>>> *日期**: *2024年5月2日 星期四 10:21
>>> *收件人**: *Holden Karau 
>>> *抄送**: *Chao Sun , Xiao Li ,
>>> Tathagata Das , Wenchen Fan <
>>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>>> Cheng Pan , Spark dev list ,
>>> Anish Shrigondekar 
>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>
>>>
>>>
>>> +1 love to see it!
>>>
>>>
>>>
>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>> wrote:
>>>
>>> +1 :) yay previews
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>
>>> +1 for next Monday.
>>>
>>>
>>>
>>> We can do more previews when the other features are ready for preview.
>>>
>>>
>>>
>>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>> Hey all,
>>>
>>>
>>>
>>> Reviving this thread, but Spark master has already accumulated a huge
>>> amount of changes.  As a downstream project maintainer, I want to really
>>> start testing the new features and other breaking changes, and it's hard to
>>> do that without a Preview release. So the sooner we make a Preview release,
>>> the faster we can start getting feedback for fixing things for a great
>>> Spark 4.0 final release.
>>>
>>>
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>>
>>> Thank you all for the replies!
>>>
>>>
>>>
>>> To @Nicholas Chammas  : Thanks for cleaning
>>> up the error terminology and documentation! I've merged the first PR and
>>> let's finish others before the 4.0 release.
>>>
>>> To @Dongjoon Hyun  : Thanks for driving the
>>> ANSI on by default effort! Now the vote has passed, let's flip the config
>>> and finish the DataFrame error context feature before 4.0.
>>>
>>> To @Jungtaek Lim  : Ack. We can treat the
>>> Streaming state store data source as completed for 4.0 then.
>>>
>>> To @Cheng Pan  : Yea we definitely should have a
>>> preview release. Let's collect more feedback on the ongoing projects and
>>> then we can propose a date for the preview release.
>>>
>>>
>>>
>>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>>
>>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>>
>

Re: ASF board report draft for May

2024-05-06 Thread Holden Karau

I trust Wenchen to manage the preview release effectively but if there are
concerns around how to manage a developer preview release lets split that
off from the board report discussion.

On Mon, May 6, 2024 at 10:44 AM Mich Talebzadeh 
wrote:

> I did some historical digging on this.
>
> Whilst both preview release and RCs are pre-release versions, the main
> difference lies in their maturity and readiness for production use. Preview
> releases are early versions aimed at gathering feedback, while release
> candidates (RCs) are nearly finished versions that undergo final testing
> and voting before the official release.
>
> So in our case, we have two options:
>
>
>1. Skip mentioning of the Preview and focus on "We are intending to
>gather feedback on version 4 by releasing an earlier version to the
>community for look and feel feedback, especially focused on APIs
>2. Mention Preview in the form. "There will be a Preview release with
>the aim of gathering feedback from the community focused on APIs"
>
> IMO Preview release does not require a formal vote. Preview releases are
> often considered experimental or pre-alpha versions and are not expected to
> meet the same level of stability and completeness as release candidates or
> final releases.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 6 May 2024 at 14:10, Mich Talebzadeh 
> wrote:
>
>> @Wenchen Fan 
>>
>> Thanks for the update! To clarify, is the vote for approving a specific
>> preview build, or is it for moving towards an RC stage? I gather there is a
>> distinction between these two?
>>
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 6 May 2024 at 13:03, Wenchen Fan  wrote:
>>
>>> The preview release also needs a vote. I'll try my best to cut the RC on
>>> Monday, but the actual release may take some time. Hopefully, we can get it
>>> out this week but if the vote fails, it will take longer as we need more
>>> RCs.
>>>
>>> On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> +1 for Holden's comment. Yes, it would be great to mention `it` as
>>>> "soon".
>>>> (If Wenchen release it on Monday, we can simply mention the release)
>>>>
>>>> In addition, Apache Spark PMC received an official notice from ASF
>>>> Infra team.
>>>>
>>>> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
>>>> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for
>>>> ASF projects
>>>>
>>>> To track and comply with the new ASF Infra Policy as much as possible,
>>>> we opened a blocker-level JIRA issue and have been working on it.
>>>> - https://infra.apache.org/github-actions-policy.html
>>>>
>>>> Please include a sentence that Apache Spark PMC is working on under the
>>>> following umbrella JIRA issue.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-48094
>>>> > Reduce GitHub Action usage according to ASF project allowance
>>>>
>>>> Thanks,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Sun, May 5, 2024 at 3:45 PM Holden Karau 
&

Re: ASF board report draft for May

2024-05-06 Thread Holden Karau

If folks are against the term soon we could say “in-progress”

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Mon, May 6, 2024 at 2:08 AM Mich Talebzadeh 
wrote:

> Hi,
>
> We should reconsider using the term "soon" for ASF board as it is
> subjective with no date (assuming this is an official communication on
> Wednesday). We ought to say
>
>  "Spark 4, the next major release after Spark 3.x, is currently under
> development. We plan to make a preview version available for evaluation as
> soon as it is feasible"
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 6 May 2024 at 05:09, Dongjoon Hyun 
> wrote:
>
>> +1 for Holden's comment. Yes, it would be great to mention `it` as
>> "soon".
>> (If Wenchen release it on Monday, we can simply mention the release)
>>
>> In addition, Apache Spark PMC received an official notice from ASF Infra
>> team.
>>
>> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
>> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for ASF
>> projects
>>
>> To track and comply with the new ASF Infra Policy as much as possible, we
>> opened a blocker-level JIRA issue and have been working on it.
>> - https://infra.apache.org/github-actions-policy.html
>>
>> Please include a sentence that Apache Spark PMC is working on under the
>> following umbrella JIRA issue.
>>
>> https://issues.apache.org/jira/browse/SPARK-48094
>> > Reduce GitHub Action usage according to ASF project allowance
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Sun, May 5, 2024 at 3:45 PM Holden Karau 
>> wrote:
>>
>>> Do we want to include that we’re planning on having a preview release of
>>> Spark 4 so folks can see the APIs “soon”?
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
>>> wrote:
>>>
>>>> It’s time for our quarterly ASF board report on Apache Spark this
>>>> Wednesday. Here’s a draft, feel free to suggest changes.
>>>>
>>>> 
>>>>
>>>> Description:
>>>>
>>>> Apache Spark is a fast and general purpose engine for large-scale data
>>>> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
>>>> well as a rich set of libraries including stream processing, machine
>>>> learning, and graph analytics.
>>>>
>>>> Issues for the board:
>>>>
>>>> - None
>>>>
>>>> Project status:
>>>>
>>>> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and
>>>> Spark 3.4.2 on April 18, 2024.
>>>> - The votes on "SPIP: Structured Logging Framework for Apache Spark"
>>>> and "Pure Python Package in PyPI (Spark Connect)" have passed.
>>>> - The votes for two behavior changes have passed: "SPARK-4: Use
>>>> ANSI SQL mode by default" and "SPARK-46122: Set
>>>> spark.sql.legacy.createHiveTableByDefault to false".
>>>> - The community decided that upcoming Spark 4.0 release will drop
>>>> support for Python 3.8.
>>>> - We started a discussion about the definition of behavior changes that
>>>> is critical for version upgrades and user experience.
>>>> - We've opened a dedicated repository for the Spark Kubernetes Operator
>>>> at https://github.com/apache/spark-kubernetes-operator. We added a new
>>>> version in Apache Spark JIRA for versioning of the Spark operator based on
>>>> a vote result.
>>>>
>>>> Trademarks:
>>>>
>>>> - No changes since the last report.
>>>>
>>>> Latest releases:
>>>> - Spark 3.4.3 was released on April 18, 2024
>>>> - Spark 3.5.1 was released on February 28, 2024
>>>> - Spark 3.3.4 was released on December 16, 2023
>>>>
>>>> Committers and PMC:
>>>>
>>>> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
>>>> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
>>>> Yikun Jiang).
>>>>
>>>> 
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>

Re: ASF board report draft for May

2024-05-05 Thread Holden Karau

Do we want to include that we’re planning on having a preview release of
Spark 4 so folks can see the APIs “soon”?

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
wrote:

> It’s time for our quarterly ASF board report on Apache Spark this
> Wednesday. Here’s a draft, feel free to suggest changes.
>
> 
>
> Description:
>
> Apache Spark is a fast and general purpose engine for large-scale data
> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
> well as a rich set of libraries including stream processing, machine
> learning, and graph analytics.
>
> Issues for the board:
>
> - None
>
> Project status:
>
> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and Spark
> 3.4.2 on April 18, 2024.
> - The votes on "SPIP: Structured Logging Framework for Apache Spark" and
> "Pure Python Package in PyPI (Spark Connect)" have passed.
> - The votes for two behavior changes have passed: "SPARK-4: Use ANSI
> SQL mode by default" and "SPARK-46122: Set
> spark.sql.legacy.createHiveTableByDefault to false".
> - The community decided that upcoming Spark 4.0 release will drop support
> for Python 3.8.
> - We started a discussion about the definition of behavior changes that is
> critical for version upgrades and user experience.
> - We've opened a dedicated repository for the Spark Kubernetes Operator at
> https://github.com/apache/spark-kubernetes-operator. We added a new
> version in Apache Spark JIRA for versioning of the Spark operator based on
> a vote result.
>
> Trademarks:
>
> - No changes since the last report.
>
> Latest releases:
> - Spark 3.4.3 was released on April 18, 2024
> - Spark 3.5.1 was released on February 28, 2024
> - Spark 3.3.4 was released on December 16, 2023
>
> Committers and PMC:
>
> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
> Yikun Jiang).
>
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Holden Karau

+1 :) yay previews

On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:

> +1
>
> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>
>> +1 for next Monday.
>>
>> We can do more previews when the other features are ready for preview.
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>>
 Yea I think a preview release won't hurt (without a branch cut). We
 don't need to wait for all the ongoing projects to be ready. How about we
 do a 4.0 preview release based on the current master branch next Monday?

 On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> Hey all,
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard 
> to
> do that without a Preview release. So the sooner we make a Preview 
> release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
> Thanks!
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
> wrote:
>
>> Thank you all for the replies!
>>
>> To @Nicholas Chammas  : Thanks for
>> cleaning up the error terminology and documentation! I've merged the 
>> first
>> PR and let's finish others before the 4.0 release.
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>> To @Jungtaek Lim  : Ack. We can treat
>> the Streaming state store data source as completed for 4.0 then.
>> To @Cheng Pan  : Yea we definitely should have
>> a preview release. Let's collect more feedback on the ongoing projects 
>> and
>> then we can propose a date for the preview release.
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>>> 3.0.0?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >
>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>> follow-up tickets, but we don't plan to address them soon. The current
>>> implementation is the final shape for Spark 4.0.0, unless there are 
>>> demands
>>> on the follow-up tickets.
>>> >
>>> > We may want to check the plan for transformWithState - my
>>> understanding is that we want to release the feature to 4.0.0, but there
>>> are several remaining works to be done. While the tentative timeline for
>>> releasing is June 2024, what would be the tentative timeline for the RC 
>>> cut?
>>> > (cc. Anish to add more context on the plan for transformWithState)
>>> >
>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > It's close to the previously proposed 4.0.0 release date (June
>>> 2024), and I think it's time to prepare for it and discuss the ongoing
>>> projects:
>>> > •
>>> > ANSI by default
>>> > • Spark Connect GA
>>> > • Structured Logging
>>> > • Streaming state store data source
>>> > • new data type VARIANT
>>> > • STRING collation support
>>> > • Spark k8s operator versioning
>>> > Please help to add more items to this list that are missed here. I
>>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>>> there is no objection. Thank you all for the great work that fills Spark
>>> 4.0!
>>> >
>>> > Wenchen Fan
>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Holden Karau

+1

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh  wrote:

> +1
>
> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun 
> wrote:
> >
> > I'll start with my +1.
> >
> > Dongjoon.
> >
> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
> > > Please vote on SPARK-46122 to set
> spark.sql.legacy.createHiveTableByDefault
> > > to `false` by default. The technical scope is defined in the following
> PR.
> > >
> > > - DISCUSSION:
> > > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
> > > - PR: https://github.com/apache/spark/pull/46207
> > >
> > > The vote is open until April 30th 1AM (PST) and passes
> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> > >
> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by
> default
> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because
> ...
> > >
> > > Thank you in advance.
> > >
> > > Dongjoon
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Holden Karau

+1

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Thu, Apr 25, 2024 at 11:18 AM Maciej  wrote:

> +1
>
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
> On 4/25/24 6:21 PM, Reynold Xin wrote:
>
> +1
>
> On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale
>  
> wrote:
>
>> +1
>>
>> On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun 
>> wrote:
>>
>>> FYI, there is a proposal to drop Python 3.8 because its EOL is October
>>> 2024.
>>>
>>> https://github.com/apache/spark/pull/46228
>>> [SPARK-47993][PYTHON] Drop Python 3.8
>>>
>>> Since it's still alive and there will be an overlap between the
>>> lifecycle of Python 3.8 and Apache Spark 4.0.0, please give us your
>>> feedback on the PR, if you have any concerns.
>>>
>>> From my side, I agree with this decision.
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Holden Karau

+1 -- even if it's not perfect now is the time to change default values

On Sat, Apr 13, 2024 at 4:11 PM Hyukjin Kwon  wrote:

> +1
>
> On Sun, Apr 14, 2024 at 7:46 AM Chao Sun  wrote:
>
>> +1.
>>
>> This feature is very helpful for guarding against correctness issues,
>> such as null results due to invalid input or math overflows. It’s been
>> there for a while now and it’s a good time to enable it by default as Spark
>> enters the next major release.
>>
>> On Sat, Apr 13, 2024 at 3:27 PM Dongjoon Hyun 
>> wrote:
>>
>>> I'll start from my +1.
>>>
>>> Dongjoon.
>>>
>>> On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
>>> > Please vote on SPARK-4 to use ANSI SQL mode by default.
>>> > The technical scope is defined in the following PR which is
>>> > one line of code change and one line of migration guide.
>>> >
>>> > - DISCUSSION:
>>> > https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
>>> > - JIRA: https://issues.apache.org/jira/browse/SPARK-4
>>> > - PR: https://github.com/apache/spark/pull/46013
>>> >
>>> > The vote is open until April 17th 1AM (PST) and passes
>>> > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Use ANSI SQL mode by default
>>> > [ ] -1 Do not use ANSI SQL mode by default because ...
>>> >
>>> > Thank you in advance.
>>> >
>>> > Dongjoon
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Holden Karau

On Wed, Apr 10, 2024 at 9:54 PM Binwei Yang  wrote:

>
> Gluten currently already support Velox backend and Clickhouse backend.
> data fusion support is also proposed but no one worked on it.
>
> Gluten isn't a POC. It's under actively developing but some companies
> already used it.
>
>
> On 2024/04/11 03:32:01 Dongjoon Hyun wrote:
> > I'm interested in your claim.
> >
> > Could you elaborate or provide some evidence for your claim, *a door for
> > all native libraries*, Binwei?
> >
> > For example, is there any POC for that claim? Maybe, did I miss something
> > in that SPIP?
>
I think the concern here is there are multiple different layers to get from
Spark -> Native code and ideally any changes we introduce in Spark would be
for common functionality that is useful across them (e.g. data fusion comet
& gluten & photon*, etc.)


* Photon being harder to guess at since it's closed source.

> >
> > Dongjoon.
> >
> > On Wed, Apr 10, 2024 at 8:19 PM Binwei Yang  wrote:
> >
> > >
> > > The SPIP is not for current Gluten, but open a door for all native
> > > libraries and accelerators support.
> > >
> > > On 2024/04/11 00:27:43 Weiting Chen wrote:
> > > > Yes, the 1st Apache release(v1.2.0) for Gluten will be in September.
> > > > For Spark version support, currently Gluten v1.1.1 support Spark3.2
> and
> > > 3.3.
> > > > We are planning to support Spark3.4 and 3.5 in Gluten v1.2.0.
> > > > Spark4.0 support for Gluten is depending on the release schedule in
> > > Spark community.
> > > >
> > > > On 2024/04/09 07:14:13 Dongjoon Hyun wrote:
> > > > > Thank you for sharing, Weiting.
> > > > >
> > > > > Do you think you can share the future milestone of Apache Gluten?
> > > > > I'm wondering when the first stable release will come and how we
> can
> > > > > coordinate across the ASF communities.
> > > > >
> > > > > > This project is still under active development now, and doesn't
> have
> > > a
> > > > > stable release.
> > > > > > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
> > > > >
> > > > > In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end
> of
> > > > > support.
> > > > > And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release)
> is
> > > > > scheduled in October.
> > > > >
> > > > > For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only
> if
> > > there
> > > > > is something we need to do from Spark side.
> > > > >
> > > > > Thanks,
> > > > > Dongjoon.
> > > > >
> > > > >
> > > > > On Mon, Apr 8, 2024 at 11:19 PM WeitingChen <
> weitingc...@apache.org>
> > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > We are excited to introduce a new Apache incubating project
> called
> > > Gluten.
> > > > > > Gluten serves as a middleware layer designed to offload Spark to
> > > native
> > > > > > engines like Velox or ClickHouse.
> > > > > > For more detailed information, please visit the project
> repository at
> > > > > > https://github.com/apache/incubator-gluten
> > > > > >
> > > > > > Additionally, a new Spark SPIP related to Spark + Gluten
> > > collaboration has
> > > > > > been proposed at
> https://issues.apache.org/jira/browse/SPARK-47773.
> > > > > > We eagerly await feedback from the Spark community.
> > > > > >
> > > > > > Thanks,
> > > > > > Weiting.
> > > > > >
> > > > > >
> > > > >
> > > >
> > > > -
> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-09 Thread Holden Karau

I like the idea of improving flexibility of Sparks physical plans and
really anything that might reduce code duplication among the ~4 or so
different accelerators.

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun 
wrote:

> Thank you for sharing, Jia.
>
> I have the same questions like the previous Weiting's thread.
>
> Do you think you can share the future milestone of Apache Gluten?
> I'm wondering when the first stable release will come and how we can
> coordinate across the ASF communities.
>
> > This project is still under active development now, and doesn't have a
> stable release.
> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
>
> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> support.
> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
> scheduled in October.
>
> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if there
> is something we need to do from Spark side.
>
+1 I think any changes need to target 4.0

>
> Thanks,
> Dongjoon.
>
>
> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:
>
>> Apache Spark currently lacks an official mechanism to support
>> cross-platform execution of physical plans. The Gluten project offers a
>> mechanism that utilizes the Substrait standard to convert and optimize
>> Spark's physical plans. By introducing Gluten's plan conversion,
>> validation, and fallback mechanisms into Spark, we can significantly
>> enhance the portability and interoperability of Spark's physical plans,
>> enabling them to operate across a broader spectrum of execution
>> environments without requiring users to migrate, while also improving
>> Spark's execution efficiency through the utilization of Gluten's advanced
>> optimization techniques. And the integration of Gluten into Spark has
>> already shown significant performance improvements with ClickHouse and
>> Velox backends and has been successfully deployed in production by several
>> customers.
>>
>> References:
>> JIAR Ticket 
>> SPIP Doc
>> 
>>
>> Your feedback and comments are welcome and appreciated.  Thanks.
>>
>> Thanks,
>> Jia Ke
>>
>

Re: Apache Spark 3.4.3 (?)

2024-04-06 Thread Holden Karau

Sounds good to me :)

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Sat, Apr 6, 2024 at 2:51 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
> commits including important security and correctness patches like
> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>
> https://github.com/apache/spark/releases/tag/v3.4.2
>
> $ git log --oneline v3.4.2..HEAD | wc -l
>   85
>
> SPARK-45580 Subquery changes the output schema of the outer query
> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
> results
> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
> ntz
> SPARK-46794 Incorrect results due to inferred predicate from checkpoint
> with subquery
> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> SPARK-45445 Upgrade snappy to 1.1.10.5
> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> SPARK-46239 Hide `Jetty` info
>
>
> Currently, I'm checking more applicable patches for branch-3.4. I'd like
> to propose to release Apache Spark 3.4.3 and volunteer as the release
> manager for Apache Spark 3.4.3. If there are no additional blockers, the
> first tentative RC1 vote date is April 15th (Monday).
>
> WDYT?
>
>
> Dongjoon.
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Holden Karau

+1

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Mon, Apr 1, 2024 at 5:44 PM Xinrong Meng  wrote:

> +1
>
> Thank you @Hyukjin Kwon 
>
> On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
> wrote:
>
>> +1
>> --
>> *From:* Denny Lee 
>> *Sent:* Monday, April 1, 2024 10:06:14 AM
>> *To:* Hussein Awala 
>> *Cc:* Chao Sun ; Hyukjin Kwon ;
>> Mridul Muralidharan ; dev 
>> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>>
>> +1 (non-binding)
>>
>>
>> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>>
>> +1(non-binding) I add to the difference will it make that it will also
>> simplify package maintenance and easily release a bug fix/new feature
>> without needing to wait for Pyspark to release.
>>
>> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>>
>> +1
>>
>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>> wrote:
>>
>> Oh I didn't send the discussion thread out as it's pretty simple,
>> non-invasive and the discussion was sort of done as part of the Spark
>> Connect initial discussion ..
>>
>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>> wrote:
>>
>>
>> Can you point me to the SPIP’s discussion thread please ?
>> I was not able to find it, but I was on vacation, and so might have
>> missed this …
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>  wrote:
>>
>> +1
>>
>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>> wrote:
>>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>>

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-12 Thread Holden Karau

+1

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Mon, Mar 11, 2024 at 7:44 PM Reynold Xin 
wrote:

> +1
>
>
> On Mon, Mar 11 2024 at 7:38 PM, Jungtaek Lim 
> wrote:
>
>> +1 (non-binding), thanks Gengliang!
>>
>> On Mon, Mar 11, 2024 at 5:46 PM Gengliang Wang  wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Structured Logging Framework for
>>> Apache Spark
>>>
>>> References:
>>>
>>>- JIRA ticket 
>>>- SPIP doc
>>>
>>> 
>>>- Discussion thread
>>>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>> Gengliang Wang
>>>
>>

Re: Generating config docs automatically

2024-02-21 Thread Holden Karau

I think this is a good idea. I like having everything in one source of
truth rather than two (so option 1 sounds like a good idea); but that’s
just my opinion. I'd be happy to help with reviews though.

On Wed, Feb 21, 2024 at 6:37 AM Nicholas Chammas 
wrote:

> I know config documentation is not the most exciting thing. If there is
> anything I can do to make this as easy as possible for a committer to
> shepherd, I’m all ears!
>
>
> On Feb 14, 2024, at 8:53 PM, Nicholas Chammas 
> wrote:
>
> I’m interested in automating our config documentation and need input from
> a committer who is interested in shepherding this work.
>
> We have around 60 tables of configs across our documentation. Here’s a
> typical example.
> 
>
> These tables span several thousand lines of manually maintained HTML,
> which poses a few problems:
>
>- The documentation for a given config is sometimes out of sync across
>the HTML table and its source `ConfigEntry`.
>- Internal configs that are not supposed to be documented publicly
>sometimes are.
>- Many config names and defaults are extremely long, posing formatting
>problems.
>
>
> Contributors waste time dealing with these issues in a losing battle to
> keep everything up-to-date and consistent.
>
> I’d like to solve all these problems by generating HTML tables
> automatically from the `ConfigEntry` instances where the configs are
> defined.
>
> I’ve proposed two alternative solutions:
>
>- #44755 : Enhance
>`ConfigEntry` so a config can be associated with one or more groups, and
>use that new metadata to generate the tables we need.
>- #44756 : Add a
>standalone YAML file where we define config groups, and use that to
>generate the tables we need.
>
>
> If you’re a committer and are interested in this problem, please chime in
> on whatever approach appeals to you. If you think this is a bad idea, I’m
> also eager to hear your feedback.
>
> Nick
>
>

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Holden Karau

This looks really cool :) Out of interest what are the differences in the
approach between this and Glutten?

On Tue, Feb 13, 2024 at 12:42 PM Chao Sun  wrote:

> Hi all,
>
> We are very happy to announce that Project Comet, a plugin to
> accelerate Spark query execution via leveraging DataFusion and Arrow,
> has now been open sourced under the Apache Arrow umbrella. Please
> check the project repo
> https://github.com/apache/arrow-datafusion-comet for more details if
> you are interested. We'd love to collaborate with people from the open
> source community who share similar goals.
>
> Thanks,
> Chao
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-16 Thread Holden Karau

Oh interesting solution, a co-worker was suggesting something similar using
resource profiles to increase memory -- but your approach avoids a lot of
complexity I like it (and we could extend it out to support resource
profile growth too).

I think an SPIP sounds like a great next step.

On Tue, Jan 16, 2024 at 10:46 PM kalyan  wrote:

> Hello All,
>
> At Uber, we had recently, done some work on improving the reliability of
> spark applications in scenarios of fatter executors going out of memory and
> leading to application failure. Fatter executors are those that have more
> than 1 task running on it at a given time concurrently. This has
> significantly improved the reliability of many spark applications for us at
> Uber. We made a blog about this recently. Link:
> https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/
>
> At a high level, we have done the below changes:
>
>1. When a Task fails with the OOM of an executor, we update the core
>requirements of the task to max executor cores.
>2. When the task is picked for rescheduling, the new attempt of the
>task happens to be on an executor where no other task can run concurrently.
>All cores get allocated to this task itself.
>3. This way we ensure that the configured memory is completely at the
>disposal of a single task. Thus eliminating contention of memory.
>
> The best part of this solution is that it's reactive. It kicks in only
> when the executors fail with the OOM exception.
>
> We understand that the problem statement is very common and we expect our
> solution to be effective in many cases.
>
> There could be more cases that can be covered. Executor failing with OOM
> is like a hard signal. The framework(making the driver aware of
> what's happening with the executor) can be extended to handle scenarios of
> other forms of memory pressure like excessive spilling to disk, etc.
>
> While we had developed this on Spark 2.4.3 in-house, we would like to
> collaborate and contribute this work to the latest versions of Spark.
>
> What is the best way forward here? Will an SPIP proposal to detail the
> changes help?
>
> Regards,
> Kalyan.
> Uber India.
>


-- 
Cell : 425-233-8271

Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-14 Thread Holden Karau

+1

On Tue, Nov 14, 2023 at 10:21 AM DB Tsai  wrote:

> +1
>
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>
> On Nov 14, 2023, at 10:14 AM, Vakaris Baškirov <
> vakaris.bashki...@gmail.com> wrote:
>
> +1 (non-binding)
>
>
> On Tue, Nov 14, 2023 at 8:03 PM Chao Sun  wrote:
>
>> +1
>>
>> On Tue, Nov 14, 2023 at 9:52 AM L. C. Hsieh  wrote:
>> >
>> > +1
>> >
>> > On Tue, Nov 14, 2023 at 9:46 AM Ye Zhou  wrote:
>> > >
>> > > +1(Non-binding)
>> > >
>> > > On Tue, Nov 14, 2023 at 9:42 AM L. C. Hsieh  wrote:
>> > >>
>> > >> Hi all,
>> > >>
>> > >> I’d like to start a vote for SPIP: An Official Kubernetes Operator
>> for
>> > >> Apache Spark.
>> > >>
>> > >> The proposal is to develop an official Java-based Kubernetes operator
>> > >> for Apache Spark to automate the deployment and simplify the
>> lifecycle
>> > >> management and orchestration of Spark applications and Spark clusters
>> > >> on k8s at prod scale.
>> > >>
>> > >> This aims to reduce the learning curve and operation overhead for
>> > >> Spark users so they can concentrate on core Spark logic.
>> > >>
>> > >> Please also refer to:
>> > >>
>> > >>- Discussion thread:
>> > >> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
>> > >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-45923
>> > >>- SPIP doc:
>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>> > >>
>> > >>
>> > >> Please vote on the SPIP for the next 72 hours:
>> > >>
>> > >> [ ] +1: Accept the proposal as an official SPIP
>> > >> [ ] +0
>> > >> [ ] -1: I don’t think this is a good idea because …
>> > >>
>> > >>
>> > >> Thank you!
>> > >>
>> > >> Liang-Chi Hsieh
>> > >>
>> > >> -
>> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > >>
>> > >
>> > >
>> > > --
>> > >
>> > > Zhou, Ye  周晔
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-12 Thread Holden Karau

To be clear: I am generally supportive of the idea (+1) but have some
follow-up questions:

Have we taken the time to learn from the other operators? Do we have a
compatible CRD/API or not (and if so why?)
The API seems to assume that everything is packaged in the container in
advance, but I imagine that might not be the case for many folks who have
Java or Python packages published to cloud storage and they want to use?
What's our plan for the testing on the potential version explosion (not
tying ourselves to operator version -> spark version makes a lot of sense,
but how do we reasonably assure ourselves that the cross product of
Operator Version, Kube Version, and Spark Version all function)? Do we have
CI resources for this?
Is there a current (non-open source operator) that folks from Apple are
using and planning to open source, or is this a fresh "from the ground up"
operator proposal?
One of the key reasons for this is listed as "An out-of-the-box automation
solution that scales effectively" but I don't see any discussion of the
target scale or plans to achieve it?

On Thu, Nov 9, 2023 at 9:02 PM Zhou Jiang  wrote:

> Hi Spark community,
>
> I'm reaching out to initiate a conversation about the possibility of
> developing a Java-based Kubernetes operator for Apache Spark. Following the
> operator pattern (
> https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark
> users may manage applications and related components seamlessly using
> native tools like kubectl. The primary goal is to simplify the Spark user
> experience on Kubernetes, minimizing the learning curve and operational
> complexities and therefore enable users to focus on the Spark application
> development.
>
> Although there are several open-source Spark on Kubernetes operators
> available, none of them are officially integrated into the Apache Spark
> project. As a result, these operators may lack active support and
> development for new features. Within this proposal, our aim is to introduce
> a Java-based Spark operator as an integral component of the Apache Spark
> project. This solution has been employed internally at Apple for multiple
> years, operating millions of executors in real production environments. The
> use of Java in this solution is intended to accommodate a wider user and
> contributor audience, especially those who are familiar with Scala.
>
> Ideally, this operator should have its dedicated repository, similar to
> Spark Connect Golang or Spark Docker, allowing it to maintain a loose
> connection with the Spark release cycle. This model is also followed by the
> Apache Flink Kubernetes operator.
>
> We believe that this project holds the potential to evolve into a thriving
> community project over the long run. A comparison can be drawn with the
> Flink Kubernetes Operator: Apple has open-sourced internal Flink Kubernetes
> operator, making it a part of the Apache Flink project (
> https://github.com/apache/flink-kubernetes-operator). This move has
> gained wide industry adoption and contributions from the community. In a
> mere year, the Flink operator has garnered more than 600 stars and has
> attracted contributions from over 80 contributors. This showcases the level
> of community interest and collaborative momentum that can be achieved in
> similar scenarios.
>
> More details can be found at SPIP doc : Spark Kubernetes Operator
> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>
> Thanks,
> --
> *Zhou JIANG*
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Apache Spark 3.4.2 (?)

2023-11-06 Thread Holden Karau

+1

On Mon, Nov 6, 2023 at 4:30 PM yangjie01 
wrote:

> +1
>
>
>
> *发件人**: *Yuming Wang 
> *日期**: *2023年11月7日 星期二 07:00
> *收件人**: *Santosh Pingale 
> *抄送**: *Dongjoon Hyun , dev  >
> *主题**: *Re: Apache Spark 3.4.2 (?)
>
>
>
> +1
>
>
>
> On Tue, Nov 7, 2023 at 3:55 AM Santosh Pingale
>  wrote:
>
> Makes sense given the nature of those commits.
>
>
>
> On Mon, Nov 6, 2023, 7:52 PM Dongjoon Hyun 
> wrote:
>
> Hi, All.
>
> Apache Spark 3.4.1 tag was created on Jun 19th and `branch-3.4` has 103
> commits including important security and correctness patches like
> SPARK-44251, SPARK-44805, and SPARK-44940.
>
> https://github.com/apache/spark/releases/tag/v3.4.1
> 
>
> $ git log --oneline v3.4.1..HEAD | wc -l
> 103
>
> SPARK-44251 Potential for incorrect results or NPE when full outer
> USING join has null key value
> SPARK-44805 Data lost after union using
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> SPARK-44940 Improve performance of JSON parsing when
> "spark.sql.json.enablePartialResults" is enabled
>
> Currently, I'm checking the following open correctness issues. I'd like to
> propose to release Apache Spark 3.4.2 after resolving them and volunteer as
> the release manager for Apache Spark 3.4.2. If there are no additional
> blockers, the first tentative RC1 vote date is November 13rd (Monday). If
> it takes some time to resolve the open correctness issues, we can start the
> vote after Thanksgiving holiday.
>
> SPARK-44512 dataset.sort.select.write.partitionBy sorts wrong column
> SPARK-45282 Join loses records for cached datasets
>
> WDTY?
>
> Dongjoon.
>
>

Re: Write Spark Connection client application in Go

2023-09-12 Thread Holden Karau

That’s so cool! Great work y’all :)

On Tue, Sep 12, 2023 at 8:14 PM bo yang  wrote:

> Hi Spark Friends,
>
> Anyone interested in using Golang to write Spark application? We created a 
> Spark
> Connect Go Client library .
> Would love to hear feedback/thoughts from the community.
>
> Please see the quick start guide
> 
> about how to use it. Following is a very short Spark Connect application in
> Go:
>
> func main() {
>   spark, _ := 
> sql.SparkSession.Builder.Remote("sc://localhost:15002").Build()
>   defer spark.Stop()
>
>   df, _ := spark.Sql("select 'apple' as word, 123 as count union all 
> select 'orange' as word, 456 as count")
>   df.Show(100, false)
>   df.Collect()
>
>   df.Write().Mode("overwrite").
>   Format("parquet").
>   Save("file:///tmp/spark-connect-write-example-output.parquet")
>
>   df = spark.Read().Format("parquet").
>   Load("file:///tmp/spark-connect-write-example-output.parquet")
>   df.Show(100, false)
>
>   df.CreateTempView("view1", true, false)
>   df, _ = spark.Sql("select count, word from view1 order by count")
> }
>
>
> Many thanks to Martin, Hyukjin, Ruifeng and Denny for creating and working
> together on this repo! Welcome more people to contribute :)
>
> Best,
> Bo
>
>

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-07 Thread Holden Karau

+1 pip installing seems to function :)

On Thu, Sep 7, 2023 at 7:22 PM Yuming Wang  wrote:

> +1.
>
> On Thu, Sep 7, 2023 at 10:33 PM yangjie01 
> wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Gengliang Wang 
>> *日期**: *2023年9月7日 星期四 12:53
>> *收件人**: *Yuanjian Li 
>> *抄送**: *Xiao Li , "her...@databricks.com.invalid"
>> , Spark dev list 
>> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC4)
>>
>>
>>
>> +1
>>
>>
>>
>> On Wed, Sep 6, 2023 at 9:46 PM Yuanjian Li 
>> wrote:
>>
>> +1 (non-binding)
>>
>> Xiao Li  于2023年9月6日周三 15:27写道：
>>
>> +1
>>
>>
>>
>> Xiao
>>
>>
>>
>> Herman van Hovell  于2023年9月6日周三 22:08写道：
>>
>> Tested connect, and everything looks good.
>>
>>
>>
>> +1
>>
>>
>>
>> On Wed, Sep 6, 2023 at 8:11 AM Yuanjian Li 
>> wrote:
>>
>> Please vote on releasing the following candidate(RC4) as Apache Spark
>> version 3.5.0.
>>
>>
>>
>> The vote is open until 11:59pm Pacific time *Sep 8th* and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>>
>>
>> [ ] +1 Release this package as Apache Spark 3.5.0
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>
>>
>> The tag to be voted on is v3.5.0-rc4 (commit
>> c2939589a29dd0d6a2d3d31a8d833877a37ee02a):
>>
>> https://github.com/apache/spark/tree/v3.5.0-rc4
>>
>>
>>
>> The release files, including signatures, digests, etc. can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-bin/
>>
>>
>>
>> Signatures used for Spark RCs can be found in this file:
>>
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>>
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1448
>>
>>
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/
>>
>>
>>
>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>
>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>
>>
>>
>> This release is using the release script of the tag v3.5.0-rc4.
>>
>>
>>
>> FAQ
>>
>>
>>
>> =
>>
>> How can I help test this release?
>>
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>>
>> an existing Spark workload and running on this release candidate, then
>>
>> reporting any regressions.
>>
>>
>>
>> If you're working in PySpark you can set up a virtual env and install
>>
>> the current RC and see if anything important breaks, in the Java/Scala
>>
>> you can add the staging repository to your projects resolvers and test
>>
>> with the RC (make sure to clean up the artifact cache before/after so
>>
>> you don't end up building with an out of date RC going forward).
>>
>>
>>
>> ===
>>
>> What should happen to JIRA tickets still targeting 3.5.0?
>>
>> ===
>>
>> The current list of open tickets targeted at 3.5.0 can be found at:
>>
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.5.0
>>
>>
>>
>> Committers should look at those and triage. Extremely important bug
>>
>> fixes, documentation, and API tweaks that impact compatibility should
>>
>> be worked on immediately. Everything else please retarget to an
>>
>> appropriate release.
>>
>>
>>
>> ==
>>
>> But my bug isn't fixed?
>>
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>>
>> release unless the bug in question is a regression from the previous
>>
>> release. That being said, if there is something which is a regression
>>
>> that has not been correctly targeted please ping me or a committer to
>>
>> help target the issue.
>>
>>
>>
>> Thanks,
>>
>> Yuanjian Li
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-02 Thread Holden Karau

Can we delay the next RC cut until after Labor Day?

On Sat, Sep 2, 2023 at 9:59 PM Yuanjian Li  wrote:

> Thank you for all the reports!
> The vote has failed. I plan to cut RC4 in two days.
>
> @Dipayan Dev  I quickly skimmed through the
> corresponding ticket, and it doesn't seem to be a regression introduced in
> 3.5. Additionally, someone is asking if this is the same issue as
> SPARK-35279.
> @Yuming Wang  I will check the signature for RC4
> @Jungtaek Lim  I will follow-up with you
> regarding SPARK-45045 
> @Wenchen Fan  Agree, we should include the
> correctness fix in 3.5
>
> Jungtaek Lim  于2023年8月31日周四 23:45写道：
>
>> My apologies, I have to add another ticket for a blocker, SPARK-45045
>> . That said, I'm -1
>> (non-binding).
>>
>> SPARK-43183  made a
>> behavioral change regarding the StreamingQueryListener as well as
>> StreamingQuery API as a side-effect, while the intention was more about
>> introducing the change in the former one. I just got some reports that the
>> behavioral change for StreamingQuery API broke various tests in 3rd party
>> data sources. To help 3rd party ecosystems to adopt 3.5 without hassle, I'd
>> like to see this be fixed in 3.5.0.
>>
>> There is no fix yet but I'm working on it. I'll give an update here.
>> Maybe we could lower down priority and let the release go with describing
>> this as a "known issue", if I couldn't make progress in a couple of days.
>> I'm sorry about that.
>>
>> Thanks,
>> Jungtaek Lim
>>
>> On Fri, Sep 1, 2023 at 12:12 PM Wenchen Fan  wrote:
>>
>>> Sorry for the last-minute bug report, but we found a regression in 3.5:
>>> the SQL INSERT command without a column list fills missing columns with
>>> NULL while Spark 3.4 does not allow it. According to the SQL standard, this
>>> shouldn't be allowed and thus a regression in 3.5.
>>>
>>> The fix has been merged but one day after the RC3 cut:
>>> https://github.com/apache/spark/pull/42393 . I'm -1 and let's include
>>> this fix in 3.5.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Aug 31, 2023 at 9:09 PM Ian Manning 
>>> wrote:
>>>
 +1 (non-binding)

 Using Spark Core, Spark SQL, Structured Streaming.

 On Tue, Aug 29, 2023 at 8:12 PM Yuanjian Li 
 wrote:

> Please vote on releasing the following candidate(RC3) as Apache Spark
> version 3.5.0.
>
> The vote is open until 11:59pm Pacific time Aug 31st and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.5.0-rc3 (commit
> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>
> https://github.com/apache/spark/tree/v3.5.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1447
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
>
> The list of bug fixes going into 3.5.0 can be found at the following
> URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>
> This release is using the release script of the tag v3.5.0-rc3.
>
>
> FAQ
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.0
>
> Committers should look at those

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-23 Thread Holden Karau

One option could be to initially launch both drivers and initial executors
(using the lazy executor ID allocation), but it would introduce a lot of
complexity.

On Wed, Aug 23, 2023 at 6:44 PM Qian Sun  wrote:

> Hi Mich
>
> I agree with your opinion that the startup time of the Spark on Kubernetes
> cluster needs to be improved.
>
> Regarding the fetching image directly, I have utilized ImageCache to store
> the images on the node, eliminating the time required to pull images from a
> remote repository, which does indeed lead to a reduction in overall time,
> and the effect becomes more pronounced as the size of the image increases.
>
>
> Additionally, I have observed that the driver pod takes a significant
> amount of time from running to attempting to create executor pods, with an
> estimated time expenditure of around 75%. We can also explore optimization
> options in this area.
>
> On Thu, Aug 24, 2023 at 12:58 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi all,
>>
>> On this conversion, one of the issues I brought up was the driver
>> start-up time. This is especially true in k8s. As spark on k8s is modeled
>> on Spark on standalone schedler, Spark on k8s consist of a single-driver
>> pod (as master on standalone”) and a  number of executors (“workers”). When 
>> executed
>> on k8s, the driver and executors are executed on separate pods
>> <https://spark.apache.org/docs/latest/running-on-kubernetes.html>. First
>> the driver pod is launched, then the driver pod itself launches the
>> executor pods. From my observation, in an auto scaling cluster, the driver
>> pod may take up to 40 seconds followed by executor pods. This is a
>> considerable time for customers and it is painfully slow. Can we actually
>> move away from dependency on standalone mode and try to speed up k8s
>> cluster formation.
>>
>> Another naive question, when the docker image is pulled from the
>> container registry to the driver itself, this takes finite time. The docker
>> image for executors could be different from that of the driver
>> docker image. Since spark-submit presents this at the time of submission,
>> can we save time by fetching the docker images straight away?
>>
>> Thanks
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 8 Aug 2023 at 18:25, Mich Talebzadeh 
>> wrote:
>>
>>> Splendid idea. 
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 8 Aug 2023 at 18:10, Holden Karau  wrote:
>>>
>>>> The driver it’s self is probably another topic, perhaps I’ll make a
>>>> “faster spark star time” JIRA and a DA JIRA and we can explore both.
>>>>
>>>> On Tue, Aug 8, 2023 at 10:07 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> From my own perspective faster execution time especially with Spark on
>>>>> tin boxes (Dataproc & EC2) and Spark on k8s is something that customers
>>>>> often bring up.
>>>>>
>>>>> Poor time to onboard with autoscaling seems to be particularly singled
>>>>> out for heavy ETL jobs that use Spark. I am disappointed to see the poor
>>>>> performance of Spark on k8s autopilot with timelines starting the driver
>&g

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-08 Thread Holden Karau

gt;> *主题**: *[Internet]Re: Improving Dynamic Allocation Logic for Spark 4+
>>>>
>>>>
>>>>
>>>> On the subject of dynamic allocation, is the following message a cause
>>>> for concern when running Spark on k8s?
>>>>
>>>>
>>>>
>>>> INFO ExecutorAllocationManager: Dynamic allocation is enabled without a
>>>> shuffle service.
>>>>
>>>>
>>>> Mich Talebzadeh,
>>>>
>>>> Solutions Architect/Engineering Lead
>>>>
>>>> London
>>>>
>>>> United Kingdom
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 7 Aug 2023 at 23:42, Mich Talebzadeh 
>>>> wrote:
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> From what I have seen spark on a serverless cluster has hard up getting
>>>> the driver going in a timely manner
>>>>
>>>>
>>>>
>>>> Annotations:  autopilot.gke.io/resource-adjustment:
>>>>
>>>>
>>>> {"input":{"containers":[{"limits":{"memory":"1433Mi"},"requests":{"cpu":"1","memory":"1433Mi"},"name":"spark-kubernetes-driver"}]},"output...
>>>>
>>>>   autopilot.gke.io/warden-version: 2.7.41
>>>>
>>>>
>>>>
>>>> This is on spark 3.4.1 with Java 11 both the host running spark-submit
>>>> and the docker itself
>>>>
>>>>
>>>>
>>>> I am not sure how relevant this is to this discussion but it looks like
>>>> a kind of blocker for now. What config params can help here and what can be
>>>> done?
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> Mich Talebzadeh,
>>>>
>>>> Solutions Architect/Engineering Lead
>>>>
>>>> London
>>>>
>>>> United Kingdom
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 7 Aug 2023 at 22:39, Holden Karau  wrote:
>>>>
>>>> Oh great point
>>>>
>>>>
>>>>
>>>> On Mon, Aug 7, 2023 at 2:23 PM bo yang  wrote:
>>>>
>>>> Thanks Holden for bringing this up!
>>>>
>>>>
>>>>
>>>> Maybe another thing to think about is how to make dynamic allocation
>>>> more friendly with Kubernetes and disaggregated shuffle storage?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Aug 7, 2023 at 1:27 PM Holden Karau 
>>>> wrote:
>>>>
>>>> So I wondering if there is interesting in revisiting some of how Spark
>>>> is doing it's dynamica allocation for Spark 4+?
>>>>
>>>>
>>>>
>>>> Some things that I've been thinking about:
>>>>
>>>>
>>>>
>>>> - Advisory user input (e.g. a way to say after X is done I know I need
>>>> Y where Y might be a bunch of GPU machines)
>>>>
>>>> - Configurable tolerance (e.g. if we have at most Z% over target no-op)
>>>>
>>>> - Past runs of same job (e.g. stage X of job Y had a peak of K)
>>>>
>>>> - Faster executor launches (I'm a little fuzzy on what we can do here
>>>> but, one area for example is we setup and tear down an RPC connection to
>>>> the driver with a blocking call which does seem to have some locking inside
>>>> of the driver at first glance)
>>>>
>>>>
>>>>
>>>> Is this an area other folks are thinking about? Should I make an epic
>>>> we can track ideas in? Or are folks generally happy with today's dynamic
>>>> allocation (or just busy with other things)?
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>> --
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: ASF board report draft for August 2023

2023-08-08 Thread Holden Karau

Maybe add a link to the 4.0 JIRA where we are tracking the current plans
for 4.0?

On Tue, Aug 8, 2023 at 9:33 AM Dongjoon Hyun 
wrote:

> Thank you, Matei.
>
> It looks good to me.
>
> Dongjoon
>
> On Mon, Aug 7, 2023 at 22:54 Matei Zaharia 
> wrote:
>
>> It’s time to send our quarterly report to the ASF board on August 9th.
>> Here’s what I wrote as a draft — feel free to suggest changes.
>>
>> =
>>
>> Issues for the board:
>>
>> - None
>>
>> Project status:
>>
>> - We cut the branch Spark 3.5.0 on July 17th 2023. The community is
>> working on bug fixes, tests, stability and documentation.
>> - We made a patch release, Spark 3.4.1, on June 23, 2023.
>> - We are preparing a Spark 3.3.3 release for later this month (
>> https://lists.apache.org/thread/0kgnw8njjnfgc5nghx60mn7oojvrqwj7).
>> - Votes on three Spark Project Improvement Proposals (SPIP) passed: "XML
>> data source support", "Python Data Source API", and "PySpark Test
>> Framework".
>> - A vote for "Apache Spark PMC asks Databricks to differentiate its Spark
>> version string" did not pass. This was asking a company to change the
>> string returned by Spark APIs in a product that packages a modified version
>> of Apache Spark.
>> - The community decided to release Apache Spark 4.0.0 after the 3.5.0
>> version.
>> - An official Apache Spark Docker image is now available at
>> https://hub.docker.com/_/spark
>> - A new repository, https://github.com/apache/spark-connect-go, was
>> created for the Go client of Spark Connect.
>> - The PMC voted to add two new committers to the project, XiDuo You and
>> Peter Toth
>>
>> Trademarks:
>>
>> - No changes since the last report.
>>
>> Latest releases:
>>
>> - We released Apache Spark 3.4.1 on June 23, 2023
>> - We released Apache Spark 3.2.4 on April 13, 2023
>> - We released Spark 3.3.2 on February 17, 2023
>>
>> Committers and PMC:
>>
>> - The latest committers were added on July 11th, 2023 (XiDuo You and
>> Peter Toth).
>> - The latest PMC members were added on May 10th, 2023 (Chao Sun, Xinrong
>> Meng and Ruifeng Zheng).
>>
>> =
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-07 Thread Holden Karau

Oooh fascinating. I’m going on call this week so it will take me awhile but
I do want to review this :)

On Mon, Aug 7, 2023 at 5:30 PM Pavan Kotikalapudi
 wrote:

> Hi Spark Dev,
>
> I have extended traditional DRA to work for structured streaming
> use-case.
>
> Here is an initial Implementation draft PR
> https://github.com/apache/spark/pull/42352 and design doc:
> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>
> Please review and let me know what you think.
>
> Thank you,
>
> Pavan
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-07 Thread Holden Karau

Oh great point

On Mon, Aug 7, 2023 at 2:23 PM bo yang  wrote:

> Thanks Holden for bringing this up!
>
> Maybe another thing to think about is how to make dynamic allocation more
> friendly with Kubernetes and disaggregated shuffle storage?
>
>
>
> On Mon, Aug 7, 2023 at 1:27 PM Holden Karau  wrote:
>
>> So I wondering if there is interesting in revisiting some of how Spark is
>> doing it's dynamica allocation for Spark 4+?
>>
>> Some things that I've been thinking about:
>>
>> - Advisory user input (e.g. a way to say after X is done I know I need Y
>> where Y might be a bunch of GPU machines)
>> - Configurable tolerance (e.g. if we have at most Z% over target no-op)
>> - Past runs of same job (e.g. stage X of job Y had a peak of K)
>> - Faster executor launches (I'm a little fuzzy on what we can do here
>> but, one area for example is we setup and tear down an RPC connection to
>> the driver with a blocking call which does seem to have some locking inside
>> of the driver at first glance)
>>
>> Is this an area other folks are thinking about? Should I make an epic we
>> can track ideas in? Or are folks generally happy with today's dynamic
>> allocation (or just busy with other things)?
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Improving Dynamic Allocation Logic for Spark 4+

2023-08-07 Thread Holden Karau

So I wondering if there is interesting in revisiting some of how Spark is
doing it's dynamica allocation for Spark 4+?

Some things that I've been thinking about:

- Advisory user input (e.g. a way to say after X is done I know I need Y
where Y might be a bunch of GPU machines)
- Configurable tolerance (e.g. if we have at most Z% over target no-op)
- Past runs of same job (e.g. stage X of job Y had a peak of K)
- Faster executor launches (I'm a little fuzzy on what we can do here but,
one area for example is we setup and tear down an RPC connection to the
driver with a blocking call which does seem to have some locking inside of
the driver at first glance)

Is this an area other folks are thinking about? Should I make an epic we
can track ideas in? Or are folks generally happy with today's dynamic
allocation (or just busy with other things)?

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE][SPIP] Python Data Source API

2023-07-07 Thread Holden Karau

+1

On Fri, Jul 7, 2023 at 9:55 AM huaxin gao  wrote:

> +1
>
> On Fri, Jul 7, 2023 at 8:59 AM Mich Talebzadeh 
> wrote:
>
>> +1 for me
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 7 Jul 2023 at 11:05, Martin Grund 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Fri, Jul 7, 2023 at 12:05 AM Denny Lee  wrote:
>>>
 +1 (non-binding)

 On Fri, Jul 7, 2023 at 00:50 Maciej  wrote:

> +0
>
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
> On 7/6/23 17:41, Xiao Li wrote:
>
> +1
>
> Xiao
>
> Hyukjin Kwon  于2023年7月5日周三 17:28写道：
>
>> +1.
>>
>> See https://youtu.be/yj7XlTB1Jvc?t=604 :-).
>>
>> On Thu, 6 Jul 2023 at 09:15, Allison Wang
>> 
>>  wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Python Data Source API.
>>>
>>> The high-level summary for the SPIP is that it aims to introduce a
>>> simple API in Python for Data Sources. The idea is to enable Python
>>> developers to create data sources without learning Scala or dealing with
>>> the complexities of the current data source APIs. This would make Spark
>>> more accessible to the wider Python developer community.
>>>
>>> References:
>>>
>>>- SPIP doc
>>>
>>> 
>>>- JIRA ticket 
>>>- Discussion thread
>>>
>>>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because __.
>>>
>>> Thanks,
>>> Allison
>>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE][SPIP] PySpark Test Framework

2023-06-21 Thread Holden Karau

A small request, it’s pride weekend in San Francisco where some of the core
developers are and right before one of the larger spark related conferences
so more folks might be traveling than normal. Could we maybe extend the
vote out an extra day or two just to give folks a chance to be heard?

On Wed, Jun 21, 2023 at 8:30 AM Reynold Xin  wrote:

> +1
>
> This is a great idea.
>
>
> On Wed, Jun 21, 2023 at 8:29 AM, Holden Karau 
> wrote:
>
>> I’d like to start with a +1, better Python testing tools integrated into
>> the project make sense.
>>
>> On Wed, Jun 21, 2023 at 8:11 AM Amanda Liu 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: PySpark Test Framework.
>>>
>>> The high-level summary for the SPIP is that it proposes an official test
>>> framework for PySpark. Currently, there are only disparate open-source
>>> repos and blog posts for PySpark testing resources. We can streamline and
>>> simplify the testing process by incorporating test features, such as a
>>> PySpark Test Base class (which allows tests to share Spark sessions) and
>>> test util functions (for example, asserting dataframe and schema equality).
>>>
>>> *SPIP doc:*
>>> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v
>>>
>>> *JIRA ticket:* https://issues.apache.org/jira/browse/SPARK-44042
>>>
>>> *Discussion thread:*
>>> https://lists.apache.org/thread/trwgbgn3ycoj8b8k8lkxko2hql23o41n
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because __.
>>>
>>> Thank you!
>>>
>>> Best,
>>> Amanda Liu
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE][SPIP] PySpark Test Framework

2023-06-21 Thread Holden Karau

I’d like to start with a +1, better Python testing tools integrated into
the project make sense.

On Wed, Jun 21, 2023 at 8:11 AM Amanda Liu 
wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: PySpark Test Framework.
>
> The high-level summary for the SPIP is that it proposes an official test
> framework for PySpark. Currently, there are only disparate open-source
> repos and blog posts for PySpark testing resources. We can streamline and
> simplify the testing process by incorporating test features, such as a
> PySpark Test Base class (which allows tests to share Spark sessions) and
> test util functions (for example, asserting dataframe and schema equality).
>
> *SPIP doc:*
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v
>
> *JIRA ticket:* https://issues.apache.org/jira/browse/SPARK-44042
>
> *Discussion thread:*
> https://lists.apache.org/thread/trwgbgn3ycoj8b8k8lkxko2hql23o41n
>
> Please vote on the SPIP for the next 72 hours:
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because __.
>
> Thank you!
>
> Best,
> Amanda Liu
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE][RESULT] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-20 Thread Holden Karau

That seems like a really good reason for a major version change given the %
of PySpark users and the fact we are (effectively) tied to pandas APIs.

On Tue, Jun 20, 2023 at 12:24 PM Bjørn Jørgensen 
wrote:

> One big thing for 4.0 will be that pandas API on spark will support pandas
> version 2.0
>
> With the major release of pandas 2.0.0 on April 3, 2023, numerous breaking
> changes have been introduced. So, we have made the decision to postpone
> addressing these breaking changes until the next major release of Spark,
> version 4.0.0 to minimize disruptions for our users and provide a more
> seamless upgrade experience.
>
> The pandas 2.0.0 release includes a significant number of updates, such as
> API removals, changes in API behavior, parameter removals, parameter
> behavior changes, and bug fixes. We have planned the following approach for
> each item:
>
> - *API Removals*: Removed APIs will remain deprecated in Spark 3.5.0,
> provide appropriate warnings, and will be removed in Spark 4.0.0.
>
> - *API Behavior Changes*: APIs with changed behavior will retain the
> behavior in Spark 3.5.0, provide appropriate warnings, and will align the
> behavior with pandas in Spark 4.0.0.
>
> - *Parameter Removals*: Removed parameters will remain deprecated in
> Spark 3.5.0, provide appropriate warnings, and will be removed in Spark
> 4.0.0.
>
> - *Parameter Behavior Changes*: Parameters with changed behavior will
> retain the behavior in Spark 3.5.0, provide appropriate warnings, and will
> align the behavior with pandas in Spark 4.0.0.
>
> - *Bug Fixes*: Bug fixes mainly related to correctness issues will be
> fixed in pandas 3.5.0.
>
> *To recap, all breaking changes related to pandas 2.0.0 will be supported
> in Spark 4.0.0,* *and will remain deprecated with appropriate errors in
> Spark 3.5.0.*
>
>
>
> https://issues.apache.org/jira/browse/SPARK-43291?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel
>
> tir. 20. juni 2023 kl. 06:18 skrev Dongjoon Hyun :
>
>> Hi, Herman.
>>
>> This is a series of discussions as I re-summarized here.
>>
>> You can find some context in the previous timeline thread.
>>
>> 2023-05-30 Apache Spark 4.0 Timeframe?
>> https://lists.apache.org/thread/xhkgj60j361gdpywoxxz7qspp2w80ry6
>>
>> Could you reply there to collect your timeline suggestions? We can
>> discuss more there.
>>
>> Dongjoon.
>>
>>
>>
>> On Mon, Jun 19, 2023 at 1:58 PM Herman van Hovell 
>> wrote:
>>
>>> Dongjoon, I am not sure if I am not sure if I follow the line of thought
>>> here.
>>>
>>> Multiple people have asked for clarification on what Spark 4.0 would
>>> mean (Holden, Mridul, Jia & Xiao). You can - for the record - also add me
>>> to this list. However you choose to single out Xiao because asks this
>>> question and wants to do a preview release as well? So again, what does
>>> Spark 4 mean, and why does it need to take almost a year? Historically
>>> major Spark releases tend to break APIs, but if it only entails changing to
>>> Scala 2.13 and dropping support for JDK 8, then we could also just release
>>> a month after 3.5.
>>>
>>> How about we do this? We get 3.5 released, and afterwards we do a couple
>>> of meetings where we build this roadmap. Using that, we can - hopefully -
>>> have a grounded discussion.
>>>
>>> Cheers,
>>> Herman
>>>
>>> On Mon, Jun 19, 2023 at 4:01 PM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Thank you. I reviewed the threads, vote and result once more.
>>>>
>>>> I found that I missed the binding vote mark on Holden in the vote
>>>> result email. The following should be "-0: Holden Karau *". Sorry for this
>>>> mistake, Holden and all.
>>>>
>>>> > -0: Holden Karau
>>>>
>>>> To Hyukjin, I disagree with you at the following point because the
>>>> thread started clearly with your and Sean's Apache Spark 4.0 requirement in
>>>> order to move away from Scala 2.12. In addition, we also discussed another
>>>> item (dropping Java 8) from other current dev thread. The vote scope and
>>>> goal is clear and specific.
>>>>
>>>> > we're unclear on the picture of Spark 4.0.0.
>>>>
>>>> Instead of vote scope and result, what is really unclear is that what
>>>> you propose here. If Xiao wants a preview, Xiao can propose the preview
>>>> plan more. It's welcome. If you want to has many 4.0 dev ideas which are

Re: Gauging interest in: ScalaFix + Scala Steward for Spark 4.0

2023-06-12 Thread Holden Karau

Yup I think buidling consensus on what goes in 4.X is something we’ll need
to do.

On Mon, Jun 12, 2023 at 11:56 AM Dongjoon Hyun 
wrote:

> Thank you for sharing those. I'm also interested in taking advantage of
> it. Also, I hope `spark-upgrade` can help us in line with Spark 4.0.
>
> However, we don't need to discuss any of this if we don't build a
> consensus on both Spark 4.0 or next Scala version.
>
> We don't have a vehicle at all to reach there yet.
>
> In the community, I saw a bottleneck; "No in 3.x era" and "No for 4.0 yet
> because XXX"
>
> Dongjoon.
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Gauging interest in: ScalaFix + Scala Steward for Spark 4.0

2023-06-12 Thread Holden Karau

My self and a few folks have been working on a spark-upgrade project
(focused on getting folks onto current versions of Spark). Since it looks
like were starting the discussion around Spark 4 I was thinking now could
be a good time for us to consider if we want to try and integrate
auto-upgrade rules like some other scala projects.

Context:
- https://github.com/scala-steward-org/scala-steward
- https://scalacenter.github.io/scalafix/
- https://github.com/holdenk/spark-upgrade

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Holden Karau

-0

I'd like to see more of a doc around what we're planning on for a 4.0
before we pick a target release date etc. (feels like cart before the
horse).

But it's a weak preference.

On Mon, Jun 12, 2023 at 11:24 AM Xiao Li  wrote:

> Thanks for starting the vote.
>
> I do have a concern about the target release date of Spark 4.0.
>
> L. C. Hsieh  于2023年6月12日周一 11:09写道：
>
>> +1
>>
>> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
>> wrote:
>> >
>> > +1
>> >
>> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> Dongjoon
>> >>
>> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>> >> > Please vote on the release plan for Apache Spark 4.0.0.
>> >> >
>> >> > The vote is open until June 16th 1AM (PST) and passes if a majority
>> +1 PMC
>> >> > votes are cast, with a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>> >> >
>> >> > ===
>> >> > Apache Spark 4.0.0 Release Plan
>> >> > ===
>> >> >
>> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master
>> branch.
>> >> >
>> >> > 2. Creating `branch-4.0` on April 1st, 2024.
>> >> >
>> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>> >> >
>> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
>> >> >
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: JDK version support policy?

2023-06-07 Thread Holden Karau

So JDK 11 is still supported in open JDK until 2026, I'm not sure if we're
going to see enough folks moving to JRE17 by the Spark 4 release unless we
have a strong benefit from dropping 11 support I'd be inclined to keep it.

On Tue, Jun 6, 2023 at 9:08 PM Dongjoon Hyun  wrote:

> I'm also +1 on dropping both Java 8 and 11 in Apache Spark 4.0, too.
>
> Dongjoon.
>
> On 2023/06/07 02:42:19 yangjie01 wrote:
> > +1 on dropping Java 8 in Spark 4.0, and I even hope Spark 4.0 can only
> support Java 17 and the upcoming Java 21.
> >
> > 发件人: Denny Lee 
> > 日期: 2023年6月7日 星期三 07:10
> > 收件人: Sean Owen 
> > 抄送: David Li , "dev@spark.apache.org" <
> dev@spark.apache.org>
> > 主题: Re: JDK version support policy?
> >
> > +1 on dropping Java 8 in Spark 4.0, saying this as a fan of the
> fast-paced (positive) updates to Arrow, eh?!
> >
> > On Tue, Jun 6, 2023 at 4:02 PM Sean Owen  sro...@gmail.com>> wrote:
> > I haven't followed this discussion closely, but I think we could/should
> drop Java 8 in Spark 4.0, which is up next after 3.5?
> >
> > On Tue, Jun 6, 2023 at 2:44 PM David Li  lidav...@apache.org>> wrote:
> > Hello Spark developers,
> >
> > I'm from the Apache Arrow project. We've discussed Java version support
> [1], and crucially, whether to continue supporting Java 8 or not. As Spark
> is a big user of Arrow in Java, I was curious what Spark's policy here was.
> >
> > If Spark intends to stay on Java 8, for instance, we may also want to
> stay on Java 8 or otherwise provide some supported version of Arrow for
> Java 8.
> >
> > We've seen dependencies dropping or planning to drop support. gRPC may
> drop Java 8 at any time [2], possibly this September [3], which may affect
> Spark (due to Spark Connect). And today we saw that Arrow had issues
> running tests with Mockito on Java 20, but we couldn't update Mockito since
> it had dropped Java 8 support. (We pinned the JDK version in that CI
> pipeline for now.)
> >
> > So at least, I am curious if Arrow could start the long process of
> migrating Java versions without impacting Spark, or if we should continue
> to cooperate. Arrow Java doesn't see quite so much activity these days, so
> it's not quite critical, but it's possible that these dependency issues
> will start to affect us more soon. And looking forward, Java is working on
> APIs that should also allow us to ditch the --add-opens flag requirement
> too.
> >
> > [1]: https://lists.apache.org/thread/phpgpydtt3yrgnncdyv4qdq1gf02s0yj<
> https://mailshield.baidu.com/check?q=Nz%2bGj2hdKguk92URjA7sg0PfbSN%2fXUIMgrHTmW45gOOKEr3Shre45B7TRzhEpb%2baVsnyuRL%2fl%2f0cu7IVGHunSGDVnxM%3d
> >
> > [2]:
> https://github.com/grpc/proposal/blob/master/P5-jdk-version-support.md<
> https://mailshield.baidu.com/check?q=s89S3eo8GCJkV7Mpx7aG1SXId7uCRYGjQMA6DeLuX9duS86LhIODZMJfeFdGMWdFzJ8S7minyHoC7mCrzHagbJXCXYTBH%2fpZBpfTbw%3d%3d
> >
> > [3]: https://github.com/grpc/grpc-java/issues/9386<
> https://mailshield.baidu.com/check?q=R0HtWZIkY5eIxpz8jtqHLzd0ugNbcaXIKW2LbUUxpIn0t9Y9yAhuHPuZ4buryfNwRnnJTA%3d%3d
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: ASF policy violation and Scala version issues

2023-06-06 Thread Holden Karau

So I think if the Spark PMC wants to ask Databricks something that could be
reasonable (although I'm a little fuzzy as to the ask), but that
conversation might belong on private@ (I could be wrong of course).

On Tue, Jun 6, 2023 at 3:29 AM Mich Talebzadeh 
wrote:

> I concur with you Sean.
>
> If I understand correctly the point raised by the thread owner, in
> heterogeneous environments that we work, it is up to the practitioner to
> ensure that there is version compatibility among OS versions, spark version
> and the target artefact in consideration. For example if I try to connect
> to Google BigQuery from spark 3.4.0, my OS or for that matter, the docker
> needs to run Java 8 regardless of  spark Java version, otherwise it will
> fail.
>
> I think these details should be left to the trenches, because these
> arguments about versioning become tangential in the big picture.  Case in
> point, my current OS scala version is 2.13.8 but works fine with Spark
> built on 2.12.17.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 6 Jun 2023 at 01:37, Sean Owen  wrote:
>
>> I think the issue is whether a distribution of Spark is so materially
>> different from OSS that it causes problems for the larger community of
>> users. There's a legitimate question of whether such a thing can be called
>> "Apache Spark + changes", as describing it that way becomes meaningfully
>> inaccurate. And if it's inaccurate, then it's a trademark usage issue, and
>> a matter for the PMC to act on. I certainly recall this type of problem
>> from the early days of Hadoop - the project itself had 2 or 3 live branches
>> in development (was it 0.20.x vs 0.23.x vs 1.x? YARN vs no YARN?) picked up
>> by different vendors and it was unclear what "Apache Hadoop" meant in a
>> vendor distro. Or frankly, upstream.
>>
>> In comparison, variation in Scala maintenance release seems trivial. I'm
>> not clear from the thread what actual issue this causes to users. Is there
>> more to it - does this go hand in hand with JDK version and Ammonite, or
>> are those separate? What's an example of the practical user issue. Like, I
>> compile vs Spark 3.4.0 and because of Scala version differences it doesn't
>> run on some vendor distro? That's not great, but seems like a vendor
>> problem. Unless you tell me we are getting tons of bug reports to OSS Spark
>> as a result or something.
>>
>> Is the implication that something in OSS Spark is being blocked to prefer
>> some set of vendor choices? because the changes you're pointing to seem to
>> be going into Apache Spark, actually. It'd be more useful to be specific
>> and name names at this point, seems fine.
>>
>> The rest of this is just a discussion about Databricks choices. (If it's
>> not clear, I'm at Databricks but do not work on the Spark distro). We can
>> discuss but it seems off-topic _if_ it can't be connected to a problem for
>> OSS Spark. Anyway:
>>
>> If it helps, _some_ important patches are described at
>> https://docs.databricks.com/release-notes/runtime/maintenance-updates.html
>> ; I don't think this is exactly hidden.
>>
>> Out of curiosity, how would you describe this software in the UI instead?
>> "3.4.0" is shorthand, because this is a little dropdown menu; the terminal
>> output is likewise not a place to list all patches. You would propose
>> requesting calling this "3.4.0 + patches"? That's the best I can think of,
>> but I don't think it addresses what you're getting at anyway. I think you'd
>> just prefer Databricks make a different choice, which is legitimate, but,
>> an issue to take up with Databricks, not here.
>>
>>
>> On Mon, Jun 5, 2023 at 6:58 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, Sean.
>>>
>>> "+ patches" or "powered by Apache Spark 3.4.0" is not a problem as you
>>> mentioned. For the record, I also didn't bring up any old story here.
>>>
>>> > "Apache Spark 3.4.0 + patches"
>>>
>>> However, "including Apache Spark 3.4.0" still causes confusion even in a
>>> different way because of those missing patches, SPARK-40436 (Upgrade Scala
>>> to 2.12.17) and SPARK-39414 (Upgrade Scala to 2.12.16). Technically,
>>> Databricks Runtime doesn't include Apache Spark 3.4.0 while it claims it to
>>> the users.
>>>
>>> [image: image.png]
>>>
>>> It's a sad story from the Apache Spark Scala perspective because the
>>> users cannot even try to use the correct Scala 2.12.17 version in the
>>> runtime.

Re: Slack for Spark Community: Merging various threads

2023-04-07 Thread Holden Karau

I think there was some concern around how to make any sync channel show up
in logs / index / search results?

On Fri, Apr 7, 2023 at 9:41 AM Dongjoon Hyun 
wrote:

> Thank you, All.
>
> I'm very satisfied with the focused and right questions for the real
> issues by removing irrelevant claims. :)
>
> Let me collect your relevant comments simply.
>
>
> # Category 1: Invitation Hurdle
>
> > The key question here is that do PMC members have the bandwidth of
> inviting everyone in user@ and dev@?
>
> > Extending this to inviting everyone on @user (over >4k  subscribers
> according to the previous thread) might be a stretch,
>
> > we should have an official project Slack with an easy invitation process.
>
>
> # Category 2: Controllability
>
> > Additionally. there is no indication that the-asf.slack.com is intended
> for general support.
>
> > I would also lean towards a standalone workspace, where we have more
> control over organizing the channels,
>
>
> # Category 3: Policy Suggestion
>
> > *Developer* discussions should still happen on email, JIRA and GitHub
> and be async-friendly (72-hour rule) to fit the ASF’s development model.
>
>
> Are there any other questions?
>
>
> Dongjoon.
>
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Apache Spark 3.2.4 EOL Release?

2023-04-04 Thread Holden Karau

+1

On Tue, Apr 4, 2023 at 11:04 AM L. C. Hsieh  wrote:

> +1
>
> Sounds good and thanks Dongjoon for driving this.
>
> On 2023/04/04 17:24:54 Dongjoon Hyun wrote:
> > Hi, All.
> >
> > Since Apache Spark 3.2.0 passed RC7 vote on October 12, 2021, branch-3.2
> > has been maintained and served well until now.
> >
> > - https://github.com/apache/spark/releases/tag/v3.2.0 (tagged on Oct 6,
> > 2021)
> > - https://lists.apache.org/thread/jslhkh9sb5czvdsn7nz4t40xoyvznlc7
> >
> > As of today, branch-3.2 has 62 additional patches after v3.2.3 and
> reaches
> > the end-of-life this month according to the Apache Spark release
> cadence. (
> > https://spark.apache.org/versioning-policy.html)
> >
> > $ git log --oneline v3.2.3..HEAD | wc -l
> > 62
> >
> > With the upcoming Apache Spark 3.4, I hope the users can get a chance to
> > have these last bits of Apache Spark 3.2.x, and I'd like to propose to
> have
> > Apache Spark 3.2.4 EOL Release next week and volunteer as the release
> > manager. WDTY? Please let me know if you need more patches on branch-3.2.
> >
> > Thanks,
> > Dongjoon.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Ammonite as REPL for Spark Connect

2023-03-22 Thread Holden Karau

I am +1 to the general concept of including Ammonite magic 彩.

On Wed, Mar 22, 2023 at 4:58 PM Herman van Hovell
 wrote:

> Ammonite is maintained externally by Li Haoyi et al. We are including it
> as a 'provided' dependency. The integration bits and pieces (1 file) are
> included in Apache Spark.
>
> On Wed, Mar 22, 2023 at 7:53 PM Mridul Muralidharan 
> wrote:
>
>>
>> Will this be maintained externally or included into Apache Spark ?
>>
>> Regards ,
>> Mridul
>>
>>
>>
>> On Wed, Mar 22, 2023 at 6:50 PM Herman van Hovell
>>  wrote:
>>
>>> Hi All,
>>>
>>> For Spark Connect Scala Client we are working on making the REPL
>>> experience a bit nicer . In
>>> a nutshell we want to give users a turn key scala REPL, that works even if
>>> you don't have a Spark distribution on your machine (through coursier
>>> ). We are using Ammonite
>>>  instead of the standard scala REPL for this, the
>>> main reason for going with Ammonite is that it is easier to customize, and
>>> IMO has a superior user experience.
>>>
>>> Does anyone object to doing this?
>>>
>>> Kind regards,
>>> Herman
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-18 Thread Holden Karau

Is there someone focused on streaming work these days who would want to
shepherd this?

On Sat, Feb 18, 2023 at 5:02 PM Dongjoon Hyun 
wrote:

> Thank you for considering me, but may I ask what makes you think to put me
> there, Mich? I'm curious about your reason.
>
> > I have put dongjoon.hyun as a shepherd.
>
> BTW, unfortunately, I cannot help you with that due to my on-going
> personal stuff. I'll adjust the JIRA first.
>
> Thanks,
> Dongjoon.
>
>
> On Sat, Feb 18, 2023 at 10:51 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> https://issues.apache.org/jira/browse/SPARK-42485
>>
>>
>> Spark Structured Streaming is a very useful tool in dealing with Event
>> Driven Architecture. In an Event Driven Architecture, there is generally a
>> main loop that listens for events and then triggers a call-back function
>> when one of those events is detected. In a streaming application the
>> application waits to receive the source messages in a set interval or
>> whenever they happen and reacts accordingly.
>>
>> There are occasions that you may want to stop the Spark program
>> gracefully. Gracefully meaning that Spark application handles the last
>> streaming message completely and terminates the application. This is
>> different from invoking interrupts such as CTRL-C.
>>
>> Of course one can terminate the process based on the following
>>
>>1. query.awaitTermination() # Waits for the termination of this
>>query, with stop() or with error
>>
>>
>>1. query.awaitTermination(timeoutMs) # Returns true if this query is
>>terminated within the timeout in milliseconds.
>>
>> So the first one above waits until an interrupt signal is received. The
>> second one will count the timeout and will exit when the timeout in
>> milliseconds is reached.
>>
>> The issue is that one needs to predict how long the streaming job needs
>> to run. Clearly any interrupt at the terminal or OS level (kill process),
>> may end up the processing terminated without a proper completion of the
>> streaming process.
>>
>> I have devised a method that allows one to terminate the spark
>> application internally after processing the last received message. Within
>> say 2 seconds of the confirmation of shutdown, the process will invoke a
>> graceful shutdown.
>>
>> This new feature proposes a solution to handle the topic doing work for
>> the message being processed gracefully, wait for it to complete and
>> shutdown the streaming process for a given topic without loss of data or
>> orphaned transactions
>>
>>
>> I have put dongjoon.hyun as a shepherd. Kindly advise me if that is the
>> correct approach.
>>
>> JIRA ticket https://issues.apache.org/jira/browse/SPARK-42485
>>
>> SPIP doc: TBC
>>
>> Discussion thread: in
>>
>> https://lists.apache.org/list.html?dev@spark.apache.org
>>
>>
>> Thanks.
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread Holden Karau

That’s legit, if the patch author isn’t comfortable with a backport then
let’s leave it be 

On Mon, Feb 13, 2023 at 9:59 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> As the author of that `Improvement` patch, I strongly disagree with giving
> the wrong idea which Python 3.11 is officially supported in Spark 3.3.
>
> I only developed and delivered it for Apache Spark 3.4.0 specifically as
> `Improvement`.
>
> We may want to backport it branch-3.3 but it's also another discussion
> topic because it's `Improvement` instead of a blocker of any existing
> release branch.
>
> Please raise the backporting discussion thread after 3.3.2 releasing if
> you want it in branch-3.3.
>
> We need to talk. :)
>
> Bests,
> Dongjoon.
>
>
> On Mon, Feb 13, 2023 at 9:31 AM Chao Sun  wrote:
>
>> +1
>>
>> On Mon, Feb 13, 2023 at 9:20 AM L. C. Hsieh  wrote:
>> >
>> > If it is not supported in Spark 3.3.x, it looks like an improvement at
>> > Spark 3.4.
>> > For such cases we usually do not back port. I think this is also why
>> > the PR did not back port when it was merged.
>> >
>> > I'm okay if there is consensus to back port it.
>> >
>> > On Mon, Feb 13, 2023 at 9:08 AM Sean Owen  wrote:
>> > >
>> > > Does that change change the result for Spark 3.3.x?
>> > > It looks like we do not support Python 3.11 in Spark 3.3.x, which is
>> one answer to whether this should be changed now.
>> > > But if that's the only change that matters for Python 3.11 and makes
>> it work, sure I think we should back-port. It doesn't necessarily block a
>> release but if that's the case, it seems OK to include to me in a next RC.
>> > >
>> > > On Mon, Feb 13, 2023 at 10:53 AM Bjørn Jørgensen <
>> bjornjorgen...@gmail.com> wrote:
>> > >>
>> > >> There is a fix for python 3.11
>> https://github.com/apache/spark/pull/38987
>> > >> We should have this in more branches.
>> > >>
>> > >> man. 13. feb. 2023 kl. 09:39 skrev Bjørn Jørgensen <
>> bjornjorgen...@gmail.com>:
>> > >>>
>> > >>> On manjaro it is Python 3.10.9
>> > >>>
>> > >>> On ubuntu it is Python 3.11.1
>> > >>>
>> > >>> man. 13. feb. 2023 kl. 03:24 skrev yangjie01 :
>> > 
>> >  Which Python version do you use for testing? When I use the latest
>> Python 3.11, I can reproduce similar test failures (43 tests of sql module
>> fail), but when I use python 3.10, they will succeed
>> > 
>> > 
>> > 
>> >  YangJie
>> > 
>> > 
>> > 
>> >  发件人: Bjørn Jørgensen 
>> >  日期: 2023年2月13日 星期一 05:09
>> >  收件人: Sean Owen 
>> >  抄送: "L. C. Hsieh" , Spark dev list <
>> dev@spark.apache.org>
>> >  主题: Re: [VOTE] Release Spark 3.3.2 (RC1)
>> > 
>> > 
>> > 
>> >  Tried it one more time and the same result.
>> > 
>> > 
>> > 
>> >  On another box with Manjaro
>> > 
>> > 
>> 
>> >  [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
>> >  [INFO]
>> >  [INFO] Spark Project Parent POM ...
>> SUCCESS [01:50 min]
>> >  [INFO] Spark Project Tags .
>> SUCCESS [ 17.359 s]
>> >  [INFO] Spark Project Sketch ...
>> SUCCESS [ 12.517 s]
>> >  [INFO] Spark Project Local DB .
>> SUCCESS [ 14.463 s]
>> >  [INFO] Spark Project Networking ...
>> SUCCESS [01:07 min]
>> >  [INFO] Spark Project Shuffle Streaming Service 
>> SUCCESS [  9.013 s]
>> >  [INFO] Spark Project Unsafe ...
>> SUCCESS [  8.184 s]
>> >  [INFO] Spark Project Launcher .
>> SUCCESS [ 10.454 s]
>> >  [INFO] Spark Project Core .
>> SUCCESS [23:58 min]
>> >  [INFO] Spark Project ML Local Library .
>> SUCCESS [ 21.218 s]
>> >  [INFO] Spark Project GraphX ...
>> SUCCESS [01:24 min]
>> >  [INFO] Spark Project Streaming 
>> SUCCESS [04:57 min]
>> >  [INFO] Spark Project Catalyst .
>> SUCCESS [08:00 min]
>> >  [INFO] Spark Project SQL ..
>> SUCCESS [  01:02 h]
>> >  [INFO] Spark Project ML Library ...
>> SUCCESS [14:38 min]
>> >  [INFO] Spark Project Tools 
>> SUCCESS [  4.394 s]
>> >  [INFO] Spark Project Hive .
>> SUCCESS [53:43 min]
>> >  [INFO] Spark Project REPL .
>> SUCCESS [01:16 min]
>> >  [INFO] Spark Project Assembly .
>> SUCCESS [  2.186 s]
>> >  [INFO] Kafka 0.10+ Token Provider for Streaming ...
>> SUCCESS [ 16.150 s]
>> >  [INFO] Spark Integration for Kafka 0.10 ...
>> SUCCESS [01:34 min]
>> >  [INFO] Kafka 0.10+ Source for Structured Streaming 
>> SUCCESS

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread Holden Karau

I’d be in favor of a back porting with the idea its a bug fix for a
language (admittedly not a version we’ve supported before)

On Mon, Feb 13, 2023 at 9:19 AM L. C. Hsieh  wrote:

> If it is not supported in Spark 3.3.x, it looks like an improvement at
> Spark 3.4.
> For such cases we usually do not back port. I think this is also why
> the PR did not back port when it was merged.
>
> I'm okay if there is consensus to back port it.
>
> On Mon, Feb 13, 2023 at 9:08 AM Sean Owen  wrote:
> >
> > Does that change change the result for Spark 3.3.x?
> > It looks like we do not support Python 3.11 in Spark 3.3.x, which is one
> answer to whether this should be changed now.
> > But if that's the only change that matters for Python 3.11 and makes it
> work, sure I think we should back-port. It doesn't necessarily block a
> release but if that's the case, it seems OK to include to me in a next RC.
> >
> > On Mon, Feb 13, 2023 at 10:53 AM Bjørn Jørgensen <
> bjornjorgen...@gmail.com> wrote:
> >>
> >> There is a fix for python 3.11
> https://github.com/apache/spark/pull/38987
> >> We should have this in more branches.
> >>
> >> man. 13. feb. 2023 kl. 09:39 skrev Bjørn Jørgensen <
> bjornjorgen...@gmail.com>:
> >>>
> >>> On manjaro it is Python 3.10.9
> >>>
> >>> On ubuntu it is Python 3.11.1
> >>>
> >>> man. 13. feb. 2023 kl. 03:24 skrev yangjie01 :
> 
>  Which Python version do you use for testing? When I use the latest
> Python 3.11, I can reproduce similar test failures (43 tests of sql module
> fail), but when I use python 3.10, they will succeed
> 
> 
> 
>  YangJie
> 
> 
> 
>  发件人: Bjørn Jørgensen 
>  日期: 2023年2月13日 星期一 05:09
>  收件人: Sean Owen 
>  抄送: "L. C. Hsieh" , Spark dev list <
> dev@spark.apache.org>
>  主题: Re: [VOTE] Release Spark 3.3.2 (RC1)
> 
> 
> 
>  Tried it one more time and the same result.
> 
> 
> 
>  On another box with Manjaro
> 
> 
> 
>  [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
>  [INFO]
>  [INFO] Spark Project Parent POM ... SUCCESS
> [01:50 min]
>  [INFO] Spark Project Tags . SUCCESS [
> 17.359 s]
>  [INFO] Spark Project Sketch ... SUCCESS [
> 12.517 s]
>  [INFO] Spark Project Local DB . SUCCESS [
> 14.463 s]
>  [INFO] Spark Project Networking ... SUCCESS
> [01:07 min]
>  [INFO] Spark Project Shuffle Streaming Service  SUCCESS
> [  9.013 s]
>  [INFO] Spark Project Unsafe ... SUCCESS
> [  8.184 s]
>  [INFO] Spark Project Launcher . SUCCESS [
> 10.454 s]
>  [INFO] Spark Project Core . SUCCESS
> [23:58 min]
>  [INFO] Spark Project ML Local Library . SUCCESS [
> 21.218 s]
>  [INFO] Spark Project GraphX ... SUCCESS
> [01:24 min]
>  [INFO] Spark Project Streaming  SUCCESS
> [04:57 min]
>  [INFO] Spark Project Catalyst . SUCCESS
> [08:00 min]
>  [INFO] Spark Project SQL .. SUCCESS
> [  01:02 h]
>  [INFO] Spark Project ML Library ... SUCCESS
> [14:38 min]
>  [INFO] Spark Project Tools  SUCCESS
> [  4.394 s]
>  [INFO] Spark Project Hive . SUCCESS
> [53:43 min]
>  [INFO] Spark Project REPL . SUCCESS
> [01:16 min]
>  [INFO] Spark Project Assembly . SUCCESS
> [  2.186 s]
>  [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [
> 16.150 s]
>  [INFO] Spark Integration for Kafka 0.10 ... SUCCESS
> [01:34 min]
>  [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS
> [32:55 min]
>  [INFO] Spark Project Examples . SUCCESS [
> 23.800 s]
>  [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS
> [  7.301 s]
>  [INFO] Spark Avro . SUCCESS
> [01:19 min]
>  [INFO]
> 
>  [INFO] BUILD SUCCESS
>  [INFO]
> 
>  [INFO] Total time:  03:31 h
>  [INFO] Finished at: 2023-02-12T21:54:20+01:00
>  [INFO]
> 
>  [bjorn@amd7g spark-3.3.2]$  java -version
>  openjdk version "17.0.6" 2023-01-17
>  OpenJDK Runtime Environment (build 17.0.6+10)
>  OpenJDK 64-Bit Server VM (build 17.0.6+10, mixed mode)
> 
> 
> 
> 
> 
>  :)
> 
>

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-13 Thread Holden Karau

Some general issues we found common ground around:

Inter-Pod security, istio + mTLS
Sidecar management
Docker Images
Add links to more related images
- Helm links
Data Locality concerns
Upgrading  Spark Versions
Performance issues

Thanks to everyone who was able to make the informal coffee chat

I'll try and schedule another one at a more European friendly time so that
we can all get to chat as well.

On Fri, Feb 10, 2023 at 1:08 PM Mich Talebzadeh 
wrote:

> Great looking forward to it
>
> Mich
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 10 Feb 2023 at 18:58, Holden Karau  wrote:
>
>> Ok so the first iteration of this is booked:
>>
>>
>> Spark on Kube Coffee Chats
>> Sunday, Feb 12 · 6–7 PM pacific time
>> Google Meet joining info
>> Video call link: https://meet.google.com/wge-tzzd-uyj
>>
>> Assuming that all goes well I’ll send out another doodle pole after this
>> one for the folks who could not make this one.
>>
>> Looking forward to catching up with y’all :) No prep work necessary but
>> if anyone wants to write down a brief like two sentence blurb about their
>> goals for Spark on Kube was thinking we might go around the virtual room
>> sharing that as our kicking off point for this coffee meeting :)
>>
>>
>> On Wed, Feb 8, 2023 at 12:27 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> That sounds like a good plan Holden!
>>>
>>>
>>> Let us go for it
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 8 Feb 2023 at 20:12, Holden Karau  wrote:
>>>
>>>> My thought here was that it's more focused on getting to understand
>>>> each other's goals / priorities and less solving any specific problem.
>>>>
>>>> For example, I know that some folks running on EKS have different
>>>> priorities than folks running on-prem.
>>>>
>>>> We might (later on) make a roadmap doc if that seems necessary, but I'm
>>>> hoping that just an understanding of folks priorities and challenges will
>>>> make it easier for us to all collaborate.
>>>>
>>>> On Wed, Feb 8, 2023 at 11:47 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Is this going to be a brainstorming meeting or there will be a prior
>>>>> agenda to work around it?
>>>>>
>>>>> thanks
>>>>>
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 8 Feb 2023 at 18:33, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Ok Colin thanks for clarification
>&

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-10 Thread Holden Karau

Ok so the first iteration of this is booked:


Spark on Kube Coffee Chats
Sunday, Feb 12 · 6–7 PM pacific time
Google Meet joining info
Video call link: https://meet.google.com/wge-tzzd-uyj

Assuming that all goes well I’ll send out another doodle pole after this
one for the folks who could not make this one.

Looking forward to catching up with y’all :) No prep work necessary but if
anyone wants to write down a brief like two sentence blurb about their
goals for Spark on Kube was thinking we might go around the virtual room
sharing that as our kicking off point for this coffee meeting :)


On Wed, Feb 8, 2023 at 12:27 PM Mich Talebzadeh 
wrote:

> That sounds like a good plan Holden!
>
>
> Let us go for it
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 8 Feb 2023 at 20:12, Holden Karau  wrote:
>
>> My thought here was that it's more focused on getting to understand each
>> other's goals / priorities and less solving any specific problem.
>>
>> For example, I know that some folks running on EKS have different
>> priorities than folks running on-prem.
>>
>> We might (later on) make a roadmap doc if that seems necessary, but I'm
>> hoping that just an understanding of folks priorities and challenges will
>> make it easier for us to all collaborate.
>>
>> On Wed, Feb 8, 2023 at 11:47 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Is this going to be a brainstorming meeting or there will be a prior
>>> agenda to work around it?
>>>
>>> thanks
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 8 Feb 2023 at 18:33, Mich Talebzadeh 
>>> wrote:
>>>
>>>> Ok Colin thanks for clarification
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 8 Feb 2023 at 18:08, Colin Williams <
>>>> colin.williams.seat...@gmail.com> wrote:
>>>>
>>>>> I'm sorry you misunderstood.  The context is migrating jobs to Spark
>>>>> on k8s.
>>>>>
>>>>> On Wed, Feb 8, 2023, 8:31 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Hi Colin,
>>>>>>
>>>>>> Thanks for your reply.
>>>>>>
>>>>>>
>>>>>> I think both Yarn and Kubernetes are cluster managers plus Standalone
>>>>>> and Remotely Mesos. So I gather this discussion will focus on Spark on 
>>>>>> k8s
>>>>>> unless I am mistaken.
>>>>>>
>>>>>>
>>>>>> HTH,
>>>>>>
>>>>>>
>>>>>> Mich
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>&g

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-08 Thread Holden Karau

gt;>>>>
>>>>>>
>>>>>> On Wed, 8 Feb 2023 at 03:42, Colin Williams <
>>>>>> colin.williams.seat...@gmail.com> wrote:
>>>>>>
>>>>>>> I wouldn't mind attending or viewing a recording depending on
>>>>>>> availability. I'm interested in challenges and solutions to porting 
>>>>>>> Spark
>>>>>>> jobs between environments.
>>>>>>>
>>>>>>> On Tue, Feb 7, 2023 at 7:34 PM Denis Bolshakov <
>>>>>>> bolshakov.de...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I am also interested, please add me to the conf.
>>>>>>>>
>>>>>>>> ср, 8 февр. 2023 г., 07:21 Jayabindu Singh :
>>>>>>>>
>>>>>>>>> Greetings everyone!
>>>>>>>>> I am super new to this group and currently leading some work to
>>>>>>>>> deploy spark on k8 for my company o9 Solutions.
>>>>>>>>> I would love to join the discussion.
>>>>>>>>> I am in PST.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Jay
>>>>>>>>>
>>>>>>>>> Sent from my iPhone
>>>>>>>>>
>>>>>>>>> On Feb 7, 2023, at 3:57 PM, Mich Talebzadeh <
>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>> Could be interesting. Need to summarise where we are with Spark on
>>>>>>>>> k8s and what market demands.
>>>>>>>>>
>>>>>>>>> My personal experience with Volcano was not that impressive 樂. So
>>>>>>>>> may be a summary will do.where we are currently with spark on k8s
>>>>>>>>>
>>>>>>>>> I am on  Greenwich Mean Time but I can take part in late sessions
>>>>>>>>> if needed
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>>> which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>>> damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, 7 Feb 2023 at 23:37, John Zhuge  wrote:
>>>>>>>>>
>>>>>>>>>> Awesome, count me in!
>>>>>>>>>> PST
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 7, 2023 at 3:34 PM Andrew Melo 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm Central US time (AKA UTC -6:00)
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 7, 2023 at 5:32 PM Holden Karau <
>>>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Awesome, I guess I should have asked folks for timezones that
>>>>>>>>>>> they’re in.
>>>>>>>>>>> >
>>>>>>>>>>> > On Tue, Feb 7, 2023 at 3:30 PM Andrew Melo <
>>>>>>>>>>> andrew.m...@gmail.com> wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >> Hello Holden,
>>>>>>>>>>> >>
>>>>>>>>>>> >> We are interested in Spark on k8s and would like the
>>>>>>>>>>> opportunity to
>>>>>>>>>>> >> speak with devs about what we're looking for slash better
>>>>>>>>>>> ways to use
>>>>>>>>>>> >> spark.
>>>>>>>>>>> >>
>>>>>>>>>>> >> Thanks!
>>>>>>>>>>> >> Andrew
>>>>>>>>>>> >>
>>>>>>>>>>> >> On Tue, Feb 7, 2023 at 5:24 PM Holden Karau <
>>>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>>> >> >
>>>>>>>>>>> >> > Hi Folks,
>>>>>>>>>>> >> >
>>>>>>>>>>> >> > It seems like we could maybe use some additional shared
>>>>>>>>>>> context around Spark on Kube so I’d like to try and schedule a 
>>>>>>>>>>> virtual
>>>>>>>>>>> coffee session.
>>>>>>>>>>> >> >
>>>>>>>>>>> >> > Who all would be interested in virtual adventures around
>>>>>>>>>>> Spark on Kube development?
>>>>>>>>>>> >> >
>>>>>>>>>>> >> > No pressure if the idea of hanging out in a virtual chat
>>>>>>>>>>> with coffee and Spark devs does not sound like your thing, just 
>>>>>>>>>>> trying to
>>>>>>>>>>> make something informal so we can have a better understanding of 
>>>>>>>>>>> everyone’s
>>>>>>>>>>> goals here.
>>>>>>>>>>> >> >
>>>>>>>>>>> >> > Cheers,
>>>>>>>>>>> >> >
>>>>>>>>>>> >> > Holden :)
>>>>>>>>>>> >> > --
>>>>>>>>>>> >> > Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>> >> > Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>>> https://amzn.to/2MaRAG9
>>>>>>>>>>> >> > YouTube Live Streams:
>>>>>>>>>>> https://www.youtube.com/user/holdenkarau
>>>>>>>>>>> >
>>>>>>>>>>> > --
>>>>>>>>>>> > Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>> > Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>>> https://amzn.to/2MaRAG9
>>>>>>>>>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -
>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>> John Zhuge
>>>>>>>>>>
>>>>>>>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-08 Thread Holden Karau

Ok y'all so it looks like the majority of interested folks are in pacific
timezone with a few in other north american and european timezones. So I've
made a doodle with a lot of pacific teams for next week
https://doodle.com/meeting/participate/id/e16z3MGa and I was thinking I'll
make another doodle for the following week with more european friendly
times.
Let me know what folks think :)

On Tue, Feb 7, 2023 at 3:23 PM Holden Karau  wrote:

> Hi Folks,
>
> It seems like we could maybe use some additional shared context around
> Spark on Kube so I’d like to try and schedule a virtual coffee session.
>
> Who all would be interested in virtual adventures around Spark on Kube
> development?
>
> No pressure if the idea of hanging out in a virtual chat with coffee and
> Spark devs does not sound like your thing, just trying to make something
> informal so we can have a better understanding of everyone’s goals here.
>
> Cheers,
>
> Holden :)
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-07 Thread Holden Karau

Awesome, I guess I should have asked folks for timezones that they’re in.

On Tue, Feb 7, 2023 at 3:30 PM Andrew Melo  wrote:

> Hello Holden,
>
> We are interested in Spark on k8s and would like the opportunity to
> speak with devs about what we're looking for slash better ways to use
> spark.
>
> Thanks!
> Andrew
>
> On Tue, Feb 7, 2023 at 5:24 PM Holden Karau  wrote:
> >
> > Hi Folks,
> >
> > It seems like we could maybe use some additional shared context around
> Spark on Kube so I’d like to try and schedule a virtual coffee session.
> >
> > Who all would be interested in virtual adventures around Spark on Kube
> development?
> >
> > No pressure if the idea of hanging out in a virtual chat with coffee and
> Spark devs does not sound like your thing, just trying to make something
> informal so we can have a better understanding of everyone’s goals here.
> >
> > Cheers,
> >
> > Holden :)
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Spark on Kube (virtua) coffee/tea/pop times

2023-02-07 Thread Holden Karau

Hi Folks,

It seems like we could maybe use some additional shared context around
Spark on Kube so I’d like to try and schedule a virtual coffee session.

Who all would be interested in virtual adventures around Spark on Kube
development?

No pressure if the idea of hanging out in a virtual chat with coffee and
Spark devs does not sound like your thing, just trying to make something
informal so we can have a better understanding of everyone’s goals here.

Cheers,

Holden :)
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Syndicate Apache Spark Twitter to Mastodon?

2022-12-01 Thread Holden Karau

The main negatives that I can think of is an additional account for the PMC
to maintain so if we as a community don’t have many people on Mastodon yet
it might not be worth it. Would need probably about ~20 minutes of setup
work to make the sync (probably most of it is finding someone with the
Twitter credentials to enable to sync). The other tricky one is picking a
server (there is no default ASF server that I know of).

On Thu, Dec 1, 2022 at 8:03 AM Russell Spitzer 
wrote:

> Since this is just syndication I don't think arguments on the benefits of
> Twitter vs Mastodon are that important, it's really just what are the costs
> of additionally posting to Mastodon. I'm assuming those costs are basically
> 0 since this can be done by a bot? So I don't think there is any strong
> reason not to do so.
>
>
> On Nov 30, 2022, at 5:51 PM, Dmitry  wrote:
>
> My personal opinion, one of the most features of Twiiter that it is not
> federated and is good platform for annonces and so on. So it means "it
> would be good to reach our users where they are" means stay in twitter(most
> companies who use Spark/Databricks are in Twitter)
> For Federated  features, I think Slack would be a better platform, a lot
> of Apache Big data projects have slack for federated features
>
> чт, 1 дек. 2022 г., 02:33 Holden Karau :
>
>> I agree that there is probably a majority still on twitter, but it would
>> be a syndication (e.g. we'd keep both).
>>
>> As to the # of devs it's hard to say since:
>> 1) It's a federated service
>> 2) Figuring out if an account is a dev or not is hard
>>
>> But, for example,
>>
>> There seems to be roughly an aggregate 6 million users (
>> https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time
>> ), which seems to be about only ~1% of Twitters size.
>>
>> Nova's (large K8s focused I believe) has ~29k, tech.lgbt has ~6k, The BSD
>> mastodon has ~1k ( https://bsd.network/about )
>>
>> It's hard to say, but I've noticed a larger number of my tech affiliated
>> friends moving to Mastodon (personally I now do both).
>>
>> On Wed, Nov 30, 2022 at 3:17 PM Dmitry  wrote:
>>
>>> Hello,
>>> Does any long-term statistics about number of developers who moved to
>>> mastodon and activity use exists?
>>>
>>> I believe the most devs are still using Twitter.
>>>
>>>
>>> чт, 1 дек. 2022 г., 01:35 Holden Karau :
>>>
>>>> Do we want to start syndicating Apache Spark Twitter to a Mastodon
>>>> instance. It seems like a lot of software dev folks are moving over there
>>>> and it would be good to reach our users where they are.
>>>>
>>>> Any objections / concerns? Any thoughts on which server we should pick
>>>> if we do this?
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Syndicate Apache Spark Twitter to Mastodon?

2022-11-30 Thread Holden Karau

I agree that there is probably a majority still on twitter, but it would be
a syndication (e.g. we'd keep both).

As to the # of devs it's hard to say since:
1) It's a federated service
2) Figuring out if an account is a dev or not is hard

But, for example,

There seems to be roughly an aggregate 6 million users (
https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time ),
which seems to be about only ~1% of Twitters size.

Nova's (large K8s focused I believe) has ~29k, tech.lgbt has ~6k, The BSD
mastodon has ~1k ( https://bsd.network/about )

It's hard to say, but I've noticed a larger number of my tech affiliated
friends moving to Mastodon (personally I now do both).

On Wed, Nov 30, 2022 at 3:17 PM Dmitry  wrote:

> Hello,
> Does any long-term statistics about number of developers who moved to
> mastodon and activity use exists?
>
> I believe the most devs are still using Twitter.
>
>
> чт, 1 дек. 2022 г., 01:35 Holden Karau :
>
>> Do we want to start syndicating Apache Spark Twitter to a Mastodon
>> instance. It seems like a lot of software dev folks are moving over there
>> and it would be good to reach our users where they are.
>>
>> Any objections / concerns? Any thoughts on which server we should pick if
>> we do this?
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Syndicate Apache Spark Twitter to Mastodon?

2022-11-30 Thread Holden Karau

Do we want to start syndicating Apache Spark Twitter to a Mastodon
instance. It seems like a lot of software dev folks are moving over there
and it would be good to reach our users where they are.

Any objections / concerns? Any thoughts on which server we should pick if
we do this?
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Jupyter notebook on Dataproc versus GKE

2022-09-06 Thread Holden Karau

I’ve used Argo for K8s scheduling, for awhile it’s also what Kubeflow used
underneath for scheduling.

On Tue, Sep 6, 2022 at 10:01 AM Mich Talebzadeh 
wrote:

> Thank you all.
>
> Has anyone used Argo for k8s scheduler by any chance?
>
> On Tue, 6 Sep 2022 at 13:41, Bjørn Jørgensen 
> wrote:
>
>> "*JupyterLab is the next-generation user interface for Project Jupyter
>> offering all the familiar building blocks of the classic Jupyter Notebook
>> (notebook, terminal, text editor, file browser, rich outputs, etc.) in a
>> flexible and powerful user interface.*"
>> https://github.com/jupyterlab/jupyterlab
>>
>> You will find them both at https://jupyter.org
>>
>> man. 5. sep. 2022 kl. 23:40 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Thanks Bjorn,
>>>
>>> What are the differences and the functionality Jupyerlab brings in on
>>> top of Jupyter notebook?
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 5 Sept 2022 at 20:58, Bjørn Jørgensen 
>>> wrote:
>>>
>>>> Jupyter notebook is replaced with jupyterlab :)
>>>>
>>>> man. 5. sep. 2022 kl. 21:10 skrev Holden Karau :
>>>>
>>>>>
>>>>>
>>>>> On Mon, Sep 5, 2022 at 9:00 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Thanks for that.
>>>>>>
>>>>>> How do you rate the performance of Jupyter W/Spark on K8s compared to
>>>>>> the same on  a cluster of VMs (example Dataproc).
>>>>>>
>>>>>> Also somehow a related question (may be naive as well). For example,
>>>>>> Google offers a lot of standard ML libraries for example built into a 
>>>>>> data
>>>>>> warehouse like BigQuery. What does the Jupyter notebook offer that others
>>>>>> don't?
>>>>>>
>>>>> Jupyter notebook doesn’t offer any particular set of libraries,
>>>>> although you can add your own to the container etc.
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 5 Sept 2022 at 12:47, Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>>> I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc
>>>>>>> personally.
>>>>>>>
>>>>>>> The Spark K8s pod scheduler is now more pluggable for Yunikorn and
>>>>>>> Volcano can be used with less effort.
>>>>>>>
>>>>>>> On Mon, Sep 5, 2022 at 7:44 AM Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> Has anyone got experience of running Jupyter on dataproc versus
>>>>>>>> Jupyter notebook on GKE (k8).
>>>>>>

Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Holden Karau

On Mon, Sep 5, 2022 at 9:00 AM Mich Talebzadeh 
wrote:

> Thanks for that.
>
> How do you rate the performance of Jupyter W/Spark on K8s compared to the
> same on  a cluster of VMs (example Dataproc).
>
> Also somehow a related question (may be naive as well). For example,
> Google offers a lot of standard ML libraries for example built into a data
> warehouse like BigQuery. What does the Jupyter notebook offer that others
> don't?
>
Jupyter notebook doesn’t offer any particular set of libraries, although
you can add your own to the container etc.

>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 5 Sept 2022 at 12:47, Holden Karau  wrote:
>
>> I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc
>> personally.
>>
>> The Spark K8s pod scheduler is now more pluggable for Yunikorn and
>> Volcano can be used with less effort.
>>
>> On Mon, Sep 5, 2022 at 7:44 AM Mich Talebzadeh 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>>
>>> Has anyone got experience of running Jupyter on dataproc versus Jupyter
>>> notebook on GKE (k8).
>>>
>>>
>>> I have not looked at this for a while but my understanding is that Spark
>>> on GKE/k8 is not yet performed. This is classic Spark with Python/Pyspark.
>>>
>>>
>>> Also I would like to know the state of spark with Volcano. Has progress
>>> made on that front.
>>>
>>>
>>> Regards,
>>>
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Holden Karau

I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc personally.

The Spark K8s pod scheduler is now more pluggable for Yunikorn and Volcano
can be used with less effort.

On Mon, Sep 5, 2022 at 7:44 AM Mich Talebzadeh 
wrote:

>
> Hi,
>
>
> Has anyone got experience of running Jupyter on dataproc versus Jupyter
> notebook on GKE (k8).
>
>
> I have not looked at this for a while but my understanding is that Spark
> on GKE/k8 is not yet performed. This is classic Spark with Python/Pyspark.
>
>
> Also I would like to know the state of spark with Volcano. Has progress
> made on that front.
>
>
> Regards,
>
>
> Mich
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [SPARK-39515] Improve scheduled jobs in GitHub Actions

2022-06-20 Thread Holden Karau

How about a hallway meet up at Data AI summit to talk about build CI if
folks are
Interested?

On Sun, Jun 19, 2022 at 7:50 PM Hyukjin Kwon  wrote:

> Increased the priority to a blocker - I don't think we can release with
> these build failures and poor CI
>
> On Mon, 20 Jun 2022 at 10:39, Hyukjin Kwon  wrote:
>
>> There are too many test failures here. I pinged in some PRs I could
>> identify from a cursory look but would be great for you guys to take a look
>> if you guys haven't tested your change against other environments like JDK
>> 11, Scala 2.13.
>>
>> On Mon, 20 Jun 2022 at 10:04, Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> I am trying to rework GitHub Actions CI at
>>> https://issues.apache.org/jira/browse/SPARK-39515. Any help would be
>>> very appreciated.
>>>
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE][SPIP] Spark Connect

2022-06-16 Thread Holden Karau

+1

On Thu, Jun 16, 2022 at 7:17 AM Thomas Graves  wrote:

> +1 for the concept.
> Correct me if I'm wrong, but at a high level this is proposing adding
> a new user API (which is language agnostic) and the proposal is to
> start with something like the Logical Plan, with the addition of being
> able to remotely call this API.
>
> +0 on architecture/design as it's not clear from the doc how much
> impact this truly has. But that is a problem with SPIPs which I have
> voiced my concern about in the past.
> I can see how this could be flushed out and keep the overall impact
> minimal (vs blow up the world architecture change) since it's not just
> a drop in replacement for all existing APIs.  For instance,
> conceptually this is just a version of the Spark thriftserver which
> uses grpc and passes the new API and internally we add a new API
> runPlan(LogicalPlan).  You could potentially also not use the internal
> version of the catalyst Logical Plan API but have some conversion
> still to allow changes to catalyst internals, not sure if that is
> needed but a possibility.
> With any API addition it will have to be kept stable and require more
> testing and likely more dev work, so weighing that vs usefulness is
> the question for me.
>
> Tom
>
> On Mon, Jun 13, 2022 at 1:04 PM Herman van Hovell
>  wrote:
> >
> > Hi all,
> >
> > I’d like to start a vote for SPIP: "Spark Connect"
> >
> > The goal of the SPIP is to introduce a Dataframe based client/server API
> for Spark
> >
> > Please also refer to:
> >
> > - Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark Connect
> - A client and server interface for Apache Spark.
> > - Design doc: Spark Connect - A client and server interface for Apache
> Spark.
> > - JIRA: SPARK-39375
> >
> > Please vote on the SPIP for the next 72 hours:
> >
> > [ ] +1: Accept the proposal as an official SPIP
> > [ ] +0
> > [ ] -1: I don’t think this is a good idea because …
> >
> > Kind Regards,
> > Herman
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Holden Karau

+1

On Mon, Jun 13, 2022 at 4:51 PM Yuming Wang  wrote:

> +1 (non-binding)
>
> On Tue, Jun 14, 2022 at 7:41 AM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> Thanks,
>> Dongjoon.
>>
>> On Mon, Jun 13, 2022 at 3:54 PM Chris Nauroth 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> I repeated all checks I described for RC5:
>>>
>>> https://lists.apache.org/thread/ksoxmozgz7q728mnxl6c2z7ncmo87vls
>>>
>>> Maxim, thank you for your dedication on these release candidates.
>>>
>>> Chris Nauroth
>>>
>>>
>>> On Mon, Jun 13, 2022 at 3:21 PM Mridul Muralidharan 
>>> wrote:
>>>

 +1

 Signatures, digests, etc check out fine.
 Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes

 The test "SPARK-33084: Add jar support Ivy URI in SQL" in
 sql.SQLQuerySuite fails; but other than that, rest looks good.

 Regards,
 Mridul



 On Mon, Jun 13, 2022 at 4:25 PM Tom Graves 
 wrote:

> +1
>
> Tom
>
> On Thursday, June 9, 2022, 11:27:50 PM CDT, Maxim Gekk
>  wrote:
>
>
> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time June 14th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc6 (commit
> f74867bddfbcdd4d08076db36851e88b15e66556):
> https://github.com/apache/spark/tree/v3.3.0-rc6
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1407
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following
> URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc6.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau

Could we make it do the same sort of history server fallback approach?

On Tue, May 17, 2022 at 10:41 PM bo yang  wrote:

> It is like Web Application Proxy in YARN (
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html),
> to provide easy access for Spark UI when the Spark application is running.
>
> When running Spark on Kubernetes with S3, there is no YARN. The reverse
> proxy here is to behave like that Web Application Proxy. It will
> simplify settings to access Spark UI on Kubernetes.
>
>
> On Mon, May 16, 2022 at 11:46 PM wilson  wrote:
>
>> what's the advantage of using reverse proxy for spark UI?
>>
>> Thanks
>>
>> On Tue, May 17, 2022 at 1:47 PM bo yang  wrote:
>>
>>> Hi Spark Folks,
>>>
>>> I built a web reverse proxy to access Spark UI on Kubernetes (working
>>> together with
>>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
>>> share here in case other people have similar need.
>>>
>>> The reverse proxy code is here:
>>> https://github.com/datapunchorg/spark-ui-reverse-proxy
>>>
>>> Let me know if anyone wants to use or would like to contribute.
>>>
>>> Thanks,
>>> Bo
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau

Oh that’s rad 

On Tue, May 17, 2022 at 7:47 AM bo yang  wrote:

> Hi Spark Folks,
>
> I built a web reverse proxy to access Spark UI on Kubernetes (working
> together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator).
> Want to share here in case other people have similar need.
>
> The reverse proxy code is here:
> https://github.com/datapunchorg/spark-ui-reverse-proxy
>
> Let me know if anyone wants to use or would like to contribute.
>
> Thanks,
> Bo
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-11 Thread Holden Karau

Do we have everything we want for RC2 targeted to 3.3.0 for tracking?

On Wed, May 11, 2022 at 6:44 AM Maxim Gekk
 wrote:

> Hi All,
>
> The vote has failed. I will create RC2 in a couple of days.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Wed, May 11, 2022 at 4:23 AM Hyukjin Kwon  wrote:
>
>> I expect to see RC2 too. I guess he just sticks to the standard, leaving
>> the vote open till the end.
>> It hasn't got enough +1s anyway :-).
>>
>> On Wed, 11 May 2022 at 10:17, Holden Karau  wrote:
>>
>>> Technically release don't follow vetos (see
>>> https://www.apache.org/foundation/voting.html ) it's up to the RM if
>>> they get the minimum number of binding +1s (although they are encouraged to
>>> cancel the release if any serious issues are raised).
>>>
>>> That being said I'll add my -1 based on the issues reported in this
>>> thread.
>>>
>>> On Tue, May 10, 2022 at 6:07 PM Sean Owen  wrote:
>>>
>>>> There's a -1 vote here, so I think this RC fails anyway.
>>>>
>>>> On Fri, May 6, 2022 at 10:30 AM Gengliang Wang 
>>>> wrote:
>>>>
>>>>> Hi Maxim,
>>>>>
>>>>> Thanks for the work!
>>>>> There is a bug fix from Bruce merged on branch-3.3 right after the RC1
>>>>> is cut:
>>>>> SPARK-39093: Dividing interval by integral can result in codegen
>>>>> compilation error
>>>>> <https://github.com/apache/spark/commit/fd998c8a6783c0c8aceed8dcde4017cd479e42c8>
>>>>>
>>>>> So -1 from me. We should have RC2 to include the fix.
>>>>>
>>>>> Thanks
>>>>> Gengliang
>>>>>
>>>>> On Fri, May 6, 2022 at 6:15 PM Maxim Gekk
>>>>>  wrote:
>>>>>
>>>>>> Hi Dongjoon,
>>>>>>
>>>>>>  > https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>>>> > Since RC1 is started, could you move them out from the 3.3.0
>>>>>> milestone?
>>>>>>
>>>>>> I have removed the 3.3.0 label from Fix version(s). Thank you,
>>>>>> Dongjoon.
>>>>>>
>>>>>> Maxim Gekk
>>>>>>
>>>>>> Software Engineer
>>>>>>
>>>>>> Databricks, Inc.
>>>>>>
>>>>>>
>>>>>> On Fri, May 6, 2022 at 11:06 AM Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, Sean.
>>>>>>> It's interesting. I didn't see those failures from my side.
>>>>>>>
>>>>>>> Hi, Maxim.
>>>>>>> In the following link, there are 17 in-progress and 6 to-do JIRA
>>>>>>> issues which look irrelevant to this RC1 vote.
>>>>>>>
>>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>>>>>
>>>>>>> Since RC1 is started, could you move them out from the 3.3.0
>>>>>>> milestone?
>>>>>>> Otherwise, we cannot distinguish new real blocker issues from those
>>>>>>> obsolete JIRA issues.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 5, 2022 at 11:46 AM Adam Binford 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I looked back at the first one (SPARK-37618), it expects/assumes a
>>>>>>>> 0022 umask to correctly test the behavior. I'm not sure how to get 
>>>>>>>> that to
>>>>>>>> not fail or be ignored with a more open umask.
>>>>>>>>
>>>>>>>> On Thu, May 5, 2022 at 1:56 PM Sean Owen  wrote:
>>>>>>>>
>>>>>>>>> I'm seeing test failures; is anyone seeing ones like this? This is
>>>>>>>>> Java 8 / Scala 2.12 / Ubuntu 22.04:
>>>>>>>>>
>>>>>>>>> - SPARK-37618: Sub dirs are group writable when removing from
>>>>>>>>> shuffle service enabled *** FAILED ***
>>>>>>>>>   [OWNER_WRITE, GROUP_READ, GROUP_WRITE, GROUP_EXECUTE,
>>>>>>>>> OTHERS_READ, O

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-10 Thread Holden Karau

Technically release don't follow vetos (see
https://www.apache.org/foundation/voting.html ) it's up to the RM if they
get the minimum number of binding +1s (although they are encouraged to
cancel the release if any serious issues are raised).

That being said I'll add my -1 based on the issues reported in this thread.

On Tue, May 10, 2022 at 6:07 PM Sean Owen  wrote:

> There's a -1 vote here, so I think this RC fails anyway.
>
> On Fri, May 6, 2022 at 10:30 AM Gengliang Wang  wrote:
>
>> Hi Maxim,
>>
>> Thanks for the work!
>> There is a bug fix from Bruce merged on branch-3.3 right after the RC1 is
>> cut:
>> SPARK-39093: Dividing interval by integral can result in codegen
>> compilation error
>> 
>>
>> So -1 from me. We should have RC2 to include the fix.
>>
>> Thanks
>> Gengliang
>>
>> On Fri, May 6, 2022 at 6:15 PM Maxim Gekk
>>  wrote:
>>
>>> Hi Dongjoon,
>>>
>>>  > https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>> > Since RC1 is started, could you move them out from the 3.3.0 milestone?
>>>
>>> I have removed the 3.3.0 label from Fix version(s). Thank you, Dongjoon.
>>>
>>> Maxim Gekk
>>>
>>> Software Engineer
>>>
>>> Databricks, Inc.
>>>
>>>
>>> On Fri, May 6, 2022 at 11:06 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, Sean.
 It's interesting. I didn't see those failures from my side.

 Hi, Maxim.
 In the following link, there are 17 in-progress and 6 to-do JIRA issues
 which look irrelevant to this RC1 vote.

 https://issues.apache.org/jira/projects/SPARK/versions/12350369

 Since RC1 is started, could you move them out from the 3.3.0 milestone?
 Otherwise, we cannot distinguish new real blocker issues from those
 obsolete JIRA issues.

 Thanks,
 Dongjoon.


 On Thu, May 5, 2022 at 11:46 AM Adam Binford  wrote:

> I looked back at the first one (SPARK-37618), it expects/assumes a
> 0022 umask to correctly test the behavior. I'm not sure how to get that to
> not fail or be ignored with a more open umask.
>
> On Thu, May 5, 2022 at 1:56 PM Sean Owen  wrote:
>
>> I'm seeing test failures; is anyone seeing ones like this? This is
>> Java 8 / Scala 2.12 / Ubuntu 22.04:
>>
>> - SPARK-37618: Sub dirs are group writable when removing from shuffle
>> service enabled *** FAILED ***
>>   [OWNER_WRITE, GROUP_READ, GROUP_WRITE, GROUP_EXECUTE, OTHERS_READ,
>> OWNER_READ, OTHERS_EXECUTE, OWNER_EXECUTE] contained GROUP_WRITE
>> (DiskBlockManagerSuite.scala:155)
>>
>> - Check schemas for expression examples *** FAILED ***
>>   396 did not equal 398 Expected 396 blocks in result file but got
>> 398. Try regenerating the result files. 
>> (ExpressionsSchemaSuite.scala:161)
>>
>>  Function 'bloom_filter_agg', Expression class
>> 'org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregate'
>> "" did not start with "
>>   Examples:
>>   " (ExpressionInfoSuite.scala:142)
>>
>> On Thu, May 5, 2022 at 6:01 AM Maxim Gekk
>>  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>>  version 3.3.0.
>>>
>>> The vote is open until 11:59pm Pacific time May 10th and passes if
>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.3.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark
>>> .apache.org/
>>>
>>> The tag to be voted on is v3.3.0-rc1 (commit
>>> 482b7d54b522c4d1e25f3e84eabbc78126f22a3d):
>>> https://github.com/apache/spark/tree/v3.3.0-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1402
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-docs/
>>>
>>> The list of bug fixes going into 3.3.0 can be found at the
>>> following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>
>>> This release is using the release script of the tag v3.3.0-rc1.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate,
>>> then
>>> reporting any

Re: Apache Spark 3.3 Release

2022-03-16 Thread Holden Karau

> >
> >
> > On Tue, Mar 15, 2022 at 10:17 AM Chao Sun  wrote:
> >
> > Cool, thanks for clarifying!
> >
> > On Tue, Mar 15, 2022 at 10:11 AM Xiao Li  wrote:
> > >>
> > >> For the following list:
> > >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
> > >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
> vectorized reader
> > >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
> > >> Do you mean we should include them, or exclude them from 3.3?
> > >
> > >
> > > If possible, I hope these features can be shipped with Spark 3.3.
> > >
> > >
> > >
> > > Chao Sun  于2022年3月15日周二 10:06写道：
> > >>
> > >> Hi Xiao,
> > >>
> > >> For the following list:
> > >>
> > >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
> > >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
> vectorized reader
> > >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
> > >>
> > >> Do you mean we should include them, or exclude them from 3.3?
> > >>
> > >> Thanks,
> > >> Chao
> > >>
> > >> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
> > >> >
> > >> > The following was tested and merged a few minutes ago. So, we can
> remove it from the list.
> > >> >
> > >> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
> > >> >
> > >> > Thanks,
> > >> > Dongjoon.
> > >> >
> > >> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li 
> wrote:
> > >> >>
> > >> >> Let me clarify my above suggestion. Maybe we can wait 3 more days
> to collect the list of actively developed PRs that we want to merge to 3.3
> after the branch cut?
> > >> >>
> > >> >> Please do not rush to merge the PRs that are not fully reviewed.
> We can cut the branch this Friday and continue merging the PRs that have
> been discussed in this thread. Does that make sense?
> > >> >>
> > >> >> Xiao
> > >> >>
> > >> >>
> > >> >>
> > >> >> Holden Karau  于2022年3月15日周二 09:10写道：
> > >> >>>
> > >> >>> May I suggest we push out one week (22nd) just to give everyone a
> bit of breathing space? Rushed software development more often results in
> bugs.
> > >> >>>
> > >> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang 
> wrote:
> > >> >>>>
> > >> >>>> > To make our release time more predictable, let us collect the
> PRs and wait three more days before the branch cut?
> > >> >>>>
> > >> >>>> For SPIP: Support Customized Kubernetes Schedulers:
> > >> >>>> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
> > >> >>>>
> > >> >>>> Three more days are OK for this from my view.
> > >> >>>>
> > >> >>>> Regards,
> > >> >>>> Yikun
> > >> >>>
> > >> >>> --
> > >> >>> Twitter: https://twitter.com/holdenkarau
> > >> >>> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > >> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Apache Spark 3.3 Release

2022-03-15 Thread Holden Karau

May I suggest we push out one week (22nd) just to give everyone a bit of
breathing space? Rushed software development more often results in bugs.

On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang  wrote:

> > To make our release time more predictable, let us collect the PRs and
> wait three more days before the branch cut?
>
> For SPIP: Support Customized Kubernetes Schedulers:
> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
> 
>
> Three more days are OK for this from my view.
>
> Regards,
> Yikun
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Apache Spark 3.3 Release

2022-03-14 Thread Holden Karau

On Mon, Mar 14, 2022 at 11:53 PM Xiao Li  wrote:

> Could you please list which features we want to finish before the branch
> cut? How long will they take?
>
> Xiao
>
> Chao Sun  于2022年3月14日周一 13:30写道：
>
>> Hi Max,
>>
>> As there are still some ongoing work for the above listed SPIPs, can we
>> still merge them after the branch cut?
>>
> In the past we’ve allowed merges for actively developed PRs post branch
cut, but it is easier when it doesn’t need to be cherry picked (eg pre cut).

>
>> Thanks,
>> Chao
>>
>> On Mon, Mar 14, 2022 at 6:12 AM Maxim Gekk
>>  wrote:
>>
>>> Hi All,
>>>
>>> Since there are no actual blockers for Spark 3.3.0 and significant
>>> objections, I am going to cut branch-3.3 after 15th March at 00:00 PST.
>>> Please, let us know if you have any concerns about that.
>>>
>>> Best regards,
>>> Max Gekk
>>>
>>>
>>> On Thu, Mar 3, 2022 at 9:44 PM Maxim Gekk 
>>> wrote:
>>>
 Hello All,

 I would like to bring on the table the theme about the new Spark
 release 3.3. According to the public schedule at
 https://spark.apache.org/versioning-policy.html, we planned to start
 the code freeze and release branch cut on March 15th, 2022. Since this date
 is coming soon, I would like to take your attention on the topic and gather
 objections that you might have.

 Bellow is the list of ongoing and active SPIPs:

 Spark SQL:
 - [SPARK-31357] DataSourceV2: Catalog API for view metadata
 - [SPARK-35801] Row-level operations in Data Source V2
 - [SPARK-37166] Storage Partitioned Join

 Spark Core:
 - [SPARK-20624] Add better handling for node shutdown
 - [SPARK-25299] Use remote storage for persisting shuffle data

 PySpark:
 - [SPARK-26413] RDD Arrow Support in Spark Core and PySpark

 Kubernetes:
 - [SPARK-36057] Support Customized Kubernetes Schedulers

 Probably, we should finish if there are any remaining works for Spark
 3.3, and switch to QA mode, cut a branch and keep everything on track. I
 would like to volunteer to help drive this process.

 Best regards,
 Max Gekk

>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: CVE-2021-38296: Apache Spark Key Negotiation Vulnerability

2022-03-09 Thread Holden Karau

CVEs are generally not mentioned in the release notes or JIRA instead we
track them at https://spark.apache.org/security.html once they are resolved
(prior to the resolution the reports goes to secur...@spark.apache.org) to
allow the project time to fix the issue before public disclosure so there
is a fixed version for people to upgrade to.

On Wed, Mar 9, 2022 at 2:58 PM Manu Zhang  wrote:

> Hi Sean,
>
> I don't find it in 3.1.3 release notes
> https://spark.apache.org/releases/spark-release-3-1-3.html. Is it tracked
> somewhere?
>
> On Thu, Mar 10, 2022 at 6:14 AM Sean R. Owen  wrote:
>
>> Severity: moderate
>>
>> Description:
>>
>> Apache Spark supports end-to-end encryption of RPC connections via
>> "spark.authenticate" and "spark.network.crypto.enabled". In versions 3.1.2
>> and earlier, it uses a bespoke mutual authentication protocol that allows
>> for full encryption key recovery. After an initial interactive attack, this
>> would allow someone to decrypt plaintext traffic offline. Note that this
>> does not affect security mechanisms controlled by
>> "spark.authenticate.enableSaslEncryption", "spark.io.encryption.enabled",
>> "spark.ssl", "spark.ui.strictTransportSecurity".
>>
>> Mitigation:
>>
>> Update to Apache Spark 3.1.3 or later
>>
>> Credit:
>>
>> Steve Weis (Databricks)
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [ANNOUNCE] Apache Spark 3.1.3 released + Docker images

2022-02-25 Thread Holden Karau

On Fri, Feb 25, 2022 at 8:31 AM Ismaël Mejía  wrote:

> The ready to use docker images are great news. I have been waiting for
> this for so long! Extra kudos for including ARM64 versions too!
>
> I am curious, what are the non-ASF artifacts included in them (or you
> refer to the OS specific elements with other licenses?), and what
> consequences might be for end users because of that.
>
OS elements, JDK, Python, etc. For any licensing concerns you should
probably consult a lawyer.

>
> Thanks and kudos to everyone who helped to make this happen!
> Ismaël
>
> ps. Any plans to make this images official docker images at some point
> (for the extra security/validation) [1]
> [1] https://docs.docker.com/docker-hub/official_images/
>
> On Mon, Feb 21, 2022 at 10:09 PM Holden Karau 
> wrote:
> >
> > We are happy to announce the availability of Spark 3.1.3!
> >
> > Spark 3.1.3 is a maintenance release containing stability fixes. This
> > release is based on the branch-3.1 maintenance branch of Spark. We
> strongly
> > recommend all 3.1 users to upgrade to this stable release.
> >
> > To download Spark 3.1.3, head over to the download page:
> > https://spark.apache.org/downloads.html
> >
> > To view the release notes:
> > https://spark.apache.org/releases/spark-release-3-1-3.html
> >
> > We would like to acknowledge all community members for contributing to
> this
> > release. This release would not have been possible without you.
> >
> > New Dockerhub magic in this release:
> >
> > We've also started publishing docker containers to the Apache Dockerhub,
> > these contain non-ASF artifacts that are subject to different license
> terms than the
> > Spark release. The docker containers are built for Linux x86 and ARM64
> since that's
> > what I have access to (thanks to NV for the ARM64 machines).
> >
> > You can get them from https://hub.docker.com/apache/spark (and spark-r
> and spark-py) :)
> > (And version 3.2.1 is also now published on Dockerhub).
> >
> > Holden
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [ANNOUNCE] Apache Spark 3.1.3 released + Docker images

2022-02-22 Thread Holden Karau

So your more than welcome to still build your own Spark docker containers
with the docker image tool, these are provided to make it easier for folks
without specific needs. In the future well hopefully have published Spark
containers tagged for different JDKs but that work has not yet been done.

On Tue, Feb 22, 2022 at 10:51 AM Denis Bolshakov 
wrote:

> Hello Holden,
>
> Could you please provide more details and plan for docker images support?
>
> So far I see that there are only two tags, I get from them spark version,
> but there is no information about java, hadoop, scala versions.
>
> Also there is no description on docker hub, probably it would be nice to
> put a link to Docker files in github repository.
>
> What directories are expected to be mounted and ports forwarded? How can I
> mount the krb5.conf file and directory where my kerberos ticket is located?
>
> I've pulled the docker image with tag spark 3.2.1 and I see that there is
> java 11 and hadoop 3.3, but our environment requires us to have other
> versions.
>
> On Tue, 22 Feb 2022 at 16:29, Mich Talebzadeh 
> wrote:
>
>> Well that is just a recommendation.
>>
>> The onus is on me the user to download and go through dev and test
>> running suite of batch jobs to ensure that all work ok, especially on the
>> edge, sign the release off and roll it in out into production. It won’t be
>> prudent otherwise.
>>
>> HHH
>>
>> On Tue, 22 Feb 2022 at 12:12, Bjørn Jørgensen 
>> wrote:
>>
>>> "Spark 3.1.3 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.1 maintenance branch of Spark. We strongly
>>> recommend all 3.1.3 users to upgrade to this stable release."
>>> https://spark.apache.org/releases/spark-release-3-1-3.html
>>>
>>> Do we have another 3.13 or do we strongly recommend all 3.1.2 users to
>>> upgrade to this stable release ?
>>>
>>> tir. 22. feb. 2022 kl. 09:50 skrev angers zhu :
>>>
>>>> Hi,  seems
>>>>
>>>>- [SPARK-35391] <https://issues.apache.org/jira/browse/SPARK-36339>:
>>>>Memory leak in ExecutorAllocationListener breaks dynamic allocation 
>>>> under
>>>>high load
>>>>
>>>> Links to wrong jira ticket?
>>>>
>>>> Mich Talebzadeh  于2022年2月22日周二 15:49写道：
>>>>
>>>>> Well, that is pretty easy to do.
>>>>>
>>>>> However, a quick fix for now could be to retag the image created. It
>>>>> is a small volume which can be done manually for now. For example, I just
>>>>> downloaded v3.1.3
>>>>>
>>>>>
>>>>> docker image ls
>>>>>
>>>>> REPOSITORY TAG
>>>>> IMAGE ID   CREATEDSIZE
>>>>>
>>>>> apache/spark   v3.1.3
>>>>>31ed15daa2bf   12 hours ago   531MB
>>>>>
>>>>> Retag it with
>>>>>
>>>>>
>>>>> docker tag 31ed15daa2bf
>>>>> apache/spark/tags/spark-3.1.3-scala_2.12-8-jre-slim-buster
>>>>>
>>>>> docker image ls
>>>>>
>>>>> REPOSITORY   TAG
>>>>>   IMAGE ID   CREATED
>>>>> SIZE
>>>>>
>>>>> apache/spark/tags/spark-3.1.3-scala_2.12-8-jre-slim-buster   latest
>>>>>  31ed15daa2bf   12 hours ago
>>>>>  531MB
>>>>>
>>>>> Then push it with (example)
>>>>>
>>>>> docker push apache/spark/tags/spark-3.1.3-scala_2.12-8-jre-slim-buster
>>>>>
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monet

Re: [ANNOUNCE] Apache Spark 3.1.3 released + Docker images

2022-02-21 Thread Holden Karau

Yeah I think we should still adopt that naming convention, however no one
has taken the time submit write a script to do it yet so until we get that
script merged I think we'll just have one build. I can try and do that for
the next release but it would be a great 2nd issue for someone getting more
familiar with the release tooling.

On Mon, Feb 21, 2022 at 2:18 PM Mich Talebzadeh 
wrote:

> Ok thanks for the correction.
>
> The docker pull line shows as follows:
>
> docker pull apache/spark:v3.2.1
>
>
> So this only tells me the version of Spark 3.2.1
>
>
> I thought we discussed deciding on the docker naming conventions in
> detail, and broadly agreed on what needs to be in the naming convention.
> For example, in this thread:
>
>
> Time to start publishing Spark Docker Images? - mich.talebza...@gmail.com
> - Gmail (google.com)
> <https://mail.google.com/mail/u/0/?hl=en-GB#search/publishing/FMfcgzGkZQSzbXWQDWfddGDNRDQfPCpg>
>  dated
> 22nd July 2021
>
>
> Referring to that, I think the broad agreement was that the docker image
> name should be of the form:
>
>
> The name of the file provides:
>
>- Built for spark or spark-py (PySpark) spark-r
>- Spark version: 3.1.1, 3.1.2, 3.2.1 etc.
>- Scala version; 2.1.2
>- The OS version based on JAVA: 8-jre-slim-buster, 11-jre-slim-buster
>meaning JAVA 8 and JAVA 11 respectively
>
> I believe it is a good thing and we ought to adopt that convention. For
> example:
>
>
> spark-py-3.2.1-scala_2.12-11-jre-slim-buster
>
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 21 Feb 2022 at 21:58, Holden Karau  wrote:
>
>> My bad, the correct link is:
>>
>> https://hub.docker.com/r/apache/spark/tags
>>
>> On Mon, Feb 21, 2022 at 1:17 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> well that docker link is not found! may be permission issue
>>>
>>> [image: image.png]
>>>
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 21 Feb 2022 at 21:09, Holden Karau  wrote:
>>>
>>>> We are happy to announce the availability of Spark 3.1.3!
>>>>
>>>> Spark 3.1.3 is a maintenance release containing stability fixes. This
>>>> release is based on the branch-3.1 maintenance branch of Spark. We
>>>> strongly
>>>> recommend all 3.1 users to upgrade to this stable release.
>>>>
>>>> To download Spark 3.1.3, head over to the download page:
>>>> https://spark.apache.org/downloads.html
>>>>
>>>> To view the release notes:
>>>> https://spark.apache.org/releases/spark-release-3-1-3.html
>>>>
>>>> We would like to acknowledge all community members for contributing to
>>>> this
>>>> release. This release would not have been possible without you.
>>>>
>>>> *New Dockerhub magic in this release:*
>>>>
>>>> We've also started publishing docker containers to the Apache Dockerhub,
>>>> these contain non-ASF artifacts that are subject to different license
>>>> terms than the
>>>> Spark release. The docker containers are built for Linux x86 and ARM64
>>>> since that's
>>>> what I have access to (thanks to NV for the ARM64 machines).
>>>>
>>>> You can get them from https://hub.docker.com/apache/spark (and spark-r
>>>> and spark-py) :)
>>>> (And version 3.2.1 is also now published on Dockerhub).
>>>>
>>>> Holden
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [ANNOUNCE] Apache Spark 3.1.3 released + Docker images

2022-02-21 Thread Holden Karau

My bad, the correct link is:

https://hub.docker.com/r/apache/spark/tags

On Mon, Feb 21, 2022 at 1:17 PM Mich Talebzadeh 
wrote:

> well that docker link is not found! may be permission issue
>
> [image: image.png]
>
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 21 Feb 2022 at 21:09, Holden Karau  wrote:
>
>> We are happy to announce the availability of Spark 3.1.3!
>>
>> Spark 3.1.3 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.1 maintenance branch of Spark. We
>> strongly
>> recommend all 3.1 users to upgrade to this stable release.
>>
>> To download Spark 3.1.3, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-1-3.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> *New Dockerhub magic in this release:*
>>
>> We've also started publishing docker containers to the Apache Dockerhub,
>> these contain non-ASF artifacts that are subject to different license
>> terms than the
>> Spark release. The docker containers are built for Linux x86 and ARM64
>> since that's
>> what I have access to (thanks to NV for the ARM64 machines).
>>
>> You can get them from https://hub.docker.com/apache/spark (and spark-r
>> and spark-py) :)
>> (And version 3.2.1 is also now published on Dockerhub).
>>
>> Holden
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

[ANNOUNCE] Apache Spark 3.1.3 released + Docker images

2022-02-21 Thread Holden Karau

We are happy to announce the availability of Spark 3.1.3!

Spark 3.1.3 is a maintenance release containing stability fixes. This
release is based on the branch-3.1 maintenance branch of Spark. We strongly
recommend all 3.1 users to upgrade to this stable release.

To download Spark 3.1.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-1-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

*New Dockerhub magic in this release:*

We've also started publishing docker containers to the Apache Dockerhub,
these contain non-ASF artifacts that are subject to different license terms
than the
Spark release. The docker containers are built for Linux x86 and ARM64
since that's
what I have access to (thanks to NV for the ARM64 machines).

You can get them from https://hub.docker.com/apache/spark (and spark-r and
spark-py) :)
(And version 3.2.1 is also now published on Dockerhub).

Holden

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] Spark 3.1.3 RC4

2022-02-18 Thread Holden Karau

The vote passes with no 0s or -1s and the following +1:
Holden Karau
John Zhuge
Mridul Muralidharan
Thomas graves
Gengliang Wang
Wenchen Fan
Yuming Wang
Ruifeng Zheng
Sean Owen

I will begin finalizing the release now.

On Fri, Feb 18, 2022 at 2:49 PM Holden Karau  wrote:

> +1 my self :)
>
> On Thu, Feb 17, 2022 at 9:17 AM John Zhuge  wrote:
>
>> +1 (non-binding)
>>
>> On Wed, Feb 16, 2022 at 10:06 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Wed, Feb 16, 2022 at 8:32 AM Thomas graves 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Tom
>>>>
>>>> On Mon, Feb 14, 2022 at 2:55 PM Holden Karau 
>>>> wrote:
>>>> >
>>>> > Please vote on releasing the following candidate as Apache Spark
>>>> version 3.1.3.
>>>> >
>>>> > The vote is open until Feb. 18th at 1 PM pacific (9 PM GMT) and
>>>> passes if a majority
>>>> > +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>>>> >
>>>> > [ ] +1 Release this package as Apache Spark 3.1.3
>>>> > [ ] -1 Do not release this package because ...
>>>> >
>>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>>> >
>>>> > There are currently no open issues targeting 3.1.3 in Spark's JIRA
>>>> https://issues.apache.org/jira/browse
>>>> > (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in
>>>> (Open, Reopened, "In Progress"))
>>>> > at https://s.apache.org/n79dw
>>>> >
>>>> >
>>>> >
>>>> > The tag to be voted on is v3.1.3-rc4 (commit
>>>> > d1f8a503a26bcfb4e466d9accc5fa241a7933667):
>>>> > https://github.com/apache/spark/tree/v3.1.3-rc4
>>>> >
>>>> > The release files, including signatures, digests, etc. can be found
>>>> at:
>>>> > https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-bin/
>>>> >
>>>> > Signatures used for Spark RCs can be found in this file:
>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>> >
>>>> > The staging repository for this release can be found at
>>>> >
>>>> https://repository.apache.org/content/repositories/orgapachespark-1401
>>>> >
>>>> > The documentation corresponding to this release can be found at:
>>>> > https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-docs/
>>>> >
>>>> > The list of bug fixes going into 3.1.3 can be found at the following
>>>> URL:
>>>> > https://s.apache.org/x0q9b
>>>> >
>>>> > This release is using the release script from 3.1.3
>>>> > The release docker container was rebuilt since the previous version
>>>> didn't have the necessary components to build the R documentation.
>>>> >
>>>> > FAQ
>>>> >
>>>> >
>>>> > =
>>>> > How can I help test this release?
>>>> > =
>>>> >
>>>> > If you are a Spark user, you can help us test this release by taking
>>>> > an existing Spark workload and running on this release candidate, then
>>>> > reporting any regressions.
>>>> >
>>>> > If you're working in PySpark you can set up a virtual env and install
>>>> > the current RC and see if anything important breaks, in the Java/Scala
>>>> > you can add the staging repository to your projects resolvers and test
>>>> > with the RC (make sure to clean up the artifact cache before/after so
>>>> > you don't end up building with an out of date RC going forward).
>>>> >
>>>> > ===
>>>> > What should happen to JIRA tickets still targeting 3.1.3?
>>>> > ===
>>>> >
>>>> > The current list of open tickets targeted at 3.1.3 can be found at:
>>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>> > Version/s"

Re: [VOTE] Spark 3.1.3 RC4

2022-02-18 Thread Holden Karau

+1 my self :)

On Thu, Feb 17, 2022 at 9:17 AM John Zhuge  wrote:

> +1 (non-binding)
>
> On Wed, Feb 16, 2022 at 10:06 AM Mridul Muralidharan 
> wrote:
>
>>
>> +1
>>
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>>
>> Regards,
>> Mridul
>>
>>
>> On Wed, Feb 16, 2022 at 8:32 AM Thomas graves  wrote:
>>
>>> +1
>>>
>>> Tom
>>>
>>> On Mon, Feb 14, 2022 at 2:55 PM Holden Karau 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 3.1.3.
>>> >
>>> > The vote is open until Feb. 18th at 1 PM pacific (9 PM GMT) and passes
>>> if a majority
>>> > +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 3.1.3
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > There are currently no open issues targeting 3.1.3 in Spark's JIRA
>>> https://issues.apache.org/jira/browse
>>> > (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in
>>> (Open, Reopened, "In Progress"))
>>> > at https://s.apache.org/n79dw
>>> >
>>> >
>>> >
>>> > The tag to be voted on is v3.1.3-rc4 (commit
>>> > d1f8a503a26bcfb4e466d9accc5fa241a7933667):
>>> > https://github.com/apache/spark/tree/v3.1.3-rc4
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at
>>> > https://repository.apache.org/content/repositories/orgapachespark-1401
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-docs/
>>> >
>>> > The list of bug fixes going into 3.1.3 can be found at the following
>>> URL:
>>> > https://s.apache.org/x0q9b
>>> >
>>> > This release is using the release script from 3.1.3
>>> > The release docker container was rebuilt since the previous version
>>> didn't have the necessary components to build the R documentation.
>>> >
>>> > FAQ
>>> >
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with an out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 3.1.3?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 3.1.3 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> > Version/s" = 3.1.3
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility should
>>> > be worked on immediately. Everything else please retarget to an
>>> > appropriate release.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release unless the bug in question is a regression from the previous
>>> > release. That being said, if there is something that is a regression
>>> > that has not been correctly targeted please ping me or a committer to
>>> > help target the issue.
>>> >
>>> > Note: I added an extra day to the vote since I know some folks are
>>> likely busy on the 14th with partner(s).
>>> >
>>> >
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> John Zhuge
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

[VOTE] Spark 3.1.3 RC4

2022-02-14 Thread Holden Karau

Please vote on releasing the following candidate as Apache Spark version
3.1.3.

The vote is open until Feb. 18th at 1 PM pacific (9 PM GMT) and passes if a
majority
+1 PMC votes are cast, with a minimum of 3 + 1 votes.

[ ] +1 Release this package as Apache Spark 3.1.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no open issues targeting 3.1.3 in Spark's JIRA
https://issues.apache.org/jira/browse
(try project = SPARK AND "Target Version/s" = "3.1.3" AND status in (Open,
Reopened, "In Progress"))
at https://s.apache.org/n79dw



The tag to be voted on is v3.1.3-rc4 (commit
d1f8a503a26bcfb4e466d9accc5fa241a7933667):
https://github.com/apache/spark/tree/v3.1.3-rc4

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at
https://repository.apache.org/content/repositories/orgapachespark-1401

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-docs/

The list of bug fixes going into 3.1.3 can be found at the following URL:
https://s.apache.org/x0q9b

This release is using the release script from 3.1.3
The release docker container was rebuilt since the previous version didn't
have the necessary components to build the R documentation.

FAQ


=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.1.3?
===

The current list of open tickets targeted at 3.1.3 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.1.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something that is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Note: I added an extra day to the vote since I know some folks are likely
busy on the 14th with partner(s).


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] Spark 3.1.3 RC3

2022-02-08 Thread Holden Karau

Yup, I’ve run into some weirdness with docs again I want to verify before I
send the vote email though.

On Mon, Feb 7, 2022 at 10:06 PM Wenchen Fan  wrote:

> Shall we use the release scripts of branch 3.1 to release 3.1?
>
> On Fri, Feb 4, 2022 at 4:57 AM Holden Karau  wrote:
>
>> Good catch Dongjoon :)
>>
>> This release candidate fails, but feel free to keep testing for any other
>> potential blockers.
>>
>> I’ll roll RC4 next week with the older release scripts (but the more
>> modern image since the legacy image didn’t have a good time with the R doc
>> packaging).
>>
>> On Thu, Feb 3, 2022 at 3:53 PM Dongjoon Hyun 
>> wrote:
>>
>>> Unfortunately, -1 for 3.1.3 RC3 due to the packaging issue.
>>>
>>> It seems that the master branch release script didn't work properly for
>>> Hadoop 2 binary distribution, Holden.
>>>
>>> $ curl -s
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/spark-3.1.3-bin-hadoop2.tgz
>>> | tar tz | grep hadoop-common
>>> spark-3.1.3-bin-hadoop2/jars/hadoop-common-3.2.0.jar
>>>
>>> Apache Spark didn't drop Apache Hadoop 2 based binary distribution yet.
>>>
>>> Dongjoon
>>>
>>>
>>> On Wed, Feb 2, 2022 at 3:38 PM Mridul Muralidharan 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>   Minor nit: the tag mentioned under [1] looks like a typo - I used
>>>> "v3.1.3-rc3"  for my vote (3.2.1 is mentioned in a couple of places, treat
>>>> them as 3.1.3 instead)
>>>>
>>>> +1
>>>> Signatures, digests, etc check out fine.
>>>> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>> [1] "The tag to be voted on is v3.2.1-rc1" - the commit hash and git
>>>> url are correct.
>>>>
>>>>
>>>> On Wed, Feb 2, 2022 at 9:30 AM Mridul Muralidharan 
>>>> wrote:
>>>>
>>>>>
>>>>> Thanks Tom !
>>>>> I missed [1] (or probably forgot) the 3.1 part of the discussion given
>>>>> it centered around 3.2 ...
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>> [1] https://www.mail-archive.com/dev@spark.apache.org/msg28484.html
>>>>>
>>>>> On Wed, Feb 2, 2022 at 8:55 AM Thomas Graves 
>>>>> wrote:
>>>>>
>>>>>> It was discussed doing all the maintenance lines back at beginning of
>>>>>> December (Dec 6) when we were talking about release 3.2.1.
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>> On Wed, Feb 2, 2022 at 2:07 AM Mridul Muralidharan 
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi Holden,
>>>>>> >
>>>>>> >   Not that I am against releasing 3.1.3 (given the fixes that have
>>>>>> already gone in), but did we discuss releasing it ? I might have missed 
>>>>>> the
>>>>>> thread ...
>>>>>> >
>>>>>> > Regards,
>>>>>> > Mridul
>>>>>> >
>>>>>> > On Tue, Feb 1, 2022 at 7:12 PM Holden Karau 
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 3.1.3.
>>>>>> >>
>>>>>> >> The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and
>>>>>> passes if a majority
>>>>>> >> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>>>>>> >>
>>>>>> >> [ ] +1 Release this package as Apache Spark 3.1.3
>>>>>> >> [ ] -1 Do not release this package because ...
>>>>>> >>
>>>>>> >> To learn more about Apache Spark, please see
>>>>>> http://spark.apache.org/
>>>>>> >>
>>>>>> >> There are currently no open issues targeting 3.1.3 in Spark's JIRA
>>>>>> https://issues.apache.org/jira/browse
>>>>>> >> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status
>>>>>> in (Open, Reopened, "In Progress"))
>>>>>> >> at h

Re: [VOTE] Spark 3.1.3 RC3

2022-02-03 Thread Holden Karau

Good catch Dongjoon :)

This release candidate fails, but feel free to keep testing for any other
potential blockers.

I’ll roll RC4 next week with the older release scripts (but the more modern
image since the legacy image didn’t have a good time with the R doc
packaging).

On Thu, Feb 3, 2022 at 3:53 PM Dongjoon Hyun 
wrote:

> Unfortunately, -1 for 3.1.3 RC3 due to the packaging issue.
>
> It seems that the master branch release script didn't work properly for
> Hadoop 2 binary distribution, Holden.
>
> $ curl -s
> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/spark-3.1.3-bin-hadoop2.tgz
> | tar tz | grep hadoop-common
> spark-3.1.3-bin-hadoop2/jars/hadoop-common-3.2.0.jar
>
> Apache Spark didn't drop Apache Hadoop 2 based binary distribution yet.
>
> Dongjoon
>
>
> On Wed, Feb 2, 2022 at 3:38 PM Mridul Muralidharan 
> wrote:
>
>> Hi,
>>
>>   Minor nit: the tag mentioned under [1] looks like a typo - I used
>> "v3.1.3-rc3"  for my vote (3.2.1 is mentioned in a couple of places, treat
>> them as 3.1.3 instead)
>>
>> +1
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>>
>> Regards,
>> Mridul
>>
>> [1] "The tag to be voted on is v3.2.1-rc1" - the commit hash and git url
>> are correct.
>>
>>
>> On Wed, Feb 2, 2022 at 9:30 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Thanks Tom !
>>> I missed [1] (or probably forgot) the 3.1 part of the discussion given
>>> it centered around 3.2 ...
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>> [1] https://www.mail-archive.com/dev@spark.apache.org/msg28484.html
>>>
>>> On Wed, Feb 2, 2022 at 8:55 AM Thomas Graves 
>>> wrote:
>>>
>>>> It was discussed doing all the maintenance lines back at beginning of
>>>> December (Dec 6) when we were talking about release 3.2.1.
>>>>
>>>> Tom
>>>>
>>>> On Wed, Feb 2, 2022 at 2:07 AM Mridul Muralidharan 
>>>> wrote:
>>>> >
>>>> > Hi Holden,
>>>> >
>>>> >   Not that I am against releasing 3.1.3 (given the fixes that have
>>>> already gone in), but did we discuss releasing it ? I might have missed the
>>>> thread ...
>>>> >
>>>> > Regards,
>>>> > Mridul
>>>> >
>>>> > On Tue, Feb 1, 2022 at 7:12 PM Holden Karau 
>>>> wrote:
>>>> >>
>>>> >> Please vote on releasing the following candidate as Apache Spark
>>>> version 3.1.3.
>>>> >>
>>>> >> The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and
>>>> passes if a majority
>>>> >> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>>>> >>
>>>> >> [ ] +1 Release this package as Apache Spark 3.1.3
>>>> >> [ ] -1 Do not release this package because ...
>>>> >>
>>>> >> To learn more about Apache Spark, please see
>>>> http://spark.apache.org/
>>>> >>
>>>> >> There are currently no open issues targeting 3.1.3 in Spark's JIRA
>>>> https://issues.apache.org/jira/browse
>>>> >> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in
>>>> (Open, Reopened, "In Progress"))
>>>> >> at https://s.apache.org/n79dw
>>>> >>
>>>> >>
>>>> >>
>>>> >> The tag to be voted on is v3.2.1-rc1 (commit
>>>> >> b8c0799a8cef22c56132d94033759c9f82b0cc86):
>>>> >> https://github.com/apache/spark/tree/v3.1.3-rc3
>>>> >>
>>>> >> The release files, including signatures, digests, etc. can be found
>>>> at:
>>>> >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/
>>>> >>
>>>> >> Signatures used for Spark RCs can be found in this file:
>>>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>> >>
>>>> >> The staging repository for this release can be found at
>>>> >> :
>>>> https://repository.apache.org/content/repositories/orgapachespark-1400/
>>>> >>
>>>> >> The documentation corresponding to this release can be found at:
>>>> >> https://dist.apache.org/repos/dist/dev/spar

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Holden Karau

+1 (binding)

On Thu, Feb 3, 2022 at 2:26 PM Erik Krogen  wrote:

> +1 (non-binding)
>
> Really looking forward to having this natively supported by Spark, so that
> we can get rid of our own hacks to tie in a custom view catalog
> implementation. I appreciate the care John has put into various parts of
> the design and believe this will provide a robust and flexible solution to
> this problem faced by various large-scale Spark users.
>
> Thanks John!
>
> On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> +1
>>
>> On Thu, Feb 3, 2022 at 11:19 AM John Zhuge  wrote:
>>
>>> Hi Spark community,
>>>
>>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP
>>> 
>>> ).
>>>
>>> The proposal is to add a ViewCatalog interface that can be used to load,
>>> create, alter, and drop views in DataSourceV2.
>>>
>>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>>> update the PR  for review.
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>>
>>> On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
 Considering the API aspect, the ViewCatalog API sounds like a good
 idea. A view catalog will enable us to integrate Coral
  (our view SQL
 translation and management layer) very cleanly to Spark. Currently we can
 only do it by maintaining our special version of the
 HiveExternalCatalog. Considering that views can be expanded
 syntactically without necessarily invoking the analyzer, using a dedicated
 view API can make performance better if performance is the concern.
 Further, a catalog can still be both a table and view provider if it
 chooses to based on this design, so I do not think we necessarily lose the
 ability of providing both. Looking forward to more discussions on this and
 making views a powerful tool in Spark.

 Thanks,
 Walaa.


 On Wed, May 26, 2021 at 9:54 AM John Zhuge  wrote:

> Looks like we are running in circles. Should we have an online meeting
> to get this sorted out?
>
> Thanks,
> John
>
> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan 
> wrote:
>
>> OK, then I'd vote for TableViewCatalog, because
>> 1. This is how Hive catalog works, and we need to migrate Hive
>> catalog to the v2 API sooner or later.
>> 2. Because of 1, TableViewCatalog is easy to support in the current
>> table/view resolution framework.
>> 3. It's better to avoid name conflicts between table and views at the
>> API level, instead of relying on the catalog implementation.
>> 4. Caching invalidation is always a tricky problem.
>>
>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue 
>> wrote:
>>
>>> I don't think that it makes sense to discuss a different approach in
>>> the PR rather than in the vote. Let's discuss this now since that's the
>>> purpose of an SPIP.
>>>
>>> On Mon, May 24, 2021 at 11:22 AM John Zhuge 
>>> wrote:
>>>
 Hi everyone, I’d like to start a vote for the ViewCatalog design
 proposal (SPIP).

 The proposal is to add a ViewCatalog interface that can be used to
 load, create, alter, and drop views in DataSourceV2.

 The full SPIP doc is here:
 https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing

 Please vote on the SPIP in the next 72 hours. Once it is approved,
 I’ll update the PR for review.

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> John Zhuge
>

>>>
>>> --
>>> John Zhuge
>>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

[VOTE] Spark 3.1.3 RC3

2022-02-01 Thread Holden Karau

Please vote on releasing the following candidate as Apache Spark version
3.1.3.

The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and passes
if a majority
+1 PMC votes are cast, with a minimum of 3 + 1 votes.

[ ] +1 Release this package as Apache Spark 3.1.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no open issues targeting 3.1.3 in Spark's JIRA
https://issues.apache.org/jira/browse
(try project = SPARK AND "Target Version/s" = "3.1.3" AND status in (Open,
Reopened, "In Progress"))
at https://s.apache.org/n79dw



The tag to be voted on is v3.2.1-rc1 (commit
b8c0799a8cef22c56132d94033759c9f82b0cc86):
https://github.com/apache/spark/tree/v3.1.3-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at
:https://repository.apache.org/content/repositories/orgapachespark-1400/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-docs/

The list of bug fixes going into 3.1.3 can be found at the following URL:
https://s.apache.org/x0q9b

This release is using the release script in master as
of ddc77fb906cb3ce1567d277c2d0850104c89ac25
The release docker container was rebuilt since the previous version didn't
have the necessary components to build the R documentation.

FAQ


=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.1.3?
===

The current list of open tickets targeted at 3.2.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.1.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something that is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

==
What happened to RC1 & RC2?
==

When I first went to build RC1 the build process failed due to the
lack of the R markdown package in my local rm container. By the time
I had time to debug and rebuild there was already another bug fix commit in
branch-3.1 so I decided to skip ahead to RC2 and pick it up directly.
When I went to go send the RC2 vote e-mail I noticed a correctness issue had
been fixed in branch-3.1 so I rolled RC3 to contain the correctness fix.

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-21 Thread Holden Karau

On Fri, Jan 21, 2022 at 6:48 PM Sean Owen  wrote:

> Continue on the ticket - I am not sure this is established. We would block
> a release for critical problems that are not regressions. This is not a
> data loss / 'deleting data' issue even if valid.
> You're welcome to provide feedback but votes are for the PMC.
>
To be clear users and developers are more than welcome to vote, but only
PMC votes are binding.

>
> On Fri, Jan 21, 2022 at 5:24 PM Bjørn Jørgensen 
> wrote:
>
>> Ok, but deleting users' data without them knowing it is never a good
>> idea. That's why I give this RC -1.
>>
>> lør. 22. jan. 2022 kl. 00:16 skrev Sean Owen :
>>
>>> (Bjorn - unless this is a regression, it would not block a release, even
>>> if it's a bug)
>>>
>>> On Fri, Jan 21, 2022 at 5:09 PM Bjørn Jørgensen <
>>> bjornjorgen...@gmail.com> wrote:
>>>
 [x] -1 Do not release this package because, deletes all my columns with
 only Null in it.

 I have opened https://issues.apache.org/jira/browse/SPARK-37981 for
 this bug.




 fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen :

> (Are you suggesting this is a regression, or is it a general question?
> here we're trying to figure out whether there are critical bugs introduced
> in 3.2.1 vs 3.2.0)
>
> On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen <
> bjornjorgen...@gmail.com> wrote:
>
>> Hi, I am wondering if it's a bug or not.
>>
>> I do have a lot of json files, where they have some columns that are
>> all "null" on.
>>
>> I start spark with
>>
>> from pyspark import pandas as ps
>> import re
>> import numpy as np
>> import os
>> import pandas as pd
>>
>> from pyspark import SparkContext, SparkConf
>> from pyspark.sql import SparkSession
>> from pyspark.sql.functions import concat, concat_ws, lit, col, trim,
>> expr
>> from pyspark.sql.types import StructType, StructField,
>> StringType,IntegerType
>>
>> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
>>
>> def get_spark_session(app_name: str, conf: SparkConf):
>> conf.setMaster('local[*]')
>> conf \
>>   .set('spark.driver.memory', '64g')\
>>   .set("fs.s3a.access.key", "minio") \
>>   .set("fs.s3a.secret.key", "") \
>>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
>>   .set("spark.hadoop.fs.s3a.impl",
>> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>>   .set("spark.sql.adaptive.enabled", "True") \
>>   .set("spark.serializer",
>> "org.apache.spark.serializer.KryoSerializer") \
>>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>>   .set("sc.setLogLevel", "error")
>>
>> return
>> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
>>
>> spark = get_spark_session("Falk", SparkConf())
>>
>> d3 =
>> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
>>
>> import pyspark
>> def sparkShape(dataFrame):
>> return (dataFrame.count(), len(dataFrame.columns))
>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>> print(d3.shape())
>>
>>
>> (653610, 267)
>>
>>
>> d3.write.json("d3.json")
>>
>>
>> d3 = spark.read.json("d3.json/*.json")
>>
>> import pyspark
>> def sparkShape(dataFrame):
>> return (dataFrame.count(), len(dataFrame.columns))
>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>> print(d3.shape())
>>
>> (653610, 186)
>>
>>
>> So spark is deleting 81 columns. I think that all of these 81 deleted
>> columns have only Null in them.
>>
>> Is this a bug or has this been made on purpose?
>>
>>
>> fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao > >:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 3.2.1. The vote is open until 8:00pm Pacific time January 25 and
>>> passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 
>>> votes. [
>>> ] +1 Release this package as Apache Spark 3.2.1[ ] -1 Do not
>>> release this package because ... To learn more about Apache Spark, 
>>> please
>>> see http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2
>>> (commit 4f25b3f71238a00508a356591553f2dfa89f8290):
>>> https://github.com/apache/spark/tree/v3.2.1-rc2
>>> The release files, including signatures, digests, etc. can be found
>>> at:https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging
>>> repository for this release can be found at:
>>>

Re: Tries on migrating Spark Linux arm64 Job from Jenkins to GitHub Actions

2022-01-08 Thread Holden Karau

Personally I’d love to see us compiling and testing on Linux arm64 as well.

On Sat, Jan 8, 2022 at 7:49 PM Yikun Jiang  wrote:

> BTW, this is not intended to be in potential opposition to Apache Spark
> Infra 2022 which dongjoon mentioned in "Apache Spark Jenkins Infra 2022".
> It is just to share a possible way for the Linux arm64 scheduled job.
>
> Also, I think we should get a final conclusion about the attitude of
> self-hosted action from the spark community for future reference.
>
> Regards,
> Yikun
>
> Yikun Jiang  于2022年1月9日周日 11:33写道：
>
>> Hi, all
>>
>> I tried to verify the possibility of *Linux arm64 scheduled job *using
>> self-hosted action, below is some progress and I would like to hear
>> suggestion from you in the next step (continue or stop).
>>
>> Related JIRA: SPARK-35607
>> 
>>
>> *## About self-hosted Github Action:*
>> Currently, self-hosted action supported x64(Linux, macOS, Windows),
>> ARM64(Linux only), ARM32(Linux only)
>> 
>> .
>>
>> There is guidance on self-hosted runners from Apache Infra
>> .
>> The gap to enable self-hosted runner on Apache repo is resource security
>> considerations, specifically, it's to prevent the self-hosted runner from
>> being accessed by unallow users' PR. As info and suggestion from ASF, the
>> apache/airflow team maintained a custom runner
>> , and
>> it's also used by apache/airflow in their CI. So, we could just use this
>> directly.
>>
>> TLDR, what we needed is setup resource with custom runner, then enable
>> these resources in self-hosted action.
>>
>> *## Test on self-hosted Github Action with custom runner:*
>> Here is some tries on my local repo:
>> 1. Spark Maven/SBT test:
>> PR: https://github.com/apache/spark/pull/35088
>> TEST: https://github.com/Yikun/spark/pull/51
>> 2. PySpark test:
>> PR: https://github.com/apache/spark/pull/35049
>> TEST: https://github.com/Yikun/spark/pull/53
>> 3. Pull request test on unallow user:
>> TEST: https://github.com/Yikun/spark/pull/60
>> The self-hosted runner will prevent the PR access the runner due to
>> "Running job on worker spark-github-runner-0001 disallowed by security
>> policy".
>>
>> *## Pros of self-hosted github aciton:*
>> - Satisfy the simple demands of Linux arm64 sheduled jobs.
>> - Reuse the main workflow of github action.
>> - All changes are visible on github is easy to review.
>> - Easy to migrate when official GA arm64 support ready.
>>
>> *## What's the next step:*
>> * If we can also consider self-hosted action as optional, I will submit a
>> JIRA on Apache Infra to request the token to continue, like:
>> https://issues.apache.org/jira/browse/INFRA-21305
>> * If we certainly think that self-hosted action is not a wise choice, I
>> will try to find other way.
>>
>> There are also some initial discusson, just FYI:
>> https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/pull/6
>>
>> Regards,
>> Yikun
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE][SPIP] Support Customized Kubernetes Schedulers Proposal

2022-01-05 Thread Holden Karau

+1 (binding)

On Wed, Jan 5, 2022 at 5:31 PM William Wang  wrote:

> +1 (non-binding)
>
> Yikun Jiang  于2022年1月6日周四 09:07写道：
>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: "Support Customized Kubernetes
>> Schedulers Proposal"
>>
>> The SPIP is to support customized Kubernetes schedulers in Spark on
>> Kubernetes.
>>
>> Please also refer to:
>>
>> - Previous discussion in dev mailing list: [DISCUSSION] SPIP: Support
>> Volcano/Alternative Schedulers Proposal
>> 
>> - Design doc: [SPIP] Spark-36057 Support Customized Kubernetes
>> Schedulers Proposal
>> 
>> - JIRA: SPARK-36057 
>>
>> Please vote on the SPIP:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Regards,
>> Yikun
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2022-01-05 Thread Holden Karau

and scheduler like:
>>>>>>
>>>>>> spark-submit \
>>>>>>
>>>>>> --conf spark.kubernete.scheduler.name volcano \
>>>>>>
>>>>>> --conf spark.kubernetes.driver.pod.featureSteps
>>>>>> org.apache.spark.deploy.k8s.features.scheduler.VolcanoFeatureStep
>>>>>>
>>>>>> --conf spark.kubernete.job.queue xxx
>>>>>>
>>>>>> (such as above, the VolcanoFeatureStep will help to set the the spark
>>>>>> scheduler queue according user specified conf)
>>>>>>
>>>>>> - SPARK-37331 <https://issues.apache.org/jira/browse/SPARK-37331>:
>>>>>> Added the ability to create kubernetes resources before driver pod 
>>>>>> creation.
>>>>>>
>>>>>> - SPARK-36059 <https://issues.apache.org/jira/browse/SPARK-36059>:
>>>>>> Add the ability to specify a scheduler in driver/executor
>>>>>>
>>>>>> After above all, the framework/common support would be ready for most
>>>>>> of customized schedulers
>>>>>>
>>>>>> Volcano part:
>>>>>>
>>>>>> - SPARK-37258 <https://issues.apache.org/jira/browse/SPARK-37258>:
>>>>>> Upgrade kubernetes-client to 5.11.1 to add volcano scheduler API support.
>>>>>>
>>>>>> - SPARK-36061 <https://issues.apache.org/jira/browse/SPARK-36061>:
>>>>>> Add a VolcanoFeatureStep to help users to create a PodGroup with user
>>>>>> specified minimum resources required, there is also a WIP commit to
>>>>>> show the preview of this
>>>>>> <https://github.com/Yikun/spark/pull/45/commits/81bf6f98edb5c00ebd0662dc172bc73f980b6a34>
>>>>>> .
>>>>>>
>>>>>> Yunikorn part:
>>>>>>
>>>>>> - @WeiweiYang is completing the doc of the Yunikorn part and
>>>>>> implementing the Yunikorn part.
>>>>>>
>>>>>> Regards,
>>>>>> Yikun
>>>>>>
>>>>>>
>>>>>> Weiwei Yang  于2021年12月2日周四 02:00写道：
>>>>>>
>>>>>>> Thank you Yikun for the info, and thanks for inviting me to a
>>>>>>> meeting to discuss this.
>>>>>>> I appreciate your effort to put these together, and I agree that the
>>>>>>> purpose is to make Spark easy/flexible enough to support other K8s
>>>>>>> schedulers (not just for Volcano).
>>>>>>> As discussed, could you please help to abstract out the things in
>>>>>>> common and allow Spark to plug different implementations? I'd be happy 
>>>>>>> to
>>>>>>> work with you guys on this issue.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 30, 2021 at 6:49 PM Yikun Jiang 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> @Weiwei @Chenya
>>>>>>>>
>>>>>>>> > Thanks for bringing this up. This is quite interesting, we
>>>>>>>> definitely should participate more in the discussions.
>>>>>>>>
>>>>>>>> Thanks for your reply and welcome to join the discussion, I think
>>>>>>>> the input from Yunikorn is very critical.
>>>>>>>>
>>>>>>>> > The main thing here is, the Spark community should make Spark
>>>>>>>> pluggable in order to support other schedulers, not just for Volcano. 
>>>>>>>> It
>>>>>>>> looks like this proposal is pushing really hard for adopting PodGroup,
>>>>>>>> which isn't part of K8s yet, that to me is problematic.
>>>>>>>>
>>>>>>>> Definitely yes, we are on the same page.
>>>>>>>>
>>>>>>>> I think we have the same goal: propose a general and reasonable
>>>>>>>> mechanism to make spark on k8s with a custom scheduler more usable.
>>>>>>>>
>>>>>>>> But for the PodGroup, just allow me to do a brief introduction:
>>>>>>>> - The PodGroup definition has been approved by Kubernetes
>>>>>>>> officially in KEP-583. [1]
>>>>>

Re: Log4j 1.2.17 spark CVE

2021-12-12 Thread Holden Karau

My understanding is it only applies to log4j 2+ so we don’t need to do
anything.

On Sun, Dec 12, 2021 at 8:46 PM Pralabh Kumar 
wrote:

> Hi developers,  users
>
> Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on
> recent CVE detected ?
>
>
> Regards
> Pralabh kumar
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Holden Karau

Shane you kick ass thank you for everything you’ve done for us :) Keep on
rocking :)

On Mon, Dec 6, 2021 at 4:24 PM Hyukjin Kwon  wrote:

> Thanks, Shane.
>
> On Tue, 7 Dec 2021 at 09:19, Dongjoon Hyun 
> wrote:
>
>> I really want to thank you for all your help.
>> You've done so many things for the Apache Spark community.
>>
>> Sincerely,
>> Dongjoon
>>
>>
>> On Mon, Dec 6, 2021 at 12:02 PM shane knapp ☠ 
>> wrote:
>>
>>> hey everyone!
>>>
>>> after a marathon run of nearly a decade, we're finally going to be
>>> shutting down {amp|rise}lab jenkins at the end of this month...
>>>
>>> the earliest snapshot i could find is from 2013 with builds for spark
>>> 0.7:
>>>
>>> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
>>>
>>> it's been a hell of a run, and i'm gonna miss randomly tweaking the
>>> build system, but technology has moved on and running a dedicated set of
>>> servers for just one open source project is just too expensive for us here
>>> at uc berkeley.
>>>
>>> if there's interest, i'll fire up a zoom session and all y'alls can
>>> watch me type the final command:
>>>
>>> systemctl stop jenkins
>>>
>>> feeling bittersweet,
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2021-11-30 Thread Holden Karau

Thanks for putting this together, I’m really excited for us to add better
batch scheduling integrations.

On Tue, Nov 30, 2021 at 12:46 AM Yikun Jiang  wrote:

> Hey everyone,
>
> I'd like to start a discussion on "Support Volcano/Alternative Schedulers
> Proposal".
>
> This SPIP is proposed to make spark k8s schedulers provide more YARN like
> features (such as queues and minimum resources before scheduling jobs) that
> many folks want on Kubernetes.
>
> The goal of this SPIP is to improve current spark k8s scheduler
> implementations, add the ability of batch scheduling and support volcano as
> one of implementations.
>
> Design doc:
> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg
> JIRA: https://issues.apache.org/jira/browse/SPARK-36057
> Part of PRs:
> Ability to create resources https://github.com/apache/spark/pull/34599
> Add PodGroupFeatureStep: https://github.com/apache/spark/pull/34456
>
> Regards,
> Yikun
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: DataFrame.mapInArrow

2021-11-10 Thread Holden Karau

Sorry I've been busy, I'll try and take a look tomorrow, excited to see
this progress though :)

On Wed, Nov 10, 2021 at 9:01 PM Hyukjin Kwon  wrote:

> Last reminder: I plan to merge this in a few more days. Any feedback and
> review would be very appreciated.
>
> On Tue, 9 Nov 2021 at 21:51, Hyukjin Kwon  wrote:
>
>> Hi dev,
>>
>> I proposed DataFrame.mapInArrow (
>> https://github.com/apache/spark/pull/34505) which allows users to
>> directly leverage Arrow batch to plug in other external systems easily.
>>
>> I would like to make sure this design of API covers most use cases, and
>> would like to know if there is other feedback or opinion on this.
>>
>> I would appreciate any feedback on this.
>>
>> Thanks.
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] SPIP: Storage Partitioned Join for Data Source V2

2021-10-29 Thread Holden Karau

+1

On Fri, Oct 29, 2021 at 3:07 PM DB Tsai  wrote:

> +1
>
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>
>
> On Fri, Oct 29, 2021 at 11:42 AM Ryan Blue  wrote:
>
>> +1
>>
>> On Fri, Oct 29, 2021 at 11:06 AM huaxin gao 
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Oct 29, 2021 at 10:59 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1

 Dongjoon

 On 2021/10/29 17:48:59, Russell Spitzer 
 wrote:
 > +1 This is a great idea, (I have no Apache Spark voting points)
 >
 > On Fri, Oct 29, 2021 at 12:41 PM L. C. Hsieh 
 wrote:
 >
 > >
 > > I'll start with my +1.
 > >
 > > On 2021/10/29 17:30:03, L. C. Hsieh  wrote:
 > > > Hi all,
 > > >
 > > > I’d like to start a vote for SPIP: Storage Partitioned Join for
 Data
 > > Source V2.
 > > >
 > > > The proposal is to support a new type of join: storage
 partitioned join
 > > which
 > > > covers bucket join support for DataSourceV2 but is more general.
 The goal
 > > > is to let Spark leverage distribution properties reported by data
 > > sources and
 > > > eliminate shuffle whenever possible.
 > > >
 > > > Please also refer to:
 > > >
 > > >- Previous discussion in dev mailing list: [DISCUSS] SPIP:
 Storage
 > > Partitioned Join for Data Source V2
 > > ><
 > >
 https://lists.apache.org/thread.html/r7dc67c3db280a8b2e65855cb0b1c86b524d4e6ae1ed9db9ca12cb2e6%40%3Cdev.spark.apache.org%3E
 > > >
 > > >.
 > > >- JIRA: SPARK-37166 <
 > > https://issues.apache.org/jira/browse/SPARK-37166>
 > > >- Design doc <
 > >
 https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
 >
 > >
 > > >
 > > > Please vote on the SPIP for the next 72 hours:
 > > >
 > > > [ ] +1: Accept the proposal as an official SPIP
 > > > [ ] +0
 > > > [ ] -1: I don’t think this is a good idea because …
 > > >
 > > >
 -
 > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 > > >
 > > >
 > >
 > >
 -
 > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 > >
 > >
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>
>> --
>> Ryan Blue
>> Tabular
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-10 Thread Holden Karau

+1

On Sun, Oct 10, 2021 at 10:46 PM Wenchen Fan  wrote:

> +1
>
> On Sat, Oct 9, 2021 at 2:36 PM angers zhu  wrote:
>
>> +1 (non-binding)
>>
>> Cheng Pan  于2021年10月9日周六 下午2:06写道：
>>
>>> +1 (non-binding)
>>>
>>> Integration test passed[1] with my project[2].
>>>
>>> [1]
>>> https://github.com/housepower/spark-clickhouse-connector/runs/3834335017
>>> [2] https://github.com/housepower/spark-clickhouse-connector
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> On Sat, Oct 9, 2021 at 2:01 PM Ye Zhou  wrote:
>>>
 +1 (non-binding).

 Run Maven build, tested within our YARN cluster, in client or cluster
 mode, with push based shuffle enabled/disalbled, and shuffling a large
 amount of data. Applications ran successfully with expected shuffle
 behavior.

 On Fri, Oct 8, 2021 at 10:06 PM sarutak 
 wrote:

> +1
>
> I think no critical issue left.
> Thank you Gengliang.
>
> Kousuke
>
> > +1
> >
> > Looks good.
> >
> > Liang-Chi
> >
> > On 2021/10/08 16:16:12, Kent Yao  wrote:
> >> +1 (non-binding) BR
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >>
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >>
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >>
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >>
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >>  >> size="3">Kent Yao  >> style="color: rgb(82, 82, 82); font-family: 宋体-简; font-size:
> >> x-small;">@ Data Science Center, Hangzhou Research Institute,
> NetEase
> >> Corp. >> style="font-size: 13px;">a s >> color="#525252" style="font-size: 13px;">park  style="orphans:
> >> 2; widows: 2;" class=" classDarkfont
> >> classDarkfont">enthusiast >> style="color: rgb(0, 0, 0); font-family: Helvetica; font-size:
> >> 13px;"> > class="mr-2 flex-self-stretch" style="box-sizing: border-box;
> > align-self: stretch !important; margin-right: 8px !important;"> > face="宋体-简" color="#525252" class=" classDarkfont" style="box-sizing:
> > border-box; font-size: 13px;"> > class="" href="https://github.com/yaooqinn/kyuubi;
> style="box-sizing:
> > border-box;">kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and
> > analytics, built on top of  > href="http://spark.apache.org/; rel="nofollow" style="font-we
> >  ight: normal; color: rgb(49, 53, 59); font-family: 宋体-简; font-size:
> > 13px; box-sizing: border-box; font-variant-ligatures: normal;">Apache
> > Spark. > class=" d-flex flex-wrap flex-items-center break-word f3 text-normal
> > classDarkfont" style="box-sizing: border-box; margin: 0px;
> > font-variant-ligatures: normal; orphans: 2; widows: 2;
> > text-decoration-thickness: initial; flex-wrap: wrap !important;
> > align-items: center !important; word-break: break-word !important;
> > overflow-wrap: break-word !important; display: flex !important;"> > style="color: rgb(49, 53, 59); font-family: Helvetica; font-size:
> > 14px;"> > style="b
> >  ox-sizing: border-box; align-self: stretch !important; margin-right:
> > 8px !important;"> > href="https://github.com/yaooqinn/spark-authorizer;
> style="box-sizing:
> > border-box; outline-width: 0px;">spark-authorizer > style="font-weight: normal;">A Spark SQL extension which provides SQL
> > Standard Authorization for http://spark.apache.org/; rel="nofollow"
> > style="font-weight: normal; color: rgb(49, 53, 59); font-family:
> 宋体-简;
> > font-size: 13px; box-sizing: border-box; font-variant-ligatures:
> > normal;">Apache Spark > style="font-weight: normal; color: rgb(82, 82, 82); font-family:
> 宋体-简;
> > font-size: 13px; caret-color: rgb(82, 82, 82);
> font-variant-ligatures:
> > normal; text-decoration-thickness: initial;">. > style="color: rgb(49, 53, 59); font-family: Helvetica; font-size:
> > 14px;"> > class=""

Re: [VOTE] Release Spark 3.2.0 (RC6)

2021-09-29 Thread Holden Karau

PySpark smoke tests pass, I'm going to do a last pass through the JIRAs
before my vote though.

On Wed, Sep 29, 2021 at 8:54 AM Sean Owen  wrote:

> +1 looks good to me as before, now that a few recent issues are resolved.
>
>
> On Tue, Sep 28, 2021 at 10:45 AM Gengliang Wang  wrote:
>
>> Please vote on releasing the following candidate as
>> Apache Spark version 3.2.0.
>>
>> The vote is open until 11:59pm Pacific time September 30 and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.2.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.2.0-rc6 (commit
>> dde73e2e1c7e55c8e740cb159872e081ddfa7ed6):
>> https://github.com/apache/spark/tree/v3.2.0-rc6
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc6-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1393
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc6-docs/
>>
>> The list of bug fixes going into 3.2.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>>
>> This release is using the release script of the tag v3.2.0-rc6.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.2.0?
>> ===
>> The current list of open tickets targeted at 3.2.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.2.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] Release Spark 3.2.0 (RC5)

2021-09-27 Thread Holden Karau

I think even if we do cancel this RC we should leave it open for a bit to
see if we can catch any other errors.

On Mon, Sep 27, 2021 at 12:29 PM Dongjoon Hyun 
wrote:

> Unfortunately, it's the same for me recently. Not only that, but I also
> hit MetaspaceSize OOM, too.
> I ended up with MAVEN_OPTS like the following.
>
> -Xms12g -Xmx12g -Xss128M -XX:MaxMetaspaceSize=4g ...
>
> Dongjoon.
>
>
> On Mon, Sep 27, 2021 at 12:18 PM Sean Owen  wrote:
>
>> Has anyone seen a StackOverflowError when running tests? It happens in
>> compilation. I heard from another user who hit this earlier, and I had not,
>> until just today testing this:
>>
>> [ERROR] ## Exception when compiling 495 sources to
>> /mnt/data/testing/spark-3.2.0/sql/catalyst/target/scala-2.12/classes
>> java.lang.StackOverflowError
>>
>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.atOwner(TypingTransformers.scala:38)
>> scala.reflect.internal.Trees.itransform(Trees.scala:1420)
>> scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>> ...
>>
>> Upping the JVM thread stack size to, say, 16m from 4m in the pom.xml file
>> made it work. I presume this could be somehow env-specific, as clearly the
>> CI/CD tests and release process built successfully. Just checking if it's
>> "just me".
>>
>>
>> On Mon, Sep 27, 2021 at 7:56 AM Gengliang Wang  wrote:
>>
>>> Please vote on releasing the following candidate as
>>> Apache Spark version 3.2.0.
>>>
>>> The vote is open until 11:59pm Pacific time September 29 and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.2.0-rc5 (commit
>>> 49aea14c5afd93ae1b9d19b661cc273a557853f5):
>>> https://github.com/apache/spark/tree/v3.2.0-rc5
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1392
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-docs/
>>>
>>> The list of bug fixes going into 3.2.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>>>
>>> This release is using the release script of the tag v3.2.0-rc5.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.2.0?
>>> ===
>>> The current list of open tickets targeted at 3.2.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.2.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Adding Spark 4 to JIRA for targetted versions

2021-09-13 Thread Holden Karau

Hi Folks,

I'm going through the Spark 3.2 tickets just to make sure were not missing
anything important and I was wondering what folks thoughts are on adding
Spark 4 so we can target API breaking changes to the next major version and
avoid loosing track of the issue.

Cheers,


Holden :)

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Add option to Spark UI to proxy to the executors?

2021-08-25 Thread Holden Karau

So I tried turning on the Spark exec UI proxy but it broke the Spark UI (in
3.1.2) and regardless of what URL I requested everything came back as
text/html of the jobs page. Is anyone actively using this feature in prod?

On Sun, Aug 22, 2021 at 5:58 PM Holden Karau  wrote:

> Oh cool. I’ll have to dig down into why that’s not working with my K8s
> deployment then.
>
> On Sat, Aug 21, 2021 at 11:54 PM Gengliang Wang  wrote:
>
>> Hi Holden,
>>
>> FYI there are already some related features in Spark:
>>
>>- Spark Master UI to reverse proxy Application and Workers UI
>><https://github.com/apache/spark/pull/13950>
>>- Support Spark UI behind front-end reverse proxy using a path prefix
>>Revert proxy URL <https://github.com/apache/spark/pull/29820>
>>
>> Not sure if they are helpful to you.
>>
>> On Sat, Aug 21, 2021 at 3:16 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Yes I can see your point.
>>>
>>> Will that work in kubernetes deployment?
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 21 Aug 2021 at 00:02, Holden Karau  wrote:
>>>
>>>> Hi Folks,
>>>>
>>>> I'm wondering what people think about the idea of having the Spark UI
>>>> (optionally) act as a proxy to the executors? This could help with exec UI
>>>> access in some deployment environments.
>>>>
>>>> Cheers,
>>>>
>>>> Holden :)
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Add option to Spark UI to proxy to the executors?

2021-08-22 Thread Holden Karau

Oh cool. I’ll have to dig down into why that’s not working with my K8s
deployment then.

On Sat, Aug 21, 2021 at 11:54 PM Gengliang Wang  wrote:

> Hi Holden,
>
> FYI there are already some related features in Spark:
>
>- Spark Master UI to reverse proxy Application and Workers UI
><https://github.com/apache/spark/pull/13950>
>- Support Spark UI behind front-end reverse proxy using a path prefix
>Revert proxy URL <https://github.com/apache/spark/pull/29820>
>
> Not sure if they are helpful to you.
>
> On Sat, Aug 21, 2021 at 3:16 PM Mich Talebzadeh 
> wrote:
>
>> Yes I can see your point.
>>
>> Will that work in kubernetes deployment?
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 21 Aug 2021 at 00:02, Holden Karau  wrote:
>>
>>> Hi Folks,
>>>
>>> I'm wondering what people think about the idea of having the Spark UI
>>> (optionally) act as a proxy to the executors? This could help with exec UI
>>> access in some deployment environments.
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Add option to Spark UI to proxy to the executors?

2021-08-20 Thread Holden Karau

Hi Folks,

I'm wondering what people think about the idea of having the Spark UI
(optionally) act as a proxy to the executors? This could help with exec UI
access in some deployment environments.

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-1s on committed but not released code?

2021-08-19 Thread Holden Karau

Hi Y'all,

This just recently came up but I'm not super sure on how we want to handle
this in general. If code was committed under the lazy consensus model and
then a committer or PMC -1s it post merge, what do we want to do?

I know we had some previous discussion around -1s, but that was largely
focused on pre-commit -1s.

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Holden Karau

n addition to base java
>>>>>> 11.
>>>>>>
>>>>>> in addition it has these Python packages for now (added for my own
>>>>>> needs for now)
>>>>>>
>>>>>> root@ce6773017a14:/opt/spark/work-dir# pip list
>>>>>> Package   Version
>>>>>> - ---
>>>>>> asn1crypto0.24.0
>>>>>> cryptography  2.6.1
>>>>>> cx-Oracle 8.2.1
>>>>>> entrypoints   0.3
>>>>>> keyring   17.1.1
>>>>>> keyrings.alt  3.1.1
>>>>>> numpy 1.21.2
>>>>>> pip   21.2.4
>>>>>> py4j  0.10.9
>>>>>> pycrypto  2.6.1
>>>>>> PyGObject 3.30.4
>>>>>> pyspark   3.1.2
>>>>>> pyxdg 0.25
>>>>>> PyYAML5.4.1
>>>>>> SecretStorage 2.3.1
>>>>>> setuptools57.4.0
>>>>>> six   1.12.0
>>>>>> wheel 0.32.3
>>>>>>
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 17 Aug 2021 at 16:17, Maciej  wrote:
>>>>>>
>>>>>>> Quick question ‒ is this actual output? If so, do we know what
>>>>>>> accounts 1.5GB overhead for PySpark image. Even without
>>>>>>> --no-install-recommends this seems like a lot (if I recall
>>>>>>> correctly it was around 400MB for existing images).
>>>>>>>
>>>>>>>
>>>>>>> On 8/17/21 2:24 PM, Mich Talebzadeh wrote:
>>>>>>>
>>>>>>> Examples:
>>>>>>>
>>>>>>> *docker images*
>>>>>>>
>>>>>>> REPOSITORY   TAG  IMAGE ID
>>>>>>>  CREATED  SIZE
>>>>>>>
>>>>>>> spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8   ba3c17bc9337
>>>>>>>  2 minutes ago2.19GB
>>>>>>>
>>>>>>> spark3.1.1-scala_2.12-java11  4595c4e78879
>>>>>>>  18 minutes ago   635MB
>>>>>>>
>>>>>>>
>>>>>>>view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 17 Aug 2021 at 10:31, Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> 3.1.2_sparkpy_3.7-scala_2.12-java11
>>>>>>>>
>>>>>>>> 3.1.2_sparkR_3.6-scala_2.12-java11
>>>>>>>> Yes let us go with that and remember that we can change the tags
>>>>>>>> anytime. The accompanying release note should detail what is inside the
>>>>>>>> image downloaded.
>>>>>>>>
>>>>>>>> +1 for m

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Holden Karau

I try to log in as root to the image
>>
>> *docker run -u0 -it cfbb0e69f204 bash*
>>
>> root@b542b0f1483d:/opt/spark/work-dir# pip install keras
>> Collecting keras
>>   Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
>>  || 1.3 MB 1.1 MB/s
>> Installing collected packages: keras
>> Successfully installed keras-2.6.0
>> WARNING: Running pip as the 'root' user can result in broken permissions
>> and conflicting behaviour with the system package manager. It is
>> recommended to use a virtual environment instead:
>> https://pip.pypa.io/warnings/venv
>> root@b542b0f1483d:/opt/spark/work-dir# pip list
>> Package   Version
>> - ---
>> asn1crypto0.24.0
>> cryptography  2.6.1
>> cx-Oracle 8.2.1
>> entrypoints   0.3
>> *keras 2.6.0  <--- it is here*
>> keyring   17.1.1
>> keyrings.alt  3.1.1
>> numpy 1.21.1
>> pip   21.2.3
>> py4j  0.10.9
>> pycrypto  2.6.1
>> PyGObject 3.30.4
>> pyspark   3.1.2
>> pyxdg 0.25
>> PyYAML5.4.1
>> SecretStorage 2.3.1
>> setuptools57.4.0
>> six   1.12.0
>> wheel 0.32.3
>> root@b542b0f1483d:/opt/spark/work-dir# exit
>>
>> Now I exited from the image and try to log in again
>> (pyspark_venv) hduser@rhes76: /home/hduser/dba/bin/build> docker run -u0
>> -it cfbb0e69f204 bash
>>
>> root@5231ee95aa83:/opt/spark/work-dir# pip list
>> Package   Version
>> - ---
>> asn1crypto0.24.0
>> cryptography  2.6.1
>> cx-Oracle 8.2.1
>> entrypoints   0.3
>> keyring   17.1.1
>> keyrings.alt  3.1.1
>> numpy 1.21.1
>> pip   21.2.3
>> py4j  0.10.9
>> pycrypto  2.6.1
>> PyGObject 3.30.4
>> pyspark   3.1.2
>> pyxdg 0.25
>> PyYAML5.4.1
>> SecretStorage 2.3.1
>> setuptools57.4.0
>> six   1.12.0
>> wheel 0.32.3
>>
>> *Hm that keras is not there*. The docker Image cannot be altered after
>> build! So once the docker image is created that is just a snapshot.
>> However, it will still have tons of useful stuff for most
>> users/organisations. My suggestions is to create for a given type (spark,
>> spark-py etc):
>>
>>
>>1. One vanilla flavour for everyday use with few useful packages
>>2. One for medium use with most common packages for ETL/ELT stuff
>>3. One specialist for ML etc with keras, tensorflow and anything else
>>needed
>>
>>
>> These images should be maintained as we currently maintain spark releases
>> with accompanying documentation. Any reason why we cannot maintain
>> ourselves?
>>
>> HTH
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 13 Aug 2021 at 17:26, Holden Karau  wrote:
>>
>>> So we actually do have a script that does the build already it's more a
>>> matter of publishing the results for easier use. Currently the script
>>> produces three images spark, spark-py, and spark-r. I can certainly see a
>>> solid reason to publish like with a jdk11 & jdk8 suffix as well if there is
>>> interest in the community. If we want to have a say spark-py-pandas for a
>>> Spark container image with everything necessary for the Koalas stuff to
>>> work then I think that could be a great PR from someone to add :)
>>>
>>> On Fri, Aug 13, 2021 at 1:00 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> should read PySpark
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is expl

Re: Time to start publishing Spark Docker Images?

2021-08-16 Thread Holden Karau

um use with most common packages for ETL/ELT stuff
>3. One specialist for ML etc with keras, tensorflow and anything else
>needed
>
>
> These images should be maintained as we currently maintain spark releases
> with accompanying documentation. Any reason why we cannot maintain
> ourselves?
>
> HTH
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 13 Aug 2021 at 17:26, Holden Karau  wrote:
>
>> So we actually do have a script that does the build already it's more a
>> matter of publishing the results for easier use. Currently the script
>> produces three images spark, spark-py, and spark-r. I can certainly see a
>> solid reason to publish like with a jdk11 & jdk8 suffix as well if there is
>> interest in the community. If we want to have a say spark-py-pandas for a
>> Spark container image with everything necessary for the Koalas stuff to
>> work then I think that could be a great PR from someone to add :)
>>
>> On Fri, Aug 13, 2021 at 1:00 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> should read PySpark
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 13 Aug 2021 at 08:51, Mich Talebzadeh 
>>> wrote:
>>>
>>>> Agreed.
>>>>
>>>> I have already built a few latest for Spark and PYSpark on 3.1.1 with
>>>> Java 8 as I found out Java 11 does not work with Google BigQuery data
>>>> warehouse. However, to hack the Dockerfile one finds out the hard way.
>>>>
>>>> For example how to add additional Python libraries like tensorflow etc.
>>>> Loading these libraries through Kubernetes is not practical as unzipping
>>>> and installing it through --py-files etc will take considerable time so
>>>> they need to be added to the dockerfile at the built time in directory for
>>>> Python under Kubernetes
>>>>
>>>> /opt/spark/kubernetes/dockerfiles/spark/bindings/python
>>>>
>>>> RUN pip install pyyaml numpy cx_Oracle tensorflow 
>>>>
>>>> Also you will need curl to test the ports from inside the docker
>>>>
>>>> RUN apt-get update && apt-get install -y curl
>>>> RUN ["apt-get","install","-y","vim"]
>>>>
>>>> As I said I am happy to build these specific dockerfiles plus the
>>>> complete documentation for it. I have already built one for Google (GCP).
>>>> The difference between Spark and PySpark version is that in Spark/scala a
>>>> fat jar file will contain all needed. That is not the case with Python I am
>>>> afraid.
>>>>
>>>> HTH
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 13 Aug 2021 at 08:13, Bode, Meikel, NMA-CFD <
>>>> meikel.b...@bertelsmann.de> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>>
>>>>> I am Meikel Bode and only an interested reader of dev and user

1 2 3 4 5 >

1 - 100 of 481 matches

Mail list logo