Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-10 Thread Nimrod Ofek
programming guides should teach
>>>> fundamentals that do not change version-to-version. TypeScript
>>>> <https://www.typescriptlang.org/docs/handbook/typescript-from-scratch.html>
>>>>  (which
>>>> has one of the best DX's and docs) does this exceptionally well. Their
>>>> guides are refined, versionless pages, new features are elaborated upon in
>>>> release notes (analogous to our version-specific docs), and for the
>>>> occasional caveat for a version, it is called out in the guides.
>>>>
>>>>  I agree with Wenchen's 3 points. I don't think we need to say that
>>>> they *have* to go to the old page, but that if they want to, they can.
>>>>
>>>> Neil
>>>>
>>>> On Wed, Jun 5, 2024 at 12:04 PM Wenchen Fan 
>>>> wrote:
>>>>
>>>>> I agree with the idea of a versionless programming guide. But one
>>>>> thing we need to make sure of is we give clear messages for things that 
>>>>> are
>>>>> only available in a new version. My proposal is:
>>>>>
>>>>>    1. keep the old versions' programming guide unchanged. For
>>>>>example, people can still access
>>>>>https://spark.apache.org/docs/3.3.4/quick-start.html
>>>>>2. In the new versionless programming guide, we mention at the
>>>>>beginning that for Spark versions before 4.0, go to the versioned doc 
>>>>> site
>>>>>to read the programming guide.
>>>>>3. Revisit the programming guide of Spark 4.0 (compare it with the
>>>>>one of 3.5), and adjust the content to mention version-specific changes
>>>>>(API change, new features, etc.)
>>>>>
>>>>> Then we can have a versionless programming guide starting from Spark
>>>>> 4.0. We can also revisit programming guides of all versions and combine
>>>>> them into one with version-specific notes, but that's probably too much
>>>>> work.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Wenchen
>>>>>
>>>>> On Wed, Jun 5, 2024 at 1:39 AM Martin Andersson <
>>>>> martin.anders...@kambi.com> wrote:
>>>>>
>>>>>> While I have no practical knowledge of how documentation is
>>>>>> maintained in the spark project, I must agree with Nimrod. For users on
>>>>>> older versions, having a programming guide that refers to features or API
>>>>>> methods that does not exist in that version is confusing and detrimental.
>>>>>>
>>>>>> Surely there must be a better way to allow updating documentation
>>>>>> more often?
>>>>>>
>>>>>> Best Regards,
>>>>>> Martin
>>>>>>
>>>>>> --
>>>>>> *From:* Nimrod Ofek 
>>>>>> *Sent:* Wednesday, June 5, 2024 08:26
>>>>>> *To:* Neil Ramaswamy 
>>>>>> *Cc:* Praveen Gattu ; dev <
>>>>>> dev@spark.apache.org>
>>>>>> *Subject:* Re: [DISCUSS] Versionless Spark Programming Guide Proposal
>>>>>>
>>>>>>
>>>>>> EXTERNAL SENDER. Do not click links or open attachments unless you
>>>>>> recognize the sender and know the content is safe. DO NOT provide your
>>>>>> username or password.
>>>>>>
>>>>>>
>>>>>> Hi Neil,
>>>>>>
>>>>>>
>>>>>> While you wrote you don't mean the api docs (of course), the
>>>>>> programming guides are also different between versions since features are
>>>>>> being added, configs are being added/ removed/ changed, defaults are 
>>>>>> being
>>>>>> changed etc.
>>>>>>
>>>>>> I know of "backport hell" - which is why I wrote that once a version
>>>>>> is released it's freezed and the documentation will be updated for the 
>>>>>> new
>>>>>> version only.
>>>>>>
>>>>>> I think of it as facing forward and keeping older versions but
>>>>>> focusing on the new releases to keep the community updating.
>>>>>> While spark has support window of 18 months until eol, we can have
>>>>>> only 6 mont

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Nimrod Ofek
Hi Neil,


While you wrote you don't mean the api docs (of course), the programming
guides are also different between versions since features are being added,
configs are being added/ removed/ changed, defaults are being changed etc.

I know of "backport hell" - which is why I wrote that once a version is
released it's freezed and the documentation will be updated for the new
version only.

I think of it as facing forward and keeping older versions but focusing on
the new releases to keep the community updating.
While spark has support window of 18 months until eol, we can have only 6
months support cycle until eol for documentation- there are no major
security concerns for documentation...

Nimrod

בתאריך יום ד׳, 5 ביוני 2024, 08:28, מאת Neil Ramaswamy ‏:

> Hi Nimrod,
>
> Quick clarification—my proposal will not touch API-specific documentation
> for the specific reasons you mentioned (signatures, behavior, etc.). It
> just aims to make the *programming guides *versionless. Programming
> guides should teach fundamentals of Spark, and the fundamentals of Spark
> should not change between releases.
>
> There are a few issues with updating documentation multiple times after
> Spark releases. First, fixes that apply to all existing versions'
> programming guides need backport PRs. For example, this change
> <https://github.com/apache/spark/pull/46797/files> applies to all the
> versions of the SS programming guide, but is likely to be fixed only in
> Spark 4.0. Additionally, any such update within a Spark release will require
> re-building the static sites in the spark repo, and copying those files to
> spark-website via a commit in spark-website. Making a typo fix like the one
> I linked would then require  + 1 PRs,
> opposed to 1 PR in the versionless programming guide world.
>
> Neil
>
> On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek  wrote:
>
>> Hi,
>>
>> While I think that the documentation needs a lot of improvement and
>> important details are missing - and detaching the documentation from the
>> main project can help iterating faster on documentation specific tasks, I
>> don't think we can nor should move to versionless documentation.
>>
>> Documentation is version specific: parameters are added and removed, new
>> features are added, behaviours sometimes change etc.
>>
>> I think the documentation should be version specific- but separate from
>> spark release cadence - and can be updated multiple times after spark
>> release.
>> The way I see it is that the documentation should be updated only for the
>> latest version and some time before a new release should be archived and
>> the updated documentation should reflect the new version.
>>
>> Thanks,
>> Nimrod
>>
>> בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu
>> ‏:
>>
>>> +1. This helps for greater velocity in improving docs. However, we might
>>> still need a way to provide version specific information isn't it, i.e.
>>> what features are available in which version etc.
>>>
>>> On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I've written up a proposal to migrate all the Apache Spark programming
>>>> guides to be versionless. You can find the proposal here
>>>> <https://docs.google.com/document/d/1OqeQ71zZleUa1XRZrtaPDFnJ-gVJdGM80o42yJVg9zg/>.
>>>> Please leave comments, or reply in this DISCUSS thread.
>>>>
>>>> TLDR: by making the programming guides versionless, we can make updates
>>>> to them whenever we'd like, instead of at the Spark release cadence. This
>>>> increased update velocity will enable us to make gradual improvements,
>>>> including breaking up the Structured Streaming programming guide into
>>>> smaller sub-guides. The proposal does not break *any *existing URLs,
>>>> and it does not affect our versioned API docs in any way.
>>>>
>>>> Thanks!
>>>> Neil
>>>>
>>>


Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-04 Thread Nimrod Ofek
Hi,

While I think that the documentation needs a lot of improvement and
important details are missing - and detaching the documentation from the
main project can help iterating faster on documentation specific tasks, I
don't think we can nor should move to versionless documentation.

Documentation is version specific: parameters are added and removed, new
features are added, behaviours sometimes change etc.

I think the documentation should be version specific- but separate from
spark release cadence - and can be updated multiple times after spark
release.
The way I see it is that the documentation should be updated only for the
latest version and some time before a new release should be archived and
the updated documentation should reflect the new version.

Thanks,
Nimrod

בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu
‏:

> +1. This helps for greater velocity in improving docs. However, we might
> still need a way to provide version specific information isn't it, i.e.
> what features are available in which version etc.
>
> On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy  wrote:
>
>> Hi all,
>>
>> I've written up a proposal to migrate all the Apache Spark programming
>> guides to be versionless. You can find the proposal here
>> .
>> Please leave comments, or reply in this DISCUSS thread.
>>
>> TLDR: by making the programming guides versionless, we can make updates
>> to them whenever we'd like, instead of at the Spark release cadence. This
>> increased update velocity will enable us to make gradual improvements,
>> including breaking up the Structured Streaming programming guide into
>> smaller sub-guides. The proposal does not break *any *existing URLs, and
>> it does not affect our versioned API docs in any way.
>>
>> Thanks!
>> Neil
>>
>


[DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Nimrod Ofek
Following the conversation started with Spark 4.0.0 release, this is a
thread to discuss improvements to our release processes.

I'll Start by raising some questions that probably should have answers to
start the discussion:


   1. What is currently running in GitHub Actions?
   2. Who currently has permissions for Github actions? Is there a specific
   owner for that today or a different volunteer each time?
   3. What are the current limits of GitHub Actions, who set them - and
   what is the process to change those (if possible at all, but I presume not
   all Apache projects have the same limits)?
   4. What versions should we support as an output for the build?
   5. Where should the artifacts be stored?
   6. What should be the output? only tar or also a docker image published
   somewhere?
   7. Do we want to have a release on fixed dates or a manual release upon
   request?
   8. Who should be permitted to sign a version - and what is the process
   for that?


Thanks!
Nimrod


Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Nimrod Ofek
I have no permissions so I can't do it but I'm happy to help (although I am
more familiar with Gitlab CICD than Github Actions).
Is there some point of contact that can provide me needed context and
permissions?
I'd also love to see why the costs are high and see how we can reduce
them...

Thanks,
Nimrod

On Wed, May 8, 2024 at 8:26 AM Holden Karau  wrote:

> I think signing the artifacts produced from a secure CI sounds like a good
> idea. I know we’ve been asked to reduce our GitHub action usage but perhaps
> someone interested could volunteer to set that up.
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:
>
>> Hi,
>> Thanks for the reply.
>>
>> From my experience, a build on a build server would be much more
>> predictable and less error prone than building on some laptop- and of
>> course much faster to have builds, snapshots, release candidates, early
>> previews releases, release candidates or final releases.
>> It will enable us to have a preview version with current changes-
>> snapshot version, either automatically every day or if we need to save
>> costs (although build is really not expensive) - with a click of a button.
>>
>> Regarding keys for signing. - that's what vaults are for, all across the
>> industry we are using vaults (such as hashicorp vault)- but if the build
>> will be automated and the only thing which will be manual is to sign the
>> release for security reasons that would be reasonable.
>>
>> Thanks,
>> Nimrod
>>
>>
>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>> holden.ka...@gmail.com>:
>>
>>> Indeed. We could conceivably build the release in CI/CD but the final
>>> verification / signing should be done locally to keep the keys safe (there
>>> was some concern from earlier release processes).
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Sorry for the novice question, Wenchen - the release is done manually
>>>> from a laptop? Not using a CI CD process on a build server?
>>>>
>>>> Thanks,
>>>> Nimrod
>>>>
>>>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>>>>
>>>>> UPDATE:
>>>>>
>>>>> Unfortunately, it took me quite some time to set up my laptop and get
>>>>> it ready for the release process (docker desktop doesn't work anymore, my
>>>>> pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
>>>>> for your patience!
>>>>>
>>>>> Wenchen
>>>>>
>>>>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>>
>>>>>>
>>>>>> *发件人**: *Jungtaek Lim 
>>>>>> *日期**: *2024年5月2日 星期四 10:21
>>>>>> *收件人**: *Holden Karau 
>>>>>> *抄送**: *Chao Sun , Xiao Li ,
>>>>>> Tathagata Das , Wenchen Fan <
>>>>>> cloud0...@gmail.com>, Cheng Pan , Nicholas
>>>>>> Chammas , Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com>, Cheng Pan , Spark dev
>>>>>> list , Anish Shrigondekar <
>>>>>> anish.shrigonde...@databricks.com>
>>>>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>>>>
>>>>>>
>>>>>>
>>>>>> +1 love to see it!
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>> +1 :) yay previews
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>>>>
>>>>>> +1
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>>>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
Hi,
Thanks for the reply.

>From my experience, a build on a build server would be much more
predictable and less error prone than building on some laptop- and of
course much faster to have builds, snapshots, release candidates, early
previews releases, release candidates or final releases.
It will enable us to have a preview version with current changes- snapshot
version, either automatically every day or if we need to save costs
(although build is really not expensive) - with a click of a button.

Regarding keys for signing. - that's what vaults are for, all across the
industry we are using vaults (such as hashicorp vault)- but if the build
will be automated and the only thing which will be manual is to sign the
release for security reasons that would be reasonable.

Thanks,
Nimrod


בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏:

> Indeed. We could conceivably build the release in CI/CD but the final
> verification / signing should be done locally to keep the keys safe (there
> was some concern from earlier release processes).
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek  wrote:
>
>> Hi,
>>
>> Sorry for the novice question, Wenchen - the release is done manually
>> from a laptop? Not using a CI CD process on a build server?
>>
>> Thanks,
>> Nimrod
>>
>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>>
>>> UPDATE:
>>>
>>> Unfortunately, it took me quite some time to set up my laptop and get it
>>> ready for the release process (docker desktop doesn't work anymore, my pgp
>>> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
>>> your patience!
>>>
>>> Wenchen
>>>
>>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>>
>>>> +1
>>>>
>>>>
>>>>
>>>> *发件人**: *Jungtaek Lim 
>>>> *日期**: *2024年5月2日 星期四 10:21
>>>> *收件人**: *Holden Karau 
>>>> *抄送**: *Chao Sun , Xiao Li ,
>>>> Tathagata Das , Wenchen Fan <
>>>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>>>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>>>> Cheng Pan , Spark dev list ,
>>>> Anish Shrigondekar 
>>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>>
>>>>
>>>>
>>>> +1 love to see it!
>>>>
>>>>
>>>>
>>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>>> wrote:
>>>>
>>>> +1 :) yay previews
>>>>
>>>>
>>>>
>>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>>
>>>> +1
>>>>
>>>>
>>>>
>>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>>
>>>> +1 for next Monday.
>>>>
>>>>
>>>>
>>>> We can do more previews when the other features are ready for preview.
>>>>
>>>>
>>>>
>>>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>>>
>>>> Next week sounds great! Thank you Wenchen!
>>>>
>>>>
>>>>
>>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
>>>> wrote:
>>>>
>>>> Yea I think a preview release won't hurt (without a branch cut). We
>>>> don't need to wait for all the ongoing projects to be ready. How about we
>>>> do a 4.0 preview release based on the current master branch next Monday?
>>>>
>>>>
>>>>
>>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>>> tathagata.das1...@gmail.com> wrote:
>>>>
>>>> Hey all,
>>>>
>>>>
>>>>
>>>> Reviving this thread, but Spark master has already accumulated a huge
>>>> amount of changes.  As a downstream project maintainer, I want to really
>>>> start testing the new features and other breaking changes, and it's hard to
>>>> do that without a Preview release. So the sooner we make a Preview release,
>>>> the faster we can start getting feedback for fixing things for a great
>>>> Spark 4.0 final release.
>>>>
>>>>
>>>>
>>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>>> certain features targeting the Delta 4.0 release are st

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
Hi,

Sorry for the novice question, Wenchen - the release is done manually from
a laptop? Not using a CI CD process on a build server?

Thanks,
Nimrod

On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and get it
> ready for the release process (docker desktop doesn't work anymore, my pgp
> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
> your patience!
>
> Wenchen
>
> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年5月2日 星期四 10:21
>> *收件人**: *Holden Karau 
>> *抄送**: *Chao Sun , Xiao Li ,
>> Tathagata Das , Wenchen Fan <
>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>> Cheng Pan , Spark dev list ,
>> Anish Shrigondekar 
>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>
>>
>>
>> +1 love to see it!
>>
>>
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>> +1 :) yay previews
>>
>>
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>> +1 for next Monday.
>>
>>
>>
>> We can do more previews when the other features are ready for preview.
>>
>>
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>
>> Next week sounds great! Thank you Wenchen!
>>
>>
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>>
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard to
>> do that without a Preview release. So the sooner we make a Preview release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>>
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>
>> Thank you all for the replies!
>>
>>
>>
>> To @Nicholas Chammas  : Thanks for cleaning
>> up the error terminology and documentation! I've merged the first PR and
>> let's finish others before the 4.0 release.
>>
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>>
>> To @Jungtaek Lim  : Ack. We can treat the
>> Streaming state store data source as completed for 4.0 then.
>>
>> To @Cheng Pan  : Yea we definitely should have a
>> preview release. Let's collect more feedback on the ongoing projects and
>> then we can propose a date for the preview release.
>>
>>
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my understanding
>> is that we want to release the feature to 4.0.0, but there are several
>> remaining works to be done. While the tentative timeline for releasing is
>> June 2024, what would be the tentative timeline for the RC cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>> and I think it's time to prepare for it and discuss the ongoing projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>
>>
>>
>> --
>>
>> Twitter: https://twitter.com/holdenkarau
>> 
>>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> 

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Nimrod Ofek
Hi Erik and Wenchen,

I think that usually a good practice with public api and with internal api
that has big impact and a lot of usage is to ease in changes by providing
defaults to new parameters that will keep former behaviour in a method with
the previous signature with deprecation notice, and deleting that
deprecated function in the next release- so the actual break will be in the
next release after all libraries had the chance to align with the api and
upgrades can be done while already using the new version.

Another thing is that we should probably examine what private apis are used
externally to provide better experience and provide proper public apis to
meet those needs (for instance, applicative metrics and some way of
creating custom behaviour columns).

Thanks,
Nimrod


בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan ‏:

> Hi Erik,
>
> Thanks for sharing your thoughts! Note: developer APIs are also public
> APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking
> changes should be avoided as much as we can and new APIs should be
> mentioned in the release notes. Breaking binary compatibility is also a
> "functional change" and should be treated as a behavior change.
>
> BTW, AFAIK some downstream libraries use private APIs such as Catalyst
> Expression and LogicalPlan. It's too much work to track all the changes to
> private APIs and I think it's the downstream library's responsibility to
> check such changes in new Spark versions, or avoid using private APIs.
> Exceptions can happen if certain private APIs are used too widely and we
> should avoid breaking them.
>
> Thanks,
> Wenchen
>
> On Wed, May 1, 2024 at 11:51 PM Erik Krogen  wrote:
>
>> Thanks for raising this important discussion Wenchen! Two points I would
>> like to raise, though I'm fully supportive of any improvements in this
>> regard, my points below notwithstanding -- I am not intending to let
>> perfect be the enemy of good here.
>>
>> On a similar note as Santosh's comment, we should consider how this
>> relates to developer APIs. Let's say I am an end user relying on some
>> library like frameless , which
>> relies on developer APIs in Spark. When we make a change to Spark's
>> developer APIs that requires a corresponding change in frameless, I don't
>> directly see that change as an end user, but it *does* impact me,
>> because now I have to upgrade to a new version of frameless that supports
>> those new changes. This can have ripple effects across the ecosystem.
>> Should we call out such changes so that end users understand the potential
>> impact to libraries they use?
>>
>> Second point, what about binary compatibility? Currently our versioning
>> policy says "Link-level compatibility is something we’ll try to guarantee
>> in future releases." (FWIW, it has said this since at least 2016
>> ...)
>> One step towards this would be to clearly call out any binary-incompatible
>> changes in our release notes, to help users understand if they may be
>> impacted. Similar to my first point, this has ripple effects across the
>> ecosystem -- if I just use Spark itself, recompiling is probably not a big
>> deal, but if I use N libraries that each depend on Spark, then after a
>> binary-incompatible change is made I have to wait for all N libraries to
>> publish new compatible versions before I can upgrade myself, presenting a
>> nontrivial barrier to adoption.
>>
>> On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
>>  wrote:
>>
>>> Thanks Wenchen for starting this!
>>>
>>> How do we define "the user" for spark?
>>> 1. End users: There are some users that use spark as a service from a
>>> provider
>>> 2. Providers/Operators: There are some users that provide spark as a
>>> service for their internal(on-prem setup with yarn/k8s)/external(Something
>>> like EMR) customers
>>> 3. ?
>>>
>>> Perhaps we need to consider infrastructure behavior changes as well to
>>> accommodate the second group of users.
>>>
>>> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>>>
>>> Hi all,
>>>
>>> It's exciting to see innovations keep happening in the Spark community
>>> and Spark keeps evolving itself. To make these innovations available to
>>> more users, it's important to help users upgrade to newer Spark versions
>>> easily. We've done a good job on it: the PR template requires the author to
>>> write down user-facing behavior changes, and the migration guide contains
>>> behavior changes that need attention from users. Sometimes behavior changes
>>> come with a legacy config to restore the old behavior. However, we still
>>> lack a clear definition of behavior changes and I propose the following
>>> definition:
>>>
>>> Behavior changes mean user-visible functional changes in a new release
>>> via public APIs. This means new features, and even bug fixes that eliminate
>>> NPE or correct query results, 

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Nimrod Ofek
+1 (non-binding)

p.s
How do I become binding?

Thanks,
Nimrod

On Tue, Apr 30, 2024 at 10:53 AM Ye Xianjin  wrote:

> +1
> Sent from my iPhone
>
> On Apr 30, 2024, at 3:23 PM, DB Tsai  wrote:
>
> 
> +1
>
> On Apr 29, 2024, at 8:01 PM, Wenchen Fan  wrote:
>
> 
> To add more color:
>
> Spark data source table and Hive Serde table are both stored in the Hive
> metastore and keep the data files in the table directory. The only
> difference is they have different "table provider", which means Spark will
> use different reader/writer. Ideally the Spark native data source
> reader/writer is faster than the Hive Serde ones.
>
> What's more, the default format of Hive Serde is text. I don't think
> people want to use text format tables in production. Most people will add
> `STORED AS parquet` or `USING parquet` explicitly. By setting this config
> to false, we have a more reasonable default behavior: creating Parquet
> tables (or whatever is specified by `spark.sql.sources.default`).
>
> On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan  wrote:
>
>> @Mich Talebzadeh  there seems to be a
>> misunderstanding here. The Spark native data source table is still stored
>> in the Hive metastore, it's just that Spark will use a different (and
>> faster) reader/writer for it. `hive-site.xml` should work as it is today.
>>
>> On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon 
>> wrote:
>>
>>> +1
>>>
>>> It's a legacy conf that we should eventually remove it away. Spark
>>> should create Spark table by default, not Hive table.
>>>
>>> Mich, for your workload, you can simply switch that conf off if it
>>> concerns you. We also enabled ANSI as well (that you agreed on). It's a bit
>>> akwakrd to stop in the middle for this compatibility reason during making
>>> Spark sound. The compatibility has been tested in production for a long
>>> time so I don't see any particular issue about the compatibility case you
>>> mentioned.
>>>
>>> On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>

 Hi @Wenchen Fan 

 Thanks for your response. I believe we have not had enough time to
 "DISCUSS" this matter.

 Currently in order to make Spark take advantage of Hive, I create a
 soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is
 3.1.1

  /opt/spark/conf/hive-site.xml ->
 /data6/hduser/hive-3.1.1/conf/hive-site.xml

 This works fine for me in my lab. So in the future if we opt to use the
 setting "spark.sql.legacy.createHiveTableByDefault" to False, there will
 not be a need for this logical link.?
 On the face of it, this looks fine but in real life it may require a
 number of changes to the old scripts. Hence my concern.
 As a matter of interest has anyone liaised with the Hive team to ensure
 they have introduced the additional changes you outlined?

 HTH

 Mich Talebzadeh,
 Technologist | Architect | Data Engineer  | Generative AI | FinCrime
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Sun, 28 Apr 2024 at 09:34, Wenchen Fan  wrote:

> @Mich Talebzadeh  thanks for sharing your
> concern!
>
> Note: creating Spark native data source tables is usually Hive
> compatible as well, unless we use features that Hive does not support
> (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to
> create Spark native table in this case, instead of creating Hive table and
> fail.
>
> On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan  wrote:
>
>> +1 (non-binding)
>>
>> Thanks,
>> Cheng Pan
>>
>> On Sat, Apr 27, 2024 at 9:29 AM Holden Karau 
>> wrote:
>> >
>> > +1
>> >
>> > Twitter: https://twitter.com/holdenkarau
>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> >
>> >
>> > On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun <
>> dongj...@apache.org> wrote:
>> >> >
>> >> > I'll start with my +1.
>> >> >
>> >> > Dongjoon.
>> >> >
>> >> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
>> >> > > Please vote on SPARK-46122 to set
>> spark.sql.legacy.createHiveTableByDefault
>> >> > > to `false` by 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Of course, I can't think of a scenario of thousands of tables with single
in memory Spark cluster with in memory catalog.
Thanks for the help!

בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

>
>
> Agreed. In scenarios where most of the interactions with the catalog are
> related to query planning, saving and metadata management, the choice of
> catalog implementation may have less impact on query runtime performance.
> This is because the time spent on metadata operations is generally minimal
> compared to the time spent on actual data fetching, processing, and
> computation.
> However, if we consider scalability and reliability concerns, especially
> as the size and complexity of data and query workload grow. While an
> in-memory catalog may offer excellent performance for smaller workloads,
> it will face limitations in handling larger-scale deployments with
> thousands of tables, partitions, and users. Additionally, durability and
> persistence are crucial considerations, particularly in production
> environments where data integrity
> and availability are crucial. In-memory catalog implementations may lack
> durability, meaning that metadata changes could be lost in the event of a
> system failure or restart. Therefore, while in-memory catalog
> implementations can provide speed and efficiency for certain use cases, we
> ought to consider the requirements for scalability, reliability, and data
> durability when choosing a catalog solution for production deployments. In
> many cases, a combination of in-memory and disk-based catalog solutions may
> offer the best balance of performance and resilience for demanding large
> scale workloads.
>
>
> HTH
>
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek  wrote:
>
>> Of course, but it's in memory and not persisted which is much faster, and
>> as I said- I believe that most of the interaction with it is during the
>> planning and save and not actual query run operations, and they are short
>> and minimal compared to data fetching and manipulation so I don't believe
>> it will have big impact on query run...
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> Well, I will be surprised because Derby database is single threaded and
>>> won't be much of a use here.
>>>
>>> Most Hive metastore in the commercial world utilise postgres or Oracle
>>> for metastore that are battle proven, replicated and backed up.
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek  wrote:
>>>
>>>> Yes, in memory hive catalog backed by local Derby DB.
>>>> And again, I presume that most metadata related parts are during
>>>> planning and not actual run, so I don't see why it should strongly affect
>>>> query performance.
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>>>> mich.talebza...@gmail.com>:
>>>>
>>>>> With regard to your point below
>>>>>
>>>>> &q

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Of course, but it's in memory and not persisted which is much faster, and
as I said- I believe that most of the interaction with it is during the
planning and save and not actual query run operations, and they are short
and minimal compared to data fetching and manipulation so I don't believe
it will have big impact on query run...

בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

> Well, I will be surprised because Derby database is single threaded and
> won't be much of a use here.
>
> Most Hive metastore in the commercial world utilise postgres or Oracle for
> metastore that are battle proven, replicated and backed up.
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek  wrote:
>
>> Yes, in memory hive catalog backed by local Derby DB.
>> And again, I presume that most metadata related parts are during planning
>> and not actual run, so I don't see why it should strongly affect query
>> performance.
>>
>> Thanks,
>>
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> With regard to your point below
>>>
>>> "The thing I'm missing is this: let's say that the output format I
>>> choose is delta lake or iceberg or whatever format that uses parquet. Where
>>> does the catalog implementation (which holds metadata afaik, same metadata
>>> that iceberg and delta lake save for their tables about their columns)
>>> comes into play and why should it affect performance? "
>>>
>>> The catalog implementation comes into play regardless of the output
>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>>> responsible for managing metadata about the datasets, tables, schemas, and
>>> other objects stored in aforementioned formats. Even though Delta Lake and
>>> Iceberg have their metadata management mechanisms internally, they still
>>> rely on the catalog for providing a unified interface for accessing and
>>> manipulating metadata across different storage formats.
>>>
>>> "Another thing is that if I understand correctly, and I might be totally
>>> wrong here, the internal spark catalog is a local installation of hive
>>> metastore anyway, so I'm not sure what the catalog has to do with anything"
>>>
>>> .I don't understand this. Do you mean a Derby database?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek  wrote:
>>>
>>>> Thanks for the detailed answer.
>>>> The thing I'm missing is this: let's say that the output format I
>>>> choose is delta lake or iceberg or whatever format that uses parquet. Where
>>>> does the catalog implementation (which holds metadata afaik, same metadata
>>>> that iceberg and delta lake save for their tables about their columns)
>>>> comes into play and why should it affect performance?
>>>> Another thing is that if I understand correctly, and I might be totally
>>>> wrong here, the internal spark catalog is a local installation of 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Yes, in memory hive catalog backed by local Derby DB.
And again, I presume that most metadata related parts are during planning
and not actual run, so I don't see why it should strongly affect query
performance.

Thanks,


בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

> With regard to your point below
>
> "The thing I'm missing is this: let's say that the output format I choose
> is delta lake or iceberg or whatever format that uses parquet. Where does
> the catalog implementation (which holds metadata afaik, same metadata that
> iceberg and delta lake save for their tables about their columns) comes
> into play and why should it affect performance? "
>
> The catalog implementation comes into play regardless of the output format
> chosen (Delta Lake, Iceberg, Parquet, etc.) because it is responsible for
> managing metadata about the datasets, tables, schemas, and other objects
> stored in aforementioned formats. Even though Delta Lake and Iceberg have
> their metadata management mechanisms internally, they still rely on the
> catalog for providing a unified interface for accessing and manipulating
> metadata across different storage formats.
>
> "Another thing is that if I understand correctly, and I might be totally
> wrong here, the internal spark catalog is a local installation of hive
> metastore anyway, so I'm not sure what the catalog has to do with anything"
>
> .I don't understand this. Do you mean a Derby database?
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek  wrote:
>
>> Thanks for the detailed answer.
>> The thing I'm missing is this: let's say that the output format I choose
>> is delta lake or iceberg or whatever format that uses parquet. Where does
>> the catalog implementation (which holds metadata afaik, same metadata that
>> iceberg and delta lake save for their tables about their columns) comes
>> into play and why should it affect performance?
>> Another thing is that if I understand correctly, and I might be totally
>> wrong here, the internal spark catalog is a local installation of hive
>> metastore anyway, so I'm not sure what the catalog has to do with anything.
>>
>> Thanks!
>>
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> My take regarding your question is that your mileage varies so to speak.
>>>
>>> 1) Hive provides a more mature and widely adopted catalog solution that
>>> integrates well with other components in the Hadoop ecosystem, such as
>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
>>> Hive may offer better compatibility and interoperability.
>>> 2) Hive provides a SQL-like interface that is familiar to users who are
>>> accustomed to traditional RDBMs. If your use case involves complex SQL
>>> queries or existing SQL-based workflows, using Hive may be advantageous.
>>> 3) If you are looking for performance, spark's native catalog tends to
>>> offer better performance for certain workloads, particularly those that
>>> involve iterative processing or complex data transformations.(my
>>> understanding). Spark's in-memory processing capabilities and optimizations
>>> make it well-suited for interactive analytics and machine learning
>>> tasks.(my favourite)
>>> 4) Integration with Spark Workflows: If you primarily use Spark for data
>>> processing and analytics, using Spark's native catalog may simplify
>>> workflow management and reduce overhead, Spark's  tight integration with
>>> its catalog allows for seamless interaction with Spark applications and
>>> libraries.
>>> 5) There seems to be some similarity with spark catalog and
>>> Databricks unity catalog, so that may favour the choice.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Thanks for the detailed answer.
The thing I'm missing is this: let's say that the output format I choose is
delta lake or iceberg or whatever format that uses parquet. Where does the
catalog implementation (which holds metadata afaik, same metadata that
iceberg and delta lake save for their tables about their columns) comes
into play and why should it affect performance?
Another thing is that if I understand correctly, and I might be totally
wrong here, the internal spark catalog is a local installation of hive
metastore anyway, so I'm not sure what the catalog has to do with anything.

Thanks!


בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

> My take regarding your question is that your mileage varies so to speak.
>
> 1) Hive provides a more mature and widely adopted catalog solution that
> integrates well with other components in the Hadoop ecosystem, such as
> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
> Hive may offer better compatibility and interoperability.
> 2) Hive provides a SQL-like interface that is familiar to users who are
> accustomed to traditional RDBMs. If your use case involves complex SQL
> queries or existing SQL-based workflows, using Hive may be advantageous.
> 3) If you are looking for performance, spark's native catalog tends to
> offer better performance for certain workloads, particularly those that
> involve iterative processing or complex data transformations.(my
> understanding). Spark's in-memory processing capabilities and optimizations
> make it well-suited for interactive analytics and machine learning
> tasks.(my favourite)
> 4) Integration with Spark Workflows: If you primarily use Spark for data
> processing and analytics, using Spark's native catalog may simplify
> workflow management and reduce overhead, Spark's  tight integration with
> its catalog allows for seamless interaction with Spark applications and
> libraries.
> 5) There seems to be some similarity with spark catalog and
> Databricks unity catalog, so that may favour the choice.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek  wrote:
>
>> I will also appreciate some material that describes the differences
>> between Spark native tables vs hive tables and why each should be used...
>>
>> Thanks
>> Nimrod
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> I see a statement made as below  and I quote
>>>
>>> "The proposal of SPARK-46122 is to switch the default value of this
>>> configuration from `true` to `false` to use Spark native tables because
>>> we support better."
>>>
>>> Can you please elaborate on the above specifically with regard to the
>>> phrase ".. because
>>> we support better."
>>>
>>> Are you referring to the performance of Spark catalog (I believe it is
>>> internal) or integration with Spark?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan  wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Apr 25, 2024 at 2:46 PM Ke

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
I will also appreciate some material that describes the differences between
Spark native tables vs hive tables and why each should be used...

Thanks
Nimrod

בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

> I see a statement made as below  and I quote
>
> "The proposal of SPARK-46122 is to switch the default value of this
> configuration from `true` to `false` to use Spark native tables because
> we support better."
>
> Can you please elaborate on the above specifically with regard to the
> phrase ".. because
> we support better."
>
> Are you referring to the performance of Spark catalog (I believe it is
> internal) or integration with Spark?
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan  wrote:
>
>> +1
>>
>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao  wrote:
>>
>>> +1
>>>
>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-4.
>>>
>>> Thanks,
>>> Kent Yao
>>>
>>> Dongjoon Hyun  于2024年4月25日周四 14:39写道:
>>> >
>>> > Hi, All.
>>> >
>>> > It's great to see community activities to polish 4.0.0 more and more.
>>> > Thank you all.
>>> >
>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from the
>>> subtasks
>>> > of SPARK-4 (Prepare Apache Spark 4.0.0),
>>> >
>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>> >Set `spark.sql.legacy.createHiveTableByDefault` to `false` by
>>> default
>>> >
>>> > This legacy configuration is about `CREATE TABLE` SQL syntax without
>>> > `USING` and `STORED AS`, which is currently mapped to `Hive` table.
>>> > The proposal of SPARK-46122 is to switch the default value of this
>>> > configuration from `true` to `false` to use Spark native tables because
>>> > we support better.
>>> >
>>> > In other words, Spark will use the value of `spark.sql.sources.default`
>>> > as the table provider instead of `Hive` like the other Spark APIs. Of
>>> course,
>>> > the users can get all the legacy behavior by setting back to `true`.
>>> >
>>> > Historically, this behavior change was merged once at Apache Spark
>>> 3.0.0
>>> > preparation via SPARK-30098 already, but reverted during the 3.0.0 RC
>>> period.
>>> >
>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for CREATE
>>> TABLE
>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as
>>> > provider for CREATE TABLE command
>>> >
>>> > At Apache Spark 3.1.0, we had another discussion about this and
>>> defined it
>>> > as one of legacy behavior via this configuration via reused ID,
>>> SPARK-30098.
>>> >
>>> > 2020-12-01:
>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>> > 2020-12-03: SPARK-30098 Add a configuration to use default datasource
>>> as
>>> > provider for CREATE TABLE command
>>> >
>>> > Last year, we received two additional requests twice to switch this
>>> because
>>> > Apache Spark 4.0.0 is a good time to make a decision for the future
>>> direction.
>>> >
>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>> >
>>> >
>>> > WDYT? The technical scope is defined in the following PR which is one
>>> line of main
>>> > code, one line of migration guide, and a few lines of test code.
>>> >
>>> > - https://github.com/apache/spark/pull/46207
>>> >
>>> > Dongjoon.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Support Avro rolling version upgrades using schema manager

2024-04-13 Thread Nimrod Ofek
Hi,

Currently, Avro records are supported in Spark - but with the limitation
that we must specify the input and output schema versions.
For writing out an avro record that is fine - but for reading avro records,
that is usually a problem since there are upgrades and changes - and the
current situation can't handle them.

Confluent Schema Registry provides such functionality by having the schema
id as part of the message - so we can fetch the relevant schema and read
through changes if there are - while still outputting the same output
schema we wish (as long as they are compatible of course).

ABRiS open source does supply that functionality, but I think that small
tweaks to the current Spark implementation should provide a good enough
solution for 99% of the cases - without the need to go to another project
that provides much of the functionality that Spark already has.

I did see an old ticket for that - that never truly matured:
https://issues.apache.org/jira/browse/SPARK-34652

I would like to do such a PR to have this functionality in Spark, just
wanted to make sure if there was any reason for not doing that - maybe
there was a specific reason for not supporting Schema registry?


Thanks!
Nimrod