Re: [CONNECT] New Clients for Go and Rust

Martin Grund Thu, 01 Jun 2023 11:58:52 -0700

These are all valid points and it makes total sense to continue to consider
them. However, reading the mail I'm wondering if we're discussing the same
problems.


Deprecation of APIs aside, the main benefit it Spark Connect is that the
contract is explicitly not a Jar file full of transitive dependencies (and
discoverable internal APIs) but rather the contract established via the
proto messages and RPCs.  If you compare this for example to the R
integration there is no need to emebed some Go pieces with the JVM to make
it work. No custom RMI protocol specific to the client language but simply
the same contract as for example PySpark uses. The physical contact is the
protobuf and the logical contact is the dataframe API.

This means that Spark Connect clients don't suffer a large part of the
challenges that other tools built on top of Spark have as there is no tigut
coupling between the driver JVM and the client.

I'm happy to help establish clear guidance of contrib style modules that
operate with a different set of expectations but are developed by the spark
community and its guidelines.

Martin


On Thu 1. Jun 2023 at 12:41 Maciej <mszymkiew...@gmail.com> wrote:

> Hi Martin,
>
>
> On 5/30/23 11:50, Martin Grund wrote:
> > I think it makes sense to split this discussion into two pieces. On >
> the contribution side, my personal perspective is that these new > clients
> are explicitly marked as experimental and unsupported until > we deem them
> mature enough to be supported using the standard release > process etc.
> However, the goal should be that the main contributors > of these clients
> are aiming to follow the same release and > maintenance schedule. I think
> we should encourage the community to > contribute to the Spark Connect
> clients and as such we should > explicitly not make it as hard as possible
> to get started (and for > that reason reserve the right to abandon).
>
> I know it sounds like a nitpicking, but we still have components
> deprecated in 1.2 or 1.3, not to mention subprojects that haven't been
> developed for years.  So, there is a huge gap between reserving a right and
> actually exercising it when needed. If such a right is to be used
> differently for Spark Connect bindings, it's something that should be
> communicated upfront.
> > > How exactly the release schedule is going to look is going to require
> > probably some experimentation because it's a new area for Spark and >
> it's ecosystem. I don't think it requires us to have all answers > upfront.
>
> Nonetheless, we should work towards establishing consensus around these
> issues and documenting the answers. They affect not only the maintainers
> (see for example a recent discussion about switching to a more predictable
> release schedule) but also the users, for whom multiple APIs (including
> their development status) have been a common source of confusion in the
> past.
> >> Also, an elephant in the room is the future of the current API in >>
> Spark 4 and onwards. As useful as connect is, it is not exactly a >>
> replacement for many existing deployments. Furthermore, it doesn't >> make
> extending Spark much easier and the current ecosystem is, >> subjectively
> speaking, a bit brittle. > > The goal of Spark Connect is not to replace
> the way users are > currently deploying Spark, it's not meant to be that.
> Users should > continue deploying Spark in exactly the way they prefer.
> Spark > Connect allows bringing more interactivity and connectivity to
> Spark. > While Spark Connect extends Spark, most new language consumers
> will > not try to extend Spark, but simply provide the existing surface to
> > their native language. So the goal is not so much extensibility but >
> more availability. For example, I believe it would be awesome if the > Livy
> community would find a way to integrate with Spark Connect to > provide the
> routing capabilities to provide a stable DNS endpoint for > all different
> Spark deployments. > >> [...] the current ecosystem is, subjectively
> speaking, a bit >> brittle. > > Can you help me understand that a bit
> better? Do you mean the Spark > ecosystem or the Spark Connect ecosystem?
>
> I mean Spark in general. While most of the core and some closely related
> projects are well maintained, tools built on top of Spark, even ones
> supported by major stakeholders, are often short-lived and left
> unmaintained, if not officially abandoned.
>
> New languages aside, without a single extension point (which, for core
> Spark is JVM interface), maintaining public projects on top of Spark
> becomes even less attractive. That, assuming we don't completely reject the
> idea of extending Spark functionality while using Spark Connect,
> effectively limiting the target audience for any 3rd party library.
>
> > > Martin > > > On Fri, May 26, 2023 at 5:39 PM Maciej
> <mszymkiew...@gmail.com> <mszymkiew...@gmail.com> > wrote: > > It might
> be a good idea to have a discussion about how new connect > clients fit
> into the overall process we have. In particular: > > * Under what
> conditions do we consider adding a new language to the > official channels?
> What process do we follow? * What guarantees do > we offer in respect to
> these clients? Is adding a new client the same > type of commitment as for
> the core API? In other words, do we commit > to maintaining such clients
> "forever" or do we separate the > "official" and "contrib" clients, with
> the later being governed by > the ASF, but not guaranteed to be maintained
> in the future? * Do we > follow the same release schedule as for the core
> project, or rather > release each client separately, after the main release
> is completed? > > Also, an elephant in the room is the future of the
> current API in > Spark 4 and onwards. As useful as connect is, it is not
> exactly a > replacement for many existing deployments. Furthermore, it
> doesn't > make extending Spark much easier and the current ecosystem is, >
> subjectively speaking, a bit brittle. > > -- Best regards, Maciej > > > On
> 5/26/23 07:26, Martin Grund wrote: >> Thanks everyone for your feedback! I
> will work on figuring out what >> it takes to get started with a repo for
> the go client. >> >> On Thu 25. May 2023 at 21:51 Chao Sun
> <sunc...@apache.org> <sunc...@apache.org> wrote: >> >> +1 on separate
> repo too >> >> On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun >>
> <dongjoon.h...@gmail.com> <dongjoon.h...@gmail.com> wrote: >>> >>> +1 for
> starting on a separate repo. >>> >>> Dongjoon. >>> >>> On Thu, May 25, 2023
> at 9:53 AM yangjie01 <yangji...@baidu.com> <yangji...@baidu.com> >>>
> wrote: >>>> >>>> +1 on start this with a separate repo. >>>> >>>> Which new
> clients can be placed in the main repo should be >>>> discussed after they
> are mature enough, >>>> >>>> >>>> >>>> Yang Jie >>>> >>>> >>>> >>>> 发件人:
> Denny Lee <denny.g....@gmail.com> <denny.g....@gmail.com> 日期: 2023年5月24日
> 星期三 >>>> 21:31 收件人: Hyukjin Kwon <gurwls...@apache.org>
> <gurwls...@apache.org> 抄送: Maciej >>>> <mszymkiew...@gmail.com>
> <mszymkiew...@gmail.com>, "dev@spark.apache.org" <dev@spark.apache.org>
> >>>> <dev@spark.apache.org> <dev@spark.apache.org> 主题: Re: [CONNECT] New
> Clients for Go and >>>> Rust >>>> >>>> >>>> >>>> +1 on separate repo
> allowing different APIs to run at different >>>> speeds and ensuring they
> get community support. >>>> >>>> >>>> >>>> On Wed, May 24, 2023 at 00:37
> Hyukjin Kwon >>>> <gurwls...@apache.org> <gurwls...@apache.org> wrote:
> >>>> >>>> I think we can just start this with a separate repo. I am fine
> >>>> with the second option too but in this case we would have to >>>>
> triage which language to add into the main repo. >>>> >>>> >>>> >>>> On
> Fri, 19 May 2023 at 22:28, Maciej <mszymkiew...@gmail.com>
> <mszymkiew...@gmail.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> >>>> >>>>
> Personally, I'm strongly against the second option and have >>>> some
> preference towards the third one (or maybe a mix of the >>>> first one and
> the third one). >>>> >>>> >>>> >>>> The project is already pretty large
> as-is and, with an >>>> extremely conservative approach towards removal of
> APIs, it >>>> only tends to grow over time. Making it even larger is not
> >>>> going to make things more maintainable and is likely to create >>>> an
> entry barrier for new contributors (that's similar to Jia's >>>>
> arguments). >>>> >>>> >>>> >>>> Moreover, we've seen quite a few different
> language clients >>>> over the years and all but one or two survived while
> none is >>>> particularly active, as far as I'm aware. Taking >>>>
> responsibility for more clients, without being sure that we >>>> have
> resources to maintain them and there is enough community >>>> around them
> to make such effort worthwhile, doesn't seem like a >>>> good idea. >>>>
> >>>> >>>> >>>> -- >>>> >>>> Best regards, >>>> >>>> Maciej Szymkiewicz >>>>
> >>>> >>>> >>>> Web: https://zero323.net >>>> >>>> PGP: A30CEF0C31A501EC
> >>>> >>>> >>>> >>>> >>>> >>>> On 5/19/23 14:57, Jia Fan wrote: >>>> >>>>
> Hi, >>>> >>>> >>>> >>>> Thanks for contribution! >>>> >>>> I prefer (1).
> There are some reason: >>>> >>>> >>>> >>>> 1. Different repository can
> maintain independent versions, >>>> different release times, and faster bug
> fix releases. >>>> >>>> >>>> >>>> 2. Different languages have different
> build tools. Putting them >>>> in one repository will make the main
> repository more and more >>>> complicated, and it will become extremely
> difficult to perform >>>> a complete build in the main repository. >>>>
> >>>> >>>> >>>> 3. Different repository will make CI configuration and
> execute >>>> easier, and the PR and commit lists will be clearer. >>>> >>>>
> >>>> >>>> 4. Other repository also have different client to governed, >>>>
> like clickhouse. It use different repository for jdbc, odbc, >>>> c++.
> Please refer: >>>> >>>> https://github.com/ClickHouse/clickhouse-java
> >>>> >>>> https://github.com/ClickHouse/clickhouse-odbc >>>> >>>>
> https://github.com/ClickHouse/clickhouse-cpp >>>> >>>> >>>> >>>> PS: I'm
> looking forward to the javascript connect client! >>>> >>>> >>>> >>>>
> Thanks Regards >>>> >>>> Jia Fan >>>> >>>> >>>> >>>> Martin Grund
> <mgr...@apache.org> <mgr...@apache.org> 于2023年5月19日周五 20:03写道： >>>> >>>>
> Hi folks, >>>> >>>> >>>> >>>> When Bo (thanks for the time and
> contribution) started the work >>>> on
> https://github.com/apache/spark/pull/41036 he started the Go >>>> client
> directly in the Spark repository. In the meantime, I was >>>> approached by
> other engineers who are willing to contribute to >>>> working on a Rust
> client for Spark Connect. >>>> >>>> >>>> >>>> Now one of the key questions
> is where should these connectors >>>> live and how we manage expectations
> most effectively. >>>> >>>> >>>> >>>> At the high level, there are two
> approaches: >>>> >>>> >>>> >>>> (1) "3rd party" (non-JVM / Python) clients
> should live in >>>> separate repositories owned and governed by the Apache
> Spark >>>> community. >>>> >>>> >>>> >>>> (2) All clients should live in
> the main Apache Spark repository >>>> in the `connector/connect/client`
> directory. >>>> >>>> >>>> >>>> (3) Non-native (Python, JVM) Spark Connect
> clients should not >>>> be part of the Apache Spark repository and
> governance rules. >>>> >>>> >>>> >>>> Before we iron out how exactly, we
> mark these clients as >>>> experimental and how we align their release
> process etc with >>>> Spark, my suggestion would be to get a consensus on
> this first >>>> question. >>>> >>>> >>>> >>>> Personally, I'm fine with (1)
> and (2) with a preference for >>>> (2). >>>> >>>> >>>> >>>> Would love to
> get feedback from other members of the >>>> community! >>>> >>>> >>>> >>>>
> Thanks >>>> >>>> Martin >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >> >>
> --------------------------------------------------------------------- >> >> To
> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> >> >
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>

Re: [CONNECT] New Clients for Go and Rust

Reply via email to