Re: [CONNECT] New Clients for Go and Rust

bo yang Thu, 01 Jun 2023 21:56:02 -0700

Hi Martin,

Thanks a lot for preparing the new repo and making it super easy for me to
just copy my code to the new repo! I will create a new PR there.


> I think the PR is fine from a code perspective as a starting point. I've
prepared the go repository with all the things necessary so that it reduces
friction for you. The protos are automatically generated, pre-commit checks
etc. All you need to do is drop your code :)

> Once we have the first version working we can iterate and identify the
next steps.

Best,
Bo


On Thu, Jun 1, 2023 at 11:58 AM Martin Grund <mar...@databricks.com.invalid>
wrote:

> These are all valid points and it makes total sense to continue to
> consider them. However, reading the mail I'm wondering if we're discussing
> the same problems.
>
> Deprecation of APIs aside, the main benefit it Spark Connect is that the
> contract is explicitly not a Jar file full of transitive dependencies (and
> discoverable internal APIs) but rather the contract established via the
> proto messages and RPCs.  If you compare this for example to the R
> integration there is no need to emebed some Go pieces with the JVM to make
> it work. No custom RMI protocol specific to the client language but simply
> the same contract as for example PySpark uses. The physical contact is the
> protobuf and the logical contact is the dataframe API.
>
> This means that Spark Connect clients don't suffer a large part of the
> challenges that other tools built on top of Spark have as there is no tigut
> coupling between the driver JVM and the client.
>
> I'm happy to help establish clear guidance of contrib style modules that
> operate with a different set of expectations but are developed by the spark
> community and its guidelines.
>
> Martin
>
>
> On Thu 1. Jun 2023 at 12:41 Maciej <mszymkiew...@gmail.com> wrote:
>
>> Hi Martin,
>>
>>
>> On 5/30/23 11:50, Martin Grund wrote:
>> > I think it makes sense to split this discussion into two pieces. On >
>> the contribution side, my personal perspective is that these new > clients
>> are explicitly marked as experimental and unsupported until > we deem them
>> mature enough to be supported using the standard release > process etc.
>> However, the goal should be that the main contributors > of these clients
>> are aiming to follow the same release and > maintenance schedule. I think
>> we should encourage the community to > contribute to the Spark Connect
>> clients and as such we should > explicitly not make it as hard as possible
>> to get started (and for > that reason reserve the right to abandon).
>>
>> I know it sounds like a nitpicking, but we still have components
>> deprecated in 1.2 or 1.3, not to mention subprojects that haven't been
>> developed for years.  So, there is a huge gap between reserving a right and
>> actually exercising it when needed. If such a right is to be used
>> differently for Spark Connect bindings, it's something that should be
>> communicated upfront.
>> > > How exactly the release schedule is going to look is going to require
>> > probably some experimentation because it's a new area for Spark and >
>> it's ecosystem. I don't think it requires us to have all answers > upfront.
>>
>> Nonetheless, we should work towards establishing consensus around these
>> issues and documenting the answers. They affect not only the maintainers
>> (see for example a recent discussion about switching to a more predictable
>> release schedule) but also the users, for whom multiple APIs (including
>> their development status) have been a common source of confusion in the
>> past.
>> >> Also, an elephant in the room is the future of the current API in >>
>> Spark 4 and onwards. As useful as connect is, it is not exactly a >>
>> replacement for many existing deployments. Furthermore, it doesn't >> make
>> extending Spark much easier and the current ecosystem is, >> subjectively
>> speaking, a bit brittle. > > The goal of Spark Connect is not to replace
>> the way users are > currently deploying Spark, it's not meant to be that.
>> Users should > continue deploying Spark in exactly the way they prefer.
>> Spark > Connect allows bringing more interactivity and connectivity to
>> Spark. > While Spark Connect extends Spark, most new language consumers
>> will > not try to extend Spark, but simply provide the existing surface to
>> > their native language. So the goal is not so much extensibility but >
>> more availability. For example, I believe it would be awesome if the > Livy
>> community would find a way to integrate with Spark Connect to > provide the
>> routing capabilities to provide a stable DNS endpoint for > all different
>> Spark deployments. > >> [...] the current ecosystem is, subjectively
>> speaking, a bit >> brittle. > > Can you help me understand that a bit
>> better? Do you mean the Spark > ecosystem or the Spark Connect ecosystem?
>>
>> I mean Spark in general. While most of the core and some closely related
>> projects are well maintained, tools built on top of Spark, even ones
>> supported by major stakeholders, are often short-lived and left
>> unmaintained, if not officially abandoned.
>>
>> New languages aside, without a single extension point (which, for core
>> Spark is JVM interface), maintaining public projects on top of Spark
>> becomes even less attractive. That, assuming we don't completely reject the
>> idea of extending Spark functionality while using Spark Connect,
>> effectively limiting the target audience for any 3rd party library.
>>
>> > > Martin > > > On Fri, May 26, 2023 at 5:39 PM Maciej
>> <mszymkiew...@gmail.com> <mszymkiew...@gmail.com> > wrote: > > It might
>> be a good idea to have a discussion about how new connect > clients fit
>> into the overall process we have. In particular: > > * Under what
>> conditions do we consider adding a new language to the > official channels?
>> What process do we follow? * What guarantees do > we offer in respect to
>> these clients? Is adding a new client the same > type of commitment as for
>> the core API? In other words, do we commit > to maintaining such clients
>> "forever" or do we separate the > "official" and "contrib" clients, with
>> the later being governed by > the ASF, but not guaranteed to be maintained
>> in the future? * Do we > follow the same release schedule as for the core
>> project, or rather > release each client separately, after the main release
>> is completed? > > Also, an elephant in the room is the future of the
>> current API in > Spark 4 and onwards. As useful as connect is, it is not
>> exactly a > replacement for many existing deployments. Furthermore, it
>> doesn't > make extending Spark much easier and the current ecosystem is, >
>> subjectively speaking, a bit brittle. > > -- Best regards, Maciej > > > On
>> 5/26/23 07:26, Martin Grund wrote: >> Thanks everyone for your feedback! I
>> will work on figuring out what >> it takes to get started with a repo for
>> the go client. >> >> On Thu 25. May 2023 at 21:51 Chao Sun
>> <sunc...@apache.org> <sunc...@apache.org> wrote: >> >> +1 on separate
>> repo too >> >> On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun >>
>> <dongjoon.h...@gmail.com> <dongjoon.h...@gmail.com> wrote: >>> >>> +1
>> for starting on a separate repo. >>> >>> Dongjoon. >>> >>> On Thu, May 25,
>> 2023 at 9:53 AM yangjie01 <yangji...@baidu.com> <yangji...@baidu.com>
>> >>> wrote: >>>> >>>> +1 on start this with a separate repo. >>>> >>>> Which
>> new clients can be placed in the main repo should be >>>> discussed after
>> they are mature enough, >>>> >>>> >>>> >>>> Yang Jie >>>> >>>> >>>> >>>>
>> 发件人: Denny Lee <denny.g....@gmail.com> <denny.g....@gmail.com> 日期:
>> 2023年5月24日 星期三 >>>> 21:31 收件人: Hyukjin Kwon <gurwls...@apache.org>
>> <gurwls...@apache.org> 抄送: Maciej >>>> <mszymkiew...@gmail.com>
>> <mszymkiew...@gmail.com>, "dev@spark.apache.org" <dev@spark.apache.org>
>> >>>> <dev@spark.apache.org> <dev@spark.apache.org> 主题: Re: [CONNECT] New
>> Clients for Go and >>>> Rust >>>> >>>> >>>> >>>> +1 on separate repo
>> allowing different APIs to run at different >>>> speeds and ensuring they
>> get community support. >>>> >>>> >>>> >>>> On Wed, May 24, 2023 at 00:37
>> Hyukjin Kwon >>>> <gurwls...@apache.org> <gurwls...@apache.org> wrote:
>> >>>> >>>> I think we can just start this with a separate repo. I am fine
>> >>>> with the second option too but in this case we would have to >>>>
>> triage which language to add into the main repo. >>>> >>>> >>>> >>>> On
>> Fri, 19 May 2023 at 22:28, Maciej <mszymkiew...@gmail.com>
>> <mszymkiew...@gmail.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> >>>> >>>>
>> Personally, I'm strongly against the second option and have >>>> some
>> preference towards the third one (or maybe a mix of the >>>> first one and
>> the third one). >>>> >>>> >>>> >>>> The project is already pretty large
>> as-is and, with an >>>> extremely conservative approach towards removal of
>> APIs, it >>>> only tends to grow over time. Making it even larger is not
>> >>>> going to make things more maintainable and is likely to create >>>> an
>> entry barrier for new contributors (that's similar to Jia's >>>>
>> arguments). >>>> >>>> >>>> >>>> Moreover, we've seen quite a few different
>> language clients >>>> over the years and all but one or two survived while
>> none is >>>> particularly active, as far as I'm aware. Taking >>>>
>> responsibility for more clients, without being sure that we >>>> have
>> resources to maintain them and there is enough community >>>> around them
>> to make such effort worthwhile, doesn't seem like a >>>> good idea. >>>>
>> >>>> >>>> >>>> -- >>>> >>>> Best regards, >>>> >>>> Maciej Szymkiewicz >>>>
>> >>>> >>>> >>>> Web: https://zero323.net >>>> >>>> PGP: A30CEF0C31A501EC
>> >>>> >>>> >>>> >>>> >>>> >>>> On 5/19/23 14:57, Jia Fan wrote: >>>> >>>>
>> Hi, >>>> >>>> >>>> >>>> Thanks for contribution! >>>> >>>> I prefer (1).
>> There are some reason: >>>> >>>> >>>> >>>> 1. Different repository can
>> maintain independent versions, >>>> different release times, and faster bug
>> fix releases. >>>> >>>> >>>> >>>> 2. Different languages have different
>> build tools. Putting them >>>> in one repository will make the main
>> repository more and more >>>> complicated, and it will become extremely
>> difficult to perform >>>> a complete build in the main repository. >>>>
>> >>>> >>>> >>>> 3. Different repository will make CI configuration and
>> execute >>>> easier, and the PR and commit lists will be clearer. >>>> >>>>
>> >>>> >>>> 4. Other repository also have different client to governed, >>>>
>> like clickhouse. It use different repository for jdbc, odbc, >>>> c++.
>> Please refer: >>>> >>>> https://github.com/ClickHouse/clickhouse-java
>> >>>> >>>> https://github.com/ClickHouse/clickhouse-odbc >>>> >>>>
>> https://github.com/ClickHouse/clickhouse-cpp >>>> >>>> >>>> >>>> PS: I'm
>> looking forward to the javascript connect client! >>>> >>>> >>>> >>>>
>> Thanks Regards >>>> >>>> Jia Fan >>>> >>>> >>>> >>>> Martin Grund
>> <mgr...@apache.org> <mgr...@apache.org> 于2023年5月19日周五 20:03写道： >>>> >>>>
>> Hi folks, >>>> >>>> >>>> >>>> When Bo (thanks for the time and
>> contribution) started the work >>>> on
>> https://github.com/apache/spark/pull/41036 he started the Go >>>> client
>> directly in the Spark repository. In the meantime, I was >>>> approached by
>> other engineers who are willing to contribute to >>>> working on a Rust
>> client for Spark Connect. >>>> >>>> >>>> >>>> Now one of the key questions
>> is where should these connectors >>>> live and how we manage expectations
>> most effectively. >>>> >>>> >>>> >>>> At the high level, there are two
>> approaches: >>>> >>>> >>>> >>>> (1) "3rd party" (non-JVM / Python) clients
>> should live in >>>> separate repositories owned and governed by the Apache
>> Spark >>>> community. >>>> >>>> >>>> >>>> (2) All clients should live in
>> the main Apache Spark repository >>>> in the `connector/connect/client`
>> directory. >>>> >>>> >>>> >>>> (3) Non-native (Python, JVM) Spark Connect
>> clients should not >>>> be part of the Apache Spark repository and
>> governance rules. >>>> >>>> >>>> >>>> Before we iron out how exactly, we
>> mark these clients as >>>> experimental and how we align their release
>> process etc with >>>> Spark, my suggestion would be to get a consensus on
>> this first >>>> question. >>>> >>>> >>>> >>>> Personally, I'm fine with (1)
>> and (2) with a preference for >>>> (2). >>>> >>>> >>>> >>>> Would love to
>> get feedback from other members of the >>>> community! >>>> >>>> >>>> >>>>
>> Thanks >>>> >>>> Martin >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >> >>
>> --------------------------------------------------------------------- >> >> 
>> To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> >> >
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>>
>>

Re: [CONNECT] New Clients for Go and Rust

Reply via email to