Hi Martin, Thanks a lot for preparing the new repo and making it super easy for me to just copy my code to the new repo! I will create a new PR there.
> I think the PR is fine from a code perspective as a starting point. I've prepared the go repository with all the things necessary so that it reduces friction for you. The protos are automatically generated, pre-commit checks etc. All you need to do is drop your code :) > Once we have the first version working we can iterate and identify the next steps. Best, Bo On Thu, Jun 1, 2023 at 11:58 AM Martin Grund <mar...@databricks.com.invalid> wrote: > These are all valid points and it makes total sense to continue to > consider them. However, reading the mail I'm wondering if we're discussing > the same problems. > > Deprecation of APIs aside, the main benefit it Spark Connect is that the > contract is explicitly not a Jar file full of transitive dependencies (and > discoverable internal APIs) but rather the contract established via the > proto messages and RPCs. If you compare this for example to the R > integration there is no need to emebed some Go pieces with the JVM to make > it work. No custom RMI protocol specific to the client language but simply > the same contract as for example PySpark uses. The physical contact is the > protobuf and the logical contact is the dataframe API. > > This means that Spark Connect clients don't suffer a large part of the > challenges that other tools built on top of Spark have as there is no tigut > coupling between the driver JVM and the client. > > I'm happy to help establish clear guidance of contrib style modules that > operate with a different set of expectations but are developed by the spark > community and its guidelines. > > Martin > > > On Thu 1. Jun 2023 at 12:41 Maciej <mszymkiew...@gmail.com> wrote: > >> Hi Martin, >> >> >> On 5/30/23 11:50, Martin Grund wrote: >> > I think it makes sense to split this discussion into two pieces. On > >> the contribution side, my personal perspective is that these new > clients >> are explicitly marked as experimental and unsupported until > we deem them >> mature enough to be supported using the standard release > process etc. >> However, the goal should be that the main contributors > of these clients >> are aiming to follow the same release and > maintenance schedule. I think >> we should encourage the community to > contribute to the Spark Connect >> clients and as such we should > explicitly not make it as hard as possible >> to get started (and for > that reason reserve the right to abandon). >> >> I know it sounds like a nitpicking, but we still have components >> deprecated in 1.2 or 1.3, not to mention subprojects that haven't been >> developed for years. So, there is a huge gap between reserving a right and >> actually exercising it when needed. If such a right is to be used >> differently for Spark Connect bindings, it's something that should be >> communicated upfront. >> > > How exactly the release schedule is going to look is going to require >> > probably some experimentation because it's a new area for Spark and > >> it's ecosystem. I don't think it requires us to have all answers > upfront. >> >> Nonetheless, we should work towards establishing consensus around these >> issues and documenting the answers. They affect not only the maintainers >> (see for example a recent discussion about switching to a more predictable >> release schedule) but also the users, for whom multiple APIs (including >> their development status) have been a common source of confusion in the >> past. >> >> Also, an elephant in the room is the future of the current API in >> >> Spark 4 and onwards. As useful as connect is, it is not exactly a >> >> replacement for many existing deployments. Furthermore, it doesn't >> make >> extending Spark much easier and the current ecosystem is, >> subjectively >> speaking, a bit brittle. > > The goal of Spark Connect is not to replace >> the way users are > currently deploying Spark, it's not meant to be that. >> Users should > continue deploying Spark in exactly the way they prefer. >> Spark > Connect allows bringing more interactivity and connectivity to >> Spark. > While Spark Connect extends Spark, most new language consumers >> will > not try to extend Spark, but simply provide the existing surface to >> > their native language. So the goal is not so much extensibility but > >> more availability. For example, I believe it would be awesome if the > Livy >> community would find a way to integrate with Spark Connect to > provide the >> routing capabilities to provide a stable DNS endpoint for > all different >> Spark deployments. > >> [...] the current ecosystem is, subjectively >> speaking, a bit >> brittle. > > Can you help me understand that a bit >> better? Do you mean the Spark > ecosystem or the Spark Connect ecosystem? >> >> I mean Spark in general. While most of the core and some closely related >> projects are well maintained, tools built on top of Spark, even ones >> supported by major stakeholders, are often short-lived and left >> unmaintained, if not officially abandoned. >> >> New languages aside, without a single extension point (which, for core >> Spark is JVM interface), maintaining public projects on top of Spark >> becomes even less attractive. That, assuming we don't completely reject the >> idea of extending Spark functionality while using Spark Connect, >> effectively limiting the target audience for any 3rd party library. >> >> > > Martin > > > On Fri, May 26, 2023 at 5:39 PM Maciej >> <mszymkiew...@gmail.com> <mszymkiew...@gmail.com> > wrote: > > It might >> be a good idea to have a discussion about how new connect > clients fit >> into the overall process we have. In particular: > > * Under what >> conditions do we consider adding a new language to the > official channels? >> What process do we follow? * What guarantees do > we offer in respect to >> these clients? Is adding a new client the same > type of commitment as for >> the core API? In other words, do we commit > to maintaining such clients >> "forever" or do we separate the > "official" and "contrib" clients, with >> the later being governed by > the ASF, but not guaranteed to be maintained >> in the future? * Do we > follow the same release schedule as for the core >> project, or rather > release each client separately, after the main release >> is completed? > > Also, an elephant in the room is the future of the >> current API in > Spark 4 and onwards. As useful as connect is, it is not >> exactly a > replacement for many existing deployments. Furthermore, it >> doesn't > make extending Spark much easier and the current ecosystem is, > >> subjectively speaking, a bit brittle. > > -- Best regards, Maciej > > > On >> 5/26/23 07:26, Martin Grund wrote: >> Thanks everyone for your feedback! I >> will work on figuring out what >> it takes to get started with a repo for >> the go client. >> >> On Thu 25. May 2023 at 21:51 Chao Sun >> <sunc...@apache.org> <sunc...@apache.org> wrote: >> >> +1 on separate >> repo too >> >> On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun >> >> <dongjoon.h...@gmail.com> <dongjoon.h...@gmail.com> wrote: >>> >>> +1 >> for starting on a separate repo. >>> >>> Dongjoon. >>> >>> On Thu, May 25, >> 2023 at 9:53 AM yangjie01 <yangji...@baidu.com> <yangji...@baidu.com> >> >>> wrote: >>>> >>>> +1 on start this with a separate repo. >>>> >>>> Which >> new clients can be placed in the main repo should be >>>> discussed after >> they are mature enough, >>>> >>>> >>>> >>>> Yang Jie >>>> >>>> >>>> >>>> >> 发件人: Denny Lee <denny.g....@gmail.com> <denny.g....@gmail.com> 日期: >> 2023年5月24日 星期三 >>>> 21:31 收件人: Hyukjin Kwon <gurwls...@apache.org> >> <gurwls...@apache.org> 抄送: Maciej >>>> <mszymkiew...@gmail.com> >> <mszymkiew...@gmail.com>, "dev@spark.apache.org" <dev@spark.apache.org> >> >>>> <dev@spark.apache.org> <dev@spark.apache.org> 主题: Re: [CONNECT] New >> Clients for Go and >>>> Rust >>>> >>>> >>>> >>>> +1 on separate repo >> allowing different APIs to run at different >>>> speeds and ensuring they >> get community support. >>>> >>>> >>>> >>>> On Wed, May 24, 2023 at 00:37 >> Hyukjin Kwon >>>> <gurwls...@apache.org> <gurwls...@apache.org> wrote: >> >>>> >>>> I think we can just start this with a separate repo. I am fine >> >>>> with the second option too but in this case we would have to >>>> >> triage which language to add into the main repo. >>>> >>>> >>>> >>>> On >> Fri, 19 May 2023 at 22:28, Maciej <mszymkiew...@gmail.com> >> <mszymkiew...@gmail.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> >>>> >>>> >> Personally, I'm strongly against the second option and have >>>> some >> preference towards the third one (or maybe a mix of the >>>> first one and >> the third one). >>>> >>>> >>>> >>>> The project is already pretty large >> as-is and, with an >>>> extremely conservative approach towards removal of >> APIs, it >>>> only tends to grow over time. Making it even larger is not >> >>>> going to make things more maintainable and is likely to create >>>> an >> entry barrier for new contributors (that's similar to Jia's >>>> >> arguments). >>>> >>>> >>>> >>>> Moreover, we've seen quite a few different >> language clients >>>> over the years and all but one or two survived while >> none is >>>> particularly active, as far as I'm aware. Taking >>>> >> responsibility for more clients, without being sure that we >>>> have >> resources to maintain them and there is enough community >>>> around them >> to make such effort worthwhile, doesn't seem like a >>>> good idea. >>>> >> >>>> >>>> >>>> -- >>>> >>>> Best regards, >>>> >>>> Maciej Szymkiewicz >>>> >> >>>> >>>> >>>> Web: https://zero323.net >>>> >>>> PGP: A30CEF0C31A501EC >> >>>> >>>> >>>> >>>> >>>> >>>> On 5/19/23 14:57, Jia Fan wrote: >>>> >>>> >> Hi, >>>> >>>> >>>> >>>> Thanks for contribution! >>>> >>>> I prefer (1). >> There are some reason: >>>> >>>> >>>> >>>> 1. Different repository can >> maintain independent versions, >>>> different release times, and faster bug >> fix releases. >>>> >>>> >>>> >>>> 2. Different languages have different >> build tools. Putting them >>>> in one repository will make the main >> repository more and more >>>> complicated, and it will become extremely >> difficult to perform >>>> a complete build in the main repository. >>>> >> >>>> >>>> >>>> 3. Different repository will make CI configuration and >> execute >>>> easier, and the PR and commit lists will be clearer. >>>> >>>> >> >>>> >>>> 4. Other repository also have different client to governed, >>>> >> like clickhouse. It use different repository for jdbc, odbc, >>>> c++. >> Please refer: >>>> >>>> https://github.com/ClickHouse/clickhouse-java >> >>>> >>>> https://github.com/ClickHouse/clickhouse-odbc >>>> >>>> >> https://github.com/ClickHouse/clickhouse-cpp >>>> >>>> >>>> >>>> PS: I'm >> looking forward to the javascript connect client! >>>> >>>> >>>> >>>> >> Thanks Regards >>>> >>>> Jia Fan >>>> >>>> >>>> >>>> Martin Grund >> <mgr...@apache.org> <mgr...@apache.org> 于2023年5月19日周五 20:03写道: >>>> >>>> >> Hi folks, >>>> >>>> >>>> >>>> When Bo (thanks for the time and >> contribution) started the work >>>> on >> https://github.com/apache/spark/pull/41036 he started the Go >>>> client >> directly in the Spark repository. In the meantime, I was >>>> approached by >> other engineers who are willing to contribute to >>>> working on a Rust >> client for Spark Connect. >>>> >>>> >>>> >>>> Now one of the key questions >> is where should these connectors >>>> live and how we manage expectations >> most effectively. >>>> >>>> >>>> >>>> At the high level, there are two >> approaches: >>>> >>>> >>>> >>>> (1) "3rd party" (non-JVM / Python) clients >> should live in >>>> separate repositories owned and governed by the Apache >> Spark >>>> community. >>>> >>>> >>>> >>>> (2) All clients should live in >> the main Apache Spark repository >>>> in the `connector/connect/client` >> directory. >>>> >>>> >>>> >>>> (3) Non-native (Python, JVM) Spark Connect >> clients should not >>>> be part of the Apache Spark repository and >> governance rules. >>>> >>>> >>>> >>>> Before we iron out how exactly, we >> mark these clients as >>>> experimental and how we align their release >> process etc with >>>> Spark, my suggestion would be to get a consensus on >> this first >>>> question. >>>> >>>> >>>> >>>> Personally, I'm fine with (1) >> and (2) with a preference for >>>> (2). >>>> >>>> >>>> >>>> Would love to >> get feedback from other members of the >>>> community! >>>> >>>> >>>> >>>> >> Thanks >>>> >>>> Martin >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >> >> >> --------------------------------------------------------------------- >> >> >> To >> unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >> > >> -- >> Best regards, >> Maciej Szymkiewicz >> >> Web: https://zero323.net >> PGP: A30CEF0C31A501EC >> >>