Re: [外部邮件] [VOTE] Differentiate Spark without Spark Connect from Spark Connect

2024-07-23 Thread Martin Grund
+1

On Tue, Jul 23, 2024 at 07:06 Dongjoon Hyun  wrote:

> +1 for the proposed definition.
>
> Thanks,
> Dongjoon
>
>
> On Tue, Jul 23, 2024 at 6:42 AM Xianjin YE  wrote:
>
>> +1 (non-binding)
>>
>> On Jul 23, 2024, at 16:16, Jungtaek Lim 
>> wrote:
>>
>> +1 (non-binding)
>>
>> On Tue, Jul 23, 2024 at 1:51 PM  wrote:
>>
>>>
>>> +1
>>>
>>> On Jul 22, 2024, at 21:42, John Zhuge  wrote:
>>>
>>> 
>>> +1 (non-binding)
>>>
>>> On Mon, Jul 22, 2024 at 8:16 PM yangjie01 
>>> wrote:
>>>
 +1

 在 2024/7/23 11:11,“Kent Yao”mailto:y...@apache.org>>
 写入:


 +1


 On 2024/07/23 02:04:17 Herman van Hovell wrote:
 > +1
 >
 > On Mon, Jul 22, 2024 at 8:56 PM Wenchen Fan >>> > wrote:
 >
 > > +1
 > >
 > > On Tue, Jul 23, 2024 at 8:40 AM Xinrong Meng >>> > wrote:
 > >
 > >> +1
 > >>
 > >> Thank you @Hyukjin Kwon >>> gurwls...@apache.org>> !
 > >>
 > >> On Mon, Jul 22, 2024 at 5:20 PM Gengliang Wang >>> > wrote:
 > >>
 > >>> +1
 > >>>
 > >>> On Mon, Jul 22, 2024 at 5:19 PM Hyukjin Kwon <
 gurwls...@apache.org >
 > >>> wrote:
 > >>>
 >  Starting with my own +1.
 > 
 >  On Tue, 23 Jul 2024 at 09:12, Hyukjin Kwon >>> >
 >  wrote:
 > 
 > > Hi all,
 > >
 > > I’d like to start a vote for differentiating "Spark without
 Spark
 > > Connect" as "Spark Classic".
 > >
 > > Please also refer to:
 > >
 > > - Discussion thread:
 > >
 https://lists.apache.org/thread/ys7zsod8cs9c7qllmf0p0msk6z2mz2ym <
 https://lists.apache.org/thread/ys7zsod8cs9c7qllmf0p0msk6z2mz2ym>
 > >
 > > Please vote on the SPIP for the next 72 hours:
 > >
 > > [ ] +1: Accept the proposal
 > > [ ] +0
 > > [ ] -1: I don’t think this is a good idea because …
 > >
 > > Thank you!
 > >
 > 
 >


 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> dev-unsubscr...@spark.apache.org>






>>>
>>> --
>>> John Zhuge
>>>
>>>
>>


Re: [DISCUSS] Differentiate Spark without Spark Connect from Spark Connect

2024-07-22 Thread Martin Grund
+1 for classic. It's simple, easy to understand and it doesn't have the
negative meanings like legacy for example.

On Sun, Jul 21, 2024 at 23:48 Wenchen Fan  wrote:

> Classic SGTM.
>
> On Mon, Jul 22, 2024 at 1:12 PM Jungtaek Lim 
> wrote:
>
>> I'd propose not to change the name of "Spark Connect" - the name
>> represents the characteristic of the mode (separation of layer for client
>> and server). Trying to remove the part of "Connect" would just make
>> confusion.
>>
>> +1 for Classic to existing mode, till someone comes up with better
>> alternatives.
>>
>> On Mon, Jul 22, 2024 at 8:50 AM Hyukjin Kwon 
>> wrote:
>>
>>> I was thinking about a similar option too but I ended up giving this up
>>> .. It's quite unlikely at this moment but suppose that we have another
>>> Spark Connect-ish component in the far future and it would be challenging
>>> to come up with another name ... Another case is that we might have to cope
>>> with the cases like Spark Connect, vs Spark (with Spark Connect) and Spark
>>> (without Spark Connect) ..
>>>
>>> On Sun, 21 Jul 2024 at 09:59, Holden Karau 
>>> wrote:
>>>
 I think perhaps Spark Connect could be phrased as “Basic* Spark” &
 existing Spark could be “Full Spark” given the API limitations of Spark
 connect.

 *I was also thinking Core here but we’ve used core to refer to the RDD
 APIs for too long to reuse it here.

 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau


 On Sat, Jul 20, 2024 at 8:02 PM Xiao Li  wrote:

> Classic is much better than Legacy. : )
>
> Hyukjin Kwon  于2024年7月18日周四 16:58写道:
>
>> Hi all,
>>
>> I noticed that we need to standardize our terminology before moving
>> forward. For instance, when documenting, 'Spark without Spark Connect' is
>> too long and verbose. Additionally, I've observed that we use various 
>> names
>> for Spark without Spark Connect: Spark Classic, Classic Spark, Legacy
>> Spark, etc.
>>
>> I propose that we consistently refer to it as Spark Classic (vs.
>> Spark Connect).
>>
>> Please share your thoughts on this. Thanks!
>>
>


Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Martin Grund
Mridul, I really just wanted to understand the concern from Dongjoon. What
you're pointing at is a slightly different concern. So what I see is the
following:

> [...] they can initialize a SparkContext and work with RDD api:

The current PR uses a potentially optional value without checking that it
is set. (Which is what would happen if you just have a SparkContext and no
SparkSession).

I understand that this can happen when someone creates a Spark job and uses
no other Spark APIs to begin with. But in the context of using the
current Spark ML implementation, is it actually possible to end up in this
situation? I'm really just trying to understand the system's invariants.

> [...] SparkSession is heavier than SparkContext

Assuming that, for whatever reason, a SparkSession was created. Is there a
downside to using it?

Please see my questions as independent of the RDD API discussion itself,
and I don't think this PR was even meant to be put in the context of any
Spark Connect work.

On Fri, Jul 12, 2024 at 11:58 PM Mridul Muralidharan 
wrote:

>
> It is not necessary for users to create a SparkSession Martin - they can
> initialize a SparkContext and work with RDD api: which would be what
> Dongjoon is referring to IMO.
>
> Even after Spark Connect GA, I am not in favor of deprecating RDD Api at
> least until we have parity between both (which we don’t have today), and we
> have vetted this parity over the course of a few minor releases.
>
>
> Regards,
> Mridul
>
>
>
> On Fri, Jul 12, 2024 at 4:19 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Apache Spark's RDD API plays an essential and invaluable role from the
>> beginning and it will be even if it's not supported by Spark Connect.
>>
>> I have a concern about a recent activity which replaces RDD with
>> SparkSession blindly.
>>
>> For instance,
>>
>> https://github.com/apache/spark/pull/47328
>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>> Dataframe read / write API
>>
>> This PR doesn't look proper to me in two ways.
>> - SparkSession is heavier than SparkContext
>> - According to the following PR description, the background is also
>> hidden in the community.
>>
>>   > # Why are the changes needed?
>>   > In databricks runtime, RDD read / write API has some issue for
>> certain storage types
>>   > that requires the account key, but Dataframe read / write API works.
>>
>> In addition, we don't know if this PR fixes the mentioned unknown
>> storage's issue or not because it's not testable in the community test
>> coverage.
>>
>> I'm wondering if the Apache Spark community aims to move away from the
>> RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark
>> Connect` is not even GA in the community?
>>
>>
>> Dongjoon.
>>
>


Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Martin Grund
I took a quick look at the PR and would like to understand your concern
better about:

>  SparkSession is heavier than SparkContext

It looks like the PR is using the active SparkSession, not creating a new
one etc. I would highly appreciate it if you could help me understand this
situation better.

Thanks a lot!

On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark's RDD API plays an essential and invaluable role from the
> beginning and it will be even if it's not supported by Spark Connect.
>
> I have a concern about a recent activity which replaces RDD with
> SparkSession blindly.
>
> For instance,
>
> https://github.com/apache/spark/pull/47328
> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
> Dataframe read / write API
>
> This PR doesn't look proper to me in two ways.
> - SparkSession is heavier than SparkContext
> - According to the following PR description, the background is also hidden
> in the community.
>
>   > # Why are the changes needed?
>   > In databricks runtime, RDD read / write API has some issue for certain
> storage types
>   > that requires the account key, but Dataframe read / write API works.
>
> In addition, we don't know if this PR fixes the mentioned unknown
> storage's issue or not because it's not testable in the community test
> coverage.
>
> I'm wondering if the Apache Spark community aims to move away from the RDD
> usage in favor of `Spark Connect`. Isn't it too early because `Spark
> Connect` is not even GA in the community?
>
> Dongjoon.
>


Re: [VOTE] Allow GitHub Actions runs for contributors' PRs without approvals in apache/spark-connect-go

2024-07-04 Thread Martin Grund
+1 (non-binding)

On Thu, Jul 4, 2024 at 7:15 PM Holden Karau  wrote:

> +1
>
> Although given its a US holiday maybe keep the vote open for an extra day?
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Thu, Jul 4, 2024 at 7:33 AM Denny Lee  wrote:
>
>> +1 (non-binding)
>>
>> On Thu, Jul 4, 2024 at 19:13 Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> I’d like to start a vote for allowing GitHub Actions runs for
>>> contributors' PRs without approvals in apache/spark-connect-go.
>>>
>>> Please also refer to:
>>>
>>>- Discussion thread:
>>> https://lists.apache.org/thread/tsqm0dv01f7jgkv5l4kyvtpw4tc6f420
>>>- JIRA ticket: https://issues.apache.org/jira/browse/INFRA-25936
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thank you!
>>>
>>>


Re: [DISCUSS] Allow GitHub Actions runs for contributors' PRs without approvals in apache/spark-connect-go

2024-07-03 Thread Martin Grund
Absolutely we should do that. I thought that the default rule was inclusive
already so that once folks have their first contribution it would
automatically allow kicking of the workflows.

On Thu, Jul 4, 2024 at 04:20 Matthew Powers 
wrote:

> Yea, this would be great.
>
> spark-connect-go is still experimental and anything we can do to get it
> production grade would be a great step IMO.  The Go community is excited to
> write Spark... with Go!
>
> On Wed, Jul 3, 2024 at 8:49 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> The Spark Connect Go client repository (
>> https://github.com/apache/spark-connect-go) requires GitHub Actions runs
>> for individual commits within contributors' PRs.
>>
>> This policy was intentionally applied (
>> https://issues.apache.org/jira/browse/INFRA-24387), but we can change
>> this default once we reach a consensus on it.
>>
>> I would like to allow GitHub Actions runs for contributors by default to
>> make the development faster. For now, I have been approving individual
>> commits in their PRs, and this becomes overhead.
>>
>> If you have any feedback on this, please let me know.
>>
>


Re: [外部邮件] Re: [VOTE] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-03 Thread Martin Grund
+1 (non-binding)

On Wed, Jul 3, 2024 at 07:25 Holden Karau  wrote:

> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, Jul 2, 2024 at 10:18 PM yangjie01 
> wrote:
>
>> +1 (non-binding)
>>
>>
>>
>> *发件人**: *Denny Lee 
>> *日期**: *2024年7月3日 星期三 09:12
>> *收件人**: *Hyukjin Kwon 
>> *抄送**: *dev 
>> *主题**: *[外部邮件] Re: [VOTE] Move Spark Connect server to builtin package
>> (Client API layer stays external)
>>
>>
>>
>> +1 (non-binding)
>>
>>
>>
>> On Wed, Jul 3, 2024 at 9:11 AM Hyukjin Kwon  wrote:
>>
>> Starting with my own +1.
>>
>>
>>
>> On Wed, 3 Jul 2024 at 09:59, Hyukjin Kwon  wrote:
>>
>> Hi all,
>>
>> I’d like to start a vote for moving Spark Connect server to builtin
>> package (Client API layer stays external).
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/odlx9b552dp8yllhrdlp24pf9m9s4tmx
>> 
>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-48763
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thank you!
>>
>>


Re: [外部邮件] [DISCUSS] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-02 Thread Martin Grund
+1

On Tue, Jul 2, 2024 at 7:19 AM yangjie01 
wrote:

> I have manually attempted to only modify the `assembly/pom.xml` and
> examined the results of executing `dev/make-distribution.sh --tgz`. The
> `spark-connect_2.13-4.0.0-SNAPSHOT.jar` is indeed included in the jars
> directory. However, if rearranging the directories would result in a
> clearer project structure, I believe that would also be a viable approach.
>
>
>
> *发件人**: *Hyukjin Kwon 
> *日期**: *2024年7月2日 星期二 12:00
> *收件人**: *yangjie01 
> *抄送**: *dev 
> *主题**: *Re: [外部邮件] [DISCUSS] Move Spark Connect server to builtin package
> (Client API layer stays external)
>
>
>
> My concern is that the `connector` directory is really for
> external/optional packages (and they aren't included in assembly IIRC).. so
> I am hesitant to just change the assembly.
> The actual changes are not quite large but it moves the files around.
>
>
>
> On Tue, 2 Jul 2024 at 12:23, yangjie01 
> wrote:
>
> I'm supportive of this initiative. However, if the purpose is just to
> avoid the additional `--packages` option, it seems that making some
> adjustments to the `assembly/pom.xml` could potentially meet our goal. Is
> it really necessary to restructure the code directory?
>
>
>
> Jie Yang
>
>
>
> *发件人**: *Hyukjin Kwon 
> *日期**: *2024年7月2日 星期二 08:19
> *收件人**: *dev 
> *主题**: *[外部邮件] [DISCUSS] Move Spark Connect server to builtin package
> (Client API layer stays external)
>
>
>
> Hi all,
>
> I would like to discuss moving Spark Connect server to builtin package.
> Right now, users have to specify —packages when they run Spark Connect
> server script, for example:
>
> ./sbin/start-connect-server.sh --jars `ls 
> connector/connect/server/target/**/spark-connect*SNAPSHOT.jar`
>
> or
>
> ./sbin/start-connect-server.sh --packages 
> org.apache.spark:spark-connect_2.12:3.5.1
>
> which is a little bit odd that sbin scripts should provide jars to start.
>
> Moving it to builtin package is pretty straightforward because most of
> jars are shaded, and the impact would be minimal, I have a prototype here
> apache/spark/#47157
> .
> This also simplifies Python local running logic a lot.
>
> User facing API layer, Spark Connect Client, stays external but I would
> like the internal/admin server layer, Spark Connect Server, implementation
> to be built in Spark.
>
> Please let me know if you have thoughts on this!
>
>
>
>


Re: Write Spark Connection client application in Go

2023-09-13 Thread Martin Grund
This is absolutely awesome! Thank you so much for dedicating your time to
this project!


On Wed, Sep 13, 2023 at 6:04 AM Holden Karau  wrote:

> That’s so cool! Great work y’all :)
>
> On Tue, Sep 12, 2023 at 8:14 PM bo yang  wrote:
>
>> Hi Spark Friends,
>>
>> Anyone interested in using Golang to write Spark application? We created
>> a Spark Connect Go Client library
>> . Would love to hear
>> feedback/thoughts from the community.
>>
>> Please see the quick start guide
>> 
>> about how to use it. Following is a very short Spark Connect application in
>> Go:
>>
>> func main() {
>>  spark, _ := 
>> sql.SparkSession.Builder.Remote("sc://localhost:15002").Build()
>>  defer spark.Stop()
>>
>>  df, _ := spark.Sql("select 'apple' as word, 123 as count union all 
>> select 'orange' as word, 456 as count")
>>  df.Show(100, false)
>>  df.Collect()
>>
>>  df.Write().Mode("overwrite").
>>  Format("parquet").
>>  Save("file:///tmp/spark-connect-write-example-output.parquet")
>>
>>  df = spark.Read().Format("parquet").
>>  Load("file:///tmp/spark-connect-write-example-output.parquet")
>>  df.Show(100, false)
>>
>>  df.CreateTempView("view1", true, false)
>>  df, _ = spark.Sql("select count, word from view1 order by count")
>> }
>>
>>
>> Many thanks to Martin, Hyukjin, Ruifeng and Denny for creating and
>> working together on this repo! Welcome more people to contribute :)
>>
>> Best,
>> Bo
>>
>>


Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Martin Grund
+1 (non binding)

Tested Spark Connect fully isolated and with PySpark build. Tested as well
some of the new PySpark ML Connect features

On Tue 29. Aug 2023 at 18:25 Yuanjian Li  wrote:

> Please vote on releasing the following candidate(RC3) as Apache Spark
> version 3.5.0.
>
> The vote is open until 11:59pm Pacific time Aug 31st and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.5.0-rc3 (commit
> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>
> https://github.com/apache/spark/tree/v3.5.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1447
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>
> This release is using the release script of the tag v3.5.0-rc3.
>
>
> FAQ
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.0
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
> Thanks,
>
> Yuanjian Li
>


Re: Spark Connect: API mismatch in SparkSesession#execute

2023-08-28 Thread Martin Grund
Hi Stefan,

There are some current limitations around how protobuf is embedded in Spark
Connect. One of the challenges there is that for compatibility reasons we
currently shade protobuf that then shades the
`prototobuf.GeneramtedMessage` class. The way to work around this is to
shade the protobuf library in your code following the same rules as in
Spark.

I have a fully working example here:
https://github.com/grundprinzip/spark-connect-appstate-example/tree/main

We definitely looking forward to improve the usability.

Hope this helps,
Martin

On Mon, Aug 28, 2023 at 4:19 PM Stefan Hagedorn 
wrote:

> Hi everyone,
>
>
>
> Trying my luck here, after no success in the user mailing list :)
>
>
>
> I’m trying to use the "extension" feature of the Spark Connect
> CommandPlugin (Spark 3.4.1) [1].
>
>
>
> I created a simple protobuf message `MyMessage` that I want to send from
> the connect client-side to the connect server (where I registered my
> plugin).
>
>
>
> The source API for SparkSession class in `spark-connect-client-jvm`
> provides a method `execute` that accepts a `com.google.protobuf.Any` [2],
> so I packed the MyMessage object in an Any:
>
>
>
> val spark = SparkSession.builder().remote("sc://localhost").build()
>
>
>
>   val cmd = com.test.MyMessage.newBuilder().setBlubb("hello world
> ").build()
>
>   val googleAny = com.google.protobuf.Any.pack(cmd)
>
>
>
>   spark.execute(googleAny)
>
>
>
>
>
> This compiles, but during execution I receive a NoSuchMethodError:
>
> java.lang.NoSuchMethodError: 'void
> org.apache.spark.sql.SparkSession.execute(com.google.protobuf.Any)'
>
>
>
> After looking around for a while after decompiling I found that
> spark-connect-client-jvm_2.12-3.4.1.jar!SparkSession#execute accepts a `
> org.sparkproject.connect.client.com.google.protobuf.Any` (instead of only
> the com.google.protobuf.Any).
>
>
>
> Am I missing something, how am I supposed to use this? Is there an
> additional build step or should I use a specific plugin? I'm using the
> sbt-protoc [3] plugin in my setup.
>
>
>
> Packing my message object `cmd` into an
> org.sparkproject.connect.client.com.google.protobuf.Any does not compile.
>
>
>
> Thanks,
>
> Stefan
>
>
>
>
>
> [1] https://github.com/apache/spark/pull/39291
>
> [2]
> https://github.com/apache/spark/blob/64c26b7cb9b4c770a3e056404e05f6b6603746ee/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala#L444
>
> [3] https://github.com/thesamet/sbt-protoc
>


Re: [VOTE][SPIP] Python Data Source API

2023-07-07 Thread Martin Grund
+1 (non-binding)

On Fri, Jul 7, 2023 at 12:05 AM Denny Lee  wrote:

> +1 (non-binding)
>
> On Fri, Jul 7, 2023 at 00:50 Maciej  wrote:
>
>> +0
>>
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>>
>> On 7/6/23 17:41, Xiao Li wrote:
>>
>> +1
>>
>> Xiao
>>
>> Hyukjin Kwon  于2023年7月5日周三 17:28写道:
>>
>>> +1.
>>>
>>> See https://youtu.be/yj7XlTB1Jvc?t=604 :-).
>>>
>>> On Thu, 6 Jul 2023 at 09:15, Allison Wang
>>> 
>>>  wrote:
>>>
 Hi all,

 I'd like to start the vote for SPIP: Python Data Source API.

 The high-level summary for the SPIP is that it aims to introduce a
 simple API in Python for Data Sources. The idea is to enable Python
 developers to create data sources without learning Scala or dealing with
 the complexities of the current data source APIs. This would make Spark
 more accessible to the wider Python developer community.

 References:

- SPIP doc

 
- JIRA ticket 
- Discussion thread



 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because __.

 Thanks,
 Allison

>>>


Re: [DISCUSS] SPIP: Python Data Source API

2023-06-24 Thread Martin Grund
Hey,

I would like to express my strong support for Python Data Sources even
though they might not be immediately as powerful as Scala-based data
sources. One element that is easily lost in this discussion is how much
faster the iteration speed is with Python compared to Scala. Due to the
dynamic nature of Python, you can design and build a data source while
running in a notebook and continuously change the code until it works as
you want. This behavior is unparalleled!

There exists a litany of Python libraries connecting to all kinds of
different endpoints that could provide data that is usable with Spark. I
personally can imagine implementing a data source on top of the AWS SDK to
extract EC2 instance information. Now I don't have to switch tools and can
keep my pipeline consistent.

Let's say you want to query an API in parallel from Spark using Python,
today's way would be to create a Python RDD and then implement the planning
and execution process manually. Finally calling `toDF` in the end. While
the actual code of the DS and the RDD-based implementation would be very
similar, the abstraction that is provided by the DS is much more powerful
and future-proof. Performing dynamic partition elimination, and filter
push-down can all be implemented at a later point in time.

Comparing a DS to using batch calling from a UDF is not great because, the
execution pattern would be very brittle. Imagine something like
`spark.range(10).withColumn("data",
fetch_api).explode(col("data")).collect()`. Here you're encoding
partitioning logic and data transformation in simple ways, but you can't
reason about the structural integrity of the query and tiny changes in the
UDF interface might already cause a lot of downstream issues.


Martin


On Sat, Jun 24, 2023 at 1:44 AM Maciej  wrote:

> With such limited scope (both language availability and features) do we
> have any representative examples of sources that could significantly
> benefit from providing this API,  compared other available options, such as
> batch imports, direct queries from vectorized  UDFs or even interfacing
> sources through 3rd party FDWs?
>
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
> On 6/20/23 16:23, Wenchen Fan wrote:
>
> In an ideal world, every data source you want to connect to already has a
> Spark data source implementation (either v1 or v2), then this Python API is
> useless. But I feel it's common that people want to do quick data
> exploration, and the target data system is not popular enough to have an
> existing Spark data source implementation. It will be useful if people can
> quickly implement a Spark data source using their favorite Python language.
>
> I'm +1 to this proposal, assuming that we will keep it simple and won't
> copy all the complicated features we built in DS v2 to this new Python API.
>
> On Tue, Jun 20, 2023 at 2:11 PM Maciej  wrote:
>
>> Similarly to Jacek, I feel it fails to document an actual community need
>> for such a feature.
>>
>> Currently, any data source implementation has the potential to benefit
>> Spark users across all supported and third-party clients.  For generally
>> available sources, this is advantageous for the whole Spark community and
>> avoids creating 1st and 2nd-tier citizens. This is even more important with
>> new officially supported languages being added through connect.
>>
>> Instead, we might rather document in detail the process of implementing a
>> new source using current APIs and work towards easily extensible or
>> customizable sources, in case there is such a need.
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>>
>>
>> On 6/20/23 05:19, Hyukjin Kwon wrote:
>>
>> Actually I support this idea in a way that Python developers don't have
>> to learn Scala to write their own source (and separate packaging).
>> This is more crucial especially when you want to write a simple data
>> source that interacts with the Python ecosystem.
>>
>> On Tue, 20 Jun 2023 at 03:08, Denny Lee  wrote:
>>
>>> Slightly biased, but per my conversations - this would be awesome to
>>> have!
>>>
>>> On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari 
>>> wrote:
>>>
 I would definitely use it - is it's available :)

 On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,  wrote:

> Hi Allison and devs,
>
> Although I was against this idea at first sight (probably because I'm
> a Scala dev), I think it could work as long as there are people who'd be
> interested in such an API. Were there any? I'm just curious. I've seen no
> emails requesting it.
>
> I also doubt that Python devs would like to work on new data sources
> but support their wishes wholeheartedly :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 

Re: [CONNECT] New Clients for Go and Rust

2023-06-01 Thread Martin Grund
These are all valid points and it makes total sense to continue to consider
them. However, reading the mail I'm wondering if we're discussing the same
problems.

Deprecation of APIs aside, the main benefit it Spark Connect is that the
contract is explicitly not a Jar file full of transitive dependencies (and
discoverable internal APIs) but rather the contract established via the
proto messages and RPCs.  If you compare this for example to the R
integration there is no need to emebed some Go pieces with the JVM to make
it work. No custom RMI protocol specific to the client language but simply
the same contract as for example PySpark uses. The physical contact is the
protobuf and the logical contact is the dataframe API.

This means that Spark Connect clients don't suffer a large part of the
challenges that other tools built on top of Spark have as there is no tigut
coupling between the driver JVM and the client.

I'm happy to help establish clear guidance of contrib style modules that
operate with a different set of expectations but are developed by the spark
community and its guidelines.

Martin


On Thu 1. Jun 2023 at 12:41 Maciej  wrote:

> Hi Martin,
>
>
> On 5/30/23 11:50, Martin Grund wrote:
> > I think it makes sense to split this discussion into two pieces. On >
> the contribution side, my personal perspective is that these new > clients
> are explicitly marked as experimental and unsupported until > we deem them
> mature enough to be supported using the standard release > process etc.
> However, the goal should be that the main contributors > of these clients
> are aiming to follow the same release and > maintenance schedule. I think
> we should encourage the community to > contribute to the Spark Connect
> clients and as such we should > explicitly not make it as hard as possible
> to get started (and for > that reason reserve the right to abandon).
>
> I know it sounds like a nitpicking, but we still have components
> deprecated in 1.2 or 1.3, not to mention subprojects that haven't been
> developed for years.  So, there is a huge gap between reserving a right and
> actually exercising it when needed. If such a right is to be used
> differently for Spark Connect bindings, it's something that should be
> communicated upfront.
> > > How exactly the release schedule is going to look is going to require
> > probably some experimentation because it's a new area for Spark and >
> it's ecosystem. I don't think it requires us to have all answers > upfront.
>
> Nonetheless, we should work towards establishing consensus around these
> issues and documenting the answers. They affect not only the maintainers
> (see for example a recent discussion about switching to a more predictable
> release schedule) but also the users, for whom multiple APIs (including
> their development status) have been a common source of confusion in the
> past.
> >> Also, an elephant in the room is the future of the current API in >>
> Spark 4 and onwards. As useful as connect is, it is not exactly a >>
> replacement for many existing deployments. Furthermore, it doesn't >> make
> extending Spark much easier and the current ecosystem is, >> subjectively
> speaking, a bit brittle. > > The goal of Spark Connect is not to replace
> the way users are > currently deploying Spark, it's not meant to be that.
> Users should > continue deploying Spark in exactly the way they prefer.
> Spark > Connect allows bringing more interactivity and connectivity to
> Spark. > While Spark Connect extends Spark, most new language consumers
> will > not try to extend Spark, but simply provide the existing surface to
> > their native language. So the goal is not so much extensibility but >
> more availability. For example, I believe it would be awesome if the > Livy
> community would find a way to integrate with Spark Connect to > provide the
> routing capabilities to provide a stable DNS endpoint for > all different
> Spark deployments. > >> [...] the current ecosystem is, subjectively
> speaking, a bit >> brittle. > > Can you help me understand that a bit
> better? Do you mean the Spark > ecosystem or the Spark Connect ecosystem?
>
> I mean Spark in general. While most of the core and some closely related
> projects are well maintained, tools built on top of Spark, even ones
> supported by major stakeholders, are often short-lived and left
> unmaintained, if not officially abandoned.
>
> New languages aside, without a single extension point (which, for core
> Spark is JVM interface), maintaining public projects on top of Spark
> becomes even less attractive. That, assuming we don't completely reject the
> idea of extending Spark functionality 

Re: [CONNECT] New Clients for Go and Rust

2023-06-01 Thread Martin Grund
Hi Bo,

I think the PR is fine from a code perspective as a starting point. I've
prepared the go repository with all the things necessary so that it reduces
friction for you. The protos are automatically generated, pre-commit checks
etc. All you need to do is drop your code :)

Once we have the first version working we can iterate and identify the next
steps.

Thanks
Martin


On Thu, Jun 1, 2023 at 2:50 AM bo yang  wrote:

> Just see the discussions here! Really appreciate Martin and other folks
> helping on my previous Golang Spark Connect PR (
> https://github.com/apache/spark/pull/41036)!
>
> Great to see we have a new repo for Spark Golang Connect client. Thanks 
> Hyukjin!
> I am thinking to migrate my PR to this new repo. Would like to hear any
> feedback or suggestion before I make the new PR :)
>
> Thanks,
> Bo
>
>
>
> On Tue, May 30, 2023 at 3:38 AM Martin Grund 
> wrote:
>
>> Hi folks,
>>
>> Thanks a lot to the help form Hykjin! We've create the
>> https://github.com/apache/spark-connect-go as the first contrib
>> repository for Spark Connect under the Apache Spark project. We will move
>> the development of the Golang client to this repository and make it very
>> clear from the README file that this is an experimental client.
>>
>> Looking forward to all your contributions!
>>
>> On Tue, May 30, 2023 at 11:50 AM Martin Grund 
>> wrote:
>>
>>> I think it makes sense to split this discussion into two pieces. On the
>>> contribution side, my personal perspective is that these new clients are
>>> explicitly marked as experimental and unsupported until we deem them mature
>>> enough to be supported using the standard release process etc. However, the
>>> goal should be that the main contributors of these clients are aiming to
>>> follow the same release and maintenance schedule. I think we should
>>> encourage the community to contribute to the Spark Connect clients and as
>>> such we should explicitly not make it as hard as possible to get started
>>> (and for that reason reserve the right to abandon).
>>>
>>> How exactly the release schedule is going to look is going to require
>>> probably some experimentation because it's a new area for Spark and it's
>>> ecosystem. I don't think it requires us to have all answers upfront.
>>>
>>> > Also, an elephant in the room is the future of the current API in
>>> Spark 4 and onwards. As useful as connect is, it is not exactly a
>>> replacement for many existing deployments. Furthermore, it doesn't make
>>> extending Spark much easier and the current ecosystem is, subjectively
>>> speaking, a bit brittle.
>>>
>>> The goal of Spark Connect is not to replace the way users are currently
>>> deploying Spark, it's not meant to be that. Users should continue deploying
>>> Spark in exactly the way they prefer. Spark Connect allows bringing more
>>> interactivity and connectivity to Spark. While Spark Connect extends Spark,
>>> most new language consumers will not try to extend Spark, but simply
>>> provide the existing surface to their native language. So the goal is not
>>> so much extensibility but more availability. For example, I believe it
>>> would be awesome if the Livy community would find a way to integrate with
>>> Spark Connect to provide the routing capabilities to provide a stable DNS
>>> endpoint for all different Spark deployments.
>>>
>>> > [...] the current ecosystem is, subjectively speaking, a bit brittle.
>>>
>>> Can you help me understand that a bit better? Do you mean the Spark
>>> ecosystem or the Spark Connect ecosystem?
>>>
>>>
>>>
>>> Martin
>>>
>>>
>>> On Fri, May 26, 2023 at 5:39 PM Maciej  wrote:
>>>
>>>> It might be a good idea to have a discussion about how new connect
>>>> clients fit into the overall process we have. In particular:
>>>>
>>>>
>>>>- Under what conditions do we consider adding a new language to the
>>>>official channels?  What process do we follow?
>>>>- What guarantees do we offer in respect to these clients? Is
>>>>adding a new client the same type of commitment as for the core API? In
>>>>other words, do we commit to maintaining such clients "forever" or do we
>>>>separate the "official" and "contrib" clients, with the later being
>>>>governed by the ASF, but not guaranteed to be maintained in the future?
>>&g

Re: [CONNECT] New Clients for Go and Rust

2023-05-30 Thread Martin Grund
Hi folks,

Thanks a lot to the help form Hykjin! We've create the
https://github.com/apache/spark-connect-go as the first contrib repository
for Spark Connect under the Apache Spark project. We will move the
development of the Golang client to this repository and make it very clear
from the README file that this is an experimental client.

Looking forward to all your contributions!

On Tue, May 30, 2023 at 11:50 AM Martin Grund  wrote:

> I think it makes sense to split this discussion into two pieces. On the
> contribution side, my personal perspective is that these new clients are
> explicitly marked as experimental and unsupported until we deem them mature
> enough to be supported using the standard release process etc. However, the
> goal should be that the main contributors of these clients are aiming to
> follow the same release and maintenance schedule. I think we should
> encourage the community to contribute to the Spark Connect clients and as
> such we should explicitly not make it as hard as possible to get started
> (and for that reason reserve the right to abandon).
>
> How exactly the release schedule is going to look is going to require
> probably some experimentation because it's a new area for Spark and it's
> ecosystem. I don't think it requires us to have all answers upfront.
>
> > Also, an elephant in the room is the future of the current API in Spark
> 4 and onwards. As useful as connect is, it is not exactly a replacement for
> many existing deployments. Furthermore, it doesn't make extending Spark
> much easier and the current ecosystem is, subjectively speaking, a bit
> brittle.
>
> The goal of Spark Connect is not to replace the way users are currently
> deploying Spark, it's not meant to be that. Users should continue deploying
> Spark in exactly the way they prefer. Spark Connect allows bringing more
> interactivity and connectivity to Spark. While Spark Connect extends Spark,
> most new language consumers will not try to extend Spark, but simply
> provide the existing surface to their native language. So the goal is not
> so much extensibility but more availability. For example, I believe it
> would be awesome if the Livy community would find a way to integrate with
> Spark Connect to provide the routing capabilities to provide a stable DNS
> endpoint for all different Spark deployments.
>
> > [...] the current ecosystem is, subjectively speaking, a bit brittle.
>
> Can you help me understand that a bit better? Do you mean the Spark
> ecosystem or the Spark Connect ecosystem?
>
>
>
> Martin
>
>
> On Fri, May 26, 2023 at 5:39 PM Maciej  wrote:
>
>> It might be a good idea to have a discussion about how new connect
>> clients fit into the overall process we have. In particular:
>>
>>
>>- Under what conditions do we consider adding a new language to the
>>official channels?  What process do we follow?
>>- What guarantees do we offer in respect to these clients? Is adding
>>a new client the same type of commitment as for the core API? In other
>>words, do we commit to maintaining such clients "forever" or do we 
>> separate
>>the "official" and "contrib" clients, with the later being governed by the
>>ASF, but not guaranteed to be maintained in the future?
>>- Do we follow the same release schedule as for the core project, or
>>rather release each client separately, after the main release is 
>> completed?
>>
>> Also, an elephant in the room is the future of the current API in Spark 4
>> and onwards. As useful as connect is, it is not exactly a replacement for
>> many existing deployments. Furthermore, it doesn't make extending Spark
>> much easier and the current ecosystem is, subjectively speaking, a bit
>> brittle.
>>
>> --
>> Best regards,
>> Maciej
>>
>>
>> On 5/26/23 07:26, Martin Grund wrote:
>>
>> Thanks everyone for your feedback! I will work on figuring out what it
>> takes to get started with a repo for the go client.
>>
>> On Thu 25. May 2023 at 21:51 Chao Sun  wrote:
>>
>>> +1 on separate repo too
>>>
>>> On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun 
>>> wrote:
>>> >
>>> > +1 for starting on a separate repo.
>>> >
>>> > Dongjoon.
>>> >
>>> > On Thu, May 25, 2023 at 9:53 AM yangjie01  wrote:
>>> >>
>>> >> +1 on start this with a separate repo.
>>> >>
>>> >> Which new clients can be placed in the main repo should be discussed
>>> after they are mature enough,
>>> >>
>&

Re: [CONNECT] New Clients for Go and Rust

2023-05-30 Thread Martin Grund
I think it makes sense to split this discussion into two pieces. On the
contribution side, my personal perspective is that these new clients are
explicitly marked as experimental and unsupported until we deem them mature
enough to be supported using the standard release process etc. However, the
goal should be that the main contributors of these clients are aiming to
follow the same release and maintenance schedule. I think we should
encourage the community to contribute to the Spark Connect clients and as
such we should explicitly not make it as hard as possible to get started
(and for that reason reserve the right to abandon).

How exactly the release schedule is going to look is going to require
probably some experimentation because it's a new area for Spark and it's
ecosystem. I don't think it requires us to have all answers upfront.

> Also, an elephant in the room is the future of the current API in Spark 4
and onwards. As useful as connect is, it is not exactly a replacement for
many existing deployments. Furthermore, it doesn't make extending Spark
much easier and the current ecosystem is, subjectively speaking, a bit
brittle.

The goal of Spark Connect is not to replace the way users are currently
deploying Spark, it's not meant to be that. Users should continue deploying
Spark in exactly the way they prefer. Spark Connect allows bringing more
interactivity and connectivity to Spark. While Spark Connect extends Spark,
most new language consumers will not try to extend Spark, but simply
provide the existing surface to their native language. So the goal is not
so much extensibility but more availability. For example, I believe it
would be awesome if the Livy community would find a way to integrate with
Spark Connect to provide the routing capabilities to provide a stable DNS
endpoint for all different Spark deployments.

> [...] the current ecosystem is, subjectively speaking, a bit brittle.

Can you help me understand that a bit better? Do you mean the Spark
ecosystem or the Spark Connect ecosystem?



Martin


On Fri, May 26, 2023 at 5:39 PM Maciej  wrote:

> It might be a good idea to have a discussion about how new connect clients
> fit into the overall process we have. In particular:
>
>
>- Under what conditions do we consider adding a new language to the
>official channels?  What process do we follow?
>- What guarantees do we offer in respect to these clients? Is adding a
>new client the same type of commitment as for the core API? In other words,
>do we commit to maintaining such clients "forever" or do we separate the
>"official" and "contrib" clients, with the later being governed by the ASF,
>but not guaranteed to be maintained in the future?
>- Do we follow the same release schedule as for the core project, or
>rather release each client separately, after the main release is completed?
>
> Also, an elephant in the room is the future of the current API in Spark 4
> and onwards. As useful as connect is, it is not exactly a replacement for
> many existing deployments. Furthermore, it doesn't make extending Spark
> much easier and the current ecosystem is, subjectively speaking, a bit
> brittle.
>
> --
> Best regards,
> Maciej
>
>
> On 5/26/23 07:26, Martin Grund wrote:
>
> Thanks everyone for your feedback! I will work on figuring out what it
> takes to get started with a repo for the go client.
>
> On Thu 25. May 2023 at 21:51 Chao Sun  wrote:
>
>> +1 on separate repo too
>>
>> On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun 
>> wrote:
>> >
>> > +1 for starting on a separate repo.
>> >
>> > Dongjoon.
>> >
>> > On Thu, May 25, 2023 at 9:53 AM yangjie01  wrote:
>> >>
>> >> +1 on start this with a separate repo.
>> >>
>> >> Which new clients can be placed in the main repo should be discussed
>> after they are mature enough,
>> >>
>> >>
>> >>
>> >> Yang Jie
>> >>
>> >>
>> >>
>> >> 发件人: Denny Lee 
>> >> 日期: 2023年5月24日 星期三 21:31
>> >> 收件人: Hyukjin Kwon 
>> >> 抄送: Maciej , "dev@spark.apache.org" <
>> dev@spark.apache.org>
>> >> 主题: Re: [CONNECT] New Clients for Go and Rust
>> >>
>> >>
>> >>
>> >> +1 on separate repo allowing different APIs to run at different speeds
>> and ensuring they get community support.
>> >>
>> >>
>> >>
>> >> On Wed, May 24, 2023 at 00:37 Hyukjin Kwon 
>> wrote:
>> >>
>> >> I think we can just start this with a separate repo.
>> >> I am fine with the second option too but in 

Re: [CONNECT] New Clients for Go and Rust

2023-05-25 Thread Martin Grund
Thanks everyone for your feedback! I will work on figuring out what it
takes to get started with a repo for the go client.

On Thu 25. May 2023 at 21:51 Chao Sun  wrote:

> +1 on separate repo too
>
> On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun 
> wrote:
> >
> > +1 for starting on a separate repo.
> >
> > Dongjoon.
> >
> > On Thu, May 25, 2023 at 9:53 AM yangjie01  wrote:
> >>
> >> +1 on start this with a separate repo.
> >>
> >> Which new clients can be placed in the main repo should be discussed
> after they are mature enough,
> >>
> >>
> >>
> >> Yang Jie
> >>
> >>
> >>
> >> 发件人: Denny Lee 
> >> 日期: 2023年5月24日 星期三 21:31
> >> 收件人: Hyukjin Kwon 
> >> 抄送: Maciej , "dev@spark.apache.org" <
> dev@spark.apache.org>
> >> 主题: Re: [CONNECT] New Clients for Go and Rust
> >>
> >>
> >>
> >> +1 on separate repo allowing different APIs to run at different speeds
> and ensuring they get community support.
> >>
> >>
> >>
> >> On Wed, May 24, 2023 at 00:37 Hyukjin Kwon 
> wrote:
> >>
> >> I think we can just start this with a separate repo.
> >> I am fine with the second option too but in this case we would have to
> triage which language to add into the main repo.
> >>
> >>
> >>
> >> On Fri, 19 May 2023 at 22:28, Maciej  wrote:
> >>
> >> Hi,
> >>
> >>
> >>
> >> Personally, I'm strongly against the second option and have some
> preference towards the third one (or maybe a mix of the first one and the
> third one).
> >>
> >>
> >>
> >> The project is already pretty large as-is and, with an extremely
> conservative approach towards removal of APIs, it only tends to grow over
> time. Making it even larger is not going to make things more maintainable
> and is likely to create an entry barrier for new contributors (that's
> similar to Jia's arguments).
> >>
> >>
> >>
> >> Moreover, we've seen quite a few different language clients over the
> years and all but one or two survived while none is particularly active, as
> far as I'm aware.  Taking responsibility for more clients, without being
> sure that we have resources to maintain them and there is enough community
> around them to make such effort worthwhile, doesn't seem like a good idea.
> >>
> >>
> >>
> >> --
> >>
> >> Best regards,
> >>
> >> Maciej Szymkiewicz
> >>
> >>
> >>
> >> Web: https://zero323.net
> >>
> >> PGP: A30CEF0C31A501EC
> >>
> >>
> >>
> >>
> >>
> >> On 5/19/23 14:57, Jia Fan wrote:
> >>
> >> Hi,
> >>
> >>
> >>
> >> Thanks for contribution!
> >>
> >> I prefer (1). There are some reason:
> >>
> >>
> >>
> >> 1. Different repository can maintain independent versions, different
> release times, and faster bug fix releases.
> >>
> >>
> >>
> >> 2. Different languages have different build tools. Putting them in one
> repository will make the main repository more and more complicated, and it
> will become extremely difficult to perform a complete build in the main
> repository.
> >>
> >>
> >>
> >> 3. Different repository will make CI configuration and execute easier,
> and the PR and commit lists will be clearer.
> >>
> >>
> >>
> >> 4. Other repository also have different client to governed, like
> clickhouse. It use different repository for jdbc, odbc, c++. Please refer:
> >>
> >> https://github.com/ClickHouse/clickhouse-java
> >>
> >> https://github.com/ClickHouse/clickhouse-odbc
> >>
> >> https://github.com/ClickHouse/clickhouse-cpp
> >>
> >>
> >>
> >> PS: I'm looking forward to the javascript connect client!
> >>
> >>
> >>
> >> Thanks Regards
> >>
> >> Jia Fan
> >>
> >>
> >>
> >> Martin Grund  于2023年5月19日周五 20:03写道:
> >>
> >> Hi folks,
> >>
> >>
> >>
> >> When Bo (thanks for the time and contribution) started the work on
> https://github.com/apache/spark/pull/41036 he started the Go client
> directly in the Spark repository. In the meantime, I was a

[CONNECT] New Clients for Go and Rust

2023-05-19 Thread Martin Grund
Hi folks,

When Bo (thanks for the time and contribution) started the work on
https://github.com/apache/spark/pull/41036 he started the Go client
directly in the Spark repository. In the meantime, I was approached by
other engineers who are willing to contribute to working on a Rust client
for Spark Connect.

Now one of the key questions is where should these connectors live and how
we manage expectations most effectively.

At the high level, there are two approaches:

(1) "3rd party" (non-JVM / Python) clients should live in separate
repositories owned and governed by the Apache Spark community.

(2) All clients should live in the main Apache Spark repository in the
`connector/connect/client` directory.

(3) Non-native (Python, JVM) Spark Connect clients should not be part of
the Apache Spark repository and governance rules.

Before we iron out how exactly, we mark these clients as experimental and
how we align their release process etc with Spark, my suggestion would be
to get a consensus on this first question.

Personally, I'm fine with (1) and (2) with a preference for (2).

Would love to get feedback from other members of the community!

Thanks
Martin


Re: Enforcing scalafmt on Spark Connect - connector/connect

2022-10-14 Thread Martin Grund
I have prepared the following pull request that enforces scalafmt on the
Spark Connect module. Please feel free to have a look and leave comments.
If we're reaching consensus on the decision, I will take care of pushing a
PR as well for the previously mentioned website to add a small note on
using `dev/lint-scala` for local style checks.

Thanks
Martin

On Fri, Oct 14, 2022 at 11:09 AM Yikun Jiang  wrote:

> +1, I also think it's a good idea.
>
> BTW, we might also consider adding some notes about `lint-scala` in [1],
> just like `lint-python` in pyspark [2].
>
> [1] https://spark.apache.org/developer-tools.html
> [2]
> https://spark.apache.org/docs/latest/api/python/development/contributing.html
>
>
> Regards,
> Yikun
>
>
> On Fri, Oct 14, 2022 at 4:51 PM Hyukjin Kwon  wrote:
>
>> I personally like this idea. At least we now do this in PySpark, and it's
>> pretty nice that you can just forget about formatting it manually by
>> yourself.
>>
>> On Fri, 14 Oct 2022 at 16:37, Martin Grund
>>  wrote:
>>
>>> Hi folks,
>>>
>>> I'm reaching out to ask to gather input / consensus on the following
>>> proposal: Since Spark Connect is effectively new code, I would like to
>>> enforce scalafmt explicitly *only* on this module by adding a check in
>>> `dev/lint-scala` that checks if there is a diff after running
>>>
>>>  ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -pl
>>> connector/connect
>>>
>>> I know that enforcing scalafmt is not desirable on the existing code
>>> base but since the Spark Connect code is very new I'm thinking it might
>>> reduce friction in the code reviews and create a consistent style.
>>>
>>> In my previous code reviews where I have applied scalafmt I've
>>> received feedback on the import grouping that scalafmt is changing
>>> different from our default style. I've prepared a PR
>>> https://github.com/apache/spark/pull/38252 to address this issue by
>>> explicitly setting it in the scalafmt option.
>>>
>>> Would you be supportive of enforcing scalafmt *only* on the Spark
>>> Connect module?
>>>
>>> Thanks
>>> Martin
>>>
>>


Enforcing scalafmt on Spark Connect - connector/connect

2022-10-14 Thread Martin Grund
Hi folks,

I'm reaching out to ask to gather input / consensus on the following
proposal: Since Spark Connect is effectively new code, I would like to
enforce scalafmt explicitly *only* on this module by adding a check in
`dev/lint-scala` that checks if there is a diff after running

 ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -pl
connector/connect

I know that enforcing scalafmt is not desirable on the existing code base
but since the Spark Connect code is very new I'm thinking it might reduce
friction in the code reviews and create a consistent style.

In my previous code reviews where I have applied scalafmt I've
received feedback on the import grouping that scalafmt is changing
different from our default style. I've prepared a PR
https://github.com/apache/spark/pull/38252 to address this issue by
explicitly setting it in the scalafmt option.

Would you be supportive of enforcing scalafmt *only* on the Spark Connect
module?

Thanks
Martin


Re: [VOTE][RESULT] SPIP: Spark Connect

2022-06-16 Thread Martin Grund
Thanks everyone for your votes and thanks Herman for being the shepherd.

On Fri 17. Jun 2022 at 02:23 Hyukjin Kwon  wrote:

> Awesome, I am excited to see this in Apache Spark.
>
> On Fri, 17 Jun 2022 at 08:37, Herman van Hovell
>  wrote:
>
>> The vote passes with 17 +1s (10 binding +1s).
>> +1:
>> Herman van Hovell*
>> Matei Zaharia*
>> Yuming Wang
>> Hyukjin Kwon*
>> Chao Sun
>> L.C. Hsieh*
>> Huaxin Gao
>> Ruifeng Zheng
>> Wenchen Fan*
>> Believer
>> Xiao Li*
>> Reynold Xin*
>> Dongjoon Hyun*
>> Gangliang Wang
>> Yikun Jiang
>> Tom Graves *
>> Holden Karau *
>>
>> 0: None
>> (Tom has voiced some architectural concerns)
>>
>> -1: None
>>
>> (* = binding)
>>
>> The next step is that we are going to create a high level design doc,
>> which will give clarity on the design and should (hopefully) take away any
>> remaining concerns.
>>
>> Thank you all for chiming in and your votes!
>>
>> Cheers,
>> Herman
>>
>


Re: [DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-06-07 Thread Martin Grund
On Tue, Jun 7, 2022 at 3:54 PM Steve Loughran 
wrote:

>
>
> On Fri, 3 Jun 2022 at 18:46, Martin Grund
>  wrote:
>
>> Hi Everyone,
>>
>> We would like to start a discussion on the "Spark Connect" proposal.
>> Please find the links below:
>>
>> *JIRA* - https://issues.apache.org/jira/browse/SPARK-39375
>> *SPIP Document* -
>> https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj
>>
>> *Excerpt from the document: *
>>
>> We propose to extend Apache Spark by building on the DataFrame API and
>> the underlying unresolved logical plans. The DataFrame API is widely used
>> and makes it very easy to iteratively express complex logic. We will
>> introduce Spark Connect, a remote option of the DataFrame API that
>> separates the client from the Spark server. With Spark Connect, Spark will
>> become decoupled, allowing for built-in remote connectivity: The decoupled
>> client SDK can be used to run interactive data exploration and connect to
>> the server for DataFrame operations.
>>
>> Spark Connect will benefit Spark developers in different ways: The
>> decoupled architecture will result in improved stability, as clients are
>> separated from the driver. From the Spark Connect client perspective, Spark
>> will be (almost) versionless, and thus enable seamless upgradability, as
>> server APIs can evolve without affecting the client API. The decoupled
>> client-server architecture can be leveraged to build close integrations
>> with local developer tooling. Finally, separating the client process from
>> the Spark server process will improve Spark’s overall security posture by
>> avoiding the tight coupling of the client inside the Spark runtime
>> environment.
>>
>
> one key finding on distributed systems since the earliest work since
> Nelson first did the RPC in 1981 is that "seamless upgradability" is
> usually an unrealised vision, especially if things like serialized
> java/spark objects are part of the payload.
>
> if it is a goal, then the tests to validate the versioning would have to
> be a key deliverable. examples: test modules using old versions,
>
> This is particularly a risk with a design which proposes serialising
> logical plans; it may be hard to change planning in future.
>
> Will the protocol include something similar to the DXL plan language
> implemented in Greenplum's orca query optimizer? That's an
> under-appreciated piece of work. If the goal of the protocol is to be long
> lived, it is a design worth considering, not just for its portability but
> because it lets people work on query optimisation as a service.
>
>
In the prototype I've built I'm not actually using the fully specified
logical plans that Spark is using for the query execution before
optimization, but rather something that is closer to the parse plans of a
SQL query. The parse plans follow more closely the relational algebra and
are much less likely to change compared to the actual underlying logical
plan operator. The goal is not to build an endpoint that can receive
optimized plans and directly executes these plans.

For example, all attributes in the plans are referenced as unresolved
attributes and the same is true for functions. This delegates the
responsibility for name resolution etc to the existing implementation that
we're not going to touch instead of trying to replicate it. It is still
possible to provide early feedback to the user because one can always
analyze the specific sub-plan.

Please let me know what you think.


>
> [1]. Orca: A Modular Query Optimizer Architecture for Big Data
>
>  
> https://15721.courses.cs.cmu.edu/spring2017/papers/15-optimizer2/p337-soliman.pdf
> <https://15721.courses.cs.cmu.edu/spring2017/papers/15-optimizer2/p337-soliman.pdf>
>
>
>> Spark Connect will strengthen Spark’s position as the modern unified
>> engine for large-scale data analytics and expand applicability to use cases
>> and developers we could not reach with the current setup: Spark will become
>> ubiquitously usable as the DataFrame API can be used with (almost) any
>> programming language.
>>
>> That's a marketing comment, not a technical one. best left out of ASF
> docs.
>


Re: [DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-06-06 Thread Martin Grund
Hi Mich,

I think I must have been not clear enough in the document. The proposal is
not for connecting Spark to other engines but to connect to Spark from
other clients remotely (without using SQL)

Please let me know if that clarifies things or if I can provide additional
context.

Thanks
Martin

On Sun 5. Jun 2022 at 16:38 Mich Talebzadeh 
wrote:

> Hi,
>
> Whilst I concur that there is a need for client server architecture, that
> technology has been around over 30 years. Moreover the current spark had
> vey efficient connections via JDBC to various databases. In some cases the
> API to various databases, for example Google BiqQuery is very efficient. I
> am not sure what this proposal is to trying to address?
>
> HTH
>
> On Fri, 3 Jun 2022 at 18:46, Martin Grund ent server
>  wrote:
>
>> Hi Everyone,
>>
>> We would like to start a discussion on the "Spark Connect" proposal.
>> Please find the links below:
>>
>> *JIRA* - https://issues.apache.org/jira/browse/SPARK-39375
>> *SPIP Document* -
>> https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj
>>
>> *Excerpt from the document: *
>>
>> We propose to extend Apache Spark by building on the DataFrame API and
>> the underlying unresolved logical plans. The DataFrame API is widely used
>> and makes it very easy to iteratively express complex logic. We will
>> introduce Spark Connect, a remote option of the DataFrame API that
>> separates the client from the Spark server. With Spark Connect, Spark will
>> become decoupled, allowing for built-in remote connectivity: The decoupled
>> client SDK can be used to run interactive data exploration and connect to
>> the server for DataFrame operations.
>>
>> Spark Connect will benefit Spark developers in different ways: The
>> decoupled architecture will result in improved stability, as clients are
>> separated from the driver. From the Spark Connect client perspective, Spark
>> will be (almost) versionless, and thus enable seamless upgradability, as
>> server APIs can evolve without affecting the client API. The decoupled
>> client-server architecture can be leveraged to build close integrations
>> with local developer tooling. Finally, separating the client process from
>> the Spark server process will improve Spark’s overall security posture by
>> avoiding the tight coupling of the client inside the Spark runtime
>> environment.
>>
>> Spark Connect will strengthen Spark’s position as the modern unified
>> engine for large-scale data analytics and expand applicability to use cases
>> and developers we could not reach with the current setup: Spark will become
>> ubiquitously usable as the DataFrame API can be used with (almost) any
>> programming language.
>>
>> We would like to start a discussion on the document and any feedback is
>> welcome!
>>
>> Thanks a lot in advance,
>> Martin
>>
> --
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: [DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-06-04 Thread Martin Grund
Support for UDFs would work in the same way as they work today. The
closures are serialized on the client and sent via the driver to the worker.

While there is no difference in the execution of the UDF, there can be
potential challenges with the dependencies required for execution. This is
true both for Python and Scala. I would like to avoid bringing dependency
management into this SPIP and I believe this can be solved in principle by
explicitly adding the JARs for the depency so that they are available in
the classpath.

In its current form, the SPIP does not propose to add new language support
for UDFs, but in theory it becomes possible to do so as long as closures
can be serialized either as code or binary and dynamically loaded on the
other side.

I hope this answers the question.

Thanks
Martin

On Sat 4. Jun 2022 at 05:04 Koert Kuipers  wrote:

> how would scala udfs be supported in this?
>
> On Fri, Jun 3, 2022 at 1:52 PM Martin Grund
>  wrote:
>
>> Hi Everyone,
>>
>> We would like to start a discussion on the "Spark Connect" proposal.
>> Please find the links below:
>>
>> *JIRA* - https://issues.apache.org/jira/browse/SPARK-39375
>> *SPIP Document* -
>> https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj
>>
>> *Excerpt from the document: *
>>
>> We propose to extend Apache Spark by building on the DataFrame API and
>> the underlying unresolved logical plans. The DataFrame API is widely used
>> and makes it very easy to iteratively express complex logic. We will
>> introduce Spark Connect, a remote option of the DataFrame API that
>> separates the client from the Spark server. With Spark Connect, Spark will
>> become decoupled, allowing for built-in remote connectivity: The decoupled
>> client SDK can be used to run interactive data exploration and connect to
>> the server for DataFrame operations.
>>
>> Spark Connect will benefit Spark developers in different ways: The
>> decoupled architecture will result in improved stability, as clients are
>> separated from the driver. From the Spark Connect client perspective, Spark
>> will be (almost) versionless, and thus enable seamless upgradability, as
>> server APIs can evolve without affecting the client API. The decoupled
>> client-server architecture can be leveraged to build close integrations
>> with local developer tooling. Finally, separating the client process from
>> the Spark server process will improve Spark’s overall security posture by
>> avoiding the tight coupling of the client inside the Spark runtime
>> environment.
>>
>> Spark Connect will strengthen Spark’s position as the modern unified
>> engine for large-scale data analytics and expand applicability to use cases
>> and developers we could not reach with the current setup: Spark will become
>> ubiquitously usable as the DataFrame API can be used with (almost) any
>> programming language.
>>
>> We would like to start a discussion on the document and any feedback is
>> welcome!
>>
>> Thanks a lot in advance,
>> Martin
>>
>
> CONFIDENTIALITY NOTICE: This electronic communication and any files
> transmitted with it are confidential, privileged and intended solely for
> the use of the individual or entity to whom they are addressed. If you are
> not the intended recipient, you are hereby notified that any disclosure,
> copying, distribution (electronic or otherwise) or forwarding of, or the
> taking of any action in reliance on the contents of this transmission is
> strictly prohibited. Please notify the sender immediately by e-mail if you
> have received this email by mistake and delete this email from your system.
>
> Is it necessary to print this email? If you care about the environment
> like we do, please refrain from printing emails. It helps to keep the
> environment forested and litter-free.


[DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-06-03 Thread Martin Grund
Hi Everyone,

We would like to start a discussion on the "Spark Connect" proposal. Please
find the links below:

*JIRA* - https://issues.apache.org/jira/browse/SPARK-39375
*SPIP Document* -
https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj

*Excerpt from the document: *

We propose to extend Apache Spark by building on the DataFrame API and the
underlying unresolved logical plans. The DataFrame API is widely used and
makes it very easy to iteratively express complex logic. We will introduce
Spark Connect, a remote option of the DataFrame API that separates the
client from the Spark server. With Spark Connect, Spark will become
decoupled, allowing for built-in remote connectivity: The decoupled client
SDK can be used to run interactive data exploration and connect to the
server for DataFrame operations.

Spark Connect will benefit Spark developers in different ways: The
decoupled architecture will result in improved stability, as clients are
separated from the driver. From the Spark Connect client perspective, Spark
will be (almost) versionless, and thus enable seamless upgradability, as
server APIs can evolve without affecting the client API. The decoupled
client-server architecture can be leveraged to build close integrations
with local developer tooling. Finally, separating the client process from
the Spark server process will improve Spark’s overall security posture by
avoiding the tight coupling of the client inside the Spark runtime
environment.

Spark Connect will strengthen Spark’s position as the modern unified engine
for large-scale data analytics and expand applicability to use cases and
developers we could not reach with the current setup: Spark will become
ubiquitously usable as the DataFrame API can be used with (almost) any
programming language.

We would like to start a discussion on the document and any feedback is
welcome!

Thanks a lot in advance,
Martin