Re: [DISCUSS] Allow GitHub Actions runs for contributors' PRs without approvals in apache/spark-connect-go

2024-07-03 Thread Matthew Powers
Yea, this would be great.

spark-connect-go is still experimental and anything we can do to get it
production grade would be a great step IMO.  The Go community is excited to
write Spark... with Go!

On Wed, Jul 3, 2024 at 8:49 PM Hyukjin Kwon  wrote:

> Hi all,
>
> The Spark Connect Go client repository (
> https://github.com/apache/spark-connect-go) requires GitHub Actions runs
> for individual commits within contributors' PRs.
>
> This policy was intentionally applied (
> https://issues.apache.org/jira/browse/INFRA-24387), but we can change
> this default once we reach a consensus on it.
>
> I would like to allow GitHub Actions runs for contributors by default to
> make the development faster. For now, I have been approving individual
> commits in their PRs, and this becomes overhead.
>
> If you have any feedback on this, please let me know.
>


Re: [VOTE] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-03 Thread Matthew Powers
+1 (non-binding)

Thanks!

On Wed, Jul 3, 2024 at 1:58 PM Xinrong Meng  wrote:

> +1
>
> Thank you @Hyukjin Kwon  !
>
> On Wed, Jul 3, 2024 at 8:55 AM bo yang  wrote:
>
>> +1 (non-binding)
>>
>> On Tue, Jul 2, 2024 at 11:22 PM Cheng Pan  wrote:
>>
>>> +1 (non-binding)
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> On Jul 3, 2024, at 08:59, Hyukjin Kwon  wrote:
>>>
>>> Hi all,
>>>
>>> I’d like to start a vote for moving Spark Connect server to builtin
>>> package (Client API layer stays external).
>>>
>>> Please also refer to:
>>>
>>>- Discussion thread:
>>> https://lists.apache.org/thread/odlx9b552dp8yllhrdlp24pf9m9s4tmx
>>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-48763
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thank you!
>>>
>>>
>>>


Re: [DISCUSS] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-02 Thread Matthew Powers
This is a great idea and would be a great quality of life improvement.

+1 (non-binding)

On Tue, Jul 2, 2024 at 4:56 AM Hyukjin Kwon  wrote:

> > while leaving the connect jvm client in a separate folder looks weird
>
> I plan to actually put it at the top level together but I feel like this
> has to be done with SPIP so I am moving internal server side first
> orthogonally
>
> On Tue, 2 Jul 2024 at 17:54, Cheng Pan  wrote:
>
>> Thanks for raising this discussion, I think putting the connect folder on
>> the top level is a good idea to promote Spark Connect, while leaving the
>> connect jvm client in a separate folder looks weird. I suppose there is no
>> contract to leave all optional modules under `connector`? e.g.
>> `resource-managers/kubernetes/{docker,integration-tests}`, `hadoop-cloud`.
>> What about moving the whole `connect` folder to the top level?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> On Jul 2, 2024, at 08:19, Hyukjin Kwon  wrote:
>>
>> Hi all,
>>
>> I would like to discuss moving Spark Connect server to builtin package.
>> Right now, users have to specify —packages when they run Spark Connect
>> server script, for example:
>>
>> ./sbin/start-connect-server.sh --jars `ls 
>> connector/connect/server/target/**/spark-connect*SNAPSHOT.jar`
>>
>> or
>>
>> ./sbin/start-connect-server.sh --packages 
>> org.apache.spark:spark-connect_2.12:3.5.1
>>
>> which is a little bit odd that sbin scripts should provide jars to start.
>>
>> Moving it to builtin package is pretty straightforward because most of
>> jars are shaded, and the impact would be minimal, I have a prototype here
>> apache/spark/#47157 . This
>> also simplifies Python local running logic a lot.
>>
>> User facing API layer, Spark Connect Client, stays external but I would
>> like the internal/admin server layer, Spark Connect Server, implementation
>> to be built in Spark.
>>
>> Please let me know if you have thoughts on this!
>>
>>
>>


Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Matthew Powers
I am a huge fan of the Apache Spark docs and I regularly look at the
analytics on this page

to see how well they are doing.  Great work to everyone that's contributed
to the docs over the years.

We've been chipping away with some improvements over the past year and have
made good progress.  For example, lots of the pages were missing canonical
links.  Canonical links are a special type of link that are
extremely important for any site that has duplicate content.  Versioned
documentation sites have lots of duplicate pages, so getting these
canonical links added was important.  It wasn't really easy to make this
change though.

The current site is confusing Google a bit.  If you do a "spark rocksdb"
Google search for example, you get the Spark 3.2 Structured Streaming
Programming Guide as the first result (because Google isn't properly
indexing the docs).  You need to Control+F and search for "rocksdb" to
navigate to the relevant section which says: "As of Spark 3.2, we add a new
built-in state store implementation...", which is what you'd expect in a
versionless docs site in any case.

There are two different user experiences:

* Option A: push Spark 3.1 Structured Streaming users to the Spark 3.1
Structured Streaming Programming guide that doesn't mention RocksDB
* Option B: push Spark Structured Streaming users to the latest Structure
Streaming Programming guide, which mentions RocksDB, but caveat that this
feature was added in Spark 3.2

I think Option B provides Spark 3.1 users a better experience overall.
It's better to let users know they can access RocksDB by upgrading than
hiding this info from them IMO.

Now if we want Option A, then we'd need to give users a reasonable way to
actually navigate to the Spark 3.1 docs.  From what I can tell, the only
way to navigate from the latest Structured Streaming Programming Guide

to a different version is by manually updating the URL.

I was just skimming over the Structured Streaming Programming guide and
noticing again how lots of the Python code snippets aren't PEP 8
compliant.  It seems like our current docs publishing process would prevent
us from improving the old docs pages.

In this conversation, let's make sure we distinguish between "programming
guides" and "API documentation".  API docs should be versioned and there is
no question there.  Programming guides are higher level conceptual
overviews, like the Polars user guide , and should
be relevant across many versions.

I would also like to point out the the current programming guides are not
consistent:

* The Structured Streaming programming guide

is one giant page
* The SQL programming guide
 is split
on many pages
* The PySpark programming guide

takes you to a whole different URL structure and makes it so you can't even
navigate to the other programming guides anymore

I am looking forward to collaborating with the community and improving the
docs to 1. delight existing users and 2. attract new users.  Docs are a
"website problem" and we're big data people, but I'm confident we'll be
able to work together and find a good path forward here.


On Wed, Jun 5, 2024 at 3:22 PM Neil Ramaswamy  wrote:

> Thanks all for the responses. Let me try to address everything.
>
> > the programming guides are also different between versions since
> features are being added, configs are being added/ removed/ changed,
> defaults are being changed etc.
>
> I agree that this is the case. But I think it's fine to mention what
> version a feature is available in. In fact, I would argue that mentioning
> an improvement that a version brings motivates users to upgrade more than
> keeping docs improvement to "new releases to keep the community updating".
> Users should upgrade to get a better Spark, not better Spark documentation.
>
> > having a programming guide that refers to features or API methods that
> does not exist in that version is confusing and detrimental
>
> I don't think that we'd do this. Again, programming guides should teach
> fundamentals that do not change version-to-version. TypeScript
>  
> (which
> has one of the best DX's and docs) does this exceptionally well. Their
> guides are refined, versionless pages, new features are elaborated upon in
> release notes (analogous to our version-specific docs), and for the
> occasional caveat for a version, it is called out in the guides.
>
>  I agree with Wenchen's 3 points. I don't think we need to say that they
> *have* to go to the old 

Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

2021-04-30 Thread Matthew Powers
Thanks for starting this good discussion.  You can add multiple columns
with select to avoid calling withColumn multiple times:

val newCols = Seq(col("*"), lit("val1").as("key1"), lit("val2").as("key2"))
df.select(newCols: _*).show()

withColumns would be a nice interface for less technical Spark users.

Here's a related discussion

on why the maintainers are sometimes hesitant to expose the surface area of
the Spark API (maintainers are doing a great job keeping the API clean).

As Maciej suggested in the thread, I created a separate project called bebe
 to expose developer friendly functions
that the maintainers don't want to expose in Spark.  If the Spark
maintainers decide that they don't want to add withColumns to the Spark
API, we can at least add it to bebe.


On Fri, Apr 30, 2021 at 1:13 AM Saurabh Chawla 
wrote:

> Hi All,
>
> I also had a scenario where at runtime, I needed to loop through a
> dataframe to use withColumn many times.
>
>  For the safer side I used the reflection to access the withColumns to
> prevent any java.lang.StackOverflowError.
>
> val dataSetClass = Class.forName("org.apache.spark.sql.Dataset")
> val newConfigurationMethod =
>   dataSetClass.getMethod("withColumns", classOf[Seq[String]], 
> classOf[Seq[Column]])
> newConfigurationMethod.invoke(
>   baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame]
>
> It would be great if we use the "withColumns" rather than using the
> reflection code like this.
> or
> make changes in the code to merge the project with existing project in the
> plan, instead of adding the new project every time we call the "
> withColumn".
>
> +1 for exposing the *withColumns*
>
> Regards
> Saurabh Chawla
>
> On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang  wrote:
>
>> Hi, all
>>
>> *Background:*
>>
>> Currently, there is a withColumns
>> [1]
>> method to help users/devs add/replace multiple columns at once.
>> But this method is private and isn't exposed as a public API interface,
>> that means it cannot be used by the user directly, and also it is not
>> supported in PySpark API.
>>
>> As the dataframe user, I can only call withColumn() multiple times:
>>
>> df.withColumn("key1", col("key1")).withColumn("key2", 
>> col("key2")).withColumn("key3", col("key3"))
>>
>> rather than:
>>
>> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), 
>> col("key3")])
>>
>> Multiple calls bring some higher cost on developer experience and
>> performance. Especially in a PySpark related scenario, multiple calls mean
>> multiple py4j calls.
>>
>> As mentioned
>>  from
>> @Hyukjin, there were some previous discussions on  SPARK-12225
>>  [2] .
>>
>> [1]
>> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
>> [2] https://issues.apache.org/jira/browse/SPARK-12225
>>
>> *Potential solution:*
>> Looks like there are 2 potential solutions if we want to support it:
>>
>> 1. Introduce a *withColumns *api for Scala/Python.
>> A separate public withColumns API will be added in scala/python api.
>>
>> 2. Make withColumn can receive *single col *and also the* list of cols*.
>> I did some experimental try on PySpark on
>> https://github.com/apache/spark/pull/32276
>> Just like Maciej said
>> 
>> it will bring some confusion with naming.
>>
>>
>> Thanks for your reading, feel free to reply if you have any other
>> concerns or suggestions!
>>
>>
>> Regards,
>> Yikun
>>
>


Re: Bintray replacement for spark-packages.org

2021-04-26 Thread Matthew Powers
Great job fixing this!!  I just checked and it's working on my end.  Updated
the resolver

and sbt test still works just fine.

On Mon, Apr 26, 2021 at 3:31 AM Bo Zhang  wrote:

> Hi Apache Spark devs,
>
> As you might know, Bintray, which is the repository service used for
> spark-packages.org, is in its sunset process. There was a planned
> brown-out on April 12th
>  and there will be
> another one on April 26th
> , and it will no
> longer be available from May 1st.
>
> We have spun up a new repository service at
> https://repos.spark-packages.org and it will be the new home for the
> artifacts on spark-packages.
>
> Given the planned Bintray brown-out, this is a good time for us to test
> the new repository service. To consume artifacts from that, please replace "
> dl.bintray.com/spark-packages/maven" with "repos.spark-packages.org" in
> the Maven pom files or sbt build files in your repositories, e.g.:
> https://github.com/apache/spark/pull/32346
>
> We are still working on the release process to the new repository service,
> and will provide an update here shortly.
>
> If you have any questions for using the new repository service, or any
> general questions for spark-packages, please reach out to
> feedb...@spark-packages.org.
>
> Thanks,
> Bo
>


Re: Auto-closing PRs or How to get reviewers' attention

2021-02-23 Thread Matthew Powers
Enrico - thanks for sharing your experience.

I recently got a couple of PRs merged and my experience was different.  I
got lots of feedback from several maintainers (thank you very much!).

Can't speak to your PRs specifically, but can give the general advice that
pivoting code based on maintainer feedback is probably the easiest way to
get stuff merged.

I initially added an add_hours function to org.apache.spark.sql.functions
and it seemed pretty clear that the maintainers weren't the biggest fans
and were more in favor of a make_interval function.  I proactively closed
my own add_hours PR and pushed forward make_interval instead.

In hindsight, add_hours would have been a bad addition to the API and I'm
glad it got rejected.  For big, mature projects like Spark, it's more
important for maintainers to reject stuff than add new functionality.
Software bloat is the main risk for Spark.

I'm of the opinion that the auto-closing PR feature is working well.  Spark
maintainers have a difficult job of having to say "no" and disappoint
people a lot.  Auto closing is a great way to indirectly communicate the
"no" in a way that's more psychologically palatable for both the maintainer
and the committer.

On Tue, Feb 23, 2021 at 9:13 AM Sean Owen  wrote:

> Yes, committers are added regularly. I don't think that changes the
> situation for most PRs that perhaps just aren't suitable to merge.
> Again the best thing you can do is make it as easy to merge as possible
> and find people who have touched the code for review. This often works out.
>
> On Tue, Feb 23, 2021 at 4:06 AM Enrico Minack 
> wrote:
>
>> Am 18.02.21 um 16:34 schrieb Sean Owen:
>> > One other aspect is that a committer is taking some degree of
>> > responsibility for merging a change, so the ask is more than just a
>> > few minutes of eyeballing. If it breaks something the merger pretty
>> > much owns resolving it, and, the whole project owns any consequence of
>> > the change for the future.
>>
>> I think this explains the hesitation pretty well: Committers take
>> ownership of the change. It is understandable that PRs then have to be
>> very convincing with low risk/benefit ratio.
>>
>> Are there plans or initiatives to proactively widen the base of
>> committers to mitigate the current situation?
>>
>> Enrico
>>
>>


Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-30 Thread Matthew Powers
Maciej - I like the idea of a separate library to provide easy access to
functions that the maintainers don't want to merge into Spark core.

I've seen this model work well in other open source communities.  The Rails
Active Support library provides the Ruby community with core functionality
like beginning_of_month.  The Ruby community has a good, well-supported
function, but it's not in the Ruby codebase so it's not a maintenance
burden - best of both worlds.

I'll start a proof-of-concept repo.  If the repo gets popular, I'll be
happy to donate it to a GitHub organization like Awesome Spark
<https://github.com/awesome-spark> or the ASF.

On Sat, Jan 30, 2021 at 9:35 AM Maciej  wrote:

> Just thinking out loud ‒ if there is community need for providing language
> bindings for less popular SQL functions, could these live outside main
> project or even outside the ASF?  As long as expressions are already
> implemented, bindings are trivial after all.
>
> If could also allow usage of more scalable hierarchy (let's say with
> modules / packages per function family).
>
> On 1/29/21 5:01 AM, Hyukjin Kwon wrote:
>
> FYI exposing methods with Column signature only is already documented on
> the top of functions.scala, and I believe that has been the current dev
> direction if I am not mistaken.
>
> Another point is that we should rather expose commonly used expressions.
> Its best if it considers language specific context. Many of expressions are
> for SQL compliance. Many data silence python libraries don't support such
> features as an example.
>
>
>
> On Fri, 29 Jan 2021, 12:04 Matthew Powers, 
> wrote:
>
>> Thanks for the thoughtful responses.  I now understand why adding all the
>> functions across all the APIs isn't the default.
>>
>> To Nick's point, relying on heuristics to gauge user interest, in
>> addition to personal experience, is a good idea.  The regexp_extract_all
>> SO thread has 16,000 views
>> <https://stackoverflow.com/questions/47981699/extract-words-from-a-string-column-in-spark-dataframe/47989473>,
>> so I say we set the threshold to 10k, haha, just kidding!  Like Sean
>> mentioned, we don't want to add niche functions.  Now we just need a way to
>> figure out what's niche!
>>
>> To Reynolds point on overloading Scala functions, I think we should start
>> trying to limit the number of overloaded functions.  Some functions have
>> the columnName and column object function signatures.  e.g.
>> approx_count_distinct(columnName: String, rsd: Double) and
>> approx_count_distinct(e: Column, rsd: Double).  We can just expose the
>> approx_count_distinct(e: Column, rsd: Double) variety going forward (not
>> suggesting any backwards incompatible changes, just saying we don't need
>> the columnName-type functions for new stuff).
>>
>> Other functions have one signature with the second object as a Scala
>> object and another signature with the second object as a column object,
>> e.g. date_add(start: Column, days: Column) and date_add(start: Column,
>> days: Int).  We can just expose the date_add(start: Column, days: Column)
>> variety cause it's general purpose.  Let me know if you think that avoiding
>> Scala function overloading will help Reynold.
>>
>> Let's brainstorm Nick's idea of creating a framework that'd test Scala /
>> Python / SQL / R implementations in one-fell-swoop.  Seems like that'd be a
>> great way to reduce the maintenance burden.  Reynold's regexp_extract code
>> from 5 years ago is largely still intact - getting the job done right the
>> first time is another great way to avoid maintenance!
>>
>> On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin  wrote:
>>
>>> There's another thing that's not mentioned … it's primarily a problem
>>> for Scala. Due to static typing, we need a very large number of function
>>> overloads for the Scala version of each function, whereas in SQL/Python
>>> they are just one. There's a limit on how many functions we can add, and it
>>> also makes it difficult to browse through the docs when there are a lot of
>>> functions.
>>>
>>>
>>>
>>> On Thu, Jan 28, 2021 at 1:09 PM, Maciej  wrote:
>>>
>>>> Just my two cents on R side.
>>>>
>>>> On 1/28/21 10:00 PM, Nicholas Chammas wrote:
>>>>
>>>> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  wrote:
>>>>
>>>>> It isn't that regexp_extract_all (for example) is useless outside SQL,
>>>>> just, where do you draw the line? Supporting 10s of random SQL functions
>>>>> across 3 other languages has a cost, which has to be we

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Matthew Powers
Thanks for the thoughtful responses.  I now understand why adding all the
functions across all the APIs isn't the default.

To Nick's point, relying on heuristics to gauge user interest, in
addition to personal experience, is a good idea.  The regexp_extract_all SO
thread has 16,000 views
,
so I say we set the threshold to 10k, haha, just kidding!  Like Sean
mentioned, we don't want to add niche functions.  Now we just need a way to
figure out what's niche!

To Reynolds point on overloading Scala functions, I think we should start
trying to limit the number of overloaded functions.  Some functions have
the columnName and column object function signatures.  e.g.
approx_count_distinct(columnName: String, rsd: Double) and
approx_count_distinct(e: Column, rsd: Double).  We can just expose the
approx_count_distinct(e: Column, rsd: Double) variety going forward (not
suggesting any backwards incompatible changes, just saying we don't need
the columnName-type functions for new stuff).

Other functions have one signature with the second object as a Scala object
and another signature with the second object as a column object, e.g.
date_add(start: Column, days: Column) and date_add(start: Column, days:
Int).  We can just expose the date_add(start: Column, days: Column) variety
cause it's general purpose.  Let me know if you think that avoiding Scala
function overloading will help Reynold.

Let's brainstorm Nick's idea of creating a framework that'd test Scala /
Python / SQL / R implementations in one-fell-swoop.  Seems like that'd be a
great way to reduce the maintenance burden.  Reynold's regexp_extract code
from 5 years ago is largely still intact - getting the job done right the
first time is another great way to avoid maintenance!

On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin  wrote:

> There's another thing that's not mentioned … it's primarily a problem for
> Scala. Due to static typing, we need a very large number of function
> overloads for the Scala version of each function, whereas in SQL/Python
> they are just one. There's a limit on how many functions we can add, and it
> also makes it difficult to browse through the docs when there are a lot of
> functions.
>
>
>
> On Thu, Jan 28, 2021 at 1:09 PM, Maciej  wrote:
>
>> Just my two cents on R side.
>>
>> On 1/28/21 10:00 PM, Nicholas Chammas wrote:
>>
>> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  wrote:
>>
>>> It isn't that regexp_extract_all (for example) is useless outside SQL,
>>> just, where do you draw the line? Supporting 10s of random SQL functions
>>> across 3 other languages has a cost, which has to be weighed against
>>> benefit, which we can never measure well except anecdotally: one or two
>>> people say "I want this" in a sea of hundreds of thousands of users.
>>>
>>
>> +1 to this, but I will add that Jira and Stack Overflow activity can
>> sometimes give good signals about API gaps that are frustrating users. If
>> there is an SO question with 30K views about how to do something that
>> should have been easier, then that's an important signal about the API.
>>
>> For this specific case, I think there is a fine argument
>>> that regexp_extract_all should be added simply for consistency
>>> with regexp_extract. I can also see the argument that regexp_extract was a
>>> step too far, but, what's public is now a public API.
>>>
>>
>> I think in this case a few references to where/how people are having to
>> work around missing a direct function for regexp_extract_all could help
>> guide the decision. But that itself means we are making these decisions on
>> a case-by-case basis.
>>
>> From a user perspective, it's definitely conceptually simpler to have SQL
>> functions be consistent and available across all APIs.
>>
>> Perhaps if we had a way to lower the maintenance burden of keeping
>> functions in sync across SQL/Scala/Python/R, it would be easier for
>> everyone to agree to just have all the functions be included across the
>> board all the time.
>>
>> Python aligns quite well with Scala so that might be fine, but R is a bit
>> tricky thing. Especially lack of proper namespaces makes it rather risky to
>> have packages that export hundreds of functions. sparkly handles this
>> neatly with NSE, but I don't think we're going to go this way.
>>
>>
>> Would, for example, some sort of automatic testing mechanism for SQL
>> functions help here? Something that uses a common function testing
>> specification to automatically test SQL, Scala, Python, and R functions,
>> without requiring maintainers to write tests for each language's version of
>> the functions. Would that address the maintenance burden?
>>
>> With R we don't really test most of the functions beyond the simple
>> "callability". One the complex ones, that require some non-trivial
>> transformations of arguments, are fully tested.
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>