Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread bo yang
+1

On Sat, May 11, 2024 at 4:43 PM huaxin gao  wrote:

> +1
>
> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>> >
>> > +1
>> >
>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>> >>
>> >> Please also refer to:
>> >>
>> >>- Discussion thread:
>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>> >>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> >>
>> >>
>> >> Please vote on the SPIP for the next 72 hours:
>> >>
>> >> [ ] +1: Accept the proposal as an official SPIP
>> >> [ ] +0
>> >> [ ] -1: I don’t think this is a good idea because …
>> >>
>> >>
>> >> Thank you!
>> >>
>> >> Liang-Chi Hsieh
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread bo yang
+1

On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon  wrote:

> +1
>
> On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan  wrote:
>> >
>> > +1
>> >
>> > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun 
>> wrote:
>> >>
>> >> I'll start with my +1.
>> >>
>> >> - Checked checksum and signature
>> >> - Checked Scala/Java/R/Python/SQL Document's Spark version
>> >> - Checked published Maven artifacts
>> >> - All CIs passed.
>> >>
>> >> Thanks,
>> >> Dongjoon.
>> >>
>> >> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
>> >> > Please vote on releasing the following candidate as Apache Spark
>> version
>> >> > 3.4.3.
>> >> >
>> >> > The vote is open until April 18th 1AM (PDT) and passes if a majority
>> +1 PMC
>> >> > votes are cast, with a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Release this package as Apache Spark 3.4.3
>> >> > [ ] -1 Do not release this package because ...
>> >> >
>> >> > To learn more about Apache Spark, please see
>> https://spark.apache.org/
>> >> >
>> >> > The tag to be voted on is v3.4.3-rc2 (commit
>> >> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
>> >> > https://github.com/apache/spark/tree/v3.4.3-rc2
>> >> >
>> >> > The release files, including signatures, digests, etc. can be found
>> at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
>> >> >
>> >> > Signatures used for Spark RCs can be found in this file:
>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >> >
>> >> > The staging repository for this release can be found at:
>> >> >
>> https://repository.apache.org/content/repositories/orgapachespark-1453/
>> >> >
>> >> > The documentation corresponding to this release can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
>> >> >
>> >> > The list of bug fixes going into 3.4.3 can be found at the following
>> URL:
>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
>> >> >
>> >> > This release is using the release script of the tag v3.4.3-rc2.
>> >> >
>> >> > FAQ
>> >> >
>> >> > =
>> >> > How can I help test this release?
>> >> > =
>> >> >
>> >> > If you are a Spark user, you can help us test this release by taking
>> >> > an existing Spark workload and running on this release candidate,
>> then
>> >> > reporting any regressions.
>> >> >
>> >> > If you're working in PySpark you can set up a virtual env and install
>> >> > the current RC and see if anything important breaks, in the
>> Java/Scala
>> >> > you can add the staging repository to your projects resolvers and
>> test
>> >> > with the RC (make sure to clean up the artifact cache before/after so
>> >> > you don't end up building with a out of date RC going forward).
>> >> >
>> >> > ===
>> >> > What should happen to JIRA tickets still targeting 3.4.3?
>> >> > ===
>> >> >
>> >> > The current list of open tickets targeted at 3.4.3 can be found at:
>> >> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> >> > Version/s" = 3.4.3
>> >> >
>> >> > Committers should look at those and triage. Extremely important bug
>> >> > fixes, documentation, and API tweaks that impact compatibility should
>> >> > be worked on immediately. Everything else please retarget to an
>> >> > appropriate release.
>> >> >
>> >> > ==
>> >> > But my bug isn't fixed?
>> >> > ==
>> >> >
>> >> > In order to make timely releases, we will typically not hold the
>> >> > release unless the bug in question is a regression from the previous
>> >> > release. That being said, if there is something which is a regression
>> >> > that has not been correctly targeted please ping me or a committer to
>> >> > help target the issue.
>> >> >
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-12 Thread bo yang
+1

On Fri, Apr 12, 2024 at 12:34 PM huaxin gao  wrote:

> +1
>
> On Fri, Apr 12, 2024 at 9:07 AM Dongjoon Hyun  wrote:
>
>> +1
>>
>> Thank you!
>>
>> I hope we can customize `dev/merge_spark_pr.py` script per repository
>> after this PR.
>>
>> Dongjoon.
>>
>> On 2024/04/12 03:28:36 "L. C. Hsieh" wrote:
>> > Hi all,
>> >
>> > Thanks for all discussions in the thread of "Versioning of Spark
>> > Operator":
>> https://lists.apache.org/thread/zhc7nb2sxm8jjxdppq8qjcmlf4rcsthh
>> >
>> > I would like to create this vote to get the consensus for versioning
>> > of the Spark Kubernetes Operator.
>> >
>> > The proposal is to use an independent versioning for the Spark
>> > Kubernetes Operator.
>> >
>> > Please vote on adding new `Versions` in Apache Spark JIRA which can be
>> > used for places like "Fix Version/s" in the JIRA tickets of the
>> > operator.
>> >
>> > The new `Versions` will be `kubernetes-operator-` prefix, for example
>> > `kubernetes-operator-0.1.0`.
>> >
>> > The vote is open until April 15th 1AM (PST) and passes if a majority
>> > +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Adding the new `Versions` for Spark Kubernetes Operator in
>> > Apache Spark JIRA
>> > [ ] -1 Do not add the new `Versions` because ...
>> >
>> > Thank you.
>> >
>> >
>> > Note that this is not a SPIP vote and also not a release vote. I don't
>> > find similar votes in previous threads. This is made similarly like a
>> > SPIP or a release vote. So I think it should be okay. Please correct
>> > me if this vote format is not good for you.
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Versioning of Spark Operator

2024-04-10 Thread bo yang
Cool, looks like we have two options here.

Option 1: Spark Operator and Connect Go Client versioning independent of
Spark, e.g. starting with 0.1.0.
Pros: they can evolve versions independently.
Cons: people will need an extra step to decide the version when using Spark
Operator and Connect Go Client.

Option 2: Spark Operator and Connect Go Client versioning loosely related
with Spark, e.g. starting with the Supported Spark version
Pros: might be easy for beginning users to choose version when using Spark
Operator and Connect Go Client.
Cons: there is uncertainty how the compatibility will go in the future for
Spark Operator and Connect Go Client regarding Spark, which may impact this
version naming.

Right now, Connect Go Client uses Option 2, but can change to Option 1 if
needed.


On Wed, Apr 10, 2024 at 6:19 AM Dongjoon Hyun 
wrote:

> Ya, that would work.
>
> Inevitably, I looked at Apache Flink K8s Operator's JIRA and GitHub repo.
>
> It looks reasonable to me.
>
> Although they share the same JIRA, they choose different patterns per
> place.
>
> 1. In POM file and Maven Artifact, independent version number.
> 1.8.0
>
> 2. Tag is also based on the independent version number
> https://github.com/apache/flink-kubernetes-operator/tags
> - release-1.8.0
> - release-1.7.0
>
> 3. JIRA Fixed Version is `kubernetes-operator-` prefix.
> https://issues.apache.org/jira/browse/FLINK-34957
> > Fix Version/s: kubernetes-operator-1.9.0
>
> Maybe, we can borrow this pattern.
>
> I guess we need a vote for any further decision because we need to create
> new `Versions` in Apache Spark JIRA.
>
> Dongjoon.
>
>


Re: Versioning of Spark Operator

2024-04-09 Thread bo yang
Thanks Liang-Chi for the Spark Operator work, and also the discussion here!

For Spark Operator and Connector Go Client, I am guessing they need to
support multiple versions of Spark? e.g. same Spark Operator may support
running multiple versions of Spark, and Connector Go Client might support
multiple versions of Spark driver as well.

How do people think of using the minimum supported Spark version as the
version name for Spark Operator and Connector Go Client? For example,
Spark Operator 3.5.x supports Spark 3.5 and above.

Best,
Bo


On Tue, Apr 9, 2024 at 10:14 AM Dongjoon Hyun  wrote:

> Ya, that's simple and possible.
>
> However, it may cause many confusions because it implies that new `Spark
> K8s Operator 4.0.0` and `Spark Connect Go 4.0.0` follow the same `Semantic
> Versioning` policy like Apache Spark 4.0.0.
>
> In addition, `Versioning` is directly related to the Release Cadence. It's
> unlikely for us to have `Spark K8s Operator` and `Spark Connect Go`
> releases at every Apache Spark maintenance release. For example, there is
> no commit in Spark Connect Go repository.
>
> I believe the versioning and release cadence is related to those
> subprojects' maturity more.
>
> Dongjoon.
>
> On 2024/04/09 16:59:40 DB Tsai wrote:
> >  Aligning with Spark releases is sensible, as it allows us to guarantee
> that the Spark operator functions correctly with the new version while also
> maintaining support for previous versions.
> >
> > DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> >
> > > On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan 
> wrote:
> > >
> > >
> > >   I am trying to understand if we can simply align with Spark's
> version for this ?
> > > Makes the release and jira management much more simpler for developers
> and intuitive for users.
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > > On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  > wrote:
> > >> Hi, Liang-Chi.
> > >>
> > >> Thank you for leading Apache Spark K8s operator as a shepherd.
> > >>
> > >> I took a look at `Apache Spark Connect Go` repo mentioned in the
> thread. Sadly, there is no release at all and no activity since last 6
> months. It seems to be the first time for Apache Spark community to
> consider these sister repositories (Go and K8s Operator).
> > >>
> > >> https://github.com/apache/spark-connect-go/commits/master/
> > >>
> > >> Dongjoon.
> > >>
> > >> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> > >> > Hi all,
> > >> >
> > >> > We've opened the dedicated repository of Spark Kubernetes Operator,
> > >> > and the first PR is created.
> > >> > Thank you for the review from the community so far.
> > >> >
> > >> > About the versioning of Spark Operator, there are questions.
> > >> >
> > >> > As we are using Spark JIRA, when we are going to merge PRs, we need
> to
> > >> > choose a Spark version. However, the Spark Operator is versioning
> > >> > differently than Spark. I'm wondering how we deal with this?
> > >> >
> > >> > Not sure if Connect also has its versioning different to Spark? If
> so,
> > >> > maybe we can follow how Connect does.
> > >> >
> > >> > Can someone who is familiar with Connect versioning give some
> suggestions?
> > >> >
> > >> > Thank you.
> > >> >
> > >> > Liang-Chi
> > >> >
> > >> >
> -
> > >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  dev-unsubscr...@spark.apache.org>
> > >> >
> > >> >
> > >>
> > >> -
> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  dev-unsubscr...@spark.apache.org>
> > >>
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread bo yang
+1 (non-binding)

On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
wrote:

> +1
> --
> *From:* Denny Lee 
> *Sent:* Monday, April 1, 2024 10:06:14 AM
> *To:* Hussein Awala 
> *Cc:* Chao Sun ; Hyukjin Kwon ;
> Mridul Muralidharan ; dev 
> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>
> +1 (non-binding)
>
>
> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>
> +1(non-binding) I add to the difference will it make that it will also
> simplify package maintenance and easily release a bug fix/new feature
> without needing to wait for Pyspark to release.
>
> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>
> +1
>
> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
> wrote:
>
> Oh I didn't send the discussion thread out as it's pretty simple,
> non-invasive and the discussion was sort of done as part of the Spark
> Connect initial discussion ..
>
> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
> wrote:
>
>
> Can you point me to the SPIP’s discussion thread please ?
> I was not able to find it, but I was on vacation, and so might have
> missed this …
>
>
> Regards,
> Mridul
>
>
> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>  wrote:
>
> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
> Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>
>


Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-13 Thread bo yang
+1

On Wed, Mar 13, 2024 at 7:19 AM Tom Graves 
wrote:

> Similar as others,  will be interested in working out api's and details
> but overall in favor of it.
>
> +1
>
> Tom Graves
> On Monday, March 11, 2024 at 11:25:38 AM CDT, Mridul Muralidharan <
> mri...@gmail.com> wrote:
>
>
>
>   I am supportive of the proposal - this is a step in the right direction !
> Additional metadata (explicit and inferred) for log records, and exposing
> them for indexing is extremely useful.
>
> The specifics of the API still need some work IMO and does not need to be
> this disruptive, but I consider that is orthogonal to this vote itself -
> and something we need to iterate upon during PR reviews.
>
> +1
>
> Regards,
> Mridul
>
>
> On Mon, Mar 11, 2024 at 11:09 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> +1
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 11 Mar 2024 at 09:27, Hyukjin Kwon  wrote:
>
> +1
>
> On Mon, 11 Mar 2024 at 18:11, yangjie01 
> wrote:
>
> +1
>
>
>
> Jie Yang
>
>
>
> *发件人**: *Haejoon Lee 
> *日期**: *2024年3月11日 星期一 17:09
> *收件人**: *Gengliang Wang 
> *抄送**: *dev 
> *主题**: *Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark
>
>
>
> +1
>
>
>
> On Mon, Mar 11, 2024 at 10:36 AM Gengliang Wang  wrote:
>
> Hi all,
>
> I'd like to start the vote for SPIP: Structured Logging Framework for
> Apache Spark
>
>
> References:
>
>- JIRA ticket
>
> 
>- SPIP doc
>
> 
>- Discussion thread
>
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
>
> Gengliang Wang
>
>


Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-14 Thread bo yang
+1

On Tue, Nov 14, 2023 at 7:18 PM huaxin gao  wrote:

> +1
>
> On Tue, Nov 14, 2023 at 10:45 AM Holden Karau 
> wrote:
>
>> +1
>>
>> On Tue, Nov 14, 2023 at 10:21 AM DB Tsai  wrote:
>>
>>> +1
>>>
>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>
>>> On Nov 14, 2023, at 10:14 AM, Vakaris Baškirov <
>>> vakaris.bashki...@gmail.com> wrote:
>>>
>>> +1 (non-binding)
>>>
>>>
>>> On Tue, Nov 14, 2023 at 8:03 PM Chao Sun  wrote:
>>>
 +1

 On Tue, Nov 14, 2023 at 9:52 AM L. C. Hsieh  wrote:
 >
 > +1
 >
 > On Tue, Nov 14, 2023 at 9:46 AM Ye Zhou  wrote:
 > >
 > > +1(Non-binding)
 > >
 > > On Tue, Nov 14, 2023 at 9:42 AM L. C. Hsieh 
 wrote:
 > >>
 > >> Hi all,
 > >>
 > >> I’d like to start a vote for SPIP: An Official Kubernetes Operator
 for
 > >> Apache Spark.
 > >>
 > >> The proposal is to develop an official Java-based Kubernetes
 operator
 > >> for Apache Spark to automate the deployment and simplify the
 lifecycle
 > >> management and orchestration of Spark applications and Spark
 clusters
 > >> on k8s at prod scale.
 > >>
 > >> This aims to reduce the learning curve and operation overhead for
 > >> Spark users so they can concentrate on core Spark logic.
 > >>
 > >> Please also refer to:
 > >>
 > >>- Discussion thread:
 > >> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
 > >>- JIRA ticket:
 https://issues.apache.org/jira/browse/SPARK-45923
 > >>- SPIP doc:
 https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
 > >>
 > >>
 > >> Please vote on the SPIP for the next 72 hours:
 > >>
 > >> [ ] +1: Accept the proposal as an official SPIP
 > >> [ ] +0
 > >> [ ] -1: I don’t think this is a good idea because …
 > >>
 > >>
 > >> Thank you!
 > >>
 > >> Liang-Chi Hsieh
 > >>
 > >>
 -
 > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 > >>
 > >
 > >
 > > --
 > >
 > > Zhou, Ye  周晔
 >
 > -
 > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>


Re: Write Spark Connection client application in Go

2023-09-14 Thread bo yang
Thanks Holden and Martin for the nice words and feedback :)

On Wed, Sep 13, 2023 at 8:22 AM Martin Grund  wrote:

> This is absolutely awesome! Thank you so much for dedicating your time to
> this project!
>
>
> On Wed, Sep 13, 2023 at 6:04 AM Holden Karau  wrote:
>
>> That’s so cool! Great work y’all :)
>>
>> On Tue, Sep 12, 2023 at 8:14 PM bo yang  wrote:
>>
>>> Hi Spark Friends,
>>>
>>> Anyone interested in using Golang to write Spark application? We created
>>> a Spark Connect Go Client library
>>> <https://github.com/apache/spark-connect-go>. Would love to hear
>>> feedback/thoughts from the community.
>>>
>>> Please see the quick start guide
>>> <https://github.com/apache/spark-connect-go/blob/master/quick-start.md>
>>> about how to use it. Following is a very short Spark Connect application in
>>> Go:
>>>
>>> func main() {
>>> spark, _ := 
>>> sql.SparkSession.Builder.Remote("sc://localhost:15002").Build()
>>> defer spark.Stop()
>>>
>>> df, _ := spark.Sql("select 'apple' as word, 123 as count union all 
>>> select 'orange' as word, 456 as count")
>>> df.Show(100, false)
>>> df.Collect()
>>>
>>> df.Write().Mode("overwrite").
>>> Format("parquet").
>>> Save("file:///tmp/spark-connect-write-example-output.parquet")
>>>
>>> df = spark.Read().Format("parquet").
>>> Load("file:///tmp/spark-connect-write-example-output.parquet")
>>> df.Show(100, false)
>>>
>>> df.CreateTempView("view1", true, false)
>>> df, _ = spark.Sql("select count, word from view1 order by count")
>>> }
>>>
>>>
>>> Many thanks to Martin, Hyukjin, Ruifeng and Denny for creating and
>>> working together on this repo! Welcome more people to contribute :)
>>>
>>> Best,
>>> Bo
>>>
>>>


Write Spark Connection client application in Go

2023-09-12 Thread bo yang
Hi Spark Friends,

Anyone interested in using Golang to write Spark application? We
created a Spark
Connect Go Client library .
Would love to hear feedback/thoughts from the community.

Please see the quick start guide

about how to use it. Following is a very short Spark Connect application in
Go:

func main() {
spark, _ := 
sql.SparkSession.Builder.Remote("sc://localhost:15002").Build()
defer spark.Stop()

df, _ := spark.Sql("select 'apple' as word, 123 as count union all
select 'orange' as word, 456 as count")
df.Show(100, false)
df.Collect()

df.Write().Mode("overwrite").
Format("parquet").
Save("file:///tmp/spark-connect-write-example-output.parquet")

df = spark.Read().Format("parquet").
Load("file:///tmp/spark-connect-write-example-output.parquet")
df.Show(100, false)

df.CreateTempView("view1", true, false)
df, _ = spark.Sql("select count, word from view1 order by count")
}


Many thanks to Martin, Hyukjin, Ruifeng and Denny for creating and working
together on this repo! Welcome more people to contribute :)

Best,
Bo


Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-07 Thread bo yang
Thanks Holden for bringing this up!

Maybe another thing to think about is how to make dynamic allocation more
friendly with Kubernetes and disaggregated shuffle storage?



On Mon, Aug 7, 2023 at 1:27 PM Holden Karau  wrote:

> So I wondering if there is interesting in revisiting some of how Spark is
> doing it's dynamica allocation for Spark 4+?
>
> Some things that I've been thinking about:
>
> - Advisory user input (e.g. a way to say after X is done I know I need Y
> where Y might be a bunch of GPU machines)
> - Configurable tolerance (e.g. if we have at most Z% over target no-op)
> - Past runs of same job (e.g. stage X of job Y had a peak of K)
> - Faster executor launches (I'm a little fuzzy on what we can do here but,
> one area for example is we setup and tear down an RPC connection to the
> driver with a blocking call which does seem to have some locking inside of
> the driver at first glance)
>
> Is this an area other folks are thinking about? Should I make an epic we
> can track ideas in? Or are folks generally happy with today's dynamic
> allocation (or just busy with other things)?
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [CONNECT] New Clients for Go and Rust

2023-06-01 Thread bo yang
Hi Martin,

Thanks a lot for preparing the new repo and making it super easy for me to
just copy my code to the new repo! I will create a new PR there.

> I think the PR is fine from a code perspective as a starting point. I've
prepared the go repository with all the things necessary so that it reduces
friction for you. The protos are automatically generated, pre-commit checks
etc. All you need to do is drop your code :)

> Once we have the first version working we can iterate and identify the
next steps.

Best,
Bo


On Thu, Jun 1, 2023 at 11:58 AM Martin Grund 
wrote:

> These are all valid points and it makes total sense to continue to
> consider them. However, reading the mail I'm wondering if we're discussing
> the same problems.
>
> Deprecation of APIs aside, the main benefit it Spark Connect is that the
> contract is explicitly not a Jar file full of transitive dependencies (and
> discoverable internal APIs) but rather the contract established via the
> proto messages and RPCs.  If you compare this for example to the R
> integration there is no need to emebed some Go pieces with the JVM to make
> it work. No custom RMI protocol specific to the client language but simply
> the same contract as for example PySpark uses. The physical contact is the
> protobuf and the logical contact is the dataframe API.
>
> This means that Spark Connect clients don't suffer a large part of the
> challenges that other tools built on top of Spark have as there is no tigut
> coupling between the driver JVM and the client.
>
> I'm happy to help establish clear guidance of contrib style modules that
> operate with a different set of expectations but are developed by the spark
> community and its guidelines.
>
> Martin
>
>
> On Thu 1. Jun 2023 at 12:41 Maciej  wrote:
>
>> Hi Martin,
>>
>>
>> On 5/30/23 11:50, Martin Grund wrote:
>> > I think it makes sense to split this discussion into two pieces. On >
>> the contribution side, my personal perspective is that these new > clients
>> are explicitly marked as experimental and unsupported until > we deem them
>> mature enough to be supported using the standard release > process etc.
>> However, the goal should be that the main contributors > of these clients
>> are aiming to follow the same release and > maintenance schedule. I think
>> we should encourage the community to > contribute to the Spark Connect
>> clients and as such we should > explicitly not make it as hard as possible
>> to get started (and for > that reason reserve the right to abandon).
>>
>> I know it sounds like a nitpicking, but we still have components
>> deprecated in 1.2 or 1.3, not to mention subprojects that haven't been
>> developed for years.  So, there is a huge gap between reserving a right and
>> actually exercising it when needed. If such a right is to be used
>> differently for Spark Connect bindings, it's something that should be
>> communicated upfront.
>> > > How exactly the release schedule is going to look is going to require
>> > probably some experimentation because it's a new area for Spark and >
>> it's ecosystem. I don't think it requires us to have all answers > upfront.
>>
>> Nonetheless, we should work towards establishing consensus around these
>> issues and documenting the answers. They affect not only the maintainers
>> (see for example a recent discussion about switching to a more predictable
>> release schedule) but also the users, for whom multiple APIs (including
>> their development status) have been a common source of confusion in the
>> past.
>> >> Also, an elephant in the room is the future of the current API in >>
>> Spark 4 and onwards. As useful as connect is, it is not exactly a >>
>> replacement for many existing deployments. Furthermore, it doesn't >> make
>> extending Spark much easier and the current ecosystem is, >> subjectively
>> speaking, a bit brittle. > > The goal of Spark Connect is not to replace
>> the way users are > currently deploying Spark, it's not meant to be that.
>> Users should > continue deploying Spark in exactly the way they prefer.
>> Spark > Connect allows bringing more interactivity and connectivity to
>> Spark. > While Spark Connect extends Spark, most new language consumers
>> will > not try to extend Spark, but simply provide the existing surface to
>> > their native language. So the goal is not so much extensibility but >
>> more availability. For example, I believe it would be awesome if the > Livy
>> community would find a way to integrate with Spark Connect to > provide the
>> routing capabilities to provide a stable DNS endpoint for > all different
>> Spark deployments. > >> [...] the current ecosystem is, subjectively
>> speaking, a bit >> brittle. > > Can you help me understand that a bit
>> better? Do you mean the Spark > ecosystem or the Spark Connect ecosystem?
>>
>> I mean Spark in general. While most of the core and some closely related
>> projects are well maintained, tools built on top of Spark, even ones
>> supported by 

Re: [CONNECT] New Clients for Go and Rust

2023-05-31 Thread bo yang
Just see the discussions here! Really appreciate Martin and other folks
helping on my previous Golang Spark Connect PR (
https://github.com/apache/spark/pull/41036)!

Great to see we have a new repo for Spark Golang Connect client.
Thanks Hyukjin!
I am thinking to migrate my PR to this new repo. Would like to hear any
feedback or suggestion before I make the new PR :)

Thanks,
Bo



On Tue, May 30, 2023 at 3:38 AM Martin Grund 
wrote:

> Hi folks,
>
> Thanks a lot to the help form Hykjin! We've create the
> https://github.com/apache/spark-connect-go as the first contrib
> repository for Spark Connect under the Apache Spark project. We will move
> the development of the Golang client to this repository and make it very
> clear from the README file that this is an experimental client.
>
> Looking forward to all your contributions!
>
> On Tue, May 30, 2023 at 11:50 AM Martin Grund 
> wrote:
>
>> I think it makes sense to split this discussion into two pieces. On the
>> contribution side, my personal perspective is that these new clients are
>> explicitly marked as experimental and unsupported until we deem them mature
>> enough to be supported using the standard release process etc. However, the
>> goal should be that the main contributors of these clients are aiming to
>> follow the same release and maintenance schedule. I think we should
>> encourage the community to contribute to the Spark Connect clients and as
>> such we should explicitly not make it as hard as possible to get started
>> (and for that reason reserve the right to abandon).
>>
>> How exactly the release schedule is going to look is going to require
>> probably some experimentation because it's a new area for Spark and it's
>> ecosystem. I don't think it requires us to have all answers upfront.
>>
>> > Also, an elephant in the room is the future of the current API in Spark
>> 4 and onwards. As useful as connect is, it is not exactly a replacement for
>> many existing deployments. Furthermore, it doesn't make extending Spark
>> much easier and the current ecosystem is, subjectively speaking, a bit
>> brittle.
>>
>> The goal of Spark Connect is not to replace the way users are currently
>> deploying Spark, it's not meant to be that. Users should continue deploying
>> Spark in exactly the way they prefer. Spark Connect allows bringing more
>> interactivity and connectivity to Spark. While Spark Connect extends Spark,
>> most new language consumers will not try to extend Spark, but simply
>> provide the existing surface to their native language. So the goal is not
>> so much extensibility but more availability. For example, I believe it
>> would be awesome if the Livy community would find a way to integrate with
>> Spark Connect to provide the routing capabilities to provide a stable DNS
>> endpoint for all different Spark deployments.
>>
>> > [...] the current ecosystem is, subjectively speaking, a bit brittle.
>>
>> Can you help me understand that a bit better? Do you mean the Spark
>> ecosystem or the Spark Connect ecosystem?
>>
>>
>>
>> Martin
>>
>>
>> On Fri, May 26, 2023 at 5:39 PM Maciej  wrote:
>>
>>> It might be a good idea to have a discussion about how new connect
>>> clients fit into the overall process we have. In particular:
>>>
>>>
>>>- Under what conditions do we consider adding a new language to the
>>>official channels?  What process do we follow?
>>>- What guarantees do we offer in respect to these clients? Is adding
>>>a new client the same type of commitment as for the core API? In other
>>>words, do we commit to maintaining such clients "forever" or do we 
>>> separate
>>>the "official" and "contrib" clients, with the later being governed by 
>>> the
>>>ASF, but not guaranteed to be maintained in the future?
>>>- Do we follow the same release schedule as for the core project, or
>>>rather release each client separately, after the main release is 
>>> completed?
>>>
>>> Also, an elephant in the room is the future of the current API in Spark
>>> 4 and onwards. As useful as connect is, it is not exactly a replacement for
>>> many existing deployments. Furthermore, it doesn't make extending Spark
>>> much easier and the current ecosystem is, subjectively speaking, a bit
>>> brittle.
>>>
>>> --
>>> Best regards,
>>> Maciej
>>>
>>>
>>> On 5/26/23 07:26, Martin Grund wrote:
>>>
>>> Thanks everyone for your feedback! I will work on figuring out what it
>>> takes to get started with a repo for the go client.
>>>
>>> On Thu 25. May 2023 at 21:51 Chao Sun  wrote:
>>>
 +1 on separate repo too

 On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun 
 wrote:
 >
 > +1 for starting on a separate repo.
 >
 > Dongjoon.
 >
 > On Thu, May 25, 2023 at 9:53 AM yangjie01 
 wrote:
 >>
 >> +1 on start this with a separate repo.
 >>
 >> Which new clients can be placed in the main repo should be discussed
 after they are mature enough,
 >>
 >>
 

Re: How can I get the same spark context in two different python processes

2022-12-12 Thread bo yang
In theory, maybe a Jupyter notebook or something similar could achieve
this? e.g. running some Jypyter kernel inside Spark driver, then another
Python process could connect to that kernel.

But in the end, this is like Spark Connect :)


On Mon, Dec 12, 2022 at 2:55 PM Kevin Su  wrote:

> Also, is there any way to workaround this issue without using Spark
> connect?
>
> Kevin Su  於 2022年12月12日 週一 下午2:52寫道:
>
>> nvm, I found the ticket.
>> Also, is there any way to workaround this issue without using Spark
>> connect?
>>
>> Kevin Su  於 2022年12月12日 週一 下午2:42寫道:
>>
>>> Thanks for the quick response? Do we have any PR or Jira ticket for it?
>>>
>>> Reynold Xin  於 2022年12月12日 週一 下午2:39寫道:
>>>
 Spark Connect :)

 (It’s work in progress)


 On Mon, Dec 12 2022 at 2:29 PM, Kevin Su  wrote:

> Hey there, How can I get the same spark context in two different
> python processes?
> Let’s say I create a context in Process A, and then I want to use
> python subprocess B to get the spark context created by Process A. How can
> I achieve that?
>
> I've
> tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but
> it will create a new spark context.
>



Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
Yes, it should be possible, any interest to work on this together? Need
more hands to add more features here :)

On Tue, May 17, 2022 at 2:06 PM Holden Karau  wrote:

> Could we make it do the same sort of history server fallback approach?
>
> On Tue, May 17, 2022 at 10:41 PM bo yang  wrote:
>
>> It is like Web Application Proxy in YARN (
>> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html),
>> to provide easy access for Spark UI when the Spark application is running.
>>
>> When running Spark on Kubernetes with S3, there is no YARN. The reverse
>> proxy here is to behave like that Web Application Proxy. It will
>> simplify settings to access Spark UI on Kubernetes.
>>
>>
>> On Mon, May 16, 2022 at 11:46 PM wilson  wrote:
>>
>>> what's the advantage of using reverse proxy for spark UI?
>>>
>>> Thanks
>>>
>>> On Tue, May 17, 2022 at 1:47 PM bo yang  wrote:
>>>
>>>> Hi Spark Folks,
>>>>
>>>> I built a web reverse proxy to access Spark UI on Kubernetes (working
>>>> together with
>>>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
>>>> share here in case other people have similar need.
>>>>
>>>> The reverse proxy code is here:
>>>> https://github.com/datapunchorg/spark-ui-reverse-proxy
>>>>
>>>> Let me know if anyone wants to use or would like to contribute.
>>>>
>>>> Thanks,
>>>> Bo
>>>>
>>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
It is like Web Application Proxy in YARN (
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html),
to provide easy access for Spark UI when the Spark application is running.

When running Spark on Kubernetes with S3, there is no YARN. The reverse
proxy here is to behave like that Web Application Proxy. It will
simplify settings to access Spark UI on Kubernetes.


On Mon, May 16, 2022 at 11:46 PM wilson  wrote:

> what's the advantage of using reverse proxy for spark UI?
>
> Thanks
>
> On Tue, May 17, 2022 at 1:47 PM bo yang  wrote:
>
>> Hi Spark Folks,
>>
>> I built a web reverse proxy to access Spark UI on Kubernetes (working
>> together with
>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
>> share here in case other people have similar need.
>>
>> The reverse proxy code is here:
>> https://github.com/datapunchorg/spark-ui-reverse-proxy
>>
>> Let me know if anyone wants to use or would like to contribute.
>>
>> Thanks,
>> Bo
>>
>>


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
Thanks Holden :)

On Mon, May 16, 2022 at 11:12 PM Holden Karau  wrote:

> Oh that’s rad 
>
> On Tue, May 17, 2022 at 7:47 AM bo yang  wrote:
>
>> Hi Spark Folks,
>>
>> I built a web reverse proxy to access Spark UI on Kubernetes (working
>> together with
>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
>> share here in case other people have similar need.
>>
>> The reverse proxy code is here:
>> https://github.com/datapunchorg/spark-ui-reverse-proxy
>>
>> Let me know if anyone wants to use or would like to contribute.
>>
>> Thanks,
>> Bo
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Reverse proxy for Spark UI on Kubernetes

2022-05-16 Thread bo yang
Hi Spark Folks,

I built a web reverse proxy to access Spark UI on Kubernetes (working
together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator).
Want to share here in case other people have similar need.

The reverse proxy code is here:
https://github.com/datapunchorg/spark-ui-reverse-proxy

Let me know if anyone wants to use or would like to contribute.

Thanks,
Bo


Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
It uses Helm to deploy Spark Operator and Nginx. For other parts like
creating EKS, IAM role, node group, etc, it uses AWS SDK to provision those
AWS resources.

On Wed, Feb 23, 2022 at 11:28 AM Bjørn Jørgensen 
wrote:

> So if I get this right you will make a Helm <https://helm.sh> chart to
> deploy Spark and some other stuff on K8S?
>
> ons. 23. feb. 2022 kl. 17:49 skrev bo yang :
>
>> Hi Sarath, let's follow up offline on this.
>>
>> On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy <
>> sarath.annare...@gmail.com> wrote:
>>
>>> Hi bo
>>>
>>> How do we start?
>>>
>>> Is there a plan? Onboarding, Arch/design diagram, tasks lined up etc
>>>
>>>
>>> Thanks
>>> Sarath
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 23, 2022, at 10:27 AM, bo yang  wrote:
>>>
>>> 
>>> Hi Sarath, thanks for your interest and willing to contribute! The
>>> project supports local development using MiniKube. Similarly there is a one
>>> click command with one extra argument to deploy all components in MiniKube,
>>> and people could use that to develop on their local MacBook.
>>>
>>>
>>> On Wed, Feb 23, 2022 at 7:41 AM Sarath Annareddy <
>>> sarath.annare...@gmail.com> wrote:
>>>
>>>> Hi bo
>>>>
>>>> I am interested to contribute.
>>>> But I don’t have free access to any cloud provider. Not sure how I can
>>>> get free access. I know Google, aws, azure only provides temp free access,
>>>> it may not be sufficient.
>>>>
>>>> Guidance is appreciated.
>>>>
>>>> Sarath
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Feb 23, 2022, at 2:01 AM, bo yang  wrote:
>>>>
>>>> 
>>>>
>>>> Right, normally people start with simple script, then add more stuff,
>>>> like permission and more components. After some time, people want to run
>>>> the script consistently in different environments. Things will become
>>>> complex.
>>>>
>>>> That is why we want to see whether people have interest for such a "one
>>>> click" tool to make things easy.
>>>>
>>>>
>>>> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> There are two distinct actions here; namely Deploy and Run.
>>>>>
>>>>> Deployment can be done by command line script with autoscaling. In the
>>>>> newer versions of Kubernnetes you don't even need to specify the node
>>>>> types, you can leave it to the Kubernetes cluster  to scale up and down 
>>>>> and
>>>>> decide on node type.
>>>>>
>>>>> The second point is the running spark that you will need to submit.
>>>>> However, that depends on setting up access permission, use of service
>>>>> accounts, pulling the correct dockerfiles for the driver and the 
>>>>> executors.
>>>>> Those details add to the complexity.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>>>>
>>>>>> Hi Spark Community,
>>>>>>
>>>>>> We built an open source tool to deploy and run Spark on Kubernetes
>>>>>> with a one click command. For example, on AWS, it could automatically
>>>>>> create an EKS cluster, node group, NGINX ingress, and Spark Operator. 
>>>>>> Then
>>>>>> you will be able to use curl or a CLI tool to submit Spark application.
>>>>>> After the deployment, you could also install Uber Remote Shuffle Service 
>>>>>> to
>>>>>> enable Dynamic Allocation on Kuberentes.
>>>>>>
>>>>>> Anyone interested in using or working together on such a tool?
>>>>>>
>>>>>> Thanks,
>>>>>> Bo
>>>>>>
>>>>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Hi Sarath, let's follow up offline on this.

On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy 
wrote:

> Hi bo
>
> How do we start?
>
> Is there a plan? Onboarding, Arch/design diagram, tasks lined up etc
>
>
> Thanks
> Sarath
>
>
> Sent from my iPhone
>
> On Feb 23, 2022, at 10:27 AM, bo yang  wrote:
>
> 
> Hi Sarath, thanks for your interest and willing to contribute! The project
> supports local development using MiniKube. Similarly there is a one click
> command with one extra argument to deploy all components in MiniKube, and
> people could use that to develop on their local MacBook.
>
>
> On Wed, Feb 23, 2022 at 7:41 AM Sarath Annareddy <
> sarath.annare...@gmail.com> wrote:
>
>> Hi bo
>>
>> I am interested to contribute.
>> But I don’t have free access to any cloud provider. Not sure how I can
>> get free access. I know Google, aws, azure only provides temp free access,
>> it may not be sufficient.
>>
>> Guidance is appreciated.
>>
>> Sarath
>>
>> Sent from my iPhone
>>
>> On Feb 23, 2022, at 2:01 AM, bo yang  wrote:
>>
>> 
>>
>> Right, normally people start with simple script, then add more stuff,
>> like permission and more components. After some time, people want to run
>> the script consistently in different environments. Things will become
>> complex.
>>
>> That is why we want to see whether people have interest for such a "one
>> click" tool to make things easy.
>>
>>
>> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> There are two distinct actions here; namely Deploy and Run.
>>>
>>> Deployment can be done by command line script with autoscaling. In the
>>> newer versions of Kubernnetes you don't even need to specify the node
>>> types, you can leave it to the Kubernetes cluster  to scale up and down and
>>> decide on node type.
>>>
>>> The second point is the running spark that you will need to submit.
>>> However, that depends on setting up access permission, use of service
>>> accounts, pulling the correct dockerfiles for the driver and the executors.
>>> Those details add to the complexity.
>>>
>>> Thanks
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>>
>>>> Hi Spark Community,
>>>>
>>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>>> a one click command. For example, on AWS, it could automatically create an
>>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>>> be able to use curl or a CLI tool to submit Spark application. After the
>>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>>> Dynamic Allocation on Kuberentes.
>>>>
>>>> Anyone interested in using or working together on such a tool?
>>>>
>>>> Thanks,
>>>> Bo
>>>>
>>>>


Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Hi Sarath, thanks for your interest and willing to contribute! The project
supports local development using MiniKube. Similarly there is a one click
command with one extra argument to deploy all components in MiniKube, and
people could use that to develop on their local MacBook.


On Wed, Feb 23, 2022 at 7:41 AM Sarath Annareddy 
wrote:

> Hi bo
>
> I am interested to contribute.
> But I don’t have free access to any cloud provider. Not sure how I can get
> free access. I know Google, aws, azure only provides temp free access, it
> may not be sufficient.
>
> Guidance is appreciated.
>
> Sarath
>
> Sent from my iPhone
>
> On Feb 23, 2022, at 2:01 AM, bo yang  wrote:
>
> 
>
> Right, normally people start with simple script, then add more stuff, like
> permission and more components. After some time, people want to run the
> script consistently in different environments. Things will become complex.
>
> That is why we want to see whether people have interest for such a "one
> click" tool to make things easy.
>
>
> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> There are two distinct actions here; namely Deploy and Run.
>>
>> Deployment can be done by command line script with autoscaling. In the
>> newer versions of Kubernnetes you don't even need to specify the node
>> types, you can leave it to the Kubernetes cluster  to scale up and down and
>> decide on node type.
>>
>> The second point is the running spark that you will need to submit.
>> However, that depends on setting up access permission, use of service
>> accounts, pulling the correct dockerfiles for the driver and the executors.
>> Those details add to the complexity.
>>
>> Thanks
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>
>>> Hi Spark Community,
>>>
>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>> a one click command. For example, on AWS, it could automatically create an
>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>> be able to use curl or a CLI tool to submit Spark application. After the
>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>> Dynamic Allocation on Kuberentes.
>>>
>>> Anyone interested in using or working together on such a tool?
>>>
>>> Thanks,
>>> Bo
>>>
>>>


Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Right, normally people start with simple script, then add more stuff, like
permission and more components. After some time, people want to run the
script consistently in different environments. Things will become complex.

That is why we want to see whether people have interest for such a "one
click" tool to make things easy.


On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh 
wrote:

> Hi,
>
> There are two distinct actions here; namely Deploy and Run.
>
> Deployment can be done by command line script with autoscaling. In the
> newer versions of Kubernnetes you don't even need to specify the node
> types, you can leave it to the Kubernetes cluster  to scale up and down and
> decide on node type.
>
> The second point is the running spark that you will need to submit.
> However, that depends on setting up access permission, use of service
> accounts, pulling the correct dockerfiles for the driver and the executors.
> Those details add to the complexity.
>
> Thanks
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>
>> Hi Spark Community,
>>
>> We built an open source tool to deploy and run Spark on Kubernetes with a
>> one click command. For example, on AWS, it could automatically create an
>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>> be able to use curl or a CLI tool to submit Spark application. After the
>> deployment, you could also install Uber Remote Shuffle Service to enable
>> Dynamic Allocation on Kuberentes.
>>
>> Anyone interested in using or working together on such a tool?
>>
>> Thanks,
>> Bo
>>
>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
Merging another email from Prasad. It could co-exist with livy. Livy is
similar like the REST Service + Spark Operator. Unfortunately Livy is not
very active right now.

To Amihay, the link is: https://github.com/datapunchorg/punch.

On Tue, Feb 22, 2022 at 8:53 PM amihay gonen  wrote:

> Can you share link to the source?
>
> בתאריך יום ד׳, 23 בפבר׳ 2022, 6:52, מאת bo yang ‏:
>
>> We do not have SaaS yet. Now it is an open source project we build in our
>> part time , and we welcome more people working together on that.
>>
>> You could specify cluster size (EC2 instance type and number of
>> instances) and run it for 1 hour. Then you could run one click command to
>> destroy the cluster. It is possible to merge these steps as well, and
>> provide a "serverless" experience. That is in our TODO list :)
>>
>>
>> On Tue, Feb 22, 2022 at 8:36 PM Bitfox  wrote:
>>
>>> How can I specify the cluster memory and cores?
>>> For instance, I want to run a job with 16 cores and 300 GB memory for
>>> about 1 hour. Do you have the SaaS solution for this? I can pay as I did.
>>>
>>> Thanks
>>>
>>> On Wed, Feb 23, 2022 at 12:21 PM bo yang  wrote:
>>>
>>>> It is not a standalone spark cluster. In some details, it deploys a
>>>> Spark Operator (
>>>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) and an
>>>> extra REST Service. When people submit Spark application to that REST
>>>> Service, the REST Service will create a CRD inside the Kubernetes cluster.
>>>> Then Spark Operator will pick up the CRD and launch the Spark application.
>>>> The one click tool intends to hide these details, so people could just
>>>> submit Spark and do not need to deal with too many deployment details.
>>>>
>>>> On Tue, Feb 22, 2022 at 8:09 PM Bitfox  wrote:
>>>>
>>>>> Can it be a cluster installation of spark? or just the standalone node?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:
>>>>>
>>>>>> Hi Spark Community,
>>>>>>
>>>>>> We built an open source tool to deploy and run Spark on Kubernetes
>>>>>> with a one click command. For example, on AWS, it could automatically
>>>>>> create an EKS cluster, node group, NGINX ingress, and Spark Operator. 
>>>>>> Then
>>>>>> you will be able to use curl or a CLI tool to submit Spark application.
>>>>>> After the deployment, you could also install Uber Remote Shuffle Service 
>>>>>> to
>>>>>> enable Dynamic Allocation on Kuberentes.
>>>>>>
>>>>>> Anyone interested in using or working together on such a tool?
>>>>>>
>>>>>> Thanks,
>>>>>> Bo
>>>>>>
>>>>>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
We do not have SaaS yet. Now it is an open source project we build in our
part time , and we welcome more people working together on that.

You could specify cluster size (EC2 instance type and number of instances)
and run it for 1 hour. Then you could run one click command to destroy the
cluster. It is possible to merge these steps as well, and provide a
"serverless" experience. That is in our TODO list :)


On Tue, Feb 22, 2022 at 8:36 PM Bitfox  wrote:

> How can I specify the cluster memory and cores?
> For instance, I want to run a job with 16 cores and 300 GB memory for
> about 1 hour. Do you have the SaaS solution for this? I can pay as I did.
>
> Thanks
>
> On Wed, Feb 23, 2022 at 12:21 PM bo yang  wrote:
>
>> It is not a standalone spark cluster. In some details, it deploys a Spark
>> Operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)
>> and an extra REST Service. When people submit Spark application to that
>> REST Service, the REST Service will create a CRD inside the
>> Kubernetes cluster. Then Spark Operator will pick up the CRD and launch the
>> Spark application. The one click tool intends to hide these details, so
>> people could just submit Spark and do not need to deal with too many
>> deployment details.
>>
>> On Tue, Feb 22, 2022 at 8:09 PM Bitfox  wrote:
>>
>>> Can it be a cluster installation of spark? or just the standalone node?
>>>
>>> Thanks
>>>
>>> On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:
>>>
>>>> Hi Spark Community,
>>>>
>>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>>> a one click command. For example, on AWS, it could automatically create an
>>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>>> be able to use curl or a CLI tool to submit Spark application. After the
>>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>>> Dynamic Allocation on Kuberentes.
>>>>
>>>> Anyone interested in using or working together on such a tool?
>>>>
>>>> Thanks,
>>>> Bo
>>>>
>>>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
It is not a standalone spark cluster. In some details, it deploys a Spark
Operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) and
an extra REST Service. When people submit Spark application to that REST
Service, the REST Service will create a CRD inside the Kubernetes cluster.
Then Spark Operator will pick up the CRD and launch the Spark application.
The one click tool intends to hide these details, so people could just
submit Spark and do not need to deal with too many deployment details.

On Tue, Feb 22, 2022 at 8:09 PM Bitfox  wrote:

> Can it be a cluster installation of spark? or just the standalone node?
>
> Thanks
>
> On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:
>
>> Hi Spark Community,
>>
>> We built an open source tool to deploy and run Spark on Kubernetes with a
>> one click command. For example, on AWS, it could automatically create an
>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>> be able to use curl or a CLI tool to submit Spark application. After the
>> deployment, you could also install Uber Remote Shuffle Service to enable
>> Dynamic Allocation on Kuberentes.
>>
>> Anyone interested in using or working together on such a tool?
>>
>> Thanks,
>> Bo
>>
>>


One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
Hi Spark Community,

We built an open source tool to deploy and run Spark on Kubernetes with a
one click command. For example, on AWS, it could automatically create an
EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
be able to use curl or a CLI tool to submit Spark application. After the
deployment, you could also install Uber Remote Shuffle Service to enable
Dynamic Allocation on Kuberentes.

Anyone interested in using or working together on such a tool?

Thanks,
Bo


Re: [VOTE][SPIP] Support Customized Kubernetes Schedulers Proposal

2022-01-05 Thread bo yang
+1 (non-binding)

On Wed, Jan 5, 2022 at 11:01 PM Holden Karau  wrote:

> +1 (binding)
>
> On Wed, Jan 5, 2022 at 5:31 PM William Wang 
> wrote:
>
>> +1 (non-binding)
>>
>> Yikun Jiang  于2022年1月6日周四 09:07写道:
>>
>>> Hi all,
>>>
>>> I’d like to start a vote for SPIP: "Support Customized Kubernetes
>>> Schedulers Proposal"
>>>
>>> The SPIP is to support customized Kubernetes schedulers in Spark on
>>> Kubernetes.
>>>
>>> Please also refer to:
>>>
>>> - Previous discussion in dev mailing list: [DISCUSSION] SPIP: Support
>>> Volcano/Alternative Schedulers Proposal
>>> 
>>> - Design doc: [SPIP] Spark-36057 Support Customized Kubernetes
>>> Schedulers Proposal
>>> 
>>> - JIRA: SPARK-36057 
>>>
>>> Please vote on the SPIP:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Regards,
>>> Yikun
>>>
>>>
>>>
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>


Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2022-01-05 Thread bo yang
Hi Mich,

Curious what do you mean “The constraint seems to be that you can fit one
Spark executor pod per Kubernetes node and from my tests you don't seem to
be able to allocate more than 50% of RAM on the node to the container",
Would you help to explain a bit? Asking this because there could be
multiple executor pods running on a single Kuberentes node.

Thanks,
Bo


On Wed, Jan 5, 2022 at 1:13 AM Mich Talebzadeh 
wrote:

> Thanks William for the info.
>
>
>
>
>
> The current model of Spark on k8s has certain drawbacks with pod based
> scheduling as I  tested it on Google Kubernetes Cluster (GKE). The
> constraint seems to be that you can fit one Spark executor pod per
> Kubernetes node and from my tests you don't seem to be able to allocate
> more than 50% of RAM on the node to the container.
>
>
> [image: gke_memoeyPlot.png]
>
>
> Anymore results in the container never been created (stuck at pending)
>
> kubectl describe pod sparkbq-b506ac7dc521b667-driver -n spark
>
>  Events:
>
>   Type Reason Age   From
> Message
>
>    --   
> ---
>
>   Warning  FailedScheduling   17m   default-scheduler   0/3 
> nodes are available: 3 Insufficient memory.
>
>   Warning  FailedScheduling   17m   default-scheduler   0/3 
> nodes are available: 3 Insufficient memory.
>
>   Normal   NotTriggerScaleUp  2m28s (x92 over 17m)  cluster-autoscaler  pod 
> didn't trigger scale-up:
>
> Obviously this is far from ideal and this model although works is not
> efficient.
>
>
> Cheers,
>
>
> Mich
>
>
>
>
>
>
>view my Linkedin profile
> 
>
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction
>
> of data or any other property which may arise from relying on this
> email's technical content is explicitly disclaimed.
>
> The author will in no case be liable for any monetary damages arising from
> such
>
> loss, damage or destruction.
>
>
>
>
>
>
>
>
> On Wed, 5 Jan 2022 at 03:55, William Wang  wrote:
>
>> Hi Mich,
>>
>> Here are parts of performance indications in Volcano.
>> 1. Scheduler throughput: 1.5k pod/s (default scheduler: 100 Pod/s)
>> 2. Spark application performance improved 30%+ with minimal resource
>> reservation feature in case of insufficient resource.(tested with TPC-DS)
>>
>> We are still working on more optimizations. Besides the performance,
>> Volcano is continuously enhanced in below four directions to provide
>> abilities that users care about.
>> - Full lifecycle management for jobs
>> - Scheduling policies for high-performance workloads(fair-share,
>> topology, sla, reservation, preemption, backfill etc)
>> - Support for heterogeneous hardware
>> - Performance optimization for high-performance workloads
>>
>> Thanks
>> LeiBo
>>
>> Mich Talebzadeh  于2022年1月4日周二 18:12写道:
>>
> Interesting,thanks
>>>
>>> Do you have any indication of the ballpark figure (a rough numerical
>>> estimate) of adding Volcano as an alternative scheduler is going to
>>> improve Spark on k8s performance?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction
>>>
>>> of data or any other property which may arise from relying on this
>>> email's technical content is explicitly disclaimed.
>>>
>>> The author will in no case be liable for any monetary damages arising
>>> from such
>>>
>>> loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, 4 Jan 2022 at 09:43, Yikun Jiang  wrote:
>>>
 Hi, folks! Wishing you all the best in 2022.

 I'd like to share the current status on "Support Customized K8S
 Scheduler in Spark".


 https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg/edit#heading=h.1quyr1r2kr5n

 Framework/Common support

 - Volcano and Yunikorn team join the discussion and complete the
 initial doc on framework/common part.

 - SPARK-37145 
 (under reviewing): We proposed to extend the customized scheduler by just
 using a custom feature step, it will meet the requirement of customized
 scheduler after it gets merged. After this, the user can enable featurestep
 and scheduler like:

 spark-submit \

 --conf spark.kubernete.scheduler.name volcano \

 --conf spark.kubernetes.driver.pod.featureSteps
 org.apache.spark.deploy.k8s.features.scheduler.VolcanoFeatureStep

 --conf spark.kubernete.job.queue xxx

 (such as above, the VolcanoFeatureStep will help to set the the spark
 scheduler queue according user specified conf)

Re: Apache Spark 3.2 Expectation

2021-02-28 Thread bo yang
+1 for better support for disaggregated shuffle (push-based shuffle is a
great example, also there are Facebook shuffle service

and Uber remote shuffle service
). There were previously some
community sync up meetings on this, but kind of discontinued. Are people
interested to continue the sync up meeting on this?

On Fri, Feb 26, 2021 at 6:41 PM Yi Wu  wrote:

> +1 to continue the incompleted push-based shuffle.
>
> --
> Yi
>
> On Fri, Feb 26, 2021 at 1:26 AM Mridul Muralidharan 
> wrote:
>
>>
>>
>> Nit: Java 17 -> should be available by Sept 2021 :-)
>> Adoption would also depend on some of our nontrivial dependencies
>> supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?
>>
>> Features:
>> Push based shuffle and disaggregated shuffle should also be in 3.2
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>>> December 2020, March seems to be a good time to share our thoughts and
>>> aspirations on Apache Spark 3.2.
>>>
>>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>>> seems to be the last minor release of this year. Given the timeframe, we
>>> might consider the following. (This is a small set. Please add your
>>> thoughts to this limited list.)
>>>
>>> # Languages
>>>
>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>>> and investigating the publishing issue. Thank you for your contributions
>>> and feedback on this.
>>>
>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>>> Java 11, we need lots of support from our dependencies. Let's see.
>>>
>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>
>>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>>> better drop it from the releasing work item list officially.
>>>
>>> # Dependencies
>>>
>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
>>> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>>> shaded clients via SPARK-33212. So far, there is one on-going report at
>>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>>> we can move toward Hadoop 3.3.2.
>>>
>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>>> official dependency via SPARK-32981. We are steadily improving this area
>>> and will consume Hive 2.3.9 if available.
>>>
>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>>> support K8s model 1.19.
>>>
>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>>> with Kafka Client 2.8 hopefully.
>>>
>>> # Some Features
>>>
>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>>> and up-coming storage partitioned join SPIP can be delivered as a part of
>>> Spark 3.2 and become an additional foundation.
>>>
>>> - Columnar Encryption: As of today, Apache Spark master branch supports
>>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>>> Apache Spark 3.2 is going to be the first release to have this feature
>>> officially. Any feedback is welcome.
>>>
>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>>> too. I'm expecting more benefits.
>>>
>>> - Structure Streaming with RocksDB backend: According to the latest
>>> update, it looks active enough for merging to master 

Re: Enabling fully disaggregated shuffle on Spark

2019-12-04 Thread bo yang
Thanks guys for the discussion in the email and also this afternoon!

>From our experience, we do not need to change Spark DAG scheduler to
implement a remote shuffle service. Current Spark shuffle manager
interfaces are pretty good and easy to implement. But we do feel the need
to modify MapStatus to make it more generic.

The current limit with MapStatus is that it assumes* a map output only
exists on a single executor* (see following). One easy update could be
making MapStatus supports the scenario where *a map output could be on
multiple remote servers*.

private[spark] sealed trait MapStatus {
def location: BlockManagerId
}

class BlockManagerId private {
private var executorId_ : String,
private var host_ : String,
private var port_ : Int,
}

Also, MapStatus is a sealed trait, thus our ShuffleManager plugin could not
extend it with our own implementation. How about *making MapStatus a public
non-sealed trait*? So different Shuffle Manager plugin could implement
their own MapStatus classes.

Best,
Bo

On Wed, Dec 4, 2019 at 3:27 PM Ben Sidhom  wrote:

> Hey Imran (and everybody who made it to the sync today):
>
> Thanks for the comments. Responses below:
>
> Scheduling and re-executing tasks
>>> Allow coordination between the service and the Spark DAG scheduler as to
>>> whether a given block/partition needs to be recomputed when a task fails or
>>> when shuffle block data cannot be read. Having such coordination is
>>> important, e.g., for suppressing recomputation after aborted executors or
>>> for forcing late recomputation if the service internally acts as a cache.
>>> One catchall solution is to have the shuffle manager provide an indication
>>> of whether shuffle data is external to executors (or nodes). Another
>>> option: allow the shuffle manager (likely on the driver) to be queried for
>>> the existence of shuffle data for a given executor ID (or perhaps map task,
>>> reduce task, etc). Note that this is at the level of data the scheduler is
>>> aware of (i.e., map/reduce partitions) rather than block IDs, which are
>>> internal details for some shuffle managers.
>>
>>
>> sounds reasonable, and I think @Matt Cheah  mentioned something like this
>> has come up with their work on SPARK-25299 and was going to be added even
>> for that work.  (of course, need to look at the actual proposal closely and
>> how it impacts the scheduler.)
>
>
> While this is something that was discussed before, it is not something
> that is *currently* in the scope of SPARK-25299. Given the number of
> parties who are doing async data pushes (either as a backup, as in the case
> of the proposal in SPARK-25299, or as the sole mechanism of data
> distribution), I expect this to be an issue at the forefront for many
> people. I have not yet written a specific proposal for how this should be
> done. Rather, I wanted to gauge how many others see this as an important
> issue and figure out the most reasonable solutions for the community as a
> whole. It sounds like people have been getting by this using hacks so far.
> I would be curious to hear what does and does not work well and which
> solutions we would be OK with in Spark upstream.
>
>
> ShuffleManager API
>>> Add a heartbeat (keep-alive) mechanism to RDD shuffle output so that the
>>> service knows that data is still active. This is one way to enable
>>> time-/job-scoped data because a disaggregated shuffle service cannot rely
>>> on robust communication with Spark and in general has a distinct lifecycle
>>> from the Spark deployment(s) it talks to. This would likely take the form
>>> of a callback on ShuffleManager itself, but there are other approaches.
>>
>>
>
> I believe this can already be done, but maybe its much uglier than it
>> needs to be (though I don't recall the details off the top of my head).
>
>
> As far as I'm aware, this would need to be added out-of-band, e.g., by the
> ShuffleManager itself firing off its own heartbeat thread(s) (on the
> driver, executors, or both). While obviously this is possible, it's also
> prone to leaks and puts more burden on shuffle implementations. In fact, I
> don't have a robust understanding of the lifecycle of the ShuffleManager
> object itself. IIRC (from some ad-hoc tests I did a while back), a new one
> is spawned on each executor itself (as opposed to being instantiated once
> on the driver and deserialized onto executors). If executor
> (ShuffleManager) instances do not receive shutdown hooks, shuffle
> implementations may be prone to resource leaks. Worse, if the behavior of
> ShuffleManager instantiation is not stable between Spark releases, there
> may be correctness issues due to intializers/constructors running in
> unexpected ways. Then you have the ShuffleManager instance used for
> registration. As far as I can tell, this runs on the driver, but might this
> be migrated between machines (either now or in future Spark releases),
> e.g., in cluster mode?
>
> If this were taken care of by the Spark 

Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread bo yang
Cool, thanks Ryan, John, Amogh for the reply! Great to see you interested!
Felix will have a Spark Scalability & Reliability Sync meeting on Dec 4 1pm
PST. We could discuss more details there. Do you want to join?

On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor  wrote:

> We at Qubole are also looking at disaggregating shuffle on Spark. Would
> love to collaborate and share learnings.
>
> Regards,
> Amogh
>
> On Tue, Nov 19, 2019 at 4:09 PM John Zhuge  wrote:
>
>> Great work, Bo! Would love to hear the details.
>>
>>
>> On Tue, Nov 19, 2019 at 4:05 PM Ryan Blue 
>> wrote:
>>
>>> I'm interested in remote shuffle services as well. I'd love to hear
>>> about what you're using in production!
>>>
>>> rb
>>>
>>> On Tue, Nov 19, 2019 at 2:43 PM bo yang  wrote:
>>>
>>>> Hi Ben,
>>>>
>>>> Thanks for the writing up! This is Bo from Uber. I am in Felix's team
>>>> in Seattle, and working on disaggregated shuffle (we called it remote
>>>> shuffle service, RSS, internally). We have put RSS into production for a
>>>> while, and learned a lot during the work (tried quite a few techniques to
>>>> improve the remote shuffle performance). We could share our learning with
>>>> the community, and also would like to hear feedback/suggestions on how to
>>>> further improve remote shuffle performance. We could chat more details if
>>>> you or other people are interested.
>>>>
>>>> Best,
>>>> Bo
>>>>
>>>> On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom 
>>>> wrote:
>>>>
>>>>> I would like to start a conversation about extending the Spark shuffle
>>>>> manager surface to support fully disaggregated shuffle implementations.
>>>>> This is closely related to the work in SPARK-25299
>>>>> <https://issues.apache.org/jira/browse/SPARK-25299>, which is focused
>>>>> on refactoring the shuffle manager API (and in particular,
>>>>> SortShuffleManager) to use a pluggable storage backend. The motivation for
>>>>> that SPIP is further enabling Spark on Kubernetes.
>>>>>
>>>>>
>>>>> The motivation for this proposal is enabling full externalized
>>>>> (disaggregated) shuffle service implementations. (Facebook’s Cosco
>>>>> shuffle
>>>>> <https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service>
>>>>> is one example of such a disaggregated shuffle service.) These changes
>>>>> allow the bulk of the shuffle to run in a remote service such that minimal
>>>>> state resides in executors and local disk spill is minimized. The net
>>>>> effect is increased job stability and performance improvements in certain
>>>>> scenarios. These changes should work well with or are complementary to
>>>>> SPARK-25299. Some or all points may be merged into that issue as
>>>>> appropriate.
>>>>>
>>>>>
>>>>> Below is a description of each component of this proposal. These
>>>>> changes can ideally be introduced incrementally. I would like to gather
>>>>> feedback and gauge interest from others in the community to collaborate on
>>>>> this. There are likely more points that would  be useful to disaggregated
>>>>> shuffle services. We can outline a more concrete plan after gathering
>>>>> enough input. A working session could help us kick off this joint effort;
>>>>> maybe something in the mid-January to mid-February timeframe (depending on
>>>>> interest and availability. I’m happy to host at our Sunnyvale, CA offices.
>>>>>
>>>>>
>>>>> ProposalScheduling and re-executing tasks
>>>>>
>>>>> Allow coordination between the service and the Spark DAG scheduler as
>>>>> to whether a given block/partition needs to be recomputed when a task 
>>>>> fails
>>>>> or when shuffle block data cannot be read. Having such coordination is
>>>>> important, e.g., for suppressing recomputation after aborted executors or
>>>>> for forcing late recomputation if the service internally acts as a cache.
>>>>> One catchall solution is to have the shuffle manager provide an indication
>>>>> of whether shuffle data is external to executors (or nodes). Another
>>>>> option: allow the shuffle manager (likely on

Re: Enabling fully disaggregated shuffle on Spark

2019-11-19 Thread bo yang
Hi Ben,

Thanks for the writing up! This is Bo from Uber. I am in Felix's team in
Seattle, and working on disaggregated shuffle (we called it remote shuffle
service, RSS, internally). We have put RSS into production for a while, and
learned a lot during the work (tried quite a few techniques to improve the
remote shuffle performance). We could share our learning with the
community, and also would like to hear feedback/suggestions on how to
further improve remote shuffle performance. We could chat more details if
you or other people are interested.

Best,
Bo

On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom 
wrote:

> I would like to start a conversation about extending the Spark shuffle
> manager surface to support fully disaggregated shuffle implementations.
> This is closely related to the work in SPARK-25299
> , which is focused on
> refactoring the shuffle manager API (and in particular, SortShuffleManager)
> to use a pluggable storage backend. The motivation for that SPIP is further
> enabling Spark on Kubernetes.
>
>
> The motivation for this proposal is enabling full externalized
> (disaggregated) shuffle service implementations. (Facebook’s Cosco shuffle
> 
> is one example of such a disaggregated shuffle service.) These changes
> allow the bulk of the shuffle to run in a remote service such that minimal
> state resides in executors and local disk spill is minimized. The net
> effect is increased job stability and performance improvements in certain
> scenarios. These changes should work well with or are complementary to
> SPARK-25299. Some or all points may be merged into that issue as
> appropriate.
>
>
> Below is a description of each component of this proposal. These changes
> can ideally be introduced incrementally. I would like to gather feedback
> and gauge interest from others in the community to collaborate on this.
> There are likely more points that would  be useful to disaggregated shuffle
> services. We can outline a more concrete plan after gathering enough input.
> A working session could help us kick off this joint effort; maybe something
> in the mid-January to mid-February timeframe (depending on interest and
> availability. I’m happy to host at our Sunnyvale, CA offices.
>
>
> ProposalScheduling and re-executing tasks
>
> Allow coordination between the service and the Spark DAG scheduler as to
> whether a given block/partition needs to be recomputed when a task fails or
> when shuffle block data cannot be read. Having such coordination is
> important, e.g., for suppressing recomputation after aborted executors or
> for forcing late recomputation if the service internally acts as a cache.
> One catchall solution is to have the shuffle manager provide an indication
> of whether shuffle data is external to executors (or nodes). Another
> option: allow the shuffle manager (likely on the driver) to be queried for
> the existence of shuffle data for a given executor ID (or perhaps map task,
> reduce task, etc). Note that this is at the level of data the scheduler is
> aware of (i.e., map/reduce partitions) rather than block IDs, which are
> internal details for some shuffle managers.
> ShuffleManager API
>
> Add a heartbeat (keep-alive) mechanism to RDD shuffle output so that the
> service knows that data is still active. This is one way to enable
> time-/job-scoped data because a disaggregated shuffle service cannot rely
> on robust communication with Spark and in general has a distinct lifecycle
> from the Spark deployment(s) it talks to. This would likely take the form
> of a callback on ShuffleManager itself, but there are other approaches.
>
>
> Add lifecycle hooks to shuffle readers and writers (e.g., to close/recycle
> connections/streams/file handles as well as provide commit semantics).
> SPARK-25299 adds commit semantics to the internal data storage layer, but
> this is applicable to all shuffle managers at a higher level and should
> apply equally to the ShuffleWriter.
>
>
> Do not require ShuffleManagers to expose ShuffleBlockResolvers where they
> are not needed. Ideally, this would be an implementation detail of the
> shuffle manager itself. If there is substantial overlap between the
> SortShuffleManager and other implementations, then the storage details can
> be abstracted at the appropriate level. (SPARK-25299 does not currently
> change this.)
>
>
> Do not require MapStatus to include blockmanager IDs where they are not
> relevant. This is captured by ShuffleBlockInfo
> 
> including an optional BlockManagerId in SPARK-25299. However, this change
> should be lifted to the MapStatus level so that it applies to all
> ShuffleManagers. Alternatively, use a more general data-location
> abstraction than BlockManagerId. This gives the shuffle 

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-14 Thread bo yang
+1 This is great work, allowing plugin of different sort shuffle write/read
implementation! Also great to see it retain the current Spark configuration
(spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).


On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah  wrote:

> Hi everyone,
>
>
>
> I would like to call a vote for the SPIP for SPARK-25299
> , which proposes to
> introduce a pluggable storage API for temporary shuffle data.
>
>
>
> You may find the SPIP document here
> 
> .
>
>
>
> The discussion thread for the SPIP was conducted here
> 
> .
>
>
>
> Please vote on whether or not this proposal is agreeable to you.
>
>
>
> Thanks!
>
>
>
> -Matt Cheah
>


Support structured plan logging

2018-10-11 Thread bo yang
Hi All,

Are there any people interested in adding structured plan logging in Spark?
Currently the logical/physical plan could be logged as plain text via
explain() method, which has some issues, for example, string truncation and
difficult for tool/program to use.

This PR  fixes the truncation
issue. A further step is to log the plan as structured content (e.g. json).
Do other people feel similar need?

Thanks,
Bo