from:"Xiao Li"

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Xiao Li

+1

Gengliang Wang  于2024年5月13日周一 16:24写道：

> +1
>
> On Mon, May 13, 2024 at 12:30 PM Zhou Jiang 
> wrote:
>
>> +1 (non-binding)
>>
>> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>
>>> Hi all,
>>>
>>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>>
>>> Please also refer to:
>>>
>>>- Discussion thread:
>>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>>- SPIP doc:
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>>
>>> Thank you!
>>>
>>> Liang-Chi Hsieh
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> *Zhou JIANG*
>>
>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Xiao Li

+1 for next Monday.

We can do more previews when the other features are ready for preview.

Tathagata Das  于2024年5月1日周三 08:46写道：

> Next week sounds great! Thank you Wenchen!
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> Reviving this thread, but Spark master has already accumulated a huge
>>> amount of changes.  As a downstream project maintainer, I want to really
>>> start testing the new features and other breaking changes, and it's hard to
>>> do that without a Preview release. So the sooner we make a Preview release,
>>> the faster we can start getting feedback for fixing things for a great
>>> Spark 4.0 final release.
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>>
 Thank you all for the replies!

 To @Nicholas Chammas  : Thanks for
 cleaning up the error terminology and documentation! I've merged the first
 PR and let's finish others before the 4.0 release.
 To @Dongjoon Hyun  : Thanks for driving the
 ANSI on by default effort! Now the vote has passed, let's flip the config
 and finish the DataFrame error context feature before 4.0.
 To @Jungtaek Lim  : Ack. We can treat
 the Streaming state store data source as completed for 4.0 then.
 To @Cheng Pan  : Yea we definitely should have a
 preview release. Let's collect more feedback on the ongoing projects and
 then we can propose a date for the preview release.

 On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:

> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are several
> follow-up tickets, but we don't plan to address them soon. The current
> implementation is the final shape for Spark 4.0.0, unless there are 
> demands
> on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my
> understanding is that we want to release the feature to 4.0.0, but there
> are several remaining works to be done. While the tentative timeline for
> releasing is June 2024, what would be the tentative timeline for the RC 
> cut?
> > (cc. Anish to add more context on the plan for transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
> wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June
> 2024), and I think it's time to prepare for it and discuss the ongoing
> projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here. I
> would like to volunteer as the release manager for Apache Spark 4.0.0 if
> there is no objection. Thank you all for the great work that fills Spark
> 4.0!
> >
> > Wenchen Fan
>
>

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Xiao Li

+1

On Sat, Apr 13, 2024 at 17:21 huaxin gao  wrote:

> +1
>
> On Sat, Apr 13, 2024 at 4:36 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Sat, Apr 13, 2024 at 4:12 PM Hyukjin Kwon 
>> wrote:
>> >
>> > +1
>> >
>> > On Sun, Apr 14, 2024 at 7:46 AM Chao Sun  wrote:
>> >>
>> >> +1.
>> >>
>> >> This feature is very helpful for guarding against correctness issues,
>> such as null results due to invalid input or math overflows. It’s been
>> there for a while now and it’s a good time to enable it by default as Spark
>> enters the next major release.
>> >>
>> >> On Sat, Apr 13, 2024 at 3:27 PM Dongjoon Hyun 
>> wrote:
>> >>>
>> >>> I'll start from my +1.
>> >>>
>> >>> Dongjoon.
>> >>>
>> >>> On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
>> >>> > Please vote on SPARK-4 to use ANSI SQL mode by default.
>> >>> > The technical scope is defined in the following PR which is
>> >>> > one line of code change and one line of migration guide.
>> >>> >
>> >>> > - DISCUSSION:
>> >>> > https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
>> >>> > - JIRA: https://issues.apache.org/jira/browse/SPARK-4
>> >>> > - PR: https://github.com/apache/spark/pull/46013
>> >>> >
>> >>> > The vote is open until April 17th 1AM (PST) and passes
>> >>> > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >>> >
>> >>> > [ ] +1 Use ANSI SQL mode by default
>> >>> > [ ] -1 Do not use ANSI SQL mode by default because ...
>> >>> >
>> >>> > Thank you in advance.
>> >>> >
>> >>> > Dongjoon
>> >>> >
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-12 Thread Xiao Li

+1




On Fri, Apr 12, 2024 at 14:30 bo yang  wrote:

> +1
>
> On Fri, Apr 12, 2024 at 12:34 PM huaxin gao 
> wrote:
>
>> +1
>>
>> On Fri, Apr 12, 2024 at 9:07 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> Thank you!
>>>
>>> I hope we can customize `dev/merge_spark_pr.py` script per repository
>>> after this PR.
>>>
>>> Dongjoon.
>>>
>>> On 2024/04/12 03:28:36 "L. C. Hsieh" wrote:
>>> > Hi all,
>>> >
>>> > Thanks for all discussions in the thread of "Versioning of Spark
>>> > Operator":
>>> https://lists.apache.org/thread/zhc7nb2sxm8jjxdppq8qjcmlf4rcsthh
>>> >
>>> > I would like to create this vote to get the consensus for versioning
>>> > of the Spark Kubernetes Operator.
>>> >
>>> > The proposal is to use an independent versioning for the Spark
>>> > Kubernetes Operator.
>>> >
>>> > Please vote on adding new `Versions` in Apache Spark JIRA which can be
>>> > used for places like "Fix Version/s" in the JIRA tickets of the
>>> > operator.
>>> >
>>> > The new `Versions` will be `kubernetes-operator-` prefix, for example
>>> > `kubernetes-operator-0.1.0`.
>>> >
>>> > The vote is open until April 15th 1AM (PST) and passes if a majority
>>> > +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Adding the new `Versions` for Spark Kubernetes Operator in
>>> > Apache Spark JIRA
>>> > [ ] -1 Do not add the new `Versions` because ...
>>> >
>>> > Thank you.
>>> >
>>> >
>>> > Note that this is not a SPIP vote and also not a release vote. I don't
>>> > find similar votes in previous threads. This is made similarly like a
>>> > SPIP or a release vote. So I think it should be okay. Please correct
>>> > me if this vote format is not good for you.
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Xiao Li

+1

Hussein Awala  于2024年4月1日周一 08:07写道：

> +1(non-binding) I add to the difference will it make that it will also
> simplify package maintenance and easily release a bug fix/new feature
> without needing to wait for Pyspark to release.
>
> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>
>> +1
>>
>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>> wrote:
>>
>>> Oh I didn't send the discussion thread out as it's pretty simple,
>>> non-invasive and the discussion was sort of done as part of the Spark
>>> Connect initial discussion ..
>>>
>>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>>> wrote:
>>>

 Can you point me to the SPIP’s discussion thread please ?
 I was not able to find it, but I was on vacation, and so might have
 missed this …


 Regards,
 Mridul

>>>
 On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
  wrote:

> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI
>> (Spark Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-12 Thread Xiao Li

+1

On Tue, Mar 12, 2024 at 6:09 AM Holden Karau  wrote:

> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Mon, Mar 11, 2024 at 7:44 PM Reynold Xin 
> wrote:
>
>> +1
>>
>>
>> On Mon, Mar 11 2024 at 7:38 PM, Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> +1 (non-binding), thanks Gengliang!
>>>
>>> On Mon, Mar 11, 2024 at 5:46 PM Gengliang Wang  wrote:
>>>
 Hi all,

 I'd like to start the vote for SPIP: Structured Logging Framework for
 Apache Spark

 References:

- JIRA ticket 
- SPIP doc

 
- Discussion thread


 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thanks!
 Gengliang Wang

>>>

--

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-20 Thread Xiao Li

+1

Xiao

Cheng Pan  于2024年2月20日周二 04:59写道：

> +1 (non-binding)
>
> - Build successfully from source code.
> - Pass integration tests with Spark ClickHouse Connector[1]
>
> [1] https://github.com/housepower/spark-clickhouse-connector/pull/299
>
> Thanks,
> Cheng Pan
>
>
> > On Feb 20, 2024, at 10:56, Jungtaek Lim 
> wrote:
> >
> > Thanks Sean, let's continue the process for this RC.
> >
> > +1 (non-binding)
> >
> > - downloaded all files from URL
> > - checked signature
> > - extracted all archives
> > - ran all tests from source files in source archive file, via running
> "sbt clean test package" - Ubuntu 20.04.4 LTS, OpenJDK 17.0.9.
> >
> > Also bump to dev@ to encourage participation - looks like the timing is
> not good for US folks but let's see more days.
> >
> >
> > On Sat, Feb 17, 2024 at 1:49 AM Sean Owen  wrote:
> > Yeah let's get that fix in, but it seems to be a minor test only issue
> so should not block release.
> >
> > On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:
> > Very sorry. When I was fixing `SPARK-45242 (
> https://github.com/apache/spark/pull/43594)`
> , I noticed that its
> `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
> didn't realize that it had also been merged into branch-3.5, so I didn't
> advocate for SPARK-45357 to be backported to branch-3.5.
> >  As far as I know, the condition to trigger this test failure is: when
> using Maven to test the `connect` module, if  `sparkTestRelation` in
> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
> then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
> is indeed related to the order in which Maven executes the test cases in
> the `connect` module.
> >  I have submitted a backport PR to branch-3.5, and if necessary, we can
> merge it to fix this test issue.
> >  Jie Yang
> >   发件人: Jungtaek Lim 
> > 日期: 2024年2月16日 星期五 22:15
> > 收件人: Sean Owen , Rui Wang 
> > 抄送: dev 
> > 主题: Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
> >   I traced back relevant changes and got a sense of what happened.
> >   Yangjie figured out the issue via link. It's a tricky issue according
> to the comments from Yangjie - the test is dependent on ordering of
> execution for test suites. He said it does not fail in sbt, hence CI build
> couldn't catch it.
> > He fixed it via link, but we missed that the offending commit was also
> ported back to 3.5 as well, hence the fix wasn't ported back to 3.5.
> >   Surprisingly, I can't reproduce locally even with maven. In my attempt
> to reproduce, SparkConnectProtoSuite was executed at third,
> SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite, and
> then SparkConnectProtoSuite. Maybe very specific to the environment, not
> just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I used
> build/mvn (Maven 3.8.8).
> >   I'm not 100% sure this is something we should fail the release as it's
> a test only and sounds very environment dependent, but I'll respect your
> call on vote.
> >   Btw, looks like Rui also made a relevant fix via link (not to fix the
> failing test but to fix other issues), but this also wasn't ported back to
> 3.5. @Rui Wang Do you think this is a regression issue and warrants a new
> RC?
> > On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:
> > Is anyone seeing this Spark Connect test failure? then again, I have
> some weird issue with this env that always fails 1 or 2 tests that nobody
> else can replicate.
> >   - Test observe *** FAILED ***
> >   == FAIL: Plans do not match ===
> >   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
> max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
> [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
> 44
> >+- LocalRelation , [id#0, name#0]
>+- LocalRelation , [id#0, name#0]
> (PlanTest.scala:179)
> >   On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> > DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately
> figured out doc generation issue after tagging RC1.
> >   Please vote on releasing the following candidate as Apache Spark
> version 3.5.1.
> >
> > The vote is open until February 18th 9AM (PST) and passes if a majority
> +1 PMC votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.5.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/
> >
> > The tag to be voted on is v3.5.1-rc2 (commit
> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
> > https://github.com/apache/spark/tree/v3.5.1-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> >

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Xiao Li

+1

On Sun, Feb 4, 2024 at 6:07 AM beliefer  wrote:

> +1
>
>
>
> 在 2024-02-04 15:26:13，"Dongjoon Hyun"  写道：
>
> +1
>
> On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
> wrote:
>
>> +1
>>
>> 在 2024/2/4 13:13，“Kent Yao”mailto:y...@apache.org>> 写入:
>>
>>
>> +1
>>
>>
>> Jungtaek Lim > kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道：
>> >
>> > Hi dev,
>> >
>> > looks like there are a huge number of commits being pushed to
>> branch-3.5 after 3.5.0 was released, 200+ commits.
>> >
>> > $ git log --oneline v3.5.0..HEAD | wc -l
>> > 202
>> >
>> > Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and
>> 10 resolved issues are either marked as blocker (even correctness issues)
>> or critical, which justifies the release.
>> > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
>> https://issues.apache.org/jira/projects/SPARK/versions/12353495>
>> >
>> > What do you think about releasing 3.5.1 with the current head of
>> branch-3.5? I'm happy to volunteer as the release manager.
>> >
>> > Thanks,
>> > Jungtaek Lim (HeartSaVioR)
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > dev-unsubscr...@spark.apache.org>
>>
>>
>>
>>
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

--

Re: Remove HiveContext from Apache Spark 4.0

2023-11-29 Thread Xiao Li

Thank you for raising it in the dev list. I do not think we should remove
HiveContext based on the cost of break and maintenance.

FYI, when releasing Spark 3.0, we had a lot of discussions about the
related topics
https://lists.apache.org/thread/mrx0y078cf3ozs7czykvv864y6dr55xq


Dongjoon Hyun  于2023年11月29日周三 08:43写道：

> Thank you for the heads-up.
>
> I agree with your intention and the fact that it's not useful in Apache
> Spark 4.0.0.
>
> However, as you know, historically, it was removed once and explicitly
> added back to the Apache Spark 3.0 via the vote.
>
> SPARK-31088 Add back HiveContext and createExternalTable
> (As a subtask of SPARK-31085 Amend Spark's Semantic Versioning Policy)
>
> Like you, I'd love to remove that too, but it's a little hard to remove it
> from Apache Spark 4.0.0 under our AS-IS versioning policy and history.
>
> I believe a new specific vote could make it possible to remove HiveContext
> (if we need to remove it).
>
> So, do you want to delete it from Apache Spark 4.0.0 via the official
> community vote with this thread context?
>
> Thanks,
> Dongjoon.
>
>
> On Wed, Nov 29, 2023 at 3:03 AM 杨杰  wrote:
>
>> Hi all,
>>
>> In SPARK-46171 (apache/spark#44077 [1]), I’m trying to remove the
>> deprecated HiveContext from Apache Spark 4.0 since HiveContext has been
>> marked as deprecated after Spark 2.0. This is a long-deprecated API, it
>> should be replaced with SparkSession with enableHiveSupport now, so I think
>> it's time to remove it.
>>
>> Feel free to comment if you have any concerns.
>>
>> [1] https://github.com/apache/spark/pull/44077
>>
>> Thanks,
>> Jie Yang
>>
>

Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-15 Thread Xiao Li

+1

bo yang  于2023年11月15日周三 05:55写道：

> +1
>
> On Tue, Nov 14, 2023 at 7:18 PM huaxin gao  wrote:
>
>> +1
>>
>> On Tue, Nov 14, 2023 at 10:45 AM Holden Karau 
>> wrote:
>>
>>> +1
>>>
>>> On Tue, Nov 14, 2023 at 10:21 AM DB Tsai  wrote:
>>>
 +1

 DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

 On Nov 14, 2023, at 10:14 AM, Vakaris Baškirov <
 vakaris.bashki...@gmail.com> wrote:

 +1 (non-binding)


 On Tue, Nov 14, 2023 at 8:03 PM Chao Sun  wrote:

> +1
>
> On Tue, Nov 14, 2023 at 9:52 AM L. C. Hsieh  wrote:
> >
> > +1
> >
> > On Tue, Nov 14, 2023 at 9:46 AM Ye Zhou  wrote:
> > >
> > > +1(Non-binding)
> > >
> > > On Tue, Nov 14, 2023 at 9:42 AM L. C. Hsieh 
> wrote:
> > >>
> > >> Hi all,
> > >>
> > >> I’d like to start a vote for SPIP: An Official Kubernetes
> Operator for
> > >> Apache Spark.
> > >>
> > >> The proposal is to develop an official Java-based Kubernetes
> operator
> > >> for Apache Spark to automate the deployment and simplify the
> lifecycle
> > >> management and orchestration of Spark applications and Spark
> clusters
> > >> on k8s at prod scale.
> > >>
> > >> This aims to reduce the learning curve and operation overhead for
> > >> Spark users so they can concentrate on core Spark logic.
> > >>
> > >> Please also refer to:
> > >>
> > >>- Discussion thread:
> > >> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
> > >>- JIRA ticket:
> https://issues.apache.org/jira/browse/SPARK-45923
> > >>- SPIP doc:
> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
> > >>
> > >>
> > >> Please vote on the SPIP for the next 72 hours:
> > >>
> > >> [ ] +1: Accept the proposal as an official SPIP
> > >> [ ] +0
> > >> [ ] -1: I don’t think this is a good idea because …
> > >>
> > >>
> > >> Thank you!
> > >>
> > >> Liang-Chi Hsieh
> > >>
> > >>
> -
> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>
> > >
> > >
> > > --
> > >
> > > Zhou, Ye  周晔
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-09 Thread Xiao Li

+1

huaxin gao  于2023年11月9日周四 16:53写道：

> +1
>
> On Thu, Nov 9, 2023 at 3:14 PM DB Tsai  wrote:
>
>> +1
>>
>> To be completely transparent, I am employed in the same department as
>> Zhou at Apple.
>>
>> I support this proposal, provided that we witness community adoption
>> following the release of the Flink Kubernetes operator, streamlining Flink
>> deployment on Kubernetes.
>>
>> A well-maintained official Spark Kubernetes operator is essential for our
>> Spark community as well.
>>
>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>
>> On Nov 9, 2023, at 12:05 PM, Zhou Jiang  wrote:
>>
>> Hi Spark community,
>> I'm reaching out to initiate a conversation about the possibility of
>> developing a Java-based Kubernetes operator for Apache Spark. Following the
>> operator pattern (
>> https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark
>> users may manage applications and related components seamlessly using
>> native tools like kubectl. The primary goal is to simplify the Spark user
>> experience on Kubernetes, minimizing the learning curve and operational
>> complexities and therefore enable users to focus on the Spark application
>> development.
>> Although there are several open-source Spark on Kubernetes operators
>> available, none of them are officially integrated into the Apache Spark
>> project. As a result, these operators may lack active support and
>> development for new features. Within this proposal, our aim is to introduce
>> a Java-based Spark operator as an integral component of the Apache Spark
>> project. This solution has been employed internally at Apple for multiple
>> years, operating millions of executors in real production environments. The
>> use of Java in this solution is intended to accommodate a wider user and
>> contributor audience, especially those who are familiar with Scala.
>> Ideally, this operator should have its dedicated repository, similar to
>> Spark Connect Golang or Spark Docker, allowing it to maintain a loose
>> connection with the Spark release cycle. This model is also followed by the
>> Apache Flink Kubernetes operator.
>> We believe that this project holds the potential to evolve into a
>> thriving community project over the long run. A comparison can be drawn
>> with the Flink Kubernetes Operator: Apple has open-sourced internal Flink
>> Kubernetes operator, making it a part of the Apache Flink project (
>> https://github.com/apache/flink-kubernetes-operator). This move has
>> gained wide industry adoption and contributions from the community. In a
>> mere year, the Flink operator has garnered more than 600 stars and has
>> attracted contributions from over 80 contributors. This showcases the level
>> of community interest and collaborative momentum that can be achieved in
>> similar scenarios.
>> More details can be found at SPIP doc : Spark Kubernetes Operator
>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>>
>> Thanks,
>> --
>> *Zhou JIANG*
>>
>>
>>

Welcome to Our New Apache Spark Committer and PMCs

2023-10-02 Thread Xiao Li

Hi all,

The Spark PMC is delighted to announce that we have voted to add one new
committer and two new PMC members. These individuals have consistently
contributed to the project and have clearly demonstrated their expertise.

New Committer:
- Jiaan Geng (focusing on Spark Connect and Spark SQL)

New PMCs:
- Yuanjian Li
- Yikun Jiang

Please join us in extending a warm welcome to them in their new roles!

Sincerely,
The Spark PMC

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Xiao Li

+1

Xiao

Yuanjian Li  于2023年9月11日周一 10:53写道：

> @Peter Toth  I've looked into the details of this
> issue, and it appears that it's neither a regression in version 3.5.0 nor a
> correctness issue. It's a bug related to a new feature. I think we can fix
> this in 3.5.1 and list it as a known issue of the Scala client of Spark
> Connect in 3.5.0.
>
> Mridul Muralidharan  于2023年9月10日周日 04:12写道：
>
>>
>> +1
>>
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>>
>> Regards,
>> Mridul
>>
>> On Sat, Sep 9, 2023 at 10:02 AM Yuanjian Li 
>> wrote:
>>
>>> Please vote on releasing the following candidate(RC5) as Apache Spark
>>> version 3.5.0.
>>>
>>> The vote is open until 11:59pm Pacific time Sep 11th and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.5.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.5.0-rc5 (commit
>>> ce5ddad990373636e94071e7cef2f31021add07b):
>>>
>>> https://github.com/apache/spark/tree/v3.5.0-rc5
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1449
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/
>>>
>>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>>
>>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>>
>>> This release is using the release script of the tag v3.5.0-rc5.
>>>
>>>
>>> FAQ
>>>
>>> =
>>>
>>> How can I help test this release?
>>>
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>>
>>> an existing Spark workload and running on this release candidate, then
>>>
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>>
>>> the current RC and see if anything important breaks, in the Java/Scala
>>>
>>> you can add the staging repository to your projects resolvers and test
>>>
>>> with the RC (make sure to clean up the artifact cache before/after so
>>>
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>>
>>> What should happen to JIRA tickets still targeting 3.5.0?
>>>
>>> ===
>>>
>>> The current list of open tickets targeted at 3.5.0 can be found at:
>>>
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.5.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>>
>>> fixes, documentation, and API tweaks that impact compatibility should
>>>
>>> be worked on immediately. Everything else please retarget to an
>>>
>>> appropriate release.
>>>
>>> ==
>>>
>>> But my bug isn't fixed?
>>>
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>>
>>> release unless the bug in question is a regression from the previous
>>>
>>> release. That being said, if there is something which is a regression
>>>
>>> that has not been correctly targeted please ping me or a committer to
>>>
>>> help target the issue.
>>>
>>> Thanks,
>>>
>>> Yuanjian Li
>>>
>>

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-06 Thread Xiao Li

+1

Xiao

Herman van Hovell  于2023年9月6日周三 22:08写道：

> Tested connect, and everything looks good.
>
> +1
>
> On Wed, Sep 6, 2023 at 8:11 AM Yuanjian Li  wrote:
>
>> Please vote on releasing the following candidate(RC4) as Apache Spark
>> version 3.5.0.
>>
>> The vote is open until 11:59pm Pacific time Sep 8th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.0
>>
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.5.0-rc4 (commit
>> c2939589a29dd0d6a2d3d31a8d833877a37ee02a):
>>
>> https://github.com/apache/spark/tree/v3.5.0-rc4
>>
>> The release files, including signatures, digests, etc. can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>>
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1448
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/
>>
>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>
>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>
>> This release is using the release script of the tag v3.5.0-rc4.
>>
>>
>> FAQ
>>
>> =
>>
>> How can I help test this release?
>>
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>>
>> an existing Spark workload and running on this release candidate, then
>>
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>>
>> the current RC and see if anything important breaks, in the Java/Scala
>>
>> you can add the staging repository to your projects resolvers and test
>>
>> with the RC (make sure to clean up the artifact cache before/after so
>>
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>>
>> What should happen to JIRA tickets still targeting 3.5.0?
>>
>> ===
>>
>> The current list of open tickets targeted at 3.5.0 can be found at:
>>
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.5.0
>>
>> Committers should look at those and triage. Extremely important bug
>>
>> fixes, documentation, and API tweaks that impact compatibility should
>>
>> be worked on immediately. Everything else please retarget to an
>>
>> appropriate release.
>>
>> ==
>>
>> But my bug isn't fixed?
>>
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>>
>> release unless the bug in question is a regression from the previous
>>
>> release. That being said, if there is something which is a regression
>>
>> that has not been correctly targeted please ping me or a committer to
>>
>> help target the issue.
>>
>> Thanks,
>>
>> Yuanjian Li
>>
>

Re: Welcome two new Apache Spark committers

2023-08-06 Thread Xiao Li

Congratulations, Peter and Xiduo!



Debasish Das  于2023年8月6日周日 19:08写道：

> Congratulations Peter and Xidou.
>
> On Sun, Aug 6, 2023, 7:05 PM Wenchen Fan  wrote:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add two new committers. Please join me in
>> welcoming them to their new role!
>>
>> - Peter Toth (Spark SQL)
>> - Xiduo You (Spark SQL)
>>
>> They consistently make contributions to the project and clearly showed
>> their expertise. We are very excited to have them join as committers.
>>
>

Re: [VOTE] SPIP: XML data source support

2023-07-28 Thread Xiao Li

+1

On Fri, Jul 28, 2023 at 15:54 Sean Owen  wrote:

> +1 I think that porting the package 'as is' into Spark is probably
> worthwhile.
> That's relatively easy; the code is already pretty battle-tested and not
> that big and even originally came from Spark code, so is more or less
> similar already.
>
> One thing it never got was DSv2 support, which means XML reading would
> still be somewhat behind other formats. (I was not able to implement it.)
> This isn't a necessary goal right now, but would be possibly part of the
> logic of moving it into the Spark code base.
>
> On Fri, Jul 28, 2023 at 5:38 PM Sandip Agarwala
>  wrote:
>
>> Dear Spark community,
>>
>> I would like to start the vote for "SPIP: XML data source support".
>>
>> XML is a widely used data format. An external spark-xml package (
>> https://github.com/databricks/spark-xml) is available to read and write
>> XML data in spark. Making spark-xml built-in will provide a better user
>> experience for Spark SQL and structured streaming. The proposal is to
>> inline code from the spark-xml package.
>>
>> SPIP link:
>>
>> https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing
>>
>> JIRA:
>> https://issues.apache.org/jira/browse/SPARK-44265
>>
>> Discussion Thread:
>> https://lists.apache.org/thread/q32hxgsp738wom03mgpg9ykj9nr2n1fh
>>
>> Please vote on the SPIP for the next 72 hours:
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because __.
>>
>> Thanks, Sandip
>>
>

Re: Spark Docker Official Image is now available

2023-07-20 Thread Xiao Li

Thank you, Yikun! This is great!

On Wed, Jul 19, 2023 at 7:55 PM Ruifeng Zheng  wrote:

> Awesome, thank you YiKun for driving this!
>
> On Thu, Jul 20, 2023 at 9:12 AM Hyukjin Kwon  wrote:
>
>> This is amazing, finally!
>>
>> On Thu, 20 Jul 2023 at 10:10, Yikun Jiang  wrote:
>>
>>> The spark Docker Official Image is now available:
>>> https://hub.docker.com/_/spark
>>>
>>> $ docker run -it --rm *spark* /opt/spark/bin/spark-shell
>>> $ docker run -it --rm *spark*:python3 /opt/spark/bin/pyspark
>>> $ docker run -it --rm *spark*:r /opt/spark/bin/sparkR
>>>
>>> We had a longer review journey than we expected, if you are also
>>> interested in this journey, you can see more in:
>>>
>>> https://github.com/docker-library/official-images/pull/13089
>>>
>>> Thanks to everyone who helps in the Docker and Apache Spark community!
>>>
>>> Some background you might want to know:
>>> *- apache/spark*: https://hub.docker.com/r/apache/spark, the Apache
>>> Spark docker image, will be published by *Apache Spark community* when
>>> the Apache Spark is released, no update.
>>> *- spark*: https://hub.docker.com/_/spark, the Docker Official Image,
>>> it will be published by the *Docker community*, keep active
>>> rebuilding for updates and security fixes by the Docker community.
>>> - The source repo of *apache/spark *and *spark: *
>>> https://github.com/apache/spark-docker
>>>
>>> See more in:
>>> [1] [DISCUSS] SPIP: Support Docker Official Image for Spark:
>>> https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3
>>> [2] [VOTE] SPIP: Support Docker Official Image for Spark:
>>> https://lists.apache.org/thread/ro6olodm1jzdffwjx4oc7ol7oh6kshbl
>>> [3] https://github.com/docker-library/official-images/pull/13089
>>> [4]
>>> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o/
>>> [5] https://issues.apache.org/jira/browse/SPARK-40513
>>>
>>> Regards,
>>> Yikun
>>>
>>
>
> --
> Ruifeng Zheng
> E-mail: zrfli...@gmail.com
>


--

Re: [VOTE][SPIP] Python Data Source API

2023-07-06 Thread Xiao Li

+1

Xiao

Hyukjin Kwon  于2023年7月5日周三 17:28写道：

> +1.
>
> See https://youtu.be/yj7XlTB1Jvc?t=604 :-).
>
> On Thu, 6 Jul 2023 at 09:15, Allison Wang
>  wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Python Data Source API.
>>
>> The high-level summary for the SPIP is that it aims to introduce a
>> simple API in Python for Data Sources. The idea is to enable Python
>> developers to create data sources without learning Scala or dealing with
>> the complexities of the current data source APIs. This would make Spark
>> more accessible to the wider Python developer community.
>>
>> References:
>>
>>- SPIP doc
>>
>> 
>>- JIRA ticket 
>>- Discussion thread
>>
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because __.
>>
>> Thanks,
>> Allison
>>
>

Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-15 Thread Xiao Li

Since the vote includes the release date for Spark 4.0, I cast my vote as
-1, in light of the discussions from the three other PMCs.

Also, considering recent discussions on the dev list, numerous breaking
changes, such as Scala 2.13, JDK 17 support, and pandas 2.0 support, will
be incorporated into Spark 4.0. I propose that we first release a preview
so that the entire community can provide more comprehensive feedback before
the final release.


Jungtaek Lim  于2023年6月12日周一 19:28写道：

> I concur with Holden and Mridul. Let's build a plan before we call the
> tentative deadline. I understand setting the tentative deadline would
> definitely help in pushing back features which "never ever ends", but at
> least we may want to list up features and discuss for priority. It is still
> possible that we might even want to see some features as hard blocker on
> the release for any reason, based on discussion of course.
>
> On Tue, Jun 13, 2023 at 10:58 AM Mridul Muralidharan 
> wrote:
>
>>
>> I agree with Holden, we should have some understanding of what we are
>> targeting for 4.0, given it is a major ver bump - and work from there on
>> the release date.
>>
>> Regards,
>> Mridul
>>
>> On Mon, Jun 12, 2023 at 8:53 PM Jia Fan  wrote:
>>
>>> By the way, like Holden said, what's big feature for 4.0.0? I think very
>>> big version change always bring some different.
>>>
>>> Jia Fan  于2023年6月13日周二 08:25写道：
>>>
>>>> +1
>>>>
>>>> 
>>>>
>>>> Jia Fan
>>>>
>>>>
>>>>
>>>> 2023年6月13日 03:51，Chao Sun  写道：
>>>>
>>>> +1
>>>>
>>>> On Mon, Jun 12, 2023 at 12:50 PM kazuyuki tanimura
>>>>  wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> Thank you!
>>>>> Kazu
>>>>>
>>>>>
>>>>> On Jun 12, 2023, at 11:32 AM, Holden Karau 
>>>>> wrote:
>>>>>
>>>>> -0
>>>>>
>>>>> I'd like to see more of a doc around what we're planning on for a 4.0
>>>>> before we pick a target release date etc. (feels like cart before the
>>>>> horse).
>>>>>
>>>>> But it's a weak preference.
>>>>>
>>>>> On Mon, Jun 12, 2023 at 11:24 AM Xiao Li  wrote:
>>>>>
>>>>>> Thanks for starting the vote.
>>>>>>
>>>>>> I do have a concern about the target release date of Spark 4.0.
>>>>>>
>>>>>> L. C. Hsieh  于2023年6月12日周一 11:09写道：
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > +1
>>>>>>> >
>>>>>>> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun <
>>>>>>> dongj...@apache.org> wrote:
>>>>>>> >>
>>>>>>> >> +1
>>>>>>> >>
>>>>>>> >> Dongjoon
>>>>>>> >>
>>>>>>> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>>>>>>> >> > Please vote on the release plan for Apache Spark 4.0.0.
>>>>>>> >> >
>>>>>>> >> > The vote is open until June 16th 1AM (PST) and passes if a
>>>>>>> majority +1 PMC
>>>>>>> >> > votes are cast, with a minimum of 3 +1 votes.
>>>>>>> >> >
>>>>>>> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>>>>>>> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>>>>>>> >> >
>>>>>>> >> > ===
>>>>>>> >> > Apache Spark 4.0.0 Release Plan
>>>>>>> >> > ===
>>>>>>> >> >
>>>>>>> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master
>>>>>>> branch.
>>>>>>> >> >
>>>>>>> >> > 2. Creating `branch-4.0` on April 1st, 2024.
>>>>>>> >> >
>>>>>>> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>>>>>>> >> >
>>>>>>> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
>>>>>>> >> >
>>>>>>> >>
>>>>>>> >>
>>>>>>> -
>>>>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>> >>
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>>
>>>>>
>>>>

Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Xiao Li

Thanks for starting the vote.

I do have a concern about the target release date of Spark 4.0.

L. C. Hsieh  于2023年6月12日周一 11:09写道：

> +1
>
> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
> wrote:
> >
> > +1
> >
> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun 
> wrote:
> >>
> >> +1
> >>
> >> Dongjoon
> >>
> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
> >> > Please vote on the release plan for Apache Spark 4.0.0.
> >> >
> >> > The vote is open until June 16th 1AM (PST) and passes if a majority
> +1 PMC
> >> > votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
> >> >
> >> > ===
> >> > Apache Spark 4.0.0 Release Plan
> >> > ===
> >> >
> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master branch.
> >> >
> >> > 2. Creating `branch-4.0` on April 1st, 2024.
> >> >
> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
> >> >
> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Apache Spark 3.4.1 Release?

2023-06-09 Thread Xiao Li

+1

On Fri, Jun 9, 2023 at 08:30 Wenchen Fan  wrote:

> +1
>
> On Fri, Jun 9, 2023 at 8:52 PM Xinrong Meng  wrote:
>
>> +1. Thank you Doonjoon!
>>
>> Thanks,
>>
>> Xinrong Meng
>>
>> Mridul Muralidharan 于2023年6月9日 周五上午5:22写道：
>>
>>>
>>> +1, thanks Dongjoon !
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Thu, Jun 8, 2023 at 7:16 PM Jia Fan 
>>> wrote:
>>>
 +1

 


 Jia Fan



 2023年6月9日 08:00，Yuming Wang  写道：

 +1.

 On Fri, Jun 9, 2023 at 7:14 AM Chao Sun  wrote:

> +1 too
>
> On Thu, Jun 8, 2023 at 2:34 PM kazuyuki tanimura
>  wrote:
> >
> > +1 (non-binding), Thank you Dongjoon
> >
> > Kazu
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
 --

Re: [ANNOUNCE] Apache Spark 3.4.0 released

2023-04-14 Thread Xiao Li

@Dongjoon Hyun  Thank you!

Could you also help update the latest tag ?
https://hub.docker.com/r/apache/spark/tags

Xiao

Dongjoon Hyun  于2023年4月14日周五 16:23写道：

> Apache Spark Docker images are published too.
>
> docker pull apache/spark:v3.4.0
> docker pull apache/spark-py:v3.4.0
> docker pull apache/spark-r:v3.4.0
>
> Thanks,
> Dongjoon
>
>
> On Fri, Apr 14, 2023 at 2:56 PM Dongjoon Hyun 
> wrote:
>
>> Thank you, Xinrong!
>>
>> Dongjoon.
>>
>>
>> On Fri, Apr 14, 2023 at 1:37 PM Xiao Li  wrote:
>>
>>> Thank you Xinrong!
>>>
>>> Congratulations everyone! This is a great release with tons of new
>>> features!
>>>
>>>
>>>
>>> Gengliang Wang  于2023年4月14日周五 13:04写道：
>>>
>>>> Congratulations everyone!
>>>> Thank you Xinrong for driving the release!
>>>>
>>>> On Fri, Apr 14, 2023 at 12:47 PM Xinrong Meng 
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We are happy to announce the availability of *Apache Spark 3.4.0*!
>>>>>
>>>>> Apache Spark 3.4.0 is the fifth release of the 3.x line.
>>>>>
>>>>> To download Spark 3.4.0, head over to the download page:
>>>>> https://spark.apache.org/downloads.html
>>>>>
>>>>> To view the release notes:
>>>>> https://spark.apache.org/releases/spark-release-3-4-0.html
>>>>>
>>>>> We would like to acknowledge all community members for contributing to
>>>>> this
>>>>> release. This release would not have been possible without you.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Xinrong Meng
>>>>>
>>>>

Re: [ANNOUNCE] Apache Spark 3.4.0 released

2023-04-14 Thread Xiao Li

Thank you Xinrong!

Congratulations everyone! This is a great release with tons of new features!



Gengliang Wang  于2023年4月14日周五 13:04写道：

> Congratulations everyone!
> Thank you Xinrong for driving the release!
>
> On Fri, Apr 14, 2023 at 12:47 PM Xinrong Meng 
> wrote:
>
>> Hi All,
>>
>> We are happy to announce the availability of *Apache Spark 3.4.0*!
>>
>> Apache Spark 3.4.0 is the fifth release of the 3.x line.
>>
>> To download Spark 3.4.0, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-4-0.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Thanks,
>>
>> Xinrong Meng
>>
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-12 Thread Xiao Li

+1

Xiao Li

Emil Ejbyfeldt  于2023年4月12日周三 12:39写道：

> +1 (non-binding)
>
> Ran some tests with the Scala 2.13 build using part of our internal
> spark workload.
>
> On 12/04/2023 19:52, Chris Nauroth wrote:
> > +1 (non-binding)
> >
> > * Verified all checksums.
> > * Verified all signatures.
> > * Built from source, with multiple profiles, to full success:
> >  * build/mvn -Phadoop-cloud -Phive-thriftserver -Pkubernetes
> > -Psparkr -Pyarn -DskipTests clean package
> > * Tests passed.
> > * Ran several examples successfully:
> >  * bin/spark-submit --class org.apache.spark.examples.SparkPi
> > examples/jars/spark-examples_2.13-3.4.0.jar
> >  * bin/spark-submit --class
> > org.apache.spark.examples.sql.hive.SparkHiveExample
> > examples/jars/spark-examples_2.13-3.4.0.jar
> >  * bin/spark-submit
> > examples/src/main/python/streaming/network_wordcount.py localhost 
> >
> > Chris Nauroth
> >
> >
> > On Tue, Apr 11, 2023 at 10:36 PM beliefer  > <mailto:belie...@163.com>> wrote:
> >
> > +1
> >
> >
> > At 2023-04-08 07:29:46, "Xinrong Meng"  > <mailto:xinrong.apa...@gmail.com>> wrote:
> >
> > Please vote on releasing the following candidate(RC7) as Apache
> > Spark version 3.4.0.
> >
> > The vote is open until 11:59pm Pacific time *April 12th* and
> > passes if a majority +1 PMC votes are cast, with a minimum of 3
> > +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.4.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/ <http://spark.apache.org/>
> >
> > The tag to be voted on is v3.4.0-rc7 (commit
> > 87a5442f7ed96b11051d8a9333476d080054e5a0):
> > https://github.com/apache/spark/tree/v3.4.0-rc7
> > <https://github.com/apache/spark/tree/v3.4.0-rc7>
> >
> > The release files, including signatures, digests, etc. can be
> > found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/
> > <https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/>
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > <https://dist.apache.org/repos/dist/dev/spark/KEYS>
> >
> > The staging repository for this release can be found at:
> >
> https://repository.apache.org/content/repositories/orgapachespark-1441 <
> https://repository.apache.org/content/repositories/orgapachespark-1441>
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/
> > <https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/>
> >
> > The list of bug fixes going into 3.4.0 can be found at the
> > following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12351465
> > <https://issues.apache.org/jira/projects/SPARK/versions/12351465
> >
> >
> > This release is using the release script of the tag v3.4.0-rc7.
> >
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> > If you are a Spark user, you can help us test this release by
> taking
> > an existing Spark workload and running on this release
> > candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and
> > install
> > the current RC and see if anything important breaks, in the
> > Java/Scala
> > you can add the staging repository to your projects resolvers
> > and test
> > with the RC (make sure to clean up the artifact cache
> > before/after so
> > you don't end up building with an out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.4.0?
> > ===
> > The current list of open tickets targeted at 3.4.0 can be found
> at:
> > https://issues.apache.org/jira/projects/SPARK
> > <https://issu

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-11 Thread Xiao Li

Thanks for testing it in your environment!

> This is a minor issue itself, and only impacts the metrics for push-based
> shuffle, but it will essentially completely eliminate the effort
> in SPARK-36620.

Based on my understanding, this is not a regression. It only affects the
new enhancements https://issues.apache.org/jira/browse/SPARK-36620 If so,
it does not block the release RC.

We should still fix it in 3.4 and the fix will be available in the next
maintenance releases.

Xiao

Ye Zhou  于2023年4月11日周二 17:14写道：

> Manually tested the binary in our cluster.
> Started spark-shell application with some shuffle. Found one issue which
> is related to push based shuffle client side metrics introduced in
> https://github.com/apache/spark/pull/36165.
> Filed a ticket https://issues.apache.org/jira/browse/SPARK-43100, posted
> PR there, and verified that the PR fixes the issue.
>
> This is a minor issue itself, and only impacts the metrics for push-based
> shuffle, but it will essentially completely eliminate the effort
> in SPARK-36620.
>
> Would like to raise this issue in the voting thread, but hold my
> non-binding -1 here.
>
>
> On Tue, Apr 11, 2023 at 1:06 AM Peter Toth  wrote:
>
>> +1
>>
>> Jia Fan  ezt írta (időpont: 2023. ápr. 11., K, 9:09):
>>
>>> +1
>>>
>>> Wenchen Fan  于2023年4月11日周二 14:32写道：
>>>
 +1

 On Tue, Apr 11, 2023 at 9:57 AM Yuming Wang  wrote:

> +1.
>
> On Tue, Apr 11, 2023 at 9:14 AM Yikun Jiang 
> wrote:
>
>> +1 (non-binding)
>>
>> Also ran the docker image related test (signatures/standalone/k8s)
>> with rc7: https://github.com/apache/spark-docker/pull/32
>>
>> Regards,
>> Yikun
>>
>>
>> On Tue, Apr 11, 2023 at 4:44 AM Jacek Laskowski 
>> wrote:
>>
>>> +1
>>>
>>> * Built fine with Scala 2.13
>>> and -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano
>>> * Ran some demos on Java 17
>>> * Mac mini / Apple M2 Pro / Ventura 13.3.1
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> "The Internals Of" Online Books 
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>> 
>>>
>>>
>>> On Sat, Apr 8, 2023 at 1:30 AM Xinrong Meng <
>>> xinrong.apa...@gmail.com> wrote:
>>>
 Please vote on releasing the following candidate(RC7) as Apache
 Spark version 3.4.0.

 The vote is open until 11:59pm Pacific time *April 12th* and
 passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 
 votes.

 [ ] +1 Release this package as Apache Spark 3.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 The tag to be voted on is v3.4.0-rc7 (commit
 87a5442f7ed96b11051d8a9333476d080054e5a0):
 https://github.com/apache/spark/tree/v3.4.0-rc7

 The release files, including signatures, digests, etc. can be found
 at:
 https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1441

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/

 The list of bug fixes going into 3.4.0 can be found at the
 following URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12351465

 This release is using the release script of the tag v3.4.0-rc7.

 FAQ

 =
 How can I help test this release?
 =
 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate,
 then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and
 install
 the current RC and see if anything important breaks, in the
 Java/Scala
 you can add the staging repository to your projects resolvers and
 test
 with the RC (make sure to clean up the artifact cache before/after
 so
 you don't end up building with an out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 3.4.0?
 ===
 The current list of open tickets targeted at 3.4.0 can be found at:

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-04-05 Thread Xiao Li

Hi, Anton,

Could you please provide a complete list of exceptions that are being used
in the public connector API?

Thanks,

Xiao

Xinrong Meng  于2023年4月5日周三 12:06写道：

> Thank you!
>
> I created a blocker Jira for that for easier tracking:
> https://issues.apache.org/jira/browse/SPARK-43041.
>
>
> On Wed, Apr 5, 2023 at 11:20 AM Gengliang Wang  wrote:
>
>> Hi Anton,
>>
>> +1 for adding the old constructors back!
>> Could you raise a PR for this? I will review it ASAP.
>>
>> Thanks
>> Gengliang
>>
>> On Wed, Apr 5, 2023 at 9:37 AM Anton Okolnychyi 
>> wrote:
>>
>>> Sorry, I think my last message did not land on the list.
>>>
>>> I have a question about changes to exceptions used in the public
>>> connector API, such as NoSuchTableException and TableAlreadyExistsException.
>>>
>>> I consider those as part of the public Catalog API (TableCatalog uses
>>> them in method definitions). However, it looks like PR #37887 has changed
>>> them in an incompatible way. Old constructors accepting Identifier objects
>>> got removed. The only way to construct such exceptions is either by passing
>>> database and table strings or Scala Seq. Shall we add back old constructors
>>> to avoid breaking connectors?
>>>
>>> [1] - https://github.com/apache/spark/pull/37887/
>>> [2] - https://issues.apache.org/jira/browse/SPARK-40360
>>> [3] -
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
>>>
>>> - Anton
>>>
>>> On 2023/04/05 16:23:52 Xinrong Meng wrote:
>>> > Considering the above blockers have been resolved, I am about to
>>> > cut v3.4.0-rc6 if no objections.
>>> >
>>> > On Tue, Apr 4, 2023 at 8:20 AM Xinrong Meng 
>>> > wrote:
>>> >
>>> > > Thank you Wenchen for the report. I marked them as blockers just now.
>>> > >
>>> > > On Tue, Apr 4, 2023 at 10:52 AM Wenchen Fan 
>>> wrote:
>>> > >
>>> > >> Sorry for the last-minute change, but we found two wrong behaviors
>>> and
>>> > >> want to fix them before the release:
>>> > >>
>>> > >> https://github.com/apache/spark/pull/40641
>>> > >> We missed a corner case when the input index for `array_insert` is
>>> 0. It
>>> > >> should fail as 0 is an invalid index.
>>> > >>
>>> > >> https://github.com/apache/spark/pull/40623
>>> > >> We found some usability issues with a new API and need to change
>>> the API
>>> > >> to fix it. If people have concerns we can also remove the new API
>>> entirely.
>>> > >>
>>> > >> Thus I'm -1 to this RC. I'll merge these 2 PRs today if no
>>> objections.
>>> > >>
>>> > >> Thanks,
>>> > >> Wenchen
>>> > >>
>>> > >> On Tue, Apr 4, 2023 at 3:47 AM L. C. Hsieh 
>>> wrote:
>>> > >>
>>> > >>> +1
>>> > >>>
>>> > >>> Thanks Xinrong.
>>> > >>>
>>> > >>> On Mon, Apr 3, 2023 at 12:35 PM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com>
>>> > >>> wrote:
>>> > >>> >
>>> > >>> > +1
>>> > >>> >
>>> > >>> > I also verified that RC5 has SBOM artifacts.
>>> > >>> >
>>> > >>> >
>>> > >>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1439/org/apache/spark/spark-core_2.12/3.4.0/spark-core_2.12-3.4.0-cyclonedx.json
>>> > >>> >
>>> > >>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1439/org/apache/spark/spark-core_2.13/3.4.0/spark-core_2.13-3.4.0-cyclonedx.json
>>> > >>> >
>>> > >>> > Thanks,
>>> > >>> > Dongjoon.
>>> > >>> >
>>> > >>> >
>>> > >>> >
>>> > >>> > On Mon, Apr 3, 2023 at 1:57 AM yangjie01 
>>> wrote:
>>> > >>> >>
>>> > >>> >> +1, checked Java 17 + Scala 2.13 + Python 3.10.10.
>>> > >>> >>
>>> > >>> >>
>>> > >>> >>
>>> > >>> >> 发件人: Herman van Hovell 
>>> > >>> >> 日期: 2023年3月31日 星期五 12:12
>>> > >>> >> 收件人: Sean Owen 
>>> > >>> >> 抄送: Xinrong Meng , dev <
>>> > >>> dev@spark.apache.org>
>>> > >>> >> 主题: Re: [VOTE] Release Apache Spark 3.4.0 (RC5)
>>> > >>> >>
>>> > >>> >>
>>> > >>> >>
>>> > >>> >> +1
>>> > >>> >>
>>> > >>> >>
>>> > >>> >>
>>> > >>> >> On Thu, Mar 30, 2023 at 11:05 PM Sean Owen 
>>> wrote:
>>> > >>> >>
>>> > >>> >> +1 same result from me as last time.
>>> > >>> >>
>>> > >>> >>
>>> > >>> >>
>>> > >>> >> On Thu, Mar 30, 2023 at 3:21 AM Xinrong Meng <
>>> > >>> xinrong.apa...@gmail.com> wrote:
>>> > >>> >>
>>> > >>> >> Please vote on releasing the following candidate(RC5) as Apache
>>> Spark
>>> > >>> version 3.4.0.
>>> > >>> >>
>>> > >>> >> The vote is open until 11:59pm Pacific time April 4th and
>>> passes if a
>>> > >>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> > >>> >>
>>> > >>> >> [ ] +1 Release this package as Apache Spark 3.4.0
>>> > >>> >> [ ] -1 Do not release this package because ...
>>> > >>> >>
>>> > >>> >> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> > >>> >>
>>> > >>> >> The tag to be voted on is v3.4.0-rc5 (commit
>>> > >>> f39ad617d32a671e120464e4a75986241d72c487):
>>> > >>> >> https://github.com/apache/spark/tree/v3.4.0-rc5
>>> > >>> >>
>>> > >>> >> The release files, including signatures, digests, etc. can be
>>> found
>>> > >>> at:
>>> >

Re: Slack for PySpark users

2023-03-30 Thread Xiao Li

Hi, Dongjoon,

The other communities (e.g., Pinot, Druid, Flink) created their own Slack
workspaces last year. I did not see an objection from the ASF board. At the
same time, Slack workspaces are very popular and useful in most non-ASF
open source communities. TBH, we are kind of late. I think we can do the
same in our community?

We can follow the guide when the ASF has an official process for ASF
archiving. Since our PMC are the owner of the slack workspace, we can make
a change based on the policy. WDYT?

Xiao


Dongjoon Hyun  于2023年3月30日周四 09:03写道：

> Hi, Xiao and all.
>
> (cc Matei)
>
> Please hold on the vote.
>
> There is a concern expressed by ASF board because recent Slack activities
> created an isolated silo outside of ASF mailing list archive.
>
> We need to establish a way to embrace it back to ASF archive before
> starting anything official.
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 29, 2023 at 11:32 PM Xiao Li  wrote:
>
>> +1
>>
>> + @dev@spark.apache.org 
>>
>> This is a good idea. The other Apache projects (e.g., Pinot, Druid,
>> Flink) have created their own dedicated Slack workspaces for faster
>> communication. We can do the same in Apache Spark. The Slack workspace will
>> be maintained by the Apache Spark PMC. I propose to initiate a vote for the
>> creation of a new Apache Spark Slack workspace. Does that sound good?
>>
>> Cheers,
>>
>> Xiao
>>
>>
>>
>>
>>
>>
>>
>> Mich Talebzadeh  于2023年3月28日周二 07:07写道：
>>
>>> I created one at slack called pyspark
>>>
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 28 Mar 2023 at 03:52, asma zgolli  wrote:
>>>
>>>> +1 good idea, I d like to join as well.
>>>>
>>>> Le mar. 28 mars 2023 à 04:09, Winston Lai  a
>>>> écrit :
>>>>
>>>>> Please let us know when the channel is created. I'd like to join :)
>>>>>
>>>>> Thank You & Best Regards
>>>>> Winston Lai
>>>>> --
>>>>> *From:* Denny Lee 
>>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>>>> *To:* Hyukjin Kwon 
>>>>> *Cc:* keen ; u...@spark.apache.org <
>>>>> u...@spark.apache.org>
>>>>> *Subject:* Re: Slack for PySpark users
>>>>>
>>>>> +1 I think this is a great idea!
>>>>>
>>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>> Yeah, actually I think we should better have a slack channel so we can
>>>>> easily discuss with users and developers.
>>>>>
>>>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>>>
>>>>> Hi all,
>>>>> I really like *Slack *as communication channel for a tech community.
>>>>> There is a Slack workspace for *delta lake users* (
>>>>> https://go.delta.io/slack) that I enjoy a lot.
>>>>> I was wondering if there is something similar for PySpark users.
>>>>>
>>>>> If not, would there be anything wrong with creating a new
>>>>> Slack workspace for PySpark users? (when explicitly mentioning that this 
>>>>> is
>>>>> *not* officially part of Apache Spark)?
>>>>>
>>>>> Cheers
>>>>> Martin
>>>>>
>>>>>
>>>>
>>>> --
>>>> Asma ZGOLLI
>>>>
>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>
>>>>

Re: Slack for PySpark users

2023-03-30 Thread Xiao Li

+1

+ @dev@spark.apache.org 

This is a good idea. The other Apache projects (e.g., Pinot, Druid, Flink)
have created their own dedicated Slack workspaces for faster communication.
We can do the same in Apache Spark. The Slack workspace will be maintained
by the Apache Spark PMC. I propose to initiate a vote for the creation of a
new Apache Spark Slack workspace. Does that sound good?

Cheers,

Xiao







Mich Talebzadeh  于2023年3月28日周二 07:07写道：

> I created one at slack called pyspark
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 28 Mar 2023 at 03:52, asma zgolli  wrote:
>
>> +1 good idea, I d like to join as well.
>>
>> Le mar. 28 mars 2023 à 04:09, Winston Lai  a
>> écrit :
>>
>>> Please let us know when the channel is created. I'd like to join :)
>>>
>>> Thank You & Best Regards
>>> Winston Lai
>>> --
>>> *From:* Denny Lee 
>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>> *To:* Hyukjin Kwon 
>>> *Cc:* keen ; u...@spark.apache.org <
>>> u...@spark.apache.org>
>>> *Subject:* Re: Slack for PySpark users
>>>
>>> +1 I think this is a great idea!
>>>
>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>> wrote:
>>>
>>> Yeah, actually I think we should better have a slack channel so we can
>>> easily discuss with users and developers.
>>>
>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>
>>> Hi all,
>>> I really like *Slack *as communication channel for a tech community.
>>> There is a Slack workspace for *delta lake users* (
>>> https://go.delta.io/slack) that I enjoy a lot.
>>> I was wondering if there is something similar for PySpark users.
>>>
>>> If not, would there be anything wrong with creating a new
>>> Slack workspace for PySpark users? (when explicitly mentioning that this is
>>> *not* officially part of Apache Spark)?
>>>
>>> Cheers
>>> Martin
>>>
>>>
>>
>> --
>> Asma ZGOLLI
>>
>> Ph.D. in Big Data - Applied Machine Learning
>>
>>

Re: [ANNOUNCE] Apache Spark 3.3.2 released

2023-02-18 Thread Xiao Li

Thank you, Liang-Chi!

Xiao


On Sat, Feb 18, 2023 at 1:07 AM beliefer  wrote:

> Congratulations !
>
>
>
> At 2023-02-17 16:58:22, "L. C. Hsieh"  wrote:
> >We are happy to announce the availability of Apache Spark 3.3.2!
> >
> >Spark 3.3.2 is a maintenance release containing stability fixes. This
> >release is based on the branch-3.3 maintenance branch of Spark. We strongly
> >recommend all 3.3 users to upgrade to this stable release.
> >
> >To download Spark 3.3.2, head over to the download page:
> >https://spark.apache.org/downloads.html
> >
> >To view the release notes:
> >https://spark.apache.org/releases/spark-release-3-3-2.html
> >
> >We would like to acknowledge all community members for contributing to this
> >release. This release would not have been possible without you.
> >
> >-
> >To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

--

Re: Welcome Yikun Jiang as a Spark committer

2022-10-09 Thread Xiao Li

Congratulations, Yikun!

Xiao

Yikun Jiang  于2022年10月9日周日 19:34写道：

> Thank you all!
>
> Regards,
> Yikun
>
>
> On Mon, Oct 10, 2022 at 3:18 AM Chao Sun  wrote:
>
>> Congratulations Yikun!
>>
>> On Sun, Oct 9, 2022 at 11:14 AM vaquar khan 
>> wrote:
>>
>>> Congratulations.
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Sun, Oct 9, 2022, 6:46 AM 叶先进  wrote:
>>>
 Congrats

 On Oct 9, 2022, at 16:44, XiDuo You  wrote:

 Congratulations, Yikun !

 Maxim Gekk  于2022年10月9日周日 15:59写道：

> Keep up the great work, Yikun!
>
> On Sun, Oct 9, 2022 at 10:52 AM Gengliang Wang 
> wrote:
>
>> Congratulations, Yikun!
>>
>> On Sun, Oct 9, 2022 at 12:33 AM 416161...@qq.com <
>> ruife...@foxmail.com> wrote:
>>
>>> Congrats, Yikun!
>>>
>>> --
>>> Ruifeng Zheng
>>> ruife...@foxmail.com
>>>
>>> 
>>>
>>>
>>>
>>> -- Original --
>>> *From:* "Martin Grigorov" ;
>>> *Date:* Sun, Oct 9, 2022 05:01 AM
>>> *To:* "Hyukjin Kwon";
>>> *Cc:* "dev";"Yikun Jiang">> >;
>>> *Subject:* Re: Welcome Yikun Jiang as a Spark committer
>>>
>>> Congratulations, Yikun!
>>>
>>> On Sat, Oct 8, 2022 at 7:41 AM Hyukjin Kwon 
>>> wrote:
>>>
 Hi all,

 The Spark PMC recently added Yikun Jiang as a committer on the
 project.
 Yikun is the major contributor of the infrastructure and GitHub
 Actions in Apache Spark as well as Kubernates and PySpark.
 He has put a lot of effort into stabilizing and optimizing the
 builds so we all can work together in Apache Spark more
 efficiently and effectively. He's also driving the SPIP for Docker
 official image in Apache Spark as well for users and developers.
 Please join me in welcoming Yikun!

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Xiao Li

+1.

Xiao

On Wed, Oct 5, 2022 at 12:49 PM Sean Owen  wrote:

> I'm OK with this. It simplifies maintenance a bit, and specifically may
> allow us to finally move off of the ancient version of Guava (?)
>
> On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
>> is still used by someone in the community or not. If it's not used or not
>> useful,
>> we may remove it from Apache Spark 3.4.0 release.
>>
>>
>> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>>
>> Here is the background of this question.
>> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
>> Spark community has been building and releasing with Java 8 only.
>> I believe that the user applications also use Java8+ in these days.
>> Recently, I received the following message from the Hadoop PMC.
>>
>>   > "if you really want to claim hadoop 2.x compatibility, then you have
>> to
>>   > be building against java 7". Otherwise a lot of people with hadoop 2.x
>>   > clusters won't be able to run your code. If your projects are java8+
>>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>>   > in your build. Hence: no need for branch-2 branches except
>>   > to complicate your build/test/release processes [1]
>>
>> If Hadoop2 binary distribution is no longer used as of today,
>> or incomplete somewhere due to Java 8 building, the following three
>> existing alternative Hadoop 3 binary distributions could be
>> the better official solution for old Hadoop 2 clusters.
>>
>> 1) Scala 2.12 and without-hadoop distribution
>> 2) Scala 2.12 and Hadoop 3 distribution
>> 3) Scala 2.13 and Hadoop 3 distribution
>>
>> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary
>> distribution?
>>
>> Dongjoon
>>
>> [1]
>> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
>>
>

--

Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Xiao Li

+1

Yikun Jiang  于2022年9月21日周三 07:22写道：

> Thanks for all your inputs! BTW, I also create a JIRA to track related
> work: https://issues.apache.org/jira/browse/SPARK-40513
>
> > can I be involved in this work?
>
> @qian Of course! Thanks!
>
> Regards,
> Yikun
>
> On Wed, Sep 21, 2022 at 7:31 PM Xinrong Meng 
> wrote:
>
>> +1
>>
>> On Tue, Sep 20, 2022 at 11:08 PM Qian SUN  wrote:
>>
>>> +1.
>>> It's valuable, can I be involved in this work?
>>>
>>> Yikun Jiang  于2022年9月19日周一 08:15写道：
>>>
 Hi, all

 I would like to start the discussion for supporting Docker Official
 Image for Spark.

 This SPIP is proposed to add Docker Official Image(DOI)
  to ensure the
 Spark Docker images meet the quality standards for Docker images, to
 provide these Docker images for users who want to use Apache Spark via
 Docker image.

 There are also several Apache projects that release the Docker
 Official Images
 , such
 as: flink , storm
 , solr ,
 zookeeper , httpd
  (with 50M+ to 1B+ download for each).
 From the huge download statistics, we can see the real demands of users,
 and from the support of other apache projects, we should also be able to do
 it.

 After support:

-

The Dockerfile will still be maintained by the Apache Spark
community and reviewed by Docker.
-

The images will be maintained by the Docker community to ensure the
quality standards for Docker images of the Docker community.

 It will also reduce the extra docker images maintenance effort (such as
 frequently rebuilding, image security update) of the Apache Spark 
 community.

 See more in SPIP DOC:
 https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o

 cc: Ruifeng (co-author) and Hyukjin (shepherd)

 Regards,
 Yikun

>>>
>>>
>>> --
>>> Best!
>>> Qian SUN
>>>
>>

Welcoming three new PMC members

2022-08-09 Thread Xiao Li

Hi all,

The Spark PMC recently voted to add three new PMC members. Join me in
welcoming them to their new roles!

New PMC members: Huaxin Gao, Gengliang Wang and Maxim Gekk

The Spark PMC

Re: Apache Spark 3.2.2 Release?

2022-07-06 Thread Xiao Li

+1

Xiao

Cheng Su  于2022年7月6日周三 19:16写道：

> +1 (non-binding)
>
> Thanks,
> Cheng Su
>
> On Wed, Jul 6, 2022 at 6:01 PM Yuming Wang  wrote:
>
>> +1
>>
>> On Thu, Jul 7, 2022 at 5:53 AM Maxim Gekk
>>  wrote:
>>
>>> +1
>>>
>>> On Thu, Jul 7, 2022 at 12:26 AM John Zhuge  wrote:
>>>
 +1  Thanks for the effort!

 On Wed, Jul 6, 2022 at 2:23 PM Bjørn Jørgensen <
 bjornjorgen...@gmail.com> wrote:

> +1
>
> ons. 6. jul. 2022, 23:05 skrev Hyukjin Kwon :
>
>> Yeah +1
>>
>> On Thu, Jul 7, 2022 at 5:40 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since Apache Spark 3.2.1 tag creation (Jan 19), new 197 patches
>>> including 11 correctness patches arrived at branch-3.2.
>>>
>>> Shall we make a new release, Apache Spark 3.2.2, as the third release
>>> at 3.2 line? I'd like to volunteer as the release manager for Apache
>>> Spark 3.2.2. I'm thinking about starting the first RC next week.
>>>
>>> $ git log --oneline v3.2.1..HEAD | wc -l
>>>  197
>>>
>>> # Correctness issues
>>>
>>> SPARK-38075 Hive script transform with order by and limit will
>>> return fake rows
>>> SPARK-38204 All state operators are at a risk of inconsistency
>>> between state partitioning and operator partitioning
>>> SPARK-38309 SHS has incorrect percentiles for shuffle read bytes
>>> and shuffle total blocks metrics
>>> SPARK-38320 (flat)MapGroupsWithState can timeout groups which
>>> just
>>> received inputs in the same microbatch
>>> SPARK-38614 After Spark update, df.show() shows incorrect
>>> F.percent_rank results
>>> SPARK-38655 OffsetWindowFunctionFrameBase cannot find the offset
>>> row whose input is not null
>>> SPARK-38684 Stream-stream outer join has a possible correctness
>>> issue due to weakly read consistent on outer iterators
>>> SPARK-39061 Incorrect results or NPE when using Inline function
>>> against an array of dynamically created structs
>>> SPARK-39107 Silent change in regexp_replace's handling of empty
>>> strings
>>> SPARK-39259 Timestamps returned by now() and equivalent functions
>>> are not consistent in subqueries
>>> SPARK-39293 The accumulator of ArrayAggregate should copy the
>>> intermediate result if string, struct, array, or map
>>>
>>> Best,
>>> Dongjoon.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>> --
 John Zhuge

>>>

Re: [PSA] Please rebase and sync your master branch in your forked repository

2022-06-20 Thread Xiao Li

Thank you, Hyukjin!

Xiao

On Mon, Jun 20, 2022 at 7:01 PM Yi Wu  wrote:

> Thanks for the work, Hyukjin.
>
> On Tue, Jun 21, 2022 at 7:59 AM Yuming Wang  wrote:
>
>> Thank you Hyukjin.
>>
>> On Tue, Jun 21, 2022 at 7:46 AM Hyukjin Kwon  wrote:
>>
>>> After https://github.com/apache/spark/pull/36922 gets merged, it
>>> requires your fork's master branch to be synced to the latest master branch
>>> in Apache Spark. Otherwise, builds would not be triggered in your PR.
>>>
>>>

--

Re: Re: [VOTE][SPIP] Spark Connect

2022-06-15 Thread Xiao Li

+1

Xiao

beliefer  于2022年6月14日周二 03:35写道：

> +1
> Yeah, I tried to use Apache Livy, so as we can runing interactive query.
> But the Spark Driver in Livy looks heavy.
>
> The SPIP may resolve the issue.
>
>
>
> At 2022-06-14 18:11:21, "Wenchen Fan"  wrote:
>
> +1
>
> On Tue, Jun 14, 2022 at 9:38 AM Ruifeng Zheng 
> wrote:
>
>> +1
>>
>>
>> -- 原始邮件 --
>> *发件人:* "huaxin gao" ;
>> *发送时间:* 2022年6月14日(星期二) 上午8:47
>> *收件人:* "L. C. Hsieh";
>> *抄送:* "Spark dev list";
>> *主题:* Re: [VOTE][SPIP] Spark Connect
>>
>> +1
>>
>> On Mon, Jun 13, 2022 at 5:42 PM L. C. Hsieh  wrote:
>>
>>> +1
>>>
>>> On Mon, Jun 13, 2022 at 5:41 PM Chao Sun  wrote:
>>> >
>>> > +1 (non-binding)
>>> >
>>> > On Mon, Jun 13, 2022 at 5:11 PM Hyukjin Kwon 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> On Tue, 14 Jun 2022 at 08:50, Yuming Wang  wrote:
>>> >>>
>>> >>> +1.
>>> >>>
>>> >>> On Tue, Jun 14, 2022 at 2:20 AM Matei Zaharia <
>>> matei.zaha...@gmail.com> wrote:
>>> 
>>>  +1, very excited about this direction.
>>> 
>>>  Matei
>>> 
>>>  On Jun 13, 2022, at 11:07 AM, Herman van Hovell
>>>  wrote:
>>> 
>>>  Let me kick off the voting...
>>> 
>>>  +1
>>> 
>>>  On Mon, Jun 13, 2022 at 2:02 PM Herman van Hovell <
>>> her...@databricks.com> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I’d like to start a vote for SPIP: "Spark Connect"
>>> >
>>> > The goal of the SPIP is to introduce a Dataframe based
>>> client/server API for Spark
>>> >
>>> > Please also refer to:
>>> >
>>> > - Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark
>>> Connect - A client and server interface for Apache Spark.
>>> > - Design doc: Spark Connect - A client and server interface for
>>> Apache Spark.
>>> > - JIRA: SPARK-39375
>>> >
>>> > Please vote on the SPIP for the next 72 hours:
>>> >
>>> > [ ] +1: Accept the proposal as an official SPIP
>>> > [ ] +0
>>> > [ ] -1: I don’t think this is a good idea because …
>>> >
>>> > Kind Regards,
>>> > Herman
>>> 
>>> 
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Stickers and Swag

2022-06-14 Thread Xiao Li

Hi, all,

The ASF has an official store at RedBubble
 that Apache Community
Development (ComDev) runs. If you are interested in buying Spark Swag, 70
products featuring the Spark logo are available:
https://www.redbubble.com/shop/ap/113203780

Go Spark!

Xiao

Re: 回复： [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Xiao Li

+1

Xiao

beliefer  于2022年6月13日周一 20:04写道：

> +1 AFAIK, no blocking issues now.
> Glad to hear to release 3.3.0 !
>
>
> 在 2022-06-14 09:38:35，"Ruifeng Zheng"  写道：
>
> +1 (non-binding)
>
> Maxim, thank you for driving this release!
>
> thanks,
> ruifeng
>
>
>
> -- 原始邮件 --
> *发件人:* "Chao Sun" ;
> *发送时间:* 2022年6月14日(星期二) 上午8:45
> *收件人:* "Cheng Su";
> *抄送:* "L. C. Hsieh";"dev";
> *主题:* Re: [VOTE] Release Spark 3.3.0 (RC6)
>
> +1 (non-binding)
>
> Thanks,
> Chao
>
> On Mon, Jun 13, 2022 at 5:37 PM Cheng Su  wrote:
>
>> +1 (non-binding).
>>
>>
>>
>> Thanks,
>>
>> Cheng Su
>>
>>
>>
>> *From: *L. C. Hsieh 
>> *Date: *Monday, June 13, 2022 at 5:13 PM
>> *To: *dev 
>> *Subject: *Re: [VOTE] Release Spark 3.3.0 (RC6)
>>
>> +1
>>
>> On Mon, Jun 13, 2022 at 5:07 PM Holden Karau 
>> wrote:
>> >
>> > +1
>> >
>> > On Mon, Jun 13, 2022 at 4:51 PM Yuming Wang  wrote:
>> >>
>> >> +1 (non-binding)
>> >>
>> >> On Tue, Jun 14, 2022 at 7:41 AM Dongjoon Hyun 
>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> Thanks,
>> >>> Dongjoon.
>> >>>
>> >>> On Mon, Jun 13, 2022 at 3:54 PM Chris Nauroth 
>> wrote:
>> 
>>  +1 (non-binding)
>> 
>>  I repeated all checks I described for RC5:
>> 
>>  https://lists.apache.org/thread/ksoxmozgz7q728mnxl6c2z7ncmo87vls
>> 
>>  Maxim, thank you for your dedication on these release candidates.
>> 
>>  Chris Nauroth
>> 
>> 
>>  On Mon, Jun 13, 2022 at 3:21 PM Mridul Muralidharan <
>> mri...@gmail.com> wrote:
>> >
>> >
>> > +1
>> >
>> > Signatures, digests, etc check out fine.
>> > Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>> >
>> > The test "SPARK-33084: Add jar support Ivy URI in SQL" in
>> sql.SQLQuerySuite fails; but other than that, rest looks good.
>> >
>> > Regards,
>> > Mridul
>> >
>> >
>> >
>> > On Mon, Jun 13, 2022 at 4:25 PM Tom Graves
>>  wrote:
>> >>
>> >> +1
>> >>
>> >> Tom
>> >>
>> >> On Thursday, June 9, 2022, 11:27:50 PM CDT, Maxim Gekk
>>  wrote:
>> >>
>> >>
>> >> Please vote on releasing the following candidate as Apache Spark
>> version 3.3.0.
>> >>
>> >> The vote is open until 11:59pm Pacific time June 14th and passes
>> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 3.3.0
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >>
>> >> The tag to be voted on is v3.3.0-rc6 (commit
>> f74867bddfbcdd4d08076db36851e88b15e66556):
>> >> https://github.com/apache/spark/tree/v3.3.0-rc6
>> >>
>> >> The release files, including signatures, digests, etc. can be
>> found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-bin/
>> >>
>> >> Signatures used for Spark RCs can be found in this file:
>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >>
>> >> The staging repository for this release can be found at:
>> >>
>> https://repository.apache.org/content/repositories/orgapachespark-1407
>> >>
>> >> The documentation corresponding to this release can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-docs/
>> >>
>> >> The list of bug fixes going into 3.3.0 can be found at the
>> following URL:
>> >> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>> >>
>> >> This release is using the release script of the tag v3.3.0-rc6.
>> >>
>> >>
>> >> FAQ
>> >>
>> >> =
>> >> How can I help test this release?
>> >> =
>> >> If you are a Spark user, you can help us test this release by
>> taking
>> >> an existing Spark workload and running on this release candidate,
>> then
>> >> reporting any regressions.
>> >>
>> >> If you're working in PySpark you can set up a virtual env and
>> install
>> >> the current RC and see if anything important breaks, in the
>> Java/Scala
>> >> you can add the staging repository to your projects resolvers and
>> test
>> >> with the RC (make sure to clean up the artifact cache before/after
>> so
>> >> you don't end up building with a out of date RC going forward).
>> >>
>> >> ===
>> >> What should happen to JIRA tickets still targeting 3.3.0?
>> >> ===
>> >> The current list of open tickets targeted at 3.3.0 can be found at:
>> >> https://issues.apache.org/jira/projects/SPARK  and search for
>> "Target Version/s" = 3.3.0
>> >>
>> >> Committers should look at those and triage. Extremely important bug
>> >> fixes, documentation, and API tweaks that impact compatibility
>> should
>> >> be worked on immediately. Everything else please

Re: SIGMOD System Award for Apache Spark

2022-05-13 Thread Xiao Li

Congratulations to everyone!

Xiao

On Fri, May 13, 2022 at 9:34 AM Dongjoon Hyun 
wrote:

> Ya, it's really great!. Congratulations to the whole community!
>
> Dongjoon.
>
> On Fri, May 13, 2022 at 8:12 AM Chao Sun  wrote:
>
>> Huge congrats to the whole community!
>>
>> On Fri, May 13, 2022 at 1:56 AM Wenchen Fan  wrote:
>>
>>> Great! Congratulations to everyone!
>>>
>>> On Fri, May 13, 2022 at 10:38 AM Gengliang Wang 
>>> wrote:
>>>
 Congratulations to the whole spark community!

 On Fri, May 13, 2022 at 10:14 AM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Congrats Spark community!
>
> On Fri, May 13, 2022 at 10:40 AM Qian Sun 
> wrote:
>
>> Congratulations !!!
>>
>> 2022年5月13日 上午3:44，Matei Zaharia  写道：
>>
>> Hi all,
>>
>> We recently found out that Apache Spark received
>>  the SIGMOD System Award
>> this year, given by SIGMOD (the ACM’s data management research
>> organization) to impactful real-world and research systems. This puts 
>> Spark
>> in good company with some very impressive previous recipients
>> . This award
>> is really an achievement by the whole community, so I wanted to say
>> congrats to everyone who contributes to Spark, whether through code, 
>> issue
>> reports, docs, or other means.
>>
>> Matei
>>
>>
>>

--

Re: Apache Spark 3.3 Release

2022-03-15 Thread Xiao Li

I think I finally got your point. What you want to keep unchanged is the
branch cut date of Spark 3.3. Today? or this Friday? This is not a big
deal.

My major concern is whether we should keep merging the feature work or the
dependency upgrade after the branch cut. To make our release time more
predictable, I am suggesting we should finalize the exception PR list
first, instead of merging them in an ad hoc way. In the past, we spent a
lot of time on the revert of the PRs that were merged after the branch cut.
I hope we can minimize unnecessary arguments in this release. Do you agree,
Dongjoon?



Dongjoon Hyun  于2022年3月15日周二 15:55写道：

> That is not totally fine, Xiao. It sounds like you are asking a change of
> plan without a proper reason.
>
> Although we cut the branch Today according our plan, you still can collect
> the list and make a list of exceptions. I'm not blocking what you want to
> do.
>
> Please let the community start to ramp down as we agreed before.
>
> Dongjoon
>
>
>
> On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:
>
>> Please do not get me wrong. If we don't cut a branch, we are allowing all
>> patches to land Apache Spark 3.3. That is totally fine. After we cut the
>> branch, we should avoid merging the feature work. In the next three days,
>> let us collect the actively developed PRs that we want to make an exception
>> (i.e., merged to 3.3 after the upcoming branch cut). Does that make sense?
>>
>> Dongjoon Hyun  于2022年3月15日周二 14:54写道：
>>
>>> Xiao. You are working against what you are saying.
>>> If you don't cut a branch, it means you are allowing all patches to land
>>> Apache Spark 3.3. No?
>>>
>>> > we need to avoid backporting the feature work that are not being well
>>> discussed.
>>>
>>>
>>>
>>> On Tue, Mar 15, 2022 at 12:12 PM Xiao Li  wrote:
>>>
>>>> Cutting the branch is simple, but we need to avoid backporting the
>>>> feature work that are not being well discussed. Not all the members are
>>>> actively following the dev list. I think we should wait 3 more days for
>>>> collecting the PR list before cutting the branch.
>>>>
>>>> BTW, there are very few 3.4-only feature work that will be affected.
>>>>
>>>> Xiao
>>>>
>>>> Dongjoon Hyun  于2022年3月15日周二 11:49写道：
>>>>
>>>>> Hi, Max, Chao, Xiao, Holden and all.
>>>>>
>>>>> I have a different idea.
>>>>>
>>>>> Given the situation and small patch list, I don't think we need to
>>>>> postpone the branch cut for those patches. It's easier to cut a branch-3.3
>>>>> and allow backporting.
>>>>>
>>>>> As of today, we already have an obvious Apache Spark 3.4 patch in the
>>>>> branch together. This situation only becomes worse and worse because there
>>>>> is no way to block the other patches from landing unintentionally if we
>>>>> don't cut a branch.
>>>>>
>>>>> [SPARK-38335][SQL] Implement parser support for DEFAULT column
>>>>> values
>>>>>
>>>>> Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.
>>>>>
>>>>> Best,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Tue, Mar 15, 2022 at 10:17 AM Chao Sun  wrote:
>>>>>
>>>>>> Cool, thanks for clarifying!
>>>>>>
>>>>>> On Tue, Mar 15, 2022 at 10:11 AM Xiao Li 
>>>>>> wrote:
>>>>>> >>
>>>>>> >> For the following list:
>>>>>> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>>>>>> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
>>>>>> vectorized reader
>>>>>> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>>>>>> >> Do you mean we should include them, or exclude them from 3.3?
>>>>>> >
>>>>>> >
>>>>>> > If possible, I hope these features can be shipped with Spark 3.3.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Chao Sun  于2022年3月15日周二 10:06写道：
>>>>>> >>
>>>>>> >> Hi Xiao,
>>>>>> >>
>>>>>> >> For the following list:
>>>>>> >>
>>>>>> >> #35789 [SPAR

Re: Apache Spark 3.3 Release

2022-03-15 Thread Xiao Li

Please do not get me wrong. If we don't cut a branch, we are allowing all
patches to land Apache Spark 3.3. That is totally fine. After we cut the
branch, we should avoid merging the feature work. In the next three days,
let us collect the actively developed PRs that we want to make an exception
(i.e., merged to 3.3 after the upcoming branch cut). Does that make sense?

Dongjoon Hyun  于2022年3月15日周二 14:54写道：

> Xiao. You are working against what you are saying.
> If you don't cut a branch, it means you are allowing all patches to land
> Apache Spark 3.3. No?
>
> > we need to avoid backporting the feature work that are not being well
> discussed.
>
>
>
> On Tue, Mar 15, 2022 at 12:12 PM Xiao Li  wrote:
>
>> Cutting the branch is simple, but we need to avoid backporting the
>> feature work that are not being well discussed. Not all the members are
>> actively following the dev list. I think we should wait 3 more days for
>> collecting the PR list before cutting the branch.
>>
>> BTW, there are very few 3.4-only feature work that will be affected.
>>
>> Xiao
>>
>> Dongjoon Hyun  于2022年3月15日周二 11:49写道：
>>
>>> Hi, Max, Chao, Xiao, Holden and all.
>>>
>>> I have a different idea.
>>>
>>> Given the situation and small patch list, I don't think we need to
>>> postpone the branch cut for those patches. It's easier to cut a branch-3.3
>>> and allow backporting.
>>>
>>> As of today, we already have an obvious Apache Spark 3.4 patch in the
>>> branch together. This situation only becomes worse and worse because there
>>> is no way to block the other patches from landing unintentionally if we
>>> don't cut a branch.
>>>
>>> [SPARK-38335][SQL] Implement parser support for DEFAULT column values
>>>
>>> Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.
>>>
>>> Best,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Mar 15, 2022 at 10:17 AM Chao Sun  wrote:
>>>
>>>> Cool, thanks for clarifying!
>>>>
>>>> On Tue, Mar 15, 2022 at 10:11 AM Xiao Li  wrote:
>>>> >>
>>>> >> For the following list:
>>>> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>>>> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
>>>> vectorized reader
>>>> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>>>> >> Do you mean we should include them, or exclude them from 3.3?
>>>> >
>>>> >
>>>> > If possible, I hope these features can be shipped with Spark 3.3.
>>>> >
>>>> >
>>>> >
>>>> > Chao Sun  于2022年3月15日周二 10:06写道：
>>>> >>
>>>> >> Hi Xiao,
>>>> >>
>>>> >> For the following list:
>>>> >>
>>>> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>>>> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
>>>> vectorized reader
>>>> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>>>> >>
>>>> >> Do you mean we should include them, or exclude them from 3.3?
>>>> >>
>>>> >> Thanks,
>>>> >> Chao
>>>> >>
>>>> >> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun <
>>>> dongjoon.h...@gmail.com> wrote:
>>>> >> >
>>>> >> > The following was tested and merged a few minutes ago. So, we can
>>>> remove it from the list.
>>>> >> >
>>>> >> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>>>> >> >
>>>> >> > Thanks,
>>>> >> > Dongjoon.
>>>> >> >
>>>> >> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li 
>>>> wrote:
>>>> >> >>
>>>> >> >> Let me clarify my above suggestion. Maybe we can wait 3 more days
>>>> to collect the list of actively developed PRs that we want to merge to 3.3
>>>> after the branch cut?
>>>> >> >>
>>>> >> >> Please do not rush to merge the PRs that are not fully reviewed.
>>>> We can cut the branch this Friday and continue merging the PRs that have
>>>> been discussed in this thread. Does that make sense?
>>>> >> >>
>>>> >> >> Xiao
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> Holden Karau  于2022年3月15日周二 09:10写道：
>>>> >> >>>
>>>> >> >>> May I suggest we push out one week (22nd) just to give everyone
>>>> a bit of breathing space? Rushed software development more often results in
>>>> bugs.
>>>> >> >>>
>>>> >> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang 
>>>> wrote:
>>>> >> >>>>
>>>> >> >>>> > To make our release time more predictable, let us collect the
>>>> PRs and wait three more days before the branch cut?
>>>> >> >>>>
>>>> >> >>>> For SPIP: Support Customized Kubernetes Schedulers:
>>>> >> >>>> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>>>> >> >>>>
>>>> >> >>>> Three more days are OK for this from my view.
>>>> >> >>>>
>>>> >> >>>> Regards,
>>>> >> >>>> Yikun
>>>> >> >>>
>>>> >> >>> --
>>>> >> >>> Twitter: https://twitter.com/holdenkarau
>>>> >> >>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9
>>>> >> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>

Re: Apache Spark 3.3 Release

2022-03-15 Thread Xiao Li

Cutting the branch is simple, but we need to avoid backporting the feature
work that are not being well discussed. Not all the members are actively
following the dev list. I think we should wait 3 more days for collecting
the PR list before cutting the branch.

BTW, there are very few 3.4-only feature work that will be affected.

Xiao

Dongjoon Hyun  于2022年3月15日周二 11:49写道：

> Hi, Max, Chao, Xiao, Holden and all.
>
> I have a different idea.
>
> Given the situation and small patch list, I don't think we need to
> postpone the branch cut for those patches. It's easier to cut a branch-3.3
> and allow backporting.
>
> As of today, we already have an obvious Apache Spark 3.4 patch in the
> branch together. This situation only becomes worse and worse because there
> is no way to block the other patches from landing unintentionally if we
> don't cut a branch.
>
> [SPARK-38335][SQL] Implement parser support for DEFAULT column values
>
> Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.
>
> Best,
> Dongjoon.
>
>
> On Tue, Mar 15, 2022 at 10:17 AM Chao Sun  wrote:
>
>> Cool, thanks for clarifying!
>>
>> On Tue, Mar 15, 2022 at 10:11 AM Xiao Li  wrote:
>> >>
>> >> For the following list:
>> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized
>> reader
>> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>> >> Do you mean we should include them, or exclude them from 3.3?
>> >
>> >
>> > If possible, I hope these features can be shipped with Spark 3.3.
>> >
>> >
>> >
>> > Chao Sun  于2022年3月15日周二 10:06写道：
>> >>
>> >> Hi Xiao,
>> >>
>> >> For the following list:
>> >>
>> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized
>> reader
>> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>> >>
>> >> Do you mean we should include them, or exclude them from 3.3?
>> >>
>> >> Thanks,
>> >> Chao
>> >>
>> >> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun 
>> wrote:
>> >> >
>> >> > The following was tested and merged a few minutes ago. So, we can
>> remove it from the list.
>> >> >
>> >> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>> >> >
>> >> > Thanks,
>> >> > Dongjoon.
>> >> >
>> >> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li 
>> wrote:
>> >> >>
>> >> >> Let me clarify my above suggestion. Maybe we can wait 3 more days
>> to collect the list of actively developed PRs that we want to merge to 3.3
>> after the branch cut?
>> >> >>
>> >> >> Please do not rush to merge the PRs that are not fully reviewed. We
>> can cut the branch this Friday and continue merging the PRs that have been
>> discussed in this thread. Does that make sense?
>> >> >>
>> >> >> Xiao
>> >> >>
>> >> >>
>> >> >>
>> >> >> Holden Karau  于2022年3月15日周二 09:10写道：
>> >> >>>
>> >> >>> May I suggest we push out one week (22nd) just to give everyone a
>> bit of breathing space? Rushed software development more often results in
>> bugs.
>> >> >>>
>> >> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang 
>> wrote:
>> >> >>>>
>> >> >>>> > To make our release time more predictable, let us collect the
>> PRs and wait three more days before the branch cut?
>> >> >>>>
>> >> >>>> For SPIP: Support Customized Kubernetes Schedulers:
>> >> >>>> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>> >> >>>>
>> >> >>>> Three more days are OK for this from my view.
>> >> >>>>
>> >> >>>> Regards,
>> >> >>>> Yikun
>> >> >>>
>> >> >>> --
>> >> >>> Twitter: https://twitter.com/holdenkarau
>> >> >>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> >> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Re: Apache Spark 3.3 Release

2022-03-15 Thread Xiao Li

>
> For the following list:
> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized
> reader
> #35848 [SPARK-38548][SQL] New SQL function: try_sum
> Do you mean we should include them, or exclude them from 3.3?


If possible, I hope these features can be shipped with Spark 3.3.



Chao Sun  于2022年3月15日周二 10:06写道：

> Hi Xiao,
>
> For the following list:
>
> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized
> reader
> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>
> Do you mean we should include them, or exclude them from 3.3?
>
> Thanks,
> Chao
>
> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun 
> wrote:
> >
> > The following was tested and merged a few minutes ago. So, we can remove
> it from the list.
> >
> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
> >
> > Thanks,
> > Dongjoon.
> >
> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li  wrote:
> >>
> >> Let me clarify my above suggestion. Maybe we can wait 3 more days to
> collect the list of actively developed PRs that we want to merge to 3.3
> after the branch cut?
> >>
> >> Please do not rush to merge the PRs that are not fully reviewed. We can
> cut the branch this Friday and continue merging the PRs that have been
> discussed in this thread. Does that make sense?
> >>
> >> Xiao
> >>
> >>
> >>
> >> Holden Karau  于2022年3月15日周二 09:10写道：
> >>>
> >>> May I suggest we push out one week (22nd) just to give everyone a bit
> of breathing space? Rushed software development more often results in bugs.
> >>>
> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang 
> wrote:
> >>>>
> >>>> > To make our release time more predictable, let us collect the PRs
> and wait three more days before the branch cut?
> >>>>
> >>>> For SPIP: Support Customized Kubernetes Schedulers:
> >>>> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
> >>>>
> >>>> Three more days are OK for this from my view.
> >>>>
> >>>> Regards,
> >>>> Yikun
> >>>
> >>> --
> >>> Twitter: https://twitter.com/holdenkarau
> >>> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Apache Spark 3.3 Release

2022-03-15 Thread Xiao Li

Let me clarify my above suggestion. Maybe we can wait 3 more days to
collect the list of actively developed PRs that we want to merge to 3.3
after the branch cut?

Please do not rush to merge the PRs that are not fully reviewed. We can cut
the branch this Friday and continue merging the PRs that have been
discussed in this thread. Does that make sense?

Xiao

Holden Karau  于2022年3月15日周二 09:10写道：

> May I suggest we push out one week (22nd) just to give everyone a bit of
> breathing space? Rushed software development more often results in bugs.
>
> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang  wrote:
>
>> > To make our release time more predictable, let us collect the PRs and
>> wait three more days before the branch cut?
>>
>> For SPIP: Support Customized Kubernetes Schedulers:
>> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>> 
>>
>> Three more days are OK for this from my view.
>>
>> Regards,
>> Yikun
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Apache Spark 3.3 Release

2022-03-14 Thread Xiao Li

To make our release time more predictable, let us collect the PRs and wait
three more days before the branch cut?

Please list all the actively developed feature work we plan to release with
Spark 3.3? We should avoid merging any new feature work that is not being
discussed in this email thread. Below is my list

   - #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
   <https://github.com/apache/spark/pull/35789>
   - #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized
   reader <https://github.com/apache/spark/pull/34659>
   - #35848 [SPARK-38548][SQL] New SQL function: try_sum
   <https://github.com/apache/spark/pull/35848>




Chao Sun  于2022年3月14日周一 21:17写道：

> I mainly mean:
>
>   - [SPARK-35801] Row-level operations in Data Source V2
>   - [SPARK-37166] Storage Partitioned Join
>
> For which the PR:
>
> - https://github.com/apache/spark/pull/35395
> - https://github.com/apache/spark/pull/35657
>
> are actively being reviewed. It seems there are ongoing PRs for other
> SPIPs as well but I'm not involved in those so not quite sure whether
> they are intended for 3.3 release.
>
> Chao
>
>
> Chao
>
> On Mon, Mar 14, 2022 at 8:53 PM Xiao Li  wrote:
> >
> > Could you please list which features we want to finish before the branch
> cut? How long will they take?
> >
> > Xiao
> >
> > Chao Sun  于2022年3月14日周一 13:30写道：
> >>
> >> Hi Max,
> >>
> >> As there are still some ongoing work for the above listed SPIPs, can we
> still merge them after the branch cut?
> >>
> >> Thanks,
> >> Chao
> >>
> >> On Mon, Mar 14, 2022 at 6:12 AM Maxim Gekk 
> >> 
> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> Since there are no actual blockers for Spark 3.3.0 and significant
> objections, I am going to cut branch-3.3 after 15th March at 00:00 PST.
> Please, let us know if you have any concerns about that.
> >>>
> >>> Best regards,
> >>> Max Gekk
> >>>
> >>>
> >>> On Thu, Mar 3, 2022 at 9:44 PM Maxim Gekk 
> wrote:
> >>>>
> >>>> Hello All,
> >>>>
> >>>> I would like to bring on the table the theme about the new Spark
> release 3.3. According to the public schedule at
> https://spark.apache.org/versioning-policy.html, we planned to start the
> code freeze and release branch cut on March 15th, 2022. Since this date is
> coming soon, I would like to take your attention on the topic and gather
> objections that you might have.
> >>>>
> >>>> Bellow is the list of ongoing and active SPIPs:
> >>>>
> >>>> Spark SQL:
> >>>> - [SPARK-31357] DataSourceV2: Catalog API for view metadata
> >>>> - [SPARK-35801] Row-level operations in Data Source V2
> >>>> - [SPARK-37166] Storage Partitioned Join
> >>>>
> >>>> Spark Core:
> >>>> - [SPARK-20624] Add better handling for node shutdown
> >>>> - [SPARK-25299] Use remote storage for persisting shuffle data
> >>>>
> >>>> PySpark:
> >>>> - [SPARK-26413] RDD Arrow Support in Spark Core and PySpark
> >>>>
> >>>> Kubernetes:
> >>>> - [SPARK-36057] Support Customized Kubernetes Schedulers
> >>>>
> >>>> Probably, we should finish if there are any remaining works for Spark
> 3.3, and switch to QA mode, cut a branch and keep everything on track. I
> would like to volunteer to help drive this process.
> >>>>
> >>>> Best regards,
> >>>> Max Gekk
>

Re: Apache Spark 3.3 Release

2022-03-14 Thread Xiao Li

Could you please list which features we want to finish before the branch
cut? How long will they take?

Xiao

Chao Sun  于2022年3月14日周一 13:30写道：

> Hi Max,
>
> As there are still some ongoing work for the above listed SPIPs, can we
> still merge them after the branch cut?
>
> Thanks,
> Chao
>
> On Mon, Mar 14, 2022 at 6:12 AM Maxim Gekk
>  wrote:
>
>> Hi All,
>>
>> Since there are no actual blockers for Spark 3.3.0 and significant
>> objections, I am going to cut branch-3.3 after 15th March at 00:00 PST.
>> Please, let us know if you have any concerns about that.
>>
>> Best regards,
>> Max Gekk
>>
>>
>> On Thu, Mar 3, 2022 at 9:44 PM Maxim Gekk 
>> wrote:
>>
>>> Hello All,
>>>
>>> I would like to bring on the table the theme about the new Spark release
>>> 3.3. According to the public schedule at
>>> https://spark.apache.org/versioning-policy.html, we planned to start
>>> the code freeze and release branch cut on March 15th, 2022. Since this date
>>> is coming soon, I would like to take your attention on the topic and gather
>>> objections that you might have.
>>>
>>> Bellow is the list of ongoing and active SPIPs:
>>>
>>> Spark SQL:
>>> - [SPARK-31357] DataSourceV2: Catalog API for view metadata
>>> - [SPARK-35801] Row-level operations in Data Source V2
>>> - [SPARK-37166] Storage Partitioned Join
>>>
>>> Spark Core:
>>> - [SPARK-20624] Add better handling for node shutdown
>>> - [SPARK-25299] Use remote storage for persisting shuffle data
>>>
>>> PySpark:
>>> - [SPARK-26413] RDD Arrow Support in Spark Core and PySpark
>>>
>>> Kubernetes:
>>> - [SPARK-36057] Support Customized Kubernetes Schedulers
>>>
>>> Probably, we should finish if there are any remaining works for Spark
>>> 3.3, and switch to QA mode, cut a branch and keep everything on track. I
>>> would like to volunteer to help drive this process.
>>>
>>> Best regards,
>>> Max Gekk
>>>
>>

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Xiao Li

Can we extend the voting window to next Wednesday? This week is a holiday
week for the lunar new year. AFAIK, many members in Asia are taking the
whole week off. They might not regularly check the emails.

Also how about starting a separate email thread starting with [VOTE] ?

Happy Lunar New Year!!!

Xiao

Holden Karau  于2022年2月3日周四 12:28写道：

> +1 (binding)
>
> On Thu, Feb 3, 2022 at 2:26 PM Erik Krogen  wrote:
>
>> +1 (non-binding)
>>
>> Really looking forward to having this natively supported by Spark, so
>> that we can get rid of our own hacks to tie in a custom view catalog
>> implementation. I appreciate the care John has put into various parts of
>> the design and believe this will provide a robust and flexible solution to
>> this problem faced by various large-scale Spark users.
>>
>> Thanks John!
>>
>> On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> +1
>>>
>>> On Thu, Feb 3, 2022 at 11:19 AM John Zhuge  wrote:
>>>
 Hi Spark community,

 I’d like to restart the vote for the ViewCatalog design proposal (SPIP
 
 ).

 The proposal is to add a ViewCatalog interface that can be used to
 load, create, alter, and drop views in DataSourceV2.

 Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
 update the PR  for review.

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thanks!

 On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <
 wa.moust...@gmail.com> wrote:

> Considering the API aspect, the ViewCatalog API sounds like a good
> idea. A view catalog will enable us to integrate Coral
>  (our view SQL
> translation and management layer) very cleanly to Spark. Currently we can
> only do it by maintaining our special version of the
> HiveExternalCatalog. Considering that views can be expanded
> syntactically without necessarily invoking the analyzer, using a dedicated
> view API can make performance better if performance is the concern.
> Further, a catalog can still be both a table and view provider if it
> chooses to based on this design, so I do not think we necessarily lose the
> ability of providing both. Looking forward to more discussions on this and
> making views a powerful tool in Spark.
>
> Thanks,
> Walaa.
>
>
> On Wed, May 26, 2021 at 9:54 AM John Zhuge  wrote:
>
>> Looks like we are running in circles. Should we have an online
>> meeting to get this sorted out?
>>
>> Thanks,
>> John
>>
>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan 
>> wrote:
>>
>>> OK, then I'd vote for TableViewCatalog, because
>>> 1. This is how Hive catalog works, and we need to migrate Hive
>>> catalog to the v2 API sooner or later.
>>> 2. Because of 1, TableViewCatalog is easy to support in the current
>>> table/view resolution framework.
>>> 3. It's better to avoid name conflicts between table and views at
>>> the API level, instead of relying on the catalog implementation.
>>> 4. Caching invalidation is always a tricky problem.
>>>
>>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue 
>>> wrote:
>>>
 I don't think that it makes sense to discuss a different approach
 in the PR rather than in the vote. Let's discuss this now since that's 
 the
 purpose of an SPIP.

 On Mon, May 24, 2021 at 11:22 AM John Zhuge 
 wrote:

> Hi everyone, I’d like to start a vote for the ViewCatalog design
> proposal (SPIP).
>
> The proposal is to add a ViewCatalog interface that can be used to
> load, create, alter, and drop views in DataSourceV2.
>
> The full SPIP doc is here:
> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>
> Please vote on the SPIP in the next 72 hours. Once it is approved,
> I’ll update the PR for review.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>


 --
 Ryan Blue
 Software Engineer
 Netflix

>>>
>>
>> --
>> John Zhuge
>>
>

 --
 John Zhuge

>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Xiao Li

Hi, Shane,

Thank you for your work on it!

Xiao




On Mon, Dec 6, 2021 at 6:20 PM L. C. Hsieh  wrote:

> Thank you, Shane.
>
> On Mon, Dec 6, 2021 at 4:27 PM Holden Karau  wrote:
> >
> > Shane you kick ass thank you for everything you’ve done for us :) Keep
> on rocking :)
> >
> > On Mon, Dec 6, 2021 at 4:24 PM Hyukjin Kwon  wrote:
> >>
> >> Thanks, Shane.
> >>
> >> On Tue, 7 Dec 2021 at 09:19, Dongjoon Hyun 
> wrote:
> >>>
> >>> I really want to thank you for all your help.
> >>> You've done so many things for the Apache Spark community.
> >>>
> >>> Sincerely,
> >>> Dongjoon
> >>>
> >>>
> >>> On Mon, Dec 6, 2021 at 12:02 PM shane knapp ☠ 
> wrote:
> 
>  hey everyone!
> 
>  after a marathon run of nearly a decade, we're finally going to be
> shutting down {amp|rise}lab jenkins at the end of this month...
> 
>  the earliest snapshot i could find is from 2013 with builds for spark
> 0.7:
> 
> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
> 
>  it's been a hell of a run, and i'm gonna miss randomly tweaking the
> build system, but technology has moved on and running a dedicated set of
> servers for just one open source project is just too expensive for us here
> at uc berkeley.
> 
>  if there's interest, i'll fire up a zoom session and all y'alls can
> watch me type the final command:
> 
>  systemctl stop jenkins
> 
>  feeling bittersweet,
> 
>  shane
>  --
>  Shane Knapp
>  Computer Guy / Voice of Reason
>  UC Berkeley EECS Research / RISELab Staff Technical Lead
>  https://rise.cs.berkeley.edu
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

--

Re: [FYI] Build and run tests on Java 17 for Apache Spark 3.3

2021-11-12 Thread Xiao Li

Thank you! Great job!

Xiao

On Fri, Nov 12, 2021 at 7:02 PM Mridul Muralidharan 
wrote:

>
> Nice job !
> There are some nice API's which should be interesting to explore with JDK
> 17 :-)
>
> Regards.
> Mridul
>
> On Fri, Nov 12, 2021 at 7:08 PM Yuming Wang  wrote:
>
>> Cool, thank you Dongjoon.
>>
>> On Sat, Nov 13, 2021 at 4:09 AM shane knapp ☠ 
>> wrote:
>>
>>> woot!  nice work everyone!  :)
>>>
>>> On Fri, Nov 12, 2021 at 11:37 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 Apache Spark community has been working on Java 17 support under the
 following JIRA.

 https://issues.apache.org/jira/browse/SPARK-33772

 As of today, Apache Spark starts to have daily Java 17 test coverage
 via GitHub Action jobs for Apache Spark 3.3.

 https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L38-L39

 Today's successful run is here.

 https://github.com/apache/spark/actions/runs/1453788012

 Please note that we are still working on some new Java 17 features like

 JEP 391: macOS/AArch64 Port
 https://bugs.openjdk.java.net/browse/JDK-8251280

 For example, Oracle Java, Azul Zulu, and Eclipse Temurin Java 17
 already support Apple Silicon natively, but some 3rd party libraries like
 RocksDB/LevelDB are not ready yet. Since Mac is one of the popular dev
 environments, we are going to keep monitoring and improving gradually for
 Apache Spark 3.3.

 Please test Java 17 and let us know your feedback.

 Thanks,
 Dongjoon.

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>

--

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Xiao Li

Thank you, Gengliang!

Congrats to our community and all the contributors!

Xiao

Henrik Peng  于2021年10月19日周二 上午8:26写道：

> Congrats and thanks!
>
>
> Gengliang Wang 于2021年10月19日 周二下午10:16写道：
>
>> Hi all,
>>
>> Apache Spark 3.2.0 is the third release of the 3.x line. With tremendous
>> contribution from the open-source community, this release managed to
>> resolve in excess of 1,700 Jira tickets.
>>
>> We'd like to thank our contributors and users for their contributions and
>> early feedback to this release. This release would not have been possible
>> without you.
>>
>> To download Spark 3.2.0, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-2-0.html
>>
>

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-11 Thread Xiao Li

+1

Xiao Li

Yi Wu  于2021年10月11日周一 上午12:08写道：

> +1 (non-binding)
>
> On Mon, Oct 11, 2021 at 1:57 PM Holden Karau  wrote:
>
>> +1
>>
>> On Sun, Oct 10, 2021 at 10:46 PM Wenchen Fan  wrote:
>>
>>> +1
>>>
>>> On Sat, Oct 9, 2021 at 2:36 PM angers zhu  wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> Cheng Pan  于2021年10月9日周六 下午2:06写道：
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> Integration test passed[1] with my project[2].
>>>>>
>>>>> [1]
>>>>> https://github.com/housepower/spark-clickhouse-connector/runs/3834335017
>>>>> [2] https://github.com/housepower/spark-clickhouse-connector
>>>>>
>>>>> Thanks,
>>>>> Cheng Pan
>>>>>
>>>>>
>>>>> On Sat, Oct 9, 2021 at 2:01 PM Ye Zhou  wrote:
>>>>>
>>>>>> +1 (non-binding).
>>>>>>
>>>>>> Run Maven build, tested within our YARN cluster, in client or cluster
>>>>>> mode, with push based shuffle enabled/disalbled, and shuffling a large
>>>>>> amount of data. Applications ran successfully with expected shuffle
>>>>>> behavior.
>>>>>>
>>>>>> On Fri, Oct 8, 2021 at 10:06 PM sarutak 
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> I think no critical issue left.
>>>>>>> Thank you Gengliang.
>>>>>>>
>>>>>>> Kousuke
>>>>>>>
>>>>>>> > +1
>>>>>>> >
>>>>>>> > Looks good.
>>>>>>> >
>>>>>>> > Liang-Chi
>>>>>>> >
>>>>>>> > On 2021/10/08 16:16:12, Kent Yao  wrote:
>>>>>>> >> +1 (non-binding) BR
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> font{
>>>>>>> >> line-height: 1.6;
>>>>>>> >> }
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> font{
>>>>>>> >> line-height: 1.6;
>>>>>>> >> }
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> font{
>>>>>>> >> line-height: 1.6;
>>>>>>> >> }
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> font{
>>>>>>> >> line-height: 1.6;
>>>>>>> >> }
>>>>>>> >> 
>>>>>>> >>
>>>>>>> >> 
>>>>>>> >> font{
>>>>>>> >> line-height: 1.6;
>>>>>>> >> }
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> 
>>>>>>> >> font{
>>>>>>> >> line-height: 1.6;
>>>>>>> >> }
>>>>>>> >> 
>>>>>>> >>
>>>>>>> >> 
>>>>>>> >> font{
>>>>>>> >> line-height: 1.6;
>>>>>>> >> }
>>>>>>> >> 
>>>>>>> >>
>>>>>>> >> 
>>>>>>> >> font{
>>>>>>> >> line-height: 1.6;
>>>>>>> >> }
>>>>>>> >> 
>>>>>>> >>
>>>>>>> >> 
>>>>>>> >> font{
>>>>>>> >> line-height: 1.6;
>>>>>>> >> }
>>>>>>> >> 
>>>>>>

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-31 Thread Xiao Li

Hi, Chao,

How long will it take? Normally, in the RC stage, we always revert the
upgrade made in the current release. We did the parquet upgrade multiple
times in the previous releases for avoiding the major delay in our Spark
release

Thanks，

Xiao


On Tue, Aug 31, 2021 at 11:03 AM Chao Sun  wrote:

> The Apache Parquet community found an issue [1] in 1.12.0 which could
> cause incorrect file offset being written and subsequently reading of the
> same file to fail. A fix has been proposed in the same JIRA and we may have
> to wait until a new release is available so that we can upgrade Spark with
> the hot fix.
>
> [1]: https://issues.apache.org/jira/browse/PARQUET-2078
>
> On Fri, Aug 27, 2021 at 7:06 AM Sean Owen  wrote:
>
>> Maybe, I'm just confused why it's needed at all. Other profiles that add
>> a dependency seem OK, but something's different here.
>>
>> One thing we can/should change is to simply remove the
>>  block in the profile. It should always be a direct
>> dep in Scala 2.13 (which lets us take out the profiles in submodules, which
>> just repeat that)
>> We can also update the version, by the by.
>>
>> I tried this and the resulting POM still doesn't look like what I expect
>> though.
>>
>> (The binary release is OK, FWIW - it gets pulled in as a JAR as expected)
>>
>> On Thu, Aug 26, 2021 at 11:34 PM Stephen Coy 
>> wrote:
>>
>>> Hi Sean,
>>>
>>> I think that maybe the https://www.mojohaus.org/flatten-maven-plugin/ will
>>> help you out here.
>>>
>>> Cheers,
>>>
>>> Steve C
>>>
>>> On 27 Aug 2021, at 12:29 pm, Sean Owen  wrote:
>>>
>>> OK right, you would have seen a different error otherwise.
>>>
>>> Yes profiles are only a compile-time thing, but they should affect the
>>> effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom shows
>>> scala-parallel-collections as a dependency in the POM as expected (not in a
>>> profile). However I see what you see in the .pom in the release repo, and
>>> in my local repo after building - it's just sitting there as a profile as
>>> if it weren't activated or something.
>>>
>>> I'm confused then, that shouldn't be what happens. I'd say maybe there
>>> is a problem with the release script, but seems to affect a simple local
>>> build. Anyone else more expert in this see the problem, while I try to
>>> debug more?
>>> The binary distro may actually be fine, I'll check; it may even not
>>> matter much for users who generally just treat Spark as a compile-time-only
>>> dependency either. But I can see it would break exactly your case,
>>> something like a self-contained test job.
>>>
>>> On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy 
>>> wrote:
>>>
 I did indeed.

 The generated spark-core_2.13-3.2.0.pom that is created alongside the
 jar file in the local repo contains:

 
   scala-2.13
   
 
   org.scala-lang.modules

 scala-parallel-collections_${scala.binary.version}
 
   
 

 which means this dependency will be missing for unit tests that create
 SparkSessions from library code only, a technique inspired by Spark’s own
 unit tests.

 Cheers,

 Steve C

 On 27 Aug 2021, at 11:33 am, Sean Owen  wrote:

 Did you run ./dev/change-scala-version.sh 2.13 ? that's required first
 to update POMs. It works fine for me.

 On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy <
 s...@infomedia.com.au.invalid> wrote:

> Hi all,
>
> Being adventurous I have built the RC1 code with:
>
> -Pyarn -Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver
> -Phive-2.3 -Pscala-2.13 -Dhadoop.version=3.2.2
>
>
> And then attempted to build my Java based spark application.
>
> However, I found a number of our unit tests were failing with:
>
> java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>
> at
> org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1412)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.SparkContext.withScope(SparkContext.scala:789)
> at org.apache.spark.SparkContext.union(SparkContext.scala:1406)
> at
> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:698)
> at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
> …
>
>
> I tracked this down to a missing dependency:
>
> 
>   org.scala-lang.modules
>
> scala-parallel-collections_${scala.binary.version}
> 
>
>
> which unfortunately appears only in a profile in the pom files
> associated with the various spark dependencies.
>
> As far as I know it is not possible to activate profiles in
> dependencies in maven builds.
>
> Therefore I suspect that

Re: [build system] half of the jenkins workers are down

2021-08-09 Thread Xiao Li

Thank you, Shane!

Xiao

shane knapp ☠  于2021年8月9日周一 下午1:26写道：

> turns out that minikube/k8s and friends were being oom-killed and this was
> causing all sorts of weirdnesses.
>
> i've upped the ram limits on all of the k8s jobs to 8G (from 6G), and
> we'll keep an eye on things and see how they go.
>
> On Mon, Aug 9, 2021 at 12:02 PM shane knapp ☠  wrote:
>
>> as workers are continuing to fail, i've stopped jenkins from accepting
>> new builds for the time being.
>>
>> more updates as they come.
>>
>> On Mon, Aug 9, 2021 at 9:17 AM shane knapp ☠  wrote:
>>
>>> happy monday!
>>>
>>> the server gods did not smile upon us this weekend, and 4 of the workers
>>> are down.  we'll most likely need to head to our colo some time today and
>>> give them an in-person kick and see what's going on.
>>>
>>> i'll send an update when they're back up.
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: Flaky build in GitHub Actions

2021-07-26 Thread Xiao Li

Thank you, Liang-chi and Hyukjin!

On Sun, Jul 25, 2021 at 6:25 PM Hyukjin Kwon  wrote:

> This is fixed up via Laingchi's PR:
> https://github.com/apache/spark/pull/33447. The issue is almost fixed now
> and less flaky.
> I'm still interacting w/ GitHub Actions: they are still investigating the
> issue. Seems like there's no similar ticket reported so they suspect an
> issue specific to Apahc Spark repo.
>
>
> 2021년 7월 22일 (목) 오전 9:40, Hyukjin Kwon 님이 작성:
>
>> FYI, @Liang-Chi Hsieh  is trying to control the memory
>> in the test base at https://github.com/apache/spark/pull/33447 which
>> looks almost promising now.
>> While I don't object to merge things, would need to closely track how
>> these tests go at Github Actions in his PR (and in the main Apache repo)
>>
>> 2021년 7월 22일 (목) 오전 3:00, Holden Karau 님이 작성:
>>
>>> I noticed that the worker decommissioning suite maybe seems to be
>>> running up against the memory limits so I'm going to try and see if I can
>>> get our memory usage down a bit as well while we wait for GH response. In
>>> the meantime, I'm assuming if things pass Jenkins we are OK with merging
>>> yes?
>>>
>>> On Wed, Jul 21, 2021 at 10:03 AM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you, Hyukjin!

 Dongjoon.

 On Tue, Jul 20, 2021 at 8:53 PM Hyukjin Kwon 
 wrote:

> I filed a ticket at GitHub. I will share more details when I get a
> response from them.
>
> 2021년 7월 20일 (화) 오후 7:30, Hyukjin Kwon 님이 작성:
>
>> Hi all,
>>
>> Looks like there's something going on in the machines in GitHub
>> Actions.
>> The build is now very flaky and keeps dying with symptoms like I
>> guess out-of-memory (?).
>> I will try to take a closer look tomorrow but it would be great if
>> you guys find some time to take a look into it 
>>
>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

--

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Xiao Li

>
> To Liang-Chi, I'm -1 for postponing the branch cut because this is a soft
> cut and the committers still are able to commit to `branch-3.3` according
> to their decisions.


First, I think you are saying "branch-3.2";

Second, the "so cut" means no "code freeze", although we cut the branch. To
avoid releasing half-baked and unready features, the release
manager needs to be very careful when cutting the RC. Based on what is
proposed here, the RC date is the actual code freeze date.

This way, we can backport the other performance/operability enhancements
> tickets under SPARK-33235 into branch-3.2 to be released in future Spark
> 3.2.x patch releases.


This is not allowed based on the policy. Only bug fixes can be merged to
the patch releases. Thus, if we know it will introduce major performance
regression, we have to turn the feature off by default.

Xiao



Min Shen  于2021年6月16日周三 下午3:22写道：

> Hi Gengliang,
>
> Thanks for volunteering as the release manager for Spark 3.2.0.
> Regarding the ongoing work of push-based shuffle in SPARK-30602, we are
> close to having all the patches merged to master to enable push-based
> shuffle.
> Currently, there are 2 PRs under SPARK-30602 that are under active review
> (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
> We should be able to post the PRs for the other 2 remaining tickets
> (SPARK-32923 and SPARK-35546) early next week.
>
> The tickets under SPARK-30602 are the minimum set of patches to enable
> push-based shuffle.
> We do have other performance/operability enhancements tickets under
> SPARK-33235 that are needed to fully contribute what we have internally for
> push-based shuffle.
> However, these are optional for enabling push-based shuffle.
> We do strongly prefer to cut the release for Spark 3.2.0 including all the
> patches under SPARK-30602.
> This way, we can backport the other performance/operability enhancements
> tickets under SPARK-33235 into branch-3.2 to be released in future Spark
> 3.2.x patch releases.
> I understand the preference of not postponing the branch cut date.
> We will check with Dongjoon regarding the soft cut date and the
> flexibility for including the remaining tickets under SPARK-30602 into
> branch-3.2.
>
> Best,
> Min
>
> On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh  wrote:
>
>>
>> Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
>> As it is soft cut date, there is no reason to postpone it.
>>
>> It sounds good then to keep original branch cut date.
>>
>> Thank you.
>>
>>
>>
>> Dongjoon Hyun-2 wrote
>> > Thank you for volunteering, Gengliang.
>> >
>> > Apache Spark 3.2.0 is the first version enabling AQE by default. I'm
>> also
>> > watching some on-going improvements on that.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive
>> Query
>> > Execution QA)
>> >
>> > To Liang-Chi, I'm -1 for postponing the branch cut because this is a
>> soft
>> > cut and the committers still are able to commit to `branch-3.3`
>> according
>> > to their decisions.
>> >
>> > Given that Apache Spark had 115 commits in a week in various areas
>> > concurrently, we should start QA for Apache Spark 3.2 by creating
>> > branch-3.3 and allowing only limited backporting.
>> >
>> > https://github.com/apache/spark/graphs/commit-activity
>> >
>> > Bests,
>> > Dongjoon.
>> >
>> >
>> > On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh 
>>
>> > viirya@
>>
>> >  wrote:
>> >
>> >> First, thanks for being volunteer as the release manager of Spark
>> 3.2.0,
>> >> Gengliang!
>> >>
>> >> And yes, for the two important Structured Streaming features, RocksDB
>> >> StateStore and session window, we're working on them and expect to have
>> >> them
>> >> in the new release.
>> >>
>> >> So I propose to postpone the branch cut date.
>> >>
>> >> Thank you!
>> >>
>> >> Liang-Chi
>> >>
>> >>
>> >> Gengliang Wang-2 wrote
>> >> > Thanks, Hyukjin.
>> >> >
>> >> > The expected target branch cut date of Spark 3.2 is *July 1st* on
>> >> > https://spark.apache.org/versioning-policy.html. However, I notice
>> that
>> >> > there are still multiple important projects in progress now:
>> >> >
>> >> > [Core]
>> >> >
>> >> >- SPIP: Support push-based shuffle to improve shuffle efficiency
>> >> >https://issues.apache.org/jira/browse/SPARK-30602;
>> >> >
>> >> > [SQL]
>> >> >
>> >> >- Support ANSI SQL INTERVAL types
>> >> >https://issues.apache.org/jira/browse/SPARK-27790;
>> >> >- Support Timestamp without time zone data type
>> >> >https://issues.apache.org/jira/browse/SPARK-35662;
>> >> >- Aggregate (Min/Max/Count) push down for Parquet
>> >> >https://issues.apache.org/jira/browse/SPARK-34952;
>> >> >
>> >> > [Streaming]
>> >> >
>> >> >- EventTime based sessionization (session window)
>> >> >https://issues.apache.org/jira/browse/SPARK-10816;
>> >> >- Add RocksDB StateStore as external module
>> >> >https://issues.apache.org/jira/browse/SPARK-34198;
>> >> >
>> >>

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Xiao Li

Thank you!

Xiao

On Tue, Jun 1, 2021 at 9:29 PM Hyukjin Kwon  wrote:

> awesome!
>
> 2021년 6월 2일 (수) 오전 9:59, Dongjoon Hyun 님이 작성:
>
>> We are happy to announce the availability of Spark 3.1.2!
>>
>> Spark 3.1.2 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.1 maintenance branch of Spark. We
>> strongly
>> recommend all 3.1 users to upgrade to this stable release.
>>
>> To download Spark 3.1.2, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-1-2.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Dongjoon Hyun
>>
>

--

Re: Apache Spark 3.1.2 Release?

2021-05-17 Thread Xiao Li

+1 Thanks, Dongjoon!

Xiao



On Mon, May 17, 2021 at 8:45 PM Kent Yao  wrote:

> +1. thanks Dongjoon
>
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark .*
> *spark-authorizer A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark .*
> *spark-postgres  A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *itatchi A** library t**hat
> brings useful functions from various modern database management systems to 
> **Apache
> Spark .*
>
>
>
> On 05/18/2021 10:57，John Zhuge 
> wrote：
>
> +1, thanks Dongjoon!
>
> On Mon, May 17, 2021 at 7:50 PM Yuming Wang  wrote:
>
>> +1.
>>
>> On Tue, May 18, 2021 at 9:06 AM Hyukjin Kwon  wrote:
>>
>>> +1 thanks for driving me
>>>
>>> On Tue, 18 May 2021, 09:33 Holden Karau,  wrote:
>>>
 +1 and thanks for volunteering to be the RM :)

 On Mon, May 17, 2021 at 4:09 PM Takeshi Yamamuro 
 wrote:

> Thank you, Dongjoon~ sgtm, too.
>
> On Tue, May 18, 2021 at 7:34 AM Cheng Su 
> wrote:
>
>> +1 for a new release, thanks Dongjoon!
>>
>> Cheng Su
>>
>> On 5/17/21, 2:44 PM, "Liang-Chi Hsieh"  wrote:
>>
>> +1 sounds good. Thanks Dongjoon for volunteering on this!
>>
>>
>> Liang-Chi
>>
>>
>> Dongjoon Hyun-2 wrote
>> > Hi, All.
>> >
>> > Since Apache Spark 3.1.1 tag creation (Feb 21),
>> > new 172 patches including 9 correctness patches and 4 K8s
>> patches arrived
>> > at branch-3.1.
>> >
>> > Shall we make a new release, Apache Spark 3.1.2, as the second
>> release at
>> > 3.1 line?
>> > I'd like to volunteer for the release manager for Apache Spark
>> 3.1.2.
>> > I'm thinking about starting the first RC next week.
>> >
>> > $ git log --oneline v3.1.1..HEAD | wc -l
>> >  172
>> >
>> > # Known correctness issues
>> > SPARK-34534 New protocol FetchShuffleBlocks in
>> OneForOneBlockFetcher
>> > lead to data loss or correctness
>> > SPARK-34545 PySpark Python UDF return inconsistent results
>> when
>> > applying 2 UDFs with different return type to 2 columns together
>> > SPARK-34681 Full outer shuffled hash join when building
>> left side
>> > produces wrong result
>> > SPARK-34719 fail if the view query has duplicated column
>> names
>> > SPARK-34794 Nested higher-order functions broken in DSL
>> > SPARK-34829 transform_values return identical values when
>> it's used
>> > with udf that returns reference type
>> > SPARK-34833 Apply right-padding correctly for correlated
>> subqueries
>> > SPARK-35381 Fix lambda variable name issues in nested
>> DataFrame
>> > functions in R APIs
>> > SPARK-35382 Fix lambda variable name issues in nested
>> DataFrame
>> > functions in Python APIs
>> >
>> > # Notable K8s patches since K8s GA
>> > SPARK-34674Close SparkContext after the Main method has
>> finished
>> > SPARK-34948Add ownerReference to executor configmap to fix
>> leakages
>> > SPARK-34820add apt-update before gnupg install
>> > SPARK-34361In case of downscaling avoid killing of
>> executors already
>> > known by the scheduler backend in the pod allocator
>> >
>> > Bests,
>> > Dongjoon.
>>
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>
 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>
> --
> John Zhuge
>
>

--

Re: Welcoming six new Apache Spark committers

2021-03-27 Thread Xiao Li

Congratulations, everyone!

Xiao

Chao Sun  于2021年3月26日周五 下午6:30写道：

> Congrats everyone!
>
> On Fri, Mar 26, 2021 at 6:23 PM Mridul Muralidharan 
> wrote:
>
>>
>> Congratulations, looking forward to more exciting contributions !
>>
>> Regards,
>> Mridul
>>
>> On Fri, Mar 26, 2021 at 8:21 PM Dongjoon Hyun 
>> wrote:
>>
>>>
>>> Congratulations! :)
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Mar 26, 2021 at 5:55 PM angers zhu  wrote:
>>>
 Congratulations

 Prashant Sharma  于2021年3月27日周六 上午8:35写道：

> Congratulations  all!!
>
> On Sat, Mar 27, 2021, 5:10 AM huaxin gao 
> wrote:
>
>> Congratulations to you all!!
>>
>> On Fri, Mar 26, 2021 at 4:22 PM Yuming Wang  wrote:
>>
>>> Congrats!
>>>
>>> On Sat, Mar 27, 2021 at 7:13 AM Takeshi Yamamuro <
>>> linguin@gmail.com> wrote:
>>>
 Congrats, all~

 On Sat, Mar 27, 2021 at 7:46 AM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Congrats all!
>
> 2021년 3월 27일 (토) 오전 6:56, Liang-Chi Hsieh 님이 작성:
>
>> Congrats! Welcome!
>>
>>
>> Matei Zaharia wrote
>> > Hi all,
>> >
>> > The Spark PMC recently voted to add several new committers.
>> Please join me
>> > in welcoming them to their new role! Our new committers are:
>> >
>> > - Maciej Szymkiewicz (contributor to PySpark)
>> > - Max Gekk (contributor to Spark SQL)
>> > - Kent Yao (contributor to Spark SQL)
>> > - Attila Zsolt Piros (contributor to decommissioning and Spark
>> on
>> > Kubernetes)
>> > - Yi Wu (contributor to Spark Core and SQL)
>> > - Gabor Somogyi (contributor to Streaming and security)
>> >
>> > All six of them contributed to Spark 3.1 and we’re very excited
>> to have
>> > them join as committers.
>> >
>> > Matei and the Spark PMC
>> >
>> -
>> > To unsubscribe e-mail:
>>
>> > dev-unsubscribe@.apache
>>
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

 --
 ---
 Takeshi Yamamuro

>>>

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-27 Thread Xiao Li

+1

Xiao

Takeshi Yamamuro  于2021年3月26日周五 下午4:14写道：

> +1 (non-binding)
>
> On Sat, Mar 27, 2021 at 4:53 AM Liang-Chi Hsieh  wrote:
>
>> +1 (non-binding)
>>
>>
>> rxin wrote
>> > +1. Would open up a huge persona for Spark.
>> >
>> > On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler <
>>
>> > cutlerb@
>>
>> >  > wrote:
>> >
>> >>
>> >> +1 (non-binding)
>> >>
>> >>
>> >> On Fri, Mar 26, 2021 at 9:49 AM Maciej <
>>
>> > mszymkiewicz@
>>
>> >  > wrote:
>> >>
>> >>
>> >>> +1 (nonbinding)
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Apache Spark 3.2 Expectation

2021-03-10 Thread Xiao Li

Below are some nice-to-have features we can work on in Spark 3.2: Lateral
Join support , interval
data type, timestamp without time zone, un-nesting arbitrary queries, the
returned metrics of DSV2, and error message standardization. Spark 3.2 will
be another exciting release I believe!

Go Spark!

Xiao




Dongjoon Hyun  于2021年3月10日周三 下午12:25写道：

> Hi, Xiao.
>
> This thread started 13 days ago. Since you asked the community about major
> features or timelines at that time, could you share your roadmap or
> expectations if you have something in your mind?
>
> > Thank you, Dongjoon, for initiating this discussion. Let us keep it
> open. It might take 1-2 weeks to collect from the community all the
> features we plan to build and ship in 3.2 since we just finished the 3.1
> voting.
> > TBH, cutting the branch this April does not look good to me. That means,
> we only have one month left for feature development of Spark 3.2. Do we
> have enough features in the current master branch? If not, are we able to
> finish major features we collected here? Do they have a timeline or project
> plan?
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun 
> wrote:
>
>> Hi, John.
>>
>> This thread aims to share your expectations and goals (and maybe work
>> progress) to Apache Spark 3.2 because we are making this together. :)
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge  wrote:
>>
>>> Hi Dongjoon,
>>>
>>> Is it possible to get ViewCatalog in? The community already had fairly
>>> detailed discussions.
>>>
>>> Thanks,
>>> John
>>>
>>> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 Since we have been preparing Apache Spark 3.2.0 in master branch since
 December 2020, March seems to be a good time to share our thoughts and
 aspirations on Apache Spark 3.2.

 According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
 seems to be the last minor release of this year. Given the timeframe, we
 might consider the following. (This is a small set. Please add your
 thoughts to this limited list.)

 # Languages

 - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
 slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
 and investigating the publishing issue. Thank you for your contributions
 and feedback on this.

 - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
 Java 11, we need lots of support from our dependencies. Let's see.

 - Python 3.6 Deprecation(?): Python 3.6 community support ends at
 2021-12-23. So, the deprecation is not required yet, but we had better
 prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

 - SparkR CRAN publishing: As we know, it's discontinued so far.
 Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing.
 If it succeeds to revive it, we can keep publishing. Otherwise, I believe
 we had better drop it from the releasing work item list officially.

 # Dependencies

 - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
 in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
 shaded clients via SPARK-33212. So far, there is one on-going report at
 YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
 we can move toward Hadoop 3.3.2.

 - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
 instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
 via SPARK-32981 and replaced the generated hive-service-rpc code with the
 official dependency via SPARK-32981. We are steadily improving this area
 and will consume Hive 2.3.9 if available.

 - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
 client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
 support K8s model 1.19.

 - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
 Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
 KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
 with Kafka Client 2.8 hopefully.

 # Some Features

 - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
 Iceberg integration. Especially, we hope the on-going function catalog SPIP
 and up-coming storage partitioned join SPIP can be delivered as a part of
 Spark 3.2 and become an additional foundation.

 - Columnar Encryption: As of today, Apache Spark master branch supports
 columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
 Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-04 Thread Xiao Li

Thank you, Liang-Chi!

Xiao

On Thu, Mar 4, 2021 at 6:25 PM Hyukjin Kwon  wrote:

> Thanks @Liang-Chi Hsieh  for driving this.
>
> 2021년 3월 5일 (금) 오전 5:21, Liang-Chi Hsieh 님이 작성:
>
>>
>> Thanks all for the input.
>>
>> If there is no objection, I am going to cut the branch next Monday.
>>
>> Thanks.
>> Liang-Chi
>>
>>
>> Takeshi Yamamuro wrote
>> > +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering.
>> > Btw, anyone roughly know how many v2.4 users still are based on some
>> stats
>> > (e.g., # of v2.4.7 downloads from the official repos)?
>> > Most users have started using v3.x?
>> >
>> > On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon 
>>
>> > gurwls223@
>>
>> >  wrote:
>> >
>> >> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't
>> mind
>> >> having 2.4.9 as EOL too if that's preferred from more people.
>> >>
>> > Takeshi Yamamuro
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

--

Re: Apache Spark 3.2 Expectation

2021-02-26 Thread Xiao Li

Thank you, Dongjoon, for initiating this discussion. Let us keep it open.
It might take 1-2 weeks to collect from the community all the features
we plan to build and ship in 3.2 since we just finished the 3.1 voting.


> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
> in April because we took 3 month for Spark 3.1 release.


TBH, cutting the branch this April does not look good to me. That means, we
only have one month left for feature development of Spark 3.2. Do we have
enough features in the current master branch? If not, are we able to finish
major features we collected here? Do they have a timeline or project plan?

Xiao

Dongjoon Hyun  于2021年2月26日周五 上午10:07写道：

> Thank you, Mridul and Sean.
>
> 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of
> course, it's a nice-to-have status. :)
>
> 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for
> sharing,
>
> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
> in April because we took 3 month for Spark 3.1 release.
> Let's update our release roadmap of the Apache Spark website.
>
> > I'd roughly expect 3.2 in, say, July of this year, given the usual
> cadence. No reason it couldn't be a little sooner or later. There is
> already some good stuff in 3.2 and will be a good minor release in 5-6
> months.
>
> Bests,
> Dongjoon.
>
>
>
> On Thu, Feb 25, 2021 at 9:33 AM Sean Owen  wrote:
>
>> I'd roughly expect 3.2 in, say, July of this year, given the usual
>> cadence. No reason it couldn't be a little sooner or later. There is
>> already some good stuff in 3.2 and will be a good minor release in 5-6
>> months.
>>
>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>>> December 2020, March seems to be a good time to share our thoughts and
>>> aspirations on Apache Spark 3.2.
>>>
>>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>>> seems to be the last minor release of this year. Given the timeframe, we
>>> might consider the following. (This is a small set. Please add your
>>> thoughts to this limited list.)
>>>
>>> # Languages
>>>
>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>>> and investigating the publishing issue. Thank you for your contributions
>>> and feedback on this.
>>>
>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>>> Java 11, we need lots of support from our dependencies. Let's see.
>>>
>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>
>>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>>> better drop it from the releasing work item list officially.
>>>
>>> # Dependencies
>>>
>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
>>> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>>> shaded clients via SPARK-33212. So far, there is one on-going report at
>>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>>> we can move toward Hadoop 3.3.2.
>>>
>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>>> official dependency via SPARK-32981. We are steadily improving this area
>>> and will consume Hive 2.3.9 if available.
>>>
>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>>> support K8s model 1.19.
>>>
>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>>> with Kafka Client 2.8 hopefully.
>>>
>>> # Some Features
>>>
>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>>> and up-coming storage partitioned join SPIP can be delivered as a part of
>>> Spark 3.2 and become an additional foundation.
>>>
>>> - Columnar Encryption: As of today, Apache Spark master branch supports
>>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>>> Apache Spark 3.2 is going to

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-25 Thread Xiao Li

I confirmed that Q17 and Q39a/b have matching results between Spark 3.0 and
3.1 after enabling spark.sql.legacy.statisticalAggregate. The result
changes are expected. For more details, you can read the PR
https://github.com/apache/spark/pull/29983/ Also, the result of Q18 is
affected by the overflow checking in Spark. These issues exist in all the
releases. We will continue to improve our ANSI mode and fix them in the
upcoming releases.

Thus, I change my vote from -1 to +1.

As Ismael suggested, we can add some Github Actions to validate the TPC-DS
and TPC-H results for small scale datasets.

Cheers,

Xiao



Ismaël Mejía  于2021年2月25日周四 下午12:16写道：

> Since the TPC-DS performance tests are one of the main validation sources
> for regressions on Spark releases maybe it is time to automate the query
> outputs validation to find correctness issues eagerly (it would be also
> nice to validate the performance regressions but correctness >>>
> performance).
>
> This has been a long standing open issue [1] that is probably worth to
> address and it seems that automating this via Github Actions could be
> relatively straight-forward.
>
> [1] https://github.com/databricks/spark-sql-perf/issues/184
>
>
> On Wed, Feb 24, 2021 at 8:15 PM Reynold Xin  wrote:
>
>> +1 Correctness issues are serious!
>>
>>
>> On Wed, Feb 24, 2021 at 11:08 AM, Mridul Muralidharan 
>> wrote:
>>
>>> That is indeed cause for concern.
>>> +1 on extending the voting deadline until we finish investigation of
>>> this.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Wed, Feb 24, 2021 at 12:55 PM Xiao Li  wrote:
>>>
>>>> -1 Could we extend the voting deadline?
>>>>
>>>> A few TPC-DS queries (q17, q18, q39a, q39b) are returning different
>>>> results between Spark 3.0 and Spark 3.1. We need a few more days to
>>>> understand whether these changes are expected.
>>>>
>>>> Xiao
>>>>
>>>>
>>>> Mridul Muralidharan  于2021年2月24日周三 上午10:41写道：
>>>>
>>>>>
>>>>> Sounds good, thanks for clarifying Hyukjin !
>>>>> +1 on release.
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>>
>>>>> On Wed, Feb 24, 2021 at 2:46 AM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> I remember HiveExternalCatalogVersionsSuite was flaky for a while
>>>>>> which is fixed in
>>>>>> https://github.com/apache/spark/commit/0d5d248bdc4cdc71627162a3d20c42ad19f24ef4
>>>>>> and .. KafkaDelegationTokenSuite is flaky (
>>>>>> https://issues.apache.org/jira/browse/SPARK-31250).
>>>>>>
>>>>>> 2021년 2월 24일 (수) 오후 5:19, Mridul Muralidharan 님이
>>>>>> 작성:
>>>>>>
>>>>>>>
>>>>>>> Signatures, digests, etc check out fine.
>>>>>>> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
>>>>>>> -Phive-thriftserver -Pmesos -Pkubernetes
>>>>>>>
>>>>>>> I keep getting test failures with
>>>>>>> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
>>>>>>> * org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.
>>>>>>> (Note: I remove $HOME/.m2 and $HOME/.iv2 paths before build)
>>>>>>>
>>>>>>> Removing these suites gets the build through though - does anyone
>>>>>>> have suggestions on how to fix it ? I did not face this with RC1.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 22, 2021 at 12:57 AM Hyukjin Kwon 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>> version 3.1.1.
>>>>>>>>
>>>>>>>> The vote is open until February 24th 11PM PST and passes if a
>>>>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>>>
>>>>>>>> [ ] +1 Release this package as Apache Spark 3.1.1
>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>
>>>>>>>> To learn more about Apache Spark, please see
>>>>>>

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-24 Thread Xiao Li

-1 Could we extend the voting deadline?

A few TPC-DS queries (q17, q18, q39a, q39b) are returning different results
between Spark 3.0 and Spark 3.1. We need a few more days to understand
whether these changes are expected.

Xiao

Mridul Muralidharan  于2021年2月24日周三 上午10:41写道：

>
> Sounds good, thanks for clarifying Hyukjin !
> +1 on release.
>
> Regards,
> Mridul
>
>
> On Wed, Feb 24, 2021 at 2:46 AM Hyukjin Kwon  wrote:
>
>> I remember HiveExternalCatalogVersionsSuite was flaky for a while which
>> is fixed in
>> https://github.com/apache/spark/commit/0d5d248bdc4cdc71627162a3d20c42ad19f24ef4
>> and .. KafkaDelegationTokenSuite is flaky (
>> https://issues.apache.org/jira/browse/SPARK-31250).
>>
>> 2021년 2월 24일 (수) 오후 5:19, Mridul Muralidharan 님이 작성:
>>
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
>>> -Phive-thriftserver -Pmesos -Pkubernetes
>>>
>>> I keep getting test failures with
>>> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
>>> * org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.
>>> (Note: I remove $HOME/.m2 and $HOME/.iv2 paths before build)
>>>
>>> Removing these suites gets the build through though - does anyone have
>>> suggestions on how to fix it ? I did not face this with RC1.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Mon, Feb 22, 2021 at 12:57 AM Hyukjin Kwon 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 3.1.1.

 The vote is open until February 24th 11PM PST and passes if a majority
 +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.1.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v3.1.1-rc3 (commit
 1d550c4e90275ab418b9161925049239227f3dc9):
 https://github.com/apache/spark/tree/v3.1.1-rc3

 The release files, including signatures, digests, etc. can be found at:

 https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1367

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/

 The list of bug fixes going into 3.1.1 can be found at the following
 URL:
 https://s.apache.org/41kf2

 This release is using the release script of the tag v3.1.1-rc3.

 FAQ

 ===
 What happened to 3.1.0?
 ===

 There was a technical issue during Apache Spark 3.1.0 preparation, and
 it was discussed and decided to skip 3.1.0.
 Please see
 https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
 more details.

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC via "pip install
 https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/pyspark-3.1.1.tar.gz
 "
 and see if anything important breaks.
 In the Java/Scala, you can add the staging repository to your projects
 resolvers and test
 with the RC (make sure to clean up the artifact cache before/after so
 you don't end up building with an out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 3.1.1?
 ===

 The current list of open tickets targeted at 3.1.1 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.1.1

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should
 be worked on immediately. Everything else please retarget to an
 appropriate release.

 ==
 But my bug isn't fixed?
 ==

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from the previous
 release. That being said, if there is something which is a regression
 that has not been correctly targeted please ping me or a committer to
 help target the issue.

Re: Apache Spark 3.0.2 Release ?

2021-02-12 Thread Xiao Li

+1

Happy Lunar New Year!

Xiao

On Fri, Feb 12, 2021 at 5:33 PM Hyukjin Kwon  wrote:

> Yeah, +1 too
>
> 2021년 2월 13일 (토) 오전 4:49, Dongjoon Hyun 님이 작성:
>
>> Thank you, Sean!
>>
>> On Fri, Feb 12, 2021 at 11:41 AM Sean Owen  wrote:
>>
>>> Sounds like a fine time to me, sure.
>>>
>>> On Fri, Feb 12, 2021 at 1:39 PM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 As of today, `branch-3.0` has 307 patches (including 25 correctness
 patches) since v3.0.1 tag (released on September 8th, 2020).

 Since we stabilized branch-3.0 during 3.1.x preparation so far,
 it would be great if we start to release Apache Spark 3.0.2 next week.
 And, I'd like to volunteer for Apache Spark 3.0.2 release manager.

 What do you think about the Apache Spark 3.0.2 release?

 Bests,
 Dongjoon.


 --
 SPARK-31511 Make BytesToBytesMap iterator() thread-safe
 SPARK-32635 When pyspark.sql.functions.lit() function is used with
 dataframe cache, it returns wrong result
 SPARK-32753 Deduplicating and repartitioning the same column create
 duplicate rows with AQE
 SPARK-32764 compare of -0.0 < 0.0 return true
 SPARK-32840 Invalid interval value can happen to be just adhesive with
 the unit
 SPARK-32908 percentile_approx() returns incorrect results
 SPARK-33019 Use
 spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
 SPARK-33183 Bug in optimizer rule EliminateSorts
 SPARK-33260 SortExec produces incorrect results if sortOrder is a Stream
 SPARK-33290 REFRESH TABLE should invalidate cache even though the table
 itself may not be cached
 SPARK-33358 Spark SQL CLI command processing loop can't exit while one
 comand fail
 SPARK-33404 "date_trunc" expression returns incorrect results
 SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
 SPARK-33591 NULL is recognized as the "null" string in partition specs
 SPARK-33593 Vector reader got incorrect data with binary partition value
 SPARK-33726 Duplicate field names causes wrong answers during
 aggregation
 SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
 SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
 SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
 SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
 SPARK-34187 Use available offset range obtained during polling when
 checking offset validation
 SPARK-34212 For parquet table, after changing the precision and scale
 of decimal type in hive, spark reads incorrect value
 SPARK-34213 LOAD DATA doesn't refresh v1 table cache
 SPARK-34229 Avro should read decimal values with the file schema
 SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache

>>>

--

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-07 Thread Xiao Li

> I will prepare to upload news in spark-website to explain that 3.1.0 is
incompletely published because there was something wrong during the release
process, and we go to 3.1.1 right away.

+1

Sean Owen  于2021年1月7日周四 上午6:44写道：

> While we can delete the tag, maybe just leave it. As a general rule we
> would not remove anything pushed to the main git repo.
>
> On Thu, Jan 7, 2021 at 8:31 AM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> BTW, wondering aloud. Since it was agreed to skip 3.1.0 and go ahead with
>> 3.1.1, what's gonna happen with v3.1.0 tag [1]? Is it going away and we'll
>> see 3.1.1-rc1?
>>
>> [1] https://github.com/apache/spark/tree/v3.1.0-rc1
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>>

Re: [build system] WE'RE LIVE!

2020-12-01 Thread Xiao Li

Thank you, Shane!

Xiao

On Tue, Dec 1, 2020 at 5:34 PM Dongjoon Hyun 
wrote:

> Yay! Thanks!
>
> Bests,
> Dongjoon
>
> On Tue, Dec 1, 2020 at 5:31 PM Takeshi Yamamuro 
> wrote:
>
>> Many thanks, guys!
>> I've checked I can re-trigger Jenkins tests.
>>
>> Bests,
>> Takeshi
>>
>> On Wed, Dec 2, 2020 at 9:55 AM shane knapp ☠  wrote:
>>
>>> https://amplab.cs.berkeley.edu/jenkins/
>>>
>>> i cleared the build queue, so you'll need to retrigger your PRs.  there
>>> will be occasional downtime over the next few days and weeks as we uncover
>>> system-level errors and more reimaging happens...  but for now, we're
>>> building.
>>>
>>> a big thanks goes out to jon for his work on the project!  we couldn't
>>> have done it w/o him.
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

--

Re: Seeking committers' help to review on SS PR

2020-11-30 Thread Xiao Li

Just want to say thank you to all the active SS contributors. I saw many
great features/improvements in Streaming have been merged and will be
available in the upcoming 3.1 release.

   - Cache fetched list of files beyond maxFilesPerTrigger as unread file
   (SPARK-32568)
   - Streamline the logic on file stream source and sink metadata log
   (SPARK-30462)
   - Add DataStreamReader.table API (SPARK-32885)
   - Add DataStreamWriter.saveAsTable API (SPARK-32896)
   - Left semi stream-stream join (SPARK-32862)
   - Introduce schema validation for streaming state store (SPARK-31894)
   - Support to use a different compression codec in state store
   (SPARK-33263)
   - Kafka connector infinite wait because metadata never updated
   (SPARK-28367)
   - Upgrade Kafka to 2.6.0 (SPARK-32568)
   - Pagination support for Structured Streaming UI pages (SPARK-31642,
   SPARK-30119)
   - State information in Structured Streaming UI (SPARK-33223)

Structured Streaming UI support in Spark History Server is another great
usability feature: https://github.com/apache/spark/pull/28781 Hopefully,
this can be part of 3.1 release.

Go Spark!

Xiao



Ryan Blue  于2020年11月30日周一 上午11:35写道：

> Jungtaek,
>
> If there are contributors that you trust for reviews, then please let PMC
> members know so they can be considered. I agree that is the best solution.
>
> If there aren't contributors that the PMC wants to add as committers, then
> I suggest agreeing on a temporary exception to help make progress in this
> area and give contributors more opportunities to develop. Something like
> this: for the next 6 months, contributions from committers to SS can be
> committed without a committer +1 if they are reviewed by at least one
> contributor (and have no dissent from committers, of course). Then after
> the period expires, we would ideally have new people ready to be added as
> committers.
>
> That would need to be voted on, but I think it is a reasonable step to
> help resuscitate Spark streaming.
>
> On Fri, Nov 27, 2020 at 7:15 PM Sean Owen  wrote:
>
>> I don't know the code well, but those look minor and straightforward.
>> They have reviews from the two most knowledgeable people in this area. I
>> don't think you need to block for 6 months after proactively seeking all
>> likely reviewers - I'm saying that's the resolution to this type of
>> situation (too).
>>
>> On Fri, Nov 27, 2020 at 8:55 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Btw, there are two more PRs which got LGTM by a SS contributor but fail
>>> to get attention from committers. They're 6+ months old. Could you help
>>> reviewing this as well, or do you all think 6 months of time range + LGTM
>>> from an SS contributor is enough to go ahead?
>>>
>>> https://github.com/apache/spark/pull/27649
>>> https://github.com/apache/spark/pull/28363
>>>
>>> These are under 100 lines of changes per each, and not invasive.
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: jenkins downtime tomorrow evening/weekend

2020-11-23 Thread Xiao Li

Thank you, Shane!

On Mon, Nov 23, 2020 at 2:12 PM shane knapp ☠  wrote:

> the third most terrifying event in the world, a massive jenkins plugin
> update is happening in a couple of hours.  i'm going to restart jenkins and
> start working out any bugs/issues that pop up.
>
> this could be short, or quite long.  i'm guessing somewhere in the
> middle.  no new builds will be kicked off starting now.
>
> in parallel, i'm about to start porting my ansible to ubuntu 20 and
> testing that on two freshly reinstalled workers.  the ultimate goal is to
> get the PRB running on ubuntu 20...   the sbt tests will also likely be
> broken as i've never been able to work on ubuntu 16, 18 or 20.
>
> shane
>
> On Sat, Nov 21, 2020 at 4:23 PM shane knapp ☠  wrote:
>
>> somehow that went pretty smoothly, tho i've got a bunch of plugins to
>> deal with...  we're back up and building w/a shiny new UI.  :)
>>
>> On Sat, Nov 21, 2020 at 3:52 PM shane knapp ☠ 
>> wrote:
>>
>>> this is starting now
>>>
>>> On Thu, Nov 19, 2020 at 4:34 PM shane knapp ☠ 
>>> wrote:
>>>
 i'm going to be upgrading jenkins to something more reasonable, and
 there will definitely be some downtime as i get things sorted.

 we should be back up and building by monday.

 shane
 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


--

Re: Spark 3.1 branch cut 4th Dec?

2020-11-20 Thread Xiao Li

Thank you, Ryan!

Xiao

Dongjoon Hyun  于2020年11月20日周五 上午9:20写道：

> It sounds great! :)
>
> Thanks, Ryan.
>
> On Fri, Nov 20, 2020 at 9:19 AM Ryan Blue  wrote:
>
>> I think we should be able to get the CREATE TABLE changes in. Now that
>> the main blocker (EXTERNAL) has been decided, it's just a matter of normal
>> review comments.
>>
>> On Fri, Nov 20, 2020 at 9:05 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for sharing, Xiao.
>>>
>>> I hope we are able to make some agreement for CREATE TABLE DDLs, too.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Nov 20, 2020 at 9:01 AM Xiao Li  wrote:
>>>
>>>> https://github.com/apache/spark/pull/28026 is the major feature I am
>>>> tracking. It is painful to keep two sets of CREATE TABLE DDLs with
>>>> different behaviors. This hurts the usability of our SQL users, based on
>>>> what I heard. Unfortunately, this PR missed Spark 3.0 release. Now, I think
>>>> we should try our best to address it in 3.1.
>>>>
>>>> Thanks,
>>>>
>>>> Xiao
>>>>
>>>> Xiao Li  于2020年11月20日周五 上午8:52写道：
>>>>
>>>>> Hi, Dongjoon,
>>>>>
>>>>> Thank you for your feedback. I think *Early December* does not mean
>>>>> we will cut the branch on Dec 1st. I do not think Dec 1st and Dec 4th are 
>>>>> a
>>>>> big deal. Normally, it would be nice to give enough buffer. Based on my
>>>>> understanding, this email is just a *proposal* and a *reminder*. In
>>>>> the past, we often got mixed feedbacks.
>>>>>
>>>>> Anyway, we are collecting the feedbacks from the whole community.
>>>>> Welcome the inputs from everyone else
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Xiao
>>>>>
>>>>> Dongjoon Hyun  于2020年11月20日周五 上午8:33写道：
>>>>>
>>>>>> Hi, Xiao.
>>>>>>
>>>>>> I agree.
>>>>>>
>>>>>> > Merging the feature work after the branch cut should not be
>>>>>> encouraged in general, although some committers did make some exceptions
>>>>>> based on their own judgement. We should try to avoid merging the feature
>>>>>> work after the branch cut.
>>>>>>
>>>>>> So, the Apache Spark community accepted your request for delay
>>>>>> already. (Early November to Early December)
>>>>>>
>>>>>> -
>>>>>> https://github.com/apache/spark-website/commit/0cd0bdc80503882b4737db7e77cc8f9d17ec12ca
>>>>>>
>>>>>> I don't think the branch cut should be delayed again. We don't need
>>>>>> to have two weeks after Hyukjin's email.
>>>>>>
>>>>>> Given the delay, I'd strongly recommend to cut the branch on 1st
>>>>>> December.
>>>>>>
>>>>>> I'll create a `branch-3.1` on 1st December if Hyujkjin is busy to
>>>>>> start to stabilize .
>>>>>>
>>>>>> Again, it will not block you if you have an exceptional request.
>>>>>>
>>>>>> However, it would be helpful for all of us if you make it clear what
>>>>>> features you are waiting for now.
>>>>>>
>>>>>> We are creating Apache Spark together.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 19, 2020 at 11:38 PM Xiao Li 
>>>>>> wrote:
>>>>>>
>>>>>>> Correction:
>>>>>>>
>>>>>>> Merging the feature work after the branch cut should not be
>>>>>>> encouraged in general, although some committers did make some exceptions
>>>>>>> based on their own judgement. We should try to avoid merging the feature
>>>>>>> work after the branch cut.
>>>>>>>
>>>>>>> This email is a good reminder message. At least, we have two weeks
>>>>>>> ahead of the proposed branch cut date. I hope each feature owner might
>>>>>>> hurry up and try to finish it before the branch cut.
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>&g

Re: Spark 3.1 branch cut 4th Dec?

2020-11-20 Thread Xiao Li

https://github.com/apache/spark/pull/28026 is the major feature I am
tracking. It is painful to keep two sets of CREATE TABLE DDLs with
different behaviors. This hurts the usability of our SQL users, based on
what I heard. Unfortunately, this PR missed Spark 3.0 release. Now, I think
we should try our best to address it in 3.1.

Thanks,

Xiao

Xiao Li  于2020年11月20日周五 上午8:52写道：

> Hi, Dongjoon,
>
> Thank you for your feedback. I think *Early December* does not mean we
> will cut the branch on Dec 1st. I do not think Dec 1st and Dec 4th are a
> big deal. Normally, it would be nice to give enough buffer. Based on my
> understanding, this email is just a *proposal* and a *reminder*. In the
> past, we often got mixed feedbacks.
>
> Anyway, we are collecting the feedbacks from the whole community. Welcome
> the inputs from everyone else
>
> Thanks,
>
> Xiao
>
> Dongjoon Hyun  于2020年11月20日周五 上午8:33写道：
>
>> Hi, Xiao.
>>
>> I agree.
>>
>> > Merging the feature work after the branch cut should not be
>> encouraged in general, although some committers did make some exceptions
>> based on their own judgement. We should try to avoid merging the feature
>> work after the branch cut.
>>
>> So, the Apache Spark community accepted your request for delay already.
>> (Early November to Early December)
>>
>> -
>> https://github.com/apache/spark-website/commit/0cd0bdc80503882b4737db7e77cc8f9d17ec12ca
>>
>> I don't think the branch cut should be delayed again. We don't need to
>> have two weeks after Hyukjin's email.
>>
>> Given the delay, I'd strongly recommend to cut the branch on 1st December.
>>
>> I'll create a `branch-3.1` on 1st December if Hyujkjin is busy to start
>> to stabilize .
>>
>> Again, it will not block you if you have an exceptional request.
>>
>> However, it would be helpful for all of us if you make it clear what
>> features you are waiting for now.
>>
>> We are creating Apache Spark together.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Nov 19, 2020 at 11:38 PM Xiao Li  wrote:
>>
>>> Correction:
>>>
>>> Merging the feature work after the branch cut should not be encouraged
>>> in general, although some committers did make some exceptions based on
>>> their own judgement. We should try to avoid merging the feature work after
>>> the branch cut.
>>>
>>> This email is a good reminder message. At least, we have two weeks
>>> ahead of the proposed branch cut date. I hope each feature owner might
>>> hurry up and try to finish it before the branch cut.
>>>
>>> Xiao
>>>
>>> Xiao Li  于2020年11月19日周四 下午11:36写道：
>>>
>>>> We should try to merge the feature work after the branch cut. This
>>>> should not be encouraged in general, although some committers did make some
>>>> exceptions based on their own judgement.
>>>>
>>>> This email is a good reminder message. At least, we have two weeks
>>>> ahead of the proposed branch cut date. I hope each feature owner might
>>>> hurry up and try to finish it before the branch cut.
>>>>
>>>> Xiao
>>>>
>>>> Dongjoon Hyun  于2020年11月19日周四 下午4:02写道：
>>>>
>>>>> Thank you for your volunteering!
>>>>>
>>>>> Since the previous branch-cuts were always soft-code freeze which
>>>>> allowed committers to merge to the new branches still for a while, I
>>>>> believe 1st December will be better for stabilization.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Thu, Nov 19, 2020 at 3:50 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I think we haven’t decided yet the exact branch-cut, code freeze and
>>>>>> release manager.
>>>>>>
>>>>>> As we planned in https://spark.apache.org/versioning-policy.html
>>>>>>
>>>>>> Early Dec 2020 Code freeze. Release branch cut
>>>>>>
>>>>>> Code freeze and branch cutting is coming.
>>>>>>
>>>>>> Therefore, we should finish if there are any remaining works for
>>>>>> Spark 3.1, and
>>>>>> switch to QA mode soon.
>>>>>> I think it’s time to set to keep it on track, and I would like to
>>>>>> volunteer to help drive this process.
>>>>>>
>>>>>> I am currently thinking 4th Dec as the branch-cut date.
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>> Thanks all.
>>>>>>
>>>>>>

Re: Spark 3.1 branch cut 4th Dec?

2020-11-20 Thread Xiao Li

Hi, Dongjoon,

Thank you for your feedback. I think *Early December* does not mean we will
cut the branch on Dec 1st. I do not think Dec 1st and Dec 4th are a big
deal. Normally, it would be nice to give enough buffer. Based on my
understanding, this email is just a *proposal* and a *reminder*. In the
past, we often got mixed feedbacks.

Anyway, we are collecting the feedbacks from the whole community. Welcome
the inputs from everyone else

Thanks,

Xiao

Dongjoon Hyun  于2020年11月20日周五 上午8:33写道：

> Hi, Xiao.
>
> I agree.
>
> > Merging the feature work after the branch cut should not be
> encouraged in general, although some committers did make some exceptions
> based on their own judgement. We should try to avoid merging the feature
> work after the branch cut.
>
> So, the Apache Spark community accepted your request for delay already.
> (Early November to Early December)
>
> -
> https://github.com/apache/spark-website/commit/0cd0bdc80503882b4737db7e77cc8f9d17ec12ca
>
> I don't think the branch cut should be delayed again. We don't need to
> have two weeks after Hyukjin's email.
>
> Given the delay, I'd strongly recommend to cut the branch on 1st December.
>
> I'll create a `branch-3.1` on 1st December if Hyujkjin is busy to start to
> stabilize .
>
> Again, it will not block you if you have an exceptional request.
>
> However, it would be helpful for all of us if you make it clear what
> features you are waiting for now.
>
> We are creating Apache Spark together.
>
> Bests,
> Dongjoon.
>
>
> On Thu, Nov 19, 2020 at 11:38 PM Xiao Li  wrote:
>
>> Correction:
>>
>> Merging the feature work after the branch cut should not be encouraged in
>> general, although some committers did make some exceptions based on their
>> own judgement. We should try to avoid merging the feature work after the
>> branch cut.
>>
>> This email is a good reminder message. At least, we have two weeks
>> ahead of the proposed branch cut date. I hope each feature owner might
>> hurry up and try to finish it before the branch cut.
>>
>> Xiao
>>
>> Xiao Li  于2020年11月19日周四 下午11:36写道：
>>
>>> We should try to merge the feature work after the branch cut. This
>>> should not be encouraged in general, although some committers did make some
>>> exceptions based on their own judgement.
>>>
>>> This email is a good reminder message. At least, we have two weeks
>>> ahead of the proposed branch cut date. I hope each feature owner might
>>> hurry up and try to finish it before the branch cut.
>>>
>>> Xiao
>>>
>>> Dongjoon Hyun  于2020年11月19日周四 下午4:02写道：
>>>
>>>> Thank you for your volunteering!
>>>>
>>>> Since the previous branch-cuts were always soft-code freeze which
>>>> allowed committers to merge to the new branches still for a while, I
>>>> believe 1st December will be better for stabilization.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, Nov 19, 2020 at 3:50 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I think we haven’t decided yet the exact branch-cut, code freeze and
>>>>> release manager.
>>>>>
>>>>> As we planned in https://spark.apache.org/versioning-policy.html
>>>>>
>>>>> Early Dec 2020 Code freeze. Release branch cut
>>>>>
>>>>> Code freeze and branch cutting is coming.
>>>>>
>>>>> Therefore, we should finish if there are any remaining works for Spark
>>>>> 3.1, and
>>>>> switch to QA mode soon.
>>>>> I think it’s time to set to keep it on track, and I would like to
>>>>> volunteer to help drive this process.
>>>>>
>>>>> I am currently thinking 4th Dec as the branch-cut date.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Thanks all.
>>>>>
>>>>>

Re: Spark 3.1 branch cut 4th Dec?

2020-11-19 Thread Xiao Li

Correction:

Merging the feature work after the branch cut should not be encouraged in
general, although some committers did make some exceptions based on their
own judgement. We should try to avoid merging the feature work after the
branch cut.

This email is a good reminder message. At least, we have two weeks ahead of
the proposed branch cut date. I hope each feature owner might hurry up and
try to finish it before the branch cut.

Xiao

Xiao Li  于2020年11月19日周四 下午11:36写道：

> We should try to merge the feature work after the branch cut. This should
> not be encouraged in general, although some committers did make some
> exceptions based on their own judgement.
>
> This email is a good reminder message. At least, we have two weeks
> ahead of the proposed branch cut date. I hope each feature owner might
> hurry up and try to finish it before the branch cut.
>
> Xiao
>
> Dongjoon Hyun  于2020年11月19日周四 下午4:02写道：
>
>> Thank you for your volunteering!
>>
>> Since the previous branch-cuts were always soft-code freeze which allowed
>> committers to merge to the new branches still for a while, I believe 1st
>> December will be better for stabilization.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Nov 19, 2020 at 3:50 PM Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> I think we haven’t decided yet the exact branch-cut, code freeze and
>>> release manager.
>>>
>>> As we planned in https://spark.apache.org/versioning-policy.html
>>>
>>> Early Dec 2020 Code freeze. Release branch cut
>>>
>>> Code freeze and branch cutting is coming.
>>>
>>> Therefore, we should finish if there are any remaining works for Spark
>>> 3.1, and
>>> switch to QA mode soon.
>>> I think it’s time to set to keep it on track, and I would like to
>>> volunteer to help drive this process.
>>>
>>> I am currently thinking 4th Dec as the branch-cut date.
>>>
>>> Any thoughts?
>>>
>>> Thanks all.
>>>
>>>

Re: Spark 3.1 branch cut 4th Dec?

2020-11-19 Thread Xiao Li

We should try to merge the feature work after the branch cut. This should
not be encouraged in general, although some committers did make some
exceptions based on their own judgement.

This email is a good reminder message. At least, we have two weeks ahead of
the proposed branch cut date. I hope each feature owner might hurry up and
try to finish it before the branch cut.

Xiao

Dongjoon Hyun  于2020年11月19日周四 下午4:02写道：

> Thank you for your volunteering!
>
> Since the previous branch-cuts were always soft-code freeze which allowed
> committers to merge to the new branches still for a while, I believe 1st
> December will be better for stabilization.
>
> Bests,
> Dongjoon.
>
>
> On Thu, Nov 19, 2020 at 3:50 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I think we haven’t decided yet the exact branch-cut, code freeze and
>> release manager.
>>
>> As we planned in https://spark.apache.org/versioning-policy.html
>>
>> Early Dec 2020 Code freeze. Release branch cut
>>
>> Code freeze and branch cutting is coming.
>>
>> Therefore, we should finish if there are any remaining works for Spark
>> 3.1, and
>> switch to QA mode soon.
>> I think it’s time to set to keep it on track, and I would like to
>> volunteer to help drive this process.
>>
>> I am currently thinking 4th Dec as the branch-cut date.
>>
>> Any thoughts?
>>
>> Thanks all.
>>
>>

Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-06 Thread Xiao Li

+1

On Fri, Nov 6, 2020 at 6:23 AM Gengliang Wang  wrote:

> +1
>
> On Nov 6, 2020, at 1:52 PM, Wenchen Fan  wrote:
>
> +1
>
> On Fri, Nov 6, 2020 at 12:56 PM kalyan  wrote:
>
>> +1
>>
>> On Fri, Nov 6, 2020, 5:58 AM Matei Zaharia 
>> wrote:
>>
>>> +1
>>>
>>> Matei
>>>
>>> > On Nov 5, 2020, at 10:25 AM, EveLiao  wrote:
>>> >
>>> > +1
>>> > Thanks!
>>> >
>>> >
>>> >
>>> > --
>>> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>

--

Re: I'm going to be out starting Nov 5th

2020-11-01 Thread Xiao Li

Take care, Holden!

Bests,

Xiao

On Sat, Oct 31, 2020 at 9:53 PM 郑瑞峰  wrote:

> Take care, Holden! Best wishes!
>
>
> -- 原始邮件 --
> *发件人:* "Hyukjin Kwon" ;
> *发送时间:* 2020年11月1日(星期天) 上午10:24
> *收件人:* "Denny Lee";
> *抄送:* "Dongjoon Hyun";"Holden Karau"<
> hol...@pigscanfly.ca>;"dev";
> *主题:* Re: I'm going to be out starting Nov 5th
>
> Oh, take care Holden!
>
> On Sun, 1 Nov 2020, 03:04 Denny Lee,  wrote:
>
>> Best wishes Holden! :)
>>
>> On Sat, Oct 31, 2020 at 11:00 Dongjoon Hyun 
>> wrote:
>>
>>> Take care, Holden! I believe everything goes well.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Oct 31, 2020 at 10:24 AM Reynold Xin 
>>> wrote:
>>>
 Take care Holden and best of luck with everything!


 On Sat, Oct 31 2020 at 10:21 AM, Holden Karau 
 wrote:

> Hi Folks,
>
> Just a heads up so folks working on decommissioning or other areas
> I've been active in don't block on me, I'm going to be out for at least a
> week and possibly more starting on November 5th. If there is anything that
> folks want me to review before then please let me know and I'll make the
> time for it. If you are curious I've got more details at
> http://blog.holdenkarau.com/2020/10/taking-break-surgery.html
>
> Happy Sparking Everyone,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


--

Re: [DISCUSS][SPIP] Standardize Spark Exception Messages

2020-10-29 Thread Xiao Li

+1

This is a great proposal to improve the usability of Spark. Make Spark
simple to use!

Xiao

Xinyi Yu  于2020年10月27日周二 下午8:25写道：

> Hi Chang,
>
> It is a script that directly analyzes the source code searching for raw
> "throw new” exception. : ) Hope that give an intuitive overview of current
> exceptions in Spark.
> On Oct 27, 2020, 8:21 PM -0700, Chang Chen , wrote:
>
> hi Xinyi
>
> Just curious, which tool did you use to generate this
> 
>
>
> Xinyi Yu  于2020年10月26日周一 上午8:05写道：
>
>> Hi all,
>>
>> We like to post a SPIP of Standardize Exception Messages in Spark. Here is
>> the document link:
>>
>> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing
>> <
>> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing
>> >
>>
>> This SPIP aims to standardize the exception messages in Spark. It has
>> three
>> major focuses:
>> 1. Group exception messages in dedicated files for easy maintenance and
>> auditing.
>> 2. Establish an error message guideline for developers.
>> 3. Improve error message quality.
>>
>> Thanks for your time and patience. Looking forward to your feedback!
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [build system] jenkins wedged again

2020-10-14 Thread Xiao Li

Thank you, Shane!

Xiao

On Wed, Oct 14, 2020 at 12:00 PM shane knapp ☠  wrote:

> we're mostly back up, and just waiting for a couple of ubuntu boxes to
> finish booting...  prb seem to be building now!
>
> On Wed, Oct 14, 2020 at 11:48 AM shane knapp ☠ 
> wrote:
>
>> i'm going to reboot the primary and worker nodes, so it'll be a few
>> minutes before everything is back up.
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


--

Re: [UPDATE] Apache Spark 3.1.0 Release Window

2020-10-12 Thread Xiao Li

Thank you, Dongjoon

Xiao

On Mon, Oct 12, 2020 at 4:19 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark 3.1.0 Release Window is adjusted like the following today.
> Please check the latest information on the official website.
>
> -
> https://github.com/apache/spark-website/commit/0cd0bdc80503882b4737db7e77cc8f9d17ec12ca
> - https://spark.apache.org/versioning-policy.html
>
> Bests,
> Dongjoon.
>


--

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Xiao Li

As pointed out by Dongjoon, the 2nd half of December is the holiday season
in most countries. If we do the code freeze in mid November and release the
first RC in mid December. I am afraid the community will not be active to
verify the release candidates during the holiday season. Normally, the RC
stage is the most critical period to detect the defects and unexpected
behavior changes. Thus, starting the RC in the next January might be a good
option IMHO.

Cheers,

Xiao


Igor Dvorzhak  于2020年10月4日周日 下午10:35写道：

> Why to move the code freeze to early December? Seems like even according
> to the changed release cadence the code freeze should happen in
> mid-November.
>
> On Sun, Oct 4, 2020 at 6:26 PM Xiao Li  wrote:
>
>> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
>>
>>
>> I think we made a change in release cadence since Spark 2.3. See the
>> commit:
>> https://github.com/apache/spark-website/commit/88990968962e5cc47db8bc2c11a50742d2438daa
>> Thus, Spark 3.1 might just follow the release cadence of Spark 2.3/2.4, if
>> we do not want to change the release cadence?
>>
>> How about moving the code freeze of Spark 3.1 to *Early Dec 2020* and
>> the RC1 date to* Early Jan 2021*?
>>
>> Thanks,
>>
>> Xiao
>>
>>
>> Dongjoon Hyun  于2020年10月4日周日 下午12:44写道：
>>
>>> For Xiao's comment, I want to point out that Apache Spark 3.1.0 is
>>> different from 2.3 or 2.4.
>>>
>>> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
>>>
>>> - Apache Spark 2.0.0 was released on July 26, 2016.
>>> - Apache Spark 2.1.0 was released on December 28, 2016.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Sun, Oct 4, 2020 at 10:53 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Thank you all.
>>>>
>>>> BTW, Xiao and Mridul, I'm wondering what date you have in your mind
>>>> specifically.
>>>>
>>>> Usually, `Christmas and New Year season` doesn't give us much
>>>> additional time.
>>>>
>>>> If you think so, could you make a PR for Apache Spark website
>>>> according to your expectation?
>>>>
>>>> https://spark.apache.org/versioning-policy.html
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan 
>>>> wrote:
>>>>
>>>>>
>>>>> +1 on pushing the branch cut for increased dev time to match previous
>>>>> releases.
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:
>>>>>
>>>>>> Thank you for your updates.
>>>>>>
>>>>>> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date
>>>>>> of the 3.1 branch cut, the feature development time window is less than 5
>>>>>> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>>>>>>
>>>>>> Below are three highly desirable feature work I am watching.
>>>>>> Hopefully, we can finish them before the branch cut.
>>>>>>
>>>>>>- Support push-based shuffle to improve shuffle efficiency:
>>>>>>https://issues.apache.org/jira/browse/SPARK-30602
>>>>>>- Unify create table syntax:
>>>>>>https://issues.apache.org/jira/browse/SPARK-31257
>>>>>>- Bloom filter join:
>>>>>>https://issues.apache.org/jira/browse/SPARK-32268
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Xiao
>>>>>>
>>>>>>
>>>>>> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道：
>>>>>>
>>>>>>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>>>>>>> dropped R 3.5 and below at branch 2.4 as well.
>>>>>>>
>>>>>>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, All.
>>>>>>>>
>>>>>>>> As of today, master branch (Apache Spark 3.1.0) resolved
>>>>>>>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>>>>>>>> According to the 3.1.0 release window, branch-3.1 will be
>>>>>>&

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Xiao Li

>
> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.


I think we made a change in release cadence since Spark 2.3. See the
commit:
https://github.com/apache/spark-website/commit/88990968962e5cc47db8bc2c11a50742d2438daa
Thus, Spark 3.1 might just follow the release cadence of Spark 2.3/2.4, if
we do not want to change the release cadence?

How about moving the code freeze of Spark 3.1 to *Early Dec 2020* and the
RC1 date to* Early Jan 2021*?

Thanks,

Xiao


Dongjoon Hyun  于2020年10月4日周日 下午12:44写道：

> For Xiao's comment, I want to point out that Apache Spark 3.1.0 is
> different from 2.3 or 2.4.
>
> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
>
> - Apache Spark 2.0.0 was released on July 26, 2016.
> - Apache Spark 2.1.0 was released on December 28, 2016.
>
> Bests,
> Dongjoon.
>
>
> On Sun, Oct 4, 2020 at 10:53 AM Dongjoon Hyun 
> wrote:
>
>> Thank you all.
>>
>> BTW, Xiao and Mridul, I'm wondering what date you have in your mind
>> specifically.
>>
>> Usually, `Christmas and New Year season` doesn't give us much additional
>> time.
>>
>> If you think so, could you make a PR for Apache Spark website according
>> to your expectation?
>>
>> https://spark.apache.org/versioning-policy.html
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1 on pushing the branch cut for increased dev time to match previous
>>> releases.
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:
>>>
>>>> Thank you for your updates.
>>>>
>>>> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date
>>>> of the 3.1 branch cut, the feature development time window is less than 5
>>>> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>>>>
>>>> Below are three highly desirable feature work I am watching. Hopefully,
>>>> we can finish them before the branch cut.
>>>>
>>>>- Support push-based shuffle to improve shuffle efficiency:
>>>>https://issues.apache.org/jira/browse/SPARK-30602
>>>>- Unify create table syntax:
>>>>https://issues.apache.org/jira/browse/SPARK-31257
>>>>- Bloom filter join:
>>>>https://issues.apache.org/jira/browse/SPARK-32268
>>>>
>>>> Thanks,
>>>>
>>>> Xiao
>>>>
>>>>
>>>> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道：
>>>>
>>>>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>>>>> dropped R 3.5 and below at branch 2.4 as well.
>>>>>
>>>>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, 
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> As of today, master branch (Apache Spark 3.1.0) resolved
>>>>>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>>>>>> According to the 3.1.0 release window, branch-3.1 will be
>>>>>> created on November 1st and enters QA period.
>>>>>>
>>>>>> Here are some notable updates I've been monitoring.
>>>>>>
>>>>>> *Language*
>>>>>> 01. SPARK-25075 Support Scala 2.13
>>>>>>   - Since SPARK-32926, Scala 2.13 build test has
>>>>>> become a part of GitHub Action jobs.
>>>>>>   - After SPARK-33044, Scala 2.13 test will be
>>>>>> a part of Jenkins jobs.
>>>>>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>>>>>> 03. SPARK-32082 Project Zen: Improving Python usability
>>>>>>   - 7 of 16 issues are resolved.
>>>>>> 04. SPARK-32073 Drop R < 3.5 support
>>>>>>   - This is done for Spark 3.0.1 and 3.1.0.
>>>>>>
>>>>>> *Dependency*
>>>>>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>>>>>   - This changes the default dist. for better cloud support
>>>>>> 06. SPARK-32981 Remove hive-1.2 distribution
>>>>>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>>>>>   - This will remove Hive 1.2.1 from source code
>>>>>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>>>>>
>>>>>> *Core*
>>>>>>

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Xiao Li

Thank you for your updates.

Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
the 3.1 branch cut, the feature development time window is less than 5
months. This is shorter than what we did in Spark 2.3 and 2.4 releases.

Below are three highly desirable feature work I am watching. Hopefully, we
can finish them before the branch cut.

   - Support push-based shuffle to improve shuffle efficiency:
   https://issues.apache.org/jira/browse/SPARK-30602
   - Unify create table syntax:
   https://issues.apache.org/jira/browse/SPARK-31257
   - Bloom filter join: https://issues.apache.org/jira/browse/SPARK-32268

Thanks,

Xiao


Hyukjin Kwon  于2020年10月3日周六 下午5:41写道：

> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
> dropped R 3.5 and below at branch 2.4 as well.
>
> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun,  wrote:
>
>> Hi, All.
>>
>> As of today, master branch (Apache Spark 3.1.0) resolved
>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>> According to the 3.1.0 release window, branch-3.1 will be
>> created on November 1st and enters QA period.
>>
>> Here are some notable updates I've been monitoring.
>>
>> *Language*
>> 01. SPARK-25075 Support Scala 2.13
>>   - Since SPARK-32926, Scala 2.13 build test has
>> become a part of GitHub Action jobs.
>>   - After SPARK-33044, Scala 2.13 test will be
>> a part of Jenkins jobs.
>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>> 03. SPARK-32082 Project Zen: Improving Python usability
>>   - 7 of 16 issues are resolved.
>> 04. SPARK-32073 Drop R < 3.5 support
>>   - This is done for Spark 3.0.1 and 3.1.0.
>>
>> *Dependency*
>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>   - This changes the default dist. for better cloud support
>> 06. SPARK-32981 Remove hive-1.2 distribution
>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>   - This will remove Hive 1.2.1 from source code
>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>
>> *Core*
>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>   - 11 of 15 issues are resolved
>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>   - 8 of 14 issues are resolved
>>
>> *Resource Manager*
>> 11. SPARK-33005 Kubernetes GA preparation
>>   - It is on the way and we are waiting for more feedback.
>>
>> *SQL*
>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>   to JSON/Avro
>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>   - 11 of 17 issues are resolved
>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>   and added more features in 3.1 but still we missed
>>   - All built-in DataSource v2 write paths are disabled
>> and v1 write is used instead.
>>   - Support partition pruning with subqueries
>>   - Support bucketing
>>
>> We still have one month before the feature freeze
>> and starting QA. If you are working for 3.1,
>> please consider the timeline and share your schedule
>> with the Apache Spark community. For the other stuff,
>> we can put it into 3.2 release scheduled in June 2021.
>>
>> Last not but least, I want to emphasize (7) once again.
>> We need to remove the forked unofficial Hive eventually.
>> Please let us know your reasons if you need to build
>> from Apache Spark 3.1 source code for Hive 1.2.
>>
>> https://github.com/apache/spark/pull/29936
>>
>> As I wrote in the above PR description, for old releases,
>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>> Hive 1.2-based distribution.
>>
>> Bests,
>> Dongjoon.
>>
>

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Xiao Li

+1

Xiao

DB Tsai  于2020年9月14日周一 下午4:09写道：

> +1
>
> On Mon, Sep 14, 2020 at 12:30 PM Chandni Singh  wrote:
>
>> +1
>>
>> Chandni
>>
>> On Mon, Sep 14, 2020 at 11:41 AM Tom Graves 
>> wrote:
>>
>>> +1
>>>
>>> Tom
>>>
>>> On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan <
>>> mri...@gmail.com> wrote:
>>>
>>>
>>> Hi,
>>>
>>> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
>>> shuffle to improve shuffle efficiency.
>>> Please take a look at:
>>>
>>>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>>>- SPIP doc:
>>>
>>> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>>>- POC against master and results summary :
>>>
>>> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>>>
>>> Active discussions on the jira and SPIP document have settled.
>>>
>>> I will leave the vote open until Friday (the 18th September 2020), 5pm
>>> CST.
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don't think this is a good idea because ...
>>>
>>>
>>> Thanks,
>>> Mridul
>>>
>>
>
> --
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>

Re: [VOTE] Release Spark 3.0.1 (RC3)

2020-09-01 Thread Xiao Li

Want to change my vote to 0, because we are unable to produce an end-user
query to hit this bug.

Xiao

On Mon, Aug 31, 2020 at 12:41 PM Xiao Li  wrote:

> -1 due to a regression introduced by a fix in 3.0.1.
>
> See https://github.com/apache/spark/pull/29602
>
> Xiao
>
> On Mon, Aug 31, 2020 at 9:26 AM Tom Graves 
> wrote:
>
>> +1
>>
>> Tom
>>
>> On Friday, August 28, 2020, 09:02:31 AM CDT, 郑瑞峰 
>> wrote:
>>
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.0.1.
>>
>> The vote is open until Sep 2nd at 9AM PST and passes if a majority +1 PMC
>> votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.0.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> There are currently no issues targeting 3.0.1 (try project = SPARK AND
>> "Target Version/s" = "3.0.1" AND status in (Open, Reopened, "In Progress"))
>>
>> The tag to be voted on is v3.0.1-rc3 (commit
>> dc04bf53fe821b7a07f817966c6c173f3b3788c6):
>> https://github.com/apache/spark/tree/v3.0.1-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.1-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1357/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.1-rc3-docs/
>>
>> The list of bug fixes going into 3.0.1 can be found at the following URL:
>> https://s.apache.org/q9g2d
>>
>> This release is using the release script of the tag v3.0.1-rc3.
>>
>> FAQ
>>
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.0.1?
>> ===
>>
>> The current list of open tickets targeted at 3.0.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.0.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>>
>
> --
> <https://databricks.com/sparkaisummit/north-america>
>


-- 
<https://databricks.com/sparkaisummit/north-america>

Re: [VOTE] Release Spark 3.0.1 (RC3)

2020-08-31 Thread Xiao Li

-1 due to a regression introduced by a fix in 3.0.1.

See https://github.com/apache/spark/pull/29602

Xiao

On Mon, Aug 31, 2020 at 9:26 AM Tom Graves 
wrote:

> +1
>
> Tom
>
> On Friday, August 28, 2020, 09:02:31 AM CDT, 郑瑞峰 
> wrote:
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.0.1.
>
> The vote is open until Sep 2nd at 9AM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.0.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 3.0.1 (try project = SPARK AND
> "Target Version/s" = "3.0.1" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v3.0.1-rc3 (commit
> dc04bf53fe821b7a07f817966c6c173f3b3788c6):
> https://github.com/apache/spark/tree/v3.0.1-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.1-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1357/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.1-rc3-docs/
>
> The list of bug fixes going into 3.0.1 can be found at the following URL:
> https://s.apache.org/q9g2d
>
> This release is using the release script of the tag v3.0.1-rc3.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.0.1?
> ===
>
> The current list of open tickets targeted at 3.0.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>

--

Re: pip/conda distribution headless mode

2020-08-30 Thread Xiao Li

Hi, Georg,

This is being tracked by https://issues.apache.org/jira/browse/SPARK-32017 You
can leave comments in the JIRA.

Thanks,

Xiao

On Sun, Aug 30, 2020 at 3:06 PM Georg Heiler 
wrote:

> Hi,
>
> I want to use pyspark as distributed via conda in headless mode.
> It looks like the hadoop binaries are bundles (= pip distributes a default
> version)
> https://stackoverflow.com/questions/63661404/bootstrap-spark-itself-on-yarn
> .
>
> I want to ask if it would be possible to A) distribute the headless
> version (=without hadoop) instead or B) distribute the headless version
> additionally for pip & conda-forge distribution channels.
>
> Best,
> Georg
>


--

Re: [VOTE] Release Spark 2.4.7 (RC1)

2020-08-17 Thread Xiao Li

https://issues.apache.org/jira/browse/SPARK-32609 got merged. This is to
fix a correctness bug in DSV2 of Spark 2.4. Please include it in the
upcoming Spark 2.4.7 release.

Thanks,

Xiao

On Sun, Aug 9, 2020 at 10:26 PM Prashant Sharma 
wrote:

> Thanks for letting us know. So this vote is cancelled in favor of RC2.
>
>
>
> On Sun, Aug 9, 2020 at 8:31 AM Takeshi Yamamuro 
> wrote:
>
>> Thanks for letting us know about the two issues above, Dongjoon.
>>
>> 
>> I've checked the release materials (signatures, tag, ...) and it looks
>> fine, too.
>> Also, I run the tests on my local Mac (java 1.8.0) with the options
>> `-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes
>> -Psparkr`
>> and they passed.
>>
>> Bests,
>> Takeshi
>>
>>
>>
>> On Sun, Aug 9, 2020 at 11:06 AM Dongjoon Hyun 
>> wrote:
>>
>>> Another instance is SPARK-31703 which filed on May 13th and the PR
>>> arrived two days ago.
>>>
>>> [SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on
>>> big endian platforms
>>> https://github.com/apache/spark/pull/29383
>>>
>>> It seems that the patch is already ready in this case.
>>> I raised the priority of SPARK-31703 to `Blocker` for both Apache Spark
>>> 2.4.7 and 3.0.1.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Sat, Aug 8, 2020 at 6:10 AM Holden Karau 
>>> wrote:
>>>
 I'm going to go ahead and vote -0 then based on that then.

 On Fri, Aug 7, 2020 at 11:36 PM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> Unfortunately, there is an on-going discussion about the new decimal
> correctness.
>
> Although we fixed one correctness issue at master and backported it
> partially to 3.0/2.4, it turns out that it needs more patched to be
> complete.
>
> Please see https://github.com/apache/spark/pull/29125 for on-going
> discussion for both 3.0/2.4.
>
> [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with
> overflowed value
>
> I also confirmed that 2.4.7 RC1 is affected.
>
> Bests,
> Dongjoon.
>
>
> On Thu, Aug 6, 2020 at 2:48 PM Sean Owen  wrote:
>
>> +1 from me. The same as usual. Licenses and sigs look OK, builds and
>> passes tests on a standard selection of profiles.
>>
>> On Thu, Aug 6, 2020 at 7:07 AM Prashant Sharma 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.7.
>> >
>> > The vote is open until Aug 9th at 9AM PST and passes if a majority
>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.4.7
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >
>> > There are currently no issues targeting 2.4.7 (try project = SPARK
>> AND "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In
>> Progress"))
>> >
>> > The tag to be voted on is v2.4.7-rc1 (commit
>> dc04bf53fe821b7a07f817966c6c173f3b3788c6):
>> > https://github.com/apache/spark/tree/v2.4.7-rc1
>> >
>> > The release files, including signatures, digests, etc. can be found
>> at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> >
>> https://repository.apache.org/content/repositories/orgapachespark-1352/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-docs/
>> >
>> > The list of bug fixes going into 2.4.7 can be found at the
>> following URL:
>> > https://s.apache.org/spark-v2.4.7-rc1
>> >
>> > This release is using the release script of the tag v2.4.7-rc1.
>> >
>> > FAQ
>> >
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate,
>> then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and
>> install
>> > the current RC and see if anything important breaks, in the
>> Java/Scala
>> > you can add the staging repository to your projects resolvers and
>> test
>> > with the RC (make sure to clean up the artifact cache before/after
>> so
>> > you don't end up building with an out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.4.7?
>> > ===

Re: [SparkSql] Casting of Predicate Literals

2020-08-04 Thread Xiao Li

Hi, Russell,

You might hit the other cases in which CAST blocks the predicate pushdown.
If the Cast was added by users and it changes the actual type, we are
unable to optimize it automatically because it could change the query
correctness. If it was added by our type coercion rules

to
make type consistent at query compile time, we can take a look at the
specific rule. If you think any of them is not reasonable or have different
behaviors from the other database systems, we can discuss it in the PRs or
JIRAs. In general, we have to be very cautious to make any change in these
rules since it could have a big impact and change the query results
silently.

Thanks,

On Tue, Aug 4, 2020 at 9:46 AM Wenchen Fan  wrote:

> I think this is not a problem in 3.0 anymore, see
> https://issues.apache.org/jira/browse/SPARK-27638
>
> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer 
> wrote:
>
>> I've just run into this issue again with another user and I feel like
>> most folks here have seen some flavor of this at some point.
>>
>> The user registers a Datasource with a column of type Date (or some non
>> string) then performs a query that looks like.
>>
>> *SELECT * from Source WHERE date_col > '2020-08-03'*
>>
>> Seeing that the predicate literal here is a String, Spark needs to make a
>> change so that the DataSource column will be of the same type (Date),
>> so it places a "Cast" on the Datasource column so our plan ends up
>> looking like.
>>
>> Cast(date_col as String) > '2020-08-03'
>>
>> Since the Datasource Strategies can't handle a push down of the "Cast"
>> function we lose the predicate pushdown we could
>> have had. This can change a Job from a single partition lookup into a
>> full scan leading to a very confusing situation for
>> the end user. I also wonder about the relative cost here since we could
>> be avoiding doing X casts and instead just do a single
>> one on the predicate, in addition we could be doing the cast at the
>> Analysis phase and cut the run short before any work even
>> starts rather than doing a perhaps meaningless comparison between a date
>> and a non-date string.
>>
>> I think we should seriously consider whether in cases like this we should
>> attempt to cast the literal rather than casting the
>> source column.
>>
>> Please let me know if anyone has thoughts on this, or has some previous
>> Jiras I could dig into if it's been discussed before,
>> Russ
>>
>

--

Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-31 Thread Xiao Li

+1

Xiao

On Fri, Jul 31, 2020 at 9:32 AM Mridul Muralidharan 
wrote:

>
> +1
>
> Thanks,
> Mridul
>
> On Thu, Jul 30, 2020 at 4:49 PM Holden Karau  wrote:
>
>> Hi Spark Developers,
>>
>> After the discussion of the proposal to amend Spark committer guidelines,
>> it appears folks are generally in agreement on policy clarifications. (See
>> https://lists.apache.org/thread.html/r6706e977fda2c474a7f24775c933c2f46ea19afbfafb03c90f6972ba%40%3Cdev.spark.apache.org%3E,
>> as well as some on the private@ list for PMC.) Therefore, I am calling
>> for a majority VOTE, which will last at least 72 hours. See the ASF voting
>> rules for procedural changes at
>> https://www.apache.org/foundation/voting.html.
>>
>> The proposal is to add a new section entitled “When to Commit” to the
>> Spark committer guidelines, currently at
>> https://spark.apache.org/committers.html.
>>
>> ** START OF CHANGE **
>>
>> PRs shall not be merged during active, on-topic discussion unless they
>> address issues such as critical security fixes of a public vulnerability.
>> Under extenuating circumstances, PRs may be merged during active, off-topic
>> discussion and the discussion directed to a more appropriate venue. Time
>> should be given prior to merging for those involved with the conversation
>> to explain if they believe they are on-topic.
>>
>> Lazy consensus requires giving time for discussion to settle while
>> understanding that people may not be working on Spark as their full-time
>> job and may take holidays. It is believed that by doing this, we can limit
>> how often people feel the need to exercise their veto.
>>
>> All -1s with justification merit discussion.  A -1 from a non-committer
>> can be overridden only with input from multiple committers, and suitable
>> time must be offered for any committer to raise concerns. A -1 from a
>> committer who cannot be reached requires a consensus vote of the PMC under
>> ASF voting rules to determine the next steps within the ASF guidelines for
>> code vetoes ( https://www.apache.org/foundation/voting.html ).
>>
>> These policies serve to reiterate the core principle that code must not
>> be merged with a pending veto or before a consensus has been reached (lazy
>> or otherwise).
>>
>> It is the PMC’s hope that vetoes continue to be infrequent, and when they
>> occur, that all parties will take the time to build consensus prior to
>> additional feature work.
>>
>> Being a committer means exercising your judgement while working in a
>> community of people with diverse views. There is nothing wrong in getting a
>> second (or third or fourth) opinion when you are uncertain. Thank you for
>> your dedication to the Spark project; it is appreciated by the developers
>> and users of Spark.
>>
>> It is hoped that these guidelines do not slow down development; rather,
>> by removing some of the uncertainty, the goal is to make it easier for us
>> to reach consensus. If you have ideas on how to improve these guidelines or
>> other Spark project operating procedures, you should reach out on the dev@
>> list to start the discussion.
>>
>> ** END OF CHANGE TEXT **
>>
>> I want to thank everyone who has been involved with the discussion
>> leading to this proposal and those of you who take the time to vote on
>> this. I look forward to our continued collaboration in building Apache
>> Spark.
>>
>> I believe we share the goal of creating a welcoming community around the
>> project. On a personal note, it is my belief that consistently applying
>> this policy around commits can help to make a more accessible and welcoming
>> community.
>>
>> Kind Regards,
>>
>> Holden
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

--

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Xiao Li

Welcome, Dilip, Huaxin and Jungtaek!

Xiao

On Tue, Jul 14, 2020 at 11:02 AM Holden Karau  wrote:

> So excited to have our committer pool growing with these awesome folks,
> welcome y'all!
>
> On Tue, Jul 14, 2020 at 10:59 AM Driesprong, Fokko 
> wrote:
>
>> Welcome!
>>
>> Op di 14 jul. 2020 om 19:53 schreef shane knapp ☠ :
>>
>>> welcome, all!
>>>
>>> On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia 
>>> wrote:
>>>
 Hi all,

 The Spark PMC recently voted to add several new committers. Please join
 me in welcoming them to their new roles! The new committers are:

 - Huaxin Gao
 - Jungtaek Lim
 - Dilip Biswal

 All three of them contributed to Spark 3.0 and we’re excited to have
 them join the project.

 Matei and the Spark PMC
 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


--

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-13 Thread Xiao Li

Thank you very much, Shane!

Xiao

On Mon, Jul 13, 2020 at 10:15 AM shane knapp ☠  wrote:

> alright, the system load graphs show that we've had a generally decreasing
> load since friday, and have burned through ~3k builds/day since the reboot
> last week!  i don't see many timeouts, and the PRB builds have been
> generally green for a couple of days.
>
> again, i will keep an eye on things but i feel we're out of the woods
> right now.  :)
>
> shane
>
> On Fri, Jul 10, 2020 at 3:43 PM Frank Yin  wrote:
>
>> Great. Thanks.
>>
>> On Fri, Jul 10, 2020 at 3:39 PM shane knapp ☠ 
>> wrote:
>>
>>> no, 8 hours is plenty.  things will speed up soon once the backlog of
>>> builds works through  i limited the number of PRB builds to 4 per
>>> worker, and things are looking better.  let's see how we look next week.
>>>
>>> On Fri, Jul 10, 2020 at 3:31 PM Frank Yin  wrote:
>>>
 Can we also increase the build timeout?

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125617
 This one fails because it times out, not because of test failures.

 On Fri, Jul 10, 2020 at 2:16 PM Frank Yin  wrote:

> Yeah, that's what I figured -- those workers are under load. Thanks.
>
> On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ 
> wrote:
>
>> only 125561, 125562 and 125564 were impacted by -9.
>>
>> 125565 exited w/a code of 15 (143 - 128), which means the process was
>> terminated for unknown reasons.
>>
>> 125563 looks like mima failed due to a bunch of errors.
>>
>> i just spot checked a bunch of recent failed PRB builds from today
>> and they all seemed to be legit.
>>
>> another thing that might be happening is an overload of PRB builds on
>> the workers due to the backlog...  the workers are under a LOT of load
>> right now, and i can put some rate limiting in to see if that helps out.
>>
>> shane
>>
>> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin 
>> wrote:
>>
>>> Like from build number 125565 to 125561, all impacted by kill -9.
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console
>>>
>>> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ 
>>> wrote:
>>>
 define "a lot" and provide some links to those builds, please.
 there are roughly 2000 builds per day, and i can't do more than keep a
 cursory eye on things.

 the infrastructure that the tests run on hasn't changed one bit on
 any of the workers, and 'kill -9' could be a timeout, flakiness caused 
 by
 old build processes remaining on the workers after the master went 
 down, or
 me trying to clean things up w/o a reboot.  or, perhaps, something 
 wrong
 w/the infra.  :)

 On Fri, Jul 10, 2020 at 9:28 AM Frank Yin 
 wrote:

> Agree, but I’ve seen a lot of kill by signal 9, assuming that
> infrastructure?
>
> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ 
> wrote:
>
>> yeah, i can't do much for flaky tests...  just flaky
>> infrastructure.
>>
>>
>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <
>> gurwls...@gmail.com> wrote:
>>
>>> Couple of flaky tests can happen. It's usual. Seems it got
>>> better now at least. I will keep monitoring the builds.
>>>
>>> 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성:
>>>
 Looks like Jenkins isn't stable still. My PR fails two times in
 a row:

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



 --
 Sent from:
 http://apache-spark-developers-list.1001551.n3.nabble.com/


 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-01 Thread Xiao Li

+1 on releasing both 3.0.1 and 2.4.7

Great! Three committers volunteer to be a release manager. Ruifeng,
Prashant and Holden. Holden just helped release Spark 2.4.6. This time,
maybe, Ruifeng and Prashant can be the release manager of 3.0.1 and 2.4.7
respectively.

Xiao

On Wed, Jul 1, 2020 at 2:24 PM Jungtaek Lim 
wrote:

> https://issues.apache.org/jira/browse/SPARK-32148 was reported yesterday,
> and if the report is valid it looks to be a blocker. I'll try to take a
> look sooner.
>
> On Thu, Jul 2, 2020 at 12:48 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Thanks Holden -- it would be great to also get 2.4.7 started
>>
>> Thanks
>> Shivaram
>>
>> On Tue, Jun 30, 2020 at 10:31 PM Holden Karau 
>> wrote:
>> >
>> > I can take care of 2.4.7 unless someone else wants to do it.
>> >
>> > On Tue, Jun 30, 2020 at 8:29 PM Jason Moore <
>> jason.mo...@quantium.com.au> wrote:
>> >>
>> >> Hi all,
>> >>
>> >>
>> >>
>> >> Could I get some input on the severity of this one that I found
>> yesterday?  If that’s a correctness issue, should it block this patch?  Let
>> me know under the ticket if there’s more info that I can provide to help.
>> >>
>> >>
>> >>
>> >> https://issues.apache.org/jira/browse/SPARK-32136
>> >>
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Jason.
>> >>
>> >>
>> >>
>> >> From: Jungtaek Lim 
>> >> Date: Wednesday, 1 July 2020 at 10:20 am
>> >> To: Shivaram Venkataraman 
>> >> Cc: Prashant Sharma , 郑瑞峰 ,
>> Gengliang Wang , gurwls223 <
>> gurwls...@gmail.com>, Dongjoon Hyun , Jules
>> Damji , Holden Karau ,
>> Reynold Xin , Yuanjian Li ,
>> "dev@spark.apache.org" , Takeshi Yamamuro <
>> linguin@gmail.com>
>> >> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
>> >>
>> >>
>> >>
>> >> SPARK-32130 [1] looks to be a performance regression introduced in
>> Spark 3.0.0, which is ideal to look into before releasing another bugfix
>> version.
>> >>
>> >>
>> >>
>> >> 1. https://issues.apache.org/jira/browse/SPARK-32130
>> >>
>> >>
>> >>
>> >> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>> >>
>> >> Hi all
>> >>
>> >>
>> >>
>> >> I just wanted to ping this thread to see if all the outstanding
>> blockers for 3.0.1 have been fixed. If so, it would be great if we can get
>> the release going. The CRAN team sent us a note that the version SparkR
>> available on CRAN for the current R version (4.0.2) is broken and hence we
>> need to update the package soon --  it will be great to do it with 3.0.1.
>> >>
>> >>
>> >>
>> >> Thanks
>> >>
>> >> Shivaram
>> >>
>> >>
>> >>
>> >> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma 
>> wrote:
>> >>
>> >> +1 for 3.0.1 release.
>> >>
>> >> I too can help out as release manager.
>> >>
>> >>
>> >>
>> >> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:
>> >>
>> >> I volunteer to be a release manager of 3.0.1, if nobody is working on
>> this.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> -- 原始邮件 --
>> >>
>> >> 发件人: "Gengliang Wang";
>> >>
>> >> 发送时间: 2020年6月24日(星期三) 下午4:15
>> >>
>> >> 收件人: "Hyukjin Kwon";
>> >>
>> >> 抄送: "Dongjoon Hyun";"Jungtaek Lim"<
>> kabhwan.opensou...@gmail.com>;"Jules Damji";"Holden
>> Karau";"Reynold Xin";"Shivaram
>> Venkataraman";"Yuanjian Li"<
>> xyliyuanj...@gmail.com>;"Spark dev list";"Takeshi
>> Yamamuro";
>> >>
>> >> 主题: Re: [DISCUSS] Apache Spark 3.0.1 Release
>> >>
>> >>
>> >>
>> >> +1, the issues mentioned are really serious.
>> >>
>> >>
>> >>
>> >> On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon 
>> wrote:
>> >>
>> >> +1.
>> >>
>> >> Just as a note,
>> >> - SPARK-31918 is fixed now, and there's no blocker. - When we build
>> SparkR, we should use the latest R version at least 4.0.0+.
>> >>
>> >>
>> >>
>> >> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이
>> 작성:
>> >>
>> >> +1
>> >>
>> >>
>> >>
>> >> Bests,
>> >>
>> >> Dongjoon.
>> >>
>> >>
>> >>
>> >> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>
>> >> +1 on a 3.0.1 soon.
>> >>
>> >>
>> >>
>> >> Probably it would be nice if some Scala experts can take a look at
>> https://issues.apache.org/jira/browse/SPARK-32051 and include the fix
>> into 3.0.1 if possible.
>> >>
>> >> Looks like APIs designed to work with Scala 2.11 & Java bring
>> ambiguity in Scala 2.12 & Java.
>> >>
>> >>
>> >>
>> >> On Wed, Jun 24, 2020 at 4:52 AM Jules Damji 
>> wrote:
>> >>
>> >> +1 (non-binding)
>> >>
>> >>
>> >>
>> >> Sent from my iPhone
>> >>
>> >> Pardon the dumb thumb typos :)
>> >>
>> >>
>> >>
>> >> On Jun 23, 2020, at 11:36 AM, Holden Karau 
>> wrote:
>> >>
>> >> +1 on a patch release soon
>> >>
>> >>
>> >>
>> >> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin 
>> wrote:
>> >>
>> >> Error! Filename not specified.
>> >>
>> >> +1 on doing a new patch release soon. I saw some of these issues when
>> preparing the 3.0 release, and some of them are very serious.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Xiao Li

Hi, Dongjoon,

Please do not misinterpret my point. I already clearly said "I do not know
how to track the popularity of Hadoop 2 vs Hadoop 3."

Also, let me repeat my opinion:  the top priority is to provide two options
for PyPi distribution and let the end users choose the ones they need.
Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any breaking
change, let us follow our protocol documented in
https://spark.apache.org/versioning-policy.html.

If you just want to change the Jenkins setup, I am OK about it. If you want
to change the default distribution, we need more discussions in the
community for getting an agreement.

 Thanks,

Xiao


On Wed, Jun 24, 2020 at 10:07 AM Dongjoon Hyun 
wrote:

> Thanks, Xiao, Sean, Nicholas.
>
> To Xiao,
>
> >  it sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>
> If you say so,
> - Apache Hadoop 2.6.0 is the most popular one with 156 dependencies.
> - Apache Spark 2.2.0 is the most popular one with 264 dependencies.
>
> As we know, it doesn't make sense. Are we recommending Apache Spark 2.2.0
> over Apache Spark 3.0.0?
>
> There is a reason why Apache Spark dropped Hadoop 2.6 profile. Hadoop
> 2.7.4 has many limitations in the cloud environment. Apache Hadoop 3.2 will
> unleash Apache Spark 3.1 in the cloud environment.  (Nicholas also pointed
> it).
>
> For Sean's comment, yes. We can focus on that later in a different thread.
>
> > The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
> eventually, not now.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Jun 24, 2020 at 7:26 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> The team I'm on currently uses pip-installed PySpark for local
>> development, and we regularly access S3 directly from our
>> laptops/workstations.
>>
>> One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
>> being able to use a recent version of hadoop-aws that has mature support
>> for s3a. With Hadoop 2.7 the support for s3a is buggy and incomplete, and
>> there are incompatibilities that prevent you from using Spark built against
>> Hadoop 2.7 with hadoop-aws version 2.8 or newer.
>>
>> On Wed, Jun 24, 2020 at 10:15 AM Sean Owen  wrote:
>>
>>> Will pyspark users care much about Hadoop version? they won't if running
>>> locally. They will if connecting to a Hadoop cluster. Then again in that
>>> context, they're probably using a distro anyway that harmonizes it.
>>> Hadoop 3's installed based can't be that large yet; it's been around far
>>> less time.
>>>
>>> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
>>> eventually, not now.
>>> But if the question now is build defaults, is it a big deal either way?
>>>
>>> On Tue, Jun 23, 2020 at 11:03 PM Xiao Li  wrote:
>>>
>>>> I think we just need to provide two options and let end users choose
>>>> the ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make
>>>> Pyspark Hadoop 3.2+ Variant available in PyPI) is a high priority task for
>>>> Spark 3.1 release to me.
>>>>
>>>> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3.
>>>> Based on this link
>>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
>>>> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>>>>
>>>>
>>>>

-- 
<https://databricks.com/sparkaisummit/north-america>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Xiao Li

I think we just need to provide two options and let end users choose the
ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
3.1 release to me.

I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based on
this link
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
sounds like Hadoop 3.x is not as popular as Hadoop 2.7.


On Tue, Jun 23, 2020 at 8:08 PM Dongjoon Hyun 
wrote:

> I fully understand your concern, but we cannot live with Hadoop 2.7.4
> forever, Xiao. Like Hadoop 2.6, we should let it go.
>
> So, are you saying that CRAN/PyPy should have all combination of Apache
> Spark including Hive 1.2 distribution?
>
> What is your suggestion as a PMC on Hadoop 3.2 migration path? I'd love to
> remove the road blocks for that.
>
> As a side note, Homebrew is not Apache Spark official channel, but it's
> also popular distribution channel in the community. And, it's using Hadoop
> 3.2 distribution already. Hadoop 2.7 is too old for Year 2021 (Apache Spark
> 3.1), isn't it?
>
> Bests,
> Dongjoon.
>
>
>
> On Tue, Jun 23, 2020 at 7:55 PM Xiao Li  wrote:
>
>> Then, it will be a little complex after this PR. It might make the
>> community more confused.
>>
>> In PYPI and CRAN, we are using Hadoop 2.7 as the default profile;
>> however, in the other distributions, we are using Hadoop 3.2 as the
>> default?
>>
>> How to explain this to the community? I would not change the default for
>> consistency.
>>
>> Xiao
>>
>>
>>
>> On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thanks. Uploading PySpark to PyPI is a simple manual step and
>>> our release script is able to build PySpark with Hadoop 2.7 still if we
>>> want.
>>> So, `No` for the following question. I updated my PR according to your
>>> comment.
>>>
>>> > If we change the default, will it impact them? If YES,...
>>>
>>> From the comment on the PR, the following become irrelevant to the
>>> current PR.
>>>
>>> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li  wrote:
>>>
>>>>
>>>> Our monthly pypi downloads of PySpark have reached 5.4 million. We
>>>> should avoid forcing the current PySpark users to upgrade their Hadoop
>>>> versions. If we change the default, will it impact them? If YES, I think we
>>>> should not do it until it is ready and they have a workaround. So far, our
>>>> pypi downloads are still relying on our default version.
>>>>
>>>> Please correct me if my concern is not valid.
>>>>
>>>> Xiao
>>>>
>>>>
>>>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> I bump up this thread again with the title "Use Hadoop-3.2 as a
>>>>> default Hadoop profile in 3.1.0?"
>>>>> There exists some recent discussion on the following PR. Please let us
>>>>> know your thoughts.
>>>>>
>>>>> https://github.com/apache/spark/pull/28897
>>>>>
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li  wrote:
>>>>>
>>>>>> Hi, Steve,
>>>>>>
>>>>>> Thanks for your comments! My major quality concern is not against
>>>>>> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
>>>>>> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
>>>>>> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 
>>>>>> profile
>>>>>> is more risky due to these changes.
>>>>>>
>>>>>> To speed up the adoption of Spark 3.0, which has many other highly
>>>>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>>>>> default.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Xiao.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran 
>>>>>> wrote:
&g

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Xiao Li

Then, it will be a little complex after this PR. It might make the
community more confused.

In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; however,
in the other distributions, we are using Hadoop 3.2 as the default?

How to explain this to the community? I would not change the default for
consistency.

Xiao



On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun 
wrote:

> Thanks. Uploading PySpark to PyPI is a simple manual step and our release
> script is able to build PySpark with Hadoop 2.7 still if we want.
> So, `No` for the following question. I updated my PR according to your
> comment.
>
> > If we change the default, will it impact them? If YES,...
>
> From the comment on the PR, the following become irrelevant to the current
> PR.
>
> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>
> Bests,
> Dongjoon.
>
>
>
>
> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li  wrote:
>
>>
>> Our monthly pypi downloads of PySpark have reached 5.4 million. We should
>> avoid forcing the current PySpark users to upgrade their Hadoop versions.
>> If we change the default, will it impact them? If YES, I think we should
>> not do it until it is ready and they have a workaround. So far, our pypi
>> downloads are still relying on our default version.
>>
>> Please correct me if my concern is not valid.
>>
>> Xiao
>>
>>
>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> I bump up this thread again with the title "Use Hadoop-3.2 as a default
>>> Hadoop profile in 3.1.0?"
>>> There exists some recent discussion on the following PR. Please let us
>>> know your thoughts.
>>>
>>> https://github.com/apache/spark/pull/28897
>>>
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li  wrote:
>>>
>>>> Hi, Steve,
>>>>
>>>> Thanks for your comments! My major quality concern is not against
>>>> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
>>>> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
>>>> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile
>>>> is more risky due to these changes.
>>>>
>>>> To speed up the adoption of Spark 3.0, which has many other highly
>>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>>> default.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao.
>>>>
>>>>
>>>>
>>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran 
>>>> wrote:
>>>>
>>>>> What is the current default value? as the 2.x releases are becoming
>>>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>>>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>>>>> there will inevitably be surprises.
>>>>>
>>>>> One issue about using a older versions is that any problem reported
>>>>> -especially at stack traces you can blame me for- Will generally be met by
>>>>> a response of "does it go away when you upgrade?" The other issue is how
>>>>> much test coverage are things getting?
>>>>>
>>>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>>>> will either love or hate that.
>>>>>
>>>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>>>> backport planned though, including changes to better handle AWS caching of
>>>>> 404s generatd from HEAD requests before an object was actually created.
>>>>>
>>>>> It would be really good if the spark distributions shipped with later
>>>>> versions of the hadoop artifacts.
>>>>>
>>>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li  wrote:
>>>>>
>>>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>>>> changes are massive, including Hive execution and a new version of Hive
>>>>>> thriftserver.
>>>>>>
>>>>>> To reduce the risk, I would like to keep the current default version
>>>>>> unchanged. When it becomes stable, we can change the default

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Xiao Li

Our monthly pypi downloads of PySpark have reached 5.4 million. We should
avoid forcing the current PySpark users to upgrade their Hadoop versions.
If we change the default, will it impact them? If YES, I think we should
not do it until it is ready and they have a workaround. So far, our pypi
downloads are still relying on our default version.

Please correct me if my concern is not valid.

Xiao


On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> I bump up this thread again with the title "Use Hadoop-3.2 as a default
> Hadoop profile in 3.1.0?"
> There exists some recent discussion on the following PR. Please let us
> know your thoughts.
>
> https://github.com/apache/spark/pull/28897
>
>
> Bests,
> Dongjoon.
>
>
> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li  wrote:
>
>> Hi, Steve,
>>
>> Thanks for your comments! My major quality concern is not against Hadoop
>> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
>> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
>> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
>> risky due to these changes.
>>
>> To speed up the adoption of Spark 3.0, which has many other highly
>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>> default.
>>
>> Cheers,
>>
>> Xiao.
>>
>>
>>
>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran 
>> wrote:
>>
>>> What is the current default value? as the 2.x releases are becoming EOL;
>>> 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
>>> getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
>>> inevitably be surprises.
>>>
>>> One issue about using a older versions is that any problem reported
>>> -especially at stack traces you can blame me for- Will generally be met by
>>> a response of "does it go away when you upgrade?" The other issue is how
>>> much test coverage are things getting?
>>>
>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>> will either love or hate that.
>>>
>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>> backport planned though, including changes to better handle AWS caching of
>>> 404s generatd from HEAD requests before an object was actually created.
>>>
>>> It would be really good if the spark distributions shipped with later
>>> versions of the hadoop artifacts.
>>>
>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li  wrote:
>>>
>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>> changes are massive, including Hive execution and a new version of Hive
>>>> thriftserver.
>>>>
>>>> To reduce the risk, I would like to keep the current default version
>>>> unchanged. When it becomes stable, we can change the default profile to
>>>> Hadoop-3.2.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao
>>>>
>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen  wrote:
>>>>
>>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>>> implications.
>>>>> That said my guess is we're close to the point where we don't need to
>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>
>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun 
>>>>> wrote:
>>>>> >
>>>>> > Hi, All.
>>>>> >
>>>>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>> will be the same because we didn't change anything yet.
>>>>> >
>>>>> > Technically, we need to change two places for publishing.
>>>>> >
>>>>> > 1. Jenkins Snapshot Publishing
>>>>> >
>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>> >
>>>>> > 2. Release Snapshot/Release Publishing
>>>>> >
>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>> >
>>>>> > To minimize the change, we need to switch our default Hadoop profile.
>>>>> >
>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>>>> optionally.
>>>>> >
>>>>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>>>>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>>> >
>>>>> > Bests,
>>>>> > Dongjoon.
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>> --
>>>> [image: Databricks Summit - Watch the talks]
>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
<https://databricks.com/sparkaisummit/north-america>

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Xiao Li

Based on my understanding, DSV2 is not stable yet. It still misses various
features. Even our built-in file sources are still unable to fully migrate
to DSV2. We plan to enhance it in the next few releases to close the gap.

Also, the changes on DSV2 in Spark 3.0 did not break any existing
application. We should encourage more users to try Spark 3 and increase the
adoption of Spark 3.x.

Xiao

On Fri, Jun 12, 2020 at 5:36 PM Holden Karau  wrote:

> So I one of the things which we’re planning on backporting internally is
> DSv2, which I think being available in a community release in a 2 branch
> would be more broadly useful. Anything else on top of that would be on a
> case by case basis for if they make an easier upgrade path to 3.
>
> If we’re worried about people using 2.5 as a long term home we could
> always mark it with “-transitional” or something similar?
>
> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen  wrote:
>
>> What is the functionality that would go into a 2.5.0 release, that can't
>> be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x
>> maintenance branch, and I personally could imagine being open to more
>> freely backporting a few new features for 2.x users, whereas usually it's
>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>> branch but there's something too big for a 'normal' maintenance release,
>> and I think the whole question turns on what that is.
>>
>> If it's things like JDK 11 support, I think that is unfortunately fairly
>> 'breaking' because of dependency updates. But maybe that's not it.
>>
>>
>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau 
>> wrote:
>>
>>> Hi Folks,
>>>
>>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>>> release. Spark 3 brings a number of important changes, and by its nature is
>>> not backward compatible. I think we'd all like to have as smooth an upgrade
>>> experience to Spark 3 as possible, and I believe that having a Spark 2
>>> release some of the new functionality while continuing to support the older
>>> APIs and current Scala version would make the upgrade path smoother.
>>>
>>> This pattern is not uncommon in other Hadoop ecosystem projects, like
>>> Hadoop itself and HBase.
>>>
>>> I know that Ryan Blue has indicated he is already going to be
>>> maintaining something like that internally at Netflix, and we'll be doing
>>> the same thing at Apple. It seems like having a transitional release could
>>> benefit the community with easy migrations and help avoid duplicated work.
>>>
>>> I want to be clear I'm volunteering to do the work of managing a 2.5
>>> release, so hopefully, this wouldn't create any substantial burdens on the
>>> community.
>>>
>>> Cheers,
>>>
>>> Holden
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


--

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Xiao Li

Which new functionalities are you referring to? In Spark SQL, most of the
major features in Spark 3.0 are difficult/time-consuming to backport. For
example, adaptive query execution. Releasing a new version is not hard, but
backporting/reviewing/maintaining these features are very time-consuming.

Which old APIs are broken? If the impact is big, we should add them back
based on our former discussion
http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html

Thanks,

Xiao

On Fri, Jun 12, 2020 at 2:38 PM Holden Karau  wrote:

> Hi Folks,
>
> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
> release. Spark 3 brings a number of important changes, and by its nature is
> not backward compatible. I think we'd all like to have as smooth an upgrade
> experience to Spark 3 as possible, and I believe that having a Spark 2
> release some of the new functionality while continuing to support the older
> APIs and current Scala version would make the upgrade path smoother.
>
> This pattern is not uncommon in other Hadoop ecosystem projects, like
> Hadoop itself and HBase.
>
> I know that Ryan Blue has indicated he is already going to be maintaining
> something like that internally at Netflix, and we'll be doing the same
> thing at Apple. It seems like having a transitional release could benefit
> the community with easy migrations and help avoid duplicated work.
>
> I want to be clear I'm volunteering to do the work of managing a 2.5
> release, so hopefully, this wouldn't create any substantial burdens on the
> community.
>
> Cheers,
>
> Holden
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

--

Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Xiao Li

+1 (binding)

Xiao

On Mon, Jun 8, 2020 at 10:13 PM Xingbo Jiang  wrote:

> +1（non-binding）
>
> Jiaxin Shan 于2020年6月8日 周一下午9:50写道：
>
>> +1
>> I build binary using the following command, test spark workloads on
>> Kubernetes (AWS EKS) and it's working well.
>>
>> ./dev/make-distribution.sh --name spark-v3.0.0-rc3-20200608 --tgz
>> -Phadoop-3.2 -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud
>> -Pscala-2.12
>>
>> On Mon, Jun 8, 2020 at 7:13 PM Bryan Cutler  wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Mon, Jun 8, 2020, 1:49 PM Tom Graves 
>>> wrote:
>>>
 +1

 Tom

 On Saturday, June 6, 2020, 03:09:09 PM CDT, Reynold Xin <
 r...@databricks.com> wrote:

 Please vote on releasing the following candidate as Apache Spark
 version 3.0.0.

 The vote is open until [DUE DAY] and passes if a majority +1 PMC votes
 are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v3.0.0-rc3 (commit
 3fdfce3120f307147244e5eaf46d61419a723d50):
 https://github.com/apache/spark/tree/v3.0.0-rc3

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1350/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/

 The list of bug fixes going into 3.0.0 can be found at the following
 URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12339177

 This release is using the release script of the tag v3.0.0-rc3.

 FAQ

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala
 you can add the staging repository to your projects resolvers and test
 with the RC (make sure to clean up the artifact cache before/after so
 you don't end up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 3.0.0?
 ===

 The current list of open tickets targeted at 3.0.0 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.0.0

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should
 be worked on immediately. Everything else please retarget to an
 appropriate release.

 ==
 But my bug isn't fixed?
 ==

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from the previous
 release. That being said, if there is something which is a regression
 that has not been correctly targeted please ping me or a committer to
 help target the issue.

>>
>> --
>> Best Regards!
>> Jiaxin Shan
>> Tel:  412-230-7670
>> Address: 470 2nd Ave S, Kirkland, WA
>> 
>>
>>

--

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-06-03 Thread Xiao Li

Yes. Spark 3.0 RC2 works well.

I think the current behavior in Spark 2.4 affects the adoption, especially
for the new users who want to try Spark in their local environment.

It impacts all our built-in clients, like Scala Shell and PySpark. Should
we consider back-porting it to 2.4?

Although this fixes the bug, it will also introduce the behavior change. We
should publicly document it and mention it in the release note. Let us
review it more carefully and understand the risk and impact.

Thanks,

Xiao

Nicholas Chammas  于2020年6月3日周三 上午10:12写道：

> I believe that was fixed in 3.0 and there was a decision not to backport
> the fix: SPARK-31170 <https://issues.apache.org/jira/browse/SPARK-31170>
>
> On Wed, Jun 3, 2020 at 1:04 PM Xiao Li  wrote:
>
>> Just downloaded it in my local macbook. Trying to create a table using
>> the pre-built PySpark. It sounds like the conf "spark.sql.warehouse.dir"
>> does not take an effect. It is trying to create a directory in
>> "file:/user/hive/warehouse/t1". I have not done any investigation yet. Have
>> any of you hit the same issue?
>>
>> C02XT0U7JGH5:bin lixiao$ ./pyspark --conf
>> spark.sql.warehouse.dir="/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6"
>>
>> Python 2.7.16 (default, Jan 27 2020, 04:46:15)
>>
>> [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)] on darwin
>>
>> Type "help", "copyright", "credits" or "license" for more information.
>>
>> 20/06/03 09:56:11 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>>
>> Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>>
>> Setting default log level to "WARN".
>>
>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>> setLogLevel(newLevel).
>>
>> Welcome to
>>
>>     __
>>
>>  / __/__  ___ _/ /__
>>
>> _\ \/ _ \/ _ `/ __/  '_/
>>
>>/__ / .__/\_,_/_/ /_/\_\   version 2.4.6
>>
>>   /_/
>>
>>
>> Using Python version 2.7.16 (default, Jan 27 2020 04:46:15)
>>
>> SparkSession available as 'spark'.
>>
>> >>> spark.sql("set spark.sql.warehouse.dir").show(truncate=False)
>>
>>
>> +---+-+
>>
>> |key|value
>>   |
>>
>>
>> +---+-+
>>
>>
>> |spark.sql.warehouse.dir|/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6|
>>
>>
>> +---+-+
>>
>>
>> >>> spark.sql("create table t1 (col1 int)")
>>
>> 20/06/03 09:56:29 WARN HiveMetaStore: Location:
>> file:/user/hive/warehouse/t1 specified for non-external table:t1
>>
>> Traceback (most recent call last):
>>
>>   File "", line 1, in 
>>
>>   File
>> "/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6/python/pyspark/sql/session.py",
>> line 767, in sql
>>
>> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>>
>>   File
>> "/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>> line 1257, in __call__
>>
>>   File
>> "/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6/python/pyspark/sql/utils.py",
>> line 69, in deco
>>
>> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
>>
>> pyspark.sql.utils.AnalysisException:
>> u'org.apache.hadoop.hive.ql.metadata.HiveException:
>> MetaException(message:file:/user/hive/warehouse/t1 is not a directory or
>> unable to create one);'
>>
>> Dongjoon Hyun  于2020年6月3日周三 上午9:18写道：
>>
>>> +1
>>>
>>> Bests,
>>> Dongjoon
>>>
>>> On Wed, Jun 3, 2020 at 5:59 AM Tom Graves 
>>> wrote:
>>>
>>>>  +1
>>>>
>>>> Tom
>>>>
>>>> On Sunday, May 31, 2020, 06:47:09 PM CDT, Holden Karau <
>>>> hol...@pigscanfly.ca> wrote:
>>>>
>>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.4.6.
>>>>
>>>> The vote is open until June 5th at 9AM PST and passes if a majority +1
>>>> PMC votes are cast, with a min

1 2 3 4 >

1 - 100 of 319 matches

Mail list logo