Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread John Zhuge
+1

On Fri, Apr 26, 2024 at 8:41 AM Kent Yao  wrote:

> +1
>
> yangjie01  于2024年4月26日周五 17:16写道:
> >
> > +1
> >
> >
> >
> > 发件人: Ruifeng Zheng 
> > 日期: 2024年4月26日 星期五 15:05
> > 收件人: Xinrong Meng 
> > 抄送: Dongjoon Hyun , "dev@spark.apache.org" <
> dev@spark.apache.org>
> > 主题: Re: [FYI] SPARK-47993: Drop Python 3.8
> >
> >
> >
> > +1
> >
> >
> >
> > On Fri, Apr 26, 2024 at 10:26 AM Xinrong Meng 
> wrote:
> >
> > +1
> >
> >
> >
> > On Thu, Apr 25, 2024 at 2:08 PM Holden Karau 
> wrote:
> >
> > +1
> >
> > Twitter: https://twitter.com/holdenkarau
> >
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
> >
> >
> >
> >
> > On Thu, Apr 25, 2024 at 11:18 AM Maciej  wrote:
> >
> > +1
> >
> > Best regards,
> >
> > Maciej Szymkiewicz
> >
> >
> >
> > Web: https://zero323.net
> >
> > PGP: A30CEF0C31A501EC
> >
> > On 4/25/24 6:21 PM, Reynold Xin wrote:
> >
> > +1
> >
> >
> >
> > On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale <
> santosh.ping...@adyen.com.invalid> wrote:
> >
> > +1
> >
> >
> >
> > On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun 
> wrote:
> >
> > FYI, there is a proposal to drop Python 3.8 because its EOL is October
> 2024.
> >
> >
> > https://github.com/apache/spark/pull/46228
> > [SPARK-47993][PYTHON] Drop Python 3.8
> >
> >
> >
> > Since it's still alive and there will be an overlap between the
> lifecycle of Python 3.8 and Apache Spark 4.0.0, please give us your
> feedback on the PR, if you have any concerns.
> >
> >
> >
> > From my side, I agree with this decision.
> >
> >
> >
> > Thanks,
> >
> > Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
John Zhuge


Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread John Zhuge
+1 (non-binding)

On Sun, Apr 14, 2024 at 7:18 PM Jungtaek Lim 
wrote:

> +1 (non-binding), thanks Dongjoon.
>
> On Sun, Apr 14, 2024 at 7:22 AM Dongjoon Hyun 
> wrote:
>
>> Please vote on SPARK-4 to use ANSI SQL mode by default.
>> The technical scope is defined in the following PR which is
>> one line of code change and one line of migration guide.
>>
>> - DISCUSSION:
>> https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
>> - JIRA: https://issues.apache.org/jira/browse/SPARK-4
>> - PR: https://github.com/apache/spark/pull/46013
>>
>> The vote is open until April 17th 1AM (PST) and passes
>> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Use ANSI SQL mode by default
>> [ ] -1 Do not use ANSI SQL mode by default because ...
>>
>> Thank you in advance.
>>
>> Dongjoon
>>
>

-- 
John Zhuge


Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-12 Thread John Zhuge
+1 (non-binding)

On Tue, Mar 12, 2024 at 8:45 AM L. C. Hsieh  wrote:

> +1
>
>
> On Tue, Mar 12, 2024 at 8:20 AM Chao Sun  wrote:
>
>> +1
>>
>> On Tue, Mar 12, 2024 at 8:03 AM Xiao Li 
>> wrote:
>>
>>> +1
>>>
>>> On Tue, Mar 12, 2024 at 6:09 AM Holden Karau 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>>>
>>>> On Mon, Mar 11, 2024 at 7:44 PM Reynold Xin 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>>
>>>>> On Mon, Mar 11 2024 at 7:38 PM, Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> +1 (non-binding), thanks Gengliang!
>>>>>>
>>>>>> On Mon, Mar 11, 2024 at 5:46 PM Gengliang Wang 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'd like to start the vote for SPIP: Structured Logging Framework
>>>>>>> for Apache Spark
>>>>>>>
>>>>>>> References:
>>>>>>>
>>>>>>>- JIRA ticket <https://issues.apache.org/jira/browse/SPARK-47240>
>>>>>>>- SPIP doc
>>>>>>>
>>>>>>> <https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing>
>>>>>>>- Discussion thread
>>>>>>><https://lists.apache.org/thread/gocslhbfv1r84kbcq3xt04nx827ljpxq>
>>>>>>>
>>>>>>> Please vote on the SPIP for the next 72 hours:
>>>>>>>
>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>> [ ] +0
>>>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Gengliang Wang
>>>>>>>
>>>>>>
>>>
>>> --
>>>
>>>

-- 
John Zhuge


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread John Zhuge
Excellent work, congratulations!

On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>
>> Congratulations!
>>
>>
>>
>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>> wrote:
>>
>> Hi everyone,
>>
>> We are happy to announce the availability of Spark 3.5.1!
>>
>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.5 maintenance branch of Spark. We
>> strongly
>> recommend all 3.5 users to upgrade to this stable release.
>>
>> To download Spark 3.5.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Jungtaek Lim
>>
>> ps. Yikun is helping us through releasing the official docker image for
>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>
>>

-- 
John Zhuge


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread John Zhuge
Congratulations! Excellent work!

On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:

> Absolutely thrilled to see the project going open-source! Huge congrats to
> Chao and the entire team on this milestone!
>
> Yufei
>
>
> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>
>> Hi all,
>>
>> We are very happy to announce that Project Comet, a plugin to
>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> has now been open sourced under the Apache Arrow umbrella. Please
>> check the project repo
>> https://github.com/apache/arrow-datafusion-comet for more details if
>> you are interested. We'd love to collaborate with people from the open
>> source community who share similar goals.
>>
>> Thanks,
>> Chao
>>
>> -----
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
John Zhuge


Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread John Zhuge
+1

John Zhuge


On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale
 wrote:

> +1
>
> On Sun, Feb 4, 2024, 8:18 PM Xiao Li 
> wrote:
>
>> +1
>>
>> On Sun, Feb 4, 2024 at 6:07 AM beliefer  wrote:
>>
>>> +1
>>>
>>>
>>>
>>> 在 2024-02-04 15:26:13,"Dongjoon Hyun"  写道:
>>>
>>> +1
>>>
>>> On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>>
>>>> 写入:
>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>> Jungtaek Lim >>> kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
>>>> >
>>>> > Hi dev,
>>>> >
>>>> > looks like there are a huge number of commits being pushed to
>>>> branch-3.5 after 3.5.0 was released, 200+ commits.
>>>> >
>>>> > $ git log --oneline v3.5.0..HEAD | wc -l
>>>> > 202
>>>> >
>>>> > Also, there are 180 JIRA tickets containing 3.5.1 as fixed version,
>>>> and 10 resolved issues are either marked as blocker (even correctness
>>>> issues) or critical, which justifies the release.
>>>> > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
>>>> https://issues.apache.org/jira/projects/SPARK/versions/12353495>
>>>> >
>>>> > What do you think about releasing 3.5.1 with the current head of
>>>> branch-3.5? I'm happy to volunteer as the release manager.
>>>> >
>>>> > Thanks,
>>>> > Jungtaek Lim (HeartSaVioR)
>>>>
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> dev-unsubscr...@spark.apache.org>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>> --
>>
>>


Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-06 Thread John Zhuge
Congratulations!

On Fri, Oct 6, 2023 at 6:41 PM Yi Wu  wrote:

> Congrats!
>
> On Sat, Oct 7, 2023 at 9:24 AM XiDuo You  wrote:
>
>> Congratulations!
>>
>> Prashant Sharma  于2023年10月6日周五 00:26写道:
>> >
>> > Congratulations 
>> >
>> > On Wed, 4 Oct, 2023, 8:52 pm huaxin gao, 
>> wrote:
>> >>
>> >> Congratulations!
>> >>
>> >> On Wed, Oct 4, 2023 at 7:39 AM Chao Sun  wrote:
>> >>>
>> >>> Congratulations!
>> >>>
>> >>> On Wed, Oct 4, 2023 at 5:11 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> 
>>  Congrats!
>> 
>>  2023년 10월 4일 (수) 오후 5:04, yangjie01 님이
>> 작성:
>> >
>> > Congratulations!
>> >
>> >
>> >
>> > Jie Yang
>> >
>> >
>> >
>> > 发件人: Dongjoon Hyun 
>> > 日期: 2023年10月4日 星期三 13:04
>> > 收件人: Hyukjin Kwon 
>> > 抄送: Hussein Awala , Rui Wang <
>> amaliu...@apache.org>, Gengliang Wang , Xiao Li <
>> gatorsm...@gmail.com>, "dev@spark.apache.org" 
>> > 主题: Re: Welcome to Our New Apache Spark Committer and PMCs
>> >
>> >
>> >
>> > Congratulations!
>> >
>> >
>> >
>> > Dongjoon.
>> >
>> >
>> >
>> > On Tue, Oct 3, 2023 at 5:25 PM Hyukjin Kwon 
>> wrote:
>> >
>> > Woohoo!
>> >
>> >
>> >
>> > On Tue, 3 Oct 2023 at 22:47, Hussein Awala 
>> wrote:
>> >
>> > Congrats to all of you!
>> >
>> >
>> >
>> > On Tue 3 Oct 2023 at 08:15, Rui Wang  wrote:
>> >
>> > Congratulations! Well deserved!
>> >
>> >
>> >
>> > -Rui
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Oct 2, 2023 at 10:32 PM Gengliang Wang 
>> wrote:
>> >
>> > Congratulations to all! Well deserved!
>> >
>> >
>> >
>> > On Mon, Oct 2, 2023 at 10:16 PM Xiao Li 
>> wrote:
>> >
>> > Hi all,
>> >
>> > The Spark PMC is delighted to announce that we have voted to add
>> one new committer and two new PMC members. These individuals have
>> consistently contributed to the project and have clearly demonstrated their
>> expertise.
>> >
>> > New Committer:
>> > - Jiaan Geng (focusing on Spark Connect and Spark SQL)
>> >
>> > New PMCs:
>> > - Yuanjian Li
>> > - Yikun Jiang
>> >
>> > Please join us in extending a warm welcome to them in their new
>> roles!
>> >
>> > Sincerely,
>> > The Spark PMC
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-04-05 Thread John Zhuge
t; >>> On Mon, Apr 3, 2023 at 12:35 PM Dongjoon Hyun <
> > >>> dongjoon.h...@gmail.com>
> > >>> > >>> wrote:
> > >>> > >>> >
> > >>> > >>> > +1
> > >>> > >>> >
> > >>> > >>> > I also verified that RC5 has SBOM artifacts.
> > >>> > >>> >
> > >>> > >>> >
> > >>> > >>>
> > >>>
> https://repository.apache.org/content/repositories/orgapachespark-1439/org/apache/spark/spark-core_2.12/3.4.0/spark-core_2.12-3.4.0-cyclonedx.json
> > >>> > >>> >
> > >>> > >>>
> > >>>
> https://repository.apache.org/content/repositories/orgapachespark-1439/org/apache/spark/spark-core_2.13/3.4.0/spark-core_2.13-3.4.0-cyclonedx.json
> > >>> > >>> >
> > >>> > >>> > Thanks,
> > >>> > >>> > Dongjoon.
> > >>> > >>> >
> > >>> > >>> >
> > >>> > >>> >
> > >>> > >>> > On Mon, Apr 3, 2023 at 1:57 AM yangjie01 <
> yangji...@baidu.com>
> > >>> wrote:
> > >>> > >>> >>
> > >>> > >>> >> +1, checked Java 17 + Scala 2.13 + Python 3.10.10.
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >> 发件人: Herman van Hovell 
> > >>> > >>> >> 日期: 2023年3月31日 星期五 12:12
> > >>> > >>> >> 收件人: Sean Owen 
> > >>> > >>> >> 抄送: Xinrong Meng , dev <
> > >>> > >>> dev@spark.apache.org>
> > >>> > >>> >> 主题: Re: [VOTE] Release Apache Spark 3.4.0 (RC5)
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >> +1
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >> On Thu, Mar 30, 2023 at 11:05 PM Sean Owen <
> sro...@apache.org>
> > >>> wrote:
> > >>> > >>> >>
> > >>> > >>> >> +1 same result from me as last time.
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >> On Thu, Mar 30, 2023 at 3:21 AM Xinrong Meng <
> > >>> > >>> xinrong.apa...@gmail.com> wrote:
> > >>> > >>> >>
> > >>> > >>> >> Please vote on releasing the following candidate(RC5) as
> Apache
> > >>> Spark
> > >>> > >>> version 3.4.0.
> > >>> > >>> >>
> > >>> > >>> >> The vote is open until 11:59pm Pacific time April 4th and
> > >>> passes if a
> > >>> > >>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> > >>> > >>> >>
> > >>> > >>> >> [ ] +1 Release this package as Apache Spark 3.4.0
> > >>> > >>> >> [ ] -1 Do not release this package because ...
> > >>> > >>> >>
> > >>> > >>> >> To learn more about Apache Spark, please see
> > >>> http://spark.apache.org/
> > >>> > >>> >>
> > >>> > >>> >> The tag to be voted on is v3.4.0-rc5 (commit
> > >>> > >>> f39ad617d32a671e120464e4a75986241d72c487):
> > >>> > >>> >> https://github.com/apache/spark/tree/v3.4.0-rc5
> > >>> > >>> >>
> > >>> > >>> >> The release files, including signatures, digests, etc. can
> be
> > >>> found
> > >>> > >>> at:
> > >>> > >>> >>
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-bin/
> > >>> > >>> >>
> > >>> > >>> >> Signatures used for Spark RCs can be found in this file:
> > >>> > >>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >>> > >>> >>
> > >>> > >>> >> The staging repository for this release can be found at:
> > >>> > >>> >>
> > >>> > >>>
> > >>>
> https://repository.apache.org/content/repositories/orgapachespark-1439
> > >>> > >>> >>
> > >>> > >>> >> The documentation corresponding to this release can be
> found at:
> > >>> > >>> >>
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-docs/
> > >>> > >>> >>
> > >>> > >>> >> The list of bug fixes going into 3.4.0 can be found at the
> > >>> following
> > >>> > >>> URL:
> > >>> > >>> >>
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
> > >>> > >>> >>
> > >>> > >>> >> This release is using the release script of the tag
> v3.4.0-rc5.
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >> FAQ
> > >>> > >>> >>
> > >>> > >>> >> =
> > >>> > >>> >> How can I help test this release?
> > >>> > >>> >> =
> > >>> > >>> >> If you are a Spark user, you can help us test this release
> by
> > >>> taking
> > >>> > >>> >> an existing Spark workload and running on this release
> > >>> candidate, then
> > >>> > >>> >> reporting any regressions.
> > >>> > >>> >>
> > >>> > >>> >> If you're working in PySpark you can set up a virtual env
> and
> > >>> install
> > >>> > >>> >> the current RC and see if anything important breaks, in the
> > >>> Java/Scala
> > >>> > >>> >> you can add the staging repository to your projects
> resolvers
> > >>> and test
> > >>> > >>> >> with the RC (make sure to clean up the artifact cache
> > >>> before/after so
> > >>> > >>> >> you don't end up building with an out of date RC going
> forward).
> > >>> > >>> >>
> > >>> > >>> >> ===
> > >>> > >>> >> What should happen to JIRA tickets still targeting 3.4.0?
> > >>> > >>> >> ===
> > >>> > >>> >> The current list of open tickets targeted at 3.4.0 can be
> found
> > >>> at:
> > >>> > >>> >> https://issues.apache.org/jira/projects/SPARK and search
> for
> > >>> "Target
> > >>> > >>> Version/s" = 3.4.0
> > >>> > >>> >>
> > >>> > >>> >> Committers should look at those and triage. Extremely
> important
> > >>> bug
> > >>> > >>> >> fixes, documentation, and API tweaks that impact
> compatibility
> > >>> should
> > >>> > >>> >> be worked on immediately. Everything else please retarget
> to an
> > >>> > >>> >> appropriate release.
> > >>> > >>> >>
> > >>> > >>> >> ==
> > >>> > >>> >> But my bug isn't fixed?
> > >>> > >>> >> ==
> > >>> > >>> >> In order to make timely releases, we will typically not
> hold the
> > >>> > >>> >> release unless the bug in question is a regression from the
> > >>> previous
> > >>> > >>> >> release. That being said, if there is something which is a
> > >>> regression
> > >>> > >>> >> that has not been correctly targeted please ping me or a
> > >>> committer to
> > >>> > >>> >> help target the issue.
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >> Thanks,
> > >>> > >>> >>
> > >>> > >>> >> Xinrong Meng
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>>
> > >>> > >>>
> > >>> -
> > >>> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>> > >>>
> > >>> > >>>
> > >>> >
> > >>>
> > >>> -
> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>>
> > >>>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
John Zhuge


Re: Adding new connectors

2023-03-24 Thread John Zhuge
Is this similar to Iceberg's hidden partitioning
<https://iceberg.apache.org/docs/latest/partitioning/#icebergs-hidden-partitioning>?
Check out the details in the spec:
https://iceberg.apache.org/spec/#partition-transforms

On Fri, Mar 24, 2023 at 2:52 PM Alex Cruise  wrote:

> On Fri, Mar 24, 2023 at 1:46 PM John Zhuge  wrote:
>
>> Have you checked out SparkCatalog
>> <https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java>
>>  in
>> Apache Iceberg project? More docs at
>> https://iceberg.apache.org/docs/latest/spark-configuration/#catalogs
>>
>
> No, I hadn't seen that one yet, thanks!
>
> Another question: our partitions have no useful uniqueness criteria other
> than a storage URL which should never be exposed to user-space. Our
> "primary" index is a timestamp, and multiple partitions within a table can
> have overlapping time ranges. We support an additional shard key but it's
> optional. Is there something like partition discovery in DataSourceV2 where
> I should list all the (potentially many thousands) of partitions for a
> table, or can I leave them unpopulated until query planning time, when time
> range predicates often have extremely high selectivity?
>
> Thanks!
>
> -0xe1a
>
>>

-- 
John Zhuge


Re: Adding new connectors

2023-03-24 Thread John Zhuge
Have you checked out SparkCatalog
<https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java>
in
Apache Iceberg project? More docs at
https://iceberg.apache.org/docs/latest/spark-configuration/#catalogs

On Fri, Mar 24, 2023 at 12:36 PM Alex Cruise  wrote:

> Hey folks, please let me know this is more of a user@ post!
>
> I'm building a Spark connector for my company's data-lake-ish product, and
> it looks like there's very little documentation about how to go about it.
>
> I found ExternalCatalog a few days ago and have been implementing one of
> those, but it seems like DataSourceRegister / SupportsCatalogOptions is
> another popular approach. I'm not sure offhand how they overlap/intersect
> just yet.
>
> I've also noticed a few implementations that put some of their code in
> org.apache.spark.* packages in addition to their own; presumably this isn't
> by accident. Is this practice necessary to get around package-private
> visibility or something?
>
> Thanks!
>
> -0xe1a
>


-- 
John Zhuge


Re: Ammonite as REPL for Spark Connect

2023-03-23 Thread John Zhuge
+1 on better notebook and other REPL experience

On Thu, Mar 23, 2023 at 9:17 AM Dongjoon Hyun 
wrote:

> I also support Herman's `SPARK-42884 Add Ammonite REPL integration` PR.
>
> Thanks,
> Dongjoon.
>
>
> On Thu, Mar 23, 2023 at 7:51 AM Mridul Muralidharan 
> wrote:
>
>>
>> Sounds good, thanks for clarifying !
>>
>> Regards,
>> Mridul
>>
>> On Thu, Mar 23, 2023 at 9:09 AM Herman van Hovell 
>> wrote:
>>
>>> The goal of adding this, is to make it easy for a user to connect a
>>> scala REPL to a Spark Connect server. Just like Spark shell makes it easy
>>> to work with a regular Spark environment.
>>>
>>> It is not meant as a Spark shell replacement. They represent two
>>> different modes of working with Spark, and they have very different API
>>> surfaces (Connect being a subset of what regular Spark has to offer). I do
>>> think we should consider using ammonite for Spark shell at some point,
>>> since this has better UX and does not require us to fork a REPL. That
>>> discussion is for another day though.
>>>
>>> I guess you can use it as an example of building an integration. In
>>> itself I wouldn't call it that because I think this a key part of getting
>>> started with connect, and/or doing debugging.
>>>
>>> On Thu, Mar 23, 2023 at 4:00 AM Mridul Muralidharan 
>>> wrote:
>>>
>>>>
>>>> What is unclear to me is why we are introducing this integration, how
>>>> users will leverage it.
>>>>
>>>> * Are we replacing spark-shell with it ?
>>>> Given the existing gaps, this is not the case.
>>>>
>>>> * Is it an example to showcase how to build an integration ?
>>>> That could be interesting, and we can add it to external/
>>>>
>>>> Anything else I am missing ?
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>>
>>>> On Wed, Mar 22, 2023 at 6:58 PM Herman van Hovell <
>>>> her...@databricks.com> wrote:
>>>>
>>>>> Ammonite is maintained externally by Li Haoyi et al. We are including
>>>>> it as a 'provided' dependency. The integration bits and pieces (1 file) 
>>>>> are
>>>>> included in Apache Spark.
>>>>>
>>>>> On Wed, Mar 22, 2023 at 7:53 PM Mridul Muralidharan 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Will this be maintained externally or included into Apache Spark ?
>>>>>>
>>>>>> Regards ,
>>>>>> Mridul
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 22, 2023 at 6:50 PM Herman van Hovell
>>>>>>  wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> For Spark Connect Scala Client we are working on making the REPL
>>>>>>> experience a bit nicer <https://github.com/apache/spark/pull/40515>.
>>>>>>> In a nutshell we want to give users a turn key scala REPL, that works 
>>>>>>> even
>>>>>>> if you don't have a Spark distribution on your machine (through
>>>>>>> coursier <https://get-coursier.io/>). We are using Ammonite
>>>>>>> <https://ammonite.io/> instead of the standard scala REPL for this,
>>>>>>> the main reason for going with Ammonite is that it is easier to 
>>>>>>> customize,
>>>>>>> and IMO has a superior user experience.
>>>>>>>
>>>>>>> Does anyone object to doing this?
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Herman
>>>>>>>
>>>>>>>

-- 
John Zhuge


Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-12 Thread John Zhuge
>> > [INFO] Total time:  02:30 h
>> > [INFO] Finished at: 2023-02-11T17:32:45+01:00
>> >
>> > lør. 11. feb. 2023 kl. 06:01 skrev L. C. Hsieh :
>> >>
>> >> Please vote on releasing the following candidate as Apache Spark
>> version 3.3.2.
>> >>
>> >> The vote is open until Feb 15th 9AM (PST) and passes if a majority +1
>> >> PMC votes are cast, with a minimum of 3 +1 votes.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 3.3.2
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see https://spark.apache.org/
>> <https://mailshield.baidu.com/check?q=iR6md5rYrz%2bpTPJlEXXlR6NN3aGjunZT0DADO3Pcgs0%3d>
>> >>
>> >> The tag to be voted on is v3.3.2-rc1 (commit
>> >> 5103e00c4ce5fcc4264ca9c4df12295d42557af6):
>> >> https://github.com/apache/spark/tree/v3.3.2-rc1
>> <https://mailshield.baidu.com/check?q=JVB3SgRULBV6o7%2f%2bttBuOSWQ7pos5zRDEjmr726OkkvMCFFjCzV8o1ouG%2bndfSI1ShN28A%3d%3d>
>> >>
>> >> The release files, including signatures, digests, etc. can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-bin/
>> <https://mailshield.baidu.com/check?q=EiCIpTlRSz22Mr68Lj2FK7L9OOrivFInwt%2buG0Qq2%2fsiczZ8oNT%2bs0h88iljAnVqyGTvojbZSAVln8NItTTQ8A%3d%3d>
>> >>
>> >> Signatures used for Spark RCs can be found in this file:
>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> <https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d>
>> >>
>> >> The staging repository for this release can be found at:
>> >>
>> https://repository.apache.org/content/repositories/orgapachespark-1433/
>> <https://mailshield.baidu.com/check?q=bUXQTF3XhoCwa%2b98IqBLxJAg%2fHJuB7FufsD0R3sZVlBnWEwzl5k5S%2fGVESXSPQ43GcGsjg3KDAlXHi9eKLeL671S4gk1NeTDBv%2f7nQ%3d%3d>
>> >>
>> >> The documentation corresponding to this release can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-docs/
>> <https://mailshield.baidu.com/check?q=CeHz8BoTnhtg6CxlsH20qffHjT4Wngbp3FxuyKIf0vccpabg1s7%2bDbjWTkwadqFA2zZQsXuPRLCL%2f6ycVxGeRWdKv4U%3d>
>> >>
>> >> The list of bug fixes going into 3.3.2 can be found at the following
>> URL:
>> >> https://issues.apache.org/jira/projects/SPARK/versions/12352299
>> <https://mailshield.baidu.com/check?q=rjLgfSQlv1N%2fnYqk65MQtTLxU5T5bPEmhdY99POorxaUjV5LNpNzD3j68xtXUnYB4xH84qwa0lPrY%2fkYCeoJh9x0PL8%3d>
>> >>
>> >> This release is using the release script of the tag v3.3.2-rc1.
>> >>
>> >> FAQ
>> >>
>> >> =
>> >> How can I help test this release?
>> >> =
>> >>
>> >> If you are a Spark user, you can help us test this release by taking
>> >> an existing Spark workload and running on this release candidate, then
>> >> reporting any regressions.
>> >>
>> >> If you're working in PySpark you can set up a virtual env and install
>> >> the current RC and see if anything important breaks, in the Java/Scala
>> >> you can add the staging repository to your projects resolvers and test
>> >> with the RC (make sure to clean up the artifact cache before/after so
>> >> you don't end up building with a out of date RC going forward).
>> >>
>> >> ===
>> >> What should happen to JIRA tickets still targeting 3.3.2?
>> >> ===
>> >>
>> >> The current list of open tickets targeted at 3.3.2 can be found at:
>> >> https://issues.apache.org/jira/projects/SPARK
>> <https://mailshield.baidu.com/check?q=4UUpJqq41y71Gnuj0qTUYo6hTjqzT7oytN6x%2fvgC5XUtQUC8MfJ77tj7K70O%2f1QMmNoa1A%3d%3d>
>> and search for "Target
>> >> Version/s" = 3.3.2
>> >>
>> >> Committers should look at those and triage. Extremely important bug
>> >> fixes, documentation, and API tweaks that impact compatibility should
>> >> be worked on immediately. Everything else please retarget to an
>> >> appropriate release.
>> >>
>> >> ==
>> >> But my bug isn't fixed?
>> >> ==
>> >>
>> >> In order to make timely releases, we will typically not hold the
>> >> release unless the bug in question is a regression from the previous
>> >> release. That being said, if there is something which is a regression
>> >> that has not been correctly targeted please ping me or a committer to
>> >> help target the issue.
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >
>> >
>> > --
>> > Bjørn Jørgensen
>> > Vestre Aspehaug 4, 6010 Ålesund
>> > Norge
>> >
>> > +47 480 94 297
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>>
>> --
>>
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>>
>>
>>
>> --
>>
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>>

-- 
John Zhuge


Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-07 Thread John Zhuge
Awesome, count me in!
PST

On Tue, Feb 7, 2023 at 3:34 PM Andrew Melo  wrote:

> I'm Central US time (AKA UTC -6:00)
>
> On Tue, Feb 7, 2023 at 5:32 PM Holden Karau  wrote:
> >
> > Awesome, I guess I should have asked folks for timezones that they’re in.
> >
> > On Tue, Feb 7, 2023 at 3:30 PM Andrew Melo 
> wrote:
> >>
> >> Hello Holden,
> >>
> >> We are interested in Spark on k8s and would like the opportunity to
> >> speak with devs about what we're looking for slash better ways to use
> >> spark.
> >>
> >> Thanks!
> >> Andrew
> >>
> >> On Tue, Feb 7, 2023 at 5:24 PM Holden Karau 
> wrote:
> >> >
> >> > Hi Folks,
> >> >
> >> > It seems like we could maybe use some additional shared context
> around Spark on Kube so I’d like to try and schedule a virtual coffee
> session.
> >> >
> >> > Who all would be interested in virtual adventures around Spark on
> Kube development?
> >> >
> >> > No pressure if the idea of hanging out in a virtual chat with coffee
> and Spark devs does not sound like your thing, just trying to make
> something informal so we can have a better understanding of everyone’s
> goals here.
> >> >
> >> > Cheers,
> >> >
> >> > Holden :)
> >> > --
> >> > Twitter: https://twitter.com/holdenkarau
> >> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
John Zhuge


Re: Time for release v3.3.2

2023-01-30 Thread John Zhuge
+1 Thanks Liang-Chi for driving the release!

On Mon, Jan 30, 2023 at 10:26 PM Yuming Wang  wrote:

> +1
>
> On Tue, Jan 31, 2023 at 12:18 PM yangjie01  wrote:
>
>> +1 Thanks Liang-Chi!
>>
>>
>>
>> YangJie
>>
>>
>>
>> *发件人**: *huaxin gao 
>> *日期**: *2023年1月31日 星期二 10:03
>> *收件人**: *Dongjoon Hyun 
>> *抄送**: *Hyukjin Kwon , Chao Sun ,
>> "L. C. Hsieh" , Spark dev list 
>> *主题**: *Re: Time for release v3.3.2
>>
>>
>>
>> +1 Thanks Liang-Chi!
>>
>>
>>
>> On Mon, Jan 30, 2023 at 6:01 PM Dongjoon Hyun 
>> wrote:
>>
>> +1
>>
>>
>>
>> Thank you so much, Liang-Chi.
>>
>> 3.3.2 release will help 3.4.0 release too because they share many bug
>> fixes.
>>
>>
>>
>> Dongjoon
>>
>>
>>
>>
>>
>> On Mon, Jan 30, 2023 at 5:56 PM Hyukjin Kwon  wrote:
>>
>> +100!
>>
>>
>>
>> On Tue, 31 Jan 2023 at 10:54, Chao Sun  wrote:
>>
>> +1, thanks Liang-Chi for volunteering!
>>
>> Chao
>>
>> On Mon, Jan 30, 2023 at 5:51 PM L. C. Hsieh  wrote:
>> >
>> > Hi Spark devs,
>> >
>> > As you know, it has been 4 months since Spark 3.3.1 was released on
>> > 2022/10, it seems a good time to think about next maintenance release,
>> > i.e. Spark 3.3.2.
>> >
>> > I'm thinking of the release of Spark 3.3.2 this Feb (2023/02).
>> >
>> > What do you think?
>> >
>> > I am willing to volunteer for Spark 3.3.2 if there is consensus about
>> > this maintenance release.
>> >
>> > Thank you.
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
John Zhuge


Re: Welcome Yikun Jiang as a Spark committer

2022-10-09 Thread John Zhuge
Congratulations, Yikun!

On Sun, Oct 9, 2022 at 8:52 PM Senthil Kumar  wrote:

> Congratulations Yikun
>
> On Mon, 10 Oct 2022, 09:11 Xiao Li,  wrote:
>
>> Congratulations, Yikun!
>>
>> Xiao
>>
>> Yikun Jiang  于2022年10月9日周日 19:34写道:
>>
>>> Thank you all!
>>>
>>> Regards,
>>> Yikun
>>>
>>>
>>> On Mon, Oct 10, 2022 at 3:18 AM Chao Sun  wrote:
>>>
>>>> Congratulations Yikun!
>>>>
>>>> On Sun, Oct 9, 2022 at 11:14 AM vaquar khan 
>>>> wrote:
>>>>
>>>>> Congratulations.
>>>>>
>>>>> Regards,
>>>>> Vaquar khan
>>>>>
>>>>> On Sun, Oct 9, 2022, 6:46 AM 叶先进  wrote:
>>>>>
>>>>>> Congrats
>>>>>>
>>>>>> On Oct 9, 2022, at 16:44, XiDuo You  wrote:
>>>>>>
>>>>>> Congratulations, Yikun !
>>>>>>
>>>>>> Maxim Gekk  于2022年10月9日周日 15:59写道:
>>>>>>
>>>>>>> Keep up the great work, Yikun!
>>>>>>>
>>>>>>> On Sun, Oct 9, 2022 at 10:52 AM Gengliang Wang 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Congratulations, Yikun!
>>>>>>>>
>>>>>>>> On Sun, Oct 9, 2022 at 12:33 AM 416161...@qq.com <
>>>>>>>> ruife...@foxmail.com> wrote:
>>>>>>>>
>>>>>>>>> Congrats, Yikun!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ruifeng Zheng
>>>>>>>>> ruife...@foxmail.com
>>>>>>>>>
>>>>>>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage=true=Ruifeng+Zheng=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242=ruifengz%40foxmail.com=>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- Original --
>>>>>>>>> *From:* "Martin Grigorov" ;
>>>>>>>>> *Date:* Sun, Oct 9, 2022 05:01 AM
>>>>>>>>> *To:* "Hyukjin Kwon";
>>>>>>>>> *Cc:* "dev";"Yikun Jiang"<
>>>>>>>>> yikunk...@gmail.com>;
>>>>>>>>> *Subject:* Re: Welcome Yikun Jiang as a Spark committer
>>>>>>>>>
>>>>>>>>> Congratulations, Yikun!
>>>>>>>>>
>>>>>>>>> On Sat, Oct 8, 2022 at 7:41 AM Hyukjin Kwon 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> The Spark PMC recently added Yikun Jiang as a committer on the
>>>>>>>>>> project.
>>>>>>>>>> Yikun is the major contributor of the infrastructure and GitHub
>>>>>>>>>> Actions in Apache Spark as well as Kubernates and PySpark.
>>>>>>>>>> He has put a lot of effort into stabilizing and optimizing the
>>>>>>>>>> builds so we all can work together in Apache Spark more
>>>>>>>>>> efficiently and effectively. He's also driving the SPIP for
>>>>>>>>>> Docker official image in Apache Spark as well for users and 
>>>>>>>>>> developers.
>>>>>>>>>> Please join me in welcoming Yikun!
>>>>>>>>>>
>>>>>>>>>>
>>>>>> --
John Zhuge


Re: Time for Spark 3.3.1 release?

2022-09-12 Thread John Zhuge
+1

On Mon, Sep 12, 2022 at 9:08 PM Yang,Jie(INF)  wrote:

> +1
>
>
>
> Thanks Yuming ~
>
>
>
> *发件人**: *Hyukjin Kwon 
> *日期**: *2022年9月13日 星期二 08:19
> *收件人**: *Gengliang Wang 
> *抄送**: *"L. C. Hsieh" , Dongjoon Hyun <
> dongjoon.h...@gmail.com>, Yuming Wang , dev <
> dev@spark.apache.org>
> *主题**: *Re: Time for Spark 3.3.1 release?
>
>
>
> +1
>
>
>
> On Tue, 13 Sept 2022 at 06:45, Gengliang Wang  wrote:
>
> +1.
>
> Thank you, Yuming!
>
>
>
> On Mon, Sep 12, 2022 at 12:10 PM L. C. Hsieh  wrote:
>
> +1
>
> Thanks Yuming!
>
> On Mon, Sep 12, 2022 at 11:50 AM Dongjoon Hyun 
> wrote:
> >
> > +1
> >
> > Thanks,
> > Dongjoon.
> >
> > On Mon, Sep 12, 2022 at 6:38 AM Yuming Wang  wrote:
> >>
> >> Hi, All.
> >>
> >>
> >>
> >> Since Apache Spark 3.3.0 tag creation (Jun 10), new 138 patches
> including 7 correctness patches arrived at branch-3.3.
> >>
> >>
> >>
> >> Shall we make a new release, Apache Spark 3.3.1, as the second release
> at branch-3.3? I'd like to volunteer as the release manager for Apache
> Spark 3.3.1.
> >>
> >>
> >>
> >> All changes:
> >>
> >> https://github.com/apache/spark/compare/v3.3.0...branch-3.3
> <https://mailshield.baidu.com/check?q=WzRnV6InLAPdBDRyJZecGtPwF02t%2bnFNwOI8oAyGcb60kX%2bRCS6N3SUnFjTdf47bb94KmZHTTKE%2bBHbIT27Rog%3d%3d>
> >>
> >>
> >>
> >> Correctness issues:
> >>
> >> SPARK-40149: Propagate metadata columns through Project
> >>
> >> SPARK-40002: Don't push down limit through window using ntile
> >>
> >> SPARK-39976: ArrayIntersect should handle null in left expression
> correctly
> >>
> >> SPARK-39833: Disable Parquet column index in DSv1 to fix a correctness
> issue in the case of overlapping partition and data columns
> >>
> >> SPARK-39061: Set nullable correctly for Inline output attributes
> >>
> >> SPARK-39887: RemoveRedundantAliases should keep aliases that make the
> output of projection nodes unique
> >>
> >> SPARK-38614: Don't push down limit through window that's using
> percent_rank
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
John Zhuge


Re: Apache Spark 3.2.2 Release?

2022-07-06 Thread John Zhuge
+1  Thanks for the effort!

On Wed, Jul 6, 2022 at 2:23 PM Bjørn Jørgensen 
wrote:

> +1
>
> ons. 6. jul. 2022, 23:05 skrev Hyukjin Kwon :
>
>> Yeah +1
>>
>> On Thu, Jul 7, 2022 at 5:40 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since Apache Spark 3.2.1 tag creation (Jan 19), new 197 patches
>>> including 11 correctness patches arrived at branch-3.2.
>>>
>>> Shall we make a new release, Apache Spark 3.2.2, as the third release
>>> at 3.2 line? I'd like to volunteer as the release manager for Apache
>>> Spark 3.2.2. I'm thinking about starting the first RC next week.
>>>
>>> $ git log --oneline v3.2.1..HEAD | wc -l
>>>  197
>>>
>>> # Correctness issues
>>>
>>> SPARK-38075 Hive script transform with order by and limit will
>>> return fake rows
>>> SPARK-38204 All state operators are at a risk of inconsistency
>>> between state partitioning and operator partitioning
>>> SPARK-38309 SHS has incorrect percentiles for shuffle read bytes
>>> and shuffle total blocks metrics
>>> SPARK-38320 (flat)MapGroupsWithState can timeout groups which just
>>> received inputs in the same microbatch
>>> SPARK-38614 After Spark update, df.show() shows incorrect
>>> F.percent_rank results
>>> SPARK-38655 OffsetWindowFunctionFrameBase cannot find the offset
>>> row whose input is not null
>>> SPARK-38684 Stream-stream outer join has a possible correctness
>>> issue due to weakly read consistent on outer iterators
>>> SPARK-39061 Incorrect results or NPE when using Inline function
>>> against an array of dynamically created structs
>>> SPARK-39107 Silent change in regexp_replace's handling of empty
>>> strings
>>> SPARK-39259 Timestamps returned by now() and equivalent functions
>>> are not consistent in subqueries
>>> SPARK-39293 The accumulator of ArrayAggregate should copy the
>>> intermediate result if string, struct, array, or map
>>>
>>> Best,
>>> Dongjoon.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>> --
John Zhuge


Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-23 Thread John Zhuge
Holden has graciously agreed to shepherd the SPIP. Thanks!

On Thu, Feb 10, 2022 at 9:19 AM John Zhuge  wrote:

> The vote is now closed and the vote passes. Thank you to everyone who took
> the time to review and vote on this SPIP. I’m looking forward to adding
> this feature to the next Spark release. The tracking JIRA is
> https://issues.apache.org/jira/browse/SPARK-31357.
>
> The tally is:
>
> +1s:
>
> Walaa Eldin Moustafa
> Erik Krogen
> Holden Karau (binding)
> Ryan Blue
> Chao Sun
> L C Hsieh (binding)
> Huaxin Gao
> Yufei Gu
> Terry Kim
> Jacky Lee
> Wenchen Fan (binding)
>
> 0s:
>
> -1s:
>
> On Mon, Feb 7, 2022 at 10:04 PM Wenchen Fan  wrote:
>
>> +1 (binding)
>>
>> On Sun, Feb 6, 2022 at 10:27 AM Jacky Lee  wrote:
>>
>>> +1 (non-binding). Thanks John!
>>> It's great to see ViewCatalog moving on, it's a nice feature.
>>>
>>> Terry Kim  于2022年2月5日周六 11:57写道:
>>>
>>>> +1 (non-binding). Thanks John!
>>>>
>>>> Terry
>>>>
>>>> On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu  wrote:
>>>>
>>>>> +1 (non-binding)
>>>>> Best,
>>>>>
>>>>> Yufei
>>>>>
>>>>> `This is not a contribution`
>>>>>
>>>>>
>>>>> On Fri, Feb 4, 2022 at 11:54 AM huaxin gao 
>>>>> wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh  wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun  wrote:
>>>>>>> >
>>>>>>> > +1 (non-binding). Looking forward to this feature!
>>>>>>> >
>>>>>>> > On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue  wrote:
>>>>>>> >>
>>>>>>> >> +1 for the SPIP. I think it's well designed and it has worked
>>>>>>> quite well at Netflix for a long time.
>>>>>>> >>
>>>>>>> >> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge 
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> Hi Spark community,
>>>>>>> >>>
>>>>>>> >>> I’d like to restart the vote for the ViewCatalog design proposal
>>>>>>> (SPIP).
>>>>>>> >>>
>>>>>>> >>> The proposal is to add a ViewCatalog interface that can be used
>>>>>>> to load, create, alter, and drop views in DataSourceV2.
>>>>>>> >>>
>>>>>>> >>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>>>>>> >>>
>>>>>>> >>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>> >>> [ ] +0
>>>>>>> >>> [ ] -1: I don’t think this is a good idea because …
>>>>>>> >>>
>>>>>>> >>> Thanks!
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> --
>>>>>>> >> Ryan Blue
>>>>>>> >> Tabular
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>
> --
> John Zhuge
>


-- 
John Zhuge


Re: [VOTE] Spark 3.1.3 RC4

2022-02-17 Thread John Zhuge
+1 (non-binding)

On Wed, Feb 16, 2022 at 10:06 AM Mridul Muralidharan 
wrote:

>
> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>
> Regards,
> Mridul
>
>
> On Wed, Feb 16, 2022 at 8:32 AM Thomas graves  wrote:
>
>> +1
>>
>> Tom
>>
>> On Mon, Feb 14, 2022 at 2:55 PM Holden Karau 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 3.1.3.
>> >
>> > The vote is open until Feb. 18th at 1 PM pacific (9 PM GMT) and passes
>> if a majority
>> > +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.1.3
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > There are currently no open issues targeting 3.1.3 in Spark's JIRA
>> https://issues.apache.org/jira/browse
>> > (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in
>> (Open, Reopened, "In Progress"))
>> > at https://s.apache.org/n79dw
>> >
>> >
>> >
>> > The tag to be voted on is v3.1.3-rc4 (commit
>> > d1f8a503a26bcfb4e466d9accc5fa241a7933667):
>> > https://github.com/apache/spark/tree/v3.1.3-rc4
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at
>> > https://repository.apache.org/content/repositories/orgapachespark-1401
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-docs/
>> >
>> > The list of bug fixes going into 3.1.3 can be found at the following
>> URL:
>> > https://s.apache.org/x0q9b
>> >
>> > This release is using the release script from 3.1.3
>> > The release docker container was rebuilt since the previous version
>> didn't have the necessary components to build the R documentation.
>> >
>> > FAQ
>> >
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with an out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.1.3?
>> > ===
>> >
>> > The current list of open tickets targeted at 3.1.3 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> > Version/s" = 3.1.3
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something that is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> > Note: I added an extra day to the vote since I know some folks are
>> likely busy on the 14th with partner(s).
>> >
>> >
>> > --
>> > Twitter: https://twitter.com/holdenkarau
>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
John Zhuge


Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-10 Thread John Zhuge
The vote is now closed and the vote passes. Thank you to everyone who took
the time to review and vote on this SPIP. I’m looking forward to adding
this feature to the next Spark release. The tracking JIRA is
https://issues.apache.org/jira/browse/SPARK-31357.

The tally is:

+1s:

Walaa Eldin Moustafa
Erik Krogen
Holden Karau (binding)
Ryan Blue
Chao Sun
L C Hsieh (binding)
Huaxin Gao
Yufei Gu
Terry Kim
Jacky Lee
Wenchen Fan (binding)

0s:

-1s:

On Mon, Feb 7, 2022 at 10:04 PM Wenchen Fan  wrote:

> +1 (binding)
>
> On Sun, Feb 6, 2022 at 10:27 AM Jacky Lee  wrote:
>
>> +1 (non-binding). Thanks John!
>> It's great to see ViewCatalog moving on, it's a nice feature.
>>
>> Terry Kim  于2022年2月5日周六 11:57写道:
>>
>>> +1 (non-binding). Thanks John!
>>>
>>> Terry
>>>
>>> On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu  wrote:
>>>
>>>> +1 (non-binding)
>>>> Best,
>>>>
>>>> Yufei
>>>>
>>>> `This is not a contribution`
>>>>
>>>>
>>>> On Fri, Feb 4, 2022 at 11:54 AM huaxin gao 
>>>> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh  wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun  wrote:
>>>>>> >
>>>>>> > +1 (non-binding). Looking forward to this feature!
>>>>>> >
>>>>>> > On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue  wrote:
>>>>>> >>
>>>>>> >> +1 for the SPIP. I think it's well designed and it has worked
>>>>>> quite well at Netflix for a long time.
>>>>>> >>
>>>>>> >> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge 
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> Hi Spark community,
>>>>>> >>>
>>>>>> >>> I’d like to restart the vote for the ViewCatalog design proposal
>>>>>> (SPIP).
>>>>>> >>>
>>>>>> >>> The proposal is to add a ViewCatalog interface that can be used
>>>>>> to load, create, alter, and drop views in DataSourceV2.
>>>>>> >>>
>>>>>> >>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>>>>> >>>
>>>>>> >>> [ ] +1: Accept the proposal as an official SPIP
>>>>>> >>> [ ] +0
>>>>>> >>> [ ] -1: I don’t think this is a good idea because …
>>>>>> >>>
>>>>>> >>> Thanks!
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Ryan Blue
>>>>>> >> Tabular
>>>>>>
>>>>>> -
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>>

-- 
John Zhuge


[VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread John Zhuge
Hi Spark community,

I’d like to restart the vote for the ViewCatalog design proposal (SPIP).

The proposal is to add a ViewCatalog interface that can be used to load,
create, alter, and drop views in DataSourceV2.

Please vote on the SPIP until Feb. 9th (Wednesday).

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks!


Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread John Zhuge
Sure Xiao.

Happy Lunar New Year!

On Thu, Feb 3, 2022 at 1:57 PM Xiao Li  wrote:

> Can we extend the voting window to next Wednesday? This week is a holiday
> week for the lunar new year. AFAIK, many members in Asia are taking the
> whole week off. They might not regularly check the emails.
>
> Also how about starting a separate email thread starting with [VOTE] ?
>
> Happy Lunar New Year!!!
>
> Xiao
>
> Holden Karau  于2022年2月3日周四 12:28写道:
>
>> +1 (binding)
>>
>> On Thu, Feb 3, 2022 at 2:26 PM Erik Krogen  wrote:
>>
>>> +1 (non-binding)
>>>
>>> Really looking forward to having this natively supported by Spark, so
>>> that we can get rid of our own hacks to tie in a custom view catalog
>>> implementation. I appreciate the care John has put into various parts of
>>> the design and believe this will provide a robust and flexible solution to
>>> this problem faced by various large-scale Spark users.
>>>
>>> Thanks John!
>>>
>>> On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Feb 3, 2022 at 11:19 AM John Zhuge  wrote:
>>>>
>>>>> Hi Spark community,
>>>>>
>>>>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP
>>>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>>>> ).
>>>>>
>>>>> The proposal is to add a ViewCatalog interface that can be used to
>>>>> load, create, alter, and drop views in DataSourceV2.
>>>>>
>>>>> Please vote on the SPIP in the next 72 hours. Once it is approved,
>>>>> I’ll update the PR <https://github.com/apache/spark/pull/28147> for
>>>>> review.
>>>>>
>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>> [ ] +0
>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>>
>>>>>> Considering the API aspect, the ViewCatalog API sounds like a good
>>>>>> idea. A view catalog will enable us to integrate Coral
>>>>>> <https://engineering.linkedin.com/blog/2020/coral> (our view SQL
>>>>>> translation and management layer) very cleanly to Spark. Currently we can
>>>>>> only do it by maintaining our special version of the
>>>>>> HiveExternalCatalog. Considering that views can be expanded
>>>>>> syntactically without necessarily invoking the analyzer, using a 
>>>>>> dedicated
>>>>>> view API can make performance better if performance is the concern.
>>>>>> Further, a catalog can still be both a table and view provider if it
>>>>>> chooses to based on this design, so I do not think we necessarily lose 
>>>>>> the
>>>>>> ability of providing both. Looking forward to more discussions on this 
>>>>>> and
>>>>>> making views a powerful tool in Spark.
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>> On Wed, May 26, 2021 at 9:54 AM John Zhuge  wrote:
>>>>>>
>>>>>>> Looks like we are running in circles. Should we have an online
>>>>>>> meeting to get this sorted out?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John
>>>>>>>
>>>>>>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> OK, then I'd vote for TableViewCatalog, because
>>>>>>>> 1. This is how Hive catalog works, and we need to migrate Hive
>>>>>>>> catalog to the v2 API sooner or later.
>>>>>>>> 2. Because of 1, TableViewCatalog is easy to support in the current
>>>>>>>> table/view resolution framework.
>>>>>>>> 3. It's better to avoid name conflicts between table and views at
>>>>>>>> the API level, instead of relying on the catalog implementation.
>>>>>>>> 4. Caching invalidation is always a tricky problem.
>

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread John Zhuge
Hi Spark community,

I’d like to restart the vote for the ViewCatalog design proposal (SPIP
<https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
).

The proposal is to add a ViewCatalog interface that can be used to load,
create, alter, and drop views in DataSourceV2.

Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
update the PR <https://github.com/apache/spark/pull/28147> for review.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks!

On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa 
wrote:

> Considering the API aspect, the ViewCatalog API sounds like a good idea. A
> view catalog will enable us to integrate Coral
> <https://engineering.linkedin.com/blog/2020/coral> (our view SQL
> translation and management layer) very cleanly to Spark. Currently we can
> only do it by maintaining our special version of the HiveExternalCatalog.
> Considering that views can be expanded syntactically without necessarily
> invoking the analyzer, using a dedicated view API can make performance
> better if performance is the concern. Further, a catalog can still be both
> a table and view provider if it chooses to based on this design, so I do
> not think we necessarily lose the ability of providing both. Looking
> forward to more discussions on this and making views a powerful tool in
> Spark.
>
> Thanks,
> Walaa.
>
>
> On Wed, May 26, 2021 at 9:54 AM John Zhuge  wrote:
>
>> Looks like we are running in circles. Should we have an online meeting to
>> get this sorted out?
>>
>> Thanks,
>> John
>>
>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan  wrote:
>>
>>> OK, then I'd vote for TableViewCatalog, because
>>> 1. This is how Hive catalog works, and we need to migrate Hive catalog
>>> to the v2 API sooner or later.
>>> 2. Because of 1, TableViewCatalog is easy to support in the current
>>> table/view resolution framework.
>>> 3. It's better to avoid name conflicts between table and views at the
>>> API level, instead of relying on the catalog implementation.
>>> 4. Caching invalidation is always a tricky problem.
>>>
>>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue 
>>> wrote:
>>>
>>>> I don't think that it makes sense to discuss a different approach in
>>>> the PR rather than in the vote. Let's discuss this now since that's the
>>>> purpose of an SPIP.
>>>>
>>>> On Mon, May 24, 2021 at 11:22 AM John Zhuge  wrote:
>>>>
>>>>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>>>>> proposal (SPIP).
>>>>>
>>>>> The proposal is to add a ViewCatalog interface that can be used to
>>>>> load, create, alter, and drop views in DataSourceV2.
>>>>>
>>>>> The full SPIP doc is here:
>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>
>>>>> Please vote on the SPIP in the next 72 hours. Once it is approved,
>>>>> I’ll update the PR for review.
>>>>>
>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>> [ ] +0
>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge


Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-24 Thread John Zhuge
+1 (non-binding)

On Mon, Jan 24, 2022 at 2:28 PM Cheng Su  wrote:

> +1 (non-binding)
>
>
>
> Cheng Su
>
>
>
> *From: *Chao Sun 
> *Date: *Monday, January 24, 2022 at 2:10 PM
> *To: *Michael Heuer 
> *Cc: *dev 
> *Subject: *Re: [VOTE] Release Spark 3.2.1 (RC2)
>
> +1 (non-binding)
>
>
>
> On Mon, Jan 24, 2022 at 6:32 AM Michael Heuer  wrote:
>
> +1 (non-binding)
>
>
>
>michael
>
>
>
>
>
> On Jan 24, 2022, at 7:30 AM, Gengliang Wang  wrote:
>
>
>
> +1 (non-binding)
>
>
>
> On Mon, Jan 24, 2022 at 6:26 PM Dongjoon Hyun 
> wrote:
>
> +1
>
>
>
> Dongjoon.
>
>
>
> On Sat, Jan 22, 2022 at 7:19 AM Mridul Muralidharan 
> wrote:
>
>
>
> +1
>
>
>
> Signatures, digests, etc check out fine.
>
> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>
>
>
> Regards,
>
> Mridul
>
>
>
> On Fri, Jan 21, 2022 at 9:01 PM Sean Owen  wrote:
>
> +1 with same result as last time.
>
>
>
> On Thu, Jan 20, 2022 at 9:59 PM huaxin gao  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.2.1. The vote is open until 8:00pm Pacific time January 25 and passes if
> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1
> Release this package as Apache Spark 3.2.1 [ ] -1 Do not release this
> package because ... To learn more about Apache Spark, please see
> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit
> 4f25b3f71238a00508a356591553f2dfa89f8290):
> https://github.com/apache/spark/tree/v3.2.1-rc2  The release files,
> including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/  Signatures
> used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging repository
> for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1398/   The
> documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/  The
> list of bug fixes going into 3.2.1 can be found at the following URL:
> https://s.apache.org/yu0cy   This release is using the release script of
> the tag v3.2.1-rc2. FAQ = How can I help test
> this release? = If you are a Spark user, you can
> help us test this release by taking an existing Spark workload and running
> on this release candidate, then reporting any regressions. If you're
> working in PySpark you can set up a virtual env and install the current RC
> and see if anything important breaks, in the Java/Scala you can add the
> staging repository to your projects resolvers and test with the RC (make
> sure to clean up the artifact cache before/after so you don't end up
> building with a out of date RC going forward).
> === What should happen to JIRA
> tickets still targeting 3.2.1? ===
> The current list of open tickets targeted at 3.2.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.1 Committers should look at those and triage. Extremely
> important bug fixes, documentation, and API tweaks that impact
> compatibility should be worked on immediately. Everything else please
> retarget to an appropriate release. == But my bug isn't
> fixed? == In order to make timely releases, we will
> typically not hold the release unless the bug in question is a regression
> from the previous release. That being said, if there is something which is
> a regression that has not been correctly targeted please ping me or a
> committer to help target the issue.
>
>
>
>

-- 
John Zhuge


Re: [VOTE][SPIP] Support Customized Kubernetes Schedulers Proposal

2022-01-06 Thread John Zhuge
+1 (non-binding)

On Thu, Jan 6, 2022 at 8:39 AM Chenya Zhang 
wrote:

> +1 (non-binding)
>
> On Thu, Jan 6, 2022 at 1:30 AM Mich Talebzadeh 
> wrote:
>
>> +1 (non-binding)
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 6 Jan 2022 at 07:03, bo yang  wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Wed, Jan 5, 2022 at 11:01 PM Holden Karau 
>>> wrote:
>>>
>>>> +1 (binding)
>>>>
>>>> On Wed, Jan 5, 2022 at 5:31 PM William Wang 
>>>> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> Yikun Jiang  于2022年1月6日周四 09:07写道:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I’d like to start a vote for SPIP: "Support Customized Kubernetes
>>>>>> Schedulers Proposal"
>>>>>>
>>>>>> The SPIP is to support customized Kubernetes schedulers in Spark on
>>>>>> Kubernetes.
>>>>>>
>>>>>> Please also refer to:
>>>>>>
>>>>>> - Previous discussion in dev mailing list: [DISCUSSION] SPIP:
>>>>>> Support Volcano/Alternative Schedulers Proposal
>>>>>> <https://lists.apache.org/thread/zv3o62xrob4dvgkbftbv5w5wy75hkbxg>
>>>>>> - Design doc: [SPIP] Spark-36057 Support Customized Kubernetes
>>>>>> Schedulers Proposal
>>>>>> <https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg>
>>>>>> - JIRA: SPARK-36057
>>>>>> <https://issues.apache.org/jira/browse/SPARK-36057>
>>>>>>
>>>>>> Please vote on the SPIP:
>>>>>>
>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>> [ ] +0
>>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>
>>>>>> Regards,
>>>>>> Yikun
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>>
>>>> --
John Zhuge


Re: Time for Spark 3.2.1?

2022-01-04 Thread John Zhuge
+1 thanks

On Tue, Jan 4, 2022 at 7:59 AM huaxin gao  wrote:

> Happy New Year, everyone!
>
> I will start preparing for Spark 3.2.1 release. I plan to do the branch
> cut on Friday 1/7. Please let me know if there are any issues I need to be
> aware of.
>
> Thanks,
> Huaxin
>
>
> On Tue, Dec 7, 2021 at 11:03 PM Jungtaek Lim 
> wrote:
>
>> +1 for both releases and the time!
>>
>> On Wed, Dec 8, 2021 at 3:46 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1 for maintenance release, and also +1 for doing this in Jan !
>>>
>>> Thanks,
>>> Mridul
>>>
>>> On Tue, Dec 7, 2021 at 11:41 PM Gengliang Wang  wrote:
>>>
>>>> +1 for new maintenance releases for all 3.x branches as well.
>>>>
>>>> On Wed, Dec 8, 2021 at 8:19 AM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> SGTM!
>>>>>
>>>>> On Wed, 8 Dec 2021 at 09:07, huaxin gao 
>>>>> wrote:
>>>>>
>>>>>> I prefer to start rolling the release in January if there is no need
>>>>>> to publish it sooner :)
>>>>>>
>>>>>> On Tue, Dec 7, 2021 at 3:59 PM Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> Oh BTW, I realised that it's a holiday season soon this month
>>>>>>> including Christmas and new year.
>>>>>>> Shall we maybe start rolling the release around next January? I
>>>>>>> would leave it to @huaxin gao  :-).
>>>>>>>
>>>>>>> On Wed, 8 Dec 2021 at 06:19, Dongjoon Hyun 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 for new releases.
>>>>>>>>
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Mon, Dec 6, 2021 at 8:51 PM Wenchen Fan 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 to make new maintenance releases for all 3.x branches.
>>>>>>>>>
>>>>>>>>> On Tue, Dec 7, 2021 at 8:57 AM Sean Owen  wrote:
>>>>>>>>>
>>>>>>>>>> Always fine by me if someone wants to roll a release.
>>>>>>>>>>
>>>>>>>>>> It's been ~6 months since the last 3.0.x and 3.1.x releases, too;
>>>>>>>>>> a new release of those wouldn't hurt either, if any of our release 
>>>>>>>>>> managers
>>>>>>>>>> have the time or inclination. 3.0.x is reaching unofficial 
>>>>>>>>>> end-of-life
>>>>>>>>>> around now anyway.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Dec 6, 2021 at 6:55 PM Hyukjin Kwon 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> It's been two months since Spark 3.2.0 release, and we have
>>>>>>>>>>> resolved many bug fixes and regressions. What do you guys think 
>>>>>>>>>>> about
>>>>>>>>>>> rolling Spark 3.2.1 release?
>>>>>>>>>>>
>>>>>>>>>>> cc @huaxin gao  FYI who I happened to
>>>>>>>>>>> overhear that is interested in rolling the maintenance release :-).
>>>>>>>>>>>
>>>>>>>>>> --
John Zhuge


Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2021-11-30 Thread John Zhuge
+1 Kudos to Yikun and the community for starting the discussion!

On Tue, Nov 30, 2021 at 8:47 AM Chenya Zhang 
wrote:

> Thanks folks for bringing up the topic of natively integrating Volcano and
> other alternative schedulers into Spark!
>
> +Weiwei, Wilfred, Chaoran. We would love to contribute to the discussion
> as well.
>
> From our side, we have been using and improving on one alternative
> resource scheduler, Apache YuniKorn (https://yunikorn.apache.org/), for
> Spark on Kubernetes in production at Apple with solid results in the past
> year. It is capable of supporting Gang scheduling (similar to PodGroups),
> multi-tenant resource queues (similar to YARN), FIFO, and other handy
> features like bin packing to enable efficient autoscaling, etc.
>
> Natively integrating with Spark would provide more flexibility for users
> and reduce the extra cost and potential inconsistency of maintaining
> different layers of resource strategies. One interesting topic we hope to
> discuss more about is dynamic allocation, which would benefit from native
> coordination between Spark and resource schedulers in K8s &
> cloud environment for an optimal resource efficiency.
>
>
> On Tue, Nov 30, 2021 at 8:10 AM Holden Karau  wrote:
>
>> Thanks for putting this together, I’m really excited for us to add better
>> batch scheduling integrations.
>>
>> On Tue, Nov 30, 2021 at 12:46 AM Yikun Jiang  wrote:
>>
>>> Hey everyone,
>>>
>>> I'd like to start a discussion on "Support Volcano/Alternative
>>> Schedulers Proposal".
>>>
>>> This SPIP is proposed to make spark k8s schedulers provide more YARN
>>> like features (such as queues and minimum resources before scheduling jobs)
>>> that many folks want on Kubernetes.
>>>
>>> The goal of this SPIP is to improve current spark k8s scheduler
>>> implementations, add the ability of batch scheduling and support volcano as
>>> one of implementations.
>>>
>>> Design doc:
>>> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg
>>> JIRA: https://issues.apache.org/jira/browse/SPARK-36057
>>> Part of PRs:
>>> Ability to create resources https://github.com/apache/spark/pull/34599
>>> Add PodGroupFeatureStep: https://github.com/apache/spark/pull/34456
>>>
>>> Regards,
>>> Yikun
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
John Zhuge


Re: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-14 Thread John Zhuge
+1 (non-binding)

On Sun, Nov 14, 2021 at 10:33 AM Chao Sun  wrote:

> +1 (non-binding). Thanks Anton for the work!
>
> On Sun, Nov 14, 2021 at 10:01 AM Ryan Blue  wrote:
>
>> +1
>>
>> Thanks to Anton for all this great work!
>>
>> On Sat, Nov 13, 2021 at 8:24 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> +1 non-binding
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 13 Nov 2021 at 15:07, Russell Spitzer 
>>> wrote:
>>>
>>>> +1 (never binding)
>>>>
>>>> On Sat, Nov 13, 2021 at 1:10 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Fri, Nov 12, 2021 at 6:58 PM huaxin gao 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Fri, Nov 12, 2021 at 6:44 PM Yufei Gu 
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> > On Nov 12, 2021, at 6:25 PM, L. C. Hsieh  wrote:
>>>>>>> >
>>>>>>> > Hi all,
>>>>>>> >
>>>>>>> > I’d like to start a vote for SPIP: Row-level operations in Data
>>>>>>> Source V2.
>>>>>>> >
>>>>>>> > The proposal is to add support for executing row-level operations
>>>>>>> > such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
>>>>>>> > execution should be the same across data sources and the best way
>>>>>>> to do
>>>>>>> > that is to implement it in Spark.
>>>>>>> >
>>>>>>> > Right now, Spark can only parse and to some extent analyze DELETE,
>>>>>>> UPDATE,
>>>>>>> > MERGE commands. Data sources that support row-level changes have
>>>>>>> to build
>>>>>>> > custom Spark extensions to execute such statements. The goal of
>>>>>>> this effort
>>>>>>> > is to come up with a flexible and easy-to-use API that will work
>>>>>>> across
>>>>>>> > data sources.
>>>>>>> >
>>>>>>> > Please also refer to:
>>>>>>> >
>>>>>>> >   - Previous discussion in dev mailing list: [DISCUSS] SPIP:
>>>>>>> > Row-level operations in Data Source V2
>>>>>>> >   <
>>>>>>> https://lists.apache.org/thread/kd8qohrk5h3qx8d6y4lhrm67vnn8p6bv>
>>>>>>> >
>>>>>>> >   - JIRA: SPARK-35801 <
>>>>>>> https://issues.apache.org/jira/browse/SPARK-35801>
>>>>>>> >   - PR for handling DELETE statements:
>>>>>>> > <https://github.com/apache/spark/pull/33008>
>>>>>>> >
>>>>>>> >   - Design doc
>>>>>>> > <
>>>>>>> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
>>>>>>> >
>>>>>>> >
>>>>>>> > Please vote on the SPIP for the next 72 hours:
>>>>>>> >
>>>>>>> > [ ] +1: Accept the proposal as an official SPIP
>>>>>>> > [ ] +0
>>>>>>> > [ ] -1: I don’t think this is a good idea because …
>>>>>>> >
>>>>>>> >
>>>>>>> -
>>>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
> --
John Zhuge


Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread John Zhuge
+1  Nicely done!

On Tue, Oct 26, 2021 at 8:08 AM Chao Sun  wrote:

> Oops, sorry. I just fixed the permission setting.
>
> Thanks everyone for the positive support!
>
> On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan  wrote:
>
>> +1 to this SPIP and nice writeup of the design doc!
>>
>> Can we open comment permission in the doc so that we can discuss details
>> there?
>>
>> On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon  wrote:
>>
>>> Seems making sense to me.
>>>
>>> Would be great to have some feedback from people such as @Wenchen Fan
>>>  @Cheng Su  @angers zhu
>>> .
>>>
>>>
>>> On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun 
>>> wrote:
>>>
>>>> +1 for this SPIP.
>>>>
>>>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao 
>>>> wrote:
>>>>
>>>>> +1. Thanks for lifting the current restrictions on bucket join and
>>>>> making this more generalized.
>>>>>
>>>>> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue  wrote:
>>>>>
>>>>>> +1 from me as well. Thanks Chao for doing so much to get it to this
>>>>>> point!
>>>>>>
>>>>>> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:
>>>>>>
>>>>>>> +1 on this SPIP.
>>>>>>>
>>>>>>> This is a more generalized version of bucketed tables and bucketed
>>>>>>> joins which can eliminate very expensive data shuffles when joins,
>>>>>>> and
>>>>>>> many users in the Apache Spark community have wanted this feature for
>>>>>>> a long time!
>>>>>>>
>>>>>>> Thank you, Ryan and Chao, for working on this, and I look forward to
>>>>>>> it as a new feature in Spark 3.3
>>>>>>>
>>>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>>>
>>>>>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > Ryan and I drafted a design doc to support a new type of join:
>>>>>>> storage partitioned join which covers bucket join support for 
>>>>>>> DataSourceV2
>>>>>>> but is more general. The goal is to let Spark leverage distribution
>>>>>>> properties reported by data sources and eliminate shuffle whenever 
>>>>>>> possible.
>>>>>>> >
>>>>>>> > Design doc:
>>>>>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>>>>>>> (includes a POC link at the end)
>>>>>>> >
>>>>>>> > We'd like to start a discussion on the doc and any feedback is
>>>>>>> welcome!
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Chao
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>>
>>>>>

-- 
John Zhuge


Re: Time to start publishing Spark Docker Images?

2021-08-12 Thread John Zhuge
+1

On Thu, Aug 12, 2021 at 5:44 PM Hyukjin Kwon  wrote:

> +1, I think we generally agreed upon having it. Thanks Holden for headsup
> and driving this.
>
> +@Dongjoon Hyun  FYI
>
> 2021년 7월 22일 (목) 오후 12:22, Kent Yao 님이 작성:
>
>> +1
>>
>> Bests,
>>
>> *Kent Yao *
>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>> *a spark enthusiast*
>> *kyuubi <https://github.com/yaooqinn/kyuubi>is a
>> unified multi-tenant JDBC interface for large-scale data processing and
>> analytics, built on top of Apache Spark <http://spark.apache.org/>.*
>> *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark
>> SQL extension which provides SQL Standard Authorization for **Apache
>> Spark <http://spark.apache.org/>.*
>> *spark-postgres <https://github.com/yaooqinn/spark-postgres> A library
>> for reading data from and transferring data to Postgres / Greenplum with
>> Spark SQL and DataFrames, 10~100x faster.*
>> *itatchi <https://github.com/yaooqinn/spark-func-extras>A** library t**hat
>> brings useful functions from various modern database management systems to 
>> **Apache
>> Spark <http://spark.apache.org/>.*
>>
>>
>>
>> On 07/22/2021 11:13,Holden Karau
>>  wrote:
>>
>> Hi Folks,
>>
>> Many other distributed computing (https://hub.docker.com/r/rayproject/ray
>> https://hub.docker.com/u/daskdev) and ASF projects (
>> https://hub.docker.com/u/apache) now publish their images to dockerhub.
>>
>> We've already got the docker image tooling in place, I think we'd need to
>> ask the ASF to grant permissions to the PMC to publish containers and
>> update the release steps but I think this could be useful for folks.
>>
>> Cheers,
>>
>> Holden
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
John Zhuge


Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
;>> dynamic scaling) without us making changes inside of the Spark Kube
>>>> scheduler.
>>>>
>>>> Certainly whichever scheduler extensions we add support for we should
>>>> collaborate with the people developing those extensions insofar as they are
>>>> interested. My first place that I checked was #sig-scheduling which is
>>>> fairly quite on the Kubernetes slack but if there are more places to look
>>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>>> give it a shot :)
>>>>
>>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Regarding your point and I quote
>>>>>
>>>>> "..  I know that one of the Spark on Kube operators
>>>>> supports volcano/kube-batch so I was thinking that might be a place I 
>>>>> would
>>>>> start exploring..."
>>>>>
>>>>> There seems to be ongoing work on say Volcano as part of  Cloud
>>>>> Native Computing Foundation <https://cncf.io/> (CNCF). For example
>>>>> through https://github.com/volcano-sh/volcano
>>>>>
>>>> <https://github.com/volcano-sh/volcano>
>>>>>
>>>>> There may be value-add in collaborating with such groups through CNCF
>>>>> in order to have a collective approach to such work. There also seems to 
>>>>> be
>>>>> some work on Integration of Spark with Volcano for Batch Scheduling.
>>>>> <https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md>
>>>>>
>>>>>
>>>>>
>>>>> What is not very clear is the degree of progress of these projects.
>>>>> You may be kind enough to elaborate on KPI for each of these projects and
>>>>> where you think your contributions is going to be.
>>>>>
>>>>>
>>>>> HTH,
>>>>>
>>>>>
>>>>> Mich
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 18 Jun 2021 at 00:44, Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> Hi Folks,
>>>>>>
>>>>>> I'm continuing my adventures to make Spark on containers party and I
>>>>>> was wondering if folks have experience with the different batch
>>>>>> scheduler options that they prefer? I was thinking so that we can
>>>>>> better support dynamic allocation it might make sense for us to
>>>>>> support using different schedulers and I wanted to see if there are
>>>>>> any that the community is more interested in?
>>>>>>
>>>>>> I know that one of the Spark on Kube operators supports
>>>>>> volcano/kube-batch so I was thinking that might be a place I start
>>>>>> exploring but also want to be open to other schedulers that folks
>>>>>> might be interested in.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Holden :)
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>> -
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
John Zhuge


Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
tch Scheduling.
>>>> <https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md>
>>>>
>>>>
>>>>
>>>> What is not very clear is the degree of progress of these projects. You
>>>> may be kind enough to elaborate on KPI for each of these projects and where
>>>> you think your contributions is going to be.
>>>>
>>>>
>>>> HTH,
>>>>
>>>>
>>>> Mich
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 18 Jun 2021 at 00:44, Holden Karau 
>>>> wrote:
>>>>
>>>>> Hi Folks,
>>>>>
>>>>> I'm continuing my adventures to make Spark on containers party and I
>>>>> was wondering if folks have experience with the different batch
>>>>> scheduler options that they prefer? I was thinking so that we can
>>>>> better support dynamic allocation it might make sense for us to
>>>>> support using different schedulers and I wanted to see if there are
>>>>> any that the community is more interested in?
>>>>>
>>>>> I know that one of the Spark on Kube operators supports
>>>>> volcano/kube-batch so I was thinking that might be a place I start
>>>>> exploring but also want to be open to other schedulers that folks
>>>>> might be interested in.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
John Zhuge


Re: [VOTE] SPIP: Catalog API for view metadata

2021-05-26 Thread John Zhuge
Looks like we are running in circles. Should we have an online meeting to
get this sorted out?

Thanks,
John

On Wed, May 26, 2021 at 12:01 AM Wenchen Fan  wrote:

> OK, then I'd vote for TableViewCatalog, because
> 1. This is how Hive catalog works, and we need to migrate Hive catalog to
> the v2 API sooner or later.
> 2. Because of 1, TableViewCatalog is easy to support in the current
> table/view resolution framework.
> 3. It's better to avoid name conflicts between table and views at the API
> level, instead of relying on the catalog implementation.
> 4. Caching invalidation is always a tricky problem.
>
> On Tue, May 25, 2021 at 3:09 AM Ryan Blue 
> wrote:
>
>> I don't think that it makes sense to discuss a different approach in the
>> PR rather than in the vote. Let's discuss this now since that's the purpose
>> of an SPIP.
>>
>> On Mon, May 24, 2021 at 11:22 AM John Zhuge  wrote:
>>
>>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>>> proposal (SPIP).
>>>
>>> The proposal is to add a ViewCatalog interface that can be used to load,
>>> create, alter, and drop views in DataSourceV2.
>>>
>>> The full SPIP doc is here:
>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>
>>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>>> update the PR for review.
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
John Zhuge


Re: [VOTE] Release Spark 3.1.2 (RC1)

2021-05-25 Thread John Zhuge
+1 (non-binding)

Validated checksum and signature; ran RAT checks; tried
spark-3.1.2-bin-hadoop2.7 with HMS 1.2.

On Tue, May 25, 2021 at 7:24 PM Liang-Chi Hsieh  wrote:

> +1 (non-binding)
>
> Binary and doc looks good. JIRA tickets looks good. Ran simple tasks.
>
> Thank you, Dongjoon!
>
>
> Hyukjin Kwon wrote
> > +1
> >
> > 2021년 5월 26일 (수) 오전 9:00, Cheng Su 
>
> > chengsu@.com
>
> > 님이 작성:
> >
> >> +1 (non-binding)
> >>
> >>
> >>
> >> Checked the related commits in commit history manually.
> >>
> >>
> >>
> >> Thanks!
> >>
> >> Cheng Su
> >>
> >>
> >>
> >> *From: *Takeshi Yamamuro 
>
> > linguin.m.s@
>
> > 
> >> *Date: *Tuesday, May 25, 2021 at 4:47 PM
> >> *To: *Dongjoon Hyun 
>
> > dongjoon.hyun@
>
> > , dev 
>
> > dev@.apache
>
> > 
> >> *Subject: *Re: [VOTE] Release Spark 3.1.2 (RC1)
> >>
> >>
> >>
> >> +1 (non-binding)
> >>
> >>
> >>
> >> I ran the tests, checked the related jira tickets, and compared TPCDS
> >> performance differences between
> >>
> >> this v3.1.2 candidate and v3.1.1.
> >>
> >> Everything looks fine.
> >>
> >>
> >>
> >> Thank you, Dongjoon!
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
John Zhuge


[VOTE] SPIP: Catalog API for view metadata

2021-05-24 Thread John Zhuge
Hi everyone, I’d like to start a vote for the ViewCatalog design proposal
(SPIP).

The proposal is to add a ViewCatalog interface that can be used to load,
create, alter, and drop views in DataSourceV2.

The full SPIP doc is here:
https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing

Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
update the PR for review.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …


Re: SPIP: Catalog API for view metadata

2021-05-24 Thread John Zhuge
Great! I will start a vote thread.

On Mon, May 24, 2021 at 10:54 AM Wenchen Fan  wrote:

> Yea let's move forward first. We can discuss the caching approach
> and TableViewCatalog approach during the PR review.
>
> On Tue, May 25, 2021 at 1:48 AM John Zhuge  wrote:
>
>> Hi everyone,
>>
>> Is there any more discussion before we start a vote on ViewCatalog? With
>> FunctionCatalog merged, I hope this feature can complete the offerings of
>> catalog plugins in 3.2.
>>
>> Once approved, I will refresh the WIP PR. Implementation details can be
>> ironed out during review.
>>
>> Thanks,
>>
>> On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue 
>> wrote:
>>
>>> An extra RPC call is a concern for the catalog implementation. It is
>>> simple to cache the result of a call to avoid a second one if the catalog
>>> chooses.
>>>
>>> I don't think that an extra RPC that can be easily avoided is a
>>> reasonable justification to add caches in Spark. For one thing, it doesn't
>>> solve the problem because the proposed API still requires separate lookups
>>> for tables and views.
>>>
>>> The only solution that would help is to use a combined trait, but that
>>> has issues. For one, view substitution is much cleaner when it happens well
>>> before table resolution. And, View and Table are very different objects;
>>> returning Object from this API doesn't make much sense.
>>>
>>> One extra RPC is not unreasonable, and the choice should be left to
>>> sources. That's the easiest place to cache results from the underlying
>>> store.
>>>
>>> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan  wrote:
>>>
>>>> Moving back the discussion to this thread. The current argument is how
>>>> to avoid extra RPC calls for catalogs supporting both table and view. There
>>>> are several options:
>>>> 1. ignore it as extra PRC calls are cheap compared to the query
>>>> execution
>>>> 2. have a per session cache for loaded table/view
>>>> 3. have a per query cache for loaded table/view
>>>> 4. add a new trait TableViewCatalog
>>>>
>>>> I think it's important to avoid perf regression with new APIs. RPC
>>>> calls can be significant for short queries. We may also double the RPC
>>>> traffic which is bad for the metastore service. Normally I would not
>>>> recommend caching as cache invalidation is a hard problem. Personally I
>>>> prefer option 4 as it only affects catalogs that support both table and
>>>> view, and it fits the hive catalog very well.
>>>>
>>>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge  wrote:
>>>>
>>>>> SPIP
>>>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>>>> has been updated. Please review.
>>>>>
>>>>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge  wrote:
>>>>>
>>>>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>>>>
>>>>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan 
>>>>>> wrote:
>>>>>>
>>>>>>> Any updates here? I agree that a new View API is better, but we need
>>>>>>> a solution to avoid performance regression. We need to elaborate on the
>>>>>>> cache idea.
>>>>>>>
>>>>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue  wrote:
>>>>>>>
>>>>>>>> I think it is a good idea to keep tables and views separate.
>>>>>>>>
>>>>>>>> The main two arguments I’ve heard for combining lookup into a
>>>>>>>> single function are the ones brought up in this thread. First, an
>>>>>>>> identifier in a catalog must be either a view or a table and should not
>>>>>>>> collide. Second, a single lookup is more likely to require a single 
>>>>>>>> RPC. I
>>>>>>>> think the RPC concern is well addressed by caching, which we already 
>>>>>>>> do in
>>>>>>>> the Spark catalog, so I’ll primarily focus on the first.
>>>>>>>>
>>>>>>>> Table/view name collision is unlikely to be a problem. Metastores
>>>>>>>> that support both today store them in a single namespace, so this is 
>>>>>>>

Re: SPIP: Catalog API for view metadata

2021-05-24 Thread John Zhuge
Hi everyone,

Is there any more discussion before we start a vote on ViewCatalog? With
FunctionCatalog merged, I hope this feature can complete the offerings of
catalog plugins in 3.2.

Once approved, I will refresh the WIP PR. Implementation details can be
ironed out during review.

Thanks,

On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue  wrote:

> An extra RPC call is a concern for the catalog implementation. It is
> simple to cache the result of a call to avoid a second one if the catalog
> chooses.
>
> I don't think that an extra RPC that can be easily avoided is a reasonable
> justification to add caches in Spark. For one thing, it doesn't solve the
> problem because the proposed API still requires separate lookups for tables
> and views.
>
> The only solution that would help is to use a combined trait, but that has
> issues. For one, view substitution is much cleaner when it happens well
> before table resolution. And, View and Table are very different objects;
> returning Object from this API doesn't make much sense.
>
> One extra RPC is not unreasonable, and the choice should be left to
> sources. That's the easiest place to cache results from the underlying
> store.
>
> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan  wrote:
>
>> Moving back the discussion to this thread. The current argument is how to
>> avoid extra RPC calls for catalogs supporting both table and view. There
>> are several options:
>> 1. ignore it as extra PRC calls are cheap compared to the query execution
>> 2. have a per session cache for loaded table/view
>> 3. have a per query cache for loaded table/view
>> 4. add a new trait TableViewCatalog
>>
>> I think it's important to avoid perf regression with new APIs. RPC calls
>> can be significant for short queries. We may also double the RPC
>> traffic which is bad for the metastore service. Normally I would not
>> recommend caching as cache invalidation is a hard problem. Personally I
>> prefer option 4 as it only affects catalogs that support both table and
>> view, and it fits the hive catalog very well.
>>
>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge  wrote:
>>
>>> SPIP
>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>> has been updated. Please review.
>>>
>>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge  wrote:
>>>
>>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>>
>>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan  wrote:
>>>>
>>>>> Any updates here? I agree that a new View API is better, but we need a
>>>>> solution to avoid performance regression. We need to elaborate on the 
>>>>> cache
>>>>> idea.
>>>>>
>>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue  wrote:
>>>>>
>>>>>> I think it is a good idea to keep tables and views separate.
>>>>>>
>>>>>> The main two arguments I’ve heard for combining lookup into a single
>>>>>> function are the ones brought up in this thread. First, an identifier in 
>>>>>> a
>>>>>> catalog must be either a view or a table and should not collide. Second, 
>>>>>> a
>>>>>> single lookup is more likely to require a single RPC. I think the RPC
>>>>>> concern is well addressed by caching, which we already do in the Spark
>>>>>> catalog, so I’ll primarily focus on the first.
>>>>>>
>>>>>> Table/view name collision is unlikely to be a problem. Metastores
>>>>>> that support both today store them in a single namespace, so this is not 
>>>>>> a
>>>>>> concern for even a naive implementation that talks to the Hive 
>>>>>> MetaStore. I
>>>>>> know that a new metastore catalog could choose to implement both
>>>>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>>>>> would be a very strange choice: if the metastore itself has different
>>>>>> namespaces for tables and views, then it makes much more sense to expose
>>>>>> them through separate catalogs because Spark will always prefer one over
>>>>>> the other.
>>>>>>
>>>>>> In a similar line of reasoning, catalogs that expose both views and
>>>>>> tables are much more rare than catalogs that only expose one. For 
>>>>>> example,
>>>>>> v2 catalogs for JDBC and

Re: Apache Spark 3.1.2 Release?

2021-05-17 Thread John Zhuge
+1, thanks Dongjoon!

On Mon, May 17, 2021 at 7:50 PM Yuming Wang  wrote:

> +1.
>
> On Tue, May 18, 2021 at 9:06 AM Hyukjin Kwon  wrote:
>
>> +1 thanks for driving me
>>
>> On Tue, 18 May 2021, 09:33 Holden Karau,  wrote:
>>
>>> +1 and thanks for volunteering to be the RM :)
>>>
>>> On Mon, May 17, 2021 at 4:09 PM Takeshi Yamamuro 
>>> wrote:
>>>
>>>> Thank you, Dongjoon~ sgtm, too.
>>>>
>>>> On Tue, May 18, 2021 at 7:34 AM Cheng Su 
>>>> wrote:
>>>>
>>>>> +1 for a new release, thanks Dongjoon!
>>>>>
>>>>> Cheng Su
>>>>>
>>>>> On 5/17/21, 2:44 PM, "Liang-Chi Hsieh"  wrote:
>>>>>
>>>>> +1 sounds good. Thanks Dongjoon for volunteering on this!
>>>>>
>>>>>
>>>>> Liang-Chi
>>>>>
>>>>>
>>>>> Dongjoon Hyun-2 wrote
>>>>> > Hi, All.
>>>>> >
>>>>> > Since Apache Spark 3.1.1 tag creation (Feb 21),
>>>>> > new 172 patches including 9 correctness patches and 4 K8s
>>>>> patches arrived
>>>>> > at branch-3.1.
>>>>> >
>>>>> > Shall we make a new release, Apache Spark 3.1.2, as the second
>>>>> release at
>>>>> > 3.1 line?
>>>>> > I'd like to volunteer for the release manager for Apache Spark
>>>>> 3.1.2.
>>>>> > I'm thinking about starting the first RC next week.
>>>>> >
>>>>> > $ git log --oneline v3.1.1..HEAD | wc -l
>>>>> >  172
>>>>> >
>>>>> > # Known correctness issues
>>>>> > SPARK-34534 New protocol FetchShuffleBlocks in
>>>>> OneForOneBlockFetcher
>>>>> > lead to data loss or correctness
>>>>> > SPARK-34545 PySpark Python UDF return inconsistent results
>>>>> when
>>>>> > applying 2 UDFs with different return type to 2 columns together
>>>>> > SPARK-34681 Full outer shuffled hash join when building left
>>>>> side
>>>>> > produces wrong result
>>>>> > SPARK-34719 fail if the view query has duplicated column
>>>>> names
>>>>> > SPARK-34794 Nested higher-order functions broken in DSL
>>>>> > SPARK-34829 transform_values return identical values when
>>>>> it's used
>>>>> > with udf that returns reference type
>>>>> > SPARK-34833 Apply right-padding correctly for correlated
>>>>> subqueries
>>>>> > SPARK-35381 Fix lambda variable name issues in nested
>>>>> DataFrame
>>>>> > functions in R APIs
>>>>> > SPARK-35382 Fix lambda variable name issues in nested
>>>>> DataFrame
>>>>> > functions in Python APIs
>>>>> >
>>>>> > # Notable K8s patches since K8s GA
>>>>> > SPARK-34674Close SparkContext after the Main method has
>>>>> finished
>>>>> > SPARK-34948Add ownerReference to executor configmap to fix
>>>>> leakages
>>>>> > SPARK-34820add apt-update before gnupg install
>>>>> > SPARK-34361In case of downscaling avoid killing of executors
>>>>> already
>>>>> > known by the scheduler backend in the pod allocator
>>>>> >
>>>>> > Bests,
>>>>> > Dongjoon.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from:
>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

-- 
John Zhuge


Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-10 Thread John Zhuge
No, just try to build a Java project with Maven RC repo.

Validated checksum and signature; ran RAT checks; built the source and ran
unit tests.

+1 (non-binding)

On Sun, May 9, 2021 at 11:10 PM Liang-Chi Hsieh  wrote:

> Yea, I don't know why it happens.
>
> I remember RC1 also has the same issue. But RC2 and RC3 don't.
>
> Does it affect the RC?
>
>
> John Zhuge wrote
> > Got this error when browsing the staging repository:
> >
> > 404 - Repository "orgapachespark-1383 (staging: open)"
> > [id=orgapachespark-1383] exists but is not exposed.
> >
> > John Zhuge
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
John Zhuge


Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-09 Thread John Zhuge
Got this error when browsing the staging repository:

404 - Repository "orgapachespark-1383 (staging: open)"
[id=orgapachespark-1383] exists but is not exposed.

On Sun, May 9, 2021 at 2:22 PM Liang-Chi Hsieh  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.8.
>
> The vote is open until May 14th at 9AM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.8
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.8 (try project = SPARK AND
> "Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.8-rc4 (commit
> 163fbd2528a18bf062bddf7b7753631a12a369b5):
> https://github.com/apache/spark/tree/v2.4.8-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1383/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc4-docs/
>
> The list of bug fixes going into 2.4.8 can be found at the following URL:
> https://s.apache.org/spark-v2.4.8-rc4
>
> This release is using the release script of the tag v2.4.8-rc4.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.8?
> ===
>
> The current list of open tickets targeted at 2.4.8 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.8
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
John Zhuge


Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-08 Thread John Zhuge
+1 (non-binding)

On Mon, Mar 8, 2021 at 4:32 PM Holden Karau  wrote:

> +1 (binding)
>
> On Mon, Mar 8, 2021 at 3:56 PM Ryan Blue  wrote:
>
>> Hi everyone, I’d like to start a vote for the FunctionCatalog design
>> proposal (SPIP).
>>
>> The proposal is to add a FunctionCatalog interface that can be used to
>> load and list functions for Spark to call. There are interfaces for scalar
>> and aggregate functions.
>>
>> In the discussion we’ve come to consensus and I’ve updated the design doc
>> to match how functions will be called:
>>
>> In addition to produceResult(InternalRow), which is optional, functions
>> can define produceResult methods with arguments that are Spark’s
>> internal data types, like UTF8String. Spark will prefer these methods
>> when calling the UDF using codgen.
>>
>> I’ve also updated the AggregateFunction interface and merged it with the
>> partial aggregate interface because Spark doesn’t support non-partial
>> aggregates.
>>
>> The full SPIP doc is here:
>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit#heading=h.82w8qxfl2uwl
>>
>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>> do a final update of the PR and we can merge the API.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>> --
>> Ryan Blue
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
John Zhuge


Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread John Zhuge
catalyst/expressions/rows.scala#L35>
>>>>>> would throw a ClassCastException. I don’t think that using a row is
>>>>>> a bad option simply because UnsafeRow is unsafe.
>>>>>>
>>>>>> It’s unlikely that UnsafeRow would be used to pass the data. The
>>>>>> implementation would evaluate each argument expression and set the result
>>>>>> in a generic row, then pass that row to the UDF. We can use whatever
>>>>>> implementation we choose to provide better guarantees than unsafe.
>>>>>>
>>>>>> I think we should consider query-compile-time checks as
>>>>>> nearly-as-good as Java-compile-time checks for the purposes of safety.
>>>>>>
>>>>>> I don’t think I agree with this. A failure at query analysis time vs
>>>>>> runtime still requires going back to a separate project, fixing 
>>>>>> something,
>>>>>> and rebuilding. The time needed to fix a problem goes up significantly 
>>>>>> vs.
>>>>>> compile-time checks. And that is even worse if the UDF is maintained by
>>>>>> someone else.
>>>>>>
>>>>>> I think we also need to consider how common it would be that a use
>>>>>> case can have the query-compile-time checks. Going through this in more
>>>>>> detail below makes me think that it is unlikely that these checks would 
>>>>>> be
>>>>>> used often because of the limitations of using an interface with type
>>>>>> erasure.
>>>>>>
>>>>>> I believe that Wenchen’s proposal will provide stronger
>>>>>> query-compile-time safety
>>>>>>
>>>>>> The proposal could have better safety for each argument, assuming
>>>>>> that we detect failures by looking at the parameter types using 
>>>>>> reflection
>>>>>> in the analyzer. But we don’t do that for any of the similar UDFs today 
>>>>>> so
>>>>>> I’m skeptical that this would actually be a high enough priority to
>>>>>> implement.
>>>>>>
>>>>>> As Erik pointed out, type erasure also limits the effectiveness. You
>>>>>> can’t implement ScalarFunction2 and 
>>>>>> ScalarFunction2>>>>> Long>. You can handle those cases using InternalRow or you can
>>>>>> handle them using VarargScalarFunction. That forces many use
>>>>>> cases into varargs with Object, where you don’t get any of the
>>>>>> proposed analyzer benefits and lose compile-time checks. The only time 
>>>>>> the
>>>>>> additional checks (if implemented) would help is when only one set of
>>>>>> argument types is needed because implementing ScalarFunction>>>>> Object> defeats the purpose.
>>>>>>
>>>>>> It’s worth noting that safety for the magic methods would be
>>>>>> identical between the two options, so the trade-off to consider is for
>>>>>> varargs and non-codegen cases. Combining the limitations discussed, this
>>>>>> has better safety guarantees only if you need just one set of types for
>>>>>> each number of arguments and are using the non-codegen path. Since 
>>>>>> varargs
>>>>>> is one of the primary reasons to use this API, then I don’t think that it
>>>>>> is a good idea to use Object[] instead of InternalRow.
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
John Zhuge


Re: Apache Spark 3.2 Expectation

2021-03-03 Thread John Zhuge
Hi Dongjoon,

Is it possible to get ViewCatalog in? The community already had fairly
detailed discussions.

Thanks,
John

On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
> seems to be the last minor release of this year. Given the timeframe, we
> might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped
> out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and
> investigating the publishing issue. Thank you for your contributions and
> feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
> better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead
> of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via
> SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client
> dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support
> K8s model 1.19.
>
> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
> with Kafka Client 2.8 hopefully.
>
> # Some Features
>
> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
> Iceberg integration. Especially, we hope the on-going function catalog SPIP
> and up-coming storage partitioned join SPIP can be delivered as a part of
> Spark 3.2 and become an additional foundation.
>
> - Columnar Encryption: As of today, Apache Spark master branch supports
> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
> Apache Spark 3.2 is going to be the first release to have this feature
> officially. Any feedback is welcome.
>
> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
> too. I'm expecting more benefits.
>
> - Structure Streaming with RocksDB backend: According to the latest
> update, it looks active enough for merging to master branch in Spark 3.2.
>
> Please share your thoughts and let's build better Apache Spark 3.2
> together.
>
> Bests,
> Dongjoon.
>


-- 
John Zhuge


Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-23 Thread John Zhuge
t;
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> <https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/>
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/
>>>>
>>>> Signatures used for Spark RCs can be found in this file:
>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1367
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/
>>>>
>>>> The list of bug fixes going into 3.1.1 can be found at the following
>>>> URL:
>>>> https://s.apache.org/41kf2
>>>>
>>>> This release is using the release script of the tag v3.1.1-rc3.
>>>>
>>>> FAQ
>>>>
>>>> ===
>>>> What happened to 3.1.0?
>>>> ===
>>>>
>>>> There was a technical issue during Apache Spark 3.1.0 preparation, and
>>>> it was discussed and decided to skip 3.1.0.
>>>> Please see
>>>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>>>> more details.
>>>>
>>>> =
>>>> How can I help test this release?
>>>> =
>>>>
>>>> If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> If you're working in PySpark you can set up a virtual env and install
>>>> the current RC via "pip install
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/pyspark-3.1.1.tar.gz
>>>> "
>>>> and see if anything important breaks.
>>>> In the Java/Scala, you can add the staging repository to your projects
>>>> resolvers and test
>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>> you don't end up building with an out of date RC going forward).
>>>>
>>>> ===
>>>> What should happen to JIRA tickets still targeting 3.1.1?
>>>> ===
>>>>
>>>> The current list of open tickets targeted at 3.1.1 can be found at:
>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>> Version/s" = 3.1.1
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>> be worked on immediately. Everything else please retarget to an
>>>> appropriate release.
>>>>
>>>> ==
>>>> But my bug isn't fixed?
>>>> ==
>>>>
>>>> In order to make timely releases, we will typically not hold the
>>>> release unless the bug in question is a regression from the previous
>>>> release. That being said, if there is something which is a regression
>>>> that has not been correctly targeted please ping me or a committer to
>>>> help target the issue.
>>>>
>>>>

-- 
John Zhuge


Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-17 Thread John Zhuge
+1 (non-binding)

On Tue, Feb 16, 2021 at 11:11 PM Maxim Gekk 
wrote:

> +1 (non-binding)
>
> On Wed, Feb 17, 2021 at 9:54 AM Wenchen Fan  wrote:
>
>> +1
>>
>> On Wed, Feb 17, 2021 at 1:43 PM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Feb 16, 2021 at 2:27 AM Herman van Hovell 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Tue, Feb 16, 2021 at 11:08 AM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> 2021년 2월 16일 (화) 오후 5:10, Prashant Sharma 님이 작성:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Tue, Feb 16, 2021 at 1:22 PM Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version 3.0.2.
>>>>>>>
>>>>>>> The vote is open until February 19th 9AM (PST) and passes if a
>>>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 3.0.2
>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>
>>>>>>> To learn more about Apache Spark, please see
>>>>>>> https://spark.apache.org/
>>>>>>>
>>>>>>> The tag to be voted on is v3.0.2-rc1 (commit
>>>>>>> 648457905c4ea7d00e3d88048c63f360045f0714):
>>>>>>> https://github.com/apache/spark/tree/v3.0.2-rc1
>>>>>>>
>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>> at:
>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-bin/
>>>>>>>
>>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>>
>>>>>>> The staging repository for this release can be found at:
>>>>>>>
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1366/
>>>>>>>
>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-docs/
>>>>>>>
>>>>>>> The list of bug fixes going into 3.0.2 can be found at the following
>>>>>>> URL:
>>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12348739
>>>>>>>
>>>>>>> FAQ
>>>>>>>
>>>>>>> =
>>>>>>> How can I help test this release?
>>>>>>> =
>>>>>>>
>>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>>> an existing Spark workload and running on this release candidate,
>>>>>>> then
>>>>>>> reporting any regressions.
>>>>>>>
>>>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>>>> the current RC and see if anything important breaks, in the
>>>>>>> Java/Scala
>>>>>>> you can add the staging repository to your projects resolvers and
>>>>>>> test
>>>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>>>> you don't end up building with a out of date RC going forward).
>>>>>>>
>>>>>>> ===
>>>>>>> What should happen to JIRA tickets still targeting 3.0.2?
>>>>>>> ===
>>>>>>>
>>>>>>> The current list of open tickets targeted at 3.0.2 can be found at:
>>>>>>> https://issues.apache.org/jira/projects/SPARK and search for
>>>>>>> "Target Version/s" = 3.0.2
>>>>>>>
>>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>>>> be worked on immediately. Everything else please retarget to an
>>>>>>> appropriate release.
>>>>>>>
>>>>>>> ==
>>>>>>> But my bug isn't fixed?
>>>>>>> ==
>>>>>>>
>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>> release unless the bug in question is a regression from the previous
>>>>>>> release. That being said, if there is something which is a regression
>>>>>>> that has not been correctly targeted please ping me or a committer to
>>>>>>> help target the issue.
>>>>>>>
>>>>>>

-- 
John Zhuge


Re: Apache Spark 3.0.2 Release ?

2021-02-13 Thread John Zhuge
+1

On Sat, Feb 13, 2021 at 9:13 AM Holden Karau  wrote:

> +1, great idea.
>
> On Fri, Feb 12, 2021 at 6:40 PM Yuming Wang  wrote:
>
>> +1.
>>
>> On Sat, Feb 13, 2021 at 10:38 AM Takeshi Yamamuro 
>> wrote:
>>
>>> +1, too. Thanks, Dongjoon!
>>>
>>> 2021/02/13 11:07、Xiao Li のメール:
>>>
>>> 
>>> +1
>>>
>>> Happy Lunar New Year!
>>>
>>> Xiao
>>>
>>> On Fri, Feb 12, 2021 at 5:33 PM Hyukjin Kwon 
>>> wrote:
>>>
>>>> Yeah, +1 too
>>>>
>>>> 2021년 2월 13일 (토) 오전 4:49, Dongjoon Hyun 님이 작성:
>>>>
>>>>> Thank you, Sean!
>>>>>
>>>>> On Fri, Feb 12, 2021 at 11:41 AM Sean Owen  wrote:
>>>>>
>>>>>> Sounds like a fine time to me, sure.
>>>>>>
>>>>>> On Fri, Feb 12, 2021 at 1:39 PM Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> As of today, `branch-3.0` has 307 patches (including 25 correctness
>>>>>>> patches) since v3.0.1 tag (released on September 8th, 2020).
>>>>>>>
>>>>>>> Since we stabilized branch-3.0 during 3.1.x preparation so far,
>>>>>>> it would be great if we start to release Apache Spark 3.0.2 next
>>>>>>> week.
>>>>>>> And, I'd like to volunteer for Apache Spark 3.0.2 release manager.
>>>>>>>
>>>>>>> What do you think about the Apache Spark 3.0.2 release?
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> SPARK-31511 Make BytesToBytesMap iterator() thread-safe
>>>>>>> SPARK-32635 When pyspark.sql.functions.lit() function is used with
>>>>>>> dataframe cache, it returns wrong result
>>>>>>> SPARK-32753 Deduplicating and repartitioning the same column create
>>>>>>> duplicate rows with AQE
>>>>>>> SPARK-32764 compare of -0.0 < 0.0 return true
>>>>>>> SPARK-32840 Invalid interval value can happen to be just adhesive
>>>>>>> with the unit
>>>>>>> SPARK-32908 percentile_approx() returns incorrect results
>>>>>>> SPARK-33019 Use
>>>>>>> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by 
>>>>>>> default
>>>>>>> SPARK-33183 Bug in optimizer rule EliminateSorts
>>>>>>> SPARK-33260 SortExec produces incorrect results if sortOrder is a
>>>>>>> Stream
>>>>>>> SPARK-33290 REFRESH TABLE should invalidate cache even though the
>>>>>>> table itself may not be cached
>>>>>>> SPARK-33358 Spark SQL CLI command processing loop can't exit while
>>>>>>> one comand fail
>>>>>>> SPARK-33404 "date_trunc" expression returns incorrect results
>>>>>>> SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
>>>>>>> SPARK-33591 NULL is recognized as the "null" string in partition
>>>>>>> specs
>>>>>>> SPARK-33593 Vector reader got incorrect data with binary partition
>>>>>>> value
>>>>>>> SPARK-33726 Duplicate field names causes wrong answers during
>>>>>>> aggregation
>>>>>>> SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
>>>>>>> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
>>>>>>> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
>>>>>>> SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
>>>>>>> SPARK-34187 Use available offset range obtained during polling when
>>>>>>> checking offset validation
>>>>>>> SPARK-34212 For parquet table, after changing the precision and
>>>>>>> scale of decimal type in hive, spark reads incorrect value
>>>>>>> SPARK-34213 LOAD DATA doesn't refresh v1 table cache
>>>>>>> SPARK-34229 Avro should read decimal values with the file schema
>>>>>>> SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table
>>>>>>> cache
>>>>>>>
>>>>>>
>>>
>>> --
>>>
>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
John Zhuge


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread John Zhuge
/committers and this thread doesn't get much attention yet.
>>
>> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon  wrote:
>>
>> Just dropping a few lines. I remember that one of the goals in DSv2 is to
>> correct the mistakes we made in the current Spark codes.
>> It would not have much point if we will happen to just follow and mimic
>> what Spark currently does. It might just end up with another copy of Spark
>> APIs, e.g. Expression (internal) APIs. I sincerely would like to avoid this
>> I do believe we have been stuck mainly due to trying to come up with a
>> better design. We already have an ugly picture of the current Spark APIs to
>> draw a better bigger picture.
>>
>>
>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau 님이 작성:
>>
>> I think this proposal is a good set of trade-offs and has existed in the
>> community for a long period of time. I especially appreciate how the design
>> is focused on a minimal useful component, with future optimizations
>> considered from a point of view of making sure it's flexible, but actual
>> concrete decisions left for the future once we see how this API is used. I
>> think if we try and optimize everything right out of the gate, we'll
>> quickly get stuck (again) and not make any progress.
>>
>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue  wrote:
>>
>> Hi everyone,
>>
>> I'd like to start a discussion for adding a FunctionCatalog interface to
>> catalog plugins. This will allow catalogs to expose functions to Spark,
>> similar to how the TableCatalog interface allows a catalog to expose
>> tables. The proposal doc is available here:
>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U%2Fedit=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=Kyth8%2FhNUZ6GXG2FsgcknZ7t7s0%2BpxnDMPyxvsxLLqE%3D=0>
>>
>> Here's a high-level summary of some of the main design choices:
>> * Adds the ability to list and load functions, not to create or modify
>> them in an external catalog
>> * Supports scalar, aggregate, and partial aggregate functions
>> * Uses load and bind steps for better error messages and simpler
>> implementations
>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
>> data
>> * Can be extended using mix-in interfaces to add vectorization, codegen,
>> and other future features
>>
>> There is also a PR with the proposed API:
>> https://github.com/apache/spark/pull/24559/files
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F24559%2Ffiles=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=t3ZCqffdsrmCY3X%2FT8x1oMjMcNUiQ0wQNk%2ByAXQx1Io%3D=0>
>>
>> Let's discuss the proposal here rather than on that PR, to get better
>> visibility. Also, please take the time to read the proposal first. That
>> really helps clear up misconceptions.
>>
>>
>>
>> --
>> Ryan Blue
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=fVfSPIyazuUYv8VLfNu%2BUIHdc3ePM1AAKKH%2BlnIicF8%3D=0>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=NbRl9kK%2B6Wy0jWmDnztYp3JCPNLuJvmFsLHUrXzEhlk%3D=0>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060068007935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=OWXOBELzO3hBa2JI%2FOSBZ3oNyLq0yr%2FGXMkNn7bqYDM%3D=0>
>>
>> --
>> Ryan Blue
>>
>>

-- 
John Zhuge


Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-19 Thread John Zhuge
+1 (non-binding)

On Tue, Jan 19, 2021 at 4:08 AM JackyLee  wrote:

> +1
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
John Zhuge


Re: SPIP: Catalog API for view metadata

2020-09-04 Thread John Zhuge
SPIP
<https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
has been updated. Please review.

On Thu, Sep 3, 2020 at 9:22 AM John Zhuge  wrote:

> Wenchen, sorry for the delay, I will post an update shortly.
>
> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan  wrote:
>
>> Any updates here? I agree that a new View API is better, but we need a
>> solution to avoid performance regression. We need to elaborate on the cache
>> idea.
>>
>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue  wrote:
>>
>>> I think it is a good idea to keep tables and views separate.
>>>
>>> The main two arguments I’ve heard for combining lookup into a single
>>> function are the ones brought up in this thread. First, an identifier in a
>>> catalog must be either a view or a table and should not collide. Second, a
>>> single lookup is more likely to require a single RPC. I think the RPC
>>> concern is well addressed by caching, which we already do in the Spark
>>> catalog, so I’ll primarily focus on the first.
>>>
>>> Table/view name collision is unlikely to be a problem. Metastores that
>>> support both today store them in a single namespace, so this is not a
>>> concern for even a naive implementation that talks to the Hive MetaStore. I
>>> know that a new metastore catalog could choose to implement both
>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>> would be a very strange choice: if the metastore itself has different
>>> namespaces for tables and views, then it makes much more sense to expose
>>> them through separate catalogs because Spark will always prefer one over
>>> the other.
>>>
>>> In a similar line of reasoning, catalogs that expose both views and
>>> tables are much more rare than catalogs that only expose one. For example,
>>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>>> and implementing ViewCatalog would make little sense. Exposing new data
>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>>> likely to be the same. Say I have a way to convert Pig statements or some
>>> other representation into a SQL view. It would make little sense to combine
>>> that with some other TableCatalog.
>>>
>>> I also don’t think there is benefit from an API perspective to justify
>>> combining the Table and View interfaces. The two share only schema and
>>> properties, and are handled very differently internally — a View’s SQL
>>> query is parsed and substituted into the plan, while a Table is wrapped in
>>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>>> SQL also needs additional context to be resolved correctly: the current
>>> catalog and namespace from the time the view was created.
>>>
>>> Query planning is distinct between tables and views, so Spark doesn’t
>>> benefit from combining them. I think it has actually caused problems that
>>> both were resolved by the same method in v1: the resolution rule grew
>>> extremely complicated trying to look up a reference just once because it
>>> had to parse a view plan and resolve relations within it using the view’s
>>> context (current database). In contrast, John’s new view substitution rules
>>> are cleaner and can stay within the substitution batch.
>>>
>>> People implementing views would also not benefit from combining the two
>>> interfaces:
>>>
>>>- There is little overlap between View and Table, only schema and
>>>properties
>>>- Most catalogs won’t implement both interfaces, so returning a
>>>ViewOrTable is more difficult for implementations
>>>- TableCatalog assumes that ViewCatalog will be added separately
>>>like John proposes, so we would have to break or replace that API
>>>
>>> I understand the initial appeal of combining TableCatalog and
>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>> think that Hive chose to do that mostly on the fact that the two were
>>> already stored together, and not because it made sense for users of the
>>> API, or any other implementer of the API.
>>>
>>> rb
>>>
>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge  wrote:
>>>
>>>>
>>>>
>>>>
>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>
>>>>> Correction: Spark adds 

Re: SPIP: Catalog API for view metadata

2020-09-03 Thread John Zhuge
Wenchen, sorry for the delay, I will post an update shortly.

On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan  wrote:

> Any updates here? I agree that a new View API is better, but we need a
> solution to avoid performance regression. We need to elaborate on the cache
> idea.
>
> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue  wrote:
>
>> I think it is a good idea to keep tables and views separate.
>>
>> The main two arguments I’ve heard for combining lookup into a single
>> function are the ones brought up in this thread. First, an identifier in a
>> catalog must be either a view or a table and should not collide. Second, a
>> single lookup is more likely to require a single RPC. I think the RPC
>> concern is well addressed by caching, which we already do in the Spark
>> catalog, so I’ll primarily focus on the first.
>>
>> Table/view name collision is unlikely to be a problem. Metastores that
>> support both today store them in a single namespace, so this is not a
>> concern for even a naive implementation that talks to the Hive MetaStore. I
>> know that a new metastore catalog could choose to implement both
>> ViewCatalog and TableCatalog and store the two sets separately, but that
>> would be a very strange choice: if the metastore itself has different
>> namespaces for tables and views, then it makes much more sense to expose
>> them through separate catalogs because Spark will always prefer one over
>> the other.
>>
>> In a similar line of reasoning, catalogs that expose both views and
>> tables are much more rare than catalogs that only expose one. For example,
>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>> and implementing ViewCatalog would make little sense. Exposing new data
>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>> likely to be the same. Say I have a way to convert Pig statements or some
>> other representation into a SQL view. It would make little sense to combine
>> that with some other TableCatalog.
>>
>> I also don’t think there is benefit from an API perspective to justify
>> combining the Table and View interfaces. The two share only schema and
>> properties, and are handled very differently internally — a View’s SQL
>> query is parsed and substituted into the plan, while a Table is wrapped in
>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>> SQL also needs additional context to be resolved correctly: the current
>> catalog and namespace from the time the view was created.
>>
>> Query planning is distinct between tables and views, so Spark doesn’t
>> benefit from combining them. I think it has actually caused problems that
>> both were resolved by the same method in v1: the resolution rule grew
>> extremely complicated trying to look up a reference just once because it
>> had to parse a view plan and resolve relations within it using the view’s
>> context (current database). In contrast, John’s new view substitution rules
>> are cleaner and can stay within the substitution batch.
>>
>> People implementing views would also not benefit from combining the two
>> interfaces:
>>
>>- There is little overlap between View and Table, only schema and
>>properties
>>- Most catalogs won’t implement both interfaces, so returning a
>>ViewOrTable is more difficult for implementations
>>- TableCatalog assumes that ViewCatalog will be added separately like
>>John proposes, so we would have to break or replace that API
>>
>> I understand the initial appeal of combining TableCatalog and ViewCatalog
>> since it is done that way in the existing interfaces. But I think that Hive
>> chose to do that mostly on the fact that the two were already stored
>> together, and not because it made sense for users of the API, or any other
>> implementer of the API.
>>
>> rb
>>
>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge  wrote:
>>
>>>
>>>
>>>
>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>
>>>> Correction: Spark adds a new Project at the top of the parsed plan from
>>>> view, based on the stored schema, to make sure the view schema doesn't
>>>> change.
>>>>
>>>
>>> Thanks Wenchen! I thought I forgot something :) Yes it is the validation
>>> done in *checkAnalysis*:
>>>
>>>   // If the view output doesn't have the same number of columns
>>> neither with the child
>>>   // output, nor with the query column names, th

Re: SPIP: Catalog API for view metadata

2020-08-18 Thread John Zhuge
> > AFAIK view schema is only used by DESCRIBE.
>
> Correction: Spark adds a new Project at the top of the parsed plan from
> view, based on the stored schema, to make sure the view schema doesn't
> change.
>

Thanks Wenchen! I thought I forgot something :) Yes it is the validation
done in *checkAnalysis*:

  // If the view output doesn't have the same number of columns
neither with the child
  // output, nor with the query column names, throw an
AnalysisException.
  // If the view's child output can't up cast to the view output,
  // throw an AnalysisException, too.

The view output comes from the schema:

  val child = View(
desc = metadata,
output = metadata.schema.toAttributes,
child = parser.parsePlan(viewText))

So it is a validation (here) or cache (in DESCRIBE) nice to have but not
"required" or "should be frozen". Thanks Ryan and Burak for pointing that
out in SPIP. I will add a new paragraph accordingly.


Re: SPIP: Catalog API for view metadata

2020-08-18 Thread John Zhuge
Thanks Wenchen. Will do.

On Tue, Aug 18, 2020 at 6:38 AM Wenchen Fan  wrote:

> > AFAIK view schema is only used by DESCRIBE.
>
> Correction: Spark adds a new Project at the top of the parsed plan from
> view, based on the stored schema, to make sure the view schema doesn't
> change.
>
> Can you update your doc to incorporate the cache idea? Let's make sure we
> don't have perf issues if we go with the new View API.
>
> On Tue, Aug 18, 2020 at 4:25 PM John Zhuge  wrote:
>
>> Thanks Burak and Walaa for the feedback!
>>
>> Here are my perspectives:
>>
>> We shouldn't be persisting things like the schema for a view
>>
>>
>> This is not related to which option to choose because existing code
>> persists schema as well.
>> When resolving the view, the analyzer always parses the view sql text, it
>> does not use the schema.
>>
>>> AFAIK view schema is only used by DESCRIBE.
>>
>>
>>> Why not use TableCatalog.loadTable to load both tables and views
>>>
>> Also, views can be defined on top of either other views or base tables,
>>> so the less divergence in code paths between views and tables the better.
>>
>>
>> Existing Spark takes this approach and there are quite a few checks like
>> "tableType == CatalogTableType.VIEW".
>> View and table metadata surprisingly have very little in common, thus I'd
>> like to group view related code together, separate from table processing.
>> Views are much closer to CTEs. SPIP proposed a new rule ViewSubstitution
>> in the same "Substitution" batch as CTESubstitution.
>>
>> This way you avoid multiple RPCs to a catalog or data source or
>>> metastore, and you avoid namespace/name conflits. Also you make yourself
>>> less susceptible to race conditions (which still inherently exist).
>>>
>>
>> Valid concern. Can be mitigated by caching RPC calls in the catalog
>> implementation. The window for race condition can also be narrowed
>> significantly but not totally eliminated.
>>
>>
>> On Fri, Aug 14, 2020 at 2:43 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Wenchen, agreed with what you said. I was referring to situations where
>>> the underlying table schema evolves (say by introducing a nested field in a
>>> Struct), and also what you mentioned in cases of SELECT *. The Hive
>>> metastore handling of those does not automatically update view schema (even
>>> though executing the view in Hive results in data that has the most recent
>>> schema when underlying tables evolve -- so newly added nested field data
>>> shows up in the view evaluation query result but not in the view schema).
>>>
>>> On Fri, Aug 14, 2020 at 2:36 AM Wenchen Fan  wrote:
>>>
>>>> View should have a fixed schema like a table. It should either be
>>>> inferred from the query when creating the view, or be specified by the user
>>>> manually like CREATE VIEW v(a, b) AS SELECT Users can still alter
>>>> view schema manually.
>>>>
>>>> Basically a view is just a named SQL query, which mostly has fixed
>>>> schema unless you do something like SELECT *.
>>>>
>>>> On Fri, Aug 14, 2020 at 8:39 AM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>>> +1 to making views as special forms of tables. Sometimes a table can
>>>>> be converted to a view to hide some of the implementation details while 
>>>>> not
>>>>> impacting readers (provided that the write path is controlled). Also, 
>>>>> views
>>>>> can be defined on top of either other views or base tables, so the less
>>>>> divergence in code paths between views and tables the better.
>>>>>
>>>>> For whether to materialize view schema or infer it, one of the issues
>>>>> we face with the HMS approach of materialization is that when the
>>>>> underlying table schema evolves, HMS will still keep the view schema
>>>>> unchanged. This causes a number of discrepancies that we address
>>>>> out-of-band (e.g., run separate pipeline to ensure view schema freshness,
>>>>> or just re-derive it at read time (example derivation algorithm for
>>>>> view Avro schema
>>>>> <https://github.com/linkedin/coral/blob/master/coral-schema/src/main/java/com/linkedin/coral/schema/avro/ViewToAvroSchemaConverter.java>
>>>>> )).
>>>>>
&

Re: SPIP: Catalog API for view metadata

2020-08-18 Thread John Zhuge
Thanks Burak and Walaa for the feedback!

Here are my perspectives:

We shouldn't be persisting things like the schema for a view


This is not related to which option to choose because existing code
persists schema as well.
When resolving the view, the analyzer always parses the view sql text, it
does not use the schema.

> AFAIK view schema is only used by DESCRIBE.


> Why not use TableCatalog.loadTable to load both tables and views
>
Also, views can be defined on top of either other views or base tables, so
> the less divergence in code paths between views and tables the better.


Existing Spark takes this approach and there are quite a few checks like
"tableType == CatalogTableType.VIEW".
View and table metadata surprisingly have very little in common, thus I'd
like to group view related code together, separate from table processing.
Views are much closer to CTEs. SPIP proposed a new rule ViewSubstitution in
the same "Substitution" batch as CTESubstitution.

This way you avoid multiple RPCs to a catalog or data source or metastore,
> and you avoid namespace/name conflits. Also you make yourself less
> susceptible to race conditions (which still inherently exist).
>

Valid concern. Can be mitigated by caching RPC calls in the catalog
implementation. The window for race condition can also be narrowed
significantly but not totally eliminated.


On Fri, Aug 14, 2020 at 2:43 AM Walaa Eldin Moustafa 
wrote:

> Wenchen, agreed with what you said. I was referring to situations where
> the underlying table schema evolves (say by introducing a nested field in a
> Struct), and also what you mentioned in cases of SELECT *. The Hive
> metastore handling of those does not automatically update view schema (even
> though executing the view in Hive results in data that has the most recent
> schema when underlying tables evolve -- so newly added nested field data
> shows up in the view evaluation query result but not in the view schema).
>
> On Fri, Aug 14, 2020 at 2:36 AM Wenchen Fan  wrote:
>
>> View should have a fixed schema like a table. It should either be
>> inferred from the query when creating the view, or be specified by the user
>> manually like CREATE VIEW v(a, b) AS SELECT Users can still alter
>> view schema manually.
>>
>> Basically a view is just a named SQL query, which mostly has fixed schema
>> unless you do something like SELECT *.
>>
>> On Fri, Aug 14, 2020 at 8:39 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> +1 to making views as special forms of tables. Sometimes a table can be
>>> converted to a view to hide some of the implementation details while not
>>> impacting readers (provided that the write path is controlled). Also, views
>>> can be defined on top of either other views or base tables, so the less
>>> divergence in code paths between views and tables the better.
>>>
>>> For whether to materialize view schema or infer it, one of the issues we
>>> face with the HMS approach of materialization is that when the underlying
>>> table schema evolves, HMS will still keep the view schema unchanged. This
>>> causes a number of discrepancies that we address out-of-band (e.g., run
>>> separate pipeline to ensure view schema freshness, or just re-derive it at
>>> read time (example derivation algorithm for view Avro schema
>>> <https://github.com/linkedin/coral/blob/master/coral-schema/src/main/java/com/linkedin/coral/schema/avro/ViewToAvroSchemaConverter.java>
>>> )).
>>>
>>> Also regarding SupportsRead vs SupportWrite, some views can be
>>> updateable (example from MySQL
>>> https://dev.mysql.com/doc/refman/8.0/en/view-updatability.html), but
>>> also implementing that requires a few concepts that are more prominent in
>>> an RDBMS.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Thu, Aug 13, 2020 at 5:09 PM Burak Yavuz  wrote:
>>>
>>>> My high level comment here is that as a naive person, I would expect a
>>>> View to be a special form of Table that SupportsRead but doesn't
>>>> SupportWrite. loadTable in the TableCatalog API should load both tables and
>>>> views. This way you avoid multiple RPCs to a catalog or data source or
>>>> metastore, and you avoid namespace/name conflits. Also you make yourself
>>>> less susceptible to race conditions (which still inherently exist).
>>>>
>>>> In addition, I'm not a SQL expert, but I thought that views are
>>>> evaluated at runtime, therefore we shouldn't be persisting things like the
>>>> schema for a view.
>>>>
>

Re: SPIP: Catalog API for view metadata

2020-08-13 Thread John Zhuge
Thanks Ryan.

ViewCatalog API mimics TableCatalog API including how shared namespace is
handled:

   - The doc for createView
   
<https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109>
states
   "it will throw ViewAlreadyExistsException when a view or table already
   exists for the identifier."
   - The doc for loadView
   
<https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75>
states
   "If the catalog supports tables and contains a table for the identifier and
   not a view, this must throw NoSuchViewException."

Agree it is good to explicitly specify the order of resolution. I will add
a section in ViewCatalog javadoc to summarize the behavior for "shared
namespace". The loadView doc will also be updated to spell out the order of
resolution.

On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue  wrote:

> I agree with Wenchen that we need to be clear about resolution and
> behavior. For example, I think that we would agree that CREATE VIEW
> catalog.schema.name should fail when there is a table named
> catalog.schema.name. We’ve already included this behavior in the
> documentation for the TableCatalog API
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
> where create should fail if a view exists for the identifier.
>
> I think it was simply assumed that we would use the same approach — the
> API requires that table and view names share a namespace. But it would be
> good to specifically note either the order in which resolution will happen
> (views are resolved first) or note that it is not allowed and behavior is
> not guaranteed. I prefer the first option.
>
> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge  wrote:
>
>> Hi Wenchen,
>>
>> Thanks for the feedback!
>>
>> 1. Add a new View API. How to avoid name conflicts between table and
>>> view? When resolving relation, shall we lookup table catalog first or view
>>> catalog?
>>
>>
>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>
>>- The proposed new view substitution rule and the changes to
>>ResolveCatalogs should ensure the view catalog is looked up first for a
>>"dual" catalog.
>>- The implementation for a "dual" catalog plugin should ensure:
>>   -  Creating a view in view catalog when a table of the same name
>>   exists should fail.
>>   -  Creating a table in table catalog when a view of the same name
>>   exists should fail as well.
>>
>> Agree with you that a new View API is more flexible. A couple of notes:
>>
>>- We actually started a common view prototype using the single
>>catalog approach, but once we added more and more view metadata, storing
>>them in table properties became not manageable, especially for the feature
>>like "versioning". Eventually we opted for a view backend of S3 JSON 
>> files.
>>- We'd like to move away from Hive metastore
>>
>> For more details and discussion, see SPIP section "Background and
>> Motivation".
>>
>> Thanks,
>> John
>>
>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan  wrote:
>>
>>> Hi John,
>>>
>>> Thanks for working on this! View support is very important to the
>>> catalog plugin API.
>>>
>>> After reading your doc, I have one high-level question: should view be a
>>> separated API or it's just a special type of table?
>>>
>>> AFAIK in most databases, tables and views share the same namespace. You
>>> can't create a view if a same-name table exists. In Hive, view is just a
>>> special type of table, so they are in the same namespace naturally. If we
>>> have both table catalog and view catalog, we need a mechanism to make sure
>>> there are no name conflicts.
>>>
>>> On the other hand, the view metadata is very simple that can be put in
>>> table properties. I'd like to see more thoughts to evaluate these 2
>>> approaches:
>>> 1. *Add a new View API*. How to avoid name conflicts between table and
>>> view? When resolving relation, shall we lookup table catalog first or view
>>> catalog?
>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we do
>>> want to store table and views separately?
>>>
>>> I think a new View API is more flexible. I'd

Re: SPIP: Catalog API for view metadata

2020-08-12 Thread John Zhuge
Hi Wenchen,

Thanks for the feedback!

1. Add a new View API. How to avoid name conflicts between table and view?
> When resolving relation, shall we lookup table catalog first or view
> catalog?


 See clarification in SPIP section "Proposed Changes - Namespace":

   - The proposed new view substitution rule and the changes to
   ResolveCatalogs should ensure the view catalog is looked up first for a
   "dual" catalog.
   - The implementation for a "dual" catalog plugin should ensure:
  -  Creating a view in view catalog when a table of the same name
  exists should fail.
  -  Creating a table in table catalog when a view of the same name
  exists should fail as well.

Agree with you that a new View API is more flexible. A couple of notes:

   - We actually started a common view prototype using the single catalog
   approach, but once we added more and more view metadata, storing them in
   table properties became not manageable, especially for the feature like
   "versioning". Eventually we opted for a view backend of S3 JSON files.
   - We'd like to move away from Hive metastore

For more details and discussion, see SPIP section "Background and
Motivation".

Thanks,
John

On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan  wrote:

> Hi John,
>
> Thanks for working on this! View support is very important to the catalog
> plugin API.
>
> After reading your doc, I have one high-level question: should view be a
> separated API or it's just a special type of table?
>
> AFAIK in most databases, tables and views share the same namespace. You
> can't create a view if a same-name table exists. In Hive, view is just a
> special type of table, so they are in the same namespace naturally. If we
> have both table catalog and view catalog, we need a mechanism to make sure
> there are no name conflicts.
>
> On the other hand, the view metadata is very simple that can be put in
> table properties. I'd like to see more thoughts to evaluate these 2
> approaches:
> 1. *Add a new View API*. How to avoid name conflicts between table and
> view? When resolving relation, shall we lookup table catalog first or view
> catalog?
> 2. *Reuse the Table API*. How to indicate it's a view? What if we do want
> to store table and views separately?
>
> I think a new View API is more flexible. I'd vote for it if we can come up
> with a good mechanism to avoid name conflicts.
>
> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge  wrote:
>
>> Hi Spark devs,
>>
>> I'd like to bring more attention to this SPIP. As Dongjoon indicated in
>> the email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this feature
>> can be considered for 3.2 or even 3.1.
>>
>> View catalog builds on top of the catalog plugin system introduced in
>> DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
>> drop views. A catalog plugin can naturally implement both ViewCatalog and
>> TableCatalog.
>>
>> Our internal implementation has been in production for over 8 months.
>> Recently we extended it to support materialized views, for the read path
>> initially.
>>
>> The PR has conflicts that I will resolve them shortly.
>>
>> Thanks,
>>
>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge  wrote:
>>
>>> Hi everyone,
>>>
>>> In order to disassociate view metadata from Hive Metastore and support
>>> different storage backends, I am proposing a new view catalog API to load,
>>> create, alter, and drop views.
>>>
>>> Document:
>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>
>>> As part of a project to support common views across query engines like
>>> Spark and Presto, my team used the view catalog API in Spark
>>> implementation. The project has been in production over three months.
>>>
>>> Thanks,
>>> John Zhuge
>>>
>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge


Re: SPIP: Catalog API for view metadata

2020-08-11 Thread John Zhuge
Hi Spark devs,

I'd like to bring more attention to this SPIP. As Dongjoon indicated in the
email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this feature can
be considered for 3.2 or even 3.1.

View catalog builds on top of the catalog plugin system introduced in
DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
drop views. A catalog plugin can naturally implement both ViewCatalog and
TableCatalog.

Our internal implementation has been in production for over 8 months.
Recently we extended it to support materialized views, for the read path
initially.

The PR has conflicts that I will resolve them shortly.

Thanks,

On Wed, Apr 22, 2020 at 12:24 AM John Zhuge  wrote:

> Hi everyone,
>
> In order to disassociate view metadata from Hive Metastore and support
> different storage backends, I am proposing a new view catalog API to load,
> create, alter, and drop views.
>
> Document:
> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
> WIP PR: https://github.com/apache/spark/pull/28147
>
> As part of a project to support common views across query engines like
> Spark and Presto, my team used the view catalog API in Spark
> implementation. The project has been in production over three months.
>
> Thanks,
> John Zhuge
>


-- 
John Zhuge


SPIP: Catalog API for view metadata

2020-04-22 Thread John Zhuge
Hi everyone,

In order to disassociate view metadata from Hive Metastore and support
different storage backends, I am proposing a new view catalog API to load,
create, alter, and drop views.

Document:
https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
JIRA: https://issues.apache.org/jira/browse/SPARK-31357
WIP PR: https://github.com/apache/spark/pull/28147

As part of a project to support common views across query engines like
Spark and Presto, my team used the view catalog API in Spark
implementation. The project has been in production over three months.

Thanks,
John Zhuge


Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread John Zhuge
gt;>>>>> community is encouraged to add their voice to the discussion.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> [ ] +1 - Spark should adopt this policy.
>>>>>>>>> >> >>
>>>>>>>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> 
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Considerations When Breaking APIs
>>>>>>>>> >> >>
>>>>>>>>> >> >> The Spark project strives to avoid breaking APIs or silently
>>>>>>>>> changing behavior, even at major versions. While this is not always
>>>>>>>>> possible, the balance of the following factors should be considered 
>>>>>>>>> before
>>>>>>>>> choosing to break an API.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Cost of Breaking an API
>>>>>>>>> >> >>
>>>>>>>>> >> >> Breaking an API almost always has a non-trivial cost to the
>>>>>>>>> users of Spark. A broken API means that Spark programs need to be 
>>>>>>>>> rewritten
>>>>>>>>> before they can be upgraded. However, there are a few considerations 
>>>>>>>>> when
>>>>>>>>> thinking about what the cost will be:
>>>>>>>>> >> >>
>>>>>>>>> >> >> Usage - an API that is actively used in many different
>>>>>>>>> places, is always very costly to break. While it is hard to know 
>>>>>>>>> usage for
>>>>>>>>> sure, there are a bunch of ways that we can estimate:
>>>>>>>>> >> >>
>>>>>>>>> >> >> How long has the API been in Spark?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Is the API common even for basic programs?
>>>>>>>>> >> >>
>>>>>>>>> >> >> How often do we see recent questions in JIRA or mailing
>>>>>>>>> lists?
>>>>>>>>> >> >>
>>>>>>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Behavior after the break - How will a program that works
>>>>>>>>> today, work after the break? The following are listed roughly in 
>>>>>>>>> order of
>>>>>>>>> increasing severity:
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will there be a compiler or linker error?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will there be a runtime exception?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will that exception happen after significant processing has
>>>>>>>>> been done?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will we silently return different answers? (very hard to
>>>>>>>>> debug, might not even notice!)
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Cost of Maintaining an API
>>>>>>>>> >> >>
>>>>>>>>> >> >> Of course, the above does not mean that we will never break
>>>>>>>>> any APIs. We must also consider the cost both to the project and to 
>>>>>>>>> our
>>>>>>>>> users of keeping the API in question.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Project Costs - Every API we have needs to be tested and
>>>>>>>>> needs to keep working as other parts of the project changes. These 
>>>>>>>>> costs
>>>>>>>>> are significantly exacerbated when external dependencies change (the 
>>>>>>>>> JVM,
>>>>>>>>> Scala, etc). In some cases, while not completely technically 
>>>>>>>>> infeasible,
>>>>>>>>> the cost of maintaining a particular API can become too high.
>>>>>>>>> >> >>
>>>>>>>>> >> >> User Costs - APIs also have a cognitive cost to users
>>>>>>>>> learning Spark or trying to understand Spark programs. This cost 
>>>>>>>>> becomes
>>>>>>>>> even higher when the API in question has confusing or undefined 
>>>>>>>>> semantics.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Alternatives to Breaking an API
>>>>>>>>> >> >>
>>>>>>>>> >> >> In cases where there is a "Bad API", but where the cost of
>>>>>>>>> removal is also high, there are alternatives that should be 
>>>>>>>>> considered that
>>>>>>>>> do not hurt existing users but do address some of the maintenance 
>>>>>>>>> costs.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an
>>>>>>>>> important point. Anytime we are adding a new interface to Spark we 
>>>>>>>>> should
>>>>>>>>> consider that we might be stuck with this API forever. Think deeply 
>>>>>>>>> about
>>>>>>>>> how new APIs relate to existing ones, as well as how you expect them 
>>>>>>>>> to
>>>>>>>>> evolve over time.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Deprecation Warnings - All deprecation warnings should point
>>>>>>>>> to a clear alternative and should never just say that an API is 
>>>>>>>>> deprecated.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Updated Docs - Documentation should point to the "best"
>>>>>>>>> recommended way of performing a given task. In the cases where we 
>>>>>>>>> maintain
>>>>>>>>> legacy documentation, we should clearly point to newer APIs and 
>>>>>>>>> suggest to
>>>>>>>>> users the "right" way.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Community Work - Many people learn Spark by reading blogs
>>>>>>>>> and other sites such as StackOverflow. However, many of these 
>>>>>>>>> resources are
>>>>>>>>> out of date. Update them, to reduce the cost of eventually removing
>>>>>>>>> deprecated APIs.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> 
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> -
>>>>>>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>> >>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -
>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> ---
>>>>>> Takeshi Yamamuro
>>>>>>
>>>>>
>>>
>>> --
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>
> --
> Takuya UESHIN
>
> http://twitter.com/ueshin
>
>
>

-- 
John Zhuge


Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-26 Thread John Zhuge
t there are many real world use cases active today. Even those cases
>>likely suffer from undiagnosed issues as there are many areas of Spark 
>> that
>>assume a single context per JVM.
>>-
>>
>>Cost to Maintain - We have recently had users ask on the mailing list
>>if this was supported, as the conf led them to believe it was, and the
>>existence of this configuration as "supported" makes it harder to reason
>>about certain global state in SparkContext.
>>
>>
>> Decision: Remove this configuration and related code.
>>
>> [SPARK-25908] Remove registerTempTable #22921
>> <https://github.com/apache/spark/pull/22921/> (only looking at one API
>> of this PR)
>>
>>
>>-
>>
>>Cost to Break - This is a wildly popular API of Spark SQL that has
>>been there since the first release. There are tons of blog posts and
>>examples that use this syntax if you google "dataframe
>>registerTempTable
>>
>> <https://www.google.com/search?q=dataframe+registertemptable=1C5CHFA_enUS746US746=dataframe+registertemptable=chrome.0.0l8.3040j1j7=chrome=UTF-8>"
>>(even more than the "correct" API "dataframe createOrReplaceView
>>
>> <https://www.google.com/search?rlz=1C5CHFA_enUS746US746=TkZMXrj1ObzA0PEPpLKR2A4=dataframe+createorreplacetempview=dataframe+createor_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1..01..gws-wiz...0i71j0i131.zP34wH1novM>").
>>All of these will be invalid for users of Spark 3.0
>>-
>>
>>Cost to Maintain - This is just an alias, so there is not a lot of
>>extra machinery required to keep the API. Users have two ways to do the
>>same thing, but we can note that this is just an alias in the docs.
>>
>>
>> Decision: Do not remove this API, I would even consider un-deprecating
>> it. I anecdotally asked several users and this is the API they prefer over
>> the "correct" one.
>>
>> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
>> <https://github.com/apache/spark/pull/24195>
>>
>>-
>>
>>Cost to Break - I think that this case actually exemplifies several
>>anti-patterns in breaking APIs. In some languages, the deprecation warning
>>gives you no help, other than what version the function was removed in. In
>>R, it points users to a really deep conversation on the semantics of time
>>in Spark SQL. None of the messages tell you how you should correctly be
>>parsing a timestamp that is given to you in a format other than UTC. My
>>guess is all users will blindly flip the flag to true (to keep using this
>>function), so you've only succeeded in annoying them.
>>-
>>
>>Cost to Maintain - These are two relatively isolated expressions,
>>there should be little cost to keeping them. Users can be confused by 
>> their
>>semantics, so we probably should update the docs to point them to a best
>>practice (I learned only by complaining on the PR, that a good practice is
>>to parse timestamps including the timezone in the format expression, which
>>naturally shifts them to UTC).
>>
>>
>> Decision: Do not deprecate these two functions. We should update the
>> docs to talk about best practices for parsing timestamps, including how to
>> correctly shift them to UTC for storage.
>>
>> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
>> <https://github.com/apache/spark/pull/24902>
>>
>>
>>-
>>
>>Cost to Break - The TRIM function takes two string parameters. If we
>>switch the parameter order, queries that use the TRIM function would
>>silently get different results on different versions of Spark. Users may
>>not notice it for a long time and wrong query results may cause serious
>>problems to users.
>>-
>>
>>Cost to Maintain - We will have some inconsistency inside Spark, as
>>the TRIM function in Scala API and in SQL have different parameter order.
>>
>>
>> Decision: Do not switch the parameter order. Promote the TRIM(trimStr
>> FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate
>> (with a warning, not by removing) the SQL TRIM function and move users to
>> the SQL standard TRIM syntax.
>>
>> Thanks for taking the time to read this! Happy to discuss the specifics
>> and amend this policy as the community sees fit.
>>
>> Michael
>>
>>

-- 
John Zhuge


Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread John Zhuge
That will be great. Please send us the invite.

On Wed, Nov 20, 2019 at 8:56 AM bo yang  wrote:

> Cool, thanks Ryan, John, Amogh for the reply! Great to see you interested!
> Felix will have a Spark Scalability & Reliability Sync meeting on Dec 4 1pm
> PST. We could discuss more details there. Do you want to join?
>
> On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor  wrote:
>
>> We at Qubole are also looking at disaggregating shuffle on Spark. Would
>> love to collaborate and share learnings.
>>
>> Regards,
>> Amogh
>>
>> On Tue, Nov 19, 2019 at 4:09 PM John Zhuge  wrote:
>>
>>> Great work, Bo! Would love to hear the details.
>>>
>>>
>>> On Tue, Nov 19, 2019 at 4:05 PM Ryan Blue 
>>> wrote:
>>>
>>>> I'm interested in remote shuffle services as well. I'd love to hear
>>>> about what you're using in production!
>>>>
>>>> rb
>>>>
>>>> On Tue, Nov 19, 2019 at 2:43 PM bo yang  wrote:
>>>>
>>>>> Hi Ben,
>>>>>
>>>>> Thanks for the writing up! This is Bo from Uber. I am in Felix's team
>>>>> in Seattle, and working on disaggregated shuffle (we called it remote
>>>>> shuffle service, RSS, internally). We have put RSS into production for a
>>>>> while, and learned a lot during the work (tried quite a few techniques to
>>>>> improve the remote shuffle performance). We could share our learning with
>>>>> the community, and also would like to hear feedback/suggestions on how to
>>>>> further improve remote shuffle performance. We could chat more details if
>>>>> you or other people are interested.
>>>>>
>>>>> Best,
>>>>> Bo
>>>>>
>>>>> On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom 
>>>>> wrote:
>>>>>
>>>>>> I would like to start a conversation about extending the Spark
>>>>>> shuffle manager surface to support fully disaggregated shuffle
>>>>>> implementations. This is closely related to the work in SPARK-25299
>>>>>> <https://issues.apache.org/jira/browse/SPARK-25299>, which is
>>>>>> focused on refactoring the shuffle manager API (and in particular,
>>>>>> SortShuffleManager) to use a pluggable storage backend. The motivation 
>>>>>> for
>>>>>> that SPIP is further enabling Spark on Kubernetes.
>>>>>>
>>>>>>
>>>>>> The motivation for this proposal is enabling full externalized
>>>>>> (disaggregated) shuffle service implementations. (Facebook’s Cosco
>>>>>> shuffle
>>>>>> <https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service>
>>>>>> is one example of such a disaggregated shuffle service.) These changes
>>>>>> allow the bulk of the shuffle to run in a remote service such that 
>>>>>> minimal
>>>>>> state resides in executors and local disk spill is minimized. The net
>>>>>> effect is increased job stability and performance improvements in certain
>>>>>> scenarios. These changes should work well with or are complementary to
>>>>>> SPARK-25299. Some or all points may be merged into that issue as
>>>>>> appropriate.
>>>>>>
>>>>>>
>>>>>> Below is a description of each component of this proposal. These
>>>>>> changes can ideally be introduced incrementally. I would like to gather
>>>>>> feedback and gauge interest from others in the community to collaborate 
>>>>>> on
>>>>>> this. There are likely more points that would  be useful to disaggregated
>>>>>> shuffle services. We can outline a more concrete plan after gathering
>>>>>> enough input. A working session could help us kick off this joint effort;
>>>>>> maybe something in the mid-January to mid-February timeframe (depending 
>>>>>> on
>>>>>> interest and availability. I’m happy to host at our Sunnyvale, CA 
>>>>>> offices.
>>>>>>
>>>>>>
>>>>>> ProposalScheduling and re-executing tasks
>>>>>>
>>>>>> Allow coordination between the service and the Spark DAG scheduler as
>>>>>> to whether a given block/partition needs to be recomputed when a task 
>>&

Re: Enabling fully disaggregated shuffle on Spark

2019-11-19 Thread John Zhuge
t;>> semantics). SPARK-25299 adds commit semantics to the internal data storage
>>> layer, but this is applicable to all shuffle managers at a higher level and
>>> should apply equally to the ShuffleWriter.
>>>
>>>
>>> Do not require ShuffleManagers to expose ShuffleBlockResolvers where
>>> they are not needed. Ideally, this would be an implementation detail of the
>>> shuffle manager itself. If there is substantial overlap between the
>>> SortShuffleManager and other implementations, then the storage details can
>>> be abstracted at the appropriate level. (SPARK-25299 does not currently
>>> change this.)
>>>
>>>
>>> Do not require MapStatus to include blockmanager IDs where they are not
>>> relevant. This is captured by ShuffleBlockInfo
>>> <https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit#heading=h.imi27prnziyj>
>>> including an optional BlockManagerId in SPARK-25299. However, this
>>> change should be lifted to the MapStatus level so that it applies to all
>>> ShuffleManagers. Alternatively, use a more general data-location
>>> abstraction than BlockManagerId. This gives the shuffle manager more
>>> flexibility and the scheduler more information with respect to data
>>> residence.
>>> Serialization
>>>
>>> Allow serializers to be used more flexibly and efficiently. For example,
>>> have serializers support writing an arbitrary number of objects into an
>>> existing OutputStream or ByteBuffer. This enables objects to be serialized
>>> to direct buffers where doing so makes sense. More importantly, it allows
>>> arbitrary metadata/framing data to be wrapped around individual objects
>>> cheaply. Right now, that’s only possible at the stream level. (There are
>>> hacks around this, but this would enable more idiomatic use in efficient
>>> shuffle implementations.)
>>>
>>>
>>> Have serializers indicate whether they are deterministic. This provides
>>> much of the value of a shuffle service because it means that reducers do
>>> not need to spill to disk when reading/merging/combining inputs--the data
>>> can be grouped by the service, even without the service understanding data
>>> types or byte representations. Alternative (less preferable since it would
>>> break Java serialization, for example): require all serializers to be
>>> deterministic.
>>>
>>>
>>>
>>> --
>>>
>>> - Ben
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
John Zhuge


Re: [DISCUSS] ViewCatalog interface for DSv2

2019-10-14 Thread John Zhuge
Thanks for the feedback. I am preparing a doc and a PoC, will post soon.

On Mon, Oct 14, 2019 at 3:17 AM Wenchen Fan  wrote:

> I'm fine with the view definition proposed here, but my major concern is
> how to make sure table/view share the same namespace. According to the SQL
> spec, if there is a view named "a", we can't create a table named "a"
> anymore.
>
> We can add documents and ask the implementation to guarantee it, but it's
> better if this can be guaranteed by the API.
>
> On Wed, Aug 14, 2019 at 1:46 AM John Zhuge  wrote:
>
>> Thanks for the feedback, Ryan! I can share the WIP copy of the SPIP if
>> that makes sense.
>>
>> I can't find out a lot about view resolution and validation in SQL Spec
>> Part1. Anybody with full SQL knowledge may chime in.
>>
>> Here are my understanding based on online manuals, docs, and other
>> resources:
>>
>>- A view has a name in the database schema so that other queries can
>>use it like a table.
>>- A view's schema is frozen at the time the view is created;
>>subsequent changes to underlying tables (e.g. adding a column) will not be
>>reflected in the view's schema. If an underlying table is dropped or
>>changed in an incompatible fashion, subsequent attempts to query the
>>invalid view will fail.
>>
>> In Preso, view columns are used for validation only (see
>> StatementAnalyzer.Visitor#isViewStale):
>>
>>- view column names must match the visible fields of analyzed view sql
>>- the visible fields can be coerced to view column types
>>
>> In Spark 2.2+, view columns are also used for validation (see
>> CheckAnalysis#checkAnalysis case View):
>>
>>- view column names must match the output fields of the view sql
>>- view column types must be able to UpCast to output field types
>>
>> Rule EliminateView adds a Project to viewQueryColumnNames if it exists.
>>
>> As for `softwareVersion`, the purpose is to track which software version
>> is used to create the view, in preparation for different versions of the
>> same software or even different softwares, such as Presto vs Spark.
>>
>>
>> On Tue, Aug 13, 2019 at 9:47 AM Ryan Blue  wrote:
>>
>>> Thanks for working on this, John!
>>>
>>> I'd like to see a more complete write-up of what you're proposing.
>>> Without that, I don't think we can have a productive discussion about this.
>>>
>>> For example, I think you're proposing to keep the view columns to ensure
>>> that the same columns are produced by the view every time, based on
>>> requirements from the SQL spec. Let's start by stating what those behavior
>>> requirements are, so that everyone has the context to understand why your
>>> proposal includes the view columns. Similarly, I'd like to know why you're
>>> proposing `softwareVersion` in the view definition.
>>>
>>> On Tue, Aug 13, 2019 at 8:56 AM John Zhuge  wrote:
>>>
>>>> Catalog support has been added to DSv2 along with a table catalog
>>>> interface. Here I'd like to propose a view catalog interface, for the
>>>> following benefit:
>>>>
>>>>- Abstraction for view management thus allowing different view
>>>>backends
>>>>- Disassociation of view definition storage from Hive Metastore
>>>>
>>>> A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an
>>>> identifier as view first then table.
>>>>
>>>> More details in SPIP and PR if we decide to proceed. Here is a quick
>>>> glance at the API:
>>>>
>>>> ViewCatalog interface:
>>>>
>>>>- loadView
>>>>- listViews
>>>>- createView
>>>>- deleteView
>>>>
>>>> View interface:
>>>>
>>>>- name
>>>>- originalSql
>>>>- defaultCatalog
>>>>- defaultNamespace
>>>>- viewColumns
>>>>- owner
>>>>- createTime
>>>>- softwareVersion
>>>>- options (map)
>>>>
>>>> ViewColumn interface:
>>>>
>>>>- name
>>>>- type
>>>>
>>>>
>>>> Thanks,
>>>> John Zhuge
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge


Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread John Zhuge
 really needed for
>>>> Spark 3; I already triaged some).
>>>>
>>>> For me, it's:
>>>> - DSv2?
>>>> - Finishing touches on the Hive, JDK 11 update
>>>>
>>>> What about considering a preview release earlier, as happened for
>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>>> even happen ... about now?
>>>>
>>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>>> guess is quite early 2020, from here.
>>>>
>>>>
>>>>
>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session
>>>> catalog uses
>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>> SPARK-28588 Build a SQL reference doc
>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>> SPARK-28684 Hive module support JDK 11
>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>> after some operations
>>>> SPARK-28372 Document Spark WEB UI
>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>> relation table properly
>>>> SPARK-28024 Incorrect numeric values when out of range
>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>>> smoother upgrade
>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>>> of joined tables > 12
>>>> SPARK-27471 Reorganize public v2 catalog API
>>>> SPARK-27520 Introduce a global config system to replace
>>>> hadoopConfiguration
>>>> SPARK-24625 put all the backward compatible behavior change configs
>>>> under spark.sql.legacy.*
>>>> SPARK-24640 size(null) returns null
>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>> SPARK-25383 Image data source supports sample pushdown
>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>>> default
>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>>> efficiency problem
>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>> cause driver pods to hang
>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>> barrier stage
>>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>>> Aggregate
>>>> SPARK-26022 PySpark Comparison with Pandas
>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>> SPARK-26425 Add more constraint checks in file streaming source to
>>>> avoid checkpoint corruption
>>>> SPARK-25843 Redesign rangeBetween API
>>>> SPARK-25841 Redesign window function rangeBetween API
>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>> produce named output from CleanupAliases
>>>> SPARK-23210 Introduce the concept of default value to schema
>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>>> aggregate
>>>> SPARK-25531 new write APIs for data source v2
>>>> SPARK-25547 Pluggable jdbc connection factory
>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>> SPARK-24417 Build and Run Spark on JDK11
>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>>> Kubernetes
>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>> MesosFineGrainedSchedulerBackend
>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>> SPARK-25186 Stabilize Data Source V2 API
>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>> execution mode
>>>> SPARK-25390 data source V2 API refactoring
>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>>> Spec
>>>> SPARK-15691 Refactor and improve Hive support
>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>> SPARK-16217 Support SELECT INTO statement
>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>> SPARK-18245 Improving support for bucketed table
>>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>>> Spark
>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>>> list of structures
>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>>> respect session timezone
>>>> SPARK-22386 Data Source V2 improvements
>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>> 
>>>>
>>>>
>>>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


-- 
John Zhuge


Re: Welcoming some new committers and PMC members

2019-09-09 Thread John Zhuge
Congratulations!

On Mon, Sep 9, 2019 at 5:45 PM Shane Knapp  wrote:

> congrats everyone!  :)
>
> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia 
> wrote:
> >
> > Hi all,
> >
> > The Spark PMC recently voted to add several new committers and one PMC
> member. Join me in welcoming them to their new roles!
> >
> > New PMC member: Dongjoon Hyun
> >
> > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
> Weichen Xu, Ruifeng Zheng
> >
> > The new committers cover lots of important areas including ML, SQL, and
> data sources, so it’s great to have them here. All the best,
> >
> > Matei and the Spark PMC
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
John Zhuge


Re: [VOTE] Release Apache Spark 2.3.4 (RC1)

2019-08-28 Thread John Zhuge
 you can add the staging repository to your projects resolvers and
>>>> test
>>>> >> > with the RC (make sure to clean up the artifact cache before/after
>>>> so
>>>> >> > you don't end up building with a out of date RC going forward).
>>>> >> >
>>>> >> > ===
>>>> >> > What should happen to JIRA tickets still targeting 2.3.4?
>>>> >> > ===
>>>> >> >
>>>> >> > The current list of open tickets targeted at 2.3.4 can be found at:
>>>> >> > https://issues.apache.org/jira/projects/SPARKand search for
>>>> "Target Version/s" = 2.3.4
>>>> >> >
>>>> >> > Committers should look at those and triage. Extremely important bug
>>>> >> > fixes, documentation, and API tweaks that impact compatibility
>>>> should
>>>> >> > be worked on immediately. Everything else please retarget to an
>>>> >> > appropriate release.
>>>> >> >
>>>> >> > ==
>>>> >> > But my bug isn't fixed?
>>>> >> > ==
>>>> >> >
>>>> >> > In order to make timely releases, we will typically not hold the
>>>> >> > release unless the bug in question is a regression from the
>>>> previous
>>>> >> > release. That being said, if there is something which is a
>>>> regression
>>>> >> > that has not been correctly targeted please ping me or a committer
>>>> to
>>>> >> > help target the issue.
>>>> >> >
>>>> >>
>>>> >> -
>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>> >>
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>

-- 
John Zhuge


Re: Release Spark 2.3.4

2019-08-16 Thread John Zhuge
+1

On Fri, Aug 16, 2019 at 4:25 PM Xiao Li  wrote:

> +1
>
> On Fri, Aug 16, 2019 at 4:11 PM Takeshi Yamamuro 
> wrote:
>
>> +1, too
>>
>> Bests,
>> Takeshi
>>
>> On Sat, Aug 17, 2019 at 7:25 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1 for 2.3.4 release as the last release for `branch-2.3` EOL.
>>>
>>> Also, +1 for next week release.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Aug 16, 2019 at 8:19 AM Sean Owen  wrote:
>>>
>>>> I think it's fine to do these in parallel, yes. Go ahead if you are
>>>> willing.
>>>>
>>>> On Fri, Aug 16, 2019 at 9:48 AM Kazuaki Ishizaki 
>>>> wrote:
>>>> >
>>>> > Hi, All.
>>>> >
>>>> > Spark 2.3.3 was released six months ago (15th February, 2019) at
>>>> http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18
>>>> months have been passed after Spark 2.3.0 has been released (28th February,
>>>> 2018).
>>>> > As of today (16th August), there are 103 commits (69 JIRAs) in
>>>> `branch-23` since 2.3.3.
>>>> >
>>>> > It would be great if we can have Spark 2.3.4.
>>>> > If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or
>>>> after 2.4.4 will be released?
>>>> >
>>>> > A issue list in jira:
>>>> https://issues.apache.org/jira/projects/SPARK/versions/12344844
>>>> > A commit list in github from the last release:
>>>> https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3
>>>> > The 8 correctness issues resolved in branch-2.3:
>>>> >
>>>> https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
>>>> >
>>>> > Best Regards,
>>>> > Kazuaki Ishizaki
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>


-- 
John Zhuge


Re: [DISCUSS] ViewCatalog interface for DSv2

2019-08-13 Thread John Zhuge
Thanks for the feedback, Ryan! I can share the WIP copy of the SPIP if that
makes sense.

I can't find out a lot about view resolution and validation in SQL Spec
Part1. Anybody with full SQL knowledge may chime in.

Here are my understanding based on online manuals, docs, and other
resources:

   - A view has a name in the database schema so that other queries can use
   it like a table.
   - A view's schema is frozen at the time the view is created; subsequent
   changes to underlying tables (e.g. adding a column) will not be reflected
   in the view's schema. If an underlying table is dropped or changed in an
   incompatible fashion, subsequent attempts to query the invalid view will
   fail.

In Preso, view columns are used for validation only (see
StatementAnalyzer.Visitor#isViewStale):

   - view column names must match the visible fields of analyzed view sql
   - the visible fields can be coerced to view column types

In Spark 2.2+, view columns are also used for validation (see
CheckAnalysis#checkAnalysis case View):

   - view column names must match the output fields of the view sql
   - view column types must be able to UpCast to output field types

Rule EliminateView adds a Project to viewQueryColumnNames if it exists.

As for `softwareVersion`, the purpose is to track which software version is
used to create the view, in preparation for different versions of the same
software or even different softwares, such as Presto vs Spark.


On Tue, Aug 13, 2019 at 9:47 AM Ryan Blue  wrote:

> Thanks for working on this, John!
>
> I'd like to see a more complete write-up of what you're proposing. Without
> that, I don't think we can have a productive discussion about this.
>
> For example, I think you're proposing to keep the view columns to ensure
> that the same columns are produced by the view every time, based on
> requirements from the SQL spec. Let's start by stating what those behavior
> requirements are, so that everyone has the context to understand why your
> proposal includes the view columns. Similarly, I'd like to know why you're
> proposing `softwareVersion` in the view definition.
>
> On Tue, Aug 13, 2019 at 8:56 AM John Zhuge  wrote:
>
>> Catalog support has been added to DSv2 along with a table catalog
>> interface. Here I'd like to propose a view catalog interface, for the
>> following benefit:
>>
>>- Abstraction for view management thus allowing different view
>>backends
>>- Disassociation of view definition storage from Hive Metastore
>>
>> A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an
>> identifier as view first then table.
>>
>> More details in SPIP and PR if we decide to proceed. Here is a quick
>> glance at the API:
>>
>> ViewCatalog interface:
>>
>>- loadView
>>- listViews
>>- createView
>>- deleteView
>>
>> View interface:
>>
>>- name
>>- originalSql
>>- defaultCatalog
>>- defaultNamespace
>>- viewColumns
>>- owner
>>    - createTime
>>- softwareVersion
>>- options (map)
>>
>> ViewColumn interface:
>>
>>- name
>>- type
>>
>>
>> Thanks,
>> John Zhuge
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
John Zhuge


[DISCUSS] ViewCatalog interface for DSv2

2019-08-13 Thread John Zhuge
Catalog support has been added to DSv2 along with a table catalog
interface. Here I'd like to propose a view catalog interface, for the
following benefit:

   - Abstraction for view management thus allowing different view backends
   - Disassociation of view definition storage from Hive Metastore

A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an
identifier as view first then table.

More details in SPIP and PR if we decide to proceed. Here is a quick glance
at the API:

ViewCatalog interface:

   - loadView
   - listViews
   - createView
   - deleteView

View interface:

   - name
   - originalSql
   - defaultCatalog
   - defaultNamespace
   - viewColumns
   - owner
   - createTime
   - softwareVersion
   - options (map)

ViewColumn interface:

   - name
   - type


Thanks,
John Zhuge


Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-18 Thread John Zhuge
+1 (non-binding)  Great work!

On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh  wrote:

> +1 (non-binding).
>
>
>
> Thanks for pushing this forward, Matt and Yifei.
>
>
>
> *From: *Felix Cheung 
> *Date: *Tuesday, June 18, 2019 at 00:01
> *To: *Yinan Li , "rb...@netflix.com" <
> rb...@netflix.com>
> *Cc: *Dongjoon Hyun , Saisai Shao <
> sai.sai.s...@gmail.com>, Imran Rashid , Ilan
> Filonenko , bo yang , Matt Cheah <
> mch...@palantir.com>, Spark Dev List , "Yifei Huang
> (PD)" , Vinoo Ganesh , Imran
> Rashid 
> *Subject: *Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API
>
>
>
> +1
>
>
>
> Glad to see the progress in this space - it’s been more than a year since
> the original discussion and effort started.
>
>
> --
>
> *From:* Yinan Li 
> *Sent:* Monday, June 17, 2019 7:14:42 PM
> *To:* rb...@netflix.com
> *Cc:* Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang;
> Matt Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
> *Subject:* Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API
>
>
>
> +1 (non-binding)
>
>
>
> On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue 
> wrote:
>
> +1 (non-binding)
>
>
>
> On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun 
> wrote:
>
> +1
>
>
>
> Bests,
>
> Dongjoon.
>
>
>
>
>
> On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao 
> wrote:
>
> +1 (binding)
>
>
>
> Thanks
>
> Saisai
>
>
>
> Imran Rashid  于2019年6月15日周六 上午3:46写道:
>
> +1 (binding)
>
> I think this is a really important feature for spark.
>
> First, there is already a lot of interest in alternative shuffle storage
> in the community.  There is already a lot of interest in alternative
> shuffle storage, from dynamic allocation in kubernetes, to even just
> improving stability in standard on-premise use of Spark.  However, they're
> often stuck doing this in forks of Spark, and in ways that are not
> maintainable (because they copy-paste many spark internals) or are
> incorrect (for not correctly handling speculative execution & stage
> retries).
>
> Second, I think the specific proposal is good for finding the right
> balance between flexibility and too much complexity, to allow incremental
> improvements.  A lot of work has been put into this already to try to
> figure out which pieces are essential to make alternative shuffle storage
> implementations feasible.
>
> Of course, that means it doesn't include everything imaginable; some
> things still aren't supported, and some will still choose to use the older
> ShuffleManager api to give total control over all of shuffle.  But we know
> there are a reasonable set of things which can be implemented behind the
> api as the first step, and it can continue to evolve.
>
>
>
> On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko  wrote:
>
> +1 (non-binding). This API is versatile and flexible enough to handle
> Bloomberg's internal use-cases. The ability for us to vary implementation
> strategies is quite appealing. It is also worth to note the minimal changes
> to Spark core in order to make it work. This is a very much needed addition
> within the Spark shuffle story.
>
>
>
> On Fri, Jun 14, 2019 at 9:59 AM bo yang  wrote:
>
> +1 This is great work, allowing plugin of different sort shuffle
> write/read implementation! Also great to see it retain the current Spark
> configuration
> (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).
>
>
>
>
>
> On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah  wrote:
>
> Hi everyone,
>
>
>
> I would like to call a vote for the SPIP for SPARK-25299
> [issues.apache.org]
> ,
> which proposes to introduce a pluggable storage API for temporary shuffle
> data.
>
>
>
> You may find the SPIP document here [docs.google.com]
> 
> .
>
>
>
> The discussion thread for the SPIP was conducted here [lists.apache.org]
> 
> .
>
>
>
> Please vote on whether or not this proposal is agreeable to you.
>
>
>
> Thanks!
>
>
>
> -Matt Cheah
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>

-- 
John


Re: Why hint does not traverse down subquery alias

2019-06-11 Thread John Zhuge
A meaningful error message will be great!

On Tue, Jun 11, 2019 at 6:15 PM Maryann Xue  wrote:

> BTW, I've actually just done some work on hint error handling, which might
> be helpful to what you mentioned:
>
> https://github.com/apache/spark/pull/24653
>
> On Tue, Jun 11, 2019 at 8:04 PM Maryann Xue 
> wrote:
>
>> I believe in the SQL standard, the original name cannot be accessed once
>> it’s aliased.
>>
>> On Tue, Jun 11, 2019 at 7:54 PM John Zhuge  wrote:
>>
>>> Yeah, it is a touch scenario.
>>>
>>> I actually have much simpler cases:
>>>
>>> 1) select /*+ broadcast(t1) */ * from db.t1 join db.t2 on t1.id = t2.id;
>>> 2) select /*+ broadcast(t1) */ * from db.t1 a1 join db.t2 a2 on a1.id =
>>> a2.id;
>>>
>>> 2) is the same as 1) but with aliases. Many users were surprised that 2)
>>> stopped working.
>>>
>>> Thanks,
>>> John
>>>
>>>
>>> On Tue, Jun 11, 2019 at 4:38 PM Maryann Xue 
>>> wrote:
>>>
>>>> Yes, and for a good reason: the hint relation has exactly the same
>>>> scope with other elements of queries/sub-queries.
>>>>
>>>> Suppose there's a query like:
>>>>
>>>> select /*+ broadcast(s) */ from (select a, b from s) t join (select a,
>>>> b from t) s on t1.a = t2.b
>>>>
>>>> If we allowed the hint resolving to "cross" the scopes, we'd end up
>>>> with a really confusing spec.
>>>>
>>>>
>>>> Thanks,
>>>> Maryann
>>>>
>>>> On Tue, Jun 11, 2019 at 5:26 PM John Zhuge  wrote:
>>>>
>>>>> Hi Reynold and Maryann,
>>>>>
>>>>> ResolveHints javadoc indicates the traversal does not go past subquery
>>>>> alias. Is there any specific reason?
>>>>>
>>>>> Thanks,
>>>>> John Zhuge
>>>>>
>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>

-- 
John Zhuge


Re: Why hint does not traverse down subquery alias

2019-06-11 Thread John Zhuge
Yeah, it is a touch scenario.

I actually have much simpler cases:

1) select /*+ broadcast(t1) */ * from db.t1 join db.t2 on t1.id = t2.id;
2) select /*+ broadcast(t1) */ * from db.t1 a1 join db.t2 a2 on a1.id =
a2.id;

2) is the same as 1) but with aliases. Many users were surprised that 2)
stopped working.

Thanks,
John


On Tue, Jun 11, 2019 at 4:38 PM Maryann Xue  wrote:

> Yes, and for a good reason: the hint relation has exactly the same scope
> with other elements of queries/sub-queries.
>
> Suppose there's a query like:
>
> select /*+ broadcast(s) */ from (select a, b from s) t join (select a, b
> from t) s on t1.a = t2.b
>
> If we allowed the hint resolving to "cross" the scopes, we'd end up with a
> really confusing spec.
>
>
> Thanks,
> Maryann
>
> On Tue, Jun 11, 2019 at 5:26 PM John Zhuge  wrote:
>
>> Hi Reynold and Maryann,
>>
>> ResolveHints javadoc indicates the traversal does not go past subquery
>> alias. Is there any specific reason?
>>
>> Thanks,
>> John Zhuge
>>
>

-- 
John Zhuge


Why hint does not traverse down subquery alias

2019-06-11 Thread John Zhuge
Hi Reynold and Maryann,

ResolveHints javadoc indicates the traversal does not go past subquery
alias. Is there any specific reason?

Thanks,
John Zhuge


Re: [VOTE] SPIP: Spark API for Table Metadata

2019-02-28 Thread John Zhuge
+1 (non-binding)

On Thu, Feb 28, 2019 at 9:11 AM Matt Cheah  wrote:

> +1 (non-binding)
>
>
>
> *From: *Jamison Bennett 
> *Date: *Thursday, February 28, 2019 at 8:28 AM
> *To: *Ryan Blue , Spark Dev List 
> *Subject: *Re: [VOTE] SPIP: Spark API for Table Metadata
>
>
>
> +1 (non-binding)
>
>
> *Jamison Bennett*
>
> Cloudera Software Engineer
>
> jamison.benn...@cloudera.com
>
> 515 Congress Ave, Suite 1212   |   Austin, TX   |   78701
>
>
>
>
>
> On Thu, Feb 28, 2019 at 10:20 AM Ryan Blue 
> wrote:
>
> +1 (non-binding)
>
>
>
> On Wed, Feb 27, 2019 at 8:34 PM Russell Spitzer 
> wrote:
>
> +1 (non-binding)
>
> On Wed, Feb 27, 2019, 6:28 PM Ryan Blue  wrote:
>
> Hi everyone,
>
>
>
> In the last DSv2 sync, the consensus was that the table metadata SPIP was
> ready to bring up for a vote. Now that the multi-catalog identifier SPIP
> vote has passed, I'd like to start one for the table metadata API,
> TableCatalog.
>
>
>
> The proposal is for adding a TableCatalog interface that will be used by
> v2 plans. That interface has methods to load, create, drop, alter, refresh,
> rename, and check existence for tables. It also specifies the set of
> metadata used to configure tables: schema, partitioning, and key-value
> properties. For more information, please read the SPIP proposal doc
> [docs.google.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo_edit-23heading-3Dh.m45webtwxf2d=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=JmgvL6ffL9tyoLWWZtWujDe9FNiSguMApA53YK9NTP8=eSx5nMZvdB5hS9VepuvvFZFXjTCrdde-AdzkHC5jRYk=>
> .
>
>
>
> Please vote in the next 3 days.
>
>
>
> [ ] +1: Accept the proposal as an official SPIP
>
> [ ] +0
>
> [ ] -1: I don't think this is a good idea because ...
>
>
>
>
>
> Thanks!
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>

-- 
John Zhuge


Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread John Zhuge
+1

On Mon, Feb 18, 2019 at 8:43 PM Dongjoon Hyun 
wrote:

> +1
>
> Dongjoon.
>
> On 2019/02/19 04:12:23, Wenchen Fan  wrote:
> > +1
> >
> > On Tue, Feb 19, 2019 at 10:50 AM Ryan Blue 
> > wrote:
> >
> > > Hi everyone,
> > >
> > > It looks like there is consensus on the proposal, so I'd like to start
> a
> > > vote thread on the SPIP for identifiers in multi-catalog Spark.
> > >
> > > The doc is available here:
> > >
> https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing
> > >
> > > Please vote in the next 3 days.
> > >
> > > [ ] +1: Accept the proposal as an official SPIP
> > > [ ] +0
> > > [ ] -1: I don't think this is a good idea because ...
> > >
> > >
> > > Thanks!
> > >
> > > rb
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
John Zhuge


Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-09 Thread John Zhuge
ssues.apache.org/jira/projects/SPARK and search for "Target
>>>> Version/s" = 2.3.3
>>>> >
>>>> > Committers should look at those and triage. Extremely important bug
>>>> > fixes, documentation, and API tweaks that impact compatibility should
>>>> > be worked on immediately. Everything else please retarget to an
>>>> > appropriate release.
>>>> >
>>>> > ==
>>>> > But my bug isn't fixed?
>>>> > ==
>>>> >
>>>> > In order to make timely releases, we will typically not hold the
>>>> > release unless the bug in question is a regression from the previous
>>>> > release. That being said, if there is something which is a regression
>>>> > that has not been correctly targeted please ping me or a committer to
>>>> > help target the issue.
>>>> >
>>>> > P.S.
>>>> > I checked all the tests passed in the Amazon Linux 2 AMI;
>>>> > $ java -version
>>>> > openjdk version "1.8.0_191"
>>>> > OpenJDK Runtime Environment (build 1.8.0_191-b12)
>>>> > OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
>>>> > $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos
>>>> -Psparkr test
>>>> >
>>>> > --
>>>> > ---
>>>> > Takeshi Yamamuro
>>>>
>>>>
>>>>
>>>> --
>>>> Marcelo
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>
> --
> ---
> Takeshi Yamamuro
>


-- 
John Zhuge


Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-07 Thread John Zhuge
ll
>> >> > the current RC and see if anything important breaks, in the
>> Java/Scala
>> >> > you can add the staging repository to your projects resolvers and
>> test
>> >> > with the RC (make sure to clean up the artifact cache before/after so
>> >> > you don't end up building with a out of date RC going forward).
>> >> >
>> >> > ===
>> >> > What should happen to JIRA tickets still targeting 2.3.3?
>> >> > ===
>> >> >
>> >> > The current list of open tickets targeted at 2.3.3 can be found at:
>> >> > https://issues.apache.org/jira/projects/SPARK and search for
>> "Target Version/s" = 2.3.3
>> >> >
>> >> > Committers should look at those and triage. Extremely important bug
>> >> > fixes, documentation, and API tweaks that impact compatibility should
>> >> > be worked on immediately. Everything else please retarget to an
>> >> > appropriate release.
>> >> >
>> >> > ==
>> >> > But my bug isn't fixed?
>> >> > ==
>> >> >
>> >> > In order to make timely releases, we will typically not hold the
>> >> > release unless the bug in question is a regression from the previous
>> >> > release. That being said, if there is something which is a regression
>> >> > that has not been correctly targeted please ping me or a committer to
>> >> > help target the issue.
>> >> >
>> >> > P.S.
>> >> > I checked all the tests passed in the Amazon Linux 2 AMI;
>> >> > $ java -version
>> >> > openjdk version "1.8.0_191"
>> >> > OpenJDK Runtime Environment (build 1.8.0_191-b12)
>> >> > OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
>> >> > $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos
>> -Psparkr test
>> >> >
>> >> > --
>> >> > ---
>> >> > Takeshi Yamamuro
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
John Zhuge


Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread John Zhuge
Thx Xiao!

On Mon, Feb 4, 2019 at 9:04 AM Xiao Li  wrote:

> Thank you, Imran!
>
> Also, I attached the slides of "Deep Dive: Scheduler of Apache Spark".
>
> Cheers,
>
> Xiao
>
>
>
> John Zhuge  于2019年2月4日周一 上午8:59写道:
>
>> Thanks Imran!
>>
>> On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid 
>> wrote:
>>
>>> The scheduler has been pretty error-prone and hard to work on, and I
>>> feel like there may be a dwindling core of active experts.  I'm sure its
>>> very discouraging to folks trying to make what seem like simple changes,
>>> and then find they are in a rats nest of complex issues they weren't
>>> expecting.  But for those who are still trying, THANK YOU!  more
>>> involvement and more folks becoming experts is definitely needed.
>>>
>>> I put together a doc going over the architecture of the scheduler, and
>>> things I've seen us get bitten by in the past.  Its sort of a brain dump,
>>> but I'm hopeful it'll help orient new folks to the scheduler.  I also hope
>>> more experts will chime in -- there are places in the doc I know I've
>>> missed things, and called that out, but there are probably even more that
>>> should be discussed, & mistakes I've made.  All input welcome.
>>>
>>>
>>> https://docs.google.com/document/d/1oiE21t-8gXLXk5evo-t-BXpO5Hdcob5D-Ps40hogsp8/edit?usp=sharing
>>>
>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge


Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread John Zhuge
Thanks Imran!

On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid 
wrote:

> The scheduler has been pretty error-prone and hard to work on, and I feel
> like there may be a dwindling core of active experts.  I'm sure its very
> discouraging to folks trying to make what seem like simple changes, and
> then find they are in a rats nest of complex issues they weren't
> expecting.  But for those who are still trying, THANK YOU!  more
> involvement and more folks becoming experts is definitely needed.
>
> I put together a doc going over the architecture of the scheduler, and
> things I've seen us get bitten by in the past.  Its sort of a brain dump,
> but I'm hopeful it'll help orient new folks to the scheduler.  I also hope
> more experts will chime in -- there are places in the doc I know I've
> missed things, and called that out, but there are probably even more that
> should be discussed, & mistakes I've made.  All input welcome.
>
>
> https://docs.google.com/document/d/1oiE21t-8gXLXk5evo-t-BXpO5Hdcob5D-Ps40hogsp8/edit?usp=sharing
>


-- 
John Zhuge


Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-11 Thread John Zhuge
+1

Appreciate the effort.

On Thu, Jan 10, 2019 at 11:06 PM Hyukjin Kwon  wrote:

> +1
>
> Thanks.
>
> 2019년 1월 11일 (금) 오전 7:01, Takeshi Yamamuro 님이 작성:
>
>> ok, thanks for the check.
>>
>> best,
>> takeshi
>>
>> On Fri, Jan 11, 2019 at 1:37 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, Takeshi.
>>>
>>> Yep. It's not a release blocker. We don't need that as Sean mentioned
>>> already.
>>> Since you are the release manager of 2.3.3, you may include that in the
>>> scope of Spark 2.3.3 before it starts.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Jan 10, 2019 at 5:44 AM Sean Owen  wrote:
>>>
>>>> Is that the right link? that is marked as a minor bug, maybe. From
>>>> what you describe it's not a regression from 2.2.2 either.
>>>>
>>>> On Thu, Jan 10, 2019 at 6:37 AM Takeshi Yamamuro 
>>>> wrote:
>>>> >
>>>> > Hi, Dongjoon,
>>>> >
>>>> > We don't need to include https://github.com/apache/spark/pull/23456
>>>> in this release?
>>>> > The query there fails in v2.x while it passes in v1.6.
>>>> >
>>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

-- 
John Zhuge


Re: DataSourceV2 hangouts sync

2018-10-25 Thread John Zhuge
Great idea!

On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue  wrote:

> Hi everyone,
>
> There's been some great discussion for DataSourceV2 in the last few
> months, but it has been difficult to resolve some of the discussions and I
> don't think that we have a very clear roadmap for getting the work done.
>
> To coordinate better as a community, I'd like to start a regular sync-up
> over google hangouts. We use this in the Parquet community to have more
> effective community discussions about thorny technical issues and to get
> aligned on an overall roadmap. It is really helpful in that community and I
> think it would help us get DSv2 done more quickly.
>
> Here's how it works: people join the hangout, we go around the list to
> gather topics, have about an hour-long discussion, and then send a summary
> of the discussion to the dev list for anyone that couldn't participate.
> That way we can move topics along, but we keep the broader community in the
> loop as well for further discussion on the mailing list.
>
> I'll volunteer to set up the sync and send invites to anyone that wants to
> attend. If you're interested, please reply with the email address you'd
> like to put on the invite list (if there's a way to do this without
> specific invites, let me know). Also for the first sync, please note what
> times would work for you so we can try to account for people in different
> time zones.
>
> For the first one, I was thinking some day next week (time TBD by those
> interested) and starting off with a general roadmap discussion before
> diving into specific technical topics.
>
> Thanks,
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
John Zhuge


Re: Timestamp Difference/operations

2018-10-12 Thread John Zhuge
Yeah, operator "-" does not seem to be supported, however, you can use
"datediff" function:

In [9]: select datediff(CAST('2000-02-01 12:34:34' AS TIMESTAMP),
CAST('2000-01-01 00:00:00' AS TIMESTAMP))
Out[9]:
+--+
| datediff(CAST(CAST(2000-02-01 12:34:34 AS TIMESTAMP) AS DATE),
CAST(CAST(2000-01-01 00:00:00 AS TIMESTAMP) AS DATE)) |
+--+
| 31
   |
+--+

In [10]: select datediff('2000-02-01 12:34:34', '2000-01-01 00:00:00')
Out[10]:
++
| datediff(CAST(2000-02-01 12:34:34 AS DATE), CAST(2000-01-01 00:00:00 AS
DATE)) |
++
| 31
 |
++

In [11]: select datediff(timestamp '2000-02-01 12:34:34', timestamp
'2000-01-01 00:00:00')
Out[11]:
+--+
| datediff(CAST(TIMESTAMP('2000-02-01 12:34:34.0') AS DATE),
CAST(TIMESTAMP('2000-01-01 00:00:00.0') AS DATE)) |
+--+
| 31
   |
+--+

On Fri, Oct 12, 2018 at 7:01 AM Paras Agarwal 
wrote:

> Hello Spark Community,
>
> Currently in hive we can do operations on Timestamp Like :
> CAST('2000-01-01 12:34:34' AS TIMESTAMP) - CAST('2000-01-01 00:00:00' AS
> TIMESTAMP)
>
> Seems its not supporting in spark.
> Is there any way available.
>
> Kindly provide some insight on this.
>
>
> Paras
> 9130006036
>


-- 
John


Re: from_csv

2018-09-19 Thread John Zhuge
+1

On Wed, Sep 19, 2018 at 8:07 AM Ted Yu  wrote:

> +1
>
>  Original message 
> From: Dongjin Lee 
> Date: 9/19/18 7:20 AM (GMT-08:00)
> To: dev 
> Subject: Re: from_csv
>
> Another +1.
>
> I already experienced this case several times.
>
> On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon  wrote:
>
>> +1 for this idea since text parsing in CSV/JSON is quite common.
>>
>> One thing is about schema inference likewise with JSON functionality. In
>> case of JSON, we added schema_of_json for it and same thing should be able
>> to apply to CSV too.
>> If we see some more needs for it, we can consider a function like
>> schema_of_csv as well.
>>
>>
>> 2018년 9월 16일 (일) 오후 4:41, Maxim Gekk 님이 작성:
>>
>>> Hi Reynold,
>>>
>>> > i'd make this as consistent as to_json / from_json as possible
>>>
>>> Sure, new function from_csv() has the same signature as from_json().
>>>
>>> > how would this work in sql? i.e. how would passing options in work?
>>>
>>> The options are passed to the function via map, for example:
>>> select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat',
>>> 'dd/MM/'))
>>>
>>> On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin  wrote:
>>>
>>>> makes sense - i'd make this as consistent as to_json / from_json as
>>>> possible.
>>>>
>>>> how would this work in sql? i.e. how would passing options in work?
>>>>
>>>> --
>>>> excuse the brevity and lower case due to wrist injury
>>>>
>>>>
>>>> On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk 
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I would like to propose new function from_csv() for parsing columns
>>>>> containing strings in CSV format. Here is my PR:
>>>>> https://github.com/apache/spark/pull/22379
>>>>>
>>>>> An use case is loading a dataset from an external storage, dbms or
>>>>> systems like Kafka to where CSV content was dumped as one of
>>>>> columns/fields. Other columns could contain related information like
>>>>> timestamps, ids, sources of data and etc. The column with CSV strings can
>>>>> be parsed by existing method csv() of DataFrameReader but in that
>>>>> case we have to "clean up" dataset and remove other columns since the
>>>>> csv() method requires Dataset[String]. Joining back result of parsing
>>>>> and original dataset by positions is expensive and not convenient. Instead
>>>>> users parse CSV columns by string functions. The approach is usually error
>>>>> prone especially for quoted values and other special cases.
>>>>>
>>>>> The proposed in the PR methods should make a better user experience in
>>>>> parsing CSV-like columns. Please, share your thoughts.
>>>>>
>>>>> --
>>>>>
>>>>> Maxim Gekk
>>>>>
>>>>> Technical Solutions Lead
>>>>>
>>>>> Databricks Inc.
>>>>>
>>>>> maxim.g...@databricks.com
>>>>>
>>>>> databricks.com
>>>>>
>>>>>   <http://databricks.com/>
>>>>>
>>>>
>>>
>
> --
> *Dongjin Lee*
>
> *A hitchhiker in the mathematical world.*
>
> *github:  <http://goog_969573159/>github.com/dongjinleekr
> <http://github.com/dongjinleekr>linkedin: kr.linkedin.com/in/dongjinleekr
> <http://kr.linkedin.com/in/dongjinleekr>slideshare: 
> www.slideshare.net/dongjinleekr
> <http://www.slideshare.net/dongjinleekr>*
>


-- 
John Zhuge


Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread John Zhuge
+1 (non-binding)

Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn
-Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided

java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)


On Wed, Sep 19, 2018 at 2:31 AM Takeshi Yamamuro 
wrote:

> +1
>
> I also checked `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
> -Phive-thriftserve` on the openjdk below/macOSv10.12.6
>
> $ java -version
> java version "1.8.0_181"
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>
>
> On Wed, Sep 19, 2018 at 10:45 AM Dongjoon Hyun 
> wrote:
>
>> +1.
>>
>> I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>> -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.
>>
>> I hit the following test case failure once during testing, but it's not
>> persistent.
>>
>> KafkaContinuousSourceSuite
>> ...
>> subscribing topic by name from earliest offsets (failOnDataLoss:
>> false) *** FAILED ***
>>
>> Thank you, Saisai.
>>
>> Bests,
>> Dongjoon.
>>
>> On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
>> wrote:
>>
>>> +1 from my own side.
>>>
>>> Thanks
>>> Saisai
>>>
>>> Wenchen Fan  于2018年9月18日周二 上午9:34写道:
>>>
 +1. All the blocker issues are all resolved in 2.3.2 AFAIK.

 On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:

> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
> build from source with most profiles passed for me.
> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
> wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark
> version 2.3.2.
> >
> > The vote is open until September 21 PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.3.2
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> http://spark.apache.org/
> >
> > The tag to be voted on is v2.3.2-rc6 (commit
> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
> > https://github.com/apache/spark/tree/v2.3.2-rc6
> >
> > The release files, including signatures, digests, etc. can be found
> at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> >
> https://repository.apache.org/content/repositories/orgapachespark-1286/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
> >
> > The list of bug fixes going into 2.3.2 can be found at the following
> URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
> >
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate,
> then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the
> Java/Scala
> > you can add the staging repository to your projects resolvers and
> test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.3.2?
> > ===
> >
> > The current list of open tickets targeted at 2.3.2 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for
> "Target Version/s" = 2.3.2
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
> --
> ---

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread John Zhuge
+1 on SPARK-25004. We have found it quite useful to diagnose PySpark OOM.

On Tue, Aug 7, 2018 at 1:21 PM Holden Karau  wrote:

> I'd like to suggest we consider  SPARK-25004  (hopefully it goes in soon),
> but solving some of the consistent Python memory issues we've had for years
> would be really amazing to get in.
>
> On Tue, Aug 7, 2018 at 1:07 PM, Tom Graves 
> wrote:
>
>> I would like to get clarification on our avro compatibility story before
>> the release.  anyone interested please look at -
>> https://issues.apache.org/jira/browse/SPARK-24924 . I probably should
>> have filed a separate jira and can if we don't resolve via discussion there.
>>
>> Tom
>>
>> On Tuesday, August 7, 2018, 11:46:31 AM CDT, shane knapp <
>> skn...@berkeley.edu> wrote:
>>
>>
>> According to the status, I think we should wait a few more days. Any
>> objections?
>>
>>
>> none here.
>>
>> i'm also pretty certain that waiting until after the code freeze to start
>> testing the GHPRB on ubuntu is the wisest course of action for us.
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


-- 
John


Re: Handle BlockMissingException in pyspark

2018-08-06 Thread John Zhuge
BlockMissingException typically indicates the HDFS file is corrupted. Might
be an HDFS issue, Hadoop mailing list is a better bet:
u...@hadoop.apache.org.

Capture at the full stack trace in executor log.
If the file still exists, run `hdfs fsck -blockId blk_1233169822_159765693`
to determine whether it is corrupted.
If not corrupted, could there be excessive (thousands) current reads on the
block?
Hadoop version? Spark version?



On Mon, Aug 6, 2018 at 2:21 AM Divay Jindal 
wrote:

> Hi ,
>
> I am running pyspark in dockerized jupyter environment , I am constantly
> getting this error :
>
> ```
>
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 33 
> in stage 25.0 failed 1 times, most recent failure: Lost task 33.0 in stage 
> 25.0 (TID 35067, localhost, executor driver)
> : org.apache.hadoop.hdfs.BlockMissingException
> : Could not obtain block: 
> BP-1742911633-10.225.201.50-1479296658503:blk_1233169822_159765693
>
> ```
>
> Please can anyone help me with how to handle such exception in pyspark.
>
> --
> Best Regards
> *Divay Jindal*
>
>
>

-- 
John


Re: [DISCUSS][SQL] Control the number of output files

2018-08-05 Thread John Zhuge
Great help from the community!

On Sun, Aug 5, 2018 at 6:17 PM Xiao Li  wrote:

> FYI, the new hints have been merged. They will be available in the
> upcoming release (Spark 2.4).
>
> *John Zhuge*, thanks for your work! Really appreciate it! Please submit
> more PRs and help the community improve Spark. : )
>
> Xiao
>
> 2018-08-05 21:06 GMT-04:00 Koert Kuipers :
>
>> lukas,
>> what is the jira ticket for this? i would like to follow it's activity.
>> thanks!
>> koert
>>
>> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec 
>> wrote:
>>
>>> Hi,
>>> Yes, This feature is planned - Spark should be soon able to repartition
>>> output by size.
>>> Lukas
>>>
>>>
>>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang 
>>> napsal:
>>>
>>>> Has there been any discussion to simply support Hive's merge small
>>>> files configuration? It simply adds one additional stage to inspect size of
>>>> each output file, recompute the desired parallelism to reach a target size,
>>>> and runs a map-only coalesce before committing the final files. Since AFAIK
>>>> SparkSQL already stages the final output commit, it seems feasible to
>>>> respect this Hive config.
>>>>
>>>>
>>>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
>>>> wrote:
>>>>
>>>>> See some of the related discussion under
>>>>> https://github.com/apache/spark/pull/21589
>>>>>
>>>>> If feels to me like we need some kind of user code mechanism to signal
>>>>> policy preferences to Spark. This could also include ways to signal
>>>>> scheduling policy, which could include things like scheduling pool and/or
>>>>> barrier scheduling. Some of those scheduling policies operate at 
>>>>> inherently
>>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>>> (really, the thread local level in the current implementation) and barrier
>>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>>> possible, but I think it is worth considering such things at a fairly high
>>>>> level of abstraction and try to unify and simplify before making things
>>>>> more complex with multiple policy mechanisms.
>>>>>
>>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
>>>>> wrote:
>>>>>
>>>>>> Seems like a good idea in general. Do other systems have similar
>>>>>> concepts? In general it'd be easier if we can follow existing convention 
>>>>>> if
>>>>>> there is any.
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>>> number of output files in Spark SQL. There are use cases to either 
>>>>>>> reduce
>>>>>>> or increase the number. The users prefer not to use function
>>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>>> write and deploy Scala/Java/Python code.
>>>>>>>
>>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>>> Broadcast Join Hints)?
>>>>>>>
>>>>>>> /*+ *COALESCE*(n, shuffle) */
>>>>>>>
>>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>>> to SQL without extending SQL syntax? Any suggestion is highly 
>>>>>>> appreciated.
>>>>>>>
>>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>>> auto-merging output files.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>
>

-- 
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-08-05 Thread John Zhuge
https://issues.apache.org/jira/browse/SPARK-24940

The PR has been merged to 2.4.0.

On Sun, Aug 5, 2018 at 6:06 PM Koert Kuipers  wrote:

> lukas,
> what is the jira ticket for this? i would like to follow it's activity.
> thanks!
> koert
>
> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec  wrote:
>
>> Hi,
>> Yes, This feature is planned - Spark should be soon able to repartition
>> output by size.
>> Lukas
>>
>>
>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang 
>> napsal:
>>
>>> Has there been any discussion to simply support Hive's merge small files
>>> configuration? It simply adds one additional stage to inspect size of each
>>> output file, recompute the desired parallelism to reach a target size, and
>>> runs a map-only coalesce before committing the final files. Since AFAIK
>>> SparkSQL already stages the final output commit, it seems feasible to
>>> respect this Hive config.
>>>
>>>
>>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>>>
>>>
>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
>>> wrote:
>>>
>>>> See some of the related discussion under
>>>> https://github.com/apache/spark/pull/21589
>>>>
>>>> If feels to me like we need some kind of user code mechanism to signal
>>>> policy preferences to Spark. This could also include ways to signal
>>>> scheduling policy, which could include things like scheduling pool and/or
>>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>> (really, the thread local level in the current implementation) and barrier
>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>> possible, but I think it is worth considering such things at a fairly high
>>>> level of abstraction and try to unify and simplify before making things
>>>> more complex with multiple policy mechanisms.
>>>>
>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
>>>> wrote:
>>>>
>>>>> Seems like a good idea in general. Do other systems have similar
>>>>> concepts? In general it'd be easier if we can follow existing convention 
>>>>> if
>>>>> there is any.
>>>>>
>>>>>
>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge  wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>>> or increase the number. The users prefer not to use function
>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>> write and deploy Scala/Java/Python code.
>>>>>>
>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>> Broadcast Join Hints)?
>>>>>>
>>>>>> /*+ *COALESCE*(n, shuffle) */
>>>>>>
>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>> to SQL without extending SQL syntax? Any suggestion is highly 
>>>>>> appreciated.
>>>>>>
>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>> auto-merging output files.
>>>>>>
>>>>>> Thanks,
>>>>>> John Zhuge
>>>>>>
>>>>>
>

-- 
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-07-26 Thread John Zhuge
Filed https://issues.apache.org/jira/browse/SPARK-24940. Will upload a
patch shortly.

SPARK-20857 introduced a generic SQL Hint Framework since 2.2.0.

On Thu, Jul 26, 2018 at 4:25 PM Reynold Xin  wrote:

> John,
>
> You want to create a ticket and submit a patch for this? If there is a
> coalesce hint, inject a coalesce logical node. Pretty simple.
>
>
> On Wed, Jul 25, 2018 at 2:48 PM John Zhuge  wrote:
>
>> Thanks for the comment, Forest. What I am asking is to make whatever DF
>> repartition/coalesce functionalities available to SQL users.
>>
>> Agree with you on that reducing the final number of output files by file
>> size is very nice to have. Lukas indicated this is planned.
>>
>> On Wed, Jul 25, 2018 at 2:31 PM Forest Fang 
>> wrote:
>>
>>> Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was
>>> referenced in John's email. Can you elaborate how is your requirement
>>> different? In my experience, it usually is driven by the need to decrease
>>> the final output parallelism without compromising compute parallelism (i.e.
>>> to prevent too many small files to be persisted on HDFS.) The requirement
>>> in my experience is often pretty ballpark and does not require precise
>>> number of partitions. Therefore setting the desired output size to say
>>> 32-64mb usually gives a good enough result. I'm curious why 6221 was marked
>>> as won't fix.
>>>
>>> On Wed, Jul 25, 2018 at 2:26 PM Forest Fang 
>>> wrote:
>>>
>>>> Has there been any discussion to simply support Hive's merge small
>>>> files configuration? It simply adds one additional stage to inspect size of
>>>> each output file, recompute the desired parallelism to reach a target size,
>>>> and runs a map-only coalesce before committing the final files. Since AFAIK
>>>> SparkSQL already stages the final output commit, it seems feasible to
>>>> respect this Hive config.
>>>>
>>>>
>>>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
>>>> wrote:
>>>>
>>>>> See some of the related discussion under
>>>>> https://github.com/apache/spark/pull/21589
>>>>>
>>>>> If feels to me like we need some kind of user code mechanism to signal
>>>>> policy preferences to Spark. This could also include ways to signal
>>>>> scheduling policy, which could include things like scheduling pool and/or
>>>>> barrier scheduling. Some of those scheduling policies operate at 
>>>>> inherently
>>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>>> (really, the thread local level in the current implementation) and barrier
>>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>>> possible, but I think it is worth considering such things at a fairly high
>>>>> level of abstraction and try to unify and simplify before making things
>>>>> more complex with multiple policy mechanisms.
>>>>>
>>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
>>>>> wrote:
>>>>>
>>>>>> Seems like a good idea in general. Do other systems have similar
>>>>>> concepts? In general it'd be easier if we can follow existing convention 
>>>>>> if
>>>>>> there is any.
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>>> number of output files in Spark SQL. There are use cases to either 
>>>>>>> reduce
>>>>>>> or increase the number. The users prefer not to use function
>>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>>> write and deploy Scala/Java/Python code.
>>>>>>>
>>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>>> Broadcast Join Hints)?
>>>>>>>
>>>>>>> /*+ *COALESCE*(n, shuffle) */
>>>>>>>
>>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>>> to SQL without extending SQL syntax? Any suggestion is highly 
>>>>>>> appreciated.
>>>>>>>
>>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>>> auto-merging output files.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread John Zhuge
Thanks for the comment, Forest. What I am asking is to make whatever DF
repartition/coalesce functionalities available to SQL users.

Agree with you on that reducing the final number of output files by file
size is very nice to have. Lukas indicated this is planned.

On Wed, Jul 25, 2018 at 2:31 PM Forest Fang  wrote:

> Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was
> referenced in John's email. Can you elaborate how is your requirement
> different? In my experience, it usually is driven by the need to decrease
> the final output parallelism without compromising compute parallelism (i.e.
> to prevent too many small files to be persisted on HDFS.) The requirement
> in my experience is often pretty ballpark and does not require precise
> number of partitions. Therefore setting the desired output size to say
> 32-64mb usually gives a good enough result. I'm curious why 6221 was marked
> as won't fix.
>
> On Wed, Jul 25, 2018 at 2:26 PM Forest Fang 
> wrote:
>
>> Has there been any discussion to simply support Hive's merge small files
>> configuration? It simply adds one additional stage to inspect size of each
>> output file, recompute the desired parallelism to reach a target size, and
>> runs a map-only coalesce before committing the final files. Since AFAIK
>> SparkSQL already stages the final output commit, it seems feasible to
>> respect this Hive config.
>>
>>
>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>>
>>
>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
>> wrote:
>>
>>> See some of the related discussion under
>>> https://github.com/apache/spark/pull/21589
>>>
>>> If feels to me like we need some kind of user code mechanism to signal
>>> policy preferences to Spark. This could also include ways to signal
>>> scheduling policy, which could include things like scheduling pool and/or
>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>> different levels currently -- e.g. scheduling pools at the Job level
>>> (really, the thread local level in the current implementation) and barrier
>>> scheduling at the Stage level -- so it is not completely obvious how to
>>> unify all of these policy options/preferences/mechanism, or whether it is
>>> possible, but I think it is worth considering such things at a fairly high
>>> level of abstraction and try to unify and simplify before making things
>>> more complex with multiple policy mechanisms.
>>>
>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin  wrote:
>>>
>>>> Seems like a good idea in general. Do other systems have similar
>>>> concepts? In general it'd be easier if we can follow existing convention if
>>>> there is any.
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge  wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Many Spark users in my company are asking for a way to control the
>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>> or increase the number. The users prefer not to use function
>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to write
>>>>> and deploy Scala/Java/Python code.
>>>>>
>>>>> Could we introduce a query hint for this purpose (similar to Broadcast
>>>>> Join Hints)?
>>>>>
>>>>> /*+ *COALESCE*(n, shuffle) */
>>>>>
>>>>> In general, is query hint is the best way to bring DF functionality to
>>>>> SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>>>>
>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>> auto-merging output files.
>>>>>
>>>>> Thanks,
>>>>> John Zhuge
>>>>>
>>>>

-- 
John Zhuge


[DISCUSS][SQL] Control the number of output files

2018-07-25 Thread John Zhuge
Hi all,

Many Spark users in my company are asking for a way to control the number
of output files in Spark SQL. There are use cases to either reduce or
increase the number. The users prefer not to use function *repartition*(n)
or *coalesce*(n, shuffle) that require them to write and deploy
Scala/Java/Python code.

Could we introduce a query hint for this purpose (similar to Broadcast Join
Hints)?

/*+ *COALESCE*(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging
output files.

Thanks,
John Zhuge


Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-18 Thread John Zhuge
one blocking issue SPARK-24781
>>>>> during release preparation.
>>>>>
>>>>> FAQ
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>> you can add the staging repository to your projects resolvers and test
>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>> you don't end up building with a out of date RC going forward).
>>>>>
>>>>> ===
>>>>> What should happen to JIRA tickets still targeting 2.3.2?
>>>>> ===
>>>>>
>>>>> The current list of open tickets targeted at 2.3.2 can be found at:
>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>> Version/s" = 2.3.2
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>> be worked on immediately. Everything else please retarget to an
>>>>> appropriate release.
>>>>>
>>>>> ==
>>>>> But my bug isn't fixed?
>>>>> ==
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from the previous
>>>>> release. That being said, if there is something which is a regression
>>>>> that has not been correctly targeted please ping me or a committer to
>>>>> help target the issue.
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>


Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-17 Thread John Zhuge
+1 (non-binding)

On Tue, Jul 17, 2018 at 8:06 PM Wenchen Fan  wrote:

> +1 (binding). I think this is more clear to both users and developers,
> compared to the existing one which only supports append/overwrite and
> doesn't work with tables in data source(like JDBC table) well.
>
> On Wed, Jul 18, 2018 at 2:06 AM Ryan Blue  wrote:
>
>> +1 (not binding)
>>
>> On Tue, Jul 17, 2018 at 10:59 AM Ryan Blue  wrote:
>>
>>> Hi everyone,
>>>
>>> From discussion on the proposal doc and the discussion thread, I think
>>> we have consensus around the plan to standardize logical write operations
>>> for DataSourceV2. I would like to call a vote on the proposal.
>>>
>>> The proposal doc is here: SPIP: Standardize SQL logical plans
>>> <https://docs.google.com/document/u/1/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5ace0718=gmail#heading=h.m45webtwxf2d>
>>> .
>>>
>>> This vote is for the plan in that doc. The related SPIP with APIs to
>>> create/alter/drop tables will be a separate vote.
>>>
>>> Please vote in the next 72 hours:
>>>
>>> [+1]: Spark should adopt the SPIP
>>> [-1]: Spark should not adopt the SPIP because . . .
>>>
>>> Thanks for voting, everyone!
>>>
>>> --
>>> Ryan Blue
>>>
>>
>>
>> --
>> Ryan Blue
>>
>> --
>> John Zhuge
>>
>


Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-10 Thread John Zhuge
+1

On Sun, Jul 8, 2018 at 1:30 AM Saisai Shao  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.2.
>
> The vote is open until July 11th PST and passes if a majority +1 PMC votes
> are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.2-rc1
> (commit 4df06b45160241dbb331153efbb25703f913c192):
> https://github.com/apache/spark/tree/v2.3.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1277/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/
>
> The list of bug fixes going into 2.3.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>
> PS. This is my first time to do release, please help to check if
> everything is landing correctly. Thanks ^-^
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.2?
> ===
>
> The current list of open tickets targeted at 2.3.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.3.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


-- 
John


Re: Time for 2.3.2?

2018-06-29 Thread John Zhuge
+1  Looking forward to the critical fixes in 2.3.2.

On Thu, Jun 28, 2018 at 9:37 AM Ryan Blue  wrote:

> +1
>
> On Thu, Jun 28, 2018 at 9:34 AM Xiao Li  wrote:
>
>> +1. Thanks, Saisai!
>>
>> The impact of SPARK-24495 is large. We should release Spark 2.3.2 ASAP.
>>
>> Thanks,
>>
>> Xiao
>>
>> 2018-06-27 23:28 GMT-07:00 Takeshi Yamamuro :
>>
>>> +1, I heard some Spark users have skipped v2.3.1 because of these bugs.
>>>
>>> On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Wenchen Fan 于2018年6月28日 周四下午2:06写道:
>>>>
>>>>> Hi Saisai, that's great! please go ahead!
>>>>>
>>>>> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
>>>>> wrote:
>>>>>
>>>>>> +1, like mentioned by Marcelo, these issues seems quite severe.
>>>>>>
>>>>>> I can work on the release if short of hands :).
>>>>>>
>>>>>> Thanks
>>>>>> Jerry
>>>>>>
>>>>>>
>>>>>> Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:
>>>>>>
>>>>>>> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
>>>>>>> for those out.
>>>>>>>
>>>>>>> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>>>>>>>
>>>>>>> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
>>>>>>> wrote:
>>>>>>> > Hi all,
>>>>>>> >
>>>>>>> > Spark 2.3.1 was released just a while ago, but unfortunately we
>>>>>>> discovered
>>>>>>> > and fixed some critical issues afterward.
>>>>>>> >
>>>>>>> > SPARK-24495: SortMergeJoin may produce wrong result.
>>>>>>> > This is a serious correctness bug, and is easy to hit: have
>>>>>>> duplicated join
>>>>>>> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`,
>>>>>>> and the
>>>>>>> > join is a sort merge join. This bug is only present in Spark 2.3.
>>>>>>> >
>>>>>>> > SPARK-24588: stream-stream join may produce wrong result
>>>>>>> > This is a correctness bug in a new feature of Spark 2.3: the
>>>>>>> stream-stream
>>>>>>> > join. Users can hit this bug if one of the join side is
>>>>>>> partitioned by a
>>>>>>> > subset of the join keys.
>>>>>>> >
>>>>>>> > SPARK-24552: Task attempt numbers are reused when stages are
>>>>>>> retried
>>>>>>> > This is a long-standing bug in the output committer that may
>>>>>>> introduce data
>>>>>>> > corruption.
>>>>>>> >
>>>>>>> > SPARK-24542: UDFXPath allow users to pass carefully crafted
>>>>>>> XML to
>>>>>>> > access arbitrary files
>>>>>>> > This is a potential security issue if users build access control
>>>>>>> module upon
>>>>>>> > Spark.
>>>>>>> >
>>>>>>> > I think we need a Spark 2.3.2 to address these issues(especially
>>>>>>> the
>>>>>>> > correctness bugs) ASAP. Any thoughts?
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Wenchen
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Marcelo
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
> --
> John Zhuge
>


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread John Zhuge
+1

On Sun, Jun 3, 2018 at 6:12 PM, Hyukjin Kwon  wrote:

> +1
>
> 2018년 6월 3일 (일) 오후 9:25, Ricardo Almeida 님이
> 작성:
>
>> +1 (non-binding)
>>
>> On 3 June 2018 at 09:23, Dongjoon Hyun  wrote:
>>
>>> +1
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Jun 2, 2018 at 8:09 PM, Denny Lee  wrote:
>>>
 +1

 On Sat, Jun 2, 2018 at 4:53 PM Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> I'll give that a try, but I'll still have to figure out what to do if
> none of the release builds work with hadoop-aws, since Flintrock deploys
> Spark release builds to set up a cluster. Building Spark is slow, so we
> only do it if the user specifically requests a Spark version by git hash.
> (This is basically how spark-ec2 did things, too.)
>
>
> On Sat, Jun 2, 2018 at 6:54 PM Marcelo Vanzin 
> wrote:
>
>> If you're building your own Spark, definitely try the hadoop-cloud
>> profile. Then you don't even need to pull anything at runtime,
>> everything is already packaged with Spark.
>>
>> On Fri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas
>>  wrote:
>> > pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work
>> for me
>> > either (even building with -Phadoop-2.7). I guess I’ve been relying
>> on an
>> > unsupported pattern and will need to figure something else out
>> going forward
>> > in order to use s3a://.
>> >
>> >
>> > On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin 
>> wrote:
>> >>
>> >> I have personally never tried to include hadoop-aws that way. But
>> at
>> >> the very least, I'd try to use the same version of Hadoop as the
>> Spark
>> >> build (2.7.3 IIRC). I don't really expect a different version to
>> work,
>> >> and if it did in the past it definitely was not by design.
>> >>
>> >> On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas
>> >>  wrote:
>> >> > Building with -Phadoop-2.7 didn’t help, and if I remember
>> correctly,
>> >> > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0
>> release,
>> >> > so
>> >> > it appears something has changed since then.
>> >> >
>> >> > I wasn’t familiar with -Phadoop-cloud, but I can try that.
>> >> >
>> >> > My goal here is simply to confirm that this release of Spark
>> works with
>> >> > hadoop-aws like past releases did, particularly for Flintrock
>> users who
>> >> > use
>> >> > Spark with S3A.
>> >> >
>> >> > We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop
>> builds
>> >> > with
>> >> > every Spark release. If the -hadoop2.7 release build won’t work
>> with
>> >> > hadoop-aws anymore, are there plans to provide a new build type
>> that
>> >> > will?
>> >> >
>> >> > Apologies if the question is poorly formed. I’m batting a bit
>> outside my
>> >> > league here. Again, my goal is simply to confirm that I/my users
>> still
>> >> > have
>> >> > a way to use s3a://. In the past, that way was simply to call
>> pyspark
>> >> > --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very
>> similar.
>> >> > If
>> >> > that will no longer work, I’m trying to confirm that the change
>> of
>> >> > behavior
>> >> > is intentional or acceptable (as a review for the Spark project)
>> and
>> >> > figure
>> >> > out what I need to change (as due diligence for Flintrock’s
>> users).
>> >> >
>> >> > Nick
>> >> >
>> >> >
>> >> > On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin <
>> van...@cloudera.com>
>> >> > wrote:
>> >> >>
>> >> >> Using the hadoop-aws package is probably going to be a little
>> more
>> >> >> complicated than that. The best bet is to use a custom build of
>> Spark
>> >> >> that includes it (use -Phadoop-cloud). Otherwise you're probably
>> >> >> looking at some nasty dependency issues, especially if you end
>> up
>> >> >> mixing different versions of Hadoop.
>> >> >>
>> >> >> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
>> >> >>  wrote:
>> >> >> > I was able to successfully launch a Spark cluster on EC2 at
>> 2.3.1 RC4
>> >> >> > using
>> >> >> > Flintrock. However, trying to load the hadoop-aws package
>> gave me
>> >> >> > some
>> >> >> > errors.
>> >> >> >
>> >> >> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
>> >> >> >
>> >> >> > 
>> >> >> >
>> >> >> > :: problems summary ::
>> >> >> >  WARNINGS
>> >> >> > [NOT FOUND  ]
>> >> >> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
>> >> >> >  local-m2-cache: tried
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > file:/home/ec2-user/.m2/repository/com/sun/jersey/
>> jersey-json/1.9/jersey-json-1.9.jar
>> >> >> >