Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Chao Sun
+1 (non-binding). Looking forward to this feature!

On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue  wrote:

> +1 for the SPIP. I think it's well designed and it has worked quite well
> at Netflix for a long time.
>
> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge  wrote:
>
>> Hi Spark community,
>>
>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP).
>>
>> The proposal is to add a ViewCatalog interface that can be used to load,
>> create, alter, and drop views in DataSourceV2.
>>
>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>>
>
>
> --
> Ryan Blue
> Tabular
>


Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Ryan Blue
+1 for the SPIP. I think it's well designed and it has worked quite well at
Netflix for a long time.

On Thu, Feb 3, 2022 at 2:04 PM John Zhuge  wrote:

> Hi Spark community,
>
> I’d like to restart the vote for the ViewCatalog design proposal (SPIP).
>
> The proposal is to add a ViewCatalog interface that can be used to load,
> create, alter, and drop views in DataSourceV2.
>
> Please vote on the SPIP until Feb. 9th (Wednesday).
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
>


-- 
Ryan Blue
Tabular


[VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread John Zhuge
Hi Spark community,

I’d like to restart the vote for the ViewCatalog design proposal (SPIP).

The proposal is to add a ViewCatalog interface that can be used to load,
create, alter, and drop views in DataSourceV2.

Please vote on the SPIP until Feb. 9th (Wednesday).

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks!


Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread John Zhuge
Sure Xiao.

Happy Lunar New Year!

On Thu, Feb 3, 2022 at 1:57 PM Xiao Li  wrote:

> Can we extend the voting window to next Wednesday? This week is a holiday
> week for the lunar new year. AFAIK, many members in Asia are taking the
> whole week off. They might not regularly check the emails.
>
> Also how about starting a separate email thread starting with [VOTE] ?
>
> Happy Lunar New Year!!!
>
> Xiao
>
> Holden Karau  于2022年2月3日周四 12:28写道:
>
>> +1 (binding)
>>
>> On Thu, Feb 3, 2022 at 2:26 PM Erik Krogen  wrote:
>>
>>> +1 (non-binding)
>>>
>>> Really looking forward to having this natively supported by Spark, so
>>> that we can get rid of our own hacks to tie in a custom view catalog
>>> implementation. I appreciate the care John has put into various parts of
>>> the design and believe this will provide a robust and flexible solution to
>>> this problem faced by various large-scale Spark users.
>>>
>>> Thanks John!
>>>
>>> On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
 +1

 On Thu, Feb 3, 2022 at 11:19 AM John Zhuge  wrote:

> Hi Spark community,
>
> I’d like to restart the vote for the ViewCatalog design proposal (SPIP
> 
> ).
>
> The proposal is to add a ViewCatalog interface that can be used to
> load, create, alter, and drop views in DataSourceV2.
>
> Please vote on the SPIP in the next 72 hours. Once it is approved,
> I’ll update the PR  for
> review.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
>
> On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Considering the API aspect, the ViewCatalog API sounds like a good
>> idea. A view catalog will enable us to integrate Coral
>>  (our view SQL
>> translation and management layer) very cleanly to Spark. Currently we can
>> only do it by maintaining our special version of the
>> HiveExternalCatalog. Considering that views can be expanded
>> syntactically without necessarily invoking the analyzer, using a 
>> dedicated
>> view API can make performance better if performance is the concern.
>> Further, a catalog can still be both a table and view provider if it
>> chooses to based on this design, so I do not think we necessarily lose 
>> the
>> ability of providing both. Looking forward to more discussions on this 
>> and
>> making views a powerful tool in Spark.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Wed, May 26, 2021 at 9:54 AM John Zhuge  wrote:
>>
>>> Looks like we are running in circles. Should we have an online
>>> meeting to get this sorted out?
>>>
>>> Thanks,
>>> John
>>>
>>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan 
>>> wrote:
>>>
 OK, then I'd vote for TableViewCatalog, because
 1. This is how Hive catalog works, and we need to migrate Hive
 catalog to the v2 API sooner or later.
 2. Because of 1, TableViewCatalog is easy to support in the current
 table/view resolution framework.
 3. It's better to avoid name conflicts between table and views at
 the API level, instead of relying on the catalog implementation.
 4. Caching invalidation is always a tricky problem.

 On Tue, May 25, 2021 at 3:09 AM Ryan Blue 
 wrote:

> I don't think that it makes sense to discuss a different approach
> in the PR rather than in the vote. Let's discuss this now since 
> that's the
> purpose of an SPIP.
>
> On Mon, May 24, 2021 at 11:22 AM John Zhuge 
> wrote:
>
>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>> proposal (SPIP).
>>
>> The proposal is to add a ViewCatalog interface that can be used
>> to load, create, alter, and drop views in DataSourceV2.
>>
>> The full SPIP doc is here:
>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>
>> Please vote on the SPIP in the next 72 hours. Once it is
>> approved, I’ll update the PR for review.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> John Zhuge
>
 --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning S

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Xiao Li
Can we extend the voting window to next Wednesday? This week is a holiday
week for the lunar new year. AFAIK, many members in Asia are taking the
whole week off. They might not regularly check the emails.

Also how about starting a separate email thread starting with [VOTE] ?

Happy Lunar New Year!!!

Xiao

Holden Karau  于2022年2月3日周四 12:28写道:

> +1 (binding)
>
> On Thu, Feb 3, 2022 at 2:26 PM Erik Krogen  wrote:
>
>> +1 (non-binding)
>>
>> Really looking forward to having this natively supported by Spark, so
>> that we can get rid of our own hacks to tie in a custom view catalog
>> implementation. I appreciate the care John has put into various parts of
>> the design and believe this will provide a robust and flexible solution to
>> this problem faced by various large-scale Spark users.
>>
>> Thanks John!
>>
>> On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> +1
>>>
>>> On Thu, Feb 3, 2022 at 11:19 AM John Zhuge  wrote:
>>>
 Hi Spark community,

 I’d like to restart the vote for the ViewCatalog design proposal (SPIP
 
 ).

 The proposal is to add a ViewCatalog interface that can be used to
 load, create, alter, and drop views in DataSourceV2.

 Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
 update the PR  for review.

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thanks!

 On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <
 wa.moust...@gmail.com> wrote:

> Considering the API aspect, the ViewCatalog API sounds like a good
> idea. A view catalog will enable us to integrate Coral
>  (our view SQL
> translation and management layer) very cleanly to Spark. Currently we can
> only do it by maintaining our special version of the
> HiveExternalCatalog. Considering that views can be expanded
> syntactically without necessarily invoking the analyzer, using a dedicated
> view API can make performance better if performance is the concern.
> Further, a catalog can still be both a table and view provider if it
> chooses to based on this design, so I do not think we necessarily lose the
> ability of providing both. Looking forward to more discussions on this and
> making views a powerful tool in Spark.
>
> Thanks,
> Walaa.
>
>
> On Wed, May 26, 2021 at 9:54 AM John Zhuge  wrote:
>
>> Looks like we are running in circles. Should we have an online
>> meeting to get this sorted out?
>>
>> Thanks,
>> John
>>
>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan 
>> wrote:
>>
>>> OK, then I'd vote for TableViewCatalog, because
>>> 1. This is how Hive catalog works, and we need to migrate Hive
>>> catalog to the v2 API sooner or later.
>>> 2. Because of 1, TableViewCatalog is easy to support in the current
>>> table/view resolution framework.
>>> 3. It's better to avoid name conflicts between table and views at
>>> the API level, instead of relying on the catalog implementation.
>>> 4. Caching invalidation is always a tricky problem.
>>>
>>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue 
>>> wrote:
>>>
 I don't think that it makes sense to discuss a different approach
 in the PR rather than in the vote. Let's discuss this now since that's 
 the
 purpose of an SPIP.

 On Mon, May 24, 2021 at 11:22 AM John Zhuge 
 wrote:

> Hi everyone, I’d like to start a vote for the ViewCatalog design
> proposal (SPIP).
>
> The proposal is to add a ViewCatalog interface that can be used to
> load, create, alter, and drop views in DataSourceV2.
>
> The full SPIP doc is here:
> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>
> Please vote on the SPIP in the next 72 hours. Once it is approved,
> I’ll update the PR for review.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>


 --
 Ryan Blue
 Software Engineer
 Netflix

>>>
>>
>> --
>> John Zhuge
>>
>

 --
 John Zhuge

>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] Spark 3.1.3 RC3

2022-02-03 Thread Holden Karau
Good catch Dongjoon :)

This release candidate fails, but feel free to keep testing for any other
potential blockers.

I’ll roll RC4 next week with the older release scripts (but the more modern
image since the legacy image didn’t have a good time with the R doc
packaging).

On Thu, Feb 3, 2022 at 3:53 PM Dongjoon Hyun 
wrote:

> Unfortunately, -1 for 3.1.3 RC3 due to the packaging issue.
>
> It seems that the master branch release script didn't work properly for
> Hadoop 2 binary distribution, Holden.
>
> $ curl -s
> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/spark-3.1.3-bin-hadoop2.tgz
> | tar tz | grep hadoop-common
> spark-3.1.3-bin-hadoop2/jars/hadoop-common-3.2.0.jar
>
> Apache Spark didn't drop Apache Hadoop 2 based binary distribution yet.
>
> Dongjoon
>
>
> On Wed, Feb 2, 2022 at 3:38 PM Mridul Muralidharan 
> wrote:
>
>> Hi,
>>
>>   Minor nit: the tag mentioned under [1] looks like a typo - I used
>> "v3.1.3-rc3"  for my vote (3.2.1 is mentioned in a couple of places, treat
>> them as 3.1.3 instead)
>>
>> +1
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>>
>> Regards,
>> Mridul
>>
>> [1] "The tag to be voted on is v3.2.1-rc1" - the commit hash and git url
>> are correct.
>>
>>
>> On Wed, Feb 2, 2022 at 9:30 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Thanks Tom !
>>> I missed [1] (or probably forgot) the 3.1 part of the discussion given
>>> it centered around 3.2 ...
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>> [1] https://www.mail-archive.com/dev@spark.apache.org/msg28484.html
>>>
>>> On Wed, Feb 2, 2022 at 8:55 AM Thomas Graves 
>>> wrote:
>>>
 It was discussed doing all the maintenance lines back at beginning of
 December (Dec 6) when we were talking about release 3.2.1.

 Tom

 On Wed, Feb 2, 2022 at 2:07 AM Mridul Muralidharan 
 wrote:
 >
 > Hi Holden,
 >
 >   Not that I am against releasing 3.1.3 (given the fixes that have
 already gone in), but did we discuss releasing it ? I might have missed the
 thread ...
 >
 > Regards,
 > Mridul
 >
 > On Tue, Feb 1, 2022 at 7:12 PM Holden Karau 
 wrote:
 >>
 >> Please vote on releasing the following candidate as Apache Spark
 version 3.1.3.
 >>
 >> The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and
 passes if a majority
 >> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
 >>
 >> [ ] +1 Release this package as Apache Spark 3.1.3
 >> [ ] -1 Do not release this package because ...
 >>
 >> To learn more about Apache Spark, please see
 http://spark.apache.org/
 >>
 >> There are currently no open issues targeting 3.1.3 in Spark's JIRA
 https://issues.apache.org/jira/browse
 >> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in
 (Open, Reopened, "In Progress"))
 >> at https://s.apache.org/n79dw
 >>
 >>
 >>
 >> The tag to be voted on is v3.2.1-rc1 (commit
 >> b8c0799a8cef22c56132d94033759c9f82b0cc86):
 >> https://github.com/apache/spark/tree/v3.1.3-rc3
 >>
 >> The release files, including signatures, digests, etc. can be found
 at:
 >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/
 >>
 >> Signatures used for Spark RCs can be found in this file:
 >> https://dist.apache.org/repos/dist/dev/spark/KEYS
 >>
 >> The staging repository for this release can be found at
 >> :
 https://repository.apache.org/content/repositories/orgapachespark-1400/
 >>
 >> The documentation corresponding to this release can be found at:
 >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-docs/
 >>
 >> The list of bug fixes going into 3.1.3 can be found at the following
 URL:
 >> https://s.apache.org/x0q9b
 >>
 >> This release is using the release script in master as of
 ddc77fb906cb3ce1567d277c2d0850104c89ac25
 >> The release docker container was rebuilt since the previous version
 didn't have the necessary components to build the R documentation.
 >>
 >> FAQ
 >>
 >>
 >> =
 >> How can I help test this release?
 >> =
 >>
 >> If you are a Spark user, you can help us test this release by taking
 >> an existing Spark workload and running on this release candidate,
 then
 >> reporting any regressions.
 >>
 >> If you're working in PySpark you can set up a virtual env and install
 >> the current RC and see if anything important breaks, in the
 Java/Scala
 >> you can add the staging repository to your projects resolvers and
 test
 >> with the RC (make sure to clean up the artifact cache before/after so
 >> you don't end up building with an out of date RC going forward).
 >>
 >> ===
 >> What should happen to JIRA tic

Re: [VOTE] Spark 3.1.3 RC3

2022-02-03 Thread Dongjoon Hyun
Unfortunately, -1 for 3.1.3 RC3 due to the packaging issue.

It seems that the master branch release script didn't work properly for
Hadoop 2 binary distribution, Holden.

$ curl -s
https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/spark-3.1.3-bin-hadoop2.tgz
| tar tz | grep hadoop-common
spark-3.1.3-bin-hadoop2/jars/hadoop-common-3.2.0.jar

Apache Spark didn't drop Apache Hadoop 2 based binary distribution yet.

Dongjoon


On Wed, Feb 2, 2022 at 3:38 PM Mridul Muralidharan  wrote:

> Hi,
>
>   Minor nit: the tag mentioned under [1] looks like a typo - I used
> "v3.1.3-rc3"  for my vote (3.2.1 is mentioned in a couple of places, treat
> them as 3.1.3 instead)
>
> +1
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>
> Regards,
> Mridul
>
> [1] "The tag to be voted on is v3.2.1-rc1" - the commit hash and git url
> are correct.
>
>
> On Wed, Feb 2, 2022 at 9:30 AM Mridul Muralidharan 
> wrote:
>
>>
>> Thanks Tom !
>> I missed [1] (or probably forgot) the 3.1 part of the discussion given it
>> centered around 3.2 ...
>>
>>
>> Regards,
>> Mridul
>>
>> [1] https://www.mail-archive.com/dev@spark.apache.org/msg28484.html
>>
>> On Wed, Feb 2, 2022 at 8:55 AM Thomas Graves 
>> wrote:
>>
>>> It was discussed doing all the maintenance lines back at beginning of
>>> December (Dec 6) when we were talking about release 3.2.1.
>>>
>>> Tom
>>>
>>> On Wed, Feb 2, 2022 at 2:07 AM Mridul Muralidharan 
>>> wrote:
>>> >
>>> > Hi Holden,
>>> >
>>> >   Not that I am against releasing 3.1.3 (given the fixes that have
>>> already gone in), but did we discuss releasing it ? I might have missed the
>>> thread ...
>>> >
>>> > Regards,
>>> > Mridul
>>> >
>>> > On Tue, Feb 1, 2022 at 7:12 PM Holden Karau 
>>> wrote:
>>> >>
>>> >> Please vote on releasing the following candidate as Apache Spark
>>> version 3.1.3.
>>> >>
>>> >> The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and
>>> passes if a majority
>>> >> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>>> >>
>>> >> [ ] +1 Release this package as Apache Spark 3.1.3
>>> >> [ ] -1 Do not release this package because ...
>>> >>
>>> >> To learn more about Apache Spark, please see http://spark.apache.org/
>>> >>
>>> >> There are currently no open issues targeting 3.1.3 in Spark's JIRA
>>> https://issues.apache.org/jira/browse
>>> >> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in
>>> (Open, Reopened, "In Progress"))
>>> >> at https://s.apache.org/n79dw
>>> >>
>>> >>
>>> >>
>>> >> The tag to be voted on is v3.2.1-rc1 (commit
>>> >> b8c0799a8cef22c56132d94033759c9f82b0cc86):
>>> >> https://github.com/apache/spark/tree/v3.1.3-rc3
>>> >>
>>> >> The release files, including signatures, digests, etc. can be found
>>> at:
>>> >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/
>>> >>
>>> >> Signatures used for Spark RCs can be found in this file:
>>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >>
>>> >> The staging repository for this release can be found at
>>> >> :
>>> https://repository.apache.org/content/repositories/orgapachespark-1400/
>>> >>
>>> >> The documentation corresponding to this release can be found at:
>>> >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-docs/
>>> >>
>>> >> The list of bug fixes going into 3.1.3 can be found at the following
>>> URL:
>>> >> https://s.apache.org/x0q9b
>>> >>
>>> >> This release is using the release script in master as of
>>> ddc77fb906cb3ce1567d277c2d0850104c89ac25
>>> >> The release docker container was rebuilt since the previous version
>>> didn't have the necessary components to build the R documentation.
>>> >>
>>> >> FAQ
>>> >>
>>> >>
>>> >> =
>>> >> How can I help test this release?
>>> >> =
>>> >>
>>> >> If you are a Spark user, you can help us test this release by taking
>>> >> an existing Spark workload and running on this release candidate, then
>>> >> reporting any regressions.
>>> >>
>>> >> If you're working in PySpark you can set up a virtual env and install
>>> >> the current RC and see if anything important breaks, in the Java/Scala
>>> >> you can add the staging repository to your projects resolvers and test
>>> >> with the RC (make sure to clean up the artifact cache before/after so
>>> >> you don't end up building with an out of date RC going forward).
>>> >>
>>> >> ===
>>> >> What should happen to JIRA tickets still targeting 3.1.3?
>>> >> ===
>>> >>
>>> >> The current list of open tickets targeted at 3.2.1 can be found at:
>>> >> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> >> Version/s" = 3.1.3
>>> >>
>>> >> Committers should look at those and triage. Extremely important bug
>>> >> fixes, documentation, and API tweaks that impact compatibility should
>>> >> be worked on immediately. Everything else please retarget to an
>>> >>

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Holden Karau
+1 (binding)

On Thu, Feb 3, 2022 at 2:26 PM Erik Krogen  wrote:

> +1 (non-binding)
>
> Really looking forward to having this natively supported by Spark, so that
> we can get rid of our own hacks to tie in a custom view catalog
> implementation. I appreciate the care John has put into various parts of
> the design and believe this will provide a robust and flexible solution to
> this problem faced by various large-scale Spark users.
>
> Thanks John!
>
> On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> +1
>>
>> On Thu, Feb 3, 2022 at 11:19 AM John Zhuge  wrote:
>>
>>> Hi Spark community,
>>>
>>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP
>>> 
>>> ).
>>>
>>> The proposal is to add a ViewCatalog interface that can be used to load,
>>> create, alter, and drop views in DataSourceV2.
>>>
>>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>>> update the PR  for review.
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>>
>>> On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
 Considering the API aspect, the ViewCatalog API sounds like a good
 idea. A view catalog will enable us to integrate Coral
  (our view SQL
 translation and management layer) very cleanly to Spark. Currently we can
 only do it by maintaining our special version of the
 HiveExternalCatalog. Considering that views can be expanded
 syntactically without necessarily invoking the analyzer, using a dedicated
 view API can make performance better if performance is the concern.
 Further, a catalog can still be both a table and view provider if it
 chooses to based on this design, so I do not think we necessarily lose the
 ability of providing both. Looking forward to more discussions on this and
 making views a powerful tool in Spark.

 Thanks,
 Walaa.


 On Wed, May 26, 2021 at 9:54 AM John Zhuge  wrote:

> Looks like we are running in circles. Should we have an online meeting
> to get this sorted out?
>
> Thanks,
> John
>
> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan 
> wrote:
>
>> OK, then I'd vote for TableViewCatalog, because
>> 1. This is how Hive catalog works, and we need to migrate Hive
>> catalog to the v2 API sooner or later.
>> 2. Because of 1, TableViewCatalog is easy to support in the current
>> table/view resolution framework.
>> 3. It's better to avoid name conflicts between table and views at the
>> API level, instead of relying on the catalog implementation.
>> 4. Caching invalidation is always a tricky problem.
>>
>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue 
>> wrote:
>>
>>> I don't think that it makes sense to discuss a different approach in
>>> the PR rather than in the vote. Let's discuss this now since that's the
>>> purpose of an SPIP.
>>>
>>> On Mon, May 24, 2021 at 11:22 AM John Zhuge 
>>> wrote:
>>>
 Hi everyone, I’d like to start a vote for the ViewCatalog design
 proposal (SPIP).

 The proposal is to add a ViewCatalog interface that can be used to
 load, create, alter, and drop views in DataSourceV2.

 The full SPIP doc is here:
 https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing

 Please vote on the SPIP in the next 72 hours. Once it is approved,
 I’ll update the PR for review.

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> John Zhuge
>

>>>
>>> --
>>> John Zhuge
>>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Erik Krogen
+1 (non-binding)

Really looking forward to having this natively supported by Spark, so that
we can get rid of our own hacks to tie in a custom view catalog
implementation. I appreciate the care John has put into various parts of
the design and believe this will provide a robust and flexible solution to
this problem faced by various large-scale Spark users.

Thanks John!

On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa 
wrote:

> +1
>
> On Thu, Feb 3, 2022 at 11:19 AM John Zhuge  wrote:
>
>> Hi Spark community,
>>
>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP
>> 
>> ).
>>
>> The proposal is to add a ViewCatalog interface that can be used to load,
>> create, alter, and drop views in DataSourceV2.
>>
>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>> update the PR  for review.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>>
>> On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Considering the API aspect, the ViewCatalog API sounds like a good idea.
>>> A view catalog will enable us to integrate Coral
>>>  (our view SQL
>>> translation and management layer) very cleanly to Spark. Currently we can
>>> only do it by maintaining our special version of the HiveExternalCatalog.
>>> Considering that views can be expanded syntactically without necessarily
>>> invoking the analyzer, using a dedicated view API can make performance
>>> better if performance is the concern. Further, a catalog can still be both
>>> a table and view provider if it chooses to based on this design, so I do
>>> not think we necessarily lose the ability of providing both. Looking
>>> forward to more discussions on this and making views a powerful tool in
>>> Spark.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Wed, May 26, 2021 at 9:54 AM John Zhuge  wrote:
>>>
 Looks like we are running in circles. Should we have an online meeting
 to get this sorted out?

 Thanks,
 John

 On Wed, May 26, 2021 at 12:01 AM Wenchen Fan 
 wrote:

> OK, then I'd vote for TableViewCatalog, because
> 1. This is how Hive catalog works, and we need to migrate Hive catalog
> to the v2 API sooner or later.
> 2. Because of 1, TableViewCatalog is easy to support in the current
> table/view resolution framework.
> 3. It's better to avoid name conflicts between table and views at the
> API level, instead of relying on the catalog implementation.
> 4. Caching invalidation is always a tricky problem.
>
> On Tue, May 25, 2021 at 3:09 AM Ryan Blue 
> wrote:
>
>> I don't think that it makes sense to discuss a different approach in
>> the PR rather than in the vote. Let's discuss this now since that's the
>> purpose of an SPIP.
>>
>> On Mon, May 24, 2021 at 11:22 AM John Zhuge 
>> wrote:
>>
>>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>>> proposal (SPIP).
>>>
>>> The proposal is to add a ViewCatalog interface that can be used to
>>> load, create, alter, and drop views in DataSourceV2.
>>>
>>> The full SPIP doc is here:
>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>
>>> Please vote on the SPIP in the next 72 hours. Once it is approved,
>>> I’ll update the PR for review.
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

 --
 John Zhuge

>>>
>>
>> --
>> John Zhuge
>>
>


Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Walaa Eldin Moustafa
+1

On Thu, Feb 3, 2022 at 11:19 AM John Zhuge  wrote:

> Hi Spark community,
>
> I’d like to restart the vote for the ViewCatalog design proposal (SPIP
> 
> ).
>
> The proposal is to add a ViewCatalog interface that can be used to load,
> create, alter, and drop views in DataSourceV2.
>
> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
> update the PR  for review.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
>
> On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa 
> wrote:
>
>> Considering the API aspect, the ViewCatalog API sounds like a good idea.
>> A view catalog will enable us to integrate Coral
>>  (our view SQL
>> translation and management layer) very cleanly to Spark. Currently we can
>> only do it by maintaining our special version of the HiveExternalCatalog.
>> Considering that views can be expanded syntactically without necessarily
>> invoking the analyzer, using a dedicated view API can make performance
>> better if performance is the concern. Further, a catalog can still be both
>> a table and view provider if it chooses to based on this design, so I do
>> not think we necessarily lose the ability of providing both. Looking
>> forward to more discussions on this and making views a powerful tool in
>> Spark.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Wed, May 26, 2021 at 9:54 AM John Zhuge  wrote:
>>
>>> Looks like we are running in circles. Should we have an online meeting
>>> to get this sorted out?
>>>
>>> Thanks,
>>> John
>>>
>>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan 
>>> wrote:
>>>
 OK, then I'd vote for TableViewCatalog, because
 1. This is how Hive catalog works, and we need to migrate Hive catalog
 to the v2 API sooner or later.
 2. Because of 1, TableViewCatalog is easy to support in the current
 table/view resolution framework.
 3. It's better to avoid name conflicts between table and views at the
 API level, instead of relying on the catalog implementation.
 4. Caching invalidation is always a tricky problem.

 On Tue, May 25, 2021 at 3:09 AM Ryan Blue 
 wrote:

> I don't think that it makes sense to discuss a different approach in
> the PR rather than in the vote. Let's discuss this now since that's the
> purpose of an SPIP.
>
> On Mon, May 24, 2021 at 11:22 AM John Zhuge  wrote:
>
>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>> proposal (SPIP).
>>
>> The proposal is to add a ViewCatalog interface that can be used to
>> load, create, alter, and drop views in DataSourceV2.
>>
>> The full SPIP doc is here:
>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>
>> Please vote on the SPIP in the next 72 hours. Once it is approved,
>> I’ll update the PR for review.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> John Zhuge
>


Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread John Zhuge
Hi Spark community,

I’d like to restart the vote for the ViewCatalog design proposal (SPIP

).

The proposal is to add a ViewCatalog interface that can be used to load,
create, alter, and drop views in DataSourceV2.

Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
update the PR  for review.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks!

On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa 
wrote:

> Considering the API aspect, the ViewCatalog API sounds like a good idea. A
> view catalog will enable us to integrate Coral
>  (our view SQL
> translation and management layer) very cleanly to Spark. Currently we can
> only do it by maintaining our special version of the HiveExternalCatalog.
> Considering that views can be expanded syntactically without necessarily
> invoking the analyzer, using a dedicated view API can make performance
> better if performance is the concern. Further, a catalog can still be both
> a table and view provider if it chooses to based on this design, so I do
> not think we necessarily lose the ability of providing both. Looking
> forward to more discussions on this and making views a powerful tool in
> Spark.
>
> Thanks,
> Walaa.
>
>
> On Wed, May 26, 2021 at 9:54 AM John Zhuge  wrote:
>
>> Looks like we are running in circles. Should we have an online meeting to
>> get this sorted out?
>>
>> Thanks,
>> John
>>
>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan  wrote:
>>
>>> OK, then I'd vote for TableViewCatalog, because
>>> 1. This is how Hive catalog works, and we need to migrate Hive catalog
>>> to the v2 API sooner or later.
>>> 2. Because of 1, TableViewCatalog is easy to support in the current
>>> table/view resolution framework.
>>> 3. It's better to avoid name conflicts between table and views at the
>>> API level, instead of relying on the catalog implementation.
>>> 4. Caching invalidation is always a tricky problem.
>>>
>>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue 
>>> wrote:
>>>
 I don't think that it makes sense to discuss a different approach in
 the PR rather than in the vote. Let's discuss this now since that's the
 purpose of an SPIP.

 On Mon, May 24, 2021 at 11:22 AM John Zhuge  wrote:

> Hi everyone, I’d like to start a vote for the ViewCatalog design
> proposal (SPIP).
>
> The proposal is to add a ViewCatalog interface that can be used to
> load, create, alter, and drop views in DataSourceV2.
>
> The full SPIP doc is here:
> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>
> Please vote on the SPIP in the next 72 hours. Once it is approved,
> I’ll update the PR for review.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>


 --
 Ryan Blue
 Software Engineer
 Netflix

>>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge


Signalling to a MicroBatchStream

2022-02-03 Thread Ross Lawley
Hi all,

I was hoping for some advice on creating a MicroBatchStream for MongoDB.

MongoDB has a tailable cursor that listens to changes in a collection
(known as a change stream). As a user watches a collection via the change
stream cursor, the cursor reports a resume token that determines the last
seen operation. I would like to use this resume token as my offset.

What I'm struggling with is how to signal this resume token offset from the
PartitionReader back to MicroBatchStream.  Even if there is no result back
from the cursor a new last seen resume token is available.

When planning partitions I only know the start offset, the resume token
represents the last offset. Because I'm using a single change stream
cursor, I am only producing a single partition each time. However, I would
like to be able to continue the next partition from the last seen resume
token as seen by the PartitionReader.

Is this approach possible? If so how would I pass that information back to
the Spark driver / MicroBatchStream?

Ross