Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-28 Thread Ryan Blue
Thanks for the discussion, everyone. Since there aren't many objections to
the scope and we are aligned on what this commitment would mean, I've
started a vote thread for it.

rb

On Wed, Feb 27, 2019 at 5:32 PM Wenchen Fan  wrote:

> I'm good with the list from Ryan, thanks!
>
> On Thu, Feb 28, 2019 at 1:00 AM Ryan Blue  wrote:
>
>> I think that's a good plan. Let's get the functionality done, but mark it
>> experimental pending a new row API.
>>
>> So is there agreement on this set of work, then?
>>
>> On Tue, Feb 26, 2019 at 6:30 PM Matei Zaharia 
>> wrote:
>>
>>> To add to this, we can add a stable interface anytime if the original
>>> one was marked as unstable; we wouldn’t have to wait until 4.0. We had a
>>> lot of APIs that were experimental in 2.0 and then got stabilized in later
>>> 2.x releases for example.
>>>
>>> Matei
>>>
>>> > On Feb 26, 2019, at 5:12 PM, Reynold Xin  wrote:
>>> >
>>> > We will have to fix that before we declare dev2 is stable, because
>>> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0.
>>> >
>>> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah 
>>> wrote:
>>> > Will that then require an API break down the line? Do we save that for
>>> Spark 4?
>>> >
>>> >
>>> >
>>> >
>>> > -Matt Cheah?
>>> >
>>> >
>>> >
>>> > From: Ryan Blue 
>>> > Reply-To: "rb...@netflix.com" 
>>> > Date: Tuesday, February 26, 2019 at 4:53 PM
>>> > To: Matt Cheah 
>>> > Cc: Sean Owen , Wenchen Fan ,
>>> Xiao Li , Matei Zaharia ,
>>> Spark Dev List 
>>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
>>> >
>>> >
>>> >
>>> > That's a good question.
>>> >
>>> >
>>> >
>>> > While I'd love to have a solution for that, I don't think it is a good
>>> idea to delay DSv2 until we have one. That is going to require a lot of
>>> internal changes and I don't see how we could make the release date if we
>>> are including an InternalRow replacement.
>>> >
>>> >
>>> >
>>> > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah 
>>> wrote:
>>> >
>>> > Reynold made a note earlier about a proper Row API that isn’t
>>> InternalRow – is that still on the table?
>>> >
>>> >
>>> >
>>> > -Matt Cheah
>>> >
>>> >
>>> >
>>> > From: Ryan Blue 
>>> > Reply-To: "rb...@netflix.com" 
>>> > Date: Tuesday, February 26, 2019 at 4:40 PM
>>> > To: Matt Cheah 
>>> > Cc: Sean Owen , Wenchen Fan ,
>>> Xiao Li , Matei Zaharia ,
>>> Spark Dev List 
>>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
>>> >
>>> >
>>> >
>>> > Thanks for bumping this, Matt. I think we can have the discussion here
>>> to clarify exactly what we’re committing to and then have a vote thread
>>> once we’re agreed.
>>> > Getting back to the DSv2 discussion, I think we have a good handle on
>>> what would be added:
>>> > · Plugin system for catalogs
>>> >
>>> > · TableCatalog interface (I’ll start a vote thread for this
>>> SPIP shortly)
>>> >
>>> > · TableCatalog implementation backed by SessionCatalog that
>>> can load v2 tables
>>> >
>>> > · Resolution rule to load v2 tables using the new catalog
>>> >
>>> > · CTAS logical and physical plan nodes
>>> >
>>> > · Conversions from SQL parsed logical plans to v2 logical plans
>>> >
>>> > Initially, this will always use the v2 catalog backed by
>>> SessionCatalog to avoid dependence on the multi-catalog work. All of those
>>> are already implemented and working, so I think it is reasonable that we
>>> can get them in.
>>> > Then we can consider a few stretch goals:
>>> > · Get in as much DDL as we can. I think create and drop table
>>> should be easy.
>>> >
>>> > · Multi-catalog identifier parsing and multi-catalog support
>>> >
>>> > If we get those last two in, it would be great. We can make the call
>>> closer to release time. Does anyone want to change this se

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-27 Thread Wenchen Fan
I'm good with the list from Ryan, thanks!

On Thu, Feb 28, 2019 at 1:00 AM Ryan Blue  wrote:

> I think that's a good plan. Let's get the functionality done, but mark it
> experimental pending a new row API.
>
> So is there agreement on this set of work, then?
>
> On Tue, Feb 26, 2019 at 6:30 PM Matei Zaharia 
> wrote:
>
>> To add to this, we can add a stable interface anytime if the original one
>> was marked as unstable; we wouldn’t have to wait until 4.0. We had a lot of
>> APIs that were experimental in 2.0 and then got stabilized in later 2.x
>> releases for example.
>>
>> Matei
>>
>> > On Feb 26, 2019, at 5:12 PM, Reynold Xin  wrote:
>> >
>> > We will have to fix that before we declare dev2 is stable, because
>> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0.
>> >
>> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah  wrote:
>> > Will that then require an API break down the line? Do we save that for
>> Spark 4?
>> >
>> >
>> >
>> >
>> > -Matt Cheah?
>> >
>> >
>> >
>> > From: Ryan Blue 
>> > Reply-To: "rb...@netflix.com" 
>> > Date: Tuesday, February 26, 2019 at 4:53 PM
>> > To: Matt Cheah 
>> > Cc: Sean Owen , Wenchen Fan ,
>> Xiao Li , Matei Zaharia ,
>> Spark Dev List 
>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
>> >
>> >
>> >
>> > That's a good question.
>> >
>> >
>> >
>> > While I'd love to have a solution for that, I don't think it is a good
>> idea to delay DSv2 until we have one. That is going to require a lot of
>> internal changes and I don't see how we could make the release date if we
>> are including an InternalRow replacement.
>> >
>> >
>> >
>> > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:
>> >
>> > Reynold made a note earlier about a proper Row API that isn’t
>> InternalRow – is that still on the table?
>> >
>> >
>> >
>> > -Matt Cheah
>> >
>> >
>> >
>> > From: Ryan Blue 
>> > Reply-To: "rb...@netflix.com" 
>> > Date: Tuesday, February 26, 2019 at 4:40 PM
>> > To: Matt Cheah 
>> > Cc: Sean Owen , Wenchen Fan ,
>> Xiao Li , Matei Zaharia ,
>> Spark Dev List 
>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
>> >
>> >
>> >
>> > Thanks for bumping this, Matt. I think we can have the discussion here
>> to clarify exactly what we’re committing to and then have a vote thread
>> once we’re agreed.
>> > Getting back to the DSv2 discussion, I think we have a good handle on
>> what would be added:
>> > · Plugin system for catalogs
>> >
>> > · TableCatalog interface (I’ll start a vote thread for this
>> SPIP shortly)
>> >
>> > · TableCatalog implementation backed by SessionCatalog that can
>> load v2 tables
>> >
>> > · Resolution rule to load v2 tables using the new catalog
>> >
>> > · CTAS logical and physical plan nodes
>> >
>> > · Conversions from SQL parsed logical plans to v2 logical plans
>> >
>> > Initially, this will always use the v2 catalog backed by SessionCatalog
>> to avoid dependence on the multi-catalog work. All of those are already
>> implemented and working, so I think it is reasonable that we can get them
>> in.
>> > Then we can consider a few stretch goals:
>> > · Get in as much DDL as we can. I think create and drop table
>> should be easy.
>> >
>> > · Multi-catalog identifier parsing and multi-catalog support
>> >
>> > If we get those last two in, it would be great. We can make the call
>> closer to release time. Does anyone want to change this set of work?
>> >
>> >
>> > On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:
>> >
>> > What would then be the next steps we'd take to collectively decide on
>> plans and timelines moving forward? Might I suggest scheduling a conference
>> call with appropriate PMCs to put our ideas together? Maybe such a
>> discussion can take place at next week's meeting? Or do we need to have a
>> separate formalized voting thread which is guided by a PMC?
>> >
>> > My suggestion is to try to make concrete steps forward and to avoid
>> letting this slip through the cracks.
>> >
>> > I also thin

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-27 Thread Ryan Blue
I think that's a good plan. Let's get the functionality done, but mark it
experimental pending a new row API.

So is there agreement on this set of work, then?

On Tue, Feb 26, 2019 at 6:30 PM Matei Zaharia 
wrote:

> To add to this, we can add a stable interface anytime if the original one
> was marked as unstable; we wouldn’t have to wait until 4.0. We had a lot of
> APIs that were experimental in 2.0 and then got stabilized in later 2.x
> releases for example.
>
> Matei
>
> > On Feb 26, 2019, at 5:12 PM, Reynold Xin  wrote:
> >
> > We will have to fix that before we declare dev2 is stable, because
> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0.
> >
> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah  wrote:
> > Will that then require an API break down the line? Do we save that for
> Spark 4?
> >
> >
> >
> >
> > -Matt Cheah?
> >
> >
> >
> > From: Ryan Blue 
> > Reply-To: "rb...@netflix.com" 
> > Date: Tuesday, February 26, 2019 at 4:53 PM
> > To: Matt Cheah 
> > Cc: Sean Owen , Wenchen Fan ,
> Xiao Li , Matei Zaharia ,
> Spark Dev List 
> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
> >
> >
> >
> > That's a good question.
> >
> >
> >
> > While I'd love to have a solution for that, I don't think it is a good
> idea to delay DSv2 until we have one. That is going to require a lot of
> internal changes and I don't see how we could make the release date if we
> are including an InternalRow replacement.
> >
> >
> >
> > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:
> >
> > Reynold made a note earlier about a proper Row API that isn’t
> InternalRow – is that still on the table?
> >
> >
> >
> > -Matt Cheah
> >
> >
> >
> > From: Ryan Blue 
> > Reply-To: "rb...@netflix.com" 
> > Date: Tuesday, February 26, 2019 at 4:40 PM
> > To: Matt Cheah 
> > Cc: Sean Owen , Wenchen Fan ,
> Xiao Li , Matei Zaharia ,
> Spark Dev List 
> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
> >
> >
> >
> > Thanks for bumping this, Matt. I think we can have the discussion here
> to clarify exactly what we’re committing to and then have a vote thread
> once we’re agreed.
> > Getting back to the DSv2 discussion, I think we have a good handle on
> what would be added:
> > · Plugin system for catalogs
> >
> > · TableCatalog interface (I’ll start a vote thread for this SPIP
> shortly)
> >
> > · TableCatalog implementation backed by SessionCatalog that can
> load v2 tables
> >
> > · Resolution rule to load v2 tables using the new catalog
> >
> > · CTAS logical and physical plan nodes
> >
> > · Conversions from SQL parsed logical plans to v2 logical plans
> >
> > Initially, this will always use the v2 catalog backed by SessionCatalog
> to avoid dependence on the multi-catalog work. All of those are already
> implemented and working, so I think it is reasonable that we can get them
> in.
> > Then we can consider a few stretch goals:
> > · Get in as much DDL as we can. I think create and drop table
> should be easy.
> >
> > · Multi-catalog identifier parsing and multi-catalog support
> >
> > If we get those last two in, it would be great. We can make the call
> closer to release time. Does anyone want to change this set of work?
> >
> >
> > On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:
> >
> > What would then be the next steps we'd take to collectively decide on
> plans and timelines moving forward? Might I suggest scheduling a conference
> call with appropriate PMCs to put our ideas together? Maybe such a
> discussion can take place at next week's meeting? Or do we need to have a
> separate formalized voting thread which is guided by a PMC?
> >
> > My suggestion is to try to make concrete steps forward and to avoid
> letting this slip through the cracks.
> >
> > I also think there would be merits to having a project plan and
> estimates around how long each of the features we want to complete is going
> to take to implement and review.
> >
> > -Matt Cheah
> >
> > On 2/24/19, 3:05 PM, "Sean Owen"  wrote:
> >
> > Sure, I don't read anyone making these statements though? Let's
> assume
> > good intent, that "foo should happen" as "my opinion as a member of
> > the community, which is not solely up to me, is that foo should
> > happen". I unde

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matei Zaharia
To add to this, we can add a stable interface anytime if the original one was 
marked as unstable; we wouldn’t have to wait until 4.0. We had a lot of APIs 
that were experimental in 2.0 and then got stabilized in later 2.x releases for 
example.

Matei

> On Feb 26, 2019, at 5:12 PM, Reynold Xin  wrote:
> 
> We will have to fix that before we declare dev2 is stable, because 
> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. 
> 
> On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah  wrote:
> Will that then require an API break down the line? Do we save that for Spark 
> 4?
> 
> 
>  
> 
> -Matt Cheah?
> 
>  
> 
> From: Ryan Blue 
> Reply-To: "rb...@netflix.com" 
> Date: Tuesday, February 26, 2019 at 4:53 PM
> To: Matt Cheah 
> Cc: Sean Owen , Wenchen Fan , Xiao Li 
> , Matei Zaharia , Spark Dev 
> List 
> Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
> 
>  
> 
> That's a good question.
> 
>  
> 
> While I'd love to have a solution for that, I don't think it is a good idea 
> to delay DSv2 until we have one. That is going to require a lot of internal 
> changes and I don't see how we could make the release date if we are 
> including an InternalRow replacement.
> 
>  
> 
> On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:
> 
> Reynold made a note earlier about a proper Row API that isn’t InternalRow – 
> is that still on the table?
> 
>  
> 
> -Matt Cheah
> 
>  
> 
> From: Ryan Blue 
> Reply-To: "rb...@netflix.com" 
> Date: Tuesday, February 26, 2019 at 4:40 PM
> To: Matt Cheah 
> Cc: Sean Owen , Wenchen Fan , Xiao Li 
> , Matei Zaharia , Spark Dev 
> List 
> Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
> 
>  
> 
> Thanks for bumping this, Matt. I think we can have the discussion here to 
> clarify exactly what we’re committing to and then have a vote thread once 
> we’re agreed.
> Getting back to the DSv2 discussion, I think we have a good handle on what 
> would be added:
> · Plugin system for catalogs
> 
> · TableCatalog interface (I’ll start a vote thread for this SPIP 
> shortly)
> 
> · TableCatalog implementation backed by SessionCatalog that can load 
> v2 tables
> 
> · Resolution rule to load v2 tables using the new catalog
> 
> · CTAS logical and physical plan nodes
> 
> · Conversions from SQL parsed logical plans to v2 logical plans
> 
> Initially, this will always use the v2 catalog backed by SessionCatalog to 
> avoid dependence on the multi-catalog work. All of those are already 
> implemented and working, so I think it is reasonable that we can get them in.
> Then we can consider a few stretch goals:
> · Get in as much DDL as we can. I think create and drop table should 
> be easy.
> 
> · Multi-catalog identifier parsing and multi-catalog support
> 
> If we get those last two in, it would be great. We can make the call closer 
> to release time. Does anyone want to change this set of work?
>  
> 
> On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:
> 
> What would then be the next steps we'd take to collectively decide on plans 
> and timelines moving forward? Might I suggest scheduling a conference call 
> with appropriate PMCs to put our ideas together? Maybe such a discussion can 
> take place at next week's meeting? Or do we need to have a separate 
> formalized voting thread which is guided by a PMC?
> 
> My suggestion is to try to make concrete steps forward and to avoid letting 
> this slip through the cracks.
> 
> I also think there would be merits to having a project plan and estimates 
> around how long each of the features we want to complete is going to take to 
> implement and review.
> 
> -Matt Cheah
> 
> On 2/24/19, 3:05 PM, "Sean Owen"  wrote:
> 
> Sure, I don't read anyone making these statements though? Let's assume
> good intent, that "foo should happen" as "my opinion as a member of
> the community, which is not solely up to me, is that foo should
> happen". I understand it's possible for a person to make their opinion
> over-weighted; this whole style of decision making assumes good actors
> and doesn't optimize against bad ones. Not that it can't happen, just
> not seeing it here.
> 
> I have never seen any vote on a feature list, by a PMC or otherwise.
> We can do that if really needed I guess. But that also isn't the
> authoritative process in play here, in contrast.
> 
> If there's not a more specific subtext or issue here, which is fine to
> say (on private@ if it's sensitive or something), yes, let's move 

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Reynold Xin
We will have to fix that before we declare dev2 is stable, because
InternalRow is not a stable API. We don’t necessarily need to do it in 3.0.

On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah  wrote:

> Will that then require an API break down the line? Do we save that for
> Spark 4?
>
>
>
> -Matt Cheah?
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Tuesday, February 26, 2019 at 4:53 PM
> *To: *Matt Cheah 
> *Cc: *Sean Owen , Wenchen Fan ,
> Xiao Li , Matei Zaharia ,
> Spark Dev List 
> *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2
>
>
>
> That's a good question.
>
>
>
> While I'd love to have a solution for that, I don't think it is a good
> idea to delay DSv2 until we have one. That is going to require a lot of
> internal changes and I don't see how we could make the release date if we
> are including an InternalRow replacement.
>
>
>
> On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:
>
> Reynold made a note earlier about a proper Row API that isn’t InternalRow
> – is that still on the table?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Tuesday, February 26, 2019 at 4:40 PM
> *To: *Matt Cheah 
> *Cc: *Sean Owen , Wenchen Fan ,
> Xiao Li , Matei Zaharia ,
> Spark Dev List 
> *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2
>
>
>
> Thanks for bumping this, Matt. I think we can have the discussion here to
> clarify exactly what we’re committing to and then have a vote thread once
> we’re agreed.
>
> Getting back to the DSv2 discussion, I think we have a good handle on what
> would be added:
>
> · Plugin system for catalogs
>
> · TableCatalog interface (I’ll start a vote thread for this SPIP
> shortly)
>
> · TableCatalog implementation backed by SessionCatalog that can
> load v2 tables
>
> · Resolution rule to load v2 tables using the new catalog
>
> · CTAS logical and physical plan nodes
>
> · Conversions from SQL parsed logical plans to v2 logical plans
>
> Initially, this will always use the v2 catalog backed by SessionCatalog to
> avoid dependence on the multi-catalog work. All of those are already
> implemented and working, so I think it is reasonable that we can get them
> in.
>
> Then we can consider a few stretch goals:
>
> · Get in as much DDL as we can. I think create and drop table
> should be easy.
>
> · Multi-catalog identifier parsing and multi-catalog support
>
> If we get those last two in, it would be great. We can make the call
> closer to release time. Does anyone want to change this set of work?
>
>
>
> On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:
>
> What would then be the next steps we'd take to collectively decide on
> plans and timelines moving forward? Might I suggest scheduling a conference
> call with appropriate PMCs to put our ideas together? Maybe such a
> discussion can take place at next week's meeting? Or do we need to have a
> separate formalized voting thread which is guided by a PMC?
>
> My suggestion is to try to make concrete steps forward and to avoid
> letting this slip through the cracks.
>
> I also think there would be merits to having a project plan and estimates
> around how long each of the features we want to complete is going to take
> to implement and review.
>
> -Matt Cheah
>
> On 2/24/19, 3:05 PM, "Sean Owen"  wrote:
>
> Sure, I don't read anyone making these statements though? Let's assume
> good intent, that "foo should happen" as "my opinion as a member of
> the community, which is not solely up to me, is that foo should
> happen". I understand it's possible for a person to make their opinion
> over-weighted; this whole style of decision making assumes good actors
> and doesn't optimize against bad ones. Not that it can't happen, just
> not seeing it here.
>
> I have never seen any vote on a feature list, by a PMC or otherwise.
> We can do that if really needed I guess. But that also isn't the
> authoritative process in play here, in contrast.
>
> If there's not a more specific subtext or issue here, which is fine to
> say (on private@ if it's sensitive or something), yes, let's move on
> in good faith.
>
> On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra 
> wrote:
> > There is nothing wrong with individuals advocating for what they
> think should or should not be in Spark 3.0, nor should anyone shy away from
> explaining why they think delaying the release for some reason is or isn't
> a good idea. What is a pr

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
Will that then require an API break down the line? Do we save that for Spark 4?

 

-Matt Cheah?

 

From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Tuesday, February 26, 2019 at 4:53 PM
To: Matt Cheah 
Cc: Sean Owen , Wenchen Fan , Xiao Li 
, Matei Zaharia , Spark Dev 
List 
Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2

 

That's a good question. 

 

While I'd love to have a solution for that, I don't think it is a good idea to 
delay DSv2 until we have one. That is going to require a lot of internal 
changes and I don't see how we could make the release date if we are including 
an InternalRow replacement.

 

On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:

Reynold made a note earlier about a proper Row API that isn’t InternalRow – is 
that still on the table?

 

-Matt Cheah

 

From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Tuesday, February 26, 2019 at 4:40 PM
To: Matt Cheah 
Cc: Sean Owen , Wenchen Fan , Xiao Li 
, Matei Zaharia , Spark Dev 
List 
Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2

 

Thanks for bumping this, Matt. I think we can have the discussion here to 
clarify exactly what we’re committing to and then have a vote thread once we’re 
agreed.

Getting back to the DSv2 discussion, I think we have a good handle on what 
would be added:

· Plugin system for catalogs

· TableCatalog interface (I’ll start a vote thread for this SPIP 
shortly)

· TableCatalog implementation backed by SessionCatalog that can load v2 
tables

· Resolution rule to load v2 tables using the new catalog

· CTAS logical and physical plan nodes

· Conversions from SQL parsed logical plans to v2 logical plans

Initially, this will always use the v2 catalog backed by SessionCatalog to 
avoid dependence on the multi-catalog work. All of those are already 
implemented and working, so I think it is reasonable that we can get them in.

Then we can consider a few stretch goals:

· Get in as much DDL as we can. I think create and drop table should be 
easy.

· Multi-catalog identifier parsing and multi-catalog support

If we get those last two in, it would be great. We can make the call closer to 
release time. Does anyone want to change this set of work?

 

On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:

What would then be the next steps we'd take to collectively decide on plans and 
timelines moving forward? Might I suggest scheduling a conference call with 
appropriate PMCs to put our ideas together? Maybe such a discussion can take 
place at next week's meeting? Or do we need to have a separate formalized 
voting thread which is guided by a PMC?

My suggestion is to try to make concrete steps forward and to avoid letting 
this slip through the cracks.

I also think there would be merits to having a project plan and estimates 
around how long each of the features we want to complete is going to take to 
implement and review.

-Matt Cheah

On 2/24/19, 3:05 PM, "Sean Owen"  wrote:

Sure, I don't read anyone making these statements though? Let's assume
good intent, that "foo should happen" as "my opinion as a member of
the community, which is not solely up to me, is that foo should
happen". I understand it's possible for a person to make their opinion
over-weighted; this whole style of decision making assumes good actors
and doesn't optimize against bad ones. Not that it can't happen, just
not seeing it here.

I have never seen any vote on a feature list, by a PMC or otherwise.
We can do that if really needed I guess. But that also isn't the
authoritative process in play here, in contrast.

If there's not a more specific subtext or issue here, which is fine to
say (on private@ if it's sensitive or something), yes, let's move on
in good faith.

On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra  
wrote:
> There is nothing wrong with individuals advocating for what they think 
should or should not be in Spark 3.0, nor should anyone shy away from 
explaining why they think delaying the release for some reason is or isn't a 
good idea. What is a problem, or is at least something that I have a problem 
with, are declarative, pseudo-authoritative statements that 3.0 (or some other 
release) will or won't contain some feature, API, etc. or that some issue is or 
is not blocker or worth delaying for. When the PMC has not voted on such 
issues, I'm often left thinking, "Wait... what? Who decided that, or where did 
that decision come from?"


 

-- 

Ryan Blue 

Software Engineer

Netflix


 

-- 

Ryan Blue 

Software Engineer

Netflix



smime.p7s
Description: S/MIME cryptographic signature


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Ryan Blue
That's a good question.

While I'd love to have a solution for that, I don't think it is a good idea
to delay DSv2 until we have one. That is going to require a lot of internal
changes and I don't see how we could make the release date if we are
including an InternalRow replacement.

On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:

> Reynold made a note earlier about a proper Row API that isn’t InternalRow
> – is that still on the table?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Tuesday, February 26, 2019 at 4:40 PM
> *To: *Matt Cheah 
> *Cc: *Sean Owen , Wenchen Fan ,
> Xiao Li , Matei Zaharia ,
> Spark Dev List 
> *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2
>
>
>
> Thanks for bumping this, Matt. I think we can have the discussion here to
> clarify exactly what we’re committing to and then have a vote thread once
> we’re agreed.
>
> Getting back to the DSv2 discussion, I think we have a good handle on what
> would be added:
>
> · Plugin system for catalogs
>
> · TableCatalog interface (I’ll start a vote thread for this SPIP
> shortly)
>
> · TableCatalog implementation backed by SessionCatalog that can
> load v2 tables
>
> · Resolution rule to load v2 tables using the new catalog
>
> · CTAS logical and physical plan nodes
>
> · Conversions from SQL parsed logical plans to v2 logical plans
>
> Initially, this will always use the v2 catalog backed by SessionCatalog to
> avoid dependence on the multi-catalog work. All of those are already
> implemented and working, so I think it is reasonable that we can get them
> in.
>
> Then we can consider a few stretch goals:
>
> · Get in as much DDL as we can. I think create and drop table
> should be easy.
>
> · Multi-catalog identifier parsing and multi-catalog support
>
> If we get those last two in, it would be great. We can make the call
> closer to release time. Does anyone want to change this set of work?
>
>
>
> On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:
>
> What would then be the next steps we'd take to collectively decide on
> plans and timelines moving forward? Might I suggest scheduling a conference
> call with appropriate PMCs to put our ideas together? Maybe such a
> discussion can take place at next week's meeting? Or do we need to have a
> separate formalized voting thread which is guided by a PMC?
>
> My suggestion is to try to make concrete steps forward and to avoid
> letting this slip through the cracks.
>
> I also think there would be merits to having a project plan and estimates
> around how long each of the features we want to complete is going to take
> to implement and review.
>
> -Matt Cheah
>
> On 2/24/19, 3:05 PM, "Sean Owen"  wrote:
>
> Sure, I don't read anyone making these statements though? Let's assume
> good intent, that "foo should happen" as "my opinion as a member of
> the community, which is not solely up to me, is that foo should
> happen". I understand it's possible for a person to make their opinion
> over-weighted; this whole style of decision making assumes good actors
> and doesn't optimize against bad ones. Not that it can't happen, just
> not seeing it here.
>
> I have never seen any vote on a feature list, by a PMC or otherwise.
> We can do that if really needed I guess. But that also isn't the
> authoritative process in play here, in contrast.
>
> If there's not a more specific subtext or issue here, which is fine to
> say (on private@ if it's sensitive or something), yes, let's move on
> in good faith.
>
> On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra 
> wrote:
> > There is nothing wrong with individuals advocating for what they
> think should or should not be in Spark 3.0, nor should anyone shy away from
> explaining why they think delaying the release for some reason is or isn't
> a good idea. What is a problem, or is at least something that I have a
> problem with, are declarative, pseudo-authoritative statements that 3.0 (or
> some other release) will or won't contain some feature, API, etc. or that
> some issue is or is not blocker or worth delaying for. When the PMC has not
> voted on such issues, I'm often left thinking, "Wait... what? Who decided
> that, or where did that decision come from?"
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Ryan Blue
Thanks for bumping this, Matt. I think we can have the discussion here to
clarify exactly what we’re committing to and then have a vote thread once
we’re agreed.

Getting back to the DSv2 discussion, I think we have a good handle on what
would be added:

   - Plugin system for catalogs
   - TableCatalog interface (I’ll start a vote thread for this SPIP shortly)
   - TableCatalog implementation backed by SessionCatalog that can load v2
   tables
   - Resolution rule to load v2 tables using the new catalog
   - CTAS logical and physical plan nodes
   - Conversions from SQL parsed logical plans to v2 logical plans

Initially, this will always use the v2 catalog backed by SessionCatalog to
avoid dependence on the multi-catalog work. All of those are already
implemented and working, so I think it is reasonable that we can get them
in.

Then we can consider a few stretch goals:

   - Get in as much DDL as we can. I think create and drop table should be
   easy.
   - Multi-catalog identifier parsing and multi-catalog support

If we get those last two in, it would be great. We can make the call closer
to release time. Does anyone want to change this set of work?

On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:

> What would then be the next steps we'd take to collectively decide on
> plans and timelines moving forward? Might I suggest scheduling a conference
> call with appropriate PMCs to put our ideas together? Maybe such a
> discussion can take place at next week's meeting? Or do we need to have a
> separate formalized voting thread which is guided by a PMC?
>
> My suggestion is to try to make concrete steps forward and to avoid
> letting this slip through the cracks.
>
> I also think there would be merits to having a project plan and estimates
> around how long each of the features we want to complete is going to take
> to implement and review.
>
> -Matt Cheah
>
> On 2/24/19, 3:05 PM, "Sean Owen"  wrote:
>
> Sure, I don't read anyone making these statements though? Let's assume
> good intent, that "foo should happen" as "my opinion as a member of
> the community, which is not solely up to me, is that foo should
> happen". I understand it's possible for a person to make their opinion
> over-weighted; this whole style of decision making assumes good actors
> and doesn't optimize against bad ones. Not that it can't happen, just
> not seeing it here.
>
> I have never seen any vote on a feature list, by a PMC or otherwise.
> We can do that if really needed I guess. But that also isn't the
> authoritative process in play here, in contrast.
>
> If there's not a more specific subtext or issue here, which is fine to
> say (on private@ if it's sensitive or something), yes, let's move on
> in good faith.
>
> On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra 
> wrote:
> > There is nothing wrong with individuals advocating for what they
> think should or should not be in Spark 3.0, nor should anyone shy away from
> explaining why they think delaying the release for some reason is or isn't
> a good idea. What is a problem, or is at least something that I have a
> problem with, are declarative, pseudo-authoritative statements that 3.0 (or
> some other release) will or won't contain some feature, API, etc. or that
> some issue is or is not blocker or worth delaying for. When the PMC has not
> voted on such issues, I'm often left thinking, "Wait... what? Who decided
> that, or where did that decision come from?"
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
Reynold made a note earlier about a proper Row API that isn’t InternalRow – is 
that still on the table?

 

-Matt Cheah

 

From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Tuesday, February 26, 2019 at 4:40 PM
To: Matt Cheah 
Cc: Sean Owen , Wenchen Fan , Xiao Li 
, Matei Zaharia , Spark Dev 
List 
Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2

 

Thanks for bumping this, Matt. I think we can have the discussion here to 
clarify exactly what we’re committing to and then have a vote thread once we’re 
agreed.

Getting back to the DSv2 discussion, I think we have a good handle on what 
would be added:

· Plugin system for catalogs

· TableCatalog interface (I’ll start a vote thread for this SPIP 
shortly)

· TableCatalog implementation backed by SessionCatalog that can load v2 
tables

· Resolution rule to load v2 tables using the new catalog

· CTAS logical and physical plan nodes

· Conversions from SQL parsed logical plans to v2 logical plans

Initially, this will always use the v2 catalog backed by SessionCatalog to 
avoid dependence on the multi-catalog work. All of those are already 
implemented and working, so I think it is reasonable that we can get them in.

Then we can consider a few stretch goals:

· Get in as much DDL as we can. I think create and drop table should be 
easy.

· Multi-catalog identifier parsing and multi-catalog support

If we get those last two in, it would be great. We can make the call closer to 
release time. Does anyone want to change this set of work?

 

On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:

What would then be the next steps we'd take to collectively decide on plans and 
timelines moving forward? Might I suggest scheduling a conference call with 
appropriate PMCs to put our ideas together? Maybe such a discussion can take 
place at next week's meeting? Or do we need to have a separate formalized 
voting thread which is guided by a PMC?

My suggestion is to try to make concrete steps forward and to avoid letting 
this slip through the cracks.

I also think there would be merits to having a project plan and estimates 
around how long each of the features we want to complete is going to take to 
implement and review.

-Matt Cheah

On 2/24/19, 3:05 PM, "Sean Owen"  wrote:

Sure, I don't read anyone making these statements though? Let's assume
good intent, that "foo should happen" as "my opinion as a member of
the community, which is not solely up to me, is that foo should
happen". I understand it's possible for a person to make their opinion
over-weighted; this whole style of decision making assumes good actors
and doesn't optimize against bad ones. Not that it can't happen, just
not seeing it here.

I have never seen any vote on a feature list, by a PMC or otherwise.
We can do that if really needed I guess. But that also isn't the
authoritative process in play here, in contrast.

If there's not a more specific subtext or issue here, which is fine to
say (on private@ if it's sensitive or something), yes, let's move on
in good faith.

On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra  
wrote:
> There is nothing wrong with individuals advocating for what they think 
should or should not be in Spark 3.0, nor should anyone shy away from 
explaining why they think delaying the release for some reason is or isn't a 
good idea. What is a problem, or is at least something that I have a problem 
with, are declarative, pseudo-authoritative statements that 3.0 (or some other 
release) will or won't contain some feature, API, etc. or that some issue is or 
is not blocker or worth delaying for. When the PMC has not voted on such 
issues, I'm often left thinking, "Wait... what? Who decided that, or where did 
that decision come from?"


 

-- 

Ryan Blue 

Software Engineer

Netflix



smime.p7s
Description: S/MIME cryptographic signature


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
What would then be the next steps we'd take to collectively decide on plans and 
timelines moving forward? Might I suggest scheduling a conference call with 
appropriate PMCs to put our ideas together? Maybe such a discussion can take 
place at next week's meeting? Or do we need to have a separate formalized 
voting thread which is guided by a PMC?

My suggestion is to try to make concrete steps forward and to avoid letting 
this slip through the cracks.

I also think there would be merits to having a project plan and estimates 
around how long each of the features we want to complete is going to take to 
implement and review.

-Matt Cheah

On 2/24/19, 3:05 PM, "Sean Owen"  wrote:

Sure, I don't read anyone making these statements though? Let's assume
good intent, that "foo should happen" as "my opinion as a member of
the community, which is not solely up to me, is that foo should
happen". I understand it's possible for a person to make their opinion
over-weighted; this whole style of decision making assumes good actors
and doesn't optimize against bad ones. Not that it can't happen, just
not seeing it here.

I have never seen any vote on a feature list, by a PMC or otherwise.
We can do that if really needed I guess. But that also isn't the
authoritative process in play here, in contrast.

If there's not a more specific subtext or issue here, which is fine to
say (on private@ if it's sensitive or something), yes, let's move on
in good faith.

On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra  
wrote:
> There is nothing wrong with individuals advocating for what they think 
should or should not be in Spark 3.0, nor should anyone shy away from 
explaining why they think delaying the release for some reason is or isn't a 
good idea. What is a problem, or is at least something that I have a problem 
with, are declarative, pseudo-authoritative statements that 3.0 (or some other 
release) will or won't contain some feature, API, etc. or that some issue is or 
is not blocker or worth delaying for. When the PMC has not voted on such 
issues, I'm often left thinking, "Wait... what? Who decided that, or where did 
that decision come from?"



smime.p7s
Description: S/MIME cryptographic signature


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Sean Owen
Sure, I don't read anyone making these statements though? Let's assume
good intent, that "foo should happen" as "my opinion as a member of
the community, which is not solely up to me, is that foo should
happen". I understand it's possible for a person to make their opinion
over-weighted; this whole style of decision making assumes good actors
and doesn't optimize against bad ones. Not that it can't happen, just
not seeing it here.

I have never seen any vote on a feature list, by a PMC or otherwise.
We can do that if really needed I guess. But that also isn't the
authoritative process in play here, in contrast.

If there's not a more specific subtext or issue here, which is fine to
say (on private@ if it's sensitive or something), yes, let's move on
in good faith.

On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra  wrote:
> There is nothing wrong with individuals advocating for what they think should 
> or should not be in Spark 3.0, nor should anyone shy away from explaining why 
> they think delaying the release for some reason is or isn't a good idea. What 
> is a problem, or is at least something that I have a problem with, are 
> declarative, pseudo-authoritative statements that 3.0 (or some other release) 
> will or won't contain some feature, API, etc. or that some issue is or is not 
> blocker or worth delaying for. When the PMC has not voted on such issues, I'm 
> often left thinking, "Wait... what? Who decided that, or where did that 
> decision come from?"

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Mark Hamstra
>
> I’m not quite sure what you mean here.
>

I'll try to explain once more, then I'll drop it since continuing the rest
of the discussion in this thread is more important than getting
side-tracked.

There is nothing wrong with individuals advocating for what they think
should or should not be in Spark 3.0, nor should anyone shy away from
explaining why they think delaying the release for some reason is or isn't
a good idea. What is a problem, or is at least something that I have a
problem with, are declarative, pseudo-authoritative statements that 3.0 (or
some other release) will or won't contain some feature, API, etc. or that
some issue is or is not blocker or worth delaying for. When the PMC has not
voted on such issues, I'm often left thinking, "Wait... what? Who decided
that, or where did that decision come from?"

On Sun, Feb 24, 2019 at 1:27 PM Ryan Blue  wrote:

> Thanks to Matt for his philosophical take. I agree.
>
> The intent is to set a common goal, so that we work toward getting v2 in a
> usable state as a community. Part of that is making choices to get it done
> on time, which we have already seen on this thread: setting out more
> clearly what we mean by “DSv2” and what we think we can get done on time.
>
> I don’t mean to say that we should commit to a plan that *requires* a
> delay to the next release (which describes the goal better than 3.0 does).
> But we should commit to making sure the goal is met, acknowledging that
> this is one of the most important efforts for many people that work in this
> community.
>
> I think it would help to clarify what this commitment means, at least to
> me:
>
>1. What it means: the community will seriously consider delaying the
>next release if this isn’t done by our initial deadline.
>2. What it does not mean: delaying the release no matter what happens.
>
> In that event that this feature isn’t done on time, it would be up to the
> community to decide what to do. But in the mean time, I think it is healthy
> to set a goal and work toward it. (I am not making a distinction between
> PMC and community here.)
>
> I think this commitment is a good idea for the same reason why we set
> other goals: to hold ourselves accountable. When one sets a New Years
> resolution to drop 10 pounds, it isn’t that the hope or intent wasn’t there
> before. It is about having a (self-imposed) constraint that helps you make
> hard choices: cake now or meet my goal?
>
> Spark 3.0 has many other major features as well, delaying the release has
> significant cost and we should try our best to not let it happen.”
>
> I agree with Wenchen here. No one wants to actually delay the release. We
> just want to push ourselves to make some tough decisions, using that delay
> as a motivating factor.
>
> The fact that some entity other than the PMC thinks that Spark 3.0 should
> contain certain new features or that it will be costly to them if 3.0 does
> not contain those features is not dispositive.
>
> I’m not quite sure what you mean here. While I am representing my
> employer, I am bringing up this topic as a member of the community, to
> suggest a direction for the community to take, and I fully accept that the
> decision is up to the community. I think it is reasonable to candidly state
> how this matters; that context informs the discussion.
>
> On Fri, Feb 22, 2019 at 1:55 PM Mark Hamstra 
> wrote:
>
>> To your other message: I already see a number of PMC members here. Who's
>>> the other entity?
>>>
>>
>> I'll answer indirectly since pointing fingers isn't really my intent. In
>> the absence of a PMC vote, I react negatively to individuals making new
>> declarative policy statements or statements to the effect that Spark
>> 3.0 will (or will not) include these features..., or that it will be too
>> costly to do something. Maybe these are innocent shorthand that leave off a
>> clarifying "in my opinion" or "according to the current state of JIRA" or
>> some such.
>>
>> My points are simply that nobody other than the PMC has an authoritative
>> say on such matters, and if we are at a point where the community needs
>> some definitive guidance, then we need PMC involvement and a vote. That's
>> not intended to preclude or terminate community discussion, because that
>> is, indeed, lovely to see.
>>
>> On Fri, Feb 22, 2019 at 12:04 PM Sean Owen  wrote:
>>
>>> To your other message: I already see a number of PMC members here. Who's
>>> the other entity? The PMC is the thing that says a thing is a release,
>>> sure, but this discussion is properly a community one. And here we are,
>>> this is lovely to see.
>>>
>>> (May I remind everyone to casually, sometime, browse the large list of
>>> other JIRAs targeted for Spark 3? it's much more than DSv2!)
>>>
>>> I can't speak to specific decisions here, but, I see:
>>>
>>> Spark 3 doesn't have a release date. Notionally it's 6 months after
>>> Spark 2.4 (Nov 2018). It'd be reasonable to plan for a little more time.
>>> Can we 

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Ryan Blue
Thanks to Matt for his philosophical take. I agree.

The intent is to set a common goal, so that we work toward getting v2 in a
usable state as a community. Part of that is making choices to get it done
on time, which we have already seen on this thread: setting out more
clearly what we mean by “DSv2” and what we think we can get done on time.

I don’t mean to say that we should commit to a plan that *requires* a delay
to the next release (which describes the goal better than 3.0 does). But we
should commit to making sure the goal is met, acknowledging that this is
one of the most important efforts for many people that work in this
community.

I think it would help to clarify what this commitment means, at least to me:

   1. What it means: the community will seriously consider delaying the
   next release if this isn’t done by our initial deadline.
   2. What it does not mean: delaying the release no matter what happens.

In that event that this feature isn’t done on time, it would be up to the
community to decide what to do. But in the mean time, I think it is healthy
to set a goal and work toward it. (I am not making a distinction between
PMC and community here.)

I think this commitment is a good idea for the same reason why we set other
goals: to hold ourselves accountable. When one sets a New Years resolution
to drop 10 pounds, it isn’t that the hope or intent wasn’t there before. It
is about having a (self-imposed) constraint that helps you make hard
choices: cake now or meet my goal?

Spark 3.0 has many other major features as well, delaying the release has
significant cost and we should try our best to not let it happen.”

I agree with Wenchen here. No one wants to actually delay the release. We
just want to push ourselves to make some tough decisions, using that delay
as a motivating factor.

The fact that some entity other than the PMC thinks that Spark 3.0 should
contain certain new features or that it will be costly to them if 3.0 does
not contain those features is not dispositive.

I’m not quite sure what you mean here. While I am representing my employer,
I am bringing up this topic as a member of the community, to suggest a
direction for the community to take, and I fully accept that the decision
is up to the community. I think it is reasonable to candidly state how this
matters; that context informs the discussion.

On Fri, Feb 22, 2019 at 1:55 PM Mark Hamstra 
wrote:

> To your other message: I already see a number of PMC members here. Who's
>> the other entity?
>>
>
> I'll answer indirectly since pointing fingers isn't really my intent. In
> the absence of a PMC vote, I react negatively to individuals making new
> declarative policy statements or statements to the effect that Spark
> 3.0 will (or will not) include these features..., or that it will be too
> costly to do something. Maybe these are innocent shorthand that leave off a
> clarifying "in my opinion" or "according to the current state of JIRA" or
> some such.
>
> My points are simply that nobody other than the PMC has an authoritative
> say on such matters, and if we are at a point where the community needs
> some definitive guidance, then we need PMC involvement and a vote. That's
> not intended to preclude or terminate community discussion, because that
> is, indeed, lovely to see.
>
> On Fri, Feb 22, 2019 at 12:04 PM Sean Owen  wrote:
>
>> To your other message: I already see a number of PMC members here. Who's
>> the other entity? The PMC is the thing that says a thing is a release,
>> sure, but this discussion is properly a community one. And here we are,
>> this is lovely to see.
>>
>> (May I remind everyone to casually, sometime, browse the large list of
>> other JIRAs targeted for Spark 3? it's much more than DSv2!)
>>
>> I can't speak to specific decisions here, but, I see:
>>
>> Spark 3 doesn't have a release date. Notionally it's 6 months after Spark
>> 2.4 (Nov 2018). It'd be reasonable to plan for a little more time. Can we
>> throw out... June 2019, and I update the website? It can slip but that
>> gives a concrete timeframe around which to plan. What can comfortably get
>> in by June 2019?
>>
>> Agreement that "DSv2" is going into Spark 3, for some definition of DSv2
>> that's probably roughly Matt's list.
>>
>> Changes that can't go into a minor release (API changes, etc) must by
>> definition go into Spark 3.0. Agree those first and do those now. Delay
>> Spark 3 until they're done and prioritize accordingly.
>> Changes that can go into a minor release can go into 3.1, if needed.
>> This has been in discussion long enough that I think whatever design(s)
>> are on the table for DSv2 now are as close as one is going to get. The
>> perfect is the enemy of the good.
>>
>> Aside from throwing out a date, I probably just restated what everyone
>> said. But I was 'summoned' :)
>>
>> On Fri, Feb 22, 2019 at 12:40 PM Mark Hamstra 
>> wrote:
>>
>>> However, as other people mentioned, Spark 3.0 has many other major
 

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-22 Thread Mark Hamstra
>
> To your other message: I already see a number of PMC members here. Who's
> the other entity?
>

I'll answer indirectly since pointing fingers isn't really my intent. In
the absence of a PMC vote, I react negatively to individuals making new
declarative policy statements or statements to the effect that Spark
3.0 will (or will not) include these features..., or that it will be too
costly to do something. Maybe these are innocent shorthand that leave off a
clarifying "in my opinion" or "according to the current state of JIRA" or
some such.

My points are simply that nobody other than the PMC has an authoritative
say on such matters, and if we are at a point where the community needs
some definitive guidance, then we need PMC involvement and a vote. That's
not intended to preclude or terminate community discussion, because that
is, indeed, lovely to see.

On Fri, Feb 22, 2019 at 12:04 PM Sean Owen  wrote:

> To your other message: I already see a number of PMC members here. Who's
> the other entity? The PMC is the thing that says a thing is a release,
> sure, but this discussion is properly a community one. And here we are,
> this is lovely to see.
>
> (May I remind everyone to casually, sometime, browse the large list of
> other JIRAs targeted for Spark 3? it's much more than DSv2!)
>
> I can't speak to specific decisions here, but, I see:
>
> Spark 3 doesn't have a release date. Notionally it's 6 months after Spark
> 2.4 (Nov 2018). It'd be reasonable to plan for a little more time. Can we
> throw out... June 2019, and I update the website? It can slip but that
> gives a concrete timeframe around which to plan. What can comfortably get
> in by June 2019?
>
> Agreement that "DSv2" is going into Spark 3, for some definition of DSv2
> that's probably roughly Matt's list.
>
> Changes that can't go into a minor release (API changes, etc) must by
> definition go into Spark 3.0. Agree those first and do those now. Delay
> Spark 3 until they're done and prioritize accordingly.
> Changes that can go into a minor release can go into 3.1, if needed.
> This has been in discussion long enough that I think whatever design(s)
> are on the table for DSv2 now are as close as one is going to get. The
> perfect is the enemy of the good.
>
> Aside from throwing out a date, I probably just restated what everyone
> said. But I was 'summoned' :)
>
> On Fri, Feb 22, 2019 at 12:40 PM Mark Hamstra 
> wrote:
>
>> However, as other people mentioned, Spark 3.0 has many other major
>>> features as well
>>>
>>
>> I fundamentally disagree. First, Spark 3.0 has nothing until the PMC says
>> it has something, and we have made no commitment along the lines that
>> "Spark 3.0.0 will not be released unless it contains new features x, y and
>> z." Second, major-version releases are not about adding new features.
>> Major-version releases are about making changes to the public API that we
>> cannot make in feature or bug-fix releases. If that is all that is
>> accomplished in a particular major release, that's fine -- in fact, we
>> quite intentionally did not target new features in the Spark 2.0.0 release.
>> The fact that some entity other than the PMC thinks that Spark 3.0 should
>> contain certain new features or that it will be costly to them if 3.0 does
>> not contain those features is not dispositive. If there are public API
>> changes that should occur in a timely fashion and there is also a list of
>> new features that some users or contributors want to see in 3.0 but that
>> look likely to not be ready in a timely fashion, then the PMC should fully
>> consider releasing 3.0 without all those new features. There is no reason
>> that they can't come in with 3.1.0.
>>
>


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-22 Thread Sean Owen
To your other message: I already see a number of PMC members here. Who's
the other entity? The PMC is the thing that says a thing is a release,
sure, but this discussion is properly a community one. And here we are,
this is lovely to see.

(May I remind everyone to casually, sometime, browse the large list of
other JIRAs targeted for Spark 3? it's much more than DSv2!)

I can't speak to specific decisions here, but, I see:

Spark 3 doesn't have a release date. Notionally it's 6 months after Spark
2.4 (Nov 2018). It'd be reasonable to plan for a little more time. Can we
throw out... June 2019, and I update the website? It can slip but that
gives a concrete timeframe around which to plan. What can comfortably get
in by June 2019?

Agreement that "DSv2" is going into Spark 3, for some definition of DSv2
that's probably roughly Matt's list.

Changes that can't go into a minor release (API changes, etc) must by
definition go into Spark 3.0. Agree those first and do those now. Delay
Spark 3 until they're done and prioritize accordingly.
Changes that can go into a minor release can go into 3.1, if needed.
This has been in discussion long enough that I think whatever design(s) are
on the table for DSv2 now are as close as one is going to get. The perfect
is the enemy of the good.

Aside from throwing out a date, I probably just restated what everyone
said. But I was 'summoned' :)

On Fri, Feb 22, 2019 at 12:40 PM Mark Hamstra 
wrote:

> However, as other people mentioned, Spark 3.0 has many other major
>> features as well
>>
>
> I fundamentally disagree. First, Spark 3.0 has nothing until the PMC says
> it has something, and we have made no commitment along the lines that
> "Spark 3.0.0 will not be released unless it contains new features x, y and
> z." Second, major-version releases are not about adding new features.
> Major-version releases are about making changes to the public API that we
> cannot make in feature or bug-fix releases. If that is all that is
> accomplished in a particular major release, that's fine -- in fact, we
> quite intentionally did not target new features in the Spark 2.0.0 release.
> The fact that some entity other than the PMC thinks that Spark 3.0 should
> contain certain new features or that it will be costly to them if 3.0 does
> not contain those features is not dispositive. If there are public API
> changes that should occur in a timely fashion and there is also a list of
> new features that some users or contributors want to see in 3.0 but that
> look likely to not be ready in a timely fashion, then the PMC should fully
> consider releasing 3.0 without all those new features. There is no reason
> that they can't come in with 3.1.0.
>


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
In addition to logical plans, we need SQL support. That requires resolving
v2 tables from a catalog and a few other changes like separating v1 plans
from SQL parsing (see the earlier dev list thread). I’d also like to add
DDL operations for v2.

I think it also makes sense to add a new DF write API, as we discussed in
the sync as well. That way, users have an API to start moving to that
always uses the v2 plans and behavior.

Here are all the commands that we have implemented on top of the proposed
table catalog API. We should be able to get these working in upstream Spark
fairly quickly.

   - CREATE TABLE [IF NOT EXISTS] …
   - CREATE TABLE … PARTITIONED BY …
   - CREATE TABLE … AS SELECT …
   - CREATE TABLE LIKE
   - ALTER TABLE …
  - ADD COLUMNS …
  - DROP COLUMNS …
  - ALTER COLUMN … TYPE
  - ALTER COLUMN … COMMENT
  - RENAME COLUMN … TO …
  - SET TBLPROPERTIES …
  - UNSET TBLPROPERTIES …
   - ALTER TABLE … RENAME TO …
   - DROP TABLE [IF EXISTS] …
   - DESCRIBE [FORMATTED|EXTENDED] …
   - SHOW CREATE TABLE …
   - SHOW TBLPROPERTIES
   - ALTER TABLE
   - REFRESH TABLE …
   - INSERT INTO …
   - INSERT OVERWRITE …
   - DELETE FROM … WHERE …


On Thu, Feb 21, 2019 at 3:57 PM Matt Cheah  wrote:

> To evaluate the amount of work required to get Data Source V2 into Spark
> 3.0, we should have a list of all the specific SPIPs and patches that are
> pending that would constitute a successful and usable revamp of that API.
> Here are the ones I could find and know off the top of my head:
>
>1. Table Catalog API: https://issues.apache.org/jira/browse/SPARK-24252
>   1. In my opinion this is by far the most important API to get in,
>   but it’s also the most important API to give thorough thought and
>   evaluation.
>2. Remaining logical plans for CTAS, RTAS, DROP / DELETE, OVERWRITE:
>https://issues.apache.org/jira/browse/SPARK-24923 +
>https://issues.apache.org/jira/browse/SPARK-24253
>3. Catalogs for other entities, such as functions. Pluggable system
>for loading these.
>4. Multi-Catalog support -
>https://issues.apache.org/jira/browse/SPARK-25006
>5. Migration of existing sources to V2, particularly file sources like
>Parquet and ORC – requires #1 as discussed in yesterday’s meeting
>
>
>
> Can someone add to this list if we’re missing anything? It might also make
> sense to either assigned a JIRA label or to update JIRA umbrella issues if
> any. Whatever mechanism works for being able to find all of these
> outstanding issues in one place.
>
>
>
> My understanding is that #1 is the most critical feature we need, and the
> feature that will go a long way towards allowing everything else to fall
> into place. #2 is also critical for external implementations of Data Source
> V2. I think we can afford to defer 3-5 to a future point release. But #1
> and #2 are also the features that have remained open for the longest time
> and we really need to move forward on these. Putting a target release for
> 3.0 will help in that regard.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Thursday, February 21, 2019 at 2:22 PM
> *To: *Matei Zaharia 
> *Cc: *Spark Dev List 
> *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2
>
>
>
> I'm all for making releases more often if we want. But this work could
> really use a target release to motivate getting it done. If we agree that
> it will block a release, then everyone is motivated to review and get the
> PRs in.
>
>
>
> If this work doesn't make it in the 3.0 release, I'm not confident that it
> will get done. Maybe we can have a release shortly after, but the timeline
> for these features -- that many of us need -- is nearly creeping into
> years. That's when alternatives start looking more likely to deliver. I'd
> rather see this work get in so we don't have to consider those
> alternatives, which is why I think this commitment is a good idea.
>
>
>
> I also would like to see multi-catalog support, but that is more
> reasonable to put off for a follow-up feature release, maybe 3.1.
>
>
>
> On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia 
> wrote:
>
> How large would the delay be? My 2 cents are that there’s nothing stopping
> us from making feature releases more often if we want to, so we shouldn’t
> see this as an “either delay 3.0 or release in >6 months” decision. If the
> work is likely to get in with a small delay and simplifies our work after
> 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it.
> But if it would be a large delay, we should also weigh it against other
> things that are going to get delayed if 3.0 moves much later.
>
> It might also be better to p

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Matt Cheah
To evaluate the amount of work required to get Data Source V2 into Spark 3.0, 
we should have a list of all the specific SPIPs and patches that are pending 
that would constitute a successful and usable revamp of that API. Here are the 
ones I could find and know off the top of my head:
Table Catalog API: https://issues.apache.org/jira/browse/SPARK-24252
In my opinion this is by far the most important API to get in, but it’s also 
the most important API to give thorough thought and evaluation.
Remaining logical plans for CTAS, RTAS, DROP / DELETE, OVERWRITE: 
https://issues.apache.org/jira/browse/SPARK-24923 + 
https://issues.apache.org/jira/browse/SPARK-24253
Catalogs for other entities, such as functions. Pluggable system for loading 
these.
Multi-Catalog support - https://issues.apache.org/jira/browse/SPARK-25006
Migration of existing sources to V2, particularly file sources like Parquet and 
ORC – requires #1 as discussed in yesterday’s meeting
 

Can someone add to this list if we’re missing anything? It might also make 
sense to either assigned a JIRA label or to update JIRA umbrella issues if any. 
Whatever mechanism works for being able to find all of these outstanding issues 
in one place.

 

My understanding is that #1 is the most critical feature we need, and the 
feature that will go a long way towards allowing everything else to fall into 
place. #2 is also critical for external implementations of Data Source V2. I 
think we can afford to defer 3-5 to a future point release. But #1 and #2 are 
also the features that have remained open for the longest time and we really 
need to move forward on these. Putting a target release for 3.0 will help in 
that regard.

 

-Matt Cheah

 

From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Thursday, February 21, 2019 at 2:22 PM
To: Matei Zaharia 
Cc: Spark Dev List 
Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2

 

I'm all for making releases more often if we want. But this work could really 
use a target release to motivate getting it done. If we agree that it will 
block a release, then everyone is motivated to review and get the PRs in. 

 

If this work doesn't make it in the 3.0 release, I'm not confident that it will 
get done. Maybe we can have a release shortly after, but the timeline for these 
features -- that many of us need -- is nearly creeping into years. That's when 
alternatives start looking more likely to deliver. I'd rather see this work get 
in so we don't have to consider those alternatives, which is why I think this 
commitment is a good idea.

 

I also would like to see multi-catalog support, but that is more reasonable to 
put off for a follow-up feature release, maybe 3.1.

 

On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia  wrote:

How large would the delay be? My 2 cents are that there’s nothing stopping us 
from making feature releases more often if we want to, so we shouldn’t see this 
as an “either delay 3.0 or release in >6 months” decision. If the work is 
likely to get in with a small delay and simplifies our work after 3.0 (e.g. we 
can get rid of older APIs), then the delay may be worth it. But if it would be 
a large delay, we should also weigh it against other things that are going to 
get delayed if 3.0 moves much later.

It might also be better to propose a specific date to delay until, so people 
can still plan around when the release branch will likely be cut.

Matei

> On Feb 21, 2019, at 1:03 PM, Ryan Blue  wrote:
> 
> Hi everyone,
> 
> In the DSv2 sync last night, we had a discussion about roadmap and what the 
> goal should be for getting the main features into Spark. We all agreed that 
> 3.0 should be that goal, even if it means delaying the 3.0 release.
> 
> The possibility of delaying the 3.0 release may be controversial, so I want 
> to bring it up to the dev list to build consensus around it. The rationale 
> for this is partly that much of this work has been outstanding for more than 
> a year now. If it doesn't make it into 3.0, then it would be another 6 months 
> before it would be in a release, and would be nearing 2 years to get the work 
> done.
> 
> Are there any objections to targeting 3.0 for this?
> 
> In addition, much of the planning for multi-catalog support has been done to 
> make v2 possible. Do we also want to include multi-catalog support?
> 
> 
> rb
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


 

-- 

Ryan Blue 

Software Engineer

Netflix



smime.p7s
Description: S/MIME cryptographic signature


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
I'm all for making releases more often if we want. But this work could
really use a target release to motivate getting it done. If we agree that
it will block a release, then everyone is motivated to review and get the
PRs in.

If this work doesn't make it in the 3.0 release, I'm not confident that it
will get done. Maybe we can have a release shortly after, but the timeline
for these features -- that many of us need -- is nearly creeping into
years. That's when alternatives start looking more likely to deliver. I'd
rather see this work get in so we don't have to consider those
alternatives, which is why I think this commitment is a good idea.

I also would like to see multi-catalog support, but that is more reasonable
to put off for a follow-up feature release, maybe 3.1.

On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia 
wrote:

> How large would the delay be? My 2 cents are that there’s nothing stopping
> us from making feature releases more often if we want to, so we shouldn’t
> see this as an “either delay 3.0 or release in >6 months” decision. If the
> work is likely to get in with a small delay and simplifies our work after
> 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it.
> But if it would be a large delay, we should also weigh it against other
> things that are going to get delayed if 3.0 moves much later.
>
> It might also be better to propose a specific date to delay until, so
> people can still plan around when the release branch will likely be cut.
>
> Matei
>
> > On Feb 21, 2019, at 1:03 PM, Ryan Blue 
> wrote:
> >
> > Hi everyone,
> >
> > In the DSv2 sync last night, we had a discussion about roadmap and what
> the goal should be for getting the main features into Spark. We all agreed
> that 3.0 should be that goal, even if it means delaying the 3.0 release.
> >
> > The possibility of delaying the 3.0 release may be controversial, so I
> want to bring it up to the dev list to build consensus around it. The
> rationale for this is partly that much of this work has been outstanding
> for more than a year now. If it doesn't make it into 3.0, then it would be
> another 6 months before it would be in a release, and would be nearing 2
> years to get the work done.
> >
> > Are there any objections to targeting 3.0 for this?
> >
> > In addition, much of the planning for multi-catalog support has been
> done to make v2 possible. Do we also want to include multi-catalog support?
> >
> >
> > rb
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Matei Zaharia
How large would the delay be? My 2 cents are that there’s nothing stopping us 
from making feature releases more often if we want to, so we shouldn’t see this 
as an “either delay 3.0 or release in >6 months” decision. If the work is 
likely to get in with a small delay and simplifies our work after 3.0 (e.g. we 
can get rid of older APIs), then the delay may be worth it. But if it would be 
a large delay, we should also weigh it against other things that are going to 
get delayed if 3.0 moves much later.

It might also be better to propose a specific date to delay until, so people 
can still plan around when the release branch will likely be cut.

Matei

> On Feb 21, 2019, at 1:03 PM, Ryan Blue  wrote:
> 
> Hi everyone,
> 
> In the DSv2 sync last night, we had a discussion about roadmap and what the 
> goal should be for getting the main features into Spark. We all agreed that 
> 3.0 should be that goal, even if it means delaying the 3.0 release.
> 
> The possibility of delaying the 3.0 release may be controversial, so I want 
> to bring it up to the dev list to build consensus around it. The rationale 
> for this is partly that much of this work has been outstanding for more than 
> a year now. If it doesn't make it into 3.0, then it would be another 6 months 
> before it would be in a release, and would be nearing 2 years to get the work 
> done.
> 
> Are there any objections to targeting 3.0 for this?
> 
> In addition, much of the planning for multi-catalog support has been done to 
> make v2 possible. Do we also want to include multi-catalog support?
> 
> 
> rb
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
Hi everyone,

In the DSv2 sync last night, we had a discussion about roadmap and what the
goal should be for getting the main features into Spark. We all agreed that
3.0 should be that goal, even if it means delaying the 3.0 release.

The possibility of delaying the 3.0 release may be controversial, so I want
to bring it up to the dev list to build consensus around it. The rationale
for this is partly that much of this work has been outstanding for more
than a year now. If it doesn't make it into 3.0, then it would be another 6
months before it would be in a release, and would be nearing 2 years to get
the work done.

Are there any objections to targeting 3.0 for this?

In addition, much of the planning for multi-catalog support has been done
to make v2 possible. Do we also want to include multi-catalog support?


rb

-- 
Ryan Blue
Software Engineer
Netflix