Re: [DISCUSS] Spark 3.0 and DataSourceV2

Ryan Blue Thu, 28 Feb 2019 08:40:10 -0800

Thanks for the discussion, everyone. Since there aren't many objections to
the scope and we are aligned on what this commitment would mean, I've
started a vote thread for it.


rb

On Wed, Feb 27, 2019 at 5:32 PM Wenchen Fan <cloud0...@gmail.com> wrote:

> I'm good with the list from Ryan, thanks!
>
> On Thu, Feb 28, 2019 at 1:00 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> I think that's a good plan. Let's get the functionality done, but mark it
>> experimental pending a new row API.
>>
>> So is there agreement on this set of work, then?
>>
>> On Tue, Feb 26, 2019 at 6:30 PM Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>
>>> To add to this, we can add a stable interface anytime if the original
>>> one was marked as unstable; we wouldn’t have to wait until 4.0. We had a
>>> lot of APIs that were experimental in 2.0 and then got stabilized in later
>>> 2.x releases for example.
>>>
>>> Matei
>>>
>>> > On Feb 26, 2019, at 5:12 PM, Reynold Xin <r...@databricks.com> wrote:
>>> >
>>> > We will have to fix that before we declare dev2 is stable, because
>>> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0.
>>> >
>>> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah <mch...@palantir.com>
>>> wrote:
>>> > Will that then require an API break down the line? Do we save that for
>>> Spark 4?
>>> >
>>> >
>>> >
>>> >
>>> > -Matt Cheah?
>>> >
>>> >
>>> >
>>> > From: Ryan Blue <rb...@netflix.com>
>>> > Reply-To: "rb...@netflix.com" <rb...@netflix.com>
>>> > Date: Tuesday, February 26, 2019 at 4:53 PM
>>> > To: Matt Cheah <mch...@palantir.com>
>>> > Cc: Sean Owen <sro...@apache.org>, Wenchen Fan <cloud0...@gmail.com>,
>>> Xiao Li <lix...@databricks.com>, Matei Zaharia <matei.zaha...@gmail.com>,
>>> Spark Dev List <dev@spark.apache.org>
>>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
>>> >
>>> >
>>> >
>>> > That's a good question.
>>> >
>>> >
>>> >
>>> > While I'd love to have a solution for that, I don't think it is a good
>>> idea to delay DSv2 until we have one. That is going to require a lot of
>>> internal changes and I don't see how we could make the release date if we
>>> are including an InternalRow replacement.
>>> >
>>> >
>>> >
>>> > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah <mch...@palantir.com>
>>> wrote:
>>> >
>>> > Reynold made a note earlier about a proper Row API that isn’t
>>> InternalRow – is that still on the table?
>>> >
>>> >
>>> >
>>> > -Matt Cheah
>>> >
>>> >
>>> >
>>> > From: Ryan Blue <rb...@netflix.com>
>>> > Reply-To: "rb...@netflix.com" <rb...@netflix.com>
>>> > Date: Tuesday, February 26, 2019 at 4:40 PM
>>> > To: Matt Cheah <mch...@palantir.com>
>>> > Cc: Sean Owen <sro...@apache.org>, Wenchen Fan <cloud0...@gmail.com>,
>>> Xiao Li <lix...@databricks.com>, Matei Zaharia <matei.zaha...@gmail.com>,
>>> Spark Dev List <dev@spark.apache.org>
>>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
>>> >
>>> >
>>> >
>>> > Thanks for bumping this, Matt. I think we can have the discussion here
>>> to clarify exactly what we’re committing to and then have a vote thread
>>> once we’re agreed.
>>> > Getting back to the DSv2 discussion, I think we have a good handle on
>>> what would be added:
>>> > ·         Plugin system for catalogs
>>> >
>>> > ·         TableCatalog interface (I’ll start a vote thread for this
>>> SPIP shortly)
>>> >
>>> > ·         TableCatalog implementation backed by SessionCatalog that
>>> can load v2 tables
>>> >
>>> > ·         Resolution rule to load v2 tables using the new catalog
>>> >
>>> > ·         CTAS logical and physical plan nodes
>>> >
>>> > ·         Conversions from SQL parsed logical plans to v2 logical plans
>>> >
>>> > Initially, this will always use the v2 catalog backed by
>>> SessionCatalog to avoid dependence on the multi-catalog work. All of those
>>> are already implemented and working, so I think it is reasonable that we
>>> can get them in.
>>> > Then we can consider a few stretch goals:
>>> > ·         Get in as much DDL as we can. I think create and drop table
>>> should be easy.
>>> >
>>> > ·         Multi-catalog identifier parsing and multi-catalog support
>>> >
>>> > If we get those last two in, it would be great. We can make the call
>>> closer to release time. Does anyone want to change this set of work?
>>> >
>>> >
>>> > On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah <mch...@palantir.com>
>>> wrote:
>>> >
>>> > What would then be the next steps we'd take to collectively decide on
>>> plans and timelines moving forward? Might I suggest scheduling a conference
>>> call with appropriate PMCs to put our ideas together? Maybe such a
>>> discussion can take place at next week's meeting? Or do we need to have a
>>> separate formalized voting thread which is guided by a PMC?
>>> >
>>> > My suggestion is to try to make concrete steps forward and to avoid
>>> letting this slip through the cracks.
>>> >
>>> > I also think there would be merits to having a project plan and
>>> estimates around how long each of the features we want to complete is going
>>> to take to implement and review.
>>> >
>>> > -Matt Cheah
>>> >
>>> > On 2/24/19, 3:05 PM, "Sean Owen" <sro...@apache.org> wrote:
>>> >
>>> >     Sure, I don't read anyone making these statements though? Let's
>>> assume
>>> >     good intent, that "foo should happen" as "my opinion as a member of
>>> >     the community, which is not solely up to me, is that foo should
>>> >     happen". I understand it's possible for a person to make their
>>> opinion
>>> >     over-weighted; this whole style of decision making assumes good
>>> actors
>>> >     and doesn't optimize against bad ones. Not that it can't happen,
>>> just
>>> >     not seeing it here.
>>> >
>>> >     I have never seen any vote on a feature list, by a PMC or
>>> otherwise.
>>> >     We can do that if really needed I guess. But that also isn't the
>>> >     authoritative process in play here, in contrast.
>>> >
>>> >     If there's not a more specific subtext or issue here, which is
>>> fine to
>>> >     say (on private@ if it's sensitive or something), yes, let's move
>>> on
>>> >     in good faith.
>>> >
>>> >     On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra <
>>> m...@clearstorydata.com> wrote:
>>> >     > There is nothing wrong with individuals advocating for what they
>>> think should or should not be in Spark 3.0, nor should anyone shy away from
>>> explaining why they think delaying the release for some reason is or isn't
>>> a good idea. What is a problem, or is at least something that I have a
>>> problem with, are declarative, pseudo-authoritative statements that 3.0 (or
>>> some other release) will or won't contain some feature, API, etc. or that
>>> some issue is or is not blocker or worth delaying for. When the PMC has not
>>> voted on such issues, I'm often left thinking, "Wait... what? Who decided
>>> that, or where did that decision come from?"
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> > Ryan Blue
>>> >
>>> > Software Engineer
>>> >
>>> > Netflix
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> > Ryan Blue
>>> >
>>> > Software Engineer
>>> >
>>> > Netflix
>>> >
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 3.0 and DataSourceV2

Reply via email to