Thanks for the discussion, everyone. Since there aren't many objections to the scope and we are aligned on what this commitment would mean, I've started a vote thread for it.
rb On Wed, Feb 27, 2019 at 5:32 PM Wenchen Fan <cloud0...@gmail.com> wrote: > I'm good with the list from Ryan, thanks! > > On Thu, Feb 28, 2019 at 1:00 AM Ryan Blue <rb...@netflix.com> wrote: > >> I think that's a good plan. Let's get the functionality done, but mark it >> experimental pending a new row API. >> >> So is there agreement on this set of work, then? >> >> On Tue, Feb 26, 2019 at 6:30 PM Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >> >>> To add to this, we can add a stable interface anytime if the original >>> one was marked as unstable; we wouldn’t have to wait until 4.0. We had a >>> lot of APIs that were experimental in 2.0 and then got stabilized in later >>> 2.x releases for example. >>> >>> Matei >>> >>> > On Feb 26, 2019, at 5:12 PM, Reynold Xin <r...@databricks.com> wrote: >>> > >>> > We will have to fix that before we declare dev2 is stable, because >>> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. >>> > >>> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah <mch...@palantir.com> >>> wrote: >>> > Will that then require an API break down the line? Do we save that for >>> Spark 4? >>> > >>> > >>> > >>> > >>> > -Matt Cheah? >>> > >>> > >>> > >>> > From: Ryan Blue <rb...@netflix.com> >>> > Reply-To: "rb...@netflix.com" <rb...@netflix.com> >>> > Date: Tuesday, February 26, 2019 at 4:53 PM >>> > To: Matt Cheah <mch...@palantir.com> >>> > Cc: Sean Owen <sro...@apache.org>, Wenchen Fan <cloud0...@gmail.com>, >>> Xiao Li <lix...@databricks.com>, Matei Zaharia <matei.zaha...@gmail.com>, >>> Spark Dev List <dev@spark.apache.org> >>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 >>> > >>> > >>> > >>> > That's a good question. >>> > >>> > >>> > >>> > While I'd love to have a solution for that, I don't think it is a good >>> idea to delay DSv2 until we have one. That is going to require a lot of >>> internal changes and I don't see how we could make the release date if we >>> are including an InternalRow replacement. >>> > >>> > >>> > >>> > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah <mch...@palantir.com> >>> wrote: >>> > >>> > Reynold made a note earlier about a proper Row API that isn’t >>> InternalRow – is that still on the table? >>> > >>> > >>> > >>> > -Matt Cheah >>> > >>> > >>> > >>> > From: Ryan Blue <rb...@netflix.com> >>> > Reply-To: "rb...@netflix.com" <rb...@netflix.com> >>> > Date: Tuesday, February 26, 2019 at 4:40 PM >>> > To: Matt Cheah <mch...@palantir.com> >>> > Cc: Sean Owen <sro...@apache.org>, Wenchen Fan <cloud0...@gmail.com>, >>> Xiao Li <lix...@databricks.com>, Matei Zaharia <matei.zaha...@gmail.com>, >>> Spark Dev List <dev@spark.apache.org> >>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 >>> > >>> > >>> > >>> > Thanks for bumping this, Matt. I think we can have the discussion here >>> to clarify exactly what we’re committing to and then have a vote thread >>> once we’re agreed. >>> > Getting back to the DSv2 discussion, I think we have a good handle on >>> what would be added: >>> > · Plugin system for catalogs >>> > >>> > · TableCatalog interface (I’ll start a vote thread for this >>> SPIP shortly) >>> > >>> > · TableCatalog implementation backed by SessionCatalog that >>> can load v2 tables >>> > >>> > · Resolution rule to load v2 tables using the new catalog >>> > >>> > · CTAS logical and physical plan nodes >>> > >>> > · Conversions from SQL parsed logical plans to v2 logical plans >>> > >>> > Initially, this will always use the v2 catalog backed by >>> SessionCatalog to avoid dependence on the multi-catalog work. All of those >>> are already implemented and working, so I think it is reasonable that we >>> can get them in. >>> > Then we can consider a few stretch goals: >>> > · Get in as much DDL as we can. I think create and drop table >>> should be easy. >>> > >>> > · Multi-catalog identifier parsing and multi-catalog support >>> > >>> > If we get those last two in, it would be great. We can make the call >>> closer to release time. Does anyone want to change this set of work? >>> > >>> > >>> > On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah <mch...@palantir.com> >>> wrote: >>> > >>> > What would then be the next steps we'd take to collectively decide on >>> plans and timelines moving forward? Might I suggest scheduling a conference >>> call with appropriate PMCs to put our ideas together? Maybe such a >>> discussion can take place at next week's meeting? Or do we need to have a >>> separate formalized voting thread which is guided by a PMC? >>> > >>> > My suggestion is to try to make concrete steps forward and to avoid >>> letting this slip through the cracks. >>> > >>> > I also think there would be merits to having a project plan and >>> estimates around how long each of the features we want to complete is going >>> to take to implement and review. >>> > >>> > -Matt Cheah >>> > >>> > On 2/24/19, 3:05 PM, "Sean Owen" <sro...@apache.org> wrote: >>> > >>> > Sure, I don't read anyone making these statements though? Let's >>> assume >>> > good intent, that "foo should happen" as "my opinion as a member of >>> > the community, which is not solely up to me, is that foo should >>> > happen". I understand it's possible for a person to make their >>> opinion >>> > over-weighted; this whole style of decision making assumes good >>> actors >>> > and doesn't optimize against bad ones. Not that it can't happen, >>> just >>> > not seeing it here. >>> > >>> > I have never seen any vote on a feature list, by a PMC or >>> otherwise. >>> > We can do that if really needed I guess. But that also isn't the >>> > authoritative process in play here, in contrast. >>> > >>> > If there's not a more specific subtext or issue here, which is >>> fine to >>> > say (on private@ if it's sensitive or something), yes, let's move >>> on >>> > in good faith. >>> > >>> > On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra < >>> m...@clearstorydata.com> wrote: >>> > > There is nothing wrong with individuals advocating for what they >>> think should or should not be in Spark 3.0, nor should anyone shy away from >>> explaining why they think delaying the release for some reason is or isn't >>> a good idea. What is a problem, or is at least something that I have a >>> problem with, are declarative, pseudo-authoritative statements that 3.0 (or >>> some other release) will or won't contain some feature, API, etc. or that >>> some issue is or is not blocker or worth delaying for. When the PMC has not >>> voted on such issues, I'm often left thinking, "Wait... what? Who decided >>> that, or where did that decision come from?" >>> > >>> > >>> > >>> > >>> > >>> > -- >>> > >>> > Ryan Blue >>> > >>> > Software Engineer >>> > >>> > Netflix >>> > >>> > >>> > >>> > >>> > >>> > -- >>> > >>> > Ryan Blue >>> > >>> > Software Engineer >>> > >>> > Netflix >>> > >>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix