Re: [DISCUSS] Spark 3.0 and DataSourceV2

Wenchen Fan Wed, 27 Feb 2019 17:32:29 -0800

I'm good with the list from Ryan, thanks!

On Thu, Feb 28, 2019 at 1:00 AM Ryan Blue <[email protected]> wrote:


> I think that's a good plan. Let's get the functionality done, but mark it
> experimental pending a new row API.
>
> So is there agreement on this set of work, then?
>
> On Tue, Feb 26, 2019 at 6:30 PM Matei Zaharia <[email protected]>
> wrote:
>
>> To add to this, we can add a stable interface anytime if the original one
>> was marked as unstable; we wouldn’t have to wait until 4.0. We had a lot of
>> APIs that were experimental in 2.0 and then got stabilized in later 2.x
>> releases for example.
>>
>> Matei
>>
>> > On Feb 26, 2019, at 5:12 PM, Reynold Xin <[email protected]> wrote:
>> >
>> > We will have to fix that before we declare dev2 is stable, because
>> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0.
>> >
>> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah <[email protected]> wrote:
>> > Will that then require an API break down the line? Do we save that for
>> Spark 4?
>> >
>> >
>> >
>> >
>> > -Matt Cheah?
>> >
>> >
>> >
>> > From: Ryan Blue <[email protected]>
>> > Reply-To: "[email protected]" <[email protected]>
>> > Date: Tuesday, February 26, 2019 at 4:53 PM
>> > To: Matt Cheah <[email protected]>
>> > Cc: Sean Owen <[email protected]>, Wenchen Fan <[email protected]>,
>> Xiao Li <[email protected]>, Matei Zaharia <[email protected]>,
>> Spark Dev List <[email protected]>
>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
>> >
>> >
>> >
>> > That's a good question.
>> >
>> >
>> >
>> > While I'd love to have a solution for that, I don't think it is a good
>> idea to delay DSv2 until we have one. That is going to require a lot of
>> internal changes and I don't see how we could make the release date if we
>> are including an InternalRow replacement.
>> >
>> >
>> >
>> > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah <[email protected]> wrote:
>> >
>> > Reynold made a note earlier about a proper Row API that isn’t
>> InternalRow – is that still on the table?
>> >
>> >
>> >
>> > -Matt Cheah
>> >
>> >
>> >
>> > From: Ryan Blue <[email protected]>
>> > Reply-To: "[email protected]" <[email protected]>
>> > Date: Tuesday, February 26, 2019 at 4:40 PM
>> > To: Matt Cheah <[email protected]>
>> > Cc: Sean Owen <[email protected]>, Wenchen Fan <[email protected]>,
>> Xiao Li <[email protected]>, Matei Zaharia <[email protected]>,
>> Spark Dev List <[email protected]>
>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
>> >
>> >
>> >
>> > Thanks for bumping this, Matt. I think we can have the discussion here
>> to clarify exactly what we’re committing to and then have a vote thread
>> once we’re agreed.
>> > Getting back to the DSv2 discussion, I think we have a good handle on
>> what would be added:
>> > ·         Plugin system for catalogs
>> >
>> > ·         TableCatalog interface (I’ll start a vote thread for this
>> SPIP shortly)
>> >
>> > ·         TableCatalog implementation backed by SessionCatalog that can
>> load v2 tables
>> >
>> > ·         Resolution rule to load v2 tables using the new catalog
>> >
>> > ·         CTAS logical and physical plan nodes
>> >
>> > ·         Conversions from SQL parsed logical plans to v2 logical plans
>> >
>> > Initially, this will always use the v2 catalog backed by SessionCatalog
>> to avoid dependence on the multi-catalog work. All of those are already
>> implemented and working, so I think it is reasonable that we can get them
>> in.
>> > Then we can consider a few stretch goals:
>> > ·         Get in as much DDL as we can. I think create and drop table
>> should be easy.
>> >
>> > ·         Multi-catalog identifier parsing and multi-catalog support
>> >
>> > If we get those last two in, it would be great. We can make the call
>> closer to release time. Does anyone want to change this set of work?
>> >
>> >
>> > On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah <[email protected]> wrote:
>> >
>> > What would then be the next steps we'd take to collectively decide on
>> plans and timelines moving forward? Might I suggest scheduling a conference
>> call with appropriate PMCs to put our ideas together? Maybe such a
>> discussion can take place at next week's meeting? Or do we need to have a
>> separate formalized voting thread which is guided by a PMC?
>> >
>> > My suggestion is to try to make concrete steps forward and to avoid
>> letting this slip through the cracks.
>> >
>> > I also think there would be merits to having a project plan and
>> estimates around how long each of the features we want to complete is going
>> to take to implement and review.
>> >
>> > -Matt Cheah
>> >
>> > On 2/24/19, 3:05 PM, "Sean Owen" <[email protected]> wrote:
>> >
>> >     Sure, I don't read anyone making these statements though? Let's
>> assume
>> >     good intent, that "foo should happen" as "my opinion as a member of
>> >     the community, which is not solely up to me, is that foo should
>> >     happen". I understand it's possible for a person to make their
>> opinion
>> >     over-weighted; this whole style of decision making assumes good
>> actors
>> >     and doesn't optimize against bad ones. Not that it can't happen,
>> just
>> >     not seeing it here.
>> >
>> >     I have never seen any vote on a feature list, by a PMC or otherwise.
>> >     We can do that if really needed I guess. But that also isn't the
>> >     authoritative process in play here, in contrast.
>> >
>> >     If there's not a more specific subtext or issue here, which is fine
>> to
>> >     say (on private@ if it's sensitive or something), yes, let's move
>> on
>> >     in good faith.
>> >
>> >     On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra <
>> [email protected]> wrote:
>> >     > There is nothing wrong with individuals advocating for what they
>> think should or should not be in Spark 3.0, nor should anyone shy away from
>> explaining why they think delaying the release for some reason is or isn't
>> a good idea. What is a problem, or is at least something that I have a
>> problem with, are declarative, pseudo-authoritative statements that 3.0 (or
>> some other release) will or won't contain some feature, API, etc. or that
>> some issue is or is not blocker or worth delaying for. When the PMC has not
>> voted on such issues, I'm often left thinking, "Wait... what? Who decided
>> that, or where did that decision come from?"
>> >
>> >
>> >
>> >
>> >
>> > --
>> >
>> > Ryan Blue
>> >
>> > Software Engineer
>> >
>> > Netflix
>> >
>> >
>> >
>> >
>> >
>> > --
>> >
>> > Ryan Blue
>> >
>> > Software Engineer
>> >
>> > Netflix
>> >
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Spark 3.0 and DataSourceV2

Reply via email to