Re: [DISCUSS] Spark 3.0 and DataSourceV2

Matt Cheah Tue, 26 Feb 2019 17:11:43 -0800

Will that then require an API break down the line? Do we save that for Spark 4?

-Matt Cheah?

From: Ryan Blue <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, February 26, 2019 at 4:53 PM
To: Matt Cheah <[email protected]>
Cc: Sean Owen <[email protected]>, Wenchen Fan <[email protected]>, Xiao Li 
<[email protected]>, Matei Zaharia <[email protected]>, Spark Dev 
List <[email protected]>
Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2

That's a good question. 

While I'd love to have a solution for that, I don't think it is a good idea to 
delay DSv2 until we have one. That is going to require a lot of internal 
changes and I don't see how we could make the release date if we are including 
an InternalRow replacement.

On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah <[email protected]> wrote:

Reynold made a note earlier about a proper Row API that isn’t InternalRow – is 
that still on the table?

-Matt Cheah

From: Ryan Blue <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, February 26, 2019 at 4:40 PM
To: Matt Cheah <[email protected]>
Cc: Sean Owen <[email protected]>, Wenchen Fan <[email protected]>, Xiao Li 
<[email protected]>, Matei Zaharia <[email protected]>, Spark Dev 
List <[email protected]>
Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2

Thanks for bumping this, Matt. I think we can have the discussion here to 
clarify exactly what we’re committing to and then have a vote thread once we’re 
agreed.

Getting back to the DSv2 discussion, I think we have a good handle on what 
would be added:

·         Plugin system for catalogs

·         TableCatalog interface (I’ll start a vote thread for this SPIP 
shortly)

·         TableCatalog implementation backed by SessionCatalog that can load v2 
tables

·         Resolution rule to load v2 tables using the new catalog

·         CTAS logical and physical plan nodes

·         Conversions from SQL parsed logical plans to v2 logical plans

Initially, this will always use the v2 catalog backed by SessionCatalog to 
avoid dependence on the multi-catalog work. All of those are already 
implemented and working, so I think it is reasonable that we can get them in.

Then we can consider a few stretch goals:

·         Get in as much DDL as we can. I think create and drop table should be 
easy.

·         Multi-catalog identifier parsing and multi-catalog support

If we get those last two in, it would be great. We can make the call closer to 
release time. Does anyone want to change this set of work?

On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah <[email protected]> wrote:

What would then be the next steps we'd take to collectively decide on plans and 
timelines moving forward? Might I suggest scheduling a conference call with 
appropriate PMCs to put our ideas together? Maybe such a discussion can take 
place at next week's meeting? Or do we need to have a separate formalized 
voting thread which is guided by a PMC?

My suggestion is to try to make concrete steps forward and to avoid letting 
this slip through the cracks.

I also think there would be merits to having a project plan and estimates 
around how long each of the features we want to complete is going to take to 
implement and review.

-Matt Cheah

On 2/24/19, 3:05 PM, "Sean Owen" <[email protected]> wrote:

    Sure, I don't read anyone making these statements though? Let's assume
    good intent, that "foo should happen" as "my opinion as a member of
    the community, which is not solely up to me, is that foo should
    happen". I understand it's possible for a person to make their opinion
    over-weighted; this whole style of decision making assumes good actors
    and doesn't optimize against bad ones. Not that it can't happen, just
    not seeing it here.

    I have never seen any vote on a feature list, by a PMC or otherwise.
    We can do that if really needed I guess. But that also isn't the
    authoritative process in play here, in contrast.

    If there's not a more specific subtext or issue here, which is fine to
    say (on private@ if it's sensitive or something), yes, let's move on
    in good faith.

    On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra <[email protected]> 
wrote:
    > There is nothing wrong with individuals advocating for what they think 
should or should not be in Spark 3.0, nor should anyone shy away from 
explaining why they think delaying the release for some reason is or isn't a 
good idea. What is a problem, or is at least something that I have a problem 
with, are declarative, pseudo-authoritative statements that 3.0 (or some other 
release) will or won't contain some feature, API, etc. or that some issue is or 
is not blocker or worth delaying for. When the PMC has not voted on such 
issues, I'm often left thinking, "Wait... what? Who decided that, or where did 
that decision come from?"

-- 

Ryan Blue 

Software Engineer

Netflix

-- 

Ryan Blue 

Software Engineer

Netflix

smime.p7s
Description: S/MIME cryptographic signature

Re: [DISCUSS] Spark 3.0 and DataSourceV2

Reply via email to