Re: DSv1 removal

Ryan Blue Thu, 20 Jun 2019 11:24:48 -0700

Hi Gabor,

First, a little context... one of the goals of DSv2 is to standardize the
behavior of SQL operations in Spark. For example, running CTAS when a table
exists will fail, not take some action depending on what the source
chooses, like drop & CTAS, inserting, or failing.

Unfortunately, this means that DSv1 can't be easily replaced because it has
behavior differences between sources. In addition, we're not really sure
how DSv1 works in all cases -- it really depends on what seemed reasonable
to authors at the time. For example, we don't have a good understanding of
how file-based tables behave (those not backed by a Metastore). There are
also changes that we know are breaking and are okay with, like only
inserting safe casts when writing with v2.

Because of this, we can't just replace v1 with v2 transparently, so the
plan is to allow deployments to migrate to v2 in stages. Here's the plan:
1. Use v1 by default so all existing queries work as they do today for
identifiers like `db.table`
2. Allow users to add additional v2 catalogs that will be used when
identifiers specifically start with one, like `test_catalog.db.table`
3. Add a v2 catalog that delegates to the session catalog, so that v2
read/write implementations can be used, but are stored just like v1 tables
in the session catalog
4. Add a setting to use a v2 catalog as the default. Setting this would use
a v2 catalog for all identifiers without a catalog, like `db.table`
5. Add a way for a v2 catalog to return a table that gets converted to v1.
This is what `CatalogTableAsV2` does in #24768
<https://github.com/apache/spark/pull/24768>.

PR #24768 <https://github.com/apache/spark/pull/24768> implements the rest
of these changes. Specifically, we initially used the default catalog for
v2 sources, but that causes namespace problems, so we need the v2 session
catalog (point #3) as the default when there is no default v2 catalog.

I hope that answers your question. If not, I'm happy to answer follow-ups
and we can add this as a topic in the next v2 sync on Wednesday. I'm also
planning on talking about metadata columns or function push-down from the
Kafka v2 PR at that sync, so you may want to attend.

rb

On Thu, Jun 20, 2019 at 4:45 AM Gabor Somogyi <[email protected]>
wrote:

> Hi All,
>
>   I've taken a look at the code and docs to find out when DSv1 sources has
> to be removed (in case of DSv2 replacement is implemented). After some
> digging I've found DSv1 sources which are already removed but in some cases
> v1 and v2 still exists in parallel.
>
> Can somebody please tell me what's the overall plan in this area?
>
> BR,
> G
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DSv1 removal

Reply via email to