[DISCUSS] USING syntax for Datasource V2

Hyukjin Kwon Mon, 20 Aug 2018 00:20:01 -0700

Hi all,

I have been trying to follow `USING` syntax support since that looks
currently not supported whereas `format` API supports this. I have been
trying to understand why and talked with Ryan.


Ryan knows all the details and, He and I thought it's good to post here - I
just started to look into this.
Here is Ryan's response:


>USING is currently used to select the underlying data source
implementation directly. The string passed in USING or format in the DF API
is used to resolve an implementation class.

The existing catalog supports tables that specify their datasource
implementation, but this will not be the case for all catalogs when Spark
adds multiple catalog support. For example, a Cassandra catalog or a JDBC
catalog that exposes tables in those systems will definitely not support
users marking tables with the “parquet” data source. The catalog must have
the ability to determine the data source implementation. That’s why I think
it is valuable to think of the current ExternalCatalog as one that can
track tables with any read/write implementation. Other catalogs can’t and
won’t do that.

> In the catalog v2 API <https://github.com/apache/spark/pull/21306> I’ve
proposed, everything from CREATE TABLE is passed to the catalog. Then the
catalog determines what source to use and returns a Table instance that
uses some class for its ReadSupport and WriteSupport implementation. An
ExternalCatalog exposed through that API would receive the USING or
format string
as a table property and would return a Table that uses the correct
ReadSupport, so tables stored in an ExternalCatalog will work as they do
today.

> I think other catalogs should be able to choose what to do with the USING 
> string.
An Iceberg <https://github.com/Netflix/iceberg> catalog might use this to
determine the underlying file format, which could be parquet, orc, or avro.
Or, a JDBC catalog might use it for the underlying table implementation in
the DB. This would make the property more of a storage hint for the
catalog, which is going to determine the read/write implementation anyway.

> For cases where there is no catalog involved, the current plan is to use
the reflection-based approach from v1 with the USING or format string. In
v2, that should resolve a ReadSupportProvider, which is used to create a
ReadSupport directly from options. I think this is a good approach for
backward-compatibility, but it can’t provide the same features as a
catalog-based table. Catalogs are how we have decided to build reliable
behavior for CTAS and the other standard logical plans
<https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
CTAS is a create and then an insert, and a write implementation alone can’t
provide that create operation.

I was targeting the last case (where there is no catalog involved) in
particular. I was thinking that approach is also good since `USING` syntax
compatibility should be kept anyway - this should reduce migration cost as
well. Was wondering about what you guys think about this.
If you guys could think the last case should be supported anyway, I was
thinking we could just orthogonally proceed. If you guys think other issues
should be resolved first, I think we (at least I will) should take a look
for the set of catalog APIs.

[DISCUSS] USING syntax for Datasource V2

Reply via email to