Seems reasonable at high level. I don't think we can use Expression's and
SortOrder's in public APIs though. Those are not meant to be public and can
break easily across versions.


On Tue, Jul 24, 2018 at 9:26 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> The recently adopted SPIP to standardize logical plans requires a way for
> to plug in providers for table metadata operations, so that the new plans
> can create and drop tables. I proposed an API to do this in a follow-up SPIP
> on APIs for Table Metadata Operations
> <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#>.
> This thread is to discuss that proposal.
>
> There are two main parts:
>
>    - A public facing API for creating, altering, and dropping tables
>    - An API for catalog implementations to provide the underlying table
>    operations
>
> The main need is for the plug-in API, but I included the public one
> because there isn’t currently a friendly public API to create tables and I
> think it helps to see how both would work together.
>
> Here’s a sample of the proposed public API:
>
> catalog.createTable("db.table")
>     .addColumn("id", LongType)
>     .addColumn("data", StringType, nullable=true)
>     .addColumn("ts", TimestampType)
>     .partitionBy(day($"ts"))
>     .config("prop", "val")
>     .commit()
>
> And here’s a sample of the catalog plug-in API:
>
> Table createTable(
>     TableIdentifier ident,
>     StructType schema,
>     List<Expression> partitions,
>     Optional<List<SortOrder>> sortOrder,
>     Map<String, String> properties)
>
> Note that this API passes both bucketing and column-based partitioning as
> Expressions. This is a generalization that makes it possible for the table
> to use the relationship between columns and partitions. In the example
> above, data is partitioned by the day of the timestamp field. Because the
> expression is passed to the table, the table can use predicates on the
> timestamp to filter out partitions without an explicit partition predicate.
> There’s more detail in the proposal on this.
>
> The SPIP is for the APIs and does not cover how multiple catalogs would be
> exposed. I started a separate discussion thread on how to access multiple
> catalogs and maintain compatibility with Spark’s current behavior (how to
> get the catalog instance in the above example).
>
> Please use this thread to discuss the proposed APIs. Thanks, everyone!
>
> rb
> ​
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to