Seems reasonable at high level. I don't think we can use Expression's and SortOrder's in public APIs though. Those are not meant to be public and can break easily across versions.
On Tue, Jul 24, 2018 at 9:26 AM Ryan Blue <rb...@netflix.com.invalid> wrote: > The recently adopted SPIP to standardize logical plans requires a way for > to plug in providers for table metadata operations, so that the new plans > can create and drop tables. I proposed an API to do this in a follow-up SPIP > on APIs for Table Metadata Operations > <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#>. > This thread is to discuss that proposal. > > There are two main parts: > > - A public facing API for creating, altering, and dropping tables > - An API for catalog implementations to provide the underlying table > operations > > The main need is for the plug-in API, but I included the public one > because there isn’t currently a friendly public API to create tables and I > think it helps to see how both would work together. > > Here’s a sample of the proposed public API: > > catalog.createTable("db.table") > .addColumn("id", LongType) > .addColumn("data", StringType, nullable=true) > .addColumn("ts", TimestampType) > .partitionBy(day($"ts")) > .config("prop", "val") > .commit() > > And here’s a sample of the catalog plug-in API: > > Table createTable( > TableIdentifier ident, > StructType schema, > List<Expression> partitions, > Optional<List<SortOrder>> sortOrder, > Map<String, String> properties) > > Note that this API passes both bucketing and column-based partitioning as > Expressions. This is a generalization that makes it possible for the table > to use the relationship between columns and partitions. In the example > above, data is partitioned by the day of the timestamp field. Because the > expression is passed to the table, the table can use predicates on the > timestamp to filter out partitions without an explicit partition predicate. > There’s more detail in the proposal on this. > > The SPIP is for the APIs and does not cover how multiple catalogs would be > exposed. I started a separate discussion thread on how to access multiple > catalogs and maintain compatibility with Spark’s current behavior (how to > get the catalog instance in the above example). > > Please use this thread to discuss the proposed APIs. Thanks, everyone! > > rb > > -- > Ryan Blue > Software Engineer > Netflix >