Re: REST catalog proposal

Ryan Blue Mon, 13 Dec 2021 19:34:33 -0800

I think Jack does a great job of pointing out a lot of the advantages. I
agree with him, but I’ll add my perspective as well. I suggested the REST
catalog a couple (few?) months ago when we were talking about the DynamoDB
catalog and it stuck with me as the solution to quite a few problems.

First, although you can plug a catalog implementation into the classpath
for clients, that’s not always a good idea. JDBC is a good example, where
you probably don’t want a ton of connections going directly to a database.
An intermediate service is a great way to scale such a metastore. As Jack
noted, it’s also nice to implement catalogs like JDBC with a service so
that you get the exact same behavior across languages without implementing
it twice with different DB APIs.

Along the same lines, many hosted processing engines aren’t going to
support customers plugging arbitrary code into processing engines. When I
was at Netflix, we used a custom metastore to track tables. That worked
great, but it meant that our platform was incompatible with things like AWS
Athena because we’d either have to plug in a Jar or get them to implement a
bespoke REST protocol just for us. Right now, catalog customization is only
available if you use the Hive thrift API, which is not a fun way to go if
you just want to try out a hosted processing engine. By building a common
protocol and client, we can hopefully get engines to support the client so
we can point them to any metastore.

A third problem that the REST catalog helps address is version upgrades.
Because the metadata JSON file is written by clients, upgrading to support
new features is difficult. For example, the snapshot reference map that
we’re using to implement tagging isn’t supported by any existing writer.
The change is backward-compatible, so older Iceberg versions can read
tables with refs, but if those versions write to a table, the refs get
dropped. That means we have to upgrade versions at the same time across all
writers. On the other hand, with a change-based protocol we can update the
Iceberg version of just the service that writes the metadata JSON files.
Then with only one version writing metadata, no metadata is unsupported and
dropped.

At Tabular, we’re interested in all 3 of these. We’re building a REST-based
catalog and we could add it in a vendor module just like the Glue catalog.
But I think our time is better spent doing this with the community so that
we all can use a common client, no matter what metastore service you use or
build.

I’ll also reply to the specific questions:

Is this a spec at the level that the table spec exists or is this an
informative PR to agree on the REST api of *a* catalog?

I think this is *a* catalog. We want to document the protocol so it is
generally useful, but we’re not aiming to get rid of the existing catalogs.
I think they are complimentary. Jack noted some great ideas for what you
can do with the change-based API.

Is it meant to enshrine the Catalog interface into a spec?

This is meant to be able to do everything that Catalog, SupportsNamespaces,
and TableOperations currently do, since those are what you customize when
you plug in a catalog implementation. Setting up things like the location
provider implementation and FileIO settings are included.

Will there be both server and client modules in the iceberg codebase?

Iceberg doesn’t provide services, it provides a library. I wouldn’t want to
change that to avoiding scope creep. That said, I think a very basic
implementation that translates back to the Catalog API is probably the best
way to test it, so I could see having something like that in tests.

It may be early to say for sure but does a server implementation imply
authn/z, database backends, deployment artifacts and all the other fun
things that go into a server side component?

This is exactly why Iceberg has always provided a library and not services.
There are so many concerns here that I think it would be a separate
project. I don’t think that Iceberg should do this, just like I think it’s
healthy that Nessie is a separate project.

Ryan

On Mon, Dec 13, 2021 at 1:17 PM Jack Ye <yezhao...@gmail.com> wrote:

> Hi Ryan,
>
> Thanks for starting the thread! Just want to share some of my thoughts
> related to this topic.
>
> I think the AWS Glue, DynamoDB and JDBC catalogs will continue to live, I
> don't see a unification through REST as we are not going to build a REST
> server between Iceberg and the related AWS services, and I think anyone can
> continue to add more implementations in this route if they want (although
> not recommended). In my opinion, everything in Iceberg will be client side,
> there is not going to be a server module because that is what the extension
> point is. REST catalog is just a parallel implementation to
> BaseMetastoreCatalog, but it does not enforce writing a JSON metadata file.
> Instead, it only tells the server side the set of changes and let the
> server handle those changes, the server can choose to write the JSON file
> in whatever way it wants, or even not write it at all. Just like Glue,
> DynamoDB and JDBC all extend BaseMetastoreCatalog, other catalogs can
> choose to "extend" the REST catalog, but the extension is through the
> OpenAPI REST client but not just Java inheritance.
>
> The biggest benefit I see out of this development is that catalog
> providers can focus on the server side implementation to build really great
> catalog services with all sorts of nice features, and no new integration
> and maintenance is needed when Iceberg rolls out new catalog features or
> support for new languages because everything goes through the base REST
> implementation. Because of such simplification in open source
> compatibility, I think most new catalog providers will prefer integration
> through REST. In addition, systems that only have exposure to a
> non-Java/Python language can also be used as a catalog provider using a
> client generated from the OpenAPI spec. It does not need to have any Java
> compatibility. Just like there are people who prefer DynamoDB catalog over
> Glue catalog, we also have use cases in AWS for catalog implementations
> that would only be achievable through a REST catalog, which I will
> contribute in the future after the REST catalog is finalized.
>
> The fact that the REST catalog server receives table changes instead of
> rewriting the entire table metadata also means the catalog service can
> optimize a lot of performance aspects. We have seen issues in streaming
> where the table metadata JSON file size gets too big and impact read, we
> also generally agree that small table metadata update through rewriting the
> entire metadata file is very inefficient. All these issues could be fixed
> by moving to a client-server model for a scalable service to handle and
> store these changes.
>
> Best,
> Jack Ye
>
> On Mon, Dec 13, 2021 at 12:28 PM Ryan Murray <rym...@dremio.com> wrote:
>
>> Hi all,
>>
>>
>> For those of you who haven't been following there has been some
>> interesting discussion around the proposal for a REST based catalog[1].
>>
>>
>> One of the primary questions I had while reading it was 'what is the
>> overall goal of the API?'. Given the size of this question I thought it
>> might be better to pose it on the mailing list than to clutter the PR.
>>
>>
>> So I guess primarily for Kyle: what is the long term goal/vision for the
>> REST catalog? Eg what are the use cases and who are the users? Do you see
>> this unifying the other existing catalogs or do you see it as another
>> catalog to compliment existing choices?
>>
>>
>> Additionally,
>>
>> * Is this a spec at the level that the table spec exists or is this an
>> informative PR to agree on the REST api of _a_ catalog?
>>
>> * Is it meant to enshrine the `Catalog` interface into a spec? This came
>> up on a python sync also
>>
>> * Will there be both server and client modules in the iceberg codebase? I
>> would expect that at least a reference implementation of a server would be
>> a good thing but this would be the first part of the codebase that runs as
>> a server instead of as client code in an engine. On the other side an open
>> api spec and a client impl w/o a server sounds like it's missing something.
>>
>> * It may be early to say for sure but does a server implementation imply
>> authn/z, database backends, deployment artifacts and all the other fun
>> things that go into a server side component?
>>
>>
>> That's just a few things I have been thinking about. Curious to see if
>> anyone else has been thinking similarly and very excited to hear your
>> thoughts Kyle. Also very excited to see this catalog develop. The activity
>> on the PR speaks to how excited people are about it landing.
>>
>>
>> Best,
>>
>> Ryan
>>
>>
>> [1] https://github.com/apache/iceberg/pull/3561
>>
>

-- 
Ryan Blue
Tabular

Re: REST catalog proposal

Reply via email to