Hey Brian Thanks for the summary ! Good one !
I would just add the "REST ref impl" discussion. Regarding the anti patterns, I agree with the lists, imho, some are more "opinionated implementation", so definitely not in the API scope. +1 Thanks again ! Regards JB On Fri, Mar 1, 2024 at 1:15 PM Brian Olsen <[email protected]> wrote: > > My attempt to consolidate a list of goals, anti patterns , and impl details > mentioned since this discussion was brought up at the last Iceberg sync. > Tried to roughly capture who mentioned these things so we can follow up if > needed. Hopefully this can serve as a basis for the design discussion. > > Goals: > > - Remove the initial burden of choice of which REST implementation from new > users getting started with Iceberg (Russel S) > - Cut down on the supported catalogs that are no longer in use (e.g. > DynamoDB) or never intended for production (e.g. Hadoop) to minimize > maintenance lower variability, and lower the burden of choice on Iceberg > users. (Blue) > - Simplify plugging in your own catalog so the Iceberg project isn’t > responsible for maintaining and testing a bunch of dialects. (Blue). > - Aim for a REST catalog centric future and continue to remove Iceberg > support where it makes sense. (Russell/Jack Ye/Blue) > - Use this as a test dependency for the Iceberg project (Jack/Russell) > - Make this an MVP production grade catalog, assuming that whatever we do put > out there will end up being used as production anyways. (Blue/Dan) > - Keep the responsibilities the REST implementation as light as possible. > (Blue) > - Support HTTP(S) protocol, the service will act as a load balancer + proxy > to the JDBC backend. (Blue) > - Container image + k8s installation (Blue) > - Use for Iceberg education and evaluation (Bits) > - Use as a blueprint for designing you own Implementation (JB) > > Anti patterns: > > - Avoid becoming the Hive Metastore project, where we support every use case. > - Don’t support data governance cases like lineage. (Dan) > - Don’t support metrics reporting. (Blue/Dan) > - Don’t support security. (Blue) > - Don’t support a wide range of protocols outside of HTTP(S) (Dan) > - In general, avoid spending time integrating with whatever runtime a given > company uses that removes focus from the core project goals and spec. (Dan) > - Don’t be overly opinionated with tool choices. (Dan) > > Implementation ideas: > > - apache/iceberg-catalog repository, with all of the catalog impls moved and > maintained there as well. (Blue/Dan/Jack/JB/Russel) > - A catalog implementation per JDBC backend. (Blue) > - Servlet like Tomcat or Spring to run / package the service. (Blue) > > On Fri, Mar 1, 2024 at 2:54 AM Jean-Baptiste Onofré <[email protected]> wrote: >> >> Hi Renjie, >> >> maybe I wasn't clear, sorry about that: the target is really both ref >> impl (where we can test different Iceberg parts like we do with the >> InMemoryCatalog, JdbcCatalog, etc) and ready to go service for users >> (simple but to start with). >> >> But we can't prevent the community from working on a production grade >> catalog. The point is: if it's not in Iceberg, then it gonna be >> elsewhere (another ASF project, vendor project, whatever). This is OK >> as soon as we have a reference implementation in Iceberg. That's the >> min we should guarantee imho. >> For instance, for the JAXRS spec, the ref implementation is CXF-RS, >> but there are other implementation. The same for OSGi Blueprint, the >> ref implementation is in Apache Aries (aries-blueprint). >> >> My proposal is really a simple ref imp in Iceberg (submodule or >> separate repo, both are OK for me even if I have a preference for >> separate repo to keep things clean and different lifecycle as we do >> for iceberg-rust or iceberg-python), >> >> That said, I don't see why we could not have iceberg-catalog repo with >> a ref impl that evolves to something production ready. Observability, >> scaling, pluggable backend, etc can be implemented there and it would >> be a great addition for Iceberg with new contributors from the >> community I'm sure. Separated repo would make this doable imho, >> Iceberg still focus on spec. >> >> Regards >> JB >> >> On Fri, Mar 1, 2024 at 9:24 AM Renjie Liu <[email protected]> wrote: >> > >> > Hi: >> > >> > I think one thing missing in the discussion is that, if the iceberg >> > community wants to maintain a rest catalog service, what's the target use >> > case? Different target use cases may lead to different directions. >> > >> > If it's mainly designed for first time users to play or experience with >> > rest catalog, then maybe we just need a submodule in java repo or a >> > test-jar would be enough. >> > >> > If it's targeted toward production usage, things get complicated. There >> > are too many things to think about, such as using different storage >> > backend, monitoring, ha, scalability etc. What's more, in an enterprise >> > iceberg rest catalog usually is only part of a data platform, there are >> > many other things involved. In this case, I'm skeptical about the actual >> > value of a rest catalog server, and a spec or a library would be more >> > valuable. >> > >> > On Fri, Mar 1, 2024 at 3:49 PM Jean-Baptiste Onofré <[email protected]> >> > wrote: >> >> >> >> Hi Fokko >> >> >> >> If service means the actual runtime service, I partially agree. >> >> >> >> I would love to see REST Catalog API the "central cornerstone" used in >> >> iceberg-java, pyiceberg, etc. So I think we should provide the >> >> resources for an user to bootstrap a REST Catalog ref impl. >> >> A lot of Apache projects provides both specs and runtime (for some >> >> part): Apache Camel, Apache ActiveMQ, Apache Karaf, Apache Kafka, ... >> >> That's why it would make sense to have it in a separate Iceberg repo >> >> (iceberg-catalog) to keep iceberg main repo focus on spec. >> >> Iceberg would need both spec and simple runtime for ref impl. It would >> >> be a bit "hypocrite" (to our users :)) to say we have the spec but not >> >> impl. It's like you have Iceberg spec but no Spark or Flink >> >> extensions. >> >> Imagine Apache ActiveMQ says we have JMS 3.0 support but no >> >> runtime/service :) >> >> >> >> That's my $0.02, but if we want to promote the REST Catalog (and I >> >> think it's a good approach), then Iceberg should provide a ref impl >> >> ready to run (without preventing other impl of course). >> >> >> >> Regards >> >> JB >> >> >> >> >> >> >> >> On Fri, Mar 1, 2024 at 8:13 AM Fokko Driesprong <[email protected]> wrote: >> >> > >> >> > Hey everyone, >> >> > >> >> > Thanks for raising this. I think a test-jar would be a great first step. >> >> > >> >> > We already maintain "service" considering JDBC, Hive, etc catalogs. >> >> > REST Catalog ref impl in Iceberg would be the sam. >> >> > >> >> > >> >> > What I think Ryan means by a service is having to maintain Postgres >> >> > (JDBC backend), Hive Metastore (Hive backend), etc. There is a lot to >> >> > it to properly scale these backends. >> >> > >> >> > For PyIceberg we decided to build the examples backed by the >> >> > SqlCatalog. This can be both in memory or on a local dist (sqlite), of >> >> > course, it has limited parallelism, but makes it easy to give Iceberg a >> >> > try. One of the main motivations for doing it this way was that it >> >> > doesn't require any additional services. Running additional services >> >> > would require having JRE/Docker/etc being installed and potentially >> >> > also an RDBMS backend to persist the data. >> >> > >> >> > Kind regards, >> >> > Fokko >> >> > >> >> > >> >> > Op vr 1 mrt 2024 om 07:34 schreef Jean-Baptiste Onofré >> >> > <[email protected]>: >> >> >> >> >> >> Hi Ryan >> >> >> >> >> >> If we plan to reduce the number of catalogs (and I think it makes >> >> >> sense and I'm with you on that), we will need a impl/service in >> >> >> Iceberg for the REST Catalog API, else the users won't be able to use >> >> >> Iceberg "out of the box". >> >> >> We already maintain "service" considering JDBC, Hive, etc catalogs. >> >> >> REST Catalog ref impl in Iceberg would be the sam. >> >> >> >> >> >> So, in order to promote the REST Catalog API as the Catalog "unique" >> >> >> façade for Iceberg, I would be in favor of having a simple REST >> >> >> service in Iceberg. >> >> >> It would be the entry point for Iceberg users and they can use other >> >> >> REST catalogs depending on their needs (Gravitno, Tabular, ...). >> >> >> >> >> >> Regards >> >> >> JB >> >> >> >> >> >> On Fri, Mar 1, 2024 at 1:28 AM Ryan Blue <[email protected]> wrote: >> >> >> > >> >> >> > There is a reference implementation in the project, in the >> >> >> > CatalogHandlers class. That implements REST requests using a catalog >> >> >> > and returns REST responses. I believe this is what Gravatno relies >> >> >> > on and I mentioned it above in the discussion about whether we >> >> >> > should have a catalog service. >> >> >> > >> >> >> > Catalog tests also use catalog handlers, but use a simple HTTP >> >> >> > wrapper to test the HTTP client. There is also a test class that >> >> >> > accepts HTTP calls directly and also runs JSON serialization on >> >> >> > requests and responses. >> >> >> > >> >> >> > So far, the Iceberg community has avoided maintaining a service. >> >> >> > That brings in a lot of complications. So far, we’ve preferred to >> >> >> > remain focused on providing a library that can be used to wire up >> >> >> > something like a REST catalog, but not provide a runtime service. >> >> >> > >> >> >> > Ryan >> >> >> > >> >> >> > >> >> >> > On Thu, Feb 29, 2024 at 2:59 AM Jean-Baptiste Onofré >> >> >> > <[email protected]> wrote: >> >> >> >> >> >> >> >> Hi Ajantha, >> >> >> >> >> >> >> >> Thanks for sharing your thoughts. >> >> >> >> >> >> >> >> It makes sense for Gravitino to be a TLP (after the incubation >> >> >> >> period) >> >> >> >> because Gravitino is "more" than an Iceberg catalog. It implements >> >> >> >> the >> >> >> >> Iceberg REST Catalog API, but it's also a metadata catalog/repo with >> >> >> >> additional features. >> >> >> >> >> >> >> >> That said, I agree with what you said: >> >> >> >> 1. We have the openapi yaml in the Iceberg project, but no reference >> >> >> >> implementation in the project itself. I think REST Catalog is a good >> >> >> >> approach as a "central" Catalog API because any Iceberg engine/layer >> >> >> >> could use this API (even if written in Python, rust, go, whatever), >> >> >> >> and it allows new use cases (like easily move data from an engine to >> >> >> >> another as the catalog API would be the same). >> >> >> >> 2. From an ASF standpoint, I would not talk about "subproject" but >> >> >> >> more repositories. The reason is because in terms of governance, >> >> >> >> it's >> >> >> >> still the Iceberg project (PMC member or committer has the same >> >> >> >> permission on all repositories in the Iceberg project, it's not >> >> >> >> possible to have a committer only on iceberg-rust for instance. >> >> >> >> Generally speaking, we should limit the number of subprojects. >> >> >> >> 3. I think it would be fair to have REST Catalog resources (openapi >> >> >> >> yaml + a ref impl) in a iceberg-catalog repository. >> >> >> >> 4. However, It's important to have a more global discussion within >> >> >> >> the >> >> >> >> community about Iceberg 2.0 and the roadmap about catalogs: do we >> >> >> >> deprecate Iceberg Java Catalog API in favor of the REST Catalog API >> >> >> >> ? >> >> >> >> What do we do with the existing catalogs ? etc. I think it's a fair >> >> >> >> discussion to have for Iceberg 2.0. >> >> >> >> >> >> >> >> It's an important discussion, community driven. >> >> >> >> >> >> >> >> Regards >> >> >> >> JB >> >> >> >> >> >> >> >> On Thu, Feb 29, 2024 at 9:44 AM Ajantha Bhat >> >> >> >> <[email protected]> wrote: >> >> >> >> > >> >> >> >> > I apologize for the delay in responding. >> >> >> >> > >> >> >> >> > I'm pleased to see the development of an open-source REST catalog >> >> >> >> > implementation, and the potential transition of Gravitino to an >> >> >> >> > ASF project is certainly promising. >> >> >> >> > But REST catalog server implementation will be a small part of >> >> >> >> > Gravitino ASF project. Which has many other things along with the >> >> >> >> > catalog? >> >> >> >> > >> >> >> >> > While I understand Iceberg's focus on the table format >> >> >> >> > specification and its implementation, >> >> >> >> > I would like to propose the creation of a sub-project for the >> >> >> >> > REST catalog server implementation under the Iceberg repository >> >> >> >> > (similar to pyiceberg, iceberg-rust, etc.). >> >> >> >> > This suggestion is based on several reasons: >> >> >> >> > >> >> >> >> > Everytime we make a change to the REST spec, there is no >> >> >> >> > reference implementation to refer to or modify it. >> >> >> >> > Many companies such as AWS, Apple, Tabular, and Datastrato are >> >> >> >> > each implementing their own REST servers. >> >> >> >> > Consolidating efforts within a sub-project could lead to >> >> >> >> > efficiency gains and potential collaboration opportunities. >> >> >> >> > From the perspective of open-source users, the absence of an >> >> >> >> > open-source implementation for the REST catalog within Iceberg >> >> >> >> > may be inconvenient or frustrating. >> >> >> >> > >> >> >> >> > I believe creating a dedicated sub-project would address these >> >> >> >> > concerns and enhance the overall usability and collaborative >> >> >> >> > nature of the Iceberg ecosystem. >> >> >> >> > I also think we can have a sub-project for kafka-connect and >> >> >> >> > iceberg tools (delta converter, catalog migrator etc) as they >> >> >> >> > need not have to depend on the Iceberg release cycle >> >> >> >> > and they are independent of table format spec. >> >> >> >> > >> >> >> >> > Let me know your thoughts on this. I can open a separate thread >> >> >> >> > for discussion if required. >> >> >> >> > >> >> >> >> > - Ajantha >> >> >> >> > >> >> >> >> > >> >> >> >> > On Wed, Jan 31, 2024 at 5:32 AM Jack Ye <[email protected]> >> >> >> >> > wrote: >> >> >> >> >> >> >> >> >> >> +1 for using test-jar! >> >> >> >> >> >> >> >> >> >> -Jack >> >> >> >> >> >> >> >> >> >> On Fri, Jan 26, 2024 at 10:48 AM Ryan Blue <[email protected]> >> >> >> >> >> wrote: >> >> >> >> >>> >> >> >> >> >>> I think I'd be fine exposing this through a test Jar, but it >> >> >> >> >>> seems to me that if we were to put it into a normal package it >> >> >> >> >>> would turn into the situation we want to avoid. People would >> >> >> >> >>> use it for unintended purposes and it would become a >> >> >> >> >>> distraction. >> >> >> >> >>> >> >> >> >> >>> What do you think about using the tests Jar for this? >> >> >> >> >>> >> >> >> >> >>> On Thu, Jan 25, 2024 at 12:48 PM Jack Ye <[email protected]> >> >> >> >> >>> wrote: >> >> >> >> >>>> >> >> >> >> >>>> Yes, sorry I did not make it clear, I also agree it is not the >> >> >> >> >>>> right direction to invest a lot of community effort. I am more >> >> >> >> >>>> talking about casual use cases like importing a server for >> >> >> >> >>>> unit tests outside Iceberg, running some local debugging, etc. >> >> >> >> >>>> I think it would be valuable to provide a server in Iceberg >> >> >> >> >>>> for that purpose, and maybe vend it as test utils. Thoughts? >> >> >> >> >>>> >> >> >> >> >>>> -Jack >> >> >> >> >>>> >> >> >> >> >>>> On Thu, Jan 25, 2024 at 11:35 AM Ryan Blue <[email protected]> >> >> >> >> >>>> wrote: >> >> >> >> >>>>> >> >> >> >> >>>>> > I know we have the RESTCatalogAdapter and RESTCatalogSevlet >> >> >> >> >>>>> > for unit tests, and technically we have a very similar >> >> >> >> >>>>> > Jetty server implementation in TestRESTCatalog. Should we >> >> >> >> >>>>> > think about making those components out of the tests into >> >> >> >> >>>>> > an iceberg-rest-server module for this use case, and merge >> >> >> >> >>>>> > with the implementation that Gravitino has? >> >> >> >> >>>>> >> >> >> >> >>>>> I think that this would take the Iceberg project in the wrong >> >> >> >> >>>>> direction. Iceberg has always been a library and I think it >> >> >> >> >>>>> should continue to be. Concerns about runtime should be left >> >> >> >> >>>>> to other projects that need to fit into existing >> >> >> >> >>>>> infrastructure or skillsets of people maintaining them. The >> >> >> >> >>>>> question of whether to use Jetty or Tomcat or whatever else >> >> >> >> >>>>> is a serious consideration, as is how to monitor that >> >> >> >> >>>>> application and send metrics. I think it would slow down the >> >> >> >> >>>>> core purpose of Iceberg if we got distracted by these things. >> >> >> >> >>>>> >> >> >> >> >>>>> In fact, I think that this project shows that the library is >> >> >> >> >>>>> getting the balance right: it is using `CatalogHandlers` for >> >> >> >> >>>>> their intended purpose. It has opinions about how to run the >> >> >> >> >>>>> actual HTTP service and people that agree can use it. Other >> >> >> >> >>>>> people could use `CatalogHandlers` to build on a different >> >> >> >> >>>>> foundation. >> >> >> >> >>>>> >> >> >> >> >>>>> On Thu, Jan 25, 2024 at 11:13 AM Jack Ye >> >> >> >> >>>>> <[email protected]> wrote: >> >> >> >> >>>>>> >> >> >> >> >>>>>> Really cool project! >> >> >> >> >>>>>> >> >> >> >> >>>>>> I browsed a bit of the codebase, and see this implementation >> >> >> >> >>>>>> of the REST service backend: >> >> >> >> >>>>>> - >> >> >> >> >>>>>> https://github.com/datastrato/gravitino/blob/main/catalogs/catalog-lakehouse-iceberg/src/main/java/com/datastrato/gravitino/catalog/lakehouse/iceberg/IcebergRESTService.java#L39 >> >> >> >> >>>>>> - >> >> >> >> >>>>>> https://github.com/datastrato/gravitino/blob/main/catalogs/catalog-lakehouse-iceberg/src/main/java/com/datastrato/gravitino/catalog/lakehouse/iceberg/ops/IcebergTableOps.java#L42-L51 >> >> >> >> >>>>>> >> >> >> >> >>>>>> Looks like it is initializing a Jetty server that uses >> >> >> >> >>>>>> CatalogHandlers to delegate the execution to a specific Java >> >> >> >> >>>>>> Catalog implementation. >> >> >> >> >>>>>> >> >> >> >> >>>>>> I think this is actually something that is lacking today in >> >> >> >> >>>>>> Iceberg, which is an easy way for users to start an actual >> >> >> >> >>>>>> REST HTTP server. >> >> >> >> >>>>>> >> >> >> >> >>>>>> I know we have the RESTCatalogAdapter and RESTCatalogSevlet >> >> >> >> >>>>>> for unit tests, and technically we have a very similar Jetty >> >> >> >> >>>>>> server implementation in TestRESTCatalog. Should we think >> >> >> >> >>>>>> about making those components out of the tests into an >> >> >> >> >>>>>> iceberg-rest-server module for this use case, and merge with >> >> >> >> >>>>>> the implementation that Gravitino has? >> >> >> >> >>>>>> >> >> >> >> >>>>>> Best, >> >> >> >> >>>>>> Jack Ye >> >> >> >> >>>>>> >> >> >> >> >>>>>> On Thu, Jan 25, 2024 at 10:47 AM Yufei Gu >> >> >> >> >>>>>> <[email protected]> wrote: >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> Thanks Justin for the sharing. >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> It's pretty cool to see an open source REST catalog >> >> >> >> >>>>>>> implementation in action. Having dabbled a bit in the early >> >> >> >> >>>>>>> development of Gravitino myself, I'm really excited about >> >> >> >> >>>>>>> its potential with the Iceberg REST catalog. >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> The idea of Gravitino moving to an ASF project is >> >> >> >> >>>>>>> promising. It’ll surely boost its visibility and open up >> >> >> >> >>>>>>> more doors for collaboration and adoption. >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> Looking forward to where this goes. Keep up the fantastic >> >> >> >> >>>>>>> work! >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> Yufei >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> On Thu, Jan 25, 2024 at 5:55 AM Jean-Baptiste Onofré >> >> >> >> >>>>>>> <[email protected]> wrote: >> >> >> >> >>>>>>>> >> >> >> >> >>>>>>>> Hi Justin, >> >> >> >> >>>>>>>> >> >> >> >> >>>>>>>> I talked with Junping a couple of months ago about >> >> >> >> >>>>>>>> Gravitino. Thanks >> >> >> >> >>>>>>>> for sharing ! >> >> >> >> >>>>>>>> >> >> >> >> >>>>>>>> Regards >> >> >> >> >>>>>>>> JB >> >> >> >> >>>>>>>> >> >> >> >> >>>>>>>> On Thu, Jan 25, 2024 at 12:15 AM Justin Mclean >> >> >> >> >>>>>>>> <[email protected]> wrote: >> >> >> >> >>>>>>>> > >> >> >> >> >>>>>>>> > Hi, >> >> >> >> >>>>>>>> > >> >> >> >> >>>>>>>> > We open-sourced a new project, Gravitino, in December >> >> >> >> >>>>>>>> > and have been working on growing the community and >> >> >> >> >>>>>>>> > adding new functionality. We plan to donate the project >> >> >> >> >>>>>>>> > to the ASF this year. Gravitino is a unified metadata >> >> >> >> >>>>>>>> > lake solution offering a unified approach to managing >> >> >> >> >>>>>>>> > datasets from diverse sources and regions across >> >> >> >> >>>>>>>> > multiple cloud platforms. Its core is an Iceberg REST >> >> >> >> >>>>>>>> > catalog service implementation to manage Iceberg tables >> >> >> >> >>>>>>>> > efficiently. >> >> >> >> >>>>>>>> > >> >> >> >> >>>>>>>> > If this sounds like something you would be interested >> >> >> >> >>>>>>>> > in, then the following resources will help: >> >> >> >> >>>>>>>> > - Blog post: >> >> >> >> >>>>>>>> > https://datastrato.ai/blog/gravitino-iceberg-rest-catalog-service/ >> >> >> >> >>>>>>>> > - Gravitino documentation: >> >> >> >> >>>>>>>> > https://datastrato.ai/docs/0.3.1/ >> >> >> >> >>>>>>>> > - Iceberg REST service documentation: >> >> >> >> >>>>>>>> > https://datastrato.ai/docs/0.3.1/iceberg-rest-service >> >> >> >> >>>>>>>> > >> >> >> >> >>>>>>>> > We welcome any feedback and suggestions you have, and as >> >> >> >> >>>>>>>> > always, all contributions are welcome. You can find the >> >> >> >> >>>>>>>> > source code at https://github.com/datastrato/gravitino. >> >> >> >> >>>>>>>> > >> >> >> >> >>>>>>>> > Kind Regards, >> >> >> >> >>>>>>>> > Justin >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> -- >> >> >> >> >>>>> Ryan Blue >> >> >> >> >>>>> Tabular >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> -- >> >> >> >> >>> Ryan Blue >> >> >> >> >>> Tabular >> >> >> > >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > Ryan Blue >> >> >> > Tabular
