Could we have a separate devlist thread dedicated for this discussion? It is a bit awkward to continue this critical Iceberg 2.0 catalog consolidation topic under this Gravitino thread, although I understand it is related. At least I have overlooked all these discussions until now, and I feel there are many other people like me.
-Jack On Fri, Mar 1, 2024 at 3:09 PM Ryan Blue <b...@tabular.io> wrote: > To clarify, what I meant was that Iceberg has, so far, avoided providing > runtime services that are ready to be deployed and used. I still think that > is a good choice, and I agree with the reasons that Renjie listed. > > I disagree that this is inconsistent. We don't supply any of the other > services that catalog implementations use. As Fokko pointed out, the JDBC > catalog client talks to a database, the Hive catalog talks to a Hive > Metastore, and the Nessie catalog talks to a Nessie deployment. Similarly, > the REST catalog is a client. We have CatalogHandlers that is a reference > implementation for REST service and catalog logic. > > The main thing that we don't provide is a deployable runtime REST catalog > service. We may choose to add one in order to make it easier to move to the > REST client, but I'm not confident that is the right choice vs encouraging > other projects. > > On Fri, Mar 1, 2024 at 4:43 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hey Brian >> >> Thanks for the summary ! Good one ! >> >> I would just add the "REST ref impl" discussion. >> >> Regarding the anti patterns, I agree with the lists, imho, some are >> more "opinionated implementation", so definitely not in the API scope. >> +1 >> >> Thanks again ! >> >> Regards >> JB >> >> >> On Fri, Mar 1, 2024 at 1:15 PM Brian Olsen <bitsondata...@gmail.com> >> wrote: >> > >> > My attempt to consolidate a list of goals, anti patterns , and impl >> details mentioned since this discussion was brought up at the last Iceberg >> sync. Tried to roughly capture who mentioned these things so we can follow >> up if needed. Hopefully this can serve as a basis for the design discussion. >> > >> > Goals: >> > >> > - Remove the initial burden of choice of which REST implementation from >> new users getting started with Iceberg (Russel S) >> > - Cut down on the supported catalogs that are no longer in use (e.g. >> DynamoDB) or never intended for production (e.g. Hadoop) to minimize >> maintenance lower variability, and lower the burden of choice on Iceberg >> users. (Blue) >> > - Simplify plugging in your own catalog so the Iceberg project isn’t >> responsible for maintaining and testing a bunch of dialects. (Blue). >> > - Aim for a REST catalog centric future and continue to remove Iceberg >> support where it makes sense. (Russell/Jack Ye/Blue) >> > - Use this as a test dependency for the Iceberg project (Jack/Russell) >> > - Make this an MVP production grade catalog, assuming that whatever we >> do put out there will end up being used as production anyways. (Blue/Dan) >> > - Keep the responsibilities the REST implementation as light as >> possible. (Blue) >> > - Support HTTP(S) protocol, the service will act as a load balancer + >> proxy to the JDBC backend. (Blue) >> > - Container image + k8s installation (Blue) >> > - Use for Iceberg education and evaluation (Bits) >> > - Use as a blueprint for designing you own Implementation (JB) >> > >> > Anti patterns: >> > >> > - Avoid becoming the Hive Metastore project, where we support every use >> case. >> > - Don’t support data governance cases like lineage. (Dan) >> > - Don’t support metrics reporting. (Blue/Dan) >> > - Don’t support security. (Blue) >> > - Don’t support a wide range of protocols outside of HTTP(S) (Dan) >> > - In general, avoid spending time integrating with whatever runtime a >> given company uses that removes focus from the core project goals and spec. >> (Dan) >> > - Don’t be overly opinionated with tool choices. (Dan) >> > >> > Implementation ideas: >> > >> > - apache/iceberg-catalog repository, with all of the catalog impls >> moved and maintained there as well. (Blue/Dan/Jack/JB/Russel) >> > - A catalog implementation per JDBC backend. (Blue) >> > - Servlet like Tomcat or Spring to run / package the service. (Blue) >> > >> > On Fri, Mar 1, 2024 at 2:54 AM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >> >> >> Hi Renjie, >> >> >> >> maybe I wasn't clear, sorry about that: the target is really both ref >> >> impl (where we can test different Iceberg parts like we do with the >> >> InMemoryCatalog, JdbcCatalog, etc) and ready to go service for users >> >> (simple but to start with). >> >> >> >> But we can't prevent the community from working on a production grade >> >> catalog. The point is: if it's not in Iceberg, then it gonna be >> >> elsewhere (another ASF project, vendor project, whatever). This is OK >> >> as soon as we have a reference implementation in Iceberg. That's the >> >> min we should guarantee imho. >> >> For instance, for the JAXRS spec, the ref implementation is CXF-RS, >> >> but there are other implementation. The same for OSGi Blueprint, the >> >> ref implementation is in Apache Aries (aries-blueprint). >> >> >> >> My proposal is really a simple ref imp in Iceberg (submodule or >> >> separate repo, both are OK for me even if I have a preference for >> >> separate repo to keep things clean and different lifecycle as we do >> >> for iceberg-rust or iceberg-python), >> >> >> >> That said, I don't see why we could not have iceberg-catalog repo with >> >> a ref impl that evolves to something production ready. Observability, >> >> scaling, pluggable backend, etc can be implemented there and it would >> >> be a great addition for Iceberg with new contributors from the >> >> community I'm sure. Separated repo would make this doable imho, >> >> Iceberg still focus on spec. >> >> >> >> Regards >> >> JB >> >> >> >> On Fri, Mar 1, 2024 at 9:24 AM Renjie Liu <liurenjie2...@gmail.com> >> wrote: >> >> > >> >> > Hi: >> >> > >> >> > I think one thing missing in the discussion is that, if the iceberg >> community wants to maintain a rest catalog service, what's the target use >> case? Different target use cases may lead to different directions. >> >> > >> >> > If it's mainly designed for first time users to play or experience >> with rest catalog, then maybe we just need a submodule in java repo or a >> test-jar would be enough. >> >> > >> >> > If it's targeted toward production usage, things get complicated. >> There are too many things to think about, such as using different storage >> backend, monitoring, ha, scalability etc. What's more, in an enterprise >> iceberg rest catalog usually is only part of a data platform, there are >> many other things involved. In this case, I'm skeptical about the actual >> value of a rest catalog server, and a spec or a library would be more >> valuable. >> >> > >> >> > On Fri, Mar 1, 2024 at 3:49 PM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >> >> >> >> >> Hi Fokko >> >> >> >> >> >> If service means the actual runtime service, I partially agree. >> >> >> >> >> >> I would love to see REST Catalog API the "central cornerstone" used >> in >> >> >> iceberg-java, pyiceberg, etc. So I think we should provide the >> >> >> resources for an user to bootstrap a REST Catalog ref impl. >> >> >> A lot of Apache projects provides both specs and runtime (for some >> >> >> part): Apache Camel, Apache ActiveMQ, Apache Karaf, Apache Kafka, >> ... >> >> >> That's why it would make sense to have it in a separate Iceberg repo >> >> >> (iceberg-catalog) to keep iceberg main repo focus on spec. >> >> >> Iceberg would need both spec and simple runtime for ref impl. It >> would >> >> >> be a bit "hypocrite" (to our users :)) to say we have the spec but >> not >> >> >> impl. It's like you have Iceberg spec but no Spark or Flink >> >> >> extensions. >> >> >> Imagine Apache ActiveMQ says we have JMS 3.0 support but no >> runtime/service :) >> >> >> >> >> >> That's my $0.02, but if we want to promote the REST Catalog (and I >> >> >> think it's a good approach), then Iceberg should provide a ref impl >> >> >> ready to run (without preventing other impl of course). >> >> >> >> >> >> Regards >> >> >> JB >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Mar 1, 2024 at 8:13 AM Fokko Driesprong <fo...@apache.org> >> wrote: >> >> >> > >> >> >> > Hey everyone, >> >> >> > >> >> >> > Thanks for raising this. I think a test-jar would be a great >> first step. >> >> >> > >> >> >> > We already maintain "service" considering JDBC, Hive, etc >> catalogs. REST Catalog ref impl in Iceberg would be the sam. >> >> >> > >> >> >> > >> >> >> > What I think Ryan means by a service is having to maintain >> Postgres (JDBC backend), Hive Metastore (Hive backend), etc. There is a lot >> to it to properly scale these backends. >> >> >> > >> >> >> > For PyIceberg we decided to build the examples backed by the >> SqlCatalog. This can be both in memory or on a local dist (sqlite), of >> course, it has limited parallelism, but makes it easy to give Iceberg a >> try. One of the main motivations for doing it this way was that it doesn't >> require any additional services. Running additional services would require >> having JRE/Docker/etc being installed and potentially also an RDBMS backend >> to persist the data. >> >> >> > >> >> >> > Kind regards, >> >> >> > Fokko >> >> >> > >> >> >> > >> >> >> > Op vr 1 mrt 2024 om 07:34 schreef Jean-Baptiste Onofré < >> j...@nanthrax.net>: >> >> >> >> >> >> >> >> Hi Ryan >> >> >> >> >> >> >> >> If we plan to reduce the number of catalogs (and I think it makes >> >> >> >> sense and I'm with you on that), we will need a impl/service in >> >> >> >> Iceberg for the REST Catalog API, else the users won't be able >> to use >> >> >> >> Iceberg "out of the box". >> >> >> >> We already maintain "service" considering JDBC, Hive, etc >> catalogs. >> >> >> >> REST Catalog ref impl in Iceberg would be the sam. >> >> >> >> >> >> >> >> So, in order to promote the REST Catalog API as the Catalog >> "unique" >> >> >> >> façade for Iceberg, I would be in favor of having a simple REST >> >> >> >> service in Iceberg. >> >> >> >> It would be the entry point for Iceberg users and they can use >> other >> >> >> >> REST catalogs depending on their needs (Gravitno, Tabular, ...). >> >> >> >> >> >> >> >> Regards >> >> >> >> JB >> >> >> >> >> >> >> >> On Fri, Mar 1, 2024 at 1:28 AM Ryan Blue <b...@tabular.io> >> wrote: >> >> >> >> > >> >> >> >> > There is a reference implementation in the project, in the >> CatalogHandlers class. That implements REST requests using a catalog and >> returns REST responses. I believe this is what Gravatno relies on and I >> mentioned it above in the discussion about whether we should have a catalog >> service. >> >> >> >> > >> >> >> >> > Catalog tests also use catalog handlers, but use a simple HTTP >> wrapper to test the HTTP client. There is also a test class that accepts >> HTTP calls directly and also runs JSON serialization on requests and >> responses. >> >> >> >> > >> >> >> >> > So far, the Iceberg community has avoided maintaining a >> service. That brings in a lot of complications. So far, we’ve preferred to >> remain focused on providing a library that can be used to wire up something >> like a REST catalog, but not provide a runtime service. >> >> >> >> > >> >> >> >> > Ryan >> >> >> >> > >> >> >> >> > >> >> >> >> > On Thu, Feb 29, 2024 at 2:59 AM Jean-Baptiste Onofré < >> j...@nanthrax.net> wrote: >> >> >> >> >> >> >> >> >> >> Hi Ajantha, >> >> >> >> >> >> >> >> >> >> Thanks for sharing your thoughts. >> >> >> >> >> >> >> >> >> >> It makes sense for Gravitino to be a TLP (after the >> incubation period) >> >> >> >> >> because Gravitino is "more" than an Iceberg catalog. It >> implements the >> >> >> >> >> Iceberg REST Catalog API, but it's also a metadata >> catalog/repo with >> >> >> >> >> additional features. >> >> >> >> >> >> >> >> >> >> That said, I agree with what you said: >> >> >> >> >> 1. We have the openapi yaml in the Iceberg project, but no >> reference >> >> >> >> >> implementation in the project itself. I think REST Catalog is >> a good >> >> >> >> >> approach as a "central" Catalog API because any Iceberg >> engine/layer >> >> >> >> >> could use this API (even if written in Python, rust, go, >> whatever), >> >> >> >> >> and it allows new use cases (like easily move data from an >> engine to >> >> >> >> >> another as the catalog API would be the same). >> >> >> >> >> 2. From an ASF standpoint, I would not talk about >> "subproject" but >> >> >> >> >> more repositories. The reason is because in terms of >> governance, it's >> >> >> >> >> still the Iceberg project (PMC member or committer has the >> same >> >> >> >> >> permission on all repositories in the Iceberg project, it's >> not >> >> >> >> >> possible to have a committer only on iceberg-rust for >> instance. >> >> >> >> >> Generally speaking, we should limit the number of subprojects. >> >> >> >> >> 3. I think it would be fair to have REST Catalog resources >> (openapi >> >> >> >> >> yaml + a ref impl) in a iceberg-catalog repository. >> >> >> >> >> 4. However, It's important to have a more global discussion >> within the >> >> >> >> >> community about Iceberg 2.0 and the roadmap about catalogs: >> do we >> >> >> >> >> deprecate Iceberg Java Catalog API in favor of the REST >> Catalog API ? >> >> >> >> >> What do we do with the existing catalogs ? etc. I think it's >> a fair >> >> >> >> >> discussion to have for Iceberg 2.0. >> >> >> >> >> >> >> >> >> >> It's an important discussion, community driven. >> >> >> >> >> >> >> >> >> >> Regards >> >> >> >> >> JB >> >> >> >> >> >> >> >> >> >> On Thu, Feb 29, 2024 at 9:44 AM Ajantha Bhat < >> ajanthab...@gmail.com> wrote: >> >> >> >> >> > >> >> >> >> >> > I apologize for the delay in responding. >> >> >> >> >> > >> >> >> >> >> > I'm pleased to see the development of an open-source REST >> catalog implementation, and the potential transition of Gravitino to an ASF >> project is certainly promising. >> >> >> >> >> > But REST catalog server implementation will be a small part >> of Gravitino ASF project. Which has many other things along with the >> catalog? >> >> >> >> >> > >> >> >> >> >> > While I understand Iceberg's focus on the table format >> specification and its implementation, >> >> >> >> >> > I would like to propose the creation of a sub-project for >> the REST catalog server implementation under the Iceberg repository >> (similar to pyiceberg, iceberg-rust, etc.). >> >> >> >> >> > This suggestion is based on several reasons: >> >> >> >> >> > >> >> >> >> >> > Everytime we make a change to the REST spec, there is no >> reference implementation to refer to or modify it. >> >> >> >> >> > Many companies such as AWS, Apple, Tabular, and Datastrato >> are each implementing their own REST servers. >> >> >> >> >> > Consolidating efforts within a sub-project could lead to >> efficiency gains and potential collaboration opportunities. >> >> >> >> >> > From the perspective of open-source users, the absence of >> an open-source implementation for the REST catalog within Iceberg may be >> inconvenient or frustrating. >> >> >> >> >> > >> >> >> >> >> > I believe creating a dedicated sub-project would address >> these concerns and enhance the overall usability and collaborative nature >> of the Iceberg ecosystem. >> >> >> >> >> > I also think we can have a sub-project for kafka-connect >> and iceberg tools (delta converter, catalog migrator etc) as they need not >> have to depend on the Iceberg release cycle >> >> >> >> >> > and they are independent of table format spec. >> >> >> >> >> > >> >> >> >> >> > Let me know your thoughts on this. I can open a separate >> thread for discussion if required. >> >> >> >> >> > >> >> >> >> >> > - Ajantha >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > On Wed, Jan 31, 2024 at 5:32 AM Jack Ye < >> yezhao...@gmail.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> +1 for using test-jar! >> >> >> >> >> >> >> >> >> >> >> >> -Jack >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Jan 26, 2024 at 10:48 AM Ryan Blue < >> b...@tabular.io> wrote: >> >> >> >> >> >>> >> >> >> >> >> >>> I think I'd be fine exposing this through a test Jar, but >> it seems to me that if we were to put it into a normal package it would >> turn into the situation we want to avoid. People would use it for >> unintended purposes and it would become a distraction. >> >> >> >> >> >>> >> >> >> >> >> >>> What do you think about using the tests Jar for this? >> >> >> >> >> >>> >> >> >> >> >> >>> On Thu, Jan 25, 2024 at 12:48 PM Jack Ye < >> yezhao...@gmail.com> wrote: >> >> >> >> >> >>>> >> >> >> >> >> >>>> Yes, sorry I did not make it clear, I also agree it is >> not the right direction to invest a lot of community effort. I am more >> talking about casual use cases like importing a server for unit tests >> outside Iceberg, running some local debugging, etc. I think it would be >> valuable to provide a server in Iceberg for that purpose, and maybe vend it >> as test utils. Thoughts? >> >> >> >> >> >>>> >> >> >> >> >> >>>> -Jack >> >> >> >> >> >>>> >> >> >> >> >> >>>> On Thu, Jan 25, 2024 at 11:35 AM Ryan Blue < >> b...@tabular.io> wrote: >> >> >> >> >> >>>>> >> >> >> >> >> >>>>> > I know we have the RESTCatalogAdapter and >> RESTCatalogSevlet for unit tests, and technically we have a very similar >> Jetty server implementation in TestRESTCatalog. Should we think about >> making those components out of the tests into an iceberg-rest-server module >> for this use case, and merge with the implementation that Gravitino has? >> >> >> >> >> >>>>> >> >> >> >> >> >>>>> I think that this would take the Iceberg project in the >> wrong direction. Iceberg has always been a library and I think it should >> continue to be. Concerns about runtime should be left to other projects >> that need to fit into existing infrastructure or skillsets of people >> maintaining them. The question of whether to use Jetty or Tomcat or >> whatever else is a serious consideration, as is how to monitor that >> application and send metrics. I think it would slow down the core purpose >> of Iceberg if we got distracted by these things. >> >> >> >> >> >>>>> >> >> >> >> >> >>>>> In fact, I think that this project shows that the >> library is getting the balance right: it is using `CatalogHandlers` for >> their intended purpose. It has opinions about how to run the actual HTTP >> service and people that agree can use it. Other people could use >> `CatalogHandlers` to build on a different foundation. >> >> >> >> >> >>>>> >> >> >> >> >> >>>>> On Thu, Jan 25, 2024 at 11:13 AM Jack Ye < >> yezhao...@gmail.com> wrote: >> >> >> >> >> >>>>>> >> >> >> >> >> >>>>>> Really cool project! >> >> >> >> >> >>>>>> >> >> >> >> >> >>>>>> I browsed a bit of the codebase, and see this >> implementation of the REST service backend: >> >> >> >> >> >>>>>> - >> https://github.com/datastrato/gravitino/blob/main/catalogs/catalog-lakehouse-iceberg/src/main/java/com/datastrato/gravitino/catalog/lakehouse/iceberg/IcebergRESTService.java#L39 >> >> >> >> >> >>>>>> - >> https://github.com/datastrato/gravitino/blob/main/catalogs/catalog-lakehouse-iceberg/src/main/java/com/datastrato/gravitino/catalog/lakehouse/iceberg/ops/IcebergTableOps.java#L42-L51 >> >> >> >> >> >>>>>> >> >> >> >> >> >>>>>> Looks like it is initializing a Jetty server that >> uses CatalogHandlers to delegate the execution to a specific Java Catalog >> implementation. >> >> >> >> >> >>>>>> >> >> >> >> >> >>>>>> I think this is actually something that is lacking >> today in Iceberg, which is an easy way for users to start an actual REST >> HTTP server. >> >> >> >> >> >>>>>> >> >> >> >> >> >>>>>> I know we have the RESTCatalogAdapter and >> RESTCatalogSevlet for unit tests, and technically we have a very similar >> Jetty server implementation in TestRESTCatalog. Should we think about >> making those components out of the tests into an iceberg-rest-server module >> for this use case, and merge with the implementation that Gravitino has? >> >> >> >> >> >>>>>> >> >> >> >> >> >>>>>> Best, >> >> >> >> >> >>>>>> Jack Ye >> >> >> >> >> >>>>>> >> >> >> >> >> >>>>>> On Thu, Jan 25, 2024 at 10:47 AM Yufei Gu < >> flyrain...@gmail.com> wrote: >> >> >> >> >> >>>>>>> >> >> >> >> >> >>>>>>> Thanks Justin for the sharing. >> >> >> >> >> >>>>>>> >> >> >> >> >> >>>>>>> It's pretty cool to see an open source REST catalog >> implementation in action. Having dabbled a bit in the early development of >> Gravitino myself, I'm really excited about its potential with the Iceberg >> REST catalog. >> >> >> >> >> >>>>>>> >> >> >> >> >> >>>>>>> The idea of Gravitino moving to an ASF project is >> promising. It’ll surely boost its visibility and open up more doors for >> collaboration and adoption. >> >> >> >> >> >>>>>>> >> >> >> >> >> >>>>>>> Looking forward to where this goes. Keep up the >> fantastic work! >> >> >> >> >> >>>>>>> >> >> >> >> >> >>>>>>> Yufei >> >> >> >> >> >>>>>>> >> >> >> >> >> >>>>>>> >> >> >> >> >> >>>>>>> On Thu, Jan 25, 2024 at 5:55 AM Jean-Baptiste Onofré < >> j...@nanthrax.net> wrote: >> >> >> >> >> >>>>>>>> >> >> >> >> >> >>>>>>>> Hi Justin, >> >> >> >> >> >>>>>>>> >> >> >> >> >> >>>>>>>> I talked with Junping a couple of months ago about >> Gravitino. Thanks >> >> >> >> >> >>>>>>>> for sharing ! >> >> >> >> >> >>>>>>>> >> >> >> >> >> >>>>>>>> Regards >> >> >> >> >> >>>>>>>> JB >> >> >> >> >> >>>>>>>> >> >> >> >> >> >>>>>>>> On Thu, Jan 25, 2024 at 12:15 AM Justin Mclean < >> jus...@classsoftware.com> wrote: >> >> >> >> >> >>>>>>>> > >> >> >> >> >> >>>>>>>> > Hi, >> >> >> >> >> >>>>>>>> > >> >> >> >> >> >>>>>>>> > We open-sourced a new project, Gravitino, in >> December and have been working on growing the community and adding new >> functionality. We plan to donate the project to the ASF this year. >> Gravitino is a unified metadata lake solution offering a unified approach >> to managing datasets from diverse sources and regions across multiple cloud >> platforms. Its core is an Iceberg REST catalog service implementation to >> manage Iceberg tables efficiently. >> >> >> >> >> >>>>>>>> > >> >> >> >> >> >>>>>>>> > If this sounds like something you would be >> interested in, then the following resources will help: >> >> >> >> >> >>>>>>>> > - Blog post: >> https://datastrato.ai/blog/gravitino-iceberg-rest-catalog-service/ >> >> >> >> >> >>>>>>>> > - Gravitino documentation: >> https://datastrato.ai/docs/0.3.1/ >> >> >> >> >> >>>>>>>> > - Iceberg REST service documentation: >> https://datastrato.ai/docs/0.3.1/iceberg-rest-service >> >> >> >> >> >>>>>>>> > >> >> >> >> >> >>>>>>>> > We welcome any feedback and suggestions you have, >> and as always, all contributions are welcome. You can find the source code >> at https://github.com/datastrato/gravitino. >> >> >> >> >> >>>>>>>> > >> >> >> >> >> >>>>>>>> > Kind Regards, >> >> >> >> >> >>>>>>>> > Justin >> >> >> >> >> >>>>> >> >> >> >> >> >>>>> >> >> >> >> >> >>>>> >> >> >> >> >> >>>>> -- >> >> >> >> >> >>>>> Ryan Blue >> >> >> >> >> >>>>> Tabular >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> -- >> >> >> >> >> >>> Ryan Blue >> >> >> >> >> >>> Tabular >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > -- >> >> >> >> > Ryan Blue >> >> >> >> > Tabular >> > > > -- > Ryan Blue > Tabular >