Hey everyone, Thanks for raising this. I think a test-jar would be a great first step.
We already maintain "service" considering JDBC, Hive, etc catalogs. REST Catalog ref impl in Iceberg would be the sam. What I think Ryan means by a service is having to maintain Postgres (JDBC backend), Hive Metastore (Hive backend), etc. There is a lot to it to properly scale these backends. For PyIceberg we decided to build the examples backed by the SqlCatalog. This can be both in memory or on a local dist (sqlite), of course, it has limited parallelism, but makes it easy to give Iceberg a try. One of the main motivations for doing it this way was that it doesn't require any additional services. Running additional services would require having JRE/Docker/etc being installed and potentially also an RDBMS backend to persist the data. Kind regards, Fokko Op vr 1 mrt 2024 om 07:34 schreef Jean-Baptiste Onofré <j...@nanthrax.net>: > Hi Ryan > > If we plan to reduce the number of catalogs (and I think it makes > sense and I'm with you on that), we will need a impl/service in > Iceberg for the REST Catalog API, else the users won't be able to use > Iceberg "out of the box". > We already maintain "service" considering JDBC, Hive, etc catalogs. > REST Catalog ref impl in Iceberg would be the sam. > > So, in order to promote the REST Catalog API as the Catalog "unique" > façade for Iceberg, I would be in favor of having a simple REST > service in Iceberg. > It would be the entry point for Iceberg users and they can use other > REST catalogs depending on their needs (Gravitno, Tabular, ...). > > Regards > JB > > On Fri, Mar 1, 2024 at 1:28 AM Ryan Blue <b...@tabular.io> wrote: > > > > There is a reference implementation in the project, in the > CatalogHandlers class. That implements REST requests using a catalog and > returns REST responses. I believe this is what Gravatno relies on and I > mentioned it above in the discussion about whether we should have a catalog > service. > > > > Catalog tests also use catalog handlers, but use a simple HTTP wrapper > to test the HTTP client. There is also a test class that accepts HTTP calls > directly and also runs JSON serialization on requests and responses. > > > > So far, the Iceberg community has avoided maintaining a service. That > brings in a lot of complications. So far, we’ve preferred to remain focused > on providing a library that can be used to wire up something like a REST > catalog, but not provide a runtime service. > > > > Ryan > > > > > > On Thu, Feb 29, 2024 at 2:59 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> > >> Hi Ajantha, > >> > >> Thanks for sharing your thoughts. > >> > >> It makes sense for Gravitino to be a TLP (after the incubation period) > >> because Gravitino is "more" than an Iceberg catalog. It implements the > >> Iceberg REST Catalog API, but it's also a metadata catalog/repo with > >> additional features. > >> > >> That said, I agree with what you said: > >> 1. We have the openapi yaml in the Iceberg project, but no reference > >> implementation in the project itself. I think REST Catalog is a good > >> approach as a "central" Catalog API because any Iceberg engine/layer > >> could use this API (even if written in Python, rust, go, whatever), > >> and it allows new use cases (like easily move data from an engine to > >> another as the catalog API would be the same). > >> 2. From an ASF standpoint, I would not talk about "subproject" but > >> more repositories. The reason is because in terms of governance, it's > >> still the Iceberg project (PMC member or committer has the same > >> permission on all repositories in the Iceberg project, it's not > >> possible to have a committer only on iceberg-rust for instance. > >> Generally speaking, we should limit the number of subprojects. > >> 3. I think it would be fair to have REST Catalog resources (openapi > >> yaml + a ref impl) in a iceberg-catalog repository. > >> 4. However, It's important to have a more global discussion within the > >> community about Iceberg 2.0 and the roadmap about catalogs: do we > >> deprecate Iceberg Java Catalog API in favor of the REST Catalog API ? > >> What do we do with the existing catalogs ? etc. I think it's a fair > >> discussion to have for Iceberg 2.0. > >> > >> It's an important discussion, community driven. > >> > >> Regards > >> JB > >> > >> On Thu, Feb 29, 2024 at 9:44 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >> > > >> > I apologize for the delay in responding. > >> > > >> > I'm pleased to see the development of an open-source REST catalog > implementation, and the potential transition of Gravitino to an ASF project > is certainly promising. > >> > But REST catalog server implementation will be a small part of > Gravitino ASF project. Which has many other things along with the catalog? > >> > > >> > While I understand Iceberg's focus on the table format specification > and its implementation, > >> > I would like to propose the creation of a sub-project for the REST > catalog server implementation under the Iceberg repository (similar to > pyiceberg, iceberg-rust, etc.). > >> > This suggestion is based on several reasons: > >> > > >> > Everytime we make a change to the REST spec, there is no reference > implementation to refer to or modify it. > >> > Many companies such as AWS, Apple, Tabular, and Datastrato are each > implementing their own REST servers. > >> > Consolidating efforts within a sub-project could lead to efficiency > gains and potential collaboration opportunities. > >> > From the perspective of open-source users, the absence of an > open-source implementation for the REST catalog within Iceberg may be > inconvenient or frustrating. > >> > > >> > I believe creating a dedicated sub-project would address these > concerns and enhance the overall usability and collaborative nature of the > Iceberg ecosystem. > >> > I also think we can have a sub-project for kafka-connect and iceberg > tools (delta converter, catalog migrator etc) as they need not have to > depend on the Iceberg release cycle > >> > and they are independent of table format spec. > >> > > >> > Let me know your thoughts on this. I can open a separate thread for > discussion if required. > >> > > >> > - Ajantha > >> > > >> > > >> > On Wed, Jan 31, 2024 at 5:32 AM Jack Ye <yezhao...@gmail.com> wrote: > >> >> > >> >> +1 for using test-jar! > >> >> > >> >> -Jack > >> >> > >> >> On Fri, Jan 26, 2024 at 10:48 AM Ryan Blue <b...@tabular.io> wrote: > >> >>> > >> >>> I think I'd be fine exposing this through a test Jar, but it seems > to me that if we were to put it into a normal package it would turn into > the situation we want to avoid. People would use it for unintended purposes > and it would become a distraction. > >> >>> > >> >>> What do you think about using the tests Jar for this? > >> >>> > >> >>> On Thu, Jan 25, 2024 at 12:48 PM Jack Ye <yezhao...@gmail.com> > wrote: > >> >>>> > >> >>>> Yes, sorry I did not make it clear, I also agree it is not the > right direction to invest a lot of community effort. I am more talking > about casual use cases like importing a server for unit tests outside > Iceberg, running some local debugging, etc. I think it would be valuable to > provide a server in Iceberg for that purpose, and maybe vend it as test > utils. Thoughts? > >> >>>> > >> >>>> -Jack > >> >>>> > >> >>>> On Thu, Jan 25, 2024 at 11:35 AM Ryan Blue <b...@tabular.io> > wrote: > >> >>>>> > >> >>>>> > I know we have the RESTCatalogAdapter and RESTCatalogSevlet for > unit tests, and technically we have a very similar Jetty server > implementation in TestRESTCatalog. Should we think about making those > components out of the tests into an iceberg-rest-server module for this use > case, and merge with the implementation that Gravitino has? > >> >>>>> > >> >>>>> I think that this would take the Iceberg project in the wrong > direction. Iceberg has always been a library and I think it should continue > to be. Concerns about runtime should be left to other projects that need to > fit into existing infrastructure or skillsets of people maintaining them. > The question of whether to use Jetty or Tomcat or whatever else is a > serious consideration, as is how to monitor that application and send > metrics. I think it would slow down the core purpose of Iceberg if we got > distracted by these things. > >> >>>>> > >> >>>>> In fact, I think that this project shows that the library is > getting the balance right: it is using `CatalogHandlers` for their intended > purpose. It has opinions about how to run the actual HTTP service and > people that agree can use it. Other people could use `CatalogHandlers` to > build on a different foundation. > >> >>>>> > >> >>>>> On Thu, Jan 25, 2024 at 11:13 AM Jack Ye <yezhao...@gmail.com> > wrote: > >> >>>>>> > >> >>>>>> Really cool project! > >> >>>>>> > >> >>>>>> I browsed a bit of the codebase, and see this implementation of > the REST service backend: > >> >>>>>> - > https://github.com/datastrato/gravitino/blob/main/catalogs/catalog-lakehouse-iceberg/src/main/java/com/datastrato/gravitino/catalog/lakehouse/iceberg/IcebergRESTService.java#L39 > >> >>>>>> - > https://github.com/datastrato/gravitino/blob/main/catalogs/catalog-lakehouse-iceberg/src/main/java/com/datastrato/gravitino/catalog/lakehouse/iceberg/ops/IcebergTableOps.java#L42-L51 > >> >>>>>> > >> >>>>>> Looks like it is initializing a Jetty server that uses > CatalogHandlers to delegate the execution to a specific Java Catalog > implementation. > >> >>>>>> > >> >>>>>> I think this is actually something that is lacking today in > Iceberg, which is an easy way for users to start an actual REST HTTP server. > >> >>>>>> > >> >>>>>> I know we have the RESTCatalogAdapter and RESTCatalogSevlet for > unit tests, and technically we have a very similar Jetty server > implementation in TestRESTCatalog. Should we think about making those > components out of the tests into an iceberg-rest-server module for this use > case, and merge with the implementation that Gravitino has? > >> >>>>>> > >> >>>>>> Best, > >> >>>>>> Jack Ye > >> >>>>>> > >> >>>>>> On Thu, Jan 25, 2024 at 10:47 AM Yufei Gu <flyrain...@gmail.com> > wrote: > >> >>>>>>> > >> >>>>>>> Thanks Justin for the sharing. > >> >>>>>>> > >> >>>>>>> It's pretty cool to see an open source REST catalog > implementation in action. Having dabbled a bit in the early development of > Gravitino myself, I'm really excited about its potential with the Iceberg > REST catalog. > >> >>>>>>> > >> >>>>>>> The idea of Gravitino moving to an ASF project is promising. > It’ll surely boost its visibility and open up more doors for collaboration > and adoption. > >> >>>>>>> > >> >>>>>>> Looking forward to where this goes. Keep up the fantastic work! > >> >>>>>>> > >> >>>>>>> Yufei > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> On Thu, Jan 25, 2024 at 5:55 AM Jean-Baptiste Onofré < > j...@nanthrax.net> wrote: > >> >>>>>>>> > >> >>>>>>>> Hi Justin, > >> >>>>>>>> > >> >>>>>>>> I talked with Junping a couple of months ago about Gravitino. > Thanks > >> >>>>>>>> for sharing ! > >> >>>>>>>> > >> >>>>>>>> Regards > >> >>>>>>>> JB > >> >>>>>>>> > >> >>>>>>>> On Thu, Jan 25, 2024 at 12:15 AM Justin Mclean < > jus...@classsoftware.com> wrote: > >> >>>>>>>> > > >> >>>>>>>> > Hi, > >> >>>>>>>> > > >> >>>>>>>> > We open-sourced a new project, Gravitino, in December and > have been working on growing the community and adding new functionality. We > plan to donate the project to the ASF this year. Gravitino is a unified > metadata lake solution offering a unified approach to managing datasets > from diverse sources and regions across multiple cloud platforms. Its core > is an Iceberg REST catalog service implementation to manage Iceberg tables > efficiently. > >> >>>>>>>> > > >> >>>>>>>> > If this sounds like something you would be interested in, > then the following resources will help: > >> >>>>>>>> > - Blog post: > https://datastrato.ai/blog/gravitino-iceberg-rest-catalog-service/ > >> >>>>>>>> > - Gravitino documentation: > https://datastrato.ai/docs/0.3.1/ > >> >>>>>>>> > - Iceberg REST service documentation: > https://datastrato.ai/docs/0.3.1/iceberg-rest-service > >> >>>>>>>> > > >> >>>>>>>> > We welcome any feedback and suggestions you have, and as > always, all contributions are welcome. You can find the source code at > https://github.com/datastrato/gravitino. > >> >>>>>>>> > > >> >>>>>>>> > Kind Regards, > >> >>>>>>>> > Justin > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> -- > >> >>>>> Ryan Blue > >> >>>>> Tabular > >> >>> > >> >>> > >> >>> > >> >>> -- > >> >>> Ryan Blue > >> >>> Tabular > > > > > > > > -- > > Ryan Blue > > Tabular >