Re: Gravitino an Iceberg REST catalog service

Fokko Driesprong Thu, 29 Feb 2024 23:14:02 -0800

Hey everyone,

Thanks for raising this. I think a test-jar would be a great first step.


We already maintain "service" considering JDBC, Hive, etc catalogs. REST
Catalog ref impl in Iceberg would be the sam.


What I think Ryan means by a service is having to maintain Postgres (JDBC
backend), Hive Metastore (Hive backend), etc. There is a lot to it to
properly scale these backends.

For PyIceberg we decided to build the examples backed by the SqlCatalog.
This can be both in memory or on a local dist (sqlite), of course, it has
limited parallelism, but makes it easy to give Iceberg a try. One of the
main motivations for doing it this way was that it doesn't require any
additional services. Running additional services would require having
JRE/Docker/etc being installed and potentially also an RDBMS backend to
persist the data.

Kind regards,
Fokko


Op vr 1 mrt 2024 om 07:34 schreef Jean-Baptiste Onofré <j...@nanthrax.net>:

> Hi Ryan
>
> If we plan to reduce the number of catalogs (and I think it makes
> sense and I'm with you on that), we will need a impl/service in
> Iceberg for the REST Catalog API, else the users won't be able to use
> Iceberg "out of the box".
> We already maintain "service" considering JDBC, Hive, etc catalogs.
> REST Catalog ref impl in Iceberg would be the sam.
>
> So, in order to promote the REST Catalog API as the Catalog "unique"
> façade for Iceberg, I would be in favor of having a simple REST
> service in Iceberg.
> It would be the entry point for Iceberg users and they can use other
> REST catalogs depending on their needs (Gravitno, Tabular, ...).
>
> Regards
> JB
>
> On Fri, Mar 1, 2024 at 1:28 AM Ryan Blue <b...@tabular.io> wrote:
> >
> > There is a reference implementation in the project, in the
> CatalogHandlers class. That implements REST requests using a catalog and
> returns REST responses. I believe this is what Gravatno relies on and I
> mentioned it above in the discussion about whether we should have a catalog
> service.
> >
> > Catalog tests also use catalog handlers, but use a simple HTTP wrapper
> to test the HTTP client. There is also a test class that accepts HTTP calls
> directly and also runs JSON serialization on requests and responses.
> >
> > So far, the Iceberg community has avoided maintaining a service. That
> brings in a lot of complications. So far, we’ve preferred to remain focused
> on providing a library that can be used to wire up something like a REST
> catalog, but not provide a runtime service.
> >
> > Ryan
> >
> >
> > On Thu, Feb 29, 2024 at 2:59 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> >>
> >> Hi Ajantha,
> >>
> >> Thanks for sharing your thoughts.
> >>
> >> It makes sense for Gravitino to be a TLP (after the incubation period)
> >> because Gravitino is "more" than an Iceberg catalog. It implements the
> >> Iceberg REST Catalog API, but it's also a metadata catalog/repo with
> >> additional features.
> >>
> >> That said, I agree with what you said:
> >> 1. We have the openapi yaml in the Iceberg project, but no reference
> >> implementation in the project itself. I think REST Catalog is a good
> >> approach as a "central" Catalog API because any Iceberg engine/layer
> >> could use this API (even if written in Python, rust, go, whatever),
> >> and it allows new use cases (like easily move data from an engine to
> >> another as the catalog API would be the same).
> >> 2. From an ASF standpoint, I would not talk about "subproject" but
> >> more repositories. The reason is because in terms of governance, it's
> >> still the Iceberg project (PMC member or committer has the same
> >> permission on all repositories in the Iceberg project, it's not
> >> possible to have a committer only on iceberg-rust for instance.
> >> Generally speaking, we should limit the number of subprojects.
> >> 3. I think it would be fair to have REST Catalog resources (openapi
> >> yaml + a ref impl) in a iceberg-catalog repository.
> >> 4. However, It's important to have a more global discussion within the
> >> community about Iceberg 2.0 and the roadmap about catalogs: do we
> >> deprecate Iceberg Java Catalog API in favor of the REST Catalog API ?
> >> What do we do with the existing catalogs ? etc. I think it's a fair
> >> discussion to have for Iceberg 2.0.
> >>
> >> It's an important discussion, community driven.
> >>
> >> Regards
> >> JB
> >>
> >> On Thu, Feb 29, 2024 at 9:44 AM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
> >> >
> >> > I apologize for the delay in responding.
> >> >
> >> > I'm pleased to see the development of an open-source REST catalog
> implementation, and the potential transition of Gravitino to an ASF project
> is certainly promising.
> >> > But REST catalog server implementation will be a small part of
> Gravitino ASF project. Which has many other things along with the catalog?
> >> >
> >> > While I understand Iceberg's focus on the table format specification
> and its implementation,
> >> > I would like to propose the creation of a sub-project for the REST
> catalog server implementation under the Iceberg repository (similar to
> pyiceberg, iceberg-rust, etc.).
> >> > This suggestion is based on several reasons:
> >> >
> >> > Everytime we make a change to the REST spec, there is no reference
> implementation to refer to or modify it.
> >> > Many companies such as AWS, Apple, Tabular, and Datastrato are each
> implementing their own REST servers.
> >> > Consolidating efforts within a sub-project could lead to efficiency
> gains and potential collaboration opportunities.
> >> > From the perspective of open-source users, the absence of an
> open-source implementation for the REST catalog within Iceberg may be
> inconvenient or frustrating.
> >> >
> >> > I believe creating a dedicated sub-project would address these
> concerns and enhance the overall usability and collaborative nature of the
> Iceberg ecosystem.
> >> > I also think we can have a sub-project for kafka-connect and iceberg
> tools (delta converter, catalog migrator etc) as they need not have to
> depend on the Iceberg release cycle
> >> > and they are independent of table format spec.
> >> >
> >> > Let me know your thoughts on this. I can open a separate thread for
> discussion if required.
> >> >
> >> > - Ajantha
> >> >
> >> >
> >> > On Wed, Jan 31, 2024 at 5:32 AM Jack Ye <yezhao...@gmail.com> wrote:
> >> >>
> >> >> +1 for using test-jar!
> >> >>
> >> >> -Jack
> >> >>
> >> >> On Fri, Jan 26, 2024 at 10:48 AM Ryan Blue <b...@tabular.io> wrote:
> >> >>>
> >> >>> I think I'd be fine exposing this through a test Jar, but it seems
> to me that if we were to put it into a normal package it would turn into
> the situation we want to avoid. People would use it for unintended purposes
> and it would become a distraction.
> >> >>>
> >> >>> What do you think about using the tests Jar for this?
> >> >>>
> >> >>> On Thu, Jan 25, 2024 at 12:48 PM Jack Ye <yezhao...@gmail.com>
> wrote:
> >> >>>>
> >> >>>> Yes, sorry I did not make it clear, I also agree it is not the
> right direction to invest a lot of community effort. I am more talking
> about casual use cases like importing a server for unit tests outside
> Iceberg, running some local debugging, etc. I think it would be valuable to
> provide a server in Iceberg for that purpose, and maybe vend it as test
> utils. Thoughts?
> >> >>>>
> >> >>>> -Jack
> >> >>>>
> >> >>>> On Thu, Jan 25, 2024 at 11:35 AM Ryan Blue <b...@tabular.io>
> wrote:
> >> >>>>>
> >> >>>>> > I know we have the RESTCatalogAdapter and RESTCatalogSevlet for
> unit tests, and technically we have a very similar Jetty server
> implementation in TestRESTCatalog. Should we think about making those
> components out of the tests into an iceberg-rest-server module for this use
> case, and merge with the implementation that Gravitino has?
> >> >>>>>
> >> >>>>> I think that this would take the Iceberg project in the wrong
> direction. Iceberg has always been a library and I think it should continue
> to be. Concerns about runtime should be left to other projects that need to
> fit into existing infrastructure or skillsets of people maintaining them.
> The question of whether to use Jetty or Tomcat or whatever else is a
> serious consideration, as is how to monitor that application and send
> metrics. I think it would slow down the core purpose of Iceberg if we got
> distracted by these things.
> >> >>>>>
> >> >>>>> In fact, I think that this project shows that the library is
> getting the balance right: it is using `CatalogHandlers` for their intended
> purpose. It has opinions about how to run the actual HTTP service and
> people that agree can use it. Other people could use `CatalogHandlers` to
> build on a different foundation.
> >> >>>>>
> >> >>>>> On Thu, Jan 25, 2024 at 11:13 AM Jack Ye <yezhao...@gmail.com>
> wrote:
> >> >>>>>>
> >> >>>>>> Really cool project!
> >> >>>>>>
> >> >>>>>> I browsed a bit of the codebase, and see this implementation of
> the REST service backend:
> >> >>>>>> -
> https://github.com/datastrato/gravitino/blob/main/catalogs/catalog-lakehouse-iceberg/src/main/java/com/datastrato/gravitino/catalog/lakehouse/iceberg/IcebergRESTService.java#L39
> >> >>>>>> -
> https://github.com/datastrato/gravitino/blob/main/catalogs/catalog-lakehouse-iceberg/src/main/java/com/datastrato/gravitino/catalog/lakehouse/iceberg/ops/IcebergTableOps.java#L42-L51
> >> >>>>>>
> >> >>>>>>  Looks like it is initializing a Jetty server that uses
> CatalogHandlers to delegate the execution to a specific Java Catalog
> implementation.
> >> >>>>>>
> >> >>>>>> I think this is actually something that is lacking today in
> Iceberg, which is an easy way for users to start an actual REST HTTP server.
> >> >>>>>>
> >> >>>>>> I know we have the RESTCatalogAdapter and RESTCatalogSevlet for
> unit tests, and technically we have a very similar Jetty server
> implementation in TestRESTCatalog. Should we think about making those
> components out of the tests into an iceberg-rest-server module for this use
> case, and merge with the implementation that Gravitino has?
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Jack Ye
> >> >>>>>>
> >> >>>>>> On Thu, Jan 25, 2024 at 10:47 AM Yufei Gu <flyrain...@gmail.com>
> wrote:
> >> >>>>>>>
> >> >>>>>>> Thanks Justin for the sharing.
> >> >>>>>>>
> >> >>>>>>> It's pretty cool to see an open source REST catalog
> implementation in action. Having dabbled a bit in the early development of
> Gravitino myself, I'm really excited about its potential with the Iceberg
> REST catalog.
> >> >>>>>>>
> >> >>>>>>> The idea of Gravitino moving to an ASF project is promising.
> It’ll surely boost its visibility and open up more doors for collaboration
> and adoption.
> >> >>>>>>>
> >> >>>>>>> Looking forward to where this goes. Keep up the fantastic work!
> >> >>>>>>>
> >> >>>>>>> Yufei
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> On Thu, Jan 25, 2024 at 5:55 AM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
> >> >>>>>>>>
> >> >>>>>>>> Hi Justin,
> >> >>>>>>>>
> >> >>>>>>>> I talked with Junping a couple of months ago about Gravitino.
> Thanks
> >> >>>>>>>> for sharing !
> >> >>>>>>>>
> >> >>>>>>>> Regards
> >> >>>>>>>> JB
> >> >>>>>>>>
> >> >>>>>>>> On Thu, Jan 25, 2024 at 12:15 AM Justin Mclean <
> jus...@classsoftware.com> wrote:
> >> >>>>>>>> >
> >> >>>>>>>> > Hi,
> >> >>>>>>>> >
> >> >>>>>>>> > We open-sourced a new project, Gravitino, in December and
> have been working on growing the community and adding new functionality. We
> plan to donate the project to the ASF this year. Gravitino is a unified
> metadata lake solution offering a unified approach to managing datasets
> from diverse sources and regions across multiple cloud platforms. Its core
> is an Iceberg REST catalog service implementation to manage Iceberg tables
> efficiently.
> >> >>>>>>>> >
> >> >>>>>>>> > If this sounds like something you would be interested in,
> then the following resources will help:
> >> >>>>>>>> > -  Blog post:
> https://datastrato.ai/blog/gravitino-iceberg-rest-catalog-service/
> >> >>>>>>>> > -  Gravitino documentation:
> https://datastrato.ai/docs/0.3.1/
> >> >>>>>>>> > -  Iceberg REST service documentation:
> https://datastrato.ai/docs/0.3.1/iceberg-rest-service
> >> >>>>>>>> >
> >> >>>>>>>> > We welcome any feedback and suggestions you have, and as
> always, all contributions are welcome. You can find the source code at
> https://github.com/datastrato/gravitino.
> >> >>>>>>>> >
> >> >>>>>>>> > Kind Regards,
> >> >>>>>>>> > Justin
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> Ryan Blue
> >> >>>>> Tabular
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Ryan Blue
> >> >>> Tabular
> >
> >
> >
> > --
> > Ryan Blue
> > Tabular
>

Re: Gravitino an Iceberg REST catalog service

Reply via email to