Hey Taher, You're right! Iceberg uses a catalog among others to maintain consistency, you can read more about it here <https://iceberg.apache.org/concepts/catalog/>.
The choice of a catalog depends on your organization and how your setup is organized. For example, if you don't use Hive metastore today, I would not set this up for Iceberg. Hive Metastores are backed by a relational database, and using JDBC Catalog you can also directly use the database as a catalog. Removing Hive from the setup will lead to fewer moving parts. There are also a variety of catalogs out there (not going to name any specific ones), that implement the Iceberg REST Open API specification that's mentioned in the link above that will provide the best Iceberg experience since it is very tightly coupled with Iceberg. Let us know if this clarifies things, and if there are any further questions! Kind regards, Fokko Op wo 31 jul 2024 om 07:44 schreef Taher Koitawala <taher...@gmail.com>: > Hi All, > I have a question about which catalog to be using with Iceberg > for our use case. > > We are on Kubernetes running Spark and Minio for storage. We use spark and > write data to Minio as s3 and we use something like a third party Data > Catalog to write the table location and create data lineage. > > In our new phase we want to move to iceberg however I see that Iceberg > uses Catalogs to maintain atomicity. What catalog would we use for our > usecase? > > 1. Will we have to provision a Hive Metastore so that we use the > SparkCatalog with hive metastore uri so that multiple transactions are > isolated? > 2. Would a simple Spark Catalog with a warehouse dir on Minio suffice? > 1. In that case my question would be, do all spark jobs refer to > one warehouse dir? I assume no > 3. What about spark catalog with Jdbc ? Would that be enough and an > easy way do isolation and atomic rw > > Thanks, > Taher Koitawala >