Hi Feng,

It's still easy to conflict and be inconsistent even if we have only one
CatalogProvider, because CatalogProvider only provides readable interfaces
(listCatalogs, getCatalog). For example, you may register a catalog X, but
can't list it because it's not in the external metadata service.

To avoid catalog conflicts and keep consistent, we can extract the catalog
management logic as a pluggable interface, including listCatalog,
getCatalog, registerCatalog, unregisterCatalog, etc. The
current CatalogManager is a default in-memory implementation, you can
replace it with user-defined managers, such as
 - file-based: which manages catalog information on local files, just like
how Presto/Trino manages catalogs
 - metaservice-based: which manages catalog information on external
metadata service.

There only can be a single catalog manager in TableEnvironment. This
guarantees data consistency and avoids conflicts. This approach can address
another pain point of Flink SQL: the catalog information is not persisted.

Can this approach satisfy your requirements?

Best,
Jark





On Fri, 10 Feb 2023 at 11:21, Shengkai Fang <fskm...@gmail.com> wrote:

> Hi Feng.
>
> I think your idea is very interesting!
>
> 1. I just wonder after initializing the Catalog, will the Session reuse the
> same Catalog instance or build a new one for later usage? If we reuse the
> same Catalog, I think it's more like lazy initialization. I am a
> little prone to rebuild a new one because it's easier for us to catalog jar
> hot updates.
>
> 2. Users use the `CREATE CATALOG` statement in the CatalogManager. In this
> case, do we need to instantiate the Catalog immediately or defer to the
> usage?
>
> Best,
> Shengkai
>
> Feng Jin <jinfeng1...@gmail.com> 于2023年2月9日周四 20:13写道:
>
> > Thanks for your reply.
> >
> > @Timo
> >
> > >  2) avoid  the default in-memory catalog and offer their catalog before
> > a  TableEnvironment session starts
> > >  3) whether this can be disabled and SHOW CATALOGS  can be used for
> > listing first without having a default catalog.
> >
> >
> > Regarding 2 and 3, I think this problem can be solved by introducing
> > catalog providers, and users can control some default catalog
> > behavior.
> >
> >
> > > We could also use the org.apache.flink.table.factories.Factory infra
> > and  allow catalog providers via pure string properties
> >
> > I think this is also very useful. In our usage scenarios, it is
> > usually multi-cluster management, and it is also necessary to pass
> > different configurations through parameters.
> >
> >
> > @Jark @Huang
> >
> > >  About the lazy catalog initialization
> >
> > Our needs may be different. If these properties already exist in an
> > external system, especially when there may be thousands of these
> > catalog properties, I don’t think it is necessary to register all
> > these properties in the Flink env at startup, but we need is that we
> > can register a catalog  when it needs and we can get the properties
> > from the external meta system .
> >
> >
> > >  It may be hard to avoid conflicts  and duplicates between
> > CatalogProvider and CatalogManager
> >
> > It is indeed easy to conflict. My idea is that if we separate the
> > catalog management of the current CatalogManager as the default
> > CatalogProvider behavior, at the same time, only one CatalogProvider
> > exists in a Flink Env.  This may avoid catalog conflicts.
> >
> >
> > Best,
> > Feng
> >
> > On Tue, Feb 7, 2023 at 1:01 PM Hang Ruan <ruanhang1...@gmail.com> wrote:
> > >
> > > Hi Feng,
> > > I agree with what Jark said. I think what you are looking for is lazy
> > > initialization.
> > >
> > > I don't think we should introduce the new interface CatalogProvider for
> > > lazy initialization. What we should do is to store the catalog
> properties
> > > and initialize the catalog when we need it. Could you please introduce
> > some
> > > other scenarios that we need the CatalogProvider besides the lazy
> > > initialization?
> > >
> > > If we really need the CatalogProvider, I think it is better to be a
> > single
> > > instance. Multiple instances are difficult to manage and there are name
> > > conflicts among providers.
> > >
> > > Best,
> > > Hang
> > >
> > > Jark Wu <imj...@gmail.com> 于2023年2月7日周二 10:48写道:
> > >
> > > > Hi Feng,
> > > >
> > > > I think this feature makes a lot of sense. If I understand correctly,
> > what
> > > > you are looking for is lazy catalog initialization.
> > > >
> > > > However, I have some concerns about introducing CatalogProvider,
> which
> > > > delegates catalog management to users. It may be hard to avoid
> > conflicts
> > > > and duplicates between CatalogProvider and CatalogManager. Is it
> > possible
> > > > to have a built-in CatalogProvider to instantiate catalogs lazily?
> > > >
> > > > An idea in my mind is to introduce another catalog registration API
> > > > without instantiating the catalog, e.g., registerCatalog(String
> > > > catalogName, Map<String, String> catalogProperties). The catalog
> > > > information is stored in CatalogManager as pure strings. The catalog
> is
> > > > instantiated and initialized when used.
> > > >
> > > > This new API is very similar to other pure-string metadata
> > registration,
> > > > such as "createTable(String path, TableDescriptor descriptor)" and
> > > > "createFunction(String path, String className, List<ResourceUri>
> > > > resourceUris)".
> > > >
> > > > Can this approach satisfy your requirement?
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Mon, 6 Feb 2023 at 22:53, Timo Walther <twal...@apache.org>
> wrote:
> > > >
> > > > > Hi Feng,
> > > > >
> > > > > this is indeed a good proposal.
> > > > >
> > > > > 1) It makes sense to improve the catalog listing for platform
> > providers.
> > > > >
> > > > > 2) Other feedback from the past has shown that users would like to
> > avoid
> > > > > the default in-memory catalog and offer their catalog before a
> > > > > TableEnvironment session starts.
> > > > >
> > > > > 3) Also we might reconsider whether a default catalog and default
> > > > > database make sense. Or whether this can be disabled and SHOW
> > CATALOGS
> > > > > can be used for listing first without having a default catalog.
> > > > >
> > > > > What do you think about option 2 and 3?
> > > > >
> > > > > In any case, I would propose we pass a CatalogProvider to
> > > > > EnvironmentSettings and only allow a single instance. Catalogs
> should
> > > > > never shadow other catalogs.
> > > > >
> > > > > We could also use the org.apache.flink.table.factories.Factory
> infra
> > and
> > > > > allow catalog providers via pure string properties. Not sure if we
> > need
> > > > > this in the first version though.
> > > > >
> > > > > Cheers,
> > > > > Timo
> > > > >
> > > > >
> > > > > On 06.02.23 11:21, Feng Jin wrote:
> > > > > > Hi everyone,
> > > > > >
> > > > > > The original discussion address is
> > > > > > https://issues.apache.org/jira/browse/FLINK-30126
> > > > > >
> > > > > > Currently, Flink has access to many systems, including kafka,
> hive,
> > > > > > iceberg, hudi, elasticsearch, mysql...  The corresponding catalog
> > name
> > > > > > might be:
> > > > > > kafka_cluster1, kafka_cluster2, hive_cluster1, hive_cluster2,
> > > > > > iceberg_cluster2, elasticsearch_cluster1,  mysql_database1_xxx,
> > > > > > mysql_database2_xxxx
> > > > > >
> > > > > > As the platform of the Flink SQL job, we need to maintain the
> meta
> > > > > > information of each system of the company, and when the Flink job
> > > > > > starts, we need to register the catalog with the Flink table
> > > > > > environment, so that users can use any table through the
> > > > > > env.executeSql interface.
> > > > > >
> > > > > > When we only have a small number of catalogs, we can register
> like
> > > > > > this, but when there are thousands of catalogs, I think that
> there
> > > > > > needs to be a dynamic loading mechanism that we can register
> > catalog
> > > > > > when needed, speed up the initialization of the table
> environment,
> > and
> > > > > > avoid the useless catalog registration process.
> > > > > >
> > > > > > Preliminary thoughts:
> > > > > >
> > > > > > A new CatalogProvider interface can be added:
> > > > > > It contains two interfaces:
> > > > > > * listCatalogs() interface, which can list all the interfaces
> that
> > the
> > > > > > interface can provide
> > > > > > * getCatalog() interface,  which can get a catalog instance by
> > catalog
> > > > > name.
> > > > > >
> > > > > > ```java
> > > > > > public interface CatalogProvider {
> > > > > >
> > > > > >      default void initialize(ClassLoader classLoader,
> > ReadableConfig
> > > > > config) {}
> > > > > >
> > > > > >      Optional<Catalog> getCatalog(String catalogName);
> > > > > >
> > > > > >      Set<String> listCatalogs();
> > > > > > }
> > > > > > ```
> > > > > >
> > > > > >
> > > > > > The corresponding implementation in CatalogManager is as follows:
> > > > > >
> > > > > > ```java
> > > > > > public CatalogManager {
> > > > > >      private @Nullable CatalogProvider catalogProvider;
> > > > > >
> > > > > >      private Map<String, Catalog> catalogs;
> > > > > >
> > > > > >      public void setCatalogProvider(CatalogProvider
> > catalogProvider) {
> > > > > >          this.catalogProvider = catalogProvider;
> > > > > >      }
> > > > > >
> > > > > >      public Optional<Catalog> getCatalog(String catalogName) {
> > > > > >          // If there is no corresponding catalog in catalogs,
> > > > > >          // get catalog by catalogProvider
> > > > > >          if (catalogProvider != null) {
> > > > > >              Optional<Catalog> catalog =
> > > > > catalogProvider.getCatalog(catalogName);
> > > > > >          }
> > > > > >      }
> > > > > >
> > > > > > }
> > > > > > ```
> > > > > >
> > > > > >
> > > > > >
> > > > > > Possible problems:
> > > > > >
> > > > > > 1. Catalog name conflict, how to choose when the registered
> catalog
> > > > > > and the catalog provided by catalog-provider conflict?
> > > > > > I prefer tableEnv-registered ones over catalogs provided by the
> > > > > > catalog-provider. If the user wishes to reference the catalog
> > provided
> > > > > > by the catalog-provider, they can unregister the catalog in
> > tableEnv
> > > > > > through the `unregisterCatalog` interface.
> > > > > >
> > > > > > 2. Number of CatalogProviders, is it possible to have multiple
> > > > > > catalogProvider implementations?
> > > > > > I don't have a good idea of this at the moment. If multiple
> > > > > > catalogProviders are supported, it brings much more convenience,
> > But
> > > > > > there may be catalog name conflicts between different
> > > > > > catalogProviders.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Looking forward to your reply, any feedback is appreciated!
> > > > > >
> > > > > >
> > > > > > Best.
> > > > > >
> > > > > > Feng Jin
> > > > > >
> > > > >
> > > > >
> > > >
> >
>

Reply via email to