Thanks for your reply.

@Timo

>  2) avoid  the default in-memory catalog and offer their catalog before a  
> TableEnvironment session starts
>  3) whether this can be disabled and SHOW CATALOGS  can be used for listing 
> first without having a default catalog.


Regarding 2 and 3, I think this problem can be solved by introducing
catalog providers, and users can control some default catalog
behavior.


> We could also use the org.apache.flink.table.factories.Factory infra and  
> allow catalog providers via pure string properties

I think this is also very useful. In our usage scenarios, it is
usually multi-cluster management, and it is also necessary to pass
different configurations through parameters.


@Jark @Huang

>  About the lazy catalog initialization

Our needs may be different. If these properties already exist in an
external system, especially when there may be thousands of these
catalog properties, I don’t think it is necessary to register all
these properties in the Flink env at startup, but we need is that we
can register a catalog  when it needs and we can get the properties
from the external meta system .


>  It may be hard to avoid conflicts  and duplicates between CatalogProvider 
> and CatalogManager

It is indeed easy to conflict. My idea is that if we separate the
catalog management of the current CatalogManager as the default
CatalogProvider behavior, at the same time, only one CatalogProvider
exists in a Flink Env.  This may avoid catalog conflicts.


Best,
Feng

On Tue, Feb 7, 2023 at 1:01 PM Hang Ruan <ruanhang1...@gmail.com> wrote:
>
> Hi Feng,
> I agree with what Jark said. I think what you are looking for is lazy
> initialization.
>
> I don't think we should introduce the new interface CatalogProvider for
> lazy initialization. What we should do is to store the catalog properties
> and initialize the catalog when we need it. Could you please introduce some
> other scenarios that we need the CatalogProvider besides the lazy
> initialization?
>
> If we really need the CatalogProvider, I think it is better to be a single
> instance. Multiple instances are difficult to manage and there are name
> conflicts among providers.
>
> Best,
> Hang
>
> Jark Wu <imj...@gmail.com> 于2023年2月7日周二 10:48写道:
>
> > Hi Feng,
> >
> > I think this feature makes a lot of sense. If I understand correctly, what
> > you are looking for is lazy catalog initialization.
> >
> > However, I have some concerns about introducing CatalogProvider, which
> > delegates catalog management to users. It may be hard to avoid conflicts
> > and duplicates between CatalogProvider and CatalogManager. Is it possible
> > to have a built-in CatalogProvider to instantiate catalogs lazily?
> >
> > An idea in my mind is to introduce another catalog registration API
> > without instantiating the catalog, e.g., registerCatalog(String
> > catalogName, Map<String, String> catalogProperties). The catalog
> > information is stored in CatalogManager as pure strings. The catalog is
> > instantiated and initialized when used.
> >
> > This new API is very similar to other pure-string metadata registration,
> > such as "createTable(String path, TableDescriptor descriptor)" and
> > "createFunction(String path, String className, List<ResourceUri>
> > resourceUris)".
> >
> > Can this approach satisfy your requirement?
> >
> > Best,
> > Jark
> >
> > On Mon, 6 Feb 2023 at 22:53, Timo Walther <twal...@apache.org> wrote:
> >
> > > Hi Feng,
> > >
> > > this is indeed a good proposal.
> > >
> > > 1) It makes sense to improve the catalog listing for platform providers.
> > >
> > > 2) Other feedback from the past has shown that users would like to avoid
> > > the default in-memory catalog and offer their catalog before a
> > > TableEnvironment session starts.
> > >
> > > 3) Also we might reconsider whether a default catalog and default
> > > database make sense. Or whether this can be disabled and SHOW CATALOGS
> > > can be used for listing first without having a default catalog.
> > >
> > > What do you think about option 2 and 3?
> > >
> > > In any case, I would propose we pass a CatalogProvider to
> > > EnvironmentSettings and only allow a single instance. Catalogs should
> > > never shadow other catalogs.
> > >
> > > We could also use the org.apache.flink.table.factories.Factory infra and
> > > allow catalog providers via pure string properties. Not sure if we need
> > > this in the first version though.
> > >
> > > Cheers,
> > > Timo
> > >
> > >
> > > On 06.02.23 11:21, Feng Jin wrote:
> > > > Hi everyone,
> > > >
> > > > The original discussion address is
> > > > https://issues.apache.org/jira/browse/FLINK-30126
> > > >
> > > > Currently, Flink has access to many systems, including kafka, hive,
> > > > iceberg, hudi, elasticsearch, mysql...  The corresponding catalog name
> > > > might be:
> > > > kafka_cluster1, kafka_cluster2, hive_cluster1, hive_cluster2,
> > > > iceberg_cluster2, elasticsearch_cluster1,  mysql_database1_xxx,
> > > > mysql_database2_xxxx
> > > >
> > > > As the platform of the Flink SQL job, we need to maintain the meta
> > > > information of each system of the company, and when the Flink job
> > > > starts, we need to register the catalog with the Flink table
> > > > environment, so that users can use any table through the
> > > > env.executeSql interface.
> > > >
> > > > When we only have a small number of catalogs, we can register like
> > > > this, but when there are thousands of catalogs, I think that there
> > > > needs to be a dynamic loading mechanism that we can register catalog
> > > > when needed, speed up the initialization of the table environment, and
> > > > avoid the useless catalog registration process.
> > > >
> > > > Preliminary thoughts:
> > > >
> > > > A new CatalogProvider interface can be added:
> > > > It contains two interfaces:
> > > > * listCatalogs() interface, which can list all the interfaces that the
> > > > interface can provide
> > > > * getCatalog() interface,  which can get a catalog instance by catalog
> > > name.
> > > >
> > > > ```java
> > > > public interface CatalogProvider {
> > > >
> > > >      default void initialize(ClassLoader classLoader, ReadableConfig
> > > config) {}
> > > >
> > > >      Optional<Catalog> getCatalog(String catalogName);
> > > >
> > > >      Set<String> listCatalogs();
> > > > }
> > > > ```
> > > >
> > > >
> > > > The corresponding implementation in CatalogManager is as follows:
> > > >
> > > > ```java
> > > > public CatalogManager {
> > > >      private @Nullable CatalogProvider catalogProvider;
> > > >
> > > >      private Map<String, Catalog> catalogs;
> > > >
> > > >      public void setCatalogProvider(CatalogProvider catalogProvider) {
> > > >          this.catalogProvider = catalogProvider;
> > > >      }
> > > >
> > > >      public Optional<Catalog> getCatalog(String catalogName) {
> > > >          // If there is no corresponding catalog in catalogs,
> > > >          // get catalog by catalogProvider
> > > >          if (catalogProvider != null) {
> > > >              Optional<Catalog> catalog =
> > > catalogProvider.getCatalog(catalogName);
> > > >          }
> > > >      }
> > > >
> > > > }
> > > > ```
> > > >
> > > >
> > > >
> > > > Possible problems:
> > > >
> > > > 1. Catalog name conflict, how to choose when the registered catalog
> > > > and the catalog provided by catalog-provider conflict?
> > > > I prefer tableEnv-registered ones over catalogs provided by the
> > > > catalog-provider. If the user wishes to reference the catalog provided
> > > > by the catalog-provider, they can unregister the catalog in tableEnv
> > > > through the `unregisterCatalog` interface.
> > > >
> > > > 2. Number of CatalogProviders, is it possible to have multiple
> > > > catalogProvider implementations?
> > > > I don't have a good idea of this at the moment. If multiple
> > > > catalogProviders are supported, it brings much more convenience, But
> > > > there may be catalog name conflicts between different
> > > > catalogProviders.
> > > >
> > > >
> > > >
> > > > Looking forward to your reply, any feedback is appreciated!
> > > >
> > > >
> > > > Best.
> > > >
> > > > Feng Jin
> > > >
> > >
> > >
> >

Reply via email to