Hi Feng. I think your idea is very interesting!
1. I just wonder after initializing the Catalog, will the Session reuse the same Catalog instance or build a new one for later usage? If we reuse the same Catalog, I think it's more like lazy initialization. I am a little prone to rebuild a new one because it's easier for us to catalog jar hot updates. 2. Users use the `CREATE CATALOG` statement in the CatalogManager. In this case, do we need to instantiate the Catalog immediately or defer to the usage? Best, Shengkai Feng Jin <jinfeng1...@gmail.com> 于2023年2月9日周四 20:13写道: > Thanks for your reply. > > @Timo > > > 2) avoid the default in-memory catalog and offer their catalog before > a TableEnvironment session starts > > 3) whether this can be disabled and SHOW CATALOGS can be used for > listing first without having a default catalog. > > > Regarding 2 and 3, I think this problem can be solved by introducing > catalog providers, and users can control some default catalog > behavior. > > > > We could also use the org.apache.flink.table.factories.Factory infra > and allow catalog providers via pure string properties > > I think this is also very useful. In our usage scenarios, it is > usually multi-cluster management, and it is also necessary to pass > different configurations through parameters. > > > @Jark @Huang > > > About the lazy catalog initialization > > Our needs may be different. If these properties already exist in an > external system, especially when there may be thousands of these > catalog properties, I don’t think it is necessary to register all > these properties in the Flink env at startup, but we need is that we > can register a catalog when it needs and we can get the properties > from the external meta system . > > > > It may be hard to avoid conflicts and duplicates between > CatalogProvider and CatalogManager > > It is indeed easy to conflict. My idea is that if we separate the > catalog management of the current CatalogManager as the default > CatalogProvider behavior, at the same time, only one CatalogProvider > exists in a Flink Env. This may avoid catalog conflicts. > > > Best, > Feng > > On Tue, Feb 7, 2023 at 1:01 PM Hang Ruan <ruanhang1...@gmail.com> wrote: > > > > Hi Feng, > > I agree with what Jark said. I think what you are looking for is lazy > > initialization. > > > > I don't think we should introduce the new interface CatalogProvider for > > lazy initialization. What we should do is to store the catalog properties > > and initialize the catalog when we need it. Could you please introduce > some > > other scenarios that we need the CatalogProvider besides the lazy > > initialization? > > > > If we really need the CatalogProvider, I think it is better to be a > single > > instance. Multiple instances are difficult to manage and there are name > > conflicts among providers. > > > > Best, > > Hang > > > > Jark Wu <imj...@gmail.com> 于2023年2月7日周二 10:48写道: > > > > > Hi Feng, > > > > > > I think this feature makes a lot of sense. If I understand correctly, > what > > > you are looking for is lazy catalog initialization. > > > > > > However, I have some concerns about introducing CatalogProvider, which > > > delegates catalog management to users. It may be hard to avoid > conflicts > > > and duplicates between CatalogProvider and CatalogManager. Is it > possible > > > to have a built-in CatalogProvider to instantiate catalogs lazily? > > > > > > An idea in my mind is to introduce another catalog registration API > > > without instantiating the catalog, e.g., registerCatalog(String > > > catalogName, Map<String, String> catalogProperties). The catalog > > > information is stored in CatalogManager as pure strings. The catalog is > > > instantiated and initialized when used. > > > > > > This new API is very similar to other pure-string metadata > registration, > > > such as "createTable(String path, TableDescriptor descriptor)" and > > > "createFunction(String path, String className, List<ResourceUri> > > > resourceUris)". > > > > > > Can this approach satisfy your requirement? > > > > > > Best, > > > Jark > > > > > > On Mon, 6 Feb 2023 at 22:53, Timo Walther <twal...@apache.org> wrote: > > > > > > > Hi Feng, > > > > > > > > this is indeed a good proposal. > > > > > > > > 1) It makes sense to improve the catalog listing for platform > providers. > > > > > > > > 2) Other feedback from the past has shown that users would like to > avoid > > > > the default in-memory catalog and offer their catalog before a > > > > TableEnvironment session starts. > > > > > > > > 3) Also we might reconsider whether a default catalog and default > > > > database make sense. Or whether this can be disabled and SHOW > CATALOGS > > > > can be used for listing first without having a default catalog. > > > > > > > > What do you think about option 2 and 3? > > > > > > > > In any case, I would propose we pass a CatalogProvider to > > > > EnvironmentSettings and only allow a single instance. Catalogs should > > > > never shadow other catalogs. > > > > > > > > We could also use the org.apache.flink.table.factories.Factory infra > and > > > > allow catalog providers via pure string properties. Not sure if we > need > > > > this in the first version though. > > > > > > > > Cheers, > > > > Timo > > > > > > > > > > > > On 06.02.23 11:21, Feng Jin wrote: > > > > > Hi everyone, > > > > > > > > > > The original discussion address is > > > > > https://issues.apache.org/jira/browse/FLINK-30126 > > > > > > > > > > Currently, Flink has access to many systems, including kafka, hive, > > > > > iceberg, hudi, elasticsearch, mysql... The corresponding catalog > name > > > > > might be: > > > > > kafka_cluster1, kafka_cluster2, hive_cluster1, hive_cluster2, > > > > > iceberg_cluster2, elasticsearch_cluster1, mysql_database1_xxx, > > > > > mysql_database2_xxxx > > > > > > > > > > As the platform of the Flink SQL job, we need to maintain the meta > > > > > information of each system of the company, and when the Flink job > > > > > starts, we need to register the catalog with the Flink table > > > > > environment, so that users can use any table through the > > > > > env.executeSql interface. > > > > > > > > > > When we only have a small number of catalogs, we can register like > > > > > this, but when there are thousands of catalogs, I think that there > > > > > needs to be a dynamic loading mechanism that we can register > catalog > > > > > when needed, speed up the initialization of the table environment, > and > > > > > avoid the useless catalog registration process. > > > > > > > > > > Preliminary thoughts: > > > > > > > > > > A new CatalogProvider interface can be added: > > > > > It contains two interfaces: > > > > > * listCatalogs() interface, which can list all the interfaces that > the > > > > > interface can provide > > > > > * getCatalog() interface, which can get a catalog instance by > catalog > > > > name. > > > > > > > > > > ```java > > > > > public interface CatalogProvider { > > > > > > > > > > default void initialize(ClassLoader classLoader, > ReadableConfig > > > > config) {} > > > > > > > > > > Optional<Catalog> getCatalog(String catalogName); > > > > > > > > > > Set<String> listCatalogs(); > > > > > } > > > > > ``` > > > > > > > > > > > > > > > The corresponding implementation in CatalogManager is as follows: > > > > > > > > > > ```java > > > > > public CatalogManager { > > > > > private @Nullable CatalogProvider catalogProvider; > > > > > > > > > > private Map<String, Catalog> catalogs; > > > > > > > > > > public void setCatalogProvider(CatalogProvider > catalogProvider) { > > > > > this.catalogProvider = catalogProvider; > > > > > } > > > > > > > > > > public Optional<Catalog> getCatalog(String catalogName) { > > > > > // If there is no corresponding catalog in catalogs, > > > > > // get catalog by catalogProvider > > > > > if (catalogProvider != null) { > > > > > Optional<Catalog> catalog = > > > > catalogProvider.getCatalog(catalogName); > > > > > } > > > > > } > > > > > > > > > > } > > > > > ``` > > > > > > > > > > > > > > > > > > > > Possible problems: > > > > > > > > > > 1. Catalog name conflict, how to choose when the registered catalog > > > > > and the catalog provided by catalog-provider conflict? > > > > > I prefer tableEnv-registered ones over catalogs provided by the > > > > > catalog-provider. If the user wishes to reference the catalog > provided > > > > > by the catalog-provider, they can unregister the catalog in > tableEnv > > > > > through the `unregisterCatalog` interface. > > > > > > > > > > 2. Number of CatalogProviders, is it possible to have multiple > > > > > catalogProvider implementations? > > > > > I don't have a good idea of this at the moment. If multiple > > > > > catalogProviders are supported, it brings much more convenience, > But > > > > > there may be catalog name conflicts between different > > > > > catalogProviders. > > > > > > > > > > > > > > > > > > > > Looking forward to your reply, any feedback is appreciated! > > > > > > > > > > > > > > > Best. > > > > > > > > > > Feng Jin > > > > > > > > > > > > > > > > >