I am very happy to do it, please help me to add editing permission, my
jira id is  hackergin

Thanks

Best,
Feng

On Fri, Feb 10, 2023 at 4:02 PM Jark Wu <imj...@gmail.com> wrote:
>
> Thank you Feng,
>
> Feel free to start a FLIP proposal if you are interested. Looking forward to 
> it!
>
> Best,
> Jark
>
> > 2023年2月10日 15:44,Feng Jin <jinfeng1...@gmail.com> 写道:
> >
> > @Shengkai
> >> About the catalog jar hot updates
> >
> > Currently we do not have a similar requirement, but if the catalog
> > management interface is opened, this can indeed realize the hot
> > loading of the catalog jar
> >
> >
> >> do we need to instantiate the Catalog immediately or defer to the usage
> >
> > I think this can be the same as before .
> >
> >
> >
> > @Jark
> >> There only can be a single catalog manager in TableEnvironment.
> >
> > big +1 for this.  This can avoid conflicts and also meet the catalog
> > persistence requirements.
> >
> >
> > Best,
> > Feng
> >
> > On Fri, Feb 10, 2023 at 3:09 PM Jark Wu <imj...@gmail.com> wrote:
> >>
> >> Hi Feng,
> >>
> >> It's still easy to conflict and be inconsistent even if we have only one
> >> CatalogProvider, because CatalogProvider only provides readable interfaces
> >> (listCatalogs, getCatalog). For example, you may register a catalog X, but
> >> can't list it because it's not in the external metadata service.
> >>
> >> To avoid catalog conflicts and keep consistent, we can extract the catalog
> >> management logic as a pluggable interface, including listCatalog,
> >> getCatalog, registerCatalog, unregisterCatalog, etc. The
> >> current CatalogManager is a default in-memory implementation, you can
> >> replace it with user-defined managers, such as
> >> - file-based: which manages catalog information on local files, just like
> >> how Presto/Trino manages catalogs
> >> - metaservice-based: which manages catalog information on external
> >> metadata service.
> >>
> >> There only can be a single catalog manager in TableEnvironment. This
> >> guarantees data consistency and avoids conflicts. This approach can address
> >> another pain point of Flink SQL: the catalog information is not persisted.
> >>
> >> Can this approach satisfy your requirements?
> >>
> >> Best,
> >> Jark
> >>
> >>
> >>
> >>
> >>
> >> On Fri, 10 Feb 2023 at 11:21, Shengkai Fang <fskm...@gmail.com> wrote:
> >>
> >>> Hi Feng.
> >>>
> >>> I think your idea is very interesting!
> >>>
> >>> 1. I just wonder after initializing the Catalog, will the Session reuse 
> >>> the
> >>> same Catalog instance or build a new one for later usage? If we reuse the
> >>> same Catalog, I think it's more like lazy initialization. I am a
> >>> little prone to rebuild a new one because it's easier for us to catalog 
> >>> jar
> >>> hot updates.
> >>>
> >>> 2. Users use the `CREATE CATALOG` statement in the CatalogManager. In this
> >>> case, do we need to instantiate the Catalog immediately or defer to the
> >>> usage?
> >>>
> >>> Best,
> >>> Shengkai
> >>>
> >>> Feng Jin <jinfeng1...@gmail.com> 于2023年2月9日周四 20:13写道:
> >>>
> >>>> Thanks for your reply.
> >>>>
> >>>> @Timo
> >>>>
> >>>>> 2) avoid  the default in-memory catalog and offer their catalog before
> >>>> a  TableEnvironment session starts
> >>>>> 3) whether this can be disabled and SHOW CATALOGS  can be used for
> >>>> listing first without having a default catalog.
> >>>>
> >>>>
> >>>> Regarding 2 and 3, I think this problem can be solved by introducing
> >>>> catalog providers, and users can control some default catalog
> >>>> behavior.
> >>>>
> >>>>
> >>>>> We could also use the org.apache.flink.table.factories.Factory infra
> >>>> and  allow catalog providers via pure string properties
> >>>>
> >>>> I think this is also very useful. In our usage scenarios, it is
> >>>> usually multi-cluster management, and it is also necessary to pass
> >>>> different configurations through parameters.
> >>>>
> >>>>
> >>>> @Jark @Huang
> >>>>
> >>>>> About the lazy catalog initialization
> >>>>
> >>>> Our needs may be different. If these properties already exist in an
> >>>> external system, especially when there may be thousands of these
> >>>> catalog properties, I don’t think it is necessary to register all
> >>>> these properties in the Flink env at startup, but we need is that we
> >>>> can register a catalog  when it needs and we can get the properties
> >>>> from the external meta system .
> >>>>
> >>>>
> >>>>> It may be hard to avoid conflicts  and duplicates between
> >>>> CatalogProvider and CatalogManager
> >>>>
> >>>> It is indeed easy to conflict. My idea is that if we separate the
> >>>> catalog management of the current CatalogManager as the default
> >>>> CatalogProvider behavior, at the same time, only one CatalogProvider
> >>>> exists in a Flink Env.  This may avoid catalog conflicts.
> >>>>
> >>>>
> >>>> Best,
> >>>> Feng
> >>>>
> >>>> On Tue, Feb 7, 2023 at 1:01 PM Hang Ruan <ruanhang1...@gmail.com> wrote:
> >>>>>
> >>>>> Hi Feng,
> >>>>> I agree with what Jark said. I think what you are looking for is lazy
> >>>>> initialization.
> >>>>>
> >>>>> I don't think we should introduce the new interface CatalogProvider for
> >>>>> lazy initialization. What we should do is to store the catalog
> >>> properties
> >>>>> and initialize the catalog when we need it. Could you please introduce
> >>>> some
> >>>>> other scenarios that we need the CatalogProvider besides the lazy
> >>>>> initialization?
> >>>>>
> >>>>> If we really need the CatalogProvider, I think it is better to be a
> >>>> single
> >>>>> instance. Multiple instances are difficult to manage and there are name
> >>>>> conflicts among providers.
> >>>>>
> >>>>> Best,
> >>>>> Hang
> >>>>>
> >>>>> Jark Wu <imj...@gmail.com> 于2023年2月7日周二 10:48写道:
> >>>>>
> >>>>>> Hi Feng,
> >>>>>>
> >>>>>> I think this feature makes a lot of sense. If I understand correctly,
> >>>> what
> >>>>>> you are looking for is lazy catalog initialization.
> >>>>>>
> >>>>>> However, I have some concerns about introducing CatalogProvider,
> >>> which
> >>>>>> delegates catalog management to users. It may be hard to avoid
> >>>> conflicts
> >>>>>> and duplicates between CatalogProvider and CatalogManager. Is it
> >>>> possible
> >>>>>> to have a built-in CatalogProvider to instantiate catalogs lazily?
> >>>>>>
> >>>>>> An idea in my mind is to introduce another catalog registration API
> >>>>>> without instantiating the catalog, e.g., registerCatalog(String
> >>>>>> catalogName, Map<String, String> catalogProperties). The catalog
> >>>>>> information is stored in CatalogManager as pure strings. The catalog
> >>> is
> >>>>>> instantiated and initialized when used.
> >>>>>>
> >>>>>> This new API is very similar to other pure-string metadata
> >>>> registration,
> >>>>>> such as "createTable(String path, TableDescriptor descriptor)" and
> >>>>>> "createFunction(String path, String className, List<ResourceUri>
> >>>>>> resourceUris)".
> >>>>>>
> >>>>>> Can this approach satisfy your requirement?
> >>>>>>
> >>>>>> Best,
> >>>>>> Jark
> >>>>>>
> >>>>>> On Mon, 6 Feb 2023 at 22:53, Timo Walther <twal...@apache.org>
> >>> wrote:
> >>>>>>
> >>>>>>> Hi Feng,
> >>>>>>>
> >>>>>>> this is indeed a good proposal.
> >>>>>>>
> >>>>>>> 1) It makes sense to improve the catalog listing for platform
> >>>> providers.
> >>>>>>>
> >>>>>>> 2) Other feedback from the past has shown that users would like to
> >>>> avoid
> >>>>>>> the default in-memory catalog and offer their catalog before a
> >>>>>>> TableEnvironment session starts.
> >>>>>>>
> >>>>>>> 3) Also we might reconsider whether a default catalog and default
> >>>>>>> database make sense. Or whether this can be disabled and SHOW
> >>>> CATALOGS
> >>>>>>> can be used for listing first without having a default catalog.
> >>>>>>>
> >>>>>>> What do you think about option 2 and 3?
> >>>>>>>
> >>>>>>> In any case, I would propose we pass a CatalogProvider to
> >>>>>>> EnvironmentSettings and only allow a single instance. Catalogs
> >>> should
> >>>>>>> never shadow other catalogs.
> >>>>>>>
> >>>>>>> We could also use the org.apache.flink.table.factories.Factory
> >>> infra
> >>>> and
> >>>>>>> allow catalog providers via pure string properties. Not sure if we
> >>>> need
> >>>>>>> this in the first version though.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Timo
> >>>>>>>
> >>>>>>>
> >>>>>>> On 06.02.23 11:21, Feng Jin wrote:
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>> The original discussion address is
> >>>>>>>> https://issues.apache.org/jira/browse/FLINK-30126
> >>>>>>>>
> >>>>>>>> Currently, Flink has access to many systems, including kafka,
> >>> hive,
> >>>>>>>> iceberg, hudi, elasticsearch, mysql...  The corresponding catalog
> >>>> name
> >>>>>>>> might be:
> >>>>>>>> kafka_cluster1, kafka_cluster2, hive_cluster1, hive_cluster2,
> >>>>>>>> iceberg_cluster2, elasticsearch_cluster1,  mysql_database1_xxx,
> >>>>>>>> mysql_database2_xxxx
> >>>>>>>>
> >>>>>>>> As the platform of the Flink SQL job, we need to maintain the
> >>> meta
> >>>>>>>> information of each system of the company, and when the Flink job
> >>>>>>>> starts, we need to register the catalog with the Flink table
> >>>>>>>> environment, so that users can use any table through the
> >>>>>>>> env.executeSql interface.
> >>>>>>>>
> >>>>>>>> When we only have a small number of catalogs, we can register
> >>> like
> >>>>>>>> this, but when there are thousands of catalogs, I think that
> >>> there
> >>>>>>>> needs to be a dynamic loading mechanism that we can register
> >>>> catalog
> >>>>>>>> when needed, speed up the initialization of the table
> >>> environment,
> >>>> and
> >>>>>>>> avoid the useless catalog registration process.
> >>>>>>>>
> >>>>>>>> Preliminary thoughts:
> >>>>>>>>
> >>>>>>>> A new CatalogProvider interface can be added:
> >>>>>>>> It contains two interfaces:
> >>>>>>>> * listCatalogs() interface, which can list all the interfaces
> >>> that
> >>>> the
> >>>>>>>> interface can provide
> >>>>>>>> * getCatalog() interface,  which can get a catalog instance by
> >>>> catalog
> >>>>>>> name.
> >>>>>>>>
> >>>>>>>> ```java
> >>>>>>>> public interface CatalogProvider {
> >>>>>>>>
> >>>>>>>>     default void initialize(ClassLoader classLoader,
> >>>> ReadableConfig
> >>>>>>> config) {}
> >>>>>>>>
> >>>>>>>>     Optional<Catalog> getCatalog(String catalogName);
> >>>>>>>>
> >>>>>>>>     Set<String> listCatalogs();
> >>>>>>>> }
> >>>>>>>> ```
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> The corresponding implementation in CatalogManager is as follows:
> >>>>>>>>
> >>>>>>>> ```java
> >>>>>>>> public CatalogManager {
> >>>>>>>>     private @Nullable CatalogProvider catalogProvider;
> >>>>>>>>
> >>>>>>>>     private Map<String, Catalog> catalogs;
> >>>>>>>>
> >>>>>>>>     public void setCatalogProvider(CatalogProvider
> >>>> catalogProvider) {
> >>>>>>>>         this.catalogProvider = catalogProvider;
> >>>>>>>>     }
> >>>>>>>>
> >>>>>>>>     public Optional<Catalog> getCatalog(String catalogName) {
> >>>>>>>>         // If there is no corresponding catalog in catalogs,
> >>>>>>>>         // get catalog by catalogProvider
> >>>>>>>>         if (catalogProvider != null) {
> >>>>>>>>             Optional<Catalog> catalog =
> >>>>>>> catalogProvider.getCatalog(catalogName);
> >>>>>>>>         }
> >>>>>>>>     }
> >>>>>>>>
> >>>>>>>> }
> >>>>>>>> ```
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Possible problems:
> >>>>>>>>
> >>>>>>>> 1. Catalog name conflict, how to choose when the registered
> >>> catalog
> >>>>>>>> and the catalog provided by catalog-provider conflict?
> >>>>>>>> I prefer tableEnv-registered ones over catalogs provided by the
> >>>>>>>> catalog-provider. If the user wishes to reference the catalog
> >>>> provided
> >>>>>>>> by the catalog-provider, they can unregister the catalog in
> >>>> tableEnv
> >>>>>>>> through the `unregisterCatalog` interface.
> >>>>>>>>
> >>>>>>>> 2. Number of CatalogProviders, is it possible to have multiple
> >>>>>>>> catalogProvider implementations?
> >>>>>>>> I don't have a good idea of this at the moment. If multiple
> >>>>>>>> catalogProviders are supported, it brings much more convenience,
> >>>> But
> >>>>>>>> there may be catalog name conflicts between different
> >>>>>>>> catalogProviders.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Looking forward to your reply, any feedback is appreciated!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Best.
> >>>>>>>>
> >>>>>>>> Feng Jin
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
>

Reply via email to