[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919380#comment-16919380 ] TisonKun commented on FLINK-13750: -- [~till.rohrmann] Sure. FYI FLINK-13912 and I'd like to mark this one blocked by that one because we can avoid introduce a new endpoint at WebMonitor then. > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Assignee: TisonKun >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919312#comment-16919312 ] Till Rohrmann commented on FLINK-13750: --- Yes there should be no longer the need to know the port and hostname of the dispatcher. Could we create for the removal of this method a separate issue? > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Assignee: TisonKun >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919295#comment-16919295 ] TisonKun commented on FLINK-13750: -- Hi here. I noticed that {{ClusterClient#getClusterConnectionInfo}} works our of expected in YARN scenario. In YARN scenario we set the port exactly to WebMonitor's port. {code:java} final String host = appReport.getHost(); final int rpcPort = appReport.getRpcPort(); LOG.info("Found application JobManager host name '{}' and port '{}' from supplied application id '{}'", host, rpcPort, applicationId); flinkConfiguration.setString(JobManagerOptions.ADDRESS, host); flinkConfiguration.setInteger(JobManagerOptions.PORT, rpcPort); flinkConfiguration.setString(RestOptions.ADDRESS, host); flinkConfiguration.setInteger(RestOptions.PORT, rpcPort); {code} thus the interface never works as expected. I suspect this doesn't find before because no one relies on it. Given a further investigating all usages(omit used in logs, all under scala-shell and thus remote executor) actually communicate with {{WebMonitor}} instead of {{Dispatcher}}. I'd like to just remove this interface though. > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Assignee: TisonKun >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913856#comment-16913856 ] TisonKun commented on FLINK-13750: -- Hi [~till.rohrmann][~Zentol], I sent a pull request GH-9509 follow our discussion above. Please give it a look :- ) Besides, here I'd like to start a new sub-topic, shall we expose "leaderInfo" to client? The only usage I see is to log in {{FlinkYarnSessionCli}} and extract dispatcher host:port in {{FlinkShell}} under YARN scenario. I think of an alternative instead of expose "leaderInfo" to client. Or we just design to expose it here. > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Assignee: TisonKun >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912036#comment-16912036 ] TisonKun commented on FLINK-13750: -- I have started a survey thread on both user and dev list[1] for an accurate idea how our users use high-availability services in Flink. While collecting feedback I also want to start the programming job as discussed above to see whether there is problem unnoticed. [1] https://lists.apache.org/x/thread.html/c0cc07197e6ba30b45d7709cc9e17d8497e5e3f33de504d58dfcafad@%3Cuser.flink.apache.org%3E > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Assignee: TisonKun >Priority: Major > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911396#comment-16911396 ] Till Rohrmann commented on FLINK-13750: --- Yes, you are right. The {{MiniCluster}} only needs to give access to the {{webMonitorLeaderRetrievalService}}. Nothing else should be needed after we have removed the requirement for the dispatcher leader retrieval service. Doing a survey on user and dev to measure how big of a deal a breaking change would be could be helpful to assess whether we need to provide backwards compatibility. > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Assignee: TisonKun >Priority: Major > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911290#comment-16911290 ] TisonKun commented on FLINK-13750: -- The original issue FLINK-13500 caused by this issue requires initialize services on demand. Specifically, BlobStoreService doesn't initialized in client-side. Let {{HighAvailabilityServices}} extend both {{ClientHighAvailabilityServices}} and {{ClusterHighAvailabilityServices}} and pass it as the respective interface doesn't fix this issue. It limits the access but would not change the initialization. With our context {{ClientHighAvailabilityServices}} has the only method {{getWebMonitorLeaderRetrievalService}} while {{ClusterHighAvailabilityServices}} doesn't need it. We can rename {{HighAvailabilityService}} as {{ClusterHighAvailabilityServices}} and deprecate the method and drop it when break changes allowed. For MiniCluster scenario, it is a special case where the client can directly access dispatcher gate and thus need not a {{ClientHighAvailabilityServices}}. We can handle it specially regard its speciality natural. An inheritance graph would be {noformat} ClientHighAvailabilityServices { only getWebMonitorLeaderRetrievalService } ↓ ZK.../Standalone.../Custom... {noformat} {noformat} ClusterHighAvailabilityServices { ... deprecated getWebMonitorLeaderRetrievalService} ↓ ZK.../Standalone.../Embedded.../Custom... {noformat} Another problem is how we treat the custom one? A quick solution is as {{HighAvailabilityServicesFactory#createClientHAServices}} described above and the default is create a ClusterHighAvailabilityServices(current HighAvailabilityServices) and wrapped it access only the deprecated {{getWebMonitorLeaderRetrievalService}}. We can drop the fallback when break changes allowed. Fair enough, a survey to build our mind on how users actually custom their HAService would be helpful. > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Assignee: TisonKun >Priority: Major > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911247#comment-16911247 ] Till Rohrmann commented on FLINK-13750: --- This sounds good to me. Maybe another thought. We could split {{HighAvailabilityServices}} up into {{ClientHighAvailabilityServices}} and {{ClusterHighAvailabilityServices}} and let {{HighAvailabilityServices}} extend both interfaces. That way we would decompose the original interface into smaller ones which could be passed into the respective using components {{ClientHighAvailabilityServices}} go into the {{ClusterClient}}, whereas {{ClusterHighAvailabilityServices}} go into the cluster components. This would, however, break binary backwards compatibility and people would need to recompile their custom implementations against the updated interface, I guess. > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Assignee: TisonKun >Priority: Major > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911184#comment-16911184 ] TisonKun commented on FLINK-13750: -- Hi Till! It also comes to me hours ago that ClusterClient should only hold WebMonitorRetriever and WebMonitorRetriever should be only held by ClusterClient, and then request to ConnectionInfo is forwarded to Dispatcher by WebMonitor the same as other requests. I'm really inspired we are in the same way. Given the (Client)HighAvailabilityServices differs between RestClusterClient and MiniClusterClient I would prefer remove the field highAvailabilityServices and shift down the relevant implementations to subclasses. For RestClusterClient, it would be quite similar to your description; while for MiniClusterClient we can access proper service with miniCluster.getHighAvailabilityServices.(It is an embedded one and the whole mini cluster must run in the same process. Thus create a new service looks unnecessary and break how embedded service is implemented) > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Assignee: TisonKun >Priority: Major > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911145#comment-16911145 ] Till Rohrmann commented on FLINK-13750: --- Hi Tison, I would try to go the following way: The {{RestClusterClient}} should only need the {{webMonitorRetrievalService}}. Hence we should try to get rid of the {{dispatcherLeaderRetriever}} and then the {{HighAvailabilityServices}}stored in the {{ClusterClient}}. Btw. the {{ClusterClient}} is a legacy class with a lot of unneeded code. Then I would introduce a {{ClientHighAvailabilityServices}} which has a method {{LeaderRetrievalService getWebMonitorLeaderRetriever();}}. In order to not break backwards compatibility we could only deprecate the same method in {{HighAvailabilityServices}}. Next, we would need to introduce a new method to {{HighAvailabilityServicesFactory#createClientHAServices}} which allows us to create a {{ClientHighAvailabilityServices}} instance. This method should have a default implementation which fails. For backwards compatibility, we could still create a {{HighAvailabilityServices}} if {{#createClientHAServices}} fails and then call {{HighAvailabilityServices#getWebMonitorLeaderRetriever()}}. In order for proper resource clean up, one either needs to pass the {{(Client)HighAvailabilitySerivces}} to the {{RestClusterClient}} (ideally as a {{AutoCloseable}}) or create a wrapper for the {{LeaderRetrievalService}} which also closes the services when closing the {{LeaderRetrievalService}}. I hope I haven't overlooked too many details here. > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Assignee: TisonKun >Priority: Major > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910848#comment-16910848 ] TisonKun commented on FLINK-13750: -- Hi [~Zentol] & [~till.rohrmann]. After an investigation I notice that {{ClusterClient}} need not to hold a field is or like {{highAvailabilityServices}}. Towards the target {{ClusterClient}} is an interface, i.e., is not an abstract class, we can shift down the initialize logic into {{RestClusterClient}} and {{MiniClusterClient}}. Here are two possible direction we do the separation and I post here for advice. 1. introduce utility functions in {{HighAvailabilityServicesUtils}} to return a limited set of high-availability service regarded as client-side services, without introduce any new class or interface.(a prototype can be found at https://github.com/TisonKun/flink/commit/1ea7c4ed6c7c2ce2a82da48bcacfd20e2bc0fdfd) pros: - easy to implement - in custom HA scenario, customer doesn't need to modify their code instead of their implementation has similar issue with FLINK-13500. cons: - there is no explicit client-side service concept. - {{HighAvailabilityServicesUtils}} knows details of Standalone and ZooKeeper implementation. nit: for the prototype, we might separate {{getDispatcherLeaderRetrievalService}} and {{getWebMonitorLeaderRetrievalService}} while the downside is we would initialize {{CurationFramework}} and custom HA service twice or more. 2. introduce an interface {{RetrieverOnlyHighAvailabilityService}} which looks like {code:java} interface RetrieverOnlyHighAvailabilityService { LeaderRetrievalService getDispatcherLeaderRetrievalService(); LeaderRetrievalService getWebMonitorLeaderRetrievalService(); } {code} and implement it for different high-availability backends. pros: - a clear concept of separation between high-availability services. - HighAvailabilityServicesUtils only pass configuration to generate RetrieverOnlyHighAvailabilityService and only RetrieverOnlyHighAvailabilityService knows the detail. cons: - we need to implement RetrieverOnlyHighAvailabilityService for every high-availability services. - in {{MiniClusterClient}} scenario, we actually used the service passed from MiniCluster. either we should treat it as a special case or change totally the logic {{MiniClusterClient}} initialization. - in custom HA scenario, user has to implement a new interface. nit: it is not the truth for current codebase that every ClusterClient share the same retrieval requirements. only RestClusterClient need to getWebMonitorLeaderRetrievalService. or in a more conceptual layer client should only communicate with WebMonitor and request to Dispatcher is routed by WebMonitor. > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Assignee: TisonKun >Priority: Major > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side
[ https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910162#comment-16910162 ] TisonKun commented on FLINK-13750: -- The main requirement of client side on HA service is to communicate to Dispatcher/WebMonitor. Any LeaderElectionServices, BlobServices and other LeaderRetrievalServices are no need for client side. I think it is reasonable to separate HA service exposed to client- and server-side. I'd like to take a closer look and provide a solution to this :-) > Separate HA services between client-/ and server-side > - > > Key: FLINK-13750 > URL: https://issues.apache.org/jira/browse/FLINK-13750 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Coordination >Reporter: Chesnay Schepler >Priority: Major > > Currently, we use the same {{HighAvailabilityServices}} on the client and > server. However, the client does not need several of the features that the > services currently provide (access to the blobstore or checkpoint metadata). > Additionally, due to how these services are setup they also require the > client to have access to the blob storage, despite it never actually being > used, which can cause issues, like FLINK-13500. > [~Tison] Would be be interested in this issue? -- This message was sent by Atlassian JIRA (v7.6.14#76016)