[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-30 Thread TisonKun (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919380#comment-16919380
 ] 

TisonKun commented on FLINK-13750:
--

[~till.rohrmann] Sure. FYI FLINK-13912 and I'd like to mark this one blocked by 
that one because we can avoid introduce a new endpoint at WebMonitor then.

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: TisonKun
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-30 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919312#comment-16919312
 ] 

Till Rohrmann commented on FLINK-13750:
---

Yes there should be no longer the need to know the port and hostname of the 
dispatcher. Could we create for the removal of this method a separate issue?

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: TisonKun
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-30 Thread TisonKun (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919295#comment-16919295
 ] 

TisonKun commented on FLINK-13750:
--

Hi here. I noticed that {{ClusterClient#getClusterConnectionInfo}} works our of 
expected in YARN scenario. In YARN scenario we set the port exactly to 
WebMonitor's port.


{code:java}
final String host = appReport.getHost();
final int rpcPort = appReport.getRpcPort();

LOG.info("Found application JobManager host name '{}' and port '{}' from 
supplied application id '{}'",
host, rpcPort, applicationId);

flinkConfiguration.setString(JobManagerOptions.ADDRESS, host);
flinkConfiguration.setInteger(JobManagerOptions.PORT, rpcPort);

flinkConfiguration.setString(RestOptions.ADDRESS, host);
flinkConfiguration.setInteger(RestOptions.PORT, rpcPort);
{code}

thus the interface never works as expected. I suspect this doesn't find before 
because no one relies on it. 

Given a further investigating all usages(omit used in logs, all under 
scala-shell and thus remote executor) actually communicate with {{WebMonitor}} 
instead of {{Dispatcher}}. I'd like to just remove this interface though.

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: TisonKun
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-22 Thread TisonKun (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913856#comment-16913856
 ] 

TisonKun commented on FLINK-13750:
--

Hi [~till.rohrmann][~Zentol], I sent a pull request GH-9509 follow our 
discussion above. Please give it a look :- )

Besides, here I'd like to start a new sub-topic, shall we expose "leaderInfo" 
to client? The only usage I see is to log in {{FlinkYarnSessionCli}} and 
extract dispatcher host:port in {{FlinkShell}} under YARN scenario. I think of 
an alternative instead of expose "leaderInfo" to client. Or we just design to 
expose it here.

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: TisonKun
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-21 Thread TisonKun (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912036#comment-16912036
 ] 

TisonKun commented on FLINK-13750:
--

I have started a survey thread on both user and dev list[1] for an accurate 
idea how our users use high-availability services in Flink.

While collecting feedback I also want to start the programming job as discussed 
above to see whether there is problem unnoticed.

[1] 
https://lists.apache.org/x/thread.html/c0cc07197e6ba30b45d7709cc9e17d8497e5e3f33de504d58dfcafad@%3Cuser.flink.apache.org%3E

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: TisonKun
>Priority: Major
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-20 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911396#comment-16911396
 ] 

Till Rohrmann commented on FLINK-13750:
---

Yes, you are right. The {{MiniCluster}} only needs to give access to the 
{{webMonitorLeaderRetrievalService}}. Nothing else should be needed after we 
have removed the requirement for the dispatcher leader retrieval service.

Doing a survey on user and dev to measure how big of a deal a breaking change 
would be could be helpful to assess whether we need to provide backwards 
compatibility.

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: TisonKun
>Priority: Major
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-20 Thread TisonKun (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911290#comment-16911290
 ] 

TisonKun commented on FLINK-13750:
--

The original issue FLINK-13500 caused by this issue requires initialize 
services on demand. Specifically, BlobStoreService doesn't initialized in 
client-side.

Let {{HighAvailabilityServices}} extend both {{ClientHighAvailabilityServices}} 
and {{ClusterHighAvailabilityServices}} and pass it as the respective interface 
doesn't fix this issue. It limits the access but would not change the 
initialization.

With our context {{ClientHighAvailabilityServices}} has the only method 
{{getWebMonitorLeaderRetrievalService}} while 
{{ClusterHighAvailabilityServices}} doesn't need it. We can rename 
{{HighAvailabilityService}} as {{ClusterHighAvailabilityServices}} and 
deprecate the method and drop it when break changes allowed.

For MiniCluster scenario, it is a special case where the client can directly 
access dispatcher gate and thus need not a {{ClientHighAvailabilityServices}}. 
We can handle it specially regard its speciality natural.

An inheritance graph would be

{noformat}
ClientHighAvailabilityServices { only getWebMonitorLeaderRetrievalService }
  ↓
ZK.../Standalone.../Custom...
{noformat}

{noformat}
ClusterHighAvailabilityServices { ... deprecated 
getWebMonitorLeaderRetrievalService}
  ↓
ZK.../Standalone.../Embedded.../Custom...
{noformat}

Another problem is how we treat the custom one? A quick solution is as 
{{HighAvailabilityServicesFactory#createClientHAServices}} described above and 
the default is create a ClusterHighAvailabilityServices(current 
HighAvailabilityServices) and wrapped it access only the deprecated 
{{getWebMonitorLeaderRetrievalService}}. We can drop the fallback when break 
changes allowed. Fair enough, a survey to build our mind on how users actually 
custom their HAService would be helpful.

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: TisonKun
>Priority: Major
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-20 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911247#comment-16911247
 ] 

Till Rohrmann commented on FLINK-13750:
---

This sounds good to me.

Maybe another thought. We could split {{HighAvailabilityServices}} up into 
{{ClientHighAvailabilityServices}} and {{ClusterHighAvailabilityServices}} and 
let {{HighAvailabilityServices}} extend both interfaces. That way we would 
decompose the original interface into smaller ones which could be passed into 
the respective using components {{ClientHighAvailabilityServices}} go into the 
{{ClusterClient}}, whereas {{ClusterHighAvailabilityServices}} go into the 
cluster components.

This would, however, break binary backwards compatibility and people would need 
to recompile their custom implementations against the updated interface, I 
guess.

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: TisonKun
>Priority: Major
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-20 Thread TisonKun (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911184#comment-16911184
 ] 

TisonKun commented on FLINK-13750:
--

Hi Till!

It also comes to me hours ago that ClusterClient should only hold 
WebMonitorRetriever and WebMonitorRetriever should be only held by 
ClusterClient, and then request to ConnectionInfo is forwarded to Dispatcher by 
WebMonitor the same as other requests. I'm really inspired we are in the same 
way.

Given the (Client)HighAvailabilityServices differs between RestClusterClient 
and MiniClusterClient I would prefer remove the field highAvailabilityServices 
and shift down the relevant implementations to subclasses. For 
RestClusterClient, it would be quite similar to your description; while for 
MiniClusterClient we can access proper service with 
miniCluster.getHighAvailabilityServices.(It is an embedded one and the whole 
mini cluster must run in the same process. Thus create a new service looks 
unnecessary and break how embedded service is implemented)

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: TisonKun
>Priority: Major
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-20 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911145#comment-16911145
 ] 

Till Rohrmann commented on FLINK-13750:
---

Hi Tison,

I would try to go the following way: The {{RestClusterClient}} should only need 
the {{webMonitorRetrievalService}}. Hence we should try to get rid of the 
{{dispatcherLeaderRetriever}} and then the {{HighAvailabilityServices}}stored 
in the {{ClusterClient}}. Btw. the {{ClusterClient}} is a legacy class with a 
lot of unneeded code.

Then I would introduce a {{ClientHighAvailabilityServices}} which has a method 
{{LeaderRetrievalService getWebMonitorLeaderRetriever();}}. In order to not 
break backwards compatibility we could only deprecate the same method in 
{{HighAvailabilityServices}}.

Next, we would need to introduce a new method to 
{{HighAvailabilityServicesFactory#createClientHAServices}} which allows us to 
create a {{ClientHighAvailabilityServices}} instance. This method should have a 
default implementation which fails.

For backwards compatibility, we could still create a 
{{HighAvailabilityServices}} if {{#createClientHAServices}} fails and then call 
{{HighAvailabilityServices#getWebMonitorLeaderRetriever()}}.

In order for proper resource clean up, one either needs to pass the 
{{(Client)HighAvailabilitySerivces}} to the {{RestClusterClient}} (ideally as a 
{{AutoCloseable}}) or create a wrapper for the {{LeaderRetrievalService}} which 
also closes the services when closing the {{LeaderRetrievalService}}.

I hope I haven't overlooked too many details here.

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: TisonKun
>Priority: Major
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-19 Thread TisonKun (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910848#comment-16910848
 ] 

TisonKun commented on FLINK-13750:
--

Hi [~Zentol] & [~till.rohrmann].

After an investigation I notice that {{ClusterClient}} need not to hold a field 
is or like {{highAvailabilityServices}}. Towards the target {{ClusterClient}} 
is an interface, i.e., is not an abstract class, we can shift down the 
initialize logic into {{RestClusterClient}} and {{MiniClusterClient}}.

Here are two possible direction we do the separation and I post here for advice.

1. introduce utility functions in {{HighAvailabilityServicesUtils}} to return a 
limited set of high-availability service regarded as client-side services, 
without introduce any new class or interface.(a prototype can be found at 
https://github.com/TisonKun/flink/commit/1ea7c4ed6c7c2ce2a82da48bcacfd20e2bc0fdfd)

pros:

- easy to implement
- in custom HA scenario, customer doesn't need to modify their code instead of 
their implementation has similar issue with FLINK-13500.

cons:

- there is no explicit client-side service concept.
- {{HighAvailabilityServicesUtils}} knows details of Standalone and ZooKeeper 
implementation.

nit: for the prototype, we might separate 
{{getDispatcherLeaderRetrievalService}} and 
{{getWebMonitorLeaderRetrievalService}} while the downside is we would 
initialize {{CurationFramework}} and custom HA service twice or more.

2. introduce an interface {{RetrieverOnlyHighAvailabilityService}} which looks 
like


{code:java}
interface RetrieverOnlyHighAvailabilityService {
  LeaderRetrievalService getDispatcherLeaderRetrievalService();
  LeaderRetrievalService getWebMonitorLeaderRetrievalService();
}
{code}

and implement it for different high-availability backends.

pros:

- a clear concept of separation between high-availability services.
- HighAvailabilityServicesUtils only pass configuration to generate 
RetrieverOnlyHighAvailabilityService and only 
RetrieverOnlyHighAvailabilityService knows the detail.

cons:

- we need to implement RetrieverOnlyHighAvailabilityService for every 
high-availability services.
- in {{MiniClusterClient}} scenario, we actually used the service passed from 
MiniCluster. either we should treat it as a special case or change totally the 
logic {{MiniClusterClient}} initialization.
- in custom HA scenario, user has to implement a new interface.

nit:

it is not the truth for current codebase that every ClusterClient share the 
same retrieval requirements. only RestClusterClient need to 
getWebMonitorLeaderRetrievalService. or in a more conceptual layer client 
should only communicate with WebMonitor and request to Dispatcher is routed by 
WebMonitor.

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: TisonKun
>Priority: Major
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-18 Thread TisonKun (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910162#comment-16910162
 ] 

TisonKun commented on FLINK-13750:
--

The main requirement of client side on HA service is to communicate to 
Dispatcher/WebMonitor. Any LeaderElectionServices, BlobServices and other 
LeaderRetrievalServices are no need for client side.

I think it is reasonable to separate HA service exposed to client- and 
server-side.

I'd like to take a closer look and provide a solution to this :-)

> Separate HA services between client-/ and server-side
> -
>
> Key: FLINK-13750
> URL: https://issues.apache.org/jira/browse/FLINK-13750
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Coordination
>Reporter: Chesnay Schepler
>Priority: Major
>
> Currently, we use the same {{HighAvailabilityServices}} on the client and 
> server. However, the client does not need several of the features that the 
> services currently provide (access to the blobstore or checkpoint metadata).
> Additionally, due to how these services are setup they also require the 
> client to have access to the blob storage, despite it never actually being 
> used, which can cause issues, like FLINK-13500.
> [~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)