[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848287#comment-15848287 ] Philipp von dem Bussche commented on FLINK-2821: Hello [~mxm], after being quiet for a while I wanted to feed back on the setup I am running at the moment. To recap (I had to think about my setup myself again after not spending much time on it lately ;) ): - job manager and task manager run in Docker containers - I am using an orchestration engine called Rancher on top of docker which also introduces another set of IP addresses / network on top of Docker. Since I am communicating to the JobManager from within the Docker / Rancher network as well as from outside (from my local buildserver) I had to have the JobManager register to a hostname that is resolvable on the Internet. Both the task manager (coming from within the Docker / Rancher network) as well as the build server connect via the internet host name now. Obviously since the task manager would live right next to the job manager the preferred solution would be for the task manager to connect locally (meaning through the Docker / Rancher network) but since one can only specify one listener address it has to go through the internet host name. However this does not solve the problem completly yet because if I just tell the JobManager to bind to the internet host name I am getting the following exception while JobManager starts up: 017-02-01 11:13:51,997 INFO org.apache.flink.util.NetUtils - Unable to allocate on port 6123, due to error: Address not available (Bind failed) 2017-02-01 11:13:51,999 ERROR org.apache.flink.runtime.jobmanager.JobManager - Failed to run JobManager. java.lang.RuntimeException: Unable to do further retries starting the actor system at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2136) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2076) at org.apache.flink.runtime.jobmanager.JobManager$$anon$12.call(JobManager.scala:1971) at org.apache.flink.runtime.jobmanager.JobManager$$anon$12.call(JobManager.scala:1969) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:29) at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1969) at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) So additionally I had to put the Docker IP address of the JobManager container into /etc/hosts resolving to the internet host name so that it tries to bind on the Docker IP address rather than the Amazon AWS IP address (which is the IP that the internet host name resolves to). This works for me now, I would not call it ideal though. I have to admit I have not tested this with the latest RC, will do that later in the week. Thanks > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger >Assignee: Maximilian Michels > Fix For: 1.2.0 > > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737849#comment-15737849 ] Philipp von dem Bussche commented on FLINK-2821: Just looking into another use case where communication with the jobmanager is required but where the client is actually not part of the Rancher or Docker environment. So for that I would not use the hostname specified as rpc address, well I could not. This is a use case I am seeing when deploying the jobmanager lets say to AWS but I want to deploy the jobs from my on prem build system. The jobmanager (container) has a name that is resolveable from the Internet but since it is also part of the Rancher network I have it listen on the Rancher DNS name. This makes sense so that JobManager and TaskManager will only communicate within the Rancher network but still I have to from outside deploy the jobs and would do that via the Internet host name. However I would then have the JobManager registered to the Internet host name and subsequently the taskmanager would also have to use that name. So for this now to work I would need the JobManager to accept communications on at least two rpc addresses. I still don't quite understand why this is being verified anyways, why not just serve any connection as long as it hits the right port ? > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger >Assignee: Maximilian Michels > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734537#comment-15734537 ] Philipp von dem Bussche commented on FLINK-2821: Thanks [~mxm] this is working for me now. I tested again and am purely using Rancher DNS names now as well as can deploy my jobs from my build service. There are two things I still had to do which was a) to also update the flink version on my build slave so it seems the client also got some changes and b) to exactly match the hostname when connecting e.g. from the client. So in Rancher a host can be reached by hostname.stackname where in my case stackname was in camel case and then I also had to use camel case to connect from the client to the jobmanager because otherwise the jobmanager would again refuse. Anyways this looks quite good now and I will test further with this. Thanks again for the help ! > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger >Assignee: Maximilian Michels > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731314#comment-15731314 ] Philipp von dem Bussche commented on FLINK-2821: Hello [~mxm], some more feedback on my testing back in the Rancher environment. The connection between TaskManager and JobManager via the Rancher DNS name is working now, however I still seem to have a slight problem deploying my Jobs. I am doing this from Jenkins using the flink cli and running it with a hostname (It is still slightly different to what I have configured on the JobManager) seems to trigger an IP resolution and then it is trying to connect via IP rather than hostname and hence the cli is still blocked from connecting on the JobManager side. Can we make it somehow that hostname and whatever this resolves to on the JobManager is allowed to connect ? Output from flink cli command: flink list --jobmanager flink-jobmanager.analyticsstack:6123 Retrieving JobManager. Using address /10.42.202.225:6123 to connect to JobManager. The program finished with the following exception: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:127) Output from jobamanager logfile: 2016-12-08 06:39:27,581 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@10.42.202.225:6123/]] arriving at [akka.tcp://flink@10.42.202.225:6123] inbound addresses are [akka.tcp://flink@flink-jobmanager:6123] 2016-12-08 06:39:37,711 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@172.17.0.6:46589] has failed, address is now gated for [5000] ms. Reason: [Disassociated] > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger >Assignee: Maximilian Michels > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726595#comment-15726595 ] Philipp von dem Bussche commented on FLINK-2821: Thanks [~mxm]. This should then work nicely in my Rancher environment as the hostname I will use from the TaskManager to talk to the JobManager will be resolvable on the JobManager but my local docker-machine based environment on my Laptop seems to struggle with this as the hostname I am using here is not resolvable. Then again I am normally testing my Flink Jobs locally without the Docker/Rancher bits so I guess I am cool with the PR as it is. I will do more testing after tomorrow when I am back home. > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger >Assignee: Maximilian Michels > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723269#comment-15723269 ] Philipp von dem Bussche commented on FLINK-2821: Hi [~mxm], I wanted to give this a try but I am not sure if I am testing this correctly. Do I just have to set jobmanager.rpc.address but to the hostname that will be used for access from outside ? I tried to use a name that is not resolvable on the host itself and that fails but this is on my local docker environment and this should be different as soon as I move this to rancher. Thanks > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger >Assignee: Maximilian Michels > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677009#comment-15677009 ] Philipp von dem Bussche commented on FLINK-2821: [~StephanEwen] I am not quite sure if this is going to work. So the IP of the orchestration framework I am using (Rancher) is exposing a 10.x IP address which is not available on the host itself (only the 172.x address from Docker). So what I have seen with binding previously was that when the host is binding to 172.x it would reject a request against a 10.x address. So if we think that it won't do that when binding on 0.0.0.0 then I am cool with the change :) If this is too theoretical though I am more than happy to do more testing if [~mxm] wants to do the change. > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger >Assignee: Maximilian Michels > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659436#comment-15659436 ] Philipp von dem Bussche commented on FLINK-2821: [~mxm] sorry for my late reply but it is good that you have asked these questions as I think I need to optimize my setup here a bit. Whenever my test server goes down (which it does quite a bit lately) then my flink services won't come up without manual intervention and I think that is not good. Since I have forgotten some of the details of my setup myself I am just going to outline them again below: So I actually took the Dockerfile from the Flink contrib Github project as a basis. In the Docker entrypoint I am doing this to set the JobManager listen address (I think this is still default): {code}sed -i -e "s/jobmanager.rpc.address: localhost/jobmanager.rpc.address: `hostname -f`/g" $FLINK_HOME/conf/flink-conf.yaml{code} So this actually leads to the JobManager doing this: {code}Starting JobManager actor system at 172.17.0.23:6123{code} The 172.x address would be the IP address coming from Docker. On my TaskManager container I obviously can access this address and I am actually having an environment variable set for this on the TaskManager that points to this address. However this is of course not really dynamic, in fact I have about 20 or so containers on my test system and after the last reboot of the server the Docker IP address changed (it was actually .24 before). So then this whole setup breaks kind of. Moving on to Rancher: you are able to define stacks (which is like a grouping for your containers). I have one stack for all my containers I need for doing data science things (well maybe thats a bit of overselling but anyways ;) ). So the name of the service (that in Rancher is kind of another wrapper around a container so you say you have a service and it is using Docker image X and then if you need more than lets say one you scale the service up and down etc. and service can run on different hosts etc.) representing the JobManager functionality is flink-jobmanager. Now with the Rancher DNS I can access the service (and since I only have one active container essentially the container) by just connecting to :flink-jobmanager . This is when I am creating the connection from within the same stack. If I was on my application stack and want to access flink directly (I don't because it goes from the webservice into Kafka first which is already on the same stack) I could connect via :flink-jobmanager.analyticsstack. Now this is quite cool because I can leave out any of those references to hosts etc via environment variables or parameters because I can be pretty sure that my other services/containers are always resolvable. However the resolution is done against the Rancher IP and the one for the JobManager in my setup currently is 10.42.9.68. So from my TaskManager container I can access all three of those IPs (the Host IP, the Docker IP and the Rancher IP) however I don't really want to go for the Host IP and the Docker IP because this would make things to static but when I have the JobManager bind on the Docker IP and try to connect to it via the Rancher IP then it complains. On the other hand I can't have the JobManager bind on the Rancher IP because that is not available inside the Container, it is something available in the Rancher context that then gets mapped/forwarded onto Docker and the 172.x address. It seems I am currently not running the build where just patched the akka version but I remember I did for a while and it worked fine. I also think this would be the only way how this could work but I might be missing something. Thanks for looking into this ! > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger >Assignee: Maximilian Michels > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink
[jira] [Comment Edited] (FLINK-4888) instantiated job manager metrics missing important job statistics
[ https://issues.apache.org/jira/browse/FLINK-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604971#comment-15604971 ] Philipp von dem Bussche edited comment on FLINK-4888 at 10/25/16 10:57 AM: --- [~Zentol] thanks for that. There is a map currentJobs which is used to get the running jobs and their status. What if we introduce something similar for the archived jobs and their status in order to avoid those repeated RPC calls ? I could have a look at where we could populate the map. was (Author: philipp.bussche): [~Zentol] thanks for that. There is a map currentJobs which is used to get the running jobs. What if we introduce something similar for the archived jobs and their status in order to avoid those repeated RPC calls ? I could have a look at where we could populate the map. > instantiated job manager metrics missing important job statistics > -- > > Key: FLINK-4888 > URL: https://issues.apache.org/jira/browse/FLINK-4888 > Project: Flink > Issue Type: Improvement > Components: Metrics >Affects Versions: 1.1.2 >Reporter: Philipp von dem Bussche >Assignee: Philipp von dem Bussche >Priority: Minor > > A jobmanager is currently (only) instantiated with the following metrics: > taskSlotsAvailable, taskSlotsTotal, numRegisteredTaskManagers and > numRunningJobs. Important other metrics would be numFailedJobs, > numCancelledJobs and numFinishedJobs. Also to get parity between JobManager > metrics and whats available via the REST API it would be good to have these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4888) instantiated job manager metrics missing important job statistics
[ https://issues.apache.org/jira/browse/FLINK-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604971#comment-15604971 ] Philipp von dem Bussche commented on FLINK-4888: [~Zentol] thanks for that. There is a map currentJobs which is used to get the running jobs. What if we introduce something similar for the archived jobs and their status in order to avoid those repeated RPC calls ? I could have a look at where we could populate the map. > instantiated job manager metrics missing important job statistics > -- > > Key: FLINK-4888 > URL: https://issues.apache.org/jira/browse/FLINK-4888 > Project: Flink > Issue Type: Improvement > Components: Metrics >Affects Versions: 1.1.2 >Reporter: Philipp von dem Bussche >Assignee: Philipp von dem Bussche >Priority: Minor > > A jobmanager is currently (only) instantiated with the following metrics: > taskSlotsAvailable, taskSlotsTotal, numRegisteredTaskManagers and > numRunningJobs. Important other metrics would be numFailedJobs, > numCancelledJobs and numFinishedJobs. Also to get parity between JobManager > metrics and whats available via the REST API it would be good to have these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4888) instantiated job manager metrics missing important job statistics
Philipp von dem Bussche created FLINK-4888: -- Summary: instantiated job manager metrics missing important job statistics Key: FLINK-4888 URL: https://issues.apache.org/jira/browse/FLINK-4888 Project: Flink Issue Type: Improvement Components: Metrics Affects Versions: 1.1.2 Reporter: Philipp von dem Bussche Priority: Minor A jobmanager is currently (only) instantiated with the following metrics: taskSlotsAvailable, taskSlotsTotal, numRegisteredTaskManagers and numRunningJobs. Important other metrics would be numFailedJobs, numCancelledJobs and numFinishedJobs. Also to get parity between JobManager metrics and whats available via the REST API it would be good to have these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592357#comment-15592357 ] Philipp von dem Bussche commented on FLINK-2821: Thank you for all your effort [~mxm]. I have tested this and was able to connect a jobmanager and a task manager in a docker-machine environment on my Mac as well as in Rancher. For the Rancher setup to work though I had to have the bind-address be set to 0.0.0.0 . I think this makes sense since Rancher introduces this additional 10.x address (on top of the 172.x address given by Docker) but when specifying the hostname as bind address it would only bind to the 172.x address. One other thing which I realized was that my local flink cli on my Mac would not work together with your customer build anymore because of version discrepancies. I felt this is quite harsh given that I am running 1.1.3 on bother sides but obviously different builds. I will play around with this a bit more and send some data across and let you know if I see anything else popping up. Thanks again ! > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger >Assignee: Maximilian Michels > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15582974#comment-15582974 ] Philipp von dem Bussche commented on FLINK-2821: Thanks [~mxm], I am not getting this Exception anymore, however I don't think this is working yet. I have to admit though that I had to change my environment slightly in which I am testing since I am currently travelling. I don't at the moment have access to the Rancher environment so I am purely bringing up a Docker container on my Mac within a (non-native) docker-machine which basically means I have a virtualbox virtual machine running on my Mac which runs the Docker daemon and from this virtual machine I am running my Docker containers at the moment. I do believe though that this test environment is quite similar to my initial test with Rancher. I have exposed port 6123 from the docker container to the host (aka the virtual machine). This happens on my non-customized 1.1.3 build (not the one you have created for me): I am trying to access my Flink's jobmanager rpc address (doing a simple flink list from my Mac) like this: PHILIPPs-MacBook:~ philipp$ flink list --jobmanager 192.168.99.100:6123 # 192.168.99.100 is the docker host's IP / the IP of the virtual machine I am getting this error message after a while: Retrieving JobManager. Using address /192.168.99.100:6123 to connect to JobManager. The program finished with the following exception: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:127) at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:644) at org.apache.flink.client.CliFrontend.getJobManagerGateway(CliFrontend.java:868) at org.apache.flink.client.CliFrontend.list(CliFrontend.java:387) at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1008) at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1048) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at scala.concurrent.Await.result(package.scala) at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:125) ... 5 more And in my Flink's jobmanager log file I am seeing this error message: 2016-10-17 17:58:46,088 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink@172.17.0.2:6123/user/jobmanager. 2016-10-17 17:58:46,108 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Trying to associate with JobManager leader akka.tcp://flink@172.17.0.2:6123/user/jobmanager 2016-10-17 17:58:46,132 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@172.17.0.2:6123/user/jobmanager was granted leadership with leader session ID None. 2016-10-17 17:58:46,140 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#-1164381512] - leader session null 2016-10-17 17:59:34,896 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@192.168.99.100:6123/]] arriving at [akka.tcp://flink@192.168.99.100:6123] inbound addresses are [akka.tcp://flink@172.17.0.2:6123] 2016-10-17 17:59:45,052 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@192.168.99.1:51492] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. I would think that the difference between this and the Rancher approach would be that Rancher introduces this third IP address (10.x) which gets used when using the Rancher DNS name between containers in a Rancher environment. Anyways when I am using the custom version that you have sent me and I configure my jobmanager like this: jobmanager.rpc.address: 192.168.99.100 jobmanager.rpc.bind-address: da54c7ceaaa9 # container's host name resolving to the 172.x address jobmanager.rpc.port: 6123 jobmanager.rpc.bind-port: 6123 The jobmanager startup fails with a message like this
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576002#comment-15576002 ] Philipp von dem Bussche commented on FLINK-2821: Hello [~mxm], here are my first test results. I have built a container and am just for a first test running it with address set to localhost and bind-address set to the actual hostname (I ended up doing that because in Rancher it wouldn't start the container at all with address set to the Rancher DNS name and bind-address set to the docker host name - so this is the most simple test I could think of). I am seeing this in the logfile of the jobmanager: 2016-10-14 17:35:22,035 ERROR org.apache.flink.runtime.jobmanager.JobManager - Failed to run JobManager. java.lang.Exception: Could not create JobManager actor system at org.apache.flink.runtime.jobmanager.JobManager$.startActorSystemAndJobManagerActors(JobManager.scala:2186) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2015) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2078) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2056) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2056) at scala.util.Try$.apply(Try.scala:161) at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2111) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2056) at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1981) at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) Caused by: java.lang.ClassNotFoundException: com.google.protobuf.GeneratedMessage at java.net.URLClassLoader.findClass(URLClassLoader.java:381) It is true that this class is not part of flink-dist_2.10-1.1.3.jar, well it is there but as part of the shaded packages. I compared to my running container and this class is part of the jar file with the exact package structure in addition to the shaded structure. bash-4.3$ unzip -l flink-dist_2.10-1.1.3.jar | grep com.google.protobuf.GeneratedMessage gives me a lot of these: 2006 10-14-16 14:29 org/apache/flink/hadoop/shaded/com/google/protobuf/GeneratedMessage$1.class 1203 10-14-16 14:29 org/apache/flink/hadoop/shaded/com/google/protobuf/GeneratedMessage$2.class 1665 10-14-16 14:29 org/apache/flink/hadoop/shaded/com/google/protobuf/GeneratedMessage$Builder$BuilderParentImpl.cla whereas on the running flink jobmanager bash-4.3$ unzip -l flink-dist_2.11-1.1.3.jar | grep com.google.protobuf.GeneratedMessage gives me also this: 1513 10-10-16 15:10 com/google/protobuf/GeneratedMessage$1.class 989 10-10-16 15:10 com/google/protobuf/GeneratedMessage$2.class 1296 10-10-16 15:10 com/google/protobuf/GeneratedMessage$Builder$BuilderParentImpl.class Note that on my running containers I am using Scala 2.11. > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger >Assignee: Maximilian Michels > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575266#comment-15575266 ] Philipp von dem Bussche commented on FLINK-2821: [~mxm] thanks, Flink 1.1.3 is perfect. I will give it a try and feed back. It will be in a few hours / tonight hopefully. > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger >Assignee: Maximilian Michels > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572723#comment-15572723 ] Philipp von dem Bussche commented on FLINK-2821: [~StephanEwen] yes more than happy to try out building the version and using it in my Rancher environment. > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572639#comment-15572639 ] Philipp von dem Bussche commented on FLINK-2821: [~mxm] thats awesome ! That might work for me actually. I am happy to give this a try. > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571405#comment-15571405 ] Philipp von dem Bussche commented on FLINK-2821: +1 for this feature. I am orchestrating my docker environment with something called Rancher (http://rancher.com/). With this you might end up having 3 IP addresses a container can be accessed with (IP exposed to the host, docker IP and Rancher IP). Starting a jobmanager container via Rancher will make the RPC address be set to the Docker IP however if you want to use the Rancher DNS capabilities (which are quite cool), then you would communicate in your Rancher network (from taskmanager to jobmanager where taskmanager is also a container started under Rancher) using the Rancher IP. This however does not work at the moment. I worked around this for now by telling the taskmanager which Docker IP to connect to in order to reach the Jobmanager while starting it up but this however is not really nice when thinking about automation and using other capabilities of Rancher. I can see this being quite a problem when using any orchestration on top of Docker ?!? Thanks > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Reporter: Robert Metzger > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)