[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2017-02-01 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848287#comment-15848287
 ] 

Philipp von dem Bussche commented on FLINK-2821:


Hello [~mxm], after being quiet for a while I wanted to feed back on the setup 
I am running at the moment.
To recap (I had to think about my setup myself again after not spending much 
time on it lately ;) ):
- job manager and task manager run in Docker containers
- I am using an orchestration engine called Rancher on top of docker which also 
introduces another set of IP addresses / network on top of Docker.

Since I am communicating to the JobManager from within the Docker / Rancher 
network as well as from outside (from my local buildserver) I had to have the 
JobManager register to a hostname that is resolvable on the Internet. Both the 
task manager (coming from within the Docker / Rancher network) as well as the 
build server connect via the internet host name now. Obviously since the task 
manager would live right next to the job manager the preferred solution would 
be for the task manager to connect locally (meaning through the Docker / 
Rancher network) but since one can only specify one listener address it has to 
go through the internet host name.

However this does not solve the problem completly yet because if I just tell 
the JobManager to bind to the internet host name I am getting the following 
exception while JobManager starts up:

017-02-01 11:13:51,997 INFO  org.apache.flink.util.NetUtils 
   - Unable to allocate on port 6123, due to error: Address not 
available (Bind failed)
2017-02-01 11:13:51,999 ERROR org.apache.flink.runtime.jobmanager.JobManager
- Failed to run JobManager.
java.lang.RuntimeException: Unable to do further retries starting the actor 
system
at 
org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2136)
at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2076)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anon$12.call(JobManager.scala:1971)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anon$12.call(JobManager.scala:1969)
at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:29)
at 
org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1969)
at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)

So additionally I had to put the Docker IP address of the JobManager container 
into /etc/hosts resolving to the internet host name so that it tries to bind on 
the Docker IP address rather than the Amazon AWS IP address (which is the IP 
that the internet host name resolves to).

This works for me now, I would not call it ideal though.

I have to admit I have not tested this with the latest RC, will do that later 
in the week.
Thanks

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>Assignee: Maximilian Michels
> Fix For: 1.2.0
>
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-12-10 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737849#comment-15737849
 ] 

Philipp von dem Bussche commented on FLINK-2821:


Just looking into another use case where communication with the jobmanager is 
required but where the client is actually not part of the Rancher or Docker 
environment. So for that I would not use the hostname specified as rpc address, 
well I could not. This is a use case I am seeing when deploying the jobmanager 
lets say to AWS but I want to deploy the jobs from my on prem build system. The 
jobmanager (container) has a name that is resolveable from the Internet but 
since it is also part of the Rancher network I have it listen on the Rancher 
DNS name. This makes sense so that JobManager and TaskManager will only 
communicate within the Rancher network but still I have to from outside deploy 
the jobs and would do that via the Internet host name. However I would then 
have the JobManager registered to the Internet host name and subsequently the 
taskmanager would also have to use that name.
So for this now to work I would need the JobManager to accept communications on 
at least two rpc addresses. I still don't quite understand why this is being 
verified anyways, why not just serve any connection as long as it hits the 
right port ?

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>Assignee: Maximilian Michels
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-12-08 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734537#comment-15734537
 ] 

Philipp von dem Bussche commented on FLINK-2821:


Thanks [~mxm] this is working for me now. I tested again and am purely using 
Rancher DNS names now as well as can deploy my jobs from my build service. 
There are two things I still had to do which was a) to also update the flink 
version on my build slave so it seems the client also got some changes and b) 
to exactly match the hostname when connecting e.g. from the client. So in 
Rancher a host can be reached by hostname.stackname where in my case stackname 
was in camel case and then I also had to use camel case to connect from the 
client to the jobmanager because otherwise the jobmanager would again refuse.
Anyways this looks quite good now and I will test further with this. Thanks 
again for the help !

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>Assignee: Maximilian Michels
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-12-07 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731314#comment-15731314
 ] 

Philipp von dem Bussche commented on FLINK-2821:


Hello [~mxm], some more feedback on my testing back in the Rancher environment. 
The connection between TaskManager and JobManager via the Rancher DNS name is 
working now, however I still seem to have a slight problem deploying my Jobs.
I am doing this from Jenkins using the flink cli and running it with a hostname 
(It is still slightly different to what I have configured on the JobManager) 
seems to trigger an IP resolution and then it is trying to connect via IP 
rather than hostname and hence the cli is still blocked from connecting on the 
JobManager side.
Can we make it somehow that hostname and whatever this resolves to on the 
JobManager is allowed to connect ?

Output from flink cli command:

flink list --jobmanager flink-jobmanager.analyticsstack:6123
Retrieving JobManager.
Using address /10.42.202.225:6123 to connect to JobManager.


 The program finished with the following exception:

org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not 
retrieve the leader gateway
at 
org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:127)

Output from jobamanager logfile:

2016-12-08 06:39:27,581 ERROR akka.remote.EndpointWriter
- dropping message [class akka.actor.ActorSelectionMessage] for 
non-local recipient [Actor[akka.tcp://flink@10.42.202.225:6123/]] arriving at 
[akka.tcp://flink@10.42.202.225:6123] inbound addresses are 
[akka.tcp://flink@flink-jobmanager:6123]
2016-12-08 06:39:37,711 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system 
[akka.tcp://flink@172.17.0.6:46589] has failed, address is now gated for [5000] 
ms. Reason: [Disassociated] 

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>Assignee: Maximilian Michels
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-12-06 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726595#comment-15726595
 ] 

Philipp von dem Bussche commented on FLINK-2821:


Thanks [~mxm]. This should then work nicely in my Rancher environment as the 
hostname I will use from the TaskManager to talk to the JobManager will be 
resolvable on the JobManager but my local docker-machine based environment on 
my Laptop seems to struggle with this as the hostname I am using here is not 
resolvable. Then again I am normally testing my Flink Jobs locally without the 
Docker/Rancher bits so I guess I am cool with the PR as it is. I will do more 
testing after tomorrow when I am back home.

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>Assignee: Maximilian Michels
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-12-05 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723269#comment-15723269
 ] 

Philipp von dem Bussche commented on FLINK-2821:


Hi [~mxm], I wanted to give this a try but I am not sure if I am testing this 
correctly.
Do I just have to set jobmanager.rpc.address but to the hostname that will be 
used for access from outside ?
I tried to use a name that is not resolvable on the host itself and that fails 
but this is on my local docker environment and this should be different as soon 
as I move this to rancher.
Thanks

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>Assignee: Maximilian Michels
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-11-18 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677009#comment-15677009
 ] 

Philipp von dem Bussche commented on FLINK-2821:


[~StephanEwen] I am not quite sure if this is going to work. So the IP of the 
orchestration framework I am using (Rancher) is exposing a 10.x IP address 
which is not available on the host itself (only the 172.x address from Docker). 
So what I have seen with binding previously was that when the host is binding 
to 172.x it would reject a request against a 10.x address. So if we think that 
it won't do that when binding on 0.0.0.0 then I am cool with the change :) 
If this is too theoretical though I am more than happy to do more testing if 
[~mxm] wants to do the change.

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>Assignee: Maximilian Michels
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-11-12 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659436#comment-15659436
 ] 

Philipp von dem Bussche commented on FLINK-2821:


[~mxm] sorry for my late reply but it is good that you have asked these 
questions as I think I need to optimize my setup here a bit. Whenever my test 
server goes down (which it does quite a bit lately) then my flink services 
won't come up without manual intervention and I think that is not good.
Since I have forgotten some of the details of my setup myself I am just going 
to outline them again below:

So I actually took the Dockerfile from the Flink contrib Github project as a 
basis.
In the Docker entrypoint I am doing this to set the JobManager listen address 
(I think this is still default):

{code}sed -i -e "s/jobmanager.rpc.address: localhost/jobmanager.rpc.address: 
`hostname -f`/g" $FLINK_HOME/conf/flink-conf.yaml{code}

So this actually leads to the JobManager doing this:

{code}Starting JobManager actor system at 172.17.0.23:6123{code}

The 172.x address would be the IP address coming from Docker.

On my TaskManager container I obviously can access this address and I am 
actually having an environment variable set for this on the TaskManager that 
points to this address. However this is of course not really dynamic, in fact I 
have about 20 or so containers on my test system and after the last reboot of 
the server the Docker IP address changed (it was actually .24 before). So then 
this whole setup breaks kind of.

Moving on to Rancher: you are able to define stacks (which is like a grouping 
for your containers). I have one stack for all my containers I need for doing 
data science things (well maybe thats a bit of overselling but anyways ;) ). So 
the name of the service (that in Rancher is kind of another wrapper around a 
container so you say you have a service and it is using Docker image X and then 
if you need more than lets say one you scale the service up and down etc. and 
service can run on different hosts etc.) representing the JobManager 
functionality is flink-jobmanager. Now with the Rancher DNS I can access the 
service (and since I only have one active container essentially the container) 
by just connecting to :flink-jobmanager . This is when I am creating 
the connection from within the same stack. If I was on my application stack and 
want to access flink directly (I don't because it goes from the webservice into 
Kafka first which is already on the same stack) I could connect via 
:flink-jobmanager.analyticsstack.
Now this is quite cool because I can leave out any of those references to hosts 
etc via environment variables or parameters because I can be pretty sure that 
my other services/containers are always resolvable. However the resolution is 
done against the Rancher IP and the one for the JobManager in my setup 
currently is 10.42.9.68.

So from my TaskManager container I can access all three of those IPs (the Host 
IP, the Docker IP and the Rancher IP) however I don't really want to go for the 
Host IP and the Docker IP because this would make things to static but when I 
have the JobManager bind on the Docker IP and try to connect to it via the 
Rancher IP then it complains. 
On the other hand I can't have the JobManager bind on the Rancher IP because 
that is not available inside the Container, it is something available in the 
Rancher context that then gets mapped/forwarded onto Docker and the 172.x 
address.

It seems I am currently not running the build where just patched the akka 
version but I remember I did for a while and it worked fine. I also think this 
would be the only way how this could work but I might be missing something. 
Thanks for looking into this !


> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>Assignee: Maximilian Michels
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink 

[jira] [Comment Edited] (FLINK-4888) instantiated job manager metrics missing important job statistics

2016-10-25 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604971#comment-15604971
 ] 

Philipp von dem Bussche edited comment on FLINK-4888 at 10/25/16 10:57 AM:
---

[~Zentol] thanks for that. There is a map currentJobs which is used to get the 
running jobs and their status. What if we introduce something similar for the 
archived jobs and their status in order to avoid those repeated RPC calls ? I 
could have a look at where we could populate the map.


was (Author: philipp.bussche):
[~Zentol] thanks for that. There is a map currentJobs which is used to get the 
running jobs. What if we introduce something similar for the archived jobs and 
their status in order to avoid those repeated RPC calls ? I could have a look 
at where we could populate the map.

> instantiated job manager metrics missing important job statistics 
> --
>
> Key: FLINK-4888
> URL: https://issues.apache.org/jira/browse/FLINK-4888
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics
>Affects Versions: 1.1.2
>Reporter: Philipp von dem Bussche
>Assignee: Philipp von dem Bussche
>Priority: Minor
>
> A jobmanager is currently (only) instantiated with the following metrics: 
> taskSlotsAvailable, taskSlotsTotal, numRegisteredTaskManagers and 
> numRunningJobs. Important other metrics would be numFailedJobs, 
> numCancelledJobs and numFinishedJobs. Also to get parity between JobManager 
> metrics and whats available via the REST API it would be good to have these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4888) instantiated job manager metrics missing important job statistics

2016-10-25 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604971#comment-15604971
 ] 

Philipp von dem Bussche commented on FLINK-4888:


[~Zentol] thanks for that. There is a map currentJobs which is used to get the 
running jobs. What if we introduce something similar for the archived jobs and 
their status in order to avoid those repeated RPC calls ? I could have a look 
at where we could populate the map.

> instantiated job manager metrics missing important job statistics 
> --
>
> Key: FLINK-4888
> URL: https://issues.apache.org/jira/browse/FLINK-4888
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics
>Affects Versions: 1.1.2
>Reporter: Philipp von dem Bussche
>Assignee: Philipp von dem Bussche
>Priority: Minor
>
> A jobmanager is currently (only) instantiated with the following metrics: 
> taskSlotsAvailable, taskSlotsTotal, numRegisteredTaskManagers and 
> numRunningJobs. Important other metrics would be numFailedJobs, 
> numCancelledJobs and numFinishedJobs. Also to get parity between JobManager 
> metrics and whats available via the REST API it would be good to have these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-4888) instantiated job manager metrics missing important job statistics

2016-10-22 Thread Philipp von dem Bussche (JIRA)
Philipp von dem Bussche created FLINK-4888:
--

 Summary: instantiated job manager metrics missing important job 
statistics 
 Key: FLINK-4888
 URL: https://issues.apache.org/jira/browse/FLINK-4888
 Project: Flink
  Issue Type: Improvement
  Components: Metrics
Affects Versions: 1.1.2
Reporter: Philipp von dem Bussche
Priority: Minor


A jobmanager is currently (only) instantiated with the following metrics: 
taskSlotsAvailable, taskSlotsTotal, numRegisteredTaskManagers and 
numRunningJobs. Important other metrics would be numFailedJobs, 
numCancelledJobs and numFinishedJobs. Also to get parity between JobManager 
metrics and whats available via the REST API it would be good to have these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-10-20 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592357#comment-15592357
 ] 

Philipp von dem Bussche commented on FLINK-2821:


Thank you for all your effort [~mxm]. I have tested this and was able to 
connect a jobmanager and a task manager in a docker-machine environment on my 
Mac as well as in Rancher. For the Rancher setup to work though I had to have 
the bind-address be set to 0.0.0.0 . I think this makes sense since Rancher 
introduces this additional 10.x address (on top of the 172.x address given by 
Docker) but when specifying the hostname as bind address it would only bind to 
the 172.x address.
One other thing which I realized was that my local flink cli on my Mac would 
not work together with your customer build anymore because of version 
discrepancies. I felt this is quite harsh given that I am running 1.1.3 on 
bother sides but obviously different builds.
I will play around with this a bit more and send some data across and let you 
know if I see anything else popping up.
Thanks again !

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>Assignee: Maximilian Michels
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-10-17 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15582974#comment-15582974
 ] 

Philipp von dem Bussche commented on FLINK-2821:


Thanks [~mxm], I am not getting this Exception anymore, however I don't think 
this is working yet.
I have to admit though that I had to change my environment slightly in which I 
am testing since I am currently travelling. I don't at the moment have access 
to the Rancher environment so I am purely bringing up a Docker container on my 
Mac within a (non-native) docker-machine which basically means I have a 
virtualbox virtual machine running on my Mac which runs the Docker daemon and 
from this virtual machine I am running my Docker containers at the moment. I do 
believe though that this test environment is quite similar to my initial test 
with Rancher. I have exposed port 6123 from the docker container to the host 
(aka the virtual machine).

This happens on my non-customized 1.1.3 build (not the one you have created for 
me):
I am trying to access my Flink's jobmanager rpc address (doing a simple flink 
list from my Mac) like this:

PHILIPPs-MacBook:~ philipp$ flink list --jobmanager 192.168.99.100:6123 # 
192.168.99.100 is the docker host's IP / the IP of the virtual machine

I am getting this error message after a while:

Retrieving JobManager.
Using address /192.168.99.100:6123 to connect to JobManager.


 The program finished with the following exception:

org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not 
retrieve the leader gateway
at 
org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:127)
at 
org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:644)
at 
org.apache.flink.client.CliFrontend.getJobManagerGateway(CliFrontend.java:868)
at org.apache.flink.client.CliFrontend.list(CliFrontend.java:387)
at 
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1008)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1048)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
[1 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at scala.concurrent.Await.result(package.scala)
at 
org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:125)
... 5 more

And in my Flink's jobmanager log file I am seeing this error message:

2016-10-17 17:58:46,088 INFO  org.apache.flink.runtime.jobmanager.JobManager
- Starting JobManager at 
akka.tcp://flink@172.17.0.2:6123/user/jobmanager.
2016-10-17 17:58:46,108 INFO  
org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  
- Trying to associate with JobManager leader 
akka.tcp://flink@172.17.0.2:6123/user/jobmanager
2016-10-17 17:58:46,132 INFO  org.apache.flink.runtime.jobmanager.JobManager
- JobManager akka.tcp://flink@172.17.0.2:6123/user/jobmanager was 
granted leadership with leader session ID None.
2016-10-17 17:58:46,140 INFO  
org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  
- Resource Manager associating with leading JobManager 
Actor[akka://flink/user/jobmanager#-1164381512] - leader session null
2016-10-17 17:59:34,896 ERROR akka.remote.EndpointWriter
- dropping message [class akka.actor.ActorSelectionMessage] for 
non-local recipient [Actor[akka.tcp://flink@192.168.99.100:6123/]] arriving at 
[akka.tcp://flink@192.168.99.100:6123] inbound addresses are 
[akka.tcp://flink@172.17.0.2:6123]
2016-10-17 17:59:45,052 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system 
[akka.tcp://flink@192.168.99.1:51492] has failed, address is now gated for 
[5000] ms. Reason is: [Disassociated].

I would think that the difference between this and the Rancher approach would 
be that Rancher introduces this third IP address (10.x) which gets used when 
using the Rancher DNS name between containers in a Rancher environment.

Anyways when I am using the custom version that you have sent me and I 
configure my jobmanager like this:

jobmanager.rpc.address: 192.168.99.100
jobmanager.rpc.bind-address: da54c7ceaaa9 # container's host name resolving to 
the 172.x address
jobmanager.rpc.port: 6123
jobmanager.rpc.bind-port: 6123

The jobmanager startup fails with a message like this 

[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-10-14 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576002#comment-15576002
 ] 

Philipp von dem Bussche commented on FLINK-2821:


Hello [~mxm], here are my first test results. 
I have built a container and am just for a first test running it with address 
set to localhost and bind-address set to the actual hostname (I ended up doing 
that because in Rancher it wouldn't start the container at all with address set 
to the Rancher DNS name and bind-address set to the docker host name - so this 
is the most simple test I could think of).
I am seeing this in the logfile of the jobmanager:

2016-10-14 17:35:22,035 ERROR org.apache.flink.runtime.jobmanager.JobManager
- Failed to run JobManager.
java.lang.Exception: Could not create JobManager actor system
at 
org.apache.flink.runtime.jobmanager.JobManager$.startActorSystemAndJobManagerActors(JobManager.scala:2186)
at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2015)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2078)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2056)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2056)
at scala.util.Try$.apply(Try.scala:161)
at 
org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2111)
at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2056)
at 
org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1981)
at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)
Caused by: java.lang.ClassNotFoundException: 
com.google.protobuf.GeneratedMessage
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

It is true that this class is not part of flink-dist_2.10-1.1.3.jar, well it is 
there but as part of the shaded packages.
I compared to my running container and this class is part of the jar file with 
the exact package structure in addition to the shaded structure.

bash-4.3$ unzip -l flink-dist_2.10-1.1.3.jar | grep 
com.google.protobuf.GeneratedMessage
gives me a lot of these:
2006  10-14-16 14:29   
org/apache/flink/hadoop/shaded/com/google/protobuf/GeneratedMessage$1.class
1203  10-14-16 14:29   
org/apache/flink/hadoop/shaded/com/google/protobuf/GeneratedMessage$2.class
1665  10-14-16 14:29   
org/apache/flink/hadoop/shaded/com/google/protobuf/GeneratedMessage$Builder$BuilderParentImpl.cla

whereas on the running flink jobmanager
bash-4.3$ unzip -l flink-dist_2.11-1.1.3.jar | grep 
com.google.protobuf.GeneratedMessage
gives me also this:
1513  10-10-16 15:10   com/google/protobuf/GeneratedMessage$1.class
989  10-10-16 15:10   com/google/protobuf/GeneratedMessage$2.class
1296  10-10-16 15:10   
com/google/protobuf/GeneratedMessage$Builder$BuilderParentImpl.class

Note that on my running containers I am using Scala 2.11.

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>Assignee: Maximilian Michels
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-10-14 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575266#comment-15575266
 ] 

Philipp von dem Bussche commented on FLINK-2821:


[~mxm] thanks, Flink 1.1.3 is perfect. I will give it a try and feed back. It 
will be in a few hours / tonight hopefully.

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>Assignee: Maximilian Michels
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-10-13 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572723#comment-15572723
 ] 

Philipp von dem Bussche commented on FLINK-2821:


[~StephanEwen] yes more than happy to try out building the version and using it 
in my Rancher environment.

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-10-13 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572639#comment-15572639
 ] 

Philipp von dem Bussche commented on FLINK-2821:


[~mxm] thats awesome ! That might work for me actually. I am happy to give this 
a try.

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs

2016-10-13 Thread Philipp von dem Bussche (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571405#comment-15571405
 ] 

Philipp von dem Bussche commented on FLINK-2821:


+1 for this feature. I am orchestrating my docker environment with something 
called Rancher (http://rancher.com/). With this you might end up having 3 IP 
addresses a container can be accessed with (IP exposed to the host, docker IP 
and Rancher IP). Starting a jobmanager container via Rancher will make the RPC 
address be set to the Docker IP however if you want to use the Rancher DNS 
capabilities (which are quite cool), then you would communicate in your Rancher 
network (from taskmanager to jobmanager where taskmanager is also a container 
started under Rancher) using the Rancher IP. This however does not work at the 
moment. I worked around this for now by telling the taskmanager which Docker IP 
to connect to in order to reach the Jobmanager while starting it up but this 
however is not really nice when thinking about automation and using other 
capabilities of Rancher.
I can see this being quite a problem when using any orchestration on top of 
Docker ?!?
Thanks

> Change Akka configuration to allow accessing actors from different URLs
> ---
>
> Key: FLINK-2821
> URL: https://issues.apache.org/jira/browse/FLINK-2821
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Robert Metzger
>
> Akka expects the actor's URL to be exactly matching.
> As pointed out here, cases where users were complaining about this: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html
>   - Proxy routing (as described here, send to the proxy URL, receiver 
> recognizes only original URL)
>   - Using hostname / IP interchangeably does not work (we solved this by 
> always putting IP addresses into URLs, never hostnames)
>   - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still 
> no solution to that (but seems not too much of a restriction)
> I am aware that this is not possible due to Akka, so it is actually not a 
> Flink bug. But I think we should track the resolution of the issue here 
> anyways because its affecting our user's satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)