[jira] [Commented] (BEAM-5308) JobBundleFactory BindException with FlinkRunner and remote cluster

2018-09-06 Thread Maximilian Michels (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605791#comment-16605791
 ] 

Maximilian Michels commented on BEAM-5308:
--

This is a classloading issue when closing the environment LoadingCache. The 
exception was swallowed:

{noformat}
2018-09-06 15:37:07,996 ERROR 
org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory
  - Unable to close.
java.lang.NoClassDefFoundError: 
org/apache/beam/repackaged/beam_runners_java_fn_execution/com/google/common/cache/RemovalCause
at 
org.apache.beam.repackaged.beam_runners_java_fn_execution.com.google.common.cache.LocalCache$Segment.clear(LocalCache.java:3290)
at 
org.apache.beam.repackaged.beam_runners_java_fn_execution.com.google.common.cache.LocalCache.clear(LocalCache.java:4322)
at 
org.apache.beam.repackaged.beam_runners_java_fn_execution.com.google.common.cache.LocalCache$LocalManualCache.invalidateAll(LocalCache.java:4937)
at 
org.apache.beam.runners.fnexecution.control.JobBundleFactoryBase.close(JobBundleFactoryBase.java:186)
at 
org.apache.beam.runners.flink.translation.functions.FlinkBatchExecutableStageContext.close(FlinkBatchExecutableStageContext.java:68)
at 
org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory$WrappedContext.closeActual(ReferenceCountingFlinkExecutableStageContextFactory.java:186)
at 
org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory$WrappedContext.access$200(ReferenceCountingFlinkExecutableStageContextFactory.java:162)
at 
org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory.release(ReferenceCountingFlinkExecutableStageContextFactory.java:150)
at 
org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory.lambda$scheduleRelease$1(ReferenceCountingFlinkExecutableStageContextFactory.java:110)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: 
org.apache.beam.repackaged.beam_runners_java_fn_execution.com.google.common.cache.RemovalCause
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader.loadClass(FlinkUserCodeClassLoaders.java:129)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 16 more
{noformat}

> JobBundleFactory BindException with FlinkRunner and remote cluster
> --
>
> Key: BEAM-5308
> URL: https://issues.apache.org/jira/browse/BEAM-5308
> Project: Beam
>  Issue Type: Task
>  Components: runner-flink
>Reporter: Thomas Weise
>Assignee: Maximilian Michels
>Priority: Major
>  Labels: portability
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Repeated execution of the same job on remote Flink cluster (not embedded in 
> job server) fails with bind exception. There seem to be 2 issues:
>  * Multiple instances of job bundle factory cannot be created (port conflict)
>  * Job bundle factory is not released after job completes (and Docker 
> container keeps on running). That's not the case in embedded mode).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5308) JobBundleFactory BindException with FlinkRunner and remote cluster

2018-09-05 Thread Thomas Weise (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605007#comment-16605007
 ] 

Thomas Weise commented on BEAM-5308:


After the port range fix multiple jobs can run on the cluster. However, the 
second issue of containers not terminating still exists. The docker containers 
remain active after the job has finished and are only removed when the Flink 
cluster is stopped. That's different from behavior in embedded mode, where the 
containers exit after 30s. [~angoenka] any ideas?

> JobBundleFactory BindException with FlinkRunner and remote cluster
> --
>
> Key: BEAM-5308
> URL: https://issues.apache.org/jira/browse/BEAM-5308
> Project: Beam
>  Issue Type: Task
>  Components: runner-flink
>Reporter: Thomas Weise
>Assignee: Maximilian Michels
>Priority: Major
>  Labels: portability
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Repeated execution of the same job on remote Flink cluster (not embedded in 
> job server) fails with bind exception. There seem to be 2 issues:
>  * Multiple instances of job bundle factory cannot be created (port conflict)
>  * Job bundle factory is not released after job completes (and Docker 
> container keeps on running). That's not the case in embedded mode).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5308) JobBundleFactory BindException with FlinkRunner and remote cluster

2018-09-05 Thread Ankur Goenka (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604840#comment-16604840
 ] 

Ankur Goenka commented on BEAM-5308:


I agree, the bug is setting the port range.

> JobBundleFactory BindException with FlinkRunner and remote cluster
> --
>
> Key: BEAM-5308
> URL: https://issues.apache.org/jira/browse/BEAM-5308
> Project: Beam
>  Issue Type: Task
>  Components: runner-flink
>Reporter: Thomas Weise
>Assignee: Maximilian Michels
>Priority: Major
>  Labels: portability
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Repeated execution of the same job on remote Flink cluster (not embedded in 
> job server) fails with bind exception. There seem to be 2 issues:
>  * Multiple instances of job bundle factory cannot be created (port conflict)
>  * Job bundle factory is not released after job completes (and Docker 
> container keeps on running). That's not the case in embedded mode).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5308) JobBundleFactory BindException with FlinkRunner and remote cluster

2018-09-05 Thread Thomas Weise (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604836#comment-16604836
 ] 

Thomas Weise commented on BEAM-5308:


That is an optimization and a different issue. Here we have a bug that multiple 
harnesses cannot run in a single JVM (which must be possible). Max has already 
identified the bug, see linked PR.

 

> JobBundleFactory BindException with FlinkRunner and remote cluster
> --
>
> Key: BEAM-5308
> URL: https://issues.apache.org/jira/browse/BEAM-5308
> Project: Beam
>  Issue Type: Task
>  Components: runner-flink
>Reporter: Thomas Weise
>Assignee: Maximilian Michels
>Priority: Major
>  Labels: portability
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Repeated execution of the same job on remote Flink cluster (not embedded in 
> job server) fails with bind exception. There seem to be 2 issues:
>  * Multiple instances of job bundle factory cannot be created (port conflict)
>  * Job bundle factory is not released after job completes (and Docker 
> container keeps on running). That's not the case in embedded mode).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5308) JobBundleFactory BindException with FlinkRunner and remote cluster

2018-09-05 Thread Thomas Weise (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604330#comment-16604330
 ] 

Thomas Weise commented on BEAM-5308:


Steps to reproduce:
 * Start Flink 1.5.1 cluster: ./bin/start-cluster.sh
 * Run job server (master branch): ./gradlew 
:beam-runners-flink_2.11-job-server:runShadow -PflinkMasterUrl=localhost:8081
 * Run wordcount: ./gradlew :beam-sdks-python:portableWordCount -Pstreaming 
-PjobEndpoint=localhost:8099

Second run exception:
{code:java}
[flink-runner-job-server] ERROR 
org.apache.beam.runners.flink.FlinkJobInvocation - Error during job invocation 
BeamApp-tweise-0905121904-94ae11b1_019a1a43-64e5-4bfd-b2bc-aa41e3835388.

org.apache.flink.client.program.ProgramInvocationException: 
java.lang.RuntimeException: Unable to create context for job 
BeamApp-tweise-0905121904-94ae11b1_019a1a43-64e5-4bfd-b2bc-aa41e3835388

        at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:264)

        at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)

        at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:457)

        at 
org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.executeRemotely(RemoteStreamEnvironment.java:219)

        at 
org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:178)

        at 
org.apache.beam.runners.flink.FlinkJobInvocation.runPipeline(FlinkJobInvocation.java:121)

        at 
org.apache.beam.repackaged.beam_runners_flink_2.11.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)

        at 
org.apache.beam.repackaged.beam_runners_flink_2.11.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)

        at 
org.apache.beam.repackaged.beam_runners_flink_2.11.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.RuntimeException: Unable to create context for job 
BeamApp-tweise-0905121904-94ae11b1_019a1a43-64e5-4bfd-b2bc-aa41e3835388

        at 
org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory.lambda$get$0(ReferenceCountingFlinkExecutableStageContextFactory.java:83)

        at 
java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)

        at 
org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory.get(ReferenceCountingFlinkExecutableStageContextFactory.java:76)

        at 
org.apache.beam.runners.flink.translation.functions.FlinkBatchExecutableStageContext$BatchFactory.get(FlinkBatchExecutableStageContext.java:80)

        at 
org.apache.beam.runners.flink.translation.wrappers.streaming.ExecutableStageDoFnOperator.open(ExecutableStageDoFnOperator.java:136)

        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:420)

        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:296)

        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)

        ... 1 more

Caused by: java.io.IOException: Failed to bind

        at 
org.apache.beam.vendor.grpc.v1.io.grpc.netty.NettyServer.start(NettyServer.java:252)

        at 
org.apache.beam.vendor.grpc.v1.io.grpc.internal.ServerImpl.start(ServerImpl.java:163)

        at 
org.apache.beam.vendor.grpc.v1.io.grpc.internal.ServerImpl.start(ServerImpl.java:78)

        at 
org.apache.beam.runners.fnexecution.ServerFactory$InetSocketAddressServerFactory.createServer(ServerFactory.java:130)

        at 
org.apache.beam.runners.fnexecution.ServerFactory$InetSocketAddressServerFactory.allocatePortAndCreate(ServerFactory.java:100)

        at 
org.apache.beam.runners.fnexecution.GrpcFnServer.allocatePortAndCreateFor(GrpcFnServer.java:37)

        at 
org.apache.beam.runners.fnexecution.control.JobBundleFactoryBase.(JobBundleFactoryBase.java:85)

        at 
org.apache.beam.runners.fnexecution.control.DockerJobBundleFactory.(DockerJobBundleFactory.java:75)

        at 
org.apache.beam.runners.fnexecution.control.DockerJobBundleFactory$1.create(DockerJobBundleFactory.java:62)

        at 
org.apache.beam.runners.fnexecution.control.DockerJobBundleFactory.create(DockerJobBundleFactory.java:71)

        at 
org.apache.beam.runners.flink.translation.functions.FlinkBatchExecutableStageContext.create(FlinkBatchExecutableStageContext.java:41)

        at