[jira] [Commented] (BEAM-5308) JobBundleFactory BindException with FlinkRunner and remote cluster
[ https://issues.apache.org/jira/browse/BEAM-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605791#comment-16605791 ] Maximilian Michels commented on BEAM-5308: -- This is a classloading issue when closing the environment LoadingCache. The exception was swallowed: {noformat} 2018-09-06 15:37:07,996 ERROR org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory - Unable to close. java.lang.NoClassDefFoundError: org/apache/beam/repackaged/beam_runners_java_fn_execution/com/google/common/cache/RemovalCause at org.apache.beam.repackaged.beam_runners_java_fn_execution.com.google.common.cache.LocalCache$Segment.clear(LocalCache.java:3290) at org.apache.beam.repackaged.beam_runners_java_fn_execution.com.google.common.cache.LocalCache.clear(LocalCache.java:4322) at org.apache.beam.repackaged.beam_runners_java_fn_execution.com.google.common.cache.LocalCache$LocalManualCache.invalidateAll(LocalCache.java:4937) at org.apache.beam.runners.fnexecution.control.JobBundleFactoryBase.close(JobBundleFactoryBase.java:186) at org.apache.beam.runners.flink.translation.functions.FlinkBatchExecutableStageContext.close(FlinkBatchExecutableStageContext.java:68) at org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory$WrappedContext.closeActual(ReferenceCountingFlinkExecutableStageContextFactory.java:186) at org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory$WrappedContext.access$200(ReferenceCountingFlinkExecutableStageContextFactory.java:162) at org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory.release(ReferenceCountingFlinkExecutableStageContextFactory.java:150) at org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory.lambda$scheduleRelease$1(ReferenceCountingFlinkExecutableStageContextFactory.java:110) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.beam.repackaged.beam_runners_java_fn_execution.com.google.common.cache.RemovalCause at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader.loadClass(FlinkUserCodeClassLoaders.java:129) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 16 more {noformat} > JobBundleFactory BindException with FlinkRunner and remote cluster > -- > > Key: BEAM-5308 > URL: https://issues.apache.org/jira/browse/BEAM-5308 > Project: Beam > Issue Type: Task > Components: runner-flink >Reporter: Thomas Weise >Assignee: Maximilian Michels >Priority: Major > Labels: portability > Time Spent: 0.5h > Remaining Estimate: 0h > > Repeated execution of the same job on remote Flink cluster (not embedded in > job server) fails with bind exception. There seem to be 2 issues: > * Multiple instances of job bundle factory cannot be created (port conflict) > * Job bundle factory is not released after job completes (and Docker > container keeps on running). That's not the case in embedded mode). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5308) JobBundleFactory BindException with FlinkRunner and remote cluster
[ https://issues.apache.org/jira/browse/BEAM-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605007#comment-16605007 ] Thomas Weise commented on BEAM-5308: After the port range fix multiple jobs can run on the cluster. However, the second issue of containers not terminating still exists. The docker containers remain active after the job has finished and are only removed when the Flink cluster is stopped. That's different from behavior in embedded mode, where the containers exit after 30s. [~angoenka] any ideas? > JobBundleFactory BindException with FlinkRunner and remote cluster > -- > > Key: BEAM-5308 > URL: https://issues.apache.org/jira/browse/BEAM-5308 > Project: Beam > Issue Type: Task > Components: runner-flink >Reporter: Thomas Weise >Assignee: Maximilian Michels >Priority: Major > Labels: portability > Time Spent: 0.5h > Remaining Estimate: 0h > > Repeated execution of the same job on remote Flink cluster (not embedded in > job server) fails with bind exception. There seem to be 2 issues: > * Multiple instances of job bundle factory cannot be created (port conflict) > * Job bundle factory is not released after job completes (and Docker > container keeps on running). That's not the case in embedded mode). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5308) JobBundleFactory BindException with FlinkRunner and remote cluster
[ https://issues.apache.org/jira/browse/BEAM-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604840#comment-16604840 ] Ankur Goenka commented on BEAM-5308: I agree, the bug is setting the port range. > JobBundleFactory BindException with FlinkRunner and remote cluster > -- > > Key: BEAM-5308 > URL: https://issues.apache.org/jira/browse/BEAM-5308 > Project: Beam > Issue Type: Task > Components: runner-flink >Reporter: Thomas Weise >Assignee: Maximilian Michels >Priority: Major > Labels: portability > Time Spent: 20m > Remaining Estimate: 0h > > Repeated execution of the same job on remote Flink cluster (not embedded in > job server) fails with bind exception. There seem to be 2 issues: > * Multiple instances of job bundle factory cannot be created (port conflict) > * Job bundle factory is not released after job completes (and Docker > container keeps on running). That's not the case in embedded mode). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5308) JobBundleFactory BindException with FlinkRunner and remote cluster
[ https://issues.apache.org/jira/browse/BEAM-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604836#comment-16604836 ] Thomas Weise commented on BEAM-5308: That is an optimization and a different issue. Here we have a bug that multiple harnesses cannot run in a single JVM (which must be possible). Max has already identified the bug, see linked PR. > JobBundleFactory BindException with FlinkRunner and remote cluster > -- > > Key: BEAM-5308 > URL: https://issues.apache.org/jira/browse/BEAM-5308 > Project: Beam > Issue Type: Task > Components: runner-flink >Reporter: Thomas Weise >Assignee: Maximilian Michels >Priority: Major > Labels: portability > Time Spent: 20m > Remaining Estimate: 0h > > Repeated execution of the same job on remote Flink cluster (not embedded in > job server) fails with bind exception. There seem to be 2 issues: > * Multiple instances of job bundle factory cannot be created (port conflict) > * Job bundle factory is not released after job completes (and Docker > container keeps on running). That's not the case in embedded mode). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5308) JobBundleFactory BindException with FlinkRunner and remote cluster
[ https://issues.apache.org/jira/browse/BEAM-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604330#comment-16604330 ] Thomas Weise commented on BEAM-5308: Steps to reproduce: * Start Flink 1.5.1 cluster: ./bin/start-cluster.sh * Run job server (master branch): ./gradlew :beam-runners-flink_2.11-job-server:runShadow -PflinkMasterUrl=localhost:8081 * Run wordcount: ./gradlew :beam-sdks-python:portableWordCount -Pstreaming -PjobEndpoint=localhost:8099 Second run exception: {code:java} [flink-runner-job-server] ERROR org.apache.beam.runners.flink.FlinkJobInvocation - Error during job invocation BeamApp-tweise-0905121904-94ae11b1_019a1a43-64e5-4bfd-b2bc-aa41e3835388. org.apache.flink.client.program.ProgramInvocationException: java.lang.RuntimeException: Unable to create context for job BeamApp-tweise-0905121904-94ae11b1_019a1a43-64e5-4bfd-b2bc-aa41e3835388 at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:264) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:457) at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.executeRemotely(RemoteStreamEnvironment.java:219) at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:178) at org.apache.beam.runners.flink.FlinkJobInvocation.runPipeline(FlinkJobInvocation.java:121) at org.apache.beam.repackaged.beam_runners_flink_2.11.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111) at org.apache.beam.repackaged.beam_runners_flink_2.11.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58) at org.apache.beam.repackaged.beam_runners_flink_2.11.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: Unable to create context for job BeamApp-tweise-0905121904-94ae11b1_019a1a43-64e5-4bfd-b2bc-aa41e3835388 at org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory.lambda$get$0(ReferenceCountingFlinkExecutableStageContextFactory.java:83) at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) at org.apache.beam.runners.flink.translation.functions.ReferenceCountingFlinkExecutableStageContextFactory.get(ReferenceCountingFlinkExecutableStageContextFactory.java:76) at org.apache.beam.runners.flink.translation.functions.FlinkBatchExecutableStageContext$BatchFactory.get(FlinkBatchExecutableStageContext.java:80) at org.apache.beam.runners.flink.translation.wrappers.streaming.ExecutableStageDoFnOperator.open(ExecutableStageDoFnOperator.java:136) at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:420) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:296) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703) ... 1 more Caused by: java.io.IOException: Failed to bind at org.apache.beam.vendor.grpc.v1.io.grpc.netty.NettyServer.start(NettyServer.java:252) at org.apache.beam.vendor.grpc.v1.io.grpc.internal.ServerImpl.start(ServerImpl.java:163) at org.apache.beam.vendor.grpc.v1.io.grpc.internal.ServerImpl.start(ServerImpl.java:78) at org.apache.beam.runners.fnexecution.ServerFactory$InetSocketAddressServerFactory.createServer(ServerFactory.java:130) at org.apache.beam.runners.fnexecution.ServerFactory$InetSocketAddressServerFactory.allocatePortAndCreate(ServerFactory.java:100) at org.apache.beam.runners.fnexecution.GrpcFnServer.allocatePortAndCreateFor(GrpcFnServer.java:37) at org.apache.beam.runners.fnexecution.control.JobBundleFactoryBase.(JobBundleFactoryBase.java:85) at org.apache.beam.runners.fnexecution.control.DockerJobBundleFactory.(DockerJobBundleFactory.java:75) at org.apache.beam.runners.fnexecution.control.DockerJobBundleFactory$1.create(DockerJobBundleFactory.java:62) at org.apache.beam.runners.fnexecution.control.DockerJobBundleFactory.create(DockerJobBundleFactory.java:71) at org.apache.beam.runners.flink.translation.functions.FlinkBatchExecutableStageContext.create(FlinkBatchExecutableStageContext.java:41) at