Task manager suddenly lost connection to JM
Hi team, I see an wired issue that one of my TM suddenly lost connection to JM. Once the job running on the TM relocated to a new TM, it can reconnect to JM again. And after a while, the new TM running the same job will repeat the same process. It is not guaranteed the troubled TMs can reconnect to JM in a reasonable time frame, like minutes. Sometime it take days in order to reconnect successfully. I am using Flink 1.3.2 and Kubernetes. Is this because of network congestion? Thanks! = Logs from JM == *2017-11-16 19:14:40,216 WARN akka.remote.RemoteWatcher* - Detected unreachable: [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] 2017-11-16 19:14:40,218 INFO org.apache.flink.runtime.jobmanager.JobManager- Task manager akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager terminated. 2017-11-16 19:14:40,219 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED. java.lang.Exception: TaskManager was lost/killed: 50cae001c1d97e55889a6051319f4746 @ fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871) at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212) at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)*2017-11-16 19:14:40,219 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph- Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state RUNNING to FAILING. java.lang.Exception: TaskManager was lost/killed: 50cae001c1d97e55889a6051319f4746 @ fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)* at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212) at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessag
Lost connection to task manager
I run the wordcount example , input data size is 10.9G command: ./bin/flink run -m yarn-cluster -yn 45 -yjm 2048 -ytm 2048 ./examples/batch/WordCount.jar --input /path --output /path1 and finally it throws exceptions as follows Can anyone give me some help?Thanks Caused by: java.lang.Exception: The data preparation for task 'Reduce (SUM(1), at main(WordCount.java:78)' , caused an error: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: Lost connection to task manager 'xxx.com/ip:port'. This indicates that the remote task manager was lost. at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:465) at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: Lost connection to task manager ''xxx.com/ip:port'. This indicates that the remote task manager was lost. at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619) at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1094) at org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99) at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:460) ... 3 more Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: Lost connection to task manager ''xxx.com/ip:port'. This indicates that the remote task manager was lost. at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:799) Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Lost connection to task manager ''xxx.com/ip:port'. This indicates that the remote task manager was lost. at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:147) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) at io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) at io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241
Re: lost connection
That is an exempt from the client log, can you check the JobManager log? It could have crashed, and if so the cause is hopefully in there. Did this issue suddenly occur; as in have you run a job successfully on the system before? (to exclude network configuration issues) Regards, Chesnay On 21.04.2016 16:09, Radu Tudoran wrote: - Could not submit job Operator2 execution (170aef70d31f3fee62f8a483930be213), because there is no connection to a JobManager. 15:59:48,456 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@10.204.62.71:6123]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /10.204.62.71:6123 16:01:28,409 ERROR org.apache.flink.client.CliFrontend - Error while running the command. org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Communication with JobManager failed: Lost connection to the JobManager. I do not understand what could be the root cause of this… the IPs look ok and there is not firewall to block things… Dr. Radu Tudoran Research Engineer - Big Data Expert IT R Division cid:image007.jpg@01CD52EB.AD060EE0 HUAWEI TECHNOLOGIES Duesseldorf GmbH European Research Center Riesstrasse 25, 80992 München E-mail: _radu.tudoran@huawei.com_ Mobile: +49 15209084330 Telephone: +49 891588344173 HUAWEI TECHNOLOGIES Duesseldorf GmbH Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com <http://www.huawei.com/> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! *From:*Chesnay Schepler [mailto:ches...@apache.org] *Sent:* Thursday, April 21, 2016 3:58 PM *To:* user@flink.apache.org *Subject:* Re: lost connection Hello, the first step is always to check the logs under /log. The JobManager log in particular may contain clues as why no connection could be established. Regards, Chesnay On 21.04.2016 15:44, Radu Tudoran wrote: Hi, I am trying to submit a jar via the console (flink run my.jar). The result is that I get an error saying that the communication with the jobmanager failed: Lost connection to the jobmanager. Can you give me some hints/ recommendations about approaching this issue. Thanks Dr. Radu Tudoran Research Engineer - Big Data Expert IT R Division cid:image007.jpg@01CD52EB.AD060EE0 HUAWEI TECHNOLOGIES Duesseldorf GmbH European Research Center Riesstrasse 25, 80992 München E-mail: _radu.tudo...@huawei.com <mailto:radu.tudo...@huawei.com>_ Mobile: +49 15209084330 Telephone: +49 891588344173 HUAWEI TECHNOLOGIES Duesseldorf GmbH Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com <http://www.huawei.com/> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
RE: lost connection
- Could not submit job Operator2 execution (170aef70d31f3fee62f8a483930be213), because there is no connection to a JobManager. 15:59:48,456 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@10.204.62.71:6123]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /10.204.62.71:6123 16:01:28,409 ERROR org.apache.flink.client.CliFrontend - Error while running the command. org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Communication with JobManager failed: Lost connection to the JobManager. I do not understand what could be the root cause of this... the IPs look ok and there is not firewall to block things... Dr. Radu Tudoran Research Engineer - Big Data Expert IT R Division [cid:image007.jpg@01CD52EB.AD060EE0] HUAWEI TECHNOLOGIES Duesseldorf GmbH European Research Center Riesstrasse 25, 80992 München E-mail: radu.tudo...@huawei.com Mobile: +49 15209084330 Telephone: +49 891588344173 HUAWEI TECHNOLOGIES Duesseldorf GmbH Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com<http://www.huawei.com/> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! From: Chesnay Schepler [mailto:ches...@apache.org] Sent: Thursday, April 21, 2016 3:58 PM To: user@flink.apache.org Subject: Re: lost connection Hello, the first step is always to check the logs under /log. The JobManager log in particular may contain clues as why no connection could be established. Regards, Chesnay On 21.04.2016 15:44, Radu Tudoran wrote: Hi, I am trying to submit a jar via the console (flink run my.jar). The result is that I get an error saying that the communication with the jobmanager failed: Lost connection to the jobmanager. Can you give me some hints/ recommendations about approaching this issue. Thanks Dr. Radu Tudoran Research Engineer - Big Data Expert IT R Division [cid:image007.jpg@01CD52EB.AD060EE0] HUAWEI TECHNOLOGIES Duesseldorf GmbH European Research Center Riesstrasse 25, 80992 München E-mail: radu.tudo...@huawei.com<mailto:radu.tudo...@huawei.com> Mobile: +49 15209084330 Telephone: +49 891588344173 HUAWEI TECHNOLOGIES Duesseldorf GmbH Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com<http://www.huawei.com/> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
Re: lost connection
Hello, the first step is always to check the logs under /log. The JobManager log in particular may contain clues as why no connection could be established. Regards, Chesnay On 21.04.2016 15:44, Radu Tudoran wrote: Hi, I am trying to submit a jar via the console (flink run my.jar). The result is that I get an error saying that the communication with the jobmanager failed: Lost connection to the jobmanager. Can you give me some hints/ recommendations about approaching this issue. Thanks Dr. Radu Tudoran Research Engineer - Big Data Expert IT R Division cid:image007.jpg@01CD52EB.AD060EE0 HUAWEI TECHNOLOGIES Duesseldorf GmbH European Research Center Riesstrasse 25, 80992 München E-mail: _radu.tudoran@huawei.com_ Mobile: +49 15209084330 Telephone: +49 891588344173 HUAWEI TECHNOLOGIES Duesseldorf GmbH Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com <http://www.huawei.com/> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
Re: JobTimeoutException: Lost connection to JobManager
The exception indicates that you're still using the old version. It takes some time for the new Maven artifact to get deployed to the snapshot repository. Apparently, a artifact has already been deployed this morning. Did you delete the jar files in your .m2 folder? On Wed, Apr 15, 2015 at 1:38 PM, Mohamed Nadjib MAMI m...@iai.uni-bonn.de wrote: Hello, I'm still facing the problem with 0.9-SNAPSHOT version. Tried to remove the libraries and download them again but same issue. Greetings, Mohamed Exception in thread main org.apache.flink.runtime.client.JobTimeoutException: Lost connection to JobManager at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:164) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:198) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:188) at org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:179) at org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:54) at Main.main(Main.java:142) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at scala.concurrent.Await.result(package.scala) at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:143) ... 5 more On 15.04.2015 01:02, Stephan Ewen wrote: I pushed a fix to the master. The problem should now be gone. Please let us know if you experience other issues! Greetings, Stephan On Tue, Apr 14, 2015 at 9:57 PM, Mohamed Nadjib MAMI m...@iai.uni-bonn.de wrote: Hello, Once I got the message, few seconds, I received your email. Well, this just to cast a need for a fix. Happy to feel the dynamism of the work. Great work. On 14.04.2015 21:50, Stephan Ewen wrote: You are on the latest snapshot version? I think there is an inconsistency in there. Will try to fix that toning. Can you actually use the milestone1 version? That one should be good. Greetings, Stephan Am 14.04.2015 20:31 schrieb Fotis P fotis...@gmail.com: Hello everyone, I am getting this weird exception while running some simple counting jobs in Flink. Exception in thread main org.apache.flink.runtime.client.JobTimeoutException: Lost connection to JobManager at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:164) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:198) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:188) at org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:179) at org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:54) at trackers.preprocessing.ExtractInfoFromLogs.main(ExtractInfoFromLogs.java:133) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at scala.concurrent.Await.result(package.scala) at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:143) ... 10 more The only call above which comes from my code is ExtractInfoFromLogs.java:133 which is the environment.execute() method. This exception comes when dealing with largish files (10GB). No exception is thrown when I am working with a smaller subset of my data. Also I would swear that it was working fine until a few days ago, and the code has not been changed :S Only change was a re-import of maven dependencies. I am unsure what other information I could provide that would help you help me :) I am running everything locally through the intelij IDE. Maven dependency is set to 0.9-SNAPSHOT. I have an 8-core Ubuntu 14.04 machine. Thanks in advance :D
Re: JobTimeoutException: Lost connection to JobManager
On 15 Apr 2015, at 14:18, Maximilian Michels m...@apache.org wrote: The exception indicates that you're still using the old version. It takes some time for the new Maven artifact to get deployed to the snapshot repository. Apparently, a artifact has already been deployed this morning. Did you delete the jar files in your .m2 folder? I think that's what he meant. The problem is that the snapshot repositories take some time to synchronize. Please 1. git clone https://github.com/apache/flink.git 2. cd flink 3. mvn clean install -DskipTests This way you build Flink yourself and are guaranteed to work on a version with the fix. Sorry for the inconvenience. Does this solve it? – Ufuk
Re: JobTimeoutException: Lost connection to JobManager
Hello, Once I got the message, few seconds, I received your email. Well, this just to cast a need for a fix. Happy to feel the dynamism of the work. Great work. On 14.04.2015 21:50, Stephan Ewen wrote: You are on the latest snapshot version? I think there is an inconsistency in there. Will try to fix that toning. Can you actually use the milestone1 version? That one should be good. Greetings, Stephan Am 14.04.2015 20:31 schrieb Fotis P fotis...@gmail.com mailto:fotis...@gmail.com: Hello everyone, I am getting this weird exception while running some simple counting jobs in Flink. Exception in thread main org.apache.flink.runtime.client.JobTimeoutException: Lost connection to JobManager at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:164) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:198) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:188) at org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:179) at org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:54) at trackers.preprocessing.ExtractInfoFromLogs.main(ExtractInfoFromLogs.java:133) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at scala.concurrent.Await.result(package.scala) at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:143) ... 10 more The only call above which comes from my code is ExtractInfoFromLogs.java:133 which is the environment.execute() method. This exception comes when dealing with largish files (10GB). No exception is thrown when I am working with a smaller subset of my data. Also I would swear that it was working fine until a few days ago, and the code has not been changed :S Only change was a re-import of maven dependencies. I am unsure what other information I could provide that would help you help me :) I am running everything locally through the intelij IDE. Maven dependency is set to 0.9-SNAPSHOT. I have an 8-core Ubuntu 14.04 machine. Thanks in advance :D -- Regards, Grüße, Cordialement, Recuerdos, Saluti, προσρήσεις, 问候, تحياتي. Mohamed Nadjib Mami PhD Student - EIS Department - Bonn University, Germany. About me! http://www.strikingly.com/mohamed-nadjib-mami LinkedIn
Re: JobTimeoutException: Lost connection to JobManager
You are on the latest snapshot version? I think there is an inconsistency in there. Will try to fix that toning. Can you actually use the milestone1 version? That one should be good. Greetings, Stephan Am 14.04.2015 20:31 schrieb Fotis P fotis...@gmail.com: Hello everyone, I am getting this weird exception while running some simple counting jobs in Flink. Exception in thread main org.apache.flink.runtime.client.JobTimeoutException: Lost connection to JobManager at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:164) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:198) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:188) at org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:179) at org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:54) at trackers.preprocessing.ExtractInfoFromLogs.main(ExtractInfoFromLogs.java:133) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at scala.concurrent.Await.result(package.scala) at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:143) ... 10 more The only call above which comes from my code is ExtractInfoFromLogs.java:133 which is the environment.execute() method. This exception comes when dealing with largish files (10GB). No exception is thrown when I am working with a smaller subset of my data. Also I would swear that it was working fine until a few days ago, and the code has not been changed :S Only change was a re-import of maven dependencies. I am unsure what other information I could provide that would help you help me :) I am running everything locally through the intelij IDE. Maven dependency is set to 0.9-SNAPSHOT. I have an 8-core Ubuntu 14.04 machine. Thanks in advance :D
Re: JobTimeoutException: Lost connection to JobManager
I pushed a fix to the master. The problem should now be gone. Please let us know if you experience other issues! Greetings, Stephan On Tue, Apr 14, 2015 at 9:57 PM, Mohamed Nadjib MAMI m...@iai.uni-bonn.de wrote: Hello, Once I got the message, few seconds, I received your email. Well, this just to cast a need for a fix. Happy to feel the dynamism of the work. Great work. On 14.04.2015 21:50, Stephan Ewen wrote: You are on the latest snapshot version? I think there is an inconsistency in there. Will try to fix that toning. Can you actually use the milestone1 version? That one should be good. Greetings, Stephan Am 14.04.2015 20:31 schrieb Fotis P fotis...@gmail.com: Hello everyone, I am getting this weird exception while running some simple counting jobs in Flink. Exception in thread main org.apache.flink.runtime.client.JobTimeoutException: Lost connection to JobManager at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:164) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:198) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:188) at org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:179) at org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:54) at trackers.preprocessing.ExtractInfoFromLogs.main(ExtractInfoFromLogs.java:133) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at scala.concurrent.Await.result(package.scala) at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:143) ... 10 more The only call above which comes from my code is ExtractInfoFromLogs.java:133 which is the environment.execute() method. This exception comes when dealing with largish files (10GB). No exception is thrown when I am working with a smaller subset of my data. Also I would swear that it was working fine until a few days ago, and the code has not been changed :S Only change was a re-import of maven dependencies. I am unsure what other information I could provide that would help you help me :) I am running everything locally through the intelij IDE. Maven dependency is set to 0.9-SNAPSHOT. I have an 8-core Ubuntu 14.04 machine. Thanks in advance :D -- Regards, Grüße, Cordialement, Recuerdos, Saluti, προσρήσεις, 问候, تحياتي. Mohamed Nadjib Mami PhD Student - EIS Department - Bonn University, Germany. About me! http://www.strikingly.com/mohamed-nadjib-mami LinkedIn