Task manager suddenly lost connection to JM

2017-11-16 Thread Hao Sun
Hi team, I see an wired issue that one of my TM suddenly lost connection to
JM.
Once the job running on the TM relocated to a new TM, it can reconnect to
JM again.
And after a while, the new TM running the same job will repeat the same
process.
It is not guaranteed the troubled TMs can reconnect to JM in a reasonable
time frame, like minutes. Sometime it take days in order to reconnect
successfully.

I am using Flink 1.3.2 and Kubernetes. Is this because of network
congestion?

Thanks!

= Logs from JM ==

*2017-11-16 19:14:40,216 WARN  akka.remote.RemoteWatcher*
   - Detected unreachable:
[akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]
2017-11-16 19:14:40,218 INFO
org.apache.flink.runtime.jobmanager.JobManager- Task
manager 
akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager
terminated.
2017-11-16 19:14:40,219 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph-
Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager was lost/killed:
50cae001c1d97e55889a6051319f4746 @
fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)
at 
org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
at 
org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
at 
org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
at 
org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
at 
org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
at 
org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at 
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at 
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at 
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at 
akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
at akka.actor.ActorCell.invoke(ActorCell.scala:486)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)*2017-11-16
19:14:40,219 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched
from state RUNNING to FAILING.
java.lang.Exception: TaskManager was lost/killed:
50cae001c1d97e55889a6051319f4746 @
fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)*
at 
org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
at 
org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
at 
org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
at 
org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
at 
org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
at 
org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessag

Lost connection to task manager

2017-04-26 Thread 猎豹移动 李木柯
I run the wordcount example , input data size is 10.9G 
command: ./bin/flink run -m yarn-cluster -yn 45 -yjm 2048 -ytm 2048 
./examples/batch/WordCount.jar --input /path --output /path1
and finally it throws exceptions as follows
Can anyone give me some help?Thanks


Caused by: java.lang.Exception: The data preparation for task 'Reduce (SUM(1), 
at main(WordCount.java:78)' , caused an error: Error obtaining the sorted 
input: Thread 'SortMerger Reading Thread' terminated due to an exception: Lost 
connection to task manager 'xxx.com/ip:port'. This indicates that the remote 
task manager was lost.
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:465)
at 
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 
'SortMerger Reading Thread' terminated due to an exception: Lost connection to 
task manager ''xxx.com/ip:port'. This indicates that the remote task manager 
was lost.
at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
at 
org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1094)
at 
org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:460)
... 3 more
Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated 
due to an exception: Lost connection to task manager ''xxx.com/ip:port'. This 
indicates that the remote task manager was lost.
at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:799)
Caused by: 
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
Lost connection to task manager ''xxx.com/ip:port'. This indicates that the 
remote task manager was lost.
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:147)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241

Re: lost connection

2016-04-21 Thread Chesnay Schepler
That is an exempt from the client log, can you check the JobManager log? 
It could have crashed, and if so the cause is hopefully in there.


Did this issue suddenly occur; as in have you run a job successfully on 
the system before? (to exclude network configuration issues)


Regards,
Chesnay

On 21.04.2016 16:09, Radu Tudoran wrote:


- Could not submit job Operator2 execution 
(170aef70d31f3fee62f8a483930be213), because there is no connection to 
a JobManager.


15:59:48,456 WARN Remoting - Tried to associate with unreachable 
remote address [akka.tcp://flink@10.204.62.71:6123]. Address is now 
gated for 5000 ms, all messages to this address will be delivered to 
dead letters. Reason: Connection refused: /10.204.62.71:6123


16:01:28,409 ERROR org.apache.flink.client.CliFrontend - Error while 
running the command.


org.apache.flink.client.program.ProgramInvocationException: The 
program execution failed: Communication with JobManager failed: Lost 
connection to the JobManager.


I do not understand what could be the root cause of this… the IPs look 
ok and there is not firewall to block things…


Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R Division

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

E-mail: _radu.tudoran@huawei.com_

Mobile: +49 15209084330

Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com 
<http://www.huawei.com/>

Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from 
HUAWEI, which is intended only for the person or entity whose address 
is listed above. Any use of the information contained herein in any 
way (including, but not limited to, total or partial disclosure, 
reproduction, or dissemination) by persons other than the intended 
recipient(s) is prohibited. If you receive this e-mail in error, 
please notify the sender by phone or email immediately and delete it!


*From:*Chesnay Schepler [mailto:ches...@apache.org]
*Sent:* Thursday, April 21, 2016 3:58 PM
*To:* user@flink.apache.org
*Subject:* Re: lost connection

Hello,

the first step is always to check the logs under /log. The JobManager 
log in particular may contain clues as why no connection could be 
established.


Regards,
Chesnay

On 21.04.2016 15:44, Radu Tudoran wrote:

Hi,

I am trying to submit a jar via the console (flink run my.jar).
The result is that I get an error saying that the communication
with the jobmanager failed: Lost connection to the jobmanager.

Can you give me some hints/ recommendations about approaching this
issue.

Thanks

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R Division

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

E-mail: _radu.tudo...@huawei.com <mailto:radu.tudo...@huawei.com>_

Mobile: +49 15209084330

Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
<http://www.huawei.com/>
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information
from HUAWEI, which is intended only for the person or entity whose
address is listed above. Any use of the information contained
herein in any way (including, but not limited to, total or partial
disclosure, reproduction, or dissemination) by persons other than
the intended recipient(s) is prohibited. If you receive this
e-mail in error, please notify the sender by phone or email
immediately and delete it!





RE: lost connection

2016-04-21 Thread Radu Tudoran
- Could not submit job Operator2 execution (170aef70d31f3fee62f8a483930be213), 
because there is no connection to a JobManager.
15:59:48,456 WARN  Remoting 
 - Tried to associate with unreachable remote address 
[akka.tcp://flink@10.204.62.71:6123]. Address is now gated for 5000 ms, all 
messages to this address will be delivered to dead letters. Reason: Connection 
refused: /10.204.62.71:6123
16:01:28,409 ERROR org.apache.flink.client.CliFrontend  
 - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: The program 
execution failed: Communication with JobManager failed: Lost connection to the 
JobManager.

I do not understand what could be the root cause of this... the IPs look ok and 
there is not firewall to block things...

Dr. Radu Tudoran
Research Engineer - Big Data Expert
IT R Division

[cid:image007.jpg@01CD52EB.AD060EE0]
HUAWEI TECHNOLOGIES Duesseldorf GmbH
European Research Center
Riesstrasse 25, 80992 München

E-mail: radu.tudo...@huawei.com
Mobile: +49 15209084330
Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, 
www.huawei.com<http://www.huawei.com/>
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
This e-mail and its attachments contain confidential information from HUAWEI, 
which is intended only for the person or entity whose address is listed above. 
Any use of the information contained herein in any way (including, but not 
limited to, total or partial disclosure, reproduction, or dissemination) by 
persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immediately and 
delete it!

From: Chesnay Schepler [mailto:ches...@apache.org]
Sent: Thursday, April 21, 2016 3:58 PM
To: user@flink.apache.org
Subject: Re: lost connection

Hello,

the first step is always to check the logs under /log. The JobManager log in 
particular may contain clues as why no connection could be established.

Regards,
Chesnay

On 21.04.2016 15:44, Radu Tudoran wrote:
Hi,

I am trying to submit a jar via the console (flink run my.jar). The result is 
that I get an error saying that the communication with the jobmanager failed: 
Lost connection to the jobmanager.
Can you give me some hints/ recommendations about approaching this issue.

Thanks

Dr. Radu Tudoran
Research Engineer - Big Data Expert
IT R Division

[cid:image007.jpg@01CD52EB.AD060EE0]
HUAWEI TECHNOLOGIES Duesseldorf GmbH
European Research Center
Riesstrasse 25, 80992 München

E-mail: radu.tudo...@huawei.com<mailto:radu.tudo...@huawei.com>
Mobile: +49 15209084330
Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, 
www.huawei.com<http://www.huawei.com/>
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
This e-mail and its attachments contain confidential information from HUAWEI, 
which is intended only for the person or entity whose address is listed above. 
Any use of the information contained herein in any way (including, but not 
limited to, total or partial disclosure, reproduction, or dissemination) by 
persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immediately and 
delete it!




Re: lost connection

2016-04-21 Thread Chesnay Schepler

Hello,

the first step is always to check the logs under /log. The JobManager 
log in particular may contain clues as why no connection could be 
established.


Regards,
Chesnay

On 21.04.2016 15:44, Radu Tudoran wrote:


Hi,

I am trying to submit a jar via the console (flink run my.jar). The 
result is that I get an error saying that the communication with the 
jobmanager failed: Lost connection to the jobmanager.


Can you give me some hints/ recommendations about approaching this issue.

Thanks

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R Division

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

E-mail: _radu.tudoran@huawei.com_

Mobile: +49 15209084330

Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com 
<http://www.huawei.com/>

Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from 
HUAWEI, which is intended only for the person or entity whose address 
is listed above. Any use of the information contained herein in any 
way (including, but not limited to, total or partial disclosure, 
reproduction, or dissemination) by persons other than the intended 
recipient(s) is prohibited. If you receive this e-mail in error, 
please notify the sender by phone or email immediately and delete it!






Re: JobTimeoutException: Lost connection to JobManager

2015-04-15 Thread Maximilian Michels
The exception indicates that you're still using the old version. It takes
some time for the new Maven artifact to get deployed to the snapshot
repository. Apparently, a artifact has already been deployed this morning.
Did you delete the jar files in your .m2 folder?

On Wed, Apr 15, 2015 at 1:38 PM, Mohamed Nadjib MAMI m...@iai.uni-bonn.de
wrote:

  Hello,

 I'm still facing the problem with 0.9-SNAPSHOT version. Tried to remove
 the libraries and download them again but same issue.

 Greetings,
 Mohamed


 Exception in thread main
 org.apache.flink.runtime.client.JobTimeoutException: Lost connection to
 JobManager
 at
 org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:164)
 at
 org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:198)
 at
 org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:188)
 at
 org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:179)
 at
 org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:54)
 at Main.main(Main.java:142)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [10 milliseconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at scala.concurrent.Await.result(package.scala)
 at
 org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:143)
 ... 5 more


 On 15.04.2015 01:02, Stephan Ewen wrote:

 I pushed a fix to the master. The problem should now be gone.

  Please let us know if you experience other issues!

  Greetings,
 Stephan


 On Tue, Apr 14, 2015 at 9:57 PM, Mohamed Nadjib MAMI m...@iai.uni-bonn.de
  wrote:

  Hello,

 Once I got the message, few seconds, I received your email. Well, this
 just to cast a need for a fix.

 Happy to feel the dynamism of the work. Great work.


 On 14.04.2015 21:50, Stephan Ewen wrote:

 You are on the latest snapshot version? I think there is an inconsistency
 in there. Will try to fix that toning.

 Can you actually use the milestone1 version? That one should be good.

 Greetings,
 Stephan
  Am 14.04.2015 20:31 schrieb Fotis P fotis...@gmail.com:

Hello everyone,

  I am getting this weird exception while running some simple counting
 jobs in Flink.

 Exception in thread main
 org.apache.flink.runtime.client.JobTimeoutException: Lost connection to
 JobManager
 at
 org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:164)
 at
 org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:198)
 at
 org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:188)
 at
 org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:179)
 at
 org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:54)
 at
 trackers.preprocessing.ExtractInfoFromLogs.main(ExtractInfoFromLogs.java:133)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out
 after [10 milliseconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at scala.concurrent.Await.result(package.scala)
 at
 org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:143)
 ... 10 more


  The only call above which comes from my code is
 ExtractInfoFromLogs.java:133 which is the environment.execute() method.

  This exception comes when dealing with largish files (10GB). No
 exception is thrown when I am working with a smaller subset of my data.
  Also I would swear that it was working fine until a few days ago, and
 the code has not been changed :S Only change was a re-import of maven
 dependencies.

  I am unsure what other information I could provide that would help you
 help me :)

  I am running everything locally through the intelij IDE. Maven
 dependency is set to 0.9-SNAPSHOT.
  I have an 8-core Ubuntu 14.04 machine.

  Thanks in advance :D

Re: JobTimeoutException: Lost connection to JobManager

2015-04-15 Thread Ufuk Celebi
On 15 Apr 2015, at 14:18, Maximilian Michels m...@apache.org wrote:

 The exception indicates that you're still using the old version. It takes 
 some time for the new Maven artifact to get deployed to the snapshot 
 repository. Apparently, a artifact has already been deployed this morning. 
 Did you delete the jar files in your .m2 folder?

I think that's what he meant.

The problem is that the snapshot repositories take some time to synchronize.

Please
1. git clone https://github.com/apache/flink.git
2. cd flink
3. mvn clean install -DskipTests

This way you build Flink yourself and are guaranteed to work on a version with 
the fix.

Sorry for the inconvenience. Does this solve it?

– Ufuk

Re: JobTimeoutException: Lost connection to JobManager

2015-04-14 Thread Mohamed Nadjib MAMI

Hello,

Once I got the message, few seconds, I received your email. Well, this 
just to cast a need for a fix.


Happy to feel the dynamism of the work. Great work.

On 14.04.2015 21:50, Stephan Ewen wrote:


You are on the latest snapshot version? I think there is an 
inconsistency in there. Will try to fix that toning.


Can you actually use the milestone1 version? That one should be good.

Greetings,
Stephan

Am 14.04.2015 20:31 schrieb Fotis P fotis...@gmail.com 
mailto:fotis...@gmail.com:


Hello everyone,

I am getting this weird exception while running some simple
counting jobs in Flink.

Exception in thread main
org.apache.flink.runtime.client.JobTimeoutException: Lost
connection to JobManager
at

org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:164)
at

org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:198)
at

org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:188)
at
org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:179)
at
org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:54)
at

trackers.preprocessing.ExtractInfoFromLogs.main(ExtractInfoFromLogs.java:133)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.util.concurrent.TimeoutException: Futures timed
out after [10 milliseconds]
at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at

scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at scala.concurrent.Await.result(package.scala)
at

org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:143)
... 10 more


The only call above which comes from my code is
ExtractInfoFromLogs.java:133 which is the environment.execute()
method.

This exception comes when dealing with largish files (10GB). No
exception is thrown when I am working with a smaller subset of my
data.
Also I would swear that it was working fine until a few days ago,
and the code has not been changed :S Only change was a re-import
of maven dependencies.

I am unsure what other information I could provide that would help
you help me :)

I am running everything locally through the intelij IDE. Maven
dependency is set to 0.9-SNAPSHOT.
I have an 8-core Ubuntu 14.04 machine.

Thanks in advance :D



--
Regards, Grüße, Cordialement, Recuerdos, Saluti, προσρήσεις, 问候, 
تحياتي. Mohamed Nadjib Mami

PhD Student - EIS Department - Bonn University, Germany.
About me! http://www.strikingly.com/mohamed-nadjib-mami
LinkedIn


Re: JobTimeoutException: Lost connection to JobManager

2015-04-14 Thread Stephan Ewen
You are on the latest snapshot version? I think there is an inconsistency
in there. Will try to fix that toning.

Can you actually use the milestone1 version? That one should be good.

Greetings,
Stephan
 Am 14.04.2015 20:31 schrieb Fotis P fotis...@gmail.com:

 Hello everyone,

 I am getting this weird exception while running some simple counting jobs
 in Flink.

 Exception in thread main
 org.apache.flink.runtime.client.JobTimeoutException: Lost connection to
 JobManager
 at
 org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:164)
 at
 org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:198)
 at
 org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:188)
 at
 org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:179)
 at
 org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:54)
 at
 trackers.preprocessing.ExtractInfoFromLogs.main(ExtractInfoFromLogs.java:133)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [10 milliseconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at scala.concurrent.Await.result(package.scala)
 at
 org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:143)
 ... 10 more


 The only call above which comes from my code is
 ExtractInfoFromLogs.java:133 which is the environment.execute() method.

 This exception comes when dealing with largish files (10GB). No exception
 is thrown when I am working with a smaller subset of my data.
 Also I would swear that it was working fine until a few days ago, and the
 code has not been changed :S Only change was a re-import of maven
 dependencies.

 I am unsure what other information I could provide that would help you
 help me :)

 I am running everything locally through the intelij IDE. Maven dependency
 is set to 0.9-SNAPSHOT.
 I have an 8-core Ubuntu 14.04 machine.

 Thanks in advance :D



Re: JobTimeoutException: Lost connection to JobManager

2015-04-14 Thread Stephan Ewen
I pushed a fix to the master. The problem should now be gone.

Please let us know if you experience other issues!

Greetings,
Stephan


On Tue, Apr 14, 2015 at 9:57 PM, Mohamed Nadjib MAMI m...@iai.uni-bonn.de
wrote:

  Hello,

 Once I got the message, few seconds, I received your email. Well, this
 just to cast a need for a fix.

 Happy to feel the dynamism of the work. Great work.


 On 14.04.2015 21:50, Stephan Ewen wrote:

 You are on the latest snapshot version? I think there is an inconsistency
 in there. Will try to fix that toning.

 Can you actually use the milestone1 version? That one should be good.

 Greetings,
 Stephan
  Am 14.04.2015 20:31 schrieb Fotis P fotis...@gmail.com:

Hello everyone,

  I am getting this weird exception while running some simple counting
 jobs in Flink.

 Exception in thread main
 org.apache.flink.runtime.client.JobTimeoutException: Lost connection to
 JobManager
 at
 org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:164)
 at
 org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:198)
 at
 org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:188)
 at
 org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:179)
 at
 org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:54)
 at
 trackers.preprocessing.ExtractInfoFromLogs.main(ExtractInfoFromLogs.java:133)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [10 milliseconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at scala.concurrent.Await.result(package.scala)
 at
 org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:143)
 ... 10 more


  The only call above which comes from my code is
 ExtractInfoFromLogs.java:133 which is the environment.execute() method.

  This exception comes when dealing with largish files (10GB). No
 exception is thrown when I am working with a smaller subset of my data.
  Also I would swear that it was working fine until a few days ago, and
 the code has not been changed :S Only change was a re-import of maven
 dependencies.

  I am unsure what other information I could provide that would help you
 help me :)

  I am running everything locally through the intelij IDE. Maven
 dependency is set to 0.9-SNAPSHOT.
  I have an 8-core Ubuntu 14.04 machine.

  Thanks in advance :D


 --
 Regards, Grüße, Cordialement, Recuerdos, Saluti, προσρήσεις, 问候, تحياتي.
 Mohamed Nadjib Mami
 PhD Student - EIS Department - Bonn University, Germany.
 About me! http://www.strikingly.com/mohamed-nadjib-mami
 LinkedIn