Are you sure you checked the taskmanager with id 383f6af3299793ba73eeb7bdbab0ddc7? It should log something at the time of the error, otherwise, this would be very weird.
On Sat, Mar 11, 2017 at 12:01 AM, Govindarajan Srinivasaraghavan < govindragh...@gmail.com> wrote: > This is the exception before the job went into cancelled state. But when I > looked into the task manager node, the flink process is still running. > > java.lang.Exception: TaskManager was lost/killed: > 383f6af3299793ba73eeb7bdbab0ddc7 @ ip-xx.xx.xxx.xx.us-west-2.compute.internal > (dataPort=37652) > at org.apache.flink.runtime.instance.SimpleSlot. > releaseSlot(SimpleSlot.java:217) > at org.apache.flink.runtime.instance.SlotSharingGroupAssignment. > releaseSharedSlot(SlotSharingGroupAssignment.java:533) > at org.apache.flink.runtime.instance.SharedSlot. > releaseSlot(SharedSlot.java:192) > at org.apache.flink.runtime.instance.Instance.markDead( > Instance.java:167) > at org.apache.flink.runtime.instance.InstanceManager. > unregisterTaskManager(InstanceManager.java:212) > at org.apache.flink.runtime.jobmanager.JobManager.org$ > apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated( > JobManager.scala:1202) > at org.apache.flink.runtime.jobmanager.JobManager$$ > anonfun$handleMessage$1.applyOrElse(JobManager.scala:1105) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp( > AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply( > AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply( > AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LeaderSessionMessageFilter$$ > anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp( > AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply( > AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LeaderSessionMessageFilter$$ > anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp( > AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply( > AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply( > AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LogMessages$$anon$1.apply( > LogMessages.scala:33) > at org.apache.flink.runtime.LogMessages$$anon$1.apply( > LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction. > scala:118) > at org.apache.flink.runtime.LogMessages$$anon$1. > applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:467) > at org.apache.flink.runtime.jobmanager.JobManager. > aroundReceive(JobManager.scala:118) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.dungeon.DeathWatch$class.receivedTerminated( > DeathWatch.scala:44) > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at akka.dispatch.ForkJoinExecutorConfigurator$ > AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue. > runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker( > ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run( > ForkJoinWorkerThread.java:107) > > On Fri, Mar 10, 2017 at 5:40 AM, Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi, >> >> this error is only logged at WARN level. As Kaibo already said, its not a >> critical issue. >> >> Can you send some more messages from your log. Usually the Jobmanager >> logs why a taskmanager has failed. And the last few log messages of the >> failed TM itself are also often helpful. >> >> >> >> On Fri, Mar 10, 2017 at 10:46 AM, Kaibo Zhou <zkb...@gmail.com> wrote: >> >>> I think this is not the root cause of job failure, this task is caused >>> by other tasks failing. You can check the log of the first failed task. >>> >>> 2017-03-10 12:25 GMT+08:00 Govindarajan Srinivasaraghavan < >>> govindragh...@gmail.com>: >>> >>>> Hi All, >>>> >>>> I see the below error after running my streaming job for a while and >>>> when the load increases. After a while the task manager becomes completely >>>> dead and the job keeps on restarting. >>>> >>>> Also when I checked if there is an back pressure in the UI, it kept on >>>> saying sampling in progress and no results were displayed. Is there an API >>>> which can provide the back pressure details? >>>> >>>> 2017-03-10 01:40:58,793 WARN org.apache.flink.streaming.ap >>>> i.operators.AbstractStreamOperator - Error while emitting latency >>>> marker. >>>> org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: >>>> Could not forward element to next operator >>>> at org.apache.flink.streaming.runtime.tasks.OperatorChain$Chain >>>> ingOutput.emitLatencyMarker(OperatorChain.java:426) >>>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera >>>> tor$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:848) >>>> at org.apache.flink.streaming.api.operators.StreamSource$Latenc >>>> yMarksEmitter$1.onProcessingTime(StreamSource.java:152) >>>> at org.apache.flink.streaming.runtime.tasks.SystemProcessingTim >>>> eService$RepeatedTriggerTask.run(SystemProcessingTimeService.java:256) >>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executor >>>> s.java:511) >>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java: >>>> 308) >>>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu >>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:180) >>>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu >>>> tureTask.run(ScheduledThreadPoolExecutor.java:294) >>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>>> Executor.java:1142) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>>> lExecutor.java:617) >>>> at java.lang.Thread.run(Thread.java:745) >>>> Caused by: java.lang.RuntimeException >>>> at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emi >>>> tLatencyMarker(RecordWriterOutput.java:117) >>>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera >>>> tor$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:848) >>>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera >>>> tor.reportOrForwardLatencyMarker(AbstractStreamOperator.java:708) >>>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera >>>> tor.processLatencyMarker(AbstractStreamOperator.java:690) >>>> at org.apache.flink.streaming.runtime.tasks.OperatorChain$Chain >>>> ingOutput.emitLatencyMarker(OperatorChain.java:423) >>>> ... 10 more >>>> Caused by: java.lang.InterruptedException >>>> at java.lang.Object.wait(Native Method) >>>> at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.r >>>> equestBuffer(LocalBufferPool.java:168) >>>> at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.r >>>> equestBufferBlocking(LocalBufferPool.java:138) >>>> at org.apache.flink.runtime.io.network.api.writer.RecordWriter. >>>> sendToTarget(RecordWriter.java:132) >>>> at org.apache.flink.runtime.io.network.api.writer.RecordWriter. >>>> randomEmit(RecordWriter.java:107) >>>> at org.apache.flink.streaming.runtime.io.StreamRecordWriter.ran >>>> domEmit(StreamRecordWriter.java:104) >>>> at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emi >>>> tLatencyMarker(RecordWriterOutput.java:114) >>>> ... 14 more >>>> >>>> >>>> >>> >> >