Hi shishal!

I think there is an issue with cancellation when many timers fire at the
same time. These timers have to finish before shutdown happens, this seems
to take a while in your case.

Did the TM process actually kill itself in the end (and got restarted)?



On Wed, Jul 11, 2018 at 9:29 AM, shishal <shisha...@gmail.com> wrote:

> Hi,
>
> I am using flink 1.4.2 with rocksdb as backend. I am using process function
> with timer on EventTime.  For checkpointing I am using hdfs.
>
> I am trying load testing so Iam reading kafka from beginning (aprox 7 days
> data with 50M events).
>
> My job gets stuck after aprox 20 min with no error. There after watermark
> do
> not progress and all checkpoint fails.
>
> Also When I try to cancel my job (using web UI) , it takes several minutes
> to finally gets cancelled. Also it makes Task manager down as well.
>
> There is no logs while my job hanged but while cancelling I get following
> error.
>
> /
>
> 2018-07-11 09:10:39,385 ERROR
> org.apache.flink.runtime.taskmanager.TaskManager              -
> ==============================================================
> ======================      FATAL      =======================
> ==============================================================
>
> A fatal error occurred, forcing the TaskManager to shut down: Task 'process
> (3/6)' did not react to cancelling signal in the last 30 seconds, but is
> stuck in method:
>  org.rocksdb.RocksDB.get(Native Method)
> org.rocksdb.RocksDB.get(RocksDB.java:810)
> org.apache.flink.contrib.streaming.state.RocksDBMapState.get(
> RocksDBMapState.java:102)
> org.apache.flink.runtime.state.UserFacingMapState.get(
> UserFacingMapState.java:47)
> nl.ing.gmtt.observer.analyser.job.rtpe.process.
> RtpeProcessFunction.onTimer(RtpeProcessFunction.java:94)
> org.apache.flink.streaming.api.operators.KeyedProcessOperator.onEventTime(
> KeyedProcessOperator.java:75)
> org.apache.flink.streaming.api.operators.HeapInternalTimerService.
> advanceWatermark(HeapInternalTimerService.java:288)
> org.apache.flink.streaming.api.operators.InternalTimeServiceManager.
> advanceWatermark(InternalTimeServiceManager.java:108)
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.
> processWatermark(AbstractStreamOperator.java:885)
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$
> ForwardingValveOutputHandler.handleWatermark(
> StreamInputProcessor.java:288)
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.
> findAndOutputNewMinWatermarkAcrossAlignedChannels(
> StatusWatermarkValve.java:189)
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.
> inputWatermark(StatusWatermarkValve.java:111)
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(
> StreamInputProcessor.java:189)
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(
> OneInputStreamTask.java:69)
> org.apache.flink.streaming.runtime.tasks.StreamTask.
> invoke(StreamTask.java:264)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
> java.lang.Thread.run(Thread.java:748)
>
> 2018-07-11 09:10:39,390 DEBUG
> org.apache.flink.runtime.akka.StoppingSupervisorWithoutLoggingActorKilledExceptionStrategy
>
> - Actor was killed. Stopping it now.
> akka.actor.ActorKilledException: Kill
> 2018-07-11 09:10:39,407 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - Stopping
> TaskManager akka://flink/user/taskmanager#-1231617791.
> 2018-07-11 09:10:39,408 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - Cancelling
> all computations and discarding all cached data.
> 2018-07-11 09:10:39,409 INFO  org.apache.flink.runtime.taskmanager.Task
>
> - Attempting to fail task externally process (3/6)
> (432fd129f3eea363334521f8c8de5198).
> 2018-07-11 09:10:39,409 INFO  org.apache.flink.runtime.taskmanager.Task
>
> - Task process (3/6) is already in state CANCELING
> 2018-07-11 09:10:39,409 INFO  org.apache.flink.runtime.taskmanager.Task
>
> - Attempting to fail task externally process (4/6)
> (7c6b96c9f32b067bdf8fa7c283eca2e0).
> 2018-07-11 09:10:39,409 INFO  org.apache.flink.runtime.taskmanager.Task
>
> - Task process (4/6) is already in state CANCELING
> 2018-07-11 09:10:39,409 INFO  org.apache.flink.runtime.taskmanager.Task
>
> - Attempting to fail task externally process (2/6)
> (a4f731797a7ea210fd0b512b0263bcd9).
> 2018-07-11 09:10:39,409 INFO  org.apache.flink.runtime.taskmanager.Task
>
> - Task process (2/6) is already in state CANCELING
> 2018-07-11 09:10:39,409 INFO  org.apache.flink.runtime.taskmanager.Task
>
> - Attempting to fail task externally process (1/6)
> (cd8a113779a4c00a051d78ad63bc7963).
> 2018-07-11 09:10:39,409 INFO  org.apache.flink.runtime.taskmanager.Task
>
> - Task process (1/6) is already in state CANCELING
> 2018-07-11 09:10:39,409 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -
> Disassociating from JobManager
> 2018-07-11 09:10:39,412 INFO
> org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting
> down BLOB cache
> 2018-07-11 09:10:39,431 INFO
> org.apache.flink.runtime.blob.TransientBlobCache              - Shutting
> down BLOB cache
> 2018-07-11 09:10:39,444 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
> -
> Stopping ZooKeeperLeaderRetrievalService.
> 2018-07-11 09:10:39,444 DEBUG
> org.apache.flink.runtime.io.disk.iomanager.IOManager          - Shutting
> down I/O manager.
> 2018-07-11 09:10:39,451 INFO
> org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O
> manager
> removed spill file directory
> /tmp/flink-io-989505e5-ac33-4d56-add5-04b2ad3067b4
> 2018-07-11 09:10:39,461 INFO
> org.apache.flink.runtime.io.network.NetworkEnvironment        - Shutting
> down the network environment and its components.
> 2018-07-11 09:10:39,461 DEBUG
> org.apache.flink.runtime.io.network.NetworkEnvironment        - Shutting
> down network connection manager
> 2018-07-11 09:10:39,462 INFO
> org.apache.flink.runtime.io.network.netty.NettyClient         - Successful
> shutdown (took 1 ms).
> 2018-07-11 09:10:39,472 INFO
> org.apache.flink.runtime.io.network.netty.NettyServer         - Successful
> shutdown (took 10 ms).
> 2018-07-11 09:10:39,472 DEBUG
> org.apache.flink.runtime.io.network.NetworkEnvironment        - Shutting
> down intermediate result partition manager
> 2018-07-11 09:10:39,473 DEBUG
> org.apache.flink.runtime.io.network.partition.ResultPartitionManager  -
> Releasing 0 partitions because of shutdown.
> 2018-07-11 09:10:39,474 DEBUG
> org.apache.flink.runtime.io.network.partition.ResultPartitionManager  -
> Successful shutdown.
> 2018-07-11 09:10:39,498 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - Task
> manager
> akka://flink/user/taskmanager is completely shut down.
> 2018-07-11 09:10:39,504 ERROR
> org.apache.flink.runtime.taskmanager.TaskManager              - Actor
> akka://flink/user/taskmanager#-1231617791 terminated, stopping process...
> 2018-07-11 09:10:39,563 INFO  org.apache.flink.runtime.taskmanager.Task
>
> - Notifying TaskManager about fatal error. Task 'process (2/6)' did not
> react to cancelling signal in the last 30 seconds, but is stuck in method:
>  org.rocksdb.RocksDB.get(Native Method)
> org.rocksdb.RocksDB.get(RocksDB.java:810)
> org.apache.flink.contrib.streaming.state.RocksDBMapState.get(
> RocksDBMapState.java:102)
> org.apache.flink.runtime.state.UserFacingMapState.get(
> UserFacingMapState.java:47)
> nl.ing.gmtt.observer.analyser.job.rtpe.process.
> RtpeProcessFunction.onTimer(RtpeProcessFunction.java:94)
> org.apache.flink.streaming.api.operators.KeyedProcessOperator.onEventTime(
> KeyedProcessOperator.java:75)
> org.apache.flink.streaming.api.operators.HeapInternalTimerService.
> advanceWatermark(HeapInternalTimerService.java:288)
> org.apache.flink.streaming.api.operators.InternalTimeServiceManager.
> advanceWatermark(InternalTimeServiceManager.java:108)
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.
> processWatermark(AbstractStreamOperator.java:885)
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$
> ForwardingValveOutputHandler.handleWatermark(
> StreamInputProcessor.java:288)
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.
> findAndOutputNewMinWatermarkAcrossAlignedChannels(
> StatusWatermarkValve.java:189)
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.
> inputWatermark(StatusWatermarkValve.java:111)
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(
> StreamInputProcessor.java:189)
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(
> OneInputStreamTask.java:69)
> org.apache.flink.streaming.runtime.tasks.StreamTask.
> invoke(StreamTask.java:264)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
> java.lang.Thread.run(Thread.java:748)
> .
> 2018-07-11 09:10:39,575 INFO  org.apache.flink.runtime.taskmanager.Task
>
> - Notifying TaskManager about fatal error. Task 'process (1/6)' did not
> react to cancelling signal in the last 30 seconds, but is stuck in method:
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> java.lang.Class.newInstance(Class.java:442)
> org.apache.flink.api.java.typeutils.runtime.PojoSerializer.createInstance(
> PojoSerializer.java:196)
> org.apache.flink.api.java.typeutils.runtime.PojoSerializer.deserialize(
> PojoSerializer.java:399)
> org.apache.flink.contrib.streaming.state.RocksDBMapState.
> deserializeUserValue(RocksDBMapState.java:304)
> org.apache.flink.contrib.streaming.state.RocksDBMapState.get(
> RocksDBMapState.java:104)
> org.apache.flink.runtime.state.UserFacingMapState.get(
> UserFacingMapState.java:47)
> nl.ing.gmtt.observer.analyser.job.rtpe.process.
> RtpeProcessFunction.onTimer(RtpeProcessFunction.java:94)
> org.apache.flink.streaming.api.operators.KeyedProcessOperator.onEventTime(
> KeyedProcessOperator.java:75)
> org.apache.flink.streaming.api.operators.HeapInternalTimerService.
> advanceWatermark(HeapInternalTimerService.java:288)
> org.apache.flink.streaming.api.operators.InternalTimeServiceManager.
> advanceWatermark(InternalTimeServiceManager.java:108)
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.
> processWatermark(AbstractStreamOperator.java:885)
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$
> ForwardingValveOutputHandler.handleWatermark(
> StreamInputProcessor.java:288)
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.
> findAndOutputNewMinWatermarkAcrossAlignedChannels(
> StatusWatermarkValve.java:189)
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.
> inputWatermark(StatusWatermarkValve.java:111)
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(
> StreamInputProcessor.java:189)
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(
> OneInputStreamTask.java:69)
> org.apache.flink.streaming.runtime.tasks.StreamTask.
> invoke(StreamTask.java:264)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
> java.lang.Thread.run(Thread.java:748)
> /
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>

Reply via email to