[ https://issues.apache.org/jira/browse/FLINK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15956181#comment-15956181 ]
Chao Zhao commented on FLINK-6231: ---------------------------------- I tried to set checkPoint interval to 5 min instead of 5 sec, and heap size to 4g,oom disappeared. I think checkPoint interval indeed solved the oom .Not sure if my oom before is a defect, any advice on this? > completed PendingCheckpoint not release state caused oom > --------------------------------------------------------- > > Key: FLINK-6231 > URL: https://issues.apache.org/jira/browse/FLINK-6231 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing > Affects Versions: 1.1.4 > Environment: linux x64 > Reporter: Chao Zhao > > My cluster got one jobmanager and one taskmanager. jobmanager oom repeately , > with jobmanager.heap.mb setting to 256 and 1024. > oom triggered at same scene: check point completed quickly, while these > completed check points still in task queue in CheckpointCoordinator.timer > without taskstate being disposed. > one of my checkpoint with taskstate is about 10m, so about 90 completed > checkpoint caused oom with heap size 1024m. hprof file proved this, can > provide if needed. > I have checked PendingCheckpoint.finalizeCheckpoint, not sure if it should be > dispose(null, true) instead of dispose(null, false). > I have no idea about how to make my taskstate much less > 2017-03-30 10:15:52,260 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 47 @ 1490840152260 > 2017-03-30 10:16:11,781 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 47 (in 19516 ms). > 2017-03-30 10:16:11,781 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 48 @ 1490840171781 > 2017-03-30 10:26:11,781 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 48 > expired before completing. > 2017-03-30 10:26:11,782 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 49 @ 1490840771782 > 2017-03-30 10:36:11,782 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 49 > expired before completing. > ....... all expired > 2017-03-31 00:46:11,826 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint > 134 expired before completing. > 2017-03-31 00:46:11,826 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 135 @ 1490892371826 > 2017-03-31 00:56:11,827 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint > 135 expired before completing. > 2017-03-31 00:56:11,827 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 136 @ 1490892971827 > 2017-03-31 01:06:11,827 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint > 136 expired before completing. > 2017-03-31 01:06:11,827 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 137 @ 1490893571827 > 2017-03-31 01:06:12,215 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 137 (in 384 ms). > 2017-03-31 01:06:16,827 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 138 @ 1490893576827 > 2017-03-31 01:06:17,454 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 138 (in 624 ms). > 2017-03-31 01:06:21,827 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 139 @ 1490893581827 > 2017-03-31 01:06:22,189 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 139 (in 357 ms). > ...... all completed in less than 1s > 2017-03-31 01:13:51,827 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 229 @ 1490894031827 > 2017-03-31 01:13:52,533 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 229 (in 643 ms). > 2017-03-31 01:13:56,827 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 230 @ 1490894036827 > 2017-03-31 01:13:58,963 ERROR akka.actor.ActorSystemImpl > - Uncaught error from thread > [flink-akka.remote.default-remote-dispatcher-5] shutting down JVM since > 'akka.jvm-exit-on-fatal-error' is enabled > java.lang.OutOfMemoryError: Java heap space > at java.lang.reflect.Array.newInstance(Array.java:70) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1670) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136) > at > akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104) > at scala.util.Try$.apply(Try.scala:192) > at akka.serialization.Serialization.deserialize(Serialization.scala:98) > at > akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23) > at > akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58) > at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58) > at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76) > at > akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > 2017-03-31 01:13:59,195 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping > checkpoint coordinator for job 489ece0f75fe046bca646f1d19b6b766 > 2017-03-31 01:13:59,197 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Removing web > dashboard root cache directory > /tmp/flink-web-4a631231-cdd4-40d4-850e-00ad7f7936ec > 2017-03-31 01:13:59,197 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping > checkpoint coordinator for job 489ece0f75fe046bca646f1d19b6b766 > 2017-03-31 01:13:59,200 INFO org.apache.flink.runtime.blob.BlobServer > - Stopped BLOB server at 0.0.0.0:12984 > 2017-03-31 01:13:59,203 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Removing web > dashboard jar upload directory > /tmp/flink-web-upload-3ad03fcb-b920-45ec-bdc6-befae0a98c08 -- This message was sent by Atlassian JIRA (v6.3.15#6346)