[ 
https://issues.apache.org/jira/browse/FLINK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15956181#comment-15956181
 ] 

Chao Zhao commented on FLINK-6231:
----------------------------------

I tried to set checkPoint interval to 5 min instead of 5 sec, and heap size to 
4g,oom disappeared. I think checkPoint interval indeed solved the oom .Not sure 
if my oom before is a defect, any advice on this?

> completed PendingCheckpoint not  release state caused oom
> ---------------------------------------------------------
>
>                 Key: FLINK-6231
>                 URL: https://issues.apache.org/jira/browse/FLINK-6231
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.1.4
>         Environment: linux x64
>            Reporter: Chao Zhao
>
> My cluster got one jobmanager and one taskmanager. jobmanager oom repeately , 
> with jobmanager.heap.mb setting to 256 and 1024. 
> oom  triggered at same scene: check point completed quickly,  while these 
> completed check points still in task queue in CheckpointCoordinator.timer 
> without taskstate being disposed.
> one of my checkpoint with taskstate is about 10m, so about 90 completed 
> checkpoint  caused oom with heap size 1024m. hprof file proved this, can 
> provide if needed.
> I have checked PendingCheckpoint.finalizeCheckpoint, not sure if it should be 
> dispose(null, true) instead of dispose(null, false).
> I have no idea about how to make my taskstate much less
> 2017-03-30 10:15:52,260 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 47 @ 1490840152260
> 2017-03-30 10:16:11,781 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
> checkpoint 47 (in 19516 ms).
> 2017-03-30 10:16:11,781 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 48 @ 1490840171781
> 2017-03-30 10:26:11,781 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 48 
> expired before completing.
> 2017-03-30 10:26:11,782 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 49 @ 1490840771782
> 2017-03-30 10:36:11,782 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 49 
> expired before completing.
> ....... all expired
> 2017-03-31 00:46:11,826 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> 134 expired before completing.
> 2017-03-31 00:46:11,826 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 135 @ 1490892371826
> 2017-03-31 00:56:11,827 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> 135 expired before completing.
> 2017-03-31 00:56:11,827 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 136 @ 1490892971827
> 2017-03-31 01:06:11,827 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> 136 expired before completing.
> 2017-03-31 01:06:11,827 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 137 @ 1490893571827
> 2017-03-31 01:06:12,215 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
> checkpoint 137 (in 384 ms).
> 2017-03-31 01:06:16,827 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 138 @ 1490893576827
> 2017-03-31 01:06:17,454 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
> checkpoint 138 (in 624 ms).
> 2017-03-31 01:06:21,827 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 139 @ 1490893581827
> 2017-03-31 01:06:22,189 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
> checkpoint 139 (in 357 ms).
> ...... all completed in less than 1s
> 2017-03-31 01:13:51,827 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 229 @ 1490894031827
> 2017-03-31 01:13:52,533 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
> checkpoint 229 (in 643 ms).
> 2017-03-31 01:13:56,827 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 230 @ 1490894036827
> 2017-03-31 01:13:58,963 ERROR akka.actor.ActorSystemImpl                      
>               - Uncaught error from thread 
> [flink-akka.remote.default-remote-dispatcher-5] shutting down JVM since 
> 'akka.jvm-exit-on-fatal-error' is enabled
> java.lang.OutOfMemoryError: Java heap space
>       at java.lang.reflect.Array.newInstance(Array.java:70)
>       at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1670)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
>       at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>       at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>       at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>       at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>       at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>       at 
> akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
>       at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>       at akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
>       at 
> akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
>       at scala.util.Try$.apply(Try.scala:192)
>       at akka.serialization.Serialization.deserialize(Serialization.scala:98)
>       at 
> akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
>       at 
> akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
>       at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
>       at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)
>       at 
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>       at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2017-03-31 01:13:59,195 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping 
> checkpoint coordinator for job 489ece0f75fe046bca646f1d19b6b766
> 2017-03-31 01:13:59,197 INFO  
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Removing web 
> dashboard root cache directory 
> /tmp/flink-web-4a631231-cdd4-40d4-850e-00ad7f7936ec
> 2017-03-31 01:13:59,197 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping 
> checkpoint coordinator for job 489ece0f75fe046bca646f1d19b6b766
> 2017-03-31 01:13:59,200 INFO  org.apache.flink.runtime.blob.BlobServer        
>               - Stopped BLOB server at 0.0.0.0:12984
> 2017-03-31 01:13:59,203 INFO  
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Removing web 
> dashboard jar upload directory 
> /tmp/flink-web-upload-3ad03fcb-b920-45ec-bdc6-befae0a98c08



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to