[ https://issues.apache.org/jira/browse/FLINK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nico Chen updated FLINK-7368: ----------------------------- Description: Flink's `MetricStore` is not thread-safe. multi-treads may acess java' hashmap inside `MetricStore` and can tirgger hashmap's infinte loop. Recently I met the case that flink jobmanager consumed 100% cpu. A part of stacktrace is shown below. The full jstack is in the attachment. {code:java} "ForkJoinPool-1-worker-19" daemon prio=10 tid=0x00007fbdacac9800 nid=0x64c1 runnable [0x00007fbd7d1c2000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.put(HashMap.java:494) at org.apache.flink.runtime.webmonitor.metrics.MetricStore.addMetric(MetricStore.java:176) at org.apache.flink.runtime.webmonitor.metrics.MetricStore.add(MetricStore.java:121) at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.addMetrics(MetricFetcher.java:198) at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.access$500(MetricFetcher.java:58) at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher$4.onSuccess(MetricFetcher.java:188) at akka.dispatch.OnSuccess.internal(Future.scala:212) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ForkJoinTask$AdaptedRunnable.exec(ForkJoinTask.java:1265) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:334) at java.util.concurrent.ForkJoinWorkerThread.execTask(ForkJoinWorkerThread.java:604) at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:784) at java.util.concurrent.ForkJoinPool.work(ForkJoinPool.java:646) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:398) {code} There are 24 threads show same stacktrace as above to indicate they are spining at HashMap.put(HashMap.java:494) (I am using Java 1.7.0_6). Many posts indicate multi-threads accessing hashmap cause this problem and I reproduce the case as well. The test code is attached. I only modify the HashMap.transfer() by adding concurrent barriers for different treads in order to simulate the timing of creation of cycles in hashmap's Entry. Even through `MetricFetcher` has a 10 seconds minimum inteverl between each metrics qurey, it still cannot guarntee query responses do not acess `MtricStore`'s hashmap concurrently. Thus I think it's a bug to fix. was: Flink's `MetricStore` is not thread-safe. multi-treads may acess java' hashmap inside `MetricStore` and can tirgger hashmap's infinte loop. Recently I met the case that flink jobmanager consumed 100% cpu. A part of stacktrace is shown below. The full jstack is in the attachment. {code:java} "ForkJoinPool-1-worker-19" daemon prio=10 tid=0x00007fbdacac9800 nid=0x64c1 runnable [0x00007fbd7d1c2000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.put(HashMap.java:494) at org.apache.flink.runtime.webmonitor.metrics.MetricStore.addMetric(MetricStore.java:176) at org.apache.flink.runtime.webmonitor.metrics.MetricStore.add(MetricStore.java:121) at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.addMetrics(MetricFetcher.java:198) at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.access$500(MetricFetcher.java:58) at org.apache.flink.runtime.webmonitor.metrics.MetricFetcher$4.onSuccess(MetricFetcher.java:188) at akka.dispatch.OnSuccess.internal(Future.scala:212) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ForkJoinTask$AdaptedRunnable.exec(ForkJoinTask.java:1265) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:334) at java.util.concurrent.ForkJoinWorkerThread.execTask(ForkJoinWorkerThread.java:604) at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:784) at java.util.concurrent.ForkJoinPool.work(ForkJoinPool.java:646) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:398) {code} There are 24 threads show same stacktrace as above to indicate they are spining at HashMap.put(HashMap.java:494) (I am using Java 1.7.0_6). Many posts indicate multi-threads accessing hashmap cause this problem and I reproduce the case as well. The test code is attached. I mainly modify the HashMap.transfer() by adding concurrent barriers for different treads in order to simulate the timing of creation of cyclic linked list. Even through `MetricFetcher` has a 10 seconds minimum inteverl between each metrics qurey, it still cannot guarntee query responses do not acess `MtricStore`'s hashmap concurrently. > MetricStore makes cpu spin at 100% > ---------------------------------- > > Key: FLINK-7368 > URL: https://issues.apache.org/jira/browse/FLINK-7368 > Project: Flink > Issue Type: Bug > Components: Metrics > Reporter: Nico Chen > Attachments: jm-jstack.log, MyHashMapInfiniteLoopTest.java, > MyHashMap.java > > > Flink's `MetricStore` is not thread-safe. multi-treads may acess java' > hashmap inside `MetricStore` and can tirgger hashmap's infinte loop. > Recently I met the case that flink jobmanager consumed 100% cpu. A part of > stacktrace is shown below. The full jstack is in the attachment. > {code:java} > "ForkJoinPool-1-worker-19" daemon prio=10 tid=0x00007fbdacac9800 nid=0x64c1 > runnable [0x00007fbd7d1c2000] > java.lang.Thread.State: RUNNABLE > at java.util.HashMap.put(HashMap.java:494) > at > org.apache.flink.runtime.webmonitor.metrics.MetricStore.addMetric(MetricStore.java:176) > at > org.apache.flink.runtime.webmonitor.metrics.MetricStore.add(MetricStore.java:121) > at > org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.addMetrics(MetricFetcher.java:198) > at > org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.access$500(MetricFetcher.java:58) > at > org.apache.flink.runtime.webmonitor.metrics.MetricFetcher$4.onSuccess(MetricFetcher.java:188) > at akka.dispatch.OnSuccess.internal(Future.scala:212) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at > scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28) > at > scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) > at > scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > java.util.concurrent.ForkJoinTask$AdaptedRunnable.exec(ForkJoinTask.java:1265) > at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:334) > at > java.util.concurrent.ForkJoinWorkerThread.execTask(ForkJoinWorkerThread.java:604) > at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:784) > at java.util.concurrent.ForkJoinPool.work(ForkJoinPool.java:646) > at > java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:398) > {code} > There are 24 threads show same stacktrace as above to indicate they are > spining at HashMap.put(HashMap.java:494) (I am using Java 1.7.0_6). Many > posts indicate multi-threads accessing hashmap cause this problem and I > reproduce the case as well. The test code is attached. I only modify the > HashMap.transfer() by adding concurrent barriers for different treads in > order to simulate the timing of creation of cycles in hashmap's Entry. > Even through `MetricFetcher` has a 10 seconds minimum inteverl between each > metrics qurey, it still cannot guarntee query responses do not acess > `MtricStore`'s hashmap concurrently. Thus I think it's a bug to fix. > -- This message was sent by Atlassian JIRA (v6.4.14#64029)