[ 
https://issues.apache.org/jira/browse/STORM-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Li updated STORM-3622:
----------------------------
    Description: 
We noticed that with the change in https://github.com/apache/storm/pull/3242, 
there is a race condition causing NPE.
{code:java}
2020-04-14 18:22:12.997 o.a.s.u.Utils Thread-17-__acker-executor[16, 16] 
[ERROR] Async loop died!
java.lang.RuntimeException: java.lang.NullPointerException
 at org.apache.storm.executor.Executor.accept(Executor.java:291) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.utils.JCQueue.consumeImpl(JCQueue.java:131) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.utils.JCQueue.consume(JCQueue.java:111) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:172) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:159) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.utils.Utils$1.run(Utils.java:434) 
[storm-client-2.2.0.y.jar:2.2.0.y]
 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
Caused by: java.lang.NullPointerException
 at 
com.codahale.metrics.jvm.ThreadStatesGaugeSet.getThreadCount(ThreadStatesGaugeSet.java:95)
 ~[metrics-jvm-3.2.6.jar:3.2.6]
 at 
com.codahale.metrics.jvm.ThreadStatesGaugeSet.access$000(ThreadStatesGaugeSet.java:20)
 ~[metrics-jvm-3.2.6.jar:3.2.6]
 at 
com.codahale.metrics.jvm.ThreadStatesGaugeSet$1.getValue(ThreadStatesGaugeSet.java:56)
 ~[metrics-jvm-3.2.6.jar:3.2.6]
 at org.apache.storm.executor.Executor.addV2Metrics(Executor.java:344) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.executor.Executor.metricsTick(Executor.java:320) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at 
org.apache.storm.executor.bolt.BoltExecutor.tupleActionFn(BoltExecutor.java:218)
 ~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.executor.Executor.accept(Executor.java:287) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 ... 6 more
{code}


This is due to a race condition in CachedGauge 
https://github.com/dropwizard/metrics/blob/v3.2.6/metrics-core/src/main/java/com/codahale/metrics/CachedGauge.java#L49-L53

There are two issues here.
The first one is STORM-3623. 
This makes all the executors to get values for all the metrics. So multiple 
threads will access the same metric.

So the threads gauges are now accessed by multiple threads. But in CachedGauge,


{code:java}
 @Override
    public T getValue() {
        if (shouldLoad()) {
            this.value = loadValue();
        }
        return value;
    }
{code}

this method is not thread-safe. Two threads can reach to getValue at the same 
time.
The first thread reaching shouldLoad knows it needs to reload, so it calls the 
next line this.value=loadValue()
The second thread is a little bit late so shouldLoad returns false. Then it 
returns the value directly.

There is a race condition between first thread calling loadValue() and the 
second thread returning value.

If the first thread finishes loadValue() first, both values returned to the 
threads are the same value (and current value). But if the second thread 
returns earlier, the second thread gets the original value (which is null ), 
hence NPE.

To summarize, the second issue is CachedThreadStatesGaugeSet is not thread-safe

To fix this NPE, we should avoid using CachedThreadStatesGaugeSet. 

But we still need to fix STORM-3623  to avoid unnecessary computations and 
redundant metrics.

  was:
We noticed that with the change in https://github.com/apache/storm/pull/3242, 
there is a race condition causing NPE.
{code:java}
2020-04-14 18:22:12.997 o.a.s.u.Utils Thread-17-__acker-executor[16, 16] 
[ERROR] Async loop died!
java.lang.RuntimeException: java.lang.NullPointerException
 at org.apache.storm.executor.Executor.accept(Executor.java:291) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.utils.JCQueue.consumeImpl(JCQueue.java:131) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.utils.JCQueue.consume(JCQueue.java:111) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:172) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:159) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.utils.Utils$1.run(Utils.java:434) 
[storm-client-2.2.0.y.jar:2.2.0.y]
 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
Caused by: java.lang.NullPointerException
 at 
com.codahale.metrics.jvm.ThreadStatesGaugeSet.getThreadCount(ThreadStatesGaugeSet.java:95)
 ~[metrics-jvm-3.2.6.jar:3.2.6]
 at 
com.codahale.metrics.jvm.ThreadStatesGaugeSet.access$000(ThreadStatesGaugeSet.java:20)
 ~[metrics-jvm-3.2.6.jar:3.2.6]
 at 
com.codahale.metrics.jvm.ThreadStatesGaugeSet$1.getValue(ThreadStatesGaugeSet.java:56)
 ~[metrics-jvm-3.2.6.jar:3.2.6]
 at org.apache.storm.executor.Executor.addV2Metrics(Executor.java:344) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.executor.Executor.metricsTick(Executor.java:320) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 at 
org.apache.storm.executor.bolt.BoltExecutor.tupleActionFn(BoltExecutor.java:218)
 ~[storm-client-2.2.0.y.jar:2.2.0.y]
 at org.apache.storm.executor.Executor.accept(Executor.java:287) 
~[storm-client-2.2.0.y.jar:2.2.0.y]
 ... 6 more
{code}


This is due to a race condition in CachedGauge 
https://github.com/dropwizard/metrics/blob/v3.2.6/metrics-core/src/main/java/com/codahale/metrics/CachedGauge.java#L49-L53

There are two issues here.
The first one is STORM-3623 
https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L335-L341.
 
This makes all the executors to get values for all the metrics. So multiple 
threads will access the same metric.

So the threads gauges are now accessed by multiple threads. But in CachedGauge,


{code:java}
 @Override
    public T getValue() {
        if (shouldLoad()) {
            this.value = loadValue();
        }
        return value;
    }
{code}

this method is not thread-safe. Two threads can reach to getValue at the same 
time.
The first thread reaching shouldLoad knows it needs to reload, so it calls the 
next line this.value=loadValue()
The second thread is a little bit late so shouldLoad returns false. Then it 
returns the value directly.

There is a race condition between first thread calling loadValue() and the 
second thread returning value.

If the first thread finishes loadValue() first, both values returned to the 
threads are the same value (and current value). But if the second thread 
returns earlier, the second thread gets the original value (which is null ), 
hence NPE.

To summarize, the second issue is CachedThreadStatesGaugeSet is not thread-safe

To fix this NPE, we should avoid using CachedThreadStatesGaugeSet. 

But we still need to fix STORM-3623 
https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L335-L341
  to avoid unnecessary computations and redundant metrics.


> Race Condition in CachedThreadStatesGaugeSet registered at SystemBolt
> ---------------------------------------------------------------------
>
>                 Key: STORM-3622
>                 URL: https://issues.apache.org/jira/browse/STORM-3622
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>            Reporter: Ethan Li
>            Assignee: Ethan Li
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We noticed that with the change in https://github.com/apache/storm/pull/3242, 
> there is a race condition causing NPE.
> {code:java}
> 2020-04-14 18:22:12.997 o.a.s.u.Utils Thread-17-__acker-executor[16, 16] 
> [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.NullPointerException
>  at org.apache.storm.executor.Executor.accept(Executor.java:291) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.utils.JCQueue.consumeImpl(JCQueue.java:131) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.utils.JCQueue.consume(JCQueue.java:111) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:172) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:159) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.utils.Utils$1.run(Utils.java:434) 
> [storm-client-2.2.0.y.jar:2.2.0.y]
>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
> Caused by: java.lang.NullPointerException
>  at 
> com.codahale.metrics.jvm.ThreadStatesGaugeSet.getThreadCount(ThreadStatesGaugeSet.java:95)
>  ~[metrics-jvm-3.2.6.jar:3.2.6]
>  at 
> com.codahale.metrics.jvm.ThreadStatesGaugeSet.access$000(ThreadStatesGaugeSet.java:20)
>  ~[metrics-jvm-3.2.6.jar:3.2.6]
>  at 
> com.codahale.metrics.jvm.ThreadStatesGaugeSet$1.getValue(ThreadStatesGaugeSet.java:56)
>  ~[metrics-jvm-3.2.6.jar:3.2.6]
>  at org.apache.storm.executor.Executor.addV2Metrics(Executor.java:344) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.executor.Executor.metricsTick(Executor.java:320) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at 
> org.apache.storm.executor.bolt.BoltExecutor.tupleActionFn(BoltExecutor.java:218)
>  ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.executor.Executor.accept(Executor.java:287) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  ... 6 more
> {code}
> This is due to a race condition in CachedGauge 
> https://github.com/dropwizard/metrics/blob/v3.2.6/metrics-core/src/main/java/com/codahale/metrics/CachedGauge.java#L49-L53
> There are two issues here.
> The first one is STORM-3623. 
> This makes all the executors to get values for all the metrics. So multiple 
> threads will access the same metric.
> So the threads gauges are now accessed by multiple threads. But in 
> CachedGauge,
> {code:java}
>  @Override
>     public T getValue() {
>         if (shouldLoad()) {
>             this.value = loadValue();
>         }
>         return value;
>     }
> {code}
> this method is not thread-safe. Two threads can reach to getValue at the same 
> time.
> The first thread reaching shouldLoad knows it needs to reload, so it calls 
> the next line this.value=loadValue()
> The second thread is a little bit late so shouldLoad returns false. Then it 
> returns the value directly.
> There is a race condition between first thread calling loadValue() and the 
> second thread returning value.
> If the first thread finishes loadValue() first, both values returned to the 
> threads are the same value (and current value). But if the second thread 
> returns earlier, the second thread gets the original value (which is null ), 
> hence NPE.
> To summarize, the second issue is CachedThreadStatesGaugeSet is not 
> thread-safe
> To fix this NPE, we should avoid using CachedThreadStatesGaugeSet. 
> But we still need to fix STORM-3623  to avoid unnecessary computations and 
> redundant metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to