[ 
https://issues.apache.org/jira/browse/HBASE-21991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787455#comment-16787455
 ] 

Sakthi edited comment on HBASE-21991 at 3/8/19 2:28 AM:
--------------------------------------------------------

[~xucang], Please ignore my above comment. I analyzed the possibilities a 
little more. Here's my take.

There are 2 possible code flows where we are modifying the state of the class.
 1:
{code:java}
// This is for the non top-k metrics
private void registerAndMarkMeterIfNotPresent(String name) {
      registerMeterIfNotPresent(name);
      markMeterIfPresent(name);
}
{code}
Here registerMeterIfNotPresent() does only puts. Registers in the registry and 
accordingly updates in the map. As the 'registry' takes care of multithreaded 
access by using 'computeIfAbsent' to put new meter, we are good here. So no 
duplicates in the registry or the map as we use registry.get to put in the map.

And, markMeterIfPresent() does only get & update. As, the metric.mark() looks 
like an atomic operation (uses Atomiclong inside), we look safe here from 
loosing updates.

2:
{code:java}
// This is for the top-k metrics
private void "topK"MetricRegisterAndMark()(String name) {
     registerLossyCountingMeterIfNotPresent(meter, LossyCounting);
     markMeterIfPresent(meter);
}
{code}
Here, we always need to maintain that the internal dataMap of the LossyCounting 
class exactly is equivalent to what we have in our requestsMap. So all the 
access to a particular lossyCounter and it's subsequent effect on the 
requestsMap & registry must be synchronized. Or else, we can end up having a 
metric in the registry that actually isn't in the internal dataMap of the 
lossyCounter for example like below:

 

Lossy Counter referred below as *LC.*

*Thread-1*: Add meter 'm'

*LC* [by Thread-1]: Added 'm' to the internal 'dataMap'. No data swept off. So, 
add it to the registry

*Thread-2*: Add meter 'm'

*LC* [by Thread-2]: Increment 'm' count in the 'dataMap'. Swept off. Removed 
'm' from the 'dataMap'. So, remove it from the registry.

*Thread-2*: A new meter, being removed. So don't add it in the registry and 
remove the remaining meters from the list given by *LC*; 

*Thread-1*: Add the meter to the registry as directed by the *LC*

 

*Voila!* LC doesn't have 'm'. But registry has the 'm'.

*Suggestion:* We don't need synchronization anywhere else in the class except 
in the below function:
{code:java}
private void registerLossyCountingMeterIfNotPresent(String requestMeter, 
LossyCounting lossyCounting) {
        ...
        synchronized(lossyCounting) {
        ...
        }
}
{code}
This way, even addition of a new lossyCounter like for the region-metrics too, 
wouldn't be an issue. Your thoughts? :)


was (Author: jatsakthi):
[~xucang], Please ignore my above comment. I analyzed the possibilities a 
little more. Here's my take.

There are 2 possible code flows where we are modifying the state of the class.
 1:
{code:java}
// This is for the non top-k metrics
private void registerAndMarkMeterIfNotPresent(String name) {
      registerMeterIfNotPresent(name);
      markMeterIfPresent(name);
}
{code}
Here registerMeterIfNotPresent() does only puts. Registers in the registry and 
accordingly updates in the map. As the 'registry' takes care of multithreaded 
access by using 'computeIfAbsent' to put new meter, we are good here. So no 
duplicates in the registry or the map as we use registry.get to put in the map.

And, markMeterIfPresent() does only get & update. As, the metric.mark() looks 
like an atomic operation (uses Atomiclong inside), we look safe here from 
loosing updates.

2:
{code:java}
// This is for the top-k metrics
private void "topK"MetricRegisterAndMark()(String name) {
     registerLossyCountingMeterIfNotPresent(meter, LossyCounting);
     markMeterIfPresent(meter);
}
{code}
Here, we always need to maintain that the internal dataMap of the LossyCounting 
class exactly is equivalent to what we have in our requestsMap. So all the 
access to a particular lossyCounter and it's subsequent effect on the 
requestsMap & registry must be synchronized. Or else, we can end up having a 
metric in the registry that actually isn't in the internal dataMap of the 
lossyCounter for example like below:

 

Lossy Counter referred below as *LC.*

*Thread-1*: Add meter 'm'

*LC* [by Thread-1]: Added 'm' to the internal 'dataMap'. Add it to the registry

*Thread-2*: Add meter 'm'

*LC* [by Thread-2]: Increment 'm' count in the 'dataMap'. Swept off. Removed 
'm' from the 'dataMap'. So, remove it from the registry.

*Thread-2*: A new meter, being removed. So don't add it in the registry and 
remove the remaining meters from the list given by *LC*; 

*Thread-1*: Add the meter to the registry as directed by the *LC*

 

*Voila!* LC doesn't have 'm'. But registry has the 'm'.

*Suggestion:* We don't need synchronization anywhere else in the class except 
in the below function:
{code:java}
private void registerLossyCountingMeterIfNotPresent(String requestMeter, 
LossyCounting lossyCounting) {
        ...
        synchronized(lossyCounting) {
        ...
        }
}
{code}
This way, even addition of a new lossyCounter like for the region-metrics too, 
wouldn't be an issue. Your thoughts? :)

> Fix MetaMetrics issues - [Race condition, Faulty remove logic], few 
> improvements
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-21991
>                 URL: https://issues.apache.org/jira/browse/HBASE-21991
>             Project: HBase
>          Issue Type: Bug
>          Components: Coprocessors, metrics
>            Reporter: Sakthi
>            Assignee: Sakthi
>            Priority: Major
>         Attachments: hbase-21991.master.001.patch
>
>
> Here is a list of the issues related to the MetaMetrics implementation:
> +*Bugs*+:
>  # [_Lossy counting for top-k_] *Faulty remove logic of non-eligible meters*: 
> Under certain conditions, we might end up storing/exposing all the meters 
> rather than top-k-ish
>  # MetaMetrics can throw NPE resulting in aborting of the RS because of a 
> *Race Condition*.
> +*Improvements*+:
>  # With high number of regions in the cluster, exposure of metrics for each 
> region blows up the JMX from ~140 Kbs to 100+ Mbs depending on the number of 
> regions. It's better to use *lossy counting to maintain top-k for region 
> metrics* as well.
>  # As the lossy meters do not represent actual counts, I think, it'll be 
> better to *rename the meters to include "lossy" in the name*. It would be 
> more informative while monitoring the metrics and there would be less 
> confusion regarding actual counts to lossy counts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to