I would say a few things. There are a lot of things going on in the
software that are interesting.

We have several queues and thread pools.

It makes sense to put
http://metrics.dropwizard.io/3.1.0/getting-started/#gauges around those.
This will give us visibility as to how close those are to 0 at any given
time.

We now have per-node data:

https://issues.apache.org/jira/browse/GOSSIP-21
https://issues.apache.org/jira/browse/GOSSIP-25

It makes sense to use gauges to record the size of these. We should also
use meters to count how operations/sec are caused by users adding data as
well as the internode process replicating data.

For PassiveGossipThread I could see us counting messages received as a
meter. We could corrupt messages separately as a meter. We could aslo
capture this data per host:

gossipfrom.node1.goodmessages
gossipfrom.node1.badmessages

As well as globally

gossipfrom.badmessages
gossipfrom.goodmessages

For ActiveGossip we could use histograms to track the time to process

sendSharedData
sendPerNodeData
sendMembership

We could use a gauge to track the size of this.scheduledExecutorService =
Executors.newScheduledThreadPool(2); and other executors tom make sure that
that queue is not backing up/blocked. Again you can track this per host and
globally

I am an ex-system administrator so I am generally ok with as many metrics
as possible as long as we do not clutter the code. There are ways to do
aspect/annotation driven counters as well so we can always look to refactor
around those things if we want to.

If you see something that seems like a point of possible contention or
something that you believe is important to track I would capture that. In
the long run there is something to consider about tracking metrics from 1k
node clusters but we are not there yet and metrics is generally lighter
than the code anyway.

Thanks for taking the time to look at this.
Edward





On Tue, Oct 11, 2016 at 2:04 PM, chandresh pancholi <
[email protected]> wrote:

> Hi,
>
> I wanted to know where to begin working on this issue.
> Someone please help me out with where to start and how to proceed with it.
>
> For Histogram i see ActiveThreadGroup and PassiveThreadGroup are doing
> inter-node operation.
>
> Where are we tracking success and failure request so generate meter
> metrics?
>
> Any kind of help is appreciable.
>
> --
> Chandresh Pancholi
> Senior Software Engineer
> Flipkart.com
> Email-id:[email protected]
> Contact:08951803660
>

Reply via email to