[ 
https://issues.apache.org/jira/browse/CASSANDRA-10580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055007#comment-15055007
 ] 

Paulo Motta commented on CASSANDRA-10580:
-----------------------------------------

Looking good. A few comments:
* I think you're fine using {{System.currentTimeMillis()}} instead of 
{{ApproximateTime}} to avoid imprecisions.
* Rename {{TimeTaken}} to {{DroppedLatency}} to be consistent with other 
similar metric names (https://wiki.apache.org/cassandra/Metrics).
** Actually, I think it's better to have 2 metrics {{InternalDroppedLatency}} 
and {{CrossNodeDroppedLatency}}, since they will be quite different (see 
CASSANDRA-9793 for more information).
* Add tests to check metrics are correct, probably on {{MessagingServiceTests}}
* You'll probably also want to verify if the metrics are working by bringing up 
a cluster manually or with ccm, stress it with cassandra-stress and see if new 
metrics are being recorded correctly via jmx with visualvm.
* The latest patch does not apply with {{fatal: corrupt patch at line 125}}, I 
don't know exactly what's that. I wonder if it's a cross-platform thing. Are 
you able to apply it locally?

Answering your questions:

bq. Also, a question: It appears that Timer.Update appends entries to the 
metric (which is what we want). Do you know at what point it starts dropping 
new appends / starts giving up ? I wonder if there is a huge number of dropped 
mutations will the timeTaken metric mess up ?

I think the metrics package already handles that. I think {{Timer}} metrics 
keeps running averages and not the actual values, so no need to cleanup afaik.

bq. To make this work for CF, I will probably pass the mutation to 
MessagingService.LogDroppedMessages (maybe through an overload) and update the 
metrics on appropriate CF. Does that make sense ?

sounds good

bq. If this change looks good, I am more inclined towards making this work for 
CF before making up patches for old branches. Let me know if that's okay.

sure, watch out for additional details with CF metrics such as cleaning up the 
metrics if CF is dropped, etc. You'll probably want to integrate this with the 
{{TableMetrics}} class.

> On dropped mutations, more details should be logged.
> ----------------------------------------------------
>
>                 Key: CASSANDRA-10580
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10580
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Coordination
>         Environment: Production
>            Reporter: Anubhav Kale
>            Assignee: Anubhav Kale
>            Priority: Minor
>             Fix For: 3.2, 2.2.x
>
>         Attachments: 10580-Metrics.patch, 10580.patch, 
> CASSANDRA-10580-Head.patch, Trunk.patch
>
>
> In our production cluster, we are seeing a large number of dropped mutations. 
> At a minimum, we should print the time the thread took to get scheduled 
> thereby dropping the mutation (We should also print the Message / Mutation so 
> it helps in figuring out which column family was affected). This will help 
> find the right tuning parameter for write_timeout_in_ms. 
> The change is small and is in StorageProxy.java and MessagingTask.java. I 
> will submit a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to