[ https://issues.apache.org/jira/browse/CASSANDRA-10580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055007#comment-15055007 ]
Paulo Motta commented on CASSANDRA-10580: ----------------------------------------- Looking good. A few comments: * I think you're fine using {{System.currentTimeMillis()}} instead of {{ApproximateTime}} to avoid imprecisions. * Rename {{TimeTaken}} to {{DroppedLatency}} to be consistent with other similar metric names (https://wiki.apache.org/cassandra/Metrics). ** Actually, I think it's better to have 2 metrics {{InternalDroppedLatency}} and {{CrossNodeDroppedLatency}}, since they will be quite different (see CASSANDRA-9793 for more information). * Add tests to check metrics are correct, probably on {{MessagingServiceTests}} * You'll probably also want to verify if the metrics are working by bringing up a cluster manually or with ccm, stress it with cassandra-stress and see if new metrics are being recorded correctly via jmx with visualvm. * The latest patch does not apply with {{fatal: corrupt patch at line 125}}, I don't know exactly what's that. I wonder if it's a cross-platform thing. Are you able to apply it locally? Answering your questions: bq. Also, a question: It appears that Timer.Update appends entries to the metric (which is what we want). Do you know at what point it starts dropping new appends / starts giving up ? I wonder if there is a huge number of dropped mutations will the timeTaken metric mess up ? I think the metrics package already handles that. I think {{Timer}} metrics keeps running averages and not the actual values, so no need to cleanup afaik. bq. To make this work for CF, I will probably pass the mutation to MessagingService.LogDroppedMessages (maybe through an overload) and update the metrics on appropriate CF. Does that make sense ? sounds good bq. If this change looks good, I am more inclined towards making this work for CF before making up patches for old branches. Let me know if that's okay. sure, watch out for additional details with CF metrics such as cleaning up the metrics if CF is dropped, etc. You'll probably want to integrate this with the {{TableMetrics}} class. > On dropped mutations, more details should be logged. > ---------------------------------------------------- > > Key: CASSANDRA-10580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10580 > Project: Cassandra > Issue Type: Improvement > Components: Coordination > Environment: Production > Reporter: Anubhav Kale > Assignee: Anubhav Kale > Priority: Minor > Fix For: 3.2, 2.2.x > > Attachments: 10580-Metrics.patch, 10580.patch, > CASSANDRA-10580-Head.patch, Trunk.patch > > > In our production cluster, we are seeing a large number of dropped mutations. > At a minimum, we should print the time the thread took to get scheduled > thereby dropping the mutation (We should also print the Message / Mutation so > it helps in figuring out which column family was affected). This will help > find the right tuning parameter for write_timeout_in_ms. > The change is small and is in StorageProxy.java and MessagingTask.java. I > will submit a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)