[ https://issues.apache.org/jira/browse/CASSANDRA-10580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053675#comment-15053675 ]
Paulo Motta commented on CASSANDRA-10580: ----------------------------------------- While re-reviewing your patch I just noticed we already log dropped messages on {{MessageDeliveryTask.logDroppedMetrics()}}. Actually, we used to log dropped messages individually before, but it became very verbose so we started logging a summary every minute instead (more details on CASSANDRA-1284). Sorry for not checking this before. I think a more robust/elegant approach is to provide a new {{Timer}} metric {{droppedTime}} or {{timeTaken}} on {{DroppedMessageMetrics}}, and print the average dropped time on {{MessagingService.logDroppedMetrics()}}. One benefit of this approach is that it will automatically allow to plot and consume this metric in real-time via JMX. Another aesthetic benefit is that we would not need to repeat the logging logic on {{MessageDeliveryTask}}, {{LocalMutationRunnable}} and {{DroppableRunnable}}, since they already report statistics to {{MessagingService}} via {{incrementDroppedMessages()}}. In order to provide dropped mutation metrics per KS/Table we would need to add a new counter metric {{droppedMutations}} to {{TableMetrics}}. This will be a bit more complex but still doable, so we can leave it for another ticket if you don't want to do it now. If you're not familiar with the metrics system you can have a look in the classes with name ending in {{Metrics}} for more background. Please let me know if you need some help with this approach. > On dropped mutations, more details should be logged. > ---------------------------------------------------- > > Key: CASSANDRA-10580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10580 > Project: Cassandra > Issue Type: Improvement > Components: Coordination > Environment: Production > Reporter: Anubhav Kale > Assignee: Anubhav Kale > Priority: Minor > Fix For: 3.2, 2.2.x > > Attachments: 10580.patch, CASSANDRA-10580-Head.patch, Trunk.patch > > > In our production cluster, we are seeing a large number of dropped mutations. > At a minimum, we should print the time the thread took to get scheduled > thereby dropping the mutation (We should also print the Message / Mutation so > it helps in figuring out which column family was affected). This will help > find the right tuning parameter for write_timeout_in_ms. > The change is small and is in StorageProxy.java and MessagingTask.java. I > will submit a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)