[ 
https://issues.apache.org/jira/browse/CASSANDRA-10580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053675#comment-15053675
 ] 

Paulo Motta commented on CASSANDRA-10580:
-----------------------------------------

While re-reviewing your patch I just noticed we already log dropped messages on 
{{MessageDeliveryTask.logDroppedMetrics()}}. Actually, we used to log dropped 
messages individually before, but it became very verbose so we started logging 
a summary every minute instead (more details on CASSANDRA-1284). Sorry for not 
checking this before.

I think a more robust/elegant approach is to provide a new {{Timer}} metric 
{{droppedTime}} or {{timeTaken}} on {{DroppedMessageMetrics}}, and print the 
average dropped time on {{MessagingService.logDroppedMetrics()}}. One benefit 
of this approach is that it will automatically allow to plot and consume this 
metric in real-time via JMX. Another aesthetic benefit is that we would not 
need to repeat the logging logic on {{MessageDeliveryTask}}, 
{{LocalMutationRunnable}} and {{DroppableRunnable}}, since they already report 
statistics to  {{MessagingService}} via {{incrementDroppedMessages()}}.

In order to provide dropped mutation metrics per KS/Table we would need to add 
a new counter metric {{droppedMutations}} to {{TableMetrics}}. This will be a 
bit more complex but still doable, so we can leave it for another ticket if you 
don't want to do it now.  If you're not familiar with the metrics system you 
can have a look in the classes with name ending in {{Metrics}} for more 
background.

Please let me know if you need some help with this approach.

> On dropped mutations, more details should be logged.
> ----------------------------------------------------
>
>                 Key: CASSANDRA-10580
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10580
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Coordination
>         Environment: Production
>            Reporter: Anubhav Kale
>            Assignee: Anubhav Kale
>            Priority: Minor
>             Fix For: 3.2, 2.2.x
>
>         Attachments: 10580.patch, CASSANDRA-10580-Head.patch, Trunk.patch
>
>
> In our production cluster, we are seeing a large number of dropped mutations. 
> At a minimum, we should print the time the thread took to get scheduled 
> thereby dropping the mutation (We should also print the Message / Mutation so 
> it helps in figuring out which column family was affected). This will help 
> find the right tuning parameter for write_timeout_in_ms. 
> The change is small and is in StorageProxy.java and MessagingTask.java. I 
> will submit a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to