In one of our load tests, we're incrementing a single counter column as
well as appending columns to a single row (essentially a timeline). You can
think of it as counting the instances of an event and then keeping a
timeline of those events. The ratio is of increments to "appends" is 1:1.

When we run this on a test cluster with RF = 3, one node gets backed up
with a lot of replicate on write tasks pending, eventually maxing out at
4128. We think it's a disk I/O issue that's causing the slowdown (lot of
reads), but we're still investigating. A few questions that might speed up
understanding the issue:

1. Is there any way to see metadata about the replicate on write tasks
pending? We're splitting apart the load test to pinpoint which of those
operations is causing an issue, but if there's a way to see that queue,
that might save us some work.

2. I'm assuming in our case the cause is incrementing counters because disk
reads are part of the write path for counters and are not for appending
columns to a row. Does that logic make sense?

Thanks in advance,
Andrew

Reply via email to