Andrew Wong created KUDU-3310: --------------------------------- Summary: Checksum scan results for lagging replicas can be confusing Key: KUDU-3310 URL: https://issues.apache.org/jira/browse/KUDU-3310 Project: Kudu Issue Type: Improvement Components: ops-tooling Reporter: Andrew Wong
When running a checksum scan, we've seen cases where the following is reported: {code} Error: Remote error: Service unavailable: Timed out: could not wait for desired snapshot timestamp to be consistent: Timed out waiting for ts: P: 1621906 798986764 usec, L: 0 to be safe (mode: NON-LEADER). Current safe time: P: 1621906798962044 usec, L: 0 Physical time difference: 0.025s {code} and this results in messages like: {code} Aborted: checksum scan error: 1 errors were detected {code} Without much context about Kudu, this makes it seem like there is some corruption between replicas, even though the issue is just that the replica is lagging a bit. We should consider either: - allowing the wait time to be configured when running the tool, or - reword the result such that it's clear the scan failed and no checksums were verified for the tablet -- This message was sent by Atlassian Jira (v8.3.4#803005)