[ 
https://issues.apache.org/jira/browse/CASSANDRA-20878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict Elliott Smith updated CASSANDRA-20878:
-----------------------------------------------
    Resolution: Fixed
        Status: Resolved  (was: Triage Needed)

> Improve Accord Observability
> ----------------------------
>
>                 Key: CASSANDRA-20878
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20878
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Accord
>            Reporter: Benedict Elliott Smith
>            Priority: Normal
>
>     Improve Observability:
>      - Track all active Coordinations
>      - Refactor Replica/Coordinator metrics and report Coordinator 
> exhausted/preempted/timeout
>      - DurabilityQueue metrics and visibility
>     Also Fix:
>      - WaitingState can get cause distributed stall when asked to wait for 
> CanApply if not yet PreCommitted; track separate querying state and advance 
> this to the next achievable state rather than the desired final state
>      - Stalled coordinators should not prevent recovery
>      - Edge case with fetch unable to make progress when pre-bootstrap and 
> all peers have GC'd
>      - Dependency initialisation for sync points across certain ownership 
> changes
>      - SyncPoint propagation may not include all of the epochs required on 
> the receiving node for ranges they have lost but not closed, and receiving 
> node does not validate them
>      - Stable tracker accounting with LocalExecute
>      - Do not prune non-durable APPLIED as must be reported in dependencies 
> until durably applied (so as not to break recovery)
>      - Ensure we cannot race with replies when initiating Coordination
>      - ProgressLog does not guarantee to clear home or waiting states when 
> erased or invalidated by compaction
>      - WaitingState on non-home shard cannot guarantee progress once home 
> shard is Erased
>      - WaitingOnSync handles retired ranges incorrectly
>     Also Improve:
>      - Standardise failure accounting, use null to represent single reply 
> timeouts
>      - BurnTest record/replay to/from file



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to