[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587397#comment-16587397
 ] 

Alex Petrov commented on CASSANDRA-14409:
-----------------------------------------

[Link for 
compare|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4],
 may be useful (at least until rebase).

Some preliminary comments (from looking at the code and running unit tests, 
without writing additional dtests so far):
  * 
[here|https://github.com/aweisberg/cassandra/commit/62578c85865c474774633f5337affa8c2ce0eb07#diff-350352a1ac9b039efbee2eeb8978a9c9R149]
 we can do just one iteration and add to one of the sets depending on whether 
the range is full or transient, similar change can be done in 
{{RangeStreamer#fetchAsync}} with {{remaining}} and {{skipped}}. Generally, 
{{fetchAsync}} uses a two-step logic which is quite hard to follow, since it 
becomes hard to track which ranges are actually making it to the end and does 
quite some unnecessary iteration.
  * 
[here|https://github.com/aweisberg/cassandra/commit/62578c85865c474774633f5337affa8c2ce0eb07#diff-d4e3b82e9bebfd2cb466b4a30af07fa4R1132],
 we can simplify the statement to {{... !needsCleanupTransient || 
!sstable.isRepaired()}}
  * 
[this|https://github.com/apache/cassandra/compare/trunk...aweisberg:14409-4#diff-fad052638059f53b1a6d479dbd05f2f2R323]
 might just use {{Replica#isLocal}}. Also, naming is inconsistent with 
{{isNotAlive}}, "is" is missing. Same inconsistency with 
{{endpointNotReplicatedAnymore}}.
  * maybe use a single debug statement 
[here|https://github.com/apache/cassandra/compare/trunk...aweisberg:14409-4#diff-fad052638059f53b1a6d479dbd05f2f2R299]
  * Might be worth to remove {{pritnlns}} from {{MoveTransientTest}} 
  * Should we stick to {{workMap}} or {{rangeMap}} in {{RangeStreamer}}?
  * 
[here|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-fad052638059f53b1a6d479dbd05f2f2R219]
 we might want to avoid iteration if tracing is not enabled
  * in Range streamer, we keep calling {{Keyspace#open}} many times even though 
we could just do it less times and pass a key space instance where applicable
  * duplicate import 
[here|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-a4c2ce49cb3a3429a8d6376929a90f7fR33]
  * 
[this|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-433b489a9a55c01dc4b021b012af3af6R59]
 might need an extra comment, since this task is used in repair as well as in 
streaming. Same goes for 
[here|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-eb7f85ba9cf3e87d842aad8b82557d19R82].
 It's not completely transparent why we strip node information and 
transientness. I'd rather fix the code that prevents us from not doing It or 
would just use ranges if this information is not important. 
  * 
[here|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-ce3f6856b405c96859d9a50d9977e0b9R1285]
 (and in some other cases where pair is used), I'd just use a specialised Pair 
class that would give more precise definition of what left and right is. 
Namely: transient and full.
  * do we need to update legacy transferred ranges 
[here|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-ce3f6856b405c96859d9a50d9977e0b9R1322]?
 However, this is there after [59b5b6bef0fa76bf5740b688fcd4d9cf525760d0], so 
not directly related to this commit.
  * optimised range fetch map calculation feels not fully done. We need to 
change the {{RangeFetchMapCalculator}} instead of unwrapping and then 
re-wrapping items in it instead. What it currently does looks very fragile.
  * {{RangeStreamer#calculateRangesToFetchWithPreferredEndpoints}} also looks 
unfinished, since it returns {{ReplicaMultimap<Replica, ReplicaList>}}, but we 
always convert it to {{Multimap<InetAddressAndPort, Pair<Replica, Replica>>}} 
in {{StorageService#calculateRangesToFetchWithPreferredEndpoints}} (same name, 
different return type, which makes it additionally difficult to follow), and 
then back to {{RangeFetchMapCalculator}} and 
{{convertPreferredEndpointsToWorkMap}}. 
  * it seems that we keep splitting full and transient replicas in a bunch of 
places, maybe we should add some auxiliary method to perform this job 
efficiently with a clear interface and use it everywhere?
  * in {{StorageService}}, use of {{Map.Entry<Replica, Replica>}} 
simultaneously with {{Pair<Replica, Replica>}} really complicates the logic. 
Once again hints we might need a specialised class for replica 
source/destination pairs.
  * in multiple places ({{StorageService}} and {{PendingRepairManager}}), we're 
still calling {{getAddressReplicas}} that materialises map only to get 
{{ReplicaSet}} by address 
  * it looks like the distinction between transient and full ranges during 
range transfer is superficial and is really used only in one place for logging, 
in {{RangeStreamer#fetchAsync}}. In the rest of cases the separation does not 
really exist: we stream ones just as the other ones, while adding a lot of 
complexity both to the system code and in the rest of places. 
{{requestRanges}}, just as {{transferRanges}} doesn't have to make this 
distinction, which will consequently simplify 
{{calculateRangesToFetchWithPreferredEndpoints}} interface, which is currently 
returning {{Multimap<InetAddressAndPort, Pair<Replica, Replica>>}} and instead 
can just return a {{ReplicaMultimap<InetAddressAndPort, ReplicaList>}}.
  * Generally, I'm not 100% sure about the logic that calculates stream and 
fetch ranges. While it was overstreaming, previous logic was quite 
straightforward. Now, when calculating the fetch ranges, we're just picking 
replicas. When calculating ranges to stream, we instead calculate range 
intersections. This once again hints that we might want to end up with a 
streaming plan that only says where to fetch from and what to stream to with 
ranges instead of having {{ReplicaList}} around it all.
  * in {{calculateRangesToStreamWithPreferredEndpoints}}, {{streamRanges}} seem 
not to really need to be {{ReplicaSet}}, since the only case where we really 
check for anything replica-related is {{AssertionError("not good")}}.
  * it seems that the concept of {{RangesByEndpoint}} suggested by [~benedict] 
will be highly beneficial for streaming since it will remove some (seemingly) 
unnecessary indirections and bring the concepts that belong together back 
together again. 

I'm open for the discussion on all the aforementioned points and am happy to 
give a hand in working on some of the points.

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14409
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Coordination, Core, Documentation and Website
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>            Priority: Major
>             Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to