[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-09-04 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602977#comment-16602977
 ] 

Benedict commented on CASSANDRA-14409:
--

Fixes:
# We need to include pending transient->full movements in order to ensure 
correctness on write
#* For now at least, pending transient replicas need to receive all writes for 
correctness when they come online; but reads should still not touch them.
#* As it happens, natural and pending are not read in isolation, so these 
conflicts could also occur through races 
#* We introduce some methods to detect conflicts between natural and pending, 
and to always prefer the isFull variant on write; reads continue to use only 
the natural replicas
# RangeStreamer.addRanges; was not correctly disabling for transient replicas 
# sorting bug, and sourceFilters bug when strictness impossible in 
RangeStreamer.calculateRangesToFetchWithPreferredEndpoints
# getOptimisedWorkMap was not properly disabled for transient replication (not 
caught by tests)

Nits:
# de/serializeReplicas: consistent method name, do not serialize endpoint for 
each replica

Follow-ups:
# consider if we need to maintain full->transient self-movements in pending 
ranges 
# bootstrap vs restarting an existing node vs repair (unlike regular repair, 
transient repair drops data - but in some cases we need to accumulate maybe?)
# corroborate nowhere is broken, and we have no negative availability 
implications through maintaining a transient<->full movement in pending (so 
that a single endpoint may be in both pending and natural)
# ring ownership movements can break consistency, in ways that are worse than 
for normal clusters - in lieu of other bugs, and with more full than transient 
replicas in your cluster, it’s probable this would be caught by digest 
mismatches - but this is far from offering our normal guarantees, and we need 
to consider this further

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-24 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591964#comment-16591964
 ] 

Ariel Weisberg commented on CASSANDRA-14409:


bq. it looks like the distinction between transient and full ranges during 
range transfer is superficial and is really used only in one place for logging, 
in RangeStreamer#fetchAsync. In the rest of cases the separation does not 
really exist: we stream ones just as the other ones, while adding a lot of 
complexity both to the system code and in the rest of places. requestRanges, 
just as transferRanges doesn't have to make this distinction, which will 
consequently simplify calculateRangesToFetchWithPreferredEndpoints interface, 
which is currently returning Multimap> and instead can just return a ReplicaMultimap.
I finally remembered why it is done this way. It's so that we write the correct 
information when streaming with an endpoint completes to a system table. The 
other end streams based on the information in the replica, but on the 
requesting end we need to know when complete whether streaming from that node 
satisfied things transiently or fully. So we are supposed to preserve that 
information based on the node we fetched from. So the split in fetch async is 
working on the preserved information of the source. I'll add a comment to fetch 
async for it, I figured it out based on a comment in StreamRequest.

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-23 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590856#comment-16590856
 ] 

Ariel Weisberg commented on CASSANDRA-14409:


It's much easier to deal with comments if you put them in the PR. Some of the 
links don't work and it's just hard to know what code you are talking about 
when it's go find all examples of X.

bq. here we can do just one iteration and add to one of the sets depending on 
whether the range is full or transient, similar change can be done in 
RangeStreamer#fetchAsync with remaining and skipped. Generally, fetchAsync uses 
a two-step logic which is quite hard to follow, since it becomes hard to track 
which ranges are actually making it to the end and does quite some unnecessary 
iteration.
The extra iteration is not a performance issue since this happens infrequently 
and over very small collection sizes. Clarity is way more important. There is 
no way to bend streams to generate two result values as far as I know. There 
will always be two branches because there are two kinds of ranges and we need 
them separate at this step. If you really think that the iterative approach is 
clearer we can do it, but going through these small collections twice is not 
going to have a performance impact. Every line you see in each stream 
corresponds to a line plus other formatting in the iterative approach as we 
include all the extra ceremony. I prefer the declarative approach using streams 
when performance isn't an issue.

Modulo all this with your later comment where streaming isn't actually using 
the separation at all and I need to figure out if that's OK.

bq. here, we can simplify the statement to ... !needsCleanupTransient || 
!sstable.isRepaired()
done

bq. this might just use Replica#isLocal. Also, naming is inconsistent with 
isNotAlive, "is" is missing. Same inconsistency with 
endpointNotReplicatedAnymore.
maybe use a single debug statement here
Both of those links lead to a compare view but didn't lead to a line of code

bq. Might be worth to remove pritnlns from MoveTransientTest
They make the tests easier to debug. I removed some that are redundant, 
switched them to log4j, and moved some of the logging in StorageService to 
debug.

bq. Should we stick to workMap or rangeMap in RangeStreamer?
Well they aren't the same thing. workMap is keyed by endpoint, the other is 
keyed by Replica. I changed getOptimizedRangeFetchMap to getOptimizedWorkMap
 to make it more consistent.

bq. here we might want to avoid iteration if tracing is not enabled
You are right. I'm going to go the exact opposite way though and remove the 
if(isTrace) part because it's completely overkill. This is not fast path code.

bq. in Range streamer, we keep calling Keyspace#open many times even though we 
could just do it less times and pass a key space instance where applicable
I updated it to only fetch the keyspace once.

bq. his might need an extra comment, since this task is used in repair as well 
as in streaming. Same goes for here. It's not completely transparent why we 
strip node information and transientness. I'd rather fix the code that prevents 
us from not doing It or would just use ranges if this information is not 
important.
The comment is on the toDummyList() method. This isn't stripping information, 
it's synthesizing it and the reason we do it is because repair manages 
transientness at a higher level and from the perspective of streaming it just 
needs to do what repair decides. I'll add comments pointing to the toDummyList 
doc.

bq. here (and in some other cases where pair is used), I'd just use a 
specialised Pair class that would give more precise definition of what left and 
right is. Namely: transient and full.
Benedict cleaned this up.

bq. do we need to update legacy transferred ranges here? However, this is there 
after [59b5b6bef0fa76bf5740b688fcd4d9cf525760d0], so not directly related to 
this commit.
It's technically an internal system table so I don't think we need it in 
Cassandra, but if someone else is using it then we will break them by removing 
it. I thought it was better to deprecate first and then remove. It's still 
correct for external tools if you don't use transient replication and don't use 
multiple storage ports which is most people.

bq. optimised range fetch map calculation feels not fully done. We need to 
change the RangeFetchMapCalculator instead of unwrapping and then re-wrapping 
items in it instead. What it currently does looks very fragile.
Unwrapping and rewrapping is a lot safer than trying to update the existing 
code. There were two implementations and instead of making two changes and 
increasing the surface area for getting it wrong I chose to leave it alone to 
minimize the impact to people not using transient replication. When we unwrap 
and rewrap the only thing we are adding in is a hard coded transient bit which 
we 

[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-22 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589031#comment-16589031
 ] 

Benedict commented on CASSANDRA-14409:
--

I have [pushed a 
fix|https://github.com/belliottsmith/cassandra/commit/a08cb8c14eb5cdccf843b7d134a73dac21fe24c9]
 for a bug I introduced in the collection refactor.

I have also [pushed a 
fix|https://github.com/belliottsmith/cassandra/commit/7d3f6cf21895f8d79b906bf5b8f68c8cb791925f]
 for a few subtle bugs in calculateRangesToFetchWithPreferredEndpoints.  I have 
not completed review of the overall logic of this (and related methods), but we 
were at least not applying the provided snitch sort order in all cases, and in 
the case that strictness was requested but impossible to meet we would fail if 
any of the endpoints met the provided filters, even though we only need one 
matching source.

I have also pushed a couple of tidying commits, 
[here|https://github.com/belliottsmith/cassandra/commit/7e7d0e30e909d06a8458e5d55e2c415dea23ecd2]
 and 
[here|https://github.com/belliottsmith/cassandra/commit/293fd16ba7be2884be454aa84df1a6cc5b3fb6df]

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-22 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588959#comment-16588959
 ] 

Ariel Weisberg commented on CASSANDRA-14409:


I'm going to rebase on top of Blake's work now, get all the tests passing, 
incorporate/review the collection refactor, get all the tests passing, apply 
Alex's review comments. I've done a few already.

{quote}Looks like StorageProxy#calculateRangesToStreamWithPreferredEndpoints 
can be simplified to something like this. Most likely the same logic can (and 
probably even should) belong to calculateStreamAndFetchRanges, since it seems 
that we unnecessarily trying to loop and intersect the ranges again.
This seems to have been caused by the fact 
calculateRangesToFetchWithPreferredEndpoints and 
calculateRangesToStreamWithPreferredEndpoints were receiving Range, not 
replica as they do right now, so in both cases we additionally call 
calculateNaturalReplicas (which we potentially might avoid). I think we should 
do all range-related calculations (intersections, etc) in 
calculateStreamAndFetchRanges. The old code logic was just making a diff 
between old and current set per range like here, so it seems that we can 
achieve the same without splitting the logic in two parts.{quote}

I'll look at this after once I have the collections and the tests passing. I 
want a more stable representative base to start from before making that kind of 
change to one of the thorniest parts of the patch.

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-22 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588873#comment-16588873
 ] 

Alex Petrov commented on CASSANDRA-14409:
-

I'm definitely in favour of refactor proposed by [~benedict]. First of all, it 
removes unclarity related to subtle differences and sorting imposed by Set and 
List, which were subsequently blurred by the use of Collection in previous 
version, so it was difficult to say what comes from where.

It seems that we did not have any places where {{Set}} was serving a meaning of 
{{ReplicaSet}}, e.g. removing duplicates endpoint/token range pairs in code, 
but rather {{Set}}.

It'd be great if we could finish up ReplicaPlan changes that this change made 
possible and attempt to hide abstract class and interface in the package, so we 
could be always specific on the code level and have clear interfaces. I have 
several minor comments / suggestions, I'll push them to the branch later today.

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-22 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588603#comment-16588603
 ] 

Alex Petrov commented on CASSANDRA-14409:
-

Looks like {{StorageProxy#calculateRangesToStreamWithPreferredEndpoints}} can 
be simplified to something like 
[this|https://gist.github.com/ifesdjeen/298ea3f90935d66461dbaa550325f37f]. Most 
likely the same logic can (and probably even should) belong to 
{{calculateStreamAndFetchRanges}}, since it seems that we unnecessarily trying 
to loop and intersect the ranges again. 

This seems to have been caused by the fact 
{{calculateRangesToFetchWithPreferredEndpoints}} and 
{{calculateRangesToStreamWithPreferredEndpoints}} were receiving 
{{Range}}, not replica as they do right now, so in both cases we 
additionally call {{calculateNaturalReplicas}} (which we potentially might 
avoid). I think we should do _all_ range-related calculations (intersections, 
etc) in {{calculateStreamAndFetchRanges}}. The old code logic was just making a 
diff between old and current set per range like 
[here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L4337-L4342],
 so it seems that we can achieve the same without splitting the logic in two 
parts. 

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-22 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588601#comment-16588601
 ] 

Benedict commented on CASSANDRA-14409:
--

FWIW, I have separated out my review notes, my follow-up improvements, and my 
ReplicaCollection refactor, to simplify review.

[14409-replicas|https://github.com/belliottsmith/cassandra/tree/14409-replicas] 
contains the refactor only, and any necessary modifications for this to 
compile, squashed
[14409-improve|https://github.com/belliottsmith/cassandra/tree/14409-improve] 
contains my follow-up improvements, and I will continue pushing things that 
appear in review here

I have removed my own review notes into a separate branch.

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-21 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587397#comment-16587397
 ] 

Alex Petrov commented on CASSANDRA-14409:
-

[Link for 
compare|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4],
 may be useful (at least until rebase).

Some preliminary comments (from looking at the code and running unit tests, 
without writing additional dtests so far):
  * 
[here|https://github.com/aweisberg/cassandra/commit/62578c85865c474774633f5337affa8c2ce0eb07#diff-350352a1ac9b039efbee2eeb8978a9c9R149]
 we can do just one iteration and add to one of the sets depending on whether 
the range is full or transient, similar change can be done in 
{{RangeStreamer#fetchAsync}} with {{remaining}} and {{skipped}}. Generally, 
{{fetchAsync}} uses a two-step logic which is quite hard to follow, since it 
becomes hard to track which ranges are actually making it to the end and does 
quite some unnecessary iteration.
  * 
[here|https://github.com/aweisberg/cassandra/commit/62578c85865c474774633f5337affa8c2ce0eb07#diff-d4e3b82e9bebfd2cb466b4a30af07fa4R1132],
 we can simplify the statement to {{... !needsCleanupTransient || 
!sstable.isRepaired()}}
  * 
[this|https://github.com/apache/cassandra/compare/trunk...aweisberg:14409-4#diff-fad052638059f53b1a6d479dbd05f2f2R323]
 might just use {{Replica#isLocal}}. Also, naming is inconsistent with 
{{isNotAlive}}, "is" is missing. Same inconsistency with 
{{endpointNotReplicatedAnymore}}.
  * maybe use a single debug statement 
[here|https://github.com/apache/cassandra/compare/trunk...aweisberg:14409-4#diff-fad052638059f53b1a6d479dbd05f2f2R299]
  * Might be worth to remove {{pritnlns}} from {{MoveTransientTest}} 
  * Should we stick to {{workMap}} or {{rangeMap}} in {{RangeStreamer}}?
  * 
[here|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-fad052638059f53b1a6d479dbd05f2f2R219]
 we might want to avoid iteration if tracing is not enabled
  * in Range streamer, we keep calling {{Keyspace#open}} many times even though 
we could just do it less times and pass a key space instance where applicable
  * duplicate import 
[here|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-a4c2ce49cb3a3429a8d6376929a90f7fR33]
  * 
[this|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-433b489a9a55c01dc4b021b012af3af6R59]
 might need an extra comment, since this task is used in repair as well as in 
streaming. Same goes for 
[here|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-eb7f85ba9cf3e87d842aad8b82557d19R82].
 It's not completely transparent why we strip node information and 
transientness. I'd rather fix the code that prevents us from not doing It or 
would just use ranges if this information is not important. 
  * 
[here|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-ce3f6856b405c96859d9a50d9977e0b9R1285]
 (and in some other cases where pair is used), I'd just use a specialised Pair 
class that would give more precise definition of what left and right is. 
Namely: transient and full.
  * do we need to update legacy transferred ranges 
[here|https://github.com/aweisberg/cassandra/compare/2a97227550af593e63583fb1336cb94d746d1838...14409-4#diff-ce3f6856b405c96859d9a50d9977e0b9R1322]?
 However, this is there after [59b5b6bef0fa76bf5740b688fcd4d9cf525760d0], so 
not directly related to this commit.
  * optimised range fetch map calculation feels not fully done. We need to 
change the {{RangeFetchMapCalculator}} instead of unwrapping and then 
re-wrapping items in it instead. What it currently does looks very fragile.
  * {{RangeStreamer#calculateRangesToFetchWithPreferredEndpoints}} also looks 
unfinished, since it returns {{ReplicaMultimap}}, but we 
always convert it to {{Multimap>}} 
in {{StorageService#calculateRangesToFetchWithPreferredEndpoints}} (same name, 
different return type, which makes it additionally difficult to follow), and 
then back to {{RangeFetchMapCalculator}} and 
{{convertPreferredEndpointsToWorkMap}}. 
  * it seems that we keep splitting full and transient replicas in a bunch of 
places, maybe we should add some auxiliary method to perform this job 
efficiently with a clear interface and use it everywhere?
  * in {{StorageService}}, use of {{Map.Entry}} 
simultaneously with {{Pair}} really complicates the logic. 
Once again hints we might need a specialised class for replica 
source/destination pairs.
  * in multiple places ({{StorageService}} and {{PendingRepairManager}}), we're 
still calling {{getAddressReplicas}} that materialises map only to get 
{{ReplicaSet}} by address 
  * it looks like the distinction between transient and full 

[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-20 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586489#comment-16586489
 ] 

Benedict commented on CASSANDRA-14409:
--

I've pushed a couple of new commits, to include the unit tests that I forgot 
to.  These also exposed some issues with consistency of the semantics, that 
would not have affected any of the current use cases but were suboptimal, and 
worth fixing to avoid any future surprises.  Specifically in cases of filtering 
an immutable view of a Mutable, with a predicate that matches all items (and 
some similar scenario where the subset is the same as the collection being 
operated on).

To fix this, I've introduced a new variable {{isSnapshot}} that tracks if the 
collection is a view or a snapshot (i.e. can be expected to be truly immutable 
/ stable)

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-20 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585874#comment-16585874
 ] 

Benedict commented on CASSANDRA-14409:
--

I have pushed a patch with some major refactors to {{ReplicaCollection}} 
[here|https://github.com/belliottsmith/cassandra/tree/14409-replicas-wip], in 
that it is seemingly at parity for unit/dtests.  There are some things I would 
like to improve, but we’re too pressed for time, and at the point of 
diminishing returns.  

The patch does include some minor behavioural changes to fix some unit tests 
that were broken prior to it, and some (in progress) review feedback comments 
I’m collecting.

In summary, it introduces {{EndpointsForRange}}, {{EndpointsForToken}} and 
{{RangesAtEndpoint}}.  There are also {{Endpoints}}, a super class of the first 
two, and {{AbstractReplicaCollection}} that implements the majority of 
functionality.

Main improvements:
* These classes are immutable by default; I think this actually reduces copying 
on average, and improves clarity
* There are {{Mutable}} variants, for clearly delineating where this is 
necessary
* We now declare in all locations what our expected semantics are, and we 
enforce them
* There is only one implementation for all non-boilerplate 
{{ReplicaCollection}} methods, so a reader can quickly determine the 
characteristics at a call site
* (Almost) all O(n^2) call sites have been eliminated, mostly by our stronger 
constraints and maintaining a {{LinkedHashMap}}.  This is eagerly constructed 
for a brand-new collection, and lazily for a filtered/sorted collection.

Some open questions: 
* I have limited mutability to append only, which permits easy immutable 
snapshots, but may be fiddly to change later.
* I have made the {{Mutable}} variants extend the immutable ones, but this is 
not necessary, as an immutable view can always be constructed - this might be 
cleaner, and might also permit easier sharing of {{Mutable}} implementation 
details
* {{AbstractReplicaCollection}} introduces a method called {{select()}} for 
efficiently getting a subset of the collection based on a sequence of filters - 
this considerably clarifies one caller, and might clarify others I didn’t spot, 
but is optional.
* Conversely, since we often filter and sort together, this class could be 
extended to perform them together, and avoid some unnecessary work, but since 
this would be harder to rollback it probably needs consensus first.
* We maintain a {{LinkedHashMap}} to avoid code bloat while enforcing the same 
iteration order in our derived collections, but we could introduce an 
{{AbstractMap}} that proxies to a pure {{HashMap}}, and iterates our list.  
This further complicates the code (slightly), but improves performance for 
sorting and building.

Suboptimality:
* {{Endpoints}} is necessary in a number of places, because we conflate 
range and point ops in just a handful of places; namely writing batch/hints to 
‘any’ other host, and in Data/DigestResolver, which can be for range or point 
queries.  This means some uglier generics in places, and some code is less 
clear than it might be.  The places where this conflation occurs are well 
delineated now, at least.
* We don’t have isolation when getting natural and pending replicas, so we 
ignore conflicting endpoints when concatenating these.
* {{Endpoints.newMutable}} - this is an instance method for a new {{Mutable}} 
that matches the type of the instance.  It’s ugly, but presently needed for the 
above conflation.  Possibly the select() class could be modified to support 
these cases instead.
* {{EndpointsForToken}} and {{EndpointsForRange}} *require* a {{Token}} and 
{{Range}} respectively - this means a bit of boilerplate occasionally to pass 
the correct value to empty(), as well as some unnecessary allocations.  We 
could probably relax this, but it makes the semantics of {{newMutable}} tricky 
(if we improved select() to replace this, we could probably avoid this)
* {{EndpointsByRange}} and {{RangesByEndpoint}} have immutable and mutable 
variants, though I don’t love these at all.  Nor do I like 
{{EndpointsByReplica}}.  But they will probably suffice.

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require 

[jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-07-31 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563871#comment-16563871
 ] 

Ariel Weisberg commented on CASSANDRA-14409:


Current branch is [here|https://github.com/aweisberg/cassandra/tree/14409-4]. 
Working on dtests right now.

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org