[ https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15735349#comment-15735349 ]
Paulo Motta edited comment on CASSANDRA-12905 at 12/9/16 1:46 PM: ------------------------------------------------------------------ Thanks for your great investigative work and proposed patch. Please find a bellow a summary of the problems identified to make sure we're on the same page: 1 - Streaming of MVs may never complete due to lock contention throwing WTE during MV mutation application (bug). 2 - MV hint delivery may throw WTE on MV lock contention, even if the node is not overloaded (bug) 3 - Streaming of MVs is slow due to MV mutation application (performance limitation) Your original patch fixes 1 and 2 by retrying to acquire MV lock on contention instead of throwing WTE for mutations originating from streaming or hints. While 3 is not actually a bug but a potential performance limitation, it seems from your results that the cost of performing bootstrap (and large stream sessions in general) of large base tables may become prohibitive due to a large amount of mutations going through the write path. While your proposed approach of performing sstable-based streaming for base and MVs may be valid in the consistent range movement case like bootstrap or decommission, it is not correct in all scenarios for repair or inconsistent range movements like replace and removenode, because it may generate permanent inconsistencies in the views which are only fixable by dropping and re-creating the view. We may address this by allowing operators to disable mutation-based streaming for MVs and streaming if they know what they are doing and can deal with the inconsistencies that may arise with their own ways (for instance, by ensuring writes are append-only or dealing with it in the application layer), but in the general case I'm afraid we will have to live with this until we find a better way to deal with it globally, with things like CASSANDRA-9779, CASSANDRA-9308, providing a custom tool to repair inconsistencies in views, or even redesigning MV consistency ensuring mechanisms. With this said, since this is blocking release I propose we fix 1 and 2 on this ticket with your original patch, and open a new ticket to discuss ways to mitigate and/or fix 3, without the pressure of having to block release on this and taking the risk of delivering a half-baked solution. Overall your original patch (without the sstable-based streaming fix) and approach looks good, I have the following comments: - invert the variable from {{dontTimeOut}} to {{droppable}} to facilitate understanding, since this is the term used for requests that can throw {{WriteTimeoutException}} - I think we can make hint mutation requests {{deferrable}} and {{non-droppable}} to avoid blocking the thread on failure to acquire the lock - for this we need to do hint replying asynchronously on {{HintVerbHandler}} (similar to what is done on {{MutationVerbHandler}}). The right thing to do would be to make hints droppable similar to remote mutations, but this would require us to send the hint timestamp down to {{keyspace.mutateMV}}, so if you find a non-disruptive way to do this than even better, otherwise we can leave it as {{deferrable}} and {{non-droppable}} for the time being. - Instead of making {{applyBlocking}} throw {{ExecutionException}}, I think we can handle this on {{applyBlocking}} itself instead of handling on consumers with {{Throwables.propagate(e.getCause())}}. Since those are mostly cosmetic changes I will kick-off an initial round of tests with your initial patch to make sure everything is working as expected: ||3.0|| |[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-12905]| |[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12905-testall/lastCompletedBuild/testReport/]| |[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12905-dtest/lastCompletedBuild/testReport/]| Please let me if you need help or will not have time to address those and I will take care of the changes myself if you agree with those. Thanks again! was (Author: pauloricardomg): Thanks for your great investigative work and proposed patch. Please find a bellow a summary of the problems identified to make sure we're on the same page: 1 - Streaming of MVs may never complete due to lock contention throwing WTE during MV mutation application (bug). 2 - MV hint delivery may throw WTE on MV lock contention, even if the node is not overloaded (bug) 3 - Streaming of MVs is slow due to MV mutation application (performance limitation) Your original patch fixes 1 and 2 by retrying to acquire MV lock on contention instead of throwing WTE for mutations originating from streaming or hints. While 3 is not actually a bug but a potential performance limitation, it seems from your results that the cost of performing bootstrap (and large stream sessions in general) of large base tables may become prohibitive due to a large amount of mutations going through the write path. While your proposed approach of performing sstable-based streaming for base and MVs may be valid in the consistent range movement case like bootstrap or decommission, it is not correct in all scenarios for repair or inconsistent range movements like replace and removenode, because it may generate permanent inconsistencies in the views which are only fixable by dropping and re-creating the view. We may address this by allowing operators to disable mutation-based streaming for MVs if they know what they are doing and can deal with the inconsistencies that may arise with their own ways (for instance, by ensuring writes are append-only or dealing with it in the application layer), but in the general case I'm afraid we will have to live with this until we find a better way to deal with it globally, with things like CASSANDRA-9779, CASSANDRA-9308, providing a custom tool to repair inconsistencies in views, or even redesigning MV consistency ensuring mechanisms. With this said, since this is blocking release I propose we fix 1 and 2 on this ticket with your original patch, and open a new ticket to discuss ways to mitigate and/or fix 3, without the pressure of having to block release on this and taking the risk of delivering a half-baked solution. Overall your original patch (without the sstable-based streaming fix) and approach looks good, I have the following comments: - invert the variable from {{dontTimeOut}} to {{droppable}} to facilitate understanding, since this is the term used for requests that can throw {{WriteTimeoutException}} - I think we can make hint mutation requests {{deferrable}} and {{non-droppable}} to avoid blocking the thread on failure to acquire the lock - for this we need to do hint replying asynchronously on {{HintVerbHandler}} (similar to what is done on {{MutationVerbHandler}}). The right thing to do would be to make hints droppable similar to remote mutations, but this would require us to send the hint timestamp down to {{keyspace.mutateMV}}, so if you find a non-disruptive way to do this than even better, otherwise we can leave it as {{deferrable}} and {{non-droppable}} for the time being. - Instead of making {{applyBlocking}} throw {{ExecutionException}}, I think we can handle this on {{applyBlocking}} itself instead of handling on consumers with {{Throwables.propagate(e.getCause())}}. Since those are mostly cosmetic changes I will kick-off an initial round of tests with your initial patch to make sure everything is working as expected: ||3.0|| |[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-12905]| |[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12905-testall/lastCompletedBuild/testReport/]| |[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12905-dtest/lastCompletedBuild/testReport/]| Please let me if you need help or will not have time to address those and I will take care of the changes myself if you agree with those. Thanks again! > Retry acquire MV lock on failure instead of throwing WTE on streaming > --------------------------------------------------------------------- > > Key: CASSANDRA-12905 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12905 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: centos 6.7 x86_64 > Reporter: Nir Zilka > Assignee: Benjamin Roth > Priority: Critical > Fix For: 3.10 > > > Hello, > I performed two upgrades to the current cluster (currently 15 nodes, 1 DC, > private VLAN), > first it was 2.2.5.1 and repair worked flawlessly, > second upgrade was to 3.0.9 (with upgradesstables) and also repair worked > well, > then i upgraded 2 weeks ago to 3.9 - and the repair problems started. > there are several errors types from the system.log (different nodes) : > - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx > - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation > timed out - received only 0 responses > - Remote peer xxx.xxx.xxx.xxx failed stream session > - Session completed with the following error > org.apache.cassandra.streaming.StreamException: Stream failed > ---- > i use 3.9 default configuration with the cluster settings adjustments (3 > seeds, GossipingPropertyFileSnitch). > streaming_socket_timeout_in_ms is the default (86400000). > i'm afraid from consistency problems while i'm not performing repair. > Any ideas? > Thanks, > Nir. -- This message was sent by Atlassian JIRA (v6.3.4#6332)