[jira] [Commented] (CASSANDRA-8094) Heavy writes in RangeSlice read requests
[ https://issues.apache.org/jira/browse/CASSANDRA-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367387#comment-14367387 ] Minh Do commented on CASSANDRA-8094: Will have some time in the couple of weeks to check it in. Thanks Heavy writes in RangeSlice read requests -- Key: CASSANDRA-8094 URL: https://issues.apache.org/jira/browse/CASSANDRA-8094 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Fix For: 2.0.14 RangeSlice requests always do a scheduled read repair when coordinators try to resolve replicas' responses no matter read_repair_chance is set or not. Because of this, in low writes and high reads clusters, there are very high write requests going on between nodes. We should have an option to turn this off and this can be different than the read_repair_chance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8751) C* should always listen to both ssl/non-ssl ports
[ https://issues.apache.org/jira/browse/CASSANDRA-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335507#comment-14335507 ] Minh Do commented on CASSANDRA-8751: TLS/SSL socket by design only processes secured or encrypted messages. How can we use this one TLS/SSL socket to process both plain-text and encrypted messages simultaneously? I don't think we can get away from this. C* should always listen to both ssl/non-ssl ports - Key: CASSANDRA-8751 URL: https://issues.apache.org/jira/browse/CASSANDRA-8751 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Critical Since there is always one thread dedicated on server socket listener and it does not use much resource, we should always have these two listeners up no matter what users set for internode_encryption. The reason behind this is that we need to switch back and forth between different internode_encryption modes and we need C* servers to keep running in transient state or during mode switching. Currently this is not possible. For example, we have a internode_encryption=dc cluster in a multi-region AWS environment and want to set internode_encryption=all by rolling restart C* nodes. However, the node with internode_encryption=all does not open to listen to non-ssl port. As a result, we have a splitted brain cluster here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8751) C* should always listen to both ssl/non-ssl ports
Minh Do created CASSANDRA-8751: -- Summary: C* should always listen to both ssl/non-ssl ports Key: CASSANDRA-8751 URL: https://issues.apache.org/jira/browse/CASSANDRA-8751 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Critical Since there is always one thread dedicated on server socket listener and it does not use much resource, we should always have these two listeners up no matter what users set for internode_encryption. The reason behind this is that we need to switch back and forth between different internode_encryption modes and we need C* servers to keep running in transient state or during mode switching. Currently this is not possible. For example, we have a internode_encryption=dc cluster in a multi-region AWS environment and want to set internode_encryption=all by rolling restart C* nodes. However, the node with internode_encryption=all does not open to listen to non-ssl port. As a result, we have a splitted brain cluster here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8094) Heavy writes in RangeSlice read requests
[ https://issues.apache.org/jira/browse/CASSANDRA-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-8094: --- Due Date: 15/Jan/15 (was: 14/Nov/14) Heavy writes in RangeSlice read requests -- Key: CASSANDRA-8094 URL: https://issues.apache.org/jira/browse/CASSANDRA-8094 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Fix For: 2.0.12 RangeSlice requests always do a scheduled read repair when coordinators try to resolve replicas' responses no matter read_repair_chance is set or not. Because of this, in low writes and high reads clusters, there are very high write requests going on between nodes. We should have an option to turn this off and this can be different than the read_repair_chance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8132) Save or stream hints to a safe place in node replacement
[ https://issues.apache.org/jira/browse/CASSANDRA-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-8132: --- Due Date: 15/Jan/15 (was: 28/Nov/14) Save or stream hints to a safe place in node replacement Key: CASSANDRA-8132 URL: https://issues.apache.org/jira/browse/CASSANDRA-8132 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Fix For: 2.1.3 Often, we need to replace a node with a new instance in the cloud environment where we have all nodes are still alive. To be safe without losing data, we usually make sure all hints are gone before we do this operation. Replacement means we just want to shutdown C* process on a node and bring up another instance to take over that node's token. However, if a node to be replaced has a lot of stored hints, its HintedHandofManager seems very slow to send the hints to other nodes. In our case, we tried to replace a node and had to wait for several days before its stored hints are clear out. As mentioned above, we need all hints on this node to clear out before we can terminate it and replace it by a new instance/machine. Since this is not a decommission, I am proposing that we have the same hints-streaming mechanism as in the decommission code. Furthermore, there needs to be a cmd for NodeTool to trigger this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8132) Save or stream hints to a safe place in node replacement
Minh Do created CASSANDRA-8132: -- Summary: Save or stream hints to a safe place in node replacement Key: CASSANDRA-8132 URL: https://issues.apache.org/jira/browse/CASSANDRA-8132 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Fix For: 2.1.1 Often, we need to replace a node with a new instance in the cloud environment where we have all nodes are still alive. To be safe without losing data, we usually make sure all hints are gone before we do this operation. Replacement means we just want to shutdown C* process on a node and bring up another instance to take over that node's token. However, if a node has a lot of stored hints, HintedHandofManager seems very slow to play the hints. In our case, we tried to replace a node and had to wait for several days. Since this is not a decommission, I am proposing that we have the same hints-streaming mechanism as in the decommission code. Furthermore, there needs to be a cmd for NodeTool to trigger this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8132) Save or stream hints to a safe place in node replacement
[ https://issues.apache.org/jira/browse/CASSANDRA-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-8132: --- Description: Often, we need to replace a node with a new instance in the cloud environment where we have all nodes are still alive. To be safe without losing data, we usually make sure all hints are gone before we do this operation. Replacement means we just want to shutdown C* process on a node and bring up another instance to take over that node's token. However, if a node to be replaced has a lot of stored hints, its HintedHandofManager seems very slow to send the hints to other nodes. In our case, we tried to replace a node and had to wait for several days before its stored hints are clear out. As mentioned above, we need all hints on this node to clear out before we can terminate it and replace it by a new node. Since this is not a decommission, I am proposing that we have the same hints-streaming mechanism as in the decommission code. Furthermore, there needs to be a cmd for NodeTool to trigger this. was: Often, we need to replace a node with a new instance in the cloud environment where we have all nodes are still alive. To be safe without losing data, we usually make sure all hints are gone before we do this operation. Replacement means we just want to shutdown C* process on a node and bring up another instance to take over that node's token. However, if a node has a lot of stored hints, HintedHandofManager seems very slow to play the hints. In our case, we tried to replace a node and had to wait for several days. Since this is not a decommission, I am proposing that we have the same hints-streaming mechanism as in the decommission code. Furthermore, there needs to be a cmd for NodeTool to trigger this. Save or stream hints to a safe place in node replacement Key: CASSANDRA-8132 URL: https://issues.apache.org/jira/browse/CASSANDRA-8132 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Fix For: 2.1.1 Often, we need to replace a node with a new instance in the cloud environment where we have all nodes are still alive. To be safe without losing data, we usually make sure all hints are gone before we do this operation. Replacement means we just want to shutdown C* process on a node and bring up another instance to take over that node's token. However, if a node to be replaced has a lot of stored hints, its HintedHandofManager seems very slow to send the hints to other nodes. In our case, we tried to replace a node and had to wait for several days before its stored hints are clear out. As mentioned above, we need all hints on this node to clear out before we can terminate it and replace it by a new node. Since this is not a decommission, I am proposing that we have the same hints-streaming mechanism as in the decommission code. Furthermore, there needs to be a cmd for NodeTool to trigger this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8132) Save or stream hints to a safe place in node replacement
[ https://issues.apache.org/jira/browse/CASSANDRA-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-8132: --- Description: Often, we need to replace a node with a new instance in the cloud environment where we have all nodes are still alive. To be safe without losing data, we usually make sure all hints are gone before we do this operation. Replacement means we just want to shutdown C* process on a node and bring up another instance to take over that node's token. However, if a node to be replaced has a lot of stored hints, its HintedHandofManager seems very slow to send the hints to other nodes. In our case, we tried to replace a node and had to wait for several days before its stored hints are clear out. As mentioned above, we need all hints on this node to clear out before we can terminate it and replace it by a new instance/machine. Since this is not a decommission, I am proposing that we have the same hints-streaming mechanism as in the decommission code. Furthermore, there needs to be a cmd for NodeTool to trigger this. was: Often, we need to replace a node with a new instance in the cloud environment where we have all nodes are still alive. To be safe without losing data, we usually make sure all hints are gone before we do this operation. Replacement means we just want to shutdown C* process on a node and bring up another instance to take over that node's token. However, if a node to be replaced has a lot of stored hints, its HintedHandofManager seems very slow to send the hints to other nodes. In our case, we tried to replace a node and had to wait for several days before its stored hints are clear out. As mentioned above, we need all hints on this node to clear out before we can terminate it and replace it by a new node. Since this is not a decommission, I am proposing that we have the same hints-streaming mechanism as in the decommission code. Furthermore, there needs to be a cmd for NodeTool to trigger this. Save or stream hints to a safe place in node replacement Key: CASSANDRA-8132 URL: https://issues.apache.org/jira/browse/CASSANDRA-8132 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Fix For: 2.1.1 Often, we need to replace a node with a new instance in the cloud environment where we have all nodes are still alive. To be safe without losing data, we usually make sure all hints are gone before we do this operation. Replacement means we just want to shutdown C* process on a node and bring up another instance to take over that node's token. However, if a node to be replaced has a lot of stored hints, its HintedHandofManager seems very slow to send the hints to other nodes. In our case, we tried to replace a node and had to wait for several days before its stored hints are clear out. As mentioned above, we need all hints on this node to clear out before we can terminate it and replace it by a new instance/machine. Since this is not a decommission, I am proposing that we have the same hints-streaming mechanism as in the decommission code. Furthermore, there needs to be a cmd for NodeTool to trigger this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8132) Save or stream hints to a safe place in node replacement
[ https://issues.apache.org/jira/browse/CASSANDRA-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174729#comment-14174729 ] Minh Do commented on CASSANDRA-8132: Brandon, I mean it is the other way around to stream hints from the node about to be replaced to one of its neighbors. It is just like in unbootstrap() that we have to stream hints from the closest node prior to the shutdown. We need to do this because we don't want to lose hints in shutting down a node and replacing it with a new instance or machine. Save or stream hints to a safe place in node replacement Key: CASSANDRA-8132 URL: https://issues.apache.org/jira/browse/CASSANDRA-8132 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Fix For: 2.1.1 Often, we need to replace a node with a new instance in the cloud environment where we have all nodes are still alive. To be safe without losing data, we usually make sure all hints are gone before we do this operation. Replacement means we just want to shutdown C* process on a node and bring up another instance to take over that node's token. However, if a node to be replaced has a lot of stored hints, its HintedHandofManager seems very slow to send the hints to other nodes. In our case, we tried to replace a node and had to wait for several days before its stored hints are clear out. As mentioned above, we need all hints on this node to clear out before we can terminate it and replace it by a new instance/machine. Since this is not a decommission, I am proposing that we have the same hints-streaming mechanism as in the decommission code. Furthermore, there needs to be a cmd for NodeTool to trigger this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8094) Heavy writes in RangeSlice read requests
[ https://issues.apache.org/jira/browse/CASSANDRA-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169774#comment-14169774 ] Minh Do commented on CASSANDRA-8094: @Jonathan, can we introduce another similar option like read_repair_chance per Column Family? Heavy writes in RangeSlice read requests -- Key: CASSANDRA-8094 URL: https://issues.apache.org/jira/browse/CASSANDRA-8094 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Fix For: 2.0.11 RangeSlice requests always do a scheduled read repair when coordinators try to resolve replicas' responses no matter read_repair_chance is set or not. Because of this, in low writes and high reads clusters, there are very high write requests going on between nodes. We should have an option to turn this off and this can be different than the read_repair_chance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8094) Heavy writes in RangeSlice read requests
Minh Do created CASSANDRA-8094: -- Summary: Heavy writes in RangeSlice read requests Key: CASSANDRA-8094 URL: https://issues.apache.org/jira/browse/CASSANDRA-8094 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Fix For: 2.0.11 RangeSlice requests always do a scheduler read repair when coordinators try to resolve replicats' responses no matter read_repair_chance is set or not. Because of this, in low writes and high reads clusters, there are very high write requests going on between nodes. We should have an option to turn this off and this can be different than the read_repair_chance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8094) Heavy writes in RangeSlice read requests
[ https://issues.apache.org/jira/browse/CASSANDRA-8094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-8094: --- Description: RangeSlice requests always do a scheduled read repair when coordinators try to resolve replicas' responses no matter read_repair_chance is set or not. Because of this, in low writes and high reads clusters, there are very high write requests going on between nodes. We should have an option to turn this off and this can be different than the read_repair_chance. was: RangeSlice requests always do a scheduler read repair when coordinators try to resolve replicats' responses no matter read_repair_chance is set or not. Because of this, in low writes and high reads clusters, there are very high write requests going on between nodes. We should have an option to turn this off and this can be different than the read_repair_chance. Heavy writes in RangeSlice read requests -- Key: CASSANDRA-8094 URL: https://issues.apache.org/jira/browse/CASSANDRA-8094 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Minh Do Assignee: Minh Do Fix For: 2.0.11 RangeSlice requests always do a scheduled read repair when coordinators try to resolve replicas' responses no matter read_repair_chance is set or not. Because of this, in low writes and high reads clusters, there are very high write requests going on between nodes. We should have an option to turn this off and this can be different than the read_repair_chance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (CASSANDRA-7818) Improve compaction logging
[ https://issues.apache.org/jira/browse/CASSANDRA-7818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do reassigned CASSANDRA-7818: -- Assignee: Minh Do Improve compaction logging -- Key: CASSANDRA-7818 URL: https://issues.apache.org/jira/browse/CASSANDRA-7818 Project: Cassandra Issue Type: Improvement Reporter: Marcus Eriksson Assignee: Minh Do Priority: Minor Labels: compaction, lhf Fix For: 2.1.1 We should log more information about compactions to be able to debug issues more efficiently * give each CompactionTask an id that we log (so that you can relate the start-compaction-messages to the finished-compaction ones) * log what level the sstables are taken from -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6702) Upgrading node uses the wrong port in gossiping
[ https://issues.apache.org/jira/browse/CASSANDRA-6702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038266#comment-14038266 ] Minh Do commented on CASSANDRA-6702: If I recalled correctly, this happened on C* 1.2 nodes while the cluster was still in a mixed mode and the target nodes were seed nodes (C* 1.1.x). After a while, gossips seemed to settle down correctly on the right IPs and Ports. However, this took some significant time depending on the size of the cluster. Upgrading node uses the wrong port in gossiping --- Key: CASSANDRA-6702 URL: https://issues.apache.org/jira/browse/CASSANDRA-6702 Project: Cassandra Issue Type: Bug Components: Core Environment: 1.1.7, AWS, Ec2MultiRegionSnitch Reporter: Minh Do Priority: Minor Fix For: 1.2.17 When upgrading a node in 1.1.7 (or 1.1.11) cluster to 1.2.15 and inspecting the gossip information on port/Ip, I could see that the upgrading node (1.2 version) communicates to one other node in the same region using Public IP and non-encrypted port. For the rest, the upgrading node uses the correct ports and IPs to communicate in this manner: Same region: private IP and non-encrypted port and Different region: public IP and encrypted port Because there is one node like this (or 2 out of 12 nodes cluster in which nodes are split equally on 2 AWS regions), we have to modify Security Group to allow the new traffics. Without modifying the SG, the 95th and 99th latencies for both reads and writes in the cluster are very bad (due to RPC timeout). Inspecting closer, that upgraded node (1.2 node) is contributing to all of the high latencies whenever it acts as a coordinator node. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6702) Upgrading node uses the wrong port in gossiping
[ https://issues.apache.org/jira/browse/CASSANDRA-6702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-6702: --- Description: When upgrading a node in 1.1.7 (or 1.1.11) cluster to 1.2.15 and inspecting the gossip information on port/Ip, I could see that the upgrading node (1.2 version) communicates to one other node in the same region using Public IP and non-encrypted port. For the rest, the upgrading node uses the correct ports and IPs to communicate in this manner: Same region: private IP and non-encrypted port and Different region: public IP and encrypted port Because there is one node like this (or 2 out of 12 nodes cluster in which nodes are split equally on 2 AWS regions), we have to modify Security Group to allow the new traffics. Without modifying the SG, the 95th and 99th latencies for both reads and writes in the cluster are very bad (due to RPC timeout). Inspecting closer, that upgraded node (1.2 node) is contributing to all of the high latencies whenever it acts as a coordinator node. was: When upgrading a node in 1.1.7 (or 1.1.11) cluster to 1.2.15 and inspecting the gossip information on port/Ip, I could see that the upgrading node (1.2 version) communicates to one other node in the same region using Public IP and non-encrypted port. For the rest, the upgrading node uses the correct ports and IPs to communicate in this manner: Same region: private IP and non-encrypted port and Different region: public IP and encrypted port Because there is one node like this (or probably 2 max), we have to modify Security Group to allow the new traffics. Without modifying the SG, the 95th and 99th latencies for both reads and writes in the cluster are very bad (due to RPC timeout). Inspecting closer, that upgraded node (1.2 node) is contributing to all of the high latencies whenever it acts as a coordinator node. Upgrading node uses the wrong port in gossiping --- Key: CASSANDRA-6702 URL: https://issues.apache.org/jira/browse/CASSANDRA-6702 Project: Cassandra Issue Type: Bug Components: Core Environment: 1.1.7, AWS, Ec2MultiRegionSnitch Reporter: Minh Do Priority: Minor Fix For: 1.2.16 When upgrading a node in 1.1.7 (or 1.1.11) cluster to 1.2.15 and inspecting the gossip information on port/Ip, I could see that the upgrading node (1.2 version) communicates to one other node in the same region using Public IP and non-encrypted port. For the rest, the upgrading node uses the correct ports and IPs to communicate in this manner: Same region: private IP and non-encrypted port and Different region: public IP and encrypted port Because there is one node like this (or 2 out of 12 nodes cluster in which nodes are split equally on 2 AWS regions), we have to modify Security Group to allow the new traffics. Without modifying the SG, the 95th and 99th latencies for both reads and writes in the cluster are very bad (due to RPC timeout). Inspecting closer, that upgraded node (1.2 node) is contributing to all of the high latencies whenever it acts as a coordinator node. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (CASSANDRA-6702) Upgrading node uses the wrong port in gossiping
Minh Do created CASSANDRA-6702: -- Summary: Upgrading node uses the wrong port in gossiping Key: CASSANDRA-6702 URL: https://issues.apache.org/jira/browse/CASSANDRA-6702 Project: Cassandra Issue Type: Bug Components: Core Environment: 1.1.7, AWS, Ec2MultiRegionSnitch Reporter: Minh Do Priority: Minor Fix For: 1.2.15 When upgrading a node in 1.1.7 (or 1.1.11) cluster to 1.2.15 and inspecting the gossip information on port/Ip, I could see that the upgrading node (1.2 version) communicates to one other node in the same region using Public IP and non-encrypted port. For the rest, the upgrading node uses the correct ports and IPs to communicate in this manner: Same region: private IP and non-encrypted port and Different region: public IP and encrypted port Because there is one node like this (or probably 2 max), we have to modify Security Group to allow the new traffics. Without modifying the SG, the 95th and 99th latencies for both reads and writes in the cluster are very bad (due to RPC timeout). Inspecting closer, that upgraded node (1.2 node) is contributing to all of the high latencies whenever it acts as a coordinator node. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-5263) Allow Merkle tree maximum depth to be configurable
[ https://issues.apache.org/jira/browse/CASSANDRA-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890449#comment-13890449 ] Minh Do commented on CASSANDRA-5263: If I understand correctly, are you saying that if N is the total number of rows in all SSTables on a node for a given token range, then depth = logN with log base 2? This works if a node does not hold too many rows. Can we safely assume that a node does not hold more than 2^24 rows (or 16.7M rows)? Because for this many rows, we need to build a Merkle tree with depth 24 and requires about 1.6G of heap. Beyond this number, I would say we run into memory heap allocation issue. I was thinking earlier that depth 20 is the maximum allowable depth and I worked my way down to compute lower depth tree. Allow Merkle tree maximum depth to be configurable -- Key: CASSANDRA-5263 URL: https://issues.apache.org/jira/browse/CASSANDRA-5263 Project: Cassandra Issue Type: Improvement Components: Config Affects Versions: 1.1.9 Reporter: Ahmed Bashir Assignee: Minh Do Currently, the maximum depth allowed for Merkle trees is hardcoded as 15. This value should be configurable, just like phi_convict_treshold and other properties. Given a cluster with nodes responsible for a large number of row keys, Merkle tree comparisons can result in a large amount of unnecessary row keys being streamed. Empirical testing indicates that reasonable changes to this depth (18, 20, etc) don't affect the Merkle tree generation and differencing timings all that much, and they can significantly reduce the amount of data being streamed during repair. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-5263) Allow Merkle tree maximum depth to be configurable
[ https://issues.apache.org/jira/browse/CASSANDRA-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889277#comment-13889277 ] Minh Do commented on CASSANDRA-5263: Using some generated sstable files within a token range, I ran a test on building the Merkle tree at 20 depth and then add the computed hash values for rows (69M added rows). These 2 steps together are equivalent to a validation compaction process on a token range if I am not missing anything. 1. Tree building uses, on the average, 15-18% total CPU resources, and no I/O 2. SSTables scanning and row hash computation use, on the average, 10-12% total CPU resources, and I/O resources limited by the configurable global compaction rate limiter Given the Jonathan's pointer on using SSTR.estimatedKeysForRanges() to calculate number of rows for a SSTable file and no overlapping among SSTable files (worst case), we can estimate how many data rows in a given token range. From what I understood, here is the formula to calculate the Merkle tree's depth (assuming each data row has a unique hash value): 1. If number of rows from all SSTables in a given range is approximately equal to the maximum number of hash entries in that range (subject to a CF's partitioner), thenvwe build the tree at 20 level depth (in the densest case) 2. When number of rows from all SSTables in a given range does not cover the full hash range or in sparse case, we build a Merkle tree with a depth less than 20. How do we come up with the right depth? depth = 20 * (n rows / max rows) where n is the total number of rows in all SSTables and max is the maximum number of hash entries in that token range. However, since different partitions give different max numbers, is there anything we can assume to make it easy here like assuming all partitions would have the same hash entries in a given token range? Allow Merkle tree maximum depth to be configurable -- Key: CASSANDRA-5263 URL: https://issues.apache.org/jira/browse/CASSANDRA-5263 Project: Cassandra Issue Type: Improvement Components: Config Affects Versions: 1.1.9 Reporter: Ahmed Bashir Assignee: Minh Do Currently, the maximum depth allowed for Merkle trees is hardcoded as 15. This value should be configurable, just like phi_convict_treshold and other properties. Given a cluster with nodes responsible for a large number of row keys, Merkle tree comparisons can result in a large amount of unnecessary row keys being streamed. Empirical testing indicates that reasonable changes to this depth (18, 20, etc) don't affect the Merkle tree generation and differencing timings all that much, and they can significantly reduce the amount of data being streamed during repair. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6619) Race condition issue during upgrading 1.1 to 1.2
[ https://issues.apache.org/jira/browse/CASSANDRA-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884312#comment-13884312 ] Minh Do commented on CASSANDRA-6619: Jonathan, you are right that both 1.2 and 1.1 are designed to read out the versions from the headers of the other. However, 1.2, as a sender in opening the outbound socket, expects to receive back immediately the version int as soon as it sends out its own. 1.1, as a receiver, can read 1.2 header but does not send the version int back. Here is the piece of code in 1.2 in IncomingTcpConnection.java that sends back the version int: private void handleModernVersion(int version, int header) throws IOException { DataOutputStream out = new DataOutputStream(socket.getOutputStream()); out.writeInt(MessagingService.current_version); out.flush(); .. } Because 1.1 does not send this back immediately, OutboundTcpConnection will be timed out on the read and the socket gets disconnected. The whole cycle will get repeated again and again until some code sets the target version right. In the lucky case, IncomingTcpConnection sets the right target version. However, it takes a while for the other 1.1 nodes to know that there is a new 1.2 node, especially if the new 1.2 node can't connection to any 1.1 nodes first. Race condition issue during upgrading 1.1 to 1.2 Key: CASSANDRA-6619 URL: https://issues.apache.org/jira/browse/CASSANDRA-6619 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Minor Fix For: 1.2.14 Attachments: patch.txt There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We already have a patch for this and will attach shortly for feedback. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (CASSANDRA-6619) Race condition issue during upgrading 1.1 to 1.2
[ https://issues.apache.org/jira/browse/CASSANDRA-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884312#comment-13884312 ] Minh Do edited comment on CASSANDRA-6619 at 1/28/14 4:41 PM: - Jonathan, you are right that both 1.2 and 1.1 are designed to read out the versions from the headers of the other. However, 1.2, as a sender in opening the outbound socket, expects to receive back immediately the version int as soon as it sends out its own. 1.1, as a receiver, can read 1.2 header but does not send the version int back. Here is the piece of code in 1.2 in IncomingTcpConnection.java that sends back the version int: private void handleModernVersion(int version, int header) throws IOException { DataOutputStream out = new DataOutputStream(socket.getOutputStream()); out.writeInt(MessagingService.current_version); out.flush(); .. } Because 1.1 does not send this back immediately, 1.2 OutboundTcpConnection will be timed out on the read and the socket gets disconnected. The whole cycle will get repeated again and again until some code sets the target version right. In the lucky case, IncomingTcpConnection sets the right target version. However, it takes a while for the other 1.1 nodes to know that there is a new 1.2 node, especially if the new 1.2 node can't connection to any 1.1 nodes first. was (Author: timiblossom): Jonathan, you are right that both 1.2 and 1.1 are designed to read out the versions from the headers of the other. However, 1.2, as a sender in opening the outbound socket, expects to receive back immediately the version int as soon as it sends out its own. 1.1, as a receiver, can read 1.2 header but does not send the version int back. Here is the piece of code in 1.2 in IncomingTcpConnection.java that sends back the version int: private void handleModernVersion(int version, int header) throws IOException { DataOutputStream out = new DataOutputStream(socket.getOutputStream()); out.writeInt(MessagingService.current_version); out.flush(); .. } Because 1.1 does not send this back immediately, OutboundTcpConnection will be timed out on the read and the socket gets disconnected. The whole cycle will get repeated again and again until some code sets the target version right. In the lucky case, IncomingTcpConnection sets the right target version. However, it takes a while for the other 1.1 nodes to know that there is a new 1.2 node, especially if the new 1.2 node can't connection to any 1.1 nodes first. Race condition issue during upgrading 1.1 to 1.2 Key: CASSANDRA-6619 URL: https://issues.apache.org/jira/browse/CASSANDRA-6619 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Minor Fix For: 1.2.14 Attachments: patch.txt There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it does not fully fix the issue. We already have a patch for this and will attach shortly for feedback. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (CASSANDRA-6619) Race condition issue during upgrading 1.1 to 1.2
[ https://issues.apache.org/jira/browse/CASSANDRA-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-6619: --- Description: There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it does not fully fix the issue. We already have a patch for this and will attach shortly for feedback. was: There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We already have a patch for this and will attach shortly for feedback. Race condition issue during upgrading 1.1 to 1.2 Key: CASSANDRA-6619 URL: https://issues.apache.org/jira/browse/CASSANDRA-6619 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Minor Fix For: 1.2.14 Attachments: patch.txt There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it does not fully fix the issue. We already have a patch for this and will attach shortly for feedback. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (CASSANDRA-6619) Race condition issue during upgrading 1.1 to 1.2
[ https://issues.apache.org/jira/browse/CASSANDRA-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884312#comment-13884312 ] Minh Do edited comment on CASSANDRA-6619 at 1/28/14 4:50 PM: - Jonathan, you are right that both 1.2 and 1.1 are designed to read out the versions from the headers of the other. However, 1.2, as a sender in opening the outbound socket, expects to receive back immediately the version int as soon as it sends out its own. 1.1, as a receiver, can read 1.2 header but does not send the version int back. Here is the piece of code in 1.2 in IncomingTcpConnection.java that sends back the version int: private void handleModernVersion(int version, int header) throws IOException { DataOutputStream out = new DataOutputStream(socket.getOutputStream()); out.writeInt(MessagingService.current_version); out.flush(); .. } Because 1.1 does not send this back immediately, 1.2 OutboundTcpConnection will be timed out on the read and the socket gets disconnected. The whole cycle will get repeated again and again until some code sets the target version right. In the lucky case, IncomingTcpConnection sets the right target version. However, it takes a while for the other 1.1 nodes to know that there is a new 1.2 node, especially if the new 1.2 node can't connection to any 1.1 nodes first. The version convergence will eventually settle down. However, in a large cluster, this would take some time causing some side effects during this time such as high read latencies, and more hints being stored. was (Author: timiblossom): Jonathan, you are right that both 1.2 and 1.1 are designed to read out the versions from the headers of the other. However, 1.2, as a sender in opening the outbound socket, expects to receive back immediately the version int as soon as it sends out its own. 1.1, as a receiver, can read 1.2 header but does not send the version int back. Here is the piece of code in 1.2 in IncomingTcpConnection.java that sends back the version int: private void handleModernVersion(int version, int header) throws IOException { DataOutputStream out = new DataOutputStream(socket.getOutputStream()); out.writeInt(MessagingService.current_version); out.flush(); .. } Because 1.1 does not send this back immediately, 1.2 OutboundTcpConnection will be timed out on the read and the socket gets disconnected. The whole cycle will get repeated again and again until some code sets the target version right. In the lucky case, IncomingTcpConnection sets the right target version. However, it takes a while for the other 1.1 nodes to know that there is a new 1.2 node, especially if the new 1.2 node can't connection to any 1.1 nodes first. Race condition issue during upgrading 1.1 to 1.2 Key: CASSANDRA-6619 URL: https://issues.apache.org/jira/browse/CASSANDRA-6619 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Minor Fix For: 1.2.14 Attachments: patch.txt There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it does not fully fix the issue. We already have a patch for this and will attach shortly for feedback. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6619) Race condition issue during upgrading 1.1 to 1.2
[ https://issues.apache.org/jira/browse/CASSANDRA-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883419#comment-13883419 ] Minh Do commented on CASSANDRA-6619: As posted in other tickets, 1.1 and 1.2 have different message protocols. Hence, it is important to set the right target version when making outbound connections rather than depending on the inbound connections to set a version value. Thus, race condition in setting the version values is solved. Attachment is the patch to make sure the code does that when an outbound connection is open and an exchange for versioning information in the hankshake fails. As discussed with Jason Brown here at Netflix, we came up with a solution that during the upgrade, the upgraded nodes have in the environment the variable cassandra.prev_version = 5 (for 1.1.7 or 4 for 1.1) to help out the handshakes in a mixed version cluster. Once a cluster is fully upgraded to 1.2, cassadra.prev_version is removed from all nodes' environment and a C* rolling restart across nodes is required. This step ensures that the new patch won't penalize the 1.2 cluster where all outbound connections are from 1.2 to 1.2. Race condition issue during upgrading 1.1 to 1.2 Key: CASSANDRA-6619 URL: https://issues.apache.org/jira/browse/CASSANDRA-6619 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Minor Fix For: 1.2.14 There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We already have a patch for this and will attach shortly for feedback. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (CASSANDRA-6619) Race condition issue during upgrading 1.1 to 1.2
[ https://issues.apache.org/jira/browse/CASSANDRA-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-6619: --- Attachment: (was: diff) Race condition issue during upgrading 1.1 to 1.2 Key: CASSANDRA-6619 URL: https://issues.apache.org/jira/browse/CASSANDRA-6619 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Minor Fix For: 1.2.14 There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We already have a patch for this and will attach shortly for feedback. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (CASSANDRA-6619) Race condition issue during upgrading 1.1 to 1.2
[ https://issues.apache.org/jira/browse/CASSANDRA-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-6619: --- Attachment: diff Race condition issue during upgrading 1.1 to 1.2 Key: CASSANDRA-6619 URL: https://issues.apache.org/jira/browse/CASSANDRA-6619 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Minor Fix For: 1.2.14 There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We already have a patch for this and will attach shortly for feedback. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (CASSANDRA-6619) Race condition issue during upgrading 1.1 to 1.2
[ https://issues.apache.org/jira/browse/CASSANDRA-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-6619: --- Attachment: patch.txt Race condition issue during upgrading 1.1 to 1.2 Key: CASSANDRA-6619 URL: https://issues.apache.org/jira/browse/CASSANDRA-6619 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Minor Fix For: 1.2.14 Attachments: patch.txt There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We already have a patch for this and will attach shortly for feedback. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (CASSANDRA-6619) Race condition issue during upgrading 1.1 to 1.2
[ https://issues.apache.org/jira/browse/CASSANDRA-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-6619: --- Reviewer: Jason Brown Race condition issue during upgrading 1.1 to 1.2 Key: CASSANDRA-6619 URL: https://issues.apache.org/jira/browse/CASSANDRA-6619 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Minor Fix For: 1.2.14 Attachments: patch.txt There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We already have a patch for this and will attach shortly for feedback. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (CASSANDRA-6619) Race condition during upgrading 1.1 to 1.2
Minh Do created CASSANDRA-6619: -- Summary: Race condition during upgrading 1.1 to 1.2 Key: CASSANDRA-6619 URL: https://issues.apache.org/jira/browse/CASSANDRA-6619 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Minor Fix For: 1.2.14 There was a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster because the upgrading process takes 10+ hours to 1+ days to complete. Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We already have a patch for this and will attach shortly to let the community to review. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (CASSANDRA-6619) Race condition issue during upgrading 1.1 to 1.2
[ https://issues.apache.org/jira/browse/CASSANDRA-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-6619: --- Description: There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We already have a patch for this and will attach shortly for feedback. was: There was a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster because the upgrading process takes 10+ hours to 1+ days to complete. Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We already have a patch for this and will attach shortly to let the community to review. Summary: Race condition issue during upgrading 1.1 to 1.2 (was: Race condition during upgrading 1.1 to 1.2) Race condition issue during upgrading 1.1 to 1.2 Key: CASSANDRA-6619 URL: https://issues.apache.org/jira/browse/CASSANDRA-6619 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Assignee: Minh Do Priority: Minor Fix For: 1.2.14 There is a race condition during upgrading a C* 1.1x cluster to C* 1.2. One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x nodes. Because of this, a live cluster during the upgrading will suffer in high read latency and be unable to fulfill some write requests. It won't be a problem if there is a small cluster but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+ day(s) to complete. Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We already have a patch for this and will attach shortly for feedback. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-5263) Allow Merkle tree maximum depth to be configurable
[ https://issues.apache.org/jira/browse/CASSANDRA-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869384#comment-13869384 ] Minh Do commented on CASSANDRA-5263: I also don't see how we use stable stats to adjust the MerkleTree depth automatically. We can estimate number of rows for each sstable but we don't know how many rows in a given range (unless we assume input is always a full range). In term of memory usage, MerkleTree with depth 20 uses around 100Mb and MerkleTree with depth 17 uses around 15Mb. Does the extra 100Mb hurt Cassandra performance on some nodes on some cases if we go to this extreme case? Also, if we use depth 20 and multithreaded version to build MerkleTree, it is going to impact the response latency. Some thoughts? Allow Merkle tree maximum depth to be configurable -- Key: CASSANDRA-5263 URL: https://issues.apache.org/jira/browse/CASSANDRA-5263 Project: Cassandra Issue Type: Improvement Components: Config Affects Versions: 1.1.9 Reporter: Ahmed Bashir Assignee: Minh Do Currently, the maximum depth allowed for Merkle trees is hardcoded as 15. This value should be configurable, just like phi_convict_treshold and other properties. Given a cluster with nodes responsible for a large number of row keys, Merkle tree comparisons can result in a large amount of unnecessary row keys being streamed. Empirical testing indicates that reasonable changes to this depth (18, 20, etc) don't affect the Merkle tree generation and differencing timings all that much, and they can significantly reduce the amount of data being streamed during repair. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (CASSANDRA-5263) Allow Merkle tree maximum depth to be configurable
[ https://issues.apache.org/jira/browse/CASSANDRA-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do reassigned CASSANDRA-5263: -- Assignee: Minh Do Allow Merkle tree maximum depth to be configurable -- Key: CASSANDRA-5263 URL: https://issues.apache.org/jira/browse/CASSANDRA-5263 Project: Cassandra Issue Type: Improvement Components: Config Affects Versions: 1.1.9 Reporter: Ahmed Bashir Assignee: Minh Do Currently, the maximum depth allowed for Merkle trees is hardcoded as 15. This value should be configurable, just like phi_convict_treshold and other properties. Given a cluster with nodes responsible for a large number of row keys, Merkle tree comparisons can result in a large amount of unnecessary row keys being streamed. Empirical testing indicates that reasonable changes to this depth (18, 20, etc) don't affect the Merkle tree generation and differencing timings all that much, and they can significantly reduce the amount of data being streamed during repair. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (CASSANDRA-6323) Create new sstables in the highest possible level
[ https://issues.apache.org/jira/browse/CASSANDRA-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do reassigned CASSANDRA-6323: -- Assignee: Minh Do Create new sstables in the highest possible level - Key: CASSANDRA-6323 URL: https://issues.apache.org/jira/browse/CASSANDRA-6323 Project: Cassandra Issue Type: Bug Components: Core Reporter: Jonathan Ellis Assignee: Minh Do Priority: Minor Labels: compaction Fix For: 2.0.3 See PickLevelForMemTableOutput here: https://code.google.com/p/leveldb/source/browse/db/version_set.cc#507 (Moving from CASSANDRA-5936) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (CASSANDRA-6308) Thread leak caused in creating OutboundTcpConnectionPool
Minh Do created CASSANDRA-6308: -- Summary: Thread leak caused in creating OutboundTcpConnectionPool Key: CASSANDRA-6308 URL: https://issues.apache.org/jira/browse/CASSANDRA-6308 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Priority: Minor Fix For: 1.2.12, 2.0.3 We have seen in one of our large clusters that there are many OutboundTcpConnection threads having the same names. From a thread dump, OutboundTcpConnection threads have accounted for the largest shares of the total threads (65%+) and kept growing. Here is a portion of a grep output for threads in which names start with WRITE-: WRITE-/10.28.131.195 daemon prio=10 tid=0x2aaac4022000 nid=0x2cb5 waiting on condition [0x2acfbacda000] WRITE-/10.28.131.195 daemon prio=10 tid=0x2aaac42fe000 nid=0x2cb4 waiting on condition [0x2acfbacad000] WRITE-/10.30.142.49 daemon prio=10 tid=0x4084 nid=0x2cb1 waiting on condition [0x2acfbac8] WRITE-/10.6.222.233 daemon prio=10 tid=0x4083e000 nid=0x2cb0 waiting on condition [0x2acfbac53000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4083b800 nid=0x2caf waiting on condition [0x2acfbac26000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40839800 nid=0x2cae waiting on condition [0x2acfbabf9000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40837800 nid=0x2cad waiting on condition [0x2acfbabcc000] WRITE-/10.6.222.233 daemon prio=10 tid=0x404a3800 nid=0x2cac waiting on condition [0x2acfbab9f000] WRITE-/10.30.142.49 daemon prio=10 tid=0x404a1800 nid=0x2cab waiting on condition [0x2acfbab72000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4049f800 nid=0x2caa waiting on condition [0x2acfbab45000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4049e000 nid=0x2ca9 waiting on condition [0x2acfbab18000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4049c800 nid=0x2ca8 waiting on condition [0x2acfbaaeb000] WRITE-/10.157.10.134 daemon prio=10 tid=0x4049a800 nid=0x2ca7 waiting on condition [0x2acfbaabe000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40498800 nid=0x2ca6 waiting on condition [0x2acfbaa91000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40496800 nid=0x2ca5 waiting on condition [0x2acfbaa64000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40717800 nid=0x2ca4 waiting on condition [0x2acfbaa37000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40716000 nid=0x2ca3 waiting on condition [0x2acfbaa0a000] WRITE-/10.30.146.195 daemon prio=10 tid=0x40714800 nid=0x2ca2 waiting on condition [0x2acfba9dd000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40712800 nid=0x2ca1 waiting on condition [0x2acfba9b] WRITE-/10.6.222.233 daemon prio=10 tid=0x40710800 nid=0x2ca0 waiting on condition [0x2acfba983000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4070e800 nid=0x2c9f waiting on condition [0x2acfba956000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4070d000 nid=0x2c9e waiting on condition [0x2acfba929000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4070b800 nid=0x2c9d waiting on condition [0x2acfba8fc000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4070a000 nid=0x2c9c waiting on condition [0x2acfba8cf000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40827000 nid=0x2c9b waiting on condition [0x2acfba8a2000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40825000 nid=0x2c9a waiting on condition [0x2acfba875000] WRITE-/10.6.222.233 daemon prio=10 tid=0x2aaac488e000 nid=0x2c99 waiting on condition [0x2acfba848000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40823000 nid=0x2c98 waiting on condition [0x2acfba81b000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40821800 nid=0x2c97 waiting on condition [0x2acfba7ee000] WRITE-/10.30.146.195 daemon prio=10 tid=0x4081f000 nid=0x2c96 waiting on condition [0x2acfba7c1000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4081d000 nid=0x2c95 waiting on condition [0x2acfba794000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4081b000 nid=0x2c94 waiting on condition [0x2acfba767000] WRITE-/10.6.222.233 daemon prio=10 tid=0x2aaac488b000 nid=0x2c93 waiting on condition [0x2acfba73a000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40819000 nid=0x2c92 waiting on condition [0x2acfba70d000] WRITE-/10.6.222.233 daemon prio=10 tid=0x407f9000 nid=0x2c91 waiting on condition [0x2acfba6e] WRITE-/10.6.222.233 daemon prio=10 tid=0x407f7000 nid=0x2c90 waiting on condition [0x2acfba6b3000] WRITE-/10.6.222.233 daemon prio=10 tid=0x407f5000 nid=0x2c8f waiting on condition [0x2acfba686000] WRITE-/10.6.222.233 daemon prio=10 tid=0x407f3000 nid=0x2c8d
[jira] [Updated] (CASSANDRA-6308) Thread leak caused in creating OutboundTcpConnectionPool
[ https://issues.apache.org/jira/browse/CASSANDRA-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minh Do updated CASSANDRA-6308: --- Attachment: patch.txt Thread leak caused in creating OutboundTcpConnectionPool Key: CASSANDRA-6308 URL: https://issues.apache.org/jira/browse/CASSANDRA-6308 Project: Cassandra Issue Type: Bug Components: Core Reporter: Minh Do Priority: Minor Labels: leak, thread Fix For: 1.2.12 Attachments: patch.txt We have seen in one of our large clusters that there are many OutboundTcpConnection threads having the same names. From a thread dump, OutboundTcpConnection threads have accounted for the largest shares of the total threads (65%+) and kept growing. Here is a portion of a grep output for threads in which names start with WRITE-: WRITE-/10.28.131.195 daemon prio=10 tid=0x2aaac4022000 nid=0x2cb5 waiting on condition [0x2acfbacda000] WRITE-/10.28.131.195 daemon prio=10 tid=0x2aaac42fe000 nid=0x2cb4 waiting on condition [0x2acfbacad000] WRITE-/10.30.142.49 daemon prio=10 tid=0x4084 nid=0x2cb1 waiting on condition [0x2acfbac8] WRITE-/10.6.222.233 daemon prio=10 tid=0x4083e000 nid=0x2cb0 waiting on condition [0x2acfbac53000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4083b800 nid=0x2caf waiting on condition [0x2acfbac26000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40839800 nid=0x2cae waiting on condition [0x2acfbabf9000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40837800 nid=0x2cad waiting on condition [0x2acfbabcc000] WRITE-/10.6.222.233 daemon prio=10 tid=0x404a3800 nid=0x2cac waiting on condition [0x2acfbab9f000] WRITE-/10.30.142.49 daemon prio=10 tid=0x404a1800 nid=0x2cab waiting on condition [0x2acfbab72000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4049f800 nid=0x2caa waiting on condition [0x2acfbab45000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4049e000 nid=0x2ca9 waiting on condition [0x2acfbab18000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4049c800 nid=0x2ca8 waiting on condition [0x2acfbaaeb000] WRITE-/10.157.10.134 daemon prio=10 tid=0x4049a800 nid=0x2ca7 waiting on condition [0x2acfbaabe000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40498800 nid=0x2ca6 waiting on condition [0x2acfbaa91000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40496800 nid=0x2ca5 waiting on condition [0x2acfbaa64000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40717800 nid=0x2ca4 waiting on condition [0x2acfbaa37000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40716000 nid=0x2ca3 waiting on condition [0x2acfbaa0a000] WRITE-/10.30.146.195 daemon prio=10 tid=0x40714800 nid=0x2ca2 waiting on condition [0x2acfba9dd000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40712800 nid=0x2ca1 waiting on condition [0x2acfba9b] WRITE-/10.6.222.233 daemon prio=10 tid=0x40710800 nid=0x2ca0 waiting on condition [0x2acfba983000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4070e800 nid=0x2c9f waiting on condition [0x2acfba956000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4070d000 nid=0x2c9e waiting on condition [0x2acfba929000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4070b800 nid=0x2c9d waiting on condition [0x2acfba8fc000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4070a000 nid=0x2c9c waiting on condition [0x2acfba8cf000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40827000 nid=0x2c9b waiting on condition [0x2acfba8a2000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40825000 nid=0x2c9a waiting on condition [0x2acfba875000] WRITE-/10.6.222.233 daemon prio=10 tid=0x2aaac488e000 nid=0x2c99 waiting on condition [0x2acfba848000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40823000 nid=0x2c98 waiting on condition [0x2acfba81b000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40821800 nid=0x2c97 waiting on condition [0x2acfba7ee000] WRITE-/10.30.146.195 daemon prio=10 tid=0x4081f000 nid=0x2c96 waiting on condition [0x2acfba7c1000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4081d000 nid=0x2c95 waiting on condition [0x2acfba794000] WRITE-/10.6.222.233 daemon prio=10 tid=0x4081b000 nid=0x2c94 waiting on condition [0x2acfba767000] WRITE-/10.6.222.233 daemon prio=10 tid=0x2aaac488b000 nid=0x2c93 waiting on condition [0x2acfba73a000] WRITE-/10.6.222.233 daemon prio=10 tid=0x40819000 nid=0x2c92 waiting on condition [0x2acfba70d000] WRITE-/10.6.222.233 daemon prio=10 tid=0x407f9000 nid=0x2c91 waiting
[jira] [Commented] (CASSANDRA-5175) Unbounded (?) thread growth connecting to an removed node
[ https://issues.apache.org/jira/browse/CASSANDRA-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721091#comment-13721091 ] Minh Do commented on CASSANDRA-5175: Hi Vijay, I am using your commit db8705294ba96fe2b746fea4f26a919538653ebd but I think the logic in this commit is not the same as the attached patch. Please take a look. if (m == CLOSE_SENTINEL) { disconnect(); +if (!isStopped) +break; continue; } I think it should be : if (isStopped) break; Thanks. Unbounded (?) thread growth connecting to an removed node - Key: CASSANDRA-5175 URL: https://issues.apache.org/jira/browse/CASSANDRA-5175 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.8 Environment: EC2, JDK 7u9, Ubuntu 12.04.1 LTS Reporter: Janne Jalkanen Assignee: Vijay Priority: Minor Fix For: 1.1.10, 1.2.1 Attachments: 0001-CASSANDRA-5175.patch The following lines started repeating every minute in the log file {noformat} INFO [GossipStage:1] 2013-01-19 19:35:43,929 Gossiper.java (line 831) InetAddress /10.238.x.y is now dead. INFO [GossipStage:1] 2013-01-19 19:35:43,930 StorageService.java (line 1291) Removing token 170141183460469231731687303715884105718 for /10.238.x.y {noformat} Also, I got about 3000 threads which all look like this: {noformat} Name: WRITE-/10.238.x.y State: WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1bb65c0f Total blocked: 0 Total waited: 3 Stack trace: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:104) {noformat} A new thread seems to be created every minute, and they never go away. The endpoint in question had been a part of the cluster weeks ago, and the node exhibiting the thread growth was added yesterday. Anyway, assassinating the endpoint in question stopped thread growth (but kept the existing threads running), so this isn't a huge issue. But I don't think the thread count is supposed to be increasing like this... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira