[jira] [Updated] (CASSANDRA-15637) CqlInputFormat regression going from 2.1 to 3.x caused by semantic difference between thrift and the new system.size_estimates table when dealing with multiple dc deployments

David Capwell (Jira) Wed, 11 Mar 2020 13:13:15 -0700


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Capwell updated CASSANDRA-15637:
--------------------------------------
    Description: 
In 3.0 CqlInputFormat switched away from thrift in favor of a new 
system.size_estimates table, but the semantics changed when dealing with 
multiple DCs or when Cassandra is not collocated with Hadoop.

The core issues are:

* system.size_estimates uses the primary range, in a multi-dc setup this could 
lead to uneven ranges
example:

{code}
DC1: [0, 10, 20, 30]
DC2: [1, 11, 21, 31]
DC3: [2, 12, 22, 32]
{code}

Using NetworkTopologyStrategy the primary ranges are: [0, 1), [1, 2), [2, 10), 
[10, 11), [11, 12), [12, 20), [20, 21), [21, 22), [22, 30), [30, 31), [31, 32), 
[32, 0).
Given this the only ranges that are more than one token are: [2, 10), [12, 20), 
[22, 30).

* system.size_estimates is not replicated so need to hit every node in the 
cluster to get estimates, if nodes are down in the DC with non-size-1 ranges 
there is no way to get a estimate.
* CqlInputFormat used to call describe_local_ring so all interactions were with 
a single DC, the java driver doesn't filter the DC so looks to allow cross DC 
traffic and includes nodes from other DCs in the replica set; in the example 
above, the amount of splits went from 4 to 12.
* CqlInputFormat used to call describe_splits_ex to dynamically calculate the 
estimates, this was on the "local primary range" and was able to hit replicas 
to create estimates if the primary was down. With system.size_estimates we no 
longer have backup and no longer expose the "local primary range" in multi-dc.
* CqlInputFormat special cases Cassandra being collocated with Hadoop and 
assumes this when querying system.size_estimates as it doesn't filter to the 
specific host, this means that non-collocated deployments randomly select the 
nodes and create splits with ranges the hosts do not have locally.

The problems are deterministic to replicate, the following test will show it

1) deploy a 3 DC cluster with 3 nodes each
2) create DC2 tokens are +1 of DC1 and DC3 are +1 of DC2
3) CREATE KEYSPACE simpleuniform0 WITH replication = {‘class’: 
‘NetworkTopologyStrategy’, ‘DC1’: 3, ‘DC2’: 3, ‘DC3’: 3};
4) CREATE TABLE simpletable0 (pk bigint, ck bigint, value blob, PRIMARY KEY 
(pk, ck))
5) insert 500k partitions uniformly: [0, 500,000)
6) wait until estimates catch up to writes
7) for all nodes, SELECT * FROM system.size_estimates

You will get the following

{code}
 keyspace_name  | table_name   | range_start          | range_end            | 
mean_partition_size | partitions_count
----------------+--------------+----------------------+----------------------+---------------------+------------------
 simpleuniform0 | simpletable0 | -9223372036854775808 | -6148914691236517206 |  
                87 |           122240
 simpleuniform0 | simpletable0 |  6148914691236517207 | -9223372036854775808 |  
                87 |           121472

(2 rows)

 keyspace_name  | table_name   | range_start | range_end           | 
mean_partition_size | partitions_count
----------------+--------------+-------------+---------------------+---------------------+------------------
 simpleuniform0 | simpletable0 |           2 | 6148914691236517205 |            
      87 |           243072

(1 rows)

 keyspace_name  | table_name   | range_start          | range_end            | 
mean_partition_size | partitions_count
----------------+--------------+----------------------+----------------------+---------------------+------------------
 simpleuniform0 | simpletable0 | -6148914691236517206 | -6148914691236517205 |  
                87 |                1

(1 rows)

 keyspace_name  | table_name   | range_start | range_end | mean_partition_size 
| partitions_count
----------------+--------------+-------------+-----------+---------------------+------------------
 simpleuniform0 | simpletable0 |           0 |         1 |                  87 
|                1

(1 rows)

 keyspace_name  | table_name   | range_start         | range_end           | 
mean_partition_size | partitions_count
----------------+--------------+---------------------+---------------------+---------------------+------------------
 simpleuniform0 | simpletable0 | 6148914691236517205 | 6148914691236517206 |    
              87 |                1

(1 rows)

 keyspace_name  | table_name   | range_start          | range_end            | 
mean_partition_size | partitions_count
----------------+--------------+----------------------+----------------------+---------------------+------------------
 simpleuniform0 | simpletable0 | -6148914691236517205 | -6148914691236517204 |  
                87 |                1

(1 rows)

 keyspace_name  | table_name   | range_start | range_end | mean_partition_size 
| partitions_count
----------------+--------------+-------------+-----------+---------------------+------------------
 simpleuniform0 | simpletable0 |           1 |         2 |                  87 
|                1

(1 rows)

 keyspace_name  | table_name   | range_start         | range_end           | 
mean_partition_size | partitions_count
----------------+--------------+---------------------+---------------------+---------------------+------------------
 simpleuniform0 | simpletable0 | 6148914691236517206 | 6148914691236517207 |    
              87 |                1

(1 rows)
{code}

8) create a MR job against simpleuniform0. simpletable0, you will get 10 splits 
where as 2.1 was 4

  was:
In 3.0 CqlInputFormat switched away from thrift in favor of a new 
system.size_estimates table, but the semantics changed when dealing with 
multiple DCs or when Cassandra is not collocated with Hadoop.

The core issues are:

* system.size_estimates uses the primary range, in a multi-dc setup this could 
lead to uneven ranges
example:

DC1: [0, 10, 20, 30]
DC2: [1, 11, 21, 31]
DC3: [2, 12, 22, 32]

Using NetworkTopologyStrategy the primary ranges are: [0, 1), [1, 2), [2, 10), 
[10, 11), [11, 12), [12, 20), [20, 21), [21, 22), [22, 30), [30, 31), [31, 32), 
[32, 0).
Given this the only ranges that are more than one token are: [2, 10), [12, 20), 
[22, 30).

* system.size_estimates is not replicated so need to hit every node in the 
cluster to get estimates, if nodes are down in the DC with non-size-1 ranges 
there is no way to get a estimate.
* CqlInputFormat used to call describe_local_ring so all interactions were with 
a single DC, the java driver doesn't filter the DC so looks to allow cross DC 
traffic and includes nodes from other DCs in the replica set; in the example 
above, the amount of splits went from 4 to 12.
* CqlInputFormat used to call describe_splits_ex to dynamically calculate the 
estimates, this was on the "local primary range" and was able to hit replicas 
to create estimates if the primary was down. With system.size_estimates we no 
longer have backup and no longer expose the "local primary range" in multi-dc.
* CqlInputFormat special cases Cassandra being collocated with Hadoop and 
assumes this when querying system.size_estimates as it doesn't filter to the 
specific host, this means that non-collocated deployments randomly select the 
nodes and create splits with ranges the hosts do not have locally.

The problems are deterministic to replicate, the following test will show it

1) deploy a 3 DC cluster with 3 nodes each
2) create DC2 tokens are +1 of DC1 and DC3 are +1 of DC2
3) CREATE KEYSPACE simpleuniform0 WITH replication = {‘class’: 
‘NetworkTopologyStrategy’, ‘DC1’: 3, ‘DC2’: 3, ‘DC3’: 3};
4) CREATE TABLE simpletable0 (pk bigint, ck bigint, value blob, PRIMARY KEY 
(pk, ck))
5) insert 500k partitions uniformly: [0, 500,000)
6) wait until estimates catch up to writes
7) for all nodes, SELECT * FROM system.size_estimates

You will get the following

 keyspace_name  | table_name   | range_start          | range_end            | 
mean_partition_size | partitions_count
----------------+--------------+----------------------+----------------------+---------------------+------------------
 simpleuniform0 | simpletable0 | -9223372036854775808 | -6148914691236517206 |  
                87 |           122240
 simpleuniform0 | simpletable0 |  6148914691236517207 | -9223372036854775808 |  
                87 |           121472

(2 rows)

 keyspace_name  | table_name   | range_start | range_end           | 
mean_partition_size | partitions_count
----------------+--------------+-------------+---------------------+---------------------+------------------
 simpleuniform0 | simpletable0 |           2 | 6148914691236517205 |            
      87 |           243072

(1 rows)

 keyspace_name  | table_name   | range_start          | range_end            | 
mean_partition_size | partitions_count
----------------+--------------+----------------------+----------------------+---------------------+------------------
 simpleuniform0 | simpletable0 | -6148914691236517206 | -6148914691236517205 |  
                87 |                1

(1 rows)

 keyspace_name  | table_name   | range_start | range_end | mean_partition_size 
| partitions_count
----------------+--------------+-------------+-----------+---------------------+------------------
 simpleuniform0 | simpletable0 |           0 |         1 |                  87 
|                1

(1 rows)

 keyspace_name  | table_name   | range_start         | range_end           | 
mean_partition_size | partitions_count
----------------+--------------+---------------------+---------------------+---------------------+------------------
 simpleuniform0 | simpletable0 | 6148914691236517205 | 6148914691236517206 |    
              87 |                1

(1 rows)

 keyspace_name  | table_name   | range_start          | range_end            | 
mean_partition_size | partitions_count
----------------+--------------+----------------------+----------------------+---------------------+------------------
 simpleuniform0 | simpletable0 | -6148914691236517205 | -6148914691236517204 |  
                87 |                1

(1 rows)

 keyspace_name  | table_name   | range_start | range_end | mean_partition_size 
| partitions_count
----------------+--------------+-------------+-----------+---------------------+------------------
 simpleuniform0 | simpletable0 |           1 |         2 |                  87 
|                1

(1 rows)

 keyspace_name  | table_name   | range_start         | range_end           | 
mean_partition_size | partitions_count
----------------+--------------+---------------------+---------------------+---------------------+------------------
 simpleuniform0 | simpletable0 | 6148914691236517206 | 6148914691236517207 |    
              87 |                1

(1 rows)

8) create a MR job against simpleuniform0. simpletable0, you will get 10 splits 
where as 2.1 was 4


> CqlInputFormat regression going from 2.1 to 3.x caused by semantic difference 
> between thrift and the new system.size_estimates table when dealing with 
> multiple dc deployments
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15637
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15637
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Tools
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>
> In 3.0 CqlInputFormat switched away from thrift in favor of a new 
> system.size_estimates table, but the semantics changed when dealing with 
> multiple DCs or when Cassandra is not collocated with Hadoop.
> The core issues are:
> * system.size_estimates uses the primary range, in a multi-dc setup this 
> could lead to uneven ranges
> example:
> {code}
> DC1: [0, 10, 20, 30]
> DC2: [1, 11, 21, 31]
> DC3: [2, 12, 22, 32]
> {code}
> Using NetworkTopologyStrategy the primary ranges are: [0, 1), [1, 2), [2, 
> 10), [10, 11), [11, 12), [12, 20), [20, 21), [21, 22), [22, 30), [30, 31), 
> [31, 32), [32, 0).
> Given this the only ranges that are more than one token are: [2, 10), [12, 
> 20), [22, 30).
> * system.size_estimates is not replicated so need to hit every node in the 
> cluster to get estimates, if nodes are down in the DC with non-size-1 ranges 
> there is no way to get a estimate.
> * CqlInputFormat used to call describe_local_ring so all interactions were 
> with a single DC, the java driver doesn't filter the DC so looks to allow 
> cross DC traffic and includes nodes from other DCs in the replica set; in the 
> example above, the amount of splits went from 4 to 12.
> * CqlInputFormat used to call describe_splits_ex to dynamically calculate the 
> estimates, this was on the "local primary range" and was able to hit replicas 
> to create estimates if the primary was down. With system.size_estimates we no 
> longer have backup and no longer expose the "local primary range" in multi-dc.
> * CqlInputFormat special cases Cassandra being collocated with Hadoop and 
> assumes this when querying system.size_estimates as it doesn't filter to the 
> specific host, this means that non-collocated deployments randomly select the 
> nodes and create splits with ranges the hosts do not have locally.
> The problems are deterministic to replicate, the following test will show it
> 1) deploy a 3 DC cluster with 3 nodes each
> 2) create DC2 tokens are +1 of DC1 and DC3 are +1 of DC2
> 3) CREATE KEYSPACE simpleuniform0 WITH replication = {‘class’: 
> ‘NetworkTopologyStrategy’, ‘DC1’: 3, ‘DC2’: 3, ‘DC3’: 3};
> 4) CREATE TABLE simpletable0 (pk bigint, ck bigint, value blob, PRIMARY KEY 
> (pk, ck))
> 5) insert 500k partitions uniformly: [0, 500,000)
> 6) wait until estimates catch up to writes
> 7) for all nodes, SELECT * FROM system.size_estimates
> You will get the following
> {code}
>  keyspace_name  | table_name   | range_start          | range_end            
> | mean_partition_size | partitions_count
> ----------------+--------------+----------------------+----------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 | -9223372036854775808 | -6148914691236517206 
> |                  87 |           122240
>  simpleuniform0 | simpletable0 |  6148914691236517207 | -9223372036854775808 
> |                  87 |           121472
> (2 rows)
>  keyspace_name  | table_name   | range_start | range_end           | 
> mean_partition_size | partitions_count
> ----------------+--------------+-------------+---------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 |           2 | 6148914691236517205 |          
>         87 |           243072
> (1 rows)
>  keyspace_name  | table_name   | range_start          | range_end            
> | mean_partition_size | partitions_count
> ----------------+--------------+----------------------+----------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 | -6148914691236517206 | -6148914691236517205 
> |                  87 |                1
> (1 rows)
>  keyspace_name  | table_name   | range_start | range_end | 
> mean_partition_size | partitions_count
> ----------------+--------------+-------------+-----------+---------------------+------------------
>  simpleuniform0 | simpletable0 |           0 |         1 |                  
> 87 |                1
> (1 rows)
>  keyspace_name  | table_name   | range_start         | range_end           | 
> mean_partition_size | partitions_count
> ----------------+--------------+---------------------+---------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 | 6148914691236517205 | 6148914691236517206 |  
>                 87 |                1
> (1 rows)
>  keyspace_name  | table_name   | range_start          | range_end            
> | mean_partition_size | partitions_count
> ----------------+--------------+----------------------+----------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 | -6148914691236517205 | -6148914691236517204 
> |                  87 |                1
> (1 rows)
>  keyspace_name  | table_name   | range_start | range_end | 
> mean_partition_size | partitions_count
> ----------------+--------------+-------------+-----------+---------------------+------------------
>  simpleuniform0 | simpletable0 |           1 |         2 |                  
> 87 |                1
> (1 rows)
>  keyspace_name  | table_name   | range_start         | range_end           | 
> mean_partition_size | partitions_count
> ----------------+--------------+---------------------+---------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 | 6148914691236517206 | 6148914691236517207 |  
>                 87 |                1
> (1 rows)
> {code}
> 8) create a MR job against simpleuniform0. simpletable0, you will get 10 
> splits where as 2.1 was 4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15637) CqlInputFormat regression going from 2.1 to 3.x caused by semantic difference between thrift and the new system.size_estimates table when dealing with multiple dc deployments

Reply via email to