[ 
https://issues.apache.org/jira/browse/CASSANDRA-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-15637:
------------------------------------------
          Since Version: 3.0.0
    Source Control Link: 
[a8327eb8868c8d9d03c253a88509ce64d2ac227b|https://github.com/apache/cassandra/commit/a8327eb8868c8d9d03c253a88509ce64d2ac227b]
             Resolution: Fixed
                 Status: Resolved  (was: Ready to Commit)

Cheers, committed as 
[a8327eb8868c8d9d03c253a88509ce64d2ac227b|https://github.com/apache/cassandra/commit/a8327eb8868c8d9d03c253a88509ce64d2ac227b]
 to trunk.

> CqlInputFormat regression going from 2.1 to 3.x caused by semantic difference 
> between thrift and the new system.size_estimates table when dealing with 
> multiple dc deployments
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15637
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15637
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Tools
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0-alpha
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In 3.0 CqlInputFormat switched away from thrift in favor of a new 
> system.size_estimates table, but the semantics changed when dealing with 
> multiple DCs or when Cassandra is not collocated with Hadoop.
> The core issues are:
> * system.size_estimates uses the primary range, in a multi-dc setup this 
> could lead to uneven ranges
> example:
> {code}
> DC1: [0, 10, 20, 30]
> DC2: [1, 11, 21, 31]
> DC3: [2, 12, 22, 32]
> {code}
> Using NetworkTopologyStrategy the primary ranges are: [0, 1), [1, 2), [2, 
> 10), [10, 11), [11, 12), [12, 20), [20, 21), [21, 22), [22, 30), [30, 31), 
> [31, 32), [32, 0).
> Given this the only ranges that are more than one token are: [2, 10), [12, 
> 20), [22, 30).
> * system.size_estimates is not replicated so need to hit every node in the 
> cluster to get estimates, if nodes are down in the DC with non-size-1 ranges 
> there is no way to get a estimate.
> * CqlInputFormat used to call describe_local_ring so all interactions were 
> with a single DC, the java driver doesn't filter the DC so looks to allow 
> cross DC traffic and includes nodes from other DCs in the replica set; in the 
> example above, the amount of splits went from 4 to 12.
> * CqlInputFormat used to call describe_splits_ex to dynamically calculate the 
> estimates, this was on the "local primary range" and was able to hit replicas 
> to create estimates if the primary was down. With system.size_estimates we no 
> longer have backup and no longer expose the "local primary range" in multi-dc.
> * CqlInputFormat had a config cassandra.input.keyRange which let you define 
> your own range.  If the range doesn't perfectly match the local range then 
> the intersectWith calls will produce ranges with no estimates.  Example: [0, 
> 10, 20], cassandra.input.keyRange=5,15.  This won't find any estimates so 
> will produce 2 splits with 128 estimate (default when not found).
> * CqlInputFormat special cases Cassandra being collocated with Hadoop and 
> assumes this when querying system.size_estimates as it doesn't filter to the 
> specific host, this means that non-collocated deployments randomly select the 
> nodes and create splits with ranges the hosts do not have locally.
> The problems are deterministic to replicate, the following test will show it
> 1) deploy a 3 DC cluster with 3 nodes each
> 2) create DC2 tokens are +1 of DC1 and DC3 are +1 of DC2
> 3) CREATE KEYSPACE simpleuniform0 WITH replication = {‘class’: 
> ‘NetworkTopologyStrategy’, ‘DC1’: 3, ‘DC2’: 3, ‘DC3’: 3};
> 4) CREATE TABLE simpletable0 (pk bigint, ck bigint, value blob, PRIMARY KEY 
> (pk, ck))
> 5) insert 500k partitions uniformly: [0, 500,000)
> 6) wait until estimates catch up to writes
> 7) for all nodes, SELECT * FROM system.size_estimates
> You will get the following
> {code}
>  keyspace_name  | table_name   | range_start          | range_end            
> | mean_partition_size | partitions_count
> ----------------+--------------+----------------------+----------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 | -9223372036854775808 | -6148914691236517206 
> |                  87 |           122240
>  simpleuniform0 | simpletable0 |  6148914691236517207 | -9223372036854775808 
> |                  87 |           121472
> (2 rows)
>  keyspace_name  | table_name   | range_start | range_end           | 
> mean_partition_size | partitions_count
> ----------------+--------------+-------------+---------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 |           2 | 6148914691236517205 |          
>         87 |           243072
> (1 rows)
>  keyspace_name  | table_name   | range_start          | range_end            
> | mean_partition_size | partitions_count
> ----------------+--------------+----------------------+----------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 | -6148914691236517206 | -6148914691236517205 
> |                  87 |                1
> (1 rows)
>  keyspace_name  | table_name   | range_start | range_end | 
> mean_partition_size | partitions_count
> ----------------+--------------+-------------+-----------+---------------------+------------------
>  simpleuniform0 | simpletable0 |           0 |         1 |                  
> 87 |                1
> (1 rows)
>  keyspace_name  | table_name   | range_start         | range_end           | 
> mean_partition_size | partitions_count
> ----------------+--------------+---------------------+---------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 | 6148914691236517205 | 6148914691236517206 |  
>                 87 |                1
> (1 rows)
>  keyspace_name  | table_name   | range_start          | range_end            
> | mean_partition_size | partitions_count
> ----------------+--------------+----------------------+----------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 | -6148914691236517205 | -6148914691236517204 
> |                  87 |                1
> (1 rows)
>  keyspace_name  | table_name   | range_start | range_end | 
> mean_partition_size | partitions_count
> ----------------+--------------+-------------+-----------+---------------------+------------------
>  simpleuniform0 | simpletable0 |           1 |         2 |                  
> 87 |                1
> (1 rows)
>  keyspace_name  | table_name   | range_start         | range_end           | 
> mean_partition_size | partitions_count
> ----------------+--------------+---------------------+---------------------+---------------------+------------------
>  simpleuniform0 | simpletable0 | 6148914691236517206 | 6148914691236517207 |  
>                 87 |                1
> (1 rows)
> {code}
> 8) create a MR job against simpleuniform0. simpletable0, you will get 10 
> splits where as 2.1 was 4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to