[ 
https://issues.apache.org/jira/browse/CASSANDRA-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-11729:
----------------------------------------
    Attachment: node3_debug.log.gz
                node2_debug.log.gz
                node1_debug.log.gz


This isn't actually related to indexes, but is highlighting a race condition 
which is pretty pervasive. You can see from the stacktrace that the assertion 
error is actually being thrown from a lambda defined in 
{{CassandraDaemon::setup}}, which only runs when a node is started. From 
inspection of the code and logs, what seems to be happening is this:
* At startup node1 creates a task to submit rebuilds of all MVs in all 
keyspaces& submits it to the {{OptionalTasks}} executor to run after 
{{RING_DELAY}}.
* While this is still pending, all 3 nodes finish startup and proceed with the 
test, creating and then dropping the {{ks}} keyspace. 
* It so happens that all of the "DROP KEYSPACE" statements hit node3 as the 
coordinator. From its log, we can see that the 4th of these executes at 
{{00:33:56,585}}, so shortly after that point, it pushes a defs change to node1 
and node2.
* Back on node1, the MV building runnable is executed where it calls 
{{Keyspace::all}} and begins to iterate the keyspaces, submitting MV builds. 

This is where the race occurs. {{Keyspace::all}} provides an iterable of 
{{Keyspace}} instance by transforming the key set of 
{{Schema.instance.keyspaces}}, using {{Keyspace::open}} as the transformation 
function. Concurrently, processing the schema update pushed by node3 follows 
the path 
{code}
SchemaKeyspace::mergeSchema
  -> Schema.instance.dropKeyspace
      -> Schema.instance.clearKeyspaceMetadata
          -> Schema.instance.keyspaces.remove
{code}

If the removal from {{Schema.instance.keyspaces}} happens after the 
transforming iterable has read the keyspace name from the keyset, but before it 
attempts to open the {{Keyspace}}, the assertion error is thrown. 

This is really a deep rooted problem with schema not being properly safe under 
any level of concurrency. {{Keyspace::all}} has many callsites, all of which 
are potentially vulnerable to this and fixing that properly should be done as a 
subtask of CASSANDRA-9424. 

[~iamaleksey] , I don't think that any of the existing subtasks fully capture 
this. Do you think it may fit in CASSANDRA-9425, or do you think a new ticket 
is called for?
 
[~philipthompson], is the best thing to do here just to mark the test as flaky 
for now?

> dtest failure in 
> secondary_indexes_test.TestSecondaryIndexes.test_6924_dropping_ks
> ----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11729
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11729
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Russ Hatch
>            Assignee: Sam Tunnicliffe
>              Labels: dtest
>             Fix For: 3.x
>
>         Attachments: node1_debug.log.gz, node2_debug.log.gz, 
> node3_debug.log.gz
>
>
> looks to be a single flap. might be worth trying to reproduce. example 
> failure:
> http://cassci.datastax.com/job/trunk_dtest/1204/testReport/secondary_indexes_test/TestSecondaryIndexes/test_6924_dropping_ks
> Failed on CassCI build trunk_dtest #1204



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to