Vamsee Yarlagadda created SOLR-7198:
---------------------------------------

             Summary: Deleting a collection during leader election results in 
left over znodes in ZK 
                 Key: SOLR-7198
                 URL: https://issues.apache.org/jira/browse/SOLR-7198
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.10.3
            Reporter: Vamsee Yarlagadda


I am seeing this issue while trying to run a collection delete operation in the 
middle of leader election process.

Contents of ZK (after issuing collection delete and waiting for some time)
{code}
  /
    /aliases.json
    /clusterstate.json
    /collections
      SolrUpgrade_collection
        leaders
           shard2_1
    /configs
    /live_nodes
    /overseer
    /overseer_elect
    /solr
    /solr.xml

Contents of znode shard2_1:

version0
aversion0
children_count0
ctimeThu Mar 05 22:05:28 PST 2015 (1425621928169)
cversion0
czxid22815
ephemeralOwner93427899815755800
mtimeThu Mar 05 22:05:28 PST 2015 (1425621928169)
mzxid22815
pzxid22815
dataLength194
{
  "core":"SolrUpgrade_collection_shard2_1_replica2",
  "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
  "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr";
}
{code}

Clusterstate.json before running a delete collection
{code}
{
  "shards":{
    "shard1":{
      "range":"80000000-ffffffff",
      "state":"active",
      "replicas":{
        "core_node1":{
          "state":"active",
          "core":"SolrUpgrade_collection_shard1_replica2",
          "node_name":"search-testing-c5-ha-4.vpc.cloudera.com:8983_solr",
          "base_url":"http://search-testing-c5-ha-4.vpc.cloudera.com:8983/solr";,
          "leader":"true"},
        "core_node2":{
          "state":"active",
          "core":"SolrUpgrade_collection_shard1_replica1",
          "node_name":"search-testing-c5-ha-2.vpc.cloudera.com:8983_solr",
          
"base_url":"http://search-testing-c5-ha-2.vpc.cloudera.com:8983/solr"}}},
    "shard2_0":{
      "range":"0-3fffffff",
      "state":"active",
      "replicas":{
        "core_node5":{
          "state":"active",
          "core":"SolrUpgrade_collection_shard2_0_replica1",
          "node_name":"search-testing-c5-ha-3.vpc.cloudera.com:8983_solr",
          "base_url":"http://search-testing-c5-ha-3.vpc.cloudera.com:8983/solr";,
          "leader":"true"},
        "core_node7":{
          "state":"active",
          "core":"SolrUpgrade_collection_shard2_0_replica2",
          "node_name":"search-testing-c5-ha-4.vpc.cloudera.com:8983_solr",
          
"base_url":"http://search-testing-c5-ha-4.vpc.cloudera.com:8983/solr"}}},
    "shard2_1":{
      "range":"40000000-7fffffff",
      "state":"active",
      "replicas":{
        "core_node8":{
          "state":"active",
          "core":"SolrUpgrade_collection_shard2_1_replica2",
          "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
          
"base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"},
        "core_node9":{
          "state":"active",
          "core":"SolrUpgrade_collection_shard2_1_replica3",
          "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
          
"base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"}}}},
  "maxShardsPerNode":"10",
  "router":"compositeId",
  "replicationFactor":"2",
  "autoAddReplicas":"false",
  "routerSpec":{"name":"compositeId"}}}
{code}

As we can notice, shard (*shard2_1*) doesn't have any leader and i can see from 
the logs that the replicas of the shard just started the leader election 
process. And here are the Solr logs from one of the above replicas which 
eventually becomes the leader and registers in ZK even though the collection 
was deleted.
{code}
2015-03-05 22:05:25,383 INFO org.apache.solr.cloud.ShardLeaderElectionContext: 
Running the leader process for shard shard2_1
2015-03-05 22:05:25,387 INFO org.apache.solr.cloud.ShardLeaderElectionContext: 
Checking if I 
(core=SolrUpgrade_collection_shard2_1_replica2,coreNodeName=core_node8) should 
try and be the leader.
2015-03-05 22:05:25,387 INFO org.apache.solr.cloud.ShardLeaderElectionContext: 
My last published State was Active, it's okay to be the leader.
2015-03-05 22:05:25,387 INFO org.apache.solr.cloud.ShardLeaderElectionContext: 
I may be the new leader - try and sync
2015-03-05 22:05:25,506 INFO org.apache.solr.common.cloud.ZkStateReader: A 
cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
2015-03-05 22:05:25,620 INFO org.apache.solr.core.SolrCore.Request: 
[SolrUpgrade_collection_shard2_1_replica2] webapp=/solr path=/select 
params={q=*:*&distrib=false&wt=javabin&version=2} hits=118 status=0 QTime=1 
2015-03-05 22:05:25,623 INFO org.apache.solr.core.SolrCore.Request: 
[SolrUpgrade_collection_shard2_1_replica3] webapp=/solr path=/select 
params={q=*:*&distrib=false&wt=javabin&version=2} hits=118 status=0 QTime=0 
2015-03-05 22:05:27,392 INFO org.apache.solr.cloud.ElectionContext: canceling 
election 
/collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node8-n_0000000001
2015-03-05 22:05:27,393 INFO org.apache.solr.core.SolrCore: 
[SolrUpgrade_collection_shard2_1_replica3]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@559b52d3
2015-03-05 22:05:27,393 INFO org.apache.solr.update.UpdateHandler: closing 
DirectUpdateHandler2{commits=0,autocommit maxTime=60000ms,autocommits=0,soft 
autocommit maxTime=1000ms,soft 
autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
2015-03-05 22:05:27,394 INFO org.apache.solr.update.SolrCoreState: Closing 
SolrCoreState
2015-03-05 22:05:27,394 INFO org.apache.solr.update.DefaultSolrCoreState: 
SolrCoreState ref count has reached 0 - closing IndexWriter
2015-03-05 22:05:27,394 INFO org.apache.solr.update.DefaultSolrCoreState: 
closing IndexWriter with IndexWriterCloser
2015-03-05 22:05:27,397 INFO org.apache.solr.servlet.SolrDispatchFilter: 
[admin] webapp=null path=/admin/cores 
params={deleteInstanceDir=true&action=UNLOAD&core=SolrUpgrade_collection_shard2_1_replica2&wt=javabin&qt=/admin/cores&deleteDataDir=true&version=2}
 status=0 QTime=5 
2015-03-05 22:05:27,399 INFO org.apache.solr.cloud.ShardLeaderElectionContext: 
Running the leader process for shard shard2_1
2015-03-05 22:05:27,399 INFO org.apache.solr.cloud.ElectionContext: canceling 
election 
/collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
2015-03-05 22:05:27,400 WARN org.apache.solr.cloud.LeaderElector: 
org.apache.solr.common.SolrException: SolrCore not 
found:SolrUpgrade_collection_shard2_1_replica3 in []
        at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:211)
        at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
        at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
        at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
        at 
org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358)
        at 
org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:273)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2015-03-05 22:05:27,415 INFO org.apache.solr.core.SolrCore: 
[SolrUpgrade_collection_shard2_1_replica3] Closing main searcher on request.
2015-03-05 22:05:27,438 INFO org.apache.solr.common.cloud.ZkStateReader: A 
cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
2015-03-05 22:05:27,456 INFO org.apache.solr.core.CachingDirectoryFactory: 
Closing HdfsDirectoryFactory - 2 directories currently being tracked
2015-03-05 22:05:27,457 INFO org.apache.solr.core.CachingDirectoryFactory: 
looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index 
[CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index;done=false>>]
2015-03-05 22:05:27,457 INFO org.apache.solr.core.CachingDirectoryFactory: 
Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index
2015-03-05 22:05:27,470 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing 
hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index
2015-03-05 22:05:27,470 INFO org.apache.solr.core.CachingDirectoryFactory: 
looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data 
[CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data;done=false>>]
2015-03-05 22:05:27,470 INFO org.apache.solr.core.CachingDirectoryFactory: 
Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
2015-03-05 22:05:27,470 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing 
hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
2015-03-05 22:05:27,471 INFO org.apache.solr.core.CachingDirectoryFactory: 
Removing directory after core close: 
hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
2015-03-05 22:05:27,490 INFO org.apache.solr.cloud.ElectionContext: canceling 
election 
/collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
2015-03-05 22:05:27,491 WARN org.apache.solr.cloud.ElectionContext: 
cancelElection did not find election node to remove 
/collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
2015-03-05 22:05:27,493 INFO org.apache.solr.servlet.SolrDispatchFilter: 
[admin] webapp=null path=/admin/cores 
params={deleteInstanceDir=true&action=UNLOAD&core=SolrUpgrade_collection_shard2_1_replica3&wt=javabin&qt=/admin/cores&deleteDataDir=true&version=2}
 status=0 QTime=100 
2015-03-05 22:05:27,741 INFO org.apache.solr.common.cloud.ZkStateReader: A 
cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.SyncStrategy: Sync replicas 
to 
http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr/SolrUpgrade_collection_shard2_1_replica2/
2015-03-05 22:05:27,888 ERROR org.apache.solr.cloud.SyncStrategy: Sync 
Failed:org.apache.solr.common.cloud.ZooKeeperException: Could not find 
collection in zk: SolrUpgrade_collection []
        at 
org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:600)
        at 
org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:588)
        at 
org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:583)
        at 
org.apache.solr.cloud.SyncStrategy.syncWithReplicas(SyncStrategy.java:152)
        at 
org.apache.solr.cloud.SyncStrategy.syncReplicas(SyncStrategy.java:122)
        at org.apache.solr.cloud.SyncStrategy.sync(SyncStrategy.java:101)
        at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:267)
        at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
        at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
        at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
        at 
org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358)
        at 
org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:273)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.SyncStrategy: Leader's 
attempt to sync with shard failed, moving to the next candidate
2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.ShardLeaderElectionContext: 
We failed sync, but we have no versions - we can't sync in that case - we were 
active before, so become leader anyway
2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.ShardLeaderElectionContext: 
I am the new leader: 
http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr/SolrUpgrade_collection_shard2_1_replica2/
 shard2_1
2015-03-05 22:05:27,888 INFO org.apache.solr.core.SolrCore: 
[SolrUpgrade_collection_shard2_1_replica2]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@595167
2015-03-05 22:05:27,889 INFO org.apache.solr.update.UpdateHandler: closing 
DirectUpdateHandler2{commits=1,autocommit maxTime=60000ms,autocommits=0,soft 
autocommit maxTime=1000ms,soft 
autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
2015-03-05 22:05:27,890 INFO org.apache.solr.update.SolrCoreState: Closing 
SolrCoreState
2015-03-05 22:05:27,890 INFO org.apache.solr.update.DefaultSolrCoreState: 
SolrCoreState ref count has reached 0 - closing IndexWriter
2015-03-05 22:05:27,890 INFO org.apache.solr.update.DefaultSolrCoreState: 
closing IndexWriter with IndexWriterCloser
2015-03-05 22:05:27,905 INFO org.apache.solr.core.SolrCore: 
[SolrUpgrade_collection_shard2_1_replica2] Closing main searcher on request.
2015-03-05 22:05:27,911 INFO org.apache.solr.core.CachingDirectoryFactory: 
Closing HdfsDirectoryFactory - 2 directories currently being tracked
2015-03-05 22:05:27,912 INFO org.apache.solr.core.CachingDirectoryFactory: 
looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index 
[CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index;done=false>>]
2015-03-05 22:05:27,912 INFO org.apache.solr.core.CachingDirectoryFactory: 
Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index
2015-03-05 22:05:27,924 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing 
hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index
2015-03-05 22:05:27,924 INFO org.apache.solr.core.CachingDirectoryFactory: 
looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data 
[CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data;done=false>>]
2015-03-05 22:05:27,924 INFO org.apache.solr.core.CachingDirectoryFactory: 
Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
2015-03-05 22:05:27,925 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing 
hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
2015-03-05 22:05:28,026 INFO org.apache.solr.core.CachingDirectoryFactory: 
Removing directory after core close: 
hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
2015-03-05 22:05:28,155 INFO org.apache.solr.common.cloud.SolrZkClient: 
makePath: /collections/SolrUpgrade_collection/leaders/shard2_1
2015-03-05 22:05:28,287 INFO org.apache.solr.common.cloud.ZkStateReader: A 
cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
{code}


This might be because of a missing check at line
https://github.com/apache/lucene-solr/blob/lucene_solr_4_10/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java#L138
As the makePath creates all the missing directories recursively.

May be we can check whether the collection znode exists before we create the 
leader znode? Thoughts? 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to