[ https://issues.apache.org/jira/browse/SOLR-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-16753: -------------------------------------- Attachment: SOLR-16753.txt Status: Open (was: Open) I'm attaching a log file from when i was able to trigger this failure locally (running {{{}gradle clean check{}}}) _after_ committing SOLR-16751. But unfortunately the seed doesn't reproduce – making me suspicious that it's a timing related problem exacerbated by high CPU load (ie: jenkins box or running lots of concurrent tests) The problem almost seems like it must be to be related to the ZK watchers and/or reading stale state ? Here's the {{SliceMutator}} updating the state of the old shard and the new split shards, which triggers {{zkCallback}} threads, which triggers thewatcher set by the test -- and yet it still doesn't see the expected number of active slices... {noformat} 2> 382813 INFO (recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) [n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state invoked for collection: coll_NRT_PULL with message: { 2> "collection":"coll_NRT_PULL", 2> "shard1_1":"active", 2> "operation":"updateshardstate", 2> "shard1_0":"active", 2> "shard1":"inactive"} 2> 382813 INFO (recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) [n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state shard1_1 to active 2> 382813 INFO (recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) [n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state shard1_0 to active 2> 382813 INFO (recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) [n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state shard1 to inactive 2> 382815 INFO (zkCallback-3873-thread-1) [] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/collections/coll_NRT_PULL/state.json] for collection [coll_NRT_PULL] has occurred - updating... (live nodes size: [6]) 2> 382815 INFO (zkCallback-3857-thread-2) [] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/collections/coll_NRT_PULL/state.json] for collection [coll_NRT_PULL] has occurred - updating... (live nodes size: [6]) 2> 382817 INFO (zkCallback-3854-thread-1) [] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/collections/coll_NRT_PULL/state.json] for collection [coll_NRT_PULL] has occurred - updating... (live nodes size: [6]) 2> 382818 INFO (recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) [n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.RecoveryStrategy Finished recovery process, successful=[true] msTimeTaken=84.0 2> 382818 INFO (recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) [n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.RecoveryStrategy Finished recovery process. recoveringAfterStartup=true msTimeTaken=85.0 2> 382818 INFO (watches-3871-thread-1) [] o.a.s.c.SolrCloudTestCase active slice count: 1 expected: 2 {noformat} [~noble.paul] - can you please try to dig into this? > SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull failures > ----------------------------------------------------------------------- > > Key: SOLR-16753 > URL: https://issues.apache.org/jira/browse/SOLR-16753 > Project: Solr > Issue Type: Test > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Chris M. Hostetter > Assignee: Noble Paul > Priority: Major > Attachments: SOLR-16753.txt > > > {{SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull}} – was > added on 2023-03-13, but somwhere between 2023-04-02 and 2023-04-09 it > started failing 15-20% on jenkins jobs with seeds that don't reliably > reproduce. > At first, this seemed like it might be related to SOLR-16751, but even with > that fix failures are still happening. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org