[jira] [Updated] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)

Hoss Man (JIRA) Tue, 09 Jul 2019 15:15:26 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-13616:
----------------------------
    Attachment: SOLR-13616.test-incomplete.patch
                thetaphi_Lucene-Solr-master-Linux_24358.log.txt
        Status: Open  (was: Open)

Unlike most tests that explicitly waitFor/assert *active* replicas, 
TestPolicyCloud (currently) has several tests that only assert the quanity and 
location of a replica – it doesn't wait for them to become active, so when 
testing an ADDREPLICA or a SPLITSHARD, those new replicas are still in recovery 
(or PrepRecovery) when the test tries to do cleanup and delete the collection – 
which frequently fails with timeout problems.

While we can certainly "improve" TestPolicyCloud to wait for recoveries to 
finish, and all replicas to be active before attempting to delete the 
collection, a better question is why this is needed?

I'm attaching {{thetaphi_Lucene-Solr-master-Linux_24358.log.txt}} which 
demonstrates the problem in {{TestPolicyCloud.testCreateCollectionAddReplica}} 
here are some highlights...
{noformat}
# thetaphi_Lucene-Solr-master-Linux_24358.log.txt
#
# testCreateCollectionAddReplica

# bulk of test logic is finished, test has added a replica and confirmed it's 
on the expected node
# 
# but meanwhile, recovery is still ongoing...

  [junit4]   2> 959699 INFO  
(recoveryExecutor-5888-thread-1-processing-n:127.0.0.1:42097_solr 
x:testCreateCollectionAddReplica_shard1_replica_n3 
c:testCreateCollectionAddReplica s:shard1 r:core_node4) [n:127.0.0.1:42097_solr 
c:testCreateCollectionAddReplica s:shard1 r:core_node4 
x:testCreateCollectionAddReplica_shard1_replica_n3 ] o.a.s.c.RecoveryStrategy 
Sending prep recovery command to [https://127.0.0.1:42097/solr]; [WaitForState: 
action=PREPRECOVERY&core=testCreateCollectionAddReplica_shard1_replica_n1&nodeName=127.0.0.1:42097_solr&coreNodeName=core_node4&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
   [junit4]   2> 959701 INFO  (qtp531873617-17025) [n:127.0.0.1:42097_solr    
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.h.a.PrepRecoveryOp 
Going to wait for coreNodeName: core_node4, state: recovering, checkLive: true, 
onlyIfLeader: true, onlyIfLeaderActive: true
   [junit4]   2> 959701 INFO  (qtp531873617-17025) [n:127.0.0.1:42097_solr    
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.h.a.PrepRecoveryOp 
In WaitForState(recovering): collection=testCreateCollectionAddReplica, 
shard=shard1, thisCore=testCreateCollectionAddReplica_shard1_replica_n1, 
leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, 
currentState=down, localState=active, nodeName=127.0.0.1:42097_solr, 
coreNodeName=core_node4, onlyIfActiveCheckResult=false, nodeProps: core_node4:{
   [junit4]   2>   "core":"testCreateCollectionAddReplica_shard1_replica_n3",
   [junit4]   2>   "base_url":"https://127.0.0.1:42097/solr";,
   [junit4]   2>   "state":"down",
   [junit4]   2>   "node_name":"127.0.0.1:42097_solr",
   [junit4]   2>   "type":"NRT"}
   ...

# the test thread moves on to @After method which calls 
MiniSolrCloudCluster.deleteAllCollections() ...
   ...
   [junit4]   2> 959703 INFO  (qtp531873617-17021) [n:127.0.0.1:42097_solr     
] o.a.s.h.a.CollectionsHandler Invoked Collection Action :delete with params 
name=testCreateCollectionAddReplica&action=DELETE&wt=javabin&version=2 and 
sendToOCPQueue=true
   ...
   [junit4]   2> 959709 INFO  
(OverseerThreadFactory-5345-thread-5-processing-n:127.0.0.1:44991_solr) 
[n:127.0.0.1:44991_solr     ] o.a.s.c.a.c.OverseerCollectionMessageHandler 
Executing Collection 
Cmd=action=UNLOAD&deleteInstanceDir=true&deleteDataDir=true&deleteMetricsHistory=true,
 asyncId=null
   ...
   [junit4]   2> 959750 INFO  (qtp531873617-17325) [n:127.0.0.1:42097_solr    
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.c.SolrCore 
[testCreateCollectionAddReplica_shard1_replica_n1]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@39444e66
   ...
   [junit4]   2> 959753 INFO  (qtp531873617-17325) [n:127.0.0.1:42097_solr    
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/cores 
params={deleteInstanceDir=true&deleteMetricsHistory=true&core=testCreateCollectionAddReplica_shard1_replica_n1&qt=/admin/cores&deleteDataDir=true&action=UNLOAD&wt=javabin&version=2}
 status=0 QTime=17


# but meanwhile, PrepRecoveryOp is currently blocked on a call to 
ZkStateReader.waitForState
# looking for specific conditions for the leader and (new) replica (that needs 
to recover)
# ... BUT!... the leader core has already been closed, so the watcher never 
succeeds,
# ...so PrepRecovery keeps waitForState ...

   [junit4]   2> 959855 WARN  (watches-5915-thread-1) [     ] 
o.a.s.c.c.ZkStateReader Error on calling watcher
   [junit4]   2>           => org.apache.solr.common.SolrException: core not 
found:testCreateCollectionAddReplica_shard1_replica_n1
   [junit4]   2>        at 
org.apache.solr.handler.admin.PrepRecoveryOp.lambda$execute$0(PrepRecoveryOp.java:83)
   [junit4]   2> org.apache.solr.common.SolrException: core not 
found:testCreateCollectionAddReplica_shard1_replica_n1
   [junit4]   2>        at 
org.apache.solr.handler.admin.PrepRecoveryOp.lambda$execute$0(PrepRecoveryOp.java:83)
 ~[java/:?]
   [junit4]   2>        at 
org.apache.solr.common.cloud.ZkStateReader.lambda$waitForState$13(ZkStateReader.java:1742)
 ~[java/:?]
   [junit4]   2>        at 
org.apache.solr.common.cloud.ZkStateReader$DocCollectionAndLiveNodesWatcherWrapper.onStateChanged(ZkStateReader.java:2309)
 ~[java/:?]
   [junit4]   2>        at 
org.apache.solr.common.cloud.ZkStateReader$Notification.run(ZkStateReader.java:2041)
 ~[java/:?]
   [junit4]   2>        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
   [junit4]   2>        at 
java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
   [junit4]   2>        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
 ~[java/:?]
   [junit4]   2>        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
   [junit4]   2>        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
   [junit4]   2>        at java.lang.Thread.run(Thread.java:835) [?:?]

#  ... repeats 3 times, same millisecond, diff threads
#       (watches-5915-thread-2 & watches-5915-thread-3

# Other then some intranode /admin/metrics requests, *NOTHING* else happens for 
~90 seconds
# Until finally the DELETE collection request times out...

   [junit4]   2> 1049792 INFO  
(TEST-TestPolicyCloud.testCreateCollectionAddReplica-seed#[CA8FB8D61B016EFE]) [ 
    ] o.a.s.SolrTestCaseJ4 ###Ending testCreateCollectionAddReplica
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestPolicyCloud 
-Dtests.method=testCreateCollectionAddReplica -Dtests.seed=CA8FB8D61B016EFE 
-Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=yi-001 
-Dtests.timezone=America/Indiana/Indianapolis -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   92.6s J0 | TestPolicyCloud.testCreateCollectionAddReplica 
<<<
   [junit4]    > Throwable #1: 
org.apache.solr.client.solrj.SolrServerException: Timeout occurred while 
waiting response from server at: https://127.0.0.1:42097/solr
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([CA8FB8D61B016EFE:4AAFDDF80A428658]:0)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:667)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:245)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:368)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:296)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1128)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:897)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:829)
   [junit4]    >        at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
   [junit4]    >        at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:228)
   [junit4]    >        at 
org.apache.solr.cloud.MiniSolrCloudCluster.deleteAllCollections(MiniSolrCloudCluster.java:547)
   [junit4]    >        at 
org.apache.solr.cloud.autoscaling.TestPolicyCloud.after(TestPolicyCloud.java:87)
   [junit4]    >        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at 
java.base/java.lang.reflect.Method.invoke(Method.java:567)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:835)
   [junit4]    > Caused by: java.net.SocketTimeoutException: Read timed out
   [junit4]    >        at 
java.base/java.net.SocketInputStream.socketRead0(Native Method)
   [junit4]    >        at 
java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
   [junit4]    >        at 
java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
   [junit4]    >        at 
java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
   [junit4]    >        at 
java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:448)
   [junit4]    >        at 
java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68)
   [junit4]    >        at 
java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1132)
   [junit4]    >        at 
java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:828)
{noformat}
This pattern of "RECOVERY happens concurrently with DELETE, lots of wasted time 
with nothing logged, DELETE times out or fails" repeats in other jenkins jobs, 
and in other test methods like {{testCreateCollectionSplitShard}} ... when 
these failures happen, the effects sometimes bleed over into toehr test methods 
– aparently because the old collection is "partially" deleted – the cores are 
unloaded, but the collection still exists in the cluster state causing problems 
when future invokations of the {{@After}} method calls 
{{MiniSolrCloudCluster.deleteAllCollections()}} ...
{noformat}
   [junit4]   2> 1081521 INFO  
(TEST-TestPolicyCloud.testDataProvider-seed#[CA8FB8D61B016EFE]) [     ] 
o.a.s.SolrTestCaseJ4 ###Ending testDataProvider
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestPolicyCloud 
-Dtests.method=testDataProvider -Dtests.seed=CA8FB8D61B016EFE 
-Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=yi-001 
-Dtests.timezone=America/Indiana/Indianapolis -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   31.7s J0 | TestPolicyCloud.testDataProvider <<<
   [junit4]    > Throwable #1: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at https://127.0.0.1:42097/solr: Could not find collection : 
testCreateCollectionAddReplica
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([CA8FB8D61B016EFE:F22410E3A9C36A27]:0)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:656)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:245)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:368)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:296)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1128)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:897)
   [junit4]    >        at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:829)
   [junit4]    >        at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
   [junit4]    >        at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:228)
   [junit4]    >        at 
org.apache.solr.cloud.MiniSolrCloudCluster.deleteAllCollections(MiniSolrCloudCluster.java:547)
   [junit4]    >        at 
org.apache.solr.cloud.autoscaling.TestPolicyCloud.after(TestPolicyCloud.java:87)
{noformat}
FWIW, I should note:
 * I've been unable to reproduce this locally
 * the failures happen on diff jenkins servers, on diff branches, and different 
OSes
 * so far all of the failures i've seen occur when randomized ssl=true, but 
AFAICT there isn't any obvious indicatiion that the problem relates to the use 
of ssl – i think that just that the added CPU contention caused by using SSL 
seems to slow things down enough on the jenkins machine that the race contidion 
happens
 ** ie: because of SSL slowdown, the recovery isn't fully completed before the 
DELETE collection cleanup requests are processed

----
My initial hunch/impression was that Overseer (Queue) was getting deadlocked 
trying to invoke the PrepRecoveryOp (which was blocked waiting for a leader 
that no longer exists) preventing the DeleteCollectionCmd from executing (which 
would remove the collection from the ClusterState – the only state change 
possible to cause the PrepRecoveryOp's waitForState call to abort.

I tried to write a new test forcing this specific situation (attached), using a 
{{CloseHook}} on the leader {{SolrCore}} and {{waitForState()}} predicate that 
would gate eachother using {{CountDownLatches}} to try and coerce the specific 
sequence of events that seemed to be problematic – but i can't seem to force a 
this type of failure (or even semi-reliably trigger it given that the 
ZkStateReader executes watchers in a random order).

If anyone else has any ideas what's going on here i'm all ears ... in the 
meantime I'm going to try and work up some general improvements to 
TestPolicyCLoud to make the assertions tighter and insure all the expected 
replicas are fully active before finishing the test.

> Possible racecondition/deadlock between collection DELETE and PrepRecovery ? 
> (TestPolicyCloud failures)
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13616
>                 URL: https://issues.apache.org/jira/browse/SOLR-13616
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: SOLR-13616.test-incomplete.patch, 
> thetaphi_Lucene-Solr-master-Linux_24358.log.txt
>
>
> Based on some recent jenkins failures in TestPolicyCloud, I suspect there is 
> a possible deadlock condition when attempting to delete a collection while 
> recovery is in progress.
> I haven't been able to identify exactly where/why/how the problem occurs, but 
> it does not appear to be a test specific problem, and seems like it could 
> potentially affect anyone unlucky enough to issue poorly timed DELETE.
> Details to follow in comments...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)

Reply via email to