[
https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-13616:
----------------------------
Attachment: SOLR-13616.test-incomplete.patch
thetaphi_Lucene-Solr-master-Linux_24358.log.txt
Status: Open (was: Open)
Unlike most tests that explicitly waitFor/assert *active* replicas,
TestPolicyCloud (currently) has several tests that only assert the quanity and
location of a replica – it doesn't wait for them to become active, so when
testing an ADDREPLICA or a SPLITSHARD, those new replicas are still in recovery
(or PrepRecovery) when the test tries to do cleanup and delete the collection –
which frequently fails with timeout problems.
While we can certainly "improve" TestPolicyCloud to wait for recoveries to
finish, and all replicas to be active before attempting to delete the
collection, a better question is why this is needed?
I'm attaching {{thetaphi_Lucene-Solr-master-Linux_24358.log.txt}} which
demonstrates the problem in {{TestPolicyCloud.testCreateCollectionAddReplica}}
here are some highlights...
{noformat}
# thetaphi_Lucene-Solr-master-Linux_24358.log.txt
#
# testCreateCollectionAddReplica
# bulk of test logic is finished, test has added a replica and confirmed it's
on the expected node
#
# but meanwhile, recovery is still ongoing...
[junit4] 2> 959699 INFO
(recoveryExecutor-5888-thread-1-processing-n:127.0.0.1:42097_solr
x:testCreateCollectionAddReplica_shard1_replica_n3
c:testCreateCollectionAddReplica s:shard1 r:core_node4) [n:127.0.0.1:42097_solr
c:testCreateCollectionAddReplica s:shard1 r:core_node4
x:testCreateCollectionAddReplica_shard1_replica_n3 ] o.a.s.c.RecoveryStrategy
Sending prep recovery command to [https://127.0.0.1:42097/solr]; [WaitForState:
action=PREPRECOVERY&core=testCreateCollectionAddReplica_shard1_replica_n1&nodeName=127.0.0.1:42097_solr&coreNodeName=core_node4&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
[junit4] 2> 959701 INFO (qtp531873617-17025) [n:127.0.0.1:42097_solr
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.h.a.PrepRecoveryOp
Going to wait for coreNodeName: core_node4, state: recovering, checkLive: true,
onlyIfLeader: true, onlyIfLeaderActive: true
[junit4] 2> 959701 INFO (qtp531873617-17025) [n:127.0.0.1:42097_solr
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.h.a.PrepRecoveryOp
In WaitForState(recovering): collection=testCreateCollectionAddReplica,
shard=shard1, thisCore=testCreateCollectionAddReplica_shard1_replica_n1,
leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true,
currentState=down, localState=active, nodeName=127.0.0.1:42097_solr,
coreNodeName=core_node4, onlyIfActiveCheckResult=false, nodeProps: core_node4:{
[junit4] 2> "core":"testCreateCollectionAddReplica_shard1_replica_n3",
[junit4] 2> "base_url":"https://127.0.0.1:42097/solr",
[junit4] 2> "state":"down",
[junit4] 2> "node_name":"127.0.0.1:42097_solr",
[junit4] 2> "type":"NRT"}
...
# the test thread moves on to @After method which calls
MiniSolrCloudCluster.deleteAllCollections() ...
...
[junit4] 2> 959703 INFO (qtp531873617-17021) [n:127.0.0.1:42097_solr
] o.a.s.h.a.CollectionsHandler Invoked Collection Action :delete with params
name=testCreateCollectionAddReplica&action=DELETE&wt=javabin&version=2 and
sendToOCPQueue=true
...
[junit4] 2> 959709 INFO
(OverseerThreadFactory-5345-thread-5-processing-n:127.0.0.1:44991_solr)
[n:127.0.0.1:44991_solr ] o.a.s.c.a.c.OverseerCollectionMessageHandler
Executing Collection
Cmd=action=UNLOAD&deleteInstanceDir=true&deleteDataDir=true&deleteMetricsHistory=true,
asyncId=null
...
[junit4] 2> 959750 INFO (qtp531873617-17325) [n:127.0.0.1:42097_solr
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.c.SolrCore
[testCreateCollectionAddReplica_shard1_replica_n1] CLOSING SolrCore
org.apache.solr.core.SolrCore@39444e66
...
[junit4] 2> 959753 INFO (qtp531873617-17325) [n:127.0.0.1:42097_solr
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.s.HttpSolrCall
[admin] webapp=null path=/admin/cores
params={deleteInstanceDir=true&deleteMetricsHistory=true&core=testCreateCollectionAddReplica_shard1_replica_n1&qt=/admin/cores&deleteDataDir=true&action=UNLOAD&wt=javabin&version=2}
status=0 QTime=17
# but meanwhile, PrepRecoveryOp is currently blocked on a call to
ZkStateReader.waitForState
# looking for specific conditions for the leader and (new) replica (that needs
to recover)
# ... BUT!... the leader core has already been closed, so the watcher never
succeeds,
# ...so PrepRecovery keeps waitForState ...
[junit4] 2> 959855 WARN (watches-5915-thread-1) [ ]
o.a.s.c.c.ZkStateReader Error on calling watcher
[junit4] 2> => org.apache.solr.common.SolrException: core not
found:testCreateCollectionAddReplica_shard1_replica_n1
[junit4] 2> at
org.apache.solr.handler.admin.PrepRecoveryOp.lambda$execute$0(PrepRecoveryOp.java:83)
[junit4] 2> org.apache.solr.common.SolrException: core not
found:testCreateCollectionAddReplica_shard1_replica_n1
[junit4] 2> at
org.apache.solr.handler.admin.PrepRecoveryOp.lambda$execute$0(PrepRecoveryOp.java:83)
~[java/:?]
[junit4] 2> at
org.apache.solr.common.cloud.ZkStateReader.lambda$waitForState$13(ZkStateReader.java:1742)
~[java/:?]
[junit4] 2> at
org.apache.solr.common.cloud.ZkStateReader$DocCollectionAndLiveNodesWatcherWrapper.onStateChanged(ZkStateReader.java:2309)
~[java/:?]
[junit4] 2> at
org.apache.solr.common.cloud.ZkStateReader$Notification.run(ZkStateReader.java:2041)
~[java/:?]
[junit4] 2> at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
[junit4] 2> at
java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
[junit4] 2> at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
~[java/:?]
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
[junit4] 2> at java.lang.Thread.run(Thread.java:835) [?:?]
# ... repeats 3 times, same millisecond, diff threads
# (watches-5915-thread-2 & watches-5915-thread-3
# Other then some intranode /admin/metrics requests, *NOTHING* else happens for
~90 seconds
# Until finally the DELETE collection request times out...
[junit4] 2> 1049792 INFO
(TEST-TestPolicyCloud.testCreateCollectionAddReplica-seed#[CA8FB8D61B016EFE]) [
] o.a.s.SolrTestCaseJ4 ###Ending testCreateCollectionAddReplica
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestPolicyCloud
-Dtests.method=testCreateCollectionAddReplica -Dtests.seed=CA8FB8D61B016EFE
-Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=yi-001
-Dtests.timezone=America/Indiana/Indianapolis -Dtests.asserts=true
-Dtests.file.encoding=ISO-8859-1
[junit4] ERROR 92.6s J0 | TestPolicyCloud.testCreateCollectionAddReplica
<<<
[junit4] > Throwable #1:
org.apache.solr.client.solrj.SolrServerException: Timeout occurred while
waiting response from server at: https://127.0.0.1:42097/solr
[junit4] > at
__randomizedtesting.SeedInfo.seed([CA8FB8D61B016EFE:4AAFDDF80A428658]:0)
[junit4] > at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:667)
[junit4] > at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262)
[junit4] > at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:245)
[junit4] > at
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:368)
[junit4] > at
org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:296)
[junit4] > at
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1128)
[junit4] > at
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:897)
[junit4] > at
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:829)
[junit4] > at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
[junit4] > at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:228)
[junit4] > at
org.apache.solr.cloud.MiniSolrCloudCluster.deleteAllCollections(MiniSolrCloudCluster.java:547)
[junit4] > at
org.apache.solr.cloud.autoscaling.TestPolicyCloud.after(TestPolicyCloud.java:87)
[junit4] > at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit4] > at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[junit4] > at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[junit4] > at
java.base/java.lang.reflect.Method.invoke(Method.java:567)
[junit4] > at java.base/java.lang.Thread.run(Thread.java:835)
[junit4] > Caused by: java.net.SocketTimeoutException: Read timed out
[junit4] > at
java.base/java.net.SocketInputStream.socketRead0(Native Method)
[junit4] > at
java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
[junit4] > at
java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
[junit4] > at
java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
[junit4] > at
java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:448)
[junit4] > at
java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68)
[junit4] > at
java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1132)
[junit4] > at
java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:828)
{noformat}
This pattern of "RECOVERY happens concurrently with DELETE, lots of wasted time
with nothing logged, DELETE times out or fails" repeats in other jenkins jobs,
and in other test methods like {{testCreateCollectionSplitShard}} ... when
these failures happen, the effects sometimes bleed over into toehr test methods
– aparently because the old collection is "partially" deleted – the cores are
unloaded, but the collection still exists in the cluster state causing problems
when future invokations of the {{@After}} method calls
{{MiniSolrCloudCluster.deleteAllCollections()}} ...
{noformat}
[junit4] 2> 1081521 INFO
(TEST-TestPolicyCloud.testDataProvider-seed#[CA8FB8D61B016EFE]) [ ]
o.a.s.SolrTestCaseJ4 ###Ending testDataProvider
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestPolicyCloud
-Dtests.method=testDataProvider -Dtests.seed=CA8FB8D61B016EFE
-Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=yi-001
-Dtests.timezone=America/Indiana/Indianapolis -Dtests.asserts=true
-Dtests.file.encoding=ISO-8859-1
[junit4] ERROR 31.7s J0 | TestPolicyCloud.testDataProvider <<<
[junit4] > Throwable #1:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at https://127.0.0.1:42097/solr: Could not find collection :
testCreateCollectionAddReplica
[junit4] > at
__randomizedtesting.SeedInfo.seed([CA8FB8D61B016EFE:F22410E3A9C36A27]:0)
[junit4] > at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:656)
[junit4] > at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262)
[junit4] > at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:245)
[junit4] > at
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:368)
[junit4] > at
org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:296)
[junit4] > at
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1128)
[junit4] > at
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:897)
[junit4] > at
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:829)
[junit4] > at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
[junit4] > at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:228)
[junit4] > at
org.apache.solr.cloud.MiniSolrCloudCluster.deleteAllCollections(MiniSolrCloudCluster.java:547)
[junit4] > at
org.apache.solr.cloud.autoscaling.TestPolicyCloud.after(TestPolicyCloud.java:87)
{noformat}
FWIW, I should note:
* I've been unable to reproduce this locally
* the failures happen on diff jenkins servers, on diff branches, and different
OSes
* so far all of the failures i've seen occur when randomized ssl=true, but
AFAICT there isn't any obvious indicatiion that the problem relates to the use
of ssl – i think that just that the added CPU contention caused by using SSL
seems to slow things down enough on the jenkins machine that the race contidion
happens
** ie: because of SSL slowdown, the recovery isn't fully completed before the
DELETE collection cleanup requests are processed
----
My initial hunch/impression was that Overseer (Queue) was getting deadlocked
trying to invoke the PrepRecoveryOp (which was blocked waiting for a leader
that no longer exists) preventing the DeleteCollectionCmd from executing (which
would remove the collection from the ClusterState – the only state change
possible to cause the PrepRecoveryOp's waitForState call to abort.
I tried to write a new test forcing this specific situation (attached), using a
{{CloseHook}} on the leader {{SolrCore}} and {{waitForState()}} predicate that
would gate eachother using {{CountDownLatches}} to try and coerce the specific
sequence of events that seemed to be problematic – but i can't seem to force a
this type of failure (or even semi-reliably trigger it given that the
ZkStateReader executes watchers in a random order).
If anyone else has any ideas what's going on here i'm all ears ... in the
meantime I'm going to try and work up some general improvements to
TestPolicyCLoud to make the assertions tighter and insure all the expected
replicas are fully active before finishing the test.
> Possible racecondition/deadlock between collection DELETE and PrepRecovery ?
> (TestPolicyCloud failures)
> -------------------------------------------------------------------------------------------------------
>
> Key: SOLR-13616
> URL: https://issues.apache.org/jira/browse/SOLR-13616
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Priority: Major
> Attachments: SOLR-13616.test-incomplete.patch,
> thetaphi_Lucene-Solr-master-Linux_24358.log.txt
>
>
> Based on some recent jenkins failures in TestPolicyCloud, I suspect there is
> a possible deadlock condition when attempting to delete a collection while
> recovery is in progress.
> I haven't been able to identify exactly where/why/how the problem occurs, but
> it does not appear to be a test specific problem, and seems like it could
> potentially affect anyone unlucky enough to issue poorly timed DELETE.
> Details to follow in comments...
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]