[jira] [Commented] (SOLR-12833) Use timed-out lock in DistributedUpdateProcessor

Andrzej Bialecki (JIRA) Thu, 02 May 2019 08:45:05 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831703#comment-16831703
 ]


Andrzej Bialecki  commented on SOLR-12833:
------------------------------------------

[~yuanyun.cn] Hmm, I'm seeing occasional lock-ups when beasting 
{{PeerSyncTest}} with stacktraces that point to the newly refactored methods in 
{{DistributedUpdateProcessor}} and {{VersionBucket}} (specifically, the code 
that is using the intrinsic monitors for locking). If we can't find the reason 
soon then we may need to revert this patch, at least from {{branch_8x}} and 
{{branch_8_1}}.

Here's an example stacktrace:
{code:java}
  [beaster]   2> 9903 INFO  (qtp1564460830-112) [    x:collection1] 
o.a.s.s.SolrIndexSearcher Opening [Searcher@2d936b61[collection1] realtime]
  [beaster]   2> 9905 INFO  (qtp1564460830-112) [    x:collection1] 
o.a.s.s.SolrIndexSearcher Opening [Searcher@2c12d484[collection1] realtime]
  [beaster]   2> 9907 INFO  (qtp1564460830-112) [    x:collection1] 
o.a.s.u.p.LogUpdateProcessorFactory [collection1]  webapp=/jeqeo/s path=/update 
params={update.distrib=FROMLEADER&_version_=6004&wt=javabin&version=2}{deleteByQuery=val_i_dvo:6
 (-6004)} 0 11
  [beaster]   2> 9908 INFO  (qtp1627373062-114) [    x:collection1] 
o.a.s.u.PeerSync PeerSync: core=collection1 url= START 
replicas=[http://127.0.0.1:50049/jeqeo/s/collection1] nUpdates=100
  [beaster]   2> 9909 INFO  (qtp1564460830-56) [    x:collection1] 
o.a.s.u.IndexFingerprint IndexFingerprint millis:0.0 
result:{maxVersionSpecified=9223372036854775807, maxVersionEncountered=4110, 
maxInHash=4110, versionsHash=-2875136333831421842, numVersions=219, 
numDocs=219, maxDoc=111}
  [beaster]   2> 9909 INFO  (qtp1564460830-56) [    x:collection1] 
o.a.s.c.S.Request [collection1]  webapp=/jeqeo/s path=/get 
params={distrib=false&qt=/get&getFingerprint=9223372036854775807&wt=javabin&version=2}
 status=0 QTime=0
  [beaster]   2> 9910 INFO  (qtp1627373062-114) [    x:collection1] 
o.a.s.u.IndexFingerprint IndexFingerprint millis:0.0 
result:{maxVersionSpecified=9223372036854775807, maxVersionEncountered=4110, 
maxInHash=4110, versionsHash=-2875136333831421842, numVersions=219, 
numDocs=219, maxDoc=110}
  [beaster]   2> 9910 INFO  (qtp1627373062-114) [    x:collection1] 
o.a.s.u.PeerSync We are already in sync. No need to do a PeerSync
  [beaster]   2> 9910 INFO  (qtp1627373062-114) [    x:collection1] 
o.a.s.c.S.Request [collection1]  webapp=/jeqeo/s path=/get 
params={qt=/get&getVersions=100&sync=http://127.0.0.1:50049/jeqeo/s/collection1&wt=javabin&version=2}
 status=0 QTime=2
  [beaster]   2> 129922 INFO  (TEST-PeerSyncTest.test-seed#[A1B6A536E7B4423F]) 
[    ] o.a.s.SolrTestCaseJ4 ###Ending test

...

  [beaster]   2> 144960 INFO  (qtp1564460830-112) [    x:collection1] 
o.a.s.u.p.LogUpdateProcessorFactory [collection1]  webapp=/jeqeo/s path=/update 
params={update.distrib=FROMLEADER&distrib.inplace.prevversion=6000&wt=javabin&version=2}{}
 0 135044
  [beaster]   2> 144960 ERROR (qtp1564460830-112) [    x:collection1] 
o.a.s.h.RequestHandlerBase java.lang.RuntimeException: 
java.lang.InterruptedException
  [beaster]   2>        at 
org.apache.solr.update.VersionBucket.awaitNanos(VersionBucket.java:68)
  [beaster]   2>        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doWaitForDependentUpdates(DistributedUpdateProcessor.java:593)
  [beaster]   2>        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$waitForDependentUpdates$1(DistributedUpdateProcessor.java:536)
  [beaster]   2>        at 
org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
  [beaster]   2>        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.waitForDependentUpdates(DistributedUpdateProcessor.java:536)
  [beaster]   2>        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:327)
  [beaster]   2>        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:223)
...
  [beaster]   2> Caused by: java.lang.InterruptedException
  [beaster]   2>        at java.base/java.lang.Object.wait(Native Method)
  [beaster]   2>        at 
org.apache.solr.update.VersionBucket.awaitNanos(VersionBucket.java:66)
  [beaster]   2>        ... 52 more
{code}
Here's how to reproduce this (it usually fails within the first 10 rounds):
{code:java}
cd solr/core
ant beast -Dbeast.iters=50  -Dtestcase=PeerSyncTest -Dtests.method=test 
-Dtests.slow=true -Dtests.badapples=true -Dtests.asserts=true
{code}
Some of the seeds that failed during beasting (but don't seem to fail when 
running standalone):
{code:java}
ant test  -Dtestcase=PeerSyncTest -Dtests.method=test 
-Dtests.seed=35EDD6492A06CFE -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=fr-CD -Dtests.timezone=Europe/Brussels -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
ant test  -Dtestcase=PeerSyncTest -Dtests.method=test 
-Dtests.seed=A1B6A536E7B4423F -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=en-NF -Dtests.timezone=America/Dawson -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
ant test  -Dtestcase=PeerSyncTest -Dtests.method=test 
-Dtests.seed=A9180C308CF9355B -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=kab -Dtests.timezone=CTT -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
{code}

I also managed to capture a full thread dump when it locked-up (see the 
attachment)

> Use timed-out lock in DistributedUpdateProcessor
> ------------------------------------------------
>
>                 Key: SOLR-12833
>                 URL: https://issues.apache.org/jira/browse/SOLR-12833
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: update, UpdateRequestProcessors
>    Affects Versions: 7.5, 8.0
>            Reporter: jefferyyuan
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 7.7, 8.0
>
>         Attachments: SOLR-12833-noint.patch, SOLR-12833.patch, 
> SOLR-12833.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> There is a synchronize block that blocks other update requests whose IDs fall 
> in the same hash bucket. The update waits forever until it gets the lock at 
> the synchronize block, this can be a problem in some cases.
>  
> Some add/update requests (for example updates with spatial/shape analysis) 
> like may take time (30+ seconds or even more), this would the request time 
> out and fail.
> Client may retry the same requests multiple times or several minutes, this 
> would make things worse.
> The server side receives all the update requests but all except one can do 
> nothing, have to wait there. This wastes precious memory and cpu resource.
> We have seen the case 2000+ threads are blocking at the synchronize lock, and 
> only a few updates are making progress. Each thread takes 3+ mb memory which 
> causes OOM.
> Also if the update can't get the lock in expected time range, its better to 
> fail fast.
>  
> We can have one configuration in solrconfig.xml: 
> updateHandler/versionLock/timeInMill, so users can specify how long they want 
> to wait the version bucket lock.
> The default value can be -1, so it behaves same - wait forever until it gets 
> the lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-12833) Use timed-out lock in DistributedUpdateProcessor

Reply via email to