[ https://issues.apache.org/jira/browse/SOLR-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831703#comment-16831703 ]
Andrzej Bialecki commented on SOLR-12833: ------------------------------------------ [~yuanyun.cn] Hmm, I'm seeing occasional lock-ups when beasting {{PeerSyncTest}} with stacktraces that point to the newly refactored methods in {{DistributedUpdateProcessor}} and {{VersionBucket}} (specifically, the code that is using the intrinsic monitors for locking). If we can't find the reason soon then we may need to revert this patch, at least from {{branch_8x}} and {{branch_8_1}}. Here's an example stacktrace: {code:java} [beaster] 2> 9903 INFO (qtp1564460830-112) [ x:collection1] o.a.s.s.SolrIndexSearcher Opening [Searcher@2d936b61[collection1] realtime] [beaster] 2> 9905 INFO (qtp1564460830-112) [ x:collection1] o.a.s.s.SolrIndexSearcher Opening [Searcher@2c12d484[collection1] realtime] [beaster] 2> 9907 INFO (qtp1564460830-112) [ x:collection1] o.a.s.u.p.LogUpdateProcessorFactory [collection1] webapp=/jeqeo/s path=/update params={update.distrib=FROMLEADER&_version_=6004&wt=javabin&version=2}{deleteByQuery=val_i_dvo:6 (-6004)} 0 11 [beaster] 2> 9908 INFO (qtp1627373062-114) [ x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url= START replicas=[http://127.0.0.1:50049/jeqeo/s/collection1] nUpdates=100 [beaster] 2> 9909 INFO (qtp1564460830-56) [ x:collection1] o.a.s.u.IndexFingerprint IndexFingerprint millis:0.0 result:{maxVersionSpecified=9223372036854775807, maxVersionEncountered=4110, maxInHash=4110, versionsHash=-2875136333831421842, numVersions=219, numDocs=219, maxDoc=111} [beaster] 2> 9909 INFO (qtp1564460830-56) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/jeqeo/s path=/get params={distrib=false&qt=/get&getFingerprint=9223372036854775807&wt=javabin&version=2} status=0 QTime=0 [beaster] 2> 9910 INFO (qtp1627373062-114) [ x:collection1] o.a.s.u.IndexFingerprint IndexFingerprint millis:0.0 result:{maxVersionSpecified=9223372036854775807, maxVersionEncountered=4110, maxInHash=4110, versionsHash=-2875136333831421842, numVersions=219, numDocs=219, maxDoc=110} [beaster] 2> 9910 INFO (qtp1627373062-114) [ x:collection1] o.a.s.u.PeerSync We are already in sync. No need to do a PeerSync [beaster] 2> 9910 INFO (qtp1627373062-114) [ x:collection1] o.a.s.c.S.Request [collection1] webapp=/jeqeo/s path=/get params={qt=/get&getVersions=100&sync=http://127.0.0.1:50049/jeqeo/s/collection1&wt=javabin&version=2} status=0 QTime=2 [beaster] 2> 129922 INFO (TEST-PeerSyncTest.test-seed#[A1B6A536E7B4423F]) [ ] o.a.s.SolrTestCaseJ4 ###Ending test ... [beaster] 2> 144960 INFO (qtp1564460830-112) [ x:collection1] o.a.s.u.p.LogUpdateProcessorFactory [collection1] webapp=/jeqeo/s path=/update params={update.distrib=FROMLEADER&distrib.inplace.prevversion=6000&wt=javabin&version=2}{} 0 135044 [beaster] 2> 144960 ERROR (qtp1564460830-112) [ x:collection1] o.a.s.h.RequestHandlerBase java.lang.RuntimeException: java.lang.InterruptedException [beaster] 2> at org.apache.solr.update.VersionBucket.awaitNanos(VersionBucket.java:68) [beaster] 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.doWaitForDependentUpdates(DistributedUpdateProcessor.java:593) [beaster] 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$waitForDependentUpdates$1(DistributedUpdateProcessor.java:536) [beaster] 2> at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50) [beaster] 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.waitForDependentUpdates(DistributedUpdateProcessor.java:536) [beaster] 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:327) [beaster] 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:223) ... [beaster] 2> Caused by: java.lang.InterruptedException [beaster] 2> at java.base/java.lang.Object.wait(Native Method) [beaster] 2> at org.apache.solr.update.VersionBucket.awaitNanos(VersionBucket.java:66) [beaster] 2> ... 52 more {code} Here's how to reproduce this (it usually fails within the first 10 rounds): {code:java} cd solr/core ant beast -Dbeast.iters=50 -Dtestcase=PeerSyncTest -Dtests.method=test -Dtests.slow=true -Dtests.badapples=true -Dtests.asserts=true {code} Some of the seeds that failed during beasting (but don't seem to fail when running standalone): {code:java} ant test -Dtestcase=PeerSyncTest -Dtests.method=test -Dtests.seed=35EDD6492A06CFE -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=fr-CD -Dtests.timezone=Europe/Brussels -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 ant test -Dtestcase=PeerSyncTest -Dtests.method=test -Dtests.seed=A1B6A536E7B4423F -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=en-NF -Dtests.timezone=America/Dawson -Dtests.asserts=true -Dtests.file.encoding=US-ASCII ant test -Dtestcase=PeerSyncTest -Dtests.method=test -Dtests.seed=A9180C308CF9355B -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=kab -Dtests.timezone=CTT -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 {code} I also managed to capture a full thread dump when it locked-up (see the attachment) > Use timed-out lock in DistributedUpdateProcessor > ------------------------------------------------ > > Key: SOLR-12833 > URL: https://issues.apache.org/jira/browse/SOLR-12833 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: update, UpdateRequestProcessors > Affects Versions: 7.5, 8.0 > Reporter: jefferyyuan > Assignee: Mark Miller > Priority: Minor > Fix For: 7.7, 8.0 > > Attachments: SOLR-12833-noint.patch, SOLR-12833.patch, > SOLR-12833.patch > > Time Spent: 20m > Remaining Estimate: 0h > > There is a synchronize block that blocks other update requests whose IDs fall > in the same hash bucket. The update waits forever until it gets the lock at > the synchronize block, this can be a problem in some cases. > > Some add/update requests (for example updates with spatial/shape analysis) > like may take time (30+ seconds or even more), this would the request time > out and fail. > Client may retry the same requests multiple times or several minutes, this > would make things worse. > The server side receives all the update requests but all except one can do > nothing, have to wait there. This wastes precious memory and cpu resource. > We have seen the case 2000+ threads are blocking at the synchronize lock, and > only a few updates are making progress. Each thread takes 3+ mb memory which > causes OOM. > Also if the update can't get the lock in expected time range, its better to > fail fast. > > We can have one configuration in solrconfig.xml: > updateHandler/versionLock/timeInMill, so users can specify how long they want > to wait the version bucket lock. > The default value can be -1, so it behaves same - wait forever until it gets > the lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org