[ https://issues.apache.org/jira/browse/SOLR-11459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrey Kudryavtsev updated SOLR-11459: -------------------------------------- Description: I have a 1_shard / *m*_replicas SolrCloud cluster and run batches of 5 - 10k in-place updates from time to time. Once I noticed that job "hangs" - it started and couldn't finish for a a while. Logs were full of messages like: {code} Missing update, on which current in-place update depends on, hasn't arrived. id=__, looking for version=___, last found version=0" {code} {code} Tried to fetch document ___ from the leader, but the leader says document has been deleted. Deleting the document here and skipping this update: Last found version: 0, was looking for: ___",24,0,"but the leader says document has been deleted. Deleting the document here and skipping this update: Last found version: 0 {code} Further analysis shows this: * There are 100-500 updates for non-existed documents among other updates (something that I have to deal with) * Leader receives bunch of updates and executes this update one by one. {{JavabinLoader}} which is used by processing documents reuses same instance of {{AddUpdateCommand}} for every update and just [clearing its state at the end|https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/solr/core/src/java/org/apache/solr/handler/loader/JavabinLoader.java#L99]. Field [AddUpdateCommand#prevVersion| https://github.com/apache/lucene-solr/blob/6396cb759f8c799f381b0730636fa412761030ce/solr/core/src/java/org/apache/solr/update/AddUpdateCommand.java#L76] is not cleared. * In case of update is in-place update, but specified document is not existed, this update is processed as a regular atomic update (i.e. new doc is created), but {{prevVersion}} is used as a {{distrib.inplace.prevversion}} parameter in sequential calls to slave in DistributedUpdateProcessor. {{prevVersion}} wasn't cleared, so it may contain version from previous processed updates. * Slaves checks it's own version on documents which is 0 (cause doc is not exists), slave thinks that some updates were missed and spends 5 seconds in [DistributedUpdateProcessor#waitForDependentUpdates|https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/solr/core/src/java/org/apache/solr/handler/loader/JavabinLoader.java#L99] waiting for missed updates (no luck) and also tried to get "correct" version from leader (no luck as well) * So update costs me *m* * 5 sec I workarounded this by explicit check of doc existence, but it probably should be fixed. Obviously first guess is that prevVersion should be cleared in {{AddUpdateCommand#clear}}, but have no clue how to test it. {code} +++ solr/core/src/java/org/apache/solr/update/AddUpdateCommand.java (revision ) @@ -78,6 +78,7 @@ updateTerm = null; isLastDocInBatch = false; version = 0; + prevVersion = -1; } {code} was: I have a 1_shard / *m*_replicas SolrCloud cluster and run batches of 5 - 10k in-place updates from time to time. Once I noticed that job "hangs" - it started and couldn't finish for a a while. Logs were full of messages like: {code} Missing update, on which current in-place update depends on, hasn't arrived. id=__, looking for version=___, last found version=0" {code} {code} Tried to fetch document ___ from the leader, but the leader says document has been deleted. Deleting the document here and skipping this update: Last found version: 0, was looking for: ___",24,0,"but the leader says document has been deleted. Deleting the document here and skipping this update: Last found version: 0 {code} Further analysis shows this: * There are 100-500 updates for non-existed documents among other updates (something that I have to deal with) * Leader receives bunch of updates and executes this update one by one. {{JavabinLoader}} which is used by processing documents reuses same instance of {{AddUpdateCommand}} for every update and just [clearing its state at the end|https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/solr/core/src/java/org/apache/solr/handler/loader/JavabinLoader.java#L99]. [AddUpdateCommand#prevVersion| https://github.com/apache/lucene-solr/blob/6396cb759f8c799f381b0730636fa412761030ce/solr/core/src/java/org/apache/solr/update/AddUpdateCommand.java#L76] is not cleared. * In case of update is in-place update, but specified document is not existed, this update is processed as a regular atomic update (i.e. new doc is created), but {{prevVersion}} is used as a {{distrib.inplace.prevversion}} parameter in sequential calls to slave in DistributedUpdateProcessor. {{prevVersion}} wasn't cleared, so it may contain version from previous processed updates. * Slaves checks it's own version on documents which is 0 (cause doc is not exists), slave thinks that some updates were missed and spends 5 seconds in [DistributedUpdateProcessor#waitForDependentUpdates|https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/solr/core/src/java/org/apache/solr/handler/loader/JavabinLoader.java#L99] waiting for missed updates (no luck) and also tried to get "correct" version from leader (no luck as well) * So update costs me *m* * 5 sec I workarounded this by explicit check of doc existence, but it probably should be fixed. Obviously first guess is that prevVersion should be cleared in {{AddUpdateCommand#clear}}, but have no clue how to test it. {code} +++ solr/core/src/java/org/apache/solr/update/AddUpdateCommand.java (revision ) @@ -78,6 +78,7 @@ updateTerm = null; isLastDocInBatch = false; version = 0; + prevVersion = -1; } {code} > AddUpdateCommand#prevVersion is not cleared which may lead to problem for > in-place updates of non existed documents > ------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-11459 > URL: https://issues.apache.org/jira/browse/SOLR-11459 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Andrey Kudryavtsev > > I have a 1_shard / *m*_replicas SolrCloud cluster and run batches of 5 - 10k > in-place updates from time to time. > Once I noticed that job "hangs" - it started and couldn't finish for a a > while. > Logs were full of messages like: > {code} Missing update, on which current in-place update depends on, hasn't > arrived. id=__, looking for version=___, last found version=0" {code} > {code} > Tried to fetch document ___ from the leader, but the leader says document has > been deleted. Deleting the document here and skipping this update: Last found > version: 0, was looking for: ___",24,0,"but the leader says document has been > deleted. Deleting the document here and skipping this update: Last found > version: 0 > {code} > Further analysis shows this: > * There are 100-500 updates for non-existed documents among other updates > (something that I have to deal with) > * Leader receives bunch of updates and executes this update one by one. > {{JavabinLoader}} which is used by processing documents reuses same instance > of {{AddUpdateCommand}} for every update and just [clearing its state at the > end|https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/solr/core/src/java/org/apache/solr/handler/loader/JavabinLoader.java#L99]. > Field [AddUpdateCommand#prevVersion| > https://github.com/apache/lucene-solr/blob/6396cb759f8c799f381b0730636fa412761030ce/solr/core/src/java/org/apache/solr/update/AddUpdateCommand.java#L76] > is not cleared. > * In case of update is in-place update, but specified document is not > existed, this update is processed as a regular atomic update (i.e. new doc is > created), but {{prevVersion}} is used as a {{distrib.inplace.prevversion}} > parameter in sequential calls to slave in DistributedUpdateProcessor. > {{prevVersion}} wasn't cleared, so it may contain version from previous > processed updates. > * Slaves checks it's own version on documents which is 0 (cause doc is not > exists), slave thinks that some updates were missed and spends 5 seconds in > [DistributedUpdateProcessor#waitForDependentUpdates|https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/solr/core/src/java/org/apache/solr/handler/loader/JavabinLoader.java#L99] > waiting for missed updates (no luck) and also tried to get "correct" version > from leader (no luck as well) > * So update costs me *m* * 5 sec > I workarounded this by explicit check of doc existence, but it probably > should be fixed. > Obviously first guess is that prevVersion should be cleared in > {{AddUpdateCommand#clear}}, but have no clue how to test it. > {code} > +++ solr/core/src/java/org/apache/solr/update/AddUpdateCommand.java > (revision ) > @@ -78,6 +78,7 @@ > updateTerm = null; > isLastDocInBatch = false; > version = 0; > + prevVersion = -1; > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org