[jira] [Updated] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order
[ https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishan Chattopadhyaya updated SOLR-9941: --- Fix Version/s: (was: 6x) 6.5 > log replay redundently (pre-)applies DBQs as if they were out of order > -- > > Key: SOLR-9941 > URL: https://issues.apache.org/jira/browse/SOLR-9941 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Ishan Chattopadhyaya > Fix For: master (7.0), 6.5 > > Attachments: SOLR-9941.hoss-test-experiment.patch, SOLR-9941.patch, > SOLR-9941.patch, SOLR-9941.patch, SOLR-9941.patch, SOLR-9941.patch > > > There's kind of an odd situation that arises when a Solr node starts up > (after a crash) and tries to recover from it's tlog that causes deletes to be > redundantly & excessively applied -- at a minimum it causes confusing really > log messages > * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most > recent log files found (based on numRecordsToKeep) and then builds a > {{RecentUpdates}} instance from them > * Delete entries from the {{RecentUpdates}} are used to populate 2 lists: > ** {{deleteByQueries}} > ** {{oldDeletes}} (for deleteById). > * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used > to replay any (uncommited) {{TransactionLog}} enteries > ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain > to for the various adds/deletes, etc... > ** when an add makes it to {{RunUpdateProcessor}} it delegates to > {{DirectUpdateHandler2}}, which (independent of the fact that we're in log > replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any > "Reordered" deletes that have a version greater then the add > *** if it finds _any_ DBQs "newer" then the document being added, it does a > low level {{IndexWriter.updateDocument}} and then immediately executes _all_ > the newer DBQs ... _once per add_ > ** these deletes are *also* still executed as part of the normal tlog replay, > because they are in the tlog. > Which means if you are recovering from a tlog with 90 addDocs, followed by 5 > DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for > 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs > detected. ..."}} even tough the only reason they are out of order is because > Solr is deliberately applying them out of order. > * At a minimum we should improve the log messages > * Ideally we should stop (pre-emptively) applying these deletes during tlog > replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order
[ https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-9941: --- Attachment: SOLR-9941.patch bq. In TestRecovery, we kinda mixed up single instance code path along with SolrCloud mode code path, so we end up with the fail. yeah, that was the piece i wasn't clear about -- makes sense. based on Dat's comments, I re-did my previous idea as a new independent {{testNewDBQAndDocMatchingOldDBQDuringLogReplay}} method: * the "client" only sends regular updates (w/o fake versions) as it would to a leader/single-node instance * confirms that a DBQ which arrives during recovery is processed correctly, even if the docs being deleted haven't been added yet as part of recovery * Also includes a DBQ in the tlog which may (randomly) be applied before/after the new DBQ arrives ** just to sanity check that the new delete doesn't muck with the replay of old deletes * the DBQ from the tlog also matches a doc which isn't added until recovery is in process (randomly before/after the DBQ from the tlog is applied) to verify that doc isn't deleted by mistake ...i've added this new test method to the patch to help with the coverage of the affected code paths. bq. Unless someone has any objections or suggests some modifications, I'd like to commit this after the 6.4 branch is cut (or alternatively, commit this to master and wait for backporting after the 6.4 branch is cut). The sooner it gets committed to master, the sooner jenkins starts helping us do randomized testing -- since Dat thinks the change makes sense, and there are no objections I would encourage committing to master early and waiting to backport. > log replay redundently (pre-)applies DBQs as if they were out of order > -- > > Key: SOLR-9941 > URL: https://issues.apache.org/jira/browse/SOLR-9941 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man > Attachments: SOLR-9941.hoss-test-experiment.patch, SOLR-9941.patch, > SOLR-9941.patch, SOLR-9941.patch, SOLR-9941.patch, SOLR-9941.patch > > > There's kind of an odd situation that arises when a Solr node starts up > (after a crash) and tries to recover from it's tlog that causes deletes to be > redundantly & excessively applied -- at a minimum it causes confusing really > log messages > * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most > recent log files found (based on numRecordsToKeep) and then builds a > {{RecentUpdates}} instance from them > * Delete entries from the {{RecentUpdates}} are used to populate 2 lists: > ** {{deleteByQueries}} > ** {{oldDeletes}} (for deleteById). > * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used > to replay any (uncommited) {{TransactionLog}} enteries > ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain > to for the various adds/deletes, etc... > ** when an add makes it to {{RunUpdateProcessor}} it delegates to > {{DirectUpdateHandler2}}, which (independent of the fact that we're in log > replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any > "Reordered" deletes that have a version greater then the add > *** if it finds _any_ DBQs "newer" then the document being added, it does a > low level {{IndexWriter.updateDocument}} and then immediately executes _all_ > the newer DBQs ... _once per add_ > ** these deletes are *also* still executed as part of the normal tlog replay, > because they are in the tlog. > Which means if you are recovering from a tlog with 90 addDocs, followed by 5 > DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for > 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs > detected. ..."}} even tough the only reason they are out of order is because > Solr is deliberately applying them out of order. > * At a minimum we should improve the log messages > * Ideally we should stop (pre-emptively) applying these deletes during tlog > replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order
[ https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishan Chattopadhyaya updated SOLR-9941: --- Attachment: SOLR-9941.patch Thanks [~caomanhdat]. Updated the patch, removed the conditional parameter for startup. Unless someone has any objections or suggests some modifications, I'd like to commit this after the 6.4 branch is cut (or alternatively, commit this to master and wait for backporting after the 6.4 branch is cut). > log replay redundently (pre-)applies DBQs as if they were out of order > -- > > Key: SOLR-9941 > URL: https://issues.apache.org/jira/browse/SOLR-9941 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man > Attachments: SOLR-9941.hoss-test-experiment.patch, SOLR-9941.patch, > SOLR-9941.patch, SOLR-9941.patch, SOLR-9941.patch > > > There's kind of an odd situation that arises when a Solr node starts up > (after a crash) and tries to recover from it's tlog that causes deletes to be > redundantly & excessively applied -- at a minimum it causes confusing really > log messages > * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most > recent log files found (based on numRecordsToKeep) and then builds a > {{RecentUpdates}} instance from them > * Delete entries from the {{RecentUpdates}} are used to populate 2 lists: > ** {{deleteByQueries}} > ** {{oldDeletes}} (for deleteById). > * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used > to replay any (uncommited) {{TransactionLog}} enteries > ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain > to for the various adds/deletes, etc... > ** when an add makes it to {{RunUpdateProcessor}} it delegates to > {{DirectUpdateHandler2}}, which (independent of the fact that we're in log > replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any > "Reordered" deletes that have a version greater then the add > *** if it finds _any_ DBQs "newer" then the document being added, it does a > low level {{IndexWriter.updateDocument}} and then immediately executes _all_ > the newer DBQs ... _once per add_ > ** these deletes are *also* still executed as part of the normal tlog replay, > because they are in the tlog. > Which means if you are recovering from a tlog with 90 addDocs, followed by 5 > DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for > 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs > detected. ..."}} even tough the only reason they are out of order is because > Solr is deliberately applying them out of order. > * At a minimum we should improve the log messages > * Ideally we should stop (pre-emptively) applying these deletes during tlog > replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order
[ https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-9941: --- Attachment: SOLR-9941.hoss-test-experiment.patch I wanted to beef up Ishan's testLogReplayWithReorderedDBQ to prove that if another (probably ordered) DBQ arrived _during_ log replay it would correctly be applied -- even if some affected docs hadn't been added yet as part of replay. (ie: prove that "RecentUpdates" was still being used during replay) But for some reason my modified test fails even on master. I suspect either i'm misunderstanding something about how recovery works, or about the way TestRecovery works (or i just have a silly bug/typo somewhere) In anycase, i've attached only my test idea in SOLR-9941.hoss-test-experiment.patch if anyone wants to take a look > log replay redundently (pre-)applies DBQs as if they were out of order > -- > > Key: SOLR-9941 > URL: https://issues.apache.org/jira/browse/SOLR-9941 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man > Attachments: SOLR-9941.hoss-test-experiment.patch, SOLR-9941.patch, > SOLR-9941.patch, SOLR-9941.patch > > > There's kind of an odd situation that arises when a Solr node starts up > (after a crash) and tries to recover from it's tlog that causes deletes to be > redundantly & excessively applied -- at a minimum it causes confusing really > log messages > * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most > recent log files found (based on numRecordsToKeep) and then builds a > {{RecentUpdates}} instance from them > * Delete entries from the {{RecentUpdates}} are used to populate 2 lists: > ** {{deleteByQueries}} > ** {{oldDeletes}} (for deleteById). > * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used > to replay any (uncommited) {{TransactionLog}} enteries > ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain > to for the various adds/deletes, etc... > ** when an add makes it to {{RunUpdateProcessor}} it delegates to > {{DirectUpdateHandler2}}, which (independent of the fact that we're in log > replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any > "Reordered" deletes that have a version greater then the add > *** if it finds _any_ DBQs "newer" then the document being added, it does a > low level {{IndexWriter.updateDocument}} and then immediately executes _all_ > the newer DBQs ... _once per add_ > ** these deletes are *also* still executed as part of the normal tlog replay, > because they are in the tlog. > Which means if you are recovering from a tlog with 90 addDocs, followed by 5 > DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for > 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs > detected. ..."}} even tough the only reason they are out of order is because > Solr is deliberately applying them out of order. > * At a minimum we should improve the log messages > * Ideally we should stop (pre-emptively) applying these deletes during tlog > replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order
[ https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishan Chattopadhyaya updated SOLR-9941: --- Attachment: SOLR-9941.patch Attaching a fix to clear out the DBQs and oldDeletes lists before a log replay upon a startup. [~hossman], [~markrmil...@gmail.com], [~yo...@apache.org], can you please review? > log replay redundently (pre-)applies DBQs as if they were out of order > -- > > Key: SOLR-9941 > URL: https://issues.apache.org/jira/browse/SOLR-9941 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man > Attachments: SOLR-9941.patch, SOLR-9941.patch, SOLR-9941.patch > > > There's kind of an odd situation that arises when a Solr node starts up > (after a crash) and tries to recover from it's tlog that causes deletes to be > redundantly & excessively applied -- at a minimum it causes confusing really > log messages > * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most > recent log files found (based on numRecordsToKeep) and then builds a > {{RecentUpdates}} instance from them > * Delete entries from the {{RecentUpdates}} are used to populate 2 lists: > ** {{deleteByQueries}} > ** {{oldDeletes}} (for deleteById). > * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used > to replay any (uncommited) {{TransactionLog}} enteries > ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain > to for the various adds/deletes, etc... > ** when an add makes it to {{RunUpdateProcessor}} it delegates to > {{DirectUpdateHandler2}}, which (independent of the fact that we're in log > replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any > "Reordered" deletes that have a version greater then the add > *** if it finds _any_ DBQs "newer" then the document being added, it does a > low level {{IndexWriter.updateDocument}} and then immediately executes _all_ > the newer DBQs ... _once per add_ > ** these deletes are *also* still executed as part of the normal tlog replay, > because they are in the tlog. > Which means if you are recovering from a tlog with 90 addDocs, followed by 5 > DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for > 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs > detected. ..."}} even tough the only reason they are out of order is because > Solr is deliberately applying them out of order. > * At a minimum we should improve the log messages > * Ideally we should stop (pre-emptively) applying these deletes during tlog > replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order
[ https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishan Chattopadhyaya updated SOLR-9941: --- Attachment: SOLR-9941.patch Seems like the fix I posted below won't work due to the fact that doing so would preclude us from processing actually re-ordered DBQs also, and hence leave out some documents that should've been deleted. Added a test for this (which should anyway be committed, I think). > log replay redundently (pre-)applies DBQs as if they were out of order > -- > > Key: SOLR-9941 > URL: https://issues.apache.org/jira/browse/SOLR-9941 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man > Attachments: SOLR-9941.patch, SOLR-9941.patch > > > There's kind of an odd situation that arises when a Solr node starts up > (after a crash) and tries to recover from it's tlog that causes deletes to be > redundantly & excessively applied -- at a minimum it causes confusing really > log messages > * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most > recent log files found (based on numRecordsToKeep) and then builds a > {{RecentUpdates}} instance from them > * Delete entries from the {{RecentUpdates}} are used to populate 2 lists: > ** {{deleteByQueries}} > ** {{oldDeletes}} (for deleteById). > * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used > to replay any (uncommited) {{TransactionLog}} enteries > ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain > to for the various adds/deletes, etc... > ** when an add makes it to {{RunUpdateProcessor}} it delegates to > {{DirectUpdateHandler2}}, which (independent of the fact that we're in log > replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any > "Reordered" deletes that have a version greater then the add > *** if it finds _any_ DBQs "newer" then the document being added, it does a > low level {{IndexWriter.updateDocument}} and then immediately executes _all_ > the newer DBQs ... _once per add_ > ** these deletes are *also* still executed as part of the normal tlog replay, > because they are in the tlog. > Which means if you are recovering from a tlog with 90 addDocs, followed by 5 > DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for > 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs > detected. ..."}} even tough the only reason they are out of order is because > Solr is deliberately applying them out of order. > * At a minimum we should improve the log messages > * Ideally we should stop (pre-emptively) applying these deletes during tlog > replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order
[ https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishan Chattopadhyaya updated SOLR-9941: --- Attachment: SOLR-9941.patch The attached patch seems to fix the problem, and feels like the right thing to do. Based on Hoss' idea: {quote} it seems like at a minimum DUH2 could inspect the UpdateCommand flags to see if this is a {{REPLAY}} command and if it is skip the {{UpdateLog.getDBQNewer}} call? {quote} The corresponding test failure in jira/solr-5944 branch passes with this patch (even after excluding the getUpdateLog().deleteAll() call). > log replay redundently (pre-)applies DBQs as if they were out of order > -- > > Key: SOLR-9941 > URL: https://issues.apache.org/jira/browse/SOLR-9941 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man > Attachments: SOLR-9941.patch > > > There's kind of an odd situation that arises when a Solr node starts up > (after a crash) and tries to recover from it's tlog that causes deletes to be > redundantly & excessively applied -- at a minimum it causes confusing really > log messages > * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most > recent log files found (based on numRecordsToKeep) and then builds a > {{RecentUpdates}} instance from them > * Delete entries from the {{RecentUpdates}} are used to populate 2 lists: > ** {{deleteByQueries}} > ** {{oldDeletes}} (for deleteById). > * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used > to replay any (uncommited) {{TransactionLog}} enteries > ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain > to for the various adds/deletes, etc... > ** when an add makes it to {{RunUpdateProcessor}} it delegates to > {{DirectUpdateHandler2}}, which (independent of the fact that we're in log > replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any > "Reordered" deletes that have a version greater then the add > *** if it finds _any_ DBQs "newer" then the document being added, it does a > low level {{IndexWriter.updateDocument}} and then immediately executes _all_ > the newer DBQs ... _once per add_ > ** these deletes are *also* still executed as part of the normal tlog replay, > because they are in the tlog. > Which means if you are recovering from a tlog with 90 addDocs, followed by 5 > DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for > 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs > detected. ..."}} even tough the only reason they are out of order is because > Solr is deliberately applying them out of order. > * At a minimum we should improve the log messages > * Ideally we should stop (pre-emptively) applying these deletes during tlog > replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org