[jira] [Comment Edited] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order

2017-01-10 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813726#comment-15813726
 ] 

Cao Manh Dat edited comment on SOLR-9941 at 1/10/17 9:17 AM:
-

Here are the reason to explain why Hoss's test fails :
- When the test add updates, it will be written to tlog-
- After the core is restarted, it will replay the updates from tlog-
- In the same time, the test add another DEQ, because the ulog is not active, 
DUP will write this DEQ to another file ( tlog-0001, buffering updates) and 
ignore apply this DEQ to IW.
{code:title=DistributedUpdateProcess.java|borderStyle=solid}
// we're not in an active state, and this update isn't from a replay, so buffer 
it.
cmd.setFlags(cmd.getFlags() | UpdateCommand.BUFFERING);
ulog.add(cmd);
{code}
- After LogReplay finish, we call a commit, which will write a commit update at 
the end of tlog-0001

So this DEQ update will never be replayed ( because it belong tlog-0001 not 
tlog- ), and it've never been applied to IW, so that DEQ update is lost.

But this do not mean we have a problem with SolrCloud mode or single instance 
mode. 
In SolrCloud mode, when a replica run recoverFromLog, replica in this time 
period will have state = DOWN, so It won't receive any updates.

In single instance mode, we will run in leader logic, so the DEQ is applied to 
IW immediately ( along with update {{ulog.deleteByQueries}} ).

In TestRecovery, we kinda mixed up single instance code path along with 
SolrCloud mode code path, so we end up with the fail.


was (Author: caomanhdat):
Here are the reason to explain why Hoss's test fails :
- When the test add updates, it will be written to tlog-
- After the core is restarted, it will replay the updates from tlog-
- In the same time, the test add another DEQ, because the ulog is not active, 
DUP will write this DEQ to another file ( tlog-0001, buffering updates) and 
ignore apply this DEQ to IW.
{code:title=DistributedUpdateProcess.java|borderStyle=solid}
// we're not in an active state, and this update isn't from a replay, so buffer 
it.
cmd.setFlags(cmd.getFlags() | UpdateCommand.BUFFERING);
ulog.add(cmd);
{code}
- After LogReplay finish, we call a commit, which will write a commit update at 
the end of tlog-0001

So this DEQ update will never be replayed ( because it belong tlog-0001 not 
tlog- ), and it've never been applied to IW, so that DEQ update is lost.

But this do not mean we have a problem with SolrCloud mode or single instance 
mode. 
In SolrCloud mode, when a replica run recoverFromLog, replica in this time 
period will have state = DOWN, so It won't receive any updates.

In single instance mode, we will run in leader logic, so the DEQ is applied to 
IW immediately ( along with update {{ulog.deleteByQueries}} ).

But in TestRecovery, we kinda mixed up single instance code path along with 
SolrCloud mode code path, so we end up with the fail.

> log replay redundently (pre-)applies DBQs as if they were out of order
> --
>
> Key: SOLR-9941
> URL: https://issues.apache.org/jira/browse/SOLR-9941
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
> Attachments: SOLR-9941.hoss-test-experiment.patch, SOLR-9941.patch, 
> SOLR-9941.patch, SOLR-9941.patch
>
>
> There's kind of an odd situation that arises when a Solr node starts up 
> (after a crash) and tries to recover from it's tlog that causes deletes to be 
> redundantly & excessively applied -- at a minimum it causes confusing really 
> log messages
> * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most 
> recent log files found (based on numRecordsToKeep) and then builds a 
> {{RecentUpdates}} instance from them
> * Delete entries from the {{RecentUpdates}} are used to populate 2 lists:
> ** {{deleteByQueries}}
> ** {{oldDeletes}} (for deleteById).
> * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used 
> to replay any (uncommited) {{TransactionLog}} enteries
> ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain 
> to for the various adds/deletes, etc...
> ** when an add makes it to {{RunUpdateProcessor}} it delegates to 
> {{DirectUpdateHandler2}}, which (independent of the fact that we're in log 
> replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any 
> "Reordered" deletes that have a version greater then the add
> *** if it finds _any_ DBQs "newer" then the document being added, it does a 
> low level {{IndexWriter.updateDocument}} and then immediately executes _all_ 
> the newer DBQs ... _once per add_
> ** these deletes are *also* still executed as part of the normal tlog replay, 
> because they are in the tlog.
> 

[jira] [Comment Edited] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order

2017-01-10 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813726#comment-15813726
 ] 

Cao Manh Dat edited comment on SOLR-9941 at 1/10/17 9:11 AM:
-

Here are the reason to explain why Hoss's test fails :
- When the test add updates, it will be written to tlog-
- After the core is restarted, it will replay the updates from tlog-
- In the same time, the test add another DEQ, because the ulog is not active, 
DUP will write this DEQ to another file ( tlog-0001, buffering updates) and 
ignore apply this DEQ to IW.
{code:title=DistributedUpdateProcess.java|borderStyle=solid}
// we're not in an active state, and this update isn't from a replay, so buffer 
it.
cmd.setFlags(cmd.getFlags() | UpdateCommand.BUFFERING);
ulog.add(cmd);
{code}
- After LogReplay finish, we call a commit, which will write a commit update at 
the end of tlog-0001

So this DEQ update will never be replayed ( because it belong tlog-0001 not 
tlog- ), and it've never been applied to IW, so that DEQ update is lost.

But this do not mean we have a problem with SolrCloud mode or single instance 
mode. 
In SolrCloud mode, when a replica run recoverFromLog, replica in this time 
period will have state = DOWN, so It won't receive any updates.

In single instance mode, we will run in leader logic, so the DEQ is applied to 
IW immediately ( along with update {{ulog.deleteByQueries}} ).

But in TestRecovery, we kinda mixed up single instance code path along with 
SolrCloud mode code path, so we end up with the fail.


was (Author: caomanhdat):
Here are the reason to explain why Hoss's test fails :
- When the test add updates, it will be written to tlog-
- After the core is restarted, it will replay the updates from tlog-
- In the same time, the test add another DEQ, because the ulog is on buffering 
mode, DUP will write this DEQ to another file ( tlog-0001) and ignore apply 
this DEQ to IW.
- After LogReplay finish, we call a commit, which will write a commit update at 
the end of tlog-0001

So this DEQ update will never be replayed ( because it belong tlog-0001 not 
tlog- ), and it've never been applied to IW, so that DEQ update is lost.
Even we restart the core, to hoping that it will replay tlog-0001, but because 
we write an commit at the end of tlog-0001, tlog-0001 will never be replayed.

So I think this fail belong to another issue.

> log replay redundently (pre-)applies DBQs as if they were out of order
> --
>
> Key: SOLR-9941
> URL: https://issues.apache.org/jira/browse/SOLR-9941
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
> Attachments: SOLR-9941.hoss-test-experiment.patch, SOLR-9941.patch, 
> SOLR-9941.patch, SOLR-9941.patch
>
>
> There's kind of an odd situation that arises when a Solr node starts up 
> (after a crash) and tries to recover from it's tlog that causes deletes to be 
> redundantly & excessively applied -- at a minimum it causes confusing really 
> log messages
> * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most 
> recent log files found (based on numRecordsToKeep) and then builds a 
> {{RecentUpdates}} instance from them
> * Delete entries from the {{RecentUpdates}} are used to populate 2 lists:
> ** {{deleteByQueries}}
> ** {{oldDeletes}} (for deleteById).
> * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used 
> to replay any (uncommited) {{TransactionLog}} enteries
> ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain 
> to for the various adds/deletes, etc...
> ** when an add makes it to {{RunUpdateProcessor}} it delegates to 
> {{DirectUpdateHandler2}}, which (independent of the fact that we're in log 
> replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any 
> "Reordered" deletes that have a version greater then the add
> *** if it finds _any_ DBQs "newer" then the document being added, it does a 
> low level {{IndexWriter.updateDocument}} and then immediately executes _all_ 
> the newer DBQs ... _once per add_
> ** these deletes are *also* still executed as part of the normal tlog replay, 
> because they are in the tlog.
> Which means if you are recovering from a tlog with 90 addDocs, followed by 5 
> DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for 
> 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs 
> detected. ..."}} even tough the only reason they are out of order is because 
> Solr is deliberately applying them out of order.
> * At a minimum we should improve the log messages
> * Ideally we should stop (pre-emptively) applying these deletes during tlog 
> replay.



--
This message was 

[jira] [Comment Edited] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order

2017-01-09 Thread Ishan Chattopadhyaya (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813195#comment-15813195
 ] 

Ishan Chattopadhyaya edited comment on SOLR-9941 at 1/10/17 2:37 AM:
-

{quote}
One question i have is: if the only code paths that call 
recoverFromLog(boolean) are "startup" paths that pass true why do we need the 
optional argument? why not just refactor the method to always use the new logic?
{quote}
My thought there was that if someone wanted to reuse the recoverFromLog() 
method later for use during a live node, and not during the startup of a 
node/core (for whatever reason that I cannot forsee right now), that they not 
end up clearing their deletes lists in the process. Though, given the current 
use of the method now, I am also open to eliminate that extra parameter if you 
suggest.


was (Author: ichattopadhyaya):
{quote}
One question i have is: if the only code paths that call 
recoverFromLog(boolean) are "startup" paths that pass true why do we need the 
optional argument? why not just refactor the method to always use the new logic?
{quote}
My thought there was that if someone wanted to reuse the doLogReplay() method 
later for use during a live node, and not during the startup of a node/core 
(for whatever reason that I cannot forsee right now), that they not end up 
clearing their deletes lists in the process. Though, given the current use of 
the method now, I am also open to eliminate that extra parameter if you suggest.

> log replay redundently (pre-)applies DBQs as if they were out of order
> --
>
> Key: SOLR-9941
> URL: https://issues.apache.org/jira/browse/SOLR-9941
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
> Attachments: SOLR-9941.hoss-test-experiment.patch, SOLR-9941.patch, 
> SOLR-9941.patch, SOLR-9941.patch
>
>
> There's kind of an odd situation that arises when a Solr node starts up 
> (after a crash) and tries to recover from it's tlog that causes deletes to be 
> redundantly & excessively applied -- at a minimum it causes confusing really 
> log messages
> * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most 
> recent log files found (based on numRecordsToKeep) and then builds a 
> {{RecentUpdates}} instance from them
> * Delete entries from the {{RecentUpdates}} are used to populate 2 lists:
> ** {{deleteByQueries}}
> ** {{oldDeletes}} (for deleteById).
> * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used 
> to replay any (uncommited) {{TransactionLog}} enteries
> ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain 
> to for the various adds/deletes, etc...
> ** when an add makes it to {{RunUpdateProcessor}} it delegates to 
> {{DirectUpdateHandler2}}, which (independent of the fact that we're in log 
> replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any 
> "Reordered" deletes that have a version greater then the add
> *** if it finds _any_ DBQs "newer" then the document being added, it does a 
> low level {{IndexWriter.updateDocument}} and then immediately executes _all_ 
> the newer DBQs ... _once per add_
> ** these deletes are *also* still executed as part of the normal tlog replay, 
> because they are in the tlog.
> Which means if you are recovering from a tlog with 90 addDocs, followed by 5 
> DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for 
> 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs 
> detected. ..."}} even tough the only reason they are out of order is because 
> Solr is deliberately applying them out of order.
> * At a minimum we should improve the log messages
> * Ideally we should stop (pre-emptively) applying these deletes during tlog 
> replay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order

2017-01-09 Thread Ishan Chattopadhyaya (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813309#comment-15813309
 ] 

Ishan Chattopadhyaya edited comment on SOLR-9941 at 1/10/17 12:14 AM:
--

{quote}
Seems to me that updates arriving during the log replay (in TestRecovery) are 
being silently dropped.
{quote}
One possible explanation came to mind: During log replay, the state of the core 
is REPLAYING. Hence, perhaps, incoming updates are not applied (and dropped) 
until the recovery has finished and state is back to ACTIVE?


was (Author: ichattopadhyaya):
{quote}
Seems to me that updates arriving during the log replay (in TestRecovery) are 
being silently dropped.
{quote}
One possible explanation came to mind: During log replay, the state of the core 
is REPLAYING. Hence, perhaps, incoming updates are not applied until the 
recovery has finished and state is back to ACTIVE?

> log replay redundently (pre-)applies DBQs as if they were out of order
> --
>
> Key: SOLR-9941
> URL: https://issues.apache.org/jira/browse/SOLR-9941
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
> Attachments: SOLR-9941.hoss-test-experiment.patch, SOLR-9941.patch, 
> SOLR-9941.patch, SOLR-9941.patch
>
>
> There's kind of an odd situation that arises when a Solr node starts up 
> (after a crash) and tries to recover from it's tlog that causes deletes to be 
> redundantly & excessively applied -- at a minimum it causes confusing really 
> log messages
> * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most 
> recent log files found (based on numRecordsToKeep) and then builds a 
> {{RecentUpdates}} instance from them
> * Delete entries from the {{RecentUpdates}} are used to populate 2 lists:
> ** {{deleteByQueries}}
> ** {{oldDeletes}} (for deleteById).
> * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used 
> to replay any (uncommited) {{TransactionLog}} enteries
> ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain 
> to for the various adds/deletes, etc...
> ** when an add makes it to {{RunUpdateProcessor}} it delegates to 
> {{DirectUpdateHandler2}}, which (independent of the fact that we're in log 
> replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any 
> "Reordered" deletes that have a version greater then the add
> *** if it finds _any_ DBQs "newer" then the document being added, it does a 
> low level {{IndexWriter.updateDocument}} and then immediately executes _all_ 
> the newer DBQs ... _once per add_
> ** these deletes are *also* still executed as part of the normal tlog replay, 
> because they are in the tlog.
> Which means if you are recovering from a tlog with 90 addDocs, followed by 5 
> DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for 
> 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs 
> detected. ..."}} even tough the only reason they are out of order is because 
> Solr is deliberately applying them out of order.
> * At a minimum we should improve the log messages
> * Ideally we should stop (pre-emptively) applying these deletes during tlog 
> replay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order

2017-01-07 Thread Ishan Chattopadhyaya (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807338#comment-15807338
 ] 

Ishan Chattopadhyaya edited comment on SOLR-9941 at 1/7/17 11:35 AM:
-

Attaching a fix to clear out the DBQs and oldDeletes lists before a log replay 
upon a startup. [~hossman], [~markrmil...@gmail.com], [~yo...@apache.org], can 
you please review / suggest alternate fixes?


was (Author: ichattopadhyaya):
Attaching a fix to clear out the DBQs and oldDeletes lists before a log replay 
upon a startup. [~hossman], [~markrmil...@gmail.com], [~yo...@apache.org], can 
you please review?

> log replay redundently (pre-)applies DBQs as if they were out of order
> --
>
> Key: SOLR-9941
> URL: https://issues.apache.org/jira/browse/SOLR-9941
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
> Attachments: SOLR-9941.patch, SOLR-9941.patch, SOLR-9941.patch
>
>
> There's kind of an odd situation that arises when a Solr node starts up 
> (after a crash) and tries to recover from it's tlog that causes deletes to be 
> redundantly & excessively applied -- at a minimum it causes confusing really 
> log messages
> * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most 
> recent log files found (based on numRecordsToKeep) and then builds a 
> {{RecentUpdates}} instance from them
> * Delete entries from the {{RecentUpdates}} are used to populate 2 lists:
> ** {{deleteByQueries}}
> ** {{oldDeletes}} (for deleteById).
> * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used 
> to replay any (uncommited) {{TransactionLog}} enteries
> ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain 
> to for the various adds/deletes, etc...
> ** when an add makes it to {{RunUpdateProcessor}} it delegates to 
> {{DirectUpdateHandler2}}, which (independent of the fact that we're in log 
> replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any 
> "Reordered" deletes that have a version greater then the add
> *** if it finds _any_ DBQs "newer" then the document being added, it does a 
> low level {{IndexWriter.updateDocument}} and then immediately executes _all_ 
> the newer DBQs ... _once per add_
> ** these deletes are *also* still executed as part of the normal tlog replay, 
> because they are in the tlog.
> Which means if you are recovering from a tlog with 90 addDocs, followed by 5 
> DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for 
> 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs 
> detected. ..."}} even tough the only reason they are out of order is because 
> Solr is deliberately applying them out of order.
> * At a minimum we should improve the log messages
> * Ideally we should stop (pre-emptively) applying these deletes during tlog 
> replay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-9941) log replay redundently (pre-)applies DBQs as if they were out of order

2017-01-07 Thread Ishan Chattopadhyaya (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807087#comment-15807087
 ] 

Ishan Chattopadhyaya edited comment on SOLR-9941 at 1/7/17 8:21 AM:


Seems like the fix I posted in my previous patch won't work due to the fact 
that doing so would preclude us from processing actually re-ordered DBQs also, 
and hence leave out some documents that should've been deleted. Added a test 
for this (which should anyway be committed, I think).


was (Author: ichattopadhyaya):
Seems like the fix I posted below won't work due to the fact that doing so 
would preclude us from processing actually re-ordered DBQs also, and hence 
leave out some documents that should've been deleted. Added a test for this 
(which should anyway be committed, I think).

> log replay redundently (pre-)applies DBQs as if they were out of order
> --
>
> Key: SOLR-9941
> URL: https://issues.apache.org/jira/browse/SOLR-9941
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
> Attachments: SOLR-9941.patch, SOLR-9941.patch
>
>
> There's kind of an odd situation that arises when a Solr node starts up 
> (after a crash) and tries to recover from it's tlog that causes deletes to be 
> redundantly & excessively applied -- at a minimum it causes confusing really 
> log messages
> * {{UpdateLog.init(...)}} creates {{TransactionLog}} instances for the most 
> recent log files found (based on numRecordsToKeep) and then builds a 
> {{RecentUpdates}} instance from them
> * Delete entries from the {{RecentUpdates}} are used to populate 2 lists:
> ** {{deleteByQueries}}
> ** {{oldDeletes}} (for deleteById).
> * Then when {{UpdateLog.recoverFromLog}} is called a {{LogReplayer}} is used 
> to replay any (uncommited) {{TransactionLog}} enteries
> ** during replay {{UpdateLog}} delegates to the UpdateRequestProcessorChain 
> to for the various adds/deletes, etc...
> ** when an add makes it to {{RunUpdateProcessor}} it delegates to 
> {{DirectUpdateHandler2}}, which (independent of the fact that we're in log 
> replay) calls {{UpdateLog.getDBQNewer}} for every add, looking for any 
> "Reordered" deletes that have a version greater then the add
> *** if it finds _any_ DBQs "newer" then the document being added, it does a 
> low level {{IndexWriter.updateDocument}} and then immediately executes _all_ 
> the newer DBQs ... _once per add_
> ** these deletes are *also* still executed as part of the normal tlog replay, 
> because they are in the tlog.
> Which means if you are recovering from a tlog with 90 addDocs, followed by 5 
> DBQs, then *each* of those 5 DBQs will each be executed 91 times -- and for 
> 90 of those executions, a DUH2 INFO log messages will say {{"Reordered DBQs 
> detected. ..."}} even tough the only reason they are out of order is because 
> Solr is deliberately applying them out of order.
> * At a minimum we should improve the log messages
> * Ideally we should stop (pre-emptively) applying these deletes during tlog 
> replay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org