Re: tlog replay
Erick, Actually, configured autocommit to 15 seconds and openSearcher is set to false. Neither 2 nor 3 happened. However, softCommit is set to 10 min. ${solr.autoCommit.maxTime:15000} false Working on upgrading to 5.3 which will take a bit of time and trying to get this under control until that time. On 10/8/15 5:28 PM, Erick Erickson wrote: right, so the scenario is 1> somehow you didn't do a hard commit (openSearcher=true or false doesn't matter) for a really long time while indexing. 2> Solr abnormally terminated. 3> When Solr started back up it replayed the entire log. How <1> happened is the mystery though. With a hard commit (autocommit) interval of 15 seconds that's weird. The message indicates something like that happened. In very recent Solr versions, the log will have progress messages printed that'll help see this is happening. Best, Erick On Thu, Oct 8, 2015 at 12:23 PM, Rallavagu wrote: As a follow up. Eventually the tlog file is disappeared (could not track the time it took to clear out completely). However, following messages were noticed in follower's log. 5120638 [recoveryExecutor-14-thread-2] WARN org.apache.solr.update.UpdateLog – Starting log replay tlog On 10/7/15 8:29 PM, Erick Erickson wrote: The only way I can account for such a large file off the top of my head is if, for some reason, the Solr on the node somehow was failing to index documents and kept adding them to the log for a lnnn time. But how that would happen without the node being in recovery mode I'm not sure. I mean the Solr instance would have to be healthy otherwise but just not able to index docs which makes no sense. The usual question here is whether there were any messages in the solr log file indicating problems while this built up. tlogs will build up to very large sizes if there are very long hard commit intervals, but I don't see how that interval would be different on the leader and follower. So color me puzzled. Best, Erick On Wed, Oct 7, 2015 at 8:09 PM, Rallavagu wrote: Thanks Erick. Eventually, followers caught up but the 14G tlog file still persists and they are healthy. Is there anything to look for? Will monitor and see how long will it take before it disappears. Evaluating move to Solr 5.3. On 10/7/15 7:51 PM, Erick Erickson wrote: Uhm, that's very weird. Updates are not applied from the tlog. Rather the raw doc is forwarded to the replica which both indexes the doc and writes it to the local tlog. So having a 14G tlog on a follower but a small tlog on the leader is definitely strange, especially if it persists over time. I assume the follower is healthy? And does this very large tlog disappear after a while? I'd expect it to be aged out after a few commits of > 100 docs. All that said, there have been a LOT of improvements since 4.6, so it might be something that's been addressed in the intervening time. Best, Erick On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu wrote: Solr 4.6.1, single shard, 4 node cloud, 3 node zk Like to understand the behavior better when large number of updates happen on leader and it generates huge tlog (14G sometimes in my case) on other nodes. At the same time leader's tlog is few KB. So, what is the rate at which the changes from transaction log are applied at nodes? The autocommit interval is set to 15 seconds after going through https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Thanks
Re: tlog replay
right, so the scenario is 1> somehow you didn't do a hard commit (openSearcher=true or false doesn't matter) for a really long time while indexing. 2> Solr abnormally terminated. 3> When Solr started back up it replayed the entire log. How <1> happened is the mystery though. With a hard commit (autocommit) interval of 15 seconds that's weird. The message indicates something like that happened. In very recent Solr versions, the log will have progress messages printed that'll help see this is happening. Best, Erick On Thu, Oct 8, 2015 at 12:23 PM, Rallavagu wrote: > As a follow up. > > Eventually the tlog file is disappeared (could not track the time it took to > clear out completely). However, following messages were noticed in > follower's log. > > 5120638 [recoveryExecutor-14-thread-2] WARN org.apache.solr.update.UpdateLog > – Starting log replay tlog > > On 10/7/15 8:29 PM, Erick Erickson wrote: >> >> The only way I can account for such a large file off the top of my >> head is if, for some reason, >> the Solr on the node somehow was failing to index documents and kept >> adding them to the >> log for a lnnn time. But how that would happen without the >> node being in recovery >> mode I'm not sure. I mean the Solr instance would have to be healthy >> otherwise but just not >> able to index docs which makes no sense. >> >> The usual question here is whether there were any messages in the solr >> log file indicating >> problems while this built up. >> >> tlogs will build up to very large sizes if there are very long hard >> commit intervals, but I don't >> see how that interval would be different on the leader and follower. >> >> So color me puzzled. >> >> Best, >> Erick >> >> On Wed, Oct 7, 2015 at 8:09 PM, Rallavagu wrote: >>> >>> Thanks Erick. >>> >>> Eventually, followers caught up but the 14G tlog file still persists and >>> they are healthy. Is there anything to look for? Will monitor and see how >>> long will it take before it disappears. >>> >>> Evaluating move to Solr 5.3. >>> >>> On 10/7/15 7:51 PM, Erick Erickson wrote: Uhm, that's very weird. Updates are not applied from the tlog. Rather the raw doc is forwarded to the replica which both indexes the doc and writes it to the local tlog. So having a 14G tlog on a follower but a small tlog on the leader is definitely strange, especially if it persists over time. I assume the follower is healthy? And does this very large tlog disappear after a while? I'd expect it to be aged out after a few commits of > 100 docs. All that said, there have been a LOT of improvements since 4.6, so it might be something that's been addressed in the intervening time. Best, Erick On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu wrote: > > > Solr 4.6.1, single shard, 4 node cloud, 3 node zk > > Like to understand the behavior better when large number of updates > happen > on leader and it generates huge tlog (14G sometimes in my case) on > other > nodes. At the same time leader's tlog is few KB. So, what is the rate > at > which the changes from transaction log are applied at nodes? The > autocommit > interval is set to 15 seconds after going through > > > https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > Thanks
Re: tlog replay
As a follow up. Eventually the tlog file is disappeared (could not track the time it took to clear out completely). However, following messages were noticed in follower's log. 5120638 [recoveryExecutor-14-thread-2] WARN org.apache.solr.update.UpdateLog – Starting log replay tlog On 10/7/15 8:29 PM, Erick Erickson wrote: The only way I can account for such a large file off the top of my head is if, for some reason, the Solr on the node somehow was failing to index documents and kept adding them to the log for a lnnn time. But how that would happen without the node being in recovery mode I'm not sure. I mean the Solr instance would have to be healthy otherwise but just not able to index docs which makes no sense. The usual question here is whether there were any messages in the solr log file indicating problems while this built up. tlogs will build up to very large sizes if there are very long hard commit intervals, but I don't see how that interval would be different on the leader and follower. So color me puzzled. Best, Erick On Wed, Oct 7, 2015 at 8:09 PM, Rallavagu wrote: Thanks Erick. Eventually, followers caught up but the 14G tlog file still persists and they are healthy. Is there anything to look for? Will monitor and see how long will it take before it disappears. Evaluating move to Solr 5.3. On 10/7/15 7:51 PM, Erick Erickson wrote: Uhm, that's very weird. Updates are not applied from the tlog. Rather the raw doc is forwarded to the replica which both indexes the doc and writes it to the local tlog. So having a 14G tlog on a follower but a small tlog on the leader is definitely strange, especially if it persists over time. I assume the follower is healthy? And does this very large tlog disappear after a while? I'd expect it to be aged out after a few commits of > 100 docs. All that said, there have been a LOT of improvements since 4.6, so it might be something that's been addressed in the intervening time. Best, Erick On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu wrote: Solr 4.6.1, single shard, 4 node cloud, 3 node zk Like to understand the behavior better when large number of updates happen on leader and it generates huge tlog (14G sometimes in my case) on other nodes. At the same time leader's tlog is few KB. So, what is the rate at which the changes from transaction log are applied at nodes? The autocommit interval is set to 15 seconds after going through https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Thanks
Re: tlog replay
The only way I can account for such a large file off the top of my head is if, for some reason, the Solr on the node somehow was failing to index documents and kept adding them to the log for a lnnn time. But how that would happen without the node being in recovery mode I'm not sure. I mean the Solr instance would have to be healthy otherwise but just not able to index docs which makes no sense. The usual question here is whether there were any messages in the solr log file indicating problems while this built up. tlogs will build up to very large sizes if there are very long hard commit intervals, but I don't see how that interval would be different on the leader and follower. So color me puzzled. Best, Erick On Wed, Oct 7, 2015 at 8:09 PM, Rallavagu wrote: > Thanks Erick. > > Eventually, followers caught up but the 14G tlog file still persists and > they are healthy. Is there anything to look for? Will monitor and see how > long will it take before it disappears. > > Evaluating move to Solr 5.3. > > On 10/7/15 7:51 PM, Erick Erickson wrote: >> >> Uhm, that's very weird. Updates are not applied from the tlog. Rather the >> raw doc is forwarded to the replica which both indexes the doc and >> writes it to the local tlog. So having a 14G tlog on a follower but a >> small >> tlog on the leader is definitely strange, especially if it persists over >> time. >> >> I assume the follower is healthy? And does this very large tlog disappear >> after a while? I'd expect it to be aged out after a few commits of > 100 >> docs. >> >> All that said, there have been a LOT of improvements since 4.6, so it >> might >> be something that's been addressed in the intervening time. >> >> Best, >> Erick >> >> >> >> On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu wrote: >>> >>> Solr 4.6.1, single shard, 4 node cloud, 3 node zk >>> >>> Like to understand the behavior better when large number of updates >>> happen >>> on leader and it generates huge tlog (14G sometimes in my case) on other >>> nodes. At the same time leader's tlog is few KB. So, what is the rate at >>> which the changes from transaction log are applied at nodes? The >>> autocommit >>> interval is set to 15 seconds after going through >>> >>> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ >>> >>> Thanks
Re: tlog replay
Thanks Erick. Eventually, followers caught up but the 14G tlog file still persists and they are healthy. Is there anything to look for? Will monitor and see how long will it take before it disappears. Evaluating move to Solr 5.3. On 10/7/15 7:51 PM, Erick Erickson wrote: Uhm, that's very weird. Updates are not applied from the tlog. Rather the raw doc is forwarded to the replica which both indexes the doc and writes it to the local tlog. So having a 14G tlog on a follower but a small tlog on the leader is definitely strange, especially if it persists over time. I assume the follower is healthy? And does this very large tlog disappear after a while? I'd expect it to be aged out after a few commits of > 100 docs. All that said, there have been a LOT of improvements since 4.6, so it might be something that's been addressed in the intervening time. Best, Erick On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu wrote: Solr 4.6.1, single shard, 4 node cloud, 3 node zk Like to understand the behavior better when large number of updates happen on leader and it generates huge tlog (14G sometimes in my case) on other nodes. At the same time leader's tlog is few KB. So, what is the rate at which the changes from transaction log are applied at nodes? The autocommit interval is set to 15 seconds after going through https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Thanks
Re: tlog replay
Uhm, that's very weird. Updates are not applied from the tlog. Rather the raw doc is forwarded to the replica which both indexes the doc and writes it to the local tlog. So having a 14G tlog on a follower but a small tlog on the leader is definitely strange, especially if it persists over time. I assume the follower is healthy? And does this very large tlog disappear after a while? I'd expect it to be aged out after a few commits of > 100 docs. All that said, there have been a LOT of improvements since 4.6, so it might be something that's been addressed in the intervening time. Best, Erick On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu wrote: > Solr 4.6.1, single shard, 4 node cloud, 3 node zk > > Like to understand the behavior better when large number of updates happen > on leader and it generates huge tlog (14G sometimes in my case) on other > nodes. At the same time leader's tlog is few KB. So, what is the rate at > which the changes from transaction log are applied at nodes? The autocommit > interval is set to 15 seconds after going through > https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > Thanks
tlog replay
Solr 4.6.1, single shard, 4 node cloud, 3 node zk Like to understand the behavior better when large number of updates happen on leader and it generates huge tlog (14G sometimes in my case) on other nodes. At the same time leader's tlog is few KB. So, what is the rate at which the changes from transaction log are applied at nodes? The autocommit interval is set to 15 seconds after going through https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Thanks
Re: Solr4.7: tlog replay has a major delay before start recovering transaction replay
> > Before tlog replay, the replica will replicate any missing index files > from the leader. I think that is what is causing the time between the > two log messages. You have INFO logging turned off so there are no > messages from the replication handler about it. I did not monitor major network throughput during that timeframe, and I thought the first log already showed the peersync failed. So I try to understand the time spent there. Also, in our solr.log, I did not see log reporting Recovery- retry(1), Recovery - retry(2), Recovery give up, etc in this log file before it tells us "tlog replay" 2015-09-21 9:07 GMT-04:00 Shalin Shekhar Mangar : > Hi Jeff, > > Comments inline: > > On Mon, Sep 21, 2015 at 6:06 PM, Jeff Wu wrote: > > Our environment ran in Solr4.7. Recently hit a core recovery failure and > > then it retries to recover from tlog. > > > > We noticed after 20:05:22 said Recovery failed, Solr server waited a > long > > time before it started tlog replay. During that time, we have about 32 > > cores doing such tlog relay. The service took over 40 minutes to make > whole > > service back. > > > > Some questions we want to know: > > 1. Is tlog replay a single thread activity? Can we configure to have > > multiple threads since in our deployment we have 64 cores for each solr > > server. > > Each core gets a separate recovery thread but each individual log > replay is single-threaded > > > > > 2. What might cause the tlog replay thread to wait for over 15 minutes > > before actual tlog replay? The actual replay seems very quick. > > Before tlog replay, the replica will replicate any missing index files > from the leader. I think that is what is causing the time between the > two log messages. You have INFO logging turned off so there are no > messages from the replication handler about it. > > > > > 3. The last message "Log replay finished" does not tell which core it is > > finished. Given 32 cores to recover, we can not know which core the log > is > > reporting. > > Yeah, many such issues were fixed in recent 5.x releases where we use > MDC to log collection, shard, core etc for each message. Furthermore, > tlog replay progress/status is also logged since 5.0 > > > > > 4. We know 4.7 is pretty old, we'd like to know is this known issue and > > fixed in late release, any related JIRA? > > > > Line 4120: ERROR - 2015-09-16 20:05:22.396; > > org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... > > (0) core=collection3_shard11_replica2 > > WARN - 2015-09-16 20:22:50.343; > > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay > > > tlog{file=/mnt/solrdata1/solr/home/collection3_shard11_replica2/data/tlog/tlog.0120498 > > refcount=2} active=true starting pos=25981 > > WARN - 2015-09-16 20:22:53.301; > > org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished. > > recoveryInfo=RecoveryInfo{adds=914 deletes=215 deleteByQuery=0 errors=0 > > positionOfStart=25981} > > > > Thank you all~ > > > > -- > Regards, > Shalin Shekhar Mangar. > -- Jeff Wu --- CSDL Beijing, China
Re: Solr4.7: tlog replay has a major delay before start recovering transaction replay
Hi Jeff, Comments inline: On Mon, Sep 21, 2015 at 6:06 PM, Jeff Wu wrote: > Our environment ran in Solr4.7. Recently hit a core recovery failure and > then it retries to recover from tlog. > > We noticed after 20:05:22 said Recovery failed, Solr server waited a long > time before it started tlog replay. During that time, we have about 32 > cores doing such tlog relay. The service took over 40 minutes to make whole > service back. > > Some questions we want to know: > 1. Is tlog replay a single thread activity? Can we configure to have > multiple threads since in our deployment we have 64 cores for each solr > server. Each core gets a separate recovery thread but each individual log replay is single-threaded > > 2. What might cause the tlog replay thread to wait for over 15 minutes > before actual tlog replay? The actual replay seems very quick. Before tlog replay, the replica will replicate any missing index files from the leader. I think that is what is causing the time between the two log messages. You have INFO logging turned off so there are no messages from the replication handler about it. > > 3. The last message "Log replay finished" does not tell which core it is > finished. Given 32 cores to recover, we can not know which core the log is > reporting. Yeah, many such issues were fixed in recent 5.x releases where we use MDC to log collection, shard, core etc for each message. Furthermore, tlog replay progress/status is also logged since 5.0 > > 4. We know 4.7 is pretty old, we'd like to know is this known issue and > fixed in late release, any related JIRA? > > Line 4120: ERROR - 2015-09-16 20:05:22.396; > org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... > (0) core=collection3_shard11_replica2 > WARN - 2015-09-16 20:22:50.343; > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay > tlog{file=/mnt/solrdata1/solr/home/collection3_shard11_replica2/data/tlog/tlog.0120498 > refcount=2} active=true starting pos=25981 > WARN - 2015-09-16 20:22:53.301; > org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished. > recoveryInfo=RecoveryInfo{adds=914 deletes=215 deleteByQuery=0 errors=0 > positionOfStart=25981} > > Thank you all~ -- Regards, Shalin Shekhar Mangar.
Solr4.7: tlog replay has a major delay before start recovering transaction replay
Our environment ran in Solr4.7. Recently hit a core recovery failure and then it retries to recover from tlog. We noticed after 20:05:22 said Recovery failed, Solr server waited a long time before it started tlog replay. During that time, we have about 32 cores doing such tlog relay. The service took over 40 minutes to make whole service back. Some questions we want to know: 1. Is tlog replay a single thread activity? Can we configure to have multiple threads since in our deployment we have 64 cores for each solr server. 2. What might cause the tlog replay thread to wait for over 15 minutes before actual tlog replay? The actual replay seems very quick. 3. The last message "Log replay finished" does not tell which core it is finished. Given 32 cores to recover, we can not know which core the log is reporting. 4. We know 4.7 is pretty old, we'd like to know is this known issue and fixed in late release, any related JIRA? Line 4120: ERROR - 2015-09-16 20:05:22.396; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (0) core=collection3_shard11_replica2 WARN - 2015-09-16 20:22:50.343; org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay tlog{file=/mnt/solrdata1/solr/home/collection3_shard11_replica2/data/tlog/tlog.0120498 refcount=2} active=true starting pos=25981 WARN - 2015-09-16 20:22:53.301; org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished. recoveryInfo=RecoveryInfo{adds=914 deletes=215 deleteByQuery=0 errors=0 positionOfStart=25981} Thank you all~
Re: Tlog replay
On Wed, Jul 8, 2015 at 12:31 PM, Summer Shire wrote: > Thanks Alessandro ! > > Any idea on why I couldn't curl the solr core and pass the flag param ? These flags are for internal use only. Solr sets them, the client doesn't. -Yonik
Re: Tlog replay
Thanks Alessandro ! Any idea on why I couldn't curl the solr core and pass the flag param ? > On Jul 8, 2015, at 7:12 AM, Alessandro Benedetti > wrote: > > Hi Summer, > > If you take a look to the CommitUpdateCommand class, you will notice no > Flag is in there. > > // this is the toString for example > > @Override > public String toString() { > return super.toString() + ",optimize="+optimize > +",openSearcher="+openSearcher > +",waitSearcher="+waitSearcher > +",expungeDeletes="+expungeDeletes > +",softCommit="+softCommit > +",prepareCommit="+prepareCommit > +'}'; > } > > > If you then access the UpdateCommand object, you find the flag : > > > public static int BUFFERING = 0x0001;// update command is > being buffered. > public static int REPLAY= 0x0002;// update command is from > replaying a log. > public static int PEER_SYNC= 0x0004; // update command is a > missing update being provided by a peer. > public static int IGNORE_AUTOCOMMIT = 0x0008; // this update > should not count toward triggering of autocommits. > public static int CLEAR_CACHES = 0x0010; // clear caches > associated with the update log. used when applying reordered DBQ > updates when doing an add. > > So the flag =2 is actually saying that the update command is from > replaying a log ( which is what you would expect) > > > Cheers > > > 2015-07-08 3:01 GMT+01:00 Summer Shire : > >> >> Hi, >> >> When I restart my solr core the log replay starts and just before it >> finishes I see the following commit >> >> start >> commit{flags=2,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} >> >> what does the “flags=2” param do ? >> >> when I try to send that param to the updateHandler manually solr does not >> like it >> >> curl http://localhost:6600/solr/main/update -H "Content-Type: text/xml" >> --data-binary '> waitSearcher="false"/>' >> >> >> >> 400> name="QTime">0Unknown commit >> parameter 'flags'400 >> >> >> thanks, >> Summer > > > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England
Re: Tlog replay
Hi Summer, If you take a look to the CommitUpdateCommand class, you will notice no Flag is in there. // this is the toString for example @Override public String toString() { return super.toString() + ",optimize="+optimize +",openSearcher="+openSearcher +",waitSearcher="+waitSearcher +",expungeDeletes="+expungeDeletes +",softCommit="+softCommit +",prepareCommit="+prepareCommit +'}'; } If you then access the UpdateCommand object, you find the flag : public static int BUFFERING = 0x0001;// update command is being buffered. public static int REPLAY= 0x0002;// update command is from replaying a log. public static int PEER_SYNC= 0x0004; // update command is a missing update being provided by a peer. public static int IGNORE_AUTOCOMMIT = 0x0008; // this update should not count toward triggering of autocommits. public static int CLEAR_CACHES = 0x0010; // clear caches associated with the update log. used when applying reordered DBQ updates when doing an add. So the flag =2 is actually saying that the update command is from replaying a log ( which is what you would expect) Cheers 2015-07-08 3:01 GMT+01:00 Summer Shire : > > Hi, > > When I restart my solr core the log replay starts and just before it > finishes I see the following commit > > start > commit{flags=2,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} > > what does the “flags=2” param do ? > > when I try to send that param to the updateHandler manually solr does not > like it > > curl http://localhost:6600/solr/main/update -H "Content-Type: text/xml" > --data-binary ' waitSearcher="false"/>' > > > > 400 name="QTime">0Unknown commit > parameter 'flags'400 > > > thanks, > Summer -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Tlog replay
Hi, When I restart my solr core the log replay starts and just before it finishes I see the following commit start commit{flags=2,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} what does the “flags=2” param do ? when I try to send that param to the updateHandler manually solr does not like it curl http://localhost:6600/solr/main/update -H "Content-Type: text/xml" --data-binary '' 4000Unknown commit parameter 'flags'400 thanks, Summer