Re: PeerSync Recovery fails, starting Replication Recovery

Anca Kopetz Fri, 20 Dec 2013 02:33:39 -0800

Hi,

We do not use a NRT solution. We do a hardCommit (autocommit with
openSearcher=false) every 15 minutes, and updates with commitWithin
every 30 minutes. The updates are done one by one by many feeders, we do
not send batches of documents.


Solr version : 4.5.1

Yes, the problem is before the recovery, as I explained in the reply to
Mark's mail.

Best regards,
Anca Kopetz

On 12/19/2013 06:39 PM, Daniel Collins wrote:

Are you using a NRT solution, how often do you commit?  We see similar
issues with PeerSync, but then we have a very active NRT system and we
soft-commit sub-second, so since PeerSync has a limit of 100 versions
before it decides its too much to do, if we try and PeerSync whilst
indexing is running, we inevitably have to fallback to a full-sync as this
does.

What Solr version are you using?  There were issues with early 4.X (up to
4.3 wasn't it, I can't find the ticket now?) whhereby if PeerSync failed,
it did a fullCopy when what it should do is an incremental copy (i.e. only
the segments that are missing) which again might hurt you.

But surely your real problem isn't that the recovery taking a long time,
your problem is why did the system enter recovery in the first place (which
is the bit *just before* the trace you gave us!)
The first line is "It has been requested that we recover" (aren't Solr
trace message polite), which I know from recent experience means either the
leader thinks this replica is out of date or you just had a leadership
transfer.  As mark says, the root cause of that is probably a ZK timeout
issue



On 19 December 2013 15:50, Mark Miller<markrmil...@gmail.com>  wrote:

Sounds like you need to raise your ZooKeeper connection timeout.

Also, make sure you are using a concurrent garbage collector as a side
note - stop the world pauses should be avoided. Just good advice :)

- Mark

On Dec 18, 2013, at 5:48 AM, Anca Kopetz<anca.kop...@kelkoo.com>  wrote:

Hi,

In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from

time to time into recovering state, and it takes more than 10 minutes to
finish to recover.

In logs, we see that "PeerSync Recovery" fails with the message :

PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many

updates received since start - startingUpdates no longer overlaps with our
currentUpdates

Then "Replication Recovery" starts.

Is there something we can do to avoid the failure of "Peer Recovery" so

that the recovery process is more rapid (less than 10 minutes) ?

The full trace log is here :

2013-12-05 13:51:53,740 [http-8080-46] INFO

  org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
  - It has been requested that we recover

2013-12-05 13:51:53,740 [http-8080-112] INFO

  org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
  - It has been requested that we recover

2013-12-05 13:51:53,740 [http-8080-112] INFO

  org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
QTime=0

2013-12-05 13:51:53,740 [Thread-1544] INFO

  org.apache.solr.cloud.ZkController:publish:1017  - publishing
core=fr_green state=recovering

2013-12-05 13:51:53,741 [http-8080-46] INFO

  org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
QTime=1

2013-12-05 13:51:53,740 [Thread-1543] INFO

  org.apache.solr.cloud.ZkController:publish:1017  - publishing
core=fr_green state=recovering

2013-12-05 13:51:53,743 [Thread-1544] INFO

  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,746 [Thread-1543] INFO

  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,755 [Thread-1543] WARN

  org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for
zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green

2013-12-05 13:51:53,756 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
process.  core=fr_green recoveringAfterStartup=false

2013-12-05 13:51:53,762 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
process. core=fr_green

2013-12-05 13:51:53,762 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
process.  core=fr_green recoveringAfterStartup=false

2013-12-05 13:51:53,765 [RecoveryThread] INFO

  org.apache.solr.cloud.ZkController:publish:1017  - publishing
core=fr_green state=recovering

2013-12-05 13:51:53,765 [RecoveryThread] INFO

  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,767 [RecoveryThread] INFO

  org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
Creating new http client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false

2013-12-05 13:51:54,777 [main-EventThread] INFO

  org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)

2013-12-05 13:51:56,804 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:doRecovery:356  - Attempting to
PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green -
recoveringAfterStartup=false

2013-12-05 13:51:56,806 [RecoveryThread] WARN

  org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=fr_green url=
http://solr-08/searchsolrnodefr too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates

2013-12-05 13:51:56,806 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery
was not successful - trying replication. core=fr_green

2013-12-05 13:51:56,806 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting
Replication Recovery. core=fr_green

2013-12-05 13:51:56,806 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:doRecovery:399  - Begin buffering
updates. core=fr_green

2013-12-05 13:51:56,806 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:replicate:127  - Attempting to
replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green

2013-12-05 13:51:56,806 [RecoveryThread] INFO

  org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
Creating new http client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false

2013-12-05 13:52:01,203 [RecoveryThread] INFO

  org.apache.solr.handler.SnapPuller:<init>:211  -  No value set for
'pollInterval'. Timer Task not started.

2013-12-05 13:52:01,209 [RecoveryThread] INFO

  org.apache.solr.handler.SnapPuller:fetchLatestIndex:369  - Master's
generation: 4382

2013-12-05 13:52:01,209 [RecoveryThread] INFO

  org.apache.solr.handler.SnapPuller:fetchLatestIndex:370  - Slave's
generation: 4382

2013-12-05 13:52:01,209 [RecoveryThread] INFO

  org.apache.solr.handler.SnapPuller:fetchLatestIndex:371  - Starting
replication process

2013-12-05 13:52:01,213 [RecoveryThread] INFO

  org.apache.solr.handler.SnapPuller:fetchLatestIndex:376  - Number of files
in latest index in master: 158

2013-12-05 13:52:01,213 [RecoveryThread] INFO

  org.apache.solr.core.CachingDirectoryFactory:get:360  - return new
directory for
/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213

2013-12-05 13:52:01,213 [RecoveryThread] INFO

  org.apache.solr.handler.SnapPuller:fetchLatestIndex:402  - Starting
download to 
org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213
lockFactory=org.apache.lucene.store.NativeFSLockFactory@697d36b7fullCopy=true

2013-12-05 13:52:21,970 [main-EventThread] INFO

  org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)

2013-12-05 13:55:15,039 [main-EventThread] INFO

  org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)

2013-12-05 13:56:27,160 [RecoveryThread] INFO

  org.apache.solr.handler.SnapPuller:fetchLatestIndex:406  - Total time
taken for download : 265 secs

2013-12-05 13:56:27,540 [RecoveryThread] INFO

  org.apache.solr.handler.SnapPuller:modifyIndexProps:886  - New index
installed. Updating index properties... index=index.20131205135201213

2013-12-05 13:56:27,546 [RecoveryThread] INFO

  org.apache.solr.handler.SnapPuller:fetchLatestIndex:461  - removing old
index directory 
org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
lockFactory=org.apache.lucene.store.NativeFSLockFactory@53949838

2013-12-05 13:56:27,765 [RecoveryThread] INFO

  org.apache.solr.search.SolrIndexSearcher:<init>:197  - Opening
Searcher@30a544f3 main

2013-12-05 13:58:52,337 [RecoveryThread] INFO

  org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:235  -
looking to close
/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724[CachedDir<<refCount=0;path=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724;done=true>>]

2013-12-05 13:58:52,337 [RecoveryThread] INFO

  org.apache.solr.core.CachingDirectoryFactory:close:304  - Closing
directory:
/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724

2013-12-05 13:58:52,337 [RecoveryThread] INFO

  org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:279  -
Removing directory before core close:
/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724

2013-12-05 13:58:54,172 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:replay:506  - Replaying buffered
documents. core=fr_green

2013-12-05 13:58:54,172 [recoveryExecutor-7-thread-2] WARN

  org.apache.solr.update.UpdateLog:doReplay:1240  - Starting log replay
tlog{file=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/tlog/tlog.0000000000000004091
refcount=2} active=true starting pos=1103833347

2013-12-05 13:59:16,660 [recoveryExecutor-7-thread-2] INFO

  org.apache.solr.search.SolrIndexSearcher:<init>:197  - Opening
Searcher@5e48fb12 main

2013-12-05 14:01:11,908 [main-EventThread] INFO

  org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)

2013-12-05 14:01:22,846 [recoveryExecutor-7-thread-2] WARN

  org.apache.solr.update.UpdateLog:run:1230  - Log replay finished.
recoveryInfo=RecoveryInfo{adds=39345 deletes=0 deleteByQuery=0 errors=0
positionOfStart=1103833347}

2013-12-05 14:01:22,846 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:doRecovery:410  - Replication
Recovery was successful - registering as Active. core=fr_green

2013-12-05 14:01:22,846 [RecoveryThread] INFO

  org.apache.solr.cloud.ZkController:publish:1017  - publishing
core=fr_green state=active

2013-12-05 14:01:22,850 [RecoveryThread] INFO

  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 14:01:22,860 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
process. core=fr_green

Best regards,
Anca Kopetz
--
<generic-logo-no-strap.gif>
Anca Kopetz Software engineer

E anca.kop...@kelkoo.com  Y!Messenger kelkooancak
T +33 (0)4 56 09 07 55
A 4/6 Rue des Meridiens 38130 Echirolles
<x-bottom-left.gif>    <x-bottom-middle.gif>
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à

l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
destinataire de ce message, merci de le détruire et d'en avertir
l'expéditeur.


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

Re: PeerSync Recovery fails, starting Replication Recovery

Reply via email to