Sounds like you need to raise your ZooKeeper connection timeout. 

Also, make sure you are using a concurrent garbage collector as a side note - 
stop the world pauses should be avoided. Just good advice :)

- Mark

On Dec 18, 2013, at 5:48 AM, Anca Kopetz <anca.kop...@kelkoo.com> wrote:

> Hi,
> 
> In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to 
> time into recovering state, and it takes more than 10 minutes to finish to 
> recover.
> 
> In logs, we see that "PeerSync Recovery" fails with the message :
> 
> PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates 
> received since start - startingUpdates no longer overlaps with our 
> currentUpdates
> 
> Then "Replication Recovery" starts. 
> 
> Is there something we can do to avoid the failure of "Peer Recovery" so that 
> the recovery process is more rapid (less than 10 minutes) ?
> 
> The full trace log is here : 
> 
> 2013-12-05 13:51:53,740 [http-8080-46] INFO  
> org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
>   - It has been requested that we recover
> 2013-12-05 13:51:53,740 [http-8080-112] INFO  
> org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
>   - It has been requested that we recover
> 2013-12-05 13:51:53,740 [http-8080-112] INFO  
> org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] 
> webapp=null path=/admin/cores 
> params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0 
> QTime=0
> 2013-12-05 13:51:53,740 [Thread-1544] INFO  
> org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
> state=recovering
> 2013-12-05 13:51:53,741 [http-8080-46] INFO  
> org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] 
> webapp=null path=/admin/cores 
> params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0 
> QTime=1
> 2013-12-05 13:51:53,740 [Thread-1543] INFO  
> org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
> state=recovering
> 2013-12-05 13:51:53,743 [Thread-1544] INFO  
> org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
> descriptor - reading it from system property
> 2013-12-05 13:51:53,746 [Thread-1543] INFO  
> org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
> descriptor - reading it from system property
> 2013-12-05 13:51:53,755 [Thread-1543] WARN  
> org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for 
> zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green
> 2013-12-05 13:51:53,756 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery process.  
> core=fr_green recoveringAfterStartup=false
> 2013-12-05 13:51:53,762 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery 
> process. core=fr_green
> 2013-12-05 13:51:53,762 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery process.  
> core=fr_green recoveringAfterStartup=false
> 2013-12-05 13:51:53,765 [RecoveryThread] INFO  
> org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
> state=recovering
> 2013-12-05 13:51:53,765 [RecoveryThread] INFO  
> org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
> descriptor - reading it from system property
> 2013-12-05 13:51:53,767 [RecoveryThread] INFO  
> org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  - Creating 
> new http client, 
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> 2013-12-05 13:51:54,777 [main-EventThread] INFO  
> org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state 
> change: WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
> 2013-12-05 13:51:56,804 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:356  - Attempting to 
> PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green - 
> recoveringAfterStartup=false
> 2013-12-05 13:51:56,806 [RecoveryThread] WARN  
> org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=fr_green 
> url=http://solr-08/searchsolrnodefr too many updates received since start - 
> startingUpdates no longer overlaps with our currentUpdates
> 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery 
> was not successful - trying replication. core=fr_green
> 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting Replication 
> Recovery. core=fr_green
> 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:399  - Begin buffering 
> updates. core=fr_green
> 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:replicate:127  - Attempting to 
> replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green
> 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
> org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  - Creating 
> new http client, 
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> 2013-12-05 13:52:01,203 [RecoveryThread] INFO  
> org.apache.solr.handler.SnapPuller:<init>:211  -  No value set for 
> 'pollInterval'. Timer Task not started.
> 2013-12-05 13:52:01,209 [RecoveryThread] INFO  
> org.apache.solr.handler.SnapPuller:fetchLatestIndex:369  - Master's 
> generation: 4382
> 2013-12-05 13:52:01,209 [RecoveryThread] INFO  
> org.apache.solr.handler.SnapPuller:fetchLatestIndex:370  - Slave's 
> generation: 4382
> 2013-12-05 13:52:01,209 [RecoveryThread] INFO  
> org.apache.solr.handler.SnapPuller:fetchLatestIndex:371  - Starting 
> replication process
> 2013-12-05 13:52:01,213 [RecoveryThread] INFO  
> org.apache.solr.handler.SnapPuller:fetchLatestIndex:376  - Number of files in 
> latest index in master: 158
> 2013-12-05 13:52:01,213 [RecoveryThread] INFO  
> org.apache.solr.core.CachingDirectoryFactory:get:360  - return new directory 
> for 
> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213
> 2013-12-05 13:52:01,213 [RecoveryThread] INFO  
> org.apache.solr.handler.SnapPuller:fetchLatestIndex:402  - Starting download 
> to 
> org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213
>  lockFactory=org.apache.lucene.store.NativeFSLockFactory@697d36b7 
> fullCopy=true
> 2013-12-05 13:52:21,970 [main-EventThread] INFO  
> org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state 
> change: WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
> 2013-12-05 13:55:15,039 [main-EventThread] INFO  
> org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state 
> change: WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
> 2013-12-05 13:56:27,160 [RecoveryThread] INFO  
> org.apache.solr.handler.SnapPuller:fetchLatestIndex:406  - Total time taken 
> for download : 265 secs
> 2013-12-05 13:56:27,540 [RecoveryThread] INFO  
> org.apache.solr.handler.SnapPuller:modifyIndexProps:886  - New index 
> installed. Updating index properties... index=index.20131205135201213
> 2013-12-05 13:56:27,546 [RecoveryThread] INFO  
> org.apache.solr.handler.SnapPuller:fetchLatestIndex:461  - removing old index 
> directory 
> org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
>  lockFactory=org.apache.lucene.store.NativeFSLockFactory@53949838
> 2013-12-05 13:56:27,765 [RecoveryThread] INFO  
> org.apache.solr.search.SolrIndexSearcher:<init>:197  - Opening 
> Searcher@30a544f3 main
> 2013-12-05 13:58:52,337 [RecoveryThread] INFO  
> org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:235  - looking 
> to close 
> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724[CachedDir<<refCount=0;path=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724;done=true>>]
> 2013-12-05 13:58:52,337 [RecoveryThread] INFO  
> org.apache.solr.core.CachingDirectoryFactory:close:304  - Closing directory: 
> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
> 2013-12-05 13:58:52,337 [RecoveryThread] INFO  
> org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:279  - Removing 
> directory before core close: 
> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
> 2013-12-05 13:58:54,172 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:replay:506  - Replaying buffered 
> documents. core=fr_green
> 2013-12-05 13:58:54,172 [recoveryExecutor-7-thread-2] WARN  
> org.apache.solr.update.UpdateLog:doReplay:1240  - Starting log replay 
> tlog{file=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/tlog/tlog.0000000000000004091
>  refcount=2} active=true starting pos=1103833347
> 2013-12-05 13:59:16,660 [recoveryExecutor-7-thread-2] INFO  
> org.apache.solr.search.SolrIndexSearcher:<init>:197  - Opening 
> Searcher@5e48fb12 main
> 2013-12-05 14:01:11,908 [main-EventThread] INFO  
> org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state 
> change: WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
> 2013-12-05 14:01:22,846 [recoveryExecutor-7-thread-2] WARN  
> org.apache.solr.update.UpdateLog:run:1230  - Log replay finished. 
> recoveryInfo=RecoveryInfo{adds=39345 deletes=0 deleteByQuery=0 errors=0 
> positionOfStart=1103833347}
> 2013-12-05 14:01:22,846 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:410  - Replication Recovery 
> was successful - registering as Active. core=fr_green
> 2013-12-05 14:01:22,846 [RecoveryThread] INFO  
> org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
> state=active
> 2013-12-05 14:01:22,850 [RecoveryThread] INFO  
> org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
> descriptor - reading it from system property
> 2013-12-05 14:01:22,860 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery 
> process. core=fr_green
> 
> Best regards,
> Anca Kopetz
> -- 
> <generic-logo-no-strap.gif>   
> Anca Kopetz Software engineer
> 
> E anca.kop...@kelkoo.com  Y!Messenger kelkooancak
> T +33 (0)4 56 09 07 55        
> A 4/6 Rue des Meridiens 38130 Echirolles
> <x-bottom-left.gif>   <x-bottom-middle.gif>
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
> 
> Ce message et les pièces jointes sont confidentiels et établis à l'attention 
> exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
> message, merci de le détruire et d'en avertir l'expéditeur.

Reply via email to