[ 
https://issues.apache.org/jira/browse/SOLR-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated SOLR-4038:
--------------------------------

    Description: 
See: 
http://lucene.472066.n3.nabble.com/SolrCloud-indexing-blocks-if-node-is-recovering-td4017827.html

While indexing (without CloudSolrServer at that time) one node dies with an 
OOME perhaps  because of the linked issue SOLR-4032. The OOME stack traces are 
varied but here are some ZK-related logs between the OOME stack traces:

{code}
2012-11-02 14:14:37,126 INFO [solr.update.UpdateLog] - [RecoveryThread] - : 
Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
2012-11-02 14:14:37,127 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] 
- : Recovery failed - trying again... (2) core=shard_e
2012-11-02 14:14:37,127 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - 
: Wait 8.0 seconds before trying to recover again (3)
2012-11-02 14:14:45,328 INFO [solr.cloud.ZkController] - [RecoveryThread] - : 
numShards not found on descriptor - reading it from system property
2012-11-02 14:14:45,363 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - 
: Starting Replication Recovery. core=shard_e
2012-11-02 14:14:45,363 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : 
Creating new http client, 
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2012-11-02 14:14:45,775 INFO [common.cloud.ZkStateReader] - [main-EventThread] 
- : A cluster state change has occurred - updating... (10)
2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - 
: Begin buffering updates. core=shard_e
2012-11-02 14:14:50,987 INFO [solr.update.UpdateLog] - [RecoveryThread] - : 
Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - 
: Attempting to replicate from http://rot05.solrserver:8080/solr/shard_e/. 
core=shard_e
2012-11-02 14:14:50,987 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : 
Creating new http client, 
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2012-11-02 14:15:03,303 INFO [solr.core.CachingDirectoryFactory] - 
[RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index
2012-11-02 14:15:03,303 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : 
removing temporary index download directory files 
NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_f/data/index.20121102141424591
 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1520a48c; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)
2012-11-02 14:15:09,421 INFO [apache.zookeeper.ClientCnxn] - 
[main-SendThread(rot1.zkserver:2181)] - : Client session timed out, have not 
heard from server in 11873ms for sessionid 0x13abc504486000f, closing socket 
connection and attempting reconnect
2012-11-02 14:15:09,422 ERROR [solr.core.SolrCore] - [http-8080-exec-1] - : 
org.apache.solr.common.SolrException: Ping query caused exception: Java heap 
space
.....
.....
2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - 
[main-EventThread] - : Watcher 
org.apache.solr.common.cloud.ConnectionManager@305e7020 
name:ZooKeeperConnection Watcher:rot1.zkserver:2181,rot2.zkserver:2181 got 
event WatchedEvent state:Disconnected type:None path:null path:null type:None
2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - 
[main-EventThread] - : zkClient has disconnected
2012-11-02 14:15:09,869 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] 
- : Error while trying to recover:java.lang.OutOfMemoryError: Java heap space
.....
.....
2012-11-02 14:15:10,159 INFO [solr.update.UpdateLog] - [RecoveryThread] - : 
Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
2012-11-02 14:15:10,159 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] 
- : Recovery failed - trying again... (3) core=shard_e
2012-11-02 14:15:10,159 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - 
: Wait 16.0 seconds before trying to recover again (4)
2012-11-02 14:15:09,878 INFO [solr.core.CachingDirectoryFactory] - 
[RecoveryThread] - : Releasing 
directory:/opt/solr/cores/shard_f/data/index.20121102141424591
2012-11-02 14:15:10,192 INFO [solr.core.CachingDirectoryFactory] - 
[RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f_f/data/index
2012-11-02 14:15:10,192 ERROR [solr.handler.ReplicationHandler] - 
[RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: 
Unable to download _773.tvf completely. Downloaded 246
415360!=562327645
.....
.....
{code} 

At this point indexing has already been blocked. Some nodes do not write 
anything to the logs and the two surrounding nodes are still busy doing some 
replication. Most nodes show a increased number of state changes:

{code}
2012-11-02 14:16:47,768 INFO [common.cloud.ZkStateReader] - [main-EventThread] 
- : A cluster state change has occurred - updating... (10)
{code}

It's about 10 minutes of doing little by trying replication by the surrounding 
nodes and mostly state changes for the other nodes before are documents indexed 
again. At that point i stopped the node.



  was:See: 
http://lucene.472066.n3.nabble.com/SolrCloud-indexing-blocks-if-node-is-recovering-td4017827.html

    
> SolrCloud indexing blocks if node is recovering
> -----------------------------------------------
>
>                 Key: SOLR-4038
>                 URL: https://issues.apache.org/jira/browse/SOLR-4038
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.0
>            Reporter: Markus Jelsma
>             Fix For: 4.1, 5.0
>
>
> See: 
> http://lucene.472066.n3.nabble.com/SolrCloud-indexing-blocks-if-node-is-recovering-td4017827.html
> While indexing (without CloudSolrServer at that time) one node dies with an 
> OOME perhaps  because of the linked issue SOLR-4032. The OOME stack traces 
> are varied but here are some ZK-related logs between the OOME stack traces:
> {code}
> 2012-11-02 14:14:37,126 INFO [solr.update.UpdateLog] - [RecoveryThread] - : 
> Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
> 2012-11-02 14:14:37,127 ERROR [solr.cloud.RecoveryStrategy] - 
> [RecoveryThread] - : Recovery failed - trying again... (2) core=shard_e
> 2012-11-02 14:14:37,127 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] 
> - : Wait 8.0 seconds before trying to recover again (3)
> 2012-11-02 14:14:45,328 INFO [solr.cloud.ZkController] - [RecoveryThread] - : 
> numShards not found on descriptor - reading it from system property
> 2012-11-02 14:14:45,363 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] 
> - : Starting Replication Recovery. core=shard_e
> 2012-11-02 14:14:45,363 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - 
> : Creating new http client, 
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> 2012-11-02 14:14:45,775 INFO [common.cloud.ZkStateReader] - 
> [main-EventThread] - : A cluster state change has occurred - updating... (10)
> 2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] 
> - : Begin buffering updates. core=shard_e
> 2012-11-02 14:14:50,987 INFO [solr.update.UpdateLog] - [RecoveryThread] - : 
> Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
> 2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] 
> - : Attempting to replicate from http://rot05.solrserver:8080/solr/shard_e/. 
> core=shard_e
> 2012-11-02 14:14:50,987 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - 
> : Creating new http client, 
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> 2012-11-02 14:15:03,303 INFO [solr.core.CachingDirectoryFactory] - 
> [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index
> 2012-11-02 14:15:03,303 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : 
> removing temporary index download directory files 
> NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_f/data/index.20121102141424591
>  lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1520a48c; 
> maxCacheMB=48.0 maxMergeSizeMB=4.0)
> 2012-11-02 14:15:09,421 INFO [apache.zookeeper.ClientCnxn] - 
> [main-SendThread(rot1.zkserver:2181)] - : Client session timed out, have not 
> heard from server in 11873ms for sessionid 0x13abc504486000f, closing socket 
> connection and attempting reconnect
> 2012-11-02 14:15:09,422 ERROR [solr.core.SolrCore] - [http-8080-exec-1] - : 
> org.apache.solr.common.SolrException: Ping query caused exception: Java heap 
> space
> .....
> .....
> 2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - 
> [main-EventThread] - : Watcher 
> org.apache.solr.common.cloud.ConnectionManager@305e7020 
> name:ZooKeeperConnection Watcher:rot1.zkserver:2181,rot2.zkserver:2181 got 
> event WatchedEvent state:Disconnected type:None path:null path:null type:None
> 2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - 
> [main-EventThread] - : zkClient has disconnected
> 2012-11-02 14:15:09,869 ERROR [solr.cloud.RecoveryStrategy] - 
> [RecoveryThread] - : Error while trying to 
> recover:java.lang.OutOfMemoryError: Java heap space
> .....
> .....
> 2012-11-02 14:15:10,159 INFO [solr.update.UpdateLog] - [RecoveryThread] - : 
> Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
> 2012-11-02 14:15:10,159 ERROR [solr.cloud.RecoveryStrategy] - 
> [RecoveryThread] - : Recovery failed - trying again... (3) core=shard_e
> 2012-11-02 14:15:10,159 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] 
> - : Wait 16.0 seconds before trying to recover again (4)
> 2012-11-02 14:15:09,878 INFO [solr.core.CachingDirectoryFactory] - 
> [RecoveryThread] - : Releasing 
> directory:/opt/solr/cores/shard_f/data/index.20121102141424591
> 2012-11-02 14:15:10,192 INFO [solr.core.CachingDirectoryFactory] - 
> [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f_f/data/index
> 2012-11-02 14:15:10,192 ERROR [solr.handler.ReplicationHandler] - 
> [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: 
> Unable to download _773.tvf completely. Downloaded 246
> 415360!=562327645
> .....
> .....
> {code} 
> At this point indexing has already been blocked. Some nodes do not write 
> anything to the logs and the two surrounding nodes are still busy doing some 
> replication. Most nodes show a increased number of state changes:
> {code}
> 2012-11-02 14:16:47,768 INFO [common.cloud.ZkStateReader] - 
> [main-EventThread] - : A cluster state change has occurred - updating... (10)
> {code}
> It's about 10 minutes of doing little by trying replication by the 
> surrounding nodes and mostly state changes for the other nodes before are 
> documents indexed again. At that point i stopped the node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to