[ https://issues.apache.org/jira/browse/SOLR-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated SOLR-4038: -------------------------------- Description: See: http://lucene.472066.n3.nabble.com/SolrCloud-indexing-blocks-if-node-is-recovering-td4017827.html While indexing (without CloudSolrServer at that time) one node dies with an OOME perhaps because of the linked issue SOLR-4032. The OOME stack traces are varied but here are some ZK-related logs between the OOME stack traces: {code} 2012-11-02 14:14:37,126 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null} 2012-11-02 14:14:37,127 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (2) core=shard_e 2012-11-02 14:14:37,127 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 8.0 seconds before trying to recover again (3) 2012-11-02 14:14:45,328 INFO [solr.cloud.ZkController] - [RecoveryThread] - : numShards not found on descriptor - reading it from system property 2012-11-02 14:14:45,363 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Starting Replication Recovery. core=shard_e 2012-11-02 14:14:45,363 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 2012-11-02 14:14:45,775 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred - updating... (10) 2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Begin buffering updates. core=shard_e 2012-11-02 14:14:50,987 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null} 2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Attempting to replicate from http://rot05.solrserver:8080/solr/shard_e/. core=shard_e 2012-11-02 14:14:50,987 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 2012-11-02 14:15:03,303 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index 2012-11-02 14:15:03,303 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : removing temporary index download directory files NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_f/data/index.20121102141424591 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1520a48c; maxCacheMB=48.0 maxMergeSizeMB=4.0) 2012-11-02 14:15:09,421 INFO [apache.zookeeper.ClientCnxn] - [main-SendThread(rot1.zkserver:2181)] - : Client session timed out, have not heard from server in 11873ms for sessionid 0x13abc504486000f, closing socket connection and attempting reconnect 2012-11-02 14:15:09,422 ERROR [solr.core.SolrCore] - [http-8080-exec-1] - : org.apache.solr.common.SolrException: Ping query caused exception: Java heap space ..... ..... 2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - [main-EventThread] - : Watcher org.apache.solr.common.cloud.ConnectionManager@305e7020 name:ZooKeeperConnection Watcher:rot1.zkserver:2181,rot2.zkserver:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None 2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - [main-EventThread] - : zkClient has disconnected 2012-11-02 14:15:09,869 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover:java.lang.OutOfMemoryError: Java heap space ..... ..... 2012-11-02 14:15:10,159 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null} 2012-11-02 14:15:10,159 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (3) core=shard_e 2012-11-02 14:15:10,159 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 16.0 seconds before trying to recover again (4) 2012-11-02 14:15:09,878 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index.20121102141424591 2012-11-02 14:15:10,192 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f_f/data/index 2012-11-02 14:15:10,192 ERROR [solr.handler.ReplicationHandler] - [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download _773.tvf completely. Downloaded 246 415360!=562327645 ..... ..... {code} At this point indexing has already been blocked. Some nodes do not write anything to the logs and the two surrounding nodes are still busy doing some replication. Most nodes show a increased number of state changes: {code} 2012-11-02 14:16:47,768 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred - updating... (10) {code} It's about 10 minutes of doing little by trying replication by the surrounding nodes and mostly state changes for the other nodes before are documents indexed again. At that point i stopped the node. was:See: http://lucene.472066.n3.nabble.com/SolrCloud-indexing-blocks-if-node-is-recovering-td4017827.html > SolrCloud indexing blocks if node is recovering > ----------------------------------------------- > > Key: SOLR-4038 > URL: https://issues.apache.org/jira/browse/SOLR-4038 > Project: Solr > Issue Type: Bug > Affects Versions: 4.0 > Reporter: Markus Jelsma > Fix For: 4.1, 5.0 > > > See: > http://lucene.472066.n3.nabble.com/SolrCloud-indexing-blocks-if-node-is-recovering-td4017827.html > While indexing (without CloudSolrServer at that time) one node dies with an > OOME perhaps because of the linked issue SOLR-4032. The OOME stack traces > are varied but here are some ZK-related logs between the OOME stack traces: > {code} > 2012-11-02 14:14:37,126 INFO [solr.update.UpdateLog] - [RecoveryThread] - : > Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null} > 2012-11-02 14:14:37,127 ERROR [solr.cloud.RecoveryStrategy] - > [RecoveryThread] - : Recovery failed - trying again... (2) core=shard_e > 2012-11-02 14:14:37,127 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] > - : Wait 8.0 seconds before trying to recover again (3) > 2012-11-02 14:14:45,328 INFO [solr.cloud.ZkController] - [RecoveryThread] - : > numShards not found on descriptor - reading it from system property > 2012-11-02 14:14:45,363 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] > - : Starting Replication Recovery. core=shard_e > 2012-11-02 14:14:45,363 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - > : Creating new http client, > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > 2012-11-02 14:14:45,775 INFO [common.cloud.ZkStateReader] - > [main-EventThread] - : A cluster state change has occurred - updating... (10) > 2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] > - : Begin buffering updates. core=shard_e > 2012-11-02 14:14:50,987 INFO [solr.update.UpdateLog] - [RecoveryThread] - : > Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null} > 2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] > - : Attempting to replicate from http://rot05.solrserver:8080/solr/shard_e/. > core=shard_e > 2012-11-02 14:14:50,987 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - > : Creating new http client, > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > 2012-11-02 14:15:03,303 INFO [solr.core.CachingDirectoryFactory] - > [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index > 2012-11-02 14:15:03,303 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : > removing temporary index download directory files > NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_f/data/index.20121102141424591 > lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1520a48c; > maxCacheMB=48.0 maxMergeSizeMB=4.0) > 2012-11-02 14:15:09,421 INFO [apache.zookeeper.ClientCnxn] - > [main-SendThread(rot1.zkserver:2181)] - : Client session timed out, have not > heard from server in 11873ms for sessionid 0x13abc504486000f, closing socket > connection and attempting reconnect > 2012-11-02 14:15:09,422 ERROR [solr.core.SolrCore] - [http-8080-exec-1] - : > org.apache.solr.common.SolrException: Ping query caused exception: Java heap > space > ..... > ..... > 2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - > [main-EventThread] - : Watcher > org.apache.solr.common.cloud.ConnectionManager@305e7020 > name:ZooKeeperConnection Watcher:rot1.zkserver:2181,rot2.zkserver:2181 got > event WatchedEvent state:Disconnected type:None path:null path:null type:None > 2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - > [main-EventThread] - : zkClient has disconnected > 2012-11-02 14:15:09,869 ERROR [solr.cloud.RecoveryStrategy] - > [RecoveryThread] - : Error while trying to > recover:java.lang.OutOfMemoryError: Java heap space > ..... > ..... > 2012-11-02 14:15:10,159 INFO [solr.update.UpdateLog] - [RecoveryThread] - : > Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null} > 2012-11-02 14:15:10,159 ERROR [solr.cloud.RecoveryStrategy] - > [RecoveryThread] - : Recovery failed - trying again... (3) core=shard_e > 2012-11-02 14:15:10,159 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] > - : Wait 16.0 seconds before trying to recover again (4) > 2012-11-02 14:15:09,878 INFO [solr.core.CachingDirectoryFactory] - > [RecoveryThread] - : Releasing > directory:/opt/solr/cores/shard_f/data/index.20121102141424591 > 2012-11-02 14:15:10,192 INFO [solr.core.CachingDirectoryFactory] - > [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f_f/data/index > 2012-11-02 14:15:10,192 ERROR [solr.handler.ReplicationHandler] - > [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: > Unable to download _773.tvf completely. Downloaded 246 > 415360!=562327645 > ..... > ..... > {code} > At this point indexing has already been blocked. Some nodes do not write > anything to the logs and the two surrounding nodes are still busy doing some > replication. Most nodes show a increased number of state changes: > {code} > 2012-11-02 14:16:47,768 INFO [common.cloud.ZkStateReader] - > [main-EventThread] - : A cluster state change has occurred - updating... (10) > {code} > It's about 10 minutes of doing little by trying replication by the > surrounding nodes and mostly state changes for the other nodes before are > documents indexed again. At that point i stopped the node. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org