[
https://issues.apache.org/jira/browse/SOLR-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546874#comment-13546874
]
Mark Miller commented on SOLR-4038:
-----------------------------------
If a node freezes up or OOM's I think you are in trouble. It needs to be
killed. Luckily, you can take care of this pretty easily in the OOM case - in
the general freeze case, there is less you can do - hopefully that just means
the zk connection will be closed though - hard to imagine it not being closed
unless it's a really weird bug.
In the OOM case, things are just not good. The node will still be somewhat
operating, might hold connections open, but not respond well. Once you OOM the
jvm is in an undefined state - you don't know what it will do. There is not a
lot we can do about that easily - I think the best thing you can do is set up
your jvms to be 'kill'd when an OOM happens.
> SolrCloud indexing blocks if shard is marked as DOWN
> ----------------------------------------------------
>
> Key: SOLR-4038
> URL: https://issues.apache.org/jira/browse/SOLR-4038
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.0
> Reporter: Markus Jelsma
> Assignee: Mark Miller
> Fix For: 4.1, 5.0
>
>
> See:
> http://lucene.472066.n3.nabble.com/SolrCloud-indexing-blocks-if-node-is-recovering-td4017827.html
> While indexing (without CloudSolrServer at that time) one node dies with an
> OOME perhaps because of the linked issue SOLR-4032. The OOME stack traces
> are varied but here are some ZK-related logs between the OOME stack traces:
> {code}
> 2012-11-02 14:14:37,126 INFO [solr.update.UpdateLog] - [RecoveryThread] - :
> Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
> 2012-11-02 14:14:37,127 ERROR [solr.cloud.RecoveryStrategy] -
> [RecoveryThread] - : Recovery failed - trying again... (2) core=shard_e
> 2012-11-02 14:14:37,127 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread]
> - : Wait 8.0 seconds before trying to recover again (3)
> 2012-11-02 14:14:45,328 INFO [solr.cloud.ZkController] - [RecoveryThread] - :
> numShards not found on descriptor - reading it from system property
> 2012-11-02 14:14:45,363 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread]
> - : Starting Replication Recovery. core=shard_e
> 2012-11-02 14:14:45,363 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] -
> : Creating new http client,
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> 2012-11-02 14:14:45,775 INFO [common.cloud.ZkStateReader] -
> [main-EventThread] - : A cluster state change has occurred - updating... (10)
> 2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread]
> - : Begin buffering updates. core=shard_e
> 2012-11-02 14:14:50,987 INFO [solr.update.UpdateLog] - [RecoveryThread] - :
> Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
> 2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread]
> - : Attempting to replicate from http://rot05.solrserver:8080/solr/shard_e/.
> core=shard_e
> 2012-11-02 14:14:50,987 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] -
> : Creating new http client,
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> 2012-11-02 14:15:03,303 INFO [solr.core.CachingDirectoryFactory] -
> [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index
> 2012-11-02 14:15:03,303 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :
> removing temporary index download directory files
> NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_f/data/index.20121102141424591
> lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1520a48c;
> maxCacheMB=48.0 maxMergeSizeMB=4.0)
> 2012-11-02 14:15:09,421 INFO [apache.zookeeper.ClientCnxn] -
> [main-SendThread(rot1.zkserver:2181)] - : Client session timed out, have not
> heard from server in 11873ms for sessionid 0x13abc504486000f, closing socket
> connection and attempting reconnect
> 2012-11-02 14:15:09,422 ERROR [solr.core.SolrCore] - [http-8080-exec-1] - :
> org.apache.solr.common.SolrException: Ping query caused exception: Java heap
> space
> .....
> .....
> 2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] -
> [main-EventThread] - : Watcher
> org.apache.solr.common.cloud.ConnectionManager@305e7020
> name:ZooKeeperConnection Watcher:rot1.zkserver:2181,rot2.zkserver:2181 got
> event WatchedEvent state:Disconnected type:None path:null path:null type:None
> 2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] -
> [main-EventThread] - : zkClient has disconnected
> 2012-11-02 14:15:09,869 ERROR [solr.cloud.RecoveryStrategy] -
> [RecoveryThread] - : Error while trying to
> recover:java.lang.OutOfMemoryError: Java heap space
> .....
> .....
> 2012-11-02 14:15:10,159 INFO [solr.update.UpdateLog] - [RecoveryThread] - :
> Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
> 2012-11-02 14:15:10,159 ERROR [solr.cloud.RecoveryStrategy] -
> [RecoveryThread] - : Recovery failed - trying again... (3) core=shard_e
> 2012-11-02 14:15:10,159 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread]
> - : Wait 16.0 seconds before trying to recover again (4)
> 2012-11-02 14:15:09,878 INFO [solr.core.CachingDirectoryFactory] -
> [RecoveryThread] - : Releasing
> directory:/opt/solr/cores/shard_f/data/index.20121102141424591
> 2012-11-02 14:15:10,192 INFO [solr.core.CachingDirectoryFactory] -
> [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f_f/data/index
> 2012-11-02 14:15:10,192 ERROR [solr.handler.ReplicationHandler] -
> [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException:
> Unable to download _773.tvf completely. Downloaded 246
> 415360!=562327645
> .....
> .....
> {code}
> At this point indexing has already been blocked. Some nodes do not write
> anything to the logs and the two surrounding nodes are still busy doing some
> replication. Most nodes show a increased number of state changes:
> {code}
> 2012-11-02 14:16:47,768 INFO [common.cloud.ZkStateReader] -
> [main-EventThread] - : A cluster state change has occurred - updating... (10)
> {code}
> It's about 10 minutes of doing little by trying replication by the
> surrounding nodes and mostly state changes for the other nodes before are
> documents indexed again. At that point i stopped the node.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]