Re: Riak Nodes Crashing

Matthew Von-Maszewski Fri, 05 Dec 2014 10:08:33 -0800

Satish,

I find nothing compelling in the log or the app.config.  Therefore I have two 
additional suggestions/requests:


- lower max_open_files in app.config to to 150 from 315.  There was one other 
customer report regarding the limit not properly stopping out of memory (OOM) 
conditions.

- try to locate a /var/log/syslog* file from a node that contains the time of 
the crash.  There may be helpful information there.  Please send that along.


Unrelated to this crash … 1.4.7 has a known bug in its active anti-entropy 
(AAE) logic.  This bug is NOT known to cause a crash.  The bug does cause AAE 
to be unreliable for data restoration.  The proper steps for upgrading to the 
current release (1.4.12) are:

-- across the entire cluster
- disable anti_entropy in app.config on all nodes: {anti_entropy, {off, []}}
- perform a rolling restart of all nodes … AAE is now disabled in the cluster 

-- on each node
- stop the node
- remove (erase all files and directories) /vol/lib/riak/anti_entropy
- update Riak to the new software revision
- start the node again

-- across the entire cluster
- enable anti_entropy in app.config on all nodes: {anti_entropy, {on, []}}
- perform a rolling restart of all nodes … AAE is now enabled in the cluster 

The nodes will start rebuilding the AAE hash data.  Suggest you perform the 
last rolling restart during a low utilization time of your cluster.


Matthew


On Dec 5, 2014, at 11:02 AM, ender <extr...@gmail.com> wrote:

> Hi Matthew,
> 
> Riak version: 1.4.7
> 5 Nodes in cluster
> RAM: 30GB
> 
> The leveldb logs are attached.
> 
> 
> 
> On Thu, Dec 4, 2014 at 1:34 PM, Matthew Von-Maszewski <matth...@basho.com> 
> wrote:
> Satish,
> 
> Some questions:
> 
> - what version of Riak are you running?  logs suggest 1.4.7
> - how many nodes in your cluster?
> - what is the physical memory (RAM size) of each node?
> - would you send the leveldb LOG  files from one of the crashed servers:
>     tar -czf satish_LOG.tgz /vol/lib/riak/leveldb/*/LOG*
> 
> 
> Matthew
> 
> On Dec 4, 2014, at 4:02 PM, ender <extr...@gmail.com> wrote:
> 
> > My RIak installation has been running successfully for about a year.  This 
> > week nodes suddenly started randomly crashing.  The machines have plenty of 
> > memory and free disk space, and looking in the ring directory nothing 
> > appears to amiss:
> >
> > [ec2-user@ip-10-196-72-247 ~]$ ls -l /vol/lib/riak/ring
> > total 80
> > -rw-rw-r-- 1 riak riak 17829 Nov 29 19:42 
> > riak_core_ring.default.20141129194225
> > -rw-rw-r-- 1 riak riak 17829 Dec  3 19:07 
> > riak_core_ring.default.20141203190748
> > -rw-rw-r-- 1 riak riak 17829 Dec  4 16:29 
> > riak_core_ring.default.20141204162956
> > -rw-rw-r-- 1 riak riak 17847 Dec  4 20:45 
> > riak_core_ring.default.20141204204548
> >
> > [ec2-user@ip-10-196-72-247 ~]$ du -h /vol/lib/riak/ring
> > 84K   /vol/lib/riak/ring
> >
> > I have attached a tarball with the app.config file plus all the logs from 
> > the node at the time of the crash.  Any help much appreciated!
> >
> > Satish
> >
> > <riak-crash-data.tar.gz>_______________________________________________
> > riak-users mailing list
> > riak-users@lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> <satish_LOG.tgz>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Riak Nodes Crashing

Reply via email to