Satish, I find nothing compelling in the log or the app.config. Therefore I have two additional suggestions/requests:
- lower max_open_files in app.config to to 150 from 315. There was one other customer report regarding the limit not properly stopping out of memory (OOM) conditions. - try to locate a /var/log/syslog* file from a node that contains the time of the crash. There may be helpful information there. Please send that along. Unrelated to this crash … 1.4.7 has a known bug in its active anti-entropy (AAE) logic. This bug is NOT known to cause a crash. The bug does cause AAE to be unreliable for data restoration. The proper steps for upgrading to the current release (1.4.12) are: -- across the entire cluster - disable anti_entropy in app.config on all nodes: {anti_entropy, {off, []}} - perform a rolling restart of all nodes … AAE is now disabled in the cluster -- on each node - stop the node - remove (erase all files and directories) /vol/lib/riak/anti_entropy - update Riak to the new software revision - start the node again -- across the entire cluster - enable anti_entropy in app.config on all nodes: {anti_entropy, {on, []}} - perform a rolling restart of all nodes … AAE is now enabled in the cluster The nodes will start rebuilding the AAE hash data. Suggest you perform the last rolling restart during a low utilization time of your cluster. Matthew On Dec 5, 2014, at 11:02 AM, ender <extr...@gmail.com> wrote: > Hi Matthew, > > Riak version: 1.4.7 > 5 Nodes in cluster > RAM: 30GB > > The leveldb logs are attached. > > > > On Thu, Dec 4, 2014 at 1:34 PM, Matthew Von-Maszewski <matth...@basho.com> > wrote: > Satish, > > Some questions: > > - what version of Riak are you running? logs suggest 1.4.7 > - how many nodes in your cluster? > - what is the physical memory (RAM size) of each node? > - would you send the leveldb LOG files from one of the crashed servers: > tar -czf satish_LOG.tgz /vol/lib/riak/leveldb/*/LOG* > > > Matthew > > On Dec 4, 2014, at 4:02 PM, ender <extr...@gmail.com> wrote: > > > My RIak installation has been running successfully for about a year. This > > week nodes suddenly started randomly crashing. The machines have plenty of > > memory and free disk space, and looking in the ring directory nothing > > appears to amiss: > > > > [ec2-user@ip-10-196-72-247 ~]$ ls -l /vol/lib/riak/ring > > total 80 > > -rw-rw-r-- 1 riak riak 17829 Nov 29 19:42 > > riak_core_ring.default.20141129194225 > > -rw-rw-r-- 1 riak riak 17829 Dec 3 19:07 > > riak_core_ring.default.20141203190748 > > -rw-rw-r-- 1 riak riak 17829 Dec 4 16:29 > > riak_core_ring.default.20141204162956 > > -rw-rw-r-- 1 riak riak 17847 Dec 4 20:45 > > riak_core_ring.default.20141204204548 > > > > [ec2-user@ip-10-196-72-247 ~]$ du -h /vol/lib/riak/ring > > 84K /vol/lib/riak/ring > > > > I have attached a tarball with the app.config file plus all the logs from > > the node at the time of the crash. Any help much appreciated! > > > > Satish > > > > <riak-crash-data.tar.gz>_______________________________________________ > > riak-users mailing list > > riak-users@lists.basho.com > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > <satish_LOG.tgz>
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com