Satish, Here is a key line from /var/log/messages:
Dec 5 06:52:43 ip-10-196-72-106 kernel: [26881589.804401] beam.smp invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 The log entry does NOT match the timestamps of the crash.log and error.log below. But that is ok. The operating system killed off Riak. There would have be no notification in the Riak log's of the operating system's actions. The fact that the out of memory monitor, oom-killer, killed Riak further supports the change to max_open_files. I recommend we now wait to see if the problem occurs again. Matthew On Dec 5, 2014, at 2:35 PM, ender <extr...@gmail.com> wrote: > Hey Matthew, > > The crash occurred around 3:00am: > > -rw-rw-r-- 1 riak riak 920 Dec 5 03:01 crash.log > -rw-rw-r-- 1 riak riak 617 Dec 5 03:01 error.log > > I have attached the syslog that covers that time. I also went ahead and > changed max_open_files in app.config to to 150 from 315. > > Satish > > > On Fri, Dec 5, 2014 at 11:29 AM, Matthew Von-Maszewski <matth...@basho.com> > wrote: > Satish, > > The "key" system log varies by Linux platform. Yes, /var/log/messages may > hold some key clues. Again, be sure the file covers the time of a crash. > > Matthew > > > On Dec 5, 2014, at 1:29 PM, ender <extr...@gmail.com> wrote: > >> Hey Matthew, >> >> I see a /var/log/messages file, but no syslog or system.log etc. Is it the >> messages file you want? >> >> Satish >> >> >> On Fri, Dec 5, 2014 at 10:06 AM, Matthew Von-Maszewski <matth...@basho.com> >> wrote: >> Satish, >> >> I find nothing compelling in the log or the app.config. Therefore I have >> two additional suggestions/requests: >> >> - lower max_open_files in app.config to to 150 from 315. There was one >> other customer report regarding the limit not properly stopping out of >> memory (OOM) conditions. >> >> - try to locate a /var/log/syslog* file from a node that contains the time >> of the crash. There may be helpful information there. Please send that >> along. >> >> >> Unrelated to this crash … 1.4.7 has a known bug in its active anti-entropy >> (AAE) logic. This bug is NOT known to cause a crash. The bug does cause >> AAE to be unreliable for data restoration. The proper steps for upgrading >> to the current release (1.4.12) are: >> >> -- across the entire cluster >> - disable anti_entropy in app.config on all nodes: {anti_entropy, {off, []}} >> - perform a rolling restart of all nodes … AAE is now disabled in the >> cluster >> >> -- on each node >> - stop the node >> - remove (erase all files and directories) /vol/lib/riak/anti_entropy >> - update Riak to the new software revision >> - start the node again >> >> -- across the entire cluster >> - enable anti_entropy in app.config on all nodes: {anti_entropy, {on, []}} >> - perform a rolling restart of all nodes … AAE is now enabled in the cluster >> >> The nodes will start rebuilding the AAE hash data. Suggest you perform the >> last rolling restart during a low utilization time of your cluster. >> >> >> Matthew >> >> >> On Dec 5, 2014, at 11:02 AM, ender <extr...@gmail.com> wrote: >> >>> Hi Matthew, >>> >>> Riak version: 1.4.7 >>> 5 Nodes in cluster >>> RAM: 30GB >>> >>> The leveldb logs are attached. >>> >>> >>> >>> On Thu, Dec 4, 2014 at 1:34 PM, Matthew Von-Maszewski <matth...@basho.com> >>> wrote: >>> Satish, >>> >>> Some questions: >>> >>> - what version of Riak are you running? logs suggest 1.4.7 >>> - how many nodes in your cluster? >>> - what is the physical memory (RAM size) of each node? >>> - would you send the leveldb LOG files from one of the crashed servers: >>> tar -czf satish_LOG.tgz /vol/lib/riak/leveldb/*/LOG* >>> >>> >>> Matthew >>> >>> On Dec 4, 2014, at 4:02 PM, ender <extr...@gmail.com> wrote: >>> >>> > My RIak installation has been running successfully for about a year. >>> > This week nodes suddenly started randomly crashing. The machines have >>> > plenty of memory and free disk space, and looking in the ring directory >>> > nothing appears to amiss: >>> > >>> > [ec2-user@ip-10-196-72-247 ~]$ ls -l /vol/lib/riak/ring >>> > total 80 >>> > -rw-rw-r-- 1 riak riak 17829 Nov 29 19:42 >>> > riak_core_ring.default.20141129194225 >>> > -rw-rw-r-- 1 riak riak 17829 Dec 3 19:07 >>> > riak_core_ring.default.20141203190748 >>> > -rw-rw-r-- 1 riak riak 17829 Dec 4 16:29 >>> > riak_core_ring.default.20141204162956 >>> > -rw-rw-r-- 1 riak riak 17847 Dec 4 20:45 >>> > riak_core_ring.default.20141204204548 >>> > >>> > [ec2-user@ip-10-196-72-247 ~]$ du -h /vol/lib/riak/ring >>> > 84K /vol/lib/riak/ring >>> > >>> > I have attached a tarball with the app.config file plus all the logs from >>> > the node at the time of the crash. Any help much appreciated! >>> > >>> > Satish >>> > >>> > <riak-crash-data.tar.gz>_______________________________________________ >>> > riak-users mailing list >>> > riak-users@lists.basho.com >>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >>> <satish_LOG.tgz> >> >> > > > <messages>
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com