Satish,

Here is a key line from /var/log/messages:

Dec  5 06:52:43 ip-10-196-72-106 kernel: [26881589.804401] beam.smp invoked 
oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0

The log entry does NOT match the timestamps of the crash.log and error.log 
below.  But that is ok.  The operating system killed off Riak.  There would 
have be no notification in the Riak log's of the operating system's actions.

The fact that the out of memory monitor, oom-killer, killed Riak further 
supports the change to max_open_files.  I recommend we now wait to see if the 
problem occurs again.


Matthew


On Dec 5, 2014, at 2:35 PM, ender <extr...@gmail.com> wrote:

> Hey Matthew,
> 
> The crash occurred around 3:00am:
> 
> -rw-rw-r-- 1 riak riak    920 Dec  5 03:01 crash.log
> -rw-rw-r-- 1 riak riak    617 Dec  5 03:01 error.log
> 
> I have attached the syslog that covers that time.  I also went ahead and 
> changed max_open_files in app.config to to 150 from 315.
> 
> Satish
> 
> 
> On Fri, Dec 5, 2014 at 11:29 AM, Matthew Von-Maszewski <matth...@basho.com> 
> wrote:
> Satish,
> 
> The "key" system log varies by Linux platform.  Yes, /var/log/messages may 
> hold some key clues.  Again, be sure the file covers the time of a crash.
> 
> Matthew
> 
> 
> On Dec 5, 2014, at 1:29 PM, ender <extr...@gmail.com> wrote:
> 
>> Hey Matthew,
>> 
>> I see a /var/log/messages file, but no syslog or system.log etc.  Is it the 
>> messages file you want?
>> 
>> Satish
>> 
>> 
>> On Fri, Dec 5, 2014 at 10:06 AM, Matthew Von-Maszewski <matth...@basho.com> 
>> wrote:
>> Satish,
>> 
>> I find nothing compelling in the log or the app.config.  Therefore I have 
>> two additional suggestions/requests:
>> 
>> - lower max_open_files in app.config to to 150 from 315.  There was one 
>> other customer report regarding the limit not properly stopping out of 
>> memory (OOM) conditions.
>> 
>> - try to locate a /var/log/syslog* file from a node that contains the time 
>> of the crash.  There may be helpful information there.  Please send that 
>> along.
>> 
>> 
>> Unrelated to this crash … 1.4.7 has a known bug in its active anti-entropy 
>> (AAE) logic.  This bug is NOT known to cause a crash.  The bug does cause 
>> AAE to be unreliable for data restoration.  The proper steps for upgrading 
>> to the current release (1.4.12) are:
>> 
>> -- across the entire cluster
>> - disable anti_entropy in app.config on all nodes: {anti_entropy, {off, []}}
>> - perform a rolling restart of all nodes … AAE is now disabled in the 
>> cluster 
>> 
>> -- on each node
>> - stop the node
>> - remove (erase all files and directories) /vol/lib/riak/anti_entropy
>> - update Riak to the new software revision
>> - start the node again
>> 
>> -- across the entire cluster
>> - enable anti_entropy in app.config on all nodes: {anti_entropy, {on, []}}
>> - perform a rolling restart of all nodes … AAE is now enabled in the cluster 
>> 
>> The nodes will start rebuilding the AAE hash data.  Suggest you perform the 
>> last rolling restart during a low utilization time of your cluster.
>> 
>> 
>> Matthew
>> 
>> 
>> On Dec 5, 2014, at 11:02 AM, ender <extr...@gmail.com> wrote:
>> 
>>> Hi Matthew,
>>> 
>>> Riak version: 1.4.7
>>> 5 Nodes in cluster
>>> RAM: 30GB
>>> 
>>> The leveldb logs are attached.
>>> 
>>> 
>>> 
>>> On Thu, Dec 4, 2014 at 1:34 PM, Matthew Von-Maszewski <matth...@basho.com> 
>>> wrote:
>>> Satish,
>>> 
>>> Some questions:
>>> 
>>> - what version of Riak are you running?  logs suggest 1.4.7
>>> - how many nodes in your cluster?
>>> - what is the physical memory (RAM size) of each node?
>>> - would you send the leveldb LOG  files from one of the crashed servers:
>>>     tar -czf satish_LOG.tgz /vol/lib/riak/leveldb/*/LOG*
>>> 
>>> 
>>> Matthew
>>> 
>>> On Dec 4, 2014, at 4:02 PM, ender <extr...@gmail.com> wrote:
>>> 
>>> > My RIak installation has been running successfully for about a year.  
>>> > This week nodes suddenly started randomly crashing.  The machines have 
>>> > plenty of memory and free disk space, and looking in the ring directory 
>>> > nothing appears to amiss:
>>> >
>>> > [ec2-user@ip-10-196-72-247 ~]$ ls -l /vol/lib/riak/ring
>>> > total 80
>>> > -rw-rw-r-- 1 riak riak 17829 Nov 29 19:42 
>>> > riak_core_ring.default.20141129194225
>>> > -rw-rw-r-- 1 riak riak 17829 Dec  3 19:07 
>>> > riak_core_ring.default.20141203190748
>>> > -rw-rw-r-- 1 riak riak 17829 Dec  4 16:29 
>>> > riak_core_ring.default.20141204162956
>>> > -rw-rw-r-- 1 riak riak 17847 Dec  4 20:45 
>>> > riak_core_ring.default.20141204204548
>>> >
>>> > [ec2-user@ip-10-196-72-247 ~]$ du -h /vol/lib/riak/ring
>>> > 84K   /vol/lib/riak/ring
>>> >
>>> > I have attached a tarball with the app.config file plus all the logs from 
>>> > the node at the time of the crash.  Any help much appreciated!
>>> >
>>> > Satish
>>> >
>>> > <riak-crash-data.tar.gz>_______________________________________________
>>> > riak-users mailing list
>>> > riak-users@lists.basho.com
>>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>> 
>>> 
>>> <satish_LOG.tgz>
>> 
>> 
> 
> 
> <messages>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to