Re: Riak Nodes Crashing

Matthew Von-Maszewski Sat, 06 Dec 2014 07:00:40 -0800

Satish,

I do NOT recommend adding a sixth node before the other five are stable again.  
There was another customer that did that recently and things just got worse due 
to the vnode handoff actions to the sixth node.


I do recommend one or both of the following:

- disable active anti-entropy in app.config, {anti_entropy, {off, []}}.  Then 
restart all nodes.  We quickly replaced 1.4.7 due to an bug in the active 
anti-entropy.  I do not know the details of the bug.  But no one had seen a 
crash from it.  However, you may be seeing a long term problem due to that same 
bug.  The anti-entropy feature in 1.4.7 is not really protecting your data 
anyway.  It might as well be disabled until you are ready to upgrade.

- further reduce the max_open_files parameter simply to get memory stable:  use 
75 instead of the recent 150.  You must restart all nodes after making the 
change in app.config.


I will need to solicit support from others at Basho if the two workarounds 
above do not stabilize the cluster.  

Matthew

On Dec 5, 2014, at 5:54 PM, ender <extr...@gmail.com> wrote:

> Would adding a 6th node mean each node would use less memory as  a stopgap 
> measure?
> 
> On Fri, Dec 5, 2014 at 2:20 PM, ender <extr...@gmail.com> wrote:
> Hey Matthew it just crashed again.  This time I got the syslog and leveldb 
> logs right away.
> 
> 
> 
> On Fri, Dec 5, 2014 at 11:43 AM, Matthew Von-Maszewski <matth...@basho.com> 
> wrote:
> Satish,
> 
> Here is a key line from /var/log/messages:
> 
> Dec  5 06:52:43 ip-10-196-72-106 kernel: [26881589.804401] beam.smp invoked 
> oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
> 
> The log entry does NOT match the timestamps of the crash.log and error.log 
> below.  But that is ok.  The operating system killed off Riak.  There would 
> have be no notification in the Riak log's of the operating system's actions.
> 
> The fact that the out of memory monitor, oom-killer, killed Riak further 
> supports the change to max_open_files.  I recommend we now wait to see if the 
> problem occurs again.
> 
> 
> Matthew
> 
> 
> On Dec 5, 2014, at 2:35 PM, ender <extr...@gmail.com> wrote:
> 
>> Hey Matthew,
>> 
>> The crash occurred around 3:00am:
>> 
>> -rw-rw-r-- 1 riak riak    920 Dec  5 03:01 crash.log
>> -rw-rw-r-- 1 riak riak    617 Dec  5 03:01 error.log
>> 
>> I have attached the syslog that covers that time.  I also went ahead and 
>> changed max_open_files in app.config to to 150 from 315.
>> 
>> Satish
>> 
>> 
>> On Fri, Dec 5, 2014 at 11:29 AM, Matthew Von-Maszewski <matth...@basho.com> 
>> wrote:
>> Satish,
>> 
>> The "key" system log varies by Linux platform.  Yes, /var/log/messages may 
>> hold some key clues.  Again, be sure the file covers the time of a crash.
>> 
>> Matthew
>> 
>> 
>> On Dec 5, 2014, at 1:29 PM, ender <extr...@gmail.com> wrote:
>> 
>>> Hey Matthew,
>>> 
>>> I see a /var/log/messages file, but no syslog or system.log etc.  Is it the 
>>> messages file you want?
>>> 
>>> Satish
>>> 
>>> 
>>> On Fri, Dec 5, 2014 at 10:06 AM, Matthew Von-Maszewski <matth...@basho.com> 
>>> wrote:
>>> Satish,
>>> 
>>> I find nothing compelling in the log or the app.config.  Therefore I have 
>>> two additional suggestions/requests:
>>> 
>>> - lower max_open_files in app.config to to 150 from 315.  There was one 
>>> other customer report regarding the limit not properly stopping out of 
>>> memory (OOM) conditions.
>>> 
>>> - try to locate a /var/log/syslog* file from a node that contains the time 
>>> of the crash.  There may be helpful information there.  Please send that 
>>> along.
>>> 
>>> 
>>> Unrelated to this crash … 1.4.7 has a known bug in its active anti-entropy 
>>> (AAE) logic.  This bug is NOT known to cause a crash.  The bug does cause 
>>> AAE to be unreliable for data restoration.  The proper steps for upgrading 
>>> to the current release (1.4.12) are:
>>> 
>>> -- across the entire cluster
>>> - disable anti_entropy in app.config on all nodes: {anti_entropy, {off, []}}
>>> - perform a rolling restart of all nodes … AAE is now disabled in the 
>>> cluster 
>>> 
>>> -- on each node
>>> - stop the node
>>> - remove (erase all files and directories) /vol/lib/riak/anti_entropy
>>> - update Riak to the new software revision
>>> - start the node again
>>> 
>>> -- across the entire cluster
>>> - enable anti_entropy in app.config on all nodes: {anti_entropy, {on, []}}
>>> - perform a rolling restart of all nodes … AAE is now enabled in the 
>>> cluster 
>>> 
>>> The nodes will start rebuilding the AAE hash data.  Suggest you perform the 
>>> last rolling restart during a low utilization time of your cluster.
>>> 
>>> 
>>> Matthew
>>> 
>>> 
>>> On Dec 5, 2014, at 11:02 AM, ender <extr...@gmail.com> wrote:
>>> 
>>>> Hi Matthew,
>>>> 
>>>> Riak version: 1.4.7
>>>> 5 Nodes in cluster
>>>> RAM: 30GB
>>>> 
>>>> The leveldb logs are attached.
>>>> 
>>>> 
>>>> 
>>>> On Thu, Dec 4, 2014 at 1:34 PM, Matthew Von-Maszewski <matth...@basho.com> 
>>>> wrote:
>>>> Satish,
>>>> 
>>>> Some questions:
>>>> 
>>>> - what version of Riak are you running?  logs suggest 1.4.7
>>>> - how many nodes in your cluster?
>>>> - what is the physical memory (RAM size) of each node?
>>>> - would you send the leveldb LOG  files from one of the crashed servers:
>>>>     tar -czf satish_LOG.tgz /vol/lib/riak/leveldb/*/LOG*
>>>> 
>>>> 
>>>> Matthew
>>>> 
>>>> On Dec 4, 2014, at 4:02 PM, ender <extr...@gmail.com> wrote:
>>>> 
>>>> > My RIak installation has been running successfully for about a year.  
>>>> > This week nodes suddenly started randomly crashing.  The machines have 
>>>> > plenty of memory and free disk space, and looking in the ring directory 
>>>> > nothing appears to amiss:
>>>> >
>>>> > [ec2-user@ip-10-196-72-247 ~]$ ls -l /vol/lib/riak/ring
>>>> > total 80
>>>> > -rw-rw-r-- 1 riak riak 17829 Nov 29 19:42 
>>>> > riak_core_ring.default.20141129194225
>>>> > -rw-rw-r-- 1 riak riak 17829 Dec  3 19:07 
>>>> > riak_core_ring.default.20141203190748
>>>> > -rw-rw-r-- 1 riak riak 17829 Dec  4 16:29 
>>>> > riak_core_ring.default.20141204162956
>>>> > -rw-rw-r-- 1 riak riak 17847 Dec  4 20:45 
>>>> > riak_core_ring.default.20141204204548
>>>> >
>>>> > [ec2-user@ip-10-196-72-247 ~]$ du -h /vol/lib/riak/ring
>>>> > 84K   /vol/lib/riak/ring
>>>> >
>>>> > I have attached a tarball with the app.config file plus all the logs 
>>>> > from the node at the time of the crash.  Any help much appreciated!
>>>> >
>>>> > Satish
>>>> >
>>>> > <riak-crash-data.tar.gz>_______________________________________________
>>>> > riak-users mailing list
>>>> > riak-users@lists.basho.com
>>>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>> 
>>>> 
>>>> <satish_LOG.tgz>
>>> 
>>> 
>> 
>> 
>> <messages>
> 
> 
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Riak Nodes Crashing

Reply via email to