Re: mysterious Riak problems

David Lowell Wed, 14 Nov 2012 14:20:36 -0800

Hi Matthew,

I would like to understand these leveldb stalls, for sure. However, I wonder if 
they are a separate issue? I've been sitting here with a live tail of all the 
leveldb LOG files, along with the riak console log tailing side by side. And 
I've seen three periods whereby we get hundreds of "riak_kv_vnode worker pool 
crashed" error messages in the console log. And we got no messages in any of 
the leveldb LOG files about "waiting" during those episodes.


Interestingly, I seem to be able to induce these periods of instability by 
running:

>>> import riak
>>> kv = riak.RiakClient()
>>> kv.bucket("ctv_tvdata").get_keys()

This call never returns, and within a few seconds of running it, the 
console.log fills with "riak_kv_vnode worker pool crashed" and "CRASH REPORT" 
log messages.

I'm actually not sure how much data is in this bucket. I'm going to guess it 
has about 10,000 small values (< 1 KB each) in it.

I've done this operation many times in the past, and it can be slow. But today 
it just never returns, and at the same time Riak starts spewing into its logs.

Does that change your interpretation of these symptoms?

Dave

--
Dave Lowell
d...@connectv.com

On Nov 14, 2012, at 1:35 PM, Matthew Von-Maszewski wrote:

> Dave,
> 
> The problem seems most pronounced when your are averaging 6 megabyte values.  
> Honestly, my previous test suite only included 150k values.  The 
> WriteThrottle is NOT giving you the support you need in this situation (you 
> need it to be throttling a tad more).  I need to think on how to help here.
> 
> I see from the old emails that you have 32G of physical memory and this log 
> suggests you have 512 vnodes per physical server.  Is this correct?  Also, 
> how many physical servers?
> 
> Matthew
> 
> On Nov 14, 2012, at 2:00 PM, David Lowell wrote:
> 
>> Thanks Matthew. Yep, there are quite a few hits on 'waiting'. Interesting. 
>> I'll send the merged log separately.
>> 
>> Dave
>> 
>> --
>> Dave Lowell
>> d...@connectv.com
>> 
>> On Nov 14, 2012, at 10:43 AM, Matthew Von-Maszewski wrote:
>> 
>>> Dave,
>>> 
>>> Ok, heavy writes.  Let's see if leveldb has hit one of its intentional 
>>> "stalls":
>>> 
>>>> sort /var/db/riak/leveldb/*/LOG* | grep -i waiting
>>> 
>>> See if that shows any indication of stall in the LOG files of leveldb.  If 
>>> so, pick one server and send me a combined LOG file from that server:
>>> 
>>> sort /var/db/riak/leveldb/*/LOG* >LOG.all
>>> 
>>> That will tar zip really well.
>>> 
>>> Matthew
>>> 
>>> 
>>> On Nov 14, 2012, at 1:34 PM, David Lowell wrote:
>>> 
>>>> Thanks Matthew. I've run both greps with no hits, unfortunately.
>>>> 
>>>> A couple of details that I want to highlight. Since I first posted about 
>>>> this issue, we upgraded from Riak 1.2.0, to 1.2.1. Following that upgrade, 
>>>> we continue see these periods of instability with errors in the logs like  
>>>> "riak_kv_vnode worker pool crashed", but we started seeing lots of new 
>>>> error records in the logs about "Unrecognized message" as well.  In both 
>>>> cases, we see tons of these "long_gc" monitoring messages and several 
>>>> "system_memory_high_watermark" alarms during these periods. The client 
>>>> also has connection problems such as timeouts and connections being 
>>>> refused.
>>>> 
>>>> The logs from the last 18 hours on this server are really a mess, with 
>>>> very high levels of all of these errors. I'd be happy to send them along 
>>>> if you think that would help.
>>>> 
>>>> It's also worth noting that our application is pretty write heavy, has 
>>>> lots of parallel processes generating those writes (so not a lot of flow 
>>>> control if Riak bogs down, at least not yet). It's probably pushing Riak 
>>>> fairly hard.
>>>> 
>>>> Dave
>>>> 
>>>> --
>>>> Dave Lowell
>>>> d...@connectv.com
>>>> 
>>>> On Nov 14, 2012, at 8:51 AM, Matthew Von-Maszewski wrote:
>>>> 
>>>>> Dave,
>>>>> 
>>>>> Just getting my head back into the game.  Was away for a few days.  
>>>>> Random thought, maybe there is a hard drive with a read problem.  That 
>>>>> can cause issues similar to this.  1.2.1 does NOT percolate the read 
>>>>> errors seen in leveldb to riak-admin (yes, that should start to happen in 
>>>>> 1.3).
>>>>> 
>>>>> I will assume your leveldb "data_root" is "/var/db/riak/leveldb" for this 
>>>>> script.  Please substitute your appropriate path from app.config and try 
>>>>> these two commands on each physical server (node):
>>>>> 
>>>>> sort /var/db/riak/leveldb/*/LOG* | grep corrupted
>>>>> sort /var/db/riak/leveldb/*/LOG* | grep checksum
>>>>> 
>>>>> If you get hits on either, we have found the problem.
>>>>> 
>>>>> The "LOCK file unavailable" is more of a statement about the internal 
>>>>> condition of the code instead of an error.  The message is saying that 
>>>>> the first attempt to re-open a vnode failed because the prior instance is 
>>>>> still closing (or more likely waiting for Erlang's garbage collection to 
>>>>> finish destroying things).  This message is new to 1.2 code base.
>>>>> 
>>>>> Matthew
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Nov 7, 2012, at 6:56 PM, David Lowell wrote:
>>>>> 
>>>>>> After further thought, I want to add more color to this issue. I 
>>>>>> hypothesize the symptoms I described here were continued fallout of an 
>>>>>> earlier crash. So I've waded further back into to the logs to try to 
>>>>>> shed light on how the Riak process was doing prior to this time. It was 
>>>>>> unhappy. It appears to have failed and stopped several times in the 1/2 
>>>>>> day prior. The pattern revealed in the logs looks something like this:
>>>>>> 
>>>>>> - Following  a clean startup, the service runs for a while
>>>>>> 
>>>>>> - Usually the first log entry of any kind near the beginning of a long 
>>>>>> block of error logs is "alarm_handler: 
>>>>>> {set,{system_memory_high_watermark,[]}}"  
>>>>>>  ( Is this indicating excessive memory use? )
>>>>>> 
>>>>>> - Within a few minutes we see several log messages warning about 
>>>>>> "long_gc", which I assume is an indication that garbage collection took 
>>>>>> longer than some threshold
>>>>>> 
>>>>>> - Within the next minute or two, we start to see the legions of errors, 
>>>>>> "riak_kv_vnode worker pool crashed", and "gen_fsm" having some sort of 
>>>>>> timeout when trying to communicate with the eleveldb backend
>>>>>> 
>>>>>> - Eventually we see a log record indicating "Failed to start 
>>>>>> riak_kv_eleveldb" because of a leveldb LOCK file being temporarily 
>>>>>> unavailable
>>>>>> 
>>>>>> - Then Riak starts to exit:  riak:stop:46 "backend module failed to 
>>>>>> start." 
>>>>>> 
>>>>>> 
>>>>>> So, not really knowing the Riak internals, it's a little difficult to 
>>>>>> piece together the story here. Could be we're running low on memory. 
>>>>>> Hard to know why riak_kv_workers are failing, or why this leveldb LOCK 
>>>>>> file is unavailable. To those more learned, do these log records tell a 
>>>>>> story?
>>>>>> 
>>>>>> For the record, we're using the default 8 MB of leveldb cache per vnode, 
>>>>>> so that ought to cap cache for our 512 vnodes at 4 GB. Our host has 32 
>>>>>> GB of physical memory. Are there other pieces of Riak that can use a lot 
>>>>>> of memory that we need to look out for?
>>>>>> 
>>>>>> I'll include a few of the actual log records for reference, below.
>>>>>> 
>>>>>> Dave
>>>>>> 
>>>>>> --
>>>>>> Dave Lowell
>>>>>> d...@connectv.com
>>>>>> 
>>>>>> Representative logs, many similar ones deleted for brevity:
>>>>>> 
>>>>>> 2012-11-07 01:41:02.395 [info] <0.51.0> alarm_handler: 
>>>>>> {set,{system_memory_high_watermark,[]}}
>>>>>> 2012-11-07 01:55:50.517 [info] 
>>>>>> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
>>>>>> <0.18585.32> 
>>>>>> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{riak_core_coverage_plan,'-next_vnode/2-fun-0-',2}},{message_queue_len,0}]
>>>>>>  
>>>>>> [{timeout,219},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,48},{old_heap_size,0},{heap_size,11430}]
>>>>>> 2012-11-07 01:56:20.303 [error] 
>>>>>> <0.9231.0>@riak_core_vnode:handle_info:510 
>>>>>> 1258832464966656615093408225054454710289582522368 riak_kv_vnode worker 
>>>>>> pool crashed 
>>>>>> {timeout,{gen_fsm,sync_send_event,[<0.9234.0>,{checkout,false,5000},5000]}}
>>>>>> 2012-11-07 01:56:22.382 [error] <0.10002.0> gen_fsm <0.10002.0> in state 
>>>>>> ready terminated with reason: 
>>>>>> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'riak@10.0.3.11'}},<0.18429.32>}}]}}
>>>>>> 2012-11-07 01:57:11.833 [error] <0.19755.32>@riak_kv_vnode:init:265 
>>>>>> Failed to start riak_kv_eleveldb_backend Reason: {db_open,"IO error: 
>>>>>> lock 
>>>>>> /var/data/ctv/riak/leveldb/1258832464966656615093408225054454710289582522368/LOCK:
>>>>>>  Resource temporarily unavailable"}
>>>>>> 2012-11-07 01:57:27.425 [info] 
>>>>>> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
>>>>>> <0.19181.32> 
>>>>>> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}]
>>>>>>  
>>>>>> [{timeout,109},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,47},{old_heap_size,0},{heap_size,9171}]
>>>>>> 2012-11-07 01:57:51.109 [error] <0.10002.0> CRASH REPORT Process 
>>>>>> <0.10002.0> with 0 neighbours exited with reason: 
>>>>>> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'riak@10.0.3.11'}},<0.18429.32>}}]}}
>>>>>>  in gen_fsm:terminate/7 line 
>>>>>> 
>>>>>> 
>>>>>> On Nov 7, 2012, at 11:34 AM, David Lowell wrote:
>>>>>> 
>>>>>>> Hello Riak Folks,
>>>>>>> 
>>>>>>> The last three days, we've been having a string of problems with Riak. 
>>>>>>> An otherwise healthy server running our full application stack will 
>>>>>>> suddenly start throwing a bunch of errors in the logs. Although the 
>>>>>>> Riak processes stay up, most or all requests to Riak fail during these 
>>>>>>> periods.
>>>>>>> 
>>>>>>> The errors in the logs are predominantly describing "riak_kv_vnode 
>>>>>>> worker pool crashed" and timeout conditions. This morning, we had this 
>>>>>>> crashy behavior start immediately after a clean Riak startup, and 
>>>>>>> making a single call to our API, so the logs are quite free of other 
>>>>>>> noise. I've summarized those logs below for curious parties, and can 
>>>>>>> attach the full set of logs if needed.
>>>>>>> 
>>>>>>> I forgot to check this morning, but during a similar outage on Monday, 
>>>>>>> the Riak server was refusing connections to new clients.
>>>>>>> 
>>>>>>> Interestingly, after giving Riak a while with no traffic at all today 
>>>>>>> (like 15-30 minutes), it appears to have recovered without a restart. 
>>>>>>> We've had similar recoveries during other "outages" of this type since 
>>>>>>> Sunday evening. Facilitating this sort of recovery does seem to require 
>>>>>>> shutting down all application KV requests for a while.
>>>>>>> 
>>>>>>> We're suspicious of some kind of corruption in the eleveldb on-disk 
>>>>>>> files, because in past outages of this type, we've observed that the 
>>>>>>> condition persists over reboots. But we don't have much more evidence 
>>>>>>> than that. Is there a command we can run that will check over eleveldb 
>>>>>>> files for corruption or inconsistency?
>>>>>>> 
>>>>>>> Other than that, what can cause "worker pool crashed" events? What do 
>>>>>>> you know about the "timeouts" that are in these logs?
>>>>>>> 
>>>>>>> For the record, we're running Riak 1.2.0 on Ubuntu 10.04,  eleveldb 
>>>>>>> backend with 512 partitions. We're running predominantly in a 
>>>>>>> single-node configuration on a bunch of isolated dev boxes at the 
>>>>>>> moment, on our way to spreading out our 512 vnodes onto 5 hosts in 
>>>>>>> production.
>>>>>>> 
>>>>>>> Thanks for your help,
>>>>>>> 
>>>>>>> Dave
>>>>>>> 
>>>>>>> --
>>>>>>> Dave Lowell
>>>>>>> d...@connectv.com
>>>>>>> 
>>>>>>> 
>>>>>>> 2012-11-07 18:11:03.398 [info] <0.7.0> Application lager started on 
>>>>>>> node 'riak@10.0.3.11'
>>>>>>> 
>>>>>>> ... normal startup messages ...
>>>>>>> 
>>>>>>> 2012-11-07 18:11:50.109 [info] 
>>>>>>> <0.10582.0>@riak_core:wait_for_application:419 Wait complete for 
>>>>>>> application riak_search (0 seconds)
>>>>>>> 2012-11-07 18:22:18.509 [error] 
>>>>>>> <0.2897.0>@riak_core_vnode:handle_info:510 
>>>>>>> 105616329260241031198313161739262640092323250176 riak_kv_vnode worker 
>>>>>>> pool crashed 
>>>>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'riak@10.0.3.11'}},<0.15324.0>}}]}}
>>>>>>> 2012-11-07 18:22:18.509 [error] <0.2899.0> gen_fsm <0.2899.0> in state 
>>>>>>> ready terminated with reason: 
>>>>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'riak@10.0.3.11'}},<0.15324.0>}}]}}
>>>>>>> 
>>>>>>> ... 13 more "riak_kv_vnode worker pool crashed" messages...
>>>>>>> 
>>>>>>> 2012-11-07 18:22:21.245 [error] <0.2899.0> CRASH REPORT Process 
>>>>>>> <0.2899.0> with 0 neighbours exited with reason: 
>>>>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'riak@10.0.3.11'}},<0.15324.0>}}]}}
>>>>>>>  in gen_fsm:terminate/7 line 611
>>>>>>> 2012-11-07 18:22:21.844 [error] <0.2944.0> gen_fsm <0.2944.0> in state 
>>>>>>> ready terminated with reason: 
>>>>>>> {timeout,{gen_server,call,[<0.2947.0>,{work,<0.2945.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{114179815416476790484662877555959610910619729920,'riak@10.0.3.11'}},<0.15324.0>}}]}}
>>>>>>> 
>>>>>>> ... 13 more "CRASH REPORT Process <X> with 0 neighbours exited with 
>>>>>>> reason" and "gen_fsm <0.2989.0> in state ready terminated with reason" 
>>>>>>> message pairs
>>>>>>> 
>>>>>>> 2012-11-07 18:23:21.427 [error] <0.15322.0> gen_server <0.15322.0> 
>>>>>>> terminated with reason: 
>>>>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>>>>>> 2012-11-07 18:23:21.495 [error] <0.15322.0> CRASH REPORT Process 
>>>>>>> <0.15322.0> with 0 neighbours exited with reason: 
>>>>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>>>>>>  in gen_server:terminate/6 line 747
>>>>>>> 2012-11-07 18:23:21.525 [error] <0.10590.0> Supervisor riak_api_pb_sup 
>>>>>>> had child undefined started with 
>>>>>>> {riak_api_pb_server,start_link,undefined} at <0.15322.0> exit with 
>>>>>>> reason 
>>>>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>>>>>>  in context child_terminated
>>>>>> 
>>>>>> _______________________________________________
>>>>>> riak-users mailing list
>>>>>> riak-users@lists.basho.com
>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>> 
>>>> 
>>> 
>> 
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: mysterious Riak problems

Reply via email to