riak-users, Antti's issues has been addressed via a correction to leveldb. Upcoming releases of Riak will of course contain this correction. The fix for Riak 2.x is tagged 2.0.11 at github.com/basho/leveldb. The same fix is also part of the leveldb "develop" branch.
Bug discussion is here: https://github.com/basho/leveldb/wiki/mv-startup-segfault Matthew > On Jan 1, 2016, at 3:25 PM, Antti Kuusela <antti.kuus...@firstbeat.com> wrote: > > Hi Matthew, > > This phenomenon started after upgrade to 2.1.2. I downgraded the servers to > 2.1.1, now waiting to see if it makes a difference. > > I installed the binary packages from Basho repo. > > > > ________________________________________ > From: Matthew Von-Maszewski <matth...@basho.com> > Sent: 31 December 2015 18:25 > To: Antti Kuusela > Cc: Luke Bakken; riak-users > Subject: Re: Leveldb segfault during Riak startup > > I also failed to ask two basic questions: > > 1. did this failure start after your upgrade to 2.1.3, or happen prior to > upgrade also? > > 2. did you use a Basho package for Centos 7, or did you build from source > code? > > Matthew > > >> On Dec 31, 2015, at 6:06 AM, Antti Kuusela <antti.kuus...@firstbeat.fi> >> wrote: >> >> Hi Luke, >> >> We erased the btrfs file system, replaced it with xfs on lvm with >> thinly-provisioned volumes and continued testing with a new database from >> scratch. The same problem continues, though. From /var/log/messages: >> >> Dec 31 03:35:31 storage5 riak[66419]: Starting up >> Dec 31 03:35:45 storage5 kernel: traps: beam.smp[66731] general protection >> ip:7f02280b3f16 sp:7f0197ad8dd0 error:0 in eleveldb.so[7f0228066000+93000] >> Dec 31 03:35:46 storage5 run_erl[66417]: Erlang closed the connection. >> >> >> >> 18.12.2015, 16:45, Luke Bakken kirjoitti: >>> Hi Antti, >>> >>> Riak is not tested on btrfs and the file system is not officially >>> supported. We recommend ext4 or xfs for Linux. ZFS is an option on >>> Solaris derivatives and FreeBSD. >>> >>> -- >>> Luke Bakken >>> Engineer >>> lbak...@basho.com >>> >>> >>> On Fri, Dec 18, 2015 at 6:14 AM, Antti Kuusela >>> <antti.kuus...@firstbeat.fi> wrote: >>>> Hi, >>>> >>>> I have been testing Riak and Riak CS as a possible solution for our future >>>> storage needs. I have a five server cluster running Centos 7. Riak version >>>> is 2.1.3 (first installed as 2.1.1, updated twice via Basho repo) and Riak >>>> CS version is 2.1.0. The servers each have 64GB RAM and six 4TB disks in >>>> raid 6 using btrfs. >>>> >>>> I have been pushing random data into Riak-CS via s3cmd to see how the >>>> system >>>> behaves. Smallest objects have been 2000 bytes, largest 100MB. I have also >>>> been making btrfs snapshots of the entire platform data dir nightly for >>>> backup purposes. Stop Riak CS, wait 10 seconds, stop Riak, wait 10, make >>>> snapshot, start Riak, wait 180 seconds, start Riak CS. This is performed on >>>> each of the servers in turn with a five minute wait in between. I have >>>> added >>>> the waits to try spread the startup load and allow the system time to get >>>> things running. New data is constantly pushed to the S3 API but restarting >>>> the nodes in rotation causes by far the highest stress on the system. >>>> >>>> I have encountered one problem in particular. Quite often one of the Riak >>>> nodes starts up but after a couple of minutes it just drops, all processes >>>> exited except for epmd. >>>> >>>> Following is from /var/log/riak/console, most of the lines skipped for sake >>>> of brevity. Normal startup stuff, as far as I can see: >>>> >>>> 2015-12-16 00:26:04.446 [info] <0.7.0> Application lager started on node >>>> 'riak@192.168.50.32' >>>> ... >>>> 2015-12-16 00:26:04.490 [info] <0.72.0> alarm_handler: >>>> {set,{system_memory_high_watermark,[]}} >>>> ... >>>> 2015-12-16 00:26:04.781 [info] >>>> <0.206.0>@riak_core_capability:process_capability_changes:555 New >>>> capability: {riak_core,vnode_routing} = proxy >>>> ... >>>> 2015-12-16 00:26:04.869 [info] <0.7.0> Application riak_core started on >>>> node >>>> 'riak@192.168.50.32' >>>> ... >>>> 2015-12-16 00:26:04.969 [info] <0.407.0>@riak_kv_env:doc_env:46 Environment >>>> and OS variables: >>>> 2015-12-16 00:26:05.124 [warning] <0.6.0> lager_error_logger_h dropped 9 >>>> messages in the last second that exceeded the limit of 100 messages/sec >>>> 2015-12-16 00:26:05.124 [info] <0.407.0> riak_kv_env: Open file limit: >>>> 65536 >>>> 2015-12-16 00:26:05.124 [warning] <0.407.0> riak_kv_env: Cores are >>>> disabled, >>>> this may hinder debugging >>>> 2015-12-16 00:26:05.124 [info] <0.407.0> riak_kv_env: Erlang process limit: >>>> 262144 >>>> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Erlang ports limit: >>>> 65536 >>>> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: ETS table count >>>> limit: >>>> 256000 >>>> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Thread pool size: 64 >>>> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Generations before >>>> full sweep: 0 >>>> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Schedulers: 12 for 12 >>>> cores >>>> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: sysctl vm.swappiness >>>> is 0 greater than or equal to 0) >>>> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: sysctl >>>> net.core.wmem_default is 8388608 lesser than or equal to 8388608) >>>> ... >>>> 2015-12-16 00:26:05.139 [info] <0.478.0>@riak_core:wait_for_service:504 >>>> Waiting for service riak_kv to start (0 seconds) >>>> 2015-12-16 00:26:05.158 [info] >>>> <0.495.0>@riak_kv_entropy_manager:set_aae_throttle_limits:790 Setting AAE >>>> throttle limits: >>>> [{-1,0},{200,10},{500,50},{750,250},{900,1000},{1100,5000}] >>>> ... >>>> 2015-12-16 00:26:30.160 [info] >>>> <0.495.0>@riak_kv_entropy_manager:perhaps_log_throttle_change:853 Changing >>>> AAE throttle from undefined -> 5000 msec/key, based on maximum vnode >>>> mailbox >>>> size {unknown_mailbox_sizes,node_list,['riak@192.168.50.32']} from >>>> ['riak@192.168.50.32'] >>>> 2015-12-16 00:27:12.053 [info] <0.478.0>@riak_core:wait_for_service:504 >>>> Waiting for service riak_kv to start (60 seconds) >>>> 2015-12-16 00:28:25.057 [info] <0.478.0>@riak_core:wait_for_service:504 >>>> Waiting for service riak_kv to start (120 seconds) >>>> >>>> And then nothing >>>> >>>> From /var/log/messages: >>>> >>>> Dec 16 00:26:02 storage2 su: (to riak) root on none >>>> Dec 16 00:26:04 storage2 riak[48174]: Starting up >>>> Dec 16 00:28:59 storage2 kernel: traps: beam.smp[48492] general protection >>>> ip:7fcaf9402f16 sp:7fca6affcdd0 error:0 in eleveldb.so[7fcaf93b5000+93000] >>>> Dec 16 00:28:59 storage2 run_erl[48172]: Erlang closed the connection. >>>> >>>> On another node at a different time /var/log/riak/console.log had similar >>>> messages, and also some warnings about invalid hint files, such as: >>>> >>>> 2015-12-13 00:15:41.232 [warning] <0.815.0> Hintfile >>>> '/data/riak/bitcask/570899077082383952423314387779798054553098649600/56.bitcask.hint' >>>> invalid >>>> >>>> In this latter example riak was started with "systemctl start riak" rather >>>> than "riak start". From /var/log/messages: >>>> >>>> Dec 13 00:15:29 storage1 riak: Starting riak: [ OK ] >>>> Dec 13 00:15:29 storage1 systemd: Started SYSV: Riak is a distributed data >>>> store. >>>> Dec 13 00:15:56 storage1 kernel: beam.smp[131820]: segfault at 160 ip >>>> 00007f24c0902ce6 sp 00007f24337fddd0 error 4 in >>>> eleveldb.so[7f24c08b5000+93000] >>>> Dec 13 00:15:56 storage1 run_erl[131501]: Erlang closed the connection. >>>> >>>> Of reported Riak bugs, this is similar to >>>> https://github.com/basho/riak/issues/790 . However, the poster of that >>>> issue >>>> reported that his problem was fixed by repairing leveldb partitions. I >>>> looked at this following >>>> http://docs.basho.com/riak/latest/ops/running/recovery/repairing-leveldb/ >>>> but didn't find any errors. >>>> >>>> Incidentally, I started having problems with btrfs as well. On one node >>>> btrfs caused a kernel crash and on another kernel killed beam.smp process >>>> after it stopped responding for over 120 seconds while syncing to btrfs. >>>> The >>>> kernel in Centos 7 probably isn't best suited for working with btrfs. >>>> Advertised version is 3.10.0. >>>> >>>> So, my question is what is your take on this? Is this a bug in the leveldb >>>> library? The same one already reported? What log data would help debug or >>>> reproduce it? Or is there potentially some problem with my setup? Or could >>>> this be caused by a bug in btrfs? What is your take on using Riak with >>>> btrfs? >>>> >>>> -- >>>> Antti Kuusela, M.Sc >>>> Senior Software Developer >>>> Firstbeat Technologies Ltd. >>>> >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> riak-users@lists.basho.com >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> . >>> >> >> >> -- >> Antti Kuusela, M.Sc >> Senior Software Developer >> Firstbeat Technologies Ltd. >> >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com