Re: Recreation of default bucket type
Hi Douglas. Would you mind sharing the riak_core section of your app.config? And also the output of `riak-admin bucket-type status default` for a node that has been restarted after the app.config has been updated with the change to the default bucket properties. On Wed, Jun 10, 2015 at 10:06 AM, Douglas Isaksson douglas.isaks...@x5music.com wrote: Hi, Ran into this issue when upgrading from 1.4.X to 2.1.1 : https://github.com/basho/riak/issues/727 allow_mult leaked into my default bucket type. Is there a way to change the allow_mult setting in the default bucket type? I'm getting no_default_update when trying. Setting allow_mult in app.config still generates siblings. Or are my options to either creating a new bucket type or rolling back the upgrade? Best, Douglas -- *Douglas Isaksson *Senior System Developer *X5 MUSIC GROUP*Slussplan 9 111 30 Stockholm, Sweden Direct: +46 70 595 85 07 Skype : douglas.isaksson E-mail: douglas.isaks...@x5music.com Web: www.x5music.com http://www.x5music.com/ ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Riak 1.4.12 high I/O while almost idle
Hi Timo. Did you check if there is active anti-entropy activity? [1] That could generate a lot of I/O in the background while building the trees with the data needed for automatic repair or running exchanges to verify replicas are up to date. [1] http://docs.basho.com/riak/latest/ops/advanced/aae/#Monitoring-AAE On Fri, May 22, 2015 at 10:54 AM, Doug Rohrer droh...@basho.com wrote: Timo: The general layout of the information is pretty similar to Top (and described on the etop docs http://www.erlang.org/doc/man/etop.html and http://www.erlang.org/doc/apps/observer/etop_ug.html) - the one downside is that it doesn't profile which function is using the most CPU (the Reds column is Reductions, which is Erlang's measure of work being done by a process). If you run it for a while longer, with a shorter -interval setting, you may be able to catch what function is being run most frequently other than the gen_server:loop and gen_event:loop functions you're seeing, as they are generic OTP functions and not very useful for diagnostics. Doug On May 22, 2015, at 9:58 AM, Timo Gatsonides t...@me.com wrote: Try running `riak-admin top` on the node to see what's going on: http://docs.basho.com/riak/1.4.12/ops/running/tools/riak-admin/#top This may give you some insight into what that node is doing. Thanks for the quick response. I had already done that, however I don’t understand the output … It is below. -Timo === '' 13:56:31 Load: cpu 0 Memory: total 69336binary 2717 procs 664processes8439code 10720 runq0atom 501ets 6227 Pid Name or Initial Func Time Reds Memory MsgQ Current Function --- 6206.105.0erlang:apply/2'-'5091705 2600 0 cpu_sup:measurement_server_loop/1 6206.181.0riak_core_vnode_manager '-'3064188 231536 0 gen_server:loop/6 6206.580.0proc_lib:init_p/5 '-'2805797 34472 0 gen_fsm:loop/7 6206.715.0proc_lib:init_p/5 '-'2485959 55144 0 gen_fsm:loop/7 6206.709.0proc_lib:init_p/5 '-'2445894 13896 3 eleveldb:get/3 6206.94.0 riak_sysmon_filter'-'2394053 13656 0 gen_server:loop/6 6206.714.0proc_lib:init_p/5 '-'2177064 34472 0 gen_fsm:loop/7 6206.710.0proc_lib:init_p/5 '-'2153568 34472 0 gen_fsm:loop/7 6206.3.0 erl_prim_loader '-'2053533 142464 0 erl_prim_loader:loop/3 6206.707.0proc_lib:init_p/5 '-'2032382 34472 0 gen_fsm:loop/7 === '' 13:56:36 Load: cpu 7 Memory: total 70450binary 2994 procs 665processes9133code 10720 runq3atom 501ets 6247 Pid Name or Initial Func Time Reds Memory MsgQ Current Function --- 6206.105.0erlang:apply/2'-' 8092 2600 0 cpu_sup:measurement_server_loop/1 6206.580.0proc_lib:init_p/5 '-' 6710 34472 0 eleveldb:write/3 6206.94.0 riak_sysmon_filter'-' 6447 13656 0 gen_server:loop/6 6206.709.0proc_lib:init_p/5 '-' 6174 34472 0 gen_fsm:loop/7 6206.28437.0 erlang:apply/2'-' 3623 21440 0 erlang:receive_emd/3 6206.707.0proc_lib:init_p/5 '-' 3089 21696 0 gen_fsm:loop/7 6206.711.0proc_lib:init_p/5 '-' 3075 34472 0 gen_fsm:loop/7 6206.374.0riak_kv_stat_sj_2 '-' 2679 3856 0 gen_server:loop/6 6206.714.0proc_lib:init_p/5 '-' 1955 21696 0 gen_fsm:loop/7 6206.7.0 application_controller'-' 1779 55880 0 gen_server:loop/6 === ''
Re: Java Client: Thread hangs after strange exception
Hi Henning. While Alex Moore in the clients team is looking at the client side of this issue, I looked up the error messages you were getting in the log and sent to him. Messages like this: 2015-05-08 14:12:18.103 [error] 0.937.0 gen_server 0.937.0 terminated with reason: no function clause matching riak_object:from_binary(progress, UploadServiceTest10, {{ts,{1431,86330,742283}},53,1,0,0,0,34,131,108,0,0,0,1, 104,2,109,0,0,0,8,35,9,254,249,83,136,...}) line 633 Are caused by an issue that since has been fixed in +2.0, where internal timestamps from the memory backend when using a TTL could be seen while internal processes iterated over the key/values: https://github.com/basho/riak_kv/pull/607 I agree with you in that I don't believe this is directly causing your client hang issue. On Tue, Apr 28, 2015 at 6:40 AM, Henning Verbeek hankipa...@gmail.com wrote: For development, I have a single-node Riak 1.4 cluster that I'm connecting to with the Java client 2.0.1. Both client and server are running on the same node, connecting via localhost. Every now and then, an update operation ... hangs. On the console I see this strange error: 12:21:46.646 [nioEventLoopGroup-2-1] ERROR com.basho.riak.client.core.RiakNode - Operation onException() channel: id:-1348399652 localhost:8087 {} io.netty.util.concurrent.BlockingOperationException: DefaultChannelPromise@63f65d07(incomplete) at io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:396) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:257) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:129) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:28) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.core.RiakNode.doGetConnection(RiakNode.java:667) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.core.RiakNode.getConnection(RiakNode.java:636) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.core.RiakNode.execute(RiakNode.java:570) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.core.DefaultNodeManager.executeOnNode(DefaultNodeManager.java:90) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.core.RiakCluster.execute(RiakCluster.java:201) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.core.RiakCluster.execute(RiakCluster.java:195) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.api.commands.kv.StoreValue.executeAsync(StoreValue.java:117) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.api.commands.kv.UpdateValue$1.handle(UpdateValue.java:182) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.api.commands.ListenableFuture.notifyListeners(ListenableFuture.java:78) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.api.commands.CoreFutureAdapter.handle(CoreFutureAdapter.java:120) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.core.FutureOperation.fireListeners(FutureOperation.java:131) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.core.FutureOperation.setResponse(FutureOperation.java:170) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.core.RiakNode.onSuccess(RiakNode.java:824) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at com.basho.riak.client.core.netty.RiakResponseHandler.channelRead(RiakResponseHandler.java:58) ~[storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:340) [storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:326) [storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:155) [storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:108) [storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:340) [storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:326) [storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785) [storage-backend-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at
Re: uneven disk distribution
Hi Johnny. Make sure that the configuration on that node is not different to the others. For example, it could be configured to never merge Bitcask files, so that space could never be reclaimed. http://docs.basho.com/riak/latest/ops/advanced/backends/bitcask/#Configuring-Bitcask On Thu, May 14, 2015 at 4:31 PM, Johnny Tan johnnyd...@gmail.com wrote: We have a 6-node test riak cluster. One of the nodes seems to be using far more disk: staging-riak001.pp /dev/sda3 15G 6.3G 7.2G 47% / staging-riak002.pp /dev/sda3 15G 6.4G 7.1G 48% / staging-riak003.pp /dev/sda3 15G 6.1G 7.5G 45% / staging-riak004.pp /dev/sda3 15G 14G 266M 99% / staging-riak005.pp /dev/sda3 15G 5.8G 7.7G 44% / staging-riak006.pp /dev/sda3 15G 6.3G 7.3G 47% / Specifically, /var/lib/riak/bitcask is using up most of that space. It seems to have files in there that are much older than any of the other nodes. We've done maintenance of various sort on this cluster -- as the name indicates, we use it as a staging ground before we go to production. I don't recall a specific issue per se, but I wouldn't rule it out. Is there a way to figure out if there's an underlying issue here, or whether some of this disk space is not really current and can somehow be purged? What info would help answer those questions? johnny ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Merge error in bitcask data store
Hi Lucas. Unfortunately, you have hit a bug in the Bitcask backend that we overlooked. The attached beam file is a patched version of the backend code that should resolve the issue for you. To use it, stop each node, then place riak_kv_bitcask_backend.beam in your basho-patches directory, then restart it. The merges should complete without crashing. I believe in your system the actual location is /usr/lib/riak/lib/basho-patches. We will add this problem and the patch to the known issues information soon. Please let us know if you come across any other issues with the merges. On Wed, Sep 3, 2014 at 11:25 AM, Lucas Grijander lucasgrinjande...@gmail.com wrote: Hi. I am new in the list, so i don't know if this is the right place to open this thread. Our riak cluster is composed by 4 nodes. The O.S is Ubuntu 14.04 and the version of riak is 2.0.0 We are getting a lot of errors in the logs of ours riak nodes. The errors are like this: 2014-09-03 13:05:14.212 [error] 0.19152.3672 Failed to merge {[/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/14.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/13.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/12.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/11.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/10.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/9.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/8.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/7.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/6.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/5.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/4.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/3.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/2.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/1.bitcask.data],[/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/14.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/13.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/12.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/11.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/10.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/9.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553...,...]}: {generic_failure,error,function_clause,[{riak_kv_bitcask_backend,key_transform_to_1,[{tombstone,2,0,4,109,116,116,108,50,48,49,52,48,57,48,50,50,50,53,48,50,53,48,56,102,56,102,102,101,54,49,49,97,101,51,99,52,52,55,101,55,55,100,99,50,100,49,52,56,51,57,55,48,50}],[{file,src/riak_kv_bitcask_backend.erl},{line,99}]},{bitcask,'-expiry_merge/4-fun-0-',7,[{file,src/bitcask.erl},{line,1912}]},{bitcask_fileops,fold_hintfile_loop,5,[{file,src/bitcask_fileops.erl},{line,660}]},{bitcask_fileops,fold_file_loop,8,[{file,src/bitcask_fileops.erl},{line,720}]},{bitcask_fileops,fold_hintfile,3,[{file,src/bitcask_fileops.erl},{line,624}]},{bitcask,expiry_merge,4,[{file,src/bitcask.erl},{line,1915}]},{bitcask,merge1,4,[{file,src/bitcask.erl},{line,686}]},{bitcask,merge,3,[{file,src/bitcask.erl},{line,566}]}]} It seems riak can not merge data in bitcask data store. This is the configuration of bitcask in ten minutes TTL: {ten_minutes_ttl,riak_kv_bitcask_backend, [{io_mode,erlang}, {expiry_grace_time,0}, {small_file_threshold,5242880}, {dead_bytes_threshold,4194304}, {frag_threshold,15}, {dead_bytes_merge_trigger,4194304}, {frag_merge_trigger,10}, {max_file_size,10485760}, {open_timeout,4}, {data_root,/var/lib/riak/data/ten_minutes}, {sync_strategy,none}, {merge_window,always}, {max_fold_age,-1}, {max_fold_puts,0}, {expiry_secs,660}, {require_hint_crc,true}]} As a result, the amount of used memory(RAM) keeps growing until the server run out of free memory. Could you give me some clue that it can point to the cause of the problem? Thanks in advance. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Merge error in bitcask data store
Bringing this back to the mailing list: A new patch was required to fix the Bitcask merge problems seen by Lucas. Here it is. Again, it can be loaded by placing it in the basho-patches directory and stopping then starting the node again. It could also be loaded without a restart by issuing the load command in the Riak console after placing it in the basho-patches directory. You would need to run bin/riak attach, the issue the command: l(riak_kv_bitcask_backend). That is a lowercase L up there, btw. This should resolve the Bitcask merge issues. On Wed, Sep 3, 2014 at 1:00 PM, Engel Sanchez en...@basho.com wrote: Hi Lucas. Unfortunately, you have hit a bug in the Bitcask backend that we overlooked. The attached beam file is a patched version of the backend code that should resolve the issue for you. To use it, stop each node, then place riak_kv_bitcask_backend.beam in your basho-patches directory, then restart it. The merges should complete without crashing. I believe in your system the actual location is /usr/lib/riak/lib/basho-patches. We will add this problem and the patch to the known issues information soon. Please let us know if you come across any other issues with the merges. On Wed, Sep 3, 2014 at 11:25 AM, Lucas Grijander lucasgrinjande...@gmail.com wrote: Hi. I am new in the list, so i don't know if this is the right place to open this thread. Our riak cluster is composed by 4 nodes. The O.S is Ubuntu 14.04 and the version of riak is 2.0.0 We are getting a lot of errors in the logs of ours riak nodes. The errors are like this: 2014-09-03 13:05:14.212 [error] 0.19152.3672 Failed to merge {[/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/14.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/13.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/12.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/11.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/10.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/9.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/8.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/7.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/6.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/5.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/4.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/3.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/2.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/1.bitcask.data],[/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/14.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/13.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/12.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/11.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/10.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553098649600/9.bitcask.data,/var/lib/riak/data/ten_minutes/570899077082383952423314387779798054553...,...]}: {generic_failure,error,function_clause,[{riak_kv_bitcask_backend,key_transform_to_1,[{tombstone,2,0,4,109,116,116,108,50,48,49,52,48,57,48,50,50,50,53,48,50,53,48,56,102,56,102,102,101,54,49,49,97,101,51,99,52,52,55,101,55,55,100,99,50,100,49,52,56,51,57,55,48,50}],[{file,src/riak_kv_bitcask_backend.erl},{line,99}]},{bitcask,'-expiry_merge/4-fun-0-',7,[{file,src/bitcask.erl},{line,1912}]},{bitcask_fileops,fold_hintfile_loop,5,[{file,src/bitcask_fileops.erl},{line,660}]},{bitcask_fileops,fold_file_loop,8,[{file,src/bitcask_fileops.erl},{line,720}]},{bitcask_fileops,fold_hintfile,3,[{file,src/bitcask_fileops.erl},{line,624}]},{bitcask,expiry_merge,4,[{file,src/bitcask.erl},{line,1915}]},{bitcask,merge1,4,[{file,src/bitcask.erl},{line,686}]},{bitcask,merge,3,[{file,src/bitcask.erl},{line,566}]}]} It seems riak can not merge data in bitcask data store. This is the configuration of bitcask in ten minutes TTL: {ten_minutes_ttl,riak_kv_bitcask_backend, [{io_mode,erlang}, {expiry_grace_time,0}, {small_file_threshold,5242880}, {dead_bytes_threshold,4194304}, {frag_threshold,15}, {dead_bytes_merge_trigger,4194304}, {frag_merge_trigger,10}, {max_file_size,10485760}, {open_timeout,4
Re: repair-2i stops with bad argument in call to eleveldb:async_write
Simon: The data scan for that partition seems to be taking more than 5 minutes to collect a batch of 1000 items, so the 2i repair process is giving up on it before it has a chance to finish. You can reduce the likelihood of this happening by configuring the batch parameter to something small. In the riak_kv section of the configuration file, set this: {riak_kv, [ {aae_2i_batch_size, 10}, ... Let us know if that allows it to finish the repair. You should still look into what may be causing the slowness. A combination of slow disks or very large data sets might do it. On Fri, Aug 1, 2014 at 5:24 AM, Russell Brown russell.br...@me.com wrote: Hi Simon, Sorry for the delays. I’m on vacation for a couple of days. Will pick this up on Monday. Cheers Russell On 1 Aug 2014, at 09:56, Effenberg, Simon seffenb...@team.mobile.de wrote: Hi Russell, @basho any updates on this? We still have the issues with 2i (repair is also still not possible) and searching for the 2i indexes is reproducable creating (for one range I tested) 3 different values. I would love to provide anything you need to debug that issue. Cheers Simon On Wed, Jul 30, 2014 at 09:22:56AM +, Effenberg, Simon wrote: Great. Thanks Russell.. if you need me to do something.. feel free to ask. Cheers Simon On Wed, Jul 30, 2014 at 10:19:56AM +0100, Russell Brown wrote: Thanks Simon, I’m going to spend a some time on this day. Cheers Russell On 30 Jul 2014, at 10:05, Effenberg, Simon seffenb...@team.mobile.de wrote: Hi Russel, still one machine out of 13 is on wheezy and the rest on squeeze but the software is the same and basho is providing even the erlang stuff. So their should no real difference inside the application. And the errors are almost the same (except the async_write/read difference). I paste them: -- node 1 --- 2014-07-30 06:16:07.728 UTC [info] 0.14871.336@riak_kv_2i_aae:next_partition:160 Finished 2i repair: Total partitions: 1 Finished partitions: 1 Speed: 100 Total 2i items scanned: 0 Total tree objects: 0 Total objects fixed: 0 With errors: Partition: 12559779695812446953312916531172001681702912 Error: index_scan_timeout 2014-07-30 06:16:07.728 UTC [error] 0.1525.0 gen_server 0.1525.0 terminated with reason: bad argument in call to eleveldb:async_write(#Ref0.0.324.211123, , [{put,131,104,2,109,0,0,0,20,99,111,110,118,101,114,115,97 ,116,105,111,110,95,115,101,99,114,...,...}], []) in eleveldb:write/3 line 155 2014-07-30 06:16:07.728 UTC [error] 0.1525.0 CRASH REPORT Process 0.1525.0 with 0 neighbours exited with reason: bad argument in call to eleveldb:async_write(#Ref0.0.324.211123, , [{put,131,104,2,109,0,0,0,20,99,11 1,110,118,101,114,115,97,116,105,111,110,95,115,101,99,114,...,...}], []) in eleveldb:write/3 line 155 in gen_server:terminate/6 line 747 2014-07-30 06:16:07.728 UTC [error] 0.1517.0 Supervisor {0.1517.0,poolboy_sup} had child riak_core_vnode_worker started with {riak_core_vnode_worker,start_link,undefined} at 0.1525.0 exit with reason bad argument in call to eleveldb:async_write(#Ref0.0.324.211123, , [{put,131,104,2,109,0,0,0,20,99,111,110,118,101,114,115,97,116,105,111,110,95,115,101,99,114,...,...}], []) in eleveldb:write/3 line 155 in context child_terminated -- node 2 --- 2014-07-30 06:16:07.791 UTC [info] 0.8083.314@riak_kv_2i_aae:next_partition:160 Finished 2i repair: Total partitions: 1 Finished partitions: 1 Speed: 100 Total 2i items scanned: 0 Total tree objects: 0 Total objects fixed: 0 With errors: Partition: 622279994019798508141412682679979879462877528064 Error: index_scan_timeout 2014-07-30 06:16:07.791 UTC [error] 0.1884.0 gen_server 0.1884.0 terminated with reason: bad argument in call to eleveldb:async_write(#Ref0.0.318.96628, , [{put,131,104,2,109,0,0,0,20,99,111,110,118,101,114,115,97, 116,105,111,110,95,115,101,99,114,...,...}], []) in eleveldb:write/3 line 155 2014-07-30 06:16:07.791 UTC [error] 0.1884.0 CRASH REPORT Process 0.1884.0 with 0 neighbours exited with reason: bad argument in call to eleveldb:async_write(#Ref0.0.318.96628, , [{put,131,104,2,109,0,0,0,20,99,111 ,110,118,101,114,115,97,116,105,111,110,95,115,101,99,114,...,...}], []) in eleveldb:write/3 line 155 in gen_server:terminate/6 line 747 2014-07-30 06:16:07.792 UTC [error] 0.1875.0 Supervisor {0.1875.0,poolboy_sup} had child riak_core_vnode_worker started with {riak_core_vnode_worker,start_link,undefined} at 0.1884.0 exit with reason bad argument in call to eleveldb:async_write(#Ref0.0.318.96628, , [{put,131,104,2,109,0,0,0,20,99,111,110,118,101,114,115,97,116,105,111,110,95,115,101,99,114,...,...}], []) in eleveldb:write/3 line 155 in context child_terminated -- node 3 ---
Re: Upgraded riak 1.4.9 is pegging the CPU
Hi Alain. I don't think you are seeing the AAE issue. The problem with upgrading from 1.4.4-1.4.7 to 1.4.8 was a broken hash function in those, which made the AAE trees incompatible. You should not have the same problem in 1.4.0. It seems that Erlang processes are repeatedly crashing and restarting. It would be good to grab all your logs before they rotate so we can take a look at exactly what is the first thing crashing and causing this snowball effect. On Thu, Jun 5, 2014 at 11:58 AM, Alain Rodriguez al...@uber.com wrote: Actually I just noticed it is likely the AAE issue: 2014-06-05 14:53:47.587 [error] 0.16054.31 CRASH REPORT Process 0.16054.31 with 0 neighbours exited with reason: no match of right hand value {error,{db_open,IO error: lock /var/lib/riak/anti_entropy/1061872283373234151507364761270424381468763488256/LOCK: already held by process}} in hashtree:new_segment_store/2 line 505 in gen_server:init_it/6 line 328 2014-06-05 14:53:47.588 [error] 0.16056.31 CRASH REPORT Process 0.16056.31 with 0 neighbours exited with reason: no match of right hand value {error,{db_open,IO error: lock /var/lib/riak/anti_entropy/1335903840372778448670555667404727447654250840064/LOCK: already held by process}} in hashtree:new_segment_store/2 line 505 in gen_server:init_it/6 line 328 2014-06-05 14:53:47.588 [error] 0.16055.31 CRASH REPORT Process 0.16055.31 with 0 neighbours exited with reason: no match of right hand value {error,{db_open,IO error: lock /var/lib/riak/anti_entropy/1267395951122892374379757940871151681107879002112/LOCK: already held by process}} in hashtree:new_segment_store/2 line 505 in gen_server:init_it/6 line 328 Bollocks! On Thu, Jun 5, 2014 at 8:49 AM, Alain Rodriguez al...@uber.com wrote: Thanks for the quick reply and no I did not. Is this something I should be able to do now (stop, remove files, start again) or is it too late? How could I verify this is the issue? On Thu, Jun 5, 2014 at 8:42 AM, Shane McEwan sh...@mcewan.id.au wrote: On 05/06/14 16:20, Alain Rodriguez wrote: Hi all, I upgraded 1 of 9 riak nodes in a cluster last night from 1.4.0 to 1.4.9. The rest are running 1.4.0. Ever since I am seeing the upgraded node, riak01 consuming a significantly larger percent of CPU and the PUT times on it have gotten worse. htop indicicates one particular process pegging the CPU, and many many more processes running than I was used to seeing before. G'day! Did you turn off and remove the Active Anti Entropy files before upgrading? From the 1.4.8 release notes: IMPORTANT We recommend removing current AAE trees before upgrading. That is, all files under the anti_entropy sub-directory. This will avoid potentially large amounts of repair activity once correct hashes start being added. The data in the current trees can only be fixed by a full rebuild, so this repair activity is wasteful. Trees will start to build once AAE is re-enabled. To minimize the impact of this, we recommend upgrading during a period of low activity. Shane. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Upgraded riak 1.4.9 is pegging the CPU
Alain, thanks for the logs you sent me on the side. I'm not yet sure what the root cause is, but I saw a lot of handoff activity and busy distributed port messages, which indicate the single TCP connection between two Erlang nodes is completely saturated. Since there is too much going on, turning off AAE and examining your cluster with less activity might still be a good idea. Check the output of riak-admin transfers until it is quiet. I noticed you have a file limit of 8192. That is not low, but newer Riaks eat more file handles, so it would be a good idea to double that. Let us know how the stats and the logs look like after AAE is off to see what else we can do. On Thu, Jun 5, 2014 at 1:05 PM, Engel Sanchez en...@basho.com wrote: Hi Alain. I don't think you are seeing the AAE issue. The problem with upgrading from 1.4.4-1.4.7 to 1.4.8 was a broken hash function in those, which made the AAE trees incompatible. You should not have the same problem in 1.4.0. It seems that Erlang processes are repeatedly crashing and restarting. It would be good to grab all your logs before they rotate so we can take a look at exactly what is the first thing crashing and causing this snowball effect. On Thu, Jun 5, 2014 at 11:58 AM, Alain Rodriguez al...@uber.com wrote: Actually I just noticed it is likely the AAE issue: 2014-06-05 14:53:47.587 [error] 0.16054.31 CRASH REPORT Process 0.16054.31 with 0 neighbours exited with reason: no match of right hand value {error,{db_open,IO error: lock /var/lib/riak/anti_entropy/1061872283373234151507364761270424381468763488256/LOCK: already held by process}} in hashtree:new_segment_store/2 line 505 in gen_server:init_it/6 line 328 2014-06-05 14:53:47.588 [error] 0.16056.31 CRASH REPORT Process 0.16056.31 with 0 neighbours exited with reason: no match of right hand value {error,{db_open,IO error: lock /var/lib/riak/anti_entropy/1335903840372778448670555667404727447654250840064/LOCK: already held by process}} in hashtree:new_segment_store/2 line 505 in gen_server:init_it/6 line 328 2014-06-05 14:53:47.588 [error] 0.16055.31 CRASH REPORT Process 0.16055.31 with 0 neighbours exited with reason: no match of right hand value {error,{db_open,IO error: lock /var/lib/riak/anti_entropy/1267395951122892374379757940871151681107879002112/LOCK: already held by process}} in hashtree:new_segment_store/2 line 505 in gen_server:init_it/6 line 328 Bollocks! On Thu, Jun 5, 2014 at 8:49 AM, Alain Rodriguez al...@uber.com wrote: Thanks for the quick reply and no I did not. Is this something I should be able to do now (stop, remove files, start again) or is it too late? How could I verify this is the issue? On Thu, Jun 5, 2014 at 8:42 AM, Shane McEwan sh...@mcewan.id.au wrote: On 05/06/14 16:20, Alain Rodriguez wrote: Hi all, I upgraded 1 of 9 riak nodes in a cluster last night from 1.4.0 to 1.4.9. The rest are running 1.4.0. Ever since I am seeing the upgraded node, riak01 consuming a significantly larger percent of CPU and the PUT times on it have gotten worse. htop indicicates one particular process pegging the CPU, and many many more processes running than I was used to seeing before. G'day! Did you turn off and remove the Active Anti Entropy files before upgrading? From the 1.4.8 release notes: IMPORTANT We recommend removing current AAE trees before upgrading. That is, all files under the anti_entropy sub-directory. This will avoid potentially large amounts of repair activity once correct hashes start being added. The data in the current trees can only be fixed by a full rebuild, so this repair activity is wasteful. Trees will start to build once AAE is re-enabled. To minimize the impact of this, we recommend upgrading during a period of low activity. Shane. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Rebuilding AAE hashes - small question
Hey there. There are a couple of things to keep in mind when deleting invalid AAE trees from the 1.4.3-1.4.7 series after upgrading to 1.4.8: * If AAE is disabled, you don't have to stop the node to delete the data in the anti_entropy directories * If AAE is enabled, deleting the AAE data in a rolling manner may trigger an avalanche of read repairs between nodes with the bad trees and nodes with good trees as the data seems to diverge. If your nodes are already up, with AAE enabled and with old incorrect trees in the mix, there is a better way. You can dynamically disable AAE with some console commands. At that point, without stopping the nodes, you can delete all AAE data across the cluster. At a convenient time, re-enable AAE. I say convenient because all trees will start to rebuild, and that can be problematic in an overloaded cluster. Doing this over the weekend might be a good idea unless your cluster can take the extra load. To dynamically disable AAE from the Riak console, you can run this command: riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, disable, [], 6). and enable with the similar: riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, enable, [], 6). That last number is just a timeout for the RPC operation. I hope this saves you some extra load on your clusters. On Wed, Apr 9, 2014 at 11:02 AM, Luke Bakken lbak...@basho.com wrote: Hi Guido, I specifically meant riak-admin transfers however using riak-admin wait-for-service riak_kv riak@node is a good first step before waiting for transfers. Thanks! -- Luke Bakken CSE lbak...@basho.com On Wed, Apr 9, 2014 at 7:54 AM, Guido Medina guido.med...@temetra.comwrote: What do you mean by wait for handoff to finish? Are you referring to wait for the service to be fully started? i.e. riak-admin wait-for-service riak_kv riak@node Or do you mean to check for riak-admin transfers on the started node and wait until those handoffs/transfers are gone? Guido. On 09/04/14 15:46, Luke Bakken wrote: Hi Guido, That is the correct process. Be sure to use the rolling restart procedure when restarting nodes (i.e. wait for handoff to finish before moving on). -- Luke Bakken CSE lbak...@basho.com On Wed, Apr 9, 2014 at 6:34 AM, Guido Medina guido.med...@temetra.comwrote: Hi, If nodes are already upgraded to 1.4.8 (and they went all the way from 1.4.0 to 1.4.8 including AAE buggy versions) Will the following command (as root) on Ubuntu Servers 12.04: riak stop; rm -Rf /var/lib/riak/anti_entropy/*; riak start executed on each node be enough to rebuild AAE hashes? Regards, Guido. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: [ANN] Riak 1.4.8
Toby, Since data inserted into the AAE trees while versions 1.4.4 - 1.4.7 of Riak were running was incorrect, they should be discarded before the upgrade. You are seeing them valiantly trying to repair themselves based on the incoming writes, but that activity is wasted since they require a full scan of the data to be useful again. We are going to add a paragraph to the release notes to warn others about this and make sure the anti_entropy directories are cleared before Riak is updated to 1.4.8 and AAE is enabled again. We recommend you disable AAE, remove all AAE trees (the contents of the anti_entropy directory) and enable it again during a period of low cluster activity. They will all be built again and you should be back in business once that the new trees are ready. Engel On Tue, Feb 25, 2014 at 11:02 PM, Toby Corkindale t...@dryft.net wrote: Hi, After upgrading to 1.4.8 in staging, last week, we've been seeing a fairly constant extra amount of riak hits in the stats. Consistently 300-400 a minute. Can get lost in the much higher, varying levels during busy hours, but outside of that it's quite visible as a baseline. The logs show a lot of AAE work going on. I wondered if this was normal behaviour of Riak cleaning up the AAE-related bug from earlier versions, and will go away eventually? Or that unlikely? I emphasize that there's no significant server load resulting from this -- I'm just curious if it's something I need to look into or not. Cheers, Toby On 21 February 2014 04:00, Tom Santero tsant...@basho.com wrote: Hi, Today, Basho released a minor update of Riak and Riak Enterprise Edition, version 1.4.8. This bug-fix release addresses a regression[0] in Riak's active anti-entropy (AAE) system. The regression existed in versions 1.4.3 through 1.4.7. The complete release notes[1] and package downloads[2] are now available. Thanks for being the best community ever. Regards, The Basho Team [0] http://lists.basho.com/pipermail/riak-users_lists.basho.com/2014-January/014551.html [1] https://github.com/basho/riak/blob/1.4/RELEASE-NOTES.md [2] http://docs.basho.com/riak/latest/downloads/ ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com -- Turning and turning in the widening gyre The falcon cannot hear the falconer Things fall apart; the center cannot hold Mere anarchy is loosed upon the world ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Corrupted data using Riak Python Client on Win7 64 bit
Hello there, This looks puzzling. Just from looking at the code we haven't found anything suspicious. Would you mind posting a pair of those files that failed to match somewhere so we can look at the differences? Thanks for reporting this. Engel@Basho On Fri, Nov 8, 2013 at 2:41 PM, finkle mcgraw finklemcg...@gmail.comwrote: Fellow Riak users, I've noticed that when I upload binary files with sizes of ~1 MB to Riak from my Windows 7 (64 bit) machine, then read the same data back again, often it has a few corrupted bytes, while maintining the correct total data length. Here's the Python script I use to provoke and detect the situation: https://gist.github.com/anonymous/7376084 Notice that I included the typical output when running the script at the bottom of the gist. As you can see, for that particular run, half of the dummy-data files were corrupted. The returned data from Riak has the exact same length as the source, but not the exact same content. I've only done brief analysis of how the corruptions appear within the files that are detected as corrupted, but it looks like it's typically between 1 to 5 bytes that are altered, evenly distributed within the file. I get no exceptions or warnings from the Riak Python client. Everything appears to be in order. So far I've tested this on two different windows machines against two different Riak clusters (a five node Amazon cluster with a loadbalancer in front, and a local devcluster running inside an Ubuntu 12.04 Virtual Machine). The problems appear in all four possible combinations. However, if I run the script from within an Ubuntu VM, on one of the said Windows machines, against any of the two Riak clusteres, the problems do NOT appear. Another observation: If I generate 50 sample files, upload them, then repeatedly try to download them over and over again, the script will detect corruptions in different files on each repetition of downloading. E.g., on round one it might say that file 1,5, and 19 were corrupted, but on round two it might say 3, 8 and 19. Here is the riak stats-view from the Amazon cluster we're running (that I tested the script agains): https://gist.github.com/anonymous/7376379 But as I said, the corruptions appear also when working locally between a Win7 machine and a cluster running on a virtual Ubuntu 12.04 machine. Here are my local package versions, running on Python 2.7.5 64 bit on Windows 7 64 bit: protobuf==2.4.1 riak==2.0.1 riak-pb==1.4.1.1 Any ideas? This seems relatively serious, unless it's some kind of brutal oversight on my part. Finkle ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: [ANNC] Riak 1.3.1
Hi Dave, The stats calculation was fixed in 1.3.1, but the read-repair with Last-write-wins=true was not backported. That one will make it to 1.4, which is scheduled in the near future. I hope that helps. -- Engel Sanchez On Thu, Apr 4, 2013 at 11:05 AM, Dave Brady dbr...@weborama.com wrote: Hi Jared, I don't see these patches, which I have applied to our installation of 1.3, explicitly mentioned in the Release Notes: Fix bug where stats endpoints were calculating _all_ riak_kv stats: https://github.com/basho/riak_kv/blob/9be3405e53acf680928faa6c70d265e86c75a22c/src/riak_kv_stat_bc.erl Every read triggers a read-repair when Last-write-wins=true https://github.com/basho/riak_kv/pull/334 Can you confirm whether or not they made it into 1.3.1, please? -- Dave Brady - Original Message - From: Jared Morrow ja...@basho.com To: Riak Users Mailing List riak-users@lists.basho.com Sent: Wednesday, April 3, 2013 11:53:45 PM GMT +01:00 Amsterdam / Berlin / Bern / Rome / Stockholm / Vienna Subject: Re: [ANNC] Riak 1.3.1 I hesitate to reply to my own email, but just wanted to point out that this issue https://github.com/basho/riak_core/pull/281 listed in the release notes should help all of you who had issues with slow bitcask startup times in 1.3.0. If you see or don't see improvements let us know. Thanks, Jared On Wed, Apr 3, 2013 at 3:16 PM, Jared Morrow ja...@basho.com wrote: Riak Users, We are happy to announce that Riak 1.3.1 is ready for your to download and install. It continues on the 1.3.x family with some nice bugfixes. See the release notes linked below for all the details. Release notes can be found here: https://github.com/basho/riak/blob/1.3/RELEASE-NOTES.md Downloads available on our docs page: http://docs.basho.com/riak/1.3.1/downloads/ Thanks as always for being the best community in open source, -Everyone at Basho ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com