Re: Very slow acquisition time (99 percentile) while fast median times

Guillaume Boddaert Fri, 27 May 2016 10:26:30 -0700

A little follow up for you guys since I went offline for quite some times.

As suggested, it was a Solr performance issue, we were able to provethat my old 5 hosts were able to handle the load without Solr/Yokozuna.Fact was that I lacked CPU for my host, as well as RAM. Since SolR ispretty resource consuming, so I switched from :

- 5 x 16Gb x 2 CPU hosts
to
- 3 x 120Gb x 8 CPU hosts

And it now works like a charm,

Thanks for the help (especially to Damien)

Guillaume

On 04/05/2016 15:17, Matthew Von-Maszewski wrote:

Guillaume,

Two points:
1. You can send the “riak debug” from one server and I will verifythat 2.0.18 is indicated in the LOG file.
2. Your previous “riak debug” from server “riak1” indicated that onlytwo CPU cores existed. We performance test with eight, twelve, andtwenty-four core servers, not two. You have two heavy weightapplications, Riak and Solr, competing for time on those two cores.Actually, you have three applications due to leveldb’s backgroundcompaction operations.
One leveldb compaction is CPU intensive. The compaction reads a blockfrom the disk, computes a CRC32 check of the block, decompresses theblock, merges the keys of this block with one or more blocks fromother files, then compresses the new block, computes a new CRC32, andfinally writes the block to disk. And there can be multiplecompactions running simultaneously. All of your CPU time could beperiodically lost to leveldb compactions.
There are some minor tunings we could do, like disabling compressionin leveldb, that might help. But I seriously doubt you are going toachieve your desired results with only two cores. Adding a sixthserver with two cores is not really going to help.
Matthew
On May 4, 2016, at 4:27 AM, Guillaume Boddaert<guilla...@lighthouse-analytics.co<mailto:guilla...@lighthouse-analytics.co>> wrote:
Thanks, I've installed the new library as stated in the documentationusing 2.0.18 files.
I was unable to find the vnode LOG file from the documentation, as myvnodes looks like file, not directories. So I can't verify that I runthe proper version of the library after my riak restart.
Anyway, it has unfortunately no effect:
http://www.awesomescreenshot.com/image/1219821/1b292613c051da86df5696034c114b14
I think i'll try to add a 6th node that don't rely on network disksand see what's going on.
G.


On 03/05/2016 22:47, Matthew Von-Maszewski wrote:
Guillaume,

A prebuilt eleveldb 2.0.18 for Debian 7 is found here:

  * 
https://s3.amazonaws.com/downloads.basho.com/patches/eleveldb/2.0.18/eleveldb_2.0.18_debian7.tgz


There are good instructions for applying an eleveldb patch here:

http://docs.basho.com/community/productadvisories/leveldbsegfault/#patch-eleveldb-so

Key points about the above web page:
- use the eleveldb patch file link in this email, NOT links on theweb page
- the Debian directory listed on the web page will be slightlydifferent than your Riak 2.1.4 installation:
/usr/lib/riak/lib/eleveldb-<something_different>/priv/


Matthew
On May 3, 2016, at 1:01 PM, Matthew Von-Maszewski<matth...@basho.com <mailto:matth...@basho.com>> wrote:
Guillaume,
I have reviewed the debug package for your riak1 server. There aretwo potential areas of follow-up:
1. You are running our most recent Riak 2.1.4 which has eleveldb2.0.17. We have seen a case where a recent feature in eleveldb2.0.17 caused too much cache flushing, impacting leveldb’sperformance. A discussion is here:
https://github.com/basho/leveldb/wiki/mv-timed-grooming2
2. Yokozuna search was recently updated for some timeout problems.Those updates are not yet in a public build. One of our otherengineers is likely to respond to you on that topic.
An eleveldb 2.0.18 is tagged and available via github if you wantto build it yourself. Otherwise, Basho may be releasing prebuiltpatches of eleveldb 2.0.18 in the near future. But no date iscurrently set.
Matthew
On May 3, 2016, at 10:50 AM, Luke Bakken <lbak...@basho.com<mailto:lbak...@basho.com>> wrote:
Guillaume -

You said earlier "My data are stored on an openstack volume that
support up to 3000IOPS". There is a likelihood that your write load is
exceeding the capacity of your virtual environment, especially if some
Riak nodes are sharing physical disk or server infrastructure.

Some suggestions:

* If you're not using Riak Search, set "search = off" in riak.conf

* Be sure to carefully read and apply all tunings:
http://docs.basho.com/riak/kv/2.1.4/using/performance/

* You may wish to increase the memory dedicated to leveldb:
http://docs.basho.com/riak/kv/2.1.4/configuring/backend/#leveldb

--
Luke Bakken
Engineer
lbak...@basho.com


On Tue, May 3, 2016 at 7:33 AM, Guillaume Boddaert
<guilla...@lighthouse-analytics.co> wrote:
Hi,
Sorry for the delay, I've spent a lot of time trying tounderstand if the
problem was elsewhere.
I've simplified my infrastructure and got a simple layout thatdon't relyanymore on loadbalancer and also corrected some minor performanceissue on
my workers.
At the moment, i have up to 32 workers that are calling riak forwrites,
each of them are set to :
w=1
dw=0
timeout=1000
using protobuf
a timeouted attempt is rerun 180s later
From my application server perspective, 23% of the calls arerejected by
timeout (75446 tries, 57564 success, 17578 timeout).

Here is a sample riak-admin stat for one of my 5 hosts:

node_put_fsm_time_100 : 999331
node_put_fsm_time_95 : 773682
node_put_fsm_time_99 : 959444
node_put_fsm_time_mean : 156242
node_put_fsm_time_median : 20235
vnode_put_fsm_time_100 : 5267527
vnode_put_fsm_time_95 : 2437457
vnode_put_fsm_time_99 : 4819538
vnode_put_fsm_time_mean : 175567
vnode_put_fsm_time_median : 6928

I am using leveldb, so i can't tune bitcask backend as suggested.

I've changed the vmdirty settings and enabled them:
admin@riak1:~$ sudo sysctl -a | grepdirtyvm.dirty_background_ratio = 0
vm.dirty_background_bytes = 209715200
vm.dirty_ratio = 40
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 100
vm.dirty_expire_centisecs = 200
I've seen less idle time between writes, iostat is showing nearconstantwrites between 20 and 500 kb/s, with some surges around 4000kb/s. That's
better, but not that great.
Here is the current configuration for my "activity_fr" buckettype and
"tweet" bucket:


admin@riak1:~$ http localhost:8098/types/activity_fr/props
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 314
Content-Type: application/json
Date: Tue, 03 May 2016 14:30:21 GMT
Server: MochiWeb/1.1 WebMachine/1.10.8 (that head fake, tho)
Vary: Accept-Encoding
{
   "props": {
       "active": true,
       "allow_mult": false,
       "basic_quorum": false,
       "big_vclock": 50,
       "chash_keyfun": {
           "fun": "chash_std_keyfun",
           "mod": "riak_core_util"
       },
       "claimant": "r...@riak2.lighthouse-analytics.co",
       "dvv_enabled": false,
       "dw": "quorum",
       "last_write_wins": true,
       "linkfun": {
           "fun": "mapreduce_linkfun",
           "mod": "riak_kv_wm_link_walker"
       },
       "n_val": 3,
       "notfound_ok": true,
       "old_vclock": 86400,
       "postcommit": [],
       "pr": 0,
       "precommit": [],
       "pw": 0,
       "r": "quorum",
       "rw": "quorum",
       "search_index": "activity_fr.20160422104506",
       "small_vclock": 50,
       "w": "quorum",
       "young_vclock": 20
   }
}
admin@riak1:~$ httplocalhost:8098/types/activity_fr/buckets/tweet/props
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 322
Content-Type: application/json
Date: Tue, 03 May 2016 14:30:02 GMT
Server: MochiWeb/1.1 WebMachine/1.10.8 (that head fake, tho)
Vary: Accept-Encoding

{
   "props": {
       "active": true,
       "allow_mult": false,
       "basic_quorum": false,
       "big_vclock": 50,
       "chash_keyfun": {
           "fun": "chash_std_keyfun",
           "mod": "riak_core_util"
       },
       "claimant": "r...@riak2.lighthouse-analytics.co",
       "dvv_enabled": false,
       "dw": "quorum",
       "last_write_wins": true,
       "linkfun": {
           "fun": "mapreduce_linkfun",
           "mod": "riak_kv_wm_link_walker"
       },
       "n_val": 3,
       "name": "tweet",
       "notfound_ok": true,
       "old_vclock": 86400,
       "postcommit": [],
       "pr": 0,
       "precommit": [],
       "pw": 0,
       "r": "quorum",
       "rw": "quorum",
       "search_index": "activity_fr.20160422104506",
       "small_vclock": 50,
       "w": "quorum",
       "young_vclock": 20
   }
}

I really don't know what to do. Can you help ?

Guillaume
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Very slow acquisition time (99 percentile) while fast median times

Reply via email to