Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Cool.. gave me an exception about ** exception error: undefined shell command profit/0 but it worked and now I have new data.. thanks a lot! Cheers Simon On Wed, 11 Dec 2013 17:05:29 -0500 Matthew Von-Maszewski wrote: > One of the core developers says that the following line should stop the stats > process. It will then be automatically started, without the stuck data. > > exit(whereis(riak_core_stat_calc_sup), kill), profit(). > > Matthew > > On Dec 11, 2013, at 4:50 PM, Simon Effenberg > wrote: > > > So I think I have no real chance to get good numbers. I can see a > > little bit through the app monitoring but I'm not sure if I can see > > real differences about the 100 -> 170 open_files increase. > > > > I will try to change the value on the already migrated nodes as well to > > see if this improves the stuff I can see.. > > > > Any other ideas? > > > > Cheers > > Simon > > > > On Wed, 11 Dec 2013 15:37:03 -0500 > > Matthew Von-Maszewski wrote: > > > >> The real Riak developers have suggested this might be your problem with > >> stats being stuck: > >> > >> https://github.com/basho/riak_core/pull/467 > >> > >> The fix is included in the upcoming 1.4.4 maintenance release (which is > >> overdue so I am not going to bother guessing when it will actually arrive). > >> > >> Matthew > >> > >> On Dec 11, 2013, at 2:47 PM, Simon Effenberg > >> wrote: > >> > >>> I will do.. > >>> > >>> but one other thing: > >>> > >>> watch Every 10.0s: sudo riak-admin status | grep put_fsm > >>> node_put_fsm_time_mean : 2208050 > >>> node_put_fsm_time_median : 39231 > >>> node_put_fsm_time_95 : 17400382 > >>> node_put_fsm_time_99 : 50965752 > >>> node_put_fsm_time_100 : 59537762 > >>> node_put_fsm_active : 5 > >>> node_put_fsm_active_60s : 364 > >>> node_put_fsm_in_rate : 5 > >>> node_put_fsm_out_rate : 3 > >>> node_put_fsm_rejected : 0 > >>> node_put_fsm_rejected_60s : 0 > >>> node_put_fsm_rejected_total : 0 > >>> > >>> this is not changing at all.. so maybe my expectations are _wrong_?! So > >>> I will start searching around if there is a "status" bug or I'm > >>> looking in the wrong place... maybe there is no problem while searching > >>> for one?! But I see that at least the app has some issues on GET and > >>> PUT (more on PUT).. so I would like to know how fast the things are.. > >>> but "status" isn't working.. argh... > >>> > >>> Cheers > >>> Simon > >>> > >>> > >>> On Wed, 11 Dec 2013 14:32:07 -0500 > >>> Matthew Von-Maszewski wrote: > >>> > An additional thought: if increasing max_open_files does NOT help, try > removing +S 4:4 from the vm.args. Typically +S setting helps leveldb, > but one other user mentioned that the new sorted 2i queries needed more > CPU in the Erlang layer. > > Summary: > - try increasing max_open_files to 170 > - helps: try setting sst_block_size to 32768 in app.config > - does not help: try removing +S from vm.args > > Matthew > > On Dec 11, 2013, at 1:58 PM, Simon Effenberg > wrote: > > > Hi Matthew, > > > > On Wed, 11 Dec 2013 18:38:49 +0100 > > Matthew Von-Maszewski wrote: > > > >> Simon, > >> > >> I have plugged your various values into the attached spreadsheet. I > >> assumed a vnode count to allow for one of your twelve servers to die > >> (256 ring size / 11 servers). > > > > Great, thanks! > > > >> > >> The spreadsheet suggests you can safely raise your max_open_files from > >> 100 to 170. I would suggest doing this for the next server you > >> upgrade. If part of your problem is file cache thrashing, you should > >> see an improvement. > > > > I will try this out.. starting the next server in 3-4 hours. > > > >> > >> Only if max_open_files helps, you should then consider adding > >> {sst_block_size, 32767} to the eleveldb portion of app.config. This > >> setting, given your value sizes, would likely half the size of the > >> metadata held in the file cache. It only impacts the files newly > >> compacted in the upgrade, and would gradually increase space in the > >> file cache while slowing down the file cache thrashing. > > > > So I'll do this at the over-next server if the next server is fine. > > > >> > >> What build/packaging of Riak do you use, or do you build from source? > > > > Using the debian packages from the basho site.. > > > > I'm really wondering why the "put" performance is that bad. > > Here are the changes which were introduced/changed only on the new > > upgraded servers: > > > > > > +fsm_limit => 5, > > --- our '+P' is set to 262144 so more than 3x fsm_limit which was > > --- stated somewhere > > +# after finishing the upgrade this should be switched to v1 !!! > > +object_format
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
One of the core developers says that the following line should stop the stats process. It will then be automatically started, without the stuck data. exit(whereis(riak_core_stat_calc_sup), kill), profit(). Matthew On Dec 11, 2013, at 4:50 PM, Simon Effenberg wrote: > So I think I have no real chance to get good numbers. I can see a > little bit through the app monitoring but I'm not sure if I can see > real differences about the 100 -> 170 open_files increase. > > I will try to change the value on the already migrated nodes as well to > see if this improves the stuff I can see.. > > Any other ideas? > > Cheers > Simon > > On Wed, 11 Dec 2013 15:37:03 -0500 > Matthew Von-Maszewski wrote: > >> The real Riak developers have suggested this might be your problem with >> stats being stuck: >> >> https://github.com/basho/riak_core/pull/467 >> >> The fix is included in the upcoming 1.4.4 maintenance release (which is >> overdue so I am not going to bother guessing when it will actually arrive). >> >> Matthew >> >> On Dec 11, 2013, at 2:47 PM, Simon Effenberg >> wrote: >> >>> I will do.. >>> >>> but one other thing: >>> >>> watch Every 10.0s: sudo riak-admin status | grep put_fsm >>> node_put_fsm_time_mean : 2208050 >>> node_put_fsm_time_median : 39231 >>> node_put_fsm_time_95 : 17400382 >>> node_put_fsm_time_99 : 50965752 >>> node_put_fsm_time_100 : 59537762 >>> node_put_fsm_active : 5 >>> node_put_fsm_active_60s : 364 >>> node_put_fsm_in_rate : 5 >>> node_put_fsm_out_rate : 3 >>> node_put_fsm_rejected : 0 >>> node_put_fsm_rejected_60s : 0 >>> node_put_fsm_rejected_total : 0 >>> >>> this is not changing at all.. so maybe my expectations are _wrong_?! So >>> I will start searching around if there is a "status" bug or I'm >>> looking in the wrong place... maybe there is no problem while searching >>> for one?! But I see that at least the app has some issues on GET and >>> PUT (more on PUT).. so I would like to know how fast the things are.. >>> but "status" isn't working.. argh... >>> >>> Cheers >>> Simon >>> >>> >>> On Wed, 11 Dec 2013 14:32:07 -0500 >>> Matthew Von-Maszewski wrote: >>> An additional thought: if increasing max_open_files does NOT help, try removing +S 4:4 from the vm.args. Typically +S setting helps leveldb, but one other user mentioned that the new sorted 2i queries needed more CPU in the Erlang layer. Summary: - try increasing max_open_files to 170 - helps: try setting sst_block_size to 32768 in app.config - does not help: try removing +S from vm.args Matthew On Dec 11, 2013, at 1:58 PM, Simon Effenberg wrote: > Hi Matthew, > > On Wed, 11 Dec 2013 18:38:49 +0100 > Matthew Von-Maszewski wrote: > >> Simon, >> >> I have plugged your various values into the attached spreadsheet. I >> assumed a vnode count to allow for one of your twelve servers to die >> (256 ring size / 11 servers). > > Great, thanks! > >> >> The spreadsheet suggests you can safely raise your max_open_files from >> 100 to 170. I would suggest doing this for the next server you upgrade. >> If part of your problem is file cache thrashing, you should see an >> improvement. > > I will try this out.. starting the next server in 3-4 hours. > >> >> Only if max_open_files helps, you should then consider adding >> {sst_block_size, 32767} to the eleveldb portion of app.config. This >> setting, given your value sizes, would likely half the size of the >> metadata held in the file cache. It only impacts the files newly >> compacted in the upgrade, and would gradually increase space in the file >> cache while slowing down the file cache thrashing. > > So I'll do this at the over-next server if the next server is fine. > >> >> What build/packaging of Riak do you use, or do you build from source? > > Using the debian packages from the basho site.. > > I'm really wondering why the "put" performance is that bad. > Here are the changes which were introduced/changed only on the new > upgraded servers: > > > +fsm_limit => 5, > --- our '+P' is set to 262144 so more than 3x fsm_limit which was > --- stated somewhere > +# after finishing the upgrade this should be switched to v1 !!! > +object_format => '__atom_v0', > > - '-env ERL_MAX_ETS_TABLES' => 8192, > + '-env ERL_MAX_ETS_TABLES' => 256000, # old package used 8192 > but 1.4.2 raised it to this high number > + '-env ERL_MAX_PORTS' => 64000, > + # Treat error_logger warnings as warnings > + '+W' => 'w', > + # Tweak GC to run more often > + '-env ERL_FULLSWEEP_AFTER' => 0, > + # Force the erlang VM to use SMP
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
So I think I have no real chance to get good numbers. I can see a little bit through the app monitoring but I'm not sure if I can see real differences about the 100 -> 170 open_files increase. I will try to change the value on the already migrated nodes as well to see if this improves the stuff I can see.. Any other ideas? Cheers Simon On Wed, 11 Dec 2013 15:37:03 -0500 Matthew Von-Maszewski wrote: > The real Riak developers have suggested this might be your problem with stats > being stuck: > > https://github.com/basho/riak_core/pull/467 > > The fix is included in the upcoming 1.4.4 maintenance release (which is > overdue so I am not going to bother guessing when it will actually arrive). > > Matthew > > On Dec 11, 2013, at 2:47 PM, Simon Effenberg > wrote: > > > I will do.. > > > > but one other thing: > > > > watch Every 10.0s: sudo riak-admin status | grep put_fsm > > node_put_fsm_time_mean : 2208050 > > node_put_fsm_time_median : 39231 > > node_put_fsm_time_95 : 17400382 > > node_put_fsm_time_99 : 50965752 > > node_put_fsm_time_100 : 59537762 > > node_put_fsm_active : 5 > > node_put_fsm_active_60s : 364 > > node_put_fsm_in_rate : 5 > > node_put_fsm_out_rate : 3 > > node_put_fsm_rejected : 0 > > node_put_fsm_rejected_60s : 0 > > node_put_fsm_rejected_total : 0 > > > > this is not changing at all.. so maybe my expectations are _wrong_?! So > > I will start searching around if there is a "status" bug or I'm > > looking in the wrong place... maybe there is no problem while searching > > for one?! But I see that at least the app has some issues on GET and > > PUT (more on PUT).. so I would like to know how fast the things are.. > > but "status" isn't working.. argh... > > > > Cheers > > Simon > > > > > > On Wed, 11 Dec 2013 14:32:07 -0500 > > Matthew Von-Maszewski wrote: > > > >> An additional thought: if increasing max_open_files does NOT help, try > >> removing +S 4:4 from the vm.args. Typically +S setting helps leveldb, but > >> one other user mentioned that the new sorted 2i queries needed more CPU in > >> the Erlang layer. > >> > >> Summary: > >> - try increasing max_open_files to 170 > >> - helps: try setting sst_block_size to 32768 in app.config > >> - does not help: try removing +S from vm.args > >> > >> Matthew > >> > >> On Dec 11, 2013, at 1:58 PM, Simon Effenberg > >> wrote: > >> > >>> Hi Matthew, > >>> > >>> On Wed, 11 Dec 2013 18:38:49 +0100 > >>> Matthew Von-Maszewski wrote: > >>> > Simon, > > I have plugged your various values into the attached spreadsheet. I > assumed a vnode count to allow for one of your twelve servers to die > (256 ring size / 11 servers). > >>> > >>> Great, thanks! > >>> > > The spreadsheet suggests you can safely raise your max_open_files from > 100 to 170. I would suggest doing this for the next server you upgrade. > If part of your problem is file cache thrashing, you should see an > improvement. > >>> > >>> I will try this out.. starting the next server in 3-4 hours. > >>> > > Only if max_open_files helps, you should then consider adding > {sst_block_size, 32767} to the eleveldb portion of app.config. This > setting, given your value sizes, would likely half the size of the > metadata held in the file cache. It only impacts the files newly > compacted in the upgrade, and would gradually increase space in the file > cache while slowing down the file cache thrashing. > >>> > >>> So I'll do this at the over-next server if the next server is fine. > >>> > > What build/packaging of Riak do you use, or do you build from source? > >>> > >>> Using the debian packages from the basho site.. > >>> > >>> I'm really wondering why the "put" performance is that bad. > >>> Here are the changes which were introduced/changed only on the new > >>> upgraded servers: > >>> > >>> > >>> +fsm_limit => 5, > >>> --- our '+P' is set to 262144 so more than 3x fsm_limit which was > >>> --- stated somewhere > >>> +# after finishing the upgrade this should be switched to v1 !!! > >>> +object_format => '__atom_v0', > >>> > >>> - '-env ERL_MAX_ETS_TABLES' => 8192, > >>> + '-env ERL_MAX_ETS_TABLES' => 256000, # old package used 8192 > >>> but 1.4.2 raised it to this high number > >>> + '-env ERL_MAX_PORTS' => 64000, > >>> + # Treat error_logger warnings as warnings > >>> + '+W' => 'w', > >>> + # Tweak GC to run more often > >>> + '-env ERL_FULLSWEEP_AFTER' => 0, > >>> + # Force the erlang VM to use SMP > >>> + '-smp' => 'enable', > >>> + # > >>> > >>> Cheers > >>> Simon > >>> > >>> > > Matthew > > > > On Dec 11, 2013, at 9:48 AM, Simon Effenberg > wrote: > > > Hi Matthew, > > > > thanks
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
The real Riak developers have suggested this might be your problem with stats being stuck: https://github.com/basho/riak_core/pull/467 The fix is included in the upcoming 1.4.4 maintenance release (which is overdue so I am not going to bother guessing when it will actually arrive). Matthew On Dec 11, 2013, at 2:47 PM, Simon Effenberg wrote: > I will do.. > > but one other thing: > > watch Every 10.0s: sudo riak-admin status | grep put_fsm > node_put_fsm_time_mean : 2208050 > node_put_fsm_time_median : 39231 > node_put_fsm_time_95 : 17400382 > node_put_fsm_time_99 : 50965752 > node_put_fsm_time_100 : 59537762 > node_put_fsm_active : 5 > node_put_fsm_active_60s : 364 > node_put_fsm_in_rate : 5 > node_put_fsm_out_rate : 3 > node_put_fsm_rejected : 0 > node_put_fsm_rejected_60s : 0 > node_put_fsm_rejected_total : 0 > > this is not changing at all.. so maybe my expectations are _wrong_?! So > I will start searching around if there is a "status" bug or I'm > looking in the wrong place... maybe there is no problem while searching > for one?! But I see that at least the app has some issues on GET and > PUT (more on PUT).. so I would like to know how fast the things are.. > but "status" isn't working.. argh... > > Cheers > Simon > > > On Wed, 11 Dec 2013 14:32:07 -0500 > Matthew Von-Maszewski wrote: > >> An additional thought: if increasing max_open_files does NOT help, try >> removing +S 4:4 from the vm.args. Typically +S setting helps leveldb, but >> one other user mentioned that the new sorted 2i queries needed more CPU in >> the Erlang layer. >> >> Summary: >> - try increasing max_open_files to 170 >> - helps: try setting sst_block_size to 32768 in app.config >> - does not help: try removing +S from vm.args >> >> Matthew >> >> On Dec 11, 2013, at 1:58 PM, Simon Effenberg >> wrote: >> >>> Hi Matthew, >>> >>> On Wed, 11 Dec 2013 18:38:49 +0100 >>> Matthew Von-Maszewski wrote: >>> Simon, I have plugged your various values into the attached spreadsheet. I assumed a vnode count to allow for one of your twelve servers to die (256 ring size / 11 servers). >>> >>> Great, thanks! >>> The spreadsheet suggests you can safely raise your max_open_files from 100 to 170. I would suggest doing this for the next server you upgrade. If part of your problem is file cache thrashing, you should see an improvement. >>> >>> I will try this out.. starting the next server in 3-4 hours. >>> Only if max_open_files helps, you should then consider adding {sst_block_size, 32767} to the eleveldb portion of app.config. This setting, given your value sizes, would likely half the size of the metadata held in the file cache. It only impacts the files newly compacted in the upgrade, and would gradually increase space in the file cache while slowing down the file cache thrashing. >>> >>> So I'll do this at the over-next server if the next server is fine. >>> What build/packaging of Riak do you use, or do you build from source? >>> >>> Using the debian packages from the basho site.. >>> >>> I'm really wondering why the "put" performance is that bad. >>> Here are the changes which were introduced/changed only on the new >>> upgraded servers: >>> >>> >>> +fsm_limit => 5, >>> --- our '+P' is set to 262144 so more than 3x fsm_limit which was >>> --- stated somewhere >>> +# after finishing the upgrade this should be switched to v1 !!! >>> +object_format => '__atom_v0', >>> >>> - '-env ERL_MAX_ETS_TABLES' => 8192, >>> + '-env ERL_MAX_ETS_TABLES' => 256000, # old package used 8192 >>> but 1.4.2 raised it to this high number >>> + '-env ERL_MAX_PORTS' => 64000, >>> + # Treat error_logger warnings as warnings >>> + '+W' => 'w', >>> + # Tweak GC to run more often >>> + '-env ERL_FULLSWEEP_AFTER' => 0, >>> + # Force the erlang VM to use SMP >>> + '-smp' => 'enable', >>> + # >>> >>> Cheers >>> Simon >>> >>> Matthew On Dec 11, 2013, at 9:48 AM, Simon Effenberg wrote: > Hi Matthew, > > thanks for all your time and work.. see inline for answers.. > > On Wed, 11 Dec 2013 09:17:32 -0500 > Matthew Von-Maszewski wrote: > >> The real Riak developers have arrived on-line for the day. They are >> telling me that all of your problems are likely due to the extended >> upgrade times, and yes there is a known issue with handoff between 1.3 >> and 1.4. They also say everything should calm down after all nodes are >> upgraded. >> >> I will review your system settings now and see if there is something >> that might make the other machines upgrade quicker. So three more >> questions: >> >> - what is
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
I will do.. but one other thing: watch Every 10.0s: sudo riak-admin status | grep put_fsm node_put_fsm_time_mean : 2208050 node_put_fsm_time_median : 39231 node_put_fsm_time_95 : 17400382 node_put_fsm_time_99 : 50965752 node_put_fsm_time_100 : 59537762 node_put_fsm_active : 5 node_put_fsm_active_60s : 364 node_put_fsm_in_rate : 5 node_put_fsm_out_rate : 3 node_put_fsm_rejected : 0 node_put_fsm_rejected_60s : 0 node_put_fsm_rejected_total : 0 this is not changing at all.. so maybe my expectations are _wrong_?! So I will start searching around if there is a "status" bug or I'm looking in the wrong place... maybe there is no problem while searching for one?! But I see that at least the app has some issues on GET and PUT (more on PUT).. so I would like to know how fast the things are.. but "status" isn't working.. argh... Cheers Simon On Wed, 11 Dec 2013 14:32:07 -0500 Matthew Von-Maszewski wrote: > An additional thought: if increasing max_open_files does NOT help, try > removing +S 4:4 from the vm.args. Typically +S setting helps leveldb, but > one other user mentioned that the new sorted 2i queries needed more CPU in > the Erlang layer. > > Summary: > - try increasing max_open_files to 170 > - helps: try setting sst_block_size to 32768 in app.config > - does not help: try removing +S from vm.args > > Matthew > > On Dec 11, 2013, at 1:58 PM, Simon Effenberg > wrote: > > > Hi Matthew, > > > > On Wed, 11 Dec 2013 18:38:49 +0100 > > Matthew Von-Maszewski wrote: > > > >> Simon, > >> > >> I have plugged your various values into the attached spreadsheet. I > >> assumed a vnode count to allow for one of your twelve servers to die (256 > >> ring size / 11 servers). > > > > Great, thanks! > > > >> > >> The spreadsheet suggests you can safely raise your max_open_files from 100 > >> to 170. I would suggest doing this for the next server you upgrade. If > >> part of your problem is file cache thrashing, you should see an > >> improvement. > > > > I will try this out.. starting the next server in 3-4 hours. > > > >> > >> Only if max_open_files helps, you should then consider adding > >> {sst_block_size, 32767} to the eleveldb portion of app.config. This > >> setting, given your value sizes, would likely half the size of the > >> metadata held in the file cache. It only impacts the files newly > >> compacted in the upgrade, and would gradually increase space in the file > >> cache while slowing down the file cache thrashing. > > > > So I'll do this at the over-next server if the next server is fine. > > > >> > >> What build/packaging of Riak do you use, or do you build from source? > > > > Using the debian packages from the basho site.. > > > > I'm really wondering why the "put" performance is that bad. > > Here are the changes which were introduced/changed only on the new > > upgraded servers: > > > > > > +fsm_limit => 5, > > --- our '+P' is set to 262144 so more than 3x fsm_limit which was > > --- stated somewhere > > +# after finishing the upgrade this should be switched to v1 !!! > > +object_format => '__atom_v0', > > > > - '-env ERL_MAX_ETS_TABLES' => 8192, > > + '-env ERL_MAX_ETS_TABLES' => 256000, # old package used 8192 > > but 1.4.2 raised it to this high number > > + '-env ERL_MAX_PORTS' => 64000, > > + # Treat error_logger warnings as warnings > > + '+W' => 'w', > > + # Tweak GC to run more often > > + '-env ERL_FULLSWEEP_AFTER' => 0, > > + # Force the erlang VM to use SMP > > + '-smp' => 'enable', > > + # > > > > Cheers > > Simon > > > > > >> > >> Matthew > >> > >> > >> > >> On Dec 11, 2013, at 9:48 AM, Simon Effenberg > >> wrote: > >> > >>> Hi Matthew, > >>> > >>> thanks for all your time and work.. see inline for answers.. > >>> > >>> On Wed, 11 Dec 2013 09:17:32 -0500 > >>> Matthew Von-Maszewski wrote: > >>> > The real Riak developers have arrived on-line for the day. They are > telling me that all of your problems are likely due to the extended > upgrade times, and yes there is a known issue with handoff between 1.3 > and 1.4. They also say everything should calm down after all nodes are > upgraded. > > I will review your system settings now and see if there is something > that might make the other machines upgrade quicker. So three more > questions: > > - what is the average size of your keys > >>> > >>> bucket names are between 5 and 15 characters (only ~ 10 buckets).. > >>> key names are normally something like 26iesj:hovh7egz > >>> > > - what is the average size of your value (data stored) > >>> > >>> I have to guess.. but mean is (from Riak) 12kb but 95th percentile is > >>> at 75kb and in theory we have a limit of 1MB (then it will be split up) > >>> but sometimes
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
An additional thought: if increasing max_open_files does NOT help, try removing +S 4:4 from the vm.args. Typically +S setting helps leveldb, but one other user mentioned that the new sorted 2i queries needed more CPU in the Erlang layer. Summary: - try increasing max_open_files to 170 - helps: try setting sst_block_size to 32768 in app.config - does not help: try removing +S from vm.args Matthew On Dec 11, 2013, at 1:58 PM, Simon Effenberg wrote: > Hi Matthew, > > On Wed, 11 Dec 2013 18:38:49 +0100 > Matthew Von-Maszewski wrote: > >> Simon, >> >> I have plugged your various values into the attached spreadsheet. I assumed >> a vnode count to allow for one of your twelve servers to die (256 ring size >> / 11 servers). > > Great, thanks! > >> >> The spreadsheet suggests you can safely raise your max_open_files from 100 >> to 170. I would suggest doing this for the next server you upgrade. If >> part of your problem is file cache thrashing, you should see an improvement. > > I will try this out.. starting the next server in 3-4 hours. > >> >> Only if max_open_files helps, you should then consider adding >> {sst_block_size, 32767} to the eleveldb portion of app.config. This >> setting, given your value sizes, would likely half the size of the metadata >> held in the file cache. It only impacts the files newly compacted in the >> upgrade, and would gradually increase space in the file cache while slowing >> down the file cache thrashing. > > So I'll do this at the over-next server if the next server is fine. > >> >> What build/packaging of Riak do you use, or do you build from source? > > Using the debian packages from the basho site.. > > I'm really wondering why the "put" performance is that bad. > Here are the changes which were introduced/changed only on the new > upgraded servers: > > > +fsm_limit => 5, > --- our '+P' is set to 262144 so more than 3x fsm_limit which was > --- stated somewhere > +# after finishing the upgrade this should be switched to v1 !!! > +object_format => '__atom_v0', > > - '-env ERL_MAX_ETS_TABLES' => 8192, > + '-env ERL_MAX_ETS_TABLES' => 256000, # old package used 8192 > but 1.4.2 raised it to this high number > + '-env ERL_MAX_PORTS' => 64000, > + # Treat error_logger warnings as warnings > + '+W' => 'w', > + # Tweak GC to run more often > + '-env ERL_FULLSWEEP_AFTER' => 0, > + # Force the erlang VM to use SMP > + '-smp' => 'enable', > + # > > Cheers > Simon > > >> >> Matthew >> >> >> >> On Dec 11, 2013, at 9:48 AM, Simon Effenberg >> wrote: >> >>> Hi Matthew, >>> >>> thanks for all your time and work.. see inline for answers.. >>> >>> On Wed, 11 Dec 2013 09:17:32 -0500 >>> Matthew Von-Maszewski wrote: >>> The real Riak developers have arrived on-line for the day. They are telling me that all of your problems are likely due to the extended upgrade times, and yes there is a known issue with handoff between 1.3 and 1.4. They also say everything should calm down after all nodes are upgraded. I will review your system settings now and see if there is something that might make the other machines upgrade quicker. So three more questions: - what is the average size of your keys >>> >>> bucket names are between 5 and 15 characters (only ~ 10 buckets).. >>> key names are normally something like 26iesj:hovh7egz >>> - what is the average size of your value (data stored) >>> >>> I have to guess.. but mean is (from Riak) 12kb but 95th percentile is >>> at 75kb and in theory we have a limit of 1MB (then it will be split up) >>> but sometimes thanks to sibblings (we have to buckets with allow_mult) >>> we have also some 7MB in MAX but this will be reduced again (it's a new >>> feature in our app which has to many parallel wrights below of 15ms). >>> - in regular use, are your keys accessed randomly across their entire range, or do they contain a date component which clusters older, less used keys >>> >>> normally we don't search but retrieve keys by key name.. and we have >>> data which is up to 6 months old and normally we access mostly >>> new/active/hot data and not all the old ones.. besides this we have a >>> job doing a 2i query every 5mins and another one doing this maybe once >>> an hour.. both don't work while the upgrade is ongoing (2i isn't >>> working). >>> >>> Cheers >>> Simon >>> Matthew On Dec 11, 2013, at 8:43 AM, Simon Effenberg wrote: > Oh and at the moment they are waiting for some handoffs and I see > errors in logfiles: > > > 2013-12-11 13:41:47.948 UTC [error] > <0.7157.24>@riak_core_handoff_sender:start_fold:269 hinted_handoff > transfer of riak_kv_vnode from
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Hi Matthew, On Wed, 11 Dec 2013 18:38:49 +0100 Matthew Von-Maszewski wrote: > Simon, > > I have plugged your various values into the attached spreadsheet. I assumed > a vnode count to allow for one of your twelve servers to die (256 ring size / > 11 servers). Great, thanks! > > The spreadsheet suggests you can safely raise your max_open_files from 100 to > 170. I would suggest doing this for the next server you upgrade. If part of > your problem is file cache thrashing, you should see an improvement. I will try this out.. starting the next server in 3-4 hours. > > Only if max_open_files helps, you should then consider adding > {sst_block_size, 32767} to the eleveldb portion of app.config. This setting, > given your value sizes, would likely half the size of the metadata held in > the file cache. It only impacts the files newly compacted in the upgrade, > and would gradually increase space in the file cache while slowing down the > file cache thrashing. So I'll do this at the over-next server if the next server is fine. > > What build/packaging of Riak do you use, or do you build from source? Using the debian packages from the basho site.. I'm really wondering why the "put" performance is that bad. Here are the changes which were introduced/changed only on the new upgraded servers: +fsm_limit => 5, --- our '+P' is set to 262144 so more than 3x fsm_limit which was --- stated somewhere +# after finishing the upgrade this should be switched to v1 !!! +object_format => '__atom_v0', - '-env ERL_MAX_ETS_TABLES' => 8192, + '-env ERL_MAX_ETS_TABLES' => 256000, # old package used 8192 but 1.4.2 raised it to this high number + '-env ERL_MAX_PORTS' => 64000, + # Treat error_logger warnings as warnings + '+W' => 'w', + # Tweak GC to run more often + '-env ERL_FULLSWEEP_AFTER' => 0, + # Force the erlang VM to use SMP + '-smp' => 'enable', + # Cheers Simon > > Matthew > > > > On Dec 11, 2013, at 9:48 AM, Simon Effenberg > wrote: > > > Hi Matthew, > > > > thanks for all your time and work.. see inline for answers.. > > > > On Wed, 11 Dec 2013 09:17:32 -0500 > > Matthew Von-Maszewski wrote: > > > >> The real Riak developers have arrived on-line for the day. They are > >> telling me that all of your problems are likely due to the extended > >> upgrade times, and yes there is a known issue with handoff between 1.3 and > >> 1.4. They also say everything should calm down after all nodes are > >> upgraded. > >> > >> I will review your system settings now and see if there is something that > >> might make the other machines upgrade quicker. So three more questions: > >> > >> - what is the average size of your keys > > > > bucket names are between 5 and 15 characters (only ~ 10 buckets).. > > key names are normally something like 26iesj:hovh7egz > > > >> > >> - what is the average size of your value (data stored) > > > > I have to guess.. but mean is (from Riak) 12kb but 95th percentile is > > at 75kb and in theory we have a limit of 1MB (then it will be split up) > > but sometimes thanks to sibblings (we have to buckets with allow_mult) > > we have also some 7MB in MAX but this will be reduced again (it's a new > > feature in our app which has to many parallel wrights below of 15ms). > > > >> > >> - in regular use, are your keys accessed randomly across their entire > >> range, or do they contain a date component which clusters older, less used > >> keys > > > > normally we don't search but retrieve keys by key name.. and we have > > data which is up to 6 months old and normally we access mostly > > new/active/hot data and not all the old ones.. besides this we have a > > job doing a 2i query every 5mins and another one doing this maybe once > > an hour.. both don't work while the upgrade is ongoing (2i isn't > > working). > > > > Cheers > > Simon > > > >> > >> Matthew > >> > >> > >> On Dec 11, 2013, at 8:43 AM, Simon Effenberg > >> wrote: > >> > >>> Oh and at the moment they are waiting for some handoffs and I see > >>> errors in logfiles: > >>> > >>> > >>> 2013-12-11 13:41:47.948 UTC [error] > >>> <0.7157.24>@riak_core_handoff_sender:start_fold:269 hinted_handoff > >>> transfer of riak_kv_vnode from 'riak@10.46.109.202' > >>> 468137243207554840987117797979434404733540892672 > >>> > >>> but I remember that somebody else had this as well and if I recall > >>> correctly it disappeared after the full upgrade was done.. but at the > >>> moment it's hard to think about upgrading everything at once.. > >>> (~12hours 100% disk utilization on all 12 nodes will lead to real slow > >>> puts/gets) > >>> > >>> What can I do? > >>> > >>> Cheers > >>> Simon > >>> > >>> PS: transfers output: > >>> > >>> 'riak@10.46.109.202' waiting to handoff 17 partitions > >>> 'riak@10.46.109.201' waiting to handoff 19
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Hi Matthew, thanks for all your time and work.. see inline for answers.. On Wed, 11 Dec 2013 09:17:32 -0500 Matthew Von-Maszewski wrote: > The real Riak developers have arrived on-line for the day. They are telling > me that all of your problems are likely due to the extended upgrade times, > and yes there is a known issue with handoff between 1.3 and 1.4. They also > say everything should calm down after all nodes are upgraded. > > I will review your system settings now and see if there is something that > might make the other machines upgrade quicker. So three more questions: > > - what is the average size of your keys bucket names are between 5 and 15 characters (only ~ 10 buckets).. key names are normally something like 26iesj:hovh7egz > > - what is the average size of your value (data stored) I have to guess.. but mean is (from Riak) 12kb but 95th percentile is at 75kb and in theory we have a limit of 1MB (then it will be split up) but sometimes thanks to sibblings (we have to buckets with allow_mult) we have also some 7MB in MAX but this will be reduced again (it's a new feature in our app which has to many parallel wrights below of 15ms). > > - in regular use, are your keys accessed randomly across their entire range, > or do they contain a date component which clusters older, less used keys normally we don't search but retrieve keys by key name.. and we have data which is up to 6 months old and normally we access mostly new/active/hot data and not all the old ones.. besides this we have a job doing a 2i query every 5mins and another one doing this maybe once an hour.. both don't work while the upgrade is ongoing (2i isn't working). Cheers Simon > > Matthew > > > On Dec 11, 2013, at 8:43 AM, Simon Effenberg > wrote: > > > Oh and at the moment they are waiting for some handoffs and I see > > errors in logfiles: > > > > > > 2013-12-11 13:41:47.948 UTC [error] > > <0.7157.24>@riak_core_handoff_sender:start_fold:269 hinted_handoff > > transfer of riak_kv_vnode from 'riak@10.46.109.202' > > 468137243207554840987117797979434404733540892672 > > > > but I remember that somebody else had this as well and if I recall > > correctly it disappeared after the full upgrade was done.. but at the > > moment it's hard to think about upgrading everything at once.. > > (~12hours 100% disk utilization on all 12 nodes will lead to real slow > > puts/gets) > > > > What can I do? > > > > Cheers > > Simon > > > > PS: transfers output: > > > > 'riak@10.46.109.202' waiting to handoff 17 partitions > > 'riak@10.46.109.201' waiting to handoff 19 partitions > > > > (these are the 1.4.2 nodes) > > > > > > On Wed, 11 Dec 2013 14:39:58 +0100 > > Simon Effenberg wrote: > > > >> Also some side notes: > >> > >> "top" is even better on new 1.4.2 than on 1.3.1 machines.. IO > >> utilization of disk is mostly the same (round about 33%).. > >> > >> but > >> > >> 95th percentile of response time for get (avg over all nodes): > >> before upgrade: 29ms > >> after upgrade: almost the same > >> > >> 95th percentile of response time for put (avg over all nodes): > >> before upgrade: 60ms > >> after upgrade: 1548ms > >>but this is only because of 2 of 12 nodes are > >>on 1.4.2 and are really slow (17000ms) > >> > >> Cheers, > >> Simon > >> > >> On Wed, 11 Dec 2013 13:45:56 +0100 > >> Simon Effenberg wrote: > >> > >>> Sorry I forgot the half of it.. > >>> > >>> seffenberg@kriak46-1:~$ free -m > >>> total used free sharedbuffers cached > >>> Mem: 23999 23759239 0184 16183 > >>> -/+ buffers/cache: 7391 16607 > >>> Swap:0 0 0 > >>> > >>> We have 12 servers.. > >>> datadir on the compacted servers (1.4.2) ~ 765 GB > >>> > >>> AAE is enabled. > >>> > >>> I attached app.config and vm.args. > >>> > >>> Cheers > >>> Simon > >>> > >>> On Wed, 11 Dec 2013 07:33:31 -0500 > >>> Matthew Von-Maszewski wrote: > >>> > Ok, I am now suspecting that your servers are either using swap space > (which is slow) or your leveldb file cache is thrashing (opening and > closing multiple files per request). > > How many servers do you have and do you use Riak's active anti-entropy > feature? I am going to plug all of this into a spreadsheet. > > Matthew Von-Maszewski > > > On Dec 11, 2013, at 7:09, Simon Effenberg > wrote: > > > Hi Matthew > > > > Memory: 23999 MB > > > > ring_creation_size, 256 > > max_open_files, 100 > > > > riak-admin status: > > > > memory_total : 276001360 > > memory_processes : 191506322 > > memory_processes_used : 191439568 > > memory_system : 84495038 > > memory_atom : 686993 > > memory_atom_used : 686560 > > memory_binary : 21965352 > > memory_code : 11332732 > > memory_ets : 10823528 > > > > Thanks for look
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
The real Riak developers have arrived on-line for the day. They are telling me that all of your problems are likely due to the extended upgrade times, and yes there is a known issue with handoff between 1.3 and 1.4. They also say everything should calm down after all nodes are upgraded. I will review your system settings now and see if there is something that might make the other machines upgrade quicker. So three more questions: - what is the average size of your keys - what is the average size of your value (data stored) - in regular use, are your keys accessed randomly across their entire range, or do they contain a date component which clusters older, less used keys Matthew On Dec 11, 2013, at 8:43 AM, Simon Effenberg wrote: > Oh and at the moment they are waiting for some handoffs and I see > errors in logfiles: > > > 2013-12-11 13:41:47.948 UTC [error] > <0.7157.24>@riak_core_handoff_sender:start_fold:269 hinted_handoff > transfer of riak_kv_vnode from 'riak@10.46.109.202' > 468137243207554840987117797979434404733540892672 > > but I remember that somebody else had this as well and if I recall > correctly it disappeared after the full upgrade was done.. but at the > moment it's hard to think about upgrading everything at once.. > (~12hours 100% disk utilization on all 12 nodes will lead to real slow > puts/gets) > > What can I do? > > Cheers > Simon > > PS: transfers output: > > 'riak@10.46.109.202' waiting to handoff 17 partitions > 'riak@10.46.109.201' waiting to handoff 19 partitions > > (these are the 1.4.2 nodes) > > > On Wed, 11 Dec 2013 14:39:58 +0100 > Simon Effenberg wrote: > >> Also some side notes: >> >> "top" is even better on new 1.4.2 than on 1.3.1 machines.. IO >> utilization of disk is mostly the same (round about 33%).. >> >> but >> >> 95th percentile of response time for get (avg over all nodes): >> before upgrade: 29ms >> after upgrade: almost the same >> >> 95th percentile of response time for put (avg over all nodes): >> before upgrade: 60ms >> after upgrade: 1548ms >>but this is only because of 2 of 12 nodes are >>on 1.4.2 and are really slow (17000ms) >> >> Cheers, >> Simon >> >> On Wed, 11 Dec 2013 13:45:56 +0100 >> Simon Effenberg wrote: >> >>> Sorry I forgot the half of it.. >>> >>> seffenberg@kriak46-1:~$ free -m >>> total used free sharedbuffers cached >>> Mem: 23999 23759239 0184 16183 >>> -/+ buffers/cache: 7391 16607 >>> Swap:0 0 0 >>> >>> We have 12 servers.. >>> datadir on the compacted servers (1.4.2) ~ 765 GB >>> >>> AAE is enabled. >>> >>> I attached app.config and vm.args. >>> >>> Cheers >>> Simon >>> >>> On Wed, 11 Dec 2013 07:33:31 -0500 >>> Matthew Von-Maszewski wrote: >>> Ok, I am now suspecting that your servers are either using swap space (which is slow) or your leveldb file cache is thrashing (opening and closing multiple files per request). How many servers do you have and do you use Riak's active anti-entropy feature? I am going to plug all of this into a spreadsheet. Matthew Von-Maszewski On Dec 11, 2013, at 7:09, Simon Effenberg wrote: > Hi Matthew > > Memory: 23999 MB > > ring_creation_size, 256 > max_open_files, 100 > > riak-admin status: > > memory_total : 276001360 > memory_processes : 191506322 > memory_processes_used : 191439568 > memory_system : 84495038 > memory_atom : 686993 > memory_atom_used : 686560 > memory_binary : 21965352 > memory_code : 11332732 > memory_ets : 10823528 > > Thanks for looking! > > Cheers > Simon > > > > On Wed, 11 Dec 2013 06:44:42 -0500 > Matthew Von-Maszewski wrote: > >> I need to ask other developers as they arrive for the new day. Does not >> make sense to me. >> >> How many nodes do you have? How much RAM do you have in each node? >> What are your settings for max_open_files and cache_size in the >> app.config file? Maybe this is as simple as leveldb using too much RAM >> in 1.4. The memory accounting for maz_open_files changed in 1.4. >> >> Matthew Von-Maszewski >> >> >> On Dec 11, 2013, at 6:28, Simon Effenberg >> wrote: >> >>> Hi Matthew, >>> >>> it took around 11hours for the first node to finish the compaction. The >>> second node is running already 12 hours and is still doing compaction. >>> >>> Besides that I wonder because the fsm_put time on the new 1.4.2 host is >>> much higher (after the compaction) than on an old 1.3.1 (both are >>> running in the cluster right now and another one is doing the >>> compaction/upgrade while it is in the cluster but not directly >>> accessible because it is out of the Loadbalancer): >>>
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Oh and at the moment they are waiting for some handoffs and I see errors in logfiles: 2013-12-11 13:41:47.948 UTC [error] <0.7157.24>@riak_core_handoff_sender:start_fold:269 hinted_handoff transfer of riak_kv_vnode from 'riak@10.46.109.202' 468137243207554840987117797979434404733540892672 but I remember that somebody else had this as well and if I recall correctly it disappeared after the full upgrade was done.. but at the moment it's hard to think about upgrading everything at once.. (~12hours 100% disk utilization on all 12 nodes will lead to real slow puts/gets) What can I do? Cheers Simon PS: transfers output: 'riak@10.46.109.202' waiting to handoff 17 partitions 'riak@10.46.109.201' waiting to handoff 19 partitions (these are the 1.4.2 nodes) On Wed, 11 Dec 2013 14:39:58 +0100 Simon Effenberg wrote: > Also some side notes: > > "top" is even better on new 1.4.2 than on 1.3.1 machines.. IO > utilization of disk is mostly the same (round about 33%).. > > but > > 95th percentile of response time for get (avg over all nodes): > before upgrade: 29ms > after upgrade: almost the same > > 95th percentile of response time for put (avg over all nodes): > before upgrade: 60ms > after upgrade: 1548ms > but this is only because of 2 of 12 nodes are > on 1.4.2 and are really slow (17000ms) > > Cheers, > Simon > > On Wed, 11 Dec 2013 13:45:56 +0100 > Simon Effenberg wrote: > > > Sorry I forgot the half of it.. > > > > seffenberg@kriak46-1:~$ free -m > > total used free sharedbuffers cached > > Mem: 23999 23759239 0184 16183 > > -/+ buffers/cache: 7391 16607 > > Swap:0 0 0 > > > > We have 12 servers.. > > datadir on the compacted servers (1.4.2) ~ 765 GB > > > > AAE is enabled. > > > > I attached app.config and vm.args. > > > > Cheers > > Simon > > > > On Wed, 11 Dec 2013 07:33:31 -0500 > > Matthew Von-Maszewski wrote: > > > > > Ok, I am now suspecting that your servers are either using swap space > > > (which is slow) or your leveldb file cache is thrashing (opening and > > > closing multiple files per request). > > > > > > How many servers do you have and do you use Riak's active anti-entropy > > > feature? I am going to plug all of this into a spreadsheet. > > > > > > Matthew Von-Maszewski > > > > > > > > > On Dec 11, 2013, at 7:09, Simon Effenberg > > > wrote: > > > > > > > Hi Matthew > > > > > > > > Memory: 23999 MB > > > > > > > > ring_creation_size, 256 > > > > max_open_files, 100 > > > > > > > > riak-admin status: > > > > > > > > memory_total : 276001360 > > > > memory_processes : 191506322 > > > > memory_processes_used : 191439568 > > > > memory_system : 84495038 > > > > memory_atom : 686993 > > > > memory_atom_used : 686560 > > > > memory_binary : 21965352 > > > > memory_code : 11332732 > > > > memory_ets : 10823528 > > > > > > > > Thanks for looking! > > > > > > > > Cheers > > > > Simon > > > > > > > > > > > > > > > > On Wed, 11 Dec 2013 06:44:42 -0500 > > > > Matthew Von-Maszewski wrote: > > > > > > > >> I need to ask other developers as they arrive for the new day. Does > > > >> not make sense to me. > > > >> > > > >> How many nodes do you have? How much RAM do you have in each node? > > > >> What are your settings for max_open_files and cache_size in the > > > >> app.config file? Maybe this is as simple as leveldb using too much > > > >> RAM in 1.4. The memory accounting for maz_open_files changed in 1.4. > > > >> > > > >> Matthew Von-Maszewski > > > >> > > > >> > > > >> On Dec 11, 2013, at 6:28, Simon Effenberg > > > >> wrote: > > > >> > > > >>> Hi Matthew, > > > >>> > > > >>> it took around 11hours for the first node to finish the compaction. > > > >>> The > > > >>> second node is running already 12 hours and is still doing compaction. > > > >>> > > > >>> Besides that I wonder because the fsm_put time on the new 1.4.2 host > > > >>> is > > > >>> much higher (after the compaction) than on an old 1.3.1 (both are > > > >>> running in the cluster right now and another one is doing the > > > >>> compaction/upgrade while it is in the cluster but not directly > > > >>> accessible because it is out of the Loadbalancer): > > > >>> > > > >>> 1.4.2: > > > >>> > > > >>> node_put_fsm_time_mean : 2208050 > > > >>> node_put_fsm_time_median : 39231 > > > >>> node_put_fsm_time_95 : 17400382 > > > >>> node_put_fsm_time_99 : 50965752 > > > >>> node_put_fsm_time_100 : 59537762 > > > >>> node_put_fsm_active : 5 > > > >>> node_put_fsm_active_60s : 364 > > > >>> node_put_fsm_in_rate : 5 > > > >>> node_put_fsm_out_rate : 3 > > > >>> node_put_fsm_rejected : 0 > > > >>> node_put_fsm_rejected_60s : 0 > > > >>> node_put_fsm_rejected_total : 0 > > > >>> > > > >>> > > > >>> 1.3.1: > > > >>> > > > >>> node_put_fsm_time_mean : 5036 > > > >>> node_put_fsm_time_median : 1614 > > > >>> node_put_fsm_time_95 : 8789 > > > >>> node_pu
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Also some side notes: "top" is even better on new 1.4.2 than on 1.3.1 machines.. IO utilization of disk is mostly the same (round about 33%).. but 95th percentile of response time for get (avg over all nodes): before upgrade: 29ms after upgrade: almost the same 95th percentile of response time for put (avg over all nodes): before upgrade: 60ms after upgrade: 1548ms but this is only because of 2 of 12 nodes are on 1.4.2 and are really slow (17000ms) Cheers, Simon On Wed, 11 Dec 2013 13:45:56 +0100 Simon Effenberg wrote: > Sorry I forgot the half of it.. > > seffenberg@kriak46-1:~$ free -m > total used free sharedbuffers cached > Mem: 23999 23759239 0184 16183 > -/+ buffers/cache: 7391 16607 > Swap:0 0 0 > > We have 12 servers.. > datadir on the compacted servers (1.4.2) ~ 765 GB > > AAE is enabled. > > I attached app.config and vm.args. > > Cheers > Simon > > On Wed, 11 Dec 2013 07:33:31 -0500 > Matthew Von-Maszewski wrote: > > > Ok, I am now suspecting that your servers are either using swap space > > (which is slow) or your leveldb file cache is thrashing (opening and > > closing multiple files per request). > > > > How many servers do you have and do you use Riak's active anti-entropy > > feature? I am going to plug all of this into a spreadsheet. > > > > Matthew Von-Maszewski > > > > > > On Dec 11, 2013, at 7:09, Simon Effenberg wrote: > > > > > Hi Matthew > > > > > > Memory: 23999 MB > > > > > > ring_creation_size, 256 > > > max_open_files, 100 > > > > > > riak-admin status: > > > > > > memory_total : 276001360 > > > memory_processes : 191506322 > > > memory_processes_used : 191439568 > > > memory_system : 84495038 > > > memory_atom : 686993 > > > memory_atom_used : 686560 > > > memory_binary : 21965352 > > > memory_code : 11332732 > > > memory_ets : 10823528 > > > > > > Thanks for looking! > > > > > > Cheers > > > Simon > > > > > > > > > > > > On Wed, 11 Dec 2013 06:44:42 -0500 > > > Matthew Von-Maszewski wrote: > > > > > >> I need to ask other developers as they arrive for the new day. Does not > > >> make sense to me. > > >> > > >> How many nodes do you have? How much RAM do you have in each node? > > >> What are your settings for max_open_files and cache_size in the > > >> app.config file? Maybe this is as simple as leveldb using too much RAM > > >> in 1.4. The memory accounting for maz_open_files changed in 1.4. > > >> > > >> Matthew Von-Maszewski > > >> > > >> > > >> On Dec 11, 2013, at 6:28, Simon Effenberg > > >> wrote: > > >> > > >>> Hi Matthew, > > >>> > > >>> it took around 11hours for the first node to finish the compaction. The > > >>> second node is running already 12 hours and is still doing compaction. > > >>> > > >>> Besides that I wonder because the fsm_put time on the new 1.4.2 host is > > >>> much higher (after the compaction) than on an old 1.3.1 (both are > > >>> running in the cluster right now and another one is doing the > > >>> compaction/upgrade while it is in the cluster but not directly > > >>> accessible because it is out of the Loadbalancer): > > >>> > > >>> 1.4.2: > > >>> > > >>> node_put_fsm_time_mean : 2208050 > > >>> node_put_fsm_time_median : 39231 > > >>> node_put_fsm_time_95 : 17400382 > > >>> node_put_fsm_time_99 : 50965752 > > >>> node_put_fsm_time_100 : 59537762 > > >>> node_put_fsm_active : 5 > > >>> node_put_fsm_active_60s : 364 > > >>> node_put_fsm_in_rate : 5 > > >>> node_put_fsm_out_rate : 3 > > >>> node_put_fsm_rejected : 0 > > >>> node_put_fsm_rejected_60s : 0 > > >>> node_put_fsm_rejected_total : 0 > > >>> > > >>> > > >>> 1.3.1: > > >>> > > >>> node_put_fsm_time_mean : 5036 > > >>> node_put_fsm_time_median : 1614 > > >>> node_put_fsm_time_95 : 8789 > > >>> node_put_fsm_time_99 : 38258 > > >>> node_put_fsm_time_100 : 384372 > > >>> > > >>> > > >>> any clue why this could/should be? > > >>> > > >>> Cheers > > >>> Simon > > >>> > > >>> On Tue, 10 Dec 2013 17:21:07 +0100 > > >>> Simon Effenberg wrote: > > >>> > > Hi Matthew, > > > > thanks!.. that answers my questions! > > > > Cheers > > Simon > > > > On Tue, 10 Dec 2013 11:08:32 -0500 > > Matthew Von-Maszewski wrote: > > > > > 2i is not my expertise, so I had to discuss you concerns with another > > > Basho developer. He says: > > > > > > Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk > > > format. You must wait for all nodes to update if you desire to use > > > the new 2i query. The 2i data will properly write/update on both 1.3 > > > and 1.4 machines during the migration. > > > > > > Does that answer your question? > > > > > > > > > And yes, you might see available disk space increase during the > > > upgrade compactions if your dataset contains numerous delete > >
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Sorry I forgot the half of it.. seffenberg@kriak46-1:~$ free -m total used free sharedbuffers cached Mem: 23999 23759239 0184 16183 -/+ buffers/cache: 7391 16607 Swap:0 0 0 We have 12 servers.. datadir on the compacted servers (1.4.2) ~ 765 GB AAE is enabled. I attached app.config and vm.args. Cheers Simon On Wed, 11 Dec 2013 07:33:31 -0500 Matthew Von-Maszewski wrote: > Ok, I am now suspecting that your servers are either using swap space (which > is slow) or your leveldb file cache is thrashing (opening and closing > multiple files per request). > > How many servers do you have and do you use Riak's active anti-entropy > feature? I am going to plug all of this into a spreadsheet. > > Matthew Von-Maszewski > > > On Dec 11, 2013, at 7:09, Simon Effenberg wrote: > > > Hi Matthew > > > > Memory: 23999 MB > > > > ring_creation_size, 256 > > max_open_files, 100 > > > > riak-admin status: > > > > memory_total : 276001360 > > memory_processes : 191506322 > > memory_processes_used : 191439568 > > memory_system : 84495038 > > memory_atom : 686993 > > memory_atom_used : 686560 > > memory_binary : 21965352 > > memory_code : 11332732 > > memory_ets : 10823528 > > > > Thanks for looking! > > > > Cheers > > Simon > > > > > > > > On Wed, 11 Dec 2013 06:44:42 -0500 > > Matthew Von-Maszewski wrote: > > > >> I need to ask other developers as they arrive for the new day. Does not > >> make sense to me. > >> > >> How many nodes do you have? How much RAM do you have in each node? What > >> are your settings for max_open_files and cache_size in the app.config > >> file? Maybe this is as simple as leveldb using too much RAM in 1.4. The > >> memory accounting for maz_open_files changed in 1.4. > >> > >> Matthew Von-Maszewski > >> > >> > >> On Dec 11, 2013, at 6:28, Simon Effenberg > >> wrote: > >> > >>> Hi Matthew, > >>> > >>> it took around 11hours for the first node to finish the compaction. The > >>> second node is running already 12 hours and is still doing compaction. > >>> > >>> Besides that I wonder because the fsm_put time on the new 1.4.2 host is > >>> much higher (after the compaction) than on an old 1.3.1 (both are > >>> running in the cluster right now and another one is doing the > >>> compaction/upgrade while it is in the cluster but not directly > >>> accessible because it is out of the Loadbalancer): > >>> > >>> 1.4.2: > >>> > >>> node_put_fsm_time_mean : 2208050 > >>> node_put_fsm_time_median : 39231 > >>> node_put_fsm_time_95 : 17400382 > >>> node_put_fsm_time_99 : 50965752 > >>> node_put_fsm_time_100 : 59537762 > >>> node_put_fsm_active : 5 > >>> node_put_fsm_active_60s : 364 > >>> node_put_fsm_in_rate : 5 > >>> node_put_fsm_out_rate : 3 > >>> node_put_fsm_rejected : 0 > >>> node_put_fsm_rejected_60s : 0 > >>> node_put_fsm_rejected_total : 0 > >>> > >>> > >>> 1.3.1: > >>> > >>> node_put_fsm_time_mean : 5036 > >>> node_put_fsm_time_median : 1614 > >>> node_put_fsm_time_95 : 8789 > >>> node_put_fsm_time_99 : 38258 > >>> node_put_fsm_time_100 : 384372 > >>> > >>> > >>> any clue why this could/should be? > >>> > >>> Cheers > >>> Simon > >>> > >>> On Tue, 10 Dec 2013 17:21:07 +0100 > >>> Simon Effenberg wrote: > >>> > Hi Matthew, > > thanks!.. that answers my questions! > > Cheers > Simon > > On Tue, 10 Dec 2013 11:08:32 -0500 > Matthew Von-Maszewski wrote: > > > 2i is not my expertise, so I had to discuss you concerns with another > > Basho developer. He says: > > > > Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk > > format. You must wait for all nodes to update if you desire to use the > > new 2i query. The 2i data will properly write/update on both 1.3 and > > 1.4 machines during the migration. > > > > Does that answer your question? > > > > > > And yes, you might see available disk space increase during the upgrade > > compactions if your dataset contains numerous delete "tombstones". The > > Riak 2.0 code includes a new feature called "aggressive delete" for > > leveldb. This feature is more proactive in pushing delete tombstones > > through the levels to free up disk space much more quickly (especially > > if you perform block deletes every now and then). > > > > Matthew > > > > > > On Dec 10, 2013, at 10:44 AM, Simon Effenberg > > wrote: > > > >> Hi Matthew, > >> > >> see inline.. > >> > >> On Tue, 10 Dec 2013 10:38:03 -0500 > >> Matthew Von-Maszewski wrote: > >> > >>> The sad truth is that you are not the first to see this problem. And > >>> yes, it has to do with your 950GB per node dataset. And no, nothing > >>> to do but sit through it at this time. > >>> > >>> While I did extensive
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Ok, I am now suspecting that your servers are either using swap space (which is slow) or your leveldb file cache is thrashing (opening and closing multiple files per request). How many servers do you have and do you use Riak's active anti-entropy feature? I am going to plug all of this into a spreadsheet. Matthew Von-Maszewski On Dec 11, 2013, at 7:09, Simon Effenberg wrote: > Hi Matthew > > Memory: 23999 MB > > ring_creation_size, 256 > max_open_files, 100 > > riak-admin status: > > memory_total : 276001360 > memory_processes : 191506322 > memory_processes_used : 191439568 > memory_system : 84495038 > memory_atom : 686993 > memory_atom_used : 686560 > memory_binary : 21965352 > memory_code : 11332732 > memory_ets : 10823528 > > Thanks for looking! > > Cheers > Simon > > > > On Wed, 11 Dec 2013 06:44:42 -0500 > Matthew Von-Maszewski wrote: > >> I need to ask other developers as they arrive for the new day. Does not >> make sense to me. >> >> How many nodes do you have? How much RAM do you have in each node? What >> are your settings for max_open_files and cache_size in the app.config file? >> Maybe this is as simple as leveldb using too much RAM in 1.4. The memory >> accounting for maz_open_files changed in 1.4. >> >> Matthew Von-Maszewski >> >> >> On Dec 11, 2013, at 6:28, Simon Effenberg wrote: >> >>> Hi Matthew, >>> >>> it took around 11hours for the first node to finish the compaction. The >>> second node is running already 12 hours and is still doing compaction. >>> >>> Besides that I wonder because the fsm_put time on the new 1.4.2 host is >>> much higher (after the compaction) than on an old 1.3.1 (both are >>> running in the cluster right now and another one is doing the >>> compaction/upgrade while it is in the cluster but not directly >>> accessible because it is out of the Loadbalancer): >>> >>> 1.4.2: >>> >>> node_put_fsm_time_mean : 2208050 >>> node_put_fsm_time_median : 39231 >>> node_put_fsm_time_95 : 17400382 >>> node_put_fsm_time_99 : 50965752 >>> node_put_fsm_time_100 : 59537762 >>> node_put_fsm_active : 5 >>> node_put_fsm_active_60s : 364 >>> node_put_fsm_in_rate : 5 >>> node_put_fsm_out_rate : 3 >>> node_put_fsm_rejected : 0 >>> node_put_fsm_rejected_60s : 0 >>> node_put_fsm_rejected_total : 0 >>> >>> >>> 1.3.1: >>> >>> node_put_fsm_time_mean : 5036 >>> node_put_fsm_time_median : 1614 >>> node_put_fsm_time_95 : 8789 >>> node_put_fsm_time_99 : 38258 >>> node_put_fsm_time_100 : 384372 >>> >>> >>> any clue why this could/should be? >>> >>> Cheers >>> Simon >>> >>> On Tue, 10 Dec 2013 17:21:07 +0100 >>> Simon Effenberg wrote: >>> Hi Matthew, thanks!.. that answers my questions! Cheers Simon On Tue, 10 Dec 2013 11:08:32 -0500 Matthew Von-Maszewski wrote: > 2i is not my expertise, so I had to discuss you concerns with another > Basho developer. He says: > > Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk > format. You must wait for all nodes to update if you desire to use the > new 2i query. The 2i data will properly write/update on both 1.3 and 1.4 > machines during the migration. > > Does that answer your question? > > > And yes, you might see available disk space increase during the upgrade > compactions if your dataset contains numerous delete "tombstones". The > Riak 2.0 code includes a new feature called "aggressive delete" for > leveldb. This feature is more proactive in pushing delete tombstones > through the levels to free up disk space much more quickly (especially if > you perform block deletes every now and then). > > Matthew > > > On Dec 10, 2013, at 10:44 AM, Simon Effenberg > wrote: > >> Hi Matthew, >> >> see inline.. >> >> On Tue, 10 Dec 2013 10:38:03 -0500 >> Matthew Von-Maszewski wrote: >> >>> The sad truth is that you are not the first to see this problem. And >>> yes, it has to do with your 950GB per node dataset. And no, nothing to >>> do but sit through it at this time. >>> >>> While I did extensive testing around upgrade times before shipping 1.4, >>> apparently there are data configurations I did not anticipate. You are >>> likely seeing a cascade where a shift of one file from level-1 to >>> level-2 is causing a shift of another file from level-2 to level-3, >>> which causes a level-3 file to shift to level-4, etc … then the next >>> file shifts from level-1. >>> >>> The bright side of this pain is that you will end up with better write >>> throughput once all the compaction ends. >> >> I have to deal with that.. but my problem is now, if I'm doing this >> node by node it looks like 2i searches aren't possible while 1.3 and >> 1.4 nodes exists in the cluster. Is there any problem which leads me to >> an 2i repai
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Hi Matthew Memory: 23999 MB ring_creation_size, 256 max_open_files, 100 riak-admin status: memory_total : 276001360 memory_processes : 191506322 memory_processes_used : 191439568 memory_system : 84495038 memory_atom : 686993 memory_atom_used : 686560 memory_binary : 21965352 memory_code : 11332732 memory_ets : 10823528 Thanks for looking! Cheers Simon On Wed, 11 Dec 2013 06:44:42 -0500 Matthew Von-Maszewski wrote: > I need to ask other developers as they arrive for the new day. Does not make > sense to me. > > How many nodes do you have? How much RAM do you have in each node? What are > your settings for max_open_files and cache_size in the app.config file? > Maybe this is as simple as leveldb using too much RAM in 1.4. The memory > accounting for maz_open_files changed in 1.4. > > Matthew Von-Maszewski > > > On Dec 11, 2013, at 6:28, Simon Effenberg wrote: > > > Hi Matthew, > > > > it took around 11hours for the first node to finish the compaction. The > > second node is running already 12 hours and is still doing compaction. > > > > Besides that I wonder because the fsm_put time on the new 1.4.2 host is > > much higher (after the compaction) than on an old 1.3.1 (both are > > running in the cluster right now and another one is doing the > > compaction/upgrade while it is in the cluster but not directly > > accessible because it is out of the Loadbalancer): > > > > 1.4.2: > > > > node_put_fsm_time_mean : 2208050 > > node_put_fsm_time_median : 39231 > > node_put_fsm_time_95 : 17400382 > > node_put_fsm_time_99 : 50965752 > > node_put_fsm_time_100 : 59537762 > > node_put_fsm_active : 5 > > node_put_fsm_active_60s : 364 > > node_put_fsm_in_rate : 5 > > node_put_fsm_out_rate : 3 > > node_put_fsm_rejected : 0 > > node_put_fsm_rejected_60s : 0 > > node_put_fsm_rejected_total : 0 > > > > > > 1.3.1: > > > > node_put_fsm_time_mean : 5036 > > node_put_fsm_time_median : 1614 > > node_put_fsm_time_95 : 8789 > > node_put_fsm_time_99 : 38258 > > node_put_fsm_time_100 : 384372 > > > > > > any clue why this could/should be? > > > > Cheers > > Simon > > > > On Tue, 10 Dec 2013 17:21:07 +0100 > > Simon Effenberg wrote: > > > >> Hi Matthew, > >> > >> thanks!.. that answers my questions! > >> > >> Cheers > >> Simon > >> > >> On Tue, 10 Dec 2013 11:08:32 -0500 > >> Matthew Von-Maszewski wrote: > >> > >>> 2i is not my expertise, so I had to discuss you concerns with another > >>> Basho developer. He says: > >>> > >>> Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk > >>> format. You must wait for all nodes to update if you desire to use the > >>> new 2i query. The 2i data will properly write/update on both 1.3 and 1.4 > >>> machines during the migration. > >>> > >>> Does that answer your question? > >>> > >>> > >>> And yes, you might see available disk space increase during the upgrade > >>> compactions if your dataset contains numerous delete "tombstones". The > >>> Riak 2.0 code includes a new feature called "aggressive delete" for > >>> leveldb. This feature is more proactive in pushing delete tombstones > >>> through the levels to free up disk space much more quickly (especially if > >>> you perform block deletes every now and then). > >>> > >>> Matthew > >>> > >>> > >>> On Dec 10, 2013, at 10:44 AM, Simon Effenberg > >>> wrote: > >>> > Hi Matthew, > > see inline.. > > On Tue, 10 Dec 2013 10:38:03 -0500 > Matthew Von-Maszewski wrote: > > > The sad truth is that you are not the first to see this problem. And > > yes, it has to do with your 950GB per node dataset. And no, nothing to > > do but sit through it at this time. > > > > While I did extensive testing around upgrade times before shipping 1.4, > > apparently there are data configurations I did not anticipate. You are > > likely seeing a cascade where a shift of one file from level-1 to > > level-2 is causing a shift of another file from level-2 to level-3, > > which causes a level-3 file to shift to level-4, etc … then the next > > file shifts from level-1. > > > > The bright side of this pain is that you will end up with better write > > throughput once all the compaction ends. > > I have to deal with that.. but my problem is now, if I'm doing this > node by node it looks like 2i searches aren't possible while 1.3 and > 1.4 nodes exists in the cluster. Is there any problem which leads me to > an 2i repair marathon or could I easily wait for some hours for each > node until all merges are done before I upgrade the next one? (2i > searches can fail for some time.. the APP isn't having problems with > that but are new inserts with 2i indices processed successfully or do > I have to do the 2i repair?) > > /s > > one other good think: saving disk space is one advantage ;).. > > > > > > Riak 2.0's l
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
I need to ask other developers as they arrive for the new day. Does not make sense to me. How many nodes do you have? How much RAM do you have in each node? What are your settings for max_open_files and cache_size in the app.config file? Maybe this is as simple as leveldb using too much RAM in 1.4. The memory accounting for maz_open_files changed in 1.4. Matthew Von-Maszewski On Dec 11, 2013, at 6:28, Simon Effenberg wrote: > Hi Matthew, > > it took around 11hours for the first node to finish the compaction. The > second node is running already 12 hours and is still doing compaction. > > Besides that I wonder because the fsm_put time on the new 1.4.2 host is > much higher (after the compaction) than on an old 1.3.1 (both are > running in the cluster right now and another one is doing the > compaction/upgrade while it is in the cluster but not directly > accessible because it is out of the Loadbalancer): > > 1.4.2: > > node_put_fsm_time_mean : 2208050 > node_put_fsm_time_median : 39231 > node_put_fsm_time_95 : 17400382 > node_put_fsm_time_99 : 50965752 > node_put_fsm_time_100 : 59537762 > node_put_fsm_active : 5 > node_put_fsm_active_60s : 364 > node_put_fsm_in_rate : 5 > node_put_fsm_out_rate : 3 > node_put_fsm_rejected : 0 > node_put_fsm_rejected_60s : 0 > node_put_fsm_rejected_total : 0 > > > 1.3.1: > > node_put_fsm_time_mean : 5036 > node_put_fsm_time_median : 1614 > node_put_fsm_time_95 : 8789 > node_put_fsm_time_99 : 38258 > node_put_fsm_time_100 : 384372 > > > any clue why this could/should be? > > Cheers > Simon > > On Tue, 10 Dec 2013 17:21:07 +0100 > Simon Effenberg wrote: > >> Hi Matthew, >> >> thanks!.. that answers my questions! >> >> Cheers >> Simon >> >> On Tue, 10 Dec 2013 11:08:32 -0500 >> Matthew Von-Maszewski wrote: >> >>> 2i is not my expertise, so I had to discuss you concerns with another Basho >>> developer. He says: >>> >>> Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk format. >>> You must wait for all nodes to update if you desire to use the new 2i >>> query. The 2i data will properly write/update on both 1.3 and 1.4 machines >>> during the migration. >>> >>> Does that answer your question? >>> >>> >>> And yes, you might see available disk space increase during the upgrade >>> compactions if your dataset contains numerous delete "tombstones". The >>> Riak 2.0 code includes a new feature called "aggressive delete" for >>> leveldb. This feature is more proactive in pushing delete tombstones >>> through the levels to free up disk space much more quickly (especially if >>> you perform block deletes every now and then). >>> >>> Matthew >>> >>> >>> On Dec 10, 2013, at 10:44 AM, Simon Effenberg >>> wrote: >>> Hi Matthew, see inline.. On Tue, 10 Dec 2013 10:38:03 -0500 Matthew Von-Maszewski wrote: > The sad truth is that you are not the first to see this problem. And > yes, it has to do with your 950GB per node dataset. And no, nothing to > do but sit through it at this time. > > While I did extensive testing around upgrade times before shipping 1.4, > apparently there are data configurations I did not anticipate. You are > likely seeing a cascade where a shift of one file from level-1 to level-2 > is causing a shift of another file from level-2 to level-3, which causes > a level-3 file to shift to level-4, etc … then the next file shifts from > level-1. > > The bright side of this pain is that you will end up with better write > throughput once all the compaction ends. I have to deal with that.. but my problem is now, if I'm doing this node by node it looks like 2i searches aren't possible while 1.3 and 1.4 nodes exists in the cluster. Is there any problem which leads me to an 2i repair marathon or could I easily wait for some hours for each node until all merges are done before I upgrade the next one? (2i searches can fail for some time.. the APP isn't having problems with that but are new inserts with 2i indices processed successfully or do I have to do the 2i repair?) /s one other good think: saving disk space is one advantage ;).. > > Riak 2.0's leveldb has code to prevent/reduce compaction cascades, but > that is not going to help you today. > > Matthew > > On Dec 10, 2013, at 10:26 AM, Simon Effenberg > wrote: > >> Hi @list, >> >> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after >> upgrading the first node (out of 12) this node seems to do many merges. >> the sst_* directories changes in size "rapidly" and the node is having >> a disk utilization of 100% all the time. >> >> I know that there is something like that: >> >> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset >> will initiate an automatic
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Hi Matthew, it took around 11hours for the first node to finish the compaction. The second node is running already 12 hours and is still doing compaction. Besides that I wonder because the fsm_put time on the new 1.4.2 host is much higher (after the compaction) than on an old 1.3.1 (both are running in the cluster right now and another one is doing the compaction/upgrade while it is in the cluster but not directly accessible because it is out of the Loadbalancer): 1.4.2: node_put_fsm_time_mean : 2208050 node_put_fsm_time_median : 39231 node_put_fsm_time_95 : 17400382 node_put_fsm_time_99 : 50965752 node_put_fsm_time_100 : 59537762 node_put_fsm_active : 5 node_put_fsm_active_60s : 364 node_put_fsm_in_rate : 5 node_put_fsm_out_rate : 3 node_put_fsm_rejected : 0 node_put_fsm_rejected_60s : 0 node_put_fsm_rejected_total : 0 1.3.1: node_put_fsm_time_mean : 5036 node_put_fsm_time_median : 1614 node_put_fsm_time_95 : 8789 node_put_fsm_time_99 : 38258 node_put_fsm_time_100 : 384372 any clue why this could/should be? Cheers Simon On Tue, 10 Dec 2013 17:21:07 +0100 Simon Effenberg wrote: > Hi Matthew, > > thanks!.. that answers my questions! > > Cheers > Simon > > On Tue, 10 Dec 2013 11:08:32 -0500 > Matthew Von-Maszewski wrote: > > > 2i is not my expertise, so I had to discuss you concerns with another Basho > > developer. He says: > > > > Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk format. > > You must wait for all nodes to update if you desire to use the new 2i > > query. The 2i data will properly write/update on both 1.3 and 1.4 machines > > during the migration. > > > > Does that answer your question? > > > > > > And yes, you might see available disk space increase during the upgrade > > compactions if your dataset contains numerous delete "tombstones". The > > Riak 2.0 code includes a new feature called "aggressive delete" for > > leveldb. This feature is more proactive in pushing delete tombstones > > through the levels to free up disk space much more quickly (especially if > > you perform block deletes every now and then). > > > > Matthew > > > > > > On Dec 10, 2013, at 10:44 AM, Simon Effenberg > > wrote: > > > > > Hi Matthew, > > > > > > see inline.. > > > > > > On Tue, 10 Dec 2013 10:38:03 -0500 > > > Matthew Von-Maszewski wrote: > > > > > >> The sad truth is that you are not the first to see this problem. And > > >> yes, it has to do with your 950GB per node dataset. And no, nothing to > > >> do but sit through it at this time. > > >> > > >> While I did extensive testing around upgrade times before shipping 1.4, > > >> apparently there are data configurations I did not anticipate. You are > > >> likely seeing a cascade where a shift of one file from level-1 to > > >> level-2 is causing a shift of another file from level-2 to level-3, > > >> which causes a level-3 file to shift to level-4, etc … then the next > > >> file shifts from level-1. > > >> > > >> The bright side of this pain is that you will end up with better write > > >> throughput once all the compaction ends. > > > > > > I have to deal with that.. but my problem is now, if I'm doing this > > > node by node it looks like 2i searches aren't possible while 1.3 and > > > 1.4 nodes exists in the cluster. Is there any problem which leads me to > > > an 2i repair marathon or could I easily wait for some hours for each > > > node until all merges are done before I upgrade the next one? (2i > > > searches can fail for some time.. the APP isn't having problems with > > > that but are new inserts with 2i indices processed successfully or do > > > I have to do the 2i repair?) > > > > > > /s > > > > > > one other good think: saving disk space is one advantage ;).. > > > > > > > > >> > > >> Riak 2.0's leveldb has code to prevent/reduce compaction cascades, but > > >> that is not going to help you today. > > >> > > >> Matthew > > >> > > >> On Dec 10, 2013, at 10:26 AM, Simon Effenberg > > >> wrote: > > >> > > >>> Hi @list, > > >>> > > >>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after > > >>> upgrading the first node (out of 12) this node seems to do many merges. > > >>> the sst_* directories changes in size "rapidly" and the node is having > > >>> a disk utilization of 100% all the time. > > >>> > > >>> I know that there is something like that: > > >>> > > >>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset > > >>> will initiate an automatic conversion that could pause the startup of > > >>> each node by 3 to 7 minutes. The leveldb data in "level #1" is being > > >>> adjusted such that "level #1" can operate as an overlapped data level > > >>> instead of as a sorted data level. The conversion is simply the > > >>> reduction of the number of files in "level #1" to being less than eight > > >>> via normal compaction of data from "level #1" into "level #2". This is > > >>> a one time conversion." > > >>> > > >>> but it l
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Hi Matthew, thanks!.. that answers my questions! Cheers Simon On Tue, 10 Dec 2013 11:08:32 -0500 Matthew Von-Maszewski wrote: > 2i is not my expertise, so I had to discuss you concerns with another Basho > developer. He says: > > Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk format. > You must wait for all nodes to update if you desire to use the new 2i query. > The 2i data will properly write/update on both 1.3 and 1.4 machines during > the migration. > > Does that answer your question? > > > And yes, you might see available disk space increase during the upgrade > compactions if your dataset contains numerous delete "tombstones". The Riak > 2.0 code includes a new feature called "aggressive delete" for leveldb. This > feature is more proactive in pushing delete tombstones through the levels to > free up disk space much more quickly (especially if you perform block deletes > every now and then). > > Matthew > > > On Dec 10, 2013, at 10:44 AM, Simon Effenberg > wrote: > > > Hi Matthew, > > > > see inline.. > > > > On Tue, 10 Dec 2013 10:38:03 -0500 > > Matthew Von-Maszewski wrote: > > > >> The sad truth is that you are not the first to see this problem. And yes, > >> it has to do with your 950GB per node dataset. And no, nothing to do but > >> sit through it at this time. > >> > >> While I did extensive testing around upgrade times before shipping 1.4, > >> apparently there are data configurations I did not anticipate. You are > >> likely seeing a cascade where a shift of one file from level-1 to level-2 > >> is causing a shift of another file from level-2 to level-3, which causes a > >> level-3 file to shift to level-4, etc … then the next file shifts from > >> level-1. > >> > >> The bright side of this pain is that you will end up with better write > >> throughput once all the compaction ends. > > > > I have to deal with that.. but my problem is now, if I'm doing this > > node by node it looks like 2i searches aren't possible while 1.3 and > > 1.4 nodes exists in the cluster. Is there any problem which leads me to > > an 2i repair marathon or could I easily wait for some hours for each > > node until all merges are done before I upgrade the next one? (2i > > searches can fail for some time.. the APP isn't having problems with > > that but are new inserts with 2i indices processed successfully or do > > I have to do the 2i repair?) > > > > /s > > > > one other good think: saving disk space is one advantage ;).. > > > > > >> > >> Riak 2.0's leveldb has code to prevent/reduce compaction cascades, but > >> that is not going to help you today. > >> > >> Matthew > >> > >> On Dec 10, 2013, at 10:26 AM, Simon Effenberg > >> wrote: > >> > >>> Hi @list, > >>> > >>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after > >>> upgrading the first node (out of 12) this node seems to do many merges. > >>> the sst_* directories changes in size "rapidly" and the node is having > >>> a disk utilization of 100% all the time. > >>> > >>> I know that there is something like that: > >>> > >>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset > >>> will initiate an automatic conversion that could pause the startup of > >>> each node by 3 to 7 minutes. The leveldb data in "level #1" is being > >>> adjusted such that "level #1" can operate as an overlapped data level > >>> instead of as a sorted data level. The conversion is simply the > >>> reduction of the number of files in "level #1" to being less than eight > >>> via normal compaction of data from "level #1" into "level #2". This is > >>> a one time conversion." > >>> > >>> but it looks much more invasive than explained here or doesn't have to > >>> do anything with the (probably seen) merges. > >>> > >>> Is this "normal" behavior or could I do anything about it? > >>> > >>> At the moment I'm stucked with the upgrade procedure because this high > >>> IO load would probably lead to high response times. > >>> > >>> Also we have a lot of data (per node ~950 GB). > >>> > >>> Cheers > >>> Simon > >>> > >>> ___ > >>> riak-users mailing list > >>> riak-users@lists.basho.com > >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >> > > > > > > -- > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > Fon: + 49-(0)30-8109 - 7173 > > Fax: + 49-(0)30-8109 - 7131 > > > > Mail: seffenb...@team.mobile.de > > Web:www.mobile.de > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > > > > Geschäftsführer: Malte Krüger > > HRB Nr.: 18517 P, Amtsgericht Potsdam > > Sitz der Gesellschaft: Kleinmachnow > -- Simon Effenberg | Site Ops Engineer | mobile.international GmbH Fon: + 49-(0)30-8109 - 7173 Fax: + 49-(0)30-8109 - 7131 Mail: seffenb...@team.mobile.de Web:www.mobile.de Marktplatz 1 | 14532 Europarc Dreilinden | Germany Geschäftsführ
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
2i is not my expertise, so I had to discuss you concerns with another Basho developer. He says: Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk format. You must wait for all nodes to update if you desire to use the new 2i query. The 2i data will properly write/update on both 1.3 and 1.4 machines during the migration. Does that answer your question? And yes, you might see available disk space increase during the upgrade compactions if your dataset contains numerous delete "tombstones". The Riak 2.0 code includes a new feature called "aggressive delete" for leveldb. This feature is more proactive in pushing delete tombstones through the levels to free up disk space much more quickly (especially if you perform block deletes every now and then). Matthew On Dec 10, 2013, at 10:44 AM, Simon Effenberg wrote: > Hi Matthew, > > see inline.. > > On Tue, 10 Dec 2013 10:38:03 -0500 > Matthew Von-Maszewski wrote: > >> The sad truth is that you are not the first to see this problem. And yes, >> it has to do with your 950GB per node dataset. And no, nothing to do but >> sit through it at this time. >> >> While I did extensive testing around upgrade times before shipping 1.4, >> apparently there are data configurations I did not anticipate. You are >> likely seeing a cascade where a shift of one file from level-1 to level-2 is >> causing a shift of another file from level-2 to level-3, which causes a >> level-3 file to shift to level-4, etc … then the next file shifts from >> level-1. >> >> The bright side of this pain is that you will end up with better write >> throughput once all the compaction ends. > > I have to deal with that.. but my problem is now, if I'm doing this > node by node it looks like 2i searches aren't possible while 1.3 and > 1.4 nodes exists in the cluster. Is there any problem which leads me to > an 2i repair marathon or could I easily wait for some hours for each > node until all merges are done before I upgrade the next one? (2i > searches can fail for some time.. the APP isn't having problems with > that but are new inserts with 2i indices processed successfully or do > I have to do the 2i repair?) > > /s > > one other good think: saving disk space is one advantage ;).. > > >> >> Riak 2.0's leveldb has code to prevent/reduce compaction cascades, but that >> is not going to help you today. >> >> Matthew >> >> On Dec 10, 2013, at 10:26 AM, Simon Effenberg >> wrote: >> >>> Hi @list, >>> >>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after >>> upgrading the first node (out of 12) this node seems to do many merges. >>> the sst_* directories changes in size "rapidly" and the node is having >>> a disk utilization of 100% all the time. >>> >>> I know that there is something like that: >>> >>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset >>> will initiate an automatic conversion that could pause the startup of >>> each node by 3 to 7 minutes. The leveldb data in "level #1" is being >>> adjusted such that "level #1" can operate as an overlapped data level >>> instead of as a sorted data level. The conversion is simply the >>> reduction of the number of files in "level #1" to being less than eight >>> via normal compaction of data from "level #1" into "level #2". This is >>> a one time conversion." >>> >>> but it looks much more invasive than explained here or doesn't have to >>> do anything with the (probably seen) merges. >>> >>> Is this "normal" behavior or could I do anything about it? >>> >>> At the moment I'm stucked with the upgrade procedure because this high >>> IO load would probably lead to high response times. >>> >>> Also we have a lot of data (per node ~950 GB). >>> >>> Cheers >>> Simon >>> >>> ___ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > > > -- > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > Fon: + 49-(0)30-8109 - 7173 > Fax: + 49-(0)30-8109 - 7131 > > Mail: seffenb...@team.mobile.de > Web:www.mobile.de > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > Geschäftsführer: Malte Krüger > HRB Nr.: 18517 P, Amtsgericht Potsdam > Sitz der Gesellschaft: Kleinmachnow ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Hi Matthew, see inline.. On Tue, 10 Dec 2013 10:38:03 -0500 Matthew Von-Maszewski wrote: > The sad truth is that you are not the first to see this problem. And yes, it > has to do with your 950GB per node dataset. And no, nothing to do but sit > through it at this time. > > While I did extensive testing around upgrade times before shipping 1.4, > apparently there are data configurations I did not anticipate. You are > likely seeing a cascade where a shift of one file from level-1 to level-2 is > causing a shift of another file from level-2 to level-3, which causes a > level-3 file to shift to level-4, etc … then the next file shifts from > level-1. > > The bright side of this pain is that you will end up with better write > throughput once all the compaction ends. I have to deal with that.. but my problem is now, if I'm doing this node by node it looks like 2i searches aren't possible while 1.3 and 1.4 nodes exists in the cluster. Is there any problem which leads me to an 2i repair marathon or could I easily wait for some hours for each node until all merges are done before I upgrade the next one? (2i searches can fail for some time.. the APP isn't having problems with that but are new inserts with 2i indices processed successfully or do I have to do the 2i repair?) /s one other good think: saving disk space is one advantage ;).. > > Riak 2.0's leveldb has code to prevent/reduce compaction cascades, but that > is not going to help you today. > > Matthew > > On Dec 10, 2013, at 10:26 AM, Simon Effenberg > wrote: > > > Hi @list, > > > > I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after > > upgrading the first node (out of 12) this node seems to do many merges. > > the sst_* directories changes in size "rapidly" and the node is having > > a disk utilization of 100% all the time. > > > > I know that there is something like that: > > > > "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset > > will initiate an automatic conversion that could pause the startup of > > each node by 3 to 7 minutes. The leveldb data in "level #1" is being > > adjusted such that "level #1" can operate as an overlapped data level > > instead of as a sorted data level. The conversion is simply the > > reduction of the number of files in "level #1" to being less than eight > > via normal compaction of data from "level #1" into "level #2". This is > > a one time conversion." > > > > but it looks much more invasive than explained here or doesn't have to > > do anything with the (probably seen) merges. > > > > Is this "normal" behavior or could I do anything about it? > > > > At the moment I'm stucked with the upgrade procedure because this high > > IO load would probably lead to high response times. > > > > Also we have a lot of data (per node ~950 GB). > > > > Cheers > > Simon > > > > ___ > > riak-users mailing list > > riak-users@lists.basho.com > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > -- Simon Effenberg | Site Ops Engineer | mobile.international GmbH Fon: + 49-(0)30-8109 - 7173 Fax: + 49-(0)30-8109 - 7131 Mail: seffenb...@team.mobile.de Web:www.mobile.de Marktplatz 1 | 14532 Europarc Dreilinden | Germany Geschäftsführer: Malte Krüger HRB Nr.: 18517 P, Amtsgericht Potsdam Sitz der Gesellschaft: Kleinmachnow ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
The sad truth is that you are not the first to see this problem. And yes, it has to do with your 950GB per node dataset. And no, nothing to do but sit through it at this time. While I did extensive testing around upgrade times before shipping 1.4, apparently there are data configurations I did not anticipate. You are likely seeing a cascade where a shift of one file from level-1 to level-2 is causing a shift of another file from level-2 to level-3, which causes a level-3 file to shift to level-4, etc … then the next file shifts from level-1. The bright side of this pain is that you will end up with better write throughput once all the compaction ends. Riak 2.0's leveldb has code to prevent/reduce compaction cascades, but that is not going to help you today. Matthew On Dec 10, 2013, at 10:26 AM, Simon Effenberg wrote: > Hi @list, > > I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after > upgrading the first node (out of 12) this node seems to do many merges. > the sst_* directories changes in size "rapidly" and the node is having > a disk utilization of 100% all the time. > > I know that there is something like that: > > "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset > will initiate an automatic conversion that could pause the startup of > each node by 3 to 7 minutes. The leveldb data in "level #1" is being > adjusted such that "level #1" can operate as an overlapped data level > instead of as a sorted data level. The conversion is simply the > reduction of the number of files in "level #1" to being less than eight > via normal compaction of data from "level #1" into "level #2". This is > a one time conversion." > > but it looks much more invasive than explained here or doesn't have to > do anything with the (probably seen) merges. > > Is this "normal" behavior or could I do anything about it? > > At the moment I'm stucked with the upgrade procedure because this high > IO load would probably lead to high response times. > > Also we have a lot of data (per node ~950 GB). > > Cheers > Simon > > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Upgrade from 1.3.1 to 1.4.2 => high IO
Hi @list, I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after upgrading the first node (out of 12) this node seems to do many merges. the sst_* directories changes in size "rapidly" and the node is having a disk utilization of 100% all the time. I know that there is something like that: "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset will initiate an automatic conversion that could pause the startup of each node by 3 to 7 minutes. The leveldb data in "level #1" is being adjusted such that "level #1" can operate as an overlapped data level instead of as a sorted data level. The conversion is simply the reduction of the number of files in "level #1" to being less than eight via normal compaction of data from "level #1" into "level #2". This is a one time conversion." but it looks much more invasive than explained here or doesn't have to do anything with the (probably seen) merges. Is this "normal" behavior or could I do anything about it? At the moment I'm stucked with the upgrade procedure because this high IO load would probably lead to high response times. Also we have a lot of data (per node ~950 GB). Cheers Simon ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com