Upgraded riak 1.4.9 is pegging the CPU

2014-06-05 Thread Alain Rodriguez
Hi all,

I upgraded 1 of 9 riak nodes in a cluster last night from 1.4.0 to 1.4.9.
The rest are running 1.4.0.

Ever since I am seeing the upgraded node, riak01 consuming a significantly
larger percent of CPU and the PUT times on it have gotten worse. htop
indicicates one particular process pegging the CPU, and many many more
processes running than I was used to seeing before.

Has anyone seen this before? Do I have to retune something for 1.4.9?

I am attaching htop, cpu and put graphs, and my app.config used across all
servers.

Thanks!

htop:
https://s3.amazonaws.com/uploads.hipchat.com/17604/95038/vEznS9gh6BRRNMR/htop.png
cpu:
https://s3.amazonaws.com/uploads.hipchat.com/17604/95038/21jilAfIwn8L5zC/cpu.png
put:
https://s3.amazonaws.com/uploads.hipchat.com/17604/95038/wX36crPiMeRg8kb/put.png


app.config.erb
Description: Binary data


vm.args.erb
Description: Binary data
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Upgraded riak 1.4.9 is pegging the CPU

2014-06-05 Thread Shane McEwan
On 05/06/14 16:20, Alain Rodriguez wrote:
 Hi all,
 
 I upgraded 1 of 9 riak nodes in a cluster last night from 1.4.0 to
 1.4.9. The rest are running 1.4.0.
 
 Ever since I am seeing the upgraded node, riak01 consuming a
 significantly larger percent of CPU and the PUT times on it have gotten
 worse. htop indicicates one particular process pegging the CPU, and many
 many more processes running than I was used to seeing before.

G'day!

Did you turn off and remove the Active Anti Entropy files before upgrading?

From the 1.4.8 release notes:

IMPORTANT We recommend removing current AAE trees before upgrading. That
is, all files under the anti_entropy sub-directory. This will avoid
potentially large amounts of repair activity once correct hashes start
being added. The data in the current trees can only be fixed by a full
rebuild, so this repair activity is wasteful. Trees will start to build
once AAE is re-enabled. To minimize the impact of this, we recommend
upgrading during a period of low activity.

Shane.

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Upgraded riak 1.4.9 is pegging the CPU

2014-06-05 Thread Alain Rodriguez
Thanks for the quick reply and no I did not. Is this something I should be
able to do now (stop, remove files, start again) or is it too late? How
could I verify this is the issue?


On Thu, Jun 5, 2014 at 8:42 AM, Shane McEwan sh...@mcewan.id.au wrote:

 On 05/06/14 16:20, Alain Rodriguez wrote:
  Hi all,
 
  I upgraded 1 of 9 riak nodes in a cluster last night from 1.4.0 to
  1.4.9. The rest are running 1.4.0.
 
  Ever since I am seeing the upgraded node, riak01 consuming a
  significantly larger percent of CPU and the PUT times on it have gotten
  worse. htop indicicates one particular process pegging the CPU, and many
  many more processes running than I was used to seeing before.

 G'day!

 Did you turn off and remove the Active Anti Entropy files before upgrading?

 From the 1.4.8 release notes:

 IMPORTANT We recommend removing current AAE trees before upgrading. That
 is, all files under the anti_entropy sub-directory. This will avoid
 potentially large amounts of repair activity once correct hashes start
 being added. The data in the current trees can only be fixed by a full
 rebuild, so this repair activity is wasteful. Trees will start to build
 once AAE is re-enabled. To minimize the impact of this, we recommend
 upgrading during a period of low activity.

 Shane.

 ___
 riak-users mailing list
 riak-users@lists.basho.com
 http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Upgraded riak 1.4.9 is pegging the CPU

2014-06-05 Thread Alain Rodriguez
Actually I just noticed it is likely the AAE issue:

2014-06-05 14:53:47.587 [error] 0.16054.31 CRASH REPORT Process
0.16054.31 with 0 neighbours exited with reason: no match of right hand
value {error,{db_open,IO error: lock
/var/lib/riak/anti_entropy/1061872283373234151507364761270424381468763488256/LOCK:
already held by process}} in hashtree:new_segment_store/2 line 505 in
gen_server:init_it/6 line 328
2014-06-05 14:53:47.588 [error] 0.16056.31 CRASH REPORT Process
0.16056.31 with 0 neighbours exited with reason: no match of right hand
value {error,{db_open,IO error: lock
/var/lib/riak/anti_entropy/1335903840372778448670555667404727447654250840064/LOCK:
already held by process}} in hashtree:new_segment_store/2 line 505 in
gen_server:init_it/6 line 328
2014-06-05 14:53:47.588 [error] 0.16055.31 CRASH REPORT Process
0.16055.31 with 0 neighbours exited with reason: no match of right hand
value {error,{db_open,IO error: lock
/var/lib/riak/anti_entropy/1267395951122892374379757940871151681107879002112/LOCK:
already held by process}} in hashtree:new_segment_store/2 line 505 in
gen_server:init_it/6 line 328

Bollocks!


On Thu, Jun 5, 2014 at 8:49 AM, Alain Rodriguez al...@uber.com wrote:

 Thanks for the quick reply and no I did not. Is this something I should be
 able to do now (stop, remove files, start again) or is it too late? How
 could I verify this is the issue?


 On Thu, Jun 5, 2014 at 8:42 AM, Shane McEwan sh...@mcewan.id.au wrote:

 On 05/06/14 16:20, Alain Rodriguez wrote:
  Hi all,
 
  I upgraded 1 of 9 riak nodes in a cluster last night from 1.4.0 to
  1.4.9. The rest are running 1.4.0.
 
  Ever since I am seeing the upgraded node, riak01 consuming a
  significantly larger percent of CPU and the PUT times on it have gotten
  worse. htop indicicates one particular process pegging the CPU, and many
  many more processes running than I was used to seeing before.

 G'day!

 Did you turn off and remove the Active Anti Entropy files before
 upgrading?

 From the 1.4.8 release notes:

 IMPORTANT We recommend removing current AAE trees before upgrading. That
 is, all files under the anti_entropy sub-directory. This will avoid
 potentially large amounts of repair activity once correct hashes start
 being added. The data in the current trees can only be fixed by a full
 rebuild, so this repair activity is wasteful. Trees will start to build
 once AAE is re-enabled. To minimize the impact of this, we recommend
 upgrading during a period of low activity.

 Shane.

 ___
 riak-users mailing list
 riak-users@lists.basho.com
 http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Upgraded riak 1.4.9 is pegging the CPU

2014-06-05 Thread Engel Sanchez
Hi Alain. I don't think you are seeing the AAE issue. The problem with
upgrading from 1.4.4-1.4.7 to 1.4.8 was a broken hash function in those,
which made the AAE trees incompatible. You should not have the same problem
in 1.4.0.  It seems that Erlang processes are repeatedly crashing and
restarting. It would be good to grab all your logs before they rotate so we
can take a look at exactly what is the first thing crashing and causing
this snowball effect.


On Thu, Jun 5, 2014 at 11:58 AM, Alain Rodriguez al...@uber.com wrote:

 Actually I just noticed it is likely the AAE issue:

 2014-06-05 14:53:47.587 [error] 0.16054.31 CRASH REPORT Process
 0.16054.31 with 0 neighbours exited with reason: no match of right hand
 value {error,{db_open,IO error: lock
 /var/lib/riak/anti_entropy/1061872283373234151507364761270424381468763488256/LOCK:
 already held by process}} in hashtree:new_segment_store/2 line 505 in
 gen_server:init_it/6 line 328
 2014-06-05 14:53:47.588 [error] 0.16056.31 CRASH REPORT Process
 0.16056.31 with 0 neighbours exited with reason: no match of right hand
 value {error,{db_open,IO error: lock
 /var/lib/riak/anti_entropy/1335903840372778448670555667404727447654250840064/LOCK:
 already held by process}} in hashtree:new_segment_store/2 line 505 in
 gen_server:init_it/6 line 328
 2014-06-05 14:53:47.588 [error] 0.16055.31 CRASH REPORT Process
 0.16055.31 with 0 neighbours exited with reason: no match of right hand
 value {error,{db_open,IO error: lock
 /var/lib/riak/anti_entropy/1267395951122892374379757940871151681107879002112/LOCK:
 already held by process}} in hashtree:new_segment_store/2 line 505 in
 gen_server:init_it/6 line 328

 Bollocks!


 On Thu, Jun 5, 2014 at 8:49 AM, Alain Rodriguez al...@uber.com wrote:

 Thanks for the quick reply and no I did not. Is this something I should
 be able to do now (stop, remove files, start again) or is it too late? How
 could I verify this is the issue?


 On Thu, Jun 5, 2014 at 8:42 AM, Shane McEwan sh...@mcewan.id.au wrote:

 On 05/06/14 16:20, Alain Rodriguez wrote:
  Hi all,
 
  I upgraded 1 of 9 riak nodes in a cluster last night from 1.4.0 to
  1.4.9. The rest are running 1.4.0.
 
  Ever since I am seeing the upgraded node, riak01 consuming a
  significantly larger percent of CPU and the PUT times on it have gotten
  worse. htop indicicates one particular process pegging the CPU, and
 many
  many more processes running than I was used to seeing before.

 G'day!

 Did you turn off and remove the Active Anti Entropy files before
 upgrading?

 From the 1.4.8 release notes:

 IMPORTANT We recommend removing current AAE trees before upgrading. That
 is, all files under the anti_entropy sub-directory. This will avoid
 potentially large amounts of repair activity once correct hashes start
 being added. The data in the current trees can only be fixed by a full
 rebuild, so this repair activity is wasteful. Trees will start to build
 once AAE is re-enabled. To minimize the impact of this, we recommend
 upgrading during a period of low activity.

 Shane.

 ___
 riak-users mailing list
 riak-users@lists.basho.com
 http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




 ___
 riak-users mailing list
 riak-users@lists.basho.com
 http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Upgraded riak 1.4.9 is pegging the CPU

2014-06-05 Thread Engel Sanchez
Alain, thanks for the logs you sent me on the side.  I'm not yet sure what
the root cause is, but I saw a lot of handoff activity and busy distributed
port messages, which indicate the single TCP connection between two Erlang
nodes is completely saturated.  Since there is too much going on, turning
off AAE and examining your cluster with less activity might still be a good
idea.  Check the output of riak-admin transfers until it is quiet.   I
noticed you have a file limit of 8192. That is not low, but newer Riaks eat
more file handles, so it would be a good idea to double that.  Let us know
how the stats and the logs look like after AAE is off to see what else we
can do.


On Thu, Jun 5, 2014 at 1:05 PM, Engel Sanchez en...@basho.com wrote:

 Hi Alain. I don't think you are seeing the AAE issue. The problem with
 upgrading from 1.4.4-1.4.7 to 1.4.8 was a broken hash function in those,
 which made the AAE trees incompatible. You should not have the same problem
 in 1.4.0.  It seems that Erlang processes are repeatedly crashing and
 restarting. It would be good to grab all your logs before they rotate so we
 can take a look at exactly what is the first thing crashing and causing
 this snowball effect.


 On Thu, Jun 5, 2014 at 11:58 AM, Alain Rodriguez al...@uber.com wrote:

 Actually I just noticed it is likely the AAE issue:

 2014-06-05 14:53:47.587 [error] 0.16054.31 CRASH REPORT Process
 0.16054.31 with 0 neighbours exited with reason: no match of right hand
 value {error,{db_open,IO error: lock
 /var/lib/riak/anti_entropy/1061872283373234151507364761270424381468763488256/LOCK:
 already held by process}} in hashtree:new_segment_store/2 line 505 in
 gen_server:init_it/6 line 328
 2014-06-05 14:53:47.588 [error] 0.16056.31 CRASH REPORT Process
 0.16056.31 with 0 neighbours exited with reason: no match of right hand
 value {error,{db_open,IO error: lock
 /var/lib/riak/anti_entropy/1335903840372778448670555667404727447654250840064/LOCK:
 already held by process}} in hashtree:new_segment_store/2 line 505 in
 gen_server:init_it/6 line 328
 2014-06-05 14:53:47.588 [error] 0.16055.31 CRASH REPORT Process
 0.16055.31 with 0 neighbours exited with reason: no match of right hand
 value {error,{db_open,IO error: lock
 /var/lib/riak/anti_entropy/1267395951122892374379757940871151681107879002112/LOCK:
 already held by process}} in hashtree:new_segment_store/2 line 505 in
 gen_server:init_it/6 line 328

 Bollocks!


 On Thu, Jun 5, 2014 at 8:49 AM, Alain Rodriguez al...@uber.com wrote:

 Thanks for the quick reply and no I did not. Is this something I should
 be able to do now (stop, remove files, start again) or is it too late? How
 could I verify this is the issue?


 On Thu, Jun 5, 2014 at 8:42 AM, Shane McEwan sh...@mcewan.id.au wrote:

 On 05/06/14 16:20, Alain Rodriguez wrote:
  Hi all,
 
  I upgraded 1 of 9 riak nodes in a cluster last night from 1.4.0 to
  1.4.9. The rest are running 1.4.0.
 
  Ever since I am seeing the upgraded node, riak01 consuming a
  significantly larger percent of CPU and the PUT times on it have
 gotten
  worse. htop indicicates one particular process pegging the CPU, and
 many
  many more processes running than I was used to seeing before.

 G'day!

 Did you turn off and remove the Active Anti Entropy files before
 upgrading?

 From the 1.4.8 release notes:

 IMPORTANT We recommend removing current AAE trees before upgrading. That
 is, all files under the anti_entropy sub-directory. This will avoid
 potentially large amounts of repair activity once correct hashes start
 being added. The data in the current trees can only be fixed by a full
 rebuild, so this repair activity is wasteful. Trees will start to build
 once AAE is re-enabled. To minimize the impact of this, we recommend
 upgrading during a period of low activity.

 Shane.

 ___
 riak-users mailing list
 riak-users@lists.basho.com
 http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




 ___
 riak-users mailing list
 riak-users@lists.basho.com
 http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com