Re: Riak transfer limit.

2014-01-28 Thread Jeppe Toustrup
On 27 January 2014 18:00, Guido Medina guido.med...@temetra.com wrote:

 What's a good value for transfer limit when re-arranging adding/removing
 nodes?
 Or if there is a generic rule of thumb like physical nodes, processors,
 etc.

 Once transfer is completed, is it a good practice to set it back to its
 default value or should the calculated (guessed?) transfer limit stay?


I have just removed a node from our Riak cluster, and I just turned up the
transfer limit on the removing node high, and set the other machines in the
cluster to 1. That way the node going out of the cluster got rid of its
data as fast as possible, while the nodes serving clients only had 1
transfer each to make sure they weren't overloaded. It worked fine for me,
but it might depend on how much load you have on the cluster during the
data migration and how important response times are for your system.

-- 
*Jeppe Toustrup*
Operations Engineer

*Falcon Social*
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Stalled handoffs on a prod cluster after server crash

2013-12-10 Thread Jeppe Toustrup
What does riak-admin transfers tell you? Are there any transfers in progress?
You can try to set the amount of allowed transfers per host to 0 and
then back to 2 (the default) or whatever you want, in order to restart
any transfers which may be in progress. You can do that with the
riak-admin transfer-limit number command.

-- 
Jeppe Fihl Toustrup
Operations Engineer
Falcon Social

On 9 December 2013 15:48, Ivaylo Panitchkov ipanitch...@hibernum.com wrote:


 Hello,

 We have a prod cluster of four machines running riak (1.1.4 2012-06-19) 
 Debian x86_64.
 Two days ago one of the servers went down because of a hardware failure.
 I force-removed the machine in question to re-balance the cluster before 
 adding the new machine.
 Since then the cluster is operating properly, but I noticed some handoffs are 
 stalled now.
 I had similar situation awhile ago that was solved by simply forcing the 
 handoffs, but this time the same approach didn't work.
 Any ideas, solutions or just hints are greatly appreciated.

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Stalled handoffs on a prod cluster after server crash

2013-12-10 Thread Jeppe Toustrup
Try to take a look at this thread from November where I experienced a
similar problem:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-November/014027.html

The following mails in the thread mentions things you try to correct
the problem, and what I ended up doing with the help of Basho
employees.

-- 
Jeppe Fihl Toustrup
Operations Engineer
Falcon Social

On 10 December 2013 22:03, Ivaylo Panitchkov ipanitch...@hibernum.com wrote:
 Hello,
 Below is the transfers info:

 ~# riak-admin transfers

 Attempting to restart script through sudo -u riak
 'r...@ccc.ccc.ccc.ccc' waiting to handoff 7 partitions
 'r...@bbb.bbb.bbb.bbb' waiting to handoff 7 partitions
 'r...@aaa.aaa.aaa.aaa' waiting to handoff 5 partitions


 ~# riak-admin member_status
 Attempting to restart script through sudo -u riak
 = Membership
 ==
 Status RingPendingNode
 ---
 valid  45.3% 34.4%'r...@aaa.aaa.aaa.aaa'
 valid  26.6% 32.8%'r...@bbb.bbb.bbb.bbb'
 valid  28.1% 32.8%'r...@ccc.ccc.ccc.ccc'
 ---

 It's stuck with all those handoffs for few days now.
 riak-admin ring_status gives me the same info like the one I mentioned when
 opened the case.
 I noticed AAA.AAA.AAA.AAA experience more load than other servers as it's
 responsible for almost half of the data.
 Is it safe to add another machine to the cluster in order to relief
 AAA.AAA.AAA.AAA even when the issue with handoffs is not yet resolved?

 Thanks,
 Ivaylo

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Ownership handoff never completes

2013-11-20 Thread Jeppe Toustrup
Hi

Thank you for the guide. I stopped two of the nodes (the source and
the destination of the partition transfers), renamed the folders
inside the merge_index folder and started them again. The ownership
handoff does however not seem to be retried.

Looking at the logs it seems like the last attempt was 48 hours ago.
Is there any logic inside Riak which causes it to give up after a
certain amount of tries?
Is there a way I can retrigger the handoffs?
I have tried to set the transfer-limit on the cluster to 0 and then
back to 2, but it doesn't seem to do anything.

I wonder if we need the merge_index folder at all, as we have disabled
Riak search since the initial configuration of the cluster. We found a
better way to query our data so that we don't need Riak search
anymore. We disabled it by resetting the properties on the buckets
where search was enabled, and then disabled search in app.config
followed by a restart of each of the nodes. This was done after the
ownership handoff issue first occurred.

-- 
Jeppe Fihl Toustrup
Operations Engineer
Falcon Social


On 19 November 2013 23:17, Mark Phillips m...@basho.com wrote:
 Hi Jeppe,



 As you suspected, this looks like index corruption in Search that's
 preventing handoff from finishing.  Specifically, you'll need to delete the

 segment files for the two partitions' indexes and rebuild those indexes
 post-transfer.


 Here's the full process:



 - Stop each node that owns the partitions in question.
 - Delete the data directory for each partition (which contains the segment
 files). It should be something like:




 rm -rf /var/lib/riak/merge_index/p


 - Restart each node

 - Wait for the transfers to complete
 - Rebuild the indexes in question [1]


 Let us know if you run into any further issues.



 Mark


 [1]
 http://docs.basho.com/riak/latest/ops/running/recovery/repairing-indexes/



 On Tue, Nov 19, 2013 at 4:26 AM, Jeppe Toustrup je...@falconsocial.com
 wrote:

 Hi

 I have recently added two extra nodes to the now seven node Riak
 cluster. The rebalancing following the expansion worked fine, except
 for two partitions which seem to not being able to go through. Running
 riak-admin ring-status shows the following:

 == Ownership Handoff
 ==
 Owner:  riak@10.0.0.96
 Next Owner: riak@10.0.0.93

 Index: 239777612374601260017792042867515182912301432832
   Waiting on: []
   Complete:   [riak_kv_vnode,riak_pipe_vnode]

 Index: 696496874040508421956443553091353626554780352512
   Waiting on: []
   Complete:   [riak_kv_vnode,riak_pipe_vnode]


 ---

 I can see from the log file on the source node (10.0.0.96) that it has
 made numerous attempt to transfer the partitions, but it ends up
 failing all the time. Here's an except of the log file showing the
 lines from when the transfer attempt ends up failing:

 2013-11-18 12:29:03.694 [error] emulator Error in process 0.5745.8
 on node 'riak@10.0.0.96' with exit value:
 {badarg,[{erlang,binary_to_term,[29942

 bytes],[]},{mi_segment,iterate_all_bytes,2,[{file,src/mi_segment.erl},{line,167}]},{mi_server,'-group_iterator/2-fun-1-',2,[{file,src/mi_server.erl},{line,725}]},{mi_server,'-group_iterator/2-fun-0-'...
 2013-11-18 12:29:03.885 [error] 0.3269.0@mi_server:handle_info:524
 lookup/range failure:

 {badarg,[{erlang,binary_to_term,[131,109,0,0,244,240,108,109,102,97,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,1
 
11,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111

Re: Ownership handoff never completes

2013-11-20 Thread Jeppe Toustrup
I've got the problem solved thanks to Brian Sparrow on the IRC channel.

Here's the steps we tried during the troubleshooting session:

1. We first tried to delete the data folders on the receiving node for
the two partitions, while the node was stopped, to see if it would
retrigger the ownership handoff. It didn't change anything.

2. We then tried to insert the following Erlang code on the sending
node, in order to see if it would retrigger the ownership handoff. The
partition IDs were for the partitions needing to be transfered:
IdxList = [696496874040508421956443553091353626554780352512,
239777612374601260017792042867515182912301432832],
  Mod = riak_kv,
  Ring = riak_core_ring_manager:get_my_ring(),
  riak_core_ring_manager:ring_trans(
fun(Ring, _) -
Ring2 = lists:foldl(
  fun(Idx, Ring) -

riak_core_ring:handoff_complete(Ring, Idx, Mod)
  end,
  Ring,
  IdxList),
{new_ring, Ring2}
end, []).

That piece of code didn't help anything either. The output of the
command showed the two partitions to be in the awaiting state:

[{239777612374601260017792042867515182912301432832,
  'riak@10.0.0.96','riak@10.0.0.93',
  [riak_kv,riak_kv_vnode,riak_pipe_vnode],
  awaiting},
 {696496874040508421956443553091353626554780352512,
  'riak@10.0.0.96','riak@10.0.0.93',
  [riak_kv,riak_kv_vnode,riak_pipe_vnode],
  awaiting}],

3. Brian suggested that I should run
riak_core_ring_events:force_update(). in the Erlang console as well,
but that didn't have any effect.

4. I send the ring directories from the source and destination nodes
to Brian, and he came back with the following Erlang code which
problem for us:

IdxList = [696496874040508421956443553091353626554780352512,
239777612374601260017792042867515182912301432832],
  Mod = riak_kv_vnode,
  Ring = riak_core_ring_manager:get_my_ring(),
  riak_core_ring_manager:ring_trans(
fun(Ring, _) -
Ring1 = begin
A = element(7, Ring),
B = [{B1, B2, B3,
  [B4E || B4E - B4, B4E /= riak_kv],
B5} || {B1, B2, B3, B4, B5} - A],
setelement(7,Ring, B)
end,
Ring2 = lists:foldl(
  fun(Idx, R) -
  riak_core_ring:handoff_complete(R, Idx, Mod)
  end,
  Ring1,
  IdxList),
{new_ring, Ring2}
end, []).

The output of the command showed the handoffs was complete:

[{239777612374601260017792042867515182912301432832,
  'riak@10.0.0.96','riak@10.0.0.93',
  [riak_kv_vnode,riak_pipe_vnode],
  complete},
 {696496874040508421956443553091353626554780352512,
  'riak@10.0.0.96','riak@10.0.0.93',
  [riak_kv_vnode,riak_pipe_vnode],
  complete}],

And I could confirm that with the usual ring-status, member-status
and transfers commands. There were no pending transfers, no pending
ownership handoffs and the cluster didn't show the rebalancing to be
in progress any more.

Thanks a lot to Brian for helping solve this issue. I hope anybody
else who may encounter it can use the above info.

-- 
Jeppe Fihl Toustrup
Operations Engineer
Falcon Social


On 20 November 2013 17:52, Mark Phillips m...@basho.com wrote:
 Hmm. The fact that you've disabled Search probably changes things but I'm
 not entirely sure how.

 Ryan et al - any ideas?

 Mark

 On Wednesday, November 20, 2013, Jeppe Toustrup wrote:

 Hi

 Thank you for the guide. I stopped two of the nodes (the source and
 the destination of the partition transfers), renamed the folders
 inside the merge_index folder and started them again. The ownership
 handoff does however not seem to be retried.

 Looking at the logs it seems like the last attempt was 48 hours ago.
 Is there any logic inside Riak which causes it to give up after a
 certain amount of tries?
 Is there a way I can retrigger the handoffs?
 I have tried to set the transfer-limit on the cluster to 0 and then
 back to 2, but it doesn't seem to do anything.

 I wonder if we need the merge_index folder at all, as we have disabled
 Riak search since the initial configuration of the cluster. We found a
 better way to query our data so that we don't need Riak search
 anymore. We disabled it by resetting the properties on the buckets
 where search was enabled, and then disabled search in app.config
 followed by a restart of each of the nodes. This was done after the
 ownership handoff issue first occurred.

 --
 Jeppe Fihl Toustrup
 Operations Engineer
 Falcon

Ownership handoff never completes

2013-11-19 Thread Jeppe Toustrup
Hi

I have recently added two extra nodes to the now seven node Riak
cluster. The rebalancing following the expansion worked fine, except
for two partitions which seem to not being able to go through. Running
riak-admin ring-status shows the following:

== Ownership Handoff ==
Owner:  riak@10.0.0.96
Next Owner: riak@10.0.0.93

Index: 239777612374601260017792042867515182912301432832
  Waiting on: []
  Complete:   [riak_kv_vnode,riak_pipe_vnode]

Index: 696496874040508421956443553091353626554780352512
  Waiting on: []
  Complete:   [riak_kv_vnode,riak_pipe_vnode]

---

I can see from the log file on the source node (10.0.0.96) that it has
made numerous attempt to transfer the partitions, but it ends up
failing all the time. Here's an except of the log file showing the
lines from when the transfer attempt ends up failing:

2013-11-18 12:29:03.694 [error] emulator Error in process 0.5745.8
on node 'riak@10.0.0.96' with exit value:
{badarg,[{erlang,binary_to_term,[29942
bytes],[]},{mi_segment,iterate_all_bytes,2,[{file,src/mi_segment.erl},{line,167}]},{mi_server,'-group_iterator/2-fun-1-',2,[{file,src/mi_server.erl},{line,725}]},{mi_server,'-group_iterator/2-fun-0-'...
2013-11-18 12:29:03.885 [error] 0.3269.0@mi_server:handle_info:524
lookup/range failure:
{badarg,[{erlang,binary_to_term,[131,109,0,0,244,240,108,109,102,97,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,
 
111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,1
 
11,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,11