Hi Greg,

I don't think the vnodes will always die. I have seen some situations (disk full, filesystem becoming read only due to device errors, corrupted bitcask files after a mashine crash) where the vnode did not crash, but the get and/or put requests returned errors. Even if the process crashes, it will just be restarted, possibly over and over again. Also, the handoff logic only operates on the level of a whole node, not individual vnodes, which makes monitoring and detecting disk failures very important.

We were also thinking about how to use multiple disk per node. But its not a very pressing problem for us, since we have a lot of relatively small entries (~1000 bytes), so the RAM used by bitcask is a problem long before we can even fill one disk.

Cheers,
Nico


On 23.03.2011 23:50, Greg Nelson wrote:
Hi Joe,

With a few hours of investigation today, your patch is looking
promising. Maybe you can give some more detail on what you did in your
experiments a few months ago?

What I did was set up a Ubuntu VM with three loopback file systems. Then
built Riak 0.14.1 with your patch, configured as you described to spread
across the three disks. I ran a single node, and it correctly spread
partitions across the disks.

I then corrupted the file system on one of the disks (by zeroing out the
loop device), and did some more GETs and PUTs against Riak. In the logs
it looks like the vnode processes that had bitcasks on that disk died,
as expected, and the other vnodes continued to operate.

I need to do a bit more investigation with more than one node, but given
how well it handled this scenario, it seems like we're on the right track.

Oh, one thing I noticed is that while Riak starts up, if there's a bad
disk then it will shutdown (the whole node), at this line:

https://github.com/jtuple/riak_kv/blob/jdb-multi-dirs/src/riak_kv_bitcask_backend.erl#L103

That makes sense, but I'm wondering if it's possible to let the node
start since some of its vnodes would be able to open their bitcasks just
fine. I wonder if it's as simple as removing that line?

Greg

On Tuesday, March 22, 2011 at 9:54 AM, Joseph Blomstedt wrote:

You're forgetting how awesome riak actually is. Given how riak is
implemented, my patches should work without any operational headaches
at all. Let me explain.

First, there was the one issue from yesterday. My initial patch didn't
reuse the same partition bitcask on the same node. I've fixed that in
a newer commit:
https://github.com/jtuple/riak_kv/commit/de6b83a4fb53c25b1013f31b8c4172cc40de73ed

Now, about how this all works in operation.

Let's consider a simple scenario under normal riak. The key concept
here is to realize that riak's vnodes are completely independent, and
that failure and partition ownership changes are handled through
handoff alone.

Let's say we have an 8-partition ring with 3 riak nodes:
n1 owns partitions 1,4,7
n2 owns partitions 2,5,8
n3 owns partitions 3,6
ie: Ring = (0/n1, 1/n2, 2/n3, 3/n1, 4/n2, 5/n3, 6/n1, 7/n2, 8/n3)

Each node runs an independent vnode for each partition it owns, and
each vnode will setup it's own bitcask:

vnode 0/1: {n1-root}/data/bitcask/1
vnode 0/4: {n1-root}/data/bitcask/4
...
vnode 2/2: {n2-root}/data/bitcask/2
...
vnode 3/6: {n3-root}/data/bitcask/6

Reads/writes are routed to the appropriate vnodes and to the
appropriate bitcasks. Under failure, hinted handoff comes into play.

Let's have a write to preflist [1,2,3] while n2 is down/split. Since
n2 is down, riak will send the write meant for partition 2 to another
node, let's say n3. n3 will spawn a new vnode for partition 2 which is
initially empty:

vnode 3/2: {n3-root}/data/bitcask/2

and, write the incoming write to the new bitcask.

Later, when n2 rejoins, n3 will eventually engage in handoff, and send
all (k,v) in its data/bitcask/2 to n2, which writes them into its
data/bitcask/2. After handing off data, n3 will shutdown it's 3/2
vnode and delete the bitcask directory {n3-root}/data/bitcask/2.

Under node rebalancing / ownership changes, a similar event occurs.
For example, if a new node n4 takes ownership of partition 4, then n1
will handoff it's data to n4 and then shutdown its vnode and delete
its {n1-root}/data/bitcask/4.

If you take the above scenario, and change all the directories of the
form:
{NODE-root}/data/bitcask/P
to:
/mnt/DISK-N/NODE/bitcask/P

and allow DISK-N to be any randomly chosen directory in /mnt, then the
scenario plays out exactly the same provided that riak always selects
the same DISK-N for a given P on a given node (across nodes doesn't
matter, vnodes are independent). My new commit handles this. A simple
configuration could be:

n1-vars.config:
{bitcask_data_root, {random, ["/mnt/bitcask/disk1/n1",
"/mnt/bitcask/disk2/n1", "/mnt/bitcask/disk3/n1"]}}
n2-vars.config:
{bitcask_data_root, {random, ["/mnt/bitcask/disk1/n2",
"/mnt/bitcask/disk2/n2", "/mnt/bitcask/disk3/n2"]}}
(...etc...)

There is no inherent need for symlinks, or needing to pre-create any
initial links per partition index. riak already creates and deletes
partition bitcask directories on demand. If a disk fails, then all
vnodes with bitcasks on that disk fail in the same manner as a disk
failure under normal riak. Standard read repair, handoff, and node
replacement apply.

-Joe

On Tue, Mar 22, 2011 at 9:53 AM, Alexander Sicular <[email protected]
<mailto:[email protected]>> wrote:
Ya, my original message just highlighted the standard 0,1,5 that most
people/hardware should know/be able to support. There are better
options and
10 would be one of them.


@siculars on twitter
http://siculars.posterous.com
Sent from my iPhone
On Mar 22, 2011, at 8:43, Ryan Zezeski <[email protected]
<mailto:[email protected]>> wrote:



On Tue, Mar 22, 2011 at 10:01 AM, Alexander Sicular
<[email protected] <mailto:[email protected]>>
wrote:

Save your ops dudes the headache and just use raid 5 and be done
with it.

Depending on the number of disks available I might even argue running
software RAID 10 for better throughput and less chance of data loss
(as long
as you can afford to cut your avail storage in half on every
machine). It's
not too hard to setup on modern Linux distros (mdadm); at least I was
doing
it 5 years ago and I'm no sys admin.
-Ryan

_______________________________________________
riak-users mailing list
[email protected] <mailto:[email protected]>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to