[ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-20 Thread MORITA Kazutaka

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

* Linear scalability in performance and capacity
* No single point of failure
* Redundant architecture (data is written to multiple nodes)
- Tolerance against network failure
* Zero configuration (newly added machines will join the cluster 
automatically)
- Autonomous load balancing
* Snapshot
- Online snapshot from qemu-monitor
* Clone from a snapshot volume
* Thin provisioning
- Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, - = on our todo list)

More details and download links are here:

http://www.osrg.net/sheepdog/

Note that the code is still in an early stage.
There are some critical TODO items:

- VM image deletion support
- Support architectures other than X86_64
- Data recoverys
- Free space management
- Guarantee reliability and availability under heavy load
- Performance improvement
- Reclaim unused blocks
- More documentation

We hope finding people interested in working together.
Enjoy!


Here are examples:

- create images

$ kvm-img create -f sheepdog "Alice's Disk" 256G
$ kvm-img create -f sheepdog "Bob's Disk" 256G

- list images

$ shepherd info -t vdi
   4 : Alice's Disk  256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:17:18, tag:0, current
   8 : Bob's Disk256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:29:20, tag:0, current

- start up a virtual machine

$ kvm --drive format=sheepdog,file="Alice's Disk"

- create a snapshot

$ kvm-img snapshot -c name sheepdog:"Alice's Disk"

- clone from a snapshot

$ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk"


Thanks.

--
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-21 Thread Nikolai K. Bochev
Hello,

I am getting the following error trying to compile sheepdog on Ubuntu 9.10 ( 
2.6.31-14 x64 ) :

cd shepherd; make
make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/shepherd'
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE shepherd.c -o 
shepherd.o
shepherd.c: In function ‘main’:
shepherd.c:300: warning: dereferencing pointer ‘hdr.55’ does break 
strict-aliasing rules
shepherd.c:300: note: initialized from here
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE treeview.c -o 
treeview.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
../lib/event.c -o ../lib/event.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE ../lib/net.c 
-o ../lib/net.o
../lib/net.c: In function ‘write_object’:
../lib/net.c:358: warning: ‘vosts’ may be used uninitialized in this function
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
../lib/logger.c -o ../lib/logger.o
cc shepherd.o treeview.o ../lib/event.o ../lib/net.o ../lib/logger.o -o 
shepherd -lncurses -lcrypto
make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/shepherd'
cd sheep; make
make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE sheep.c -o 
sheep.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE store.c -o 
store.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE net.c -o net.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE work.c -o 
work.o
In file included from /usr/include/asm/fcntl.h:1,
 from /usr/include/linux/fcntl.h:4,
 from /usr/include/linux/signalfd.h:13,
 from work.c:31:
/usr/include/asm-generic/fcntl.h:117: error: redefinition of ‘struct flock’
/usr/include/asm-generic/fcntl.h:140: error: redefinition of ‘struct flock64’
make[1]: *** [work.o] Error 1
make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
make: *** [all] Error 2

I have all the required libs installed. Patching and compiling qemu-kvm went 
flawless.

- Original Message -
From: "MORITA Kazutaka" 
To: kvm@vger.kernel.org, qemu-de...@nongnu.org, linux-fsde...@vger.kernel.org
Sent: Wednesday, October 21, 2009 8:13:47 AM
Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

 * Linear scalability in performance and capacity
 * No single point of failure
 * Redundant architecture (data is written to multiple nodes)
 - Tolerance against network failure
 * Zero configuration (newly added machines will join the cluster 
automatically)
 - Autonomous load balancing
 * Snapshot
 - Online snapshot from qemu-monitor
 * Clone from a snapshot volume
 * Thin provisioning
 - Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, - = on our todo list)

More details and download links are here:

http://www.osrg.net/sheepdog/

Note that the code is still in an early stage.
There are some critical TODO items:

 - VM image deletion support
 - Support architectures other than X86_64
 - Data recoverys
 - Free space management
 - Guarantee reliability and availability under heavy load
 - Performance improvement
 - Reclaim unused blocks
 - More documentation

We hope finding people interested in working together.
Enjoy!


Here are examples:

- create images

$ kvm-img create -f sheepdog "Alice's Disk" 256G
$ kvm-img create -f sheepdog "Bob's Disk" 256G

- list images

$ shepherd info -t vdi
4 : Alice's Disk  256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:17:18, tag:0, current
8 : Bob's Disk256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:29:20, tag:0, current

- start up a virtual machine

$ kvm --drive format=sheepdog,file="Alice's Disk"

- create a snapshot

$ kvm-img snapshot -c name sheepdog:"Alice's Disk"

- clone from a snapshot

$ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk"


Thanks.

-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--

RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-21 Thread Dietmar Maurer
Quite interesting. But would it be possible to use corosync for the cluster 
communication? The point is that we need corosync anyways for pacemaker, it is 
written in C (high performance) and seem to implement the feature you need?

> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
> Behalf Of MORITA Kazutaka
> Sent: Mittwoch, 21. Oktober 2009 07:14
> To: kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-
> fsde...@vger.kernel.org
> Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
> 
> Hi everyone,
> 
> Sheepdog is a distributed storage system for KVM/QEMU. It provides
> highly available block level storage volumes to VMs like Amazon EBS.
> Sheepdog supports advanced volume management features such as snapshot,
> cloning, and thin provisioning. Sheepdog runs on several tens or
> hundreds
> of nodes, and the architecture is fully symmetric; there is no central
> node such as a meta-data server.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-22 Thread Avi Kivity

On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.


Very interesting!  From a very brief look at the code, it looks like the 
sheepdog block format driver is a network client that is able to access 
highly available images, yes?


If so, is it reasonable to compare this to a cluster file system setup 
(like GFS) with images as files on this filesystem?  The difference 
would be that clustering is implemented in userspace in sheepdog, but in 
the kernel for a clustering filesystem.


How is load balancing implemented?  Can you move an image transparently 
while a guest is running?  Will an image be moved closer to its guest?  
Can you stripe an image across nodes?


Do you support multiple guests accessing the same image?

What about fault tolerance - storing an image redundantly on multiple nodes?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-22 Thread Anthony Liguori

Avi Kivity wrote:

On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or 
hundreds

of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.


Very interesting!  From a very brief look at the code, it looks like 
the sheepdog block format driver is a network client that is able to 
access highly available images, yes?


If so, is it reasonable to compare this to a cluster file system setup 
(like GFS) with images as files on this filesystem?  The difference 
would be that clustering is implemented in userspace in sheepdog, but 
in the kernel for a clustering filesystem.


I'm still in the process of reading the code, but that's the impression 
I got too.  It made me think that the protocol for qemu to communicate 
with sheepdog could be a filesystem protocol (like 9p) and sheepdog 
could expose itself as a synthetic.  There are some interesting 
ramifications to something like that--namely that you could mount 
sheepdog on localhost and interact with it through the vfs.


Very interesting stuff, I'm looking forward to examining more closely.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-22 Thread Avishay Traeger
This looks very interesting - how does this compare with Exanodes/Seanodes?

Thanks,
Avishay
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-22 Thread Alexander Graf


Am 22.10.2009 um 18:28 schrieb Anthony Liguori :


Avi Kivity wrote:

On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as  
snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or  
hundreds
of nodes, and the architecture is fully symmetric; there is no  
central

node such as a meta-data server.


Very interesting!  From a very brief look at the code, it looks  
like the sheepdog block format driver is a network client that is  
able to access highly available images, yes?


If so, is it reasonable to compare this to a cluster file system  
setup (like GFS) with images as files on this filesystem?  The  
difference would be that clustering is implemented in userspace in  
sheepdog, but in the kernel for a clustering filesystem.


I'm still in the process of reading the code, but that's the  
impression I got too.  It made me think that the protocol for qemu  
to communicate with sheepdog could be a filesystem protocol (like 9p)


Speaking about 9p, what's the status there?

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
Hello,

Does the following patch work for you?

diff --git a/sheep/work.c b/sheep/work.c
index 4df8dc0..45f362d 100644
--- a/sheep/work.c
+++ b/sheep/work.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#define _LINUX_FCNTL_H
 #include 

 #include "list.h"


On Wed, Oct 21, 2009 at 5:45 PM, Nikolai K. Bochev
 wrote:
> Hello,
>
> I am getting the following error trying to compile sheepdog on Ubuntu 9.10 ( 
> 2.6.31-14 x64 ) :
>
> cd shepherd; make
> make[1]: Entering directory 
> `/home/shiny/Packages/sheepdog-2009102101/shepherd'
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE shepherd.c 
> -o shepherd.o
> shepherd.c: In function ‘main’:
> shepherd.c:300: warning: dereferencing pointer ‘hdr.55’ does break 
> strict-aliasing rules
> shepherd.c:300: note: initialized from here
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE treeview.c 
> -o treeview.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
> ../lib/event.c -o ../lib/event.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
> ../lib/net.c -o ../lib/net.o
> ../lib/net.c: In function ‘write_object’:
> ../lib/net.c:358: warning: ‘vosts’ may be used uninitialized in this function
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
> ../lib/logger.c -o ../lib/logger.o
> cc shepherd.o treeview.o ../lib/event.o ../lib/net.o ../lib/logger.o -o 
> shepherd -lncurses -lcrypto
> make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/shepherd'
> cd sheep; make
> make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE sheep.c -o 
> sheep.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE store.c -o 
> store.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE net.c -o 
> net.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE work.c -o 
> work.o
> In file included from /usr/include/asm/fcntl.h:1,
>                 from /usr/include/linux/fcntl.h:4,
>                 from /usr/include/linux/signalfd.h:13,
>                 from work.c:31:
> /usr/include/asm-generic/fcntl.h:117: error: redefinition of ‘struct flock’
> /usr/include/asm-generic/fcntl.h:140: error: redefinition of ‘struct flock64’
> make[1]: *** [work.o] Error 1
> make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
> make: *** [all] Error 2
>
> I have all the required libs installed. Patching and compiling qemu-kvm went 
> flawless.
>
> - Original Message -
> From: "MORITA Kazutaka" 
> To: kvm@vger.kernel.org, qemu-de...@nongnu.org, linux-fsde...@vger.kernel.org
> Sent: Wednesday, October 21, 2009 8:13:47 AM
> Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
>
> Hi everyone,
>
> Sheepdog is a distributed storage system for KVM/QEMU. It provides
> highly available block level storage volumes to VMs like Amazon EBS.
> Sheepdog supports advanced volume management features such as snapshot,
> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
> of nodes, and the architecture is fully symmetric; there is no central
> node such as a meta-data server.
>
> The following list describes the features of Sheepdog.
>
>     * Linear scalability in performance and capacity
>     * No single point of failure
>     * Redundant architecture (data is written to multiple nodes)
>     - Tolerance against network failure
>     * Zero configuration (newly added machines will join the cluster 
> automatically)
>     - Autonomous load balancing
>     * Snapshot
>     - Online snapshot from qemu-monitor
>     * Clone from a snapshot volume
>     * Thin provisioning
>     - Amazon EBS API support (to use from a Eucalyptus instance)
>
> (* = current features, - = on our todo list)
>
> More details and download links are here:
>
> http://www.osrg.net/sheepdog/
>
> Note that the code is still in an early stage.
> There are some critical TODO items:
>
>     - VM image deletion support
>     - Support architectures other than X86_64
>     - Data recoverys
>     - Free space management
>     - Guarantee reliability and availability under heavy load
>     - Performance improvement
>     - Reclaim unused blocks
>     - More documentation
>
> We hope finding people interested in working together.
> Enjoy!
>
>
> Here are examples:
>
> - create images
>
> $ kvm-img create -f sheepdog "Alice's Disk" 256G
> $ kvm-img create -f sheepdog "Bob's Disk" 256G
>
> - list images
>
> $ shepherd info -t vdi
>    4

Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
We use JGroups (Java library) for reliable multicast communication in
our cluster manager daemon. We don't worry about the performance much
since the cluster manager daemon is not involved in the I/O path. We
might think about moving to corosync if it is more stable than
JGroups.

On Wed, Oct 21, 2009 at 6:08 PM, Dietmar Maurer  wrote:
> Quite interesting. But would it be possible to use corosync for the cluster 
> communication? The point is that we need corosync anyways for pacemaker, it 
> is written in C (high performance) and seem to implement the feature you need?
>
>> -Original Message-
>> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
>> Behalf Of MORITA Kazutaka
>> Sent: Mittwoch, 21. Oktober 2009 07:14
>> To: kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-
>> fsde...@vger.kernel.org
>> Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
>>
>> Hi everyone,
>>
>> Sheepdog is a distributed storage system for KVM/QEMU. It provides
>> highly available block level storage volumes to VMs like Amazon EBS.
>> Sheepdog supports advanced volume management features such as snapshot,
>> cloning, and thin provisioning. Sheepdog runs on several tens or
>> hundreds
>> of nodes, and the architecture is fully symmetric; there is no central
>> node such as a meta-data server.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Chris Webb
MORITA Kazutaka  writes:

> We use JGroups (Java library) for reliable multicast communication in
> our cluster manager daemon. We don't worry about the performance much
> since the cluster manager daemon is not involved in the I/O path. We
> might think about moving to corosync if it is more stable than
> JGroups.

I'd love to see this running on top of corosync too. Corosync is a well
tested, stable cluster manager, and doesn't have the JVM dependency of
jgroups so feels more suitable for building 'thin virtualisation fabrics'.

Cheers,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Chris Webb
Chris Webb  writes:

> MORITA Kazutaka  writes:
> 
> > We use JGroups (Java library) for reliable multicast communication in
> > our cluster manager daemon. We don't worry about the performance much
> > since the cluster manager daemon is not involved in the I/O path. We
> > might think about moving to corosync if it is more stable than
> > JGroups.
> 
> I'd love to see this running on top of corosync too. Corosync is a well
> tested, stable cluster manager, and doesn't have the JVM dependency of
> jgroups so feels more suitable for building 'thin virtualisation fabrics'.

Very exciting project, by the way!

Best wishes,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity  wrote:
> On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:
>>
>> Hi everyone,
>>
>> Sheepdog is a distributed storage system for KVM/QEMU. It provides
>> highly available block level storage volumes to VMs like Amazon EBS.
>> Sheepdog supports advanced volume management features such as snapshot,
>> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
>> of nodes, and the architecture is fully symmetric; there is no central
>> node such as a meta-data server.
>
> Very interesting!  From a very brief look at the code, it looks like the
> sheepdog block format driver is a network client that is able to access
> highly available images, yes?

Yes. Sheepdog is a simple key-value storage system that
consists of multiple nodes (a bit similar to Amazon Dynamo, I guess).

The qemu Sheepdog driver (client) divides a VM image into fixed-size
objects and store them on the key-value storage system.

> If so, is it reasonable to compare this to a cluster file system setup (like
> GFS) with images as files on this filesystem?  The difference would be that
> clustering is implemented in userspace in sheepdog, but in the kernel for a
> clustering filesystem.

I think that the major difference between sheepdog and cluster file
systems such as Google File system, pNFS, etc is the interface between
clients and a storage system.

> How is load balancing implemented?  Can you move an image transparently
> while a guest is running?  Will an image be moved closer to its guest?

Sheepdog uses consistent hashing to decide where objects store; I/O
load is balanced across the nodes. When a new node is added or the
existing node is removed, the hash table changes and the data
automatically and transparently are moved over nodes.

We plan to implement a mechanism to distribute the data not randomly
but intelligently; we could use machine load, the locations of VMs, etc.

> Can you stripe an image across nodes?

Yes, a VM images is divided into multiple objects, and they are
stored over nodes.

> Do you support multiple guests accessing the same image?

A VM image can be attached to any VMs but one VM at a time; multiple
running VMs cannot access to the same VM image.

> What about fault tolerance - storing an image redundantly on multiple nodes?

Yes, all objects are replicated to multiple nodes.


-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Dietmar Maurer
> We use JGroups (Java library) for reliable multicast communication in
> our cluster manager daemon.

I doubt that there is something like 'reliable multicast' - you will run into 
many problems when you try to handle errors.

> We don't worry about the performance much
> since the cluster manager daemon is not involved in the I/O path. We
> might think about moving to corosync if it is more stable than
> JGroups.

corosync is already quite stable. And it support virtual synchrony

http://en.wikipedia.org/wiki/Virtual_synchrony

Anyways, I do not know JGroups - maybe that 'reliable multicast' solves all 
network problems somehow - Is there any documentation about how they do it?

- Dietmar

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Dietmar Maurer
Another suggestion: use LVM instead of btrfs (to get better performance)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Dietmar Maurer
> Anyways, I do not know JGroups - maybe that 'reliable multicast' solves
> all network problems somehow - Is there any documentation about how
> they do it?

OK, found the papers on their web site - quite interesting too.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Javier Guerra
On Fri, Oct 23, 2009 at 5:41 AM, MORITA Kazutaka
 wrote:
> On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity  wrote:
>> If so, is it reasonable to compare this to a cluster file system setup (like
>> GFS) with images as files on this filesystem?  The difference would be that
>> clustering is implemented in userspace in sheepdog, but in the kernel for a
>> clustering filesystem.
>
> I think that the major difference between sheepdog and cluster file
> systems such as Google File system, pNFS, etc is the interface between
> clients and a storage system.

note that GFS is "Global File System" (written by Sistina (the same
folks from LVM) and bought by RedHat).  Google Filesystem is a
different thing, and ironically the client/storage interface is a
little more like sheepdog and unlike a regular cluster filesystem.

>> How is load balancing implemented?  Can you move an image transparently
>> while a guest is running?  Will an image be moved closer to its guest?
>
> Sheepdog uses consistent hashing to decide where objects store; I/O
> load is balanced across the nodes. When a new node is added or the
> existing node is removed, the hash table changes and the data
> automatically and transparently are moved over nodes.
>
> We plan to implement a mechanism to distribute the data not randomly
> but intelligently; we could use machine load, the locations of VMs, etc.

i don't have much hands-on experience on consistent hashing; but it
sounds reasonable to make each node's ring segment proportional to its
storage capacity.  dynamic load balancing seems a tougher nut to
crack, especially while keeping all clients mapping consistent.

>> Do you support multiple guests accessing the same image?
>
> A VM image can be attached to any VMs but one VM at a time; multiple
> running VMs cannot access to the same VM image.

this is a must-have safety measure; but a 'manual override' is quite
useful for those that know how to manage a cluster-aware filesystem
inside a VM image, maybe like Xen's "w!" flag does.  justs be sure to
avoid distributed caching for a shared image!

in all, great project, and with such a clean patch into KVM/Qemu, high
hopes of making into regular use.

i'd just want to add my '+1 votes' on both getting rid of JVM
dependency and using block devices (usually LVM) instead of ext3/btrfs

-- 
Javier
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
Sorry, I am not familiar with the details of Exanodes/Seanodes but it seems to
be a storage system provides iSCSI protocol. As I wrote in a different
mail, Sheepdog is a storage system that provide a simple key-value
interface to Sheepdog client (qemu block driver).

On Fri, Oct 23, 2009 at 3:53 AM, Avishay Traeger  wrote:
> This looks very interesting - how does this compare with Exanodes/Seanodes?
>
> Thanks,
> Avishay
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread FUJITA Tomonori
On Fri, 23 Oct 2009 09:14:29 -0500
Javier Guerra  wrote:

> > I think that the major difference between sheepdog and cluster file
> > systems such as Google File system, pNFS, etc is the interface between
> > clients and a storage system.
> 
> note that GFS is "Global File System" (written by Sistina (the same
> folks from LVM) and bought by RedHat).  Google Filesystem is a
> different thing, and ironically the client/storage interface is a
> little more like sheepdog and unlike a regular cluster filesystem.

Hmm, Avi referred to Global File System? I wasn't sure. 'GFS' is
ambiguous. Anyway, Global File System is a SAN file system. It's
a completely different architecture from Sheepdog.


> > Sheepdog uses consistent hashing to decide where objects store; I/O
> > load is balanced across the nodes. When a new node is added or the
> > existing node is removed, the hash table changes and the data
> > automatically and transparently are moved over nodes.
> >
> > We plan to implement a mechanism to distribute the data not randomly
> > but intelligently; we could use machine load, the locations of VMs, etc.
> 
> i don't have much hands-on experience on consistent hashing; but it
> sounds reasonable to make each node's ring segment proportional to its
> storage capacity.

Yeah, that's one of the techniques, I think.


>  dynamic load balancing seems a tougher nut to
> crack, especially while keeping all clients mapping consistent.

There are some techniques to do that.

We think that there are some existing techniques to distribute data
intelligently. We just have not analyzed the options.


> i'd just want to add my '+1 votes' on both getting rid of JVM
> dependency and using block devices (usually LVM) instead of ext3/btrfs

LVM doesn't fit for our requirement nicely. What we need is updating
some objects in an atomic way. We can implement that for ourselves but
we prefer to keep our code simple by using the existing mechanism.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-24 Thread Avi Kivity

On 10/23/2009 05:40 PM, FUJITA Tomonori wrote:

On Fri, 23 Oct 2009 09:14:29 -0500
Javier Guerra  wrote:

   

I think that the major difference between sheepdog and cluster file
systems such as Google File system, pNFS, etc is the interface between
clients and a storage system.
   

note that GFS is "Global File System" (written by Sistina (the same
folks from LVM) and bought by RedHat).  Google Filesystem is a
different thing, and ironically the client/storage interface is a
little more like sheepdog and unlike a regular cluster filesystem.
 

Hmm, Avi referred to Global File System? I wasn't sure. 'GFS' is
ambiguous. Anyway, Global File System is a SAN file system. It's
a completely different architecture from Sheepdog.
   


I did, and yes, it is completely different since you don't require 
central storage.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-25 Thread Dietmar Maurer
> >> Do you support multiple guests accessing the same image?
> >
> > A VM image can be attached to any VMs but one VM at a time; multiple
> > running VMs cannot access to the same VM image.

I guess this is a problem when you want to do live migrations?

- Dietmar


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-25 Thread MORITA Kazutaka

On 2009/10/25 17:51, Dietmar Maurer wrote:

Do you support multiple guests accessing the same image?

A VM image can be attached to any VMs but one VM at a time; multiple
running VMs cannot access to the same VM image.


I guess this is a problem when you want to do live migrations?


Yes, because Sheepdog locks a VM image when it is opened.
To avoid this problem, locking must be delayed until migration has done.
This is also a TODO item.

--
MORITA Kazutaka



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-27 Thread MORITA Kazutaka

On 2009/10/21 14:13, MORITA Kazutaka wrote:

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.


We added some pages to Sheepdog website:

 Design: http://www.osrg.net/sheepdog/design.html
 FAQ   : http://www.osrg.net/sheepdog/faq.html

Sheepdog mailing list is also ready to use (thanks for Tomasz)

 Subscribe/Unsubscribe/Preferences
   http://lists.wpkg.org/mailman/listinfo/sheepdog
 Archive
   http://lists.wpkg.org/pipermail/sheepdog/

We are always looking for developers or users interested in
participating in Sheepdog project!

Thanks.

MORITA Kazutaka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
Hi,

Thanks for many comments.

Sheepdog git trees are created.

  Sheepdog server
git://sheepdog.git.sourceforge.net/gitroot/sheepdog/sheepdog

  Sheepdog client
git://sheepdog.git.sourceforge.net/gitroot/sheepdog/qemu-kvm

Please try!

On Wed, Oct 21, 2009 at 2:13 PM, MORITA Kazutaka
 wrote:
> Hi everyone,
>
> Sheepdog is a distributed storage system for KVM/QEMU. It provides
> highly available block level storage volumes to VMs like Amazon EBS.
> Sheepdog supports advanced volume management features such as snapshot,
> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
> of nodes, and the architecture is fully symmetric; there is no central
> node such as a meta-data server.
>
> The following list describes the features of Sheepdog.
>
>    * Linear scalability in performance and capacity
>    * No single point of failure
>    * Redundant architecture (data is written to multiple nodes)
>    - Tolerance against network failure
>    * Zero configuration (newly added machines will join the cluster
> automatically)
>    - Autonomous load balancing
>    * Snapshot
>    - Online snapshot from qemu-monitor
>    * Clone from a snapshot volume
>    * Thin provisioning
>    - Amazon EBS API support (to use from a Eucalyptus instance)
>
> (* = current features, - = on our todo list)
>
> More details and download links are here:
>
> http://www.osrg.net/sheepdog/
>
> Note that the code is still in an early stage.
> There are some critical TODO items:
>
>    - VM image deletion support
>    - Support architectures other than X86_64
>    - Data recoverys
>    - Free space management
>    - Guarantee reliability and availability under heavy load
>    - Performance improvement
>    - Reclaim unused blocks
>    - More documentation
>
> We hope finding people interested in working together.
> Enjoy!
>
>
> Here are examples:
>
> - create images
>
> $ kvm-img create -f sheepdog "Alice's Disk" 256G
> $ kvm-img create -f sheepdog "Bob's Disk" 256G
>
> - list images
>
> $ shepherd info -t vdi
>   4 : Alice's Disk  256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
> 16:17:18, tag:        0, current
>   8 : Bob's Disk    256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
> 16:29:20, tag:        0, current
>
> - start up a virtual machine
>
> $ kvm --drive format=sheepdog,file="Alice's Disk"
>
> - create a snapshot
>
> $ kvm-img snapshot -c name sheepdog:"Alice's Disk"
>
> - clone from a snapshot
>
> $ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk"
>
>
> Thanks.
>
> --
> MORITA, Kazutaka
>
> NTT Cyber Space Labs
> OSS Computing Project
> Kernel Group
> E-mail: morita.kazut...@lab.ntt.co.jp
>
>
>
>



-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Javier Guerra
On Fri, Oct 23, 2009 at 2:39 PM, MORITA Kazutaka
 wrote:
> Thanks for many comments.
>
> Sheepdog git trees are created.

great!

is there any client (no matter how crude) besides the patched
KVM/Qemu?  it would make it far easier to hack around...

-- 
Javier
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
On Sat, Oct 24, 2009 at 4:45 AM, Javier Guerra  wrote:
> On Fri, Oct 23, 2009 at 2:39 PM, MORITA Kazutaka
>  wrote:
>> Thanks for many comments.
>>
>> Sheepdog git trees are created.
>
> great!
>
> is there any client (no matter how crude) besides the patched
> KVM/Qemu?  it would make it far easier to hack around...

No, there isn't. Sorry.
I think we should provide a test client as soon as possible.

-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Chris Webb
Javier Guerra  writes:

> i'd just want to add my '+1 votes' on both getting rid of JVM
> dependency and using block devices (usually LVM) instead of ext3/btrfs

If the chunks into which the virtual drives are split are quite small (say
the 64MB used by Hadoop), LVM may be a less appropriate choice. It doesn't
support very large numbers of very small logical volumes very well.

Best wishes,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Javier Guerra
On Fri, Oct 23, 2009 at 9:58 AM, Chris Webb  wrote:
> If the chunks into which the virtual drives are split are quite small (say
> the 64MB used by Hadoop), LVM may be a less appropriate choice. It doesn't
> support very large numbers of very small logical volumes very well.

absolutely.  the 'nicest' way to do it would be to use a single block
device per sheep process, and do the splitting there.

it's an extra layer of code, and once you add non-naïve behavior for
deleting and fragmentation, you quickly approach filesystem-like
complexity.

unless you can do some very clever mapping that reuses the consistent
hash algorithms to find not only which server(s) you want, but also
which chunk to hit  the kind of things i'd love to code, but never
found the use for it.

i'll definitely dig deeper in the code.

-- 
Javier
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
On Fri, Oct 23, 2009 at 8:10 PM, Alexander Graf  wrote:
>
> On 23.10.2009, at 12:41, MORITA Kazutaka wrote:
>
> On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity  wrote:
>
> How is load balancing implemented?  Can you move an image transparently
>
> while a guest is running?  Will an image be moved closer to its guest?
>
> Sheepdog uses consistent hashing to decide where objects store; I/O
> load is balanced across the nodes. When a new node is added or the
> existing node is removed, the hash table changes and the data
> automatically and transparently are moved over nodes.
>
> We plan to implement a mechanism to distribute the data not randomly
> but intelligently; we could use machine load, the locations of VMs, etc.
>
> What exactly does balanced mean? Can it cope with individual nodes having
> more disk space than others?

I mean objects are uniformly distributed over the nodes by the hash function.
Distribution using free disk space information is one of TODOs.

> Do you support multiple guests accessing the same image?
>
> A VM image can be attached to any VMs but one VM at a time; multiple
> running VMs cannot access to the same VM image.
>
> What about read-only access? Imagine you'd have 5 kvm instances each
> accessing it using -snapshot.

By creating new clone images from existing snapshot image, you can do
the similar thing.
Sheepdog can create cloning image instantly.

-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Tomasz Chmielewski

Chris Webb wrote:

Javier Guerra  writes:


i'd just want to add my '+1 votes' on both getting rid of JVM
dependency and using block devices (usually LVM) instead of ext3/btrfs


If the chunks into which the virtual drives are split are quite small (say
the 64MB used by Hadoop), LVM may be a less appropriate choice. It doesn't
support very large numbers of very small logical volumes very well.


Also, on _loaded_ systems, I noticed creating/removing logical volumes 
can take really long (several minutes); where allocating a file of a 
given size would just take a fraction of that.


Sot sure how it would matter here, but probably it would.

--
Tomasz Chmielewski
http://wpkg.org



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-25 Thread Dietmar Maurer
> Also, on _loaded_ systems, I noticed creating/removing logical volumes
> can take really long (several minutes); where allocating a file of a
> given size would just take a fraction of that.

Allocating a file takes much longer, unless you use  a 'sparse' file.

- Dietmar

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-25 Thread Tomasz Chmielewski

Dietmar Maurer wrote:

Also, on _loaded_ systems, I noticed creating/removing logical volumes
can take really long (several minutes); where allocating a file of a
given size would just take a fraction of that.


Allocating a file takes much longer, unless you use  a 'sparse' file.


If you mean "allocating" like with:

   dd if=/dev/zero of=image bs=1G count=50

Then of course, that's a lot of IO.


As you mentioned, you can create a sparse file (but then, you'll end up 
with a lot of fragmentation).


But a better way would be to use persistent preallocation (fallocate), 
instead of "traditional" dd or a sparse file.



--
Tomasz Chmielewski
http://wpkg.org

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html