Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-27 Thread MORITA Kazutaka

On 2009/10/21 14:13, MORITA Kazutaka wrote:

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.


We added some pages to Sheepdog website:

 Design: http://www.osrg.net/sheepdog/design.html
 FAQ   : http://www.osrg.net/sheepdog/faq.html

Sheepdog mailing list is also ready to use (thanks for Tomasz)

 Subscribe/Unsubscribe/Preferences
   http://lists.wpkg.org/mailman/listinfo/sheepdog
 Archive
   http://lists.wpkg.org/pipermail/sheepdog/

We are always looking for developers or users interested in
participating in Sheepdog project!

Thanks.

MORITA Kazutaka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-26 Thread MORITA Kazutaka

On 2009/10/25 17:51, Dietmar Maurer wrote:

Do you support multiple guests accessing the same image?

A VM image can be attached to any VMs but one VM at a time; multiple
running VMs cannot access to the same VM image.


I guess this is a problem when you want to do live migrations?


Yes, because Sheepdog locks a VM image when it is opened.
To avoid this problem, locking must be delayed until migration has done.
This is also a TODO item.

--
MORITA Kazutaka



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-25 Thread Dietmar Maurer
 Also, on _loaded_ systems, I noticed creating/removing logical volumes
 can take really long (several minutes); where allocating a file of a
 given size would just take a fraction of that.

Allocating a file takes much longer, unless you use  a 'sparse' file.

- Dietmar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-25 Thread Dietmar Maurer
  Do you support multiple guests accessing the same image?
 
  A VM image can be attached to any VMs but one VM at a time; multiple
  running VMs cannot access to the same VM image.

I guess this is a problem when you want to do live migrations?

- Dietmar


Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-25 Thread Tomasz Chmielewski

Dietmar Maurer wrote:

Also, on _loaded_ systems, I noticed creating/removing logical volumes
can take really long (several minutes); where allocating a file of a
given size would just take a fraction of that.


Allocating a file takes much longer, unless you use  a 'sparse' file.


If you mean allocating like with:

   dd if=/dev/zero of=image bs=1G count=50

Then of course, that's a lot of IO.


As you mentioned, you can create a sparse file (but then, you'll end up 
with a lot of fragmentation).


But a better way would be to use persistent preallocation (fallocate), 
instead of traditional dd or a sparse file.



--
Tomasz Chmielewski
http://wpkg.org

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-24 Thread Avi Kivity

On 10/23/2009 05:40 PM, FUJITA Tomonori wrote:

On Fri, 23 Oct 2009 09:14:29 -0500
Javier Guerrajav...@guerrag.com  wrote:

   

I think that the major difference between sheepdog and cluster file
systems such as Google File system, pNFS, etc is the interface between
clients and a storage system.
   

note that GFS is Global File System (written by Sistina (the same
folks from LVM) and bought by RedHat).  Google Filesystem is a
different thing, and ironically the client/storage interface is a
little more like sheepdog and unlike a regular cluster filesystem.
 

Hmm, Avi referred to Global File System? I wasn't sure. 'GFS' is
ambiguous. Anyway, Global File System is a SAN file system. It's
a completely different architecture from Sheepdog.
   


I did, and yes, it is completely different since you don't require 
central storage.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
Hello,

Does the following patch work for you?

diff --git a/sheep/work.c b/sheep/work.c
index 4df8dc0..45f362d 100644
--- a/sheep/work.c
+++ b/sheep/work.c
@@ -28,6 +28,7 @@
 #include syscall.h
 #include sys/types.h
 #include linux/types.h
+#define _LINUX_FCNTL_H
 #include linux/signalfd.h

 #include list.h


On Wed, Oct 21, 2009 at 5:45 PM, Nikolai K. Bochev
n.boc...@grandstarco.com wrote:
 Hello,

 I am getting the following error trying to compile sheepdog on Ubuntu 9.10 ( 
 2.6.31-14 x64 ) :

 cd shepherd; make
 make[1]: Entering directory 
 `/home/shiny/Packages/sheepdog-2009102101/shepherd'
 cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE shepherd.c 
 -o shepherd.o
 shepherd.c: In function ‘main’:
 shepherd.c:300: warning: dereferencing pointer ‘hdr.55’ does break 
 strict-aliasing rules
 shepherd.c:300: note: initialized from here
 cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE treeview.c 
 -o treeview.o
 cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
 ../lib/event.c -o ../lib/event.o
 cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
 ../lib/net.c -o ../lib/net.o
 ../lib/net.c: In function ‘write_object’:
 ../lib/net.c:358: warning: ‘vosts’ may be used uninitialized in this function
 cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
 ../lib/logger.c -o ../lib/logger.o
 cc shepherd.o treeview.o ../lib/event.o ../lib/net.o ../lib/logger.o -o 
 shepherd -lncurses -lcrypto
 make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/shepherd'
 cd sheep; make
 make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
 cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE sheep.c -o 
 sheep.o
 cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE store.c -o 
 store.o
 cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE net.c -o 
 net.o
 cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE work.c -o 
 work.o
 In file included from /usr/include/asm/fcntl.h:1,
                 from /usr/include/linux/fcntl.h:4,
                 from /usr/include/linux/signalfd.h:13,
                 from work.c:31:
 /usr/include/asm-generic/fcntl.h:117: error: redefinition of ‘struct flock’
 /usr/include/asm-generic/fcntl.h:140: error: redefinition of ‘struct flock64’
 make[1]: *** [work.o] Error 1
 make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
 make: *** [all] Error 2

 I have all the required libs installed. Patching and compiling qemu-kvm went 
 flawless.

 - Original Message -
 From: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
 To: kvm@vger.kernel.org, qemu-de...@nongnu.org, linux-fsde...@vger.kernel.org
 Sent: Wednesday, October 21, 2009 8:13:47 AM
 Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

 Hi everyone,

 Sheepdog is a distributed storage system for KVM/QEMU. It provides
 highly available block level storage volumes to VMs like Amazon EBS.
 Sheepdog supports advanced volume management features such as snapshot,
 cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
 of nodes, and the architecture is fully symmetric; there is no central
 node such as a meta-data server.

 The following list describes the features of Sheepdog.

     * Linear scalability in performance and capacity
     * No single point of failure
     * Redundant architecture (data is written to multiple nodes)
     - Tolerance against network failure
     * Zero configuration (newly added machines will join the cluster 
 automatically)
     - Autonomous load balancing
     * Snapshot
     - Online snapshot from qemu-monitor
     * Clone from a snapshot volume
     * Thin provisioning
     - Amazon EBS API support (to use from a Eucalyptus instance)

 (* = current features, - = on our todo list)

 More details and download links are here:

 http://www.osrg.net/sheepdog/

 Note that the code is still in an early stage.
 There are some critical TODO items:

     - VM image deletion support
     - Support architectures other than X86_64
     - Data recoverys
     - Free space management
     - Guarantee reliability and availability under heavy load
     - Performance improvement
     - Reclaim unused blocks
     - More documentation

 We hope finding people interested in working together.
 Enjoy!


 Here are examples:

 - create images

 $ kvm-img create -f sheepdog Alice's Disk 256G
 $ kvm-img create -f sheepdog Bob's Disk 256G

 - list images

 $ shepherd info -t vdi
    4 : Alice's Disk  256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
 16:17:18, tag:        0, current
    8 : Bob's Disk    256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
 16:29:20, tag:        0, current

 - start up a virtual machine

 $ kvm --drive format=sheepdog,file=Alice's Disk

 - create a snapshot

 $ kvm-img snapshot -c name sheepdog:Alice's Disk

 - clone from a snapshot

 $ kvm-img create -b sheepdog:Alice's 

Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
We use JGroups (Java library) for reliable multicast communication in
our cluster manager daemon. We don't worry about the performance much
since the cluster manager daemon is not involved in the I/O path. We
might think about moving to corosync if it is more stable than
JGroups.

On Wed, Oct 21, 2009 at 6:08 PM, Dietmar Maurer diet...@proxmox.com wrote:
 Quite interesting. But would it be possible to use corosync for the cluster 
 communication? The point is that we need corosync anyways for pacemaker, it 
 is written in C (high performance) and seem to implement the feature you need?

 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of MORITA Kazutaka
 Sent: Mittwoch, 21. Oktober 2009 07:14
 To: kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-
 fsde...@vger.kernel.org
 Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

 Hi everyone,

 Sheepdog is a distributed storage system for KVM/QEMU. It provides
 highly available block level storage volumes to VMs like Amazon EBS.
 Sheepdog supports advanced volume management features such as snapshot,
 cloning, and thin provisioning. Sheepdog runs on several tens or
 hundreds
 of nodes, and the architecture is fully symmetric; there is no central
 node such as a meta-data server.

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Chris Webb
MORITA Kazutaka morita.kazut...@lab.ntt.co.jp writes:

 We use JGroups (Java library) for reliable multicast communication in
 our cluster manager daemon. We don't worry about the performance much
 since the cluster manager daemon is not involved in the I/O path. We
 might think about moving to corosync if it is more stable than
 JGroups.

I'd love to see this running on top of corosync too. Corosync is a well
tested, stable cluster manager, and doesn't have the JVM dependency of
jgroups so feels more suitable for building 'thin virtualisation fabrics'.

Cheers,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Chris Webb
Chris Webb ch...@arachsys.com writes:

 MORITA Kazutaka morita.kazut...@lab.ntt.co.jp writes:
 
  We use JGroups (Java library) for reliable multicast communication in
  our cluster manager daemon. We don't worry about the performance much
  since the cluster manager daemon is not involved in the I/O path. We
  might think about moving to corosync if it is more stable than
  JGroups.
 
 I'd love to see this running on top of corosync too. Corosync is a well
 tested, stable cluster manager, and doesn't have the JVM dependency of
 jgroups so feels more suitable for building 'thin virtualisation fabrics'.

Very exciting project, by the way!

Best wishes,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity a...@redhat.com wrote:
 On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:

 Hi everyone,

 Sheepdog is a distributed storage system for KVM/QEMU. It provides
 highly available block level storage volumes to VMs like Amazon EBS.
 Sheepdog supports advanced volume management features such as snapshot,
 cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
 of nodes, and the architecture is fully symmetric; there is no central
 node such as a meta-data server.

 Very interesting!  From a very brief look at the code, it looks like the
 sheepdog block format driver is a network client that is able to access
 highly available images, yes?

Yes. Sheepdog is a simple key-value storage system that
consists of multiple nodes (a bit similar to Amazon Dynamo, I guess).

The qemu Sheepdog driver (client) divides a VM image into fixed-size
objects and store them on the key-value storage system.

 If so, is it reasonable to compare this to a cluster file system setup (like
 GFS) with images as files on this filesystem?  The difference would be that
 clustering is implemented in userspace in sheepdog, but in the kernel for a
 clustering filesystem.

I think that the major difference between sheepdog and cluster file
systems such as Google File system, pNFS, etc is the interface between
clients and a storage system.

 How is load balancing implemented?  Can you move an image transparently
 while a guest is running?  Will an image be moved closer to its guest?

Sheepdog uses consistent hashing to decide where objects store; I/O
load is balanced across the nodes. When a new node is added or the
existing node is removed, the hash table changes and the data
automatically and transparently are moved over nodes.

We plan to implement a mechanism to distribute the data not randomly
but intelligently; we could use machine load, the locations of VMs, etc.

 Can you stripe an image across nodes?

Yes, a VM images is divided into multiple objects, and they are
stored over nodes.

 Do you support multiple guests accessing the same image?

A VM image can be attached to any VMs but one VM at a time; multiple
running VMs cannot access to the same VM image.

 What about fault tolerance - storing an image redundantly on multiple nodes?

Yes, all objects are replicated to multiple nodes.


-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Dietmar Maurer
 We use JGroups (Java library) for reliable multicast communication in
 our cluster manager daemon.

I doubt that there is something like 'reliable multicast' - you will run into 
many problems when you try to handle errors.

 We don't worry about the performance much
 since the cluster manager daemon is not involved in the I/O path. We
 might think about moving to corosync if it is more stable than
 JGroups.

corosync is already quite stable. And it support virtual synchrony

http://en.wikipedia.org/wiki/Virtual_synchrony

Anyways, I do not know JGroups - maybe that 'reliable multicast' solves all 
network problems somehow - Is there any documentation about how they do it?

- Dietmar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Dietmar Maurer
Another suggestion: use LVM instead of btrfs (to get better performance)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Dietmar Maurer
 Anyways, I do not know JGroups - maybe that 'reliable multicast' solves
 all network problems somehow - Is there any documentation about how
 they do it?

OK, found the papers on their web site - quite interesting too.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Javier Guerra
On Fri, Oct 23, 2009 at 5:41 AM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity a...@redhat.com wrote:
 If so, is it reasonable to compare this to a cluster file system setup (like
 GFS) with images as files on this filesystem?  The difference would be that
 clustering is implemented in userspace in sheepdog, but in the kernel for a
 clustering filesystem.

 I think that the major difference between sheepdog and cluster file
 systems such as Google File system, pNFS, etc is the interface between
 clients and a storage system.

note that GFS is Global File System (written by Sistina (the same
folks from LVM) and bought by RedHat).  Google Filesystem is a
different thing, and ironically the client/storage interface is a
little more like sheepdog and unlike a regular cluster filesystem.

 How is load balancing implemented?  Can you move an image transparently
 while a guest is running?  Will an image be moved closer to its guest?

 Sheepdog uses consistent hashing to decide where objects store; I/O
 load is balanced across the nodes. When a new node is added or the
 existing node is removed, the hash table changes and the data
 automatically and transparently are moved over nodes.

 We plan to implement a mechanism to distribute the data not randomly
 but intelligently; we could use machine load, the locations of VMs, etc.

i don't have much hands-on experience on consistent hashing; but it
sounds reasonable to make each node's ring segment proportional to its
storage capacity.  dynamic load balancing seems a tougher nut to
crack, especially while keeping all clients mapping consistent.

 Do you support multiple guests accessing the same image?

 A VM image can be attached to any VMs but one VM at a time; multiple
 running VMs cannot access to the same VM image.

this is a must-have safety measure; but a 'manual override' is quite
useful for those that know how to manage a cluster-aware filesystem
inside a VM image, maybe like Xen's w! flag does.  justs be sure to
avoid distributed caching for a shared image!

in all, great project, and with such a clean patch into KVM/Qemu, high
hopes of making into regular use.

i'd just want to add my '+1 votes' on both getting rid of JVM
dependency and using block devices (usually LVM) instead of ext3/btrfs

-- 
Javier
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Chris Webb
Javier Guerra jav...@guerrag.com writes:

 i'd just want to add my '+1 votes' on both getting rid of JVM
 dependency and using block devices (usually LVM) instead of ext3/btrfs

If the chunks into which the virtual drives are split are quite small (say
the 64MB used by Hadoop), LVM may be a less appropriate choice. It doesn't
support very large numbers of very small logical volumes very well.

Best wishes,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Javier Guerra
On Fri, Oct 23, 2009 at 9:58 AM, Chris Webb ch...@arachsys.com wrote:
 If the chunks into which the virtual drives are split are quite small (say
 the 64MB used by Hadoop), LVM may be a less appropriate choice. It doesn't
 support very large numbers of very small logical volumes very well.

absolutely.  the 'nicest' way to do it would be to use a single block
device per sheep process, and do the splitting there.

it's an extra layer of code, and once you add non-naïve behavior for
deleting and fragmentation, you quickly approach filesystem-like
complexity.

unless you can do some very clever mapping that reuses the consistent
hash algorithms to find not only which server(s) you want, but also
which chunk to hit  the kind of things i'd love to code, but never
found the use for it.

i'll definitely dig deeper in the code.

-- 
Javier
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
Sorry, I am not familiar with the details of Exanodes/Seanodes but it seems to
be a storage system provides iSCSI protocol. As I wrote in a different
mail, Sheepdog is a storage system that provide a simple key-value
interface to Sheepdog client (qemu block driver).

On Fri, Oct 23, 2009 at 3:53 AM, Avishay Traeger avis...@gmail.com wrote:
 This looks very interesting - how does this compare with Exanodes/Seanodes?

 Thanks,
 Avishay
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread FUJITA Tomonori
On Fri, 23 Oct 2009 09:14:29 -0500
Javier Guerra jav...@guerrag.com wrote:

  I think that the major difference between sheepdog and cluster file
  systems such as Google File system, pNFS, etc is the interface between
  clients and a storage system.
 
 note that GFS is Global File System (written by Sistina (the same
 folks from LVM) and bought by RedHat).  Google Filesystem is a
 different thing, and ironically the client/storage interface is a
 little more like sheepdog and unlike a regular cluster filesystem.

Hmm, Avi referred to Global File System? I wasn't sure. 'GFS' is
ambiguous. Anyway, Global File System is a SAN file system. It's
a completely different architecture from Sheepdog.


  Sheepdog uses consistent hashing to decide where objects store; I/O
  load is balanced across the nodes. When a new node is added or the
  existing node is removed, the hash table changes and the data
  automatically and transparently are moved over nodes.
 
  We plan to implement a mechanism to distribute the data not randomly
  but intelligently; we could use machine load, the locations of VMs, etc.
 
 i don't have much hands-on experience on consistent hashing; but it
 sounds reasonable to make each node's ring segment proportional to its
 storage capacity.

Yeah, that's one of the techniques, I think.


  dynamic load balancing seems a tougher nut to
 crack, especially while keeping all clients mapping consistent.

There are some techniques to do that.

We think that there are some existing techniques to distribute data
intelligently. We just have not analyzed the options.


 i'd just want to add my '+1 votes' on both getting rid of JVM
 dependency and using block devices (usually LVM) instead of ext3/btrfs

LVM doesn't fit for our requirement nicely. What we need is updating
some objects in an atomic way. We can implement that for ourselves but
we prefer to keep our code simple by using the existing mechanism.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
On Fri, Oct 23, 2009 at 8:10 PM, Alexander Graf ag...@suse.de wrote:

 On 23.10.2009, at 12:41, MORITA Kazutaka wrote:

 On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity a...@redhat.com wrote:

 How is load balancing implemented?  Can you move an image transparently

 while a guest is running?  Will an image be moved closer to its guest?

 Sheepdog uses consistent hashing to decide where objects store; I/O
 load is balanced across the nodes. When a new node is added or the
 existing node is removed, the hash table changes and the data
 automatically and transparently are moved over nodes.

 We plan to implement a mechanism to distribute the data not randomly
 but intelligently; we could use machine load, the locations of VMs, etc.

 What exactly does balanced mean? Can it cope with individual nodes having
 more disk space than others?

I mean objects are uniformly distributed over the nodes by the hash function.
Distribution using free disk space information is one of TODOs.

 Do you support multiple guests accessing the same image?

 A VM image can be attached to any VMs but one VM at a time; multiple
 running VMs cannot access to the same VM image.

 What about read-only access? Imagine you'd have 5 kvm instances each
 accessing it using -snapshot.

By creating new clone images from existing snapshot image, you can do
the similar thing.
Sheepdog can create cloning image instantly.

-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread Tomasz Chmielewski

Chris Webb wrote:

Javier Guerra jav...@guerrag.com writes:


i'd just want to add my '+1 votes' on both getting rid of JVM
dependency and using block devices (usually LVM) instead of ext3/btrfs


If the chunks into which the virtual drives are split are quite small (say
the 64MB used by Hadoop), LVM may be a less appropriate choice. It doesn't
support very large numbers of very small logical volumes very well.


Also, on _loaded_ systems, I noticed creating/removing logical volumes 
can take really long (several minutes); where allocating a file of a 
given size would just take a fraction of that.


Sot sure how it would matter here, but probably it would.

--
Tomasz Chmielewski
http://wpkg.org



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-22 Thread Avi Kivity

On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.


Very interesting!  From a very brief look at the code, it looks like the 
sheepdog block format driver is a network client that is able to access 
highly available images, yes?


If so, is it reasonable to compare this to a cluster file system setup 
(like GFS) with images as files on this filesystem?  The difference 
would be that clustering is implemented in userspace in sheepdog, but in 
the kernel for a clustering filesystem.


How is load balancing implemented?  Can you move an image transparently 
while a guest is running?  Will an image be moved closer to its guest?  
Can you stripe an image across nodes?


Do you support multiple guests accessing the same image?

What about fault tolerance - storing an image redundantly on multiple nodes?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-22 Thread Anthony Liguori

Avi Kivity wrote:

On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or 
hundreds

of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.


Very interesting!  From a very brief look at the code, it looks like 
the sheepdog block format driver is a network client that is able to 
access highly available images, yes?


If so, is it reasonable to compare this to a cluster file system setup 
(like GFS) with images as files on this filesystem?  The difference 
would be that clustering is implemented in userspace in sheepdog, but 
in the kernel for a clustering filesystem.


I'm still in the process of reading the code, but that's the impression 
I got too.  It made me think that the protocol for qemu to communicate 
with sheepdog could be a filesystem protocol (like 9p) and sheepdog 
could expose itself as a synthetic.  There are some interesting 
ramifications to something like that--namely that you could mount 
sheepdog on localhost and interact with it through the vfs.


Very interesting stuff, I'm looking forward to examining more closely.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-22 Thread Alexander Graf


Am 22.10.2009 um 18:28 schrieb Anthony Liguori anth...@codemonkey.ws:


Avi Kivity wrote:

On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as  
snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or  
hundreds
of nodes, and the architecture is fully symmetric; there is no  
central

node such as a meta-data server.


Very interesting!  From a very brief look at the code, it looks  
like the sheepdog block format driver is a network client that is  
able to access highly available images, yes?


If so, is it reasonable to compare this to a cluster file system  
setup (like GFS) with images as files on this filesystem?  The  
difference would be that clustering is implemented in userspace in  
sheepdog, but in the kernel for a clustering filesystem.


I'm still in the process of reading the code, but that's the  
impression I got too.  It made me think that the protocol for qemu  
to communicate with sheepdog could be a filesystem protocol (like 9p)


Speaking about 9p, what's the status there?

Alex
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-21 Thread Nikolai K. Bochev
Hello,

I am getting the following error trying to compile sheepdog on Ubuntu 9.10 ( 
2.6.31-14 x64 ) :

cd shepherd; make
make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/shepherd'
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE shepherd.c -o 
shepherd.o
shepherd.c: In function ‘main’:
shepherd.c:300: warning: dereferencing pointer ‘hdr.55’ does break 
strict-aliasing rules
shepherd.c:300: note: initialized from here
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE treeview.c -o 
treeview.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
../lib/event.c -o ../lib/event.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE ../lib/net.c 
-o ../lib/net.o
../lib/net.c: In function ‘write_object’:
../lib/net.c:358: warning: ‘vosts’ may be used uninitialized in this function
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
../lib/logger.c -o ../lib/logger.o
cc shepherd.o treeview.o ../lib/event.o ../lib/net.o ../lib/logger.o -o 
shepherd -lncurses -lcrypto
make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/shepherd'
cd sheep; make
make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE sheep.c -o 
sheep.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE store.c -o 
store.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE net.c -o net.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE work.c -o 
work.o
In file included from /usr/include/asm/fcntl.h:1,
 from /usr/include/linux/fcntl.h:4,
 from /usr/include/linux/signalfd.h:13,
 from work.c:31:
/usr/include/asm-generic/fcntl.h:117: error: redefinition of ‘struct flock’
/usr/include/asm-generic/fcntl.h:140: error: redefinition of ‘struct flock64’
make[1]: *** [work.o] Error 1
make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
make: *** [all] Error 2

I have all the required libs installed. Patching and compiling qemu-kvm went 
flawless.

- Original Message -
From: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
To: kvm@vger.kernel.org, qemu-de...@nongnu.org, linux-fsde...@vger.kernel.org
Sent: Wednesday, October 21, 2009 8:13:47 AM
Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

 * Linear scalability in performance and capacity
 * No single point of failure
 * Redundant architecture (data is written to multiple nodes)
 - Tolerance against network failure
 * Zero configuration (newly added machines will join the cluster 
automatically)
 - Autonomous load balancing
 * Snapshot
 - Online snapshot from qemu-monitor
 * Clone from a snapshot volume
 * Thin provisioning
 - Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, - = on our todo list)

More details and download links are here:

http://www.osrg.net/sheepdog/

Note that the code is still in an early stage.
There are some critical TODO items:

 - VM image deletion support
 - Support architectures other than X86_64
 - Data recoverys
 - Free space management
 - Guarantee reliability and availability under heavy load
 - Performance improvement
 - Reclaim unused blocks
 - More documentation

We hope finding people interested in working together.
Enjoy!


Here are examples:

- create images

$ kvm-img create -f sheepdog Alice's Disk 256G
$ kvm-img create -f sheepdog Bob's Disk 256G

- list images

$ shepherd info -t vdi
4 : Alice's Disk  256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:17:18, tag:0, current
8 : Bob's Disk256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:29:20, tag:0, current

- start up a virtual machine

$ kvm --drive format=sheepdog,file=Alice's Disk

- create a snapshot

$ kvm-img snapshot -c name sheepdog:Alice's Disk

- clone from a snapshot

$ kvm-img create -b sheepdog:Alice's Disk:0 -f sheepdog Charlie's Disk


Thanks.

-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More 

RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-21 Thread Dietmar Maurer
Quite interesting. But would it be possible to use corosync for the cluster 
communication? The point is that we need corosync anyways for pacemaker, it is 
written in C (high performance) and seem to implement the feature you need?

 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of MORITA Kazutaka
 Sent: Mittwoch, 21. Oktober 2009 07:14
 To: kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-
 fsde...@vger.kernel.org
 Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
 
 Hi everyone,
 
 Sheepdog is a distributed storage system for KVM/QEMU. It provides
 highly available block level storage volumes to VMs like Amazon EBS.
 Sheepdog supports advanced volume management features such as snapshot,
 cloning, and thin provisioning. Sheepdog runs on several tens or
 hundreds
 of nodes, and the architecture is fully symmetric; there is no central
 node such as a meta-data server.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html