[PATCH] rbd: fix the memory leak of bio_chain_clone

2012-06-25 Thread Guangliang Zhao
Signed-off-by: Guangliang Zhao gz...@suse.com
---
 drivers/block/rbd.c |   10 --
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 65665c9..3d6dfc8 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -719,8 +719,6 @@ static struct bio *bio_chain_clone(struct bio **old, struct 
bio **next,
goto err_out;
 
if (total + old_chain-bi_size  len) {
-   struct bio_pair *bp;
-
/*
 * this split can only happen with a single paged bio,
 * split_bio will BUG_ON if this is not the case
@@ -732,13 +730,13 @@ static struct bio *bio_chain_clone(struct bio **old, 
struct bio **next,
 
/* split the bio. We'll release it either in the next
   call, or it will have to be released outside */
-   bp = bio_split(old_chain, (len - total) / SECTOR_SIZE);
-   if (!bp)
+   *bp = bio_split(old_chain, (len - total) / SECTOR_SIZE);
+   if (!*bp)
goto err_out;
 
-   __bio_clone(tmp, bp-bio1);
+   __bio_clone(tmp, (*bp)-bio1);
 
-   *next = bp-bio2;
+   *next = (*bp)-bio2;
} else {
__bio_clone(tmp, old_chain);
*next = old_chain-bi_next;
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


radosgw forgetting subuser permissions when creating a fresh key

2012-06-25 Thread Florian Haas
Hi everyone,

I wonder if this is intentional: when I create a new Swift key for an
existing subuser, which has previously been assigned full control
permissions, those permissions appear to get lost upon key creation.

# radosgw-admin subuser create --uid=johndoe --subuser=johndoe:swift
--access=full
{ user_id: johndoe,
  rados_uid: 0,
  display_name: John Doe,
  email: j...@example.com,
  suspended: 0,
  subusers: [
{ id: johndoe:swift,
  permissions: full-control}],
  keys: [
{ user: johndoe,
  access_key: QFAMEDSJP5DEKJO0DDXY,
  secret_key: iaSFLDVvDdQt6lkNzHyW4fPLZugBAI1g17LO0+87}],
  swift_keys: []}

Note permissions: full-control

# radosgw-admin key create --subuser=johndoe:swift --key-type=swift
{ user_id: johndoe,
  rados_uid: 0,
  display_name: John Doe,
  email: j...@example.com,
  suspended: 0,
  subusers: [
 { id: johndoe:swift,
   permissions: none}],
  keys: [
{ user: johndoe,
  access_key: QFAMEDSJP5DEKJO0DDXY,
  secret_key: iaSFLDVvDdQt6lkNzHyW4fPLZugBAI1g17LO0+87}],
  swift_keys: [
{ user: johndoe:swift,
  secret_key: E9T2rUZNu2gxUjcwUBO8n\/Ev4KX6\/GprEuH4qhu1}]}

Note that while there is now a key, the permissions are gone. Is this
meant to be a security feature of sorts, or is this a bug? subuser
modify can obviously restore the permissions, but it seems to be less
than desirable to have to do that.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: filestore flusher = false , correct my problem of constant write (need info on this parameter)

2012-06-25 Thread Alexandre DERUMIER
Hi,
maybe this can help:

I have tunned
filestore queue max ops = 5

now I'm able to achieve 4000io/s  (with some spikes)

with 3 nodes with  1 x osd (1 x 15k drive by osd), journal on tmpfs
or
3 nodes with 5 osd (1 x 15k drive by osd), journal on tmpfs

same result for both conf.




- Mail original - 

De: Alexandre DERUMIER aderum...@odiso.com 
À: Sage Weil s...@inktank.com 
Cc: Mark Nelson mark.nel...@inktank.com, ceph-devel@vger.kernel.org, 
Stefan Priebe s.pri...@profihost.ag 
Envoyé: Dimanche 24 Juin 2012 10:10:48 
Objet: Re: filestore flusher = false , correct my problem of constant write 
(need info on this parameter) 

ok, I have done tests with more than 1 client. 

3 kvm guest on 3 differents kvm host server and 3 kvm guest on same server. 

I have the same result , around 2000 io/s shared between clients. So it doesn't 
scale. 


I have also tried with 3 nodes x 5 osd 15k drive + 5 tmpfs journal 
and 3 nodes x 1 osd (hardware raid0 with 5 x 15kdisk) + 1 tmpfs journal 

the results are same, around 2000 io/s 


But 
if I try with 3 nodes x 1 osd with 1x15k drive, 
I got around 500 io/s 


I also known that stefan priebe has achieve around 12000io/s with ssds in osd. 

So it's seem related to osd drive speed 

So, are we sure that journal is acking to client before flushing to disk ? 



benchmark used is fio, with directio, write 100MB of 4K block. (so journal is 
big enough to handle all the write) 

fio --filename=[disk] --direct=1 --rw=randwrite --bs=4k --size=100M 
--numjobs=50 --runtime=30 --group_reporting --name=file1 


- Mail original - 

De: Alexandre DERUMIER aderum...@odiso.com 
À: Sage Weil s...@inktank.com 
Cc: Mark Nelson mark.nel...@inktank.com, ceph-devel@vger.kernel.org, 
Stefan Priebe s.pri...@profihost.ag 
Envoyé: Samedi 23 Juin 2012 20:21:05 
Objet: Re: filestore flusher = false , correct my problem of constant write 
(need info on this parameter) 

Is that 2000 ios from a single client? You might try multiple clients and 
see if the sum of the ios will scale any higher. 

yes from a single client. (qemu-kvm guest). 

Tomorrow,I'll retry with 3 qemu-kvm guest, on same host and 3 differents hosts. 
I'll also try on bigger cpu machine to compra.(I see a lot of cpu on my kvm 
guest process, more than with iscsi) 

I'll keep you in touch. 

Thanks 

Alexandre 

- Mail original - 

De: Sage Weil s...@inktank.com 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: Mark Nelson mark.nel...@inktank.com, ceph-devel@vger.kernel.org, 
Stefan Priebe s.pri...@profihost.ag 
Envoyé: Samedi 23 Juin 2012 20:12:49 
Objet: Re: filestore flusher = false , correct my problem of constant write 
(need info on this parameter) 

On Sat, 23 Jun 2012, Alexandre DERUMIER wrote: 
 I was just talking with Elder on IRC yesterday about looking into how 
 much small network transfers are hurting us in cases like these. Even 
 with SSD based OSDs I haven't seen a very dramatic improvement in small 
 request performance. How tough would it be to aggregate requests into 
 larger network transactions? There would be a latency penalty of 
 course, but we could flush a client side dirty cache pretty quickly and 
 still benefit if we are getting bombarded with lots of tiny requests. 
 
 Yes, I see no improvement with journal on tmpfs ...this is strange.. 
 
 Also, I have tried with rbd_cache=true, so ios should be already aggregate in 
 bigger transaction. 
 But I didnt't have see any improvement. 
 
 I'm around 2000 ios. 
 
 Do you know what is the bottleneck ? rbd protocol (some kind of overhead 
 for each io ?) 

Is that 2000 ios from a single client? You might try multiple clients and 
see if the sum of the ios will scale any higher. That will tell us 
whether it is in the messenger or osd request pipeline. The latter 
definitely needs some work, although there may be a quick fix to the msgr 
that will buy us some too. 

sage 


 
 
 - Mail original - 
 
 De: Mark Nelson mark.nel...@inktank.com 
 À: Sage Weil s...@inktank.com 
 Cc: Alexandre DERUMIER aderum...@odiso.com, ceph-devel@vger.kernel.org, 
 Stefan Priebe s.pri...@profihost.ag 
 Envoyé: Samedi 23 Juin 2012 18:40:28 
 Objet: Re: filestore flusher = false , correct my problem of constant write 
 (need info on this parameter) 
 
 On 6/23/12 10:38 AM, Sage Weil wrote: 
  On Fri, 22 Jun 2012, Alexandre DERUMIER wrote: 
  Hi Sage, 
  thanks for your response. 
  
  If you turn off the journal compeletely, you will see bursty write 
  commits 
  from the perspective of the client, because the OSD is periodically 
  doing 
  a sync or snapshot and only acking the writes then. 
  If you enable the journal, the OSD will reply with a commit as soon as 
  the 
  write is stable in the journal. That's one reason why it is there--file 
  system commits of heavyweight and slow. 
  
  Yes of course, I don't wan't to desactivate journal, using a journal on a 
  fast ssd or nvram is the right way. 
  
  If we left the file system to its 

Qemu fails to open RBD image when auth_supported is not set to 'none'

2012-06-25 Thread Wido den Hollander

Hi,

I just tried to start a VM with libvirt with the following disk:

disk type='network' device='disk'
  driver name='qemu' type='raw' cache='none'/
  source protocol='rbd' name='rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7'
host name='31.25.XX.XX' port='6789'/
  /source
  target dev='vda' bus='virtio'/
/disk

That fails with: Operation not supported

I tried qemu-img:

qemu-img info 
rbd:rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7:mon_host=31.25.XX.XX\\:6789


Same result.

I then tried:

qemu-img info 
rbd:rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7:mon_host=31.25.XX.XX\\:6789:auth_supported=none


And that worked :)

This host does not have a local ceph.conf, all the parameters have to 
come from the command line.


I know that recently auth_supported defaults to cephx, but that now 
break the libvirt integration since it doesn't set auth_supported to 
explicitly none when no auth section is present.


Should this be something that gets fixed in librados or in libvirt?

If it's libvirt, I'll write a patch for it :)

Wido
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qemu fails to open RBD image when auth_supported is not set to 'none'

2012-06-25 Thread Wido den Hollander

On 06/25/2012 05:20 PM, Wido den Hollander wrote:

Hi,

I just tried to start a VM with libvirt with the following disk:

disk type='network' device='disk'
   driver name='qemu' type='raw' cache='none'/
   source protocol='rbd' name='rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7'
 host name='31.25.XX.XX' port='6789'/
   /source
   target dev='vda' bus='virtio'/
/disk

That fails with: Operation not supported

I tried qemu-img:

qemu-img info
rbd:rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7:mon_host=31.25.XX.XX\\:6789

Same result.

I then tried:

qemu-img info
rbd:rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7:mon_host=31.25.XX.XX\\:6789:auth_supported=none


And that worked :)

This host does not have a local ceph.conf, all the parameters have to
come from the command line.

I know that recently auth_supported defaults to cephx, but that now
break the libvirt integration since it doesn't set auth_supported to
explicitly none when no auth section is present.

Should this be something that gets fixed in librados or in libvirt?


Thought about it, this is something in libvirt :)



If it's libvirt, I'll write a patch for it :)


Just did so, very simple patch: 
https://www.redhat.com/archives/libvir-list/2012-June/msg01119.html


Wido



Wido
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to restart Mon after reboot

2012-06-25 Thread Tommi Virtanen
On Sat, Jun 23, 2012 at 3:43 AM, David Blundell
david.blund...@100percentit.com wrote:
 The logs on all three servers are full of messages like:
 Jun 23 04:02:19 Store2 kernel: [63811.494955] ceph-osd: page allocation 
 failure: order:3, mode:0x4020

 The difference between the lines is that order: varies between 2, 3, 4 or 5

 Is this likely to be a btrfs bug?

That means you're running out of memory, in kernelspace. The order is
the power-of-two (2**n) of how many 4kB pages were requested, 0x4020 =
GFP_COMP|GFP_HIGH (compound  access emergency pools). Btrfs may be
indirectly related, it's not clear what's consuming all the memory,
but that doesn't sound all that likely. That message should be
followed by a stack dump, that might tell us more.

Are you using the Ceph distributed filesystem, or just the RADOS
level, e.g. RBD images?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph as a NOVA-INST-DIR/instances/ storage backend

2012-06-25 Thread Tommi Virtanen
On Sat, Jun 23, 2012 at 11:42 AM, Igor Laskovy igor.lask...@gmail.com wrote:
 Hi all from hot Kiev))

 Does anybody use Ceph as a backend storage for NOVA-INST-DIR/instances/ ?
 Is it in production use? Live migration is still possible?
 I kindly ask any advice of best practices point of view.

That's the shared NFS mount style for storing images, right? While you
could use the Ceph Distributed File System for that, there's a better
answer (for both Nova and Glance): RBD.

Our docs are still being worked on, but you can see the current state
at http://ceph.com/docs/master/rbd/rbd-openstack/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-25 Thread Dan Mick
I've yet to make the core match the binary.  

On Jun 22, 2012, at 11:32 PM, Stefan Priebe s.pri...@profihost.ag wrote:

 Thanks did you find anything?
 
 Am 23.06.2012 um 01:59 schrieb Sam Just sam.j...@inktank.com:
 
 I am still looking into the logs.
 -Sam
 
 On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote:
 Stefan, I'm looking at your logs and coredump now.
 
 
 On 06/21/2012 11:43 PM, Stefan Priebe wrote:
 
 Does anybody have an idea? This is right now a showstopper to me.
 
 Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
 AGs.pri...@profihost.ag:
 
 Hello list,
 
 i'm able to reproducably crash osd daemons.
 
 How i can reproduce:
 
 Kernel: 3.5.0-rc3
 Ceph: 0.47.3
 FS: btrfs
 Journal: 2GB tmpfs per OSD
 OSD: 3x servers with 4x Intel SSD OSDs each
 10GBE Network
 rbd_cache_max_age: 2.0
 rbd_cache_size: 33554432
 
 Disk is set to writeback.
 
 Start a KVM VM via PXE with the disk attached in writeback mode.
 
 Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
 crashes.
 
 # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
 
 Strangely exactly THIS OSD also has the most log entries:
 64K ceph-osd.20.log
 64K ceph-osd.21.log
 1,3Mceph-osd.22.log
 64K ceph-osd.23.log
 
 But all OSDs are set to debug osd = 20.
 
 dmesg shows:
 ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp
 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
 
 I uploaded the following files:
 priebe_fio_randwrite_ceph-osd.21.log.bz2 =  OSD which was OK and didn't
 crash
 priebe_fio_randwrite_ceph-osd.22.log.bz2 =  Log from the crashed OSD
 üu
 priebe_fio_randwrite_core.ssdstor001.27204.bz2 =  Core dump
 priebe_fio_randwrite_ceph-osd.bz2 =  osd binary
 
 Stefan
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph as a NOVA-INST-DIR/instances/ storage backend

2012-06-25 Thread Florian Haas
On Mon, Jun 25, 2012 at 6:03 PM, Tommi Virtanen t...@inktank.com wrote:
 On Sat, Jun 23, 2012 at 11:42 AM, Igor Laskovy igor.lask...@gmail.com wrote:
 Hi all from hot Kiev))

 Does anybody use Ceph as a backend storage for NOVA-INST-DIR/instances/ ?

Yes. http://www.sebastien-han.fr/blog/2012/06/10/introducing-ceph-to-openstack/
Look at the Live Migration with CephFS part.

 Is it in production use?

Production use would require CephFS to be production ready, which at
this point it isn't.

 Live migration is still possible?

Yes.

 I kindly ask any advice of best practices point of view.

 That's the shared NFS mount style for storing images, right? While you
 could use the Ceph Distributed File System for that, there's a better
 answer (for both Nova and Glance): RBD.

... which sort of goes hand-in-hand with boot from volume, which was
just recently documented in the Nova admin guide, so you may want to
take a look: 
http://docs.openstack.org/trunk/openstack-compute/admin/content/boot-from-volume.html

That being said, volume attachment persistence across live migrations
hasn't always been stellar in Nova, and I'm not 100% sure how well
trunk currently deals with that.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FS / Kernel question choosing the correct kernel version

2012-06-25 Thread Sage Weil
On Sat, 23 Jun 2012, Stefan Priebe wrote:
 Hi,
 
 i got stuck while selecting the right FS for ceph / RBD.
 
 XFS:
 - deadlock / hung task under 3.0.34 in xfs_ilock / xfs_buf_lock while syncfs

There was an ilock fix that went into 3.4, IIRC.  Have you tried vanilla 
3.4?  We are seeing some lockdep noise currently, but no deadlocks yet.

 - under 3.5-rc3 all my machines got loaded doing nothing than waiting for XFS
 / SSDs so ceph is really slow / unuseable

 btrfs:
 - 3.5-rc3 ceph is pretty fast and works good until i see also a deadlock while
 doing heavy random I/Os in my rbd / kvm.
 
 All processes hang in btrfs_commit_transaction or
 btrfs_commit_transaction_async

We haven't seen this yet.  See my other reply; a task dump may offer some 
clues.

 Are there tested / recommanded kernel versions for rbd and a specific fs.

Lockdep noise aside, we've been fine with 3.4 for btrfs and xfs so far.  
Our regression testing hardware is probably not as fast as yours, though, 
which may explain why our qa hasn't hit the same bugs.

Can you be more specific about how you're generating the rbd workload?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html