[PATCH] rbd: fix the memory leak of bio_chain_clone
Signed-off-by: Guangliang Zhao gz...@suse.com --- drivers/block/rbd.c | 10 -- 1 files changed, 4 insertions(+), 6 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 65665c9..3d6dfc8 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -719,8 +719,6 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next, goto err_out; if (total + old_chain-bi_size len) { - struct bio_pair *bp; - /* * this split can only happen with a single paged bio, * split_bio will BUG_ON if this is not the case @@ -732,13 +730,13 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next, /* split the bio. We'll release it either in the next call, or it will have to be released outside */ - bp = bio_split(old_chain, (len - total) / SECTOR_SIZE); - if (!bp) + *bp = bio_split(old_chain, (len - total) / SECTOR_SIZE); + if (!*bp) goto err_out; - __bio_clone(tmp, bp-bio1); + __bio_clone(tmp, (*bp)-bio1); - *next = bp-bio2; + *next = (*bp)-bio2; } else { __bio_clone(tmp, old_chain); *next = old_chain-bi_next; -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
radosgw forgetting subuser permissions when creating a fresh key
Hi everyone, I wonder if this is intentional: when I create a new Swift key for an existing subuser, which has previously been assigned full control permissions, those permissions appear to get lost upon key creation. # radosgw-admin subuser create --uid=johndoe --subuser=johndoe:swift --access=full { user_id: johndoe, rados_uid: 0, display_name: John Doe, email: j...@example.com, suspended: 0, subusers: [ { id: johndoe:swift, permissions: full-control}], keys: [ { user: johndoe, access_key: QFAMEDSJP5DEKJO0DDXY, secret_key: iaSFLDVvDdQt6lkNzHyW4fPLZugBAI1g17LO0+87}], swift_keys: []} Note permissions: full-control # radosgw-admin key create --subuser=johndoe:swift --key-type=swift { user_id: johndoe, rados_uid: 0, display_name: John Doe, email: j...@example.com, suspended: 0, subusers: [ { id: johndoe:swift, permissions: none}], keys: [ { user: johndoe, access_key: QFAMEDSJP5DEKJO0DDXY, secret_key: iaSFLDVvDdQt6lkNzHyW4fPLZugBAI1g17LO0+87}], swift_keys: [ { user: johndoe:swift, secret_key: E9T2rUZNu2gxUjcwUBO8n\/Ev4KX6\/GprEuH4qhu1}]} Note that while there is now a key, the permissions are gone. Is this meant to be a security feature of sorts, or is this a bug? subuser modify can obviously restore the permissions, but it seems to be less than desirable to have to do that. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: filestore flusher = false , correct my problem of constant write (need info on this parameter)
Hi, maybe this can help: I have tunned filestore queue max ops = 5 now I'm able to achieve 4000io/s (with some spikes) with 3 nodes with 1 x osd (1 x 15k drive by osd), journal on tmpfs or 3 nodes with 5 osd (1 x 15k drive by osd), journal on tmpfs same result for both conf. - Mail original - De: Alexandre DERUMIER aderum...@odiso.com À: Sage Weil s...@inktank.com Cc: Mark Nelson mark.nel...@inktank.com, ceph-devel@vger.kernel.org, Stefan Priebe s.pri...@profihost.ag Envoyé: Dimanche 24 Juin 2012 10:10:48 Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) ok, I have done tests with more than 1 client. 3 kvm guest on 3 differents kvm host server and 3 kvm guest on same server. I have the same result , around 2000 io/s shared between clients. So it doesn't scale. I have also tried with 3 nodes x 5 osd 15k drive + 5 tmpfs journal and 3 nodes x 1 osd (hardware raid0 with 5 x 15kdisk) + 1 tmpfs journal the results are same, around 2000 io/s But if I try with 3 nodes x 1 osd with 1x15k drive, I got around 500 io/s I also known that stefan priebe has achieve around 12000io/s with ssds in osd. So it's seem related to osd drive speed So, are we sure that journal is acking to client before flushing to disk ? benchmark used is fio, with directio, write 100MB of 4K block. (so journal is big enough to handle all the write) fio --filename=[disk] --direct=1 --rw=randwrite --bs=4k --size=100M --numjobs=50 --runtime=30 --group_reporting --name=file1 - Mail original - De: Alexandre DERUMIER aderum...@odiso.com À: Sage Weil s...@inktank.com Cc: Mark Nelson mark.nel...@inktank.com, ceph-devel@vger.kernel.org, Stefan Priebe s.pri...@profihost.ag Envoyé: Samedi 23 Juin 2012 20:21:05 Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) Is that 2000 ios from a single client? You might try multiple clients and see if the sum of the ios will scale any higher. yes from a single client. (qemu-kvm guest). Tomorrow,I'll retry with 3 qemu-kvm guest, on same host and 3 differents hosts. I'll also try on bigger cpu machine to compra.(I see a lot of cpu on my kvm guest process, more than with iscsi) I'll keep you in touch. Thanks Alexandre - Mail original - De: Sage Weil s...@inktank.com À: Alexandre DERUMIER aderum...@odiso.com Cc: Mark Nelson mark.nel...@inktank.com, ceph-devel@vger.kernel.org, Stefan Priebe s.pri...@profihost.ag Envoyé: Samedi 23 Juin 2012 20:12:49 Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) On Sat, 23 Jun 2012, Alexandre DERUMIER wrote: I was just talking with Elder on IRC yesterday about looking into how much small network transfers are hurting us in cases like these. Even with SSD based OSDs I haven't seen a very dramatic improvement in small request performance. How tough would it be to aggregate requests into larger network transactions? There would be a latency penalty of course, but we could flush a client side dirty cache pretty quickly and still benefit if we are getting bombarded with lots of tiny requests. Yes, I see no improvement with journal on tmpfs ...this is strange.. Also, I have tried with rbd_cache=true, so ios should be already aggregate in bigger transaction. But I didnt't have see any improvement. I'm around 2000 ios. Do you know what is the bottleneck ? rbd protocol (some kind of overhead for each io ?) Is that 2000 ios from a single client? You might try multiple clients and see if the sum of the ios will scale any higher. That will tell us whether it is in the messenger or osd request pipeline. The latter definitely needs some work, although there may be a quick fix to the msgr that will buy us some too. sage - Mail original - De: Mark Nelson mark.nel...@inktank.com À: Sage Weil s...@inktank.com Cc: Alexandre DERUMIER aderum...@odiso.com, ceph-devel@vger.kernel.org, Stefan Priebe s.pri...@profihost.ag Envoyé: Samedi 23 Juin 2012 18:40:28 Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) On 6/23/12 10:38 AM, Sage Weil wrote: On Fri, 22 Jun 2012, Alexandre DERUMIER wrote: Hi Sage, thanks for your response. If you turn off the journal compeletely, you will see bursty write commits from the perspective of the client, because the OSD is periodically doing a sync or snapshot and only acking the writes then. If you enable the journal, the OSD will reply with a commit as soon as the write is stable in the journal. That's one reason why it is there--file system commits of heavyweight and slow. Yes of course, I don't wan't to desactivate journal, using a journal on a fast ssd or nvram is the right way. If we left the file system to its
Qemu fails to open RBD image when auth_supported is not set to 'none'
Hi, I just tried to start a VM with libvirt with the following disk: disk type='network' device='disk' driver name='qemu' type='raw' cache='none'/ source protocol='rbd' name='rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7' host name='31.25.XX.XX' port='6789'/ /source target dev='vda' bus='virtio'/ /disk That fails with: Operation not supported I tried qemu-img: qemu-img info rbd:rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7:mon_host=31.25.XX.XX\\:6789 Same result. I then tried: qemu-img info rbd:rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7:mon_host=31.25.XX.XX\\:6789:auth_supported=none And that worked :) This host does not have a local ceph.conf, all the parameters have to come from the command line. I know that recently auth_supported defaults to cephx, but that now break the libvirt integration since it doesn't set auth_supported to explicitly none when no auth section is present. Should this be something that gets fixed in librados or in libvirt? If it's libvirt, I'll write a patch for it :) Wido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qemu fails to open RBD image when auth_supported is not set to 'none'
On 06/25/2012 05:20 PM, Wido den Hollander wrote: Hi, I just tried to start a VM with libvirt with the following disk: disk type='network' device='disk' driver name='qemu' type='raw' cache='none'/ source protocol='rbd' name='rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7' host name='31.25.XX.XX' port='6789'/ /source target dev='vda' bus='virtio'/ /disk That fails with: Operation not supported I tried qemu-img: qemu-img info rbd:rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7:mon_host=31.25.XX.XX\\:6789 Same result. I then tried: qemu-img info rbd:rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7:mon_host=31.25.XX.XX\\:6789:auth_supported=none And that worked :) This host does not have a local ceph.conf, all the parameters have to come from the command line. I know that recently auth_supported defaults to cephx, but that now break the libvirt integration since it doesn't set auth_supported to explicitly none when no auth section is present. Should this be something that gets fixed in librados or in libvirt? Thought about it, this is something in libvirt :) If it's libvirt, I'll write a patch for it :) Just did so, very simple patch: https://www.redhat.com/archives/libvir-list/2012-June/msg01119.html Wido Wido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to restart Mon after reboot
On Sat, Jun 23, 2012 at 3:43 AM, David Blundell david.blund...@100percentit.com wrote: The logs on all three servers are full of messages like: Jun 23 04:02:19 Store2 kernel: [63811.494955] ceph-osd: page allocation failure: order:3, mode:0x4020 The difference between the lines is that order: varies between 2, 3, 4 or 5 Is this likely to be a btrfs bug? That means you're running out of memory, in kernelspace. The order is the power-of-two (2**n) of how many 4kB pages were requested, 0x4020 = GFP_COMP|GFP_HIGH (compound access emergency pools). Btrfs may be indirectly related, it's not clear what's consuming all the memory, but that doesn't sound all that likely. That message should be followed by a stack dump, that might tell us more. Are you using the Ceph distributed filesystem, or just the RADOS level, e.g. RBD images? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph as a NOVA-INST-DIR/instances/ storage backend
On Sat, Jun 23, 2012 at 11:42 AM, Igor Laskovy igor.lask...@gmail.com wrote: Hi all from hot Kiev)) Does anybody use Ceph as a backend storage for NOVA-INST-DIR/instances/ ? Is it in production use? Live migration is still possible? I kindly ask any advice of best practices point of view. That's the shared NFS mount style for storing images, right? While you could use the Ceph Distributed File System for that, there's a better answer (for both Nova and Glance): RBD. Our docs are still being worked on, but you can see the current state at http://ceph.com/docs/master/rbd/rbd-openstack/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
I've yet to make the core match the binary. On Jun 22, 2012, at 11:32 PM, Stefan Priebe s.pri...@profihost.ag wrote: Thanks did you find anything? Am 23.06.2012 um 01:59 schrieb Sam Just sam.j...@inktank.com: I am still looking into the logs. -Sam On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote: Stefan, I'm looking at your logs and coredump now. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AGs.pri...@profihost.ag: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3Mceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph as a NOVA-INST-DIR/instances/ storage backend
On Mon, Jun 25, 2012 at 6:03 PM, Tommi Virtanen t...@inktank.com wrote: On Sat, Jun 23, 2012 at 11:42 AM, Igor Laskovy igor.lask...@gmail.com wrote: Hi all from hot Kiev)) Does anybody use Ceph as a backend storage for NOVA-INST-DIR/instances/ ? Yes. http://www.sebastien-han.fr/blog/2012/06/10/introducing-ceph-to-openstack/ Look at the Live Migration with CephFS part. Is it in production use? Production use would require CephFS to be production ready, which at this point it isn't. Live migration is still possible? Yes. I kindly ask any advice of best practices point of view. That's the shared NFS mount style for storing images, right? While you could use the Ceph Distributed File System for that, there's a better answer (for both Nova and Glance): RBD. ... which sort of goes hand-in-hand with boot from volume, which was just recently documented in the Nova admin guide, so you may want to take a look: http://docs.openstack.org/trunk/openstack-compute/admin/content/boot-from-volume.html That being said, volume attachment persistence across live migrations hasn't always been stellar in Nova, and I'm not 100% sure how well trunk currently deals with that. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FS / Kernel question choosing the correct kernel version
On Sat, 23 Jun 2012, Stefan Priebe wrote: Hi, i got stuck while selecting the right FS for ceph / RBD. XFS: - deadlock / hung task under 3.0.34 in xfs_ilock / xfs_buf_lock while syncfs There was an ilock fix that went into 3.4, IIRC. Have you tried vanilla 3.4? We are seeing some lockdep noise currently, but no deadlocks yet. - under 3.5-rc3 all my machines got loaded doing nothing than waiting for XFS / SSDs so ceph is really slow / unuseable btrfs: - 3.5-rc3 ceph is pretty fast and works good until i see also a deadlock while doing heavy random I/Os in my rbd / kvm. All processes hang in btrfs_commit_transaction or btrfs_commit_transaction_async We haven't seen this yet. See my other reply; a task dump may offer some clues. Are there tested / recommanded kernel versions for rbd and a specific fs. Lockdep noise aside, we've been fine with 3.4 for btrfs and xfs so far. Our regression testing hardware is probably not as fast as yours, though, which may explain why our qa hasn't hit the same bugs. Can you be more specific about how you're generating the rbd workload? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html