Re: qemu-1.7.0 and internal snapshot, Was: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend
Hi Wido, On 12/20/2013 08:06 AM, Wido den Hollander wrote: On 12/17/2013 05:07 PM, Oliver Francke wrote: Hi Alexandre and Wido ;) well, I know this is a pretty old question... but saw some comments from you Wido as well as a most current patch for qemu-1.7.0 in the git.proxmox ( internal snapshot async port to qemu 1.7 v5) What I currently did is: apply modify-query-machines.patch and internal-snapshot-async.patch to qemu-1.7.0 sources, all hunks succeed. Now after talking to the QMP with: { execute : savevm-start, arguments: { statefile: rbd:123/905.save1 }} or a local file, it spits out: qemu-system-x86_64: block.c:4430: bdrv_set_in_use: Assertion `bs-in_use != in_use' failed *sigh* qemu is started with some parameters... and finally drive-specific ones: -device virtio-blk-pci,drive=virtio0 -drive format=raw,file=rbd:123/vm-905-disk-.rbd:rbd_cache=true:rbd_cache_size=33554432:rbd_cache_max_dirty=16777216:rbd_cache_target_dirty=8388608,cache=writeback,if=none,id=virtio0,media=disk,index=0 Did I miss a relevant point? What would be the correct strategy? I haven't tested this recently, so I'm not sure if this should already work. It would be great if this worked, but I'm not aware of it. unfortunately I didn't give it a try the time Alexandre first mentioned it. This functionality should def make it into some next qemu-version. If you get it to work with current 1.7.0 i would appreciate any more input ;) Regards, Oliver. Wido Thnx in advance and kind regards, Oliver. P.S.: I don't use libvirt nor proxmox as a complete system. On 05/24/2013 10:57 PM, Oliver Francke wrote: Hi Alexandre, Am 24.05.2013 um 17:37 schrieb Alexandre DERUMIER aderum...@odiso.com: Hi, For Proxmox, we have made some patchs to split the savevm process, to be able to save the memory to an external volume. (and not the current volume). For rbd, we create a new rbd volume to store the memory. qemu patch is here : https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/internal-snapshot-async.patch;h=c67a97ea497fe31ff449acb79e04dc1c53b25578;hb=HEAD - Mail original - wow, sounds very interesting, being on the road for the next 3 days I will have a closer look next week. Thnx n regards, Oliver. De: Wido den Hollander w...@42on.com À: Oliver Francke oliver.fran...@filoo.de Cc: ceph-devel@vger.kernel.org Envoyé: Vendredi 24 Mai 2013 17:08:35 Objet: Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend On 05/24/2013 09:46 AM, Oliver Francke wrote: Hi, with a running VM I encounter this strange behaviour, former qemu-versions don't show up such an error. Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination with ceph-0.56.6? Therefore my crosspost. Even if I have no real live-snapshot avail - they know of this restriction -, it's more work for the customers to perform a shutdown before the wonna do some changes to their VM ;) Doesn't Qemu try to save the memory state to RBD here as well? That doesn't work and fails. Any hints welcome, Oliver. -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RBD/qemu-1.7.0 memory leak with drive_mirror/live-migration
Hi *, I just tried a feature for live-migration via: drive_mirror -f virtio0 rbd:123123/virtio0 rbd in a qemu-monitoring-session. One immediately can see some memory consumption... never getting released again. Additionally the block-job never ends, even after X/X bytes completed. You can cancel the job, memory still occupied, after another try more RSS-memory gets filled. Same procedure with qcow2 does not need any more memory, and the job gets cleared after completion. Possibly @Josh: any idea? Logfiles with debug_what=? needed? ;) Thnx in @vance, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qemu-1.7.0 and internal snapshot, Was: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend
Hi Alexandre and Wido ;) well, I know this is a pretty old question... but saw some comments from you Wido as well as a most current patch for qemu-1.7.0 in the git.proxmox ( internal snapshot async port to qemu 1.7 v5) What I currently did is: apply modify-query-machines.patch and internal-snapshot-async.patch to qemu-1.7.0 sources, all hunks succeed. Now after talking to the QMP with: { execute : savevm-start, arguments: { statefile: rbd:123/905.save1 }} or a local file, it spits out: qemu-system-x86_64: block.c:4430: bdrv_set_in_use: Assertion `bs-in_use != in_use' failed *sigh* qemu is started with some parameters... and finally drive-specific ones: -device virtio-blk-pci,drive=virtio0 -drive format=raw,file=rbd:123/vm-905-disk-.rbd:rbd_cache=true:rbd_cache_size=33554432:rbd_cache_max_dirty=16777216:rbd_cache_target_dirty=8388608,cache=writeback,if=none,id=virtio0,media=disk,index=0 Did I miss a relevant point? What would be the correct strategy? Thnx in advance and kind regards, Oliver. P.S.: I don't use libvirt nor proxmox as a complete system. On 05/24/2013 10:57 PM, Oliver Francke wrote: Hi Alexandre, Am 24.05.2013 um 17:37 schrieb Alexandre DERUMIER aderum...@odiso.com: Hi, For Proxmox, we have made some patchs to split the savevm process, to be able to save the memory to an external volume. (and not the current volume). For rbd, we create a new rbd volume to store the memory. qemu patch is here : https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/internal-snapshot-async.patch;h=c67a97ea497fe31ff449acb79e04dc1c53b25578;hb=HEAD - Mail original - wow, sounds very interesting, being on the road for the next 3 days I will have a closer look next week. Thnx n regards, Oliver. De: Wido den Hollander w...@42on.com À: Oliver Francke oliver.fran...@filoo.de Cc: ceph-devel@vger.kernel.org Envoyé: Vendredi 24 Mai 2013 17:08:35 Objet: Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend On 05/24/2013 09:46 AM, Oliver Francke wrote: Hi, with a running VM I encounter this strange behaviour, former qemu-versions don't show up such an error. Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination with ceph-0.56.6? Therefore my crosspost. Even if I have no real live-snapshot avail - they know of this restriction -, it's more work for the customers to perform a shutdown before the wonna do some changes to their VM ;) Doesn't Qemu try to save the memory state to RBD here as well? That doesn't work and fails. Any hints welcome, Oliver. -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend
Hi Alexandre, Josh, sorry for coming back so very late, I tried the patch and though I could not get it to work properly - very likely my fault - how would it be to integrate it into the rbd-handler of qemu? Josh? I think you are CC'd from another qemu-ticket anyway? I could just ignore the EOPNOTSUPP or whatever it's called, but a smooth integration of live-snapshots would be so cool ;) Kind regards, Oliver. On 05/24/2013 05:37 PM, Alexandre DERUMIER wrote: Hi, For Proxmox, we have made some patchs to split the savevm process, to be able to save the memory to an external volume. (and not the current volume). For rbd, we create a new rbd volume to store the memory. qemu patch is here : https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/internal-snapshot-async.patch;h=c67a97ea497fe31ff449acb79e04dc1c53b25578;hb=HEAD - Mail original - De: Wido den Hollander w...@42on.com À: Oliver Francke oliver.fran...@filoo.de Cc: ceph-devel@vger.kernel.org Envoyé: Vendredi 24 Mai 2013 17:08:35 Objet: Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend On 05/24/2013 09:46 AM, Oliver Francke wrote: Hi, with a running VM I encounter this strange behaviour, former qemu-versions don't show up such an error. Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination with ceph-0.56.6? Therefore my crosspost. Even if I have no real live-snapshot avail - they know of this restriction -, it's more work for the customers to perform a shutdown before the wonna do some changes to their VM ;) Doesn't Qemu try to save the memory state to RBD here as well? That doesn't work and fails. Any hints welcome, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend
Hi, with a running VM I encounter this strange behaviour, former qemu-versions don't show up such an error. Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination with ceph-0.56.6? Therefore my crosspost. Even if I have no real live-snapshot avail - they know of this restriction -, it's more work for the customers to perform a shutdown before the wonna do some changes to their VM ;) Any hints welcome, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend
Well, On 05/24/2013 05:08 PM, Wido den Hollander wrote: On 05/24/2013 09:46 AM, Oliver Francke wrote: Hi, with a running VM I encounter this strange behaviour, former qemu-versions don't show up such an error. Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination with ceph-0.56.6? Therefore my crosspost. Even if I have no real live-snapshot avail - they know of this restriction -, it's more work for the customers to perform a shutdown before the wonna do some changes to their VM ;) Doesn't Qemu try to save the memory state to RBD here as well? That doesn't work and fails. true, but that should qemu try in version 1.4.x, too, but that succeeds ;) Tried to figure out some corresponding QMP-commands, found some references like: {execute: snapshot-create, arguments: {name: vm_before_upgrade} } but that fails with unknown command. The convenience for a monitoring-command would be, that qemu cares for all mounted block-devs, well, it _should_ exclude some ide-cd0 devices, where I get another error with an inserted CD, but... other error, Oliver. Any hints welcome, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend
Hi Alexandre, Am 24.05.2013 um 17:37 schrieb Alexandre DERUMIER aderum...@odiso.com: Hi, For Proxmox, we have made some patchs to split the savevm process, to be able to save the memory to an external volume. (and not the current volume). For rbd, we create a new rbd volume to store the memory. qemu patch is here : https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/internal-snapshot-async.patch;h=c67a97ea497fe31ff449acb79e04dc1c53b25578;hb=HEAD - Mail original - wow, sounds very interesting, being on the road for the next 3 days I will have a closer look next week. Thnx n regards, Oliver. De: Wido den Hollander w...@42on.com À: Oliver Francke oliver.fran...@filoo.de Cc: ceph-devel@vger.kernel.org Envoyé: Vendredi 24 Mai 2013 17:08:35 Objet: Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend On 05/24/2013 09:46 AM, Oliver Francke wrote: Hi, with a running VM I encounter this strange behaviour, former qemu-versions don't show up such an error. Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination with ceph-0.56.6? Therefore my crosspost. Even if I have no real live-snapshot avail - they know of this restriction -, it's more work for the customers to perform a shutdown before the wonna do some changes to their VM ;) Doesn't Qemu try to save the memory state to RBD here as well? That doesn't work and fails. Any hints welcome, Oliver. -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD memory leak when scrubbing [0.56.6]
Well, subject seems familiar, version was 0.48.3 in the last mail. Some more of the story. Before successful upgrade to latest bobtail everything with regards to scrubbing was disabled. That is via: ceph osd tell \* injectargs '--osd-max-scrubs 0' We are running fine now since 9th of may. Fine means though, not have ran any scrubbing for ages. This morning I re-started scrubbing. After a couple of hours I detected the first OSD's eating up memory. Top-scorer was running with 23GiB rss. After stopping scrubbing again there was no regain of memory. Not anyone else with perhaps large pg's experiencing such behaviour? Any advice on how to proceed? Thnx in advance, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leak when scrubbing [0.56.6]
Uhm, to be most correct... there was a follow-up even with version 0.56 ;) On 05/21/2013 05:24 PM, Oliver Francke wrote: Well, subject seems familiar, version was 0.48.3 in the last mail. Some more of the story. Before successful upgrade to latest bobtail everything with regards to scrubbing was disabled. That is via: ceph osd tell \* injectargs '--osd-max-scrubs 0' We are running fine now since 9th of may. Fine means though, not have ran any scrubbing for ages. This morning I re-started scrubbing. After a couple of hours I detected the first OSD's eating up memory. Top-scorer was running with 23GiB rss. After stopping scrubbing again there was no regain of memory. Not anyone else with perhaps large pg's experiencing such behaviour? Any advice on how to proceed? Thnx in advance, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leak when scrubbing [0.56.6]
Right, On 05/21/2013 05:35 PM, Sylvain Munaut wrote: Hi, subject seems familiar, version was 0.48.3 in the last mail. Not anyone else with perhaps large pg's experiencing such behaviour? Any advice on how to proceed? I had the same behavior in both argonaut and bobtail, raising sharply ~ 100M or so at each scrub (every 24h). It's now resolved in cuttlefish AFAICT. However given the mon leveldb issues I'm having with cuttlefish, I'm not sure I'd recommend upgrading ... not really at this moment. Cause we have an otherwise stable and fast cluster for our VM's ;) Oliver. Cheers, Sylvain -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leak when scrubbing [0.56.6]
Well, Am 21.05.2013 um 21:31 schrieb Sage Weil s...@inktank.com: On Tue, 21 May 2013, Stefan Priebe wrote: Am 21.05.2013 17:44, schrieb Sage Weil: On Tue, 21 May 2013, Stefan Priebe - Profihost AG wrote: Am 21.05.2013 um 17:35 schrieb Sylvain Munaut s.mun...@whatever-company.com: Hi, subject seems familiar, version was 0.48.3 in the last mail. Not anyone else with perhaps large pg's experiencing such behaviour? Any advice on how to proceed? I had the same behavior in both argonaut and bobtail, raising sharply ~ 100M or so at each scrub (every 24h). It's now resolved in cuttlefish AFAICT. However given the mon leveldb issues I'm having with cuttlefish, I'm not sure I'd recommend upgrading ... I thought all mon leveldb issues were solved? Not quite. I'm now able to reproduce the leveldb growth from Mike Dawson's trace (thanks!) but we don't have a fix yet. Oh OK. Is there a tracker id? http://tracker.ceph.com/issues/4895 true for this issue, but back to topic, still not being able to safely ( deep-) scrub the whole cluster with 0.56.6. I thought the scrub memory issues were addressed by f80f64cf024bd7519d5a1fb2a5698db97a003ce8 in 0.56.4... :( any advice very welcome, though about 1/3 of the cluster is safe in means of we scrubbed it deeply. Best regards, Oliver. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing
Hi Josh, thanks for the quick response and... On 03/26/2013 09:30 AM, Josh Durgin wrote: On 03/25/2013 03:04 AM, Oliver Francke wrote: Hi josh, logfile is attached... Thanks. It shows nothing out of the ordinary, but I just reproduced the incorrect rollback locally, so it shouldn't be hard to track down from here. I opened http://tracker.ceph.com/issues/4551 to track it. the good news. Oliver. Josh On 03/22/2013 08:30 PM, Josh Durgin wrote: On 03/22/2013 12:09 PM, Oliver Francke wrote: Hi Josh, all, I did not want to hijack the thread dealing with a crashing VM, but perhaps there are some common things. Today I installed a fresh cluster with mkephfs, went fine, imported a master debian 6.0 image with format 2, made a snapshot, protected it, and made some clones. Clones mounted with qemu-nbd, fiddled a bit with IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started, took 2 secs and the VM was up n running. Cool. Now an ordinary shutdown was performed, made a snapshot of this image. Started again, did some apt-get update… install s/t…. Shutdown - rbd rollback - startup again - login - install s/t else… filesystem showed many ex3-errors, fell into read-only mode, massive corruption. This sounds like it might be a bug in rollback. Could you try cloning and snapshotting again, but export the image before booting, and after rolling back, and compare the md5sums? Done, first MD5-mismatch after 32 4MB blocks, checked with dd and a bs of 4MB. Running the rollback with: --debug-ms 1 --debug-rbd 20 --log-file rbd-rollback.log might help too. Does your ceph.conf where you ran the rollback have anything related to rbd_cache in it? No cache settings in global ceph.conf. Hope it helps, Oliver. qemu config was with :rbd_cache=false if it matters. Above scenario is reproducible, and as I stated out, no crash detected. Perhaps it is in the same area as in the crash-thread, otherwise I will provide logfiles as needed. It's unrelated, the other thread is an issue with the cache, which does not cause corruption but triggers a crash. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing
Hi Josh, all, I did not want to hijack the thread dealing with a crashing VM, but perhaps there are some common things. Today I installed a fresh cluster with mkephfs, went fine, imported a master debian 6.0 image with format 2, made a snapshot, protected it, and made some clones. Clones mounted with qemu-nbd, fiddled a bit with IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started, took 2 secs and the VM was up n running. Cool. Now an ordinary shutdown was performed, made a snapshot of this image. Started again, did some apt-get update… install s/t…. Shutdown - rbd rollback - startup again - login - install s/t else… filesystem showed many ex3-errors, fell into read-only mode, massive corruption. qemu config was with :rbd_cache=false if it matters. Above scenario is reproducible, and as I stated out, no crash detected. Perhaps it is in the same area as in the crash-thread, otherwise I will provide logfiles as needed. Kind regards, Oliver. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing
Hi Josh, all, I did not want to hijack the thread dealing with a crashing VM, but perhaps there are some common things. Today I installed a fresh cluster with mkephfs, went fine, imported a master debian 6.0 image with format 2, made a snapshot, protected it, and made some clones. Clones mounted with qemu-nbd, fiddled a bit with IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started, took 2 secs and the VM was up n running. Cool. Now an ordinary shutdown was performed, made a snapshot of this image. Started again, did some apt-get update… install s/t…. Shutdown - rbd rollback - startup again - login - install s/t else… filesystem showed many ex3-errors, fell into read-only mode, massive corruption. qemu config was with :rbd_cache=false if it matters. Above scenario is reproducible, and as I stated out, no crash detected. Perhaps it is in the same area as in the crash-thread, otherwise I will provide logfiles as needed. Kind regards, Oliver. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A couple of OSD-crashes after serious network trouble
Hi Sam, On 12/13/2012 05:15 AM, Samuel Just wrote: Apologies, I missed your reply on Monday. Any attempt to read or no prob ;) We are busy, too, with preparing new nodes and switch to 10GE this evening. write the object will hit the file on the primary (the smaller one with the newer syslog entries). If you take down both OSDs (12 and 40) while performing the repair, the vm in question will hang if it tries to access that block, but should recover when you bring the OSDs back up. To expand on the the response Sage posted, writes/reads to that block have been hitting the primary (osd.12) which unfortunately is the incorrect file. I would, however, have expected that those writes would have been replicated to the larger file on osd.40 as well. Are you certain that the newer syslog entries on 12 aren't also present on 40? well... time heals... I re-checked right now and both files are md5-wise identical?! Not checked the other 5 inconsistencies. Still having three headers missing and 6 OSD's not checked with scrub, though. Will be back... for sure ;) Thnx for now, Oliver. -Sam On Tue, Dec 11, 2012 at 11:38 AM, Oliver Francke oliver.fran...@filoo.de wrote: Hi Sage, Am 11.12.2012 um 18:04 schrieb Sage Weil s...@inktank.com: On Tue, 11 Dec 2012, Oliver Francke wrote: Hi Sam, perhaps you have overlooked my comments further down, beginning with been there ? ;) We're pretty swamped with bobtail stuff at the moment, so ceph-devel inquiries are low on the priority list right now. 100% agree, this thing here is best effort right now, true. See below: If so, please have a look, cause I'm clueless 8-) On 12/10/2012 11:48 AM, Oliver Francke wrote: Hi Sam, helpful input.. and... not so... On 12/07/2012 10:18 PM, Samuel Just wrote: Ah... unfortunately doing a repair in these 6 cases would probably result in the wrong object surviving. It should work, but it might corrupt the rbd image contents. If the images are expendable, you could repair and then delete the images. The red flag here is that the known size is smaller than the other size. This indicates that it most likely chose the wrong file as the correct one since rbd image blocks usually get bigger over time. To fix this, you will need to manually copy the file for the larger of the two object replicas to replace the smaller of the two object replicas. For the first, soid 87c96f10/rb.0.47d9b.1014b7b4.02df/head//65 in pg 65.10: 1) Find the object on the primary and the replica (from above, primary is 12 and replica is 40). You can use find in the primary and replica current/65.10_head directories to look for a file matching *rb.0.47d9b.1014b7b4.02df*). The file name should be 'rb.0.47d9b.1014b7b4.02df__head_87C96F10__65' I think. 2) Stop the primary and replica osds 3) Compare the file sizes for the two files -- you should find that the file sizes do not match. 4) Replace the smaller file with the larger one (you'll probably want to keep a copy of the smaller one around just in case). 5) Restart the osds and scrub pg 65.10 -- the pg should come up clean (possibly with a relatively harmless stat mismatch) been there. on OSD.12 it's -rw-r--r-- 1 root root 699904 Dec 9 06:25 rb.0.47d9b.1014b7b4.02df__head_87C96F10__41 on OSD.40: -rw-r--r-- 1 root root 4194304 Dec 9 06:25 rb.0.47d9b.1014b7b4.02df__head_87C96F10__41 going by a short glance into the file, there are some readable syslog-entries, in both files. For the bad luck in this example, the shorter file contains the more current entries?! It sounds like the larger one was at one point correct, but since they got out of sync an update was applied to the other. What fs is this (inside the VM)? If we're lucky the whole block is file data, in which case I would extend the small one with more recent out to the full size by taking the last chunk of the second one. (Or, if the bytes look like an unimportant file, just use truncate(1) to extend it, and get zeros for that region.) Make backups of the object first, and fsck inside the VM afterwards. -- We've seen this issue bite twice now, both times on argonaut. So far nobody using anything more recent..but that is a smaller pool of people, so no real comform there. Working on setting up a higher-stress long-term testing cluster to trigger this. Can you remind me what kernel version you are using? one of the affected nodes are driven by 3.5.4, the newer ones are nowadays Ubtuntu 12.04.1 LTS with self-compiled 3.6.6. Inside the VM's you can imagine all flavors, less forgiving CentOS 5.8, some debian5.0 ( ext3)… mostly ext3, I think. Not optimum, at least. Couple of problems caused by slow requests, I can see in some log-files customers pressing the RESET button, implemented via qemu-monitor. Destructive as can be, with having some megs of cache with the rbd-device. Thnx n regards, Oliver. sage What exactly happens, if I try to copy or export the file? Which
Re: A couple of OSD-crashes after serious network trouble
Hi Sam, perhaps you have overlooked my comments further down, beginning with been there ? ;) If so, please have a look, cause I'm clueless 8-) On 12/10/2012 11:48 AM, Oliver Francke wrote: Hi Sam, helpful input.. and... not so... On 12/07/2012 10:18 PM, Samuel Just wrote: Ah... unfortunately doing a repair in these 6 cases would probably result in the wrong object surviving. It should work, but it might corrupt the rbd image contents. If the images are expendable, you could repair and then delete the images. The red flag here is that the known size is smaller than the other size. This indicates that it most likely chose the wrong file as the correct one since rbd image blocks usually get bigger over time. To fix this, you will need to manually copy the file for the larger of the two object replicas to replace the smaller of the two object replicas. For the first, soid 87c96f10/rb.0.47d9b.1014b7b4.02df/head//65 in pg 65.10: 1) Find the object on the primary and the replica (from above, primary is 12 and replica is 40). You can use find in the primary and replica current/65.10_head directories to look for a file matching *rb.0.47d9b.1014b7b4.02df*). The file name should be 'rb.0.47d9b.1014b7b4.02df__head_87C96F10__65' I think. 2) Stop the primary and replica osds 3) Compare the file sizes for the two files -- you should find that the file sizes do not match. 4) Replace the smaller file with the larger one (you'll probably want to keep a copy of the smaller one around just in case). 5) Restart the osds and scrub pg 65.10 -- the pg should come up clean (possibly with a relatively harmless stat mismatch) been there. on OSD.12 it's -rw-r--r-- 1 root root 699904 Dec 9 06:25 rb.0.47d9b.1014b7b4.02df__head_87C96F10__41 on OSD.40: -rw-r--r-- 1 root root 4194304 Dec 9 06:25 rb.0.47d9b.1014b7b4.02df__head_87C96F10__41 going by a short glance into the file, there are some readable syslog-entries, in both files. For the bad luck in this example, the shorter file contains the more current entries?! What exactly happens, if I try to copy or export the file? Which block will be chosen? VM is running as I'm writing, so flexibility reduced. Regards, Oliver. If this worked our correctly, you can repeat for the other 5 cases. Let me know if you have any questions. -Sam On Fri, Dec 7, 2012 at 11:09 AM, Oliver Francke oliver.fran...@filoo.de wrote: Hi Sam, Am 07.12.2012 um 19:37 schrieb Samuel Just sam.j...@inktank.com: That is very likely to be one of the merge_log bugs fixed between 0.48 and 0.55. I could confirm with a stacktrace from gdb with line numbers or the remainder of the logging dumped when the daemon crashed. My understanding of your situation is that currently all pgs are active+clean but you are missing some rbd image headers and some rbd images appear to be corrupted. Is that accurate? -Sam thnx for droppig in. Uhm almost correct, there are now 6 pg in state inconsistent: HEALTH_WARN 6 pgs inconsistent pg 65.da is active+clean+inconsistent, acting [1,33] pg 65.d7 is active+clean+inconsistent, acting [13,42] pg 65.10 is active+clean+inconsistent, acting [12,40] pg 65.f is active+clean+inconsistent, acting [13,31] pg 65.75 is active+clean+inconsistent, acting [1,33] pg 65.6a is active+clean+inconsistent, acting [13,31] I know which images are affected, but does a repair help? 0 log [ERR] : 65.10 osd.40: soid 87c96f10/rb.0.47d9b.1014b7b4.02df/head//65 size 4194304 != known size 699904 0 log [ERR] : 65.6a osd.31: soid 19a2526a/rb.0.2dcf2.1da2a31e.0737/head//65 size 4191744 != known size 2757632 0 log [ERR] : 65.75 osd.33: soid 20550575/rb.0.2d520.5c17a6e3.0339/head//65 size 4194304 != known size 1238016 0 log [ERR] : 65.d7 osd.42: soid fa3a5d7/rb.0.2c2a8.12ec359d.205c/head//65 size 4194304 != known size 1382912 0 log [ERR] : 65.da osd.33: soid c2a344da/rb.0.2be17.cb4bd69.0081/head//65 size 4191744 != known size 1815552 0 log [ERR] : 65.f osd.31: soid e8d2430f/rb.0.2d1e9.1339c5dd.0c41/head//65 size 2424832 != known size 2331648 of make things worse? I could only check 14 out of 20 OSD's so far, cause from two older nodes a scrub leads to slow-requests… couple of minutes, so VM's got stalled… customers pressing the reset-button, so losing caches… Comments welcome, Oliver. On Fri, Dec 7, 2012 at 6:39 AM, Oliver Francke oliver.fran...@filoo.de wrote: Hi, is the following a known one, too? Would be good to get it out of my head: /var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd() [0x706c59] /var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0) [0x7f7f306c0ff0] /var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35) [0x7f7f2f35f1b5] /var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180) [0x7f7f2f361fc0] /var/log/ceph/ceph-osd.40.log.1.gz: 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5] /var/log/ceph/ceph-osd.40.log.1.gz: 6
Re: A couple of OSD-crashes after serious network trouble
Hi Sage, Am 11.12.2012 um 18:04 schrieb Sage Weil s...@inktank.com: On Tue, 11 Dec 2012, Oliver Francke wrote: Hi Sam, perhaps you have overlooked my comments further down, beginning with been there ? ;) We're pretty swamped with bobtail stuff at the moment, so ceph-devel inquiries are low on the priority list right now. 100% agree, this thing here is best effort right now, true. See below: If so, please have a look, cause I'm clueless 8-) On 12/10/2012 11:48 AM, Oliver Francke wrote: Hi Sam, helpful input.. and... not so... On 12/07/2012 10:18 PM, Samuel Just wrote: Ah... unfortunately doing a repair in these 6 cases would probably result in the wrong object surviving. It should work, but it might corrupt the rbd image contents. If the images are expendable, you could repair and then delete the images. The red flag here is that the known size is smaller than the other size. This indicates that it most likely chose the wrong file as the correct one since rbd image blocks usually get bigger over time. To fix this, you will need to manually copy the file for the larger of the two object replicas to replace the smaller of the two object replicas. For the first, soid 87c96f10/rb.0.47d9b.1014b7b4.02df/head//65 in pg 65.10: 1) Find the object on the primary and the replica (from above, primary is 12 and replica is 40). You can use find in the primary and replica current/65.10_head directories to look for a file matching *rb.0.47d9b.1014b7b4.02df*). The file name should be 'rb.0.47d9b.1014b7b4.02df__head_87C96F10__65' I think. 2) Stop the primary and replica osds 3) Compare the file sizes for the two files -- you should find that the file sizes do not match. 4) Replace the smaller file with the larger one (you'll probably want to keep a copy of the smaller one around just in case). 5) Restart the osds and scrub pg 65.10 -- the pg should come up clean (possibly with a relatively harmless stat mismatch) been there. on OSD.12 it's -rw-r--r-- 1 root root 699904 Dec 9 06:25 rb.0.47d9b.1014b7b4.02df__head_87C96F10__41 on OSD.40: -rw-r--r-- 1 root root 4194304 Dec 9 06:25 rb.0.47d9b.1014b7b4.02df__head_87C96F10__41 going by a short glance into the file, there are some readable syslog-entries, in both files. For the bad luck in this example, the shorter file contains the more current entries?! It sounds like the larger one was at one point correct, but since they got out of sync an update was applied to the other. What fs is this (inside the VM)? If we're lucky the whole block is file data, in which case I would extend the small one with more recent out to the full size by taking the last chunk of the second one. (Or, if the bytes look like an unimportant file, just use truncate(1) to extend it, and get zeros for that region.) Make backups of the object first, and fsck inside the VM afterwards. -- We've seen this issue bite twice now, both times on argonaut. So far nobody using anything more recent..but that is a smaller pool of people, so no real comform there. Working on setting up a higher-stress long-term testing cluster to trigger this. Can you remind me what kernel version you are using? one of the affected nodes are driven by 3.5.4, the newer ones are nowadays Ubtuntu 12.04.1 LTS with self-compiled 3.6.6. Inside the VM's you can imagine all flavors, less forgiving CentOS 5.8, some debian5.0 ( ext3)… mostly ext3, I think. Not optimum, at least. Couple of problems caused by slow requests, I can see in some log-files customers pressing the RESET button, implemented via qemu-monitor. Destructive as can be, with having some megs of cache with the rbd-device. Thnx n regards, Oliver. sage What exactly happens, if I try to copy or export the file? Which block will be chosen? VM is running as I'm writing, so flexibility reduced. Regards, Oliver. If this worked our correctly, you can repeat for the other 5 cases. Let me know if you have any questions. -Sam On Fri, Dec 7, 2012 at 11:09 AM, Oliver Francke oliver.fran...@filoo.de wrote: Hi Sam, Am 07.12.2012 um 19:37 schrieb Samuel Just sam.j...@inktank.com: That is very likely to be one of the merge_log bugs fixed between 0.48 and 0.55. I could confirm with a stacktrace from gdb with line numbers or the remainder of the logging dumped when the daemon crashed. My understanding of your situation is that currently all pgs are active+clean but you are missing some rbd image headers and some rbd images appear to be corrupted. Is that accurate? -Sam thnx for droppig in. Uhm almost correct, there are now 6 pg in state inconsistent: HEALTH_WARN 6 pgs inconsistent pg 65.da is active+clean+inconsistent, acting [1,33] pg 65.d7 is active+clean+inconsistent, acting [13,42] pg 65.10 is active+clean+inconsistent, acting [12,40] pg 65.f is active+clean
Re: A couple of OSD-crashes after serious network trouble
Hi Sam, helpful input.. and... not so... On 12/07/2012 10:18 PM, Samuel Just wrote: Ah... unfortunately doing a repair in these 6 cases would probably result in the wrong object surviving. It should work, but it might corrupt the rbd image contents. If the images are expendable, you could repair and then delete the images. The red flag here is that the known size is smaller than the other size. This indicates that it most likely chose the wrong file as the correct one since rbd image blocks usually get bigger over time. To fix this, you will need to manually copy the file for the larger of the two object replicas to replace the smaller of the two object replicas. For the first, soid 87c96f10/rb.0.47d9b.1014b7b4.02df/head//65 in pg 65.10: 1) Find the object on the primary and the replica (from above, primary is 12 and replica is 40). You can use find in the primary and replica current/65.10_head directories to look for a file matching *rb.0.47d9b.1014b7b4.02df*). The file name should be 'rb.0.47d9b.1014b7b4.02df__head_87C96F10__65' I think. 2) Stop the primary and replica osds 3) Compare the file sizes for the two files -- you should find that the file sizes do not match. 4) Replace the smaller file with the larger one (you'll probably want to keep a copy of the smaller one around just in case). 5) Restart the osds and scrub pg 65.10 -- the pg should come up clean (possibly with a relatively harmless stat mismatch) been there. on OSD.12 it's -rw-r--r-- 1 root root 699904 Dec 9 06:25 rb.0.47d9b.1014b7b4.02df__head_87C96F10__41 on OSD.40: -rw-r--r-- 1 root root 4194304 Dec 9 06:25 rb.0.47d9b.1014b7b4.02df__head_87C96F10__41 going by a short glance into the file, there are some readable syslog-entries, in both files. For the bad luck in this example, the shorter file contains the more current entries?! What exactly happens, if I try to copy or export the file? Which block will be chosen? VM is running as I'm writing, so flexibility reduced. Regards, Oliver. If this worked our correctly, you can repeat for the other 5 cases. Let me know if you have any questions. -Sam On Fri, Dec 7, 2012 at 11:09 AM, Oliver Francke oliver.fran...@filoo.de wrote: Hi Sam, Am 07.12.2012 um 19:37 schrieb Samuel Just sam.j...@inktank.com: That is very likely to be one of the merge_log bugs fixed between 0.48 and 0.55. I could confirm with a stacktrace from gdb with line numbers or the remainder of the logging dumped when the daemon crashed. My understanding of your situation is that currently all pgs are active+clean but you are missing some rbd image headers and some rbd images appear to be corrupted. Is that accurate? -Sam thnx for droppig in. Uhm almost correct, there are now 6 pg in state inconsistent: HEALTH_WARN 6 pgs inconsistent pg 65.da is active+clean+inconsistent, acting [1,33] pg 65.d7 is active+clean+inconsistent, acting [13,42] pg 65.10 is active+clean+inconsistent, acting [12,40] pg 65.f is active+clean+inconsistent, acting [13,31] pg 65.75 is active+clean+inconsistent, acting [1,33] pg 65.6a is active+clean+inconsistent, acting [13,31] I know which images are affected, but does a repair help? 0 log [ERR] : 65.10 osd.40: soid 87c96f10/rb.0.47d9b.1014b7b4.02df/head//65 size 4194304 != known size 699904 0 log [ERR] : 65.6a osd.31: soid 19a2526a/rb.0.2dcf2.1da2a31e.0737/head//65 size 4191744 != known size 2757632 0 log [ERR] : 65.75 osd.33: soid 20550575/rb.0.2d520.5c17a6e3.0339/head//65 size 4194304 != known size 1238016 0 log [ERR] : 65.d7 osd.42: soid fa3a5d7/rb.0.2c2a8.12ec359d.205c/head//65 size 4194304 != known size 1382912 0 log [ERR] : 65.da osd.33: soid c2a344da/rb.0.2be17.cb4bd69.0081/head//65 size 4191744 != known size 1815552 0 log [ERR] : 65.f osd.31: soid e8d2430f/rb.0.2d1e9.1339c5dd.0c41/head//65 size 2424832 != known size 2331648 of make things worse? I could only check 14 out of 20 OSD's so far, cause from two older nodes a scrub leads to slow-requests… couple of minutes, so VM's got stalled… customers pressing the reset-button, so losing caches… Comments welcome, Oliver. On Fri, Dec 7, 2012 at 6:39 AM, Oliver Francke oliver.fran...@filoo.de wrote: Hi, is the following a known one, too? Would be good to get it out of my head: /var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd() [0x706c59] /var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0) [0x7f7f306c0ff0] /var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35) [0x7f7f2f35f1b5] /var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180) [0x7f7f2f361fc0] /var/log/ceph/ceph-osd.40.log.1.gz: 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5] /var/log/ceph/ceph-osd.40.log.1.gz: 6: (()+0xcb166) [0x7f7f2fbf2166] /var/log/ceph/ceph-osd.40.log.1.gz: 7: (()+0xcb193) [0x7f7f2fbf2193] /var/log/ceph/ceph-osd.40.log.1.gz: 8: (()+0xcb28e) [0x7f7f2fbf228e] /var/log/ceph/ceph-osd.40.log.1.gz
Re: A couple of OSD-crashes after serious network trouble
Hi, is the following a known one, too? Would be good to get it out of my head: /var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd() [0x706c59] /var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0) [0x7f7f306c0ff0] /var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35) [0x7f7f2f35f1b5] /var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180) [0x7f7f2f361fc0] /var/log/ceph/ceph-osd.40.log.1.gz: 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5] /var/log/ceph/ceph-osd.40.log.1.gz: 6: (()+0xcb166) [0x7f7f2fbf2166] /var/log/ceph/ceph-osd.40.log.1.gz: 7: (()+0xcb193) [0x7f7f2fbf2193] /var/log/ceph/ceph-osd.40.log.1.gz: 8: (()+0xcb28e) [0x7f7f2fbf228e] /var/log/ceph/ceph-osd.40.log.1.gz: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x793) [0x77e903] /var/log/ceph/ceph-osd.40.log.1.gz: 10: (PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)+0x1de3) [0x63db93] /var/log/ceph/ceph-osd.40.log.1.gz: 11: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec const)+0x2cc) [0x63e00c] /var/log/ceph/ceph-osd.40.log.1.gz: 12: (boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::react_impl(boost::statechart::event_base const, void const*)+0x203) [0x658a63] /var/log/ceph/ceph-osd.40.log.1.gz: 13: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator::process_event(boost::statechart::event_base const)+0x6b) [0x650b4b] /var/log/ceph/ceph-osd.40.log.1.gz: 14: (PG::RecoveryState::handle_log(int, MOSDPGLog*, PG::RecoveryCtx*)+0x190) [0x60a520] /var/log/ceph/ceph-osd.40.log.1.gz: 15: (OSD::handle_pg_log(std::tr1::shared_ptrOpRequest)+0x666) [0x5c62e6] /var/log/ceph/ceph-osd.40.log.1.gz: 16: (OSD::dispatch_op(std::tr1::shared_ptrOpRequest)+0x11b) [0x5c6f3b] /var/log/ceph/ceph-osd.40.log.1.gz: 17: (OSD::_dispatch(Message*)+0x173) [0x5d1983] /var/log/ceph/ceph-osd.40.log.1.gz: 18: (OSD::ms_dispatch(Message*)+0x184) [0x5d2254] /var/log/ceph/ceph-osd.40.log.1.gz: 19: (SimpleMessenger::DispatchQueue::entry()+0x5e9) [0x7d3c09] /var/log/ceph/ceph-osd.40.log.1.gz: 20: (SimpleMessenger::dispatch_entry()+0x15) [0x7d5195] /var/log/ceph/ceph-osd.40.log.1.gz: 21: (SimpleMessenger::DispatchThread::entry()+0xd) [0x726bad] /var/log/ceph/ceph-osd.40.log.1.gz: 22: (()+0x68ca) [0x7f7f306b88ca] /var/log/ceph/ceph-osd.40.log.1.gz: 23: (clone()+0x6d) [0x7f7f2f3fc92d] Thnx for looking, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A couple of OSD-crashes after serious network trouble
Hi Sam, Am 07.12.2012 um 19:37 schrieb Samuel Just sam.j...@inktank.com: That is very likely to be one of the merge_log bugs fixed between 0.48 and 0.55. I could confirm with a stacktrace from gdb with line numbers or the remainder of the logging dumped when the daemon crashed. My understanding of your situation is that currently all pgs are active+clean but you are missing some rbd image headers and some rbd images appear to be corrupted. Is that accurate? -Sam thnx for droppig in. Uhm almost correct, there are now 6 pg in state inconsistent: HEALTH_WARN 6 pgs inconsistent pg 65.da is active+clean+inconsistent, acting [1,33] pg 65.d7 is active+clean+inconsistent, acting [13,42] pg 65.10 is active+clean+inconsistent, acting [12,40] pg 65.f is active+clean+inconsistent, acting [13,31] pg 65.75 is active+clean+inconsistent, acting [1,33] pg 65.6a is active+clean+inconsistent, acting [13,31] I know which images are affected, but does a repair help? 0 log [ERR] : 65.10 osd.40: soid 87c96f10/rb.0.47d9b.1014b7b4.02df/head//65 size 4194304 != known size 699904 0 log [ERR] : 65.6a osd.31: soid 19a2526a/rb.0.2dcf2.1da2a31e.0737/head//65 size 4191744 != known size 2757632 0 log [ERR] : 65.75 osd.33: soid 20550575/rb.0.2d520.5c17a6e3.0339/head//65 size 4194304 != known size 1238016 0 log [ERR] : 65.d7 osd.42: soid fa3a5d7/rb.0.2c2a8.12ec359d.205c/head//65 size 4194304 != known size 1382912 0 log [ERR] : 65.da osd.33: soid c2a344da/rb.0.2be17.cb4bd69.0081/head//65 size 4191744 != known size 1815552 0 log [ERR] : 65.f osd.31: soid e8d2430f/rb.0.2d1e9.1339c5dd.0c41/head//65 size 2424832 != known size 2331648 of make things worse? I could only check 14 out of 20 OSD's so far, cause from two older nodes a scrub leads to slow-requests… couple of minutes, so VM's got stalled… customers pressing the reset-button, so losing caches… Comments welcome, Oliver. On Fri, Dec 7, 2012 at 6:39 AM, Oliver Francke oliver.fran...@filoo.de wrote: Hi, is the following a known one, too? Would be good to get it out of my head: /var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd() [0x706c59] /var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0) [0x7f7f306c0ff0] /var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35) [0x7f7f2f35f1b5] /var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180) [0x7f7f2f361fc0] /var/log/ceph/ceph-osd.40.log.1.gz: 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5] /var/log/ceph/ceph-osd.40.log.1.gz: 6: (()+0xcb166) [0x7f7f2fbf2166] /var/log/ceph/ceph-osd.40.log.1.gz: 7: (()+0xcb193) [0x7f7f2fbf2193] /var/log/ceph/ceph-osd.40.log.1.gz: 8: (()+0xcb28e) [0x7f7f2fbf228e] /var/log/ceph/ceph-osd.40.log.1.gz: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x793) [0x77e903] /var/log/ceph/ceph-osd.40.log.1.gz: 10: (PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)+0x1de3) [0x63db93] /var/log/ceph/ceph-osd.40.log.1.gz: 11: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec const)+0x2cc) [0x63e00c] /var/log/ceph/ceph-osd.40.log.1.gz: 12: (boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::react_impl(boost::statechart::event_base const, void const*)+0x203) [0x658a63] /var/log/ceph/ceph-osd.40.log.1.gz: 13: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator::process_event(boost::statechart::event_base const)+0x6b) [0x650b4b] /var/log/ceph/ceph-osd.40.log.1.gz: 14: (PG::RecoveryState::handle_log(int, MOSDPGLog*, PG::RecoveryCtx*)+0x190) [0x60a520] /var/log/ceph/ceph-osd.40.log.1.gz: 15: (OSD::handle_pg_log(std::tr1::shared_ptrOpRequest)+0x666) [0x5c62e6] /var/log/ceph/ceph-osd.40.log.1.gz: 16: (OSD::dispatch_op(std::tr1::shared_ptrOpRequest)+0x11b) [0x5c6f3b] /var/log/ceph/ceph-osd.40.log.1.gz: 17: (OSD::_dispatch(Message*)+0x173) [0x5d1983] /var/log/ceph/ceph-osd.40.log.1.gz: 18: (OSD::ms_dispatch(Message*)+0x184) [0x5d2254] /var/log/ceph/ceph-osd.40.log.1.gz: 19: (SimpleMessenger::DispatchQueue::entry()+0x5e9) [0x7d3c09] /var/log/ceph/ceph-osd.40.log.1.gz: 20: (SimpleMessenger::dispatch_entry()+0x15) [0x7d5195] /var/log/ceph/ceph-osd.40.log.1.gz: 21: (SimpleMessenger::DispatchThread::entry()+0xd) [0x726bad] /var/log/ceph/ceph-osd.40.log.1.gz: 22: (()+0x68ca) [0x7f7f306b88ca] /var/log/ceph/ceph-osd.40.log.1.gz: 23: (clone()+0x6d) [0x7f7f2f3fc92d] Thnx for looking, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler
Re: A couple of OSD-crashes after serious network trouble
Hi, On 12/05/2012 03:54 PM, Sage Weil wrote: On Wed, 5 Dec 2012, Oliver Francke wrote: Hi *, around midnight yesterday we faced some layer-2 network problems. OSD's started to lose heartbeats and so on. Slow requests... you name it. So, after all OSD's doing their work, we had in sum around 6 of them crashed, 2 had to be restarted after first start. Should be 8 crashes in total. The recover_got() crash has definitely been resolved in the recent code. The others are hard to read since they've been sorted/summed; the full backtrace is better for identifying the crash. Do you have those available? There is the other pattern: /var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd() [0x706c59] /var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0) [0x7f7f306c0ff0] /var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35) [0x7f7f2f35f1b5] /var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180) [0x7f7f2f361fc0] /var/log/ceph/ceph-osd.40.log.1.gz: 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5] /var/log/ceph/ceph-osd.40.log.1.gz: 6: (()+0xcb166) [0x7f7f2fbf2166] /var/log/ceph/ceph-osd.40.log.1.gz: 7: (()+0xcb193) [0x7f7f2fbf2193] /var/log/ceph/ceph-osd.40.log.1.gz: 8: (()+0xcb28e) [0x7f7f2fbf228e] /var/log/ceph/ceph-osd.40.log.1.gz: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x793) [0x77e903] /var/log/ceph/ceph-osd.40.log.1.gz: 10: (PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)+0x1de3) [0x63db93] /var/log/ceph/ceph-osd.40.log.1.gz: 11: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec const)+0x2cc) [0x63e00c] /var/log/ceph/ceph-osd.40.log.1.gz: 12: (boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::react_impl(boost::statechart::event_base const, void const*)+0x203) [0x658a63] /var/log/ceph/ceph-osd.40.log.1.gz: 13: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator::process_event(boost::statechart::event_base const)+0x6b) [0x650b4b] /var/log/ceph/ceph-osd.40.log.1.gz: 14: (PG::RecoveryState::handle_log(int, MOSDPGLog*, PG::RecoveryCtx*)+0x190) [0x60a520] /var/log/ceph/ceph-osd.40.log.1.gz: 15: (OSD::handle_pg_log(std::tr1::shared_ptrOpRequest)+0x666) [0x5c62e6] /var/log/ceph/ceph-osd.40.log.1.gz: 16: (OSD::dispatch_op(std::tr1::shared_ptrOpRequest)+0x11b) [0x5c6f3b] /var/log/ceph/ceph-osd.40.log.1.gz: 17: (OSD::_dispatch(Message*)+0x173) [0x5d1983] /var/log/ceph/ceph-osd.40.log.1.gz: 18: (OSD::ms_dispatch(Message*)+0x184) [0x5d2254] /var/log/ceph/ceph-osd.40.log.1.gz: 19: (SimpleMessenger::DispatchQueue::entry()+0x5e9) [0x7d3c09] /var/log/ceph/ceph-osd.40.log.1.gz: 20: (SimpleMessenger::dispatch_entry()+0x15) [0x7d5195] /var/log/ceph/ceph-osd.40.log.1.gz: 21: (SimpleMessenger::DispatchThread::entry()+0xd) [0x726bad] /var/log/ceph/ceph-osd.40.log.1.gz: 22: (()+0x68ca) [0x7f7f306b88ca] /var/log/ceph/ceph-osd.40.log.1.gz: 23: (clone()+0x6d) [0x7f7f2f3fc92d] State at the end of the day: active+clean; Unfortunately... after some scrubbing today, we see again inconsistencies... *sigh* End of year syndrom? Tried to get onto one OSD, which crashed yesterday and fired off some ceph osd scrub 0. And then ceph osd repair 0. 2012-12-06 16:46:29.818551 7f49f1923700 0 log [ERR] : 65.ad repair stat mismatch, got 4204/4205 objects, 0/0 clones, 16466529280/16470149632 bytes. 2012-12-06 16:46:29.818734 7f49f1923700 0 log [ERR] : 65.ad repair 1 errors, 1 fixed 2012-12-06 16:46:30.104722 7f49f2124700 0 log [ERR] : 65.23 repair stat mismatch, got 4258/4259 objects, 0/0 clones, 16686233712/16690428016 bytes. 2012-12-06 16:46:30.104890 7f49f2124700 0 log [ERR] : 65.23 repair 1 errors, 1 fixed 2012-12-06 16:51:26.973407 7f49f2124700 0 log [ERR] : 6.1 osd.31: soid bafe2559/rb.0.1adf5.6733efe2.07ce/head//6 size 4194304 != known size 3046912 2012-12-06 16:51:26.973426 7f49f2124700 0 log [ERR] : 6.1 repair 0 missing, 1 inconsistent objects 2012-12-06 16:51:26.981234 7f49f2124700 0 log [ERR] : 6.1 repair stat mismatch, got 2153/2154 objects, 0/0 clones, 7013002752/7017197056 bytes. 2012-12-06 16:51:26.981402 7f49f2124700 0 log [ERR] : 6.1 repair 1 errors, 1 fixed um... is it repaired? Really? Everything cool now for OSD.0? Additionally there are - again - half a dozen headers missing. If corresponding VM's are stopped now, they will not restart, of course. First tickets are raised by customers having s/t like filesystems errors... mounted read-only... on the console and kind of that crap... again. Well then, should one now do a ceph osd repair \* ? Fix the headers? Is there a best practice
A couple of OSD-crashes after serious network trouble
::shared_ptrOpRequest)+0x32e) 16 (ReplicatedPG::do_sub_op(std::tr1::shared_ptrOpRequest)+0x3f7) 12 (ReplicatedPG::handle_pull_response(std::tr1::shared_ptrOpRequest)+0x4d4) 16 (ReplicatedPG::handle_pull_response(std::tr1::shared_ptrOpRequest)+0xb24) 4 (ReplicatedPG::handle_push(std::tr1::shared_ptrOpRequest)+0x263) 32 (ReplicatedPG::recover_got(hobject_t, 32 (ReplicatedPG::submit_push_complete(ObjectRecoveryInfo, 12 (ReplicatedPG::sub_op_push(std::tr1::shared_ptrOpRequest)+0x98) 16 (ReplicatedPG::sub_op_push(std::tr1::shared_ptrOpRequest)+0xa2) 4 (ReplicatedPG::sub_op_push(std::tr1::shared_ptrOpRequest)+0xf3) 4 (SimpleMessenger::dispatch_entry()+0x15) 4 (SimpleMessenger::DispatchQueue::entry()+0x5e9) 4 (SimpleMessenger::DispatchThread::entry()+0xd) 16 (ThreadPool::worker()+0x4d5) 16 (ThreadPool::worker()+0x76f) 32 (ThreadPool::WorkThread::entry()+0xd) === 8- === Everything has cleared up so far, so that's some good news ;) Comments welcome, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Best practice with 0.48.2 to take a node into maintenance
Hi *, well, even if 0.48.2 is really stable and reliable, it is not everytime the case with linux kernel. We have a couple of nodes, where an update would make life better. So, as our OSD-nodes have to care for VM's too, it's not the problem to let them drain so migrate all of them to other nodes. Just reboot? Perhaps not, cause all OSD's will begin to remap/backfill, they are instructed to do so. Well, declare them as osd lost? Dangerous. Is there another way I miss in doing node-maintenance? Will we have to wait for bobtail for far less hassle with all remapping and resources? Thnx for comments, Oliver. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Best practice with 0.48.2 to take a node into maintenance
Hi Josh, Am 03.12.2012 um 20:14 schrieb Josh Durgin josh.dur...@inktank.com: On 12/03/2012 11:05 AM, Oliver Francke wrote: Hi *, well, even if 0.48.2 is really stable and reliable, it is not everytime the case with linux kernel. We have a couple of nodes, where an update would make life better. So, as our OSD-nodes have to care for VM's too, it's not the problem to let them drain so migrate all of them to other nodes. Just reboot? Perhaps not, cause all OSD's will begin to remap/backfill, they are instructed to do so. Well, declare them as osd lost? Dangerous. Is there another way I miss in doing node-maintenance? Will we have to wait for bobtail for far less hassle with all remapping and resources? By default the monitors won't mark an OSD out in the time it takes to reboot, but if maintenance takes longer, you can drain data from the node. A simple way to rate limit it yourself is by slowly lowering the weights of the OSDs on the host you want to update, e.g. by 0.1 at a time and waiting for recovery to complete before lowering again. Once they're at 0 and the cluster is healthy, they're not responsible for any data anymore, and the node can be rebooted. true. Should have mentioned knowing smooth way. But for a planned reboot this take way too much time ;) But if it's recommended, it's recommended ;) Oliver. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Best practice with 0.48.2 to take a node into maintenance
Hi Florian, Am 03.12.2012 um 20:45 schrieb Smart Weblications GmbH - Florian Wiessner f.wiess...@smart-weblications.de: Am 03.12.2012 20:21, schrieb Oliver Francke: Hi Josh, Am 03.12.2012 um 20:14 schrieb Josh Durgin josh.dur...@inktank.com: On 12/03/2012 11:05 AM, Oliver Francke wrote: Hi *, well, even if 0.48.2 is really stable and reliable, it is not everytime the case with linux kernel. We have a couple of nodes, where an update would make life better. So, as our OSD-nodes have to care for VM's too, it's not the problem to let them drain so migrate all of them to other nodes. Just reboot? Perhaps not, cause all OSD's will begin to remap/backfill, they are instructed to do so. Well, declare them as osd lost? Dangerous. Is there another way I miss in doing node-maintenance? Will we have to wait for bobtail for far less hassle with all remapping and resources? By default the monitors won't mark an OSD out in the time it takes to reboot, but if maintenance takes longer, you can drain data from the node. A simple way to rate limit it yourself is by slowly lowering the weights of the OSDs on the host you want to update, e.g. by 0.1 at a time and waiting for recovery to complete before lowering again. Once they're at 0 and the cluster is healthy, they're not responsible for any data anymore, and the node can be rebooted. true. Should have mentioned knowing smooth way. But for a planned reboot this take way too much time ;) But if it's recommended, it's recommended ;) I did rolling reboots of our whole cluster a few days ago (3.4.20). When the system reboots and no fsck is done, ceph won't start to backfill in my setup. I had some nodes do fsck after upgrade so ceph marked the osd as down and started to backfill, but once the missing osd was back up running again, the backfill stopped and ceph did just a little bit of peering and was healthy in a few minutes again (2-5 minutes)… if you encounter all BIOS-, POST-, RAID-controller-checks, linux-boot, openvswitch-STP setup and so on, one can imagine, that a reboot takes a couple-of-minutes, normally with our setup after 30 seconds the cluster shall detect some outage and start to do it's work. Everytings fine, but perhaps we could avoid big load in the cluster to remap and re-remap ( Theme: slow requests) I have to ask in means of QoS for a better way ;) All that stuff had a big customer impact in the past… Time to ask. Kind reg's Oliver. -- Mit freundlichen Grüßen, Florian Wiessner Smart Weblications GmbH Martinsberger Str. 1 D-95119 Naila fon.: +49 9282 9638 200 fax.: +49 9282 9638 205 24/7: +49 900 144 000 00 - 0,99 EUR/Min* http://www.smart-weblications.de -- Sitz der Gesellschaft: Naila Geschäftsführer: Florian Wiessner HRB-Nr.: HRB 3840 Amtsgericht Hof *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd STDIN import does not work / wip-rbd-export-stdout
Well... On 11/26/2012 02:20 PM, Stefan Priebe - Profihost AG wrote: Hello list, i know branch wip-rbd-export-stdout is work in progress but it is more than useful ;-) When i try to import an image i get: # gzip -dc vm-101-disk-1.img.gz | rbd import --format=2 --size=42949672960 - kvmpool1/vm-101-disk-1 rbd: error reading file: (29) Illegal seek Importing image: 0% complete...failed. rbd: import failed: (29) Illegal seek Anything i've tried wrong? I would assume, that size is already in MiB? Seems to be a slightly too big value... Not tried myself, though... Oliver. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ubuntu 12.04.1 + xfs + syncfs is still not our friend
Hi *, anybody out there who's in Ubuntu 12.04.1/ in connection with libc + xfs + syncfs? We bit the bullet and reinstalled two new nodes from debian to precise in favour of possible performance increase?! *sigh*, still getting: 2012-11-06 17:05:51.863921 7f5cc52e3780 0 filestore(/data/osd6-3) mount syncfs(2) syscall not support by glibc 2012-11-06 17:05:51.863925 7f5cc52e3780 0 filestore(/data/osd6-3) mount no syncfs(2), must use sync(2). as a show-stopper. That's with 3.2.* and 3.6.6 kernel. Should be new enough. And ceph-0.48.2 ( eu.ceph.com mirror was installing 0.48.1.. though *ops* ;) ) But both should be capable of this system-call?! We've been so very near... but now clueless, please sched some light in 8-) ( I hope this is not a very obvious human RTFM error ;) ) And so I'm writing, should one enable directio + aio, as we're having the journals on a SSD-partition? Kind regards, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ubuntu 12.04.1 + xfs + syncfs is still not our friend
Hi Jens, sorry for the double work… answered off-list already ;) Oliver. Am 06.11.2012 um 19:46 schrieb Jens Rehpöhler jens.rehpoeh...@filoo.de: On 06.11.2012 18:33, Gandalf Corvotempesta wrote: 2012/11/6 Oliver Francke oliver.fran...@filoo.de: 2012-11-06 17:05:51.863921 7f5cc52e3780 0 filestore(/data/osd6-3) mount syncfs(2) syscall not support by glibc 2012-11-06 17:05:51.863925 7f5cc52e3780 0 filestore(/data/osd6-3) mount no syncfs(2), must use sync(2). Could you please try to run ldd --version and test if man syncfs is working ? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Here is the output: ldd --version ldd (Ubuntu EGLIBC 2.15-0ubuntu10.3) 2.15 After installing manpages-dev i get the man page for syncfs (with man syncfs) Distribution is a normal Ubuntu 12.04.1 -- mit freundlichen Grüssen Jens Rehpöhler -- Filoo GmbH Moltkestr. 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | Dr. C.Kunz Telefon: +49 5241 8673012 | Mobil: +49 151 54645798 Fax: +49 5241 8673020 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Reduce bandwidth for remapping/backfill/recover?
Hi * anybody out there, who can help with an idea for reducing bandwidth when incorporating 2 new nodes into a cluster? I know of osd recovery max active = X ( 5 default), but with 4 OSD's per node, there is enough possibility to saturate our backnet ( 1Gig at the moment). Any other way to not disturb users too much? Thxn in@vance, Oliver. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.53 released
Hi Josh, On 10/19/2012 07:42 AM, Josh Durgin wrote: On 10/17/2012 04:26 AM, Oliver Francke wrote: Hi Sage, *, after having some trouble with the journals - had to erase the partition and redo a ceph... --mkjournal - I started my testing... Everything fine. This would be due to the change in default osd journal size. In 0.53 it's 1024MB, even for block devices. Previously it defaulted to the entire block device. I already fixed this to use the entire block device in 0.54, and didn't realize the fix wasn't included in 0.53. You can restore the correct behaviour for block devices by setting this in the [osd] section of your ceph.conf: osd journal size = 0 thnx for the explanation, gives me a better feeling for the next stable to come to the stores ;) Uhm, may it be impertinant to bring http://tracker.newdream.net/issues/2573 to your attention, as it's still ongoing at least in 0.48.2argonaut? Thnx in advance, Oliver. Josh --- 8- --- 2012-10-17 12:54:11.167782 7febab24a780 0 filestore(/data/osd0) mount: enabling PARALLEL journal mode: btrfs, SNAP_CREATE_V2 detected and 'filestore btrfs snap' mode is enabled 2012-10-17 12:54:11.191723 7febab24a780 0 journal kernel version is 3.5.0 2012-10-17 12:54:11.191907 7febab24a780 1 journal _open /dev/sdb1 fd 27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2012-10-17 12:54:11.201764 7febab24a780 0 journal kernel version is 3.5.0 2012-10-17 12:54:11.201924 7febab24a780 1 journal _open /dev/sdb1 fd 27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 --- 8- --- And the other minute I started my fairly destructive testing, 0.52 never ever failed on that. And then a loop started with --- 8- --- 2012-10-17 12:59:15.403247 7feba5fed700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :57922 pgs=3 cs=1 l=0).fault, initiating reconnect 2012-10-17 12:59:17.280143 7feb950cc700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6804/17972 pipe(0x17f2240 sd=29 :49431 pgs=3 cs=1 l=0).fault with nothing to send, going to standby 2012-10-17 12:59:18.288902 7feb951cd700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :37519 pgs=3 cs=2 l=0).connect claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node! 2012-10-17 12:59:18.297663 7feb951cd700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :34833 pgs=3 cs=2 l=0).connect claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node! 2012-10-17 12:59:18.303215 7feb951cd700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :35169 pgs=3 cs=2 l=0).connect claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node! --- 8- --- leading to high CPU-load on node2 ( IP 10.0.0.11). The destructive part happens on node3 ( IP 10.0.0.12). Procedure is as always just kill some OSDs and start over again... Happened now twice, so I would call it reproducable ;) Kind regards, Oliver. On 10/17/2012 01:48 AM, Sage Weil wrote: Another development release of Ceph is ready, v0.53. We are getting pretty close to what will be frozen for the next stable release (bobtail), so if you would like a preview, give this one a go. Notable changes include: * librbd: image locking * rbd: fix list command when more than 1024 (format 2) images * osd: backfill reservation framework (to avoid flooding new osds with backfill data) * osd, mon: honor new 'nobackfill' and 'norecover' osdmap flags * osd: new 'deep scrub' will compare object content across replicas (once per week by default) * osd: crush performance improvements * osd: some performance improvements related to request queuing * osd: capability syntax improvements, bug fixes * osd: misc recovery fixes * osd: fix memory leak on certain error paths * osd: default journal size to 1 GB * crush: default root of tree type is now 'root' instead of 'pool' (to avoid confusiong wrt rados pools) * ceph-fuse: fix handling for .. in root directory * librados: some locking fixes * mon: some election bug fixes * mon: some additional on-disk metadata to facilitate future mon changes (post-bobtail) * mon: throttle osd flapping based on osd history (limits osdmap thrashing on overloaded or unhappy clusters) * mon: new 'osd crush create-or-move ...' command * radosgw: fix copy-object vs attributes * radosgw: fix bug in bucket stat updates * mds: fix ino release on abort session close, relative getattr path, mds shutdown, other misc items * upstart: stop jobs on shutdown * common: thread pool sizes can now be adjusted at runtime * build fixes for Fedora 18, CentOS/RHEL 6 The big items are locking support in RBD, and OSD improvements like deep scrub (which verify object data across replicas) and backfill reservations (which limit load on expanding clusters). And a huge swath of bugfixes and cleanups, many due to feeding the code through scan.coverity.com (they offer free static
Re: v0.53 released
Hi Sage, Am 19.10.2012 um 17:48 schrieb Sage Weil s...@inktank.com: On Fri, 19 Oct 2012, Oliver Francke wrote: Hi Josh, On 10/19/2012 07:42 AM, Josh Durgin wrote: On 10/17/2012 04:26 AM, Oliver Francke wrote: Hi Sage, *, after having some trouble with the journals - had to erase the partition and redo a ceph... --mkjournal - I started my testing... Everything fine. This would be due to the change in default osd journal size. In 0.53 it's 1024MB, even for block devices. Previously it defaulted to the entire block device. I already fixed this to use the entire block device in 0.54, and didn't realize the fix wasn't included in 0.53. You can restore the correct behaviour for block devices by setting this in the [osd] section of your ceph.conf: osd journal size = 0 thnx for the explanation, gives me a better feeling for the next stable to come to the stores ;) Uhm, may it be impertinant to bring http://tracker.newdream.net/issues/2573 to your attention, as it's still ongoing at least in 0.48.2argonaut? Do you mean these messages? 2012-10-11 10:51:25.879084 7f25d08dc700 0 osd.13 1353 pg[6.5( v 1353'2567562 (1353'2566561,1353'2567562] n=1857 ec=390 les/c 1347/1349 1340/1347/1333) [13,33] r=0 lpr=1347 mlcod 1353'2567561 active+clean] watch: ctx-obc=0x6381000 cookie=1 oi.version=2301953 ctx-at_version=1353'2567563 2012-10-11 10:51:25.879133 7f25d08dc700 0 osd.13 1353 pg[6.5( v 1353'2567562 (1353'2566561,1353'2567562] n=1857 ec=390 les/c 1347/1349 1340/1347/1333) [13,33] r=0 lpr=1347 mlcod 1353'2567561 active+clean] watch: oi.user_version=2301951 They're fixed in master; I'll backport the cleanup to stable. It's useless noise. uhm, more into the following: Oct 19 15:28:13 fcmsnode1 kernel: [1483536.141269] libceph: osd13 10.10.10.22:6812 socket closed Oct 19 15:43:13 fcmsnode1 kernel: [1484435.176280] libceph: osd13 10.10.10.22:6812 socket closed Oct 19 15:58:13 fcmsnode1 kernel: [1485334.382798] libceph: osd13 10.10.10.22:6812 socket closed It's kind of new, cause I would have got aware before. And we have 4 ODS's on every node, so why only from one OSD? Same picture on two other nodes, If I read the ticket, no data is lost, just closing a socket? But then a kern.log entry is far too much? ;) Oliver. sage Thnx in advance, Oliver. Josh --- 8- --- 2012-10-17 12:54:11.167782 7febab24a780 0 filestore(/data/osd0) mount: enabling PARALLEL journal mode: btrfs, SNAP_CREATE_V2 detected and 'filestore btrfs snap' mode is enabled 2012-10-17 12:54:11.191723 7febab24a780 0 journal kernel version is 3.5.0 2012-10-17 12:54:11.191907 7febab24a780 1 journal _open /dev/sdb1 fd 27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2012-10-17 12:54:11.201764 7febab24a780 0 journal kernel version is 3.5.0 2012-10-17 12:54:11.201924 7febab24a780 1 journal _open /dev/sdb1 fd 27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 --- 8- --- And the other minute I started my fairly destructive testing, 0.52 never ever failed on that. And then a loop started with --- 8- --- 2012-10-17 12:59:15.403247 7feba5fed700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :57922 pgs=3 cs=1 l=0).fault, initiating reconnect 2012-10-17 12:59:17.280143 7feb950cc700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6804/17972 pipe(0x17f2240 sd=29 :49431 pgs=3 cs=1 l=0).fault with nothing to send, going to standby 2012-10-17 12:59:18.288902 7feb951cd700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :37519 pgs=3 cs=2 l=0).connect claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node! 2012-10-17 12:59:18.297663 7feb951cd700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :34833 pgs=3 cs=2 l=0).connect claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node! 2012-10-17 12:59:18.303215 7feb951cd700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :35169 pgs=3 cs=2 l=0).connect claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node! --- 8- --- leading to high CPU-load on node2 ( IP 10.0.0.11). The destructive part happens on node3 ( IP 10.0.0.12). Procedure is as always just kill some OSDs and start over again... Happened now twice, so I would call it reproducable ;) Kind regards, Oliver. On 10/17/2012 01:48 AM, Sage Weil wrote: Another development release of Ceph is ready, v0.53. We are getting pretty close to what will be frozen for the next stable release (bobtail), so if you would like a preview, give this one a go. Notable changes include: * librbd: image locking * rbd: fix list command when more than 1024 (format 2) images * osd: backfill reservation framework (to avoid flooding new osds with backfill data) * osd, mon: honor new 'nobackfill' and 'norecover' osdmap flags * osd: new 'deep scrub' will compare object content across
Re: v0.53 released
Hi Sage, *, after having some trouble with the journals - had to erase the partition and redo a ceph... --mkjournal - I started my testing... Everything fine. --- 8- --- 2012-10-17 12:54:11.167782 7febab24a780 0 filestore(/data/osd0) mount: enabling PARALLEL journal mode: btrfs, SNAP_CREATE_V2 detected and 'filestore btrfs snap' mode is enabled 2012-10-17 12:54:11.191723 7febab24a780 0 journal kernel version is 3.5.0 2012-10-17 12:54:11.191907 7febab24a780 1 journal _open /dev/sdb1 fd 27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2012-10-17 12:54:11.201764 7febab24a780 0 journal kernel version is 3.5.0 2012-10-17 12:54:11.201924 7febab24a780 1 journal _open /dev/sdb1 fd 27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 --- 8- --- And the other minute I started my fairly destructive testing, 0.52 never ever failed on that. And then a loop started with --- 8- --- 2012-10-17 12:59:15.403247 7feba5fed700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :57922 pgs=3 cs=1 l=0).fault, initiating reconnect 2012-10-17 12:59:17.280143 7feb950cc700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6804/17972 pipe(0x17f2240 sd=29 :49431 pgs=3 cs=1 l=0).fault with nothing to send, going to standby 2012-10-17 12:59:18.288902 7feb951cd700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :37519 pgs=3 cs=2 l=0).connect claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node! 2012-10-17 12:59:18.297663 7feb951cd700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :34833 pgs=3 cs=2 l=0).connect claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node! 2012-10-17 12:59:18.303215 7feb951cd700 0 -- 10.0.0.11:6801/29042 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :35169 pgs=3 cs=2 l=0).connect claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node! --- 8- --- leading to high CPU-load on node2 ( IP 10.0.0.11). The destructive part happens on node3 ( IP 10.0.0.12). Procedure is as always just kill some OSDs and start over again... Happened now twice, so I would call it reproducable ;) Kind regards, Oliver. On 10/17/2012 01:48 AM, Sage Weil wrote: Another development release of Ceph is ready, v0.53. We are getting pretty close to what will be frozen for the next stable release (bobtail), so if you would like a preview, give this one a go. Notable changes include: * librbd: image locking * rbd: fix list command when more than 1024 (format 2) images * osd: backfill reservation framework (to avoid flooding new osds with backfill data) * osd, mon: honor new 'nobackfill' and 'norecover' osdmap flags * osd: new 'deep scrub' will compare object content across replicas (once per week by default) * osd: crush performance improvements * osd: some performance improvements related to request queuing * osd: capability syntax improvements, bug fixes * osd: misc recovery fixes * osd: fix memory leak on certain error paths * osd: default journal size to 1 GB * crush: default root of tree type is now 'root' instead of 'pool' (to avoid confusiong wrt rados pools) * ceph-fuse: fix handling for .. in root directory * librados: some locking fixes * mon: some election bug fixes * mon: some additional on-disk metadata to facilitate future mon changes (post-bobtail) * mon: throttle osd flapping based on osd history (limits osdmap thrashing on overloaded or unhappy clusters) * mon: new 'osd crush create-or-move ...' command * radosgw: fix copy-object vs attributes * radosgw: fix bug in bucket stat updates * mds: fix ino release on abort session close, relative getattr path, mds shutdown, other misc items * upstart: stop jobs on shutdown * common: thread pool sizes can now be adjusted at runtime * build fixes for Fedora 18, CentOS/RHEL 6 The big items are locking support in RBD, and OSD improvements like deep scrub (which verify object data across replicas) and backfill reservations (which limit load on expanding clusters). And a huge swath of bugfixes and cleanups, many due to feeding the code through scan.coverity.com (they offer free static code analysis for open source projects). v0.54 is now frozen, and will include many deployment-related fixes (including a new ceph-deploy tool to replace mkcephfs), more bugfixes for libcephfs, ceph-fuse, and the MDS, and the fruits of some performance work on the OSD. You can get v0.53 from the usual locations: * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.53.tar.gz * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian * For RPMs, see http://ceph.com/docs/master/install/rpm -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH
Re: v0.48.2 argonaut update released
Hi *, with reference to the below mentioned objecter: misc fixes for op reordering I assumed it could have something to do with slow requests not being solved for too long. I just not saw it anymore in 0.51 in our testing-evironment. But today we took one of our nodes into maintenance, and I see many of: --- 8 2012-10-01 11:58:46.766999 osd.10 [WRN] 38 slow requests, 1 included below; oldest blocked for 1189.312605 secs 2012-10-01 11:58:46.767013 osd.10 [WRN] slow request 240.183032 seconds old, received at 2012-10-01 11:54:46.583860: osd_op(client.110046.0:2143984 rb.0.1adf5.6733efe2.061a [write 208384~4096] 6.f511e801) v4 currently delayed 2 --- 8- --- which is bad, as I assume, that some of the VM's are now stalled. Anybody else experienced such things? Thnx n regards, Oliver. On 09/20/2012 06:52 PM, Sage Weil wrote: Hi everyone, Another update to the stable argonaut series has been released. This fixes a few important bugs in rbd and radosgw and includes a series of changes to upstart and deployment related scripts that will allow the upcoming 'ceph-deploy' tool to work with the argonaut release. Upgrading: * The default search path for keyring files now includes /etc/ceph/ceph.$name.keyring. If such files are present on your cluster, be aware that by default they may now be used. * There are several changes to the upstart init files. These have not been previously documented or recommended. Any existing users should review the changes before upgrading. * The ceph-disk-prepare and ceph-disk-active scripts have been updated significantly. These have not been previously documented or recommended. Any existing users should review the changes before upgrading. Notable changes include: * mkcephfs: fix keyring generation for mds, osd when default paths are used * radosgw: fix bug causing occasional corruption of per-bucket stats * radosgw: workaround to avoid previously corrupted stats from going negative * radosgw: fix bug in usage stats reporting on busy buckets * radosgw: fix Content-Range: header for objects bigger than 2 GB. * rbd: avoid leaving watch acting when command line tool errors out (avoids 30s delay on subsequent operations) * rbd: friendlier use of pool/image options for import (old calling convention still works) * librbd: fix rare snapshot creation race (could lose a snap when creation is concurrent) * librbd: fix discard handling when spanning holes * librbd: fix memory leak on discard when caching is enabled * objecter: misc fixes for op reordering * objecter: fix for rare startup-time deadlock waiting for osdmap * ceph: fix usage * mon: reduce log noise about check_sub * ceph-disk-activate: misc fixes, improvements * ceph-disk-prepare: partition and format osd disks automatically * upstart: start everyone on a reboot * upstart: always update the osd crush location on start if specified in the config * config: add /etc/ceph/ceph.$name.keyring to default keyring search path * ceph.spec: don't package crush headers You can get this release from the usual locations: * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.newdream.net/download/ceph-0.48.2.tar.gz * For Debian/Ubuntu packages, see http://ceph.newdream.net/docs/master/install/debian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.48.2 argonaut update released
Well, Am 01.10.2012 um 18:07 schrieb Sage Weil s...@inktank.com: On Mon, 1 Oct 2012, Oliver Francke wrote: Hi *, with reference to the below mentioned objecter: misc fixes for op reordering I assumed it could have something to do with slow requests not being solved for too long. I just not saw it anymore in 0.51 in our testing-evironment. But today we took one of our nodes into maintenance, and I see many of: --- 8 2012-10-01 11:58:46.766999 osd.10 [WRN] 38 slow requests, 1 included below; oldest blocked for 1189.312605 secs 2012-10-01 11:58:46.767013 osd.10 [WRN] slow request 240.183032 seconds old, received at 2012-10-01 11:54:46.583860: osd_op(client.110046.0:2143984 rb.0.1adf5.6733efe2.061a [write 208384~4096] 6.f511e801) v4 currently delayed 2 --- 8- --- which is bad, as I assume, that some of the VM's are now stalled. Anybody else experienced such things? You see this on v0.51? For v0.52 we merged in a large series of messenger fixes and cleanups that could very easily explain this. Can you try v0.52? just checking… up to now no slow req. is lasting for longer than 30… seconds. The same series has not been merged into the argonaut stable series; I'm unsure yet whether that's a good idea (it's a lot of refactoring mixed in with the fixes). Perhaps after it has proven itself in v0.52 for longer, and/or if we get reports of msgr problems in argonaut deployments. I think most people are just happy campers these days… Not so destructive as myself ;) sage Oliver. Thnx n regards, Oliver. On 09/20/2012 06:52 PM, Sage Weil wrote: Hi everyone, Another update to the stable argonaut series has been released. This fixes a few important bugs in rbd and radosgw and includes a series of changes to upstart and deployment related scripts that will allow the upcoming 'ceph-deploy' tool to work with the argonaut release. Upgrading: * The default search path for keyring files now includes /etc/ceph/ceph.$name.keyring. If such files are present on your cluster, be aware that by default they may now be used. * There are several changes to the upstart init files. These have not been previously documented or recommended. Any existing users should review the changes before upgrading. * The ceph-disk-prepare and ceph-disk-active scripts have been updated significantly. These have not been previously documented or recommended. Any existing users should review the changes before upgrading. Notable changes include: * mkcephfs: fix keyring generation for mds, osd when default paths are used * radosgw: fix bug causing occasional corruption of per-bucket stats * radosgw: workaround to avoid previously corrupted stats from going negative * radosgw: fix bug in usage stats reporting on busy buckets * radosgw: fix Content-Range: header for objects bigger than 2 GB. * rbd: avoid leaving watch acting when command line tool errors out (avoids 30s delay on subsequent operations) * rbd: friendlier use of pool/image options for import (old calling convention still works) * librbd: fix rare snapshot creation race (could lose a snap when creation is concurrent) * librbd: fix discard handling when spanning holes * librbd: fix memory leak on discard when caching is enabled * objecter: misc fixes for op reordering * objecter: fix for rare startup-time deadlock waiting for osdmap * ceph: fix usage * mon: reduce log noise about check_sub * ceph-disk-activate: misc fixes, improvements * ceph-disk-prepare: partition and format osd disks automatically * upstart: start everyone on a reboot * upstart: always update the osd crush location on start if specified in the config * config: add /etc/ceph/ceph.$name.keyring to default keyring search path * ceph.spec: don't package crush headers You can get this release from the usual locations: * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.newdream.net/download/ceph-0.48.2.tar.gz * For Debian/Ubuntu packages, see http://ceph.newdream.net/docs/master/install/debian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestra?e 25a 0 G?tersloh HRB4355 AG G?tersloh Gesch?ftsf?hrer: S.Grewing | J.Rehp?hler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD-crash on 0.48.1argonout, error void ReplicatedPG::recover_got(hobject_t, eversion_t) not seen on list
Hi all, after adding a new node into our ceph-cluster yesterday, we had a crash of one OSD. I have found this kind of message in the bugtracker as being solved ( http://tracker.newdream.net/issues/2075), I will update this one for my convenience and attach the according log ( due to productive site, there is no more verbose debug available, sorry). Other than that, everything went almost smoothly, except the annoying slow requests, which are hopefully not only fixed in 0.51, ... when do we expect next stable, btw? The replication was fast, due to a SSD-cached LSI-controller, 4 OSDs per node, one per HDD, 1Gbit was completely saturated, time for next step towards 10Gbit ;) Regards, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.48.1 argonaut stable update released
Well, On 08/14/2012 09:29 PM, Sage Weil wrote: On Tue, 14 Aug 2012, Oliver Francke wrote: Hi Sage, I just updated to debian-testing/0.50 this afternoon, after some hint: * osd: better tracking of recent slow operations This is actually about the admin socket command to dump operations in flight (more useful information is reported for diagnosis/debugging). and it is hereby confirmed to be better in my testing environment. Before I had requests, which could be there for 480 seconds? not any more. That great news! That is probably Sam's refactor of the OSD threading at work. There were also a few bugs fixed in 0.48.1 that were causing somewhat similar symptoms (ops blocked indefinitely) due to peering problems, but that doesn't sound like it's the same thing. How's about this fix in 0.48.X? It's a huge set of changes, and definitely won't go into the 0.48 series, sorry! (In fact, the pending change was one motivation for doing 0.48 when we did.) It will be in bobtail, though, which is probably about a month away from freeze. Please let us know what your experience is like with 0.50 (and beyond). the more detailed picture is: it works and is stable, so far no problems with my torture-tests. Sporadically I see a line ala: --- 8- --- delete error: image still has watchers This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout. 2012-08-15 15:57:22.072729 7f9fe82a2760 -1 librbd: error removing header: (16) Device or resource busy --- 8- --- even from long ago stopped VM's. Regards, Oliver. Thanks! sage Thnx in @vance, Oliver - Thus being too lazy to read all change logs - Francke. Am 14.08.2012 um 20:18 schrieb Sage Weil s...@inktank.com: We've built and pushed the first update to the argonaut stable release. This branch has a range of small fixes for stability, compatibility, and performance, but no major changes in functionality. The stability fixes are particularly important for large clusters with many OSDs, and for network environments where intermittent network failures are more common. The highlights include: * mkcephfs: use default `keyring', `osd data', `osd journal' paths when not specified in conf * msgr: various fixes to socket error handling * osd: reduce scrub overhead * osd: misc peering fixes (past_interval sharing, pgs stuck in `peering' states) * osd: fail on EIO in read path (do not silently ignore read errors from failing disks) * osd: avoid internal heartbeat errors by breaking some large transactions into pieces * osd: fix osdmap catch-up during startup (catch up and then add daemon to osdmap) * osd: fix spurious `misdirected op' messages * osd: report scrub status via `pg # query' * rbd: fix race when watch registrations are resent * rbd: fix rbd image id assignment scheme (new image data objects have slightly different names) * rbd: fix perf stats for cache hit rate * rbd tool: fix off-by-one in key name (crash when empty key specified) * rbd: more robust udev rules * rados tool: copy object, pool commands * radosgw: fix in usage stats trimming * radosgw: misc compatibility fixes (date strings, ETag quoting, swift headers, etc.) * ceph-fuse: fix locking in read/write paths * mon: fix rare race corrupting on-disk data * config: fix admin socket `config set' command * log: fix in-memory log event gathering * debian: remove crush headers, include librados-config * rpm: add ceph-disk-{activate, prepare} The fix for the radosgw usage trimming is incompatible with v0.48 (which was effectively broken). You now need to use the v0.48.1 version of radosgw-admin to initiate usage stats trimming. There are a range of smaller bug fixes as well. For a complete list of what went into this release, please see the release notes and changelog. You can get this stable update from the usual locations: * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.newdream.net/download/ceph-0.48.1.tar.gz * For Debian/Ubuntu packages, see http://ceph.newdream.net/docs/master/install/debian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.48.1 argonaut stable update released
Hi Sage, I just updated to debian-testing/0.50 this afternoon, after some hint: * osd: better tracking of recent slow operations and it is hereby confirmed to be better in my testing environment. Before I had requests, which could be there for 480 seconds… not any more. How's about this fix in 0.48.X? Thnx in @vance, Oliver - Thus being too lazy to read all change logs - Francke. Am 14.08.2012 um 20:18 schrieb Sage Weil s...@inktank.com: We've built and pushed the first update to the argonaut stable release. This branch has a range of small fixes for stability, compatibility, and performance, but no major changes in functionality. The stability fixes are particularly important for large clusters with many OSDs, and for network environments where intermittent network failures are more common. The highlights include: * mkcephfs: use default `keyring', `osd data', `osd journal' paths when not specified in conf * msgr: various fixes to socket error handling * osd: reduce scrub overhead * osd: misc peering fixes (past_interval sharing, pgs stuck in `peering' states) * osd: fail on EIO in read path (do not silently ignore read errors from failing disks) * osd: avoid internal heartbeat errors by breaking some large transactions into pieces * osd: fix osdmap catch-up during startup (catch up and then add daemon to osdmap) * osd: fix spurious `misdirected op' messages * osd: report scrub status via `pg # query' * rbd: fix race when watch registrations are resent * rbd: fix rbd image id assignment scheme (new image data objects have slightly different names) * rbd: fix perf stats for cache hit rate * rbd tool: fix off-by-one in key name (crash when empty key specified) * rbd: more robust udev rules * rados tool: copy object, pool commands * radosgw: fix in usage stats trimming * radosgw: misc compatibility fixes (date strings, ETag quoting, swift headers, etc.) * ceph-fuse: fix locking in read/write paths * mon: fix rare race corrupting on-disk data * config: fix admin socket `config set' command * log: fix in-memory log event gathering * debian: remove crush headers, include librados-config * rpm: add ceph-disk-{activate, prepare} The fix for the radosgw usage trimming is incompatible with v0.48 (which was effectively broken). You now need to use the v0.48.1 version of radosgw-admin to initiate usage stats trimming. There are a range of smaller bug fixes as well. For a complete list of what went into this release, please see the release notes and changelog. You can get this stable update from the usual locations: * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.newdream.net/download/ceph-0.48.1.tar.gz * For Debian/Ubuntu packages, see http://ceph.newdream.net/docs/master/install/debian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Some findings on 0.48, qemu-1.0.1 eating up RDB-write-cache memory
Hi *, as I have read many postings from users using qemu, too, I would like them to keep an eye on memory consumption. I'm with qemu-1.0.1 and qemu-1.1.0-1 and linux-kernel 3.4.2/3.5.0-rc2. If I restart a VM from cold, I do some readings, up to memory being fully used ( cache/buffers), that is, VM started with: -m 1024 and I can see RSS of 1.1g in top. After doing some normal IOps testing with: spew -v --raw -P -t -i 5 -b 4k -p random -B 4k 2G /tmp/doof.dat so a 2G file, tested for IOps-performance with 4k blocks I get a pretty good value for 5x write/read-after-write: Total iterations:5 Total runtime:00:04:43 Total write transfer time (WTT): 00:02:15 Total write transfer rate (WTR):77480.53 KiB/s Total write IOPS: 19370.13 IOPS Total read transfer time (RTT): 00:01:40 Total read transfer rate (RTR):103823.12 KiB/s Total read IOPS:25955.78 IOPS but at the cost of approx. 400MiB more memory used, showing now 1.5g. Though it's not proportional, after next run I get 1.6g, then the process slows down... two another runs and we break the 1.7g border... But with the following settings in the global section of ceph.conf: rbd_cache = true rbd_cache_size=16777216 rbd_cache_max_dirty=8388608 rbd_cache_target_dirty=4194304 I cannot see, why we should waste 500+ MiB of memory ;) ( multiplied with approx. 100 VM's running). If same VM started with: :rbd_cache=false everything stays as it should. Anybody with similar setup willing to do some testing? Other than that: fast and stable release, it seems ;) Thnx in @vance, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: all rbd users: set 'filestore fiemap = false'
Hi Sage, On 06/18/2012 06:02 AM, Sage Weil wrote: If you are using RBD, and want to avoid potential image corruption, add filestore fiemap = false to the [osd] section of your ceph.conf and restart your OSDs. as far as this heals some trouble, but I fairly don't understand... We've tracked down the source of some corruption to racy/buggy FIEMAP ioctl behavior. The RBD client (when caching is diabled--the default) uses a 'sparse read' operation that the OSD implements by doing an fsync on the object file, mapping which extents are allocated, and sending only that data over the wire. We have observed incorrect/changing FIEMAP on both btrfs: fsync fiemap returns mapping time passes, no modifications to file fiemap returns different mapping ... that even an initial start of a VM leads to corruption of the read data? I get s/t like: --- 8- --- Loading, please wait /sbin/init: relocation error: ... not defined in file libc.so.6... [ 0.81...] Kernel panic - not snycing: Attempted to kill init! --- 8- --- host-kernel is now 3.4.1 + qemu-1.0.1, but shows failures with other kernel/qemu-versions, too. Keeping fingers crossed for Josh, though ;-) Give me a shout, If I can do some debugging, regards, Oliver. Josh is still tracking down which kernels and file system are affected; fortunately it is relatively easy to reproduce with the test_librbd_fsx tool. In the meantime, the (mis)feature can be safely disabled. It will default to off in 0.48. It is unclear whether it's really much of a performance win anyway. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Random data corruption in VM, possibly caused by rbd
Hi Guido, yeah, there is something weird going on. I just started to establish some test-VM's. Freshly imported from running *.qcow2 images. Kernel panic with INIT, seg-faults and other funny stuff. Just added the rbd_cache=true in my config, voila. All is fast-n-up-n-running... All my testing was done with cache enabled... Since our errors all came from rbd_writeback from former ceph-versions... Josh? Sage? Help?! Oliver. On 06/08/2012 02:55 PM, Guido Winkelmann wrote: Am Donnerstag, 7. Juni 2012, 12:48:05 schrieben Sie: On 06/07/2012 11:04 AM, Guido Winkelmann wrote: Hi, I'm using Ceph with RBD to provide network-transparent disk images for KVM- based virtual servers. The last two days, I've been hunting some weird elusive bug where data in the virtual machines would be corrupted in weird ways. It usually manifests in files having some random data - usually zeroes - at the start before the actual contents that should be in there start. I definitely want to figure out what's going on with this. A few questions: Are you using rbd caching? If so, what settings? In either case, does the corruption still occur if you switch caching on/off? There are different I/O paths here, and this might tell us if the problem is on the client side. Okay, I've tried enabling rbd caching now, and so far, the problem appears to be gone. I am using libvirt for starting and managing the virtual machines, and what I did was change thesource element for the virtual disk from source protocol='rbd' name='rbd/name_of_image' to source protocol='rbd' name='rbd/name_of_image:rbd_cache=true' and then restart the VM. (I found that in one of your mails on this list; there does not appear to be any proper documentation on this...) The iotester does not find any corruptions with these settings. The VM ist still horribly broken, but that's probably lingering filesystem damage from yesterday. I'll try with a fresh image next. I did not change anything else in the setup. In particular, the OSDs still use btrfs. One of the OSD has been restarted, though. I will run another test with a VM without rbd caching, to make sure it wasn't by random chance restarting that one osd that made the real difference. Enabling btrfs did not appear to make any difference wrt performance, but that's probably because my tests mostly create sustained sequential IO, for which caches are generally not very helpful. Enabling rbd caching is not a solution I particularly like, for two reasons: 1. In my setup, migrating VMs from one host to another is a normal part of operation, and I still don't know ho to prevent data corruption (in the form of silently lost writes) when combining rbd caching and migration. 2. I'm not really looking into speeding up single VM, I'm really more interested in just how many VMs I can run before performance starts degrading for everyone, and I don't think rbd caching will help with that. Regards, Guido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Random data corruption in VM, possibly caused by rbd
Well then, quite busy, too with some other stuff, but... On 06/08/2012 04:50 PM, Josh Durgin wrote: On 06/08/2012 06:55 AM, Sage Weil wrote: On Fri, 8 Jun 2012, Oliver Francke wrote: Hi Guido, yeah, there is something weird going on. I just started to establish some test-VM's. Freshly imported from running *.qcow2 images. Kernel panic with INIT, seg-faults and other funny stuff. Just added the rbd_cache=true in my config, voila. All is fast-n-up-n-running... All my testing was done with cache enabled... Since our errors all came from rbd_writeback from former ceph-versions... Are you guys able to reproduce the corruption with 'debug osd = 20' and 'debug ms = 1'? Ideally we'd like to: - reproduce from a fresh vm, with osd logs - identify the bad file - map that file to a block offset (see http://ceph.com/qa/fiemap.[ch], linux_fiemap.h) - use that to identify the badness in the log a logfile with debugging is available at our local store... I suspect the cache is just masking the problem because it submits fewer IOs... The cache also doesn't do sparse reads. Is it still reproducible with a fresh vm when you set filestore_fiemap_threshold = 0 for the osds, and run without rbd caching? restarted OSDs with this setting, but without rbd_cache I still get errors. *sigh* Oliver. Josh sage Josh? Sage? Help?! Oliver. On 06/08/2012 02:55 PM, Guido Winkelmann wrote: Am Donnerstag, 7. Juni 2012, 12:48:05 schrieben Sie: On 06/07/2012 11:04 AM, Guido Winkelmann wrote: Hi, I'm using Ceph with RBD to provide network-transparent disk images for KVM- based virtual servers. The last two days, I've been hunting some weird elusive bug where data in the virtual machines would be corrupted in weird ways. It usually manifests in files having some random data - usually zeroes - at the start before the actual contents that should be in there start. I definitely want to figure out what's going on with this. A few questions: Are you using rbd caching? If so, what settings? In either case, does the corruption still occur if you switch caching on/off? There are different I/O paths here, and this might tell us if the problem is on the client side. Okay, I've tried enabling rbd caching now, and so far, the problem appears to be gone. I am using libvirt for starting and managing the virtual machines, and what I did was change thesource element for the virtual disk from source protocol='rbd' name='rbd/name_of_image' to source protocol='rbd' name='rbd/name_of_image:rbd_cache=true' and then restart the VM. (I found that in one of your mails on this list; there does not appear to be any proper documentation on this...) The iotester does not find any corruptions with these settings. The VM ist still horribly broken, but that's probably lingering filesystem damage from yesterday. I'll try with a fresh image next. I did not change anything else in the setup. In particular, the OSDs still use btrfs. One of the OSD has been restarted, though. I will run another test with a VM without rbd caching, to make sure it wasn't by random chance restarting that one osd that made the real difference. Enabling btrfs did not appear to make any difference wrt performance, but that's probably because my tests mostly create sustained sequential IO, for which caches are generally not very helpful. Enabling rbd caching is not a solution I particularly like, for two reasons: 1. In my setup, migrating VMs from one host to another is a normal part of operation, and I still don't know ho to prevent data corruption (in the form of silently lost writes) when combining rbd caching and migration. 2. I'm not really looking into speeding up single VM, I'm really more interested in just how many VMs I can run before performance starts degrading for everyone, and I don't think rbd caching will help with that. Regards, Guido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Random data corruption in VM, possibly caused by rbd
Hi Guido, unfortunately this sounds very familiar to me. We have been on a long road with similar weird errors. Our setup is something like start a couple of VM's ( qemu-*), let them create a 1G-file each and randomly seek and write 4MB blocks filled with md5sums of the block as payload, to be verifiable after completely written. Furthermore create some 1 files every-now-and-then and try to remove them after the verify-run. This produced the same things than you are experiencing - zero'ed blocks - with the main difference, that my tests are now clean with 0.47-2 and friends. After a couple of hundreds of runs. Our setup is with XFS as OSD-data partition, as we had too many errors with btrfs in the past. My assumption now would be, that there are some relations to your filesystem…?! Would be cool if you are able to change your setup to XFS. At least that would be a starting-point for further investigations. Regards, Oliver. Am 07.06.2012 um 20:04 schrieb Guido Winkelmann: Hi, I'm using Ceph with RBD to provide network-transparent disk images for KVM- based virtual servers. The last two days, I've been hunting some weird elusive bug where data in the virtual machines would be corrupted in weird ways. It usually manifests in files having some random data - usually zeroes - at the start before the actual contents that should be in there start. To track this down, I wrote a simple io tester. It does the following: - Create 1 Megabyte of random data - Calculate the SHA256 hash of that data - Write the data to a file on the harddisk, in a given directory, using the hash as the filename - Repeat until the disk is full - Delete the last file (because it is very likely to be incompletely written) - Read and delete all the files just written while checking that their sha256 sums are equal to their filenames When running this io tester in a VM that uses a qcow2 file on a local harddisk for its virtual disk, no errors are found. When the same VM is running using rbd, the io tester finds on average about one corruption every 200 Megabytes, reproducably. (As in an interesting aside, the io tester also prints how long it took to read or write 100 MB, and it turns out reading the data back in again is about three times slower than writing them in the first place...) Ceph is version 0.47.2. Qemu KVM is 1.0, compiled with the spec file from http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=summary (And compiled after ceph 0.47.2 was installed on that machine, so it would use the correct headers...) Both the Ceph cluster and the KVM host machines are running on Fedora 16, with a fairly recent 3.3.x kernel. The ceph cluster uses btrf for the osd's data dirs. The journal is on a tmpfs. (This is not a production setup - luckily.) The virtual machine is using ext4 as its filesystem. There were no obvious other problems with either the ceph cluster or the KVM host machines. I have attached a copy of the ceph.conf in use, in case it might be helpful. This is a huge problem, and any help in tracking it down would be much appreciated. Regards, Guidoceph.conf -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: q. about rbd-header
Hi Josh, On 03/14/2012 10:59 PM, Josh Durgin wrote: On 03/14/2012 01:49 PM, Oliver Francke wrote: Well, nobody able to sched some light in? Did some math and found out how to fill the size bytes. Sorry I didn't respond faster. But, one question never got answered: - why is - with busy VMs - frequently the first block affected, with the result of damaged grub-loaders/partition-tables/filesystems? Is this some NULL/zero pointer thingy in case of ceph-failure? My guess is that this is not the first object affected, but it's where the loss of an object is most easily noticeable - if an object doesn't exist, it's treated as being full of zeros, which might go undetected for a long time if it's e.g. some temp or log file that's not reread and verified. well, I responded to Sage with some more infos from one of the images where the header is missing... Did not want to bother the list ;) If you demand some broken images… we have many of them to investigate, unfortunately. We'd really like to find the root cause of the problem. One possibility is some bad interaction between osds running different versions. This caused one issue with recovery stxShadow saw yesterday, for example (http://tracker.newdream.net/issues/2132). Had you been doing rolling upgrades of osds before these problems appeared? If so, do you know which versions you had running concurrently? Are your osds often restarting? What we'd need to diagnose this are osd logs during recovery with: debug osd = 20 debug ms = 1 Once you detect the problem, a log from each replica storing the pg the bad/missing object is in should be enough. And just to make sure, you aren't writing to these rbd images from multiple places, right? This wouldn't cause the missing header objects, but is likely to cause corruption of the image data. This could happen, for example, by rolling an image back to a snapshot while a vm is running on it. Currently we don't use snapshots. And of course ensure, a VM is running once at a time ;-) And we had some rolling upgrade, but this was _after_ trouble/crashes occured. Oliver. Josh Maybe this sounds a bit harsh, after the 5th night-shift trying to repair images and keep customers calm, I think this is forgivable. Oliver. Am 14.03.2012 um 16:05 schrieb Oliver Francke: Hey, anybody out there who could explain the structure of a rbd-header? After last crash we have about 10 images with a: 2012-03-14 15:22:47.998790 7f45a61e3760 librbd: Error reading header: 2 No such file or directory error opening image vm-266-disk-1.rbd: 2 No such file or directory ... error? I understand the rb.x.y-prefix, the 2 ^ 16hex as block-size. But the size/count encoding is not intuitive ;) Besides one file, where I created a header and putted it via rados put back into the pool, and got some files back, many of the other images with lost headers have different sizes. We got bad luck again, too many crashed VM's, too much data-loss... Comments welcome ;) Oliver. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
q. about rbd-header
Hey, anybody out there who could explain the structure of a rbd-header? After last crash we have about 10 images with a: 2012-03-14 15:22:47.998790 7f45a61e3760 librbd: Error reading header: 2 No such file or directory error opening image vm-266-disk-1.rbd: 2 No such file or directory ... error? I understand the rb.x.y-prefix, the 2 ^ 16hex as block-size. But the size/count encoding is not intuitive ;) Besides one file, where I created a header and putted it via rados put back into the pool, and got some files back, many of the other images with lost headers have different sizes. We got bad luck again, too many crashed VM's, too much data-loss... Comments welcome ;) Oliver. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: q. about rbd-header
Well, nobody able to sched some light in? Did some math and found out how to fill the size bytes. But, one question never got answered: - why is - with busy VMs - frequently the first block affected, with the result of damaged grub-loaders/partition-tables/filesystems? Is this some NULL/zero pointer thingy in case of ceph-failure? If you demand some broken images… we have many of them to investigate, unfortunately. Maybe this sounds a bit harsh, after the 5th night-shift trying to repair images and keep customers calm, I think this is forgivable. Oliver. Am 14.03.2012 um 16:05 schrieb Oliver Francke: Hey, anybody out there who could explain the structure of a rbd-header? After last crash we have about 10 images with a: 2012-03-14 15:22:47.998790 7f45a61e3760 librbd: Error reading header: 2 No such file or directory error opening image vm-266-disk-1.rbd: 2 No such file or directory ... error? I understand the rb.x.y-prefix, the 2 ^ 16hex as block-size. But the size/count encoding is not intuitive ;) Besides one file, where I created a header and putted it via rados put back into the pool, and got some files back, many of the other images with lost headers have different sizes. We got bad luck again, too many crashed VM's, too much data-loss... Comments welcome ;) Oliver. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Still inconsistant pg's, ceph-osd crashes reliably after trying to repair
Hi *, after some crashes we still had to care for some remaining inconsistancies reported via ceph -w and friends. Well, we traced one of them down via ceph pg dump and we picked 79. pg=79.7 and found the corresponding file in the /var/log/ceph/osd.2.log. /data/osd4/current/79.7_head/rb.0.0.136c__head_9FB2FA17 and the dup on /data/osd2/... Strange though, they had the same checksum but reported a stat-error. Anyway. Decided to do a: ceph pg repair 79.7 ... byebye ceph-osd on node2! Here the trace: === 8- === 2012-03-01 17:49:13.024571 7f3944584700 -- 10.10.10.14:6802/4892 10.10.10.10:6802/19139 pipe(0xfcd2c80 sd=16 pgs=0 cs=0 l=0).connect protocol version mismatch, my 9 != 0 2012-03-01 17:49:23.674162 7f395001b700 log [ERR] : 79.7 osd.4: soid 9fb2fa17/rb.0.0.136c/headextra attr _, extra attr snapset 2012-03-01 17:49:23.674222 7f395001b700 log [ERR] : 79.7 repair 0 missing, 1 inconsistent objects *** Caught signal (Aborted) ** in thread 7f395001b700 ceph version 0.42-142-gc9416e6 (commit:c9416e6184905501159e96115f734bdf65a74d28) 1: /usr/bin/ceph-osd() [0x5a6b89] 2: (()+0xeff0) [0x7f3960ca5ff0] 3: (gsignal()+0x35) [0x7f395f2841b5] 4: (abort()+0x180) [0x7f395f286fc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f395fb18dc5] 6: (()+0xcb166) [0x7f395fb17166] 7: (()+0xcb193) [0x7f395fb17193] 8: (()+0xcb28e) [0x7f395fb1728e] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) [0x67c5ce] 10: (object_info_t::decode(ceph::buffer::list::iterator)+0x2c) [0x61663c] 11: (PG::repair_object(hobject_t const, ScrubMap::object*, int, int)+0x3be) [0x68d96e] 12: (PG::scrub_finalize()+0x1438) [0x6b8568] 13: (OSD::ScrubFinalizeWQ::_process(PG*)+0xc) [0x588edc] 14: (ThreadPool::worker()+0xa26) [0x5bc426] 15: (ThreadPool::WorkThread::entry()+0xd) [0x585f0d] 16: (()+0x68ca) [0x7f3960c9d8ca] 17: (clone()+0x6d) [0x7f395f32186d] 2012-03-01 17:49:30.017269 7f81b662b780 ceph version 0.42-142-gc9416e6 (commit:c9416e6184905501159e96115f734bdf65a74d28), process ceph-osd, pid 3111 2012-03-01 17:49:30.085426 7f81b662b780 filestore(/data/osd2) mount FIEMAP ioctl is NOT supported 2012-03-01 17:49:30.085466 7f81b662b780 filestore(/data/osd2) mount did NOT detect btrfs 2012-03-01 17:49:30.110409 7f81b662b780 filestore(/data/osd2) mount found snaps 2012-03-01 17:49:30.110476 7f81b662b780 filestore(/data/osd2) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-03-01 17:49:31.964977 7f81b662b780 journal _open /dev/sdc1 fd 16: 10737942528 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-03-01 17:49:31.967549 7f81b662b780 journal read_entry 929464 : seq 67841857 11225 bytes === 8- === ... after some journal-replay things calmed down, but: 2012-03-01 17:58:29.470446 log 2012-03-01 17:58:24.242369 osd.2 10.10.10.14:6801/3111 368 : [WRN] bad locator @56 on object @79 loc @56 op osd_op(client.44350.0:1412387 rb.0.0.136c [write 2465792~49152] 56.9fb2fa17) v4 these type of messages we see ever so often... It corresponds, but in what way? Can't we assume, if both snipplets rb.0.0... are identical, that life's good? We had some other inconsistancies, where we had to delete the whole pool to get rid of crappy blocks. The ceph-osd died, too, after doing some rbd rm pool/image the one block in question remained, visable via rados ls -p pool Any idea, o better clue? ;-) Kind reg's, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still inconsistant pg's, ceph-osd crashes reliably after trying to repair
Well, Am 01.03.2012 um 18:15 schrieb Oliver Francke: Hi *, after some crashes we still had to care for some remaining inconsistancies reported via ceph -w and friends. Well, we traced one of them down via ceph pg dump and we picked 79. pg=79.7 and found the corresponding file in the /var/log/ceph/osd.2.log. /data/osd4/current/79.7_head/rb.0.0.136c__head_9FB2FA17 and the dup on /data/osd2/... Strange though, they had the same checksum but reported a stat-error. Anyway. Decided to do a: ceph pg repair 79.7 ... byebye ceph-osd on node2! Here the trace: === 8- === 2012-03-01 17:49:13.024571 7f3944584700 -- 10.10.10.14:6802/4892 10.10.10.10:6802/19139 pipe(0xfcd2c80 sd=16 pgs=0 cs=0 l=0).connect protocol version mismatch, my 9 != 0 2012-03-01 17:49:23.674162 7f395001b700 log [ERR] : 79.7 osd.4: soid 9fb2fa17/rb.0.0.136c/headextra attr _, extra attr snapset one clarification by ourselves done: one copy is missing the xattrs, checked via getfattr but why can't it be corrected, and worse this crash happens? 2012-03-01 17:49:23.674222 7f395001b700 log [ERR] : 79.7 repair 0 missing, 1 inconsistent objects *** Caught signal (Aborted) ** in thread 7f395001b700 ceph version 0.42-142-gc9416e6 (commit:c9416e6184905501159e96115f734bdf65a74d28) 1: /usr/bin/ceph-osd() [0x5a6b89] 2: (()+0xeff0) [0x7f3960ca5ff0] 3: (gsignal()+0x35) [0x7f395f2841b5] 4: (abort()+0x180) [0x7f395f286fc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f395fb18dc5] 6: (()+0xcb166) [0x7f395fb17166] 7: (()+0xcb193) [0x7f395fb17193] 8: (()+0xcb28e) [0x7f395fb1728e] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) [0x67c5ce] 10: (object_info_t::decode(ceph::buffer::list::iterator)+0x2c) [0x61663c] 11: (PG::repair_object(hobject_t const, ScrubMap::object*, int, int)+0x3be) [0x68d96e] 12: (PG::scrub_finalize()+0x1438) [0x6b8568] 13: (OSD::ScrubFinalizeWQ::_process(PG*)+0xc) [0x588edc] 14: (ThreadPool::worker()+0xa26) [0x5bc426] 15: (ThreadPool::WorkThread::entry()+0xd) [0x585f0d] 16: (()+0x68ca) [0x7f3960c9d8ca] 17: (clone()+0x6d) [0x7f395f32186d] 2012-03-01 17:49:30.017269 7f81b662b780 ceph version 0.42-142-gc9416e6 (commit:c9416e6184905501159e96115f734bdf65a74d28), process ceph-osd, pid 3111 2012-03-01 17:49:30.085426 7f81b662b780 filestore(/data/osd2) mount FIEMAP ioctl is NOT supported 2012-03-01 17:49:30.085466 7f81b662b780 filestore(/data/osd2) mount did NOT detect btrfs 2012-03-01 17:49:30.110409 7f81b662b780 filestore(/data/osd2) mount found snaps 2012-03-01 17:49:30.110476 7f81b662b780 filestore(/data/osd2) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-03-01 17:49:31.964977 7f81b662b780 journal _open /dev/sdc1 fd 16: 10737942528 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-03-01 17:49:31.967549 7f81b662b780 journal read_entry 929464 : seq 67841857 11225 bytes === 8- === ... after some journal-replay things calmed down, but: 2012-03-01 17:58:29.470446 log 2012-03-01 17:58:24.242369 osd.2 10.10.10.14:6801/3111 368 : [WRN] bad locator @56 on object @79 loc @56 op osd_op(client.44350.0:1412387 rb.0.0.136c [write 2465792~49152] 56.9fb2fa17) v4 these type of messages we see ever so often... It corresponds, but in what way? Can't we assume, if both snipplets rb.0.0... are identical, that life's good? We had some other inconsistancies, where we had to delete the whole pool to get rid of crappy blocks. The ceph-osd died, too, after doing some rbd rm pool/image the one block in question remained, visable via rados ls -p pool Any idea, o better clue? ;-) Kind reg's, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Recommended number of pools, one Q. ever wanted to ask
Hi *, well, there was once a comment on our layout in means of too many pools. Our setup is to have a pool per customer, to simplify the view on used storage capacity. So, if we have - in a couple of months, we hope - more then some hundred customers, this setup was not recommended, cause the whole system is not designed for handling that. ( Sage) What does not recommended mean? Is it, that per OSD the used memory will be too high? Is this a general performance issue? Well, if we read pool, this gave us the basic idea/concept to put all per-customers data into it. Please sched some light in 8-) Kind regards, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommended number of pools, one Q. ever wanted to ask
Well, On 02/28/2012 10:42 AM, Wido den Hollander wrote: Hi, On 02/28/2012 10:35 AM, Oliver Francke wrote: Hi *, well, there was once a comment on our layout in means of too many pools. Our setup is to have a pool per customer, to simplify the view on used storage capacity. So, if we have - in a couple of months, we hope - more then some hundred customers, this setup was not recommended, cause the whole system is not designed for handling that. ( Sage) What does not recommended mean? Is it, that per OSD the used memory will be too high? Yes. Every new pool you create will consume some memory on the OSD. So if you start creating a lot of pools, you will also start consuming more and more memory. I haven't followed this lately, but that is the current information I have. The number of objects in a pool is also not a problem, you can have millions without any issues. It's the number of pools which will haunt you later on. thnx for the quick reply, so if we can imagine, that the number of pool per OSD is the limiting factor, we shall not have more than let's say ~100, means, we shall be safe. Wido best regards, Oliver. Is this a general performance issue? Well, if we read pool, this gave us the basic idea/concept to put all per-customers data into it. Please sched some light in 8-) Kind regards, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with inconsistent PG
Well, Am 17.02.2012 um 18:54 schrieb Sage Weil: On Fri, 17 Feb 2012, Oliver Francke wrote: Well then, found it via the ceph osd dump via the pool-id, thanks. The according customer opened a ticket this morning for not being able to boot his VM after shutdown. So I had to do some testdisk/fsck and tar the content into a new image. I hope, there are no other bad blocks not being visible as inconsistencies. As these faulty images were easy detected as the boot-block was affected, how big is the chance, that there are more rb..-fragments being corrupted within a image in reference to what you mentioned below: ...transactions to leak across checkpoint/snapshot boundaries. Do we have a chance to detect it? I fear not, cause it will perhaps only be visible while doing a fsck inside the VM?! It is hard to say. There is a small chance that it will trigger any time ceph-osd is restarted. The bug is fixed in the next release (which should be out today), but of course upgrading involves shutting down :(. Alternatively, you can cherry-pick the fixes, 1009d1a016f049e19ad729a0c00a354a3956caf7 and 93d7ef96316f30d3d7caefe07a5a747ce883ca2d. v0.42 includes some encoding changes that means you can upgrade but you can't downgrade again. (These encoding changes are being made so that in the future, you _can_ downgrade.) Here's what I suggest: - don't restart any ceph-osds if you can help it - wait for v0.42 to come out, and wait until Monday at least - pause read/write traffic to the cluster with ceph osd pause - wait at least 30 seconds for osds to do a commit without any load. this makes it extremely unlikely you'd trigger the bug. - upgrade to v0.42, or restart with a patched ceph-osd. - unpause io with ceph osd unpause that sounds reasonable, cool stuff ;-) Thnx again, Oliver. sage Anyway, thanks for your help and best regards, Oliver. Am 16.02.2012 um 19:02 schrieb Sage Weil: On Thu, 16 Feb 2012, Oliver Francke wrote: Hi Sage, thnx for the quick response, Am 16.02.2012 um 18:17 schrieb Sage Weil: On Thu, 16 Feb 2012, Oliver Francke wrote: Hi Sage, *, your tip with truncating from below did not solve the problem. Just to recap: we had two inconsistencies, which we could break down to something like: rb.0.0.__head_DA680EE2 according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3 for example, and a stupid find ? brings up a couple of them, so the pg number is relevant too - makes sense - we went into lets say /data/osd3/current/84.2_head/ and did a hex dump from the file, looked really like the head, in means of signs from an installed grub-loader. But a corrupted partition-table. From other of these files one could do a fdisk -l file and at least a partition-table could have been found. Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is, from such a file with name and pg, how can we identify the real file being associated with, cause there is another customer with a potential problem with next reboot ( second inconsistency). We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted partition tables, so all in the first head-file? Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader anymore ;) ). 'head' in this case means the object hasn't been COWed (snapshotted and then overwritten), and means its the first 4MB block of the rbd image/disk. yes, true, We you able to use the 'rbd info' in the previous email to identify which image it is? Is that what you mean by 'identify the real file'? that's the point, from the object I would like to identify the complete image location ala: pool/image from there I'd know, which customer's rbd disk-image is affected. For pool, look at the pgid, in this case '109.6'. 109 is the pool id. Look at the pool list from 'ceph osd dump' output to see which pool name that is. For the image, rb.0.0 is the image prefix. Look at each rbd image in that pool, and check for the image whose prefix matches. e.g., for img in `rbd -p poolname list` ; do rbd info $img -p poolname | grep -q rb.0.0 echo found $img ; done BTW, are you creating a pool per customer here? You need to be a little bit careful about creating large numbers of pools; the system isn't really designed to be used that way. You should use a pool if you have a distinct data placement requirement (e.g., put these objects on this set of ceph-osds). But because of the way things work internally creating hundreds/thousands of them won't be very efficient. sage Thnx for your patience, Oliver. I'm not sure I understand exactly what your question is. I would
Re: Problem with inconsistent PG
Hi Sage, *, your tip with truncating from below did not solve the problem. Just to recap: we had two inconsistencies, which we could break down to something like: rb.0.0.__head_DA680EE2 according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3 for example, and a stupid find … brings up a couple of them, so the pg number is relevant too - makes sense - we went into lets say /data/osd3/current/84.2_head/ and did a hex dump from the file, looked really like the head, in means of signs from an installed grub-loader. But a corrupted partition-table. From other of these files one could do a fdisk -l file and at least a partition-table could have been found. Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is, from such a file with name and pg, how can we identify the real file being associated with, cause there is another customer with a potential problem with next reboot ( second inconsistency). We also had some VM's in a big test-phase with similar problems… grub going into rescue-prompt, invalid/corrupted partition tables, so all in the first head-file? Would be cool to get some more infos… and sched some light into the structures ( myself not really being a good code-reader anymore ;) ). Thanks in@vance and kind regards, Oliver. Am 13.02.2012 um 18:13 schrieb Sage Weil: On Sun, 12 Feb 2012, Jens Rehpoehler wrote: Hi Liste, today i've got another problem. ceph -w shows up with an inconsistent PG over night: 2012-02-10 08:38:48.701775pg v441251: 1982 pgs: 1981 active+clean, 1 active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345 GB avail 2012-02-10 08:38:49.702789pg v441252: 1982 pgs: 1981 active+clean, 1 active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345 GB avail I've identified it with ceph pg dump - | grep inconsistent 109.6141000463820288111780111780 active+clean+inconsistent485'7115480'7301[3 http://marc.info/?l=ceph-develm=132891306919981w=2#3,4 http://marc.info/?l=ceph-develm=132891306919981w=2#4][3 http://marc.info/?l=ceph-develm=132891306919981w=2#3,4 http://marc.info/?l=ceph-develm=132891306919981w=2#4] 485'70612012-02-10 08:02:12.043986 Now I've tried to repair it with: ceph pg repair 109.6 2012-02-10 08:35:52.276325 mon- [pg,repair,109.6] 2012-02-10 08:35:52.276776 mon.1 - 'instructing pg 109.6 on osd.3 to repair' (0) but i only get the following result: 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455420 osd.3 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid 1ef398ce/rb.0.0.00bd/headsize 2736128 != known size 3145728 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455426 osd.3 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent objects 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455799 osd.3 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors Can someone please explain me what to do in this case and how to recover the pg ? So the fix is just to truncate the file to the expected size, 3145728, by finding it in the current/ directory. The name/path will be slightly weird; look for 'rb.0.0.00bd'. The data is still suspect, though. Did the ceph-osd restart or crash recently? I would do that, repair (it should succeed), and then fsck the file system in that rbd image. We just fixed a bug that was causing transactions to leak across checkpoint/snapshot boundaries. That could be responsible for causing all sorts of subtle corruptions, including this one. It'll be included in v0.42 (out next week). sage Hi Sarge, no ... the osd didn't crash. I had to do some hardware maintainance and push it out of distribution with ceph osd out 3. After a short while i used /etc/init.d/ceph stop on that osd. Then, after my work i've started ceph and push it in the distribution with ceph osd in 3. For the bug I'm worried about, stopping the daemon and crashing are equivalent. In both cases, a transaction may have been only partially included in the checkpoint. Could you please tell me if this is the right way to get an osd out for maintainance ? Is there any other thing i should do to keep data consistent ? You followed the right procedure. There is (hopefully, was!) just a bug. sage My structure is - 3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes with a each a total capacity of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a data store for a kvm virtualisation farm. The farm is accessing the data directly per rbd. Thank you Jens -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send
Re: Problem with inconsistent PG
Hi Sage, thnx for the quick response, Am 16.02.2012 um 18:17 schrieb Sage Weil: On Thu, 16 Feb 2012, Oliver Francke wrote: Hi Sage, *, your tip with truncating from below did not solve the problem. Just to recap: we had two inconsistencies, which we could break down to something like: rb.0.0.__head_DA680EE2 according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3 for example, and a stupid find ? brings up a couple of them, so the pg number is relevant too - makes sense - we went into lets say /data/osd3/current/84.2_head/ and did a hex dump from the file, looked really like the head, in means of signs from an installed grub-loader. But a corrupted partition-table. From other of these files one could do a fdisk -l file and at least a partition-table could have been found. Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is, from such a file with name and pg, how can we identify the real file being associated with, cause there is another customer with a potential problem with next reboot ( second inconsistency). We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted partition tables, so all in the first head-file? Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader anymore ;) ). 'head' in this case means the object hasn't been COWed (snapshotted and then overwritten), and means its the first 4MB block of the rbd image/disk. yes, true, We you able to use the 'rbd info' in the previous email to identify which image it is? Is that what you mean by 'identify the real file'? that's the point, from the object I would like to identify the complete image location ala: pool/image from there I'd know, which customer's rbd disk-image is affected. Thnx for your patience, Oliver. I'm not sure I understand exactly what your question is. I would have expected modifying the file with fdisk -l to work (if fdisk sees a valid partition table, it should be able to write it too). sage Thanks in@vance and kind regards, Oliver. Am 13.02.2012 um 18:13 schrieb Sage Weil: On Sun, 12 Feb 2012, Jens Rehpoehler wrote: Hi Liste, today i've got another problem. ceph -w shows up with an inconsistent PG over night: 2012-02-10 08:38:48.701775pg v441251: 1982 pgs: 1981 active+clean, 1 active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345 GB avail 2012-02-10 08:38:49.702789pg v441252: 1982 pgs: 1981 active+clean, 1 active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345 GB avail I've identified it with ceph pg dump - | grep inconsistent 109.6141000463820288111780111780 active+clean+inconsistent485'7115480'7301[3 http://marc.info/?l=ceph-develm=132891306919981w=2#3,4 http://marc.info/?l=ceph-develm=132891306919981w=2#4][3 http://marc.info/?l=ceph-develm=132891306919981w=2#3,4 http://marc.info/?l=ceph-develm=132891306919981w=2#4] 485'70612012-02-10 08:02:12.043986 Now I've tried to repair it with: ceph pg repair 109.6 2012-02-10 08:35:52.276325 mon- [pg,repair,109.6] 2012-02-10 08:35:52.276776 mon.1 - 'instructing pg 109.6 on osd.3 to repair' (0) but i only get the following result: 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455420 osd.3 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid 1ef398ce/rb.0.0.00bd/headsize 2736128 != known size 3145728 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455426 osd.3 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent objects 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455799 osd.3 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors Can someone please explain me what to do in this case and how to recover the pg ? So the fix is just to truncate the file to the expected size, 3145728, by finding it in the current/ directory. The name/path will be slightly weird; look for 'rb.0.0.00bd'. The data is still suspect, though. Did the ceph-osd restart or crash recently? I would do that, repair (it should succeed), and then fsck the file system in that rbd image. We just fixed a bug that was causing transactions to leak across checkpoint/snapshot boundaries. That could be responsible for causing all sorts of subtle corruptions, including this one. It'll be included in v0.42 (out next week). sage Hi Sarge, no ... the osd didn't crash. I had to do some hardware maintainance and push it out of distribution with ceph osd out 3. After a short while i used /etc/init.d/ceph stop on that osd. Then, after my work i've started ceph and push it in the distribution with ceph osd in 3. For the bug I'm worried about, stopping the daemon