Re: [ceph-users] cuttlefish countdown -- OSD doesn't get marked out
Hi David, did you test it with more than one rack as well? In my first problem I used two racks, with a custom crushmap, so that the replicas are in the two racks (replicationlevel = 2). Than I took one osd down, and expected that the remaining osds in this rack would get the now missing replicas from the osd of the other rack. But nothing happened, the cluster stayed degraded. -martin On 26.04.2013 02:22, David Zafman wrote: I filed tracker bug 4822 and have wip-4822 with a fix. My manual testing shows that it works. I'm building a teuthology test. Given your osd tree has a single rack it should always mark OSDs down after 5 minutes by default. David Zafman Senior Developer http://www.inktank.com On Apr 25, 2013, at 9:38 AM, Martin Mailand mar...@tuxadero.com wrote: Hi Sage, On 25.04.2013 18:17, Sage Weil wrote: What is the output from 'ceph osd tree' and the contents of your [mon*] sections of ceph.conf? Thanks! sage root@store1:~# ceph osd tree # id weight type name up/down reweight -1 24 root default -3 24 rack unknownrack -2 4 host store1 01 osd.0 up 1 11 osd.1 down1 21 osd.2 up 1 31 osd.3 up 1 -4 4 host store3 10 1 osd.10 up 1 11 1 osd.11 up 1 81 osd.8 up 1 91 osd.9 up 1 -5 4 host store4 12 1 osd.12 up 1 13 1 osd.13 up 1 14 1 osd.14 up 1 15 1 osd.15 up 1 -6 4 host store5 16 1 osd.16 up 1 17 1 osd.17 up 1 18 1 osd.18 up 1 19 1 osd.19 up 1 -7 4 host store6 20 1 osd.20 up 1 21 1 osd.21 up 1 22 1 osd.22 up 1 23 1 osd.23 up 1 -8 4 host store2 41 osd.4 up 1 51 osd.5 up 1 61 osd.6 up 1 71 osd.7 up 1 [global] auth cluster requierd = none auth service required = none auth client required = none # log file = log_max_recent=100 log_max_new=100 [mon] mon data = /data/mon.$id [mon.a] mon host = store1 mon addr = 192.168.195.31:6789 [mon.b] mon host = store3 mon addr = 192.168.195.33:6789 [mon.c] mon host = store5 mon addr = 192.168.195.35:6789 ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cuttlefish countdown -- OSD doesn't get marked out
Hi, if I shutdown an OSD, the OSD gets marked down after 20 seconds, after 300 seconds the osd should get marked out, an the cluster should resync. But that doesn't happened, the OSD stays in the status down/in forever, therefore the cluster stays forever degraded. I can reproduce it with a new installed cluster. If I manually set the osd out (ceph osd out 1), the cluster resync starts immediately. I think thats a release critical bug, because the cluster health is not automatically recovered. And I reported this behavior a while ago http://article.gmane.org/gmane.comp.file-systems.ceph.user/603/ -martin Log: root@store1:~# ceph -s health HEALTH_OK monmap e1: 3 mons at {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, election epoch 82, quorum 0,1,2 a,b,c osdmap e204: 24 osds: 24 up, 24 in pgmap v106709: 5056 pgs: 5056 active+clean; 526 GB data, 1068 GB used, 173 TB / 174 TB avail mdsmap e1: 0/0/1 up root@store1:~# ceph --version ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c) root@store1:~# /etc/init.d/ceph stop osd.1 === osd.1 === Stopping Ceph osd.1 on store1...bash: warning: setlocale: LC_ALL: cannot change locale (en_GB.utf8) kill 5492...done root@store1:~# ceph -s health HEALTH_OK monmap e1: 3 mons at {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, election epoch 82, quorum 0,1,2 a,b,c osdmap e204: 24 osds: 24 up, 24 in pgmap v106709: 5056 pgs: 5056 active+clean; 526 GB data, 1068 GB used, 173 TB / 174 TB avail mdsmap e1: 0/0/1 up root@store1:~# date -R Thu, 25 Apr 2013 13:09:54 +0200 root@store1:~# ceph -s date -R health HEALTH_WARN 423 pgs degraded; 423 pgs stuck unclean; recovery 10999/269486 degraded (4.081%); 1/24 in osds are down monmap e1: 3 mons at {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, election epoch 82, quorum 0,1,2 a,b,c osdmap e206: 24 osds: 23 up, 24 in pgmap v106715: 5056 pgs: 4633 active+clean, 423 active+degraded; 526 GB data, 1068 GB used, 173 TB / 174 TB avail; 10999/269486 degraded (4.081%) mdsmap e1: 0/0/1 up Thu, 25 Apr 2013 13:10:14 +0200 root@store1:~# ceph -s date -R health HEALTH_WARN 423 pgs degraded; 423 pgs stuck unclean; recovery 10999/269486 degraded (4.081%); 1/24 in osds are down monmap e1: 3 mons at {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, election epoch 82, quorum 0,1,2 a,b,c osdmap e206: 24 osds: 23 up, 24 in pgmap v106719: 5056 pgs: 4633 active+clean, 423 active+degraded; 526 GB data, 1068 GB used, 173 TB / 174 TB avail; 10999/269486 degraded (4.081%) mdsmap e1: 0/0/1 up Thu, 25 Apr 2013 13:23:01 +0200 On 25.04.2013 01:46, Sage Weil wrote: Hi everyone- We are down to a handful of urgent bugs (3!) and a cuttlefish release date that is less than a week away. Thank you to everyone who has been involved in coding, testing, and stabilizing this release. We are close! If you would like to test the current release candidate, your efforts would be much appreciated! For deb systems, you can do wget -q -O- 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/autobuild.asc' | sudo apt-key add - echo deb http://gitbuilder.ceph.com/ceph-deb-$(lsb_release -sc)-x86_64-basic/ref/next $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list For rpm users you can find packages at http://gitbuilder.ceph.com/ceph-rpm-centos6-x86_64-basic/ref/next/ http://gitbuilder.ceph.com/ceph-rpm-fc17-x86_64-basic/ref/next/ http://gitbuilder.ceph.com/ceph-rpm-fc18-x86_64-basic/ref/next/ A draft of the release notes is up at http://ceph.com/docs/master/release-notes/#v0-61 Let me know if I've missed anything! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cuttlefish countdown -- OSD doesn't get marked out
Hi Sage, On 25.04.2013 18:17, Sage Weil wrote: What is the output from 'ceph osd tree' and the contents of your [mon*] sections of ceph.conf? Thanks! sage root@store1:~# ceph osd tree # idweight type name up/down reweight -1 24 root default -3 24 rack unknownrack -2 4 host store1 0 1 osd.0 up 1 1 1 osd.1 down1 2 1 osd.2 up 1 3 1 osd.3 up 1 -4 4 host store3 10 1 osd.10 up 1 11 1 osd.11 up 1 8 1 osd.8 up 1 9 1 osd.9 up 1 -5 4 host store4 12 1 osd.12 up 1 13 1 osd.13 up 1 14 1 osd.14 up 1 15 1 osd.15 up 1 -6 4 host store5 16 1 osd.16 up 1 17 1 osd.17 up 1 18 1 osd.18 up 1 19 1 osd.19 up 1 -7 4 host store6 20 1 osd.20 up 1 21 1 osd.21 up 1 22 1 osd.22 up 1 23 1 osd.23 up 1 -8 4 host store2 4 1 osd.4 up 1 5 1 osd.5 up 1 6 1 osd.6 up 1 7 1 osd.7 up 1 [global] auth cluster requierd = none auth service required = none auth client required = none # log file = log_max_recent=100 log_max_new=100 [mon] mon data = /data/mon.$id [mon.a] mon host = store1 mon addr = 192.168.195.31:6789 [mon.b] mon host = store3 mon addr = 192.168.195.33:6789 [mon.c] mon host = store5 mon addr = 192.168.195.35:6789 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Cluster Map Problems
Hi, I still have this problem in v0.60. If I stop one OSD, the OSD get set down after 20 seconds. But after 300 seconds the OSD get not set out, there for the ceph stays degraded for ever. I can reproduce it with a fresh created cluster. root@store1:~# ceph -s health HEALTH_WARN 405 pgs degraded; 405 pgs stuck unclean; recovery 10603/259576 degraded (4.085%); 1/24 in osds are down monmap e1: 3 mons at {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, election epoch 10, quorum 0,1,2 a,b,c osdmap e150: 24 osds: 23 up, 24 in pgmap v12028: 4800 pgs: 4395 active+clean, 405 active+degraded; 505 GB data, 1017 GB used, 173 TB / 174 TB avail; 0B/s rd, 6303B/s wr, 2op/s; 10603/259576 degraded (4.085%) mdsmap e1: 0/0/1 up -martin On 28.03.2013 23:45, John Wilkins wrote: Martin, I'm just speculating: since I just rewrote the networking section and there is an empty mon_host value, and I do recall a chat last week where mon_host was considered a different setting now, maybe you might try specifying: [mon.a] mon host = store1 mon addr = 192.168.195.31:6789 etc. for monitors. I'm assuming that's not the case, but I want to make sure my docs are right on this point. On Thu, Mar 28, 2013 at 3:24 PM, Martin Mailand mar...@tuxadero.com wrote: Hi John, my ceph.conf is a bit further down in this email. -martin Am 28.03.2013 23:21, schrieb John Wilkins: Martin, Would you mind posting your Ceph configuration file too? I don't see any value set for mon_host: On Thu, Mar 28, 2013 at 1:04 PM, Martin Mailand mar...@tuxadero.com wrote: Hi Greg, the dump from mon.a is attached. -martin On 28.03.2013 20:55, Gregory Farnum wrote: Hmm. The monitor code for checking this all looks good to me. Can you go to one of your monitor nodes and dump the config? (http://ceph.com/docs/master/rados/configuration/ceph-conf/?highlight=admin%20socket#viewing-a-configuration-at-runtime) -Greg On Thu, Mar 28, 2013 at 12:33 PM, Martin Mailand mar...@tuxadero.com wrote: Hi, I get the same behavior an new created cluster as well, no changes to the cluster config at all. I stop the osd.1, after 20 seconds it got marked down. But it never get marked out. ceph version 0.59 (cbae6a435c62899f857775f66659de052fb0e759) -martin On 28.03.2013 19:48, John Wilkins wrote: Martin, Greg is talking about noout. With Ceph, you can specifically preclude OSDs from being marked out when down to prevent rebalancing--e.g., during upgrades, short-term maintenance, etc. http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#stopping-w-out-rebalancing On Thu, Mar 28, 2013 at 11:12 AM, Martin Mailand mar...@tuxadero.com wrote: Hi Greg, setting the osd manually out triggered the recovery. But now it is the question, why is the osd not marked out after 300 seconds? That's a default cluster, I use the 0.59 build from your site. And I didn't change any value, except for the crushmap. That's my ceph.conf. -martin [global] auth cluster requierd = none auth service required = none auth client required = none # log file = log_max_recent=100 log_max_new=100 [mon] mon data = /data/mon.$id [mon.a] host = store1 mon addr = 192.168.195.31:6789 [mon.b] host = store3 mon addr = 192.168.195.33:6789 [mon.c] host = store5 mon addr = 192.168.195.35:6789 [osd] journal aio = true osd data = /data/osd.$id osd mount options btrfs = rw,noatime,nodiratime,autodefrag osd mkfs options btrfs = -n 32k -l 32k [osd.0] host = store1 osd journal = /dev/sdg1 btrfs devs = /dev/sdc [osd.1] host = store1 osd journal = /dev/sdh1 btrfs devs = /dev/sdd [osd.2] host = store1 osd journal = /dev/sdi1 btrfs devs = /dev/sde [osd.3] host = store1 osd journal = /dev/sdj1 btrfs devs = /dev/sdf [osd.4] host = store2 osd journal = /dev/sdg1 btrfs devs = /dev/sdc [osd.5] host = store2 osd journal = /dev/sdh1 btrfs devs = /dev/sdd [osd.6] host = store2 osd journal = /dev/sdi1 btrfs devs = /dev/sde [osd.7] host = store2 osd journal = /dev/sdj1 btrfs devs = /dev/sdf [osd.8] host = store3 osd journal = /dev/sdg1 btrfs devs = /dev/sdc [osd.9] host = store3 osd journal = /dev/sdh1 btrfs devs = /dev/sdd [osd.10] host = store3 osd journal = /dev/sdi1 btrfs devs = /dev/sde [osd.11] host = store3 osd journal = /dev/sdj1 btrfs devs = /dev/sdf [osd.12] host = store4 osd journal = /dev/sdg1 btrfs devs = /dev/sdc
Re: [ceph-users] Mon crash
Hi Joao, thanks for catching that up. -martin On 28.03.2013 20:03, Joao Eduardo Luis wrote: Hi Martin, As John said in his reply, these should be reported to ceph-devel (CC'ing). Anyway, this is bug #4519 [1]. It was introduced after 0.58, released under 0.59 and is already fixed in master. As far as we can tell, only when using auth none will anyone using 0.59 stumble upon it. -Joao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid)
Hi List, I get reproducible this assertion, how can I help to debug it? -martin (Lese Datenbank ... 52246 Dateien und Verzeichnisse sind derzeit installiert.) Vorbereitung zum Ersetzen von linux-firmware 1.79 (durch .../linux-firmware_1.79.1_all.deb) ... Ersatz für linux-firmware wird entpackt ... osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int)' thread 7f72b7fff700 time 2013-02-14 16:04:48.867285 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid) ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061) 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long, unsigned long, int)+0xd68) [0x7f72d4050848] 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f72d405742b] 3: (Context::complete(int)+0xa) [0x7f72d400f9ba] 4: (librbd::C_Request::finish(int)+0x85) [0x7f72d403f145] 5: (Context::complete(int)+0xa) [0x7f72d400f9ba] 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f72d40241b7] 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f72d33db16d] 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f72d3444e50] 9: (()+0x7e9a) [0x7f72d03c7e9a] 10: (clone()+0x6d) [0x7f72d00f4cbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid)
Hi Sage, everything is on 0.56.2 and the cluster is healthy. I can reproduce it with an apt-get upgrade within the vm, the vm os is 12.04. Most of the time the assertion happened when the firmware .deb is updated. See the log in my first email. But I use a custom build qemu version (1.4-rc1), which was build against 0.56.2. root@store1:~# ceph -s health HEALTH_OK monmap e1: 1 mons at {a=192.168.195.33:6789/0}, election epoch 1, quorum 0 a osdmap e160: 20 osds: 20 up, 20 in pgmap v28314: 3264 pgs: 3264 active+clean; 437 GB data, 1027 GB used, 144 TB / 145 TB avail mdsmap e1: 0/0/1 up root@store1:~# ceph --version ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061) root@compute4:~# dpkg -l|grep 'rbd\|rados\|qemu' ii librados20.56.2-1precise RADOS distributed object store client library ii librbd1 0.56.2-1precise RADOS block device client library ii qemu-common 1.4.0-rc1-vdsp1.0 qemu common functionality (bios, documentation, etc) ii qemu-kvm 1.4.0-rc1-vdsp1.0 Full virtualization on i386 and amd64 hardware ii qemu-utils 1.4.0-rc1-vdsp1.0 qemu utilities -martin On 14.02.2013 18:18, Sage Weil wrote: Hi Martin- On Thu, 14 Feb 2013, Martin Mailand wrote: Hi List, I get reproducible this assertion, how can I help to debug it? Can you describe the workload? Are the OSDs also running 0.56.2(+)? Any other activity on the server side (data migration, OSD failure, etc.) that may have contributed? We just reopened http://tracker.ceph.com/issues/2947 to track this. I'm working on reproducing it now as well. Thanks! sage -martin (Lese Datenbank ... 52246 Dateien und Verzeichnisse sind derzeit installiert.) Vorbereitung zum Ersetzen von linux-firmware 1.79 (durch .../linux-firmware_1.79.1_all.deb) ... Ersatz f?r linux-firmware wird entpackt ... osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int)' thread 7f72b7fff700 time 2013-02-14 16:04:48.867285 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid) ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061) 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long, unsigned long, int)+0xd68) [0x7f72d4050848] 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f72d405742b] 3: (Context::complete(int)+0xa) [0x7f72d400f9ba] 4: (librbd::C_Request::finish(int)+0x85) [0x7f72d403f145] 5: (Context::complete(int)+0xa) [0x7f72d400f9ba] 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f72d40241b7] 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f72d33db16d] 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f72d3444e50] 9: (()+0x7e9a) [0x7f72d03c7e9a] 10: (clone()+0x6d) [0x7f72d00f4cbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion
Hi, I have 16 SAS disk on a LSI 9266-8i and 4 Intel 520 SSD on a HBA, the node has dual 10G Ethernet. The clients are 4 nodes with dual 10GeB, as test I use rados bench on each client. The aggregated write speed is around 1,6GB/s with single replication. In the first configuration, I had the SSDs on the raidcontroller as well, but then I saturated the PCIe 2.0 x8 interface of the raidcontroller, therefore I use a second controller for the SSDs. -martin Am 07.11.2012 17:41, schrieb Mark Nelson: Well, local, but still over tcp. Right now I'm focusing on pushing the osds/filestores as far as I can, and after that I'm going to setup a bonded 10GbE network to see what kind of messenger bottlenecks I run into. Sadly the testing is going slower than I would like. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion
Hi, I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a Mellanox MSX1016. ATM the Arista is may favourite. For the dual 10GeB NICs I tested the Intel X520-DA2 and the Mellanox ConnectX-3. My favourite is the Intel X520-DA2. -martin Am 07.11.2012 22:14, schrieb Gandalf Corvotempesta: 2012/11/7 Martin Mailand mar...@tuxadero.com: I have 16 SAS disk on a LSI 9266-8i and 4 Intel 520 SSD on a HBA, the node has dual 10G Ethernet. The clients are 4 nodes with dual 10GeB, as test I use rados bench on each client. The aggregated write speed is around 1,6GB/s with single replication. Just for curiosity, which switches do you have? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion
Hi Stefan, deep buffers means latency spikes, you should go for fast switching latency. The HP5900 has a latency of 1ms, the Arista and Mellanox of 250ns. And I you should think at the price the HP5900 cost 3 times of the Mellanox. -martin Am 07.11.2012 22:44, schrieb Stefan Priebe: Am 07.11.2012 22:35, schrieb Martin Mailand: Hi, I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a Mellanox MSX1016. ATM the Arista is may favourite. For the dual 10GeB NICs I tested the Intel X520-DA2 and the Mellanox ConnectX-3. My favourite is the Intel X520-DA2. That's pretty interesting i'll get the HP5900 and HP5920 in a few weeks. HP told me the deep packet buffers of the HP5920 will burst the performance and should be used for storage related stuff. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion
Hi, I *think* the HP is Broadcom based, the Arista is Fulcrum based, and I don't know which chips Mellanox is using. Our NOC tested both of them, an the Arista was the clear winner, at least in our workload. -martin Am 07.11.2012 22:59, schrieb Stefan Priebe: HP told me they all use the same ships and Arista measures latency while only one port is in use. HP guarentees the latency when all ports are in use. If this is correct or just somehing hp told me - i don't know. They told me the arista is slower and the statistics are not comporable... -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion
good question, probably we do not have enough experience with IPoIB. But it looks good on paper, so it's definitely a try worth. -martin Am 07.11.2012 23:28, schrieb Gandalf Corvotempesta: 2012/11/7 Martin Mailand mar...@tuxadero.com: I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a Mellanox MSX1016. ATM the Arista is may favourite. Why not infiniband? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph benchmark high wait on journal device
Hi, inspired from the performance test Mark did, I tried to compile my own one. I have four OSD processes on one Node, each process has a Intel 710 SSD for its journal and 4 SAS Disk via an Lsi 9266-8i in Raid 0. If I test the SSD with fio they are quite fast and the w_wait time is quite low. But if I run rados bench on the cluster, the w_wait times for the journal devices are quite high (around 20-40ms). I thought the SSD would be better, any ideas what happend here? -martin Logs: /dev/sd{c,d,e,f} Intel SSD 710 200G /dev/sd{g,h,i,j} each 4 x SAS on LSI 9266-8i Raid 0 fio -name iops -rw=write -size=10G -iodepth 1 -filename /dev/sdc2 -ioengine libaio -direct 1 -bs 256k Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util - snip - sdc 0,00 0,000,00 809,20 0,00 202,30 512,00 0,961,190,001,19 1,18 95,84 - snap - rados bench -p rbd 300 write -t 16 2012-10-15 17:53:17.058383min lat: 0.035382 max lat: 0.469604 avg lat: 0.189553 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 300 16 25329 25313 337.443 324 0.274815 0.189553 Total time run: 300.169843 Total writes made: 25329 Write size: 4194304 Bandwidth (MB/sec): 337.529 Stddev Bandwidth: 25.1568 Max bandwidth (MB/sec): 372 Min bandwidth (MB/sec): 0 Average Latency:0.189597 Stddev Latency: 0.0641609 Max latency:0.469604 Min latency:0.035382 during the rados bench test. avg-cpu: %user %nice %system %iowait %steal %idle 20,380,00 16,208,870,00 54,55 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,0041,200,00 12,40 0,00 0,35 57,42 0,000,310,000,31 0,31 0,38 sdb 0,00 0,000,000,00 0,00 0,00 0,00 0,000,000,000,00 0,00 0,00 sdc 0,00 0,000,00 332,80 0,00 139,67 859,53 7,36 22,090,00 22,09 2,12 70,42 sdd 0,00 0,000,00 391,60 0,00 175,84 919,6215,59 39,620,00 39,62 2,40 93,80 sde 0,00 0,000,00 342,00 0,00 147,39 882,59 8,54 24,890,00 24,89 2,18 74,58 sdf 0,00 0,000,00 362,20 0,00 162,72 920,0515,35 42,500,00 42,50 2,60 94,20 sdg 0,00 0,000,00 522,00 0,00 139,20 546,13 0,280,540,000,54 0,10 5,26 sdh 0,00 0,000,00 672,00 0,00 179,20 546,13 9,67 14,420,00 14,42 0,61 41,18 sdi 0,00 0,000,00 555,00 0,00 148,00 546,13 0,320,570,000,57 0,10 5,46 sdj 0,00 0,000,00 582,00 0,00 155,20 546,13 0,510,870,000,87 0,12 6,96 100 seconds later avg-cpu: %user %nice %system %iowait %steal %idle 22,920,00 19,579,250,00 48,25 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,0040,800,00 15,60 0,00 0,36 47,08 0,000,220,000,22 0,22 0,34 sdb 0,00 0,000,000,00 0,00 0,00 0,00 0,000,000,000,00 0,00 0,00 sdc 0,00 0,000,00 386,60 0,00 168,33 891,7012,11 31,080,00 31,08 2,25 86,86 sdd 0,00 0,000,00 405,00 0,00 183,06 925,6815,68 38,700,00 38,70 2,34 94,90 sde 0,00 0,000,00 411,00 0,00 185,06 922,1515,58 38,090,00 38,09 2,33 95,92 sdf 0,00 0,000,00 387,00 0,00 168,33 890,7912,19 31,480,00 31,48 2,26 87,48 sdg 0,00 0,000,00 646,20 0,00 171,22 542,64 0,420,650,000,65 0,10 6,70 sdh 0,0085,600,40 797,00 0,01 192,97 495,6510,95 13,73 32,50 13,72 0,55 44,22 sdi 0,00 0,000,00 678,20 0,00 180,01 543,59 0,450,670,000,67 0,10 6,76 sdj 0,00 0,000,00 639,00 0,00 169,61 543,61 0,360,570,000,57 0,10 6,32 --admin-daemon /var/run/ceph/ceph-osd.1.asok perf dump
Re: Ceph benchmark high wait on journal device
Hi Mark, I think there is no differences between the 9266-8i and the 9265-8i, except for the cache vault and the angel of the SAS connectors. In the last test, which I posted, the SSDs where connected to the onboard SATA ports. Further test showed if I reduce the the object size (the -b option) to 1M, 512k, 256k the latency almost vanished. With 256k the w_wait was around 1ms. So my observation shows almost the different of yours. I use a singel controller with a dual expander backplane. That's the baby. http://85.214.49.87/ceph/testlab/IMAG0018.jpg btw. Is there a nice way to format the output of ceph --admin-daemon ceph-osd.0.asok perf_dump? -martin Am 15.10.2012 21:50, schrieb Mark Nelson: Hi Martin, I haven't tested the 9266-8i specifically, but it may behave similarly to the 9265-8i. This is just a theory, but I get the impression that the controller itself introduces some latency getting data to disk, and that it may get worse as the more data is pushed across the controller. That seems to be the case even of the data is not going to the disk in question. Are you using a single controller with expanders? On some of our nodes that use a single controller with lots of expanders, I've noticed high IO wait times, especially when doing lots of small writes. Mark On 10/15/2012 11:12 AM, Martin Mailand wrote: Hi, inspired from the performance test Mark did, I tried to compile my own one. I have four OSD processes on one Node, each process has a Intel 710 SSD for its journal and 4 SAS Disk via an Lsi 9266-8i in Raid 0. If I test the SSD with fio they are quite fast and the w_wait time is quite low. But if I run rados bench on the cluster, the w_wait times for the journal devices are quite high (around 20-40ms). I thought the SSD would be better, any ideas what happend here? -martin Logs: /dev/sd{c,d,e,f} Intel SSD 710 200G /dev/sd{g,h,i,j} each 4 x SAS on LSI 9266-8i Raid 0 fio -name iops -rw=write -size=10G -iodepth 1 -filename /dev/sdc2 -ioengine libaio -direct 1 -bs 256k Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util - snip - sdc 0,00 0,00 0,00 809,20 0,00 202,30 512,00 0,96 1,19 0,00 1,19 1,18 95,84 - snap - rados bench -p rbd 300 write -t 16 2012-10-15 17:53:17.058383min lat: 0.035382 max lat: 0.469604 avg lat: 0.189553 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 300 16 25329 25313 337.443 324 0.274815 0.189553 Total time run: 300.169843 Total writes made: 25329 Write size: 4194304 Bandwidth (MB/sec): 337.529 Stddev Bandwidth: 25.1568 Max bandwidth (MB/sec): 372 Min bandwidth (MB/sec): 0 Average Latency: 0.189597 Stddev Latency: 0.0641609 Max latency: 0.469604 Min latency: 0.035382 during the rados bench test. avg-cpu: %user %nice %system %iowait %steal %idle 20,38 0,00 16,20 8,87 0,00 54,55 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 41,20 0,00 12,40 0,00 0,35 57,42 0,00 0,31 0,00 0,31 0,31 0,38 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 sdc 0,00 0,00 0,00 332,80 0,00 139,67 859,53 7,36 22,09 0,00 22,09 2,12 70,42 sdd 0,00 0,00 0,00 391,60 0,00 175,84 919,62 15,59 39,62 0,00 39,62 2,40 93,80 sde 0,00 0,00 0,00 342,00 0,00 147,39 882,59 8,54 24,89 0,00 24,89 2,18 74,58 sdf 0,00 0,00 0,00 362,20 0,00 162,72 920,05 15,35 42,50 0,00 42,50 2,60 94,20 sdg 0,00 0,00 0,00 522,00 0,00 139,20 546,13 0,28 0,54 0,00 0,54 0,10 5,26 sdh 0,00 0,00 0,00 672,00 0,00 179,20 546,13 9,67 14,42 0,00 14,42 0,61 41,18 sdi 0,00 0,00 0,00 555,00 0,00 148,00 546,13 0,32 0,57 0,00 0,57 0,10 5,46 sdj 0,00 0,00 0,00 582,00 0,00 155,20 546,13 0,51 0,87 0,00 0,87 0,12 6,96 100 seconds later avg-cpu: %user %nice %system %iowait %steal %idle 22,92 0,00 19,57 9,25 0,00 48,25 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 40,80 0,00 15,60 0,00 0,36 47,08 0,00 0,22 0,00 0,22 0,22 0,34 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 sdc 0,00 0,00 0,00 386,60 0,00 168,33 891,70 12,11 31,08 0,00 31,08 2,25 86,86 sdd 0,00 0,00 0,00 405,00 0,00 183,06 925,68 15,68 38,70 0,00 38,70 2,34 94,90 sde 0,00 0,00 0,00 411,00 0,00 185,06 922,15 15,58 38,09 0,00 38,09 2,33 95,92 sdf 0,00 0,00 0,00 387,00 0,00 168,33 890,79 12,19 31,48 0,00 31,48 2,26 87,48 sdg 0,00 0,00 0,00 646,20 0,00 171,22 542,64 0,42 0,65 0,00 0,65 0,10 6,70 sdh 0,00 85,60 0,40 797,00 0,01 192,97 495,65 10,95 13,73 32,50 13,72 0,55 44,22 sdi 0,00 0,00 0,00 678,20 0,00 180,01 543,59 0,45 0,67 0,00 0,67 0,10 6,76 sdj 0,00 0,00 0,00 639,00 0,00 169,61 543,61 0,36 0,57 0,00 0,57 0,10 6,32 --admin-daemon /var/run/ceph/ceph-osd.1.asok perf dump {filestore:{journal_queue_max_ops:500,journal_queue_ops:0,journal_ops:34653,journal_queue_max_bytes:104857600,journal_queue_bytes:0,journal_bytes:86821481160,journal_latency:{avgcount:34653,sum:3458.68},journal_wr:19372,journal_wr_bytes:{avgcount:19372,sum:87026655232
rbd map error with new rbd format
Hi, whilst testing the new rbd layering feature I found a problem with rbd map. It seems rbd map doesn't support the new format. -martin ceph -v ceph version 0.51-265-gc7d11cd (commit:c7d11cd7b813a47167108c160358f70ec1aab7d6) rbd create --size 10 --new-format new rbd map new add failed: (2) No such file or directory rbd create --size 10 old rbd map old rbd showmapped id poolimage snapdevice 1 rbd old - /dev/rbd1 rbd info new rbd image 'new': size 10 MB in 25000 objects order 22 (4096 KB objects) block_name_prefix: rbd_data.101e1a89b511 old format: False features: layering rbd info old rbd image 'old': size 10 MB in 25000 objects order 22 (4096 KB objects) block_name_prefix: rb.0.1021.23697452 old format: True features: -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.51 released
Hi Sage, is in this release the rbd layering/cloning already testable? Do you have a link to the docs how to use it? Best Regards, martin Am 26.08.2012 17:58, schrieb Sage Weil: The latest development release v0.51 is ready. Notable changes include: * crush: tunables documented; feature bit now present and enforced * osd: various fixes for out-of-order op replies * osd: several rare peering cases fixed * osd: fixed detection of EIO errors from fs on read * osd: new 'lock' rados class for generic object locking * librbd: fixed memory leak on discard * librbd: image layering/cloning * radosgw: fix range header for large objects, ETag quoting, GMT dates, other compatibility fixes * mkcephfs: fix for default keyring, osd data/journal locations * wireshark: ceph protocol dissector patch updated * ceph.spec: fixed packaging problem with crush headers Full RBD cloning support will be in place in v0.52, as will a refactor of the messenger code with many bug fixes in the socket failure handling. This is available for testing now in 'next' for the adventurous. Improved OSD scrubbing is also coming soon. We should (finally) be building some release RPMs for v0.52 as well. You can get v0.51 from the usual locations: * Git at git://github.com/ceph/ceph.git * Tarball athttp://ceph.newdream.net/download/ceph-0.51.tar.gz * For Debian/Ubuntu packages, seehttp://ceph.newdream.net/docs/master/install/debian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD layering design draft
Hi, what's up locked, unlocked, unlocking? -martin Am 16.06.2012 17:11, schrieb Sage Weil: On Fri, 15 Jun 2012, Yehuda Sadeh wrote: On Fri, Jun 15, 2012 at 5:46 PM, Sage Weils...@inktank.com wrote: Looks good! Couple small things: $ rbd unpreserve pool/image@snap Is 'preserve' and 'unpreserve' the verbiage we want to use here? Not sure I have a better suggestion, but preserve is unusual. freeze, thaw/unfreeze? Freeze/thaw usually mean something like quiesce I/O or read-only, usually temporarily. What we actaully mean is you can't delete this. Maybe pin/unpin? preserve/unpreserve may be fine, too! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unmountable btrfs filesystems
Hi Wido, until recently there were still a few bugs in btrfs which could be hit quite easily with ceph. The last big one was fixed here http://www.spinics.net/lists/ceph-devel/msg06270.html I am running a ceph cluster with btrfs on a 3.5-rc2 without a problem, even under heavy test load. Hope that's helped. -martin Am 16.06.2012 20:46, schrieb Wido den Hollander: I tried various kernels, the most recent 3.3.0 from kernel.ubuntu.com, but I'm still seeing this. Is anyone seeing the same or did everybody migrate away to ext4 or XFS? I still prefer btrfs due to the snapshotting, but loosing all these OSD's all the time is getting kind of frustrating. Any thoughts or comments? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
Hi, the ceph cluster is running under heavy load for the last 13 hours without a problem, dmesg is empty and the performance is good. -martin Am 23.05.2012 21:12, schrieb Martin Mailand: this patch is running for 3 hours without a Bug and without the Warning. I will let it run overnight and report tomorrow. It looks very good ;-) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
Hi Josef, this patch is running for 3 hours without a Bug and without the Warning. I will let it run overnight and report tomorrow. It looks very good ;-) -martin Am 23.05.2012 17:02, schrieb Josef Bacik: Ok give this a shot, it should do it. Thanks, -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
Hi Josef, there was one line before the bug. [ 995.725105] couldn't find orphan item for 524 Am 18.05.2012 16:48, schrieb Josef Bacik: Ok hopefully this will print something out that makes sense. Thanks, -martin [ 241.754693] Btrfs loaded [ 241.755148] device fsid 43c4ebd9-3824-4b07-a710-3ec39b012759 devid 1 transid 4 /dev/sdc [ 241.755750] btrfs: setting nodatacow [ 241.755753] btrfs: enabling auto defrag [ 241.755754] btrfs: disk space caching is enabled [ 241.755755] btrfs flagging fs with big metadata feature [ 241.768683] device fsid e7e7f2df-6a4e-45b1-85cc-860cda849953 devid 1 transid 4 /dev/sdd [ 241.769028] btrfs: setting nodatacow [ 241.769030] btrfs: enabling auto defrag [ 241.769031] btrfs: disk space caching is enabled [ 241.769032] btrfs flagging fs with big metadata feature [ 241.781360] device fsid 203fdd4c-baac-49f8-bfdb-08486c937989 devid 1 transid 4 /dev/sde [ 241.781854] btrfs: setting nodatacow [ 241.781859] btrfs: enabling auto defrag [ 241.781861] btrfs: disk space caching is enabled [ 241.781864] btrfs flagging fs with big metadata feature [ 242.713741] device fsid 95c36e12-0098-48d7-a08d-9d54a299206b devid 1 transid 4 /dev/sdf [ 242.714110] btrfs: setting nodatacow [ 242.714118] btrfs: enabling auto defrag [ 242.714121] btrfs: disk space caching is enabled [ 242.714125] btrfs flagging fs with big metadata feature [ 995.725105] couldn't find orphan item for 524 [ 995.725126] [ cut here ] [ 995.725134] kernel BUG at fs/btrfs/inode.c:2227! [ 995.725143] invalid opcode: [#1] SMP [ 995.725158] CPU 0 [ 995.725162] Modules linked in: btrfs zlib_deflate libcrc32c ext2 coretemp ghash_clmulni_intel aesni_intel bonding cryptd aes_x86_64 microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ses ioatdma enclosure mac_hid lp parport ixgbe usbhid hid isci libsas megaraid_sas scsi_transport_sas igb dca mdio [ 995.725285] [ 995.725290] Pid: 2972, comm: ceph-osd Tainted: G C 3.4.0-rc7.2012051800+ #14 Supermicro X9SRi/X9SRi [ 995.725324] RIP: 0010:[a028535f] [a028535f] btrfs_orphan_del+0x14f/0x160 [btrfs] [ 995.725354] RSP: 0018:881016ed9d18 EFLAGS: 00010292 [ 995.725364] RAX: 0037 RBX: 88101485fdb0 RCX: [ 995.725378] RDX: RSI: 0082 RDI: 0246 [ 995.725392] RBP: 881016ed9d58 R08: R09: [ 995.725405] R10: R11: 00b6 R12: 88101efe9f90 [ 995.725419] R13: 88101efe9c00 R14: 0001 R15: 0001 [ 995.725433] FS: 7f58e5dbc700() GS:88107fc0() knlGS: [ 995.725466] CS: 0010 DS: ES: CR0: 80050033 [ 995.725492] CR2: 03f28000 CR3: 00101acac000 CR4: 000407f0 [ 995.725522] DR0: DR1: DR2: [ 995.725551] DR3: DR6: 0ff0 DR7: 0400 [ 995.725581] Process ceph-osd (pid: 2972, threadinfo 881016ed8000, task 88101618) [ 995.725626] Stack: [ 995.725646] 0c02 88101deaf550 881016ed9d38 88101deaf550 [ 995.725700] 88101efe9c00 88101485fdb0 880be890c1e0 [ 995.725757] 881016ed9e08 a02897a8 88101485fdb0 [ 995.725807] Call Trace: [ 995.725835] [a02897a8] btrfs_truncate+0x5e8/0x6d0 [btrfs] [ 995.725869] [a028b121] btrfs_setattr+0xc1/0x1b0 [btrfs] [ 995.725898] [811955c3] notify_change+0x183/0x320 [ 995.725925] [8117889e] do_truncate+0x5e/0xa0 [ 995.725951] [81178a24] sys_truncate+0x144/0x1b0 [ 995.725979] [8165fd29] system_call_fastpath+0x16/0x1b [ 995.726006] Code: 45 31 ff e9 3c ff ff ff 48 8b b3 58 fe ff ff 48 85 f6 74 19 80 bb 60 fe ff ff 84 74 10 48 c7 c7 08 48 2e a0 31 c0 e8 09 7c 3c e1 0f 0b 48 8b 73 40 eb ea 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 [ 995.726221] RIP [a028535f] btrfs_orphan_del+0x14f/0x160 [btrfs] [ 995.726258] RSP 881016ed9d18 [ 995.726574] ---[ end trace 4bde8f513a6d106d ]--- -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
Hi Josef, now I get [ 2081.142669] couldn't find orphan item for 2039, nlink 1, root 269, root being deleted no -martin Am 18.05.2012 21:01, schrieb Josef Bacik: *sigh* ok try this, hopefully it will point me in the right direction. Thanks, [ 126.389847] Btrfs loaded [ 126.390284] device fsid 0c9d8c6d-2982-4604-b32a-fc443c4e2c50 devid 1 transid 4 /dev/sdc [ 126.391246] btrfs: setting nodatacow [ 126.391252] btrfs: enabling auto defrag [ 126.391254] btrfs: disk space caching is enabled [ 126.391257] btrfs flagging fs with big metadata feature [ 126.405700] device fsid e8a0dc27-8714-49bd-a14f-ac37525febb1 devid 1 transid 4 /dev/sdd [ 126.406162] btrfs: setting nodatacow [ 126.406167] btrfs: enabling auto defrag [ 126.406170] btrfs: disk space caching is enabled [ 126.406172] btrfs flagging fs with big metadata feature [ 126.419819] device fsid f67cd977-ebf4-41f2-9821-f2989e985954 devid 1 transid 4 /dev/sde [ 126.420198] btrfs: setting nodatacow [ 126.420206] btrfs: enabling auto defrag [ 126.420210] btrfs: disk space caching is enabled [ 126.420214] btrfs flagging fs with big metadata feature [ 127.274555] device fsid 3001355e-c2e2-46c7-9eba-dfecb441d6a6 devid 1 transid 4 /dev/sdf [ 127.274980] btrfs: setting nodatacow [ 127.274986] btrfs: enabling auto defrag [ 127.274989] btrfs: disk space caching is enabled [ 127.274992] btrfs flagging fs with big metadata feature [ 2081.142669] couldn't find orphan item for 2039, nlink 1, root 269, root being deleted no [ 2081.142735] [ cut here ] [ 2081.142750] kernel BUG at fs/btrfs/inode.c:2228! [ 2081.142766] invalid opcode: [#1] SMP [ 2081.142786] CPU 10 [ 2081.142794] Modules linked in: btrfs zlib_deflate libcrc32c ext2 bonding coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ioatdma ses enclosure mac_hid lp parport usbhid hid megaraid_sas isci libsas scsi_transport_sas igb ixgbe dca mdio [ 2081.142974] [ 2081.142985] Pid: 2966, comm: ceph-osd Tainted: G C 3.4.0-rc7.2012051802+ #16 Supermicro X9SRi/X9SRi [ 2081.143020] RIP: 0010:[a0269383] [a0269383] btrfs_orphan_del+0x173/0x180 [btrfs] [ 2081.143080] RSP: 0018:881016d83d18 EFLAGS: 00010292 [ 2081.143096] RAX: 0062 RBX: 881017ad4770 RCX: [ 2081.143115] RDX: RSI: 0082 RDI: 0246 [ 2081.143134] RBP: 881016d83d58 R08: R09: [ 2081.143154] R10: R11: 0116 R12: 88101e7baf90 [ 2081.143173] R13: 88101e7bac00 R14: 0001 R15: 0001 [ 2081.143193] FS: 7fcc1e736700() GS:88107fd4() knlGS: [ 2081.143243] CS: 0010 DS: ES: CR0: 80050033 [ 2081.143274] CR2: 09269000 CR3: 00101ba87000 CR4: 000407e0 [ 2081.143308] DR0: DR1: DR2: [ 2081.143341] DR3: DR6: 0ff0 DR7: 0400 [ 2081.143376] Process ceph-osd (pid: 2966, threadinfo 881016d82000, task 881023c744a0) [ 2081.143424] Stack: [ 2081.143447] 0c07 88101e1dac30 881016d83d38 88101e1dac30 [ 2081.143510] 88101e7bac00 881017ad4770 88101f0f7d60 [ 2081.143572] 881016d83e08 a026d7c8 881017ad4770 [ 2081.143634] Call Trace: [ 2081.143684] [a026d7c8] btrfs_truncate+0x5e8/0x6d0 [btrfs] [ 2081.143737] [a026f141] btrfs_setattr+0xc1/0x1b0 [btrfs] [ 2081.143773] [811955c3] notify_change+0x183/0x320 [ 2081.143807] [8117889e] do_truncate+0x5e/0xa0 [ 2081.143839] [81178a24] sys_truncate+0x144/0x1b0 [ 2081.143873] [8165fd29] system_call_fastpath+0x16/0x1b [ 2081.143903] Code: a0 49 8b 8d f0 02 00 00 8b 53 48 4c 0f 44 c0 48 85 f6 74 19 80 bb 60 fe ff ff 84 74 10 48 c7 c7 10 88 2c a0 31 c0 e8 e5 3b 3e e1 0f 0b 48 8b 73 40 eb ea 0f 1f 44 00 00 55 48 89 e5 48 83 ec 10 [ 2081.144199] RIP [a0269383] btrfs_orphan_del+0x173/0x180 [btrfs] [ 2081.144258] RSP 881016d83d18 [ 2081.144614] ---[ end trace 8d0829d100639242 ]--- -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
Hi Josef, somehow I still get the kernel Bug messages, I used your patch from the 16th against rc7. -martin Am 16.05.2012 21:20, schrieb Josef Bacik: Hrm ok so I finally got some time to try and debug it and let the test run a good long while (5 hours almost) and I couldn't hit either the original bug or the one you guys were hitting. So either my extra little bit of locking did the trick or I get to keep my Worst reproducer ever award. Can you guys give this one a whirl and if it panics send the entire dmesg since it should spit out a WARN_ON() to let me know what I thought was the problem was it. Thanks, [ 2868.813236] [ cut here ] [ 2868.813297] kernel BUG at fs/btrfs/inode.c:2220! [ 2868.813355] invalid opcode: [#2] SMP [ 2868.813479] CPU 2 [ 2868.813516] Modules linked in: btrfs zlib_deflate libcrc32c ext2 bonding coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ses ioatdma enclosure mac_hid lp parport isci libsas scsi_transport_sas usbhid hid ixgbe igb megaraid_sas dca mdio [ 2868.814871] [ 2868.814925] Pid: 5325, comm: ceph-osd Tainted: G D C 3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi [ 2868.815108] RIP: 0010:[a02212f2] [a02212f2] btrfs_orphan_del+0xe2/0xf0 [btrfs] [ 2868.815236] RSP: 0018:880296e89d18 EFLAGS: 00010282 [ 2868.815294] RAX: fffe RBX: 88101ef3c390 RCX: 00562497 [ 2868.815355] RDX: 00562496 RSI: 88101ef1 RDI: ea00407bc400 [ 2868.815416] RBP: 880296e89d58 R08: 60ef8fd0 R09: a01f8c6a [ 2868.815476] R10: R11: 011d R12: 880fdf602790 [ 2868.815537] R13: 880fdf602400 R14: 0001 R15: 0001 [ 2868.815598] FS: 7f07d5512700() GS:88107fc4() knlGS: [ 2868.815675] CS: 0010 DS: ES: CR0: 80050033 [ 2868.815734] CR2: 0ab16000 CR3: 00082a6b2000 CR4: 000407e0 [ 2868.815796] DR0: DR1: DR2: [ 2868.815858] DR3: DR6: 0ff0 DR7: 0400 [ 2868.815920] Process ceph-osd (pid: 5325, threadinfo 880296e88000, task 8810170616e0) [ 2868.815997] Stack: [ 2868.816049] 0c07 88101ef12960 880296e89d38 88101ef12960 [ 2868.816262] 880fdf602400 88101ef3c390 880b4ce2f260 [ 2868.816485] 880296e89e08 a0225628 88101ef3c390 [ 2868.816694] Call Trace: [ 2868.816755] [a0225628] btrfs_truncate+0x4d8/0x650 [btrfs] [ 2868.816817] [81188afd] ? path_lookupat+0x6d/0x750 [ 2868.816880] [a0227021] btrfs_setattr+0xc1/0x1b0 [btrfs] [ 2868.816940] [811955c3] notify_change+0x183/0x320 [ 2868.816998] [8117889e] do_truncate+0x5e/0xa0 [ 2868.817056] [81178a24] sys_truncate+0x144/0x1b0 [ 2868.817115] [8165fd29] system_call_fastpath+0x16/0x1b [ 2868.817173] Code: e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 0f 1f 44 00 00 80 bb 60 fe ff ff 84 75 b4 eb ae 0f 1f 44 00 00 48 89 df e8 50 73 fe ff eb b8 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec [ 2868.819501] RIP [a02212f2] btrfs_orphan_del+0xe2/0xf0 [btrfs] [ 2868.819602] RSP 880296e89d18 [ 2868.819703] ---[ end trace 94d17b770b376c84 ]--- [ 3249.857453] [ cut here ] [ 3249.857481] kernel BUG at fs/btrfs/inode.c:2220! [ 3249.857506] invalid opcode: [#3] SMP [ 3249.857534] CPU 0 [ 3249.857538] Modules linked in: btrfs zlib_deflate libcrc32c ext2 bonding coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ses ioatdma enclosure mac_hid lp parport isci libsas scsi_transport_sas usbhid hid ixgbe igb megaraid_sas dca mdio [ 3249.857721] [ 3249.857740] Pid: 5384, comm: ceph-osd Tainted: G D C 3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi [ 3249.857791] RIP: 0010:[a02212f2] [a02212f2] btrfs_orphan_del+0xe2/0xf0 [btrfs] [ 3249.857847] RSP: 0018:880abe8b5d18 EFLAGS: 00010282 [ 3249.857873] RAX: fffe RBX: 8807eb8b6670 RCX: 0077a084 [ 3249.857902] RDX: 0077a083 RSI: 88101ee497e0 RDI: ea00407b9240 [ 3249.857931] RBP: 880abe8b5d58 R08: 60ef8fd0 R09: a01f8c6a [ 3249.857959] R10: R11: 0153 R12: 880d56825390 [ 3249.857988] R13: 880d56825000 R14: 0001 R15: 0001 [ 3249.858017] FS: 7f06bd13b700() GS:88107fc0() knlGS: [ 3249.858062] CS: 0010 DS: ES: CR0: 80050033 [ 3249.858088] CR2: 043d2000 CR3: 000e7ebe5000 CR4: 000407f0 [ 3249.858117] DR0: DR1: DR2: [ 3249.858146] DR3: DR6: 0ff0 DR7:
Re: Ceph on btrfs 3.4rc
Hi Josef, no there was nothing above. Here the is another dmesg output. Was there anything above those messages? There should have been a WARN_ON() or something. If not thats fine, I just need to know one way or the other so I can figure out what to do next. Thanks, Josef -martin [ 63.027277] Btrfs loaded [ 63.027485] device fsid 266726e1-439f-4d89-a374-7ef92d355daf devid 1 transid 4 /dev/sdc [ 63.027750] btrfs: setting nodatacow [ 63.027752] btrfs: enabling auto defrag [ 63.027753] btrfs: disk space caching is enabled [ 63.027754] btrfs flagging fs with big metadata feature [ 63.036347] device fsid 070e2c6c-2ea5-478d-bc07-7ce3a954e2e4 devid 1 transid 4 /dev/sdd [ 63.036624] btrfs: setting nodatacow [ 63.036626] btrfs: enabling auto defrag [ 63.036627] btrfs: disk space caching is enabled [ 63.036628] btrfs flagging fs with big metadata feature [ 63.045628] device fsid 6f7b82a9-a1b7-40c6-8b00-2c2a44481066 devid 1 transid 4 /dev/sde [ 63.045910] btrfs: setting nodatacow [ 63.045912] btrfs: enabling auto defrag [ 63.045913] btrfs: disk space caching is enabled [ 63.045914] btrfs flagging fs with big metadata feature [ 63.831278] device fsid 46890b76-45c2-4ea2-96ee-2ea88e29628b devid 1 transid 4 /dev/sdf [ 63.831577] btrfs: setting nodatacow [ 63.831579] btrfs: enabling auto defrag [ 63.831579] btrfs: disk space caching is enabled [ 63.831580] btrfs flagging fs with big metadata feature [ 1521.820412] [ cut here ] [ 1521.820424] kernel BUG at fs/btrfs/inode.c:2220! [ 1521.820433] invalid opcode: [#1] SMP [ 1521.820448] CPU 4 [ 1521.820452] Modules linked in: btrfs zlib_deflate libcrc32c ext2 ses enclosure bonding coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 psmouse microcode serio_raw sb_edac edac_core mei(C) joydev ioatdma mac_hid lp parport isci libsas scsi_transport_sas usbhid hid ixgbe igb dca megaraid_sas mdio [ 1521.820562] [ 1521.820567] Pid: 3095, comm: ceph-osd Tainted: G C 3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi [ 1521.820591] RIP: 0010:[a02532f2] [a02532f2] btrfs_orphan_del+0xe2/0xf0 [btrfs] [ 1521.820616] RSP: 0018:881013da9d18 EFLAGS: 00010282 [ 1521.820626] RAX: fffe RBX: 881013a3b7f0 RCX: 00395dcf [ 1521.820640] RDX: 00395dce RSI: 88101df77480 RDI: ea004077ddc0 [ 1521.820654] RBP: 881013da9d58 R08: 60ef800010d0 R09: a022ac6a [ 1521.820667] R10: R11: 010a R12: 88101e378790 [ 1521.820681] R13: 88101e378400 R14: 0001 R15: 0001 [ 1521.820695] FS: 7faa45d30700() GS:88107fc8() knlGS: [ 1521.820710] CS: 0010 DS: ES: CR0: 80050033 [ 1521.820738] CR2: 7fe0efba6010 CR3: 001016fec000 CR4: 000407e0 [ 1521.820767] DR0: DR1: DR2: [ 1521.820796] DR3: DR6: 0ff0 DR7: 0400 [ 1521.820825] Process ceph-osd (pid: 3095, threadinfo 881013da8000, task 881013da44a0) [ 1521.820870] Stack: [ 1521.820889] 0c05 88101df9c230 881013da9d38 88101df9c230 [ 1521.820939] 88101e378400 881013a3b7f0 880c6880f840 [ 1521.820988] 881013da9e08 a0257628 881013a3b7f0 [ 1521.821038] Call Trace: [ 1521.821066] [a0257628] btrfs_truncate+0x4d8/0x650 [btrfs] [ 1521.821096] [81188afd] ? path_lookupat+0x6d/0x750 [ 1521.821128] [a0259021] btrfs_setattr+0xc1/0x1b0 [btrfs] [ 1521.821156] [811955c3] notify_change+0x183/0x320 [ 1521.821183] [8117889e] do_truncate+0x5e/0xa0 [ 1521.821209] [81178a24] sys_truncate+0x144/0x1b0 [ 1521.821237] [8165fd29] system_call_fastpath+0x16/0x1b [ 1521.821265] Code: e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 0f 1f 44 00 00 80 bb 60 fe ff ff 84 75 b4 eb ae 0f 1f 44 00 00 48 89 df e8 50 73 fe ff eb b8 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec [ 1521.821458] RIP [a02532f2] btrfs_orphan_del+0xe2/0xf0 [btrfs] [ 1521.821492] RSP 881013da9d18 [ 1521.821758] ---[ end trace aee4c5fe92ee2a67 ]--- [ 6888.637508] btrfs: truncated 1 orphans [ 7641.701736] [ cut here ] [ 7641.701764] kernel BUG at fs/btrfs/inode.c:2220! [ 7641.701789] invalid opcode: [#2] SMP [ 7641.701816] CPU 3 [ 7641.701819] Modules linked in: btrfs zlib_deflate libcrc32c ext2 ses enclosure bonding coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 psmouse microcode serio_raw sb_edac edac_core mei(C) joydev ioatdma mac_hid lp parport isci libsas scsi_transport_sas usbhid hid ixgbe igb dca megaraid_sas mdio [ 7641.702000] [ 7641.702030] Pid: 3064, comm: ceph-osd Tainted: G D C 3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi [ 7641.702081] RIP: 0010:[a02532f2] [a02532f2]
Re: Ceph on btrfs 3.4rc
Hi Josef, Am 11.05.2012 21:16, schrieb Josef Bacik: Heh duh, sorry, try this one instead. Thanks, With this patch I got this Bug: [ 8233.828722] [ cut here ] [ 8233.828737] kernel BUG at fs/btrfs/inode.c:2217! [ 8233.828746] invalid opcode: [#1] SMP [ 8233.828761] CPU 1 [ 8233.828766] Modules linked in: btrfs zlib_deflate libcrc32c ses enclosure bonding coretemp ghash_clmulni_intel psmouse aesni_intel sb_edac cryptd a es_x86_64 ext2 microcode serio_raw edac_core mei(C) joydev ioatdma mac_hid lp parport usbhid hid isci libsas ixgbe scsi_transport_sas megaraid_sas igb dca mdio [ 8233.828885] [ 8233.828891] Pid: , comm: ceph-osd Tainted: GWC 3.4.0-rc6+ #6 Supermicro X9SRi/X9SRi [ 8233.828915] RIP: 0010:[a02492d2] [a02492d2] btrfs_orphan_del+0xe2/0xf0 [btrfs] [ 8233.828947] RSP: 0018:88101ce53d18 EFLAGS: 00010282 [ 8233.828957] RAX: fffe RBX: 880d194e2c50 RCX: 00d0a3be [ 8233.828971] RDX: 00d0a3bd RSI: 88101de2a000 RDI: ea0040778a80 [ 8233.828985] RBP: 88101ce53d58 R08: 60ef8f00 R09: a0220c6a [ 8233.828999] R10: R11: 00f0 R12: 88071bb1e790 [ 8233.829029] R13: 88071bb1e400 R14: 0001 R15: 0001 [ 8233.829059] FS: 7fdfa179b700() GS:88107fc2() knlGS: [ 8233.829104] CS: 0010 DS: ES: CR0: 80050033 [ 8233.829131] CR2: 0c614000 CR3: 0001df9d2000 CR4: 000407e0 [ 8233.829160] DR0: DR1: DR2: [ 8233.829190] DR3: DR6: 0ff0 DR7: 0400 [ 8233.829220] Process ceph-osd (pid: , threadinfo 88101ce52000, task 88101b7b96e0) [ 8233.829265] Stack: [ 8233.829286] 0c02 88101de14cd0 88101ce53d38 88101de14cd0 [ 8233.829336] 88071bb1e400 880d194e2c50 881024680620 [ 8233.829386] 88101ce53e08 a024d608 880d194e2c50 [ 8233.829436] Call Trace: [ 8233.829472] [a024d608] btrfs_truncate+0x4d8/0x650 [btrfs] [ 8233.829503] [81188afd] ? path_lookupat+0x6d/0x750 [ 8233.829537] [a024efc1] btrfs_setattr+0xc1/0x1b0 [btrfs] [ 8233.829567] [811955c3] notify_change+0x183/0x320 [ 8233.829595] [8117889e] do_truncate+0x5e/0xa0 [ 8233.829621] [81178a24] sys_truncate+0x144/0x1b0 [ 8233.829649] [8165fd69] system_call_fastpath+0x16/0x1b [ 8233.829676] Code: e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 0f 1f 44 00 00 80 bb 60 fe ff ff 84 75 b4 eb ae 0f 1f 44 00 00 48 89 df e8 70 73 fe ff eb b8 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec [ 8233.829875] RIP [a02492d2] btrfs_orphan_del+0xe2/0xf0 [btrfs] [ 8233.829914] RSP 88101ce53d18 [ 8233.830187] ---[ end trace 46dd4a711bf2979d ]--- -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
Hi Josef, Am 11.05.2012 15:31, schrieb Josef Bacik: That previous patch was against btrfs-next, this patch is against 3.4-rc6 if you are on mainline. Thanks, I tried your patch against mainline, after a few minutes I hit this bug. [ 1078.523655] [ cut here ] [ 1078.523667] kernel BUG at fs/btrfs/inode.c:2211! [ 1078.523676] invalid opcode: [#1] SMP [ 1078.523692] CPU 5 [ 1078.523696] Modules linked in: btrfs zlib_deflate libcrc32c mlx4_en bonding ext2 coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode psmouse serio_raw sb_edac edac_core mei(C) joydev ses ioatdma enclosure mac_hid lp parport isci libsas scsi_transport_sas usbhid hid igb megaraid_sas mlx4_core dca [ 1078.523813] [ 1078.523818] Pid: 4108, comm: ceph-osd Tainted: G C 3.4.0-rc6+ #5 Supermicro X9SRi/X9SRi [ 1078.523841] RIP: 0010:[a022b2a2] [a022b2a2] btrfs_orphan_del+0xb2/0xc0 [btrfs] [ 1078.523867] RSP: 0018:880ff14a5d38 EFLAGS: 00010282 [ 1078.523877] RAX: fffe RBX: 880ff004d6f0 RCX: 00117400 [ 1078.523891] RDX: 001173ff RSI: 8810279f6ea0 RDI: ea00409e7d80 [ 1078.523905] RBP: 880ff14a5d58 R08: 60ef80001400 R09: a0202c6a [ 1078.523918] R10: R11: 00ba R12: 0001 [ 1078.523932] R13: 881017663c00 R14: 0001 R15: 88101776f5a0 [ 1078.523946] FS: 7f1d2c03c700() GS:88107fca() knlGS: [ 1078.523961] CS: 0010 DS: ES: CR0: 80050033 [ 1078.523990] CR2: 050f4000 CR3: 000ff2a57000 CR4: 000407e0 [ 1078.524019] DR0: DR1: DR2: [ 1078.524048] DR3: DR6: 0ff0 DR7: 0400 [ 1078.524077] Process ceph-osd (pid: 4108, threadinfo 880ff14a4000, task 880ff2aa44a0) [ 1078.524121] Stack: [ 1078.524141] 8810279f7460 881017663c00 880ff004d6f0 [ 1078.524190] 880ff14a5e08 a022f5d8 880ff004d6f0 [ 1078.524240] 880ff14a5e18 81188afd 8000 80001000 [ 1078.524289] Call Trace: [ 1078.524317] [a022f5d8] btrfs_truncate+0x4d8/0x650 [btrfs] [ 1078.524348] [81188afd] ? path_lookupat+0x6d/0x750 [ 1078.524380] [a0230f91] btrfs_setattr+0xc1/0x1b0 [btrfs] [ 1078.524408] [811955c3] notify_change+0x183/0x320 [ 1078.524435] [8117889e] do_truncate+0x5e/0xa0 [ 1078.524461] [81178a24] sys_truncate+0x144/0x1b0 [ 1078.524489] [8165fd69] system_call_fastpath+0x16/0x1b [ 1078.524516] Code: 8b 65 e8 4c 8b 6d f0 4c 8b 75 f8 c9 c3 0f 1f 40 00 80 bb 60 fe ff ff 84 75 c1 eb bb 0f 1f 44 00 00 48 89 df e8 a0 73 fe ff eb c1 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec [ 1078.524710] RIP [a022b2a2] btrfs_orphan_del+0xb2/0xc0 [btrfs] [ 1078.524744] RSP 880ff14a5d38 [ 1078.525013] ---[ end trace 88c92720204f7aa4 ]--- That's the drive with the broken btrfs. [ 212.843776] device fsid 28492275-01d3-4e89-9f1c-bd86057194bf devid 1 transid 4 /dev/sdc [ 212.844630] btrfs: setting nodatacow [ 212.844637] btrfs: enabling auto defrag [ 212.844640] btrfs: disk space caching is enabled [ 212.844643] btrfs flagging fs with big metadata feature -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Strange write behavior on an osd
Hi, I have a strange behavior on the osd, the cluster is a two node system, on one machine 50 qemu/rbd vm's are running (idling) the other machine is a osd with four osd processes and one mon processes. The osd disk are as follow sda is root sdb is journal four partitions sd{c,d,e,f) each three disk via a raid controler. /dev/sdc on /data/osd.0 type btrfs (rw,noatime,nodiratime,nodatacow,autodefrag) /dev/sdd on /data/osd.1 type btrfs (rw,noatime,nodiratime,nodatacow,autodefrag) /dev/sde on /data/osd.2 type btrfs (rw,noatime,nodiratime,nodatacow,autodefrag) /dev/sdf on /data/osd.3 type btrfs (rw,noatime,nodiratime,nodatacow,autodefrag) There is almost no network traffic, but the osd writes huge amount to the disk for around 90 sec and then its almost idle for 30 sec, the writes always goes to sde. Why is it so bursty? -martin ## Busy Log ## total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 1 2 89 8 0 0|1632k 88M| 0 0 | 475B 1234B|155310k 0 0 88 12 0 0| 0 147M| 856B 2974B| 0 0 |2056 1789 0 0 88 12 0 0| 0 164M| 85k 6771B| 0 0 |2227 3104 0 1 84 15 0 0| 0 152M| 193k 17k| 0 0 |2805 6116 1 2 83 14 0 0|2704k 183M| 314k 23k| 0 0 |3184 7942 0 1 84 15 0 0|2072k 183M| 213k 16k| 0 0 |3142 6798 0 0 88 12 0 0| 0 167M| 27k 5571B| 0 0 |2418 2608 1 1 80 18 0 0| 96k 207M| 443k 26k| 0 0 |3267 9278 1 2 81 15 0 0| 0 180M| 682k 43k| 0 0 |394113k 1 1 80 17 0 0|2736k 153M| 573k 35k| 0 0 |322911k 1 1 84 14 0 0|9564k 163M| 242k 22k| 0 0 |2988 7054 0 1 75 24 0 0| 160k 166M| 40k 5331B| 0 0 |2187 2759 0 1 85 14 0 0| 32k 176M| 85k 6730B| 0 0 |2244 3198 0 1 83 16 0 0| 0 183M| 137k 12k| 0 0 |2590 5254 0 1 84 15 0 0|2688k 170M| 179k 15k| 0 0 |2780 5461 0 1 86 13 0 0|2692k 166M| 185k 17k| 0 0 |2638 6242 1 1 83 15 0 0| 0 179M| 149k 17k| 0 0 |3165 5695 1 2 81 17 0 0| 0 186M| 484k 33k| 0 0 |351211k 0 1 82 16 0 0| 0 177M| 523k 33k| 0 0 |317711k 1 1 82 16 0 0| 36k 179M| 603k 39k| 0 0 |300611k 1 1 79 19 0 0|3332k 210M| 332k 28k| 0 0 |3555 8813 0 0 89 11 0 0| 0 167M| 53k 7553B| 0 0 |2423 3136 0 0 87 12 0 0| 0 139M| 129k 11k| 0 0 |2073 3888 0 2 80 18 0 0| 32k 170M| 293k 26k| 0 0 |2950 8825 0 0 88 12 0 0| 772k 175M| 95k 8765B| 0 0 |2512 3640 0 2 86 12 0 0| 28k 197M| 199k 12k| 0 0 |2435 5194 0 0 87 13 0 0| 20k 179M| 111k 7843B| 0 0 |2310 3064 avg-cpu: %user %nice %system %iowait %steal %idle 0.770.001.44 15.810.00 81.99 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.0071.800.00 17.80 0.00 0.35 40.27 0.031.570.001.57 0.54 0.96 sdb 0.00 0.000.00 188.40 0.00 1.51 16.37 0.492.590.002.59 0.70 13.20 sdc 0.00 0.004.00 61.00 0.53 2.34 90.34 0.365.61 46.202.95 1.00 6.48 sde 0.00 1542.000.40 2172.00 0.01 165.82 156.34 143.39 65.76 214.00 65.73 0.46 100.00 sdd 0.00 0.003.40 59.60 0.53 1.25 57.85 0.203.19 32.471.52 0.88 5.52 sdf 0.00 0.008.40 75.40 1.35 1.75 75.59 0.516.13 42.102.12 1.37 11.44 avg-cpu: %user %nice %system %iowait %steal %idle 0.230.000.77 15.960.00 83.03 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.0072.200.00 18.20 0.00 0.35 39.74 0.021.320.001.32 0.57 1.04 sdb 0.00 0.000.00 72.00 0.00 0.52 14.80 0.192.580.002.58 0.73 5.28 sdc 0.00 0.000.20 38.00 0.00 1.64 88.17 0.369.36 16.009.33 1.09 4.16 sde 0.00 1554.801.20 2058.20 0.04 163.24 162.37 143.50 69.50 296.67 69.37 0.49 100.00 sdd 0.00 0.003.40 39.00 0.53 2.90 165.51 0.368.49 67.533.34 1.04 4.40 sdf 0.00 0.003.20 53.40 0.53 4.16 169.41 0.83 14.66 53.00 12.36 1.58 8.96 avg-cpu: %user %nice %system %iowait %steal %idle 0.650.001.36 16.560.00 81.43 Device:
Re: Strange write behavior on an osd
Hi, Am 24.04.2012 17:23, schrieb João Eduardo Luís: Any chance you could run iotop during the busy periods and tell us which processes are issuing the io? sure, http://85.214.49.87/ceph/iotop.txt -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange write behavior on an osd
Hi, Am 24.04.2012 18:31, schrieb João Eduardo Luís: What kernel and btrfs versions are you using? Kernel:3.4.0-rc3 btrfs-tools 0.19+20100601-3ubuntu3 That's how I created the fs. mkfs.btrfs -n 32k -l 32k /dev/sd{c,d,e,f} -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd snapshot in qemu and libvirt
Hi List, is it possible to quiesce the disk before a snapshot? Or does it make no sense with rbd? How about the new rbd_cache, does it get flushed before the snapshot? I would like to use it like this. virsh snapshot-create --quiesce $DOMAIN -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-librbd-caching
Am 12.04.2012 21:45, schrieb Sage Weil: The config options you'll want to look at are client_oc_* (in case you didn't see that already :). oc is short for objectcacher, and it isn't only used for client (libcephfs), so it might be worth renaming these options before people start using them. Hi, I changed the values and the performance is still very good and the memory footprint is much smaller. OPTION(client_oc_size, OPT_INT, 1024*1024* 50)// MB * n OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25)// MB * n (dirty OR tx.. bigish) OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty (keep this smallish) // note: the max amount of in flight dirty data is roughly (max - target) But I am not quite sure about the meaning of the values. client_oc_size Max size of the cache? client_oc_max_dirty max dirty value before the writeback starts? client_oc_target_dirty ??? -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rbd snapshot in qemu and libvirt
Hi List, does anyone know the actual progress of the rbd snapshot feature integration into qemu and libvirt? -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd snapshot in qemu and libvirt
Hi Wido, I am looking for doing the snapshots via libvirt, create, delete, rollback and list of the snapshot. -martin Am 18.04.2012 15:10, schrieb Wido den Hollander: I tested this about a year ago and that worked fine. Anything in particular you are looking for? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd snapshot in qemu and libvirt
Hi Andrey, if I try it I get this error. virsh snapshot-create linux1 error: Requested operation is not valid: Disk 'rbd/vm1:rbd_cache_enabled=1' does not support snapshotting maybe the rbd_cache option is the problem? -martin Am 18.04.2012 16:39, schrieb Andrey Korolyov: I have tested all of them about a week ago, all works fine. Also it will be very nice if rbd can list an actual allocated size of every image or snapshot in future. On Wed, Apr 18, 2012 at 5:22 PM, Martin Mailandmar...@tuxadero.com wrote: Hi Wido, I am looking for doing the snapshots via libvirt, create, delete, rollback and list of the snapshot. -martin Am 18.04.2012 15:10, schrieb Wido den Hollander: I tested this about a year ago and that worked fine. Anything in particular you are looking for? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd snapshot in qemu and libvirt
Hi, Am 18.04.2012 17:52, schrieb Andrey Korolyov: Oh, I forgot to say about a patch: perfect, now it works. Thanks. -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
wip-librbd-caching
Hi, today I tried the wip-librbd-caching branch. The performance improvement is very good particular for small writes. I tested from within a vm with fio: rbd_cache_enabled=1 fio -name iops -rw=write -size=10G -iodepth 1 -filename /tmp/bigfile -ioengine libaio -direct 1 -bs 4k I get over 10k iops With an iodepth 4 I get over 30k iops In comparison with the rbd_writebackwindow I get around 5k iops with an iodepth of 1. So far the whole cluster is running stable for over 12 hours. But there is also a downside. My typical vm are 1Gb in size, the default cache size is 200Mb, which is 20% more memory usage. Maybe 50Mb or less will be enough? I am going to test that. The other point is, that the cache is not KSM enabled, therefore identical pages will not be merged, could that be changed, what would be the downside? So maybe we could reduce the memory footprint of the cache, but keep it's performance. -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Crowbar] barclamp-ceph and crowbar
Hi John, I tried them a few weeks ago, they are developed for crowbar version 1.1 and doesn't seem to work with 1.2. If I want to create a proposal, the next page is white and an error is log. The barclamp installs one ceph-mon node, and several ceph-store nodes. The glue to connect your virtual machine to the ceph-store is not included in the barclamp. -martin On 24.02.2012 15:46, John Alberts wrote: Does anyone know what I can do with barclamp-ceph? https://github.com/NewDreamNetwork/barclamp-ceph The code hasn't been touched since it's initial import 4 months ago. Does it allow me to easily use ceph for /var/lib/instances on compute hosts so I can use features like live migration easily? Thanks John ___ Crowbar mailing list crow...@dell.com https://lists.us.dell.com/mailman/listinfo/crowbar For more information: https://github.com/dellcloudedge/crowbar/wiki -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash during resync
Hi Sage, I uploaded the osd.0 log as well. http://85.214.49.87/ceph/20120124/osd.0.log.bz2 -martin Am 25.01.2012 23:08, schrieb Sage Weil: Hi Martin, On Tue, 24 Jan 2012, Martin Mailand wrote: Hi, today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted osd.0 with a new kernel and created a new btrfs on the osd.0, than I took the osd.0 into the cluster. During the the resync of osd.0 osd.2 and osd.3 crashed. I am not sure, if the crashes happened because I played with osd.0, or if they are bugs. osd.2 -rw--- 1 root root 1.1G 2012-01-24 12:19 core-ceph-osd-1000-1327403927-s-brick-002 log: 2012-01-24 12:15:45.563135 7f1fdd42c700 log [INF] : 2.a restarting backfill on osd.0 from (185'113859,185'113859] 0//0 to 196'114038 osd/PG.cc: In function 'void PG::finish_recovery_op(const hobject_t, bool)', in thread '7f1fdab26700' osd/PG.cc: 1553: FAILED assert(recovery_ops_active 0) -rw--- 1 root root 758M 2012-01-24 15:58 core-ceph-osd-20755-1327417128-s-brick-002 Can you post the log for osd.0 too? Thanks! sage log: 2012-01-24 15:58:48.356892 7fe26acbf700 osd.2 379 pg[2.ff( v 379'286211 lc 202'286160 (185'285159,379'286211] n=112 ec=1 les/c 379/310 373/376/376) [2,1] r=0 lpr=376 rops=1 mlcod 202'286160 active m=6] * oi-watcher: client.4478 cookie=1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)', in thread '7fe26fdca700' osd/ReplicatedPG.cc: 3199: FAILED assert(obc-watchers.size() == 0) osd/ReplicatedPG.cc: In function 'void ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)', in thread '7fe26fdca700' http://85.214.49.87/ceph/20120124/osd.2.log.bz2 osd.3 -rw--- 1 root root 986M 2012-01-24 12:24 core-ceph-osd-962-1327404263-s-brick-003 log: 2012-01-24 12:15:50.241321 7f30c8fde700 log [INF] : 2.2e restarting backfill on osd.0 from (185'338312,185'338312] 0//0 to 196'339910 2012-01-24 12:21:48.420242 7f30c5ed7700 log [INF] : 2.9d scrub ok osd/PG.cc: In function 'void PG::activate(ObjectStore::Transaction, std::listContext*, std::mapint, std::mappg_t, PG::Query , std::mapint, MOSDPGInfo**)', in thread '7f30c8fde700' http://85.214.49.87/ceph/20120124/osd.3.log.bz2 -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown with ceph (how to reproduce)
Hi I tried the branch on one of my ceph osd, and there is a big difference in the performance. The average request size stayed high, but after around a hour the kernel crashed. IOstat http://pastebin.com/xjuriJ6J Kernel trace http://pastebin.com/SYE95GgH -martin Am 23.01.2012 19:50, schrieb Chris Mason: On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote: On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote: As you might know, I have been seeing btrfs slowdowns in our ceph cluster for quite some time. Even with the latest btrfs code for 3.3 I'm still seeing these problems. To make things reproducible, I've now written a small test, that imitates ceph's behavior: On a freshly created btrfs filesystem (2 TB size, mounted with noatime,nodiratime,compress=lzo,space_cache,inode_cache) I'm opening 100 files. After that I'm doing random writes on these files with a sync_file_range after each write (each write has a size of 100 bytes) and ioctl(BTRFS_IOC_SYNC) after every 100 writes. After approximately 20 minutes, write activity suddenly increases fourfold and the average request size decreases (see chart in the attachment). You can find IOstat output here: http://pastebin.com/Smbfg1aG I hope that you are able to trace down the problem with the test program in the attachment. Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree and formatted the fs with 64k node and leaf sizes and the problem appeared to go away. So surprise surprise fragmentation is biting us in the ass. If you can try running that branch with 64k node and leaf sizes with your ceph cluster and see how that works out. Course you should only do that if you dont mind if you lose everything :). Thanks, Please keep in mind this branch is only out there for development, and it really might have huge flaws. scrub doesn't work with it correctly right now, and the IO error recovery code is probably broken too. Long term though, I think the bigger block sizes are going to make a huge difference in these workloads. If you use the very dangerous code: mkfs.btrfs -l 64k -n 64k /dev/xxx (-l is leaf size, -n is node size). 64K is the max right now, 32K may help just as much at a lower CPU cost. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown with ceph (how to reproduce)
Hi Chris, great to hear that, could you give me a ping if you fixed it, than I can retry it? -martin Am 24.01.2012 20:40, schrieb Chris Mason: On Tue, Jan 24, 2012 at 08:15:58PM +0100, Martin Mailand wrote: Hi I tried the branch on one of my ceph osd, and there is a big difference in the performance. The average request size stayed high, but after around a hour the kernel crashed. IOstat http://pastebin.com/xjuriJ6J Kernel trace http://pastebin.com/SYE95GgH Aha, this I know how to fix. Thanks for trying it out. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash during resync
Hi Greg, ok, do you guys still need the core files, or could I delete them? -martin Am 24.01.2012 22:13, schrieb Gregory Farnum: On Tue, Jan 24, 2012 at 10:48 AM, Martin Mailandmar...@tuxadero.com wrote: Hi, today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted osd.0 with a new kernel and created a new btrfs on the osd.0, than I took the osd.0 into the cluster. During the the resync of osd.0 osd.2 and osd.3 crashed. I am not sure, if the crashes happened because I played with osd.0, or if they are bugs. These are OSD-level issues not caused by btrfs, so your new kernel definitely didn't do it. It's probably fallout from the backfill changes that got merged in last week. I created new bugs to track them: http://tracker.newdream.net/issues/1982 (1983, 1984). Sam and Josh are going wild on some other issues that we've turned up and these have been added to the queue as soon as somebody qualified can get to them. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rbd snap ls does not list more than 200 snapshots
Hi, I created quite a few snapshots of a rbd image. After around 200 snapshots the command rbd snap ls vm10 does not return, instead it uses all of the memory of a 32G machine an then the oom killer gets kicked in. Are 200 snapshots a known limit? How to reproduce: for i in $(seq 500); do rbd snap create --snap=a$i vm10; echo $i ; done rbd snap ls vm10 doesn't return top: 25381 root 20 0 5425m 5.2g 5436 S 29 16.4 1:10.21 rbd rbd -v ceph version 0.40-206-g6c275c8 (commit:6c275c8195a8ae04e8a492d043fa6dfd60cecd82) -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Assertion in v0.40 - os/FileStore.cc: 2438: FAILED assert(0 == unexpected error)
Hi Sage, that's exactly what I did, the first two crashes are in this log, unfortunately there was no debug level set. http://85.214.49.87/ceph/osd.0.full.log.bz2 -martin Am 15.01.2012 03:45, schrieb Sage Weil: Hi Martin- On Sat, 14 Jan 2012, Martin Mailand wrote: Hi one of four OSD died during the update to v0.40 with an Assertion os/FileStore.cc: 2438: FAILED assert(0 == unexpected error) Even after a complete shutdown of the cluster an a new start with all OSD at the same version, this osd did not start. The OSD Log it attached. It's trying to replay a transaction that appears to be invalid because the .2 clone is smaller than it thinks. Is this the first time the OSD crashed, or did it crash once, and you cranked up logs and generated this one? If you have the previous log, that would be helpful... it should have a similar tranasction dump but a different stack trace. Also, are any of the 6 patches on top of 0.40 related to the filestore or osd? Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Assertion in v0.40 - os/FileStore.cc: 2438: FAILED assert(0 == unexpected error)
Hi Sage, here is the requested dump file. http://85.214.49.87/ceph/foo.txt.bz2 -martin Am 15.01.2012 06:52, schrieb Sage Weil: Hi Martin- On Sat, 14 Jan 2012, Sage Weil wrote: Hi Martin- On Sat, 14 Jan 2012, Martin Mailand wrote: Hi one of four OSD died during the update to v0.40 with an Assertion os/FileStore.cc: 2438: FAILED assert(0 == unexpected error) Even after a complete shutdown of the cluster an a new start with all OSD at the same version, this osd did not start. The OSD Log it attached. It's trying to replay a transaction that appears to be invalid because the .2 clone is smaller than it thinks. Is this the first time the OSD crashed, or did it crash once, and you cranked up logs and generated this one? If you have the previous log, that would be helpful... it should have a similar tranasction dump but a different stack trace. I pushed a wip-osd-dump-journal branch to git that will make ceph-osd -iwhatever --dump-journal /tmp/foo.txt dump the contents of your entire osd journal (sans data) to a text file. Do you mind sending that along as well? I'd like to see what is in the journal _after_ the event that is failing (if anything). Thanks! sage Also, are any of the 6 patches on top of 0.40 related to the filestore or osd? Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.40 released
Hi, is there an example how to use it, because there is no cpeh plugin for collectd? -martin Am 14.01.2012 06:30, schrieb Sage Weil: * mon: expose cluster stats via admin socket (accessible via collectd plugin) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Assertion in v0.40 - os/FileStore.cc: 2438: FAILED assert(0 == unexpected error)
Hi one of four OSD died during the update to v0.40 with an Assertion os/FileStore.cc: 2438: FAILED assert(0 == unexpected error) Even after a complete shutdown of the cluster an a new start with all OSD at the same version, this osd did not start. The OSD Log it attached. -martin osd.0.log.bz2 Description: application/bzip
Assertion: ./messages/MOSDRepScrub.h: 64: FAILED assert(v == 0)
Hi today 2 of my osds (osd.4 and osd.7) crashed with the same error. 2011-12-21 14:41:18.896008 7fae9f3a5700 journal check_for_full at 80625664 : JOURNAL FULL 80625664 = 368639 (max_size 107372544 start 80994304) 2011-12-21 14:41:23.205993 7fae9fba6700 journal FULL_FULL - FULL_WAIT. last commit epoch committed, waiting for a new one to start. 2011-12-21 14:41:24.075990 7fae9fba6700 journal FULL_WAIT - FULL_NOTFULL. journal now active, setting completion plug. ./messages/MOSDRepScrub.h: In function 'virtual void MOSDRepScrub::decode_payload(CephContext*)', in thread '7fae93977700' ./messages/MOSDRepScrub.h: 64: FAILED assert(v == 0) ceph version 0.39-171-gdcedda8 (commit:dcedda84d0e1f69af985c301276c67c1b11e7efc) 1: /usr/bin/ceph-osd() [0x685e77] 2: (decode_message(CephContext*, ceph_msg_header, ceph_msg_footer, ceph::buffer::list, ceph::buffer::list, ceph::buffer::list)+0xcd2) [0x6a7202] 3: (SimpleMessenger::Pipe::read_message(Message**)+0x136d) [0x62c9cd] 4: (SimpleMessenger::Pipe::reader()+0xb99) [0x6357d9] 5: (SimpleMessenger::Pipe::Reader::entry()+0xd) [0x4c244d] 6: (()+0x6d8c) [0x7faea6873d8c] 7: (clone()+0x6d) [0x7faea4eb004d] ceph version 0.39-171-gdcedda8 (commit:dcedda84d0e1f69af985c301276c67c1b11e7efc) 1: /usr/bin/ceph-osd() [0x685e77] 2: (decode_message(CephContext*, ceph_msg_header, ceph_msg_footer, ceph::buffer::list, ceph::buffer::list, ceph::buffer::list)+0xcd2) [0x6a7202] 3: (SimpleMessenger::Pipe::read_message(Message**)+0x136d) [0x62c9cd] 4: (SimpleMessenger::Pipe::reader()+0xb99) [0x6357d9] 5: (SimpleMessenger::Pipe::Reader::entry()+0xd) [0x4c244d] 6: (()+0x6d8c) [0x7faea6873d8c] 7: (clone()+0x6d) [0x7faea4eb004d] *** Caught signal (Aborted) ** in thread 7fae93977700 ceph version 0.39-171-gdcedda8 (commit:dcedda84d0e1f69af985c301276c67c1b11e7efc) 1: /usr/bin/ceph-osd() [0x645172] 2: (()+0xfc60) [0x7faea687cc60] 3: (gsignal()+0x35) [0x7faea4dfdd05] 4: (abort()+0x186) [0x7faea4e01ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7faea56b46dd] 6: (()+0xb9926) [0x7faea56b2926] 7: (()+0xb9953) [0x7faea56b2953] 8: (()+0xb9a5e) [0x7faea56b2a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x6193d6] 10: /usr/bin/ceph-osd() [0x685e77] 11: (decode_message(CephContext*, ceph_msg_header, ceph_msg_footer, ceph::buffer::list, ceph::buffer::list, ceph::buffer::list)+0xcd2) [0x6a7202] 12: (SimpleMessenger::Pipe::read_message(Message**)+0x136d) [0x62c9cd] 13: (SimpleMessenger::Pipe::reader()+0xb99) [0x6357d9] 14: (SimpleMessenger::Pipe::Reader::entry()+0xd) [0x4c244d] 15: (()+0x6d8c) [0x7faea6873d8c] 16: (clone()+0x6d) [0x7faea4eb004d] (gdb) thread apply all bt snip Thread 1 (Thread 2400): #0 0x7faea687cb3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00644dc2 in reraise_fatal (signum=6) at global/signal_handler.cc:59 #2 0x006453ba in handle_fatal_signal (signum=6) at global/signal_handler.cc:106 #3 signal handler called ---Type return to continue, or q return to quit--- #4 0x7faea4dfdd05 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x7faea4e01ab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x7faea56b46dd in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x7faea56b2926 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x7faea56b2953 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x7faea56b2a5e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x006193d6 in ceph::__ceph_assert_fail (assertion=value optimized out, file=value optimized out, line=value optimized out, func=value optimized out) at common/assert.cc:70 #11 0x00685e77 in MOSDRepScrub::decode_payload (this=0x33c0c40, cct=value optimized out) at ./messages/MOSDRepScrub.h:64 #12 0x006a7202 in decode_message (cct=0x2722000, header=..., footer=value optimized out, front=value optimized out, middle=value optimized out, data=...) at msg/Message.cc:551 #13 0x0062c9cd in SimpleMessenger::Pipe::read_message (this=0x2ed3780, pm=0x7fae93976d88) at msg/SimpleMessenger.cc:1987 #14 0x006357d9 in SimpleMessenger::Pipe::reader (this=0x2ed3780) at msg/SimpleMessenger.cc:1601 #15 0x004c244d in SimpleMessenger::Pipe::Reader::entry (this=value optimized out) at msg/SimpleMessenger.h:208 #16 0x7faea6873d8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #17 0x7faea4eb004d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #18 0x in ?? () (gdb) thread 1 [Switching to thread 1 (Thread 2400)]#0 0x7faea687cb3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) frame 11 #11 0x00685e77 in MOSDRepScrub::decode_payload (this=0x33c0c40, cct=value optimized out) at ./messages/MOSDRepScrub.h:64 64 ./messages/MOSDRepScrub.h: No such file or directory.
Re: Random blocks when accessing rbd images
Hi Samuel I think I am seeing it now. root@s-brick-003:~# ceph pg dump|grep -i scrub pg_stat objects mip degrunf kb bytes log disklog state v reportedup acting last_scrub 0.6 0 0 0 0 0 0 0 0 active+clean+scrubbing 0'0 60'156 [6,2] [6,2] 0'0 2011-12-20 14:44:55.787529 root@s-brick-003:~# ceph -v ceph version 0.39-171-gdcedda8 (commit:dcedda84d0e1f69af985c301276c67c1b11e7efc) root@s-brick-003:~# I also had an osd crash and hit this (Assertion: ./messages/MOSDRepScrub.h: 64: FAILED assert(v == 0)), see my other email for more information. -martin Am 16.12.2011 22:17, schrieb Samuel Just: In master, 061e7619aacf60a828e0ce84a108d5a0bea247c6 may fix the problem. If not, 5274e88d2cb8c0449a4ecd1ff0cf8bb0af2cfc97 includes some asserts that may give us a clue as to how this is happening. -Sam -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Assertion: ./messages/MOSDRepScrub.h: 64: FAILED assert(v == 0)
Hi Greg, ok, I also have at the moment one pg which stays in scrubbing, is that also a result of the different versions I am running? Do you know if Sam needs the cluster in this state to debug the scrubbing problem? Or is it unusable for that due to the different versions? -martin Am 22.12.2011 21:24, schrieb Gregory Farnum: I see you're following master! :) You got bit by a wire-incompatible change in one of the OSD messages that Sam made, although I think he's actually going to be walking it back after a conversation we just had. In any case, restarting all of your OSDs so they're running the same code will fix it. :) -Greg On Thu, Dec 22, 2011 at 5:48 AM, Martin Mailandmar...@tuxadero.com wrote: Hi today 2 of my osds (osd.4 and osd.7) crashed with the same error. 2011-12-21 14:41:18.896008 7fae9f3a5700 journal check_for_full at 80625664 : JOURNAL FULL 80625664= 368639 (max_size 107372544 start 80994304) 2011-12-21 14:41:23.205993 7fae9fba6700 journal FULL_FULL - FULL_WAIT. last commit epoch committed, waiting for a new one to start. 2011-12-21 14:41:24.075990 7fae9fba6700 journal FULL_WAIT - FULL_NOTFULL. journal now active, setting completion plug. ./messages/MOSDRepScrub.h: In function 'virtual void MOSDRepScrub::decode_payload(CephContext*)', in thread '7fae93977700' ./messages/MOSDRepScrub.h: 64: FAILED assert(v == 0) ceph version 0.39-171-gdcedda8 (commit:dcedda84d0e1f69af985c301276c67c1b11e7efc) 1: /usr/bin/ceph-osd() [0x685e77] 2: (decode_message(CephContext*, ceph_msg_header, ceph_msg_footer, ceph::buffer::list, ceph::buffer::list, ceph::buffer::list)+0xcd2) [0x6a7202] 3: (SimpleMessenger::Pipe::read_message(Message**)+0x136d) [0x62c9cd] 4: (SimpleMessenger::Pipe::reader()+0xb99) [0x6357d9] 5: (SimpleMessenger::Pipe::Reader::entry()+0xd) [0x4c244d] 6: (()+0x6d8c) [0x7faea6873d8c] 7: (clone()+0x6d) [0x7faea4eb004d] ceph version 0.39-171-gdcedda8 (commit:dcedda84d0e1f69af985c301276c67c1b11e7efc) 1: /usr/bin/ceph-osd() [0x685e77] 2: (decode_message(CephContext*, ceph_msg_header, ceph_msg_footer, ceph::buffer::list, ceph::buffer::list, ceph::buffer::list)+0xcd2) [0x6a7202] 3: (SimpleMessenger::Pipe::read_message(Message**)+0x136d) [0x62c9cd] 4: (SimpleMessenger::Pipe::reader()+0xb99) [0x6357d9] 5: (SimpleMessenger::Pipe::Reader::entry()+0xd) [0x4c244d] 6: (()+0x6d8c) [0x7faea6873d8c] 7: (clone()+0x6d) [0x7faea4eb004d] *** Caught signal (Aborted) ** in thread 7fae93977700 ceph version 0.39-171-gdcedda8 (commit:dcedda84d0e1f69af985c301276c67c1b11e7efc) 1: /usr/bin/ceph-osd() [0x645172] 2: (()+0xfc60) [0x7faea687cc60] 3: (gsignal()+0x35) [0x7faea4dfdd05] 4: (abort()+0x186) [0x7faea4e01ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7faea56b46dd] 6: (()+0xb9926) [0x7faea56b2926] 7: (()+0xb9953) [0x7faea56b2953] 8: (()+0xb9a5e) [0x7faea56b2a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x6193d6] 10: /usr/bin/ceph-osd() [0x685e77] 11: (decode_message(CephContext*, ceph_msg_header, ceph_msg_footer, ceph::buffer::list, ceph::buffer::list, ceph::buffer::list)+0xcd2) [0x6a7202] 12: (SimpleMessenger::Pipe::read_message(Message**)+0x136d) [0x62c9cd] 13: (SimpleMessenger::Pipe::reader()+0xb99) [0x6357d9] 14: (SimpleMessenger::Pipe::Reader::entry()+0xd) [0x4c244d] 15: (()+0x6d8c) [0x7faea6873d8c] 16: (clone()+0x6d) [0x7faea4eb004d] (gdb) thread apply all bt snip Thread 1 (Thread 2400): #0 0x7faea687cb3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00644dc2 in reraise_fatal (signum=6) at global/signal_handler.cc:59 #2 0x006453ba in handle_fatal_signal (signum=6) at global/signal_handler.cc:106 #3signal handler called ---Typereturn to continue, or qreturn to quit--- #4 0x7faea4dfdd05 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x7faea4e01ab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x7faea56b46dd in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x7faea56b2926 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x7faea56b2953 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x7faea56b2a5e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x006193d6 in ceph::__ceph_assert_fail (assertion=value optimized out, file=value optimized out, line=value optimized out, func=value optimized out) at common/assert.cc:70 #11 0x00685e77 in MOSDRepScrub::decode_payload (this=0x33c0c40, cct=value optimized out) at ./messages/MOSDRepScrub.h:64 #12 0x006a7202 in decode_message (cct=0x2722000, header=..., footer=value optimized out, front=value optimized out, middle=value optimized out, data=...) at msg/Message.cc:551 #13 0x0062c9cd in SimpleMessenger::Pipe::read_message (this=0x2ed3780, pm=0x7fae93976d88) at msg/SimpleMessenger.cc:1987 #14 0x006357d9 in
Re: Assertion: ./messages/MOSDRepScrub.h: 64: FAILED assert(v == 0)
Hi Sam, okay, after I upgraded the whole cluster, the stuck pg went away. -martin Am 22.12.2011 22:08, schrieb Samuel Just: Martin, that bug should actually be fixed in current master. You'll need to upgrade the whole cluster, though. -Sam On Thu, Dec 22, 2011 at 12:40 PM, Martin Mailandmar...@tuxadero.com wrote: Hi Greg, ok, I also have at the moment one pg which stays in scrubbing, is that also a result of the different versions I am running? Do you know if Sam needs the cluster in this state to debug the scrubbing problem? Or is it unusable for that due to the different versions? -martin Am 22.12.2011 21:24, schrieb Gregory Farnum: I see you're following master! :) You got bit by a wire-incompatible change in one of the OSD messages that Sam made, although I think he's actually going to be walking it back after a conversation we just had. In any case, restarting all of your OSDs so they're running the same code will fix it. :) -Greg On Thu, Dec 22, 2011 at 5:48 AM, Martin Mailandmar...@tuxadero.com wrote: Hi today 2 of my osds (osd.4 and osd.7) crashed with the same error. 2011-12-21 14:41:18.896008 7fae9f3a5700 journal check_for_full at 80625664 : JOURNAL FULL 80625664= 368639 (max_size 107372544 start 80994304) 2011-12-21 14:41:23.205993 7fae9fba6700 journal FULL_FULL -FULL_WAIT. last commit epoch committed, waiting for a new one to start. 2011-12-21 14:41:24.075990 7fae9fba6700 journal FULL_WAIT - FULL_NOTFULL. journal now active, setting completion plug. ./messages/MOSDRepScrub.h: In function 'virtual void MOSDRepScrub::decode_payload(CephContext*)', in thread '7fae93977700' ./messages/MOSDRepScrub.h: 64: FAILED assert(v == 0) ceph version 0.39-171-gdcedda8 (commit:dcedda84d0e1f69af985c301276c67c1b11e7efc) 1: /usr/bin/ceph-osd() [0x685e77] 2: (decode_message(CephContext*, ceph_msg_header, ceph_msg_footer, ceph::buffer::list, ceph::buffer::list, ceph::buffer::list)+0xcd2) [0x6a7202] 3: (SimpleMessenger::Pipe::read_message(Message**)+0x136d) [0x62c9cd] 4: (SimpleMessenger::Pipe::reader()+0xb99) [0x6357d9] 5: (SimpleMessenger::Pipe::Reader::entry()+0xd) [0x4c244d] 6: (()+0x6d8c) [0x7faea6873d8c] 7: (clone()+0x6d) [0x7faea4eb004d] ceph version 0.39-171-gdcedda8 (commit:dcedda84d0e1f69af985c301276c67c1b11e7efc) 1: /usr/bin/ceph-osd() [0x685e77] 2: (decode_message(CephContext*, ceph_msg_header, ceph_msg_footer, ceph::buffer::list, ceph::buffer::list, ceph::buffer::list)+0xcd2) [0x6a7202] 3: (SimpleMessenger::Pipe::read_message(Message**)+0x136d) [0x62c9cd] 4: (SimpleMessenger::Pipe::reader()+0xb99) [0x6357d9] 5: (SimpleMessenger::Pipe::Reader::entry()+0xd) [0x4c244d] 6: (()+0x6d8c) [0x7faea6873d8c] 7: (clone()+0x6d) [0x7faea4eb004d] *** Caught signal (Aborted) ** in thread 7fae93977700 ceph version 0.39-171-gdcedda8 (commit:dcedda84d0e1f69af985c301276c67c1b11e7efc) 1: /usr/bin/ceph-osd() [0x645172] 2: (()+0xfc60) [0x7faea687cc60] 3: (gsignal()+0x35) [0x7faea4dfdd05] 4: (abort()+0x186) [0x7faea4e01ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7faea56b46dd] 6: (()+0xb9926) [0x7faea56b2926] 7: (()+0xb9953) [0x7faea56b2953] 8: (()+0xb9a5e) [0x7faea56b2a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x6193d6] 10: /usr/bin/ceph-osd() [0x685e77] 11: (decode_message(CephContext*, ceph_msg_header, ceph_msg_footer, ceph::buffer::list, ceph::buffer::list, ceph::buffer::list)+0xcd2) [0x6a7202] 12: (SimpleMessenger::Pipe::read_message(Message**)+0x136d) [0x62c9cd] 13: (SimpleMessenger::Pipe::reader()+0xb99) [0x6357d9] 14: (SimpleMessenger::Pipe::Reader::entry()+0xd) [0x4c244d] 15: (()+0x6d8c) [0x7faea6873d8c] 16: (clone()+0x6d) [0x7faea4eb004d] (gdb) thread apply all bt snip Thread 1 (Thread 2400): #0 0x7faea687cb3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00644dc2 in reraise_fatal (signum=6) at global/signal_handler.cc:59 #2 0x006453ba in handle_fatal_signal (signum=6) at global/signal_handler.cc:106 #3signal handler called ---Typereturnto continue, or qreturnto quit--- #4 0x7faea4dfdd05 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x7faea4e01ab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x7faea56b46dd in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x7faea56b2926 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x7faea56b2953 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x7faea56b2a5e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x006193d6 in ceph::__ceph_assert_fail (assertion=value optimized out, file=value optimized out, line=value optimized out, func=value optimized out) at common/assert.cc:70 #11 0x00685e77 in MOSDRepScrub::decode_payload (this=0x33c0c40, cct=value optimized out) at ./messages/MOSDRepScrub.h:64 #12 0x006a7202 in
Re: Random blocks when accessing rbd images
Hi Guido, I am running ceph version 0.39-37-g54758ab (commit:54758abccf429122c1bc3bce6d01bc33f1cfe238) on my cluster and I do not see this problem. Do you use the qemu rbd block driver or the kernel mount? How did you install ceph, via the packages? -martin Am 15.12.2011 16:45, schrieb Guido Winkelmann: Am Donnerstag, 15. Dezember 2011, 17:32:25 schrieben Sie: On 12/15/2011 05:07 PM, Guido Winkelmann wrote: Hi, I've got a small ceph cluster with one mon, one mds and two osds (all on the same machine, for now), that I want to use as a block- and file storage backend for qemu machine virtualisation. I found that read access to some of the rbd images, or parts of some of them sometimes blocks indefinitely, usually after the image has been sitting around untouched for a while, for example over night. This has the effect that virtual machines that try to access their disks as well as rbd commands like rbd cp will just hang indefinitely. I found that these blocks can usually be fixed by restarting one of the osds. The last time this happened, ceph -s reported one of the osds to be in state active+clean+scrubbing. (I'm afraid I don't have the complete output from ceph -s anymore.) Does anybody have any idea what could be going wrong here? I think it's fixed in v0.39 I'm already using 0.39, so, no. (Should have mentioned that to start with...) Guido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Random blocks when accessing rbd images
Hi Wido, but wasn't that fixed a few weeks ago? -martin Am 15.12.2011 17:33, schrieb Wido den Hollander: Yes, from what I've seen it will block indefinitely until you restart one of the OSDs who are member of the PG. Wido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Random blocks when accessing rbd images
Hi, at least there is a patch that should have fixed it. http://marc.info/?l=ceph-develm=131955913203561w=2 Am 15.12.2011 17:38, schrieb Martin Mailand: Hi Wido, but wasn't that fixed a few weeks ago? -martin Am 15.12.2011 17:33, schrieb Wido den Hollander: Yes, from what I've seen it will block indefinitely until you restart one of the OSDs who are member of the PG. Wido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: os/FileJournal.cc: 1011: FAILED assert(seq = last_committed_seq)
Hi Sage, it happened again, this time I have the log, it's attached. (gdb) thread 1 [Switching to thread 1 (Thread 24077)]#0 0x7f7995b83b3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) frame 11 #11 0x0072ee8d in FileJournal::committed_thru (this=0x1ebc000, seq=16833973) at os/FileJournal.cc:1011 1011os/FileJournal.cc: No such file or directory. in os/FileJournal.cc (gdb) p seq $1 = 16833973 (gdb) p last_committed_seq $2 = 16834010 (gdb) Is this all info you need, or should I leave the osd in this state for further debugging? -martin Am 29.11.2011 17:07, schrieb Sage Weil: On Tue, 29 Nov 2011, Martin Mailand wrote: Hi, with a build from today, I have the same prob. os/FileJournal.cc: In function 'virtual void FileJournal::committed_thru(uint64_t)', in thread '7fc55c85f700' os/FileJournal.cc: 1011: FAILED assert(seq= last_committed_seq) ceph version 0.38-250-gc2889fe (commit:c2889fef420611df3dd0de4064c91f6aa9f86625) Can you post a log of the failed ceph-osd restart with 'debug journal = 20' and 'debug filestore = 20'? Thanks! sage osd.0.log.debug.bz2 Description: BZip2 compressed data
Re: Cluster sync doesn't finsh
Hi Sam, is there anything new on this Issue, which I could test? -martin Am 19.11.2011 02:05, schrieb Samuel Just: I've inserted this bug as #1738. Unfortunately, this will take a bit of effort to fix. In the short term, you could switch to a crushmap where each node at the bottom level of the hierarchy contains more than one device. (i.e., remove the node level and stop at the rack level). Thanks for the help! -Sam On Fri, Nov 18, 2011 at 12:17 PM, Martin Mailandmar...@tuxadero.com wrote: Hi Sam, here the crushmap http://85.214.49.87/ceph/crushmap.txt http://85.214.49.87/ceph/crushmap -martin Samuel Just schrieb: It looks like a crushmap related problem. Could you send us the crushmap? ceph osd getcrushmap Thanks -Sam On Fri, Nov 18, 2011 at 10:13 AM, Gregory Farnum gregory.far...@dreamhost.com wrote: On Fri, Nov 18, 2011 at 10:05 AM, Tommi Virtanen tommi.virta...@dreamhost.com wrote: On Thu, Nov 17, 2011 at 12:48, Martin Mailandmar...@tuxadero.com wrote: Hi, I am doing cluster failure test, where I shut down one OSD an wait for the cluster to sync. But the sync never finshed, at around 4-5% it stops. I stoped osd2. ... 2011-11-17 16:42:45.520740pg v1337: 600 pgs: 547 active+clean, 53 active+clean+degraded; 113 GB data, 184 GB used, 1141 GB / 1395 GB avail; 4025/82404 degraded (4.884%) ... The osd log, the ceph.conf, pg dump, osd dump could be found here. http://85.214.49.87/ceph/ This looks a bit worrying: 2011-11-17 17:56:35.771574 7f704c834700 -- 192.168.42.113:0/2424 192.168.42.114:6802/21115 pipe(0x2596c80 sd=17 pgs=0 cs=0 l=0).connect claims to be 192.168.42.114:6802/21507 not 192.168.42.114:6802/21115 - wrong node! So osd.0 is basically refusing to talk to one of the other OSDs. I don't understand the messenger well enough to know why this would be, but it wouldn't surprise me if this problem kept the objects degraded -- it looks like a breakage in the osd-osd communication. Now if this was the reason, I'd expect a restart of all the OSDs to get it back in shape; messenger state is ephemeral. Can you confirm that? Probably not — that wrong node thing can occur for a lot of different reasons, some of which matter and most of which don't. Sam's looking into the problem; there's something going wrong with the CRUSH calculations or the monitor PG placement overrides or something... -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: os/FileJournal.cc: 1011: FAILED assert(seq = last_committed_seq)
Hi Sage, I just updated the crashed osd, and it did not work very well. os/FileJournal.cc: 1173: FAILED assert(h-seq = last_committed_seq) 1173os/FileJournal.cc: No such file or directory. in os/FileJournal.cc (gdb) p h-seq value has been optimized out (gdb) p last_committed_seq $1 = 16834095 -martin Am 05.12.2011 18:44, schrieb Sage Weil: dc167bac7800c75df971bded4b54e0de48f7b18f (wip-journal branch) should fix this. Can you give it a test before I push to stable? Thanks! sage On Mon, 5 Dec 2011, Martin Mailand wrote: Hi Sage, it happened again, this time I have the log, it's attached. (gdb) thread 1 [Switching to thread 1 (Thread 24077)]#0 0x7f7995b83b3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) frame 11 #11 0x0072ee8d in FileJournal::committed_thru (this=0x1ebc000, seq=16833973) at os/FileJournal.cc:1011 1011os/FileJournal.cc: No such file or directory. in os/FileJournal.cc (gdb) p seq $1 = 16833973 (gdb) p last_committed_seq $2 = 16834010 (gdb) Is this all info you need, or should I leave the osd in this state for further debugging? -martin Am 29.11.2011 17:07, schrieb Sage Weil: On Tue, 29 Nov 2011, Martin Mailand wrote: Hi, with a build from today, I have the same prob. os/FileJournal.cc: In function 'virtual void FileJournal::committed_thru(uint64_t)', in thread '7fc55c85f700' os/FileJournal.cc: 1011: FAILED assert(seq= last_committed_seq) ceph version 0.38-250-gc2889fe (commit:c2889fef420611df3dd0de4064c91f6aa9f86625) Can you post a log of the failed ceph-osd restart with 'debug journal = 20' and 'debug filestore = 20'? Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html osd.log.bz2 Description: application/bzip
os/FileJournal.cc: 1011: FAILED assert(seq = last_committed_seq)
Hi I hit this assertion a few times. I use ext4 as the osd fs, so I think we have to replay the whole journal, maybe that triggers it. -martin 2011-11-29 11:37:55.393296 7fab45dbc7a0 FileStore is up to date. os/FileJournal.cc: In function 'virtual void FileJournal::committed_thru(uint64_t)', in thread '7fab434cf700' os/FileJournal.cc: 1011: FAILED assert(seq = last_committed_seq) ceph version 0.38-244-g30def38 (commit:30def38d21b217f244db74e6c469598d794fa8a1) 1: (FileJournal::committed_thru(unsigned long)+0xcd) [0x72e7cd] 2: (JournalingObjectStore::commit_finish()+0xb9) [0x714d79] 3: (FileStore::sync_entry()+0xec7) [0x70aae7] 4: (FileStore::SyncThread::entry()+0xd) [0x7139bd] 5: (()+0x6d8c) [0x7fab45993d8c] 6: (clone()+0x6d) [0x7fab43fd004d] ceph version 0.38-244-g30def38 (commit:30def38d21b217f244db74e6c469598d794fa8a1) 1: (FileJournal::committed_thru(unsigned long)+0xcd) [0x72e7cd] 2: (JournalingObjectStore::commit_finish()+0xb9) [0x714d79] 3: (FileStore::sync_entry()+0xec7) [0x70aae7] 4: (FileStore::SyncThread::entry()+0xd) [0x7139bd] 5: (()+0x6d8c) [0x7fab45993d8c] 6: (clone()+0x6d) [0x7fab43fd004d] *** Caught signal (Aborted) ** in thread 7fab434cf700 ceph version 0.38-244-g30def38 (commit:30def38d21b217f244db74e6c469598d794fa8a1) 1: /usr/bin/ceph-osd() [0x5a7ba2] 2: (()+0xfc60) [0x7fab4599cc60] 3: (gsignal()+0x35) [0x7fab43f1dd05] 4: (abort()+0x186) [0x7fab43f21ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fab447d46dd] 6: (()+0xb9926) [0x7fab447d2926] 7: (()+0xb9953) [0x7fab447d2953] 8: (()+0xb9a5e) [0x7fab447d2a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x5cd9e6] 10: (FileJournal::committed_thru(unsigned long)+0xcd) [0x72e7cd] 11: (JournalingObjectStore::commit_finish()+0xb9) [0x714d79] 12: (FileStore::sync_entry()+0xec7) [0x70aae7] 13: (FileStore::SyncThread::entry()+0xd) [0x7139bd] 14: (()+0x6d8c) [0x7fab45993d8c] 15: (clone()+0x6d) [0x7fab43fd004d] Thread 1 (Thread 2491): #0 0x7fab4599cb3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x005a77f2 in reraise_fatal (signum=6) at global/signal_handler.cc:59 #2 0x005a7dea in handle_fatal_signal (signum=6) at global/signal_handler.cc:106 #3 signal handler called #4 0x7fab43f1dd05 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x7fab43f21ab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x7fab447d46dd in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x7fab447d2926 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 ---Type return to continue, or q return to quit--- #8 0x7fab447d2953 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x7fab447d2a5e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x005cd9e6 in ceph::__ceph_assert_fail (assertion=value optimized out, file=value optimized out, line=value optimized out, func=value optimized out) at common/assert.cc:70 #11 0x0072e7cd in FileJournal::committed_thru (this=0x141, seq=4145693) at os/FileJournal.cc:1011 #12 0x00714d79 in JournalingObjectStore::commit_finish (this=0x1401000) at os/JournalingObjectStore.cc:260 #13 0x0070aae7 in FileStore::sync_entry (this=0x1401000) at os/FileStore.cc:3079 #14 0x007139bd in FileStore::SyncThread::entry (this=value optimized out) at os/FileStore.h:101 #15 0x7fab45993d8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #16 0x7fab43fd004d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #17 0x in ?? () (gdb) (gdb) thread 1 [Switching to thread 1 (Thread 2491)]#0 0x7fab4599cb3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) thread 11 Thread ID 11 not known. (gdb) frame 11 #11 0x0072e7cd in FileJournal::committed_thru (this=0x141, seq=4145693) at os/FileJournal.cc:1011 1011os/FileJournal.cc: No such file or directory. in os/FileJournal.cc (gdb) p seq $1 = 4145693 (gdb) p last_committed_seq $2 = 4145768 (gdb) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: os/FileJournal.cc: 1011: FAILED assert(seq = last_committed_seq)
Hi, with a build from today, I have the same prob. os/FileJournal.cc: In function 'virtual void FileJournal::committed_thru(uint64_t)', in thread '7fc55c85f700' os/FileJournal.cc: 1011: FAILED assert(seq = last_committed_seq) ceph version 0.38-250-gc2889fe (commit:c2889fef420611df3dd0de4064c91f6aa9f86625) -martin Am 29.11.2011 13:14, schrieb Martin Mailand: Hi Stratos, ok, my build was form the 23.11, I retest with master. -martin Am 29.11.2011 12:56, schrieb Stratos Psomadakis: On 11/29/2011 01:48 PM, Martin Mailand wrote: Hi I hit this assertion a few times. I use ext4 as the osd fs, so I think we have to replay the whole journal, maybe that triggers it. I've hit that too with v0.38 (with OSD on ext4), but when I built ceph from the master branch, the issue seemed to be resolved. -martin 2011-11-29 11:37:55.393296 7fab45dbc7a0 FileStore is up to date. os/FileJournal.cc: In function 'virtual void FileJournal::committed_thru(uint64_t)', in thread '7fab434cf700' os/FileJournal.cc: 1011: FAILED assert(seq= last_committed_seq) ceph version 0.38-244-g30def38 (commit:30def38d21b217f244db74e6c469598d794fa8a1) 1: (FileJournal::committed_thru(unsigned long)+0xcd) [0x72e7cd] 2: (JournalingObjectStore::commit_finish()+0xb9) [0x714d79] 3: (FileStore::sync_entry()+0xec7) [0x70aae7] 4: (FileStore::SyncThread::entry()+0xd) [0x7139bd] 5: (()+0x6d8c) [0x7fab45993d8c] 6: (clone()+0x6d) [0x7fab43fd004d] ceph version 0.38-244-g30def38 (commit:30def38d21b217f244db74e6c469598d794fa8a1) 1: (FileJournal::committed_thru(unsigned long)+0xcd) [0x72e7cd] 2: (JournalingObjectStore::commit_finish()+0xb9) [0x714d79] 3: (FileStore::sync_entry()+0xec7) [0x70aae7] 4: (FileStore::SyncThread::entry()+0xd) [0x7139bd] 5: (()+0x6d8c) [0x7fab45993d8c] 6: (clone()+0x6d) [0x7fab43fd004d] *** Caught signal (Aborted) ** in thread 7fab434cf700 ceph version 0.38-244-g30def38 (commit:30def38d21b217f244db74e6c469598d794fa8a1) 1: /usr/bin/ceph-osd() [0x5a7ba2] 2: (()+0xfc60) [0x7fab4599cc60] 3: (gsignal()+0x35) [0x7fab43f1dd05] 4: (abort()+0x186) [0x7fab43f21ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fab447d46dd] 6: (()+0xb9926) [0x7fab447d2926] 7: (()+0xb9953) [0x7fab447d2953] 8: (()+0xb9a5e) [0x7fab447d2a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x5cd9e6] 10: (FileJournal::committed_thru(unsigned long)+0xcd) [0x72e7cd] 11: (JournalingObjectStore::commit_finish()+0xb9) [0x714d79] 12: (FileStore::sync_entry()+0xec7) [0x70aae7] 13: (FileStore::SyncThread::entry()+0xd) [0x7139bd] 14: (()+0x6d8c) [0x7fab45993d8c] 15: (clone()+0x6d) [0x7fab43fd004d] Thread 1 (Thread 2491): #0 0x7fab4599cb3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x005a77f2 in reraise_fatal (signum=6) at global/signal_handler.cc:59 #2 0x005a7dea in handle_fatal_signal (signum=6) at global/signal_handler.cc:106 #3signal handler called #4 0x7fab43f1dd05 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x7fab43f21ab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x7fab447d46dd in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x7fab447d2926 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 ---Typereturn to continue, or qreturn to quit--- #8 0x7fab447d2953 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x7fab447d2a5e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x005cd9e6 in ceph::__ceph_assert_fail (assertion=value optimized out, file=value optimized out, line=value optimized out, func=value optimized out) at common/assert.cc:70 #11 0x0072e7cd in FileJournal::committed_thru (this=0x141, seq=4145693) at os/FileJournal.cc:1011 #12 0x00714d79 in JournalingObjectStore::commit_finish (this=0x1401000) at os/JournalingObjectStore.cc:260 #13 0x0070aae7 in FileStore::sync_entry (this=0x1401000) at os/FileStore.cc:3079 #14 0x007139bd in FileStore::SyncThread::entry (this=value optimized out) at os/FileStore.h:101 #15 0x7fab45993d8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #16 0x7fab43fd004d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #17 0x in ?? () (gdb) (gdb) thread 1 [Switching to thread 1 (Thread 2491)]#0 0x7fab4599cb3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) thread 11 Thread ID 11 not known. (gdb) frame 11 #11 0x0072e7cd in FileJournal::committed_thru (this=0x141, seq=4145693) at os/FileJournal.cc:1011 1011 os/FileJournal.cc: No such file or directory. in os/FileJournal.cc (gdb) p seq $1 = 4145693 (gdb) p last_committed_seq $2 = 4145768 (gdb) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body
Re: osd/OSD.cc: 5534: FAILED assert(pending_ops 0)
Hi Sage, I hit it again, this time on another osd ceph version 0.38-181-g2e19550 (commit:2e195500b5d3a8ab8512bcf2a219a6b7ff922c97) Thread 1 (Thread 2951): #0 0x7f36bbb41b3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x005f5852 in reraise_fatal (signum=6) at global/signal_handler.cc:59 #2 0x005f5e4a in handle_fatal_signal (signum=6) at global/signal_handler.cc:106 #3 signal handler called #4 0x7f36ba0c2d05 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x7f36ba0c6ab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x7f36ba9796dd in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 ---Type return to continue, or q return to quit--- #7 0x7f36ba977926 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x7f36ba977953 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x7f36ba977a5e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x005f6956 in ceph::__ceph_assert_fail (assertion=value optimized out, file=value optimized out, line=value optimized out, func=value optimized out) at common/assert.cc:70 #11 0x0056616a in OSD::dequeue_op (this=0x25b, pg=value optimized out) at osd/OSD.cc:5518 #12 0x005d4406 in ThreadPool::worker (this=0x25b0408) at common/WorkQueue.cc:54 #13 0x005822dd in ThreadPool::WorkThread::entry (this=value optimized out) at ./common/WorkQueue.h:120 #14 0x7f36bbb38d8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #15 0x7f36ba17504d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #16 0x in ?? () (gdb) thread 1 [Switching to thread 1 (Thread 2951)]#0 0x7f36bbb41b3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) frame 11 #11 0x0056616a in OSD::dequeue_op (this=0x25b, pg=value optimized out) at osd/OSD.cc:5518 5518osd/OSD.cc: No such file or directory. in osd/OSD.cc (gdb) p pending_ops $1 = 0 -martin Am 16.11.2011 22:12, schrieb Sage Weil: Hi Martin, I've reread the code twice now and it's really not clear to me how pending_ops could get out of sync with the actual queue size. I've pushed a couple of patches that remove surrounding dead code and add an additional assert sanity check to master.Have you seen this again, or just that once? Opened http://tracker.newdream.net/issues/1727 Thanks- sage On Wed, 16 Nov 2011, Martin Mailand wrote: Hi, so after a little help from greg. (gdb) print pending_ops $1 = 0 -martin Sage Weil schrieb: On Mon, 14 Nov 2011, Gregory Farnum wrote: It's not a big deal; logging is expensive. :) Just a backtrace isn't a lot to go on, but it's better than nothing! On Mon, Nov 14, 2011 at 11:45 AM, Martin Mailandmar...@tuxadero.com wrote: Hi Gregory, I do not have more at the moment. As I cannot have the debug log always on, a core dump would be the best solution? I'm mainly interested in whether pending_ops is 0 or 0. A 'thread apply all bt' may also be useful. Thanks! sage -martin Gregory Farnum schrieb: Do you have any other system state? (More logs, core dumps.) Make a bug in the tracker either way so it doesn't get lost track of. :) -Greg On Mon, Nov 14, 2011 at 6:04 AM, Martin Mailandmar...@tuxadero.com wrote: Hi, today one of my ods died, the log is. sd/OSD.cc: In function 'void OSD::dequeue_op(PG*)', in thread '7faeb6139700' osd/OSD.cc: 5534: FAILED assert(pending_ops 0) ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] *** Caught signal (Aborted) ** in thread 7faeb6139700 ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: /usr/bin/ceph-osd() [0x5b8b52] 2: (()+0xfc60) [0x7faec4d1bc60] 3: (gsignal()+0x35) [0x7faec34a1d05] 4: (abort()+0x186) [0x7faec34a5ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7faec3d586dd] 6: (()+0xb9926) [0x7faec3d56926] 7: (()+0xb9953) [0x7faec3d56953] 8: (()+0xb9a5e) [0x7faec3d56a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x5bddb6] 10: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 11: (ThreadPool::worker()+0x6e6) [0x5b7b16] 12: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 13: (()+0x6d8c) [0x7faec4d12d8c] 14: (clone()+0x6d) [0x7faec355404d] Anything else needed to debug this? -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message
Re: Cluster sync doesn't finsh
Hi Sam, here the crushmap http://85.214.49.87/ceph/crushmap.txt http://85.214.49.87/ceph/crushmap -martin Samuel Just schrieb: It looks like a crushmap related problem. Could you send us the crushmap? ceph osd getcrushmap Thanks -Sam On Fri, Nov 18, 2011 at 10:13 AM, Gregory Farnum gregory.far...@dreamhost.com wrote: On Fri, Nov 18, 2011 at 10:05 AM, Tommi Virtanen tommi.virta...@dreamhost.com wrote: On Thu, Nov 17, 2011 at 12:48, Martin Mailand mar...@tuxadero.com wrote: Hi, I am doing cluster failure test, where I shut down one OSD an wait for the cluster to sync. But the sync never finshed, at around 4-5% it stops. I stoped osd2. ... 2011-11-17 16:42:45.520740pg v1337: 600 pgs: 547 active+clean, 53 active+clean+degraded; 113 GB data, 184 GB used, 1141 GB / 1395 GB avail; 4025/82404 degraded (4.884%) ... The osd log, the ceph.conf, pg dump, osd dump could be found here. http://85.214.49.87/ceph/ This looks a bit worrying: 2011-11-17 17:56:35.771574 7f704c834700 -- 192.168.42.113:0/2424 192.168.42.114:6802/21115 pipe(0x2596c80 sd=17 pgs=0 cs=0 l=0).connect claims to be 192.168.42.114:6802/21507 not 192.168.42.114:6802/21115 - wrong node! So osd.0 is basically refusing to talk to one of the other OSDs. I don't understand the messenger well enough to know why this would be, but it wouldn't surprise me if this problem kept the objects degraded -- it looks like a breakage in the osd-osd communication. Now if this was the reason, I'd expect a restart of all the OSDs to get it back in shape; messenger state is ephemeral. Can you confirm that? Probably not — that wrong node thing can occur for a lot of different reasons, some of which matter and most of which don't. Sam's looking into the problem; there's something going wrong with the CRUSH calculations or the monitor PG placement overrides or something... -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd/OSD.cc: 5534: FAILED assert(pending_ops 0)
Hi Sage, I saw it once, but the osd node seems a bit dodgy. I re-imaged the node today, I try again to reproduce it. -martin Am 16.11.2011 22:12, schrieb Sage Weil: Hi Martin, I've reread the code twice now and it's really not clear to me how pending_ops could get out of sync with the actual queue size. I've pushed a couple of patches that remove surrounding dead code and add an additional assert sanity check to master.Have you seen this again, or just that once? Opened http://tracker.newdream.net/issues/1727 Thanks- sage On Wed, 16 Nov 2011, Martin Mailand wrote: Hi, so after a little help from greg. (gdb) print pending_ops $1 = 0 -martin Sage Weil schrieb: On Mon, 14 Nov 2011, Gregory Farnum wrote: It's not a big deal; logging is expensive. :) Just a backtrace isn't a lot to go on, but it's better than nothing! On Mon, Nov 14, 2011 at 11:45 AM, Martin Mailandmar...@tuxadero.com wrote: Hi Gregory, I do not have more at the moment. As I cannot have the debug log always on, a core dump would be the best solution? I'm mainly interested in whether pending_ops is 0 or 0. A 'thread apply all bt' may also be useful. Thanks! sage -martin Gregory Farnum schrieb: Do you have any other system state? (More logs, core dumps.) Make a bug in the tracker either way so it doesn't get lost track of. :) -Greg On Mon, Nov 14, 2011 at 6:04 AM, Martin Mailandmar...@tuxadero.com wrote: Hi, today one of my ods died, the log is. sd/OSD.cc: In function 'void OSD::dequeue_op(PG*)', in thread '7faeb6139700' osd/OSD.cc: 5534: FAILED assert(pending_ops 0) ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] *** Caught signal (Aborted) ** in thread 7faeb6139700 ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: /usr/bin/ceph-osd() [0x5b8b52] 2: (()+0xfc60) [0x7faec4d1bc60] 3: (gsignal()+0x35) [0x7faec34a1d05] 4: (abort()+0x186) [0x7faec34a5ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7faec3d586dd] 6: (()+0xb9926) [0x7faec3d56926] 7: (()+0xb9953) [0x7faec3d56953] 8: (()+0xb9a5e) [0x7faec3d56a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x5bddb6] 10: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 11: (ThreadPool::worker()+0x6e6) [0x5b7b16] 12: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 13: (()+0x6d8c) [0x7faec4d12d8c] 14: (clone()+0x6d) [0x7faec355404d] Anything else needed to debug this? -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Cluster sync doesn't finsh
Hi, I am doing cluster failure test, where I shut down one OSD an wait for the cluster to sync. But the sync never finshed, at around 4-5% it stops. I stoped osd2. 2011-11-17 16:40:48.015370pg v1333: 600 pgs: 1 active, 546 active+clean, 53 active+clean+degraded; 113 GB data, 183 GB used, 1142 GB / 1395 GB avail; 4200/82404 degraded (5.097%) 2011-11-17 16:40:53.109391pg v1334: 600 pgs: 1 active, 546 active+clean, 53 active+clean+degraded; 113 GB data, 183 GB used, 1142 GB / 1395 GB avail; 4117/82404 degraded (4.996%) 2011-11-17 16:40:58.228525pg v1335: 600 pgs: 1 active, 546 active+clean, 53 active+clean+degraded; 113 GB data, 183 GB used, 1142 GB / 1395 GB avail; 4037/82404 degraded (4.899%) 2011-11-17 16:41:03.223778pg v1336: 600 pgs: 547 active+clean, 53 active+clean+degraded; 113 GB data, 183 GB used, 1142 GB / 1395 GB avail; 4025/82404 degraded (4.884%) 2011-11-17 16:42:45.520740pg v1337: 600 pgs: 547 active+clean, 53 active+clean+degraded; 113 GB data, 184 GB used, 1141 GB / 1395 GB avail; 4025/82404 degraded (4.884%) ^C root@m-brick-000:~# date -R Thu, 17 Nov 2011 17:56:08 +0100 root@m-brick-000:~# So for the last hour nothing happend, there is no load on the cluster. The osd log, the ceph.conf, pg dump, osd dump could be found here. http://85.214.49.87/ceph/ ceph version 0.38-181-g2e19550 (commit:2e195500b5d3a8ab8512bcf2a219a6b7ff922c97) -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd/OSD.cc: 5534: FAILED assert(pending_ops 0)
Hi, I have a bt. http://pastebin.com/QNcja2QK -martin Sage Weil schrieb: On Mon, 14 Nov 2011, Gregory Farnum wrote: It's not a big deal; logging is expensive. :) Just a backtrace isn't a lot to go on, but it's better than nothing! On Mon, Nov 14, 2011 at 11:45 AM, Martin Mailand mar...@tuxadero.com wrote: Hi Gregory, I do not have more at the moment. As I cannot have the debug log always on, a core dump would be the best solution? I'm mainly interested in whether pending_ops is 0 or 0. A 'thread apply all bt' may also be useful. Thanks! sage -martin Gregory Farnum schrieb: Do you have any other system state? (More logs, core dumps.) Make a bug in the tracker either way so it doesn't get lost track of. :) -Greg On Mon, Nov 14, 2011 at 6:04 AM, Martin Mailand mar...@tuxadero.com wrote: Hi, today one of my ods died, the log is. sd/OSD.cc: In function 'void OSD::dequeue_op(PG*)', in thread '7faeb6139700' osd/OSD.cc: 5534: FAILED assert(pending_ops 0) ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] *** Caught signal (Aborted) ** in thread 7faeb6139700 ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: /usr/bin/ceph-osd() [0x5b8b52] 2: (()+0xfc60) [0x7faec4d1bc60] 3: (gsignal()+0x35) [0x7faec34a1d05] 4: (abort()+0x186) [0x7faec34a5ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7faec3d586dd] 6: (()+0xb9926) [0x7faec3d56926] 7: (()+0xb9953) [0x7faec3d56953] 8: (()+0xb9a5e) [0x7faec3d56a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x5bddb6] 10: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 11: (ThreadPool::worker()+0x6e6) [0x5b7b16] 12: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 13: (()+0x6d8c) [0x7faec4d12d8c] 14: (clone()+0x6d) [0x7faec355404d] Anything else needed to debug this? -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd/OSD.cc: 5534: FAILED assert(pending_ops 0)
Hi, so after a little help from greg. (gdb) print pending_ops $1 = 0 -martin Sage Weil schrieb: On Mon, 14 Nov 2011, Gregory Farnum wrote: It's not a big deal; logging is expensive. :) Just a backtrace isn't a lot to go on, but it's better than nothing! On Mon, Nov 14, 2011 at 11:45 AM, Martin Mailand mar...@tuxadero.com wrote: Hi Gregory, I do not have more at the moment. As I cannot have the debug log always on, a core dump would be the best solution? I'm mainly interested in whether pending_ops is 0 or 0. A 'thread apply all bt' may also be useful. Thanks! sage -martin Gregory Farnum schrieb: Do you have any other system state? (More logs, core dumps.) Make a bug in the tracker either way so it doesn't get lost track of. :) -Greg On Mon, Nov 14, 2011 at 6:04 AM, Martin Mailand mar...@tuxadero.com wrote: Hi, today one of my ods died, the log is. sd/OSD.cc: In function 'void OSD::dequeue_op(PG*)', in thread '7faeb6139700' osd/OSD.cc: 5534: FAILED assert(pending_ops 0) ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] *** Caught signal (Aborted) ** in thread 7faeb6139700 ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: /usr/bin/ceph-osd() [0x5b8b52] 2: (()+0xfc60) [0x7faec4d1bc60] 3: (gsignal()+0x35) [0x7faec34a1d05] 4: (abort()+0x186) [0x7faec34a5ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7faec3d586dd] 6: (()+0xb9926) [0x7faec3d56926] 7: (()+0xb9953) [0x7faec3d56953] 8: (()+0xb9a5e) [0x7faec3d56a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x5bddb6] 10: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 11: (ThreadPool::worker()+0x6e6) [0x5b7b16] 12: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 13: (()+0x6d8c) [0x7faec4d12d8c] 14: (clone()+0x6d) [0x7faec355404d] Anything else needed to debug this? -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph and ext4
Hi Christian, I am not sure if you noticed, but your ext4 bug is fixed in mainline. I am running a ceph cluster with 40+ vms for over a week by now, without any problems. An fsck.ext4 shows the ext4 is clean. The performance of ext4 is much better than btrfs, no rise in the load of the osd's. -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph and ext4
Hi Tomasz, as far as I know it still has this limit. But it should be relatively safe to use it. http://marc.info/?l=ceph-develm=131942130322957w=2 If we hit the 4KB limit of xattrs in ext4 how does it show up in the rbd layer? How does it show up in the fs layer, would the fs still be clean? -martin Am 14.11.2011 14:09, schrieb Tomasz Paszkowski: what about limit on xattr size ? Is it still limited to 4KB ? On Mon, Nov 14, 2011 at 1:15 PM, Martin Mailandmar...@tuxadero.com wrote: Hi Christian, I am not sure if you noticed, but your ext4 bug is fixed in mainline. I am running a ceph cluster with 40+ vms for over a week by now, without any problems. An fsck.ext4 shows the ext4 is clean. The performance of ext4 is much better than btrfs, no rise in the load of the osd's. -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
osd/OSD.cc: 5534: FAILED assert(pending_ops 0)
Hi, today one of my ods died, the log is. sd/OSD.cc: In function 'void OSD::dequeue_op(PG*)', in thread '7faeb6139700' osd/OSD.cc: 5534: FAILED assert(pending_ops 0) ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] *** Caught signal (Aborted) ** in thread 7faeb6139700 ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: /usr/bin/ceph-osd() [0x5b8b52] 2: (()+0xfc60) [0x7faec4d1bc60] 3: (gsignal()+0x35) [0x7faec34a1d05] 4: (abort()+0x186) [0x7faec34a5ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7faec3d586dd] 6: (()+0xb9926) [0x7faec3d56926] 7: (()+0xb9953) [0x7faec3d56953] 8: (()+0xb9a5e) [0x7faec3d56a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x5bddb6] 10: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 11: (ThreadPool::worker()+0x6e6) [0x5b7b16] 12: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 13: (()+0x6d8c) [0x7faec4d12d8c] 14: (clone()+0x6d) [0x7faec355404d] Anything else needed to debug this? -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: crushmap errors
Hi Sage, 1. The crushtool grammer fix is working for me. Thanks. 2. I think if an admin puts the extra rack info into the ceph.conf file, than it should do what expected. I understand your worries but on the other end ceph is not an end user tool, and people should know what they do and balance there racks evenly. Just my two cents. -martin Am 11.11.2011 23:51, schrieb Sage Weil: On Fri, 11 Nov 2011, Martin Mailand wrote: Hi, I used in ceph v0.38 the host and rack feature in the conf during an mkcephfs. Now I have to problems with the crushmap 1. I cannot compile a ceph genearated crushmap. crushtool -c file.txt -o file file.txt:4 error: parse error at '.0' Whoops, will push a patch to stable shortly. The grammer wasn't recognizing '.' as a legal character. 2. Why are 2 racks are not enough for 2 failure domains? From the commit: If there are2 racks, separate across racks. Well, technically they are. My worry is that it's more likely that racks will have significantly vary capacity (i.e. crush weight) due to, say, 1 full rack and a second 1/2 rack. If the policy forces replicas be placed across racks things won't balance well. I suppose there should be an argument like --min-racks that controls that threshold? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph and ext4
Hi Gregory, this is quite bad, so ext4 is still no alternative as a backend fs. -martin Gregory Farnum schrieb: On Mon, Nov 14, 2011 at 5:33 AM, Martin Mailand mar...@tuxadero.com wrote: Hi Tomasz, as far as I know it still has this limit. But it should be relatively safe to use it. http://marc.info/?l=ceph-develm=131942130322957w=2 If we hit the 4KB limit of xattrs in ext4 how does it show up in the rbd layer? How does it show up in the fs layer, would the fs still be clean? Right now it would show up very badly, unfortunately. (And yes, the limit is still there.) You'd notice, though you might manage to corrupt some of your data first. :/ However, if you're not taking snapshots and you're not using xattrs yourself, you won't hit it with rbd or the Ceph FS. -Greg -martin Am 14.11.2011 14:09, schrieb Tomasz Paszkowski: what about limit on xattr size ? Is it still limited to 4KB ? On Mon, Nov 14, 2011 at 1:15 PM, Martin Mailandmar...@tuxadero.com wrote: Hi Christian, I am not sure if you noticed, but your ext4 bug is fixed in mainline. I am running a ceph cluster with 40+ vms for over a week by now, without any problems. An fsck.ext4 shows the ext4 is clean. The performance of ext4 is much better than btrfs, no rise in the load of the osd's. -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd/OSD.cc: 5534: FAILED assert(pending_ops 0)
Hi Gregory, I do not have more at the moment. As I cannot have the debug log always on, a core dump would be the best solution? -martin Gregory Farnum schrieb: Do you have any other system state? (More logs, core dumps.) Make a bug in the tracker either way so it doesn't get lost track of. :) -Greg On Mon, Nov 14, 2011 at 6:04 AM, Martin Mailand mar...@tuxadero.com wrote: Hi, today one of my ods died, the log is. sd/OSD.cc: In function 'void OSD::dequeue_op(PG*)', in thread '7faeb6139700' osd/OSD.cc: 5534: FAILED assert(pending_ops 0) ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 4: (()+0x6d8c) [0x7faec4d12d8c] 5: (clone()+0x6d) [0x7faec355404d] *** Caught signal (Aborted) ** in thread 7faeb6139700 ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) 1: /usr/bin/ceph-osd() [0x5b8b52] 2: (()+0xfc60) [0x7faec4d1bc60] 3: (gsignal()+0x35) [0x7faec34a1d05] 4: (abort()+0x186) [0x7faec34a5ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7faec3d586dd] 6: (()+0xb9926) [0x7faec3d56926] 7: (()+0xb9953) [0x7faec3d56953] 8: (()+0xb9a5e) [0x7faec3d56a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x5bddb6] 10: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] 11: (ThreadPool::worker()+0x6e6) [0x5b7b16] 12: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] 13: (()+0x6d8c) [0x7faec4d12d8c] 14: (clone()+0x6d) [0x7faec355404d] Anything else needed to debug this? -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph and ext4
so snapshots of rbd images would be safe? or would they hit the limit? Sage Weil schrieb: On Mon, 14 Nov 2011, Gregory Farnum wrote: On Mon, Nov 14, 2011 at 5:33 AM, Martin Mailand mar...@tuxadero.com wrote: Hi Tomasz, as far as I know it still has this limit. But it should be relatively safe to use it. http://marc.info/?l=ceph-develm=131942130322957w=2 If we hit the 4KB limit of xattrs in ext4 how does it show up in the rbd layer? How does it show up in the fs layer, would the fs still be clean? Right now it would show up very badly, unfortunately. (And yes, the limit is still there.) You'd notice, though you might manage to corrupt some of your data first. :/ Well, the osd's are now more careful about being fail-stop, so if they hit the xattr limit they crash. So there won't be data corruption per se, except that you won't be able to start the OSD up again because the journal replay will keep hitting the limit. However, if you're not taking snapshots and you're not using xattrs yourself, you won't hit it with rbd or the Ceph FS. Right. Nothing sets large xattrs on objects in rbd. For the file system, this would only happen on extremely (!) deeply nested directories (ceph dfs xattrs are managed by the MDS, not as object attrs). sage -Greg -martin Am 14.11.2011 14:09, schrieb Tomasz Paszkowski: what about limit on xattr size ? Is it still limited to 4KB ? On Mon, Nov 14, 2011 at 1:15 PM, Martin Mailandmar...@tuxadero.com wrote: Hi Christian, I am not sure if you noticed, but your ext4 bug is fixed in mainline. I am running a ceph cluster with 40+ vms for over a week by now, without any problems. An fsck.ext4 shows the ext4 is clean. The performance of ext4 is much better than btrfs, no rise in the load of the osd's. -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
crushmap errors
Hi, I used in ceph v0.38 the host and rack feature in the conf during an mkcephfs. Now I have to problems with the crushmap 1. I cannot compile a ceph genearated crushmap. crushtool -c file.txt -o file file.txt:4 error: parse error at '.0' # begin crush map # devices device 0 osd.0 2. Why are 2 racks are not enough for 2 failure domains? From the commit: If there are 2 racks, separate across racks. and in the src/osd/OSDMap.cc if (racks.size() 3) { // spread replicas across hosts crush_rule_set_step(rule, 1, CRUSH_RULE_CHOOSE_LEAF_FIRSTN, CRUSH_CHOOSE_N, 2); shouldn't that be if (racks.size() 1) { // spread replicas across racks crush_rule_set_step(rule, 1, CRUSH_RULE_CHOOSE_LEAF_FIRSTN, CRUSH_CHOOSE_N, 2); Best Regards, martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
Hi resend without the perf attachment, which could be found here: http://tuxadero.com/multistorage/perf.report.txt.bz2 Best Regards, martin Original-Nachricht Betreff: Re: ceph on btrfs [was Re: ceph on non-btrfs file systems] Datum: Wed, 26 Oct 2011 22:38:47 +0200 Von: Martin Mailand mar...@tuxadero.com Antwort an: mar...@tuxadero.com An: Sage Weil s...@newdream.net Kopie (CC): Christian Brunner c...@muc.de, ceph-devel@vger.kernel.org, linux-bt...@vger.kernel.org Hi, I have more or less the same setup as Christian and I suffer the same problems. But as far as I can see the output of latencytop and perf differs form Christian one, both are attached. I was wondering about the high latency from btrfs-submit. Process btrfs-submit-0 (970) Total: 2123.5 msec I have as well the high IO rate and high IO wait. avg-cpu: %user %nice %system %iowait %steal %idle 0.600.002.20 82.400.00 14.80 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.008.40 0.0074.40 17.71 0.033.810.003.81 3.81 3.20 sdb 0.00 7.000.00 269.80 0.00 1224.80 9.08 107.19 398.690.00 398.69 3.15 85.00 top - 21:57:41 up 8:41, 1 user, load average: 0.65, 0.79, 0.76 Tasks: 179 total, 1 running, 178 sleeping, 0 stopped, 0 zombie Cpu(s): 0.6%us, 2.4%sy, 0.0%ni, 70.8%id, 25.8%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 4018276k total, 1577728k used, 2440548k free,10496k buffers Swap: 1998844k total,0k used, 1998844k free, 1316696k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 1399 root 20 0 548m 103m 3428 S 0.0 2.6 2:01.85 ceph-osd 1401 root 20 0 548m 103m 3428 S 0.0 2.6 1:51.71 ceph-osd 1400 root 20 0 548m 103m 3428 S 0.0 2.6 1:50.30 ceph-osd 1391 root 20 0 000 S 0.0 0.0 1:18.39 btrfs-endio-wri 976 root 20 0 000 S 0.0 0.0 1:18.11 btrfs-endio-wri 1367 root 20 0 000 S 0.0 0.0 1:05.60 btrfs-worker-1 968 root 20 0 000 S 0.0 0.0 1:05.45 btrfs-worker-0 1163 root 20 0 141m 1636 1100 S 0.0 0.0 1:00.56 collectd 970 root 20 0 000 S 0.0 0.0 0:47.73 btrfs-submit-0 1402 root 20 0 548m 103m 3428 S 0.0 2.6 0:34.86 ceph-osd 1392 root 20 0 000 S 0.0 0.0 0:33.70 btrfs-endio-met 975 root 20 0 000 S 0.0 0.0 0:32.70 btrfs-endio-met 1415 root 20 0 548m 103m 3428 S 0.0 2.6 0:28.29 ceph-osd 1414 root 20 0 548m 103m 3428 S 0.0 2.6 0:28.24 ceph-osd 1397 root 20 0 548m 103m 3428 S 0.0 2.6 0:24.60 ceph-osd 1436 root 20 0 548m 103m 3428 S 0.0 2.6 0:13.31 ceph-osd Here ist my setup. Kernel v3.1 + Josef The config for this osd (ceph version 0.37 (commit:a6f3bbb744a6faea95ae48317f0b838edb16a896)) is: [osd.1] host = s-brick-003 osd journal = /dev/sda7 btrfs devs = /dev/sdb btrfs options = noatime filestore_btrfs_snap = false I hope this helps to pin point the problem. Best Regards, martin Sage Weil schrieb: On Wed, 26 Oct 2011, Christian Brunner wrote: 2011/10/26 Sage Weil s...@newdream.net: On Wed, 26 Oct 2011, Christian Brunner wrote: Christian, have you tweaked those settings in your ceph.conf? It would be something like 'journal dio = false'. If not, can you verify that directio shows true when the journal is initialized from your osd log? E.g., 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1 If directio = 1 for you, something else funky is causing those blkdev_fsync's... I've looked it up in the logs - directio is 1: Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096 bytes, directio = 1 Do you mind capturing an strace? I'd like to see where that blkdev_fsync is coming from. Here is an strace. I can see a lot of sync_file_range operations. Yeah, these all look like the flusher thread, and shouldn't be hitting blkdev_fsync. Can you confirm that with filestore flusher = false filestore sync flush = false you get no sync_file_range at all? I wonder if this is also perf lying about the call chain. Yes, setting this makes the sync_file_range calls go away. Okay. That means either sync_file_range on a regular btrfs file is triggering blkdev_fsync somewhere in btrfs, there is an extremely sneaky bug that is mixing up file descriptors, or latencytop is lying. I'm guessing the latter, given the other weirdness Josef and Chris were seeing. :) Is it safe to use these settings with filestore btrfs snap = 0? Yeah. They're purely
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
Hi Stefan, I think the machine has enough ram. root@s-brick-003:~# free -m total used free sharedbuffers cached Mem: 3924 2401 1522 0 42 2115 -/+ buffers/cache:243 3680 Swap: 1951 0 1951 There is no swap usage at all. -martin Am 27.10.2011 12:59, schrieb Stefan Majer: Hi Martin, a quick dig into your perf report show a large amount of swapper work. If this is the case, i would suspect latency. So do you have not enough physical ram in your machine ? Greetings Stefan Majer On Thu, Oct 27, 2011 at 12:53 PM, Martin Mailandmar...@tuxadero.com wrote: Hi resend without the perf attachment, which could be found here: http://tuxadero.com/multistorage/perf.report.txt.bz2 Best Regards, martin Original-Nachricht Betreff: Re: ceph on btrfs [was Re: ceph on non-btrfs file systems] Datum: Wed, 26 Oct 2011 22:38:47 +0200 Von: Martin Mailandmar...@tuxadero.com Antwort an: mar...@tuxadero.com An: Sage Weils...@newdream.net Kopie (CC): Christian Brunnerc...@muc.de, ceph-devel@vger.kernel.org, linux-bt...@vger.kernel.org Hi, I have more or less the same setup as Christian and I suffer the same problems. But as far as I can see the output of latencytop and perf differs form Christian one, both are attached. I was wondering about the high latency from btrfs-submit. Process btrfs-submit-0 (970) Total: 2123.5 msec I have as well the high IO rate and high IO wait. avg-cpu: %user %nice %system %iowait %steal %idle 0.600.002.20 82.400.00 14.80 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.008.40 0.0074.40 17.71 0.033.810.003.81 3.81 3.20 sdb 0.00 7.000.00 269.80 0.00 1224.80 9.08 107.19 398.690.00 398.69 3.15 85.00 top - 21:57:41 up 8:41, 1 user, load average: 0.65, 0.79, 0.76 Tasks: 179 total, 1 running, 178 sleeping, 0 stopped, 0 zombie Cpu(s): 0.6%us, 2.4%sy, 0.0%ni, 70.8%id, 25.8%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 4018276k total, 1577728k used, 2440548k free,10496k buffers Swap: 1998844k total,0k used, 1998844k free, 1316696k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 1399 root 20 0 548m 103m 3428 S 0.0 2.6 2:01.85 ceph-osd 1401 root 20 0 548m 103m 3428 S 0.0 2.6 1:51.71 ceph-osd 1400 root 20 0 548m 103m 3428 S 0.0 2.6 1:50.30 ceph-osd 1391 root 20 0 000 S 0.0 0.0 1:18.39 btrfs-endio-wri 976 root 20 0 000 S 0.0 0.0 1:18.11 btrfs-endio-wri 1367 root 20 0 000 S 0.0 0.0 1:05.60 btrfs-worker-1 968 root 20 0 000 S 0.0 0.0 1:05.45 btrfs-worker-0 1163 root 20 0 141m 1636 1100 S 0.0 0.0 1:00.56 collectd 970 root 20 0 000 S 0.0 0.0 0:47.73 btrfs-submit-0 1402 root 20 0 548m 103m 3428 S 0.0 2.6 0:34.86 ceph-osd 1392 root 20 0 000 S 0.0 0.0 0:33.70 btrfs-endio-met 975 root 20 0 000 S 0.0 0.0 0:32.70 btrfs-endio-met 1415 root 20 0 548m 103m 3428 S 0.0 2.6 0:28.29 ceph-osd 1414 root 20 0 548m 103m 3428 S 0.0 2.6 0:28.24 ceph-osd 1397 root 20 0 548m 103m 3428 S 0.0 2.6 0:24.60 ceph-osd 1436 root 20 0 548m 103m 3428 S 0.0 2.6 0:13.31 ceph-osd Here ist my setup. Kernel v3.1 + Josef The config for this osd (ceph version 0.37 (commit:a6f3bbb744a6faea95ae48317f0b838edb16a896)) is: [osd.1] host = s-brick-003 osd journal = /dev/sda7 btrfs devs = /dev/sdb btrfs options = noatime filestore_btrfs_snap = false I hope this helps to pin point the problem. Best Regards, martin Sage Weil schrieb: On Wed, 26 Oct 2011, Christian Brunner wrote: 2011/10/26 Sage Weils...@newdream.net: On Wed, 26 Oct 2011, Christian Brunner wrote: Christian, have you tweaked those settings in your ceph.conf? It would be something like 'journal dio = false'. If not, can you verify that directio shows true when the journal is initialized from your osd log? E.g., 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1 If directio = 1 for you, something else funky is causing those blkdev_fsync's... I've looked it up in the logs - directio is 1: Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096 bytes, directio = 1 Do you mind capturing an strace? I'd like to see where that blkdev_fsync is coming from. Here is an strace. I can see a lot of sync_file_range operations. Yeah, these all look like the flusher
Re: kernel BUG at fs/btrfs/inode.c:1163
Hi Anand, I changed the replication level of the rbd pool, from one to two. ceph osd pool set rbd size 2 And then during the sync the bug happened, but today I could not reproduce it. So I do not have a testcase for you. Best Regards, martin Am 19.10.2011 17:02, schrieb Anand Jain: I tried to play with ceph here and not a complete success yet. any idea what was done on the system at the time of the problem ? and any specific command that could trigger this again ? Thanks. anand -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
0.37 crash
Hi, today I tried the version 0.37 and it did not work very well, see below. It was an update from 0.36. Best Regards, Martin 2011-10-20 17:33:34.350502 7f0ada6f4760 ceph version 0.37 (commit:a6f3bbb744a6faea95ae48317f0b838edb16a896), process ceph-osd, pid 21707 2011-10-20 17:33:34.353543 7f0ada6f4760 filestore(/data/osd2) mount FIEMAP ioctl is NOT supported 2011-10-20 17:33:34.353628 7f0ada6f4760 filestore(/data/osd2) mount detected btrfs 2011-10-20 17:33:34.353656 7f0ada6f4760 filestore(/data/osd2) mount btrfs CLONE_RANGE ioctl is supported 2011-10-20 17:33:34.425059 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE is supported 2011-10-20 17:33:34.544564 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_DESTROY is supported 2011-10-20 17:33:34.544873 7f0ada6f4760 filestore(/data/osd2) mount btrfs START_SYNC got 0 Success 2011-10-20 17:33:34.544966 7f0ada6f4760 filestore(/data/osd2) mount btrfs START_SYNC is supported (transid 149) 2011-10-20 17:33:34.624965 7f0ada6f4760 filestore(/data/osd2) mount btrfs WAIT_SYNC is supported 2011-10-20 17:33:34.636719 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE_V2 got 0 Success 2011-10-20 17:33:34.636754 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE_V2 is supported 2011-10-20 17:33:34.644876 7f0ada6f4760 filestore(/data/osd2) mount found snaps 2011-10-20 17:33:34.644983 7f0ada6f4760 filestore(/data/osd2) mount: enabling WRITEAHEAD journal mode: 'filestore btrfs snap' mode is not enabled 2011-10-20 17:33:34.678324 7f0ada6f4760 journal kernel version is 3.1.0 2011-10-20 17:33:34.678737 7f0ada6f4760 journal _open /dev/sda7 fd 14: 476500201472 bytes, block size 4096 bytes, directio = 1 2011-10-20 17:33:34.688215 7f0ada6f4760 journal read_entry 39366656 : seq 4653 710 bytes 2011-10-20 17:33:34.688420 7f0ada6f4760 journal read_entry 39374848 : seq 4654 33 bytes 2011-10-20 17:33:34.695110 7f0ada6f4760 journal kernel version is 3.1.0 2011-10-20 17:33:34.695496 7f0ada6f4760 journal _open /dev/sda7 fd 14: 476500201472 bytes, block size 4096 bytes, directio = 1 2011-10-20 17:33:34.696359 7f0ada6f4760 FileStore is up to date. 2011-10-20 17:33:34.696683 7f0ada6f4760 journal close /dev/sda7 2011-10-20 17:33:34.697970 7f0ada6f4760 filestore(/data/osd2) mount FIEMAP ioctl is NOT supported 2011-10-20 17:33:34.698013 7f0ada6f4760 filestore(/data/osd2) mount detected btrfs 2011-10-20 17:33:34.698031 7f0ada6f4760 filestore(/data/osd2) mount btrfs CLONE_RANGE ioctl is supported 2011-10-20 17:33:34.774980 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE is supported 2011-10-20 17:33:34.904538 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_DESTROY is supported 2011-10-20 17:33:34.904945 7f0ada6f4760 filestore(/data/osd2) mount btrfs START_SYNC got 0 Success 2011-10-20 17:33:34.904995 7f0ada6f4760 filestore(/data/osd2) mount btrfs START_SYNC is supported (transid 152) 2011-10-20 17:33:34.991585 7f0ada6f4760 filestore(/data/osd2) mount btrfs WAIT_SYNC is supported 2011-10-20 17:33:34.996636 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE_V2 got 0 Success 2011-10-20 17:33:34.996664 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE_V2 is supported 2011-10-20 17:33:35.004813 7f0ada6f4760 filestore(/data/osd2) mount found snaps 2011-10-20 17:33:35.004902 7f0ada6f4760 filestore(/data/osd2) mount: enabling WRITEAHEAD journal mode: 'filestore btrfs snap' mode is not enabled 2011-10-20 17:33:35.023071 7f0ada6f4760 journal kernel version is 3.1.0 2011-10-20 17:33:35.023353 7f0ada6f4760 journal _open /dev/sda7 fd 14: 476500201472 bytes, block size 4096 bytes, directio = 1 2011-10-20 17:33:35.029846 7f0ada6f4760 journal read_entry 39366656 : seq 4653 710 bytes 2011-10-20 17:33:35.030077 7f0ada6f4760 journal read_entry 39374848 : seq 4654 33 bytes 2011-10-20 17:33:35.036728 7f0ada6f4760 journal kernel version is 3.1.0 2011-10-20 17:33:35.037142 7f0ada6f4760 journal _open /dev/sda7 fd 14: 476500201472 bytes, block size 4096 bytes, directio = 1 *** Caught signal (Aborted) ** in thread 0x7f0ace7f9700 ceph version 0.37 (commit:a6f3bbb744a6faea95ae48317f0b838edb16a896) 1: /usr/bin/ceph-osd() [0x5bd012] 2: (()+0xfc60) [0x7f0ada2d4c60] 3: (gsignal()+0x35) [0x7f0ad8a5ad05] 4: (abort()+0x186) [0x7f0ad8a5eab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f0ad93116dd] 6: (()+0xb9926) [0x7f0ad930f926] 7: (()+0xb9953) [0x7f0ad930f953] 8: (()+0xb9a5e) [0x7f0ad930fa5e] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x129) [0x5a7e99] 10: (OSDMap::decode(ceph::buffer::list)+0x81) [0x58f9f1] 11: (OSD::get_map(unsigned int)+0x242) [0x53f6d2] 12: (OSD::handle_osd_map(MOSDMap*)+0x1f82) [0x56ae72] 13: (OSD::_dispatch(Message*)+0x36b) [0x56d11b] 14: (OSD::ms_dispatch(Message*)+0xf6) [0x56e1c6] 15: (SimpleMessenger::dispatch_entry()+0x88b) [0x5fff2b] 16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4bd55c] 17: (()+0x6d8c) [0x7f0ada2cbd8c] 18:
Re: 0.37 crash
Hi Stefan, in my case the osd process was just terminated, no IO wait. Could you have a look in your dmesg, if there is any btrfs entry? Because the IO wait sounds like a btrfs problem. Best Regards, martin Stefan Kleijkers schrieb: Hello, I got the exact same problem. Upgraded from 0.36 to 0.37 and one of the two osds wouldn't start. In the log of the osd I also found the same error as below. The ceph-osd had status D (with ps, which is uninterruptable sleep) and I see a high IO wait with top. Also I noticed a lot of disk io on the disks. Stefan On 10/20/2011 05:39 PM, Martin Mailand wrote: Hi, today I tried the version 0.37 and it did not work very well, see below. It was an update from 0.36. Best Regards, Martin 2011-10-20 17:33:34.350502 7f0ada6f4760 ceph version 0.37 (commit:a6f3bbb744a6faea95ae48317f0b838edb16a896), process ceph-osd, pid 21707 2011-10-20 17:33:34.353543 7f0ada6f4760 filestore(/data/osd2) mount FIEMAP ioctl is NOT supported 2011-10-20 17:33:34.353628 7f0ada6f4760 filestore(/data/osd2) mount detected btrfs 2011-10-20 17:33:34.353656 7f0ada6f4760 filestore(/data/osd2) mount btrfs CLONE_RANGE ioctl is supported 2011-10-20 17:33:34.425059 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE is supported 2011-10-20 17:33:34.544564 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_DESTROY is supported 2011-10-20 17:33:34.544873 7f0ada6f4760 filestore(/data/osd2) mount btrfs START_SYNC got 0 Success 2011-10-20 17:33:34.544966 7f0ada6f4760 filestore(/data/osd2) mount btrfs START_SYNC is supported (transid 149) 2011-10-20 17:33:34.624965 7f0ada6f4760 filestore(/data/osd2) mount btrfs WAIT_SYNC is supported 2011-10-20 17:33:34.636719 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE_V2 got 0 Success 2011-10-20 17:33:34.636754 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE_V2 is supported 2011-10-20 17:33:34.644876 7f0ada6f4760 filestore(/data/osd2) mount found snaps 2011-10-20 17:33:34.644983 7f0ada6f4760 filestore(/data/osd2) mount: enabling WRITEAHEAD journal mode: 'filestore btrfs snap' mode is not enabled 2011-10-20 17:33:34.678324 7f0ada6f4760 journal kernel version is 3.1.0 2011-10-20 17:33:34.678737 7f0ada6f4760 journal _open /dev/sda7 fd 14: 476500201472 bytes, block size 4096 bytes, directio = 1 2011-10-20 17:33:34.688215 7f0ada6f4760 journal read_entry 39366656 : seq 4653 710 bytes 2011-10-20 17:33:34.688420 7f0ada6f4760 journal read_entry 39374848 : seq 4654 33 bytes 2011-10-20 17:33:34.695110 7f0ada6f4760 journal kernel version is 3.1.0 2011-10-20 17:33:34.695496 7f0ada6f4760 journal _open /dev/sda7 fd 14: 476500201472 bytes, block size 4096 bytes, directio = 1 2011-10-20 17:33:34.696359 7f0ada6f4760 FileStore is up to date. 2011-10-20 17:33:34.696683 7f0ada6f4760 journal close /dev/sda7 2011-10-20 17:33:34.697970 7f0ada6f4760 filestore(/data/osd2) mount FIEMAP ioctl is NOT supported 2011-10-20 17:33:34.698013 7f0ada6f4760 filestore(/data/osd2) mount detected btrfs 2011-10-20 17:33:34.698031 7f0ada6f4760 filestore(/data/osd2) mount btrfs CLONE_RANGE ioctl is supported 2011-10-20 17:33:34.774980 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE is supported 2011-10-20 17:33:34.904538 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_DESTROY is supported 2011-10-20 17:33:34.904945 7f0ada6f4760 filestore(/data/osd2) mount btrfs START_SYNC got 0 Success 2011-10-20 17:33:34.904995 7f0ada6f4760 filestore(/data/osd2) mount btrfs START_SYNC is supported (transid 152) 2011-10-20 17:33:34.991585 7f0ada6f4760 filestore(/data/osd2) mount btrfs WAIT_SYNC is supported 2011-10-20 17:33:34.996636 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE_V2 got 0 Success 2011-10-20 17:33:34.996664 7f0ada6f4760 filestore(/data/osd2) mount btrfs SNAP_CREATE_V2 is supported 2011-10-20 17:33:35.004813 7f0ada6f4760 filestore(/data/osd2) mount found snaps 2011-10-20 17:33:35.004902 7f0ada6f4760 filestore(/data/osd2) mount: enabling WRITEAHEAD journal mode: 'filestore btrfs snap' mode is not enabled 2011-10-20 17:33:35.023071 7f0ada6f4760 journal kernel version is 3.1.0 2011-10-20 17:33:35.023353 7f0ada6f4760 journal _open /dev/sda7 fd 14: 476500201472 bytes, block size 4096 bytes, directio = 1 2011-10-20 17:33:35.029846 7f0ada6f4760 journal read_entry 39366656 : seq 4653 710 bytes 2011-10-20 17:33:35.030077 7f0ada6f4760 journal read_entry 39374848 : seq 4654 33 bytes 2011-10-20 17:33:35.036728 7f0ada6f4760 journal kernel version is 3.1.0 2011-10-20 17:33:35.037142 7f0ada6f4760 journal _open /dev/sda7 fd 14: 476500201472 bytes, block size 4096 bytes, directio = 1 *** Caught signal (Aborted) ** in thread 0x7f0ace7f9700 ceph version 0.37 (commit:a6f3bbb744a6faea95ae48317f0b838edb16a896) 1: /usr/bin/ceph-osd() [0x5bd012] 2: (()+0xfc60) [0x7f0ada2d4c60] 3: (gsignal()+0x35) [0x7f0ad8a5ad05] 4: (abort()+0x186) [0x7f0ad8a5eab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f0ad93116dd] 6
Re: kernel BUG at fs/btrfs/inode.c:1163
Am 19.10.2011 11:49, schrieb David Sterba: On Tue, Oct 18, 2011 at 10:04:01PM +0200, Martin Mailand wrote: [28997.273289] [ cut here ] [28997.282916] kernel BUG at fs/btrfs/inode.c:1163! 1119 fi = btrfs_item_ptr(leaf, path-slots[0], 1120 struct btrfs_file_extent_item); 1121 extent_type = btrfs_file_extent_type(leaf, fi); 1122 1123 if (extent_type == BTRFS_FILE_EXTENT_REG || 1124 extent_type == BTRFS_FILE_EXTENT_PREALLOC) { ... 1158 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { 1159 extent_end = found_key.offset + 1160 btrfs_file_extent_inline_len(leaf, fi); 1161 extent_end = ALIGN(extent_end, root-sectorsize); 1162 } else { 1163 BUG_ON(1); 1164 } rc10 kernel sources point to this, can you please verify it in your sources? if it's really this one, that means that it's an unhandled extent_type read from the b-tree leaf and could be a corruption. (the value is directly obtained from file extent type item, line 1121) yep, that's the same in my source It would be interesting what's the value of 'extent_type' at the time of crash, if it's eg -1 that could point to a real bug, some unhandled corner case in truncate, for example. How can I do that? [28997.507960] Call Trace: [28997.507960] [a00903e0] ? acls_after_inode_item+0xc0/0xc0 [btrfs] ... a corruption caused by overflow of xattrs/acls into inode item bytes? As ceph stresses xattrs very well, I wouldn't be surprised by that. david -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kernel BUG at fs/btrfs/inode.c:1163
Hi today I hit this Bug, kernel is v3.1-rc10 + josef from today, workload is a ceph osd. Best Regards, Martin [28997.273289] [ cut here ] [28997.282916] kernel BUG at fs/btrfs/inode.c:1163! [28997.290863] invalid opcode: [#1] SMP [28997.290863] CPU 0 [28997.290863] Modules linked in: radeon ttm drm_kms_helper drm psmouse sp5100_tco i2c_piix4 i2c_algo_bit serio_raw edac_core k8temp edac_mce_amd shpchp lp parport pata_atiixp btrfs zlib_deflate e1000e libcrc32c ahci libahci [28997.290863] [28997.290863] Pid: 1220, comm: ceph-osd Tainted: GW 3.1.0-rc10+ #2 MICRO-STAR INTERNATIONAL CO., LTD MS-96B3/MS-96B3 [28997.290863] RIP: 0010:[a0094f17] [a0094f17] run_delalloc_nocow+0x7a7/0x7c0 [btrfs] [28997.290863] RSP: 0018:880117357a78 EFLAGS: 00010206 [28997.290863] RAX: 002f RBX: 880116b12a20 RCX: 880117357a38 [28997.290863] RDX: 8800 RSI: 0496 RDI: 8801003851e0 [28997.290863] RBP: 880117357b78 R08: 0497 R09: 880117357a28 [28997.290863] R10: 0030 R11: R12: 00011d3b [28997.290863] R13: 00011d3b R14: 8801003851e0 R15: 0030 [28997.290863] FS: 7ff45ae7b700() GS:88011fc0() knlGS: [28997.507960] CS: 0010 DS: ES: CR0: 8005003b [28997.507960] CR2: 7ff450a2 CR3: 000114b75000 CR4: 06f0 [28997.507960] DR0: DR1: DR2: [28997.507960] DR3: DR6: 0ff0 DR7: 0400 [28997.507960] Process ceph-osd (pid: 1220, threadinfo 880117356000, task 88011526) [28997.507960] Stack: [28997.507960] 880117357aa8 81156e90 880104413af0 880104413af0 [28997.507960] 880117550030 880117357bf0 880117550028 880117550020 [28997.507960] 0040 00010040 880117357d14 00ffa00a973e [28997.507960] Call Trace: [28997.507960] [81156e90] ? kmem_cache_free+0x20/0x100 [28997.507960] [a0095264] run_delalloc_range+0x334/0x380 [btrfs] [28997.507960] [a00abc85] __extent_writepage+0x5b5/0x6f0 [btrfs] [28997.507960] [812e526d] ? radix_tree_gang_lookup_tag_slot+0x8d/0xd0 [28997.507960] [a00abfea] extent_write_cache_pages.clone.19.clone.26+0x22a/0x3a0 [btrfs] [28997.507960] [a00ac3a5] extent_writepages+0x45/0x60 [btrfs] [28997.507960] [a00903e0] ? acls_after_inode_item+0xc0/0xc0 [btrfs] [28997.507960] [81182ade] ? vfsmount_lock_local_unlock+0x1e/0x30 [28997.507960] [a008fa27] btrfs_writepages+0x27/0x30 [btrfs] [28997.507960] [81118161] do_writepages+0x21/0x40 [28997.507960] [8110e2cb] __filemap_fdatawrite_range+0x5b/0x60 [28997.507960] [8110f1d3] filemap_fdatawrite_range+0x13/0x20 [28997.507960] [81192c99] sys_sync_file_range+0x149/0x180 [28997.835220] [815f05c2] system_call_fastpath+0x16/0x1b [28997.835220] Code: 8b 7d 80 e8 dc 9e 00 00 41 b9 04 00 00 00 e9 3d fe ff ff 4d 89 ef 41 bc 01 00 00 00 48 c7 45 a8 ff ff ff ff e9 5c fb ff ff 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 66 0f 1f 84 00 [28997.835220] RIP [a0094f17] run_delalloc_nocow+0x7a7/0x7c0 [btrfs] [28997.835220] RSP 880117357a78 [28997.927402] ---[ end trace a0a1c4a13d975229 ]--- -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD blocked for more than 120 seconds
Am 17.10.2011 11:40, schrieb Christian Brunner: 2011/10/15 Martin Mailandmar...@tuxadero.com: Hi Christian, I have a very similar experience, I also used josef's tree and btrfs snaps = 0, the next problem I had than was excessive fragmentation, so I used this patch http://marc.info/?l=linux-btrfsm=131495014823121w=2, and changed the btrfs option to (btrfs options = noatime,nodatacow,autodefrag) that kept the fragmentation under control. But even with this setup after a few days the load on the osd is unbearable. How did you find out about our fragmentation issues? Was it just a performance problem? I used filefrag to show the number of extents, after the patch, I have on average 1,14 extents per 4MB ceph object on the osd. As far as I understood the doku if you disable the btrfs snapshot functionality the writeahead journal is activated. http://ceph.newdream.net/wiki/Ceph.conf And I get this in the logs. mount: enabling WRITEAHEAD journal mode: 'filestore btrfs snap' mode is not enabled May I asked what kind of probs you did have with ext4? Because I am looking into this direction as well. You can read about our ext4 problems here: http://marc.info/?l=ceph-develm=131201869703245w=2 I still can reproduce the bug with v3.1-rc9. Our bugreport with RedHat didn't make any progress for a long time, but last week RedHat made two sugestions: - If you configure ceph with 'filestore flusher = false', do you see any different behavior? - If you mount with -o noauto_da_alloc does it change anything? Since I have just migrated to btrfs, I've some problems to check this, but I'll try to do this as soon as I can get hold of some extra hardware. I can check this, I have a spare cluster at the moment. Regards, Christian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD blocked for more than 120 seconds
Am 17.10.2011 14:05, schrieb Tomasz Paszkowski: Hi, It seems that ext4 and btrfs are not to be considered as stable for now. Does anyone could confirm that ext3 is the best choice for this moment ? Hi, I did a quick test with ext3, and it did not look very good. After a few minutes one of the osds failed with this message. [315274.737204] kjournald starting. Commit interval 5 seconds [315274.737919] EXT3-fs (sdb): using internal journal [315274.737929] EXT3-fs (sdb): mounted filesystem with ordered data mode [317040.890148] INFO: task ceph-osd:18032 blocked for more than 120 seconds. [317040.905855] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [317040.923801] ceph-osdD 880114c8b1a0 0 18032 1 0x [317040.923812] 88010f2e3cb8 0086 88010f2e3cb8 88010f2e3cb8 [317040.923821] 88011ffdff08 88010f2e3fd8 88010f2e2000 88010f2e3fd8 [317040.923830] 880116dadbc0 880114c8ade0 88010f2e3cd8 8110d500 [317040.923847] Call Trace: [317040.923865] [8110d500] ? find_get_pages_tag+0x40/0x130 [317040.923876] [815d93df] schedule+0x3f/0x60 [317040.923884] [815d99ed] schedule_timeout+0x26d/0x2e0 [317040.923893] [8101a725] ? native_sched_clock+0x15/0x70 [317040.923899] [8101a789] ? sched_clock+0x9/0x10 [317040.923908] [8108d465] ? sched_clock_local+0x25/0x90 [317040.923916] [815d9219] wait_for_common+0xd9/0x180 [317040.923924] [8105bbc0] ? try_to_wake_up+0x2b0/0x2b0 [317040.923932] [815d939d] wait_for_completion+0x1d/0x20 [317040.923941] [8118d652] sync_inodes_sb+0x92/0x1c0 [317040.923949] [81192440] ? __sync_filesystem+0x90/0x90 [317040.923956] [81192430] __sync_filesystem+0x80/0x90 [317040.923963] [8119245f] sync_one_sb+0x1f/0x30 [317040.923972] [81169268] iterate_supers+0xa8/0x100 [317040.923979] [81192360] sync_filesystems+0x20/0x30 [317040.923985] [81192501] sys_sync+0x21/0x40 [317040.923995] [815e37c2] system_call_fastpath+0x16/0x1b Best Regards, martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD blocked for more than 120 seconds
Am 17.10.2011 11:40, schrieb Christian Brunner: Our bugreport with RedHat didn't make any progress for a long time, but last week RedHat made two sugestions: - If you configure ceph with 'filestore flusher = false', do you see any different behavior? - If you mount with -o noauto_da_alloc does it change anything? Hi, after a quick test I think 'filestore flusher = false' did the trick. What does it do? Best Regards, martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD blocked for more than 120 seconds
Hi Sage, the hang was on a btrfs, I do not have a fix for that. The 'filestore flusher = false' does fix the ext4 problems, which where reported from Christian, but this option has quite an impact of the osd performance. The '-o noauto_da_alloc' option did not solve the fsck problem. Best Regards, Martin Sage Weil schrieb: On Mon, 17 Oct 2011, Martin Mailand wrote: Am 17.10.2011 11:40, schrieb Christian Brunner: Our bugreport with RedHat didn't make any progress for a long time, but last week RedHat made two sugestions: - If you configure ceph with 'filestore flusher = false', do you see any different behavior? - If you mount with -o noauto_da_alloc does it change anything? Hi, after a quick test I think 'filestore flusher = false' did the trick. What does it do? It fixes your hang (previous email), or the subsequent fsck errors? When filestore flusher = true (default), after every write the fd is handed off to another thread that uses sync_file_range() to push the data out to disk quickly before closing the file. The purpose is to limit the latency for the eventual snapshot or sync. Eric suspected the handoff between threads may be what was triggering the bug in ext4. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD blocked for more than 120 seconds
Hi Christian, I have a very similar experience, I also used josef's tree and btrfs snaps = 0, the next problem I had than was excessive fragmentation, so I used this patch http://marc.info/?l=linux-btrfsm=131495014823121w=2, and changed the btrfs option to (btrfs options = noatime,nodatacow,autodefrag) that kept the fragmentation under control. But even with this setup after a few days the load on the osd is unbearable. As far as I understood the doku if you disable the btrfs snapshot functionality the writeahead journal is activated. http://ceph.newdream.net/wiki/Ceph.conf And I get this in the logs. mount: enabling WRITEAHEAD journal mode: 'filestore btrfs snap' mode is not enabled May I asked what kind of probs you did have with ext4? Because I am looking into this direction as well. Best Regards, martin Christian Brunner schrieb: I'm not seeing the same problem, but I've experienced something similar: As you might know, I had serious performance problems with btrfs some month ago, after that, I switched to ext4 and had other problems there. Last Saturday I decided to give josef's current btrfs git repo a try in our ceph cluster. Everything performed well at first, but after a day I noticed that btrfs-cleaner was wasting more and more time in btrfs_clean_old_snapshots. When we reached load 20 on the OSDs I rebooted the nodes, everything was back to normal then. But again after a a few hours the load started to rise. My solution to fix this for the moment was, to turn of the btrfs snapshot feature in ceph with: filestore btrfs snaps = 0 Now I have good performance, low waitio values on the disks and I haven't seen our btrfs warning until now as well. I don't know what the implications are (does this enable writeahead journaling in ceph?), but to me it's the only setup that does the job at the moment. Regards, Christian 2011/10/14 Wido den Hollander w...@widodh.nl: Hi, On Thu, 2011-10-13 at 22:39 +0200, Martin Mailand wrote: Hi, on one of my OSDs the ceph-osd task hung for more than 120 sec. The OSD had almost no load, therefore it cannot be an overload problem. I think it is a btrfs problem, could someone clarify it? This was in the dmesg. [29280.890040] INFO: task btrfs-cleaner:1708 blocked for more than 120 Judging on the fact that I see btrfs-cleaner and btrfs-transaction blocking I guess this is a btrfs bug/hangup. Which kernel are you using? Wido seconds. [29280.905659] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [29280.922916] btrfs-cleaner D 8801153bdf80 0 1708 2 0x [29280.922931] 88011698bbd0 0046 88011698bb90 81090d7d [29280.922960] 8801 88011698bfd8 88011698a000 88011698bfd8 [29280.922988] 81a0d020 8801153bdbc0 88011698bbd0 000181090d7d [29280.923018] Call Trace: [29280.923043] [81090d7d] ? ktime_get_ts+0xad/0xe0 [29280.923062] [8110cf10] ? __lock_page+0x70/0x70 [29280.923082] [815d93df] schedule+0x3f/0x60 [29280.923098] [815d948c] io_schedule+0x8c/0xd0 [29280.923114] [8110cf1e] sleep_on_page+0xe/0x20 [29280.923130] [815d9c6f] __wait_on_bit+0x5f/0x90 [29280.923147] [8110d168] wait_on_page_bit+0x78/0x80 [29280.923165] [81086bd0] ? autoremove_wake_function+0x40/0x40 [29280.923227] [a0065ecb] btrfs_defrag_file+0x4fb/0xc10 [btrfs] [29280.923246] [8117f6ac] ? find_inode+0xac/0xb0 [29280.923281] [a003a2d0] ? btrfs_clean_old_snapshots+0x160/0x160 [btrfs] [29280.923302] [812e369b] ? radix_tree_lookup+0xb/0x10 [29280.923337] [a0034f62] ? btrfs_read_fs_root_no_name+0x1c2/0x2e0 [btrfs] [29280.923375] [a004897e] btrfs_run_defrag_inodes+0x15e/0x210 [btrfs] [29280.923410] [a003278f] cleaner_kthread+0x17f/0x1a0 [btrfs] [29280.923443] [a0032610] ? btrfs_congested_fn+0xb0/0xb0 [btrfs] [29280.923460] [81086436] kthread+0x96/0xa0 [29280.923477] [815e5934] kernel_thread_helper+0x4/0x10 [29280.923493] [810863a0] ? flush_kthread_worker+0xb0/0xb0 [29280.923510] [815e5930] ? gs_change+0x13/0x13 [29280.923521] INFO: task btrfs-transacti:1709 blocked for more than 120 seconds. [29280.939551] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [29280.956782] btrfs-transacti D 880115745f80 0 1709 2 0x [29280.956792] 880115e6fd50 0046 880115e6fd20 880111a5a3e0 [29280.956800] 8801 880115e6ffd8 880115e6e000 880115e6ffd8 [29280.956809] 81a0d020 880115745bc0 0282 000116758450 [29280.956817] Call Trace: [29280.956827] [815d93df] schedule+0x3f/0x60 [29280.956855] [a0037de5] wait_for_commit.clone.16+0x55/0x90 [btrfs] [29280.956864] [81086b90] ? wake_up_bit+0x40/0x40 [29280.956891] [a0039726] btrfs_commit_transaction+0x776/0x860 [btrfs
Btrfs High IO-Wait
Hi, I have high IO-Wait on the ods (ceph), the osd are running a v3.1-rc9 kernel. I also experience high IO-rates, around 500IO/s reported via iostat. Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.006.80 0.0062.40 18.35 0.045.290.005.29 5.29 3.60 sdb 0.00 249.800.40 669.60 1.60 4118.40 12.3087.47 130.56 15.00 130.63 1.01 67.40 In comparison, the same workload, but the osd uses ext4 as a backing fs. Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.00 10.00 0.00 128.00 25.60 0.033.400.003.40 3.40 3.40 sdb 0.0027.800.00 48.20 0.00 318.40 13.21 0.438.840.008.84 1.99 9.60 iodump shows similar results, where sdb is the data disk, sda7 the journal and sda5 the root. btrfs root@s-brick-003:~# echo 1 /proc/sys/vm/block_dump root@s-brick-003:~# while true; do sleep 1; dmesg -c; done | perl /usr/local/bin/iodump ^C# Caught SIGINT. TASK PID TOTAL READ WRITE DIRTY DEVICES btrfs-submit-08321 28040 0 28040 0 sdb ceph-osd 8514158 0158 0 sda7 kswapd0 46 81 0 81 0 sda1 bash 10709 35 35 0 0 sda1 flush-8:0 962 12 0 12 0 sda5 kworker/0:1 8897 6 0 6 0 sdb kworker/1:1 10354 3 0 3 0 sdb kjournald 266 3 0 3 0 sda5 ceph-osd 8523 2 2 0 0 sda1 ceph-osd 8531 1 1 0 0 sda1 dmesg10712 1 1 0 0 sda5 ext4 root@s-brick-002:~# echo 1 /proc/sys/vm/block_dump root@s-brick-002:~# while true; do sleep 1; dmesg -c; done | perl /usr/local/bin/iodump ^C# Caught SIGINT. TASK PID TOTAL READ WRITE DIRTY DEVICES ceph-osd 3115847 0847 0 sdb jbd2/sdb-82897784 0784 0 sdb ceph-osd 3112728 0728 0 sda5, sdb ceph-osd 3110191 0191 0 sda7 perl 3628 13 13 0 0 sda5 flush-8:162901 8 0 8 0 sdb kjournald 272 3 0 3 0 sda5 dmesg 3630 1 1 0 0 sda5 sleep 3629 1 1 0 0 sda5 I think that is the same problem as in http://marc.info/?l=ceph-develm=131158049117139w=2 I also did a latencytop as Chris recommended in the above thread. Best Regards, martin latencytop.out_long_uptime.bz2 Description: application/bzip latencytop.out_short_uptime.bz2 Description: application/bzip
Re: OSD::disk_tp timeout
Hi Christian, if I remember correctly you are using ceph with a qemu-kvm setup? After the last update of ceph, the load average on the osd was doubled, the performance of the kvm machines became bad. The really weird thing is, the cluster needs around 30 mins to get into this state. After I restart the osd's everything is fine, than after a while the load of the osd nodes is building up. Most of the load is produced by btrfs kernel processes in the deferred state. Not sure if I have the same problem as you, as I do not get any timeouts. Best Regards, martin Christian Brunner schrieb: Hi, I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly screwed ceph cluster. :( What bugs me most is the fact, that OSDs become unresponsive frequently. The process is eating a lot of cpu and I can see the following messages in the log: Oct 8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Oct 8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Oct 8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Oct 8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Oct 8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Oct 8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Do you have any idea, what to do about that? Regards, Christian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD::disk_tp timeout
Hi, I am using v3.1-rc9, so the fix in there. Maybe I can nail it down a bit more specific. Best Regards, martin Sage Weil schrieb: Hi Christian, On Sat, 8 Oct 2011, Christian Brunner wrote: Hi, I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly screwed ceph cluster. :( What bugs me most is the fact, that OSDs become unresponsive frequently. The process is eating a lot of cpu and I can see the What version of btrfs are you running? This sound a bit like the bug fixed by this patch: http://www.spinics.net/lists/linux-btrfs/msg12627.html (That was just merged into mainline this week.) following messages in the log: Oct 8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Oct 8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Oct 8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Oct 8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Oct 8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Oct 8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 Do you have any idea, what to do about that? Those messages just mean that a thread in the disk threadpool (which is doing all the writes to btrfs) is blocked/stopped. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd snap
Hi, is it possible to access snapshots without to rollback the head of the rbd volumen? Because I want to do a snapshot of a vm running via librbd and qemu, and use the snapshot to make a offsite backup of the vm. Best Regards, Martin Martin Mailand schrieb: Okay, with the btrfs patch and the right commandline snapshotting works like a charm. best regards, martin Josh Durgin schrieb: On 09/16/2011 02:32 PM, Martin Mailand wrote: root@c-brick-001:~# rbd rm --snap=2011091601 test *** Caught signal (Segmentation fault) ** in thread 0x7f203d749740 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6) 1: rbd() [0x457062] 2: (()+0xfc60) [0x7f203ccf6c60] 3: (librbd::snap_set(librbd::ImageCtx*, char const*)+0x10) [0x7f203d32ecd0] 4: (main()+0x59f) [0x4518ff] 5: (__libc_start_main()+0xff) [0x7f203b6cdeff] 6: rbd() [0x44d569] Segmentation fault I added a bug to the tracker for this (http://tracker.newdream.net/issues/1545). It shouldn't crash the way you ran it, but if you're trying to remove a snapshot you need to use the 'snap rm' command, i.e.: $ rbd snap rm --snap=2011091601 test -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd snap
Hi, that's great, and it is safe to start diffrent vms with diffrent snapshots of the same image at the same time? Best Regards, Martin Sage Weil schrieb: On Fri, 23 Sep 2011, Martin Mailand wrote: Hi, is it possible to access snapshots without to rollback the head of the rbd volumen? Because I want to do a snapshot of a vm running via librbd and qemu, and use the snapshot to make a offsite backup of the vm. $ rbd export foo --snap=mysnap /path/to/foo.dump You can also map the snapshot via qemu with a string like rbd:rbd/foo@mysnap. sage Best Regards, Martin Martin Mailand schrieb: Okay, with the btrfs patch and the right commandline snapshotting works like a charm. best regards, martin Josh Durgin schrieb: On 09/16/2011 02:32 PM, Martin Mailand wrote: root@c-brick-001:~# rbd rm --snap=2011091601 test *** Caught signal (Segmentation fault) ** in thread 0x7f203d749740 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6) 1: rbd() [0x457062] 2: (()+0xfc60) [0x7f203ccf6c60] 3: (librbd::snap_set(librbd::ImageCtx*, char const*)+0x10) [0x7f203d32ecd0] 4: (main()+0x59f) [0x4518ff] 5: (__libc_start_main()+0xff) [0x7f203b6cdeff] 6: rbd() [0x44d569] Segmentation fault I added a bug to the tracker for this (http://tracker.newdream.net/issues/1545). It shouldn't crash the way you ran it, but if you're trying to remove a snapshot you need to use the 'snap rm' command, i.e.: $ rbd snap rm --snap=2011091601 test -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RDB Performance
hi, I have a few question about the rbd performance. I have a small ceph installation, three osd server one monitor server and one compute node which maps a rbd image to a block device, all server a connectet via a dedicated 1Gbs network. Each osd is capable of doing around 90MB/s tested with osd bench. But if I test the write speed of the rbd block device the performance ist quite poor. I do the test with dd if=/dev/zero of=/dev/rbd0 bs=1M count=1 oflag=direct, I get a throughput around 25MB/s. I used wireshark to graph the network throughput, the image is http://tuxadero.com/multistorage/ceph.jpg as you can see the throughput is not smooth. The graph for the test without the oflag=direct is http://tuxadero.com/multistorage/ceph2.jpg which is much better, but I the compute node uses around 4-5G of it's RAM as a writeback cache, which is not acceptable for my application. For comparison the graph for a scp transfer. http://tuxadero.com/multistorage/scp.jpg I read in the ceph doku, that ever package has to be commited to the disk on the osd, before it is acknowledged to the client, could you please expalain what a package is? Probably not a TCP package. And on the mailinglist was a discussion about a writeback window, to my understanding it say how many byte can be unacknowledged in transit, is that right? How could I activate it? Thanks for your time. Best Regards, martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDB Performance
Hi Sage, good to hear that you are working on this issue. I tried qemu-kvm with the rbd block device patch, which I think uses librbd, but I couldn't measure any performance improvements. Which versions do I have to use, and do I have to activate the writeback window or is it default on? Best Regards, Martin Sage Weil schrieb: On Wed, 21 Sep 2011, Martin Mailand wrote: hi, I have a few question about the rbd performance. I have a small ceph installation, three osd server one monitor server and one compute node which maps a rbd image to a block device, all server a connectet via a dedicated 1Gbs network. Each osd is capable of doing around 90MB/s tested with osd bench. But if I test the write speed of the rbd block device the performance ist quite poor. I do the test with dd if=/dev/zero of=/dev/rbd0 bs=1M count=1 oflag=direct, I get a throughput around 25MB/s. I used wireshark to graph the network throughput, the image is http://tuxadero.com/multistorage/ceph.jpg as you can see the throughput is not smooth. The graph for the test without the oflag=direct is http://tuxadero.com/multistorage/ceph2.jpg which is much better, but I the compute node uses around 4-5G of it's RAM as a writeback cache, which is not acceptable for my application. For comparison the graph for a scp transfer. http://tuxadero.com/multistorage/scp.jpg I read in the ceph doku, that ever package has to be commited to the disk on the osd, before it is acknowledged to the client, could you please expalain what a package is? Probably not a TCP package. You probably mean object.. each write has to be on disk before it is acknowledged. And on the mailinglist was a discussion about a writeback window, to my understanding it say how many byte can be unacknowledged in transit, is that right? Right. How could I activate it? So far it's currently only implemented in librbd (the userland implementation). The problem is that your dd is doing synchronous writes to the block device, which are synchronously written to the OSD. That means a lot of time waiting around for the last write to complete before starting to send the next one. Normal hard disks have a cache that absorbs this. They acknowledge the write immediately, and only promise that the data will actually be durable when you issue a flush command later. In librbd, we just added a write window that gives you similar performance. We acknowledge writes immediately and do the write asynchronously, with a cap on the amount of outstanding bytes. This doesn't coalesce small writes into big ones like a real cache, but usually the filesystem does most of that, so we should get similar performance. Anyway, the kernrel implementation doesn't do that yet. It's on the todo list for the next 2 weeks... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDB Performance
Hi Sage, the performance improvment is quite impressive. Now I get around 90MB/s from within the vm. Thanks. Best Regards, martin Sage Weil schrieb: On Wed, 21 Sep 2011, Martin Mailand wrote: Hi Sage, good to hear that you are working on this issue. I tried qemu-kvm with the rbd block device patch, which I think uses librbd, but I couldn't measure any performance improvements. Which versions do I have to use, and do I have to activate the writeback window or is it default on? In the qemu rbd: line, include an option like :rbd_writeback_window=8192, where the size of the window is specified in bytes. (It's off by default.) Also, keep mind that unless you're using the latest qemu upstream (or our repo), the flush aren't being passed down properly, and your data won't quite be safe. (That's the main reason why we're leaving it off by default for the time being.) sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph kernel bug
Hi Sage, I rerun the test and I think I triggered the first bug again. http://pastebin.com/ydNm0pff I did also the dumps for you. http://tuxadero.com/multistorage/ceph.ko_dump http://tuxadero.com/multistorage/libceph.ko_dump Best Regards, martin Am 16.09.2011 00:54, schrieb Sage Weil: On Thu, 15 Sep 2011, Martin Mailand wrote: Hi Sage, that's quite a bit of output, I put it in a pastebin. http://pastebin.com/9CNJk0Pw. Any chance you can include the output of 'objdump -rdS libceph.ko'? ceph.ko too, for good measure. This looks like a sightly different crash than the one on that bug! Thanks! sage Best Regards, martin Sage Weil schrieb: On Thu, 15 Sep 2011, Martin Mailand wrote: Hi Sage, I am still hitting this in -rc6. It happeneds every time I stop an OSD. Do you need more information to reproduce it? Oh, great to hear it's easy to reproduce! I was trying (in my uml environment) and failing. Can run the script below right before stopping the osd, and send the dmesg output along? (Or attach to http://tracker.newdream.net/issues/1382) Thanks! sage #!/bin/sh -x p() { echo $* /sys/kernel/debug/dynamic_debug/control } p 'module ceph +p' p 'module libceph +p' p 'module rbd +p' p 'file net/ceph/messenger.c -p' p 'file' `grep -- --- /sys/kernel/debug/dynamic_debug/control | grep ceph \ | awk '{print $1}' | sed 's/:/ line /'` '+p' p 'file' `grep -- === /sys/kernel/debug/dynamic_debug/control | grep ceph \ | awk '{print $1}' | sed 's/:/ line /'` '+p' -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
Hi Josef, after a quick test it seems that I do not hit this Warning any longer. But I got a new one. [ 5241.839951] [ cut here ] [ 5241.839974] WARNING: at fs/btrfs/extent-tree.c:5715 btrfs_alloc_free_block+0xac/0x330 [btrfs]() [ 5241.839979] Hardware name: MS-96B3 [ 5241.839982] Modules linked in: radeon ttm drm_kms_helper drm i2c_algo_bit psmouse k8temp sp5100_tco edac_core edac_mce_amd serio_raw shpchp i2c_piix4 lp parport ahci pata_atiixp libahci btrfs e1000e zlib_deflate libcrc32c [ 5241.840068] Pid: 1568, comm: kworker/0:1 Tainted: GW 3.1.0-rc6 #1 [ 5241.840072] Call Trace: [ 5241.840084] [81063d0f] warn_slowpath_common+0x7f/0xc0 [ 5241.840101] [81063d6a] warn_slowpath_null+0x1a/0x20 [ 5241.840133] [a002a9cc] btrfs_alloc_free_block+0xac/0x330 [btrfs] [ 5241.840152] [8110d35a] ? unlock_page+0x2a/0x40 [ 5241.840188] [a0059268] ? read_extent_buffer+0xa8/0x180 [btrfs] [ 5241.840222] [a0031c00] ? verify_parent_transid+0x160/0x160 [btrfs] [ 5241.840252] [a001a0d2] __btrfs_cow_block+0x122/0x4b0 [btrfs] [ 5241.840283] [a001a552] btrfs_cow_block+0xf2/0x1f0 [btrfs] [ 5241.840314] [a001cb88] push_leaf_left+0x108/0x180 [btrfs] [ 5241.840344] [a001fb78] btrfs_del_items+0x2b8/0x440 [btrfs] [ 5241.840379] [a00300c2] btrfs_del_csums+0x2d2/0x310 [btrfs] [ 5241.840415] [a00677a8] ? btrfs_tree_unlock+0x28/0xb0 [btrfs] [ 5241.840447] [a002597a] __btrfs_free_extent+0x48a/0x6f0 [btrfs] [ 5241.840480] [a0028c8d] run_clustered_refs+0x21d/0x840 [btrfs] [ 5241.840514] [a002937a] btrfs_run_delayed_refs+0xca/0x220 [btrfs] [ 5241.840551] [a0053576] ? btrfs_run_ordered_operations+0x1d6/0x200 [btrfs] [ 5241.840587] [a0038fa3] btrfs_commit_transaction+0x83/0x870 [btrfs] [ 5241.840605] [81012871] ? __switch_to+0x261/0x2f0 [ 5241.840622] [81086d70] ? wake_up_bit+0x40/0x40 [ 5241.840656] [a0039790] ? btrfs_commit_transaction+0x870/0x870 [btrfs] [ 5241.840691] [a00397af] do_async_commit+0x1f/0x30 [btrfs] [ 5241.840708] [8108110d] process_one_work+0x11d/0x430 [ 5241.840724] [81081dd9] worker_thread+0x169/0x360 [ 5241.840741] [81081c70] ? manage_workers.clone.21+0x240/0x240 [ 5241.840758] [81086616] kthread+0x96/0xa0 [ 5241.840775] [815f2434] kernel_thread_helper+0x4/0x10 [ 5241.840792] [81086580] ? flush_kthread_worker+0xb0/0xb0 [ 5241.840808] [815f2430] ? gs_change+0x13/0x13 [ 5241.840819] ---[ end trace c8a580615cad6cb5 ]--- Best Regards, Martin Am 15.09.2011 21:50, schrieb Josef Bacik: On Thu, Sep 15, 2011 at 11:44:09AM -0700, Sage Weil wrote: On Tue, 13 Sep 2011, Liu Bo wrote: On 09/11/2011 05:47 AM, Martin Mailand wrote: Hi I am hitting this Warning reproducible, the workload is a ceph osd, kernel ist 3.1.0-rc5. Have posted a patch for this: http://marc.info/?l=linux-btrfsm=131547325515336w=2 We're still seeing this with -rc6, which includes 98c9942 and 65450aa. I haven't looked at the reservation code in much detail. Is there anything I can do to help track this down? This should be taken care of with all my enospc changes. You can pull them down from my btrfs-work tree as soon as kernel.org comes back from the dead :). Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rbd snap
Hi, should rbd snap work? I created a snapshot and then I want to list it. rbd snap create --snap=2011091601 lenny1 rbd snap ls lenny1 But the ls command does not come back and I get a Kernel bug on 2 OSD. It is reproducible. [ 7658.115729] [ cut here ] [ 7658.115779] kernel BUG at fs/btrfs/delayed-inode.c:1693! [ 7658.115812] invalid opcode: [#1] SMP [ 7658.115846] CPU 1 [ 7658.115861] Modules linked in: radeon ttm drm_kms_helper drm i2c_algo_bit psmouse k8temp sp5100_tco edac_core edac_mce_amd serio_raw shpchp i2c_piix4 lp parport ahci pata_atiixp libahci btrfs e1000e zlib_deflate libcrc32c [ 7658.116080] [ 7658.116095] Pid: 1418, comm: cosd Tainted: GW 3.1.0-rc6 #1 MICRO-STAR INTERNATIONAL CO., LTD MS-96B3/MS-96B3 [ 7658.116167] RIP: 0010:[a007ffd0] [a007ffd0] btrfs_delayed_update_inode+0x2a0/0x2b0 [btrfs] [ 7658.116278] RSP: 0018:8801160efbc8 EFLAGS: 00010286 [ 7658.116311] RAX: ffe4 RBX: 8800777c0120 RCX: 00018000 [ 7658.116351] RDX: f7e5 RSI: 00018000 RDI: 880116b10160 [ 7658.116389] RBP: 8801160efc08 R08: e8c81a40 R09: 8800886826a0 [ 7658.116428] R10: R11: R12: 880001e4af50 [ 7658.116467] R13: 8800777c0168 R14: 880117122ea0 R15: 880115113000 [ 7658.116507] FS: 7f80e30eb700() GS:88011fc8() knlGS: [ 7658.116550] CS: 0010 DS: ES: CR0: 80050033 [ 7658.116583] CR2: ff600400 CR3: 000116015000 CR4: 06e0 [ 7658.116623] DR0: DR1: DR2: [ 7658.116662] DR3: DR6: 0ff0 DR7: 0400 [ 7658.116700] Process cosd (pid: 1418, threadinfo 8801160ee000, task 880116aeade0) [ 7658.116744] Stack: [ 7658.116759] 0282 00018000 8801160efc18 880001e4af50 [ 7658.116817] 880117122ea0 8800391b01e0 8801171113f0 [ 7658.116875] 8801160efc58 a003f353 8801160efc38 a00677f8 [ 7658.116933] Call Trace: [ 7658.116978] [a003f353] btrfs_update_inode+0x53/0x160 [btrfs] [ 7658.117039] [a00677f8] ? btrfs_tree_unlock+0x78/0xb0 [btrfs] [ 7658.117099] [a0063184] btrfs_ioctl_clone+0x9b4/0xd20 [btrfs] [ 7658.117164] [a00666f6] btrfs_ioctl+0x306/0xe20 [btrfs] [ 7658.117204] [81175f32] ? do_filp_open+0x42/0xa0 [ 7658.117240] [81178048] do_vfs_ioctl+0x98/0x540 [ 7658.117277] [81156f40] ? kmem_cache_free+0x20/0x100 [ 7658.117313] [81178581] sys_ioctl+0x91/0xa0 [ 7658.117347] [815f02c2] system_call_fastpath+0x16/0x1b [ 7658.117381] Code: 00 03 00 00 8d 0c 49 48 89 ca 48 89 4d c8 e8 f8 7d fa ff 85 c0 48 8b 4d c8 75 10 48 89 4b 08 e9 cc fd ff ff 0f 1f 80 00 00 00 00 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 [ 7658.117814] RIP [a007ffd0] btrfs_delayed_update_inode+0x2a0/0x2b0 [btrfs] [ 7658.117883] RSP 8801160efbc8 [ 7658.122364] ---[ end trace c8a580615cad6cbe ]--- Best Regards, Martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
Hi Josef, the commit is not in there, but the code looks like your post. if (--trans-use_count) { trans-block_rsv = trans-orig_rsv; return 0; } trans-block_rsv = NULL; while (count 4) { unsigned long cur = trans-delayed_ref_updates; trans-delayed_ref_updates = 0; But on the other hand I am quite new to git, how could I get your latest commit? Best Regards, Martin Am 16.09.2011 16:37, schrieb Josef Bacik: On 09/16/2011 10:09 AM, Martin Mailand wrote: Hi Josef, after a quick test it seems that I do not hit this Warning any longer. But I got a new one. Hmm looks like that may not be my newest stuff, is commit 57f499e1bb76ba3ebeb09cd12e9dac84baa5812b in there? Specifically look at __btrfs_end_transaction in transaction.c and see if the line trans-block_rsv = NULL; is before the first while() loop. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph kernel bug
Hi Sage, yes it fixes things for me as well. Best Regards, martin Sage Weil schrieb: Hi Martin, Thanks, this was enough to help me reproduce it, and I believe I have a correct fix (it's working for me). Can you try commit 935b639 'libceph: fix linger request requeuing' (for-linus branch of git://github.com/NewDreamNetwork/ceph-client.git) and confirm that it fixes things for you as well? Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd snap
Hi Sage, yes, that fixes the btrfs problem. But now I have a new bug. root@c-brick-001:~# rbd rm --snap=2011091601 test *** Caught signal (Segmentation fault) ** in thread 0x7f203d749740 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6) 1: rbd() [0x457062] 2: (()+0xfc60) [0x7f203ccf6c60] 3: (librbd::snap_set(librbd::ImageCtx*, char const*)+0x10) [0x7f203d32ecd0] 4: (main()+0x59f) [0x4518ff] 5: (__libc_start_main()+0xff) [0x7f203b6cdeff] 6: rbd() [0x44d569] Segmentation fault I use the ceph ubuntu build from your site. Best Regards, martin Sage Weil schrieb: There is a patch for a btrfs bug in the clone ioctl reservation that hasn't made it upstream yet. See http://marc.info/?l=linux-btrfsm=131291225105499w=2 That should sort you out! sage On Fri, 16 Sep 2011, Martin Mailand wrote: Hi, should rbd snap work? I created a snapshot and then I want to list it. rbd snap create --snap=2011091601 lenny1 rbd snap ls lenny1 But the ls command does not come back and I get a Kernel bug on 2 OSD. It is reproducible. [ 7658.115729] [ cut here ] [ 7658.115779] kernel BUG at fs/btrfs/delayed-inode.c:1693! [ 7658.115812] invalid opcode: [#1] SMP [ 7658.115846] CPU 1 [ 7658.115861] Modules linked in: radeon ttm drm_kms_helper drm i2c_algo_bit psmouse k8temp sp5100_tco edac_core edac_mce_amd serio_raw shpchp i2c_piix4 lp parport ahci pata_atiixp libahci btrfs e1000e zlib_deflate libcrc32c [ 7658.116080] [ 7658.116095] Pid: 1418, comm: cosd Tainted: GW 3.1.0-rc6 #1 MICRO-STAR INTERNATIONAL CO., LTD MS-96B3/MS-96B3 [ 7658.116167] RIP: 0010:[a007ffd0] [a007ffd0] btrfs_delayed_update_inode+0x2a0/0x2b0 [btrfs] [ 7658.116278] RSP: 0018:8801160efbc8 EFLAGS: 00010286 [ 7658.116311] RAX: ffe4 RBX: 8800777c0120 RCX: 00018000 [ 7658.116351] RDX: f7e5 RSI: 00018000 RDI: 880116b10160 [ 7658.116389] RBP: 8801160efc08 R08: e8c81a40 R09: 8800886826a0 [ 7658.116428] R10: R11: R12: 880001e4af50 [ 7658.116467] R13: 8800777c0168 R14: 880117122ea0 R15: 880115113000 [ 7658.116507] FS: 7f80e30eb700() GS:88011fc8() knlGS: [ 7658.116550] CS: 0010 DS: ES: CR0: 80050033 [ 7658.116583] CR2: ff600400 CR3: 000116015000 CR4: 06e0 [ 7658.116623] DR0: DR1: DR2: [ 7658.116662] DR3: DR6: 0ff0 DR7: 0400 [ 7658.116700] Process cosd (pid: 1418, threadinfo 8801160ee000, task 880116aeade0) [ 7658.116744] Stack: [ 7658.116759] 0282 00018000 8801160efc18 880001e4af50 [ 7658.116817] 880117122ea0 8800391b01e0 8801171113f0 [ 7658.116875] 8801160efc58 a003f353 8801160efc38 a00677f8 [ 7658.116933] Call Trace: [ 7658.116978] [a003f353] btrfs_update_inode+0x53/0x160 [btrfs] [ 7658.117039] [a00677f8] ? btrfs_tree_unlock+0x78/0xb0 [btrfs] [ 7658.117099] [a0063184] btrfs_ioctl_clone+0x9b4/0xd20 [btrfs] [ 7658.117164] [a00666f6] btrfs_ioctl+0x306/0xe20 [btrfs] [ 7658.117204] [81175f32] ? do_filp_open+0x42/0xa0 [ 7658.117240] [81178048] do_vfs_ioctl+0x98/0x540 [ 7658.117277] [81156f40] ? kmem_cache_free+0x20/0x100 [ 7658.117313] [81178581] sys_ioctl+0x91/0xa0 [ 7658.117347] [815f02c2] system_call_fastpath+0x16/0x1b [ 7658.117381] Code: 00 03 00 00 8d 0c 49 48 89 ca 48 89 4d c8 e8 f8 7d fa ff 85 c0 48 8b 4d c8 75 10 48 89 4b 08 e9 cc fd ff ff 0f 1f 80 00 00 00 00 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 [ 7658.117814] RIP [a007ffd0] btrfs_delayed_update_inode+0x2a0/0x2b0 [btrfs] [ 7658.117883] RSP 8801160efbc8 [ 7658.122364] ---[ end trace c8a580615cad6cbe ]--- Best Regards, Martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd snap
Hi Josh, right, that's my mistake, I will try it with the right commandline tomorrow. Best Regards, martin Josh Durgin schrieb: On 09/16/2011 02:32 PM, Martin Mailand wrote: root@c-brick-001:~# rbd rm --snap=2011091601 test *** Caught signal (Segmentation fault) ** in thread 0x7f203d749740 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6) 1: rbd() [0x457062] 2: (()+0xfc60) [0x7f203ccf6c60] 3: (librbd::snap_set(librbd::ImageCtx*, char const*)+0x10) [0x7f203d32ecd0] 4: (main()+0x59f) [0x4518ff] 5: (__libc_start_main()+0xff) [0x7f203b6cdeff] 6: rbd() [0x44d569] Segmentation fault I added a bug to the tracker for this (http://tracker.newdream.net/issues/1545). It shouldn't crash the way you ran it, but if you're trying to remove a snapshot you need to use the 'snap rm' command, i.e.: $ rbd snap rm --snap=2011091601 test -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph kernel bug
Hi Sage, I am still hitting this in -rc6. It happeneds every time I stop an OSD. Do you need more information to reproduce it? Best Regards, martin [103159.164630] libceph: osd0 192.168.42.113:6800 socket closed [103169.153484] [ cut here ] [103169.162935] kernel BUG at net/ceph/messenger.c:2193! [103169.163332] invalid opcode: [#1] SMP [103169.163332] CPU 0 [103169.163332] Modules linked in: btrfs zlib_deflate rbd libceph libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables x_tables kvm_amd kvm bridge nv_tco stp radeon ttm drm_kms_helper drm lp parport i2c_algo_bit amd64_edac_mod i2c_nforce2 edac_core edac_mce_amd k10temp shpchp psmouse serio_raw ses enclosure aacraid forcedeth [103169.163332] [103169.163332] Pid: 4405, comm: kworker/0:1 Not tainted 3.1.0-rc6 #1 Supermicro H8DM8-2/H8DM8-2 [103169.163332] RIP: 0010:[a02b73f1] [a02b73f1] ceph_con_send+0x111/0x120 [libceph] [103169.163332] RSP: 0018:88031c5b3bd0 EFLAGS: 00010202 [103169.163332] RAX: 88040502c678 RBX: 88040452b030 RCX: 88031c8a9e50 [103169.163332] RDX: 88031c5b3fd8 RSI: 88040502c600 RDI: 88040452b1a8 [103169.163332] RBP: 88031c5b3bf0 R08: 88040fc0de40 R09: 0002 [103169.163332] R10: 0002 R11: 0072 R12: 88040452b1a8 [103169.163332] R13: 88040502c600 R14: 88031c8a9e60 R15: 88031c8a9e50 [103169.163332] FS: 7f6d43dd2700() GS:88040fc0() knlGS: [103169.163332] CS: 0010 DS: ES: CR0: 8005003b [103169.163332] CR2: ff600400 CR3: 000403fb1000 CR4: 06f0 [103169.163332] DR0: DR1: DR2: [103169.163332] DR3: DR6: 0ff0 DR7: 0400 [103169.163332] Process kworker/0:1 (pid: 4405, threadinfo 88031c5b2000, task 880405cd5bc0) [103169.163332] Stack: [103169.163332] 88031c5b3bf0 880404632a00 88031c8a9e30 88031c8a9da8 [103169.163332] 88031c5b3c40 a02bc8ad 88031c8a9c80 88031c8a9e00 [103169.163332] 88031c5b3c40 8804045b7151 88031c8a9da8 [103169.163332] Call Trace: [103169.163332] [a02bc8ad] send_queued+0xed/0x130 [libceph] [103169.163332] [a02bed81] ceph_osdc_handle_map+0x261/0x3b0 [libceph] [103169.163332] [a02bb31f] dispatch+0x10f/0x580 [libceph] [103169.163332] [a02b954f] con_work+0x214f/0x21d0 [libceph] [103169.163332] [a02b7400] ? ceph_con_send+0x120/0x120 [libceph] [103169.163332] [8108110d] process_one_work+0x11d/0x430 [103169.163332] [81081c69] worker_thread+0x169/0x360 [103169.163332] [81081b00] ? manage_workers.clone.21+0x240/0x240 [103169.163332] [81086496] kthread+0x96/0xa0 [103169.163332] [815e5c34] kernel_thread_helper+0x4/0x10 [103169.163332] [81086400] ? flush_kthread_worker+0xb0/0xb0 [103169.163332] [815e5c30] ? gs_change+0x13/0x13 [103169.163332] Code: 65 f0 4c 8b 6d f8 c9 c3 66 90 48 8d be 88 00 00 00 48 c7 c6 70 98 2b a0 e8 1d ad 02 e1 48 8b 5d e8 4c 8b 65 f0 4c 8b 6d f8 c9 c3 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 [103169.163332] RIP [a02b73f1] ceph_con_send+0x111/0x120 [libceph] [103169.163332] RSP 88031c5b3bd0 [103169.805672] ---[ end trace 49d197af1dff5a93 ]--- [103169.818910] BUG: unable to handle kernel paging request at fff8 [103169.828781] IP: [810868f0] kthread_data+0x10/0x20 [103169.828781] PGD 1a07067 PUD 1a08067 PMD 0 [103169.828781] Oops: [#2] SMP [103169.828781] CPU 0 [103169.828781] Modules linked in: btrfs zlib_deflate rbd libceph libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables x_tables kvm_amd kvm bridge nv_tco stp radeon ttm drm_kms_helper drm lp parport i2c_algo_bit amd64_edac_mod i2c_nforce2 edac_core edac_mce_amd k10temp shpchp psmouse serio_raw ses enclosure aacraid forcedeth [103169.828781] [103169.828781] Pid: 4405, comm: kworker/0:1 Tainted: G D 3.1.0-rc6 #1 Supermicro H8DM8-2/H8DM8-2 [103169.828781] RIP: 0010:[810868f0] [810868f0] kthread_data+0x10/0x20 [103169.828781] RSP: 0018:88031c5b3878 EFLAGS: 00010096 [103169.828781] RAX: RBX: RCX: [103169.828781] RDX: 880405cd5bc0 RSI: RDI: 880405cd5bc0 [103169.828781] RBP: 88031c5b3878 R08: 00989680 R09: [103169.828781] R10: 0400 R11: 0005 R12: 880405cd5f88 [103169.828781] R13: R14: R15: 880405cd5e90 [103169.828781] FS: 7f6d43dd2700() GS:88040fc0() knlGS: [103169.828781] CS: 0010 DS: ES: CR0: 8005003b [103169.828781] CR2: fff8 CR3: 000403fb1000 CR4: 06f0 [103169.828781] DR0: