Re: osd: new pool flags: noscrub, nodeep-scrub
On Fri, Sep 11, 2015 at 4:24 PM, Mykola Golubwrote: > On Fri, Sep 11, 2015 at 05:59:56AM -0700, Sage Weil wrote: > >> I wonder if, in addition, we should also allow scrub and deep-scrub >> intervals to be set on a per-pool basis? > > ceph osd pool set [deep-]scrub_interval N ? BTW it would be absolutely lovely to see a copy-aware scrubs, e.g. parallel (deep-) scrubs on a non-intersecting set of PGs. Currently as far as I can see if the scrub starts, the max_scrubs is in effect only for a primary OSD, allowing situations when two scrubs, primary and non-primary, can land on a same OSD. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: leaking mons on a latest dumpling
On Thu, Apr 16, 2015 at 11:30 AM, Joao Eduardo Luis j...@suse.de wrote: On 04/15/2015 05:38 PM, Andrey Korolyov wrote: Hello, there is a slow leak which is presented in all ceph versions I assume but it is positively exposed only on large time spans and on large clusters. It looks like the lower is monitor placed in the quorum hierarchy, the higher the leak is: {election_epoch:26,quorum:[0,1,2,3,4],quorum_names:[0,1,2,3,4],quorum_leader_name:0,monmap:{epoch:1,fsid:a2ec787e-3551-4a6f-aa24-deedbd8f8d01,modified:2015-03-05 13:48:54.696784,created:2015-03-05 13:48:54.696784,mons:[{rank:0,name:0,addr:10.0.1.91:6789\/0},{rank:1,name:1,addr:10.0.1.92:6789\/0},{rank:2,name:2,addr:10.0.1.93:6789\/0},{rank:3,name:3,addr:10.0.1.94:6789\/0},{rank:4,name:4,addr:10.0.1.95:6789\/0}]}} ceph heap stats -m 10.0.1.95:6789 | grep Actual MALLOC: =427626648 ( 407.8 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.94:6789 | grep Actual MALLOC: =289550488 ( 276.1 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.93:6789 | grep Actual MALLOC: =230592664 ( 219.9 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.92:6789 | grep Actual MALLOC: =253710488 ( 242.0 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.91:6789 | grep Actual MALLOC: = 97112216 ( 92.6 MiB) Actual memory used (physical + swap) for almost same uptime, the data difference is: rd KB 55365750505 wr KB 82719722467 The leak itself is not very critical but of course requires some script work to restart monitors at least once per month on a 300Tb cluster to prevent 1G memory consumption by monitor processes. Given a current status for a dumpling, it would be probably possible to identify leak source and then forward-port fix to the newer releases, as the freshest version I am running on a large scale is a top of dumpling branch, otherwise it would require enormous amount of time to check fix proposals. There have been numerous reports of a slow leak in the monitors on dumpling and firefly. I'm sure there's a ticket for that but I wasn't able to find it. Many hours were spent chasing down this leak to no avail, despite of plugging several leaks throughout the code (especially in firefly, that should have been backported to dumpling at some point or the other). This was mostly hard to figure out because it tends to require a long-term cluster to show up, and the biggest the cluster is the larger the probability of triggering it. This behavior has me believing that this should be somewhere in the message dispatching workflow and, given it's the leader that suffers the most, should be somewhere in the read-write message dispatching (PaxosService::prepare_update()). But despite code inspections, I don't think we ever found the cause -- or that any fixed leak was ever flagged as the root of the problem. Anyway, since Giant, most complaints (if not all!) went away. Maybe I missed them, or maybe people suffering from this just stopped complaining. I'm hoping it's the first rather than the latter and, as luck has it, maybe the fix was a fortunate side-effect of some other change. -Joao Thanks for an explanation, I accidentally reversed the logical order describing leadership placement above. I`ll go through non-ported commits for ff and will port most promising ones on a spare time occasion, checking if the leak disappeared or not (it takes about a week to see the difference for mine workloads). Could dump structures be helpful for developers to ring a bell for deterministic suggestions? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
leaking mons on a latest dumpling
Hello, there is a slow leak which is presented in all ceph versions I assume but it is positively exposed only on large time spans and on large clusters. It looks like the lower is monitor placed in the quorum hierarchy, the higher the leak is: {election_epoch:26,quorum:[0,1,2,3,4],quorum_names:[0,1,2,3,4],quorum_leader_name:0,monmap:{epoch:1,fsid:a2ec787e-3551-4a6f-aa24-deedbd8f8d01,modified:2015-03-05 13:48:54.696784,created:2015-03-05 13:48:54.696784,mons:[{rank:0,name:0,addr:10.0.1.91:6789\/0},{rank:1,name:1,addr:10.0.1.92:6789\/0},{rank:2,name:2,addr:10.0.1.93:6789\/0},{rank:3,name:3,addr:10.0.1.94:6789\/0},{rank:4,name:4,addr:10.0.1.95:6789\/0}]}} ceph heap stats -m 10.0.1.95:6789 | grep Actual MALLOC: =427626648 ( 407.8 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.94:6789 | grep Actual MALLOC: =289550488 ( 276.1 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.93:6789 | grep Actual MALLOC: =230592664 ( 219.9 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.92:6789 | grep Actual MALLOC: =253710488 ( 242.0 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.91:6789 | grep Actual MALLOC: = 97112216 ( 92.6 MiB) Actual memory used (physical + swap) for almost same uptime, the data difference is: rd KB 55365750505 wr KB 82719722467 The leak itself is not very critical but of course requires some script work to restart monitors at least once per month on a 300Tb cluster to prevent 1G memory consumption by monitor processes. Given a current status for a dumpling, it would be probably possible to identify leak source and then forward-port fix to the newer releases, as the freshest version I am running on a large scale is a top of dumpling branch, otherwise it would require enormous amount of time to check fix proposals. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Preliminary RDMA vs TCP numbers
On Wed, Apr 8, 2015 at 11:17 AM, Somnath Roy somnath@sandisk.com wrote: Hi, Please find the preliminary performance numbers of TCP Vs RDMA (XIO) implementation (on top of SSDs) in the following link. http://www.slideshare.net/somnathroy7568/ceph-on-rdma The attachment didn't go through it seems, so, I had to use slideshare. Mark, If we have time, I can present it in tomorrow's performance meeting. Thanks Regards Somnath Those numbers are really impressive (for small numbers at least)! What are TCP settings you using?For example, difference can be lowered on scale due to less intensive per-connection acceleration on CUBIC on a larger number of nodes, though I do not believe that it was a main reason for an observed TCP catchup on a relatively flat workload such as fio generates. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Multiple issues with glibc heap management
Hello, since very long period (at least from cuttlefish) many users, including me, experiences rare but still very disturbing client crashes (#8385, #6480, and couple of other same-looking traces for different code pieces, I may start the corresponding separate bugs if necessary). The main problem is that the issues are very hard to reproduce in a deterministic way, although they are *primarily* correlate with disk workload. Despite fixing all reported separately is a definitely a working way, the issue can be also caused be simularly-behaving single piece of code belonging to one of shared libraries. As issue touches more or less all existing deployments on stable release, may be it is worthy (and possible) to fix it not by cleaning up particular issues one by one. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu drive-mirror to rbd storage : no sparse rbd image
On Sat, Oct 11, 2014 at 12:25 PM, Fam Zheng f...@redhat.com wrote: On Sat, 10/11 10:00, Alexandre DERUMIER wrote: What is the source format? If the zero clusters are actually unallocated in the source image, drive-mirror will not write those clusters either. I.e. with drive-mirror sync=top, both source and target should have the same qemu-img map output. Thanks for your reply, I had tried drive mirror (sync=full) with raw file (sparse) - rbd (no sparse) rbd (sparse) - rbd (no sparse) raw file (sparse) - qcow2 on ext4 (sparse) rbd (sparse) - raw on ext4 (sparse) Also I see that I have the same problem with target file format on xfs. raw file (sparse) - qcow2 on xfs (no sparse) rbd (sparse) - raw on xfs (no sparse) These don't tell me much. Maybe it's better to show the actual commands and how you tell sparse from no sparse? Does qcow2 - qcow2 work for you on xfs? I only have this problem with drive-mirror, qemu-img convert seem to simply skip zero blocks. Or maybe this is because I'm using sync=full ? What is the difference between full and top ? sync: what parts of the disk image should be copied to the destination; possibilities include full for all the disk, top for only the sectors allocated in the topmost image. (what is topmost image ?) For sync=top, only the clusters allocated in the image itself is copied; for full, all those clusters allocated in the image itself, and its backing image, and it's backing's backing image, ..., are copied. The image itself, having a backing image or not, is called the topmost image. Fam -- Just a wild guess - Alexandre, did you tried detect-zeroes blk option for mirroring targets? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adding a delay when restarting all OSDs on a host
On Tue, Jul 22, 2014 at 5:19 PM, Wido den Hollander w...@42on.com wrote: Hi, Currently on Ubuntu with Upstart when you invoke a restart like this: $ sudo restart ceph-osd-all It will restart all OSDs at once, which can increase the load on the system a quite a bit. It's better to restart all OSDs by restarting them one by one: $ sudo ceph restart ceph-osd id=X But you then have to figure out all the IDs by doing a find in /var/lib/ceph/osd and that's more manual work. I'm thinking of patching the init scripts which allows something like this: $ sudo restart ceph-osd-all delay=180 It then waits 180 seconds between each OSD restart making the proces even smoother. I know there are currently sysvinit, upstart and systemd scripts, so it has to be implemented on various places, but how does the general idea sound? -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- Hi, this behaviour obviously have a negative side of increased overall peering time and larger integral value of out-of-SLA delays. I`d vote for warming up necessary files, most likely collections, just before restart. If there are no enough room to hold all of them at once, we can probably combine both methods to achieve lower impact value on restart, although adding a simple delay sounds much more straight than putting file cache to ram. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adding a delay when restarting all OSDs on a host
On Tue, Jul 22, 2014 at 6:28 PM, Wido den Hollander w...@42on.com wrote: On 07/22/2014 03:48 PM, Andrey Korolyov wrote: On Tue, Jul 22, 2014 at 5:19 PM, Wido den Hollander w...@42on.com wrote: Hi, Currently on Ubuntu with Upstart when you invoke a restart like this: $ sudo restart ceph-osd-all It will restart all OSDs at once, which can increase the load on the system a quite a bit. It's better to restart all OSDs by restarting them one by one: $ sudo ceph restart ceph-osd id=X But you then have to figure out all the IDs by doing a find in /var/lib/ceph/osd and that's more manual work. I'm thinking of patching the init scripts which allows something like this: $ sudo restart ceph-osd-all delay=180 It then waits 180 seconds between each OSD restart making the proces even smoother. I know there are currently sysvinit, upstart and systemd scripts, so it has to be implemented on various places, but how does the general idea sound? -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- Hi, this behaviour obviously have a negative side of increased overall peering time and larger integral value of out-of-SLA delays. I`d vote for warming up necessary files, most likely collections, just before restart. If there are no enough room to hold all of them at once, we can probably combine both methods to achieve lower impact value on restart, although adding a simple delay sounds much more straight than putting file cache to ram. In the case I'm talking about there are 23 OSDs running on a single machine and restarting all the OSDs causes a lot of peering and reading PG logs. A warm-up mechanism might work, but that would be a lot of work. When upgrading your cluster you simply want to do this: $ dsh -g ceph-osd sudo restart ceph-osd-all delay=180 That might take hours to complete, but if it's just an upgrade that doesn't matter. You want as minimal impact on service as possible. I may suggest to measure impact with vmtouch[0], it decreased OSD startup time greatly on mine tests, but I was stuck with same resource exhaustion as before after OSD marked itself up (IOPS ceiling primarily). 0. http://hoytech.com/vmtouch/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SMART monitoring
On Fri, Dec 27, 2013 at 9:09 PM, Andrey Korolyov and...@xdel.ru wrote: On 12/27/2013 08:15 PM, Justin Erenkrantz wrote: On Thu, Dec 26, 2013 at 9:17 PM, Sage Weil s...@inktank.com wrote: I think the question comes down to whether Ceph should take some internal action based on the information, or whether that is better handled by some external monitoring agent. For example, an external agent might collect SMART info into graphite, and every so often do some predictive analysis and mark out disks that are expected to fail soon. I'd love to see some consensus form around what this should look like... My $.02 from the peanut gallery: at a minimum, set the HEALTH_WARN flag if there is a SMART failure on a physical drive that contains an OSD. Yes, you could build the monitoring into a separate system, but I think it'd be really useful to combine it into the cluster health assessment. -- justin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hi, Judging from my personal experience SMART failures can be dangerous if they are not bad enough to completely tear down an OSD therefore it will not flap and will not be marked as down in time, but cluster performance is greatly affected in this case. I don`t think that the SMART monitoring task is somehow related to Ceph because seperate monitoring of predictive failure counters can do its job well and in cause of sudden errors SMART query may not work at all since a lot of bus resets was made by the system and disk can be inaccessible at all. So I propose two set of strategies - do a regular scattered background checks and monitor OSD responsiveness to word around cases with performance degradation due to read/write errors. Some necromant job for this thread.. Considering a year-long experience with Hitachi 4T disks, there are a lot of failures which are cannot be handled by SMART completely - speed degradation and sudden disk death. Although second case rules out by itself by kicking out stuck OSD, it is not very easy to check which disks are about to die without throughout dmesg monitoring for bus errors and periodical speed calibration. Probably introducing such thing as idle-priority speed measurement for OSDs without dramatically increasing overall wearout may be useful enough to implement in couple with additional OSD perf metric, like seek_time in SMART, though SMART may return good value for it when performance already slowed down to crawl, also it`ll handle most things impacting performance which can be unexposable at all to the host OS - correctable bus errors and so on. By the way, although 1T Seagates have way higher failure rate, they always dying with an 'appropriate' set of attributes in SMART, Hitachi tends to die without warning :) Hope that it`ll be helpful for someone. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Helper for state replication machine
Hello, I do not know about how many of you aware of this work of Michael Hines [0], but looks like it can be extremely usable for critical applications using qemu and, of course, Ceph at the block level. My thought was that if qemu rbd driver can provide any kind of metadata interface to mark each atomic write, it can be easily used to check and replay machine states on the acceptor side independently. Since Ceph replication is asynchronous, there is no acceptable approach to tell when it`s time to replay certain memory state on acceptor side, even if we are pushing all writes in synchronous manner. I`d be happy to hear any suggestions on this, because the result probably will be widely adopted by enterprise users whose needs includes state replication and who are bounded to VMWare by now. Of course, I am assuming worst case above, when primary replica shifts during disaster state and there are at least two sites holding primary and non-primary replica sets, with 100% distinction of primary role (=0.80). Of course there are a lot of points to discuss, like 'fallback' primary affinity and so on, but I`d like to ask first of possibility to implement such mechanism at a driver level. Thanks! 0. http://wiki.qemu.org/Features/MicroCheckpointing -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [librbd] Add interface of get the snapshot size?
On 03/24/2014 05:30 PM, Haomai Wang wrote: Hi all, As we know, snapshot is a lightweight resource in librbd and we doesn't have any statistic informations about it. But it causes some problems to the cloud management. We can't measure the size of snapshot, different snapshot will occur different space. So we don't have way to estimate the resource usage of user. Maybe we can have a counter to record space usage when volumn created. When creating snapshot, the counter is freeze and store as the size of snapshot. New counter will assign to zero for the volume. Any feedback is appreciate! I believe that there is a rough estimation over 'rados df'. Per-image statistics would be awesome, though precise stats will be neither rough too(# of rbd object clones per volume) or introduce new counter mechanism. Dealing with discard for the filestore, it looks even more difficult to calculate right estimation, as with XFS preallocation feature. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
XFS preallocation with lot of small objects
Hello, Due to lot of reports of ENOSPC for xfs-based stores may be it worth to introduce an option to, say, ceph-deploy which will pass allocsize= param to the mount effectively disabling Dynamic Preallocation? Of course not every case really worth it because of related performance impact. If there is any method to calculate 'real' allocation on such volumes, it can be put to the docs as a measurement suggestion too. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: xfs Warnings in syslog
Just my two cents: XFS is a quite unstable with Ceph especially along with heavy CPU usage up to 3.7(primarily soft lockups). I used 3.7 for eight months before upgrade on production system and it performs just perfectly. On Tue, Oct 22, 2013 at 1:29 PM, Jeff Liu jeff@oracle.com wrote: Hello, So It's better to add XFS mailing list to the CC-list. :) I think this issue has been fixed by upstream commits: From ff9a28f6c25d18a635abcab1f49db68108203dfb From: Jan Kara j...@suse.cz Date: Thu, 14 Mar 2013 14:30:54 +0100 Subject: [PATCH 1/1] xfs: Fix WARN_ON(delalloc) in xfs_vm_releasepage() Thanks, -Jeff On 10/22/2013 07:46 PM, Niklas Goerke wrote: Hi My syslog and dmesg are being filled with the Warnings attached. Looking at todays syslog I got up to 1101 of these warnings in the time from 10:50 to 11:13 (and only in that time, else the log was clean). I found them on all four of my OSD hosts, all at about the same time. I'm running kernel 3.2.0-4-amd64 on a debian 7.0. Ceph is on version 0.67.4. I have got 15 OSDs per OSD Host. Ceph does not really seem to care about this, so I'm not sure what it is all about… Still they are warnings in syslog and I hope you guys can tell me what went wrong here and what I can do about it? Thank you Niklas Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388018] [ cut here ] Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388030] WARNING: at /build/linux-s5x2oE/linux-3.2.46/fs/xfs/xfs_aops.c:1091 xfs_vm_releasepage+0x76/0x8e [xfs]() Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388034] Hardware name: X9DR3-F Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388036] Modules linked in: xfs autofs4 nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc ext3 jbd loop acpi_cpufreq mperf coretemp crc32c_intel ghash_clmulni_intel snd_pcm aesni_intel snd_page_alloc aes_x86_64 snd_timer aes_generic s nd cryptd soundcore pcspkr sb_edac joydev evdev edac_core iTCO_wdt i2c_i801 iTCO_vendor_support i2c_core ioatdma processor thermal_sys container button ext4 crc16 jbd2 mbcache usbhid hid ses enclosure sg sd_mod crc_t10dif megaraid_sas ehci_hcd usbcore isci libsas usb_common liba ta ixgbe mdio scsi_transport_sas scsi_mod igb dca [last unloaded: scsi_wait_scan] Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388093] Pid: 3459605, comm: ceph-osd Tainted: GW3.2.0-4-amd64 #1 Debian 3.2.46-1 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388096] Call Trace: Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388102] [81046b75] ? warn_slowpath_common+0x78/0x8c Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388115] [a048b98c] ? xfs_vm_releasepage+0x76/0x8e [xfs] Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388122] [810bedc5] ? invalidate_inode_page+0x5e/0x80 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388129] [810bee5d] ? invalidate_mapping_pages+0x76/0x102 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388135] [810b7b83] ? sys_fadvise64_64+0x19f/0x1e2 Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388140] [81353b52] ? system_call_fastpath+0x16/0x1b Oct 22 11:11:19 cs-bigfoot06 kernel: [9744648.388144] ---[ end trace e9640ed6f82f066d ]--- -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Large time shift causes OSD to hit suicide timeout and ABRT
Hello, Not sure if this matches any real-world problem: step time server 192.168.10.125 offset 30763065.968946 sec #0 0x7f2d0294d405 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f2d02950b5b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x7f2d0324b875 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x7f2d03249996 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x7f2d032499c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x7f2d03249bee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x0090d2fa in ceph::__ceph_assert_fail (assertion=0xa38ab1 0 == \hit suicide timeout\, file=optimized out, line=79, func=0xa38c60 bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)) at common/assert.cc:77 #7 0x0087914b in ceph::HeartbeatMap::_check (this=this@entry=0x35b40e0, h=h@entry=0x36d1050, who=who@entry=0xa38aef reset_timeout, now=now@entry=1380797379) at common/HeartbeatMap.cc:79 #8 0x0087940e in ceph::HeartbeatMap::reset_timeout (this=0x35b40e0, h=0x36d1050, grace=15, suicide_grace=150) at common/HeartbeatMap.cc:89 #9 0x0070ada7 in OSD::process_peering_events (this=0x375, pgs=..., handle=...) at osd/OSD.cc:6808 #10 0x0074c2e4 in OSD::PeeringWQ::_process (this=optimized out, pgs=..., handle=...) at osd/OSD.h:869 #11 0x00903dca in ThreadPool::worker (this=0x3750478, wt=0x4ef6fa80) at common/WorkQueue.cc:119 #12 0x00905070 in ThreadPool::WorkThread::entry (this=optimized out) at common/WorkQueue.h:316 #13 0x7f2d046c2e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #14 0x7f2d02a093dd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #15 0x in ?? () -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph users meetup
If anyone attends to the CloudConf Europe, it would be nice to meet in in real world too. On Wed, Sep 25, 2013 at 2:29 PM, Wido den Hollander w...@42on.com wrote: On 09/25/2013 10:53 AM, Loic Dachary wrote: Hi Eric Patrick, Yesterday morning Eric suggested that organizing a ceph user meetup would be great and proposed his help to make it happen. Although I'd be very happy to attend a france based meetup, it may make sense to also organize a Europe wide meetup. For instance it would be great to have a Ceph room during FOSDEM ( http://fosdem.org/ february 2014 ). I'm in! NL, BE, DE or FR doesn't matter to me. Cheers -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hiding auth key string for the qemu process
Hello, Since it was a long time from enabling cephx by default and we may think that everyone using it, is seems worthy to introduce bits of code hiding the key from cmdline. First applicable place for such improvement is most-likely OpenStack envs with their sparse security and usage of admin key as default one. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deep-Scrub and High Read Latency with QEMU/RBD
You may want to reduce scrubbing pgs per osd to 1 using config option and check the results. On Fri, Aug 30, 2013 at 8:03 PM, Mike Dawson mike.daw...@cloudapt.com wrote: We've been struggling with an issue of spikes of high i/o latency with qemu/rbd guests. As we've been chasing this bug, we've greatly improved the methods we use to monitor our infrastructure. It appears that our RBD performance chokes in two situations: - Deep-Scrub - Backfill/recovery In this email, I want to focus on deep-scrub. Graphing '% Util' from 'iostat -x' on my hosts with OSDs, I can see Deep-Scrub take my disks from around 10% utilized to complete saturation during a scrub. RBD writeback cache appears to cover the issue nicely, but occasionally suffers drops in performance (presumably when it flushes). But, reads appear to suffer greatly, with multiple seconds of 0B/s of reads accomplished (see log fragment below). If I make the assumption that deep-scrub isn't intended to create massive spindle contention, this appears to be a problem. What should happen here? Looking at the settings around deep-scrub, I don't see an obvious way to say don't saturate my drives. Are there any setting in Ceph or otherwise (readahead?) that might lower the burden of deep-scrub? If not, perhaps reads could be remapped to avoid waiting on saturated disks during scrub. Any ideas? 2013-08-30 15:47:20.166149 mon.0 [INF] pgmap v9853931: 20672 pgs: 20665 active+clean, 7 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 5058KB/s wr, 217op/s 2013-08-30 15:47:21.945948 mon.0 [INF] pgmap v9853932: 20672 pgs: 20665 active+clean, 7 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 5553KB/s wr, 229op/s 2013-08-30 15:47:23.205843 mon.0 [INF] pgmap v9853933: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 6580KB/s wr, 246op/s 2013-08-30 15:47:24.843308 mon.0 [INF] pgmap v9853934: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 3795KB/s wr, 224op/s 2013-08-30 15:47:25.862722 mon.0 [INF] pgmap v9853935: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 1414B/s rd, 3799KB/s wr, 181op/s 2013-08-30 15:47:26.887516 mon.0 [INF] pgmap v9853936: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 1541B/s rd, 8138KB/s wr, 160op/s 2013-08-30 15:47:27.933629 mon.0 [INF] pgmap v9853937: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 14458KB/s wr, 304op/s 2013-08-30 15:47:29.127847 mon.0 [INF] pgmap v9853938: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 15300KB/s wr, 345op/s 2013-08-30 15:47:30.344837 mon.0 [INF] pgmap v9853939: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 13128KB/s wr, 218op/s 2013-08-30 15:47:31.380089 mon.0 [INF] pgmap v9853940: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 13299KB/s wr, 241op/s 2013-08-30 15:47:32.388303 mon.0 [INF] pgmap v9853941: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 4951B/s rd, 8147KB/s wr, 192op/s 2013-08-30 15:47:33.858382 mon.0 [INF] pgmap v9853942: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 7029B/s rd, 3254KB/s wr, 190op/s 2013-08-30 15:47:35.279691 mon.0 [INF] pgmap v9853943: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64555 GB / 174 TB avail; 1651B/s rd, 2476KB/s wr, 207op/s 2013-08-30 15:47:36.309078 mon.0 [INF] pgmap v9853944: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64555 GB / 174 TB avail; 0B/s rd, 3788KB/s wr, 239op/s 2013-08-30 15:47:38.120343 mon.0 [INF] pgmap v9853945: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64555 GB / 174 TB avail; 0B/s rd, 4671KB/s wr, 239op/s 2013-08-30 15:47:39.546980 mon.0 [INF] pgmap v9853946: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64555 GB / 174 TB avail; 0B/s rd, 13487KB/s wr, 444op/s 2013-08-30 15:47:40.561203 mon.0 [INF] pgmap v9853947: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64555 GB / 174 TB avail; 0B/s rd, 15265KB/s wr, 489op/s 2013-08-30 15:47:41.794355 mon.0 [INF] pgmap v9853948: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111
Re: Deep-Scrub and High Read Latency with QEMU/RBD
On Fri, Aug 30, 2013 at 9:44 PM, Mike Dawson mike.daw...@cloudapt.com wrote: Andrey, I use all the defaults: # ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config show | grep scrub osd_scrub_thread_timeout: 60, osd_scrub_finalize_thread_timeout: 600, osd_max_scrubs: 1, This one. I may suggest to increase max_interval and write some kind of script doing per-pg scrub with low intensity, so you`ll have one scrubbing PG or less anytime and you may wait some time before scrubbing next, so they will not start scrubbing at once when max_interval will expire. I had discussed some throttling mechanisms to scrubbing some months ago here or in ceph-devel, but there still no such implementation (it is ultimately low-priority task since it can be handled by such simple thing as proposal above). osd_scrub_load_threshold: 0.5, osd_scrub_min_interval: 86400, osd_scrub_max_interval: 604800, osd_scrub_chunk_min: 5, osd_scrub_chunk_max: 25, osd_deep_scrub_interval: 604800, osd_deep_scrub_stride: 524288, Which value are you referring to? Does anyone know exactly how osd scrub load threshold works? The manual states The maximum CPU load. Ceph will not scrub when the CPU load is higher than this number. Default is 50%. So on a system with multiple processors and cores...what happens? Is the threshold .5 load (meaning half a core) or 50% of max load meaning anything less than 8 if you have 16 cores? Thanks, Mike Dawson On 8/30/2013 1:34 PM, Andrey Korolyov wrote: You may want to reduce scrubbing pgs per osd to 1 using config option and check the results. On Fri, Aug 30, 2013 at 8:03 PM, Mike Dawson mike.daw...@cloudapt.com wrote: We've been struggling with an issue of spikes of high i/o latency with qemu/rbd guests. As we've been chasing this bug, we've greatly improved the methods we use to monitor our infrastructure. It appears that our RBD performance chokes in two situations: - Deep-Scrub - Backfill/recovery In this email, I want to focus on deep-scrub. Graphing '% Util' from 'iostat -x' on my hosts with OSDs, I can see Deep-Scrub take my disks from around 10% utilized to complete saturation during a scrub. RBD writeback cache appears to cover the issue nicely, but occasionally suffers drops in performance (presumably when it flushes). But, reads appear to suffer greatly, with multiple seconds of 0B/s of reads accomplished (see log fragment below). If I make the assumption that deep-scrub isn't intended to create massive spindle contention, this appears to be a problem. What should happen here? Looking at the settings around deep-scrub, I don't see an obvious way to say don't saturate my drives. Are there any setting in Ceph or otherwise (readahead?) that might lower the burden of deep-scrub? If not, perhaps reads could be remapped to avoid waiting on saturated disks during scrub. Any ideas? 2013-08-30 15:47:20.166149 mon.0 [INF] pgmap v9853931: 20672 pgs: 20665 active+clean, 7 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 5058KB/s wr, 217op/s 2013-08-30 15:47:21.945948 mon.0 [INF] pgmap v9853932: 20672 pgs: 20665 active+clean, 7 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 5553KB/s wr, 229op/s 2013-08-30 15:47:23.205843 mon.0 [INF] pgmap v9853933: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 6580KB/s wr, 246op/s 2013-08-30 15:47:24.843308 mon.0 [INF] pgmap v9853934: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 3795KB/s wr, 224op/s 2013-08-30 15:47:25.862722 mon.0 [INF] pgmap v9853935: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 1414B/s rd, 3799KB/s wr, 181op/s 2013-08-30 15:47:26.887516 mon.0 [INF] pgmap v9853936: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 1541B/s rd, 8138KB/s wr, 160op/s 2013-08-30 15:47:27.933629 mon.0 [INF] pgmap v9853937: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 14458KB/s wr, 304op/s 2013-08-30 15:47:29.127847 mon.0 [INF] pgmap v9853938: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 15300KB/s wr, 345op/s 2013-08-30 15:47:30.344837 mon.0 [INF] pgmap v9853939: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 13128KB/s wr, 218op/s 2013-08-30 15:47:31.380089 mon.0 [INF] pgmap v9853940: 20672 pgs: 20664 active+clean, 8 active+clean+scrubbing+deep; 38136 GB data, 111 TB used, 64556 GB / 174 TB avail; 0B/s rd, 13299KB/s wr, 241op/s 2013-08-30 15:47:32.388303 mon
Re: libvirt: Removing RBD volumes with snapshots, auto purge or not?
On Tue, Aug 20, 2013 at 7:36 PM, Wido den Hollander w...@42on.com wrote: Hi, The current [0] libvirt storage pool code simply calls rbd_remove without anything else. As far as I know rbd_remove will fail if the image still has snapshots, you have to remove those snapshots first before you can remove the image. The problem is that libvirt's storage pools do not support listing snapshots, so we can't integrate that. Libvirt however has a flag you can pass down to tell you want the device to be zeroed. The normal procedure is that the device is filled with zeros before actually removing it. I was thinking about abusing this flag to use it as a snap purge for RBD. So a regular volume removal will call only rbd_remove, but when the flag VIR_STORAGE_VOL_DELETE_ZEROED is passed it will purge all snapshots prior to calling rbd_remove. Another way would be to always purge snapshots, but I'm afraid that could make somebody very unhappy at some point. Currently virsh doesn't support flags, but that could be fixed in a different patch. Does my idea sound sane? [0]: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=e3340f63f412c22d025f615beb7cfed25f00107b;hb=master#l407 -- Wido den Hollander 42on B.V. Hi Wido, You had mentioned not so long ago the same idea as I had about a year and half ago about placing memory dumps along with the regular snapshot in Ceph using libvirt mechanisms. That sounds pretty nice since we`ll have something other than qcow2 with same snapshot functionality but your current proposal does not extend to this. Placing custom side hook seems much more expandable than putting snap purge into specific flag. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
Created #5844. On Thu, Aug 1, 2013 at 10:38 PM, Samuel Just sam.j...@inktank.com wrote: Is there a bug open for this? I suspect we don't sufficiently throttle the snapshot removal work. -Sam On Thu, Aug 1, 2013 at 7:50 AM, Andrey Korolyov and...@xdel.ru wrote: Second this. Also for long-lasting snapshot problem and related performance issues I may say that cuttlefish improved things greatly, but creation/deletion of large snapshot (hundreds of gigabytes of commited data) still can bring down cluster for a minutes, despite usage of every possible optimization. On Thu, Aug 1, 2013 at 12:22 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and re backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
Second this. Also for long-lasting snapshot problem and related performance issues I may say that cuttlefish improved things greatly, but creation/deletion of large snapshot (hundreds of gigabytes of commited data) still can bring down cluster for a minutes, despite usage of every possible optimization. On Thu, Aug 1, 2013 at 12:22 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and re backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read ahead affect Ceph read performance much
Wow, very glad to hear that. I tried with the regular FS tunable and there was almost no effect on the regular test, so I thought that reads cannot be improved at all in this direction. On Mon, Jul 29, 2013 at 2:24 PM, Li Wang liw...@ubuntukylin.com wrote: We performed Iozone read test on a 32-node HPC server. Regarding the hardware of each node, the CPU is very powerful, so does the network, with a bandwidth 1.5 GB/s. 64GB memory, the IO is relatively slow, the throughput measured by ‘dd’ locally is around 70MB/s. We configured a Ceph cluster with 24 OSDs on 24 nodes, one mds, one to four clients, one client per node. The performance is as follows, Iozone sequential read throughput (MB/s) Number of clients 1 2 4 Default resize180.0954 324.4836 591.5851 Resize: 256MB 645.3347 1022.998 1267.631 The complete iozone parameter for one client is, iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w -c -e -b /tmp/iozone.nodelist.50305030.output, on each client node, only one thread is started. for two clients, it is, iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w -c -e -b /tmp/iozone.nodelist.50305030.output As the data shown, a larger read ahead window could result in 300% speedup! Besides, Since the backend of Ceph is not the traditional hard disk, it is beneficial to capture the stride read prefetching. To prove this, we tested the stride read with the following program, as we know, the generic read ahead algorithm of Linux kernel will not capture stride-read prefetch, so we use fadvise() to manually force pretching. the record size is 4MB. The result is even more surprising, Stride read throughput (MB/s) Number of records prefetched 0 1 4 16 64 128 Throughput 42.82 100.74 217.41 497.73 854.48 950.18 As the data shown, with a read ahead size of 128*4MB, the speedup over without read ahead could be up to 950/42 2000%! The core logic of the test program is below, stride = 17 recordsize = 4MB for (;;) { for (i = 0; i count; ++i) { long long start = pos + (i + 1) * stride * recordsize; printf(PRE READ %lld %lld\n, start, start + block); posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED); } len = read(fd, buf, block); total += len; printf(READ %lld %lld\n, pos, (pos + len)); pos += len; lseek(fd, (stride - 1) * block, SEEK_CUR); pos += (stride - 1) * block; } Given the above results and some more, We plan to submit a blue print to discuss the prefetching optimization of Ceph. Cheers, Li Wang -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD crash upon pool creation
Hello, Using db2bb270e93ed44f9252d65d1d4c9b36875d0ea5 I had observed some disaster-alike behavior after ``pool create'' command - every osd daemon in the cluster will die at least once(some will crash times in a row after bringing back). Please take a look on the backtraces(almost identical) below. Issue #5637 is created in the tracker. Thanks! http://xdel.ru/downloads/poolcreate.txt.gz http://xdel.ru/downloads/poolcreate2.txt.gz -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson mark.nel...@inktank.com wrote: On 04/18/2013 06:46 AM, James Harper wrote: I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong. Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing. My setup, approximately, is: Two OSD's . 1 x 7200RPM SATA disk each . 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch) . 1 x gigabit public network . journal on another spindle Three MON's . 1 each on the OSD's . 1 on another server, which is also the one used for testing performance I'm using debian packages from ceph which are version 0.56.4 For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on. Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far... Hi James! Sorry to hear about the performance trouble! Is it just sequential 4KB direct IO writes that are giving you troubles? If you are using the kernel version of RBD, we don't have any kind of cache implemented there and since you are bypassing the pagecache on the client, those writes are being sent to the different OSDs in 4KB chunks over the network. RBD stores data in blocks that are represented by 4MB objects on one of the OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD repeatedly and then moving on to the next one. Hopefully those writes would get aggregated at the OSD level, but clearly that's not really happening here given your performance. Here's a couple of thoughts: 1) If you are working with VMs, using the QEMU/KVM interface with virtio drivers and RBD cache enabled will give you a huge jump in small sequential write performance relative to what you are seeing now. 2) You may want to try upgrading to 0.60. We made a change to how the pg_log works that causes fewer disk seeks during small IO, especially with XFS. Can you point into related commits, if possible? 3) If you are still having trouble, testing your network, disk speeds, and using rados bench to test the object store all may be helpful. Thanks James Good luck! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Accidental image deletion
Hello, Is there an existing or planned way to save an image from such thing, except protected snapshot? Since ``rbd snap protect'' is good enough for a small or inactive images, large ones may add significant overhead by space or by I/O when 'locking' snapshot is present, so it would be nice to see same functionality by the flag of ``rbd lock'' command. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph availability test recovering question
Hello, I`m experiencing same long-lasting problem - during recovery ops, some percentage of read I/O remains in-flight for seconds, rendering upper-level filesystem on the qemu client very slow and almost unusable. Different striping has almost no effect on visible delays and reads may be non-intensive at all but they still are very slow. Here is some fio results on randread with small blocks, so it is not affected by readahead as linear one: Intensive reads during recovery: lat (msec) : 2=0.01%, 4=0.08%, 10=1.87%, 20=4.17%, 50=8.34% lat (msec) : 100=13.93%, 250=2.77%, 500=1.19%, 750=25.13%, 1000=0.41% lat (msec) : 2000=15.45%, =2000=26.66% same on healthy cluster: lat (msec) : 20=0.33%, 50=9.17%, 100=23.35%, 250=25.47%, 750=6.53% lat (msec) : 1000=0.42%, 2000=34.17%, =2000=0.56% On Sun, Mar 17, 2013 at 8:18 AM, kelvin_hu...@wiwynn.com wrote: Hi, all I have some problem after availability test Setup: Linux kernel: 3.2.0 OS: Ubuntu 12.04 Storage server : 11 HDD (each storage server has 11 osd, 7200 rpm, 1T) + 10GbE NIC RAID card: LSI MegaRAID SAS 9260-4i For every HDD: RAID0, Write Policy: Write Back with BBU, Read Policy: ReadAhead, IO Policy: Direct Storage server number : 2 Ceph version : 0.48.2 Replicas : 2 Monitor number:3 We have two storage server as a cluter, then use ceph client create 1T RBD image for testing, the client also has 10GbE NIC , Linux kernel 3.2.0 , Ubuntu 12.04 We also use FIO to produce workload fio command: [Sequencial Read] fio --iodepth = 32 --numjobs=1 --runtime=120 --bs = 65536 --rw = read --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 --thinktime=10 [Sequencial Write] fio --iodepth = 32 --numjobs=1 --runtime=120 --bs = 65536 --rw = write --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 --thinktime=10 Now I want observe to ceph state when one storage server is crash, so I turn off one storage server networking. We expect that data write and data read operation can be quickly resume or even not be suspended in ceph recovering time, but the experimental results show the data write and data read operation will pause for about 20~30 seconds in ceph recovering time. My question is: 1.The state of I/O pause is normal when ceph recovering ? 2.The pause time of I/O that can not be avoided when ceph recovering ? 3.How to reduce the I/O pause time ? Thanks!! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: maintanance on osd host
On Tue, Feb 26, 2013 at 6:56 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi list, how can i do a short maintanance like a kernel upgrade on an osd host? Right now ceph starts to backfill immediatly if i say: ceph osd out 41 ... Without ceph osd out command all clients hang for the time ceph does not know that the host was rebootet. I tried ceph osd set nodown and ceph osd set noout but this doesn't result in any difference Hi Stefan, in my practice nodown will freeze all I/O for sure until OSD will return, killing osd process and setting ``mon osd down out interval'' large enough will do the trick - you`ll get only two small freezes on the peering process at start and at the end. Also it is very strange that your clients hanging for a long time - I have set non-optimal values for purpose and was not able to observe re-peering process longer than a minute. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd export speed limit
On Wed, Feb 13, 2013 at 12:22 AM, Stefan Priebe s.pri...@profihost.ag wrote: Hi, is there a speed limit option for rbd export? Right now i'm able to produce several SLOW requests from IMPORTANT valid requests while just exporting a snapshot which is not really important. rbd export runs with 2400MB/s and each OSD with 250MB/s so it seems to block valid normal read / write operations. Greets, Stefan -- Can confirm this in some specific case - when 0.56.2 and 0.56.3 coexist for a long time, nodes with newer running version can produce such warnings at the beginning of export huge snapshots, not during entire export. And there are real impact on clients - for example I can see messages from watchdog in the KVM guests. For now, I will do an input throttling on export as temporary workaround. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hit suicide timeout after adding new osd
On Thu, Jan 24, 2013 at 10:01 PM, Sage Weil s...@inktank.com wrote: On Thu, 24 Jan 2013, Andrey Korolyov wrote: On Thu, Jan 24, 2013 at 8:39 AM, Sage Weil s...@inktank.com wrote: On Thu, 24 Jan 2013, Andrey Korolyov wrote: On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian S?gaard j...@mermaidconsulting.dk wrote: Hi Sage, I think the problem now is just that 'osd target transaction size' is I set it to 50, and that seems to have solved all my problems. After a day or so my cluster got to a HEALTH_OK state again. It has been running for a few days now without any crashes! Hmm, one of the OSDs crashed again, sadly. It logs: -2 2013-01-23 18:01:23.563624 7f67524da700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60 -1 2013-01-23 18:01:23.563657 7f67524da700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180 0 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677 common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout) With this stack trace: ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0x846ecb] 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae] 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8] 4: (CephContextServiceThread::entry()+0x55) [0x8e0f45] 5: /lib64/libpthread.so.0() [0x3cbc807d14] 6: (clone()+0x6d) [0x3cbc0f167d] I have saved the core file, if there's anything in there you need? Or do you think I just need to set the target transaction size even lower than 50? I was able to catch this too on rejoin to very busy cluster and seems I need to lower this value at least at start time. Also c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time before restarted or new osd will join a cluster, and for 2M objects/3T of replicated data restart of the cluster was took almost a hour before it actually begins to work. The worst thing is that a single osd, if restarted, will mark as up after couple of minutes, then after almost half of hour(eating 100 percent of one cpu, ) as down and then cluster will start to redistribute data after 300s timeout, osd still doing something. Okay, something is very wrong. Can you reproduce this with a log? Or even a partial log while it is spinning? You can adjust the log level on a running process with ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_osd 20 ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_ms 1 We haven't been able to reproduce this, so I'm very much interested in any light you can shine here. Unfortunately cluster finally hit ``suicide timeout'' by every osd, so there was no logs, only some backtraces[1]. Yesterday after an osd was not able to join cluster in a hour, I decided to wait until data is remapped, then tried to restart cluster, leaving it overnight, to morning all osd processes are dead, with the same backtraces. Before it, after a silly node crash(related to deadlocks in kernel kvm code), some pgs remains to stay in peering state without any blocker in json output, so I had decided to restart osd to which primary copy belongs, because it helped before. So most interesting part is missing, but I`ll reformat cluster soon and will try to catch this again after filling some data in. [1]. http://xdel.ru/downloads/ceph-log/osd-heartbeat/ Thanks, I believe I see the problem. The peering workqueue is way behind, and it is trying to it all in one lump, timing out the work queue. The workaround is to increase the timeout. We'll put together a proper fix. sage Hi Sage, Single OSDs still not able to join a cluster after restart, osd process eats one core and reads disk by long continuous periods, about hundreds of seconds, then staying eating 100% of core, then repeat. On relatively new cluster, it is not repeatable even with almost same data commit, only week or two of writes, snapshot creation, etc. exposes that. Please see log below: 2013-02-17 12:08:17.503992 7fbe8795c780 0 ceph version 0.56.3-2-g290a352 (290a352c3f9e241deac562e980ac8c6a74033ba6), process ceph-osd, pid 29283 starting osd.26 at :/0 osd_data /var/lib/ceph/osd/26 /var/lib/ceph/osd/journal/journal26 2013-02-17 12:08:17.508193 7fbe8795c780 1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6803/29283 need_addr=1 2013-02-17 12:08:17.508222 7fbe8795c780 1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6804/29283 need_addr=1 2013-02-17 12:08:17.508244 7fbe8795c780 1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6805
Re: [0.48.3] OSD memory leak when scrubbing
Can anyone who hit this bug please confirm that your system contains libc 2.15+? On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han han.sebast...@gmail.com wrote: oh nice, the pattern also matches path :D, didn't know that thanks Greg -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum g...@inktank.com wrote: Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core -Greg On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com wrote: ok I finally managed to get something on my test cluster, unfortunately, the dump goes to / any idea to change the destination path? My production / won't be big enough... -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote: ...and/or do you have the corepath set interestingly, or one of the core-trapping mechanisms turned on? On 02/04/2013 11:29 AM, Sage Weil wrote: On Mon, 4 Feb 2013, S?bastien Han wrote: Hum just tried several times on my test cluster and I can't get any core dump. Does Ceph commit suicide or something? Is it expected behavior? SIGSEGV should trigger the usual path that dumps a stack trace and then dumps core. Was your ulimit -c set before the daemon was started? sage -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheer -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheers -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote: Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd export speed limit
Hi Stefan, you may be interested in throttle(1) as a side solution with stdout export option. By the way, on which interconnect you have manage to get such speeds, if you mean 'commited' bytes(e.g. not almost empty allocated image)? On Wed, Feb 13, 2013 at 12:22 AM, Stefan Priebe s.pri...@profihost.ag wrote: Hi, is there a speed limit option for rbd export? Right now i'm able to produce several SLOW requests from IMPORTANT valid requests while just exporting a snapshot which is not really important. rbd export runs with 2400MB/s and each OSD with 250MB/s so it seems to block valid normal read / write operations. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Paxos and long-lasting deleted data
On Thu, Jan 31, 2013 at 11:18 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Jan 31, 2013 at 10:56 PM, Gregory Farnum g...@inktank.com wrote: On Thu, Jan 31, 2013 at 10:50 AM, Andrey Korolyov and...@xdel.ru wrote: http://xdel.ru/downloads/ceph-log/rados-out.txt.gz On Thu, Jan 31, 2013 at 10:31 PM, Gregory Farnum g...@inktank.com wrote: Can you pastebin the output of rados -p rbd ls? Well, that sure is a lot of rbd objects. Looks like a tool mismatch or a bug in whatever version you were using. Can you describe how you got into this state, what versions of the servers and client tools you used, etc? -Greg That`s relatively fresh data moved into bare new cluster after couple of days of 0.56.1 release, and tool/daemons version kept consistently the same at any moment. All garbage data belongs to the same pool prefix(3.) on which I have put a bunch of VM` images lately, cluster may have been experienced split-brain problem for a short times during crash-tests with no workload at all and standard crash tests on osd removal/readdition during moderate workload. Killed osds have been returned before,at the time and after process of data rearrangement on ``osd down'' timeout. Is it possible to do a little clean somehow without pool re-creation? Just an update: this data stayed after pool deletion, so there is probably a way to delete garbage bytes on live pool without doing any harm(hope so), since it is can be dissected from actual pool pool data placement, in theory. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Paxos and long-lasting deleted data
On Mon, Feb 4, 2013 at 1:46 AM, Gregory Farnum g...@inktank.com wrote: On Sunday, February 3, 2013 at 11:45 AM, Andrey Korolyov wrote: Just an update: this data stayed after pool deletion, so there is probably a way to delete garbage bytes on live pool without doing any harm(hope so), since it is can be dissected from actual pool pool data placement, in theory. What? You mean you deleted the pool and the data in use by the cluster didn't drop? If that's the case, check and see if it's still at the same level — pool deletes are asynchronous and throttled to prevent impacting client operations too much. Yep, of course, I meant this exactly - I have waited until ``ceph -w'' values was stabilized for a long period, then checked that a bunch of files with same prefix as in deleted pool remains, then I purged them manually. I`m not sure if this data was in use at the moment of pool removal, as I mentioned above, it`s just garbage produced during periods when cluster was degraded heavily. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Paxos and long-lasting deleted data
http://xdel.ru/downloads/ceph-log/rados-out.txt.gz On Thu, Jan 31, 2013 at 10:31 PM, Gregory Farnum g...@inktank.com wrote: Can you pastebin the output of rados -p rbd ls? On Thu, Jan 31, 2013 at 10:17 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, Please take a look, this data remains for days and seems not to be deleted in future too: pool name category KB objects clones degraded unfound rdrd KB wr wr KB data- 000 0 0000 0 install - 15736833 38560 0 0 163 464648 60970390 metadata- 000 0 0000 0 prod-rack0 - 364027905888950 0 0 320 267626 689034186 rbd -4194305 10270 0 04111269 25165828 total used 690091436893778 total avail18335469376 total space25236383744 for pool in $(rados lspools) ; do rbd ls -l $pool ; done | grep -v SIZE | awk '{ sum += $2} END { print sum }' rbd: pool data doesn't contain rbd images rbd: pool metadata doesn't contain rbd images 526360 I have same thing before, but not so contrast as there. Cluster was put on moderate failure test, dropping one or two osds at once under I/O pressure with replication factor three. Just wondering if there was something else you wanted to discuss on your email given the email subject. Wanted by any chance discuss anything regarding Paxos? Sorry, please nevermind, just thought about paxos-like behavior and suddenly put that in a title, instead of ``osd data placement''. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: page allocation failures on osd nodes
On Mon, Jan 28, 2013 at 8:55 PM, Andrey Korolyov and...@xdel.ru wrote: On Mon, Jan 28, 2013 at 5:48 PM, Sam Lang sam.l...@inktank.com wrote: On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote: Ahem. once on almost empty node same trace produced by qemu process(which was actually pinned to the specific numa node), so seems that`s generally is a some scheduler/mm bug, not directly related to the osd processes. In other words, the less percentage of memory actually is an RSS, the more is a probability of such allocation failure. This might be a known bug in xen for your kernel? The xen users list might be able to help. -sam It is vanilla-3.4, I really wonder from where comes paravirt bits in the trace. Bug exposed only in 3.4 and really harmless, at least in ways I have tested that. Ceph-osd memory allocation behavior more likely to trigger those messages than most other applications in ``same'' conditions. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: page allocation failures on osd nodes
On Mon, Jan 28, 2013 at 5:48 PM, Sam Lang sam.l...@inktank.com wrote: On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote: Ahem. once on almost empty node same trace produced by qemu process(which was actually pinned to the specific numa node), so seems that`s generally is a some scheduler/mm bug, not directly related to the osd processes. In other words, the less percentage of memory actually is an RSS, the more is a probability of such allocation failure. This might be a known bug in xen for your kernel? The xen users list might be able to help. -sam It is vanilla-3.4, I really wonder from where comes paravirt bits in the trace. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: page allocation failures on osd nodes
On Sat, Jan 26, 2013 at 12:41 PM, Andrey Korolyov and...@xdel.ru wrote: On Sat, Jan 26, 2013 at 3:40 AM, Sam Lang sam.l...@inktank.com wrote: On Fri, Jan 25, 2013 at 10:07 AM, Andrey Korolyov and...@xdel.ru wrote: Sorry, I have written too less yesterday because of being sleepy. That`s obviously a cache pressure since dropping caches resulted in disappearance of this errors for a long period. I`m not very familiar with kernel memory mechanisms, but shouldn`t kernel try to allocate memory on the second node if this not prohibited by process` cpuset first and only then report allocation failure(as can be seen only node 0 involved in the failures)? I really have no idea where numa-awareness may be count in case of osd daemons. Hi Andrey, You said that the allocation failure doesn't occur if you flush caches, but the kernel should evict pages from the cache as needed so that the osd can allocate more memory (unless their dirty, but it doesn't look like you have many dirty pages in this case). It looks like you have plenty of reclaimable pages as well. Does the osd remain running after that error occurs? Yes, it keeps running flawlessly without even a bit in an osdmap, but unfortunately logging wasn`t turned on for this moment. As soon as I`ll end massive test for ``suicide timeout'' bug I`ll check you idea with dd and also rerun test as below with ``debug osd = 20''. My thought is that kernel has ready-to-be-free memory on node1 and for strange reason osd process trying to reserve pages from node0 (where it is obviously allocated memory on start, since node1` memory starting only from high numbers over 32G), then kernel refuses to free cache on the specific node(it`s a quite misty, at least for me, why kernel just does not invalidate some buffers, even they are more preferably to stay in RAM than tail of LRU` ones?). Allocation looks like following on the most nodes: MemTotal: 66081396 kB MemFree: 278216 kB Buffers: 15040 kB Cached: 62422368 kB SwapCached:0 kB Active: 2063908 kB Inactive: 60876892 kB Active(anon): 509784 kB Inactive(anon): 56 kB Active(file):1554124 kB Inactive(file): 60876836 kB OSD-node free memory, with two osd processes on each node, libvirt prints ``Free'' field there: 0: 207500 KiB 1: 72332 KiB Total: 279832 KiB 0: 208528 KiB 1: 80692 KiB Total: 289220 KiB Since it is known that kernel reserve more memory on the node with higher memory pressure, seems very legit - osd processes works mostly with node 0` memory, so there is a bigger gap than on node 1 where exists almost only fs cache. Ahem. once on almost empty node same trace produced by qemu process(which was actually pinned to the specific numa node), so seems that`s generally is a some scheduler/mm bug, not directly related to the osd processes. In other words, the less percentage of memory actually is an RSS, the more is a probability of such allocation failure. I have printed timestamps of failure events on selected nodes, just for reference: http://xdel.ru/downloads/ceph-log/allocation-failure/stat.txt I wonder if you see the same error if you do a long write intensive workload on the local disk for the osd in question, maybe dd if=/dev/zero of=/data/osd.0/foo -sam On Fri, Jan 25, 2013 at 2:42 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, Those traces happens only constant high constant writes and seems to be very rarely. OSD processes do not consume more memory after this event and peaks are not distinguishable by monitoring. I have able to catch it having four-hour constant writes on the cluster. http://xdel.ru/downloads/ceph-log/allocation-failure/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: page allocation failures on osd nodes
On Sat, Jan 26, 2013 at 3:40 AM, Sam Lang sam.l...@inktank.com wrote: On Fri, Jan 25, 2013 at 10:07 AM, Andrey Korolyov and...@xdel.ru wrote: Sorry, I have written too less yesterday because of being sleepy. That`s obviously a cache pressure since dropping caches resulted in disappearance of this errors for a long period. I`m not very familiar with kernel memory mechanisms, but shouldn`t kernel try to allocate memory on the second node if this not prohibited by process` cpuset first and only then report allocation failure(as can be seen only node 0 involved in the failures)? I really have no idea where numa-awareness may be count in case of osd daemons. Hi Andrey, You said that the allocation failure doesn't occur if you flush caches, but the kernel should evict pages from the cache as needed so that the osd can allocate more memory (unless their dirty, but it doesn't look like you have many dirty pages in this case). It looks like you have plenty of reclaimable pages as well. Does the osd remain running after that error occurs? Yes, it keeps running flawlessly without even a bit in an osdmap, but unfortunately logging wasn`t turned on for this moment. As soon as I`ll end massive test for ``suicide timeout'' bug I`ll check you idea with dd and also rerun test as below with ``debug osd = 20''. My thought is that kernel has ready-to-be-free memory on node1 and for strange reason osd process trying to reserve pages from node0 (where it is obviously allocated memory on start, since node1` memory starting only from high numbers over 32G), then kernel refuses to free cache on the specific node(it`s a quite misty, at least for me, why kernel just does not invalidate some buffers, even they are more preferably to stay in RAM than tail of LRU` ones?). Allocation looks like following on the most nodes: MemTotal: 66081396 kB MemFree: 278216 kB Buffers: 15040 kB Cached: 62422368 kB SwapCached:0 kB Active: 2063908 kB Inactive: 60876892 kB Active(anon): 509784 kB Inactive(anon): 56 kB Active(file):1554124 kB Inactive(file): 60876836 kB OSD-node free memory, with two osd processes on each node, libvirt prints ``Free'' field there: 0: 207500 KiB 1: 72332 KiB Total: 279832 KiB 0: 208528 KiB 1: 80692 KiB Total: 289220 KiB Since it is known that kernel reserve more memory on the node with higher memory pressure, seems very legit - osd processes works mostly with node 0` memory, so there is a bigger gap than on node 1 where exists almost only fs cache. I wonder if you see the same error if you do a long write intensive workload on the local disk for the osd in question, maybe dd if=/dev/zero of=/data/osd.0/foo -sam On Fri, Jan 25, 2013 at 2:42 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, Those traces happens only constant high constant writes and seems to be very rarely. OSD processes do not consume more memory after this event and peaks are not distinguishable by monitoring. I have able to catch it having four-hour constant writes on the cluster. http://xdel.ru/downloads/ceph-log/allocation-failure/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to protect rbd from multiple simultaneous mapping
On Fri, Jan 25, 2013 at 7:51 PM, Sage Weil s...@inktank.com wrote: On Fri, 25 Jan 2013, Andrey Korolyov wrote: On Fri, Jan 25, 2013 at 4:52 PM, Ugis ugi...@gmail.com wrote: I mean if you map rbd and do not use rbd lock.. command. Can you tell which client has mapped certain rbd anyway? Not yet. We need to add the ability to list watchers in librados, which will then let us infer that information. Assume you has an undistinguishable L3 segment, NAT for example, and accessing cluster over it - there is no possibility for cluster to tell who exactly did something(mean, mapping). Locks mechanism is enough to fulfill your request, anyway. The addrs listed by the lock list are entity_addr_t's, which include an IP, port, and a nonce that uniquely identifies the client. It won't get confused by NAT. Note that you can blacklist either a full P or an individual entity_addr_t. But, as mentioned above, you can't list users who didn't use the locking (yet). Yep, I meant impossibility of mapping source address to the specific client in this case, there is possible to say that some client mapped image, not exact one with specific identity(since clients using same credentials, in less-distinguishable case). Client with the root privileges can be extended to send DMI UUID which is more or less persistent, but this is generally bad idea since client may be non-root and still in need of persistent identity. sage 2013/1/25 Wido den Hollander w...@widodh.nl: On 01/25/2013 11:47 AM, Ugis wrote: This could work, thanks! P.S. Is there a way to tell which client has mapped certain rbd if no rbd lock is used? What you could do is this: $ rbd lock add myimage `hostname` That way you know which client locked the image. Wido It would be useful to see that info in output of rbd info image. Probably attribute for rbd like max_map_count_allowed would be useful in future - just to make sure rbd is not mapped from multiple clients if it must not. I suppose it can actually happen if multiple admins work with same rbds from multiple clients and no strict rbd lock add.. procedure is followed. Ugis 2013/1/25 Sage Weil s...@inktank.com: On Thu, 24 Jan 2013, Mandell Degerness wrote: The advisory locks are nice, but it would be really nice to have the fencing. If a node is temporarily off the network and a heartbeat monitor attempts to bring up a service on a different node, there is no way to ensure that the first node will not write data to the rbd after the rbd is mounted on the second node. It would be nice if, on seeing that an advisory lock exists, you could tell ceph Do not accept data from node X until further notice. Just a reminder: you can use the information from the locks to fence. The basic process is: - identify old rbd lock holder (rbd lock list img) - blacklist old owner (ceph osd blacklist add addr) - break old rbd lock (rbd lock remove img lockid addr) - lock rbd image on new host (rbd lock add img lockid) - map rbd image on new host The oddity here is that the old VM can in theory continue to write up until the OSD hears about the blacklist via the internal gossip. This is okay because the act of the new VM touching any part of the image (and the OSD that stores it) ensures that that OSD gets the blacklist information. So on XFS, for example, the act of replaying the XFS journal ensures that any attempt by the old VM to write to the journal will get EIO. sage On Thu, Jan 24, 2013 at 11:50 AM, Josh Durgin josh.dur...@inktank.com wrote: On 01/24/2013 05:30 AM, Ugis wrote: Hi, I have rbd which contains non-cluster filesystem. If this rbd is mapped+mounted on one host, it should not be mapped+mounted on the other simultaneously. How to protect such rbd from being mapped on the other host? At ceph level the only option is to use lock add [image-name] [lock-id] and check for existance of this lock on the other client or is it possible to protect rbd in a way that on other clients rbd map command would just fail with something like Permission denied without using arbitrary locks? In other words, can one limit the count of clients that may map certain rbd? This is what the lock commands were added for. The lock add command will exit non-zero if the image is already locked, so you can run something like: rbd lock add [image-name] [lock-id] rbd map [image-name] to avoid mapping an image that's in use elsewhere. The lock-id is user-defined, so you could (for example) use the hostname of the machine mapping the image to tell where it's in use. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe
Re: Hit suicide timeout after adding new osd
On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian Søgaard j...@mermaidconsulting.dk wrote: Hi Sage, I think the problem now is just that 'osd target transaction size' is I set it to 50, and that seems to have solved all my problems. After a day or so my cluster got to a HEALTH_OK state again. It has been running for a few days now without any crashes! Hmm, one of the OSDs crashed again, sadly. It logs: -2 2013-01-23 18:01:23.563624 7f67524da700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60 -1 2013-01-23 18:01:23.563657 7f67524da700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180 0 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677 common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout) With this stack trace: ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0x846ecb] 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae] 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8] 4: (CephContextServiceThread::entry()+0x55) [0x8e0f45] 5: /lib64/libpthread.so.0() [0x3cbc807d14] 6: (clone()+0x6d) [0x3cbc0f167d] I have saved the core file, if there's anything in there you need? Or do you think I just need to set the target transaction size even lower than 50? I was able to catch this too on rejoin to very busy cluster and seems I need to lower this value at least at start time. Also c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time before restarted or new osd will join a cluster, and for 2M objects/3T of replicated data restart of the cluster was took almost a hour before it actually begins to work. The worst thing is that a single osd, if restarted, will mark as up after couple of minutes, then after almost half of hour(eating 100 percent of one cpu, ) as down and then cluster will start to redistribute data after 300s timeout, osd still doing something. -- Jens Kristian Søgaard, Mermaid Consulting ApS, j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hit suicide timeout after adding new osd
On Thu, Jan 24, 2013 at 8:39 AM, Sage Weil s...@inktank.com wrote: On Thu, 24 Jan 2013, Andrey Korolyov wrote: On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian S?gaard j...@mermaidconsulting.dk wrote: Hi Sage, I think the problem now is just that 'osd target transaction size' is I set it to 50, and that seems to have solved all my problems. After a day or so my cluster got to a HEALTH_OK state again. It has been running for a few days now without any crashes! Hmm, one of the OSDs crashed again, sadly. It logs: -2 2013-01-23 18:01:23.563624 7f67524da700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60 -1 2013-01-23 18:01:23.563657 7f67524da700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180 0 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677 common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout) With this stack trace: ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0x846ecb] 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae] 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8] 4: (CephContextServiceThread::entry()+0x55) [0x8e0f45] 5: /lib64/libpthread.so.0() [0x3cbc807d14] 6: (clone()+0x6d) [0x3cbc0f167d] I have saved the core file, if there's anything in there you need? Or do you think I just need to set the target transaction size even lower than 50? I was able to catch this too on rejoin to very busy cluster and seems I need to lower this value at least at start time. Also c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time before restarted or new osd will join a cluster, and for 2M objects/3T of replicated data restart of the cluster was took almost a hour before it actually begins to work. The worst thing is that a single osd, if restarted, will mark as up after couple of minutes, then after almost half of hour(eating 100 percent of one cpu, ) as down and then cluster will start to redistribute data after 300s timeout, osd still doing something. Okay, something is very wrong. Can you reproduce this with a log? Or even a partial log while it is spinning? You can adjust the log level on a running process with ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_osd 20 ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_ms 1 We haven't been able to reproduce this, so I'm very much interested in any light you can shine here. Unfortunately cluster finally hit ``suicide timeout'' by every osd, so there was no logs, only some backtraces[1]. Yesterday after an osd was not able to join cluster in a hour, I decided to wait until data is remapped, then tried to restart cluster, leaving it overnight, to morning all osd processes are dead, with the same backtraces. Before it, after a silly node crash(related to deadlocks in kernel kvm code), some pgs remains to stay in peering state without any blocker in json output, so I had decided to restart osd to which primary copy belongs, because it helped before. So most interesting part is missing, but I`ll reformat cluster soon and will try to catch this again after filling some data in. [1]. http://xdel.ru/downloads/ceph-log/osd-heartbeat/ Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: handling fs errors
On Tue, Jan 22, 2013 at 10:05 AM, Sage Weil s...@inktank.com wrote: We observed an interesting situation over the weekend. The XFS volume ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 minutes. After 3 minutes (180s), ceph-osd gave up waiting and committed suicide. XFS seemed to unwedge itself a bit after that, as the daemon was able to restart and continue. The problem is that during that 180s the OSD was claiming to be alive but not able to do any IO. That heartbeat check is meant as a sanity check against a wedged kernel, but waiting so long meant that the ceph-osd wasn't failed by the cluster quickly enough and client IO stalled. We could simply change that timeout to something close to the heartbeat interval (currently default is 20s). That will make ceph-osd much more sensitive to fs stalls that may be transient (high load, whatever). Another option would be to make the osd heartbeat replies conditional on whether the internal heartbeat is healthy. Then the heartbeat warnings could start at 10-20s, ping replies would pause, but the suicide could still be 180s out. If the stall is short-lived, pings will continue, the osd will mark itself back up (if it was marked down) and continue. Having written that out, the last option sounds like the obvious choice. Any other thoughts? sage Seems to be possible to run in domino-style failing marks there if lock is triggered frequently enough and depends only on pure amount of workload. By the way, was that fs aged or you`re able to catch the lock on fresh one? And which kernel you have run there? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Single host VM limit when using RBD
Hi Matthew, Seems to a low value in /proc/sys/kernel/threads-max value. On Thu, Jan 17, 2013 at 12:37 PM, Matthew Anderson matth...@base3.com.au wrote: I've run into a limit on the maximum number of RBD backed VM's that I'm able to run on a single host. I have 20 VM's (21 RBD volumes open) running on a single host and when booting the 21st machine I get the below error from libvirt/QEMU. I'm able to shut down a VM and start another in it's place so there seems to be a hard limit on the amount of volumes I'm able to have open. I did some googling and the error 11 from pthread_create seems to mean 'resource unavailable' so I'm probably running into a thread limit of some sort. I did try increasing the max_thread kernel option but nothing changed. I moved a few VM's to a different empty host and they start with no issues at all. This machine has 4 OSD's running on it in addition to the 20 VM's. Kernel 3.7.1. Ceph 0.56.1 and QEMU 1.3.0. There is currently 65GB of 96GB free ram and no swap. Can anyone suggest where the limit might be or anything I can do to narrow down the problem? Thanks -Matt - Error starting domain: internal error Process exited while reading console log output: char device redirected to /dev/pts/23 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 02:32:58.096437 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (()+0x2aaa8f) [0x7f4eb2de8a8f] 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575] 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc] 4: (()+0xa0290) [0x7f4eb5b27290] 5: (()+0x879dd) [0x7f4eb5b0e9dd] 6: (()+0x87c1b) [0x7f4eb5b0ec1b] 7: (()+0x87ae1) [0x7f4eb5b0eae1] 8: (()+0x87d50) [0x7f4eb5b0ed50] 9: (()+0xb37b2) [0x7f4eb5b3a7b2] 10: (()+0x1e83eb) [0x7f4eb5c6f3eb] 11: (()+0x1ab54a) [0x7f4eb5c3254a] 12: (main()+0x9da) [0x7f4eb5c72a3a] 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd] 14: (()+0x710b9) [0x7f4eb5af80b9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after Traceback (most recent call last): File /usr/share/virt-manager/virtManager/asyncjob.py, line 96, in cb_wrapper callback(asyncjob, *args, **kwargs) File /usr/share/virt-manager/virtManager/asyncjob.py, line 117, in tmpcb callback(*args, **kwargs) File /usr/share/virt-manager/virtManager/domain.py, line 1090, in startup self._backend.create() File /usr/lib/python2.7/dist-packages/libvirt.py, line 620, in create if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self) libvirtError: internal error Process exited while reading console log output: char device redirected to /dev/pts/23 Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4eb5a65960 time 2013-01-17 02:32:58.096437 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (()+0x2aaa8f) [0x7f4eb2de8a8f] 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575] 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc] 4: (()+0xa0290) [0x7f4eb5b27290] 5: (()+0x879dd) [0x7f4eb5b0e9dd] 6: (()+0x87c1b) [0x7f4eb5b0ec1b] 7: (()+0x87ae1) [0x7f4eb5b0eae1] 8: (()+0x87d50) [0x7f4eb5b0ed50] 9: (()+0xb37b2) [0x7f4eb5b3a7b2] 10: (()+0x1e83eb) [0x7f4eb5c6f3eb] 11: (()+0x1ab54a) [0x7f4eb5c3254a] 12: (main()+0x9da) [0x7f4eb5c72a3a] 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd] 14: (()+0x710b9) [0x7f4eb5af80b9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
On Thu, Jan 17, 2013 at 7:00 PM, Atchley, Scott atchle...@ornl.gov wrote: On Jan 17, 2013, at 9:48 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/17 Atchley, Scott atchle...@ornl.gov: IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since it uses the traditional socket stack through the kernel, CPU usage will be as high (or higher if QDR) than 10GbE. Which kind of tuning? Do you have a paper about this? No, I followed the Mellanox tuning guide and modified their interrupt affinity scripts. Did you tried to bind interrupts only to core to which QPI link belongs in reality and measure difference with spread-over-all-cores binding? But, actually, is possible to use ceph with IPoIB in a stable way or is this experimental ? IPoIB appears as a traditional Ethernet device to Linux and can be used as such. Not exactly, this summer kernel added additional driver for fully featured L2(ib ethernet driver), before that it was quite painful to do any possible failover using ipoib. I don't know if i support for rsocket that is experimental/untested and IPoIB is a stable workaroud or what else. IPoIB is much more used and pretty stable, while rsockets is new with limited testing. That said, more people using it will help Sean improve it. Ideally, we would like support for zero-copy and reduced CPU usage (via OS-bypass) and with more interconnects than just InfiniBand. :-) And is a dual controller needed on each OSD node? Ceph is able to handle OSD network failures? This is really important to know. It change the whole network topology. I will let others answer this. Scott-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph slow request unstable issue
On Wed, Jan 16, 2013 at 10:35 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Jan 16, 2013 at 8:58 PM, Sage Weil s...@inktank.com wrote: Hi, On Wed, 16 Jan 2013, Andrey Korolyov wrote: On Wed, Jan 16, 2013 at 4:58 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: Hi list, We are suffering from OSD or OS down when there is continuing high pressure on the Ceph rack. Basically we are on Ubuntu 12.04+ Ceph 0.56.1, 6 nodes, in each nodes with 20 * spindles + 4* SSDs as journal.(120 spindles in total) We create a lots of RBD volumes (say 240),mounting to 16 different client machines ( 15 RBD Volumes/ client) and running DD concurrently on top of each RBD. The issues are: 1. Slow requests ??From the list-archive it seems solved in 0.56.1 but we still notice such warning 2. OSD Down or even host down Like the message below.Seems some OSD has been blocking there for quite a long time. Suggestions are highly appreciate.Thanks Xiaoxi _ Bad news: I have back all my Ceph machine?s OS to kernel 3.2.0-23, which Ubuntu 12.04 use. I run dd command (dd if=/dev/zero bs=1M count=6 of=/dev/rbd${i} )on Ceph client to create data prepare test at last night. Now, I have one machine down (can?t be reached by ping), another two machine has all OSD daemon down, while the three left has some daemon down. I have many warnings in OSD log like this: no flag points reached 2013-01-15 19:14:22.769898 7f20a2d57700 0 log [WRN] : slow request 52.218106 seconds old, received at 2013-01-15 19:13:30.551718: osd_op(client.10674.1:1002417 rb.0.27a8.6b8b4567.0eba [write 3145728~524288] 2.c61810ee RETRY) currently waiting for sub ops 2013-01-15 19:14:23.770077 7f20a2d57700 0 log [WRN] : 21 slow requests, 6 included below; oldest blocked for 1132.138983 secs 2013-01-15 19:14:23.770086 7f20a2d57700 0 log [WRN] : slow request 53.216404 seconds old, received at 2013-01-15 19:13:30.553616: osd_op(client.10671.1:1066860 rb.0.282c.6b8b4567.1057 [write 2621440~524288] 2.ea7acebc) currently waiting for sub ops 2013-01-15 19:14:23.770096 7f20a2d57700 0 log [WRN] : slow request 51.442032 seconds old, received at 2013-01-15 19:13:32.327988: osd_op(client.10674.1:1002418 Similar info in dmesg we have saw pervious: [21199.036476] INFO: task ceph-osd:7788 blocked for more than 120 seconds. [21199.037493] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [21199.038841] ceph-osdD 0006 0 7788 1 0x [21199.038844] 880fefdafcc8 0086 ffe0 [21199.038848] 880fefdaffd8 880fefdaffd8 880fefdaffd8 00013780 [21199.038852] 88081aa58000 880f68f52de0 880f68f52de0 882017556200 [21199.038856] Call Trace: [21199.038858] [8165a55f] schedule+0x3f/0x60 [21199.038861] [8106b7e5] exit_mm+0x85/0x130 [21199.038864] [8106b9fe] do_exit+0x16e/0x420 [21199.038866] [8109d88f] ? __unqueue_futex+0x3f/0x80 [21199.038869] [8107a19a] ? __dequeue_signal+0x6a/0xb0 [21199.038872] [8106be54] do_group_exit+0x44/0xa0 [21199.038874] [8107ccdc] get_signal_to_deliver+0x21c/0x420 [21199.038877] [81013865] do_signal+0x45/0x130 [21199.038880] [810a091c] ? do_futex+0x7c/0x1b0 [21199.038882] [810a0b5a] ? sys_futex+0x10a/0x1a0 [21199.038885] [81013b15] do_notify_resume+0x65/0x80 [21199.038887] [81664d50] int_signal+0x12/0x17 We have seen this stack trace several times over the past 6 months, but are not sure what the trigger is. In principle, the ceph server-side daemons shouldn't be capable of locking up like this, but clearly something is amiss between what they are doing in userland and how the kernel is tolerating that. Low memory, perhaps? In each case where we tried to track it down, the problem seemed to go away on its own. Is this easily reproducible in your case? my 0.02$: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11531.html and kernel panic from two different hosts from yesterday during ceph startup(on 3.8-rc3, images from console available at http://imgur.com/wIRVn,k0QCS#0) leads to suggestion that Ceph may have been introduced lockup-alike behavior not a long ago, causing, in my case, excessive amount of context switches on the host leading to osd flaps and panic at the ip-ib stack due to same issue. For the stack trace my first guess would be a problem with the IB driver that is triggered by memory pressure. Can you characterize what the system utilization
Re: Striped images and cluster misbehavior
After digging a lot, I have found that IB cards and switch may went to ``bad'' state after host` load spike, so I have limited all potentially cpu-hungry processes via cg. That`s has no effect at all, spikes happens almost at same time when osds on the corresponding host went down as ``wrongly marked'' for a couple of seconds. By doing manual observations, I have ensured that osds went crazy first, eating all cores with 100% SY(mean. scheduler or fs issues), then card lacking time for its interrupts start dropping the packets and so on. This can be reproduced only on heavy workload on the fast cluster, slow one with simular software versions will crawl but do not produce such locks. Those locks may went away and may hang for a while, tens of minutes, I do not sure of what it depends. Both nodes with logs pointed above contains one monitor and one osd, but locks do happen on two-osd nodes as well. Ceph instances does not share block devices in my setup(except two-osd nodes using same SSD for a journal, but since it is reproducible on mon-osd pair with completely separated storage that`s seems not to be an exact cause). For meantime, I may suggest for myself to move out from XFS and see if locks remain. The issue started in the latest 3.6 series and 0.55+ and remains in the 3.7.1 and 0.56.1. Should I move to ext4 immediately or try 3.8-rc with couple of XFS fixes first? http://xdel.ru/downloads/ceph-log/osd-lockup-1-14-25-12.875107.log.gz http://xdel.ru/downloads/ceph-log/osd-lockup-2-14-33-16.741603.log.gz Timestamps in filenames added for easier lookup, osdmap have marked osds as down after couple of beats after those marks. On Mon, Dec 31, 2012 at 1:16 AM, Andrey Korolyov and...@xdel.ru wrote: On Sun, Dec 30, 2012 at 10:56 PM, Samuel Just sam.j...@inktank.com wrote: Sorry for the delay. A quick look at the log doesn't show anything obvious... Can you elaborate on how you caused the hang? -Sam I am sorry for all this noise, the issue almost for sure has been triggered by some bug in the Infiniband switch firmware because per-port reset was able to solve ``wrong mark'' problem - at least, it haven`t showed up yet for a week. The problem took almost two days until resolution - all possible connectivity tests displayed no overtimes or drops which can cause wrong marks. Finally, I have started playing with TCP settings and found that ipv4.tcp_low_latency raising possibility of ``wrong mark'' event several times when enabled - so area of all possible causes quickly collapsed to the media-only problem and I fixed problem soon. On Wed, Dec 19, 2012 at 3:53 AM, Andrey Korolyov and...@xdel.ru wrote: Please take a look at the log below, this is slightly different bug - both osd processes on the node was stuck eating all available cpu until I killed them. This can be reproduced by doing parallel export of different from same client IP using both ``rbd export'' or API calls - after a couple of wrong ``downs'' osd.19 and osd.27 finally stuck. What is more interesting, 10.5.0.33 holds most hungry set of virtual machines, eating constantly four of twenty-four HT cores, and this node fails almost always, Underlying fs is an XFS, ceph version gf9d090e. With high possibility my previous reports are about side effects of this problem. http://xdel.ru/downloads/ceph-log/osd-19_and_27_stuck.log.gz and timings for the monmap, logs are from different hosts, so they may have a time shift of tens of milliseconds: http://xdel.ru/downloads/ceph-log/timings-crash-osd_19_and_27.txt Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.56.1 released
On Tue, Jan 8, 2013 at 11:30 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, i cannot see any git tag or branch claming to be 0.56.1? Which commit id is this? Greets Stefan Same for me, github simply does not sent a new tag in the pull to local tree by some reason. Repository cloning from scratch resolved this :) Am 08.01.2013 05:53, schrieb Sage Weil: We found a few critical problems with v0.56, and fixed a few outstanding problems. v0.56.1 is ready, and we're pretty pleased with it! There are two critical fixes in this update: a fix for possible data loss or corruption if power is lost, and a protocol compatibility problem that was introduced in v0.56 (between v0.56 and any other version of ceph). * osd: fix commit sequence for XFS, ext4 (or any other non-btrfs) to prevent data loss on power cycle or kernel panic * osd: fix compatibility for CALL operation * osd: process old osdmaps prior to joining cluster (fixes slow startup) * osd: fix a couple of recovery-related crashes * osd: fix large io requests when journal is in (non-default) aio mode * log: fix possible deadlock in logging code This release will kick off the bobtail backport series, and will get a shiny new URL for it's home. * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.56.1.tar.gz * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian * For RPMs, see http://ceph.com/docs/master/install/rpm -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Very intensive I/O under mon process
I have just observed that ceph-mon process, at least bobtail one, has an extremely high density of writes - times above _overall_ cluster amount of writes, measured by qemu driver(and they are very close to be fair). For example, test cluster of 32 osds have 7.5 MByte/s of writes on each mon node having overall amount about 1.5 Mbyte/s and dev- with only three osds has values is about 1Mbyte/s with accumulated real write bandwidth of tens of kilobytes per second. I`m afraid if this is normal, I may hit a limit of spinning storage increasing test cluster, say, twenty times up of number of osd and related ``idle'' write bandwidth. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very intensive I/O under mon process
On Wed, Jan 2, 2013 at 8:00 PM, Joao Eduardo Luis joao.l...@inktank.com wrote: On 01/02/2013 03:40 PM, Andrey Korolyov wrote: I have just observed that ceph-mon process, at least bobtail one, has an extremely high density of writes - times above _overall_ cluster amount of writes, measured by qemu driver(and they are very close to be fair). For example, test cluster of 32 osds have 7.5 MByte/s of writes on each mon node having overall amount about 1.5 Mbyte/s and dev- with only three osds has values is about 1Mbyte/s with accumulated real write bandwidth of tens of kilobytes per second. I`m afraid if this is normal, I may hit a limit of spinning storage increasing test cluster, say, twenty times up of number of osd and related ``idle'' write bandwidth. High debugging levels (specially 'debug ms', 'debug mon' or 'debug paxos') should significantly increase IO on the monitors. Might that be the case? Nope, all debug levels, including mons are set to 0/0. I also see that the ``no-client'' cluster shows a very small amount of such writes under mon, 10-20kByte/s, and one idle client (writing couple of bytes without O_SYNC) raise this value times up to ~200kB/s and so on, so may be I`m wrong before and writes correlate with amount of clients too(six clients plus three control nodes accessing via API in the context of previous message for both environments). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
0.55 crashed during upgrade to bobtail
Hi, All osds in the dev cluster died shortly after upgrade (packet-only, i.e. binary upgrade, even without restart running processes), please see attached file. Was: 0.55.1-356-g850d1d5 Upgraded to: 0.56 tag The only one difference is a version of the gcc corresponding libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I may do a rollback and problem will eliminate with high probability, but seems there should be some fix. Also I have something simular in the the testing env days ago - packet upgrade inside 0.55 killed all _windows_ guests_ and one of tens of linux guests running above rbd. Unfortunately I have no debug sessions at this moment and I have only tail of the log from qemu: terminate called after throwing an instance of 'ceph::buffer::end_of_buffer' what(): buffer::end_of_buffer I`m blaming ldconfig action from librbd because nothing else `ll cause such case of destroy on the running processes - may be I`m wrong. thanks! WBR, Andrey crashes-2013-01-01.tgz Description: GNU Zip compressed data
Re: 0.55 crashed during upgrade to bobtail
On Tue, Jan 1, 2013 at 9:49 PM, Andrey Korolyov and...@xdel.ru wrote: Hi, All osds in the dev cluster died shortly after upgrade (packet-only, i.e. binary upgrade, even without restart running processes), please see attached file. Was: 0.55.1-356-g850d1d5 Upgraded to: 0.56 tag The only one difference is a version of the gcc corresponding libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I may do a rollback and problem will eliminate with high probability, but seems there should be some fix. Also I have something simular in the the testing env days ago - packet upgrade inside 0.55 killed all _windows_ guests_ and one of tens of linux guests running above rbd. Unfortunately I have no debug sessions at this moment and I have only tail of the log from qemu: terminate called after throwing an instance of 'ceph::buffer::end_of_buffer' what(): buffer::end_of_buffer I`m blaming ldconfig action from librbd because nothing else `ll cause such case of destroy on the running processes - may be I`m wrong. thanks! WBR, Andrey Sorry, I`m not able to reproduce crash after rollback and traces was uncomplete due to lack of disk space on specified core location, so please don`t mind it. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 0.55 crashed during upgrade to bobtail
On Wed, Jan 2, 2013 at 12:16 AM, Andrey Korolyov and...@xdel.ru wrote: On Tue, Jan 1, 2013 at 9:49 PM, Andrey Korolyov and...@xdel.ru wrote: Hi, All osds in the dev cluster died shortly after upgrade (packet-only, i.e. binary upgrade, even without restart running processes), please see attached file. Was: 0.55.1-356-g850d1d5 Upgraded to: 0.56 tag The only one difference is a version of the gcc corresponding libstdc++ - 4.6 on the buildhost and 4.7 on the cluster. Of course I may do a rollback and problem will eliminate with high probability, but seems there should be some fix. Also I have something simular in the the testing env days ago - packet upgrade inside 0.55 killed all _windows_ guests_ and one of tens of linux guests running above rbd. Unfortunately I have no debug sessions at this moment and I have only tail of the log from qemu: terminate called after throwing an instance of 'ceph::buffer::end_of_buffer' what(): buffer::end_of_buffer I`m blaming ldconfig action from librbd because nothing else `ll cause such case of destroy on the running processes - may be I`m wrong. thanks! WBR, Andrey Sorry, I`m not able to reproduce crash after rollback and traces was uncomplete due to lack of disk space on specified core location, so please don`t mind it. Ahem, finally it seems that osd process stumbling on something on the fs, because my other environments also was able to reproduce crash once, but reproducing is not possible since new osd process started over existing filestore(offline version rollback and another try to online upgrade doing fine). And backtrace in first message is complete, at least 1 and 2, despite of lack of space first time - I have received a couple of coredumps which trace looks exactly the same as 2. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Improving responsiveness of KVM guests on Ceph storage
On Mon, Dec 31, 2012 at 3:12 AM, Jens Kristian Søgaard j...@mermaidconsulting.dk wrote: Hi Andrey, Thanks for your reply! You may try do play with SCHED_RT, I have found it hard to use for myself, but you can achieve your goal by adding small RT slices via ``cpu'' cgroup to vcpu/emulator threads, it dramatically increases overall VM` responsibility. I'm not quite sure I understand your suggestion. Do you mean that you set the process priority to real-time on each qemu-kvm process, and then use cgroups cpu.rt_runtime_us / cpu.rt_period_us to restrict the amount of CPU time those processes can receive? I'm not sure how that would apply here, as I have only one qemu-kvm process and it is not non-responsive because of the lack of allocated CPU time slices - but rather because some I/Os take a long time to complete, and other I/Os apparently have to wait for those to complete. Yep, I meant the same. Of course it`ll not help with only one VM, RT may help in more concurrent cases :) threads. Of course, some Ceph tuning like writeback cache and large journal may help you too, I`m speaking primarily of VM` performance by I have been considering the journal as something where I could improve performance by tweaking the setup. I have set aside 10 GB of space for the journal, but I'm not sure if this is too little - or if the size really doesn't matter that much when it is on the same mdraid as the data itself. Is there a tool that can tell me how much of my journal space that is actually actively being used? I.e. I'm looking for something that could tell me, if increasing the size of the journal or placing it on a seperate (SSD) disk could solve my problem. As I understood right, you have md device holding both journal and filestore? What type of raid you have here? Of course you`ll need a separate device (for experimental purposes, fast disk may be enough) for the journal, and if you set any type of redundant storage under filestore partition, you may also change it to simple RAID0, or even separate disks, and create one osd over every disk(you should see to the journal device` throughput which must be equal to sum of speeds of all filestore devices, so for commodity-type SSD it sums to two 100MB/s disks, for example). I have ``pure'' disk setup in my dev environment built on quite old desktop-class machines and one rsync process may hang VM for short time, despite of using dedicated SATA disk for journal. How do I change the size of the writeback cache when using qemu-kvm like I do? Does setting rbd cache size in ceph.conf have any effect on qemu-kvm, where the drive is defined as: format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio What size of cache_size/max_dirty you have inside ceph.conf and which qemu version you use? Default values good enough to prevent pushing I/O spikes down to the physical storage, but for long I/O-intensive tasks increasing cache may help OS to align writes more smoothly. Also you don`t need to set rbd_cache explicitly in the disk config using qemu 1.2 and younger releases, for older ones http://lists.gnu.org/archive/html/qemu-devel/2012-05/msg02500.html should be applied. -- Jens Kristian Søgaard, Mermaid Consulting ApS, j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Improving responsiveness of KVM guests on Ceph storage
On Mon, Dec 31, 2012 at 2:58 PM, Jens Kristian Søgaard j...@mermaidconsulting.dk wrote: Hi Andrey, As I understood right, you have md device holding both journal and filestore? What type of raid you have here? Yes, same md device holding both journal and filestore. It is a raid5. Ahem, of course you need to reassemble it to something faster :) Of course you`ll need a separate device (for experimental purposes, fast disk may be enough) for the journal Is there a way to tell if the journal is the bottleneck without actually adding such an extra device? In theory, yes - but your setup already dying under high amount of write seeks, so it may be not necessary. Also I don`t see a right way to measure a bottleneck when disk device used for both filestore and journal - in case of separated ones, you may measure maximum values using fio and compare to calculated ones from /proc/diskstats, ``all-in-one'' case seems obviously hard to measure, even if you able to log writes to journal file and filestore files separately without significant overhead. filestore partition, you may also change it to simple RAID0, or even separate disks, and create one osd over every disk(you should see to I have only 3 OSDs with 4 disks each. I was afraid that it would be too brittle as a RAID0, and if I created seperate OSDs for each disk, it would stall the file system due to recovery if a server crashes. No, it isn`t too bad in most cases. Recovery process is not affecting operations to the rbd storage except small performance degradation, so you may split your raid setup to the lightweight R0. It depends, on plain SATA controller software R0 under one OSD will do better work than 2 separate OSDs having one disk each, on cache-backed controller separate OSDs is more preferably until controller is not able to align writes due to overall write bandwidth. What size of cache_size/max_dirty you have inside ceph.conf I haven't set them explicitly, so I imagine the cache_size is 32 MB and the max_dirty is 24 MB. and which qemu version you use? Using the default 0.15 version in Fedora 16. tasks increasing cache may help OS to align writes more smoothly. Also you don`t need to set rbd_cache explicitly in the disk config using qemu 1.2 and younger releases, for older ones http://lists.gnu.org/archive/html/qemu-devel/2012-05/msg02500.html should be applied. I read somewhere that I needed to enable it specifically for older qemu-kvm versions, which I did like this: format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio However now I read in the docs for qemu-rbd that it needs to be set like this: format=raw,file=rbd:data/squeeze:rbd_cache=true,cache=writeback I'm not sure if 1 and true are interpreted the same way? I'll try using true and see if I get any noticable changes in behaviour. The link you sent me seems to indicate that I need to compile my own version of qemu-kvm to be able to test this? No, there is no significant changes since 0.15 to the current version and your options will work just fine. So there may be general recommendations to remove redundancy from your disk backend and then move out journal to separate disk or ssd. -- Jens Kristian Søgaard, Mermaid Consulting ApS, j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Striped images and cluster misbehavior
On Sun, Dec 30, 2012 at 10:56 PM, Samuel Just sam.j...@inktank.com wrote: Sorry for the delay. A quick look at the log doesn't show anything obvious... Can you elaborate on how you caused the hang? -Sam I am sorry for all this noise, the issue almost for sure has been triggered by some bug in the Infiniband switch firmware because per-port reset was able to solve ``wrong mark'' problem - at least, it haven`t showed up yet for a week. The problem took almost two days until resolution - all possible connectivity tests displayed no overtimes or drops which can cause wrong marks. Finally, I have started playing with TCP settings and found that ipv4.tcp_low_latency raising possibility of ``wrong mark'' event several times when enabled - so area of all possible causes quickly collapsed to the media-only problem and I fixed problem soon. On Wed, Dec 19, 2012 at 3:53 AM, Andrey Korolyov and...@xdel.ru wrote: Please take a look at the log below, this is slightly different bug - both osd processes on the node was stuck eating all available cpu until I killed them. This can be reproduced by doing parallel export of different from same client IP using both ``rbd export'' or API calls - after a couple of wrong ``downs'' osd.19 and osd.27 finally stuck. What is more interesting, 10.5.0.33 holds most hungry set of virtual machines, eating constantly four of twenty-four HT cores, and this node fails almost always, Underlying fs is an XFS, ceph version gf9d090e. With high possibility my previous reports are about side effects of this problem. http://xdel.ru/downloads/ceph-log/osd-19_and_27_stuck.log.gz and timings for the monmap, logs are from different hosts, so they may have a time shift of tens of milliseconds: http://xdel.ru/downloads/ceph-log/timings-crash-osd_19_and_27.txt Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Improving responsiveness of KVM guests on Ceph storage
On Sun, Dec 30, 2012 at 9:05 PM, Jens Kristian Søgaard j...@mermaidconsulting.dk wrote: Hi guys, I'm testing Ceph as storage for KVM virtual machine images and found an inconvenience that I am hoping it is possible to find the cause of. I'm running a single KVM Linux guest on top of Ceph storage. In that guest I run rsync to download files from the internet. When rsync is running, the guest will seemingly stall and run very slowly. For example if I log in via SSH to the guest and use the command prompt, nothing will happen for a long period (30+ seconds), then it processes a few typed characters, and then it blocks for another long period of time, then process a bit more, etc. I was hoping to be able to tweak the system so that it runs more like when using conventional storage - i.e. perhaps the rsync won't be super fast, but the machine will be equally responsive all the time. I'm hoping that you can provide some hints on how to best benchmark or test the system to find the cause of this? The ceph OSDs periodically logs thse two messages, that I do not fully understand: 12-12-30 17:07:12.894920 7fc8f3242700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30 2012-12-30 17:07:13.599126 7fc8cbfff700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30 Is this to be expected when the system is in use, or does it indicate that something is wrong? Ceph also logs messages such as this: 2012-12-30 17:07:36.932272 osd.0 10.0.0.1:6800/9157 286340 : [WRN] slow request 30.751940 seconds old, received at 2012-12-30 17:07:06.180236: osd_op(client.4705.0:16074961 rb.0.11b7.4a933baa.000c188f [write 532480~4096] 0.f2a63fe) v4 currently waiting for sub ops My setup: 3 servers running Fedora 17 with Ceph 0.55.1 from RPM. Each server runs one osd and one mon. One of the servers also runs an mds. Backing file system is btrfs stored on a md-raid . Journal is stored on the same SATA disks as the rests of the data. Each server has 3 bonded gigabit/sec NICs. One server running Fedora 16 with qemu-kvm. Has gigabit/sec NIC connected to the same network as the Ceph servers, and a gigabit/sec NIC connected to the Internet. Disk is mounted with: -drive format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio iostat on the KVM guest gives: avg-cpu: %user %nice %system %iowait %steal %idle 0,000,000,00 100,000,000,00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vda 0,00 1,400,100,30 0,8013,60 36,00 1,66 2679,25 2499,75 99,99 Top on the KVM host shows 90% CPU idle and 0.0% I/O waiting. iostat on a OSD gives: avg-cpu: %user %nice %system %iowait %steal %idle 0,130,001,50 15,790,00 82,58 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 240,70 441,20 33,00 42,70 1122,40 1961,80 81,48 14,45 164,42 319,14 44,85 6,63 50,22 sdb 299,10 393,10 33,90 38,40 1363,60 1720,60 85,32 13,55 171,32 316,21 43,41 6,55 47,39 sdc 268,50 441,60 28,80 45,40 1191,60 1977,00 85,41 19,08 159,39 345,98 41,02 6,56 48,69 sdd 255,50 445,50 30,20 45,00 1150,40 1975,80 83,14 18,18 155,97 338,90 33,20 6,95 52,23 md0 0,00 0,001,20 132,70 4,80 4086,40 61,11 0,000,000,000,00 0,00 0,00 The figures are similar on all three OSDs. I am thinking that one possible cause could be that the journal is stored on the same disks as the rest of the data, but I don't know how to benchmark if this is actually the case (?) Thanks for any help or advice, you can offer! Hi Jens, You may try do play with SCHED_RT, I have found it hard to use for myself, but you can achieve your goal by adding small RT slices via ``cpu'' cgroup to vcpu/emulator threads, it dramatically increases overall VM` responsibility. I have thrown it off because RT scheduler is a very strange thing - it may cause endless lockup on disk operation during heavy operations or produce ever-stuck ``kworker'' on some cores if you have killed VM which has separate RT slices for vcpu threads. Of course, some Ceph tuning like writeback cache and large journal may help you too, I`m speaking primarily of VM` performance by itself. -- Jens Kristian Søgaard, Mermaid Consulting ApS, j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Striped images and cluster misbehavior
On Mon, Dec 17, 2012 at 2:36 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, After recent switch do default ``--stripe-count 1'' on image upload I have observed some strange thing - single import or deletion of the striped image may temporarily turn off entire cluster, literally(see log below). Of course next issued osd map fix the situation, but all in-flight operations experiencing a short freeze. This issue appears randomly in some import or delete operation, have not seen any other types causing this. Even if a nature of this bug laying completely in the client-osd interaction, may be ceph should develop a some foolproof actions even if complaining client have admin privileges? Almost for sure this should be reproduced within teuthology with rwx rights both on osds and mons at the client. And as I can see there is no problem on both physical and protocol layer for dedicated cluster interface on client machine. 2012-12-17 02:17:03.691079 mon.0 [INF] pgmap v2403268: 15552 pgs: 15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB avail 2012-12-17 02:17:04.693344 mon.0 [INF] pgmap v2403269: 15552 pgs: 15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB avail 2012-12-17 02:17:05.695742 mon.0 [INF] pgmap v2403270: 15552 pgs: 15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB avail 2012-12-17 02:17:05.991900 mon.0 [INF] osd.0 10.5.0.10:6800/4907 failed (3 reports from 1 peers after 2012-12-17 02:17:29.991859 = grace 20.00) 2012-12-17 02:17:05.992017 mon.0 [INF] osd.1 10.5.0.11:6800/5011 failed (3 reports from 1 peers after 2012-12-17 02:17:29.991995 = grace 20.00) 2012-12-17 02:17:05.992139 mon.0 [INF] osd.2 10.5.0.12:6803/5226 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992110 = grace 20.00) 2012-12-17 02:17:05.992240 mon.0 [INF] osd.3 10.5.0.13:6803/6054 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992224 = grace 20.00) 2012-12-17 02:17:05.992330 mon.0 [INF] osd.4 10.5.0.14:6803/5792 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992317 = grace 20.00) 2012-12-17 02:17:05.992420 mon.0 [INF] osd.5 10.5.0.15:6803/5564 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992405 = grace 20.00) 2012-12-17 02:17:05.992515 mon.0 [INF] osd.7 10.5.0.17:6803/5902 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992501 = grace 20.00) 2012-12-17 02:17:05.992607 mon.0 [INF] osd.8 10.5.0.10:6803/5338 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992591 = grace 20.00) 2012-12-17 02:17:05.992702 mon.0 [INF] osd.10 10.5.0.12:6800/5040 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992686 = grace 20.00) 2012-12-17 02:17:05.992793 mon.0 [INF] osd.11 10.5.0.13:6800/5748 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992778 = grace 20.00) 2012-12-17 02:17:05.992891 mon.0 [INF] osd.12 10.5.0.14:6800/5459 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992875 = grace 20.00) 2012-12-17 02:17:05.992980 mon.0 [INF] osd.13 10.5.0.15:6800/5235 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992966 = grace 20.00) 2012-12-17 02:17:05.993081 mon.0 [INF] osd.16 10.5.0.30:6800/5585 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993065 = grace 20.00) 2012-12-17 02:17:05.993184 mon.0 [INF] osd.17 10.5.0.31:6800/5578 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993169 = grace 20.00) 2012-12-17 02:17:05.993274 mon.0 [INF] osd.18 10.5.0.32:6800/5097 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993260 = grace 20.00) 2012-12-17 02:17:05.993367 mon.0 [INF] osd.19 10.5.0.33:6800/5109 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993352 = grace 20.00) 2012-12-17 02:17:05.993464 mon.0 [INF] osd.20 10.5.0.34:6800/5125 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993448 = grace 20.00) 2012-12-17 02:17:05.993554 mon.0 [INF] osd.21 10.5.0.35:6800/5183 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993538 = grace 20.00) 2012-12-17 02:17:05.993644 mon.0 [INF] osd.22 10.5.0.36:6800/5202 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993628 = grace 20.00) 2012-12-17 02:17:05.993740 mon.0 [INF] osd.23 10.5.0.37:6800/5252 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993725 = grace 20.00) 2012-12-17 02:17:05.993831 mon.0 [INF] osd.24 10.5.0.30:6803/5758 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993816 = grace 20.00) 2012-12-17 02:17:05.993924 mon.0 [INF] osd.25 10.5.0.31:6803/5748 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993908 = grace 20.00) 2012-12-17 02:17:05.994018 mon.0 [INF] osd.26 10.5.0.32:6803/5275 failed (3 reports from 1 peers after 2012-12-17 02:17:29.994002 = grace 20.00) 2012-12-17 02:17:06.105315 mon.0 [INF] osdmap e24204: 32 osds: 4 up, 32 in 2012-12-17 02:17:06.051291 osd.6 [WRN] 1 slow requests, 1
Re: Slow requests
On Sun, Dec 16, 2012 at 5:59 PM, Jens Kristian Søgaard j...@mermaidconsulting.dk wrote: Hi, My log is filling up with warnings about a single slow request that has been around for a very long time: osd.1 10.0.0.2:6800/900 162926 : [WRN] 1 slow requests, 1 included below; oldest blocked for 84446.312051 secs osd.1 10.0.0.2:6800/900 162927 : [WRN] slow request 84446.312051 seconds old, received at 2012-12-15 15:27:56.891437: osd_sub_op(client.4528.0:19602219 0.fe 3807b5fe/rb.0.11b7.4a933baa.0008629e/head//0 [] v 53'185888 snapset=0=[]:[] snapc=0=[]) v7 currently started How can I identify the cause of this and how can I cancel this request? I'm running Ceph on Fedora 17 using the latest RPMs available from ceph.com (0.52-6). Thanks in advance, Hi Jens, Please take a look to this thread: http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10843 Seems that you`ll need newer rpms to get rid of this. -- Jens Kristian Søgaard, Mermaid Consulting ApS, j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Striped images and cluster misbehavior
Hi, After recent switch do default ``--stripe-count 1'' on image upload I have observed some strange thing - single import or deletion of the striped image may temporarily turn off entire cluster, literally(see log below). Of course next issued osd map fix the situation, but all in-flight operations experiencing a short freeze. This issue appears randomly in some import or delete operation, have not seen any other types causing this. Even if a nature of this bug laying completely in the client-osd interaction, may be ceph should develop a some foolproof actions even if complaining client have admin privileges? Almost for sure this should be reproduced within teuthology with rwx rights both on osds and mons at the client. And as I can see there is no problem on both physical and protocol layer for dedicated cluster interface on client machine. 2012-12-17 02:17:03.691079 mon.0 [INF] pgmap v2403268: 15552 pgs: 15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB avail 2012-12-17 02:17:04.693344 mon.0 [INF] pgmap v2403269: 15552 pgs: 15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB avail 2012-12-17 02:17:05.695742 mon.0 [INF] pgmap v2403270: 15552 pgs: 15552 active+clean; 931 GB data, 2927 GB used, 26720 GB / 29647 GB avail 2012-12-17 02:17:05.991900 mon.0 [INF] osd.0 10.5.0.10:6800/4907 failed (3 reports from 1 peers after 2012-12-17 02:17:29.991859 = grace 20.00) 2012-12-17 02:17:05.992017 mon.0 [INF] osd.1 10.5.0.11:6800/5011 failed (3 reports from 1 peers after 2012-12-17 02:17:29.991995 = grace 20.00) 2012-12-17 02:17:05.992139 mon.0 [INF] osd.2 10.5.0.12:6803/5226 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992110 = grace 20.00) 2012-12-17 02:17:05.992240 mon.0 [INF] osd.3 10.5.0.13:6803/6054 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992224 = grace 20.00) 2012-12-17 02:17:05.992330 mon.0 [INF] osd.4 10.5.0.14:6803/5792 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992317 = grace 20.00) 2012-12-17 02:17:05.992420 mon.0 [INF] osd.5 10.5.0.15:6803/5564 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992405 = grace 20.00) 2012-12-17 02:17:05.992515 mon.0 [INF] osd.7 10.5.0.17:6803/5902 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992501 = grace 20.00) 2012-12-17 02:17:05.992607 mon.0 [INF] osd.8 10.5.0.10:6803/5338 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992591 = grace 20.00) 2012-12-17 02:17:05.992702 mon.0 [INF] osd.10 10.5.0.12:6800/5040 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992686 = grace 20.00) 2012-12-17 02:17:05.992793 mon.0 [INF] osd.11 10.5.0.13:6800/5748 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992778 = grace 20.00) 2012-12-17 02:17:05.992891 mon.0 [INF] osd.12 10.5.0.14:6800/5459 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992875 = grace 20.00) 2012-12-17 02:17:05.992980 mon.0 [INF] osd.13 10.5.0.15:6800/5235 failed (3 reports from 1 peers after 2012-12-17 02:17:29.992966 = grace 20.00) 2012-12-17 02:17:05.993081 mon.0 [INF] osd.16 10.5.0.30:6800/5585 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993065 = grace 20.00) 2012-12-17 02:17:05.993184 mon.0 [INF] osd.17 10.5.0.31:6800/5578 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993169 = grace 20.00) 2012-12-17 02:17:05.993274 mon.0 [INF] osd.18 10.5.0.32:6800/5097 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993260 = grace 20.00) 2012-12-17 02:17:05.993367 mon.0 [INF] osd.19 10.5.0.33:6800/5109 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993352 = grace 20.00) 2012-12-17 02:17:05.993464 mon.0 [INF] osd.20 10.5.0.34:6800/5125 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993448 = grace 20.00) 2012-12-17 02:17:05.993554 mon.0 [INF] osd.21 10.5.0.35:6800/5183 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993538 = grace 20.00) 2012-12-17 02:17:05.993644 mon.0 [INF] osd.22 10.5.0.36:6800/5202 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993628 = grace 20.00) 2012-12-17 02:17:05.993740 mon.0 [INF] osd.23 10.5.0.37:6800/5252 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993725 = grace 20.00) 2012-12-17 02:17:05.993831 mon.0 [INF] osd.24 10.5.0.30:6803/5758 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993816 = grace 20.00) 2012-12-17 02:17:05.993924 mon.0 [INF] osd.25 10.5.0.31:6803/5748 failed (3 reports from 1 peers after 2012-12-17 02:17:29.993908 = grace 20.00) 2012-12-17 02:17:05.994018 mon.0 [INF] osd.26 10.5.0.32:6803/5275 failed (3 reports from 1 peers after 2012-12-17 02:17:29.994002 = grace 20.00) 2012-12-17 02:17:06.105315 mon.0 [INF] osdmap e24204: 32 osds: 4 up, 32 in 2012-12-17 02:17:06.051291 osd.6 [WRN] 1 slow requests, 1 included below; oldest blocked for 30.947080 secs 2012-12-17 02:17:06.051299 osd.6 [WRN] slow request 30.947080 seconds old, received at 2012-12-17 02:16:35.042711:
Re: Slow requests
On Mon, Dec 17, 2012 at 2:42 AM, Jens Kristian Søgaard j...@mermaidconsulting.dk wrote: Hi Andrey, Thanks for your reply! Please take a look to this thread: http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10843 I took your advice and restarted each of my three osd's individually. The first two restarted within a minute or two. The last one took 20 minutes to restart (?) Afterwards the slow request had disappeared, so it did seem to work! Seems that you`ll need newer rpms to get rid of this. Are newer RPMs available for download somewhere, or do I need to compile my own? I have searched the ceph.com site several times in the past, but I only find older versions. Oh, sorry, I maybe misguided you - solution is the patch from Sam, restarts may help only on the short distance and you`re not able to check some pgs for consistency until patch have been applied - they`ll hang on scrub every time. -- Jens Kristian Søgaard, Mermaid Consulting ApS, j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
misdirected client messages
Hi, Today during planned kernel upgrade one of osds (which I have not touched yet), started to claim about ``misdirected client'': 2012-12-12 21:22:59.107648 osd.20 [WRN] client.2774043 10.5.0.33:0/1013711 misdirected client.2774043.0:114 pg 5.ad140d42 to osd.20 in e23834, client e23834 pg 5.542 features 67108863 (remained the same except timestamp for at least ten minutes and then disappeared) Last two of three nodes have been rebooted to this time and primary still not rebooted yet, osd.20 not rebooted too to this moment: 5.542 124 0 0 0 511705106 142753 142753 active+clean2012-12-12 17:07:05.761509 23834'1022923 23409'2958324 [7,11,5][7,11,5]23452'839524 2012-12-08 21:08:42.076364 23452'8395242012-12-08 21:08:42.076365 As I can see by looking to bugtracker, those message shouldn`t appear, at least in recent versions. I`m using nondefault TCP congestion algo, so it is clearly not a result of TCP mistuning between kernel versions. I`m running 0.55 gf9d090e and kernel was upgraded inside 3.6 branch. Unfortunately message disappeared too soon before I point it and change logging level on all involved OSD daemons. Since there is absolutely no harm, may I ask on suggestions on repeating / raising probability of this bug and do an appropriate logging? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hangup during scrubbing - possible solutions
On Sat, Dec 1, 2012 at 9:07 AM, Samuel Just sam.j...@inktank.com wrote: Just pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7. Let me know if it persists. Thanks for the logs! -Sam Very nice, thanks! There is one corner case - ``on-the-fly'' upgrade works well only if your patch applied to ``generic'' 0.54 by cherry-picking, online upgrade to the next-dccf6ee from tagged 0.54 causes osd processes on the upgraded nodes to fall shortly after restart with backtrace you may see below. Offline upgrade, e.g over shutting down entire cluster, works fine, so only one problem is a preservation of running state of the cluster over upgrade which may confuse some users(at least ones who runs production suites). http://xdel.ru/downloads/ceph-log/bt-recovery-sj-patch.out.gz On Fri, Nov 30, 2012 at 2:04 PM, Samuel Just sam.j...@inktank.com wrote: Hah! Thanks for the log, it's our handling of active_pushes. I'll have a patch shortly. Thanks! -Sam On Fri, Nov 30, 2012 at 4:14 AM, Andrey Korolyov and...@xdel.ru wrote: http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz http://xdel.ru/downloads/ceph-log/cluster-w.log.gz Here, please. I have initiated a deep-scrub of osd.1 which was lead to forever-stuck I/O requests in a short time(scrub `ll do the same). Second log may be useful for proper timestamps, as seeks on the original may took a long time. Osd processes on the specific node was restarted twice - at the beginning to be sure all config options were applied and at the end to do same plus to get rid of stuck requests. On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just sam.j...@inktank.com wrote: If you can reproduce it again, what we really need are the osd logs from the acting set of a pg stuck in scrub with debug osd = 20 debug ms = 1 debug filestore = 20. Thanks, -Sam On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov and...@xdel.ru wrote: On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote: On Thu, 22 Nov 2012, Andrey Korolyov wrote: Hi, In the recent versions Ceph introduces some unexpected behavior for the permanent connections (VM or kernel clients) - after crash recovery, I/O will hang on the next planned scrub on the following scenario: - launch a bunch of clients doing non-intensive writes, - lose one or more osd, mark them down, wait for recovery completion, - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script, or wait for ceph to do the same, - observe a raising number of pgs stuck in the active+clean+scrubbing state (they took a master role from ones which was on killed osd and almost surely they are being written in time of crash), - some time later, clients will hang hardly and ceph log introduce stuck(old) I/O requests. The only one way to return clients back without losing their I/O state is per-osd restart, which also will help to get rid of active+clean+scrubbing pgs. First of all, I`ll be happy to help to solve this problem by providing logs. If you can reproduce this behavior with 'debug osd = 20' and 'debug ms = 1' logging on the OSD, that would be wonderful! I have tested slightly different recovery flow, please see below. Since there is no real harm, like frozen I/O, placement groups also was stuck forever on the active+clean+scrubbing state, until I restarted all osds (end of the log): http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz - start the healthy cluster - start persistent clients - add an another host with pair of OSDs, let them be in the data placement - wait for data to rearrange - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I have an 1/2 hour delay on readjust in such case, I did ``ceph osd out'' manually) - watch for data to rearrange again - [22:51 timestamp] when it ends, start a manual rescrub, with non-zero active+clean+scrubbing-state placement groups at the end of process which `ll stay in this state forever until something happens After that, I can restart osds one per one, if I want to get rid of scrubbing states immediately and then do deep-scrub(if I don`t, those states will return at next ceph self-scrubbing) or do per-osd deep-scrub, if I have a lot of time. The case I have described in the previous message took place when I remove osd from data placement which existed on the moment when client(s) have started and indeed it is more harmful than current one(frozen I/O leads to hanging entire guest, for example). Since testing those flow took a lot of time, I`ll send logs related to this case tomorrow. Second question is not directly related to this problem, but I have thought on for a long time - is there a planned features to control scrub process more precisely, e.g. pg scrub rate or scheduled scrub, instead of current set of timeouts which of course not very predictable on when to run? Not yet. I would be interested in hearing what kind of control/config options/whatever
Re: Hangup during scrubbing - possible solutions
http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz http://xdel.ru/downloads/ceph-log/cluster-w.log.gz Here, please. I have initiated a deep-scrub of osd.1 which was lead to forever-stuck I/O requests in a short time(scrub `ll do the same). Second log may be useful for proper timestamps, as seeks on the original may took a long time. Osd processes on the specific node was restarted twice - at the beginning to be sure all config options were applied and at the end to do same plus to get rid of stuck requests. On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just sam.j...@inktank.com wrote: If you can reproduce it again, what we really need are the osd logs from the acting set of a pg stuck in scrub with debug osd = 20 debug ms = 1 debug filestore = 20. Thanks, -Sam On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov and...@xdel.ru wrote: On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote: On Thu, 22 Nov 2012, Andrey Korolyov wrote: Hi, In the recent versions Ceph introduces some unexpected behavior for the permanent connections (VM or kernel clients) - after crash recovery, I/O will hang on the next planned scrub on the following scenario: - launch a bunch of clients doing non-intensive writes, - lose one or more osd, mark them down, wait for recovery completion, - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script, or wait for ceph to do the same, - observe a raising number of pgs stuck in the active+clean+scrubbing state (they took a master role from ones which was on killed osd and almost surely they are being written in time of crash), - some time later, clients will hang hardly and ceph log introduce stuck(old) I/O requests. The only one way to return clients back without losing their I/O state is per-osd restart, which also will help to get rid of active+clean+scrubbing pgs. First of all, I`ll be happy to help to solve this problem by providing logs. If you can reproduce this behavior with 'debug osd = 20' and 'debug ms = 1' logging on the OSD, that would be wonderful! I have tested slightly different recovery flow, please see below. Since there is no real harm, like frozen I/O, placement groups also was stuck forever on the active+clean+scrubbing state, until I restarted all osds (end of the log): http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz - start the healthy cluster - start persistent clients - add an another host with pair of OSDs, let them be in the data placement - wait for data to rearrange - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I have an 1/2 hour delay on readjust in such case, I did ``ceph osd out'' manually) - watch for data to rearrange again - [22:51 timestamp] when it ends, start a manual rescrub, with non-zero active+clean+scrubbing-state placement groups at the end of process which `ll stay in this state forever until something happens After that, I can restart osds one per one, if I want to get rid of scrubbing states immediately and then do deep-scrub(if I don`t, those states will return at next ceph self-scrubbing) or do per-osd deep-scrub, if I have a lot of time. The case I have described in the previous message took place when I remove osd from data placement which existed on the moment when client(s) have started and indeed it is more harmful than current one(frozen I/O leads to hanging entire guest, for example). Since testing those flow took a lot of time, I`ll send logs related to this case tomorrow. Second question is not directly related to this problem, but I have thought on for a long time - is there a planned features to control scrub process more precisely, e.g. pg scrub rate or scheduled scrub, instead of current set of timeouts which of course not very predictable on when to run? Not yet. I would be interested in hearing what kind of control/config options/whatever you (and others) would like to see! Of course it will be awesome to have any determined scheduler or at least an option to disable automated scrubbing, since it is not very determined in time and deep-scrub eating a lot of I/O if command issued against entire OSD. Rate limiting is not in the first place, at least it may be recreated in external script, but for those who prefer to leave control to Ceph, it may be very useful. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: parsing in the ceph osd subsystem
On Thu, Nov 29, 2012 at 8:34 PM, Sage Weil s...@inktank.com wrote: On Thu, 29 Nov 2012, Andrey Korolyov wrote: $ ceph osd down - osd.0 is already down $ ceph osd down --- osd.0 is already down the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys ceph cli should explicitly work only with positive integers plus zero, refusing all other input. which branch is this? this parsing is cleaned u pin the latest next/master. It was produced by 0.54-tag. I have built dd3a24a647d0b0f1153cf1b102ed1f51d51be2f2 today and problem has gone(except parsing ``-0'' as 0 and 0/001 as 0 and 1 correspondingly). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: endless flying slow requests
On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote: Also, these clusters aren't mixed argonaut and next, are they? (Not that that shouldn't work, but it would be a useful data point.) -Sam On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote: Did you observe hung io along with that error? Both sub_op_commit and sub_op_applied have happened, so the sub_op_reply should have been sent back to the primary. This looks more like a leak. If you also observed hung io, then it's possible that the problem is occurring between the sub_op_applied event and the response. -Sam It is relatively easy to check if one of client VMs has locked one or more cores to iowait or just hangs, so yes, these ops are related to real commit operations and they are hanged. I`m using all-new 0.54 cluster, without mixing of course. Does everyone who hit that bug readjusted cluster before bug shows itself(say, in a day-long distance)? On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote: Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage Meanwhile, I did. :) Such requests will always be created if you have restarted or marked an osd out and then back in and scrub didn`t happen in the meantime (after such operation and before request arrival). What is more interesting, the hangup happens not exactly at the time of operation, but tens of minutes later. { description: osd_sub_op(client.1292013.0:45422 4.731 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v 16444'21693 snapset=0=[]:[] snapc=0=[]), received_at: 2012-11-28 03:54:43.094151, age: 27812.942680, duration: 2.676641, flag_point: started, events: [ { time: 2012-11-28 03:54:43.094222, event: waiting_for_osdmap}, { time: 2012-11-28 03:54:43.386890, event: reached_pg}, { time: 2012-11-28 03:54:43.386894, event: started}, { time: 2012-11-28 03:54:43.386973, event: commit_queued_for_journal_write}, { time: 2012-11-28 03:54:45.360049, event: write_thread_in_journal_buffer}, { time: 2012-11-28 03:54:45.586183, event: journaled_completion_queued}, { time: 2012-11-28 03:54:45.586262, event: sub_op_commit}, { time: 2012-11-28 03:54:45.770792, event: sub_op_applied}]}]} sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10
Re: endless flying slow requests
On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote: Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage Meanwhile, I did. :) Such requests will always be created if you have restarted or marked an osd out and then back in and scrub didn`t happen in the meantime (after such operation and before request arrival). What is more interesting, the hangup happens not exactly at the time of operation, but tens of minutes later. { description: osd_sub_op(client.1292013.0:45422 4.731 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v 16444'21693 snapset=0=[]:[] snapc=0=[]), received_at: 2012-11-28 03:54:43.094151, age: 27812.942680, duration: 2.676641, flag_point: started, events: [ { time: 2012-11-28 03:54:43.094222, event: waiting_for_osdmap}, { time: 2012-11-28 03:54:43.386890, event: reached_pg}, { time: 2012-11-28 03:54:43.386894, event: started}, { time: 2012-11-28 03:54:43.386973, event: commit_queued_for_journal_write}, { time: 2012-11-28 03:54:45.360049, event: write_thread_in_journal_buffer}, { time: 2012-11-28 03:54:45.586183, event: journaled_completion_queued}, { time: 2012-11-28 03:54:45.586262, event: sub_op_commit}, { time: 2012-11-28 03:54:45.770792, event: sub_op_applied}]}]} sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31792.858971 secs 2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31793.859124 secs 2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed When i now restart osd 24 they go away and everything is fine again. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To
Re: Hangup during scrubbing - possible solutions
On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote: On Thu, 22 Nov 2012, Andrey Korolyov wrote: Hi, In the recent versions Ceph introduces some unexpected behavior for the permanent connections (VM or kernel clients) - after crash recovery, I/O will hang on the next planned scrub on the following scenario: - launch a bunch of clients doing non-intensive writes, - lose one or more osd, mark them down, wait for recovery completion, - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script, or wait for ceph to do the same, - observe a raising number of pgs stuck in the active+clean+scrubbing state (they took a master role from ones which was on killed osd and almost surely they are being written in time of crash), - some time later, clients will hang hardly and ceph log introduce stuck(old) I/O requests. The only one way to return clients back without losing their I/O state is per-osd restart, which also will help to get rid of active+clean+scrubbing pgs. First of all, I`ll be happy to help to solve this problem by providing logs. If you can reproduce this behavior with 'debug osd = 20' and 'debug ms = 1' logging on the OSD, that would be wonderful! I have tested slightly different recovery flow, please see below. Since there is no real harm, like frozen I/O, placement groups also was stuck forever on the active+clean+scrubbing state, until I restarted all osds (end of the log): http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz - start the healthy cluster - start persistent clients - add an another host with pair of OSDs, let them be in the data placement - wait for data to rearrange - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I have an 1/2 hour delay on readjust in such case, I did ``ceph osd out'' manually) - watch for data to rearrange again - [22:51 timestamp] when it ends, start a manual rescrub, with non-zero active+clean+scrubbing-state placement groups at the end of process which `ll stay in this state forever until something happens After that, I can restart osds one per one, if I want to get rid of scrubbing states immediately and then do deep-scrub(if I don`t, those states will return at next ceph self-scrubbing) or do per-osd deep-scrub, if I have a lot of time. The case I have described in the previous message took place when I remove osd from data placement which existed on the moment when client(s) have started and indeed it is more harmful than current one(frozen I/O leads to hanging entire guest, for example). Since testing those flow took a lot of time, I`ll send logs related to this case tomorrow. Second question is not directly related to this problem, but I have thought on for a long time - is there a planned features to control scrub process more precisely, e.g. pg scrub rate or scheduled scrub, instead of current set of timeouts which of course not very predictable on when to run? Not yet. I would be interested in hearing what kind of control/config options/whatever you (and others) would like to see! Of course it will be awesome to have any determined scheduler or at least an option to disable automated scrubbing, since it is not very determined in time and deep-scrub eating a lot of I/O if command issued against entire OSD. Rate limiting is not in the first place, at least it may be recreated in external script, but for those who prefer to leave control to Ceph, it may be very useful. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 'zombie snapshot' problem
On Thu, Nov 22, 2012 at 2:05 AM, Josh Durgin josh.dur...@inktank.com wrote: On 11/21/2012 04:50 AM, Andrey Korolyov wrote: Hi, Somehow I have managed to produce unkillable snapshot, which does not allow to remove itself or parent image: $ rbd snap purge dev-rack0/vm2 Removing all snapshots: 100% complete...done. I see one bug with 'snap purge' ignoring the return code when removing snaps. I just fixed this in the next branch. It's probably getting the same error as 'rbd snap rm' below. Could you post the output of: rbd snap purge dev-rack0/vm2 --debug-ms 1 --debug-rbd 20 $ rbd rm dev-rack0/vm2 2012-11-21 16:31:24.184626 7f7e0d172780 -1 librbd: image has snapshots - not removing Removing image: 0% complete...failed. rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed. $ rbd snap ls dev-rack0/vm2 SNAPID NAME SIZE 188 vm2.snap-yxf 16384 MB $ rbd info dev-rack0/vm2 rbd image 'vm2': size 16384 MB in 4096 objects order 22 (4096 KB objects) block_name_prefix: rbd_data.1fa164c960874 format: 2 features: layering $ rbd snap rm --snap vm2.snap-yxf dev-rack0/vm2 rbd: failed to remove snapshot: (2) No such file or directory $ rbd snap create --snap vm2.snap-yxf dev-rack0/vm2 rbd: failed to create snapshot: (17) File exists $ rbd snap rollback --snap vm2.snap-yxf dev-rack0/vm2 Rolling back to snapshot: 100% complete...done. $ rbd snap protect --snap vm2.snap-yxf dev-rack0/vm2 $ rbd snap unprotect --snap vm2.snap-yxf dev-rack0/vm2 Meanwhile, ``rbd ls -l dev-rack0'' segfaulting with an attached log. Is there any reliable way to kill problematic snap? From this log it looks like vm2 used to be a clone, and the snapshot vm2.snap-yxf was taken before it was flattened. Later, the parent of vm2.snap-yxf was deleted. Is this correct? I have attached log you asked, hope it will be useful. Here is a two possible flows: snapshot created before and during flatten: Completely linear flow: $ rbd cp install/debian7 dev-rack0/testimg Image copy: 100% complete...done. $ rbd snap create --snap test1 dev-rack0/testimg $ rbd snap clone --snap test1 dev-rack0/testimg dev-rack0/testimg2 rbd: error parsing command 'clone' $ rbd snap protect --snap test1 dev-rack0/testimg $ rbd clone --snap test1 dev-rack0/testimg dev-rack0/testimg2 $ rbd snap create --snap test2 dev-rack0/testimg2 $ rbd flatten dev-rack0/testimg2 Image flatten: 100% complete...done. $ rbd snap unprotect --snap test1 dev-rack0/testimg 2012-11-22 15:11:03.446892 7ff9fb7c1780 -1 librbd: snap_unprotect: can't unprotect; at least 1 child(ren) in pool dev-rack0 rbd: unprotecting snap failed: (16) Device or resource busy $ rbd snap purge dev-rack0/testimg2 Removing all snapshots: 100% complete...done. $ rbd snap ls dev-rack0/testimg2 $ rbd snap unprotect --snap test1 dev-rack0/testimg snapshot created over image with ``flatten'' in progress: $ rbd snap create --snap test3 dev-rack0/testimg $ rbd snap protect --snap test3 dev-rack0/testimg $ rbd clone --snap test3 dev-rack0/testimg dev-rack0/testimg3 rbd $ rbd flatten dev-rack0/testimg3 [here was executed rbd snap create --snap test43 dev-rack0/testimg3] Image flatten: 100% complete...done. $ rbd snap unprotect --snap test3 dev-rack0/testimg $ rbd snap ls dev-rack0/testimg3 SNAPID NAME SIZE 323 test43 640 MB $ rbd snap purge dev-rack0/testimg3 Removing all snapshots: 100% complete...done. $ rbd snap ls dev-rack0/testimg3 SNAPID NAME SIZE 323 test43 640 MB $ rbd snap rm --snap test43 dev-rack0/testimg3 rbd: failed to remove snapshot: (2) No such file or directory Hooray, problem found! Now I`ll avoid this by putting flatten state as exclusive one over the image. ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150) It was a bug in 0.53 that protected snapshots could be deleted. Josh snap.txt.gz Description: GNU Zip compressed data
Hangup during scrubbing - possible solutions
Hi, In the recent versions Ceph introduces some unexpected behavior for the permanent connections (VM or kernel clients) - after crash recovery, I/O will hang on the next planned scrub on the following scenario: - launch a bunch of clients doing non-intensive writes, - lose one or more osd, mark them down, wait for recovery completion, - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script, or wait for ceph to do the same, - observe a raising number of pgs stuck in the active+clean+scrubbing state (they took a master role from ones which was on killed osd and almost surely they are being written in time of crash), - some time later, clients will hang hardly and ceph log introduce stuck(old) I/O requests. The only one way to return clients back without losing their I/O state is per-osd restart, which also will help to get rid of active+clean+scrubbing pgs. First of all, I`ll be happy to help to solve this problem by providing logs. Second question is not directly related to this problem, but I have thought on for a long time - is there a planned features to control scrub process more precisely, e.g. pg scrub rate or scheduled scrub, instead of current set of timeouts which of course not very predictable on when to run? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
'zombie snapshot' problem
Hi, Somehow I have managed to produce unkillable snapshot, which does not allow to remove itself or parent image: $ rbd snap purge dev-rack0/vm2 Removing all snapshots: 100% complete...done. $ rbd rm dev-rack0/vm2 2012-11-21 16:31:24.184626 7f7e0d172780 -1 librbd: image has snapshots - not removing Removing image: 0% complete...failed. rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed. $ rbd snap ls dev-rack0/vm2 SNAPID NAME SIZE 188 vm2.snap-yxf 16384 MB $ rbd info dev-rack0/vm2 rbd image 'vm2': size 16384 MB in 4096 objects order 22 (4096 KB objects) block_name_prefix: rbd_data.1fa164c960874 format: 2 features: layering $ rbd snap rm --snap vm2.snap-yxf dev-rack0/vm2 rbd: failed to remove snapshot: (2) No such file or directory $ rbd snap create --snap vm2.snap-yxf dev-rack0/vm2 rbd: failed to create snapshot: (17) File exists $ rbd snap rollback --snap vm2.snap-yxf dev-rack0/vm2 Rolling back to snapshot: 100% complete...done. $ rbd snap protect --snap vm2.snap-yxf dev-rack0/vm2 $ rbd snap unprotect --snap vm2.snap-yxf dev-rack0/vm2 Meanwhile, ``rbd ls -l dev-rack0'' segfaulting with an attached log. Is there any reliable way to kill problematic snap? log-crash.txt.gz Description: GNU Zip compressed data
Re: Authorization issues in the 0.54
On Thu, Nov 15, 2012 at 5:03 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Nov 15, 2012 at 5:12 AM, Yehuda Sadeh yeh...@inktank.com wrote: On Wed, Nov 14, 2012 at 4:20 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, In the 0.54 cephx is probably broken somehow: $ ceph auth add client.qemukvm osd 'allow *' mon 'allow *' mds 'allow *' -i qemukvm.key 2012-11-14 15:51:23.153910 7ff06441f780 -1 read 65 bytes from qemukvm.key added key for client.qemukvm $ ceph auth list ... client.admin key: [xx] caps: [mds] allow * Note that for mds you just specify 'allow' and not 'allow *'. It shouldn't affect the stuff that you're testing though. Thanks for the hint! caps: [mon] allow * caps: [osd] allow * client.qemukvm key: [yy] caps: [mds] allow * caps: [mon] allow * caps: [osd] allow * ... $ virsh secret-set-value --secret uuid --base64 yy set username in the VM` xml... $ virsh start testvm kvm: -drive file=rbd:rbd/vm0:id=qemukvm:key=yy:auth_supported=cephx\;none:mon_host=192.168.10.125\:6789\;192.168.10.127\:6789\;192.168.10.129\:6789,if=none,id=drive-virtio-disk0,format=raw: could not open disk image rbd:rbd/vm0:id=qemukvm:key=yy:auth_supported=cephx\;none:mon_host=192.168.10.125\:6789\;192.168.10.127\:6789\;192.168.10.129\:6789: Operation not permitted $ virsh secret-set-value --secret uuid --base64 xx set username again to admin for the VM` disk $ virsh start testvm Finally, vm started successfully. All rbd commands issued from cli works okay with the appropriate credentials, qemu binary was linked with same librbd as running one. Does anyone have a suggestion? There wasn't any change that I'm aware of that should make that happening. Can you reproduce it with 'debug ms = 1' and 'debug auth = 20'? I`ll provide detailed logs some time later, after I do an upgrade of production rack. The situation is a quite strange - when I did upgrade from older version (tested for 0.51 and 0.53), auth was stopped to work exactly as above, and any actions with key(importing and elevating privileges or importing with max possible privileges) does nothing for an rbd-backed QEMU vm, only ``admin'' credentials able to pass authentication. When I finally reformatted cluster using mkcephfs for 0.54, authentication works with ``rwx'' rights on osd, when earlier ``rw'' was enough. Seems that this is some kind of bug in the monfs resulting to misworking authentication, also 0.53 to 0.54 was the first upgrade which made impossible version rollback - mons complaining to an empty set of some ``missing features'' on start, so I recreated monfs on every mon during online downgrade(I know that downgrade is bad by nature, but since on-disk format for osd was fixed, I have trying to do it). Sorry, it was three overlapping factors - my inattention, additional ``x'' attribute in the required key capabilities and ``backup'' mon stayed from time of upgrade - I have simply forgot to kill it and this mon alone caused to drop authentication requests from qemu VMs somehow in meantime allowing plain cluster operations using ``rbd'' command and same credentials (very very strange). By the way, it seems that monitor not included in cluster can easily flood any of existing mons if it have same name, even it is completely outside authentication keyring. Output from flooded mon is very close to #2645 by footprint. I have suggestion that it`ll be reasonable to introduce temporary bans or any type of foolproof behavior for bad authentication requests on the monitors in future. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: changed rbd cp behavior in 0.53
On Thu, Nov 15, 2012 at 8:43 PM, Deb Barba deb.ba...@inktank.com wrote: This is not common UNIX/posix behavior. if you just give the source a file name, it should assume . (current directory) as it's location, not whatever path you started from. I would expect most UNIX users would be losing a lot of files if they try to copy from path x/y/z, and just provide a new name. that would indicate they wanted it stashed in .. Not cloned in path x/y/z . I am concerned this would confuse most users out in the field. Thanks, Deborah Barba Speaking of standards, rbd layout is more closely to /dev layout, or, at least iSCSI targets, when not specifying full path or use some predefined default prefix make no sense at all. On Wed, Nov 14, 2012 at 10:43 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Nov 15, 2012 at 4:56 AM, Dan Mick dan.m...@inktank.com wrote: On 11/12/2012 02:47 PM, Josh Durgin wrote: On 11/12/2012 08:30 AM, Andrey Korolyov wrote: Hi, For this version, rbd cp assumes that destination pool is the same as source, not 'rbd', if pool in the destination path is omitted. rbd cp install/img testimg rbd ls install img testimg Is this change permanent? Thanks! This is a regression. The previous behavior will be restored for 0.54. I added http://tracker.newdream.net/issues/3478 to track it. Actually, on detailed examination, it looks like this has been the behavior for a long time; I think the wiser course would be not to change this defaulting. One could argue the value of such defaulting, but it's also true that you can specify the source and destination pools explicitly. Andrey, any strong objection to leaving this the way it is? I`m not complaining - this behavior seems more logical in the first place and of course I use full path even doing something by hand. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Authorization issues in the 0.54
Hi, In the 0.54 cephx is probably broken somehow: $ ceph auth add client.qemukvm osd 'allow *' mon 'allow *' mds 'allow *' -i qemukvm.key 2012-11-14 15:51:23.153910 7ff06441f780 -1 read 65 bytes from qemukvm.key added key for client.qemukvm $ ceph auth list ... client.admin key: [xx] caps: [mds] allow * caps: [mon] allow * caps: [osd] allow * client.qemukvm key: [yy] caps: [mds] allow * caps: [mon] allow * caps: [osd] allow * ... $ virsh secret-set-value --secret uuid --base64 yy set username in the VM` xml... $ virsh start testvm kvm: -drive file=rbd:rbd/vm0:id=qemukvm:key=yy:auth_supported=cephx\;none:mon_host=192.168.10.125\:6789\;192.168.10.127\:6789\;192.168.10.129\:6789,if=none,id=drive-virtio-disk0,format=raw: could not open disk image rbd:rbd/vm0:id=qemukvm:key=yy:auth_supported=cephx\;none:mon_host=192.168.10.125\:6789\;192.168.10.127\:6789\;192.168.10.129\:6789: Operation not permitted $ virsh secret-set-value --secret uuid --base64 xx set username again to admin for the VM` disk $ virsh start testvm Finally, vm started successfully. All rbd commands issued from cli works okay with the appropriate credentials, qemu binary was linked with same librbd as running one. Does anyone have a suggestion? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: changed rbd cp behavior in 0.53
On Thu, Nov 15, 2012 at 4:56 AM, Dan Mick dan.m...@inktank.com wrote: On 11/12/2012 02:47 PM, Josh Durgin wrote: On 11/12/2012 08:30 AM, Andrey Korolyov wrote: Hi, For this version, rbd cp assumes that destination pool is the same as source, not 'rbd', if pool in the destination path is omitted. rbd cp install/img testimg rbd ls install img testimg Is this change permanent? Thanks! This is a regression. The previous behavior will be restored for 0.54. I added http://tracker.newdream.net/issues/3478 to track it. Actually, on detailed examination, it looks like this has been the behavior for a long time; I think the wiser course would be not to change this defaulting. One could argue the value of such defaulting, but it's also true that you can specify the source and destination pools explicitly. Andrey, any strong objection to leaving this the way it is? I`m not complaining - this behavior seems more logical in the first place and of course I use full path even doing something by hand. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
changed rbd cp behavior in 0.53
Hi, For this version, rbd cp assumes that destination pool is the same as source, not 'rbd', if pool in the destination path is omitted. rbd cp install/img testimg rbd ls install img testimg Is this change permanent? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
``rbd mv'' crash when no destination issued
Hi, Please take a look, seems harmless: $ rbd mv vm0 terminate called after throwing an instance of 'std::logic_error' what(): basic_string::_S_construct null not valid *** Caught signal (Aborted) ** in thread 7f85f5981780 ceph version 0.53 (commit:2528b5ee105b16352c91af064af5c0b5a7d45d7c) 1: rbd() [0x431a92] 2: (()+0xfcb0) [0x7f85f451fcb0] 3: (gsignal()+0x35) [0x7f85f2c63405] 4: (abort()+0x17b) [0x7f85f2c66b5b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f85f35616dd] 6: (()+0x637e6) [0x7f85f355f7e6] 7: (()+0x63813) [0x7f85f355f813] 8: (()+0x63a3e) [0x7f85f355fa3e] 9: (std::__throw_logic_error(char const*)+0x5d) [0x7f85f35b133d] 10: (char* std::string::_S_constructchar const*(char const*, char const*, std::allocatorchar const, std::forward_iterator_tag)+0xa9) [0x7f85f35bd3e9] 11: (std::basic_stringchar, std::char_traitschar, std::allocatorchar ::basic_string(char const*, std::allocatorchar const)+0x43) [0x7f85f35bd453] 12: (librbd::rename(librados::IoCtx, char const*, char const*)+0x119) [0x7f85f5528099] 13: (main()+0x2ba5) [0x42aff5] 14: (__libc_start_main()+0xed) [0x7f85f2c4e76d] 15: rbd() [0x42d599] 2012-11-09 20:51:50.082452 7f85f5981780 -1 *** Caught signal (Aborted) ** in thread 7f85f5981780 ceph version 0.53 (commit:2528b5ee105b16352c91af064af5c0b5a7d45d7c) 1: rbd() [0x431a92] 2: (()+0xfcb0) [0x7f85f451fcb0] 3: (gsignal()+0x35) [0x7f85f2c63405] 4: (abort()+0x17b) [0x7f85f2c66b5b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f85f35616dd] 6: (()+0x637e6) [0x7f85f355f7e6] 7: (()+0x63813) [0x7f85f355f813] 8: (()+0x63a3e) [0x7f85f355fa3e] 9: (std::__throw_logic_error(char const*)+0x5d) [0x7f85f35b133d] 10: (char* std::string::_S_constructchar const*(char const*, char const*, std::allocatorchar const, std::forward_iterator_tag)+0xa9) [0x7f85f35bd3e9] 11: (std::basic_stringchar, std::char_traitschar, std::allocatorchar ::basic_string(char const*, std::allocatorchar const)+0x43) [0x7f85f35bd453] 12: (librbd::rename(librados::IoCtx, char const*, char const*)+0x119) [0x7f85f5528099] 13: (main()+0x2ba5) [0x42aff5] 14: (__libc_start_main()+0xed) [0x7f85f2c4e76d] 15: rbd() [0x42d599] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -47 2012-11-09 20:51:50.060811 7f85f5981780 5 asok(0x208ba50) register_command perfcounters_dump hook 0x208b9b0 -46 2012-11-09 20:51:50.060838 7f85f5981780 5 asok(0x208ba50) register_command 1 hook 0x208b9b0 -45 2012-11-09 20:51:50.060843 7f85f5981780 5 asok(0x208ba50) register_command perf dump hook 0x208b9b0 -44 2012-11-09 20:51:50.060858 7f85f5981780 5 asok(0x208ba50) register_command perfcounters_schema hook 0x208b9b0 -43 2012-11-09 20:51:50.060864 7f85f5981780 5 asok(0x208ba50) register_command 2 hook 0x208b9b0 -42 2012-11-09 20:51:50.060866 7f85f5981780 5 asok(0x208ba50) register_command perf schema hook 0x208b9b0 -41 2012-11-09 20:51:50.060868 7f85f5981780 5 asok(0x208ba50) register_command config show hook 0x208b9b0 -40 2012-11-09 20:51:50.060874 7f85f5981780 5 asok(0x208ba50) register_command config set hook 0x208b9b0 -39 2012-11-09 20:51:50.060877 7f85f5981780 5 asok(0x208ba50) register_command log flush hook 0x208b9b0 -38 2012-11-09 20:51:50.060880 7f85f5981780 5 asok(0x208ba50) register_command log dump hook 0x208b9b0 -37 2012-11-09 20:51:50.060886 7f85f5981780 5 asok(0x208ba50) register_command log reopen hook 0x208b9b0 -36 2012-11-09 20:51:50.063300 7f85f5981780 1 librados: starting msgr at :/0 -35 2012-11-09 20:51:50.063317 7f85f5981780 1 librados: starting objecter -34 2012-11-09 20:51:50.063399 7f85f5981780 1 librados: setting wanted keys -33 2012-11-09 20:51:50.063406 7f85f5981780 1 librados: calling monclient init -32 2012-11-09 20:51:50.063589 7f85f5981780 2 auth: KeyRing::load: loaded key file /etc/ceph/keyring.bin -31 2012-11-09 20:51:50.064794 7f85efc26700 5 throttle(msgr_dispatch_throttler-radosclient 0x2095a08) get 473 (0 - 473) -30 2012-11-09 20:51:50.064953 7f85efc26700 5 throttle(msgr_dispatch_throttler-radosclient 0x2095a08) get 33 (473 - 506) -29 2012-11-09 20:51:50.065026 7f85f1c2a700 1 monclient(hunting): found mon.2 -28 2012-11-09 20:51:50.065055 7f85f1c2a700 5 throttle(msgr_dispatch_throttler-radosclient 0x2095a08) put 473 (0x6a9d28 - 33) -27 2012-11-09 20:51:50.065332 7f85f1c2a700 5 throttle(msgr_dispatch_throttler-radosclient 0x2095a08) put 33 (0x6a9d28 - 0) -26 2012-11-09 20:51:50.065627 7f85efc26700 5 throttle(msgr_dispatch_throttler-radosclient 0x2095a08) get 206 (0 - 206) -25 2012-11-09 20:51:50.065798 7f85f1c2a700 5 throttle(msgr_dispatch_throttler-radosclient 0x2095a08) put 206 (0x6a9d28 - 0) -24 2012-11-09 20:51:50.066082 7f85efc26700 5 throttle(msgr_dispatch_throttler-radosclient 0x2095a08) get 393 (0 - 393) -23 2012-11-09 20:51:50.066177 7f85f1c2a700 5 throttle(msgr_dispatch_throttler-radosclient
Re: clock syncronisation
On Thu, Nov 8, 2012 at 4:00 PM, Wido den Hollander w...@widodh.nl wrote: On 08-11-12 10:04, Stefan Priebe - Profihost AG wrote: Hello list, is there any prefered way to use clock syncronisation? I've tried running openntpd and ntpd on all servers but i'm still getting: 2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped 0.063136s in the future, clocks not synchronized 2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped 0.063285s in the future, clocks not synchronized 2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped 0.063301s in the future, clocks not synchronized 2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped 0.063360s in the future, clocks not synchronized What NTP server are you using? Network latency might cause the clocks not to be synchronised. There is no real reason to worry about, quorum may suffer only large desync delays as some seconds or more. If you have unsynchronized clocks on mon hodes with such big delays, requests which have issued from cli, e.g. creating new connection may wait as long as delay itself, depend of clock value of selected monitor node. Clock drift caused mostly by heavy load, but of course playing with clocksources may have some effect(since most systems already use HPET timer, there is only one way, to sync with ntp server as frequent as you want to prevent drift). To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion
On Thu, Nov 8, 2012 at 7:02 PM, Atchley, Scott atchle...@ornl.gov wrote: On Nov 8, 2012, at 10:00 AM, Scott Atchley atchle...@ornl.gov wrote: On Nov 8, 2012, at 9:39 AM, Mark Nelson mark.nel...@inktank.com wrote: On 11/08/2012 07:55 AM, Atchley, Scott wrote: On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2012/11/8 Mark Nelson mark.nel...@inktank.com: I haven't done much with IPoIB (just RDMA), but my understanding is that it tends to top out at like 15Gb/s. Some others on this mailing list can probably speak more authoritatively. Even with RDMA you are going to top out at around 3.1-3.2GB/s. 15Gb/s is still faster than 10Gbe But this speed limit seems to be kernel-related and should be the same even in a 10Gbe environment, or not? We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. Scott, this is very interesting! Does setting the interrupt affinity make the biggest difference then when you have concurrent netperf processes going? For some reason I thought that setting interrupt affinity wasn't even guaranteed in linux any more, but this is just some half-remembered recollection from a year or two ago. We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and without affinity: Default (irqbalance running) 12.8 Gb/s IRQ balance off13.0 Gb/s Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 Gb/s for a single stream. Did you tried Mellanox-baked modules for 2.6.32 before that? Note, I used hwloc to determine which socket was closer to the mlx4 device on our dual socket machines. On these nodes, hwloc reported that both sockets were equally close, but a colleague has machines where one socket is closer than the other. In that case, bind to the closer socket (or to cores within the closer socket). We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf We looked at their interrupt affinity setting scripts and then wrote our own. Our testing is with IPoIB in connected mode, not datagram mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. We are getting a new test cluster with FDR HCAs and I will look into those as well. Nice! At some point I'll probably try to justify getting some FDR cards in house. I'd definitely like to hear how FDR ends up working for you. I'll post the numbers when I get access after they are set up. Scott -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: less cores more iops / speed
On Thu, Nov 8, 2012 at 7:53 PM, Alexandre DERUMIER aderum...@odiso.com wrote: So it is a problem of KVM which let's the processes jump between cores a lot. maybe numad from redhat can help ? http://fedoraproject.org/wiki/Features/numad It's try to keep process on same numa node and I think it's also doing some dynamic pinning. Numad keeps only memory chunks on the preferred node, cpu pinning, which is a primary goal there, should be done separately via libvirt or manually for qemu process via cpuset(libvirt does pinning via taskset and seems that it is broken at least in debian wheezy - even affinity mask is set for qemu process, load spreads all over numa node, including cpus outside the set). - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: Mark Nelson mark.nel...@inktank.com Cc: Joao Eduardo Luis joao.l...@inktank.com, ceph-devel@vger.kernel.org Envoyé: Jeudi 8 Novembre 2012 16:14:32 Objet: Re: less cores more iops / speed Am 08.11.2012 14:19, schrieb Mark Nelson: On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote: Am 08.11.2012 01:59, schrieb Mark Nelson: There's also the context switching overhead. It'd be interesting to know how much the writer processes were shifting around on cores. What do you mean by that? I'm talking about the KVM guest not about the ceph nodes. in this case, is fio bouncing around between cores? Thanks you're correct. If i bind fio to two cores on a 8 core VM it runs with 16.000 iops. So it is a problem of KVM which let's the processes jump between cores a lot. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: kvm crashing in void librbd::AioCompletion::complete_request
On Mon, Nov 5, 2012 at 11:33 PM, Stefan Priebe s.pri...@profihost.ag wrote: Am 04.11.2012 15:12, schrieb Sage Weil: On Sun, 4 Nov 2012, Stefan Priebe wrote: Can i merge wip-rbd-read into master? Yeah. I'm going to do a bit more testing first before I do it, but it should apply cleanly. Hopefully later today. Thanks - seems to be fixed with wip-rbd-read but i have a memory leak now. kvm process raises and raises and raises which each new test around 5GB of memory. Should i write a new mail? Do you use qemu 1.2.0? It has a memory leak, at least with rbd backend, but I didn`t profiled it yet, so it may be something generic :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Different geoms for an rbd block device
On Wed, Oct 31, 2012 at 1:07 AM, Josh Durgin josh.dur...@inktank.com wrote: On 10/28/2012 03:02 AM, Andrey Korolyov wrote: Hi, Should following behavior considered to be normal? $ rbd map test-rack0/debiantest --user qemukvm --secret qemukvm.key $ fdisk /dev/rbd1 Command (m for help): p Disk /dev/rbd1: 671 MB, 671088640 bytes 255 heads, 63 sectors/track, 81 cylinders, total 1310720 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 4194304 bytes / 4194304 bytes Disk identifier: 0x00056f14 Device Boot Start End Blocks Id System /dev/rbd1p12048 63487 30720 82 Linux swap / Solaris Partition 1 does not start on physical sector boundary. /dev/rbd1p2 63488 1292287 614400 83 Linux Partition 2 does not start on physical sector boundary. Meanwhile, in the guest vm over same image: fdisk /dev/vda Command (m for help): p Disk /dev/vda: 671 MB, 671088640 bytes 16 heads, 63 sectors/track, 1300 cylinders, total 1310720 sectors I'm guessing the reported number of cylinders is the issue? You can control that with a qemu option. I think -drive ...cyls=81 will do it. You can also set the min/opt i/o sizes via qemu device properties min_io_size and opt_io_size in the same way you can adjust discard granularity: http://ceph.com/docs/master/rbd/qemu-rbd/#enabling-discard-trim Unfortunately min_io_size is a uint16 in qemu, so it won't be able to store 4194304. Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00056f14 Device Boot Start End Blocks Id System /dev/vda12048 63487 30720 82 Linux swap / Solaris /dev/vda2 63488 1292287 614400 83 Linux The real pain starts when I try to repartition disk from after 'rbd map' using its geometry - it simply broke partition layout, for example, first block offset moves from 2048b to 8192. Of course I can specify geometry by hand, but before that I may need to start vm at least once or do something else which will print me out actual layout. Thanks! Setting the geometry at qemu boot time should work, and is a bit easier. qemu actually has code to try to guess disk geometry from a partition table, but perhaps it doesn't support the format you're using. Josh So preferable geometry is one provided by kernel client, right? Is there any advantages of using large blocks for I/O with discard(ofc, not right now, I`ll wait for virtio bus support :) )? At first sight, TCP transfers should not differ by resulting speed on typical workloads, but only on exotic ones - like delayed commit on the guest FS + intensive writes. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ignore O_SYNC for rbd cache
Hi, Recent tests on my test rack with 20G IB(iboip, 64k mtu, default CUBIC, CFQ, LSI SAS 2108 w/ wb cache) interconnect shows a quite fantastic performance - on both reads and writes Ceph completely utilizing all disk bandwidth as high as 0.9 of theoretical limit of sum of all bandwidths bearing in mind replication level. The only thing that may bring down overall performance is a O_SYNC|O_DIRECT writes which will be issued by almost every database server in the default setup. Assuming that the database config may be untouchable and somehow I can build very reliable hardware setup which `ll never fail on power, should ceph have an option to ignore these flags? May be there is another real-world cases for including such or I am very wrong even thinking on fool client application in this way. Thank you for any suggestion! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Collection of strange lockups on 0.51
On Mon, Oct 1, 2012 at 8:42 PM, Tommi Virtanen t...@inktank.com wrote: On Sun, Sep 30, 2012 at 2:55 PM, Andrey Korolyov and...@xdel.ru wrote: Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems to appear more likely on 0.51 traffic patterns, which is very strange for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my case, exposed to extremely high load - database benchmark over 700+ rbd-backed VMs and cluster rebalance at same time. It explains post-reboot lockups in igb driver and all types of lockups above. I would very appreciate any suggestions of switch models which do not expose such behavior in simultaneous conditions both off-list and in this thread. I don't see how a switch dropping packets would give an ethernet card driver any excuse to crash, but I'm simultaneously happy to hear that it doesn't seem like Ceph is at fault, and sorry for your troubles. I don't have an up to date 1GbE card recommendation to share, but I would recommend making sure you're using a recent Linux kernel. I have incorrectly formulated a reason - of course drops can not cause a lockup by themselves, but switch may create somehow a long-lasting `corrupt` state on the trunk ports which leads to such lockups at the ethernet card. Of course I`ll play with the driver versions and card|port settings, thanks for suggestion :) I`m still investigating the issue since it is a quite hard to repeat in the right time and hope I`m able to capture this state using tcpdump-like, e.g. s/w methods - if card driver locks on something, it may prevent to process problematic byte sequence at packet sniffer level. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Collection of strange lockups on 0.51
On Thu, Sep 13, 2012 at 1:43 AM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Sep 13, 2012 at 1:09 AM, Tommi Virtanen t...@inktank.com wrote: On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, This is completely off-list, but I`m asking because only ceph trigger such a bug :) . With 0.51, following happens: if I kill an osd, one or more neighbor nodes may go to hanged state with cpu lockups, not related to temperature or overall interrupt count or la and it happens randomly over 16-node cluster. Almost sure that ceph triggerizing some hardware bug, but I don`t quite sure of which origin. Also after a short time after reset from such crash a new lockup may be created by any action. From the log, it looks like your ethernet driver is crapping out. [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out ... [172517.058622] [812b2975] ? netif_tx_lock+0x40/0x76 etc. The later oopses are talking about paravirt_write_msr etc, which makes me thing you're using Xen? You probably don't want to run Ceph servers inside virtualization (for production). NOPE. Xen was my choice for almost five years, but right now I am replaced it with kvm everywhere due to buggy 4.1 '-stable'. 4.0 has same poor network performance as 3.x but can be really named stable. All those backtraces comes from bare hardware. At the end you can see nice backtrace which comes out soon after end of the boot sequence when I manually typed 'modprobe rbd', it may be any other command assuming from experience. As soon as I don`t know anything about long-lasting states in intel, especially of those which will survive ipmi reset button, I think that first-sight complain about igb may be not quite right. If there cards may save some of runtime states to EEPROM and pull them back then I`m wrong. Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems to appear more likely on 0.51 traffic patterns, which is very strange for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my case, exposed to extremely high load - database benchmark over 700+ rbd-backed VMs and cluster rebalance at same time. It explains post-reboot lockups in igb driver and all types of lockups above. I would very appreciate any suggestions of switch models which do not expose such behavior in simultaneous conditions both off-list and in this thread. [172696.503900] [8100d025] ? paravirt_write_msr+0xb/0xe [172696.503942] [810325f3] ? leave_mm+0x3e/0x3e and *then* you get [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0 [172695.041745] megasas: [ 0]waiting for 35 commands to complete [172696.045602] megaraid_sas: no pending cmds after reset [172696.045644] megasas: reset successful which just adds more awesomeness to the soup -- though I do wonder if this could be caused by the soft hang from earlier. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: enabling cephx by default
On Tue, Sep 18, 2012 at 4:37 PM, Guido Winkelmann guido-c...@thisisnotatest.de wrote: Am Dienstag, 11. September 2012, 17:25:49 schrieben Sie: The next stable release will have cephx authentication enabled by default. Hm, that could be a problem for me. I have tried multiple times to get cephx working in the past, without lasting success. (I cannot recall at the moment what the problem was the last time around, but it was probably qemu/libvirt.) BTW, libvirt 0.10.x has a broken cephx support somehow. It forms same string for -drive as 0.9x(at least in a log) but failing to pass authentication same moment. IMHO, the documentation badly needs a high-level overview for cephx (or maybe I just haven't found it yet); what it does, what dangers it protects you from and how it achieves that. Guido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: enabling cephx by default
On Tue, Sep 18, 2012 at 5:34 PM, Andrey Korolyov and...@xdel.ru wrote: On Tue, Sep 18, 2012 at 4:37 PM, Guido Winkelmann guido-c...@thisisnotatest.de wrote: Am Dienstag, 11. September 2012, 17:25:49 schrieben Sie: The next stable release will have cephx authentication enabled by default. Hm, that could be a problem for me. I have tried multiple times to get cephx working in the past, without lasting success. (I cannot recall at the moment what the problem was the last time around, but it was probably qemu/libvirt.) BTW, libvirt 0.10.x has a broken cephx support somehow. It forms same string for -drive as 0.9x(at least in a log) but failing to pass authentication same moment. Please nevermind, I have build incorrect regex for log parsing previously. https://www.redhat.com/archives/libvirt-users/2012-September/msg00082.html IMHO, the documentation badly needs a high-level overview for cephx (or maybe I just haven't found it yet); what it does, what dangers it protects you from and how it achieves that. Guido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Collection of strange lockups on 0.51
Hi, This is completely off-list, but I`m asking because only ceph trigger such a bug :) . With 0.51, following happens: if I kill an osd, one or more neighbor nodes may go to hanged state with cpu lockups, not related to temperature or overall interrupt count or la and it happens randomly over 16-node cluster. Almost sure that ceph triggerizing some hardware bug, but I don`t quite sure of which origin. Also after a short time after reset from such crash a new lockup may be created by any action. Before blaming system drivers and continuing to investigate a problem, may I ask if someone faced similar problem? I am using 802.ad on pair intel 350 for general connectivity. I have attached a bit of traces which was pushed to netconsole(in some cases, machine died hardly, e.g. not even sending a final bye over netconsole, so it is not complete). netcon.log.gz Description: GNU Zip compressed data
Re: Collection of strange lockups on 0.51
On Thu, Sep 13, 2012 at 1:09 AM, Tommi Virtanen t...@inktank.com wrote: On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, This is completely off-list, but I`m asking because only ceph trigger such a bug :) . With 0.51, following happens: if I kill an osd, one or more neighbor nodes may go to hanged state with cpu lockups, not related to temperature or overall interrupt count or la and it happens randomly over 16-node cluster. Almost sure that ceph triggerizing some hardware bug, but I don`t quite sure of which origin. Also after a short time after reset from such crash a new lockup may be created by any action. From the log, it looks like your ethernet driver is crapping out. [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out ... [172517.058622] [812b2975] ? netif_tx_lock+0x40/0x76 etc. The later oopses are talking about paravirt_write_msr etc, which makes me thing you're using Xen? You probably don't want to run Ceph servers inside virtualization (for production). NOPE. Xen was my choice for almost five years, but right now I am replaced it with kvm everywhere due to buggy 4.1 '-stable'. 4.0 has same poor network performance as 3.x but can be really named stable. All those backtraces comes from bare hardware. At the end you can see nice backtrace which comes out soon after end of the boot sequence when I manually typed 'modprobe rbd', it may be any other command assuming from experience. As soon as I don`t know anything about long-lasting states in intel, especially of those which will survive ipmi reset button, I think that first-sight complain about igb may be not quite right. If there cards may save some of runtime states to EEPROM and pull them back then I`m wrong. [172696.503900] [8100d025] ? paravirt_write_msr+0xb/0xe [172696.503942] [810325f3] ? leave_mm+0x3e/0x3e and *then* you get [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0 [172695.041745] megasas: [ 0]waiting for 35 commands to complete [172696.045602] megaraid_sas: no pending cmds after reset [172696.045644] megasas: reset successful which just adds more awesomeness to the soup -- though I do wonder if this could be caused by the soft hang from earlier. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
Hi, Almost always one or more osd dies when doing overlapped recovery - e.g. add new crushmap and remove some newly added osds from cluster some minutes later during remap or inject two slightly different crushmaps after a short time(surely preserving at least one of replicas online). Seems that osd dying on excessive amount of operations in queue because under normal test, e.g. rados, iowait does not break one percent barrier but during recovery it may raise up to ten percents(2108 w/ cache, splitted disks as R0 each). #0 0x7f62f193a445 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f62f193db9b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x7f62f2236665 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x7f62f2234796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x7f62f22347c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x7f62f22349ee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00844e11 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () #7 0x0073148f in FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int) () #8 0x0073484e in FileStore::do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long) () #9 0x0070c680 in FileStore::_do_op(FileStore::OpSequencer*) () #10 0x0083ce01 in ThreadPool::worker() () #11 0x006823ed in ThreadPool::WorkThread::entry() () #12 0x7f62f345ee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #13 0x7f62f19f64cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #14 0x in ?? () ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) On Sun, Aug 26, 2012 at 8:52 PM, Andrey Korolyov and...@xdel.ru wrote: During recovery, following crash happens(simular to http://tracker.newdream.net/issues/2126 which marked resolved long ago): http://xdel.ru/downloads/ceph-log/osd-2012-08-26.txt On Sat, Aug 25, 2012 at 12:30 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Aug 23, 2012 at 4:09 AM, Gregory Farnum g...@inktank.com wrote: The tcmalloc backtrace on the OSD suggests this may be unrelated, but what's the fd limit on your monitor process? You may be approaching that limit if you've got 500 OSDs and a similar number of clients. Thanks! I didn`t measured a # of connection because of bearing in mind 1 conn per client, raising limit did the thing. Previously mentioned qemu-kvm zombie does not related to rbd itself - it can be created by destroying libvirt domain which is in saving state or vice-versa, so I`ll put a workaround on this. Right now I am faced different problem - osds dying silently, e.g. not leaving a core, I`ll check logs on the next testing phase. On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote: On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm] defunct). Entire cluster contain sixteen osd on eight nodes and three mons, on first and last node and on vm outside cluster. osd bt: #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 (gdb) bt #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at /usr/include/c++/4.7/bits/basic_string.h:246 #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at /usr/include/c++/4.7/bits/basic_string.h:536 #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out) at /usr/include/c++/4.7/sstream:60 #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439 #7 pretty_version_to_str () at common/version.cc:40 #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, out=...) at common/BackTrace.cc:19 #9 0x0078f450 in handle_fatal_signal (signum=11) at global/signal_handler.cc:91 #10 signal handler called #11 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 #14
Re: Ceph benchmarks
On Tue, Aug 28, 2012 at 12:47 AM, Sébastien Han han.sebast...@gmail.com wrote: Hi community, For those of you who are interested, I performed several benchmarks of RADOS and RBD on different types of hardware and use case. You can find my results here: http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/ Hope it helps :) Feel free to comment, critic... :) Cheers! My two cents - on ultrafast journal(tmpfs) it means which tcp congestion control algorithm you using. For default CUBIC delays aggregated sixteen-osd writing speed is about 450MBps, but for DCTCP it raising up to 550MBps. For such device as SLC disk(ext4,^O journal, commit=100) there is no observable difference - both times aggregated speed measured about 330MBps. I do not tried yet H(S)TCP, it should do the same as DCTCP. For delays lower than regular gigabit ethernet different congestion algorithms should show bigger difference, though. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote: On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm] defunct). Entire cluster contain sixteen osd on eight nodes and three mons, on first and last node and on vm outside cluster. osd bt: #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 (gdb) bt #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at /usr/include/c++/4.7/bits/basic_string.h:246 #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at /usr/include/c++/4.7/bits/basic_string.h:536 #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out) at /usr/include/c++/4.7/sstream:60 #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439 #7 pretty_version_to_str () at common/version.cc:40 #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, out=...) at common/BackTrace.cc:19 #9 0x0078f450 in handle_fatal_signal (signum=11) at global/signal_handler.cc:91 #10 signal handler called #11 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #16 0x7fc37d1c47c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #17 0x7fc37d1c49ee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c 0 == \unexpected error\, file=optimized out, line=3007, func=0x90ef80 unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int)) at common/assert.cc:77 This means it got an unexpected error when talking to the file system. If you look in the osd log, it may tell you what that was. (It may not--there isn't usually the other tcmalloc stuff triggered from the assert handler.) What happens if you restart that ceph-osd daemon? sage Unfortunately I have completely disabled logs during test, so there are no suggestion of assert_fail. The main problem was revealed - created VMs was pointed to one monitor instead set of three, so there may be some unusual things(btw, crashed mon isn`t one from above, but a neighbor of crashed osds on first node). After IPMI reset node returns back well and cluster behavior seems to be okay - stuck kvm I/O somehow prevented even other module load|unload on this node, so I finally decided to do hard reset. Despite I`m using almost generic wheezy, glibc was updated to 2.15, may be because of this my trace appears first time ever. I`m almost sure that fs does not triggered this crash and mainly suspecting stuck kvm processes. I`ll rerun test with same conditions tomorrow(~500 vms pointed to one mon and very high I/O, but with osd logging). #19 0x0073148f in FileStore::_do_transaction (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545, trans_num=trans_num@entry=0) at os/FileStore.cc:3007 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000, tls=..., op_seq=429545) at os/FileStore.cc:2436 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000, osr=optimized out) at os/FileStore.cc:2259 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at common/WorkQueue.cc:54 #23 0x006823ed in ThreadPool::WorkThread::entry (this=optimized out) at ./common/WorkQueue.h:126 #24 0x7fc37e3eee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #26 0x in ?? () mon bt was exactly the same as in http://tracker.newdream.net/issues/2762 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info
another performance-related thread
Hi, I`ve finally managed to run rbd-related test on relatively powerful machines and what I have got: 1) Reads on almost fair balanced cluster(eight nodes) did very well, utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear and sequential reads, which is close to overall disk throughput) 2) Writes get much worse, both on rados bench and on fio test when I ran fio simularly on 120 vms - at it best, overall performance is about 400Mbyte/s, using rados bench -t 12 on three host nodes fio config: rw=(randread|randwrite|seqread|seqwrite) size=256m direct=1 directory=/test numjobs=1 iodepth=12 group_reporting name=random-ead-direct bs=1M loops=12 for 120 vm set, Mbyte/s linear reads: MEAN: 14156 STDEV: 612.596 random reads: MEAN: 14128 STDEV: 911.789 linear writes: MEAN: 2956 STDEV: 283.165 random writes: MEAN: 2986 STDEV: 361.311 each node holds 15 vms and for 64M rbd cache all possible three states - wb, wt and no-cache has almost same numbers at the tests. I wonder if it possible to raise write/read ratio somehow. Seems that osd underutilize itself, e.g. I am not able to get single-threaded rbd write to get above 35Mb/s. Adding second osd on same disk only raising iowait time, but not benchmark results. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: another performance-related thread
On 07/31/2012 07:17 PM, Mark Nelson wrote: Hi Andrey! On 07/31/2012 10:03 AM, Andrey Korolyov wrote: Hi, I`ve finally managed to run rbd-related test on relatively powerful machines and what I have got: 1) Reads on almost fair balanced cluster(eight nodes) did very well, utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear and sequential reads, which is close to overall disk throughput) Does your 2108 have the RAID or JBOD firmware? I'm guessing the RAID firmware given that you are able to change the caching behavior? How do you have the arrays setup for the OSDs? Exactly, I am able to change cache behavior on-the-fly using 'famous' megacli binary. Each node contains three disks, each of them configured as raid0 single-disk - two 7200 server sata and intel 313 for journal. On satas I am using xfs with default mount options and on ssd I`ve put ext4 with disabled journal and of course with discard/noatime. This 2108 comes with SuperMicro firmware 2.120.243-1482 - guessing it is RAID variant and I didn`t tried to reflash it yet. For tests, I have forced write-through cache on - this should be very good at small writes aggregation. Before using such config, I have configured two disks to RAID0 and get slightly worse results on write bench. Thanks for suggesting to try JBOD firmware, I`ll do tests using it this week and post results. 2) Writes get much worse, both on rados bench and on fio test when I ran fio simularly on 120 vms - at it best, overall performance is about 400Mbyte/s, using rados bench -t 12 on three host nodes fio config: rw=(randread|randwrite|seqread|seqwrite) size=256m direct=1 directory=/test numjobs=1 iodepth=12 group_reporting name=random-ead-direct bs=1M loops=12 for 120 vm set, Mbyte/s linear reads: MEAN: 14156 STDEV: 612.596 random reads: MEAN: 14128 STDEV: 911.789 linear writes: MEAN: 2956 STDEV: 283.165 random writes: MEAN: 2986 STDEV: 361.311 each node holds 15 vms and for 64M rbd cache all possible three states - wb, wt and no-cache has almost same numbers at the tests. I wonder if it possible to raise write/read ratio somehow. Seems that osd underutilize itself, e.g. I am not able to get single-threaded rbd write to get above 35Mb/s. Adding second osd on same disk only raising iowait time, but not benchmark results. I've seen high IO wait times (especially with small writes) via rados bench as well. It's something we are actively investigating. Part of the issue with rados bench is that every single request is getting written to a seperate file, so especially at small IO sizes there is a lot of underlying filesystem metadata traffic. For us, this is happening on 9260 controllers with RAID firmware. I think we may see some improvement by switching to 2X08 cards with the JBOD (ie IT) firmware, but we haven't confirmed it yet. For 24 HT cores I have seen 2 percent iowait at most(at writes), so almost surely there is no IO bottleneck at all(except breaking the rule 'one osd per physical disk', when iowait raising up to 50 percent on entire system). Rados bench is not an universal measurement tool, thought - using VM` IO requests instead of manipulating rados objects will lead to almost fair result, by my opinion. We actually just purchased a variety of alternative RAID and SAS controllers to test with to see how universal this problem is. Theoretically RBD shouldn't suffer from this as badly as small writes to the same file should get buffered. The same is true for CephFS when doing buffered IO to a single file due to the Linux buffer cache. Small writes to many files will likely suffer in the same way that rados bench does though. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: another performance-related thread
On 07/31/2012 07:53 PM, Josh Durgin wrote: On 07/31/2012 08:03 AM, Andrey Korolyov wrote: Hi, I`ve finally managed to run rbd-related test on relatively powerful machines and what I have got: 1) Reads on almost fair balanced cluster(eight nodes) did very well, utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear and sequential reads, which is close to overall disk throughput) 2) Writes get much worse, both on rados bench and on fio test when I ran fio simularly on 120 vms - at it best, overall performance is about 400Mbyte/s, using rados bench -t 12 on three host nodes How are your osd journals configured? What's your ceph.conf for the osds? fio config: rw=(randread|randwrite|seqread|seqwrite) size=256m direct=1 directory=/test numjobs=1 iodepth=12 group_reporting name=random-ead-direct bs=1M loops=12 for 120 vm set, Mbyte/s linear reads: MEAN: 14156 STDEV: 612.596 random reads: MEAN: 14128 STDEV: 911.789 linear writes: MEAN: 2956 STDEV: 283.165 random writes: MEAN: 2986 STDEV: 361.311 each node holds 15 vms and for 64M rbd cache all possible three states - wb, wt and no-cache has almost same numbers at the tests. I wonder if it possible to raise write/read ratio somehow. Seems that osd underutilize itself, e.g. I am not able to get single-threaded rbd write to get above 35Mb/s. Adding second osd on same disk only raising iowait time, but not benchmark results. Are these write tests using direct I/O? That will bypass the cache for writes, which would explain the similar numbers with different cache modes. I have previously forgot that direct flag may affect rbd cache behaviout. Without it on wb cache, read rate remained same and writes increased by ~ 0.15: random writes: MEAN: 3370 STDEV: 939.99 linear writes: MEAN: 3561 STDEV: 824.954 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph status reporting non-existing osd
On Thu, Jul 19, 2012 at 1:28 AM, Gregory Farnum g...@inktank.com wrote: On Wed, Jul 18, 2012 at 12:07 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Jul 18, 2012 at 10:30 PM, Gregory Farnum g...@inktank.com wrote: On Wed, Jul 18, 2012 at 12:47 AM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Jul 18, 2012 at 11:18 AM, Gregory Farnum g...@inktank.com wrote: On Tuesday, July 17, 2012 at 11:22 PM, Andrey Korolyov wrote: On Wed, Jul 18, 2012 at 10:09 AM, Gregory Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: Hrm. That shouldn't be possible if the OSD has been removed. How did you take it out? It sounds like maybe you just marked it in the OUT state (and turned it off quite quickly) without actually taking it out of the cluster? -Greg As I have did removal, it was definitely not like that - at first place, I have marked osds(4 and 5 on same host) out, then rebuilt crushmap and then kill osd processes. As I mentioned before, osd.4 doest not exist in crushmap and therefore it shouldn`t be reported at all(theoretically). Okay, that's what happened — marking an OSD out in the CRUSH map means all the data gets moved off it, but that doesn't remove it from all the places where it's registered in the monitor and in the map, for a couple reasons: 1) You might want to mark an OSD out before taking it down, to allow for more orderly data movement. 2) OSDs can get marked out automatically, but the system shouldn't be able to forget about them on its own. 3) You might want to remove an OSD from the CRUSH map in the process of placing it somewhere else (perhaps you moved the physical machine to a new location). etc. You want to run ceph osd rm 4 5 and that should unregister both of them from everything[1]. :) -Greg [1]: Except for the full lists, which have a bug in the version of code you're running — remove the OSDs, then adjust the full ratios again, and all will be well. $ ceph osd rm 4 osd.4 does not exist $ ceph -s health HEALTH_WARN 1 near full osd(s) monmap e3: 3 mons at {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0}, election epoch 58, quorum 0,1,2 0,1,2 osdmap e2198: 4 osds: 4 up, 4 in pgmap v586056: 464 pgs: 464 active+clean; 66645 MB data, 231 GB used, 95877 MB / 324 GB avail mdsmap e207: 1/1/1 up {0=a=up:active} $ ceph health detail HEALTH_WARN 1 near full osd(s) osd.4 is near full at 89% $ ceph osd dump max_osd 4 osd.0 up in weight 1 up_from 2183 up_thru 2187 down_at 2172 last_clean_interval [2136,2171) 192.168.10.128:6800/4030 192.168.10.128:6801/4030 192.168.10.128:6802/4030 exists,up 68b3deec-e80a-48b7-9c29-1b98f5de4f62 osd.1 up in weight 1 up_from 2136 up_thru 2186 down_at 2135 last_clean_interval [2115,2134) 192.168.10.129:6800/2980 192.168.10.129:6801/2980 192.168.10.129:6802/2980 exists,up b2a26fe9-aaa8-445f-be1f-fa7d2a283b57 osd.2 up in weight 1 up_from 2181 up_thru 2187 down_at 2172 last_clean_interval [2136,2171) 192.168.10.128:6803/4128 192.168.10.128:6804/4128 192.168.10.128:6805/4128 exists,up 378d367a-f7fb-4892-9ec9-db8ffdd2eb20 osd.3 up in weight 1 up_from 2136 up_thru 2186 down_at 2135 last_clean_interval [2115,2134) 192.168.10.129:6803/3069 192.168.10.129:6804/3069 192.168.10.129:6805/3069 exists,up faf8eda8-55fc-4a0e-899f-47dbd32b81b8 Hrm. How did you create your new crush map? All the normal avenues of removing an OSD from the map set a flag which the PGMap uses to delete its records (which would prevent it reappearing in the full list), and I can't see how setcrushmap would remove an OSD from the map (although there might be a code path I haven't found). Manually, by deleting osd4|5 entries and reweighing remaining nodes. So you extracted the CRUSH map, edited it, and injected it using ceph osd setrcrushmap? Yep, exactly. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph status reporting non-existing osd
On Wed, Jul 18, 2012 at 10:09 AM, Gregory Farnum g...@inktank.com wrote: On Monday, July 16, 2012 at 11:55 AM, Andrey Korolyov wrote: On Mon, Jul 16, 2012 at 10:48 PM, Gregory Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: ceph pg set_full_ratio 0.95 ceph pg set_nearfull_ratio 0.94 On Monday, July 16, 2012 at 11:42 AM, Andrey Korolyov wrote: On Mon, Jul 16, 2012 at 8:12 PM, Gregory Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: On Saturday, July 14, 2012 at 7:20 AM, Andrey Korolyov wrote: On Fri, Jul 13, 2012 at 9:09 PM, Sage Weil s...@inktank.com (mailto:s...@inktank.com) wrote: On Fri, 13 Jul 2012, Gregory Farnum wrote: On Fri, Jul 13, 2012 at 1:17 AM, Andrey Korolyov and...@xdel.ru (mailto:and...@xdel.ru) wrote: Hi, Recently I`ve reduced my test suite from 6 to 4 osds at ~60% usage on six-node, and I have removed a bunch of rbd objects during recovery to avoid overfill. Right now I`m constantly receiving a warn about nearfull state on non-existing osd: health HEALTH_WARN 1 near full osd(s) monmap e3: 3 mons at {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0}, election epoch 240, quorum 0,1,2 0,1,2 osdmap e2098: 4 osds: 4 up, 4 in pgmap v518696: 464 pgs: 464 active+clean; 61070 MB data, 181 GB used, 143 GB / 324 GB avail mdsmap e181: 1/1/1 up {0=a=up:active} HEALTH_WARN 1 near full osd(s) osd.4 is near full at 89% Needless to say, osd.4 remains only in ceph.conf, but not at crushmap. Reducing has been done 'on-line', e.g. without restart entire cluster. Whoops! It looks like Sage has written some patches to fix this, but for now you should be good if you just update your ratios to a larger number, and then bring them back down again. :) Restarting ceph-mon should also do the trick. Thanks for the bug report! sage Should I restart mons simultaneously? I don't think restarting will actually do the trick for you — you actually will need to set the ratios again. Restarting one by one has no effect, same as filling up data pool up to ~95 percent(btw, when I deleted this 50Gb file on cephfs, mds was stuck permanently and usage remained same until I dropped and recreated data pool - hope it`s one of known posix layer bugs). I also deleted entry from config, and then restarted mons, with no effect. Any suggestions? I'm not sure what you're asking about here? -Greg Oh, sorry, I have mislooked and thought that you suggested filling up osds. How do I can set full/nearfull ratios correctly? $ceph injectargs '--mon_osd_full_ratio 96' parsed options $ ceph injectargs '--mon_osd_near_full_ratio 94' parsed options ceph pg dump | grep 'full' full_ratio 0.95 nearfull_ratio 0.85 Setting parameters in the ceph.conf and then restarting mons does not affect ratios either. Thanks, it worked, but setting values back result to turn warning back. Hrm. That shouldn't be possible if the OSD has been removed. How did you take it out? It sounds like maybe you just marked it in the OUT state (and turned it off quite quickly) without actually taking it out of the cluster? -Greg As I have did removal, it was definitely not like that - at first place, I have marked osds(4 and 5 on same host) out, then rebuilt crushmap and then kill osd processes. As I mentioned before, osd.4 doest not exist in crushmap and therefore it shouldn`t be reported at all(theoretically). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph status reporting non-existing osd
On Wed, Jul 18, 2012 at 11:18 AM, Gregory Farnum g...@inktank.com wrote: On Tuesday, July 17, 2012 at 11:22 PM, Andrey Korolyov wrote: On Wed, Jul 18, 2012 at 10:09 AM, Gregory Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: On Monday, July 16, 2012 at 11:55 AM, Andrey Korolyov wrote: On Mon, Jul 16, 2012 at 10:48 PM, Gregory Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: ceph pg set_full_ratio 0.95 ceph pg set_nearfull_ratio 0.94 On Monday, July 16, 2012 at 11:42 AM, Andrey Korolyov wrote: On Mon, Jul 16, 2012 at 8:12 PM, Gregory Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: On Saturday, July 14, 2012 at 7:20 AM, Andrey Korolyov wrote: On Fri, Jul 13, 2012 at 9:09 PM, Sage Weil s...@inktank.com (mailto:s...@inktank.com) wrote: On Fri, 13 Jul 2012, Gregory Farnum wrote: On Fri, Jul 13, 2012 at 1:17 AM, Andrey Korolyov and...@xdel.ru (mailto:and...@xdel.ru) wrote: Hi, Recently I`ve reduced my test suite from 6 to 4 osds at ~60% usage on six-node, and I have removed a bunch of rbd objects during recovery to avoid overfill. Right now I`m constantly receiving a warn about nearfull state on non-existing osd: health HEALTH_WARN 1 near full osd(s) monmap e3: 3 mons at {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0}, election epoch 240, quorum 0,1,2 0,1,2 osdmap e2098: 4 osds: 4 up, 4 in pgmap v518696: 464 pgs: 464 active+clean; 61070 MB data, 181 GB used, 143 GB / 324 GB avail mdsmap e181: 1/1/1 up {0=a=up:active} HEALTH_WARN 1 near full osd(s) osd.4 is near full at 89% Needless to say, osd.4 remains only in ceph.conf, but not at crushmap. Reducing has been done 'on-line', e.g. without restart entire cluster. Whoops! It looks like Sage has written some patches to fix this, but for now you should be good if you just update your ratios to a larger number, and then bring them back down again. :) Restarting ceph-mon should also do the trick. Thanks for the bug report! sage Should I restart mons simultaneously? I don't think restarting will actually do the trick for you — you actually will need to set the ratios again. Restarting one by one has no effect, same as filling up data pool up to ~95 percent(btw, when I deleted this 50Gb file on cephfs, mds was stuck permanently and usage remained same until I dropped and recreated data pool - hope it`s one of known posix layer bugs). I also deleted entry from config, and then restarted mons, with no effect. Any suggestions? I'm not sure what you're asking about here? -Greg Oh, sorry, I have mislooked and thought that you suggested filling up osds. How do I can set full/nearfull ratios correctly? $ceph injectargs '--mon_osd_full_ratio 96' parsed options $ ceph injectargs '--mon_osd_near_full_ratio 94' parsed options ceph pg dump | grep 'full' full_ratio 0.95 nearfull_ratio 0.85 Setting parameters in the ceph.conf and then restarting mons does not affect ratios either. Thanks, it worked, but setting values back result to turn warning back. Hrm. That shouldn't be possible if the OSD has been removed. How did you take it out? It sounds like maybe you just marked it in the OUT state (and turned it off quite quickly) without actually taking it out of the cluster? -Greg As I have did removal, it was definitely not like that - at first place, I have marked osds(4 and 5 on same host) out, then rebuilt crushmap and then kill osd processes. As I mentioned before, osd.4 doest not exist in crushmap and therefore it shouldn`t be reported at all(theoretically). Okay, that's what happened — marking an OSD out in the CRUSH map means all the data gets moved off it, but that doesn't remove it from all the places where it's registered in the monitor and in the map, for a couple reasons: 1) You might want to mark an OSD out before taking it down, to allow for more orderly data movement. 2) OSDs can get marked out automatically, but the system shouldn't be able to forget about them on its own. 3) You might want to remove an OSD from the CRUSH map
Re: ceph status reporting non-existing osd
On Wed, Jul 18, 2012 at 10:30 PM, Gregory Farnum g...@inktank.com wrote: On Wed, Jul 18, 2012 at 12:47 AM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Jul 18, 2012 at 11:18 AM, Gregory Farnum g...@inktank.com wrote: On Tuesday, July 17, 2012 at 11:22 PM, Andrey Korolyov wrote: On Wed, Jul 18, 2012 at 10:09 AM, Gregory Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: Hrm. That shouldn't be possible if the OSD has been removed. How did you take it out? It sounds like maybe you just marked it in the OUT state (and turned it off quite quickly) without actually taking it out of the cluster? -Greg As I have did removal, it was definitely not like that - at first place, I have marked osds(4 and 5 on same host) out, then rebuilt crushmap and then kill osd processes. As I mentioned before, osd.4 doest not exist in crushmap and therefore it shouldn`t be reported at all(theoretically). Okay, that's what happened — marking an OSD out in the CRUSH map means all the data gets moved off it, but that doesn't remove it from all the places where it's registered in the monitor and in the map, for a couple reasons: 1) You might want to mark an OSD out before taking it down, to allow for more orderly data movement. 2) OSDs can get marked out automatically, but the system shouldn't be able to forget about them on its own. 3) You might want to remove an OSD from the CRUSH map in the process of placing it somewhere else (perhaps you moved the physical machine to a new location). etc. You want to run ceph osd rm 4 5 and that should unregister both of them from everything[1]. :) -Greg [1]: Except for the full lists, which have a bug in the version of code you're running — remove the OSDs, then adjust the full ratios again, and all will be well. $ ceph osd rm 4 osd.4 does not exist $ ceph -s health HEALTH_WARN 1 near full osd(s) monmap e3: 3 mons at {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0}, election epoch 58, quorum 0,1,2 0,1,2 osdmap e2198: 4 osds: 4 up, 4 in pgmap v586056: 464 pgs: 464 active+clean; 66645 MB data, 231 GB used, 95877 MB / 324 GB avail mdsmap e207: 1/1/1 up {0=a=up:active} $ ceph health detail HEALTH_WARN 1 near full osd(s) osd.4 is near full at 89% $ ceph osd dump max_osd 4 osd.0 up in weight 1 up_from 2183 up_thru 2187 down_at 2172 last_clean_interval [2136,2171) 192.168.10.128:6800/4030 192.168.10.128:6801/4030 192.168.10.128:6802/4030 exists,up 68b3deec-e80a-48b7-9c29-1b98f5de4f62 osd.1 up in weight 1 up_from 2136 up_thru 2186 down_at 2135 last_clean_interval [2115,2134) 192.168.10.129:6800/2980 192.168.10.129:6801/2980 192.168.10.129:6802/2980 exists,up b2a26fe9-aaa8-445f-be1f-fa7d2a283b57 osd.2 up in weight 1 up_from 2181 up_thru 2187 down_at 2172 last_clean_interval [2136,2171) 192.168.10.128:6803/4128 192.168.10.128:6804/4128 192.168.10.128:6805/4128 exists,up 378d367a-f7fb-4892-9ec9-db8ffdd2eb20 osd.3 up in weight 1 up_from 2136 up_thru 2186 down_at 2135 last_clean_interval [2115,2134) 192.168.10.129:6803/3069 192.168.10.129:6804/3069 192.168.10.129:6805/3069 exists,up faf8eda8-55fc-4a0e-899f-47dbd32b81b8 Hrm. How did you create your new crush map? All the normal avenues of removing an OSD from the map set a flag which the PGMap uses to delete its records (which would prevent it reappearing in the full list), and I can't see how setcrushmap would remove an OSD from the map (although there might be a code path I haven't found). Manually, by deleting osd4|5 entries and reweighing remaining nodes. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph status reporting non-existing osd
On Mon, Jul 16, 2012 at 10:48 PM, Gregory Farnum g...@inktank.com wrote: ceph pg set_full_ratio 0.95 ceph pg set_nearfull_ratio 0.94 On Monday, July 16, 2012 at 11:42 AM, Andrey Korolyov wrote: On Mon, Jul 16, 2012 at 8:12 PM, Gregory Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: On Saturday, July 14, 2012 at 7:20 AM, Andrey Korolyov wrote: On Fri, Jul 13, 2012 at 9:09 PM, Sage Weil s...@inktank.com (mailto:s...@inktank.com) wrote: On Fri, 13 Jul 2012, Gregory Farnum wrote: On Fri, Jul 13, 2012 at 1:17 AM, Andrey Korolyov and...@xdel.ru (mailto:and...@xdel.ru) wrote: Hi, Recently I`ve reduced my test suite from 6 to 4 osds at ~60% usage on six-node, and I have removed a bunch of rbd objects during recovery to avoid overfill. Right now I`m constantly receiving a warn about nearfull state on non-existing osd: health HEALTH_WARN 1 near full osd(s) monmap e3: 3 mons at {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0}, election epoch 240, quorum 0,1,2 0,1,2 osdmap e2098: 4 osds: 4 up, 4 in pgmap v518696: 464 pgs: 464 active+clean; 61070 MB data, 181 GB used, 143 GB / 324 GB avail mdsmap e181: 1/1/1 up {0=a=up:active} HEALTH_WARN 1 near full osd(s) osd.4 is near full at 89% Needless to say, osd.4 remains only in ceph.conf, but not at crushmap. Reducing has been done 'on-line', e.g. without restart entire cluster. Whoops! It looks like Sage has written some patches to fix this, but for now you should be good if you just update your ratios to a larger number, and then bring them back down again. :) Restarting ceph-mon should also do the trick. Thanks for the bug report! sage Should I restart mons simultaneously? I don't think restarting will actually do the trick for you — you actually will need to set the ratios again. Restarting one by one has no effect, same as filling up data pool up to ~95 percent(btw, when I deleted this 50Gb file on cephfs, mds was stuck permanently and usage remained same until I dropped and recreated data pool - hope it`s one of known posix layer bugs). I also deleted entry from config, and then restarted mons, with no effect. Any suggestions? I'm not sure what you're asking about here? -Greg Oh, sorry, I have mislooked and thought that you suggested filling up osds. How do I can set full/nearfull ratios correctly? $ceph injectargs '--mon_osd_full_ratio 96' parsed options $ ceph injectargs '--mon_osd_near_full_ratio 94' parsed options ceph pg dump | grep 'full' full_ratio 0.95 nearfull_ratio 0.85 Setting parameters in the ceph.conf and then restarting mons does not affect ratios either. Thanks, it worked, but setting values back result to turn warning back. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph status reporting non-existing osd
On Fri, Jul 13, 2012 at 9:09 PM, Sage Weil s...@inktank.com wrote: On Fri, 13 Jul 2012, Gregory Farnum wrote: On Fri, Jul 13, 2012 at 1:17 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, Recently I`ve reduced my test suite from 6 to 4 osds at ~60% usage on six-node, and I have removed a bunch of rbd objects during recovery to avoid overfill. Right now I`m constantly receiving a warn about nearfull state on non-existing osd: health HEALTH_WARN 1 near full osd(s) monmap e3: 3 mons at {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0}, election epoch 240, quorum 0,1,2 0,1,2 osdmap e2098: 4 osds: 4 up, 4 in pgmap v518696: 464 pgs: 464 active+clean; 61070 MB data, 181 GB used, 143 GB / 324 GB avail mdsmap e181: 1/1/1 up {0=a=up:active} HEALTH_WARN 1 near full osd(s) osd.4 is near full at 89% Needless to say, osd.4 remains only in ceph.conf, but not at crushmap. Reducing has been done 'on-line', e.g. without restart entire cluster. Whoops! It looks like Sage has written some patches to fix this, but for now you should be good if you just update your ratios to a larger number, and then bring them back down again. :) Restarting ceph-mon should also do the trick. Thanks for the bug report! sage Should I restart mons simultaneously? Restarting one by one has no effect, same as filling up data pool up to ~95 percent(btw, when I deleted this 50Gb file on cephfs, mds was stuck permanently and usage remained same until I dropped and recreated data pool - hope it`s one of known posix layer bugs). I also deleted entry from config, and then restarted mons, with no effect. Any suggestions? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html