Re: [ceph-users] un-even data filled on OSDs
Hello! On Fri, Jun 10, 2016 at 07:38:10AM +0530, swamireddy wrote: > Blair - Thanks for the details. I used to set the low priority for > recovery during the rebalance/recovery activity. > Even though I set the recovery_priority as 5 (instead of 1) and > client-op_priority set as 63, some of my customers complained that > their VMs are not reachable for a few mins/secs during the reblancing > task. Not sure, these low priority configurations are doing the job as > its. It is true up to Hammer at least. I have no possibility to test it on jevel setup due to my company policy (part of my cluster already is jevel, but I can not continue upgrade due to direct directive). > Thanks > Swami -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD randwrite performance
Hello! On Wed, May 25, 2016 at 11:45:29AM +0900, chibi wrote: > Hello, > On Tue, 24 May 2016 21:20:49 +0300 Max A. Krasilnikov wrote: >> Hello! >> >> I have cluster with 5 SSD drives as OSD backed by SSD journals, one per >> osd. One osd per node. >> > More details will help identify other potential bottlenecks, such as: > CPU/RAM > Kernel, OS version. For now I have 3x(Openstack controller + ceph mon + 8xOSD (one for SSD)). All running Ubuntu 14.04+Hammer from ubuntu-cloud, now moving to Ubuntu 14.04+Ceph Jewel from Ceph site. E5-2620 v2 (12 cores) 32G RAM Linux 4.2.0, moving to 4.4 from Xenial. >> Data drives is Samsung 850 EVO 1TB, journals are Samsung 850 EVO 250G, >> journal partition is 24GB, data partition is 790GB. OSD nodes connected >> by 2x10Gbps linux bonding for data/cluster network. >> > As Oliver wrote, these SSDs are totally unsuited for usage with Ceph, > especially regarding to journals. > But also in general, since they're neither handling IOPS in a consistent, > predictable manner. > And they're not durable (endurance, TBW) enough either. Yep, I understand. But on second cluster w/ ScaleIO they do much better :( > When using SSDs or NVMes, use DC level ones exclusively, Intel is the more > tested one in these parts, but the Samsung DC level ones ought to be fine, > too. I can hope, my employer will provide me with them, but for now i have to do all the best with current hardware :( > >> When doing random write with 4k blocks with direct=1, buffered=0, >> iodepth=32..1024, ioengine=libaio from nova qemu virthost I can get no >> more than 9kiops. Randread is about 13-15 kiops. >> >> Trouble is that randwrite not depends on iodepth. read, write can be up >> to 140kiops, randread up to 15 kiops. randwrite is always 2-9 kiops. >> > Aside from the limitations of your SSDs, there are other factors, like CPU > utilization. > And very importantly also network latency, but that's for single threaded > IOPS mostly. >> Ceph cluster is mixed of jewel and hammer, upgrading now to jewel. On >> Hammer I got the same results. >> > Mixed is a very bad state for a cluster to be in. > Jewel has lots of improvements in that area, but w/o decent hardware you > may not see them. My cluster is upgrading now. 2 OSD per night :), one node per week, with changing old 850EVO to new ones. >> All journals can do up to 32kiops with the same config for fio. >> >> I am confused because EMC ScaleIO can do much more iops what is boring >> my boss :) >> > There are lot of discussion and slides on how to improve/maximize IOPS > with Ceph, go search for them. > Fast CPUs, jmalloc, pinning, configuration, NVMes for journals, etc. I have seen a lot of them. Will try to use pinning, I have never used it before. > Christian > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SSD randwrite performance
Hello! I have cluster with 5 SSD drives as OSD backed by SSD journals, one per osd. One osd per node. Data drives is Samsung 850 EVO 1TB, journals are Samsung 850 EVO 250G, journal partition is 24GB, data partition is 790GB. OSD nodes connected by 2x10Gbps linux bonding for data/cluster network. When doing random write with 4k blocks with direct=1, buffered=0, iodepth=32..1024, ioengine=libaio from nova qemu virthost I can get no more than 9kiops. Randread is about 13-15 kiops. Trouble is that randwrite not depends on iodepth. read, write can be up to 140kiops, randread up to 15 kiops. randwrite is always 2-9 kiops. Ceph cluster is mixed of jewel and hammer, upgrading now to jewel. On Hammer I got the same results. All journals can do up to 32kiops with the same config for fio. I am confused because EMC ScaleIO can do much more iops what is boring my boss :) -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] using jemalloc in trusty
Hello! On Mon, May 23, 2016 at 02:34:37PM +, Somnath.Roy wrote: > You need to build ceph code base to use jemalloc for OSDs..LD_PRELOAD won't > work.. Is it true for Xenial too or only for Trusty? I don't want to rebuild Jewel on xenial hosts... -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Jewel ubuntu release is half cooked
Hello! On Mon, May 23, 2016 at 11:26:38AM +0100, andrei wrote: > 1. Ceph journals - After performing the upgrade the ceph-osd processes are > not starting. I've followed the instructions and chowned /var/lib/ceph (also > see point 2 below). The issue relates to the journal partitions, which are > not chowned due to the symlinks. Thus, the ceph user had no read/write access > to the journal partitions. IMHO, this should be addressed at the > documentation layer unless it can be easily and reliably dealt with by the > installation script. I had met the same trouble and have to chown journal partitions. A also have 14.04, upgrading Ceph from Hammer (ubuntu-cloud archive) to Jevel (Ceph site) On another node, running Ubuntu 16.04 and ceph from Ubuntu repo, no such issues found. I prefer to upgrade Ceph first and upgrade all systems and Openstack services later. -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.7 Hammer released
Hello! On Tue, May 17, 2016 at 10:04:41AM +0200, dan wrote: > Hi Sage et al, > I'm updating our pre-prod cluster from 0.94.6 to 0.94.7 and after > upgrading the ceph-mon's I'm getting loads of warnings like: > 2016-05-17 10:01:29.314785 osd.76 [WRN] failed to encode map e103116 > with expected crc > I've seen that error is whitelisted in the qa-suite: > https://github.com/ceph/ceph-qa-suite/pull/602/files > Is it really harmless? (This is the first time I've seen such a warning). I have the same warning using some jewel OSDs in hammer cluster (considering step-by-step per-node upgrade). No problems, just warning in logs. -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] can I attach a volume to 2 servers
Hello! On Mon, May 02, 2016 at 11:25:11AM -0400, forsaks.30 wrote: > Hi Edward > thanks for your explanation! > Yes you are right. > I just came across sebastien han's post, using nfs on top of rbd( > http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/) > I will try this method. Why not to use cephfs? I prefer it as I already have ceph cluster for rbd. > On Mon, May 2, 2016 at 11:14 AM, Edward Huyer <erh...@rit.edu> wrote: >> Mapping a single RBD on multiple servers isn’t going to do what you want >> unless you’re putting some kind of clustered filesystem on it. Exporting >> the filesystem via an NFS server will generally be simpler. -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replace Journal
Здравствуйте! On Fri, Apr 22, 2016 at 09:30:15AM +0200, martin.wilderoth wrote: > >> I have a ceph cluster and I will change my journal devices to new SSD's. > >> > >> In some instructions of doing this they refer to a journal file (link to > >> UUID of journal ) > >> > >> In my OSD folder this journal don’t exist. > >> >> If your cluster is "years old" and not created with ceph-disk, then yes, >> that's not surprising. >> Mind, I created a recent one of mine manually and still used that scheme: >> --- ls -la /var/lib/ceph/osd/ceph-12/ >> total 80 >> drwxr-xr-x 4 root root 4096 Mar 1 14:44 . >> drwxr-xr-x 8 root root 4096 Sep 10 2015 .. >> -rw-r--r-- 1 root root37 Sep 10 2015 ceph_fsid >> drwxr-xr-x 320 root root 24576 Mar 2 20:24 current >> -rw-r--r-- 1 root root37 Sep 10 2015 fsid >> lrwxrwxrwx 1 root root44 Sep 10 2015 journal -> >> /dev/disk/by-id/wwn-0x55cd2e404b77573c-part5 >> -rw--- 1 root root57 Sep 10 2015 keyring >> --- >> >> Ceph isn't magical, so if that link isn't there, you probably have >> something like this in your ceph.conf, preferably with UUID instead of thet >> possibly changing device name: >> --- >> [osd.0] >> host = ceph-01 >> osd journal = /dev/sdc3 >> --- > Yes that is my setup, Would that mean i could either create symlink journal > -> /dev/disk/.. > remove the osd journal in ceph.conf. > or change my ceph.conf with osd journal = /dev/ > And the recommended way is actually to use journal symlink ? I'm using symlinks to /dev/disk/by-partlabel/ It saves me from any troubles when replacing journal SSDs. The same for mounts: I use LABEL= in fstab because of changing device names when replacing HW in storage nodes. Of cause, any can set proper udev rules, but I'm too lazy for this exercises :) -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mon placement over wide area
Здравствуйте! On Tue, Apr 12, 2016 at 07:48:58AM +, Maxime.Guyot wrote: > Hi Adrian, > Looking at the documentation RadosGW has multi region support with the > “federated gateways” > (http://docs.ceph.com/docs/master/radosgw/federated-config/): > "When you deploy a Ceph Object Store service that spans geographical locales, > configuring Ceph Object Gateway regions and metadata synchronization agents > enables the service to maintain a global namespace, even though Ceph Object > Gateway instances run in different geographic locales and potentially on > different Ceph Storage Clusters.” > Maybe that could do the trick for your multi metro EC pools? > Disclaimer: I haven't tested the federated gateways RadosGW. As I can see in doc, Jewel have to be able to perform per-image async mirroring: There is new support for mirroring (asynchronous replication) of RBD images across clusters. This is implemented as a per-RBD image journal that can be streamed across a WAN to another site, and a new rbd-mirror daemon that performs the cross-cluster replication. © http://docs.ceph.com/docs/master/release-notes/ I will test it 1-2 month later this year :) -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deprecating ext4 support
Hello! On Mon, Apr 11, 2016 at 05:39:37PM -0400, sage wrote: > Hi, > ext4 has never been recommended, but we did test it. After Jewel is out, > we would like explicitly recommend *against* ext4 and stop testing it. 1. Does filestore_xattr_use_omap fix issues with ext4? So, can I continue using ext4 for cluster with RBD && CephFS + this option set to true? 2. Agree with Christian, it would be better to warn but not drop support for legacy fs until old HW is out of service, 4-5 years. 3. Also, if BlueStore will be so good, one prefer to use it instead of FileStore, so fs deprecation would be not so painful. I'm not so great ceph user, but I have limitations like Christian and changing fs would cost me 24 nights for now :( -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] recorded data digest != on disk
Hello! I have 3-node cluster running ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) on Ubuntu 14.04. When scrubbing I get error: -9> 2016-03-21 17:36:09.047029 7f253a4f6700 5 -- op tracker -- seq: 48045, time: 2016-03-21 17:36:09.046984, event: all_read, op: osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -8> 2016-03-21 17:36:09.047035 7f253a4f6700 5 -- op tracker -- seq: 48045, time: 0.00, event: dispatched, op: osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -7> 2016-03-21 17:36:09.047066 7f254411b700 5 -- op tracker -- seq: 48045, time: 2016-03-21 17:36:09.047066, event: reached_pg, op: osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -6> 2016-03-21 17:36:09.047086 7f254411b700 5 -- op tracker -- seq: 48045, time: 2016-03-21 17:36:09.047086, event: started, op: osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -5> 2016-03-21 17:36:09.047127 7f254411b700 5 -- op tracker -- seq: 48045, time: 2016-03-21 17:36:09.047127, event: done, op: osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -4> 2016-03-21 17:36:09.047173 7f253f912700 2 osd.13 pg_epoch: 23286 pg[5.ca( v 23286'8176779 (23286'8173729,23286'8176779] local-les=23286 n=8132 ec=114 les/c 23286/23286 23285/23285/23285) [13,21] r=0 lpr=23285 crt=23286'8176777 lcod 23286'8176778 mlcod 23286'8176778 active+clean+scrubbing+deep+repair] scrub_compare_maps osd.13 has 10 items -3> 2016-03-21 17:36:09.047377 7f253f912700 2 osd.13 pg_epoch: 23286 pg[5.ca( v 23286'8176779 (23286'8173729,23286'8176779] local-les=23286 n=8132 ec=114 les/c 23286/23286 23285/23285/23285) [13,21] r=0 lpr=23285 crt=23286'8176777 lcod 23286'8176778 mlcod 23286'8176778 active+clean+scrubbing+deep+repair] scrub_compare_maps replica 21 has 10 items -2> 2016-03-21 17:36:09.047983 7f253f912700 2 osd.13 pg_epoch: 23286 pg[5.ca( v 23286'8176779 (23286'8173729,23286'8176779] local-les=23286 n=8132 ec=114 les/c 23286/23286 23285/23285/23285) [13,21] r=0 lpr=23285 crt=23286'8176777 lcod 23286'8176778 mlcod 23286'8176778 active+clean+scrubbing+deep+repair] 5.ca recorded data digest 0xb284fef9 != on disk 0x43d61c5d on 6134ccca/rb d_data.86280c78aaf7da.000e0bb5/17//5 -1> 2016-03-21 17:36:09.048201 7f253f912700 -1 log_channel(cluster) log [ERR] : 5.ca recorded data digest 0xb284fef9 != on disk 0x43d61c5d on 6134ccca/rbd_data.86280c78aaf7da.000e0bb5/17//5 0> 2016-03-21 17:36:09.050672 7f253f912700 -1 osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f253f912700 time 2016-03-21 17:36:09.048341 osd/osd_types.cc: 4103: FAILED assert(clone_size.count(clone)) ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5606c23633db] 2: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x5606c1fd4666] 3: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, std::pair, std::less, std::allocator<std::pair > > > const&)+0xa1c) [0x5606c20b3c6c] 4: (PG::scrub_compare_maps()+0xec9) [0x5606c2020d49] 5: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1ee) [0x5606c20264be] 6: (PG::scrub(ThreadPool::TPHandle&)+0x1f4) [0x5606c2027d44] 7: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x5606c1f0c379] 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x5606c2353fc6] 9: (ThreadPool::WorkThread::entry()+0x10) [0x5606c2355070] 10: (()+0x8182) [0x7f256168e182] 11: (clone()+0x6d) [0x7f255fbf947d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. Is there any way to recalculate data digest? I have removed OSD with failed PG, data was recovered but error occurs on other OSD. I think, I do not have consistent copy of data. What can I do to recover? pool size 2 (it's not so good, I know, but i have not ability to increase this nearest 2 month). -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests with rbd
Здравствуйте! On Fri, Mar 04, 2016 at 01:33:24PM +0100, honza801 wrote: > hi, > i have rbd0 mapped to client, xfs formatted. i'm putting a lot of data on it. > following messages appear in logs and 'ceph -s' output > osd.255 [WRN] 1 slow requests, 1 included below; oldest blocked for > > 51.726881 secs > osd.255 [WRN] slow request 51.726881 seconds old, received at > 2016-03-04 12:22:23.549737: osd_op(client.14296.1:389333 > rbd_data.37d230c8153.000d1cc8 [set-alloc-hint object_size > 4194304 write_size 4194304,writefull 0~4194304] 2.fc8c5908 > ondisk+write e7523) currently waiting for subops from 120,239 > it causes slow downs on writes. iostat, load, dmesg on osds shows nothing odd. > could anyone give me a hint? I spent a lot of time with this trouble because of "overtuning" of Linux TCP/IP stack using sysctl. If your disks are not overloaded, if your network is not overloaded, take a look on network configuration including sysctl. BTW, default sysctl settings are quite well :) Things can be better, but they are stable anough. -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restore properties to default?
Здравствуйте! On Thu, Mar 03, 2016 at 09:53:22AM +1000, lindsay.mathieson wrote: > Ok, reduced my recovery I/O with > ceph tell osd.* injectargs '--osd-max-backfills 1' > ceph tell osd.* injectargs '--osd-recovery-max-active 1' > ceph tell osd.* injectargs '--osd-client-op-priority 63' > Now I can put it back to the default values explicity (10, 15), but is > there a way to tell ceph to just restore the default args? As an option: ceph --show-config -c /dev/null |grep osd_max_backfills ... ceph tell osd.* injectargs '--osd_max_backfills=' -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hammer OSD crash during deep scrub
Hello! On Wed, Feb 17, 2016 at 11:14:09AM +0200, pseudo wrote: > Hello! > Now I'm going to check OSD filesystem. But I have neither strange logs in > syslog, nor SMART reports about this drive. Filesystem check did not find any troubles. Removing OSD and scrubbing problematic PG on other pair of OSD result in crash of new primary OSD. It looks like I have wrong PG data, but it is design flaw when OSD is crashed due to inconsistent data on input... I have no idea how to find problematic object in PG. If I find it, I would repair it by hands. Any ideas? -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] can not umount ceph osd partition
Hello! On Thu, Feb 04, 2016 at 11:10:06AM +0100, yoann.moulin wrote: > Hello, >>>> I am using 0.94.5. When I try to umount partition and fsck it I have issue: >>>> root@storage003:~# stop ceph-osd id=13 >>>> ceph-osd stop/waiting >>>> root@storage003:~# umount /var/lib/ceph/osd/ceph-13 >>>> root@storage003:~# fsck -yf /dev/sdf >>>> fsck from util-linux 2.20.1 >>>> e2fsck 1.42.9 (4-Feb-2014) >>>> /dev/sdf is in use. >>>> e2fsck: Cannot continue, aborting. >>>> >>>> There is no /var/lib/ceph/osd/ceph-13 in /proc mounts. But no ability to >>>> check >>>> fs. >>>> I can mount -o remount,rw, but I would like to umount device for >>>> maintenance >>>> and, maybe, replace it. >>>> >>>> Why I can't umount? >> >>> is "lsof -n | grep /dev/sdf" give something ? >> >> Nothing. >> >>> and are you sure /dev/sdf is the disk for osd 13 ? >> >> Absolutelly. I have even tried fsck -yf /dev/disk/by-label/osd-13. No luck. >> >> Disk is mounted using LABEL in fstab, journal is symlink to >> /dev/disk/by-partlabel/j-13. > I think it's more linux related. Maybe. But I have it only on ceph boxes :( > could you try to look with lsof if something hold the device by the > label or uuid instead of /dev/sdf ? > you can try to delete the device from the scsi bus with something like : > echo 1 > /sys/block//device/delete > be careful, it is like removing the disk physically, if a process holds > the device, you might expect that process gonna switch into kernel > status "D+" . You won't be able to kill that process even by kill -9. To > stop it, you will have to reboot the server. > you can give a look here how to manipulate scsi bus: > http://fibrevillage.com/storage/279-hot-add-remove-rescan-of-scsi-devices-on-linux > you can install the package "scsitools" that provide rescan-scsi-bus.sh > to rescan you scsi bus to get back your disk removed. > http://manpages.ubuntu.com/manpages/precise/man8/rescan-scsi-bus.8.html > hope that can help you Thanx a lot! I will try to use partx -u (it sometimes helped me in past to re-read partitions from disk when gdisk was not able to update kernel's list of partitions) and software removing/inserting drive. If some processes fails into uninterruptible sleep, I will reboot node. It will be rebooted in any case if this will not help. If I investigate thomething it will be posted here. I think, it can affect other ceph users. -- WBR, Max A. Krasilnikov ColoCall Data Center ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] can not umount ceph osd partition
Hello! I am using 0.94.5. When I try to umount partition and fsck it I have issue: root@storage003:~# stop ceph-osd id=13 ceph-osd stop/waiting root@storage003:~# umount /var/lib/ceph/osd/ceph-13 root@storage003:~# fsck -yf /dev/sdf fsck from util-linux 2.20.1 e2fsck 1.42.9 (4-Feb-2014) /dev/sdf is in use. e2fsck: Cannot continue, aborting. There is no /var/lib/ceph/osd/ceph-13 in /proc mounts. But no ability to check fs. I can mount -o remount,rw, but I would like to umount device for maintenance and, maybe, replace it. Why I can't umount? -- WBR, Max A. Krasilnikov ColoCall Data Center ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] can not umount ceph osd partition
Здравствуйте! On Wed, Feb 03, 2016 at 04:59:30PM +0100, yoann.moulin wrote: > Hello, >> I am using 0.94.5. When I try to umount partition and fsck it I have issue: >> root@storage003:~# stop ceph-osd id=13 >> ceph-osd stop/waiting >> root@storage003:~# umount /var/lib/ceph/osd/ceph-13 >> root@storage003:~# fsck -yf /dev/sdf >> fsck from util-linux 2.20.1 >> e2fsck 1.42.9 (4-Feb-2014) >> /dev/sdf is in use. >> e2fsck: Cannot continue, aborting. >> >> There is no /var/lib/ceph/osd/ceph-13 in /proc mounts. But no ability to >> check >> fs. >> I can mount -o remount,rw, but I would like to umount device for maintenance >> and, maybe, replace it. >> >> Why I can't umount? > is "lsof -n | grep /dev/sdf" give something ? Nothing. > and are you sure /dev/sdf is the disk for osd 13 ? Absolutelly. I have even tried fsck -yf /dev/disk/by-label/osd-13. No luck. Disk is mounted using LABEL in fstab, journal is symlink to /dev/disk/by-partlabel/j-13. -- WBR, Max A. Krasilnikov ColoCall Data Center ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osd_recovery_delay_start ignored in Hammer?
Hello! In my cluster running Hammer 0.94.5-0ubuntu0.15.04.1~cloud0 when starting OSD it starts recovery immediatelly. I have changed osd_recovery_delay_start to 60 seconds, but this setting is ignored during osd bootup. root@storage001:~# ceph -n osd.9 --show-config |grep osd_recovery_delay_start osd_recovery_delay_start = 60 I would like to delay recovery because it increases load on cluster leading to slow request on start. After 1-2 minutes after startup of osd slow requests desapear and all things doing ok. -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Write throughput drops to zero
Здравствуйте! On Fri, Oct 30, 2015 at 09:30:40PM +, moloney wrote: > Hi, > I recently got my first Ceph cluster up and running and have been doing some > stress tests. I quickly found that during sequential write benchmarks the > throughput would often drop to zero. Initially I saw this inside QEMU virtual > machines, but I can also reproduce the issue with "rados bench" within 5-10 > minutes of sustained writes. If left alone the writes will eventually start > going again, but it takes quite a while (at least a couple minutes). If I > stop and restart the benchmark the write throughput will immediately be where > it is supposed to be. > I have convinced myself it is not a network hardware issue. I can load up > the network with a bunch of parallel iperf benchmarks and it keeps chugging > along happily. When the issue occurs with Ceph I don't see any indications of > network issues (e.g. dropped packets). Adding additional network load during > the rados bench (using iperf) doesn't seem to trigger the issue any faster or > more often. > I have also convinced myself it isn't an issue with a journal getting full or > an OSD being too busy. The amount of data being written before the problem > occurs is much larger than the total journal capacity. Watching the load on > the OSD servers with top/iostat I don't seen anything being overloaded, > rather I see the load everywhere drop to essentially zero when the writes > stall. Before the writes stall the load is well distributed with no visible > hot spots. The OSDs and hosts that report slow requests are random, so I > don't think it is a failing disk or server. I don't see anything interesting > going on in the logs so far (I am just about to do some tests with Ceph's > debug logging cranked up). > The cluster specs are: > OS: Ubuntu 14.04 with 3.16 kernel > Ceph: 9.1.0 > OSD Filesystem: XFS > Replication: 3X > Two racks with IPoIB network > 10Gbps Ethernet between racks > 8 OSD servers with: > * Dual Xeon E5-2630L (12 cores @ 2.4GHz) > * 128GB RAM > * 12 6TB Seagate drives (connected to LSI 2208 chip in JBOD mode) > * Two 400GB Intel P3600 NVMe drives (OS on RAID1 partition, 6 partitions > for OSD journals each) > * Mellanox ConnectX-3 NIC (for both Infiniband and 10Gbps Ethernet) > 3 Mons collocated on OSD servers > Any advice is greatly appreciated. I am planning to try this with Hammer too. I had the same trouble with Hammer, Ubuntu 14.04 and 3.19 kernel on Supermicro X9DRL-3F/iF with Intel 82599ES, bounded into one links to 2 different Cisco Nexus 5020. It was finally fixed with dropping down MTU from 1500+ to 1500. It was working with 9000 and folowing sysctls, but after several weeks trouble repeated and I had to drop mtu down again: net.ipv4.tcp_rmem= 1024000 8738000 1677721600 net.ipv4.tcp_wmem= 1024000 8738000 1677721600 net.ipv4.tcp_mem= 1024000 8738000 1677721600 net.core.netdev_max_backlog = 25 net.ipv4.tcp_max_syn_backlog = 15 net.ipv4.tcp_congestion_control=htcp net.ipv4.tcp_mtu_probing=1 net.ipv4.tcp_max_tw_buckets = 200 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_low_latency = 1 vm.swappiness = 1 net.ipv4.tcp_moderate_rcvbuf = 0 All > Thanks, > Brendan > _______ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- WBR, Max A. Krasilnikov ColoCall Data Center ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Potential OSD deadlock?
Hello! On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote: > Have you tried running iperf between the nodes? Capturing a pcap of the > (failing) Ceph comms from both sides could help narrow it down. > Is there any SDN layer involved that could add overhead/padding to the frames? > What about some intermediate MTU like 8000 - does that work? > Oh and if there's any bonding/trunking involved, beware that you need to set > the same MTU and offloads on all interfaces on certains kernels - flags like > MTU/offloads should propagate between the master/slave interfaces but in > reality it's not the case and they get reset even if you unplug/replug the > ethernet cable. I'm sorry for long time to answer, but I have fixed problem with Jumbo frames with sysctl: # net.ipv4.tcp_moderate_rcvbuf = 0 # net.ipv4.tcp_rmem= 1024000 8738000 1677721600 net.ipv4.tcp_wmem= 1024000 8738000 1677721600 net.ipv4.tcp_mem= 1024000 8738000 1677721600 net.core.rmem_max=1677721600 net.core.rmem_default=167772160 net.core.wmem_max=1677721600 net.core.wmem_default=167772160 And now i can load my cluster without any slow requests. The essential setting is net.ipv4.tcp_moderate_rcvbuf = 0. All other are just tunings. > Jan >> On 09 Oct 2015, at 13:21, Max A. Krasilnikov <pse...@colocall.net> wrote: >> >> Hello! >> >> On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote: >> >>> Are there any errors on the NICs? (ethtool -s ethX) >> >> No errors. Neither on nodes, nor on switches. >> >>> Also take a look at the switch and look for flow control statistics - do >>> you have flow control enabled or disabled? >> >> flow control disabled everywhere. >> >>> We had to disable flow control as it would pause all IO on the port >>> whenever any path got congested which you don't want to happen with a >>> cluster like Ceph. It's better to let the frame drop/retransmit in this >>> case (and you should size it so it doesn't happen in any case). >>> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't >>> put my money on that... >> >> I tried to completely disable all offloads and setting mtu back to 9000 >> after. >> No luck. >> I am speaking with my NOC about MTU in 10G network. If I have update, I will >> write here. I can hardly beleave that it is ceph side, but nothing is >> impossible. >> >>> Jan >> >> >>>> On 09 Oct 2015, at 10:48, Max A. Krasilnikov <pse...@colocall.net> wrote: >>>> >>>> Hello! >>>> >>>> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote: >>>> >>>>> -BEGIN PGP SIGNED MESSAGE- >>>>> Hash: SHA256 >>>> >>>>> Sage, >>>> >>>>> After trying to bisect this issue (all test moved the bisect towards >>>>> Infernalis) and eventually testing the Infernalis branch again, it >>>>> looks like the problem still exists although it is handled a tad >>>>> better in Infernalis. I'm going to test against Firefly/Giant next >>>>> week and then try and dive into the code to see if I can expose any >>>>> thing. >>>> >>>>> If I can do anything to provide you with information, please let me know. >>>> >>>> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G >>>> network >>>> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux >>>> bounding >>>> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel >>>> 82599ES >>>> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on >>>> Nexus 5020 >>>> switch with Jumbo frames enabled i have performance drop and slow >>>> requests. When >>>> setting 1500 on nodes and not touching Nexus all problems are fixed. >>>> >>>> I have rebooted all my ceph services when changing MTU and changing things >>>> to >>>> 9000 and 1500 several times in order to be sure. It is reproducable in my >>>> environment. >>>> >>>>> Thanks, >>>>> -BEGIN PGP SIGNATURE- >>>>> Version: Mailvelope v1.2.0 >>>>> Comment: https://www.mailvelope.com >>>> >>>>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw >>>>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB >>>>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/
Re: [ceph-users] Potential OSD deadlock?
Hello! On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > Sage, > After trying to bisect this issue (all test moved the bisect towards > Infernalis) and eventually testing the Infernalis branch again, it > looks like the problem still exists although it is handled a tad > better in Infernalis. I'm going to test against Firefly/Giant next > week and then try and dive into the code to see if I can expose any > thing. > If I can do anything to provide you with information, please let me know. I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G network between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 82599ES Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 5020 switch with Jumbo frames enabled i have performance drop and slow requests. When setting 1500 on nodes and not touching Nexus all problems are fixed. I have rebooted all my ceph services when changing MTU and changing things to 9000 and 1500 several times in order to be sure. It is reproducable in my environment. > Thanks, > -BEGIN PGP SIGNATURE- > Version: Mailvelope v1.2.0 > Comment: https://www.mailvelope.com > wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw > YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB > BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP > qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV > ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF > V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa > jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+ > 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF > VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs > VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle > Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W > 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b > BCFo > =GJL4 > -END PGP SIGNATURE- > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlancwrote: >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA256 >> >> We forgot to upload the ceph.log yesterday. It is there now. >> - >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc wrote: >>> -BEGIN PGP SIGNED MESSAGE- >>> Hash: SHA256 >>> >>> I upped the debug on about everything and ran the test for about 40 >>> minutes. I took OSD.19 on ceph1 doen and then brought it back in. >>> There was at least one op on osd.19 that was blocked for over 1,000 >>> seconds. Hopefully this will have something that will cast a light on >>> what is going on. >>> >>> We are going to upgrade this cluster to Infernalis tomorrow and rerun >>> the test to verify the results from the dev cluster. This cluster >>> matches the hardware of our production cluster but is not yet in >>> production so we can safely wipe it to downgrade back to Hammer. >>> >>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/ >>> >>> Let me know what else we can do to help. >>> >>> Thanks, >>> -BEGIN PGP SIGNATURE- >>> Version: Mailvelope v1.2.0 >>> Comment: https://www.mailvelope.com >>> >>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ >>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo >>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg >>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr >>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN >>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07 >>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV >>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje >>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3 >>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd >>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw >>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ >>> EDrG >>> =BZVw >>> -END PGP SIGNATURE- >>> >>> Robert LeBlanc >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> >>> >>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On my second test (a much longer one), it took nearly an hour, but a few messages have popped up over a 20 window. Still far less than I have been seeing. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > I'll
Re: [ceph-users] Potential OSD deadlock?
Hello! On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote: > Are there any errors on the NICs? (ethtool -s ethX) No errors. Neither on nodes, nor on switches. > Also take a look at the switch and look for flow control statistics - do you > have flow control enabled or disabled? flow control disabled everywhere. > We had to disable flow control as it would pause all IO on the port whenever > any path got congested which you don't want to happen with a cluster like > Ceph. It's better to let the frame drop/retransmit in this case (and you > should size it so it doesn't happen in any case). > And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't > put my money on that... I tried to completely disable all offloads and setting mtu back to 9000 after. No luck. I am speaking with my NOC about MTU in 10G network. If I have update, I will write here. I can hardly beleave that it is ceph side, but nothing is impossible. > Jan >> On 09 Oct 2015, at 10:48, Max A. Krasilnikov <pse...@colocall.net> wrote: >> >> Hello! >> >> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote: >> >>> -BEGIN PGP SIGNED MESSAGE- >>> Hash: SHA256 >> >>> Sage, >> >>> After trying to bisect this issue (all test moved the bisect towards >>> Infernalis) and eventually testing the Infernalis branch again, it >>> looks like the problem still exists although it is handled a tad >>> better in Infernalis. I'm going to test against Firefly/Giant next >>> week and then try and dive into the code to see if I can expose any >>> thing. >> >>> If I can do anything to provide you with information, please let me know. >> >> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G >> network >> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding >> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel >> 82599ES >> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus >> 5020 >> switch with Jumbo frames enabled i have performance drop and slow requests. >> When >> setting 1500 on nodes and not touching Nexus all problems are fixed. >> >> I have rebooted all my ceph services when changing MTU and changing things to >> 9000 and 1500 several times in order to be sure. It is reproducable in my >> environment. >> >>> Thanks, >>> -BEGIN PGP SIGNATURE- >>> Version: Mailvelope v1.2.0 >>> Comment: https://www.mailvelope.com >> >>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw >>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB >>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP >>> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV >>> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF >>> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa >>> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+ >>> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF >>> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs >>> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle >>> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W >>> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b >>> BCFo >>> =GJL4 >>> -END PGP SIGNATURE- >>> >>> Robert LeBlanc >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >>> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <rob...@leblancnet.us> wrote: >>>> -BEGIN PGP SIGNED MESSAGE- >>>> Hash: SHA256 >>>> >>>> We forgot to upload the ceph.log yesterday. It is there now. >>>> - >>>> Robert LeBlanc >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>>> >>>> >>>> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc wrote: >>>>> -BEGIN PGP SIGNED MESSAGE- >>>>> Hash: SHA256 >>>>> >>>>> I upped the debug on about everything and ran the test for about 40 >>>>> minutes. I took OSD.19 on ceph1 doen and then brought it back in. >>>>> There was at least one op on osd.19 that was blocked for over 1,000 >>>>> seconds. Hopefully this will have something that will cast a light on >>>>> what is going on. >>>>> >>>>>
Re: [ceph-users] Potential OSD deadlock?
Здравствуйте! On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote: > Have you tried running iperf between the nodes? Capturing a pcap of the > (failing) Ceph comms from both sides could help narrow it down. > Is there any SDN layer involved that could add overhead/padding to the frames? No other layers, only 2x Nexus 5020 with virtual portchannels. All other I will check on Monday. > What about some intermediate MTU like 8000 - does that work? Not tested. I will. > Oh and if there's any bonding/trunking involved, beware that you need to set > the same MTU and offloads on all interfaces on certains kernels - flags like > MTU/offloads should propagate between the master/slave interfaces but in > reality it's not the case and they get reset even if you unplug/replug the > ethernet cable. Yes, I understand it :) I was setting parameters on both interfaces and checked it out using "ip link". > Jan >> On 09 Oct 2015, at 13:21, Max A. Krasilnikov <pse...@colocall.net> wrote: >> >> Hello! >> >> On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote: >> >>> Are there any errors on the NICs? (ethtool -s ethX) >> >> No errors. Neither on nodes, nor on switches. >> >>> Also take a look at the switch and look for flow control statistics - do >>> you have flow control enabled or disabled? >> >> flow control disabled everywhere. >> >>> We had to disable flow control as it would pause all IO on the port >>> whenever any path got congested which you don't want to happen with a >>> cluster like Ceph. It's better to let the frame drop/retransmit in this >>> case (and you should size it so it doesn't happen in any case). >>> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't >>> put my money on that... >> >> I tried to completely disable all offloads and setting mtu back to 9000 >> after. >> No luck. >> I am speaking with my NOC about MTU in 10G network. If I have update, I will >> write here. I can hardly beleave that it is ceph side, but nothing is >> impossible. >> >>> Jan >> >> >>>> On 09 Oct 2015, at 10:48, Max A. Krasilnikov <pse...@colocall.net> wrote: >>>> >>>> Hello! >>>> >>>> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote: >>>> >>>>> -BEGIN PGP SIGNED MESSAGE- >>>>> Hash: SHA256 >>>> >>>>> Sage, >>>> >>>>> After trying to bisect this issue (all test moved the bisect towards >>>>> Infernalis) and eventually testing the Infernalis branch again, it >>>>> looks like the problem still exists although it is handled a tad >>>>> better in Infernalis. I'm going to test against Firefly/Giant next >>>>> week and then try and dive into the code to see if I can expose any >>>>> thing. >>>> >>>>> If I can do anything to provide you with information, please let me know. >>>> >>>> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G >>>> network >>>> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux >>>> bounding >>>> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel >>>> 82599ES >>>> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on >>>> Nexus 5020 >>>> switch with Jumbo frames enabled i have performance drop and slow >>>> requests. When >>>> setting 1500 on nodes and not touching Nexus all problems are fixed. >>>> >>>> I have rebooted all my ceph services when changing MTU and changing things >>>> to >>>> 9000 and 1500 several times in order to be sure. It is reproducable in my >>>> environment. >>>> >>>>> Thanks, >>>>> -BEGIN PGP SIGNATURE- >>>>> Version: Mailvelope v1.2.0 >>>>> Comment: https://www.mailvelope.com >>>> >>>>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw >>>>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB >>>>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP >>>>> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV >>>>> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF >>>>> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa >>>>> jjyy5w
Re: [ceph-users] Potential OSD deadlock?
Hello! On Mon, Oct 05, 2015 at 09:35:26PM -0600, robert wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > With some off-list help, we have adjusted > osd_client_message_cap=1. This seems to have helped a bit and we > have seen some OSDs have a value up to 4,000 for client messages. But > it does not solve the problem with the blocked I/O. > One thing that I have noticed is that almost exactly 30 seconds elapse > between an OSD boots and the first blocked I/O message. I don't know > if the OSD doesn't have time to get it's brain right about a PG before > it starts servicing it or what exactly. I have problems like yours in my cluster. All of them can be fixed with restarting some osds, but i can not restart all my osds time to time. Problem occurs when client is writing to rbd wolume or when recovering volume. Typical message is (this was when recovering): [WRN] slow request 30.929654 seconds old, received at 2015-10-06 13:00:41.412329: osd_op(client.1068613.0:192715 rbd_data.dc7650539e6a.0820 [set-alloc-hint object_size 4194304 write_size 4194304,write 3371008~4096] 5.d66fd55d snapc c=[c] ack+ondisk+write+known_if_redirected e4009) currently waiting for subops from 51 Restarting osd.51 in such scenario fixes the problem. There are no slow requests with low io on systems, only when i do something like uploading image. Some times ago i had too much created but not used osds. In that time, when going down for restart, osds did not inform mon about this. Removing unused osds entries fixes this issue. But when doing ceph crush dump i can see them. Maybe, it is the root of problem? I tried to do getcrushmap/edit/setcrushmap, but entries are in their place. Maybe, my experience will help You to find answer. I hope, it wil fix my problems :) -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Software Raid 1 for system disks on storage nodes (not for OSD disks)
Hello! On Sat, Sep 19, 2015 at 07:03:35AM +0200, martin wrote: > Thanks all for the suggestions. > Our storage nodes have plenty of RAM and their only purpose is to host the > OSD daemons, so we will not create a swap partition on provisioning. As an option, You can use swap file on demand. It is easy to deploy. > For the OS disk we will then use a software raid 1 to handle eventually > disk failures. For provisioning the hosts we use kickstart and then Ansible > to install an prepare the hosts to be ready to for ceph-deploy. I don't think raid1 is suitable for ceph as of ability to have distributed over hosts copies of data. Think as this is a raid1 over hosts that is more reliable. Even a server crush will not destroy Your data. If Your setup is quite correct :) -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] help! Ceph Manual Depolyment
Здравствуйте! On Thu, Sep 17, 2015 at 11:59:47PM +0800, wikison wrote: > Is there any detailed manual deployment document? I downloaded the source and > built ceph, then installed ceph on 7 computers. I used three as monitors and > four as OSD. I followed the official document on ceph.com. But it didn't work > and it seemed to be out-dated. Could anybody help me? This works for me: http://docs.ceph.com/docs/master/install/manual-deployment/ http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/ http://www.sebastien-han.fr/blog/2013/05/13/deploy-a-ceph-mds-server/ http://docs.ceph.com/docs/master/cephfs/createfs/ -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Strange rbd hung with non-standard crush location
cluster_addr = 10.0.65.1 public_addr = 10.0.66.1 host = storage001 [osd.10] cluster_addr = 10.0.65.1 public_addr = 10.0.66.1 host = storage001 [osd.11] cluster_addr = 10.0.65.1 public_addr = 10.0.66.1 host = storage001 [osd.12] cluster_addr = 10.0.65.1 public_addr = 10.0.66.1 host = storage001 [osd.20] cluster_addr = 10.0.65.2 public_addr = 10.0.66.2 host = storage002 [osd.30] cluster_addr = 10.0.65.2 public_addr = 10.0.66.2 host = storage002 [osd.31] cluster_addr = 10.0.65.2 public_addr = 10.0.66.2 host = storage002 [osd.32] cluster_addr = 10.0.65.2 public_addr = 10.0.66.2 host = storage002 [osd.40] cluster_addr = 10.0.65.3 public_addr = 10.0.66.3 host = storage003 [osd.50] cluster_addr = 10.0.65.3 public_addr = 10.0.66.3 host = storage003 [osd.51] cluster_addr = 10.0.65.3 public_addr = 10.0.66.3 host = storage003 [osd.52] cluster_addr = 10.0.65.3 public_addr = 10.0.66.3 host = storage003 My volumes: pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 64 pgp_num 64 last_change 92 flags hashpspool stripe_width 0 pool 4 'openstack-img' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 512 pgp_num 512 last_change 187 flags hashpspool stripe_width 0 pool 5 'openstack-hdd' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 512 pgp_num 512 last_change 114 flags hashpspool stripe_width 0 pool 6 'openstack-ssd' replicated size 2 min_size 1 crush_ruleset 4 object_hash rjenkins pg_num 512 pgp_num 512 last_change 118 flags hashpspool stripe_width 0 pool 7 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 64 pgp_num 64 last_change 141 flags hashpspool stripe_width 0 pool 8 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 128 pgp_num 128 last_change 145 flags hashpspool crash_replay_interval 45 stripe_width 0 First one was added by ceph setup and not used by me. I hava only changed ruleset to 3. So, why I need "default" root with osds in it? And why this is not described in docs? Or, maybe, I have mistaken understanding it? -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recommended way of leveraging multiple disks by Ceph
Здравствуйте! On Tue, Sep 15, 2015 at 04:16:47PM +, fangzhe.chang wrote: > Hi, > I'd like to run Ceph on a few machines, each of which has multiple disks. The > disks are heterogeneous: some are rotational disks of larger capacities while > others are smaller solid state disks. What are the recommended ways of > running ceph osd-es on them? > Two of the approaches can be: > 1) Deploy an osd instance on each hard disk. For instance, if a machine > has six hard disks, there will be six osd instances running on it. In this > case, does Ceph's replication algorithm recognize that these osd-es are on > the same machine therefore try to avoid placing replicas on disks/osd-es of a > same machine? When adding osd or whenever later You can set crush location for osd. pg placing is based on Your crush rules and crush locations. In general case, data would be written to different hosts. I have confid with multiple disks on 3 nodes, some of them are hdd and 1 ssd per node. Each serve 1 osd. > 2) Create a logical volume spanning multiple hard disks of a machine and > run a single copy of osd per machine. It is more reliable to have several osd'es, one per drive. When loosing drive, You will not loose all data on host. > If you have previous experiences, benchmarking results, or know a pointer to > the corresponding documentation, please share with me and other users. Thanks > a lot. I preferred this fine article: http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD performance slowly degrades :-(
Здравствуйте! On Wed, Aug 12, 2015 at 02:30:59PM +, pieter.koorts wrote: Hi Irek, Thanks for the link. I have removed the SSD's for now and performance is up to 30MB/s on a benchmark now. To be honest, I new the Samsung SSD weren't great but did not expect them to be worse then just plain hard disks. I had the same trouble with Samsung 840 EVO 1TB. 15 of 16 disks was terribly slow (about 3000 iops and up to 200 MBps per drive). All the drives were replased by 850 EVO 250 GB and problem was fixed. My ssds had the latest firmware and was brand new at the moment of test. Pieter Something that's been bugging me for a while is I am trying to diagnose iowait time within KVM guests. Guests doing reads or writes tend do about 50% to 90% iowait but the host itself is only doing about 1% to 2% iowait. So the result is the guests are extremely slow. I currently run 3x hosts each with a single SSD and single HDD OSD in cache-teir writeback mode. Although the SSD (Samsung 850 EVO 120GB) is not a great one it should at least perform reasonably compared to a hard disk and doing some direct SSD tests I get approximately 100MB/s write and 200MB/s read on each SSD. When I run rados bench though, the benchmark starts with a not great but okay speed and as the benchmark progresses it just gets slower and slower till it's worse than a USB hard drive. The SSD cache pool is 120GB in size (360GB RAW) and in use at about 90GB. I have tried tuning the XFS mount options as well but it has had little effect. -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
Здравствуйте! On Tue, Jul 07, 2015 at 02:21:56PM +0530, mallikarjuna.biradar wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? With replication factor 2 You have to take 3+ nodes in order to serve clients. If chooseleaf type 0. -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com