Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging
seconds or 0 objects Object prefix: benchmark_data_ceph02_396172 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 167761 243.956 2440.200718 0.227978 2 16 151 135 269.946 2960.327927 0.2265 3 16 215 199 265.281 256 0.0875193 0.225989 4 16 288 272 271.951 2920.184617 0.227921 5 16 358 342 273.553 2800.140823 0.22683 6 16 426 410 273.286 2720.118436 0.226586 7 16 501 485 277.094 3000.224887 0.226209 8 16 573 557 278.452 2880.200903 0.226424 9 16 643 627 278.619 2800.214474 0.227003 10 16 711 695 277.952 2720.259724 0.226849 Total time run: 10.146720 Total writes made: 712 Write size: 4194304 Object size:4194304 Bandwidth (MB/sec): 280.682 Stddev Bandwidth: 17.7138 Max bandwidth (MB/sec): 300 Min bandwidth (MB/sec): 244 Average IOPS: 70 Stddev IOPS:4 Max IOPS: 75 Min IOPS: 61 Average Latency(s): 0.227538 Stddev Latency(s): 0.0843661 Max latency(s): 0.48464 Min latency(s): 0.0467124 On 2020-01-06 20:44, Jelle de Jong wrote: Hello everybody, I have issues with very slow requests a simple tree node cluster here, four WDC enterprise disks and Intel Optane NVMe journal on identical high memory nodes, with 10GB networking. It was working all good with Ceph Hammer on Debian Wheezy, but I wanted to upgrade to a supported version and test out bluestore as well. So I upgraded to luminous on Debian Stretch and used ceph-volume to create bluestore osds, everything went downhill from there. I went back to filestore on all nodes but I still have slow requests and I can not pinpoint a good reason I tried to debug and gathered information to look at: https://paste.debian.net/hidden/acc5d204/ First I thought it was the balancing that was making things slow, then I thought it might be the LVM layer, so I recreated the nodes without LVM by switching from ceph-volume to ceph-disk, no different still slow request. Then I changed back from bluestore to filestore but still the a very slow cluster. Then I thought it was a CPU scheduling issue and downgraded the 5.x kernel and CPU performance is full speed again. I thought maybe there is something weird with an osd and taking them out one by one, but slow request are still showing up and client performance from vms is really poor. I just feel a burst of small requests keeps blocking for a while then recovers again. Many thanks for helping out looking at the URL. If there are options which I should tune for a hdd with nvme journal setup please share. Jelle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Random slow requests without any load
Hi, What are the full commands you used to setup this iptables config? iptables --table raw --append OUTPUT --jump NOTRACK iptables --table raw --append PREROUTING --jump NOTRACK Does not create the same output, it needs some more. Kind regards, Jelle de Jong On 2019-07-17 14:59, Kees Meijs wrote: Hi, Experienced similar issues. Our cluster internal network (completely separated) now has NOTRACK (no connection state tracking) iptables rules. In full: # iptables-save # Generated by xtables-save v1.8.2 on Wed Jul 17 14:57:38 2019 *filter :FORWARD DROP [0:0] :OUTPUT ACCEPT [0:0] :INPUT ACCEPT [0:0] COMMIT # Completed on Wed Jul 17 14:57:38 2019 # Generated by xtables-save v1.8.2 on Wed Jul 17 14:57:38 2019 *raw :OUTPUT ACCEPT [0:0] :PREROUTING ACCEPT [0:0] -A OUTPUT -j NOTRACK -A PREROUTING -j NOTRACK COMMIT # Completed on Wed Jul 17 14:57:38 2019 Ceph uses IPv4 in our case, but to be complete: # ip6tables-save # Generated by xtables-save v1.8.2 on Wed Jul 17 14:58:20 2019 *filter :OUTPUT ACCEPT [0:0] :INPUT ACCEPT [0:0] :FORWARD DROP [0:0] COMMIT # Completed on Wed Jul 17 14:58:20 2019 # Generated by xtables-save v1.8.2 on Wed Jul 17 14:58:20 2019 *raw :OUTPUT ACCEPT [0:0] :PREROUTING ACCEPT [0:0] -A OUTPUT -j NOTRACK -A PREROUTING -j NOTRACK COMMIT # Completed on Wed Jul 17 14:58:20 2019 Using this configuration, state tables never ever can fill up with dropped connections as effect. Cheers, Kees On 17-07-2019 11:27, Maximilien Cuony wrote: Just a quick update about this if somebody else get the same issue: The problem was with the firewall. Port range and established connection are allowed, but for some reasons it seems the tracking of connections are lost, leading to a strange state where one machine refuse data (RST are replied) and the sender never get the RST packed (even with 'related' packets allowed). There was a similar post on this list in February ("Ceph and TCP States") where lossing of connections in conntrack created issues, but the fix, net.netfilter.nf_conntrack_tcp_be_liberal=1 did not improve that particular case. As a workaround, we installed lighter rules for the firewall (allowing all packets from machines inside the cluster by default) and that "fixed" the issue :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging
Hello everybody, I have issues with very slow requests a simple tree node cluster here, four WDC enterprise disks and Intel Optane NVMe journal on identical high memory nodes, with 10GB networking. It was working all good with Ceph Hammer on Debian Wheezy, but I wanted to upgrade to a supported version and test out bluestore as well. So I upgraded to luminous on Debian Stretch and used ceph-volume to create bluestore osds, everything went downhill from there. I went back to filestore on all nodes but I still have slow requests and I can not pinpoint a good reason I tried to debug and gathered information to look at: https://paste.debian.net/hidden/acc5d204/ First I thought it was the balancing that was making things slow, then I thought it might be the LVM layer, so I recreated the nodes without LVM by switching from ceph-volume to ceph-disk, no different still slow request. Then I changed back from bluestore to filestore but still the a very slow cluster. Then I thought it was a CPU scheduling issue and downgraded the 5.x kernel and CPU performance is full speed again. I thought maybe there is something weird with an osd and taking them out one by one, but slow request are still showing up and client performance from vms is really poor. I just feel a burst of small requests keeps blocking for a while then recovers again. Many thanks for helping out looking at the URL. If there are options which I should tune for a hdd with nvme journal setup please share. Jelle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12
Hello everybody, I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's with 32GB Intel Optane NVMe journal, 10GB networking. I wanted to move to bluestore due to dropping support of filestore, our cluster was working fine with filestore and we could take complete nodes out for maintenance without issues. root@ceph04:~# ceph osd pool get libvirt-pool size size: 3 root@ceph04:~# ceph osd pool get libvirt-pool min_size min_size: 2 I removed all osds from one node, zapping the osd and journal devices, we recreated the osds as bluestore and used a small 5GB partition as rockdb device instead of journal for all osd's. I saw the cluster suffer with pgs inactive and slow request. I tried setting the following on all nodes, but no diffrence: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_recovery_sleep 0.3' systemctl restart ceph-osd.target It took three days to recover and during this time clients were not responsive. How can I migrate to bluestore without inactive pgs or slow request. I got several more filestore clusters and I would like to know how to migrate without inactive pgs and slow reguests? As a side question, I optimized our cluster for filestore, the Intel Optane NVMe journals showed good fio dsync write tests, does bluestore also use dsync writes for rockdb caching or can we select NVMe devices on other specifications? My test with filestores showed that Optane NVMe SSD was faster then the Samsung NVMe SSD 970 Pro and I only need a a few GB for filestore journals, but with bluestore rockdb caching the situation is different and I can't find documentation on how to speed test NVMe devices for bluestore. Kind regards, Jelle root@ceph04:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 60.04524 root default -2 20.01263 host ceph04 0 hdd 2.72899 osd.0 up 1.0 1.0 1 hdd 2.72899 osd.1 up 1.0 1.0 2 hdd 5.45799 osd.2 up 1.0 1.0 3 hdd 2.72899 osd.3 up 1.0 1.0 14 hdd 3.63869 osd.14 up 1.0 1.0 15 hdd 2.72899 osd.15 up 1.0 1.0 -3 20.01263 host ceph05 4 hdd 5.45799 osd.4 up 1.0 1.0 5 hdd 2.72899 osd.5 up 1.0 1.0 6 hdd 2.72899 osd.6 up 1.0 1.0 13 hdd 3.63869 osd.13 up 1.0 1.0 16 hdd 2.72899 osd.16 up 1.0 1.0 18 hdd 2.72899 osd.18 up 1.0 1.0 -4 20.01997 host ceph06 8 hdd 5.45999 osd.8 up 1.0 1.0 9 hdd 2.73000 osd.9 up 1.0 1.0 10 hdd 2.73000 osd.10 up 1.0 1.0 11 hdd 2.73000 osd.11 up 1.0 1.0 12 hdd 3.64000 osd.12 up 1.0 1.0 17 hdd 2.73000 osd.17 up 1.0 1.0 root@ceph04:~# ceph status cluster: id: 85873cda-4865-4147-819d-8deda5345db5 health: HEALTH_WARN 18962/11801097 objects misplaced (0.161%) 1/3933699 objects unfound (0.000%) Reduced data availability: 42 pgs inactive Degraded data redundancy: 3645135/11801097 objects degraded (30.888%), 959 pgs degraded, 960 pgs undersized 110 slow requests are blocked > 32 sec. Implicated osds 3,10,11 services: mon: 3 daemons, quorum ceph04,ceph05,ceph06 mgr: ceph04(active), standbys: ceph06, ceph05 osd: 18 osds: 18 up, 18 in; 964 remapped pgs data: pools: 1 pools, 1024 pgs objects: 3.93M objects, 15.0TiB usage: 31.2TiB used, 28.8TiB / 60.0TiB avail pgs: 4.102% pgs not active 3645135/11801097 objects degraded (30.888%) 18962/11801097 objects misplaced (0.161%) 1/3933699 objects unfound (0.000%) 913 active+undersized+degraded+remapped+backfill_wait 60 active+clean 41 activating+undersized+degraded+remapped 4 active+remapped+backfill_wait 4 active+undersized+degraded+remapped+backfilling 1 undersized+degraded+remapped+backfilling+peered 1 active+recovery_wait+undersized+remapped io: recovery: 197MiB/s, 49objects/s root@ceph04:~# ceph health detail HEALTH_WARN 18962/11801097 objects misplaced (0.161%); 1/3933699 objects unfound (0.000%); Reduced data availability: 42 pgs inactive; Degraded data redundancy: 3643636/11801097 objects degraded (30.875%), 959 pgs degraded, 960 pgs undersized; 110 slow requests are blocked > 32 sec. Implicated osds 3,10,11 OBJECT_MISPLACED 18962/11801097 objects misplaced (0.161%) OBJECT_UNFOUND 1/3933699 objects unfound (0.000%) pg 3.361 has 1 unfound objects PG_AVAILABILITY Reduced data availability: 42 pgs i
[ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12
Hello everybody, [fix confusing typo] I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's with 32GB Intel Optane NVMe journal, 10GB networking. I wanted to move to bluestore due to dropping support of filestore, our cluster was working fine with filestore and we could take complete nodes out for maintenance without issues. root@ceph04:~# ceph osd pool get libvirt-pool size size: 3 root@ceph04:~# ceph osd pool get libvirt-pool min_size min_size: 2 I removed all osds from one node, zapping the osd and journal devices, we recreated the osds as bluestore and used a small 5GB partition as block device instead of journal for all osd's. I saw the cluster suffer with pgs inactive and slow request. I tried setting the following on all nodes, but no diffrence: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_recovery_sleep 0.3' systemctl restart ceph-osd.target How can I migrate to bluestore without inactive pgs or slow request. I got several more filestore clusters and I would like to know how to migrate without inactive pgs and slow reguests? As a side question, I optimized our cluster for filestore, the Intel Optane NVMe journals showed good fio dsync write tests, does bluestore also use dsync writes for rockdb caching or can we select NVMe devices on other specifications? My test with filestores showed that Optane NVMe SSD was faster then the Samsung NVMe SSD 970 Pro and I only need a a few GB for filestore journals, but with bluestore rockdb caching the situation is different and I can't find documentation on how to speed test NVMe devices for bluestore. Kind regards, Jelle root@ceph04:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 60.04524 root default -2 20.01263 host ceph04 0 hdd 2.72899 osd.0 up 1.0 1.0 1 hdd 2.72899 osd.1 up 1.0 1.0 2 hdd 5.45799 osd.2 up 1.0 1.0 3 hdd 2.72899 osd.3 up 1.0 1.0 14 hdd 3.63869 osd.14 up 1.0 1.0 15 hdd 2.72899 osd.15 up 1.0 1.0 -3 20.01263 host ceph05 4 hdd 5.45799 osd.4 up 1.0 1.0 5 hdd 2.72899 osd.5 up 1.0 1.0 6 hdd 2.72899 osd.6 up 1.0 1.0 13 hdd 3.63869 osd.13 up 1.0 1.0 16 hdd 2.72899 osd.16 up 1.0 1.0 18 hdd 2.72899 osd.18 up 1.0 1.0 -4 20.01997 host ceph06 8 hdd 5.45999 osd.8 up 1.0 1.0 9 hdd 2.73000 osd.9 up 1.0 1.0 10 hdd 2.73000 osd.10 up 1.0 1.0 11 hdd 2.73000 osd.11 up 1.0 1.0 12 hdd 3.64000 osd.12 up 1.0 1.0 17 hdd 2.73000 osd.17 up 1.0 1.0 root@ceph04:~# ceph status cluster: id: 85873cda-4865-4147-819d-8deda5345db5 health: HEALTH_WARN 18962/11801097 objects misplaced (0.161%) 1/3933699 objects unfound (0.000%) Reduced data availability: 42 pgs inactive Degraded data redundancy: 3645135/11801097 objects degraded (30.888%), 959 pgs degraded, 960 pgs undersized 110 slow requests are blocked > 32 sec. Implicated osds 3,10,11 services: mon: 3 daemons, quorum ceph04,ceph05,ceph06 mgr: ceph04(active), standbys: ceph06, ceph05 osd: 18 osds: 18 up, 18 in; 964 remapped pgs data: pools: 1 pools, 1024 pgs objects: 3.93M objects, 15.0TiB usage: 31.2TiB used, 28.8TiB / 60.0TiB avail pgs: 4.102% pgs not active 3645135/11801097 objects degraded (30.888%) 18962/11801097 objects misplaced (0.161%) 1/3933699 objects unfound (0.000%) 913 active+undersized+degraded+remapped+backfill_wait 60 active+clean 41 activating+undersized+degraded+remapped 4 active+remapped+backfill_wait 4 active+undersized+degraded+remapped+backfilling 1 undersized+degraded+remapped+backfilling+peered 1 active+recovery_wait+undersized+remapped io: recovery: 197MiB/s, 49objects/s root@ceph04:~# ceph health detail HEALTH_WARN 18962/11801097 objects misplaced (0.161%); 1/3933699 objects unfound (0.000%); Reduced data availability: 42 pgs inactive; Degraded data redundancy: 3643636/11801097 objects degraded (30.875%), 959 pgs degraded, 960 pgs undersized; 110 slow requests are blocked > 32 sec. Implicated osds 3,10,11 OBJECT_MISPLACED 18962/11801097 objects misplaced (0.161%) OBJECT_UNFOUND 1/3933699 objects unfound (0.000%) pg 3.361 has 1 unfound objects PG_AVAILABILITY Reduced data availability: 42 pgs inactive pg 3.26 is stuck inactive for 19268.231084, curren
[ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12
Hello everybody, I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's with 32GB Intel Optane NVMe journal, 10GB networking. I wanted to move to bluestore due to dropping support of file store, our cluster was working fine with bluestore and we could take complete nodes out for maintenance without issues. root@ceph04:~# ceph osd pool get libvirt-pool size size: 3 root@ceph04:~# ceph osd pool get libvirt-pool min_size min_size: 2 I removed all osds from one node, zapping the osd and journal devices, we recreated the osds as bluestore and used a small 5GB partition as block device instead of journal for all osd's. I saw the cluster suffer with pgs inactive and slow request. I tried setting the following on all nodes, but no diffrence: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_recovery_sleep 0.3' systemctl restart ceph-osd.target How can I migrate to bluestore without inactive pgs or slow request. I got several more filestore clusters and I would like to know how to migrate without inactive pgs and slow reguests? As a side question, I optimized our cluster for filestore, the Intel Optane NVMe journals showed good fio dsync write tests, does bluestore also use dsync writes for block caching or can we select NVMe devices on other specifications? My test with filestores showed that Optane NVMe SSD was faster then the Samsung NVMe SSD 970 Pro and I only need a a few GB for filestore journals, but with bluestore block caching the situation is different and I can't find documentation on how to speed test NVMe devices for bluestore. Kind regards, Jelle root@ceph04:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 60.04524 root default -2 20.01263 host ceph04 0 hdd 2.72899 osd.0 up 1.0 1.0 1 hdd 2.72899 osd.1 up 1.0 1.0 2 hdd 5.45799 osd.2 up 1.0 1.0 3 hdd 2.72899 osd.3 up 1.0 1.0 14 hdd 3.63869 osd.14 up 1.0 1.0 15 hdd 2.72899 osd.15 up 1.0 1.0 -3 20.01263 host ceph05 4 hdd 5.45799 osd.4 up 1.0 1.0 5 hdd 2.72899 osd.5 up 1.0 1.0 6 hdd 2.72899 osd.6 up 1.0 1.0 13 hdd 3.63869 osd.13 up 1.0 1.0 16 hdd 2.72899 osd.16 up 1.0 1.0 18 hdd 2.72899 osd.18 up 1.0 1.0 -4 20.01997 host ceph06 8 hdd 5.45999 osd.8 up 1.0 1.0 9 hdd 2.73000 osd.9 up 1.0 1.0 10 hdd 2.73000 osd.10 up 1.0 1.0 11 hdd 2.73000 osd.11 up 1.0 1.0 12 hdd 3.64000 osd.12 up 1.0 1.0 17 hdd 2.73000 osd.17 up 1.0 1.0 root@ceph04:~# ceph status cluster: id: 85873cda-4865-4147-819d-8deda5345db5 health: HEALTH_WARN 18962/11801097 objects misplaced (0.161%) 1/3933699 objects unfound (0.000%) Reduced data availability: 42 pgs inactive Degraded data redundancy: 3645135/11801097 objects degraded (30.888%), 959 pgs degraded, 960 pgs undersized 110 slow requests are blocked > 32 sec. Implicated osds 3,10,11 services: mon: 3 daemons, quorum ceph04,ceph05,ceph06 mgr: ceph04(active), standbys: ceph06, ceph05 osd: 18 osds: 18 up, 18 in; 964 remapped pgs data: pools: 1 pools, 1024 pgs objects: 3.93M objects, 15.0TiB usage: 31.2TiB used, 28.8TiB / 60.0TiB avail pgs: 4.102% pgs not active 3645135/11801097 objects degraded (30.888%) 18962/11801097 objects misplaced (0.161%) 1/3933699 objects unfound (0.000%) 913 active+undersized+degraded+remapped+backfill_wait 60 active+clean 41 activating+undersized+degraded+remapped 4 active+remapped+backfill_wait 4 active+undersized+degraded+remapped+backfilling 1 undersized+degraded+remapped+backfilling+peered 1 active+recovery_wait+undersized+remapped io: recovery: 197MiB/s, 49objects/s root@ceph04:~# ceph health detail HEALTH_WARN 18962/11801097 objects misplaced (0.161%); 1/3933699 objects unfound (0.000%); Reduced data availability: 42 pgs inactive; Degraded data redundancy: 3643636/11801097 objects degraded (30.875%), 959 pgs degraded, 960 pgs undersized; 110 slow requests are blocked > 32 sec. Implicated osds 3,10,11 OBJECT_MISPLACED 18962/11801097 objects misplaced (0.161%) OBJECT_UNFOUND 1/3933699 objects unfound (0.000%) pg 3.361 has 1 unfound objects PG_AVAILABILITY Reduced data availability: 42 pgs inactive pg 3.26 is stuck inactive for 19268.231084, current state activating+und
Re: [ceph-users] Scaling out
Thanks heaps Nathan. That's what we thoughts and we wanted implement but I wanted to double check with the community. Cheers On Thu, Nov 21, 2019 at 2:42 PM Nathan Fish wrote: > The default crush rule uses "host" as the failure domain, so in order > to deploy on one host you will need to make a crush rule that > specifies "osd". Then simply adding more hosts with osds will result > in automatic rebalancing. Once you have enough hosts to satisfy the > crush rule ( 3 for replicated size = 3) you can change the pool(s) > back to the default rule. > > On Thu, Nov 21, 2019 at 7:46 AM Alfredo De Luca > wrote: > > > > Hi all. > > We are doing some tests on how to scale out nodes on Ceph Nautilus. > > Basically we want to try to install Ceph on one node and scale up to 2+ > nodes. How to do so? > > > > Every nodes has 6 disks and maybe we can use the crushmap to achieve > this? > > > > Any thoughts/ideas/recommendations? > > > > > > Cheers > > > > > > -- > > Alfredo > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- *Alfredo* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Scaling out
Hi all. We are doing some tests on how to scale out nodes on Ceph Nautilus. Basically we want to try to install Ceph on one node and scale up to 2+ nodes. How to do so? Every nodes has 6 disks and maybe we can use the crushmap to achieve this? Any thoughts/ideas/recommendations? Cheers -- *Alfredo* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-objectstore-tool crash when trying to recover pg from OSD
Hi, does anyone have any feedback for me regarding this? Here's the log I get when trying to restart the OSD via systemctl: https://pastebin.com/tshuqsLP On Mon, 4 Nov 2019 at 12:42, Eugene de Beste mailto:eug...@sanbi.ac.za)> wrote: > Hi everyone > > I have a cluster that was initially set up with bad defaults in Luminous. > After upgrading to Nautilus I've had a few OSDs crash on me, due to errors > seemingly related to https://tracker.ceph.com/issues/42223 and > https://tracker.ceph.com/issues/22678. > One of my pools have been running in min_size 1 (yes, I know) and I am not > stuck with incomplete pgs due to aforementioned OSD crash. > When trying to use the ceph-objectstore-tool to get the pgs out of the OSD, > I'm running into the same issue as trying to start the OSD, which is the > crashes. ceph-objectstore-tool core dumps and I can't retrieve the pg. > Does anyone have any input on this? I would like to be able to retrieve that > data if possible. > Here's the log for ceph-objectstore-tool --debug --data-path > /var/lib/ceph/osd/ceph-22 --skip-journal-replay --skip-mount-omap --op info > --pgid 2.9f -- https://pastebin.com/9aGtAfSv > Regards and thanks, > Eugene ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-objectstore-tool crash when trying to recover pg from OSD
Hi everyone I have a cluster that was initially set up with bad defaults in Luminous. After upgrading to Nautilus I've had a few OSDs crash on me, due to errors seemingly related to https://tracker.ceph.com/issues/42223 and https://tracker.ceph.com/issues/22678. One of my pools have been running in min_size 1 (yes, I know) and I am not stuck with incomplete pgs due to aforementioned OSD crash. When trying to use the ceph-objectstore-tool to get the pgs out of the OSD, I'm running into the same issue as trying to start the OSD, which is the crashes. ceph-objectstore-tool core dumps and I can't retrieve the pg. Does anyone have any input on this? I would like to be able to retrieve that data if possible. Here's the log for ceph-objectstore-tool --debug --data-path /var/lib/ceph/osd/ceph-22 --skip-journal-replay --skip-mount-omap --op info --pgid 2.9f -- https://pastebin.com/9aGtAfSv Regards and thanks, Eugene ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ssd requirements for wal/db
hi all, maybe to clarify a bit, e.g. https://indico.cern.ch/event/755842/contributions/3243386/attachments/1784159/2904041/2019-jcollet-openlab.pdf clearly shows that the db+wal disks are not saturated, but we are wondering what is really needed/acceptable wrt throughput and latency (eg is a 6gbps sata enough or is 12gbps sas needed); we are thinking combining 4 or 5 7.2k rpms disks with one ssd. similar question with the read-intensive: how much is actually written to the db+wal compared to the data disk? is that 1-to-1? do people see eg 1 DWPD on their db+wal devices? (i guess it depends;) if so, what kind of workload daily averages are this in terms of volume? thanks for pointing out the capacitor isue, something to defintely double check for the (cheaper) read intensive ssd. stijn On 10/4/19 7:29 PM, Vitaliy Filippov wrote: > WAL/DB isn't "read intensive". It's more "write intensive" :) use server > SSDs with capacitors to get adequate write performance. > >> Hi all, >> >> We are thinking about putting our wal/db of hdds/ on ssds. If we would >> put the wal&db of 4 HDDS on 1 SSD as recommended, what type of SSD would >> suffice? >> We were thinking of using SATA Read Intensive 6Gbps 1DWPD SSDs. >> >> Does someone has some experience with this configuration? Would we need >> SAS ssds instead of SATA? And Mixed Use 3WPD instead of Read intensive? > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] process stuck in D state on cephfs kernel mount
hi marc, > - how to prevent the D state process to accumulate so much load? you can't. in linux, uninterruptable tasks themself count as "load", this does not mean you eg ran out of cpu resources. stijn > > Thanks, > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Encryption questions
Thanks for the answers, guys! Am I right to assume msgr2 (http://docs.ceph.com/docs/mimic/dev/msgr2/) will provide encryption between Ceph daemons as well as between clients and daemons? Does anybody know if it will be available in Nautilus? On Fri, Jan 11, 2019 at 8:10 AM Tobias Florek wrote: > Hi, > > as others pointed out, traffic in ceph is unencrypted (internal traffic > as well as client traffic). I usually advise to set up IPSec or > nowadays wireguard connections between all hosts. That takes care of > any traffic going over the wire, including ceph. > > Cheers, > Tobias Florek > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Encryption questions
Hi everyone, I have some questions about encryption in Ceph. 1) Are RBD connections encrypted or is there an option to use encryption between clients and Ceph? From reading the documentation, I have the impression that the only option to guarantee encryption in transit is to force clients to encrypt volumes via dmcrypt. Is there another option? I know I could encrypt the OSDs but that's not going to solve the problem of encryption in transit. 2) I'm also struggling to understand if communication between Ceph daemons (monitors and OSDs) are encrypted or not. I came across a few references about msgr2 but I couldn't tell if it is already implemented. Can anyone confirm this? I'm currently starting a new project using Ceph Mimic but if there's something new in this space expected for Nautilus, it would be good to know as well. Regards, Sergio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Lost machine with MON and MDS
Hi, I have 3 machine with ceph config with cephfs. But I lost one machine, just with mon and mds. It's possible recovey cephfs? If yes how? ceph: Ubuntu 16.05.5 (lost this machine) - mon - mds - osd ceph-osd-1: Ubuntu 16.05.5 - osd ceph-osd-2: Ubuntu 16.05.5 - osd []´s Maiko de Andrade MAX Brasil Desenvolvedor de Sistemas +55 51 91251756 http://about.me/maiko ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox
Hi, bring this up again to ask one more question: what would be the best recommended locking strategy for dovecot against cephfs? this is a balanced setup using independent director instances but all dovecot instances on each node share the same storage system (cephfs). Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, May 16, 2018 at 5:15 PM Webert de Souza Lima wrote: > Thanks Jack. > > That's good to know. It is definitely something to consider. > In a distributed storage scenario we might build a dedicated pool for that > and tune the pool as more capacity or performance is needed. > > Regards, > > Webert Lima > DevOps Engineer at MAV Tecnologia > *Belo Horizonte - Brasil* > *IRC NICK - WebertRLZ* > > > On Wed, May 16, 2018 at 4:45 PM Jack wrote: > >> On 05/16/2018 09:35 PM, Webert de Souza Lima wrote: >> > We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore >> > backend. >> > We'll have to do some some work on how to simulate user traffic, for >> writes >> > and readings. That seems troublesome. >> I would appreciate seeing these results ! >> >> > Thanks for the plugins recommendations. I'll take the change and ask you >> > how is the SIS status? We have used it in the past and we've had some >> > problems with it. >> >> I am using it since Dec 2016 with mdbox, with no issue at all (I am >> currently using Dovecot 2.2.27-3 from Debian Stretch) >> The only config I use is mail_attachment_dir, the rest lies as default >> (mail_attachment_min_size = 128k, mail_attachment_fs = sis posix, >> ail_attachment_hash = %{sha1}) >> The backend storage is a local filesystem, and there is only one Dovecot >> instance >> >> > >> > Regards, >> > >> > Webert Lima >> > DevOps Engineer at MAV Tecnologia >> > *Belo Horizonte - Brasil* >> > *IRC NICK - WebertRLZ* >> > >> > >> > On Wed, May 16, 2018 at 4:19 PM Jack wrote: >> > >> >> Hi, >> >> >> >> Many (most ?) filesystems does not store multiple files on the same >> block >> >> >> >> Thus, with sdbox, every single mail (you know, that kind of mail with >> 10 >> >> lines in it) will eat an inode, and a block (4k here) >> >> mdbox is more compact on this way >> >> >> >> Another difference: sdbox removes the message, mdbox does not : a >> single >> >> metadata update is performed, which may be packed with others if many >> >> files are deleted at once >> >> >> >> That said, I do not have experience with dovecot + cephfs, nor have >> made >> >> tests for sdbox vs mdbox >> >> >> >> However, and this is a bit out of topic, I recommend you look at the >> >> following dovecot's features (if not already done), as they are awesome >> >> and will help you a lot: >> >> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib) >> >> - Single-Instance-Storage (aka sis, aka "attachment deduplication" : >> >> https://www.dovecot.org/list/dovecot/2013-December/094276.html) >> >> >> >> Regards, >> >> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote: >> >>> I'm sending this message to both dovecot and ceph-users ML so please >> >> don't >> >>> mind if something seems too obvious for you. >> >>> >> >>> Hi, >> >>> >> >>> I have a question for both dovecot and ceph lists and below I'll >> explain >> >>> what's going on. >> >>> >> >>> Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox), >> >> when >> >>> using sdbox, a new file is stored for each email message. >> >>> When using mdbox, multiple messages are appended to a single file >> until >> >> it >> >>> reaches/passes the rotate limit. >> >>> >> >>> I would like to understand better how the mdbox format impacts on IO >> >>> performance. >> >>> I think it's generally expected that fewer larger file translate to >> less >> >> IO >> >>> and more troughput when compared to more small files, but how does >> >> dovecot >> >>> handle that with mdbox? >> >>> If dovecot does flush data to storage upon each and every n
Re: [ceph-users] rados rm objects, still appear in rados ls
John Spray wrote: > On Fri, Sep 28, 2018 at 2:25 PM Frank (lists) wrote: >> >> Hi, >> >> On my cluster I tried to clear all objects from a pool. I used the >> command "rados -p bench ls | xargs rados -p bench rm". (rados -p bench >> cleanup doesn't clean everything, because there was a lot of other >> testing going on here). >> >> Now 'rados -p bench ls' returns a list of objects, which don't exists: >> [root@ceph01 yum.repos.d]# rados -p bench stat >> benchmark_data_ceph01.example.com_1805226_object32453 >> error stat-ing >> bench/benchmark_data_ceph01.example.com_1805226_object32453: (2) No such >> file or directory >> >> I've tried scrub and deepscrub the pg the object is in, but the problem >> persists. What causes this? > > Are you perhaps using a cache tier pool? The pool had 2 snaps. After removing those, the ls command returned no 'non-existing' objects. I expected that ls would only return objects of the current contents, I did not specify -s for working with snaps of the pool. > > John > >> >> I use Centos 7.5 with mimic 13.2.2 >> >> >> regards, >> >> Frank de Bot >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph-Deploy error on 15/71 stage
Hi Eugen. Just tried everything again here by removing the /sda4 partitions and letting it so that either salt-run proposal-populate or salt-run state.orch ceph.stage.configure could try to find the free space on the partitions to work with: unsuccessfully again. :( Just to make things clear: are you so telling me that it is completely impossible to have a ceph "volume" in non-dedicated devices, sharing space with, for instance, the nodes swap, boot or main partition? And so the only possible way to have a functioning ceph distributed filesystem working would be by having in each node at least one disk dedicated for the operational system and another, independent disk dedicated to the ceph filesystem? That would be a awful drawback in our plans if real, but if there is no other way, we will have to just give up. Just, please, answer this two questions clearly, before we capitulate? :( Anyway, thanks a lot, once again, Jones On Mon, Sep 3, 2018 at 5:39 AM Eugen Block wrote: > Hi Jones, > > I still don't think creating an OSD on a partition will work. The > reason is that SES creates an additional partition per OSD resulting > in something like this: > > vdb 253:16 05G 0 disk > ├─vdb1253:17 0 100M 0 part /var/lib/ceph/osd/ceph-1 > └─vdb2253:18 0 4,9G 0 part > > Even with external block.db and wal.db on additional devices you would > still need two partitions for the OSD. I'm afraid with your setup this > can't work. > > Regards, > Eugen > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph-Deploy error on 15/71 stage
:18.787469-03:00 polar kernel: [3.036222] ata2.00: configured for UDMA/133 2018-08-30T10:21:18.787469-03:00 polar kernel: [3.043916] scsi 1:0:0:0: CD-ROMPLDS DVD+-RW DU-8A5LH 6D1M PQ: 0 ANSI: 5 2018-08-30T10:21:18.787470-03:00 polar kernel: [3.052087] usb 1-6: new low-speed USB device number 2 using xhci_hcd 2018-08-30T10:21:18.787471-03:00 polar kernel: [3.063179] scsi 1:0:0:0: Attached scsi generic sg1 type 5 2018-08-30T10:21:18.787472-03:00 polar kernel: [3.083566] sda: sda1 sda2 sda3 sda4 2018-08-30T10:21:18.787472-03:00 polar kernel: [3.084238] sd 0:0:0:0: [sda] Attached SCSI disk 2018-08-30T10:21:18.787473-03:00 polar kernel: [3.113065] sr 1:0:0:0: [sr0] scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray 2018-08-30T10:21:18.787475-03:00 polar kernel: [3.113068] cdrom: Uniform CD-ROM driver Revision: 3.20 2018-08-30T10:21:18.787476-03:00 polar kernel: [3.113272] sr 1:0:0:0: Attached scsi CD-ROM sr0 2018-08-30T10:21:18.787477-03:00 polar kernel: [3.213133] usb 1-6: New USB device found, idVendor=413c, idProduct=2113 ### I'm trying to run deploy again here, however I'm having some connection issues today (possibly due to the heavy rain) affecting the initial stages of it. If it succeeds, I send the outputs from /var/log/messages on the minions right away. Thanks a lot, Jones On Fri, Aug 31, 2018 at 4:00 AM Eugen Block wrote: > Hi, > > I'm not sure if there's a misunderstanding. You need to track the logs > during the osd deployment step (stage.3), that is where it fails, and > this is where /var/log/messages could be useful. Since the deployment > failed you have no systemd-units (ceph-osd@.service) to log > anything. > > Before running stage.3 again try something like > > grep -C5 ceph-disk /var/log/messages (or messages-201808*.xz) > > or > > grep -C5 sda4 /var/log/messages (or messages-201808*.xz) > > If that doesn't reveal anything run stage.3 again and watch the logs. > > Regards, > Eugen > > > Zitat von Jones de Andrade : > > > Hi Eugen. > > > > Ok, edited the file /etc/salt/minion, uncommented the "log_level_logfile" > > line and set it to "debug" level. > > > > Turned off the computer, waited a few minutes so that the time frame > would > > stand out in the /var/log/messages file, and restarted the computer. > > > > Using vi I "greped out" (awful wording) the reboot section. From that, I > > also removed most of what it seemed totally unrelated to ceph, salt, > > minions, grafana, prometheus, whatever. > > > > I got the lines below. It does not seem to complain about anything that I > > can see. :( > > > > > > 2018-08-30T15:41:46.455383-03:00 torcello systemd[1]: systemd 234 running > > in system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT > +UTMP > > +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS > > +KMOD -IDN2 -IDN default-hierarchy=hybrid) > > 2018-08-30T15:41:46.456330-03:00 torcello systemd[1]: Detected > architecture > > x86-64. > > 2018-08-30T15:41:46.456350-03:00 torcello systemd[1]: nss-lookup.target: > > Dependency Before=nss-lookup.target dropped > > 2018-08-30T15:41:46.456357-03:00 torcello systemd[1]: Started Load Kernel > > Modules. > > 2018-08-30T15:41:46.456369-03:00 torcello systemd[1]: Starting Apply > Kernel > > Variables... > > 2018-08-30T15:41:46.457230-03:00 torcello systemd[1]: Started > Alertmanager > > for prometheus. > > 2018-08-30T15:41:46.457237-03:00 torcello systemd[1]: Started Monitoring > > system and time series database. > > 2018-08-30T15:41:46.457403-03:00 torcello systemd[1]: Starting NTP > > client/server... > > > > > > > > > > > > > > *2018-08-30T15:41:46.457425-03:00 torcello systemd[1]: Started Prometheus > > exporter for machine metrics.2018-08-30T15:41:46.457706-03:00 torcello > > prometheus[695]: level=info ts=2018-08-30T18:41:44.797896888Z > > caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, > > branch=non-git, revision=non-git)"2018-08-30T15:41:46.457712-03:00 > torcello > > prometheus[695]: level=info ts=2018-08-30T18:41:44.797969232Z > > caller=main.go:226 build_context="(go=go1.9.4, user=abuild@lamb69, > > date=20180513-03:46:03)"2018-08-30T15:41:46.457719-03:00 torcello > > prometheus[695]: level=info ts=2018-08-30T18:41:44.798008802Z > > caller=main.go:227 host_details="(Linux 4.12.14-lp150.12.4-default #1 SMP > > Tue May 22 05:17:22 UTC 2018 (66b2eda) x86_64 torcello > > (none))"2018-08-3
Re: [ceph-users] Ceph-Deploy error on 15/71 stage
511493-03:00 torcello systemd[2295]: Reached target Timers. 2018-08-30T15:44:15.511664-03:00 torcello systemd[2295]: Reached target Paths. 2018-08-30T15:44:15.517873-03:00 torcello systemd[2295]: Listening on D-Bus User Message Bus Socket. 2018-08-30T15:44:15.518060-03:00 torcello systemd[2295]: Reached target Sockets. 2018-08-30T15:44:15.518216-03:00 torcello systemd[2295]: Reached target Basic System. 2018-08-30T15:44:15.518373-03:00 torcello systemd[2295]: Reached target Default. 2018-08-30T15:44:15.518501-03:00 torcello systemd[2295]: Startup finished in 31ms. 2018-08-30T15:44:15.518634-03:00 torcello systemd[1]: Started User Manager for UID 1000. 2018-08-30T15:44:15.518759-03:00 torcello systemd[1792]: Received SIGRTMIN+24 from PID 2300 (kill). 2018-08-30T15:44:15.537634-03:00 torcello systemd[1]: Stopped User Manager for UID 464. 2018-08-30T15:44:15.538422-03:00 torcello systemd[1]: Removed slice User Slice of sddm. 2018-08-30T15:44:15.613246-03:00 torcello systemd[2295]: Started D-Bus User Message Bus. 2018-08-30T15:44:15.623989-03:00 torcello dbus-daemon[2311]: [session uid=1000 pid=2311] Successfully activated service 'org.freedesktop.systemd1' 2018-08-30T15:44:16.447162-03:00 torcello kapplymousetheme[2350]: kcm_input: Using X11 backend 2018-08-30T15:44:16.901642-03:00 torcello node_exporter[807]: time="2018-08-30T15:44:16-03:00" level=error msg="ERROR: ntp collector failed after 0.000205s: couldn't get SNTP reply: read udp 127.0.0.1:53434-> 127.0.0.1:123: read: connection refused" source="collector.go:123" Any ideas? Thanks a lot, Jones On Thu, Aug 30, 2018 at 4:14 AM Eugen Block wrote: > Hi, > > > So, it only contains logs concerning the node itself (is it correct? > sincer > > node01 is also the master, I was expecting it to have logs from the other > > too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have > > available, and nothing "shines out" (sorry for my poor english) as a > > possible error. > > the logging is not configured to be centralised per default, you would > have to configure that yourself. > > Regarding the OSDs, if there are OSD logs created, they're created on > the OSD nodes, not on the master. But since the OSD deployment fails, > there probably are no OSD specific logs yet. So you'll have to take a > look into the syslog (/var/log/messages), that's where the salt-minion > reports its attempts to create the OSDs. Chances are high that you'll > find the root cause in here. > > If the output is not enough, set the log-level to debug: > > osd-1:~ # grep -E "^log_level" /etc/salt/minion > log_level: debug > > > Regards, > Eugen > > > Zitat von Jones de Andrade : > > > Hi Eugen. > > > > Sorry for the delay in answering. > > > > Just looked in the /var/log/ceph/ directory. It only contains the > following > > files (for example on node01): > > > > ### > > # ls -lart > > total 3864 > > -rw--- 1 ceph ceph 904 ago 24 13:11 ceph.audit.log-20180829.xz > > drwxr-xr-x 1 root root 898 ago 28 10:07 .. > > -rw-r--r-- 1 ceph ceph 189464 ago 28 23:59 > ceph-mon.node01.log-20180829.xz > > -rw--- 1 ceph ceph 24360 ago 28 23:59 ceph.log-20180829.xz > > -rw-r--r-- 1 ceph ceph 48584 ago 29 00:00 > ceph-mgr.node01.log-20180829.xz > > -rw--- 1 ceph ceph 0 ago 29 00:00 ceph.audit.log > > drwxrws--T 1 ceph ceph 352 ago 29 00:00 . > > -rw-r--r-- 1 ceph ceph 1908122 ago 29 12:46 ceph-mon.node01.log > > -rw--- 1 ceph ceph 175229 ago 29 12:48 ceph.log > > -rw-r--r-- 1 ceph ceph 1599920 ago 29 12:49 ceph-mgr.node01.log > > ### > > > > So, it only contains logs concerning the node itself (is it correct? > sincer > > node01 is also the master, I was expecting it to have logs from the other > > too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have > > available, and nothing "shines out" (sorry for my poor english) as a > > possible error. > > > > Any suggestion on how to proceed? > > > > Thanks a lot in advance, > > > > Jones > > > > > > On Mon, Aug 27, 2018 at 5:29 AM Eugen Block wrote: > > > >> Hi Jones, > >> > >> all ceph logs are in the directory /var/log/ceph/, each daemon has its > >> own log file, e.g. OSD logs are named ceph-osd.*. > >> > >> I haven't tried it but I don't think SUSE Enterprise Storage deploys > >> OSDs on partitioned disks. Is there a way to attach a second disk to > >> the OSD nodes, maybe via USB or something? > >> > >> Although th
Re: [ceph-users] Ceph-Deploy error on 15/71 stage
Hi Eugen. Sorry for the delay in answering. Just looked in the /var/log/ceph/ directory. It only contains the following files (for example on node01): ### # ls -lart total 3864 -rw--- 1 ceph ceph 904 ago 24 13:11 ceph.audit.log-20180829.xz drwxr-xr-x 1 root root 898 ago 28 10:07 .. -rw-r--r-- 1 ceph ceph 189464 ago 28 23:59 ceph-mon.node01.log-20180829.xz -rw--- 1 ceph ceph 24360 ago 28 23:59 ceph.log-20180829.xz -rw-r--r-- 1 ceph ceph 48584 ago 29 00:00 ceph-mgr.node01.log-20180829.xz -rw--- 1 ceph ceph 0 ago 29 00:00 ceph.audit.log drwxrws--T 1 ceph ceph 352 ago 29 00:00 . -rw-r--r-- 1 ceph ceph 1908122 ago 29 12:46 ceph-mon.node01.log -rw--- 1 ceph ceph 175229 ago 29 12:48 ceph.log -rw-r--r-- 1 ceph ceph 1599920 ago 29 12:49 ceph-mgr.node01.log ### So, it only contains logs concerning the node itself (is it correct? sincer node01 is also the master, I was expecting it to have logs from the other too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have available, and nothing "shines out" (sorry for my poor english) as a possible error. Any suggestion on how to proceed? Thanks a lot in advance, Jones On Mon, Aug 27, 2018 at 5:29 AM Eugen Block wrote: > Hi Jones, > > all ceph logs are in the directory /var/log/ceph/, each daemon has its > own log file, e.g. OSD logs are named ceph-osd.*. > > I haven't tried it but I don't think SUSE Enterprise Storage deploys > OSDs on partitioned disks. Is there a way to attach a second disk to > the OSD nodes, maybe via USB or something? > > Although this thread is ceph related it is referring to a specific > product, so I would recommend to post your question in the SUSE forum > [1]. > > Regards, > Eugen > > [1] https://forums.suse.com/forumdisplay.php?99-SUSE-Enterprise-Storage > > Zitat von Jones de Andrade : > > > Hi Eugen. > > > > Thanks for the suggestion. I'll look for the logs (since it's our first > > attempt with ceph, I'll have to discover where they are, but no problem). > > > > One thing called my attention on your response however: > > > > I haven't made myself clear, but one of the failures we encountered were > > that the files now containing: > > > > node02: > >-- > >storage: > >-- > >osds: > >-- > >/dev/sda4: > >-- > >format: > >bluestore > >standalone: > >True > > > > Were originally empty, and we filled them by hand following a model found > > elsewhere on the web. It was necessary, so that we could continue, but > the > > model indicated that, for example, it should have the path for /dev/sda > > here, not /dev/sda4. We chosen to include the specific partition > > identification because we won't have dedicated disks here, rather just > the > > very same partition as all disks were partitioned exactly the same. > > > > While that was enough for the procedure to continue at that point, now I > > wonder if it was the right call and, if it indeed was, if it was done > > properly. As such, I wonder: what you mean by "wipe" the partition here? > > /dev/sda4 is created, but is both empty and unmounted: Should a different > > operation be performed on it, should I remove it first, should I have > > written the files above with only /dev/sda as target? > > > > I know that probably I wouldn't run in this issues with dedicated discks, > > but unfortunately that is absolutely not an option. > > > > Thanks a lot in advance for any comments and/or extra suggestions. > > > > Sincerely yours, > > > > Jones > > > > On Sat, Aug 25, 2018 at 5:46 PM Eugen Block wrote: > > > >> Hi, > >> > >> take a look into the logs, they should point you in the right direction. > >> Since the deployment stage fails at the OSD level, start with the OSD > >> logs. Something's not right with the disks/partitions, did you wipe > >> the partition from previous attempts? > >> > >> Regards, > >> Eugen > >> > >> Zitat von Jones de Andrade : > >> > >>> (Please forgive my previous email: I was using another message and > >>> completely forget to update the subject) > >>> > >>> Hi all. > >>> > >>> I'm new to ceph, and after having serious problems in ceph stages 0, 1 > >> and > >>> 2 that
Re: [ceph-users] Ceph-Deploy error on 15/71 stage
Hi Eugen. Thanks for the suggestion. I'll look for the logs (since it's our first attempt with ceph, I'll have to discover where they are, but no problem). One thing called my attention on your response however: I haven't made myself clear, but one of the failures we encountered were that the files now containing: node02: -- storage: -- osds: -- /dev/sda4: -- format: bluestore standalone: True Were originally empty, and we filled them by hand following a model found elsewhere on the web. It was necessary, so that we could continue, but the model indicated that, for example, it should have the path for /dev/sda here, not /dev/sda4. We chosen to include the specific partition identification because we won't have dedicated disks here, rather just the very same partition as all disks were partitioned exactly the same. While that was enough for the procedure to continue at that point, now I wonder if it was the right call and, if it indeed was, if it was done properly. As such, I wonder: what you mean by "wipe" the partition here? /dev/sda4 is created, but is both empty and unmounted: Should a different operation be performed on it, should I remove it first, should I have written the files above with only /dev/sda as target? I know that probably I wouldn't run in this issues with dedicated discks, but unfortunately that is absolutely not an option. Thanks a lot in advance for any comments and/or extra suggestions. Sincerely yours, Jones On Sat, Aug 25, 2018 at 5:46 PM Eugen Block wrote: > Hi, > > take a look into the logs, they should point you in the right direction. > Since the deployment stage fails at the OSD level, start with the OSD > logs. Something's not right with the disks/partitions, did you wipe > the partition from previous attempts? > > Regards, > Eugen > > Zitat von Jones de Andrade : > > > (Please forgive my previous email: I was using another message and > > completely forget to update the subject) > > > > Hi all. > > > > I'm new to ceph, and after having serious problems in ceph stages 0, 1 > and > > 2 that I could solve myself, now it seems that I have hit a wall harder > > than my head. :) > > > > When I run salt-run state.orch ceph.stage.deploy, i monitor I see it > going > > up to here: > > > > ### > > [14/71] ceph.sysctl on > > node01... ✓ (0.5s) > > node02 ✓ (0.7s) > > node03... ✓ (0.6s) > > node04. ✓ (0.5s) > > node05... ✓ (0.6s) > > node06.. ✓ (0.5s) > > > > [15/71] ceph.osd on > > node01.. ❌ (0.7s) > > node02 ❌ (0.7s) > > node03... ❌ (0.7s) > > node04. ❌ (0.6s) > > node05... ❌ (0.6s) > > node06.. ❌ (0.7s) > > > > Ended stage: ceph.stage.deploy succeeded=14/71 failed=1/71 time=624.7s > > > > Failures summary: > > > > ceph.osd (/srv/salt/ceph/osd): > > node02: > > deploy OSDs: Module function osd.deploy threw an exception. > Exception: > > Mine on node02 for cephdisks.list > > node03: > > deploy OSDs: Module function osd.deploy threw an exception. > Exception: > > Mine on node03 for cephdisks.list > > node01: > > deploy OSDs: Module function osd.deploy threw an exception. > Exception: > > Mine on node01 for cephdisks.list > > node04: > > deploy OSDs: Module function osd.deploy threw an exception. > Exception: > > Mine on node04 for cephdisks.list > > node05: > > deploy OSDs: Module function osd.deploy threw an exception. > Exception: > > Mine on node05 for cephdisks.list > > node06: > > deploy OSDs: Module function osd.deploy threw an exception. > Exception: > > Mine on node06 for cephdisks.list > > ### > > > > Since this is a first attempt in 6 simple test machines, we are going to > > put the mon, osds, etc, in all nodes at first. Only the master is left > in a > > single machine (node01) by now. > > > > As they are simple machin
[ceph-users] Ceph-Deploy error on 15/71 stage
(Please forgive my previous email: I was using another message and completely forget to update the subject) Hi all. I'm new to ceph, and after having serious problems in ceph stages 0, 1 and 2 that I could solve myself, now it seems that I have hit a wall harder than my head. :) When I run salt-run state.orch ceph.stage.deploy, i monitor I see it going up to here: ### [14/71] ceph.sysctl on node01... ✓ (0.5s) node02 ✓ (0.7s) node03... ✓ (0.6s) node04. ✓ (0.5s) node05... ✓ (0.6s) node06.. ✓ (0.5s) [15/71] ceph.osd on node01.. ❌ (0.7s) node02 ❌ (0.7s) node03... ❌ (0.7s) node04. ❌ (0.6s) node05... ❌ (0.6s) node06.. ❌ (0.7s) Ended stage: ceph.stage.deploy succeeded=14/71 failed=1/71 time=624.7s Failures summary: ceph.osd (/srv/salt/ceph/osd): node02: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node02 for cephdisks.list node03: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node03 for cephdisks.list node01: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node01 for cephdisks.list node04: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node04 for cephdisks.list node05: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node05 for cephdisks.list node06: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node06 for cephdisks.list ### Since this is a first attempt in 6 simple test machines, we are going to put the mon, osds, etc, in all nodes at first. Only the master is left in a single machine (node01) by now. As they are simple machines, they have a single hdd, which is partitioned as follows (the hda4 partition is unmounted and left for the ceph system): ### # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:00 465,8G 0 disk ├─sda1 8:10 500M 0 part /boot/efi ├─sda2 8:2016G 0 part [SWAP] ├─sda3 8:30 49,3G 0 part / └─sda4 8:40 400G 0 part sr0 11:01 3,7G 0 rom # salt -I 'roles:storage' cephdisks.list node01: node02: node03: node04: node05: node06: # salt -I 'roles:storage' pillar.get ceph node02: -- storage: -- osds: -- /dev/sda4: -- format: bluestore standalone: True (and so on for all 6 machines) ## Finally and just in case, my policy.cfg file reads: # #cluster-unassigned/cluster/*.sls cluster-ceph/cluster/*.sls profile-default/cluster/*.sls profile-default/stack/default/ceph/minions/*yml config/stack/default/global.yml config/stack/default/ceph/cluster.yml role-master/cluster/node01.sls role-admin/cluster/*.sls role-mon/cluster/*.sls role-mgr/cluster/*.sls role-mds/cluster/*.sls role-ganesha/cluster/*.sls role-client-nfs/cluster/*.sls role-client-cephfs/cluster/*.sls ## Please, could someone help me and shed some light on this issue? Thanks a lot in advance, Regasrds, Jones ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mimic prometheus plugin -no socket could be created
Hi all. I'm new to ceph, and after having serious problems in ceph stages 0, 1 and 2 that I could solve myself, now it seems that I have hit a wall harder than my head. :) When I run salt-run state.orch ceph.stage.deploy, i monitor I see it going up to here: ### [14/71] ceph.sysctl on node01... ✓ (0.5s) node02 ✓ (0.7s) node03... ✓ (0.6s) node04. ✓ (0.5s) node05... ✓ (0.6s) node06.. ✓ (0.5s) [15/71] ceph.osd on node01.. ❌ (0.7s) node02 ❌ (0.7s) node03... ❌ (0.7s) node04. ❌ (0.6s) node05... ❌ (0.6s) node06.. ❌ (0.7s) Ended stage: ceph.stage.deploy succeeded=14/71 failed=1/71 time=624.7s Failures summary: ceph.osd (/srv/salt/ceph/osd): node02: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node02 for cephdisks.list node03: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node03 for cephdisks.list node01: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node01 for cephdisks.list node04: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node04 for cephdisks.list node05: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node05 for cephdisks.list node06: deploy OSDs: Module function osd.deploy threw an exception. Exception: Mine on node06 for cephdisks.list ### Since this is a first attempt in 6 simple test machines, we are going to put the mon, osds, etc, in all nodes at first. Only the master is left in a single machine (node01) by now. As they are simple machines, they have a single hdd, which is partitioned as follows (the hda4 partition is unmounted and left for the ceph system): ### # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:00 465,8G 0 disk ├─sda1 8:10 500M 0 part /boot/efi ├─sda2 8:2016G 0 part [SWAP] ├─sda3 8:30 49,3G 0 part / └─sda4 8:40 400G 0 part sr0 11:01 3,7G 0 rom # salt -I 'roles:storage' cephdisks.list node01: node02: node03: node04: node05: node06: # salt -I 'roles:storage' pillar.get ceph node02: -- storage: -- osds: -- /dev/sda4: -- format: bluestore standalone: True (and so on for all 6 machines) ## Finally and just in case, my policy.cfg file reads: # #cluster-unassigned/cluster/*.sls cluster-ceph/cluster/*.sls profile-default/cluster/*.sls profile-default/stack/default/ceph/minions/*yml config/stack/default/global.yml config/stack/default/ceph/cluster.yml role-master/cluster/node01.sls role-admin/cluster/*.sls role-mon/cluster/*.sls role-mgr/cluster/*.sls role-mds/cluster/*.sls role-ganesha/cluster/*.sls role-client-nfs/cluster/*.sls role-client-cephfs/cluster/*.sls ## Please, could someone help me and shed some light on this issue? Thanks a lot in advance, Regasrds, Jones On Thu, Aug 23, 2018 at 2:46 PM John Spray wrote: > On Thu, Aug 23, 2018 at 5:18 PM Steven Vacaroaia wrote: > > > > Hi All, > > > > I am trying to enable prometheus plugin with no success due to "no > socket could be created" > > > > The instructions for enabling the plugin are very straightforward and > simple > > > > Note > > My ultimate goal is to use Prometheus with Cephmetrics > > Some of you suggested to deploy ceph-exporter but why do we need to do > that when there is a plugin already ? > > > > > > How can I troubleshoot this further ? > > > > nhandled exception from module 'prometheus' while running on mgr.mon01: > error('No socket could be created',) > > Aug 23 12:03:06 mon01 ceph-mgr: 2018-08-23 12:03:06.615 7fadab50e700 -1 > prometheus.serve: > > Aug 23 12:03:06 mon01 ceph-mgr: 2018-08-23 12:03:06.615 7fadab50e700 -1 > Traceback (most recent call last): > > Aug 23 12:03:06 mon01 ceph-mgr: File > "/usr/lib64/ceph/mgr/prometheus/module.py", line 720, in serve > > Aug 23 12:03:06 mon01 ceph-mgr: cherrypy.engine.start() > > Aug 23 12:03:06 mon01 ceph-mgr: File > "/usr/lib/python2.7/site-packages/cherrypy/process/wspbus.py", line 250, in > start > > Aug 23 12:03:06 mon01 ceph-mgr: raise e_info > > Aug 23 12:03:06 mon01 ceph-mgr: ChannelFailures: error('No socket could > be created',) > > The things I usually check if a process can't create a socket are: > - is th
Re: [ceph-users] cephfs kernel client hangs
You can only try to remount the cephs dir. It will probably not work, giving you I/O Errors, so the fallback would be to use a fuse-mount. If I recall correctly you could do a lazy umount on the current dir (umount -fl /mountdir) and remount it using the FUSE client. it will work for new sessions but the currently hanging ones will still be hanging. with fuse you'll only be able to mount cephfs root dir, so if you have multiple directories, you'll need to: - mount root cephfs dir in another directory - mount each subdir (after root mounted) to the desired directory via bind mount. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, Aug 8, 2018 at 11:46 AM Zhenshi Zhou wrote: > Hi, > Is there any other way excpet rebooting the server when the client hangs? > If the server is in production environment, I can't restart it everytime. > > Webert de Souza Lima 于2018年8月8日周三 下午10:33写道: > >> Hi Zhenshi, >> >> if you still have the client mount hanging but no session is connected, >> you probably have some PID waiting with blocked IO from cephfs mount. >> I face that now and then and the only solution is to reboot the server, >> as you won't be able to kill a process with pending IO. >> >> Regards, >> >> Webert Lima >> DevOps Engineer at MAV Tecnologia >> *Belo Horizonte - Brasil* >> *IRC NICK - WebertRLZ* >> >> >> On Wed, Aug 8, 2018 at 11:17 AM Zhenshi Zhou wrote: >> >>> Hi Webert, >>> That command shows the current sessions, whereas the server which I get >>> the files(osdc,mdsc,monc) disconnect for a long time. >>> So I cannot get useful infomation from the command you provide. >>> >>> Thanks >>> >>> Webert de Souza Lima 于2018年8月8日周三 下午10:10写道: >>> >>>> You could also see open sessions at the MDS server by issuing `ceph >>>> daemon mds.XX session ls` >>>> >>>> Regards, >>>> >>>> Webert Lima >>>> DevOps Engineer at MAV Tecnologia >>>> *Belo Horizonte - Brasil* >>>> *IRC NICK - WebertRLZ* >>>> >>>> >>>> On Wed, Aug 8, 2018 at 5:08 AM Zhenshi Zhou >>>> wrote: >>>> >>>>> Hi, I find an old server which mounted cephfs and has the debug files. >>>>> # cat osdc >>>>> REQUESTS 0 homeless 0 >>>>> LINGER REQUESTS >>>>> BACKOFFS >>>>> # cat monc >>>>> have monmap 2 want 3+ >>>>> have osdmap 3507 >>>>> have fsmap.user 0 >>>>> have mdsmap 55 want 56+ >>>>> fs_cluster_id -1 >>>>> # cat mdsc >>>>> 194 mds0getattr #1036ae3 >>>>> >>>>> What does it mean? >>>>> >>>>> Zhenshi Zhou 于2018年8月8日周三 下午1:58写道: >>>>> >>>>>> I restarted the client server so that there's no file in that >>>>>> directory. I will take care of it if the client hangs next time. >>>>>> >>>>>> Thanks >>>>>> >>>>>> Yan, Zheng 于2018年8月8日周三 上午11:23写道: >>>>>> >>>>>>> On Wed, Aug 8, 2018 at 11:02 AM Zhenshi Zhou >>>>>>> wrote: >>>>>>> > >>>>>>> > Hi, >>>>>>> > I check all my ceph servers and they are not mount cephfs on each >>>>>>> of them(maybe I umount after testing). As a result, the cluster didn't >>>>>>> encounter a memory deadlock. Besides, I check the monitoring system and >>>>>>> the >>>>>>> memory and cpu usage were at common level while the clients hung. >>>>>>> > Back to my question, there must be something else cause the client >>>>>>> hang. >>>>>>> > >>>>>>> >>>>>>> Check if there are hang requests in >>>>>>> /sys/kernel/debug/ceph//{osdc,mdsc}, >>>>>>> >>>>>>> > Zhenshi Zhou 于2018年8月8日周三 上午4:16写道: >>>>>>> >> >>>>>>> >> Hi, I'm not sure if it just mounts the cephfs without using or >>>>>>> doing any operation within the mounted directory would be affected by >>>>>>> flushing cache. I mounted cephfs on osd servers only for testing and >>>>>>> then >>>>>>&
Re: [ceph-users] cephfs kernel client hangs
Hi Zhenshi, if you still have the client mount hanging but no session is connected, you probably have some PID waiting with blocked IO from cephfs mount. I face that now and then and the only solution is to reboot the server, as you won't be able to kill a process with pending IO. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, Aug 8, 2018 at 11:17 AM Zhenshi Zhou wrote: > Hi Webert, > That command shows the current sessions, whereas the server which I get > the files(osdc,mdsc,monc) disconnect for a long time. > So I cannot get useful infomation from the command you provide. > > Thanks > > Webert de Souza Lima 于2018年8月8日周三 下午10:10写道: > >> You could also see open sessions at the MDS server by issuing `ceph >> daemon mds.XX session ls` >> >> Regards, >> >> Webert Lima >> DevOps Engineer at MAV Tecnologia >> *Belo Horizonte - Brasil* >> *IRC NICK - WebertRLZ* >> >> >> On Wed, Aug 8, 2018 at 5:08 AM Zhenshi Zhou wrote: >> >>> Hi, I find an old server which mounted cephfs and has the debug files. >>> # cat osdc >>> REQUESTS 0 homeless 0 >>> LINGER REQUESTS >>> BACKOFFS >>> # cat monc >>> have monmap 2 want 3+ >>> have osdmap 3507 >>> have fsmap.user 0 >>> have mdsmap 55 want 56+ >>> fs_cluster_id -1 >>> # cat mdsc >>> 194 mds0getattr #1036ae3 >>> >>> What does it mean? >>> >>> Zhenshi Zhou 于2018年8月8日周三 下午1:58写道: >>> >>>> I restarted the client server so that there's no file in that >>>> directory. I will take care of it if the client hangs next time. >>>> >>>> Thanks >>>> >>>> Yan, Zheng 于2018年8月8日周三 上午11:23写道: >>>> >>>>> On Wed, Aug 8, 2018 at 11:02 AM Zhenshi Zhou >>>>> wrote: >>>>> > >>>>> > Hi, >>>>> > I check all my ceph servers and they are not mount cephfs on each of >>>>> them(maybe I umount after testing). As a result, the cluster didn't >>>>> encounter a memory deadlock. Besides, I check the monitoring system and >>>>> the >>>>> memory and cpu usage were at common level while the clients hung. >>>>> > Back to my question, there must be something else cause the client >>>>> hang. >>>>> > >>>>> >>>>> Check if there are hang requests in >>>>> /sys/kernel/debug/ceph//{osdc,mdsc}, >>>>> >>>>> > Zhenshi Zhou 于2018年8月8日周三 上午4:16写道: >>>>> >> >>>>> >> Hi, I'm not sure if it just mounts the cephfs without using or >>>>> doing any operation within the mounted directory would be affected by >>>>> flushing cache. I mounted cephfs on osd servers only for testing and then >>>>> left it there. Anyway I will umount it. >>>>> >> >>>>> >> Thanks >>>>> >> >>>>> >> John Spray 于2018年8月8日 周三03:37写道: >>>>> >>> >>>>> >>> On Tue, Aug 7, 2018 at 5:42 PM Reed Dier >>>>> wrote: >>>>> >>> > >>>>> >>> > This is the first I am hearing about this as well. >>>>> >>> >>>>> >>> This is not a Ceph-specific thing -- it can also affect similar >>>>> >>> systems like Lustre. >>>>> >>> >>>>> >>> The classic case is when under some memory pressure, the kernel >>>>> tries >>>>> >>> to free memory by flushing the client's page cache, but doing the >>>>> >>> flush means allocating more memory on the server, making the memory >>>>> >>> pressure worse, until the whole thing just seizes up. >>>>> >>> >>>>> >>> John >>>>> >>> >>>>> >>> > Granted, I am using ceph-fuse rather than the kernel client at >>>>> this point, but that isn’t etched in stone. >>>>> >>> > >>>>> >>> > Curious if there is more to share. >>>>> >>> > >>>>> >>> > Reed >>>>> >>> > >>>>> >>> > On Aug 7, 2018, at 9:47 AM, Webert de Souza Lima < >>>>> webert.b...@gmail.c
Re: [ceph-users] cephfs kernel client hangs
You could also see open sessions at the MDS server by issuing `ceph daemon mds.XX session ls` Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, Aug 8, 2018 at 5:08 AM Zhenshi Zhou wrote: > Hi, I find an old server which mounted cephfs and has the debug files. > # cat osdc > REQUESTS 0 homeless 0 > LINGER REQUESTS > BACKOFFS > # cat monc > have monmap 2 want 3+ > have osdmap 3507 > have fsmap.user 0 > have mdsmap 55 want 56+ > fs_cluster_id -1 > # cat mdsc > 194 mds0getattr #1036ae3 > > What does it mean? > > Zhenshi Zhou 于2018年8月8日周三 下午1:58写道: > >> I restarted the client server so that there's no file in that directory. >> I will take care of it if the client hangs next time. >> >> Thanks >> >> Yan, Zheng 于2018年8月8日周三 上午11:23写道: >> >>> On Wed, Aug 8, 2018 at 11:02 AM Zhenshi Zhou >>> wrote: >>> > >>> > Hi, >>> > I check all my ceph servers and they are not mount cephfs on each of >>> them(maybe I umount after testing). As a result, the cluster didn't >>> encounter a memory deadlock. Besides, I check the monitoring system and the >>> memory and cpu usage were at common level while the clients hung. >>> > Back to my question, there must be something else cause the client >>> hang. >>> > >>> >>> Check if there are hang requests in >>> /sys/kernel/debug/ceph//{osdc,mdsc}, >>> >>> > Zhenshi Zhou 于2018年8月8日周三 上午4:16写道: >>> >> >>> >> Hi, I'm not sure if it just mounts the cephfs without using or doing >>> any operation within the mounted directory would be affected by flushing >>> cache. I mounted cephfs on osd servers only for testing and then left it >>> there. Anyway I will umount it. >>> >> >>> >> Thanks >>> >> >>> >> John Spray 于2018年8月8日 周三03:37写道: >>> >>> >>> >>> On Tue, Aug 7, 2018 at 5:42 PM Reed Dier >>> wrote: >>> >>> > >>> >>> > This is the first I am hearing about this as well. >>> >>> >>> >>> This is not a Ceph-specific thing -- it can also affect similar >>> >>> systems like Lustre. >>> >>> >>> >>> The classic case is when under some memory pressure, the kernel tries >>> >>> to free memory by flushing the client's page cache, but doing the >>> >>> flush means allocating more memory on the server, making the memory >>> >>> pressure worse, until the whole thing just seizes up. >>> >>> >>> >>> John >>> >>> >>> >>> > Granted, I am using ceph-fuse rather than the kernel client at >>> this point, but that isn’t etched in stone. >>> >>> > >>> >>> > Curious if there is more to share. >>> >>> > >>> >>> > Reed >>> >>> > >>> >>> > On Aug 7, 2018, at 9:47 AM, Webert de Souza Lima < >>> webert.b...@gmail.com> wrote: >>> >>> > >>> >>> > >>> >>> > Yan, Zheng 于2018年8月7日周二 下午7:51写道: >>> >>> >> >>> >>> >> On Tue, Aug 7, 2018 at 7:15 PM Zhenshi Zhou >>> wrote: >>> >>> >> this can cause memory deadlock. you should avoid doing this >>> >>> >> >>> >>> >> > Yan, Zheng 于2018年8月7日 周二19:12写道: >>> >>> >> >> >>> >>> >> >> did you mount cephfs on the same machines that run ceph-osd? >>> >>> >> >> >>> >>> > >>> >>> > >>> >>> > I didn't know about this. I run this setup in production. :P >>> >>> > >>> >>> > Regards, >>> >>> > >>> >>> > Webert Lima >>> >>> > DevOps Engineer at MAV Tecnologia >>> >>> > Belo Horizonte - Brasil >>> >>> > IRC NICK - WebertRLZ >>> >>> > >>> >>> > ___ >>> >>> > ceph-users mailing list >>> >>> > ceph-users@lists.ceph.com >>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> > >>> >>> > >>> >>> > ___ >>> >>> > ceph-users mailing list >>> >>> > ceph-users@lists.ceph.com >>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> ___ >>> >>> ceph-users mailing list >>> >>> ceph-users@lists.ceph.com >>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > >>> > ___ >>> > ceph-users mailing list >>> > ceph-users@lists.ceph.com >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Whole cluster flapping
So your OSDs are really too busy to respond heartbeats. You'll be facing this for sometime until cluster loads get lower. I would set `ceph osd set nodeep-scrub` until the heavy disk IO stops. maybe you can schedule it for enable during the night and disabling in the morning. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, Aug 8, 2018 at 9:18 AM CUZA Frédéric wrote: > Thx for the command line, I did take a look too it what I don’t really > know what to search for, my bad…. > > All this flapping is due to deep-scrub when it starts on an OSD things > start to go bad. > > > > I set out all the OSDs that were flapping the most (1 by 1 after > rebalancing) and it looks better even if some osds keep going down/up with > the same message in logs : > > > > 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdabd897700' had > timed out after 90 > > > > (I update it to 90 instead of 15s) > > > > Regards, > > > > > > > > *De :* ceph-users *De la part de* > Webert de Souza Lima > *Envoyé :* 07 August 2018 16:28 > *À :* ceph-users > *Objet :* Re: [ceph-users] Whole cluster flapping > > > > oops, my bad, you're right. > > > > I don't know much you can see but maybe you can dig around performance > counters and see what's happening on those OSDs, try these: > > > > ~# ceph daemonperf osd.XX > > ~# ceph daemon osd.XX perf dump > > > > change XX to your OSD numbers. > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > *Belo Horizonte - Brasil* > > *IRC NICK - WebertRLZ* > > > > > > On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric > wrote: > > Pool is already deleted and no longer present in stats. > > > > Regards, > > > > *De :* ceph-users *De la part de* > Webert de Souza Lima > *Envoyé :* 07 August 2018 15:08 > *À :* ceph-users > *Objet :* Re: [ceph-users] Whole cluster flapping > > > > Frédéric, > > > > see if the number of objects is decreasing in the pool with `ceph df > [detail]` > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > *Belo Horizonte - Brasil* > > *IRC NICK - WebertRLZ* > > > > > > On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric wrote: > > It’s been over a week now and the whole cluster keeps flapping, it is > never the same OSDs that go down. > > Is there a way to get the progress of this recovery ? (The pool hat I > deleted is no longer present (for a while now)) > > In fact, there is a lot of i/o activity on the server where osds go down. > > > > Regards, > > > > *De :* ceph-users *De la part de* > Webert de Souza Lima > *Envoyé :* 31 July 2018 16:25 > *À :* ceph-users > *Objet :* Re: [ceph-users] Whole cluster flapping > > > > The pool deletion might have triggered a lot of IO operations on the disks > and the process might be too busy to respond to hearbeats, so the mons mark > them as down due to no response. > > Check also the OSD logs to see if they are actually crashing and > restarting, and disk IO usage (i.e. iostat). > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > *Belo Horizonte - Brasil* > > *IRC NICK - WebertRLZ* > > > > > > On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric > wrote: > > Hi Everyone, > > > > I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large > pool that we had (120 TB). > > Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 > OSD), we have SDD for journal. > > > > After I deleted the large pool my cluster started to flapping on all OSDs. > > Osds are marked down and then marked up as follow : > > > > 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 > 172.29.228.72:6800/95783 boot > > 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: > 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs > degraded, 317 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: > 81 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:42:55.610556 mon.
Re: [ceph-users] cephfs kernel client hangs
That's good to know, thanks for the explanation. Fortunately we are in the process of cluster redesign and we can definitely fix that scenario. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Tue, Aug 7, 2018 at 4:37 PM John Spray wrote: > On Tue, Aug 7, 2018 at 5:42 PM Reed Dier wrote: > > > > This is the first I am hearing about this as well. > > This is not a Ceph-specific thing -- it can also affect similar > systems like Lustre. > > The classic case is when under some memory pressure, the kernel tries > to free memory by flushing the client's page cache, but doing the > flush means allocating more memory on the server, making the memory > pressure worse, until the whole thing just seizes up. > > John > > > Granted, I am using ceph-fuse rather than the kernel client at this > point, but that isn’t etched in stone. > > > > Curious if there is more to share. > > > > Reed > > > > On Aug 7, 2018, at 9:47 AM, Webert de Souza Lima > wrote: > > > > > > Yan, Zheng 于2018年8月7日周二 下午7:51写道: > >> > >> On Tue, Aug 7, 2018 at 7:15 PM Zhenshi Zhou > wrote: > >> this can cause memory deadlock. you should avoid doing this > >> > >> > Yan, Zheng 于2018年8月7日 周二19:12写道: > >> >> > >> >> did you mount cephfs on the same machines that run ceph-osd? > >> >> > > > > > > I didn't know about this. I run this setup in production. :P > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > Belo Horizonte - Brasil > > IRC NICK - WebertRLZ > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client hangs
Yan, Zheng 于2018年8月7日周二 下午7:51写道: > On Tue, Aug 7, 2018 at 7:15 PM Zhenshi Zhou wrote: > this can cause memory deadlock. you should avoid doing this > > > Yan, Zheng 于2018年8月7日 周二19:12写道: > >> > >> did you mount cephfs on the same machines that run ceph-osd? > >> I didn't know about this. I run this setup in production. :P Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Whole cluster flapping
oops, my bad, you're right. I don't know much you can see but maybe you can dig around performance counters and see what's happening on those OSDs, try these: ~# ceph daemonperf osd.XX ~# ceph daemon osd.XX perf dump change XX to your OSD numbers. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric wrote: > Pool is already deleted and no longer present in stats. > > > > Regards, > > > > *De :* ceph-users *De la part de* > Webert de Souza Lima > *Envoyé :* 07 August 2018 15:08 > *À :* ceph-users > *Objet :* Re: [ceph-users] Whole cluster flapping > > > > Frédéric, > > > > see if the number of objects is decreasing in the pool with `ceph df > [detail]` > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > *Belo Horizonte - Brasil* > > *IRC NICK - WebertRLZ* > > > > > > On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric wrote: > > It’s been over a week now and the whole cluster keeps flapping, it is > never the same OSDs that go down. > > Is there a way to get the progress of this recovery ? (The pool hat I > deleted is no longer present (for a while now)) > > In fact, there is a lot of i/o activity on the server where osds go down. > > > > Regards, > > > > *De :* ceph-users *De la part de* > Webert de Souza Lima > *Envoyé :* 31 July 2018 16:25 > *À :* ceph-users > *Objet :* Re: [ceph-users] Whole cluster flapping > > > > The pool deletion might have triggered a lot of IO operations on the disks > and the process might be too busy to respond to hearbeats, so the mons mark > them as down due to no response. > > Check also the OSD logs to see if they are actually crashing and > restarting, and disk IO usage (i.e. iostat). > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > *Belo Horizonte - Brasil* > > *IRC NICK - WebertRLZ* > > > > > > On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric > wrote: > > Hi Everyone, > > > > I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large > pool that we had (120 TB). > > Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 > OSD), we have SDD for journal. > > > > After I deleted the large pool my cluster started to flapping on all OSDs. > > Osds are marked down and then marked up as follow : > > > > 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 > 172.29.228.72:6800/95783 boot > > 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: > 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs > degraded, 317 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: > 81 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 > 172.29.228.72:6803/95830 boot > > 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 > osds down (OSD_DOWN) > > 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: > 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs > degraded, 223 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: > 76 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 > 172.29.228.246:6812/3144542 boot > > 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 > osds down (OSD_DOWN) > > 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: > 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs > degraded, 220 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN]
Re: [ceph-users] Whole cluster flapping
Frédéric, see if the number of objects is decreasing in the pool with `ceph df [detail]` Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric wrote: > It’s been over a week now and the whole cluster keeps flapping, it is > never the same OSDs that go down. > > Is there a way to get the progress of this recovery ? (The pool hat I > deleted is no longer present (for a while now)) > > In fact, there is a lot of i/o activity on the server where osds go down. > > > > Regards, > > > > *De :* ceph-users *De la part de* > Webert de Souza Lima > *Envoyé :* 31 July 2018 16:25 > *À :* ceph-users > *Objet :* Re: [ceph-users] Whole cluster flapping > > > > The pool deletion might have triggered a lot of IO operations on the disks > and the process might be too busy to respond to hearbeats, so the mons mark > them as down due to no response. > > Check also the OSD logs to see if they are actually crashing and > restarting, and disk IO usage (i.e. iostat). > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > *Belo Horizonte - Brasil* > > *IRC NICK - WebertRLZ* > > > > > > On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric > wrote: > > Hi Everyone, > > > > I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large > pool that we had (120 TB). > > Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 > OSD), we have SDD for journal. > > > > After I deleted the large pool my cluster started to flapping on all OSDs. > > Osds are marked down and then marked up as follow : > > > > 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 > 172.29.228.72:6800/95783 boot > > 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: > 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs > degraded, 317 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: > 81 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 > 172.29.228.72:6803/95830 boot > > 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 > osds down (OSD_DOWN) > > 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: > 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs > degraded, 223 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: > 76 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 > 172.29.228.246:6812/3144542 boot > > 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 > osds down (OSD_DOWN) > > 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: > 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs > degraded, 220 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: > 83 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: > 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs > degraded, 197 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: > 95 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: > 5738/5845923 objects mispla
Re: [ceph-users] Whole cluster flapping
The pool deletion might have triggered a lot of IO operations on the disks and the process might be too busy to respond to hearbeats, so the mons mark them as down due to no response. Check also the OSD logs to see if they are actually crashing and restarting, and disk IO usage (i.e. iostat). Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric wrote: > Hi Everyone, > > > > I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large > pool that we had (120 TB). > > Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 > OSD), we have SDD for journal. > > > > After I deleted the large pool my cluster started to flapping on all OSDs. > > Osds are marked down and then marked up as follow : > > > > 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 > 172.29.228.72:6800/95783 boot > > 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: > 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs > degraded, 317 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: > 81 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 > 172.29.228.72:6803/95830 boot > > 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 > osds down (OSD_DOWN) > > 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: > 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs > degraded, 223 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: > 76 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 > 172.29.228.246:6812/3144542 boot > > 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 > osds down (OSD_DOWN) > > 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: > 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs > degraded, 220 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: > 83 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: > 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs > degraded, 197 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: > 95 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: > 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs > degraded, 197 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: > 98 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed > (root=default,room=,host=) (8 reporters from different host after > 54.650576 >= grace 54.300663) > > 2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 > osds down (OSD_DOWN) > > 2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update: > 5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs > degraded, 201 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update: > 78 slow reques
Re: [ceph-users] MDS damaged
Hi Dan, you're right, I was following the mimic instructions (which indeed worked on my mimic testbed), but luminous is different and I missed the additional step. Works now, thanks! Alessandro Il 13/07/18 17:51, Dan van der Ster ha scritto: On Fri, Jul 13, 2018 at 4:07 PM Alessandro De Salvo wrote: However, I cannot reduce the number of mdses anymore, I was used to do that with e.g.: ceph fs set cephfs max_mds 1 Trying this with 12.2.6 has apparently no effect, I am left with 2 active mdses. Is this another bug? Are you following this procedure? http://docs.ceph.com/docs/luminous/cephfs/multimds/#decreasing-the-number-of-ranks i.e. you need to deactivate after decreasing max_mds. (Mimic does this automatically, OTOH). -- dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS damaged
Thanks all, 100..inode, mds_snaptable and 1..inode were not corrupted, so I left them as they were. I have re-injected all the bad objects, for all mdses (2 per filesysytem) and all filesystems I had (2), and after setiing the mdses as repaired my filesystems are back! However, I cannot reduce the number of mdses anymore, I was used to do that with e.g.: ceph fs set cephfs max_mds 1 Trying this with 12.2.6 has apparently no effect, I am left with 2 active mdses. Is this another bug? Thanks, Alessandro Il 13/07/18 15:54, Yan, Zheng ha scritto: On Thu, Jul 12, 2018 at 11:39 PM Alessandro De Salvo wrote: Some progress, and more pain... I was able to recover the 200. using the ceph-objectstore-tool for one of the OSDs (all identical copies) but trying to re-inject it just with rados put was giving no error while the get was still giving the same I/O error. So the solution was to rm the object and the put it again, that worked. However, after restarting one of the MDSes and seeting it to repaired, I've hit another, similar problem: 2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log [ERR] : error reading table object 'mds0_inotable' -5 ((5) Input/output error) Can I safely try to do the same as for object 200.? Should I check something before trying it? Again, checking the copies of the object, they have identical md5sums on all the replicas. Yes, It should be safe. you also need to the same for several other objects. full object list are: 200. mds0_inotable 100..inode mds_snaptable 1..inode The first three objects are per-mds-rank. Ff you have enabled multi-active mds, you also need to update objects of other ranks. For mds.1, object names are 201., mds1_inotable and 101..inode. Thanks, Alessandro Il 12/07/18 16:46, Alessandro De Salvo ha scritto: Unfortunately yes, all the OSDs were restarted a few times, but no change. Thanks, Alessandro Il 12/07/18 15:55, Paul Emmerich ha scritto: This might seem like a stupid suggestion, but: have you tried to restart the OSDs? I've also encountered some random CRC errors that only showed up when trying to read an object, but not on scrubbing, that magically disappeared after restarting the OSD. However, in my case it was clearly related to https://tracker.ceph.com/issues/22464 which doesn't seem to be the issue here. Paul 2018-07-12 13:53 GMT+02:00 Alessandro De Salvo : Il 12/07/18 11:20, Alessandro De Salvo ha scritto: Il 12/07/18 10:58, Dan van der Ster ha scritto: On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum wrote: On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo wrote: OK, I found where the object is: ceph osd map cephfs_metadata 200. osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23) So, looking at the osds 23, 35 and 18 logs in fact I see: osd.23: 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.35: 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.18: 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head So, basically the same error everywhere. I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may help. No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and no disk problems anywhere. No relevant errors in syslogs, the hosts are just fine. I cannot exclude an error on the RAID controllers, but 2 of the OSDs with 10.14 are on a SAN system and one on a different one, so I would tend to exclude they both had (silent) errors at the same time. That's fairly distressing. At this point I'd probably try extracting the object using ceph-objectstore-tool and seeing if it decodes properly as an mds journal. If it does, you might risk just putting it back in place to overwrite the crc. Wouldn't it be easier to scrub repair the PG to fix the crc? this is what I already instructed the cluster to do, a deep scrub, but I'm not sure it could repair in case all replicas are bad, as it seems to be the case. I finally managed (with the help of Dan), to perform the deep-scrub on pg 10.14, but the deep scrub did not detect anything wrong. Also trying to repair 10.14 has no effect. Still, trying to access the object I get in the OSDs: 2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:h
Re: [ceph-users] MDS damaged
Some progress, and more pain... I was able to recover the 200. using the ceph-objectstore-tool for one of the OSDs (all identical copies) but trying to re-inject it just with rados put was giving no error while the get was still giving the same I/O error. So the solution was to rm the object and the put it again, that worked. However, after restarting one of the MDSes and seeting it to repaired, I've hit another, similar problem: 2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log [ERR] : error reading table object 'mds0_inotable' -5 ((5) Input/output error) Can I safely try to do the same as for object 200.? Should I check something before trying it? Again, checking the copies of the object, they have identical md5sums on all the replicas. Thanks, Alessandro Il 12/07/18 16:46, Alessandro De Salvo ha scritto: Unfortunately yes, all the OSDs were restarted a few times, but no change. Thanks, Alessandro Il 12/07/18 15:55, Paul Emmerich ha scritto: This might seem like a stupid suggestion, but: have you tried to restart the OSDs? I've also encountered some random CRC errors that only showed up when trying to read an object, but not on scrubbing, that magically disappeared after restarting the OSD. However, in my case it was clearly related to https://tracker.ceph.com/issues/22464 which doesn't seem to be the issue here. Paul 2018-07-12 13:53 GMT+02:00 Alessandro De Salvo <mailto:alessandro.desa...@roma1.infn.it>>: Il 12/07/18 11:20, Alessandro De Salvo ha scritto: Il 12/07/18 10:58, Dan van der Ster ha scritto: On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum mailto:gfar...@redhat.com>> wrote: On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo mailto:alessandro.desa...@roma1.infn.it>> wrote: OK, I found where the object is: ceph osd map cephfs_metadata 200. osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23) So, looking at the osds 23, 35 and 18 logs in fact I see: osd.23: 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.35: 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.18: 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head So, basically the same error everywhere. I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may help. No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and no disk problems anywhere. No relevant errors in syslogs, the hosts are just fine. I cannot exclude an error on the RAID controllers, but 2 of the OSDs with 10.14 are on a SAN system and one on a different one, so I would tend to exclude they both had (silent) errors at the same time. That's fairly distressing. At this point I'd probably try extracting the object using ceph-objectstore-tool and seeing if it decodes properly as an mds journal. If it does, you might risk just putting it back in place to overwrite the crc. Wouldn't it be easier to scrub repair the PG to fix the crc? this is what I already instructed the cluster to do, a deep scrub, but I'm not sure it could repair in case all replicas are bad, as it seems to be the case. I finally managed (with the help of Dan), to perform the deep-scrub on pg 10.14, but the deep scrub did not detect anything wrong. Also trying to repair 10.14 has no effect. Still, trying to access the object I get in the OSDs: 2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log [ERR]
Re: [ceph-users] MDS damaged
Unfortunately yes, all the OSDs were restarted a few times, but no change. Thanks, Alessandro Il 12/07/18 15:55, Paul Emmerich ha scritto: This might seem like a stupid suggestion, but: have you tried to restart the OSDs? I've also encountered some random CRC errors that only showed up when trying to read an object, but not on scrubbing, that magically disappeared after restarting the OSD. However, in my case it was clearly related to https://tracker.ceph.com/issues/22464 which doesn't seem to be the issue here. Paul 2018-07-12 13:53 GMT+02:00 Alessandro De Salvo <mailto:alessandro.desa...@roma1.infn.it>>: Il 12/07/18 11:20, Alessandro De Salvo ha scritto: Il 12/07/18 10:58, Dan van der Ster ha scritto: On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum mailto:gfar...@redhat.com>> wrote: On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo mailto:alessandro.desa...@roma1.infn.it>> wrote: OK, I found where the object is: ceph osd map cephfs_metadata 200. osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23) So, looking at the osds 23, 35 and 18 logs in fact I see: osd.23: 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.35: 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.18: 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head So, basically the same error everywhere. I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may help. No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and no disk problems anywhere. No relevant errors in syslogs, the hosts are just fine. I cannot exclude an error on the RAID controllers, but 2 of the OSDs with 10.14 are on a SAN system and one on a different one, so I would tend to exclude they both had (silent) errors at the same time. That's fairly distressing. At this point I'd probably try extracting the object using ceph-objectstore-tool and seeing if it decodes properly as an mds journal. If it does, you might risk just putting it back in place to overwrite the crc. Wouldn't it be easier to scrub repair the PG to fix the crc? this is what I already instructed the cluster to do, a deep scrub, but I'm not sure it could repair in case all replicas are bad, as it seems to be the case. I finally managed (with the help of Dan), to perform the deep-scrub on pg 10.14, but the deep scrub did not detect anything wrong. Also trying to repair 10.14 has no effect. Still, trying to access the object I get in the OSDs: 2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head Was deep-scrub supposed to detect the wrong crc? If yes, them it sounds like a bug. Can I force the repair someway? Thanks, Alessandro Alessandro, did you already try a deep-scrub on pg 10.14? I'm waiting for the cluster to do that, I've sent it earlier this morning. I expect it'll show an inconsistent object. Though, I'm unsure if repair will correct the crc given that in this case *all* replicas have a bad crc. Exactly, this is what I wonder too. Cheers, Alessandro --Dan However, I'm also quite curious how it ended up that way, with a checksum mismatch b
Re: [ceph-users] MDS damaged
Il 12/07/18 11:20, Alessandro De Salvo ha scritto: Il 12/07/18 10:58, Dan van der Ster ha scritto: On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum wrote: On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo wrote: OK, I found where the object is: ceph osd map cephfs_metadata 200. osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23) So, looking at the osds 23, 35 and 18 logs in fact I see: osd.23: 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.35: 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.18: 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head So, basically the same error everywhere. I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may help. No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and no disk problems anywhere. No relevant errors in syslogs, the hosts are just fine. I cannot exclude an error on the RAID controllers, but 2 of the OSDs with 10.14 are on a SAN system and one on a different one, so I would tend to exclude they both had (silent) errors at the same time. That's fairly distressing. At this point I'd probably try extracting the object using ceph-objectstore-tool and seeing if it decodes properly as an mds journal. If it does, you might risk just putting it back in place to overwrite the crc. Wouldn't it be easier to scrub repair the PG to fix the crc? this is what I already instructed the cluster to do, a deep scrub, but I'm not sure it could repair in case all replicas are bad, as it seems to be the case. I finally managed (with the help of Dan), to perform the deep-scrub on pg 10.14, but the deep scrub did not detect anything wrong. Also trying to repair 10.14 has no effect. Still, trying to access the object I get in the OSDs: 2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head Was deep-scrub supposed to detect the wrong crc? If yes, them it sounds like a bug. Can I force the repair someway? Thanks, Alessandro Alessandro, did you already try a deep-scrub on pg 10.14? I'm waiting for the cluster to do that, I've sent it earlier this morning. I expect it'll show an inconsistent object. Though, I'm unsure if repair will correct the crc given that in this case *all* replicas have a bad crc. Exactly, this is what I wonder too. Cheers, Alessandro --Dan However, I'm also quite curious how it ended up that way, with a checksum mismatch but identical data (and identical checksums!) across the three replicas. Have you previously done some kind of scrub repair on the metadata pool? Did the PG perhaps get backfilled due to cluster changes? -Greg Thanks, Alessandro Il 11/07/18 18:56, John Spray ha scritto: On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo wrote: Hi John, in fact I get an I/O error by hand too: rados get -p cephfs_metadata 200. 200. error getting cephfs_metadata/200.: (5) Input/output error Next step would be to go look for corresponding errors on your OSD logs, system logs, and possibly also check things like the SMART counters on your hard drives for possible root causes. John Can this be recovered someway? Thanks, Alessandro Il 11/07/18 18:33, John Spray ha scritto: On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo wrote: Hi, after the upgrade to luminous 12.2.6 today, all our MDSes have been marked as damaged. Trying to restart the instances only result in standby MDSes. We currently have 2 filesystems active and 2 MDSes each. I found the following error messages in the mon: mds.0 :6800/2412911269 down:damaged mds.1 :6800/830539001 down:damaged mds.0 :6800/4080298733 down:damaged Whenever I try to force the repaired state with ceph mds repaired : I get something like this in the MDS logs: 2018-07-11 13:20:41.597970 7ff7e010e700 0 mds.1.journaler.mdlog(ro) error getting journal off disk 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log [ERR] : Error recovering journal 0x201: (5) Input/output error An EIO reading the journal header is pretty scary. The MDS itself probably can't tell you much more about this: you need to dig down into the RADOS layer. Try reading the 200. object (that happens to be the rank 0 journal header, every CephFS filesystem should have one) u
Re: [ceph-users] MDS damaged
Il 12/07/18 10:58, Dan van der Ster ha scritto: On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum wrote: On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo wrote: OK, I found where the object is: ceph osd map cephfs_metadata 200. osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23) So, looking at the osds 23, 35 and 18 logs in fact I see: osd.23: 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.35: 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.18: 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head So, basically the same error everywhere. I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may help. No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and no disk problems anywhere. No relevant errors in syslogs, the hosts are just fine. I cannot exclude an error on the RAID controllers, but 2 of the OSDs with 10.14 are on a SAN system and one on a different one, so I would tend to exclude they both had (silent) errors at the same time. That's fairly distressing. At this point I'd probably try extracting the object using ceph-objectstore-tool and seeing if it decodes properly as an mds journal. If it does, you might risk just putting it back in place to overwrite the crc. Wouldn't it be easier to scrub repair the PG to fix the crc? this is what I already instructed the cluster to do, a deep scrub, but I'm not sure it could repair in case all replicas are bad, as it seems to be the case. Alessandro, did you already try a deep-scrub on pg 10.14? I'm waiting for the cluster to do that, I've sent it earlier this morning. I expect it'll show an inconsistent object. Though, I'm unsure if repair will correct the crc given that in this case *all* replicas have a bad crc. Exactly, this is what I wonder too. Cheers, Alessandro --Dan However, I'm also quite curious how it ended up that way, with a checksum mismatch but identical data (and identical checksums!) across the three replicas. Have you previously done some kind of scrub repair on the metadata pool? Did the PG perhaps get backfilled due to cluster changes? -Greg Thanks, Alessandro Il 11/07/18 18:56, John Spray ha scritto: On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo wrote: Hi John, in fact I get an I/O error by hand too: rados get -p cephfs_metadata 200. 200. error getting cephfs_metadata/200.: (5) Input/output error Next step would be to go look for corresponding errors on your OSD logs, system logs, and possibly also check things like the SMART counters on your hard drives for possible root causes. John Can this be recovered someway? Thanks, Alessandro Il 11/07/18 18:33, John Spray ha scritto: On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo wrote: Hi, after the upgrade to luminous 12.2.6 today, all our MDSes have been marked as damaged. Trying to restart the instances only result in standby MDSes. We currently have 2 filesystems active and 2 MDSes each. I found the following error messages in the mon: mds.0 :6800/2412911269 down:damaged mds.1 :6800/830539001 down:damaged mds.0 :6800/4080298733 down:damaged Whenever I try to force the repaired state with ceph mds repaired : I get something like this in the MDS logs: 2018-07-11 13:20:41.597970 7ff7e010e700 0 mds.1.journaler.mdlog(ro) error getting journal off disk 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log [ERR] : Error recovering journal 0x201: (5) Input/output error An EIO reading the journal header is pretty scary. The MDS itself probably can't tell you much more about this: you need to dig down into the RADOS layer. Try reading the 200. object (that happens to be the rank 0 journal header, every CephFS filesystem should have one) using the `rados` command line tool. John Any attempt of running the journal export results in errors, like this one: cephfs-journal-tool --rank=cephfs:0 journal export backup.bin Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1 Header 200. is unreadable 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados` Same happens for recover_dentries cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header 200. is unreadable Errors: 0
Re: [ceph-users] MDS damaged
> Il giorno 11 lug 2018, alle ore 23:25, Gregory Farnum ha > scritto: > >> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo >> wrote: >> OK, I found where the object is: >> >> >> ceph osd map cephfs_metadata 200. >> osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg >> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23) >> >> >> So, looking at the osds 23, 35 and 18 logs in fact I see: >> >> >> osd.23: >> >> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log >> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on >> 10:292cf221:::200.:head >> >> >> osd.35: >> >> 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log >> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on >> 10:292cf221:::200.:head >> >> >> osd.18: >> >> 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log >> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on >> 10:292cf221:::200.:head >> >> >> So, basically the same error everywhere. >> >> I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may >> help. >> >> No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and >> no disk problems anywhere. No relevant errors in syslogs, the hosts are >> just fine. I cannot exclude an error on the RAID controllers, but 2 of >> the OSDs with 10.14 are on a SAN system and one on a different one, so I >> would tend to exclude they both had (silent) errors at the same time. > > That's fairly distressing. At this point I'd probably try extracting the > object using ceph-objectstore-tool and seeing if it decodes properly as an > mds journal. If it does, you might risk just putting it back in place to > overwrite the crc. > Ok, I guess I know how to extract the object from a given OSD, but I’m not sure how to check if it decodes as mds journal, is there a procedure for this? However if trying to export all the sophie’s from all the osd brings the same object md5sum I believe I can try directly to overwrite the object, as it cannot go worse than this, correct? Also I’d need a confirmation of the procedure to follow in this case, when possibly all copies of an object are wrong, I would try the following: - set the noout - bring down all the osd where the object is present - replace the object in all stores - bring the osds up again - unset the noout Correct? > However, I'm also quite curious how it ended up that way, with a checksum > mismatch but identical data (and identical checksums!) across the three > replicas. Have you previously done some kind of scrub repair on the metadata > pool? No, at least not on this pg, I only remember of a repair but it was on a different pool. > Did the PG perhaps get backfilled due to cluster changes? That might be the case, as we have to reboot the osds sometimes when they crash. Also, yesterday we rebooted all of them, but this happens always in sequence, one by one, not all at the same time. Thanks for the help, Alessandro > -Greg > >> >> Thanks, >> >> >> Alessandro >> >> >> >> Il 11/07/18 18:56, John Spray ha scritto: >> > On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo >> > wrote: >> >> Hi John, >> >> >> >> in fact I get an I/O error by hand too: >> >> >> >> >> >> rados get -p cephfs_metadata 200. 200. >> >> error getting cephfs_metadata/200.: (5) Input/output error >> > Next step would be to go look for corresponding errors on your OSD >> > logs, system logs, and possibly also check things like the SMART >> > counters on your hard drives for possible root causes. >> > >> > John >> > >> > >> > >> >> >> >> Can this be recovered someway? >> >> >> >> Thanks, >> >> >> >> >> >> Alessandro >> >> >> >> >> >> Il 11/07/18 18:33, John Spray ha scritto: >> >>> On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo >> >>> wrote: >> >>>> Hi, >> >>>> >> >>>> after the upgrade to luminous 12.2.6 today, all our MDSes have been >> >>>> marked as damaged. Trying to restart the instances only result in >> >>>>
Re: [ceph-users] v10.2.11 Jewel released
Cheers! Thanks for all the backports and fixes. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, Jul 11, 2018 at 1:46 PM Abhishek Lekshmanan wrote: > > We're glad to announce v10.2.11 release of the Jewel stable release > series. This point releases brings a number of important bugfixes and > has a few important security fixes. This is most likely going to be the > final Jewel release (shine on you crazy diamond). We thank everyone in > the community for contributing towards this release and particularly > want to thank Nathan and Yuri for their relentless efforts in > backporting and testing this release. > > We recommend that all Jewel 10.2.x users upgrade. > > Notable Changes > --- > > * CVE 2018-1128: auth: cephx authorizer subject to replay attack > (issue#24836 http://tracker.ceph.com/issues/24836, Sage Weil) > > * CVE 2018-1129: auth: cephx signature check is weak (issue#24837 > http://tracker.ceph.com/issues/24837, Sage Weil) > > * CVE 2018-10861: mon: auth checks not correct for pool ops (issue#24838 > http://tracker.ceph.com/issues/24838, Jason Dillaman) > > * The RBD C API's rbd_discard method and the C++ API's Image::discard > method > now enforce a maximum length of 2GB. This restriction prevents overflow > of > the result code. > > * New OSDs will now use rocksdb for omap data by default, rather than > leveldb. omap is used by RGW bucket indexes and CephFS directories, > and when a single leveldb grows to 10s of GB with a high write or > delete workload, it can lead to high latency when leveldb's > single-threaded compaction cannot keep up. rocksdb supports multiple > threads for compaction, which avoids this problem. > > * The CephFS client now catches failures to clear dentries during startup > and refuses to start as consistency and untrimmable cache issues may > develop. The new option client_die_on_failed_dentry_invalidate (default: > true) may be turned off to allow the client to proceed (dangerous!). > > * In 10.2.10 and earlier releases, keyring caps were not checked for > validity, > so the caps string could be anything. As of 10.2.11, caps strings are > validated and providing a keyring with an invalid caps string to, e.g., > "ceph auth add" will result in an error. > > The changelog and the full release notes are at the release blog entry > at https://ceph.com/releases/v10-2-11-jewel-released/ > > Getting Ceph > > * Git at git://github.com/ceph/ceph.git > * Tarball at http://download.ceph.com/tarballs/ceph-10.2.11.tar.gz > * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ > * Release git sha1: e4b061b47f07f583c92a050d9e84b1813a35671e > > > Best, > Abhishek > > -- > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, > HRB 21284 (AG Nürnberg) > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS damaged
OK, I found where the object is: ceph osd map cephfs_metadata 200. osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23) So, looking at the osds 23, 35 and 18 logs in fact I see: osd.23: 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.35: 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head osd.18: 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 10:292cf221:::200.:head So, basically the same error everywhere. I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may help. No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and no disk problems anywhere. No relevant errors in syslogs, the hosts are just fine. I cannot exclude an error on the RAID controllers, but 2 of the OSDs with 10.14 are on a SAN system and one on a different one, so I would tend to exclude they both had (silent) errors at the same time. Thanks, Alessandro Il 11/07/18 18:56, John Spray ha scritto: On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo wrote: Hi John, in fact I get an I/O error by hand too: rados get -p cephfs_metadata 200. 200. error getting cephfs_metadata/200.: (5) Input/output error Next step would be to go look for corresponding errors on your OSD logs, system logs, and possibly also check things like the SMART counters on your hard drives for possible root causes. John Can this be recovered someway? Thanks, Alessandro Il 11/07/18 18:33, John Spray ha scritto: On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo wrote: Hi, after the upgrade to luminous 12.2.6 today, all our MDSes have been marked as damaged. Trying to restart the instances only result in standby MDSes. We currently have 2 filesystems active and 2 MDSes each. I found the following error messages in the mon: mds.0 :6800/2412911269 down:damaged mds.1 :6800/830539001 down:damaged mds.0 :6800/4080298733 down:damaged Whenever I try to force the repaired state with ceph mds repaired : I get something like this in the MDS logs: 2018-07-11 13:20:41.597970 7ff7e010e700 0 mds.1.journaler.mdlog(ro) error getting journal off disk 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log [ERR] : Error recovering journal 0x201: (5) Input/output error An EIO reading the journal header is pretty scary. The MDS itself probably can't tell you much more about this: you need to dig down into the RADOS layer. Try reading the 200. object (that happens to be the rank 0 journal header, every CephFS filesystem should have one) using the `rados` command line tool. John Any attempt of running the journal export results in errors, like this one: cephfs-journal-tool --rank=cephfs:0 journal export backup.bin Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1 Header 200. is unreadable 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados` Same happens for recover_dentries cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header 200. is unreadable Errors: 0 Is there something I could try to do to have the cluster back? I was able to dump the contents of the metadata pool with rados export -p cephfs_metadata and I'm currently trying the procedure described in http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery but I'm not sure if it will work as it's apparently doing nothing at the moment (maybe it's just very slow). Any help is appreciated, thanks! Alessandro ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS damaged
Hi John, in fact I get an I/O error by hand too: rados get -p cephfs_metadata 200. 200. error getting cephfs_metadata/200.: (5) Input/output error Can this be recovered someway? Thanks, Alessandro Il 11/07/18 18:33, John Spray ha scritto: On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo wrote: Hi, after the upgrade to luminous 12.2.6 today, all our MDSes have been marked as damaged. Trying to restart the instances only result in standby MDSes. We currently have 2 filesystems active and 2 MDSes each. I found the following error messages in the mon: mds.0 :6800/2412911269 down:damaged mds.1 :6800/830539001 down:damaged mds.0 :6800/4080298733 down:damaged Whenever I try to force the repaired state with ceph mds repaired : I get something like this in the MDS logs: 2018-07-11 13:20:41.597970 7ff7e010e700 0 mds.1.journaler.mdlog(ro) error getting journal off disk 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log [ERR] : Error recovering journal 0x201: (5) Input/output error An EIO reading the journal header is pretty scary. The MDS itself probably can't tell you much more about this: you need to dig down into the RADOS layer. Try reading the 200. object (that happens to be the rank 0 journal header, every CephFS filesystem should have one) using the `rados` command line tool. John Any attempt of running the journal export results in errors, like this one: cephfs-journal-tool --rank=cephfs:0 journal export backup.bin Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1 Header 200. is unreadable 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados` Same happens for recover_dentries cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header 200. is unreadable Errors: 0 Is there something I could try to do to have the cluster back? I was able to dump the contents of the metadata pool with rados export -p cephfs_metadata and I'm currently trying the procedure described in http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery but I'm not sure if it will work as it's apparently doing nothing at the moment (maybe it's just very slow). Any help is appreciated, thanks! Alessandro ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS damaged
Hi Gregory, thanks for the reply. I have the dump of the metadata pool, but I'm not sure what to check there. Is it what you mean? The cluster was operational until today at noon, when a full restart of the daemons was issued, like many other times in the past. I was trying to issue the repaired command to get a real error in the logs, but it was apparently not the case. Thanks, Alessandro Il 11/07/18 18:22, Gregory Farnum ha scritto: Have you checked the actual journal objects as the "journal export" suggested? Did you identify any actual source of the damage before issuing the "repaired" command? What is the history of the filesystems on this cluster? On Wed, Jul 11, 2018 at 8:10 AM Alessandro De Salvo <mailto:alessandro.desa...@roma1.infn.it>> wrote: Hi, after the upgrade to luminous 12.2.6 today, all our MDSes have been marked as damaged. Trying to restart the instances only result in standby MDSes. We currently have 2 filesystems active and 2 MDSes each. I found the following error messages in the mon: mds.0 :6800/2412911269 down:damaged mds.1 :6800/830539001 down:damaged mds.0 :6800/4080298733 down:damaged Whenever I try to force the repaired state with ceph mds repaired : I get something like this in the MDS logs: 2018-07-11 13:20:41.597970 7ff7e010e700 0 mds.1.journaler.mdlog(ro) error getting journal off disk 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log [ERR] : Error recovering journal 0x201: (5) Input/output error Any attempt of running the journal export results in errors, like this one: cephfs-journal-tool --rank=cephfs:0 journal export backup.bin Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1 Header 200. is unreadable 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados` Same happens for recover_dentries cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header 200. is unreadable Errors: 0 Is there something I could try to do to have the cluster back? I was able to dump the contents of the metadata pool with rados export -p cephfs_metadata and I'm currently trying the procedure described in http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery but I'm not sure if it will work as it's apparently doing nothing at the moment (maybe it's just very slow). Any help is appreciated, thanks! Alessandro ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS damaged
Hi, after the upgrade to luminous 12.2.6 today, all our MDSes have been marked as damaged. Trying to restart the instances only result in standby MDSes. We currently have 2 filesystems active and 2 MDSes each. I found the following error messages in the mon: mds.0 :6800/2412911269 down:damaged mds.1 :6800/830539001 down:damaged mds.0 :6800/4080298733 down:damaged Whenever I try to force the repaired state with ceph mds repaired : I get something like this in the MDS logs: 2018-07-11 13:20:41.597970 7ff7e010e700 0 mds.1.journaler.mdlog(ro) error getting journal off disk 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log [ERR] : Error recovering journal 0x201: (5) Input/output error Any attempt of running the journal export results in errors, like this one: cephfs-journal-tool --rank=cephfs:0 journal export backup.bin Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1 Header 200. is unreadable 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados` Same happens for recover_dentries cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header 200. is unreadable Errors: 0 Is there something I could try to do to have the cluster back? I was able to dump the contents of the metadata pool with rados export -p cephfs_metadata and I'm currently trying the procedure described in http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery but I'm not sure if it will work as it's apparently doing nothing at the moment (maybe it's just very slow). Any help is appreciated, thanks! Alessandro ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Looking for some advise on distributed FS: Is Ceph the right option for me?
Hi all. I'm looking for some information on several distributed filesystems for our application. It looks like it finally came down to two candidates, Ceph being one of them. But there are still a few questions about ir that I would really like to clarify, if possible. Our plan, initially on 6 workstations, is to have it hosting a distributed file system that can withstand two simultaneous computers failures without data loss (something that can remember a raid 6, but over the network). This file system will also need to be also remotely mounted (NFS server with fallbacks) by other 5+ computers. Students will be working on all 11+ computers at the same time (different requisites from different softwares: some use many small files, other a few really big, 100s gb, files), and absolutely no hardware modifications are allowed. This initial test bed is for undergraduate students usage, but if successful will be employed also for our small clusters. The connection is a simple GbE. Our actual concerns are: 1) Data Resilience: It seems that double copy of each block is the standard setting, is it correct? As such, it will strip-parity data among three computers for each block? 2) Metadata Resilience: We seen that we can now have more than a single Metadata Server (which was a show-stopper on previous versions). However, do they have to be dedicated boxes, or they can share boxes with the Data Servers? Can it be configured in such a way that even if two metadata server computers fail the whole system data will still be accessible from the remaining computers, without interruptions, or they share different data aiming only for performance? 3) Other softwares compability: We seen that there is NFS incompability, is it correct? Also, any posix issues? 4) No single (or double) point of failure: every single possible stance has to be able to endure a *double* failure (yes, things can get time to be fixed here). Does Ceph need s single master server for any of its activities? Can it endure double failure? How long would it take to any sort of "fallback" to be completed, users would need to wait to regain access? I think that covers the initial questions we have. Sorry if this is the wrong list, however. Looking forward for any answer or suggestion, Regards, Jones ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD for bluestore
bluestore doesn't have a journal like the filestore does, but there is the WAL (Write-Ahead Log) which is looks like a journal but works differently. You can (or must, depending or your needs) have SSDs to serve this WAL (and for Rocks DB). Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Sun, Jul 8, 2018 at 11:58 AM Satish Patel wrote: > Folks, > > I'm just reading from multiple post that bluestore doesn't need SSD > journel, is that true? > > I'm planning to build 5 node cluster so depending on that I purchase SSD > for journel. > > If it does require SSD for journel then what would be the best vendor and > model which last long? Any recommendation > > Sent from my iPhone > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FreeBSD Initiator with Ceph iscsi
I've crossposted the problem to the freebsd-stable mailinglist. There is no ALUA support at the initiator side. There were 2 options for multipathing: 1. Export your LUNs via two (or more) different paths (for example via two different target portal IP addresses), on the initiator side set up both iSCSI sessions in the usual way (like without multipathing), add kern.iscsi.fail_on_disconnection=1 to /etc/sysctl.conf, and set up gmultipath on top of LUNs reachable via those sessions 2. Set up the target so it redirects (sends "Target moved temporarily" login responses) to the target portal it considers active. Then set up the initiator (single session) to either one; the target will "bounce it" to the right place. You don't need gmultipath in this case, because from the initiator point of view there's only one iSCSI session at any time. Would an of those 2 options be possible on the ceph iscsi gateway solution to configure? Regards, Frank Jason Dillaman wrote: > Conceptually, I would assume it should just work if configured correctly > w/ multipath (to properly configure the ALUA settings on the LUNs). I > don't run FreeBSD, but any particular issue you are seeing? > > On Tue, Jun 26, 2018 at 6:06 PM Frank de Bot (lists) <mailto:li...@searchy.net>> wrote: > > Hi, > > In my test setup I have a ceph iscsi gateway (configured as in > http://docs.ceph.com/docs/luminous/rbd/iscsi-overview/ ) > > I would like to use thie with a FreeBSD (11.1) initiator, but I fail to > make a working setup in FreeBSD. Is it known if the FreeBSD initiator > (with gmultipath) can work with this gateway setup? > > > Regards, > > Frank > ___ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] FreeBSD Initiator with Ceph iscsi
Hi, In my test setup I have a ceph iscsi gateway (configured as in http://docs.ceph.com/docs/luminous/rbd/iscsi-overview/ ) I would like to use thie with a FreeBSD (11.1) initiator, but I fail to make a working setup in FreeBSD. Is it known if the FreeBSD initiator (with gmultipath) can work with this gateway setup? Regards, Frank ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Intel SSD DC P3520 PCIe for OSD 1480 TBW good idea?
Hello everybody, I am thinking about making a production three node Ceph cluster with 3x 1.2TB Intel SSD DC P3520 PCIe storage devices. 10.8 (7.2TB 66% for production) I am not planning on a journal on a separate ssd. I assume there is no advantage of this when using pcie storage? Network connection to an Cisco SG550XG-8F8T 10Gbe Switch with Intel X710-DA2. (if someone knows a good mainline Linux budget replacement). https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p3520-series/dc-p3520-1-2tb-aic-3d1.html Is this a good storage setup? Mainboard: Intel® Server Board S2600CW2R CPU: 2x Intel® Xeon® Processor E5-2630 v4 (25M Cache, 2.20 GHz) Memory: 1x 64GB DDR4 ECC KVR24R17D4K4/64 Disk: 2x WD Gold 4TB 7200rpm 128MB SATA3 Storage: 3x Intel SSD DC P3520 1.2TB PCIe Adapter: Intel Ethernet Converged Network Adapter X710-DA2 I want to try using NUMA to also run KVM guests besides the OSD. I should have enough cores and only have a few osd processes. Kind regards, Jelle de Jong ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Frequent slow requests
Frank (lists) wrote: > Hi, > > On a small cluster (3 nodes) I frequently have slow requests. When > dumping the inflight ops from the hanging OSD, it seems it doesn't get a > 'response' for one of the subops. The events always look like: > I've done some further testing, all slow request are blocked by OSD's on a single host. How can I debug this problem further? I can't find any errors or other strange things on the host with osd's that are seemingly not sending a response to an op. Regards, Frank de Bot ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Minimal MDS for CephFS on OSD hosts
Keep in mind that the mds server is cpu-bound, so during heavy workloads it will eat up CPU usage, so the OSD daemons can affect or be affected by the MDS daemon. But it does work well. We've been running a few clusters with MON, MDS and OSDs sharing the same hosts for a couple of years now. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Tue, Jun 19, 2018 at 11:03 AM Paul Emmerich wrote: > Just co-locate them with your OSDs. You can can control how much RAM the > MDSs use with the "mds cache memory limit" option. (default 1 GB) > Note that the cache should be large enough RAM to keep the active working > set in the mds cache but 1 million files is not really a lot. > As a rule of thumb: ~1GB of MDS cache per ~100k files. > > 64GB of RAM for 12 OSDs and an MDS is enough in most cases. > > Paul > > 2018-06-19 15:34 GMT+02:00 Denny Fuchs : > >> Hi, >> >> Am 19.06.2018 15:14, schrieb Stefan Kooman: >> >> Storage doesn't matter for MDS, as they won't use it to store ceph data >>> (but instead use the (meta)data pool to store meta data). >>> I would not colocate the MDS daemons with the OSDS, but instead create a >>> couple of VMs (active / standby) and give them as much RAM as you >>> possibly can. >>> >> >> thanks a lot. I think, we would start with round about 8GB and see, what >> happens. >> >> cu denny >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating cephfs data pools and/or mounting multiple filesystems belonging to the same cluster
Hi, Il 14/06/18 06:13, Yan, Zheng ha scritto: On Wed, Jun 13, 2018 at 9:35 PM Alessandro De Salvo wrote: Hi, Il 13/06/18 14:40, Yan, Zheng ha scritto: On Wed, Jun 13, 2018 at 7:06 PM Alessandro De Salvo wrote: Hi, I'm trying to migrate a cephfs data pool to a different one in order to reconfigure with new pool parameters. I've found some hints but no specific documentation to migrate pools. I'm currently trying with rados export + import, but I get errors like these: Write #-9223372036854775808::::11e1007.:head# omap_set_header failed: (95) Operation not supported The command I'm using is the following: rados export -p cephfs_data | rados import -p cephfs_data_new - So, I have a few questions: 1) would it work to swap the cephfs data pools by renaming them while the fs cluster is down? 2) how can I copy the old data pool into a new one without errors like the ones above? This won't work as you expected. some cephfs metadata records ID of data pool. This is was suspecting too, hence the question, so thanks for confirming it. Basically, once a cephfs filesystem is created the pool and structure are immutable. This is not good, though. 3) plain copy from a fs to another one would also work, but I didn't find a way to tell the ceph fuse clients how to mount different filesystems in the same cluster, any documentation on it? ceph-fuse /mnt/ceph --client_mds_namespace=cephfs_name In the meantime I also found the same option for fuse and tried it. It works with fuse, but it seems it's not possible to export via nfs-ganesha multiple filesystems. put client_mds_namespace option to client section of ceph.conf (the machine the run ganesha) Yes, that would work but then I need a (set of) exporter(s) for every cephfs filesystem. That sounds reasonable though, as it's the same situation as for the mds services. Thanks for the hint, Alessandro Anyone tried it? 4) even if I found a way to mount via fuse different filesystems belonging to the same cluster, is this feature stable enough or is it still super-experimental? very stable Very good! Thanks, Alessandro Thanks, Alessandro ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs: bind data pool via file layout
Got it Gregory, sounds good enough for us. Thank you all for the help provided. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, Jun 13, 2018 at 2:20 PM Gregory Farnum wrote: > Nah, I would use one Filesystem unless you can’t. The backtrace does > create another object but IIRC it’s a maximum one IO per create/rename (on > the file). > On Wed, Jun 13, 2018 at 1:12 PM Webert de Souza Lima < > webert.b...@gmail.com> wrote: > >> Thanks for clarifying that, Gregory. >> >> As said before, we use the file layout to resolve the difference of >> workloads in those 2 different directories in cephfs. >> Would you recommend using 2 filesystems instead? By doing so, each fs >> would have it's default data pool accordingly. >> >> >> Regards, >> >> Webert Lima >> DevOps Engineer at MAV Tecnologia >> *Belo Horizonte - Brasil* >> *IRC NICK - WebertRLZ* >> >> >> On Wed, Jun 13, 2018 at 11:33 AM Gregory Farnum >> wrote: >> >>> The backtrace object Zheng referred to is used only for resolving hard >>> links or in disaster recovery scenarios. If the default data pool isn’t >>> available you would stack up pending RADOS writes inside of your mds but >>> the rest of the system would continue unless you manage to run the mds out >>> of memory. >>> -Greg >>> On Wed, Jun 13, 2018 at 9:25 AM Webert de Souza Lima < >>> webert.b...@gmail.com> wrote: >>> >>>> Thank you Zheng. >>>> >>>> Does that mean that, when using such feature, our data integrity relies >>>> now on both data pools' integrity/availability? >>>> >>>> We currently use such feature in production for dovecot's index files, >>>> so we could store this directory on a pool of SSDs only. The main data pool >>>> is made of HDDs and stores the email files themselves. >>>> >>>> There ain't too many files created, it's just a few files per email >>>> user, and basically one directory per user's mailbox. >>>> Each mailbox has a index file that is updated upon every new email >>>> received or moved, deleted, read, etc. >>>> >>>> I think in this scenario the overhead may be acceptable for us. >>>> >>>> >>>> Regards, >>>> >>>> Webert Lima >>>> DevOps Engineer at MAV Tecnologia >>>> *Belo Horizonte - Brasil* >>>> *IRC NICK - WebertRLZ* >>>> >>>> >>>> On Wed, Jun 13, 2018 at 9:51 AM Yan, Zheng wrote: >>>> >>>>> On Wed, Jun 13, 2018 at 3:34 AM Webert de Souza Lima >>>>> wrote: >>>>> > >>>>> > hello, >>>>> > >>>>> > is there any performance impact on cephfs for using file layouts to >>>>> bind a specific directory in cephfs to a given pool? Of course, such pool >>>>> is not the default data pool for this cephfs. >>>>> > >>>>> >>>>> For each file, no matter which pool file data are stored, mds alway >>>>> create an object in the default data pool. The object in default data >>>>> pool is used for storing backtrace. So files stored in non-default >>>>> pool have extra overhead on file creation. For large file, the >>>>> overhead can be neglect. But for lots of small files, the overhead may >>>>> affect performance. >>>>> >>>>> >>>>> > Regards, >>>>> > >>>>> > Webert Lima >>>>> > DevOps Engineer at MAV Tecnologia >>>>> > Belo Horizonte - Brasil >>>>> > IRC NICK - WebertRLZ >>>>> > ___ >>>>> > ceph-users mailing list >>>>> > ceph-users@lists.ceph.com >>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> ___ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs: bind data pool via file layout
Thanks for clarifying that, Gregory. As said before, we use the file layout to resolve the difference of workloads in those 2 different directories in cephfs. Would you recommend using 2 filesystems instead? By doing so, each fs would have it's default data pool accordingly. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, Jun 13, 2018 at 11:33 AM Gregory Farnum wrote: > The backtrace object Zheng referred to is used only for resolving hard > links or in disaster recovery scenarios. If the default data pool isn’t > available you would stack up pending RADOS writes inside of your mds but > the rest of the system would continue unless you manage to run the mds out > of memory. > -Greg > On Wed, Jun 13, 2018 at 9:25 AM Webert de Souza Lima < > webert.b...@gmail.com> wrote: > >> Thank you Zheng. >> >> Does that mean that, when using such feature, our data integrity relies >> now on both data pools' integrity/availability? >> >> We currently use such feature in production for dovecot's index files, so >> we could store this directory on a pool of SSDs only. The main data pool is >> made of HDDs and stores the email files themselves. >> >> There ain't too many files created, it's just a few files per email user, >> and basically one directory per user's mailbox. >> Each mailbox has a index file that is updated upon every new email >> received or moved, deleted, read, etc. >> >> I think in this scenario the overhead may be acceptable for us. >> >> >> Regards, >> >> Webert Lima >> DevOps Engineer at MAV Tecnologia >> *Belo Horizonte - Brasil* >> *IRC NICK - WebertRLZ* >> >> >> On Wed, Jun 13, 2018 at 9:51 AM Yan, Zheng wrote: >> >>> On Wed, Jun 13, 2018 at 3:34 AM Webert de Souza Lima >>> wrote: >>> > >>> > hello, >>> > >>> > is there any performance impact on cephfs for using file layouts to >>> bind a specific directory in cephfs to a given pool? Of course, such pool >>> is not the default data pool for this cephfs. >>> > >>> >>> For each file, no matter which pool file data are stored, mds alway >>> create an object in the default data pool. The object in default data >>> pool is used for storing backtrace. So files stored in non-default >>> pool have extra overhead on file creation. For large file, the >>> overhead can be neglect. But for lots of small files, the overhead may >>> affect performance. >>> >>> >>> > Regards, >>> > >>> > Webert Lima >>> > DevOps Engineer at MAV Tecnologia >>> > Belo Horizonte - Brasil >>> > IRC NICK - WebertRLZ >>> > ___ >>> > ceph-users mailing list >>> > ceph-users@lists.ceph.com >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating cephfs data pools and/or mounting multiple filesystems belonging to the same cluster
Hi, Il 13/06/18 14:40, Yan, Zheng ha scritto: On Wed, Jun 13, 2018 at 7:06 PM Alessandro De Salvo wrote: Hi, I'm trying to migrate a cephfs data pool to a different one in order to reconfigure with new pool parameters. I've found some hints but no specific documentation to migrate pools. I'm currently trying with rados export + import, but I get errors like these: Write #-9223372036854775808::::11e1007.:head# omap_set_header failed: (95) Operation not supported The command I'm using is the following: rados export -p cephfs_data | rados import -p cephfs_data_new - So, I have a few questions: 1) would it work to swap the cephfs data pools by renaming them while the fs cluster is down? 2) how can I copy the old data pool into a new one without errors like the ones above? This won't work as you expected. some cephfs metadata records ID of data pool. This is was suspecting too, hence the question, so thanks for confirming it. Basically, once a cephfs filesystem is created the pool and structure are immutable. This is not good, though. 3) plain copy from a fs to another one would also work, but I didn't find a way to tell the ceph fuse clients how to mount different filesystems in the same cluster, any documentation on it? ceph-fuse /mnt/ceph --client_mds_namespace=cephfs_name In the meantime I also found the same option for fuse and tried it. It works with fuse, but it seems it's not possible to export via nfs-ganesha multiple filesystems. Anyone tried it? 4) even if I found a way to mount via fuse different filesystems belonging to the same cluster, is this feature stable enough or is it still super-experimental? very stable Very good! Thanks, Alessandro Thanks, Alessandro ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs: bind data pool via file layout
Thank you Zheng. Does that mean that, when using such feature, our data integrity relies now on both data pools' integrity/availability? We currently use such feature in production for dovecot's index files, so we could store this directory on a pool of SSDs only. The main data pool is made of HDDs and stores the email files themselves. There ain't too many files created, it's just a few files per email user, and basically one directory per user's mailbox. Each mailbox has a index file that is updated upon every new email received or moved, deleted, read, etc. I think in this scenario the overhead may be acceptable for us. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, Jun 13, 2018 at 9:51 AM Yan, Zheng wrote: > On Wed, Jun 13, 2018 at 3:34 AM Webert de Souza Lima > wrote: > > > > hello, > > > > is there any performance impact on cephfs for using file layouts to bind > a specific directory in cephfs to a given pool? Of course, such pool is not > the default data pool for this cephfs. > > > > For each file, no matter which pool file data are stored, mds alway > create an object in the default data pool. The object in default data > pool is used for storing backtrace. So files stored in non-default > pool have extra overhead on file creation. For large file, the > overhead can be neglect. But for lots of small files, the overhead may > affect performance. > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > Belo Horizonte - Brasil > > IRC NICK - WebertRLZ > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Migrating cephfs data pools and/or mounting multiple filesystems belonging to the same cluster
Hi, I'm trying to migrate a cephfs data pool to a different one in order to reconfigure with new pool parameters. I've found some hints but no specific documentation to migrate pools. I'm currently trying with rados export + import, but I get errors like these: Write #-9223372036854775808::::11e1007.:head# omap_set_header failed: (95) Operation not supported The command I'm using is the following: rados export -p cephfs_data | rados import -p cephfs_data_new - So, I have a few questions: 1) would it work to swap the cephfs data pools by renaming them while the fs cluster is down? 2) how can I copy the old data pool into a new one without errors like the ones above? 3) plain copy from a fs to another one would also work, but I didn't find a way to tell the ceph fuse clients how to mount different filesystems in the same cluster, any documentation on it? 4) even if I found a way to mount via fuse different filesystems belonging to the same cluster, is this feature stable enough or is it still super-experimental? Thanks, Alessandro ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs: bind data pool via file layout
hello, is there any performance impact on cephfs for using file layouts to bind a specific directory in cephfs to a given pool? Of course, such pool is not the default data pool for this cephfs. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] (yet another) multi active mds advise needed
Hi Daniel, Thanks for clarifying. I'll have a look at dirfrag option. Regards, Webert Lima Em sáb, 19 de mai de 2018 01:18, Daniel Baumann escreveu: > On 05/19/2018 01:13 AM, Webert de Souza Lima wrote: > > New question: will it make any difference in the balancing if instead of > > having the MAIL directory in the root of cephfs and the domains's > > subtrees inside it, I discard the parent dir and put all the subtress > right in cephfs root? > > the balancing between the MDS is influenced by which directories are > accessed, the currently accessed directory-trees are diveded between the > MDS's (also check the dirfrag option in the docs). assuming you have the > same access pattern, the "fragmentation" between the MDS's happens at > these "target-directories", so it doesn't matter if these directories > are further up or down in the same filesystem tree. > > in the multi-MDS scenario where the MDS serving rank 0 fails, the > effects in the moment of the failure for any cephfs client accessing a > directory/file are the same (as described in an earlier mail), > regardless on which level the directory/file is within the filesystem. > > Regards, > Daniel > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] (yet another) multi active mds advise needed
Hi Patrick On Fri, May 18, 2018 at 6:20 PM Patrick Donnelly wrote: > Each MDS may have multiple subtrees they are authoritative for. Each > MDS may also replicate metadata from another MDS as a form of load > balancing. Ok, its good to know that it actually does some load balance. Thanks. New question: will it make any difference in the balancing if instead of having the MAIL directory in the root of cephfs and the domains's subtrees inside it, I discard the parent dir and put all the subtress right in cephfs root? > standby-replay daemons are not available to take over for ranks other > than the one it follows. So, you would want to have a standby-replay > daemon for each rank or just have normal standbys. It will likely > depend on the size of your MDS (cache size) and available hardware. > > It's best if y ou see if the normal balancer (especially in v12.2.6 > [1]) can handle the load for you without trying to micromanage things > via pins. You can use pinning to isolate metadata load from other > ranks as a stop-gap measure. > Ok I will start with the simplest way. This can be changed after deployment if it comes to be the case. On Fri, May 18, 2018 at 6:38 PM Daniel Baumann wrote: > jftr, having 3 active mds and 3 standby-replay resulted May 20217 in a > longer downtime for us due to http://tracker.ceph.com/issues/21749 > > we're not using standby-replay MDS's anymore but only "normal" standby, > and didn't have had any problems anymore (running kraken then, upgraded > to luminous last fall). > Thank you very much for your feedback Daniel. I'll go for the regular standby daemons, then. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] (yet another) multi active mds advise needed
Hi, We're migrating from a Jewel / filestore based cephfs archicture to a Luminous / buestore based one. One MUST HAVE is multiple Active MDS daemons. I'm still lacking knowledge of how it actually works. After reading the docs and ML we learned that they work by sort of dividing the responsibilities, each with his own and only directory subtree. (please correct me if I'm wrong). Question 1: I'd like to know if it is viable to have 4 MDS daemons, being 3 Active and 1 Standby (or Standby-Replay if that's still possible with multi-mds). Basically, what we have is 2 subtrees used by dovecot: INDEX and MAIL. Their tree is almost identical but INDEX stores all dovecot metadata with heavy IO going on and MAIL stores actual email files, with much more writes than reads. I don't know by now which one could bottleneck the MDS servers most so I wonder if I can take metrics on MDS usage per pool when it's deployed. Question 2: If the metadata workloads are very different I wonder if I can isolate them, like pinning MDS servers X and Y to one of the directories. Cache Tier is deprecated so, Question 3: how can I think of a read cache mechanism in Luminous with bluestore, mainly to keep newly created files (emails that just arrived and will probably be fetched by the user in a few seconds via IMAP/POP3). Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi-MDS Failover
Hello, On Mon, Apr 30, 2018 at 7:16 AM Daniel Baumann wrote: > additionally: if rank 0 is lost, the whole FS stands still (no new > client can mount the fs; no existing client can change a directory, etc.). > > my guess is that the root of a cephfs (/; which is always served by rank > 0) is needed in order to do traversals/lookups of any directories on the > top-level (which then can be served by ranks 1-n). > Could someone confirm if this is actually how it works? Thanks. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox
Thanks Jack. That's good to know. It is definitely something to consider. In a distributed storage scenario we might build a dedicated pool for that and tune the pool as more capacity or performance is needed. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, May 16, 2018 at 4:45 PM Jack wrote: > On 05/16/2018 09:35 PM, Webert de Souza Lima wrote: > > We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore > > backend. > > We'll have to do some some work on how to simulate user traffic, for > writes > > and readings. That seems troublesome. > I would appreciate seeing these results ! > > > Thanks for the plugins recommendations. I'll take the change and ask you > > how is the SIS status? We have used it in the past and we've had some > > problems with it. > > I am using it since Dec 2016 with mdbox, with no issue at all (I am > currently using Dovecot 2.2.27-3 from Debian Stretch) > The only config I use is mail_attachment_dir, the rest lies as default > (mail_attachment_min_size = 128k, mail_attachment_fs = sis posix, > ail_attachment_hash = %{sha1}) > The backend storage is a local filesystem, and there is only one Dovecot > instance > > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > *Belo Horizonte - Brasil* > > *IRC NICK - WebertRLZ* > > > > > > On Wed, May 16, 2018 at 4:19 PM Jack wrote: > > > >> Hi, > >> > >> Many (most ?) filesystems does not store multiple files on the same > block > >> > >> Thus, with sdbox, every single mail (you know, that kind of mail with 10 > >> lines in it) will eat an inode, and a block (4k here) > >> mdbox is more compact on this way > >> > >> Another difference: sdbox removes the message, mdbox does not : a single > >> metadata update is performed, which may be packed with others if many > >> files are deleted at once > >> > >> That said, I do not have experience with dovecot + cephfs, nor have made > >> tests for sdbox vs mdbox > >> > >> However, and this is a bit out of topic, I recommend you look at the > >> following dovecot's features (if not already done), as they are awesome > >> and will help you a lot: > >> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib) > >> - Single-Instance-Storage (aka sis, aka "attachment deduplication" : > >> https://www.dovecot.org/list/dovecot/2013-December/094276.html) > >> > >> Regards, > >> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote: > >>> I'm sending this message to both dovecot and ceph-users ML so please > >> don't > >>> mind if something seems too obvious for you. > >>> > >>> Hi, > >>> > >>> I have a question for both dovecot and ceph lists and below I'll > explain > >>> what's going on. > >>> > >>> Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox), > >> when > >>> using sdbox, a new file is stored for each email message. > >>> When using mdbox, multiple messages are appended to a single file until > >> it > >>> reaches/passes the rotate limit. > >>> > >>> I would like to understand better how the mdbox format impacts on IO > >>> performance. > >>> I think it's generally expected that fewer larger file translate to > less > >> IO > >>> and more troughput when compared to more small files, but how does > >> dovecot > >>> handle that with mdbox? > >>> If dovecot does flush data to storage upon each and every new email is > >>> arrived and appended to the corresponding file, would that mean that it > >>> generate the same ammount of IO as it would do with one file per > message? > >>> Also, if using mdbox many messages will be appended to a said file > >> before a > >>> new file is created. That should mean that a file descriptor is kept > open > >>> for sometime by dovecot process. > >>> Using cephfs as backend, how would this impact cluster performance > >>> regarding MDS caps and inodes cached when files from thousands of users > >> are > >>> opened and appended all over? > >>> > >>> I would like to understand this better. > >>> > >>> Why? > >>> We are a small Business Email Hosting provider with ba
Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox
Hello Danny, I actually saw that thread and I was very excited about it. I thank you all for that idea and all the effort being put in it. I haven't yet tried to play around with your plugin but I intend to, and to contribute back. I think when it's ready for production it will be unbeatable. I have watched your talk at Cephalocon (on YouTube). I'll see your slides, maybe they'll give me more insights on our infrastructure architecture. As you can see our business is still taking baby steps compared to Deutsche Telekom's but we face infrastructure challenges everyday since ever. As for now, I think we could still fit with cephfs, but we definitely need some improvement. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, May 16, 2018 at 4:42 PM Danny Al-Gaaf wrote: > Hi, > > some time back we had similar discussions when we, as an email provider, > discussed to move away from traditional NAS/NFS storage to Ceph. > > The problem with POSIX file systems and dovecot is that e.g. with mdbox > only around ~20% of the IO operations are READ/WRITE, the rest are > metadata IOs. You will not change this with using CephFS since it will > basically behave the same way as e.g. NFS. > > We decided to develop librmb to store emails as objects directly in > RADOS instead of CephFS. The project is still under development, so you > should not use it in production, but you can try it to run a POC. > > For more information check out my slides from Ceph Day London 2018: > https://dalgaaf.github.io/cephday-london2018-emailstorage/#/cover-page > > The project can be found on github: > https://github.com/ceph-dovecot/ > > -Danny > > Am 16.05.2018 um 20:37 schrieb Webert de Souza Lima: > > I'm sending this message to both dovecot and ceph-users ML so please > don't > > mind if something seems too obvious for you. > > > > Hi, > > > > I have a question for both dovecot and ceph lists and below I'll explain > > what's going on. > > > > Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox), > when > > using sdbox, a new file is stored for each email message. > > When using mdbox, multiple messages are appended to a single file until > it > > reaches/passes the rotate limit. > > > > I would like to understand better how the mdbox format impacts on IO > > performance. > > I think it's generally expected that fewer larger file translate to less > IO > > and more troughput when compared to more small files, but how does > dovecot > > handle that with mdbox? > > If dovecot does flush data to storage upon each and every new email is > > arrived and appended to the corresponding file, would that mean that it > > generate the same ammount of IO as it would do with one file per message? > > Also, if using mdbox many messages will be appended to a said file > before a > > new file is created. That should mean that a file descriptor is kept open > > for sometime by dovecot process. > > Using cephfs as backend, how would this impact cluster performance > > regarding MDS caps and inodes cached when files from thousands of users > are > > opened and appended all over? > > > > I would like to understand this better. > > > > Why? > > We are a small Business Email Hosting provider with bare metal, self > hosted > > systems, using dovecot for servicing mailboxes and cephfs for email > storage. > > > > We are currently working on dovecot and storage redesign to be in > > production ASAP. The main objective is to serve more users with better > > performance, high availability and scalability. > > * high availability and load balancing is extremely important to us * > > > > On our current model, we're using mdbox format with dovecot, having > > dovecot's INDEXes stored in a replicated pool of SSDs, and messages > stored > > in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs). > > All using cephfs / filestore backend. > > > > Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel > > (10.2.9-4). > > - ~25K users from a few thousands of domains per cluster > > - ~25TB of email data per cluster > > - ~70GB of dovecot INDEX [meta]data per cluster > > - ~100MB of cephfs metadata per cluster > > > > Our goal is to build a single ceph cluster for storage that could expand > in > > capacity, be highly available and perform well enough. I know, that's > what > > everyone wants. > > > > Cephfs is an important choise because: > > - there can
Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox
Hello Jack, yes, I imagine I'll have to do some work on tuning the block size on cephfs. Thanks for the advise. I knew that using mdbox, messages are not removed but I though that was true in sdbox too. Thanks again. We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore backend. We'll have to do some some work on how to simulate user traffic, for writes and readings. That seems troublesome. Thanks for the plugins recommendations. I'll take the change and ask you how is the SIS status? We have used it in the past and we've had some problems with it. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, May 16, 2018 at 4:19 PM Jack wrote: > Hi, > > Many (most ?) filesystems does not store multiple files on the same block > > Thus, with sdbox, every single mail (you know, that kind of mail with 10 > lines in it) will eat an inode, and a block (4k here) > mdbox is more compact on this way > > Another difference: sdbox removes the message, mdbox does not : a single > metadata update is performed, which may be packed with others if many > files are deleted at once > > That said, I do not have experience with dovecot + cephfs, nor have made > tests for sdbox vs mdbox > > However, and this is a bit out of topic, I recommend you look at the > following dovecot's features (if not already done), as they are awesome > and will help you a lot: > - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib) > - Single-Instance-Storage (aka sis, aka "attachment deduplication" : > https://www.dovecot.org/list/dovecot/2013-December/094276.html) > > Regards, > On 05/16/2018 08:37 PM, Webert de Souza Lima wrote: > > I'm sending this message to both dovecot and ceph-users ML so please > don't > > mind if something seems too obvious for you. > > > > Hi, > > > > I have a question for both dovecot and ceph lists and below I'll explain > > what's going on. > > > > Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox), > when > > using sdbox, a new file is stored for each email message. > > When using mdbox, multiple messages are appended to a single file until > it > > reaches/passes the rotate limit. > > > > I would like to understand better how the mdbox format impacts on IO > > performance. > > I think it's generally expected that fewer larger file translate to less > IO > > and more troughput when compared to more small files, but how does > dovecot > > handle that with mdbox? > > If dovecot does flush data to storage upon each and every new email is > > arrived and appended to the corresponding file, would that mean that it > > generate the same ammount of IO as it would do with one file per message? > > Also, if using mdbox many messages will be appended to a said file > before a > > new file is created. That should mean that a file descriptor is kept open > > for sometime by dovecot process. > > Using cephfs as backend, how would this impact cluster performance > > regarding MDS caps and inodes cached when files from thousands of users > are > > opened and appended all over? > > > > I would like to understand this better. > > > > Why? > > We are a small Business Email Hosting provider with bare metal, self > hosted > > systems, using dovecot for servicing mailboxes and cephfs for email > storage. > > > > We are currently working on dovecot and storage redesign to be in > > production ASAP. The main objective is to serve more users with better > > performance, high availability and scalability. > > * high availability and load balancing is extremely important to us * > > > > On our current model, we're using mdbox format with dovecot, having > > dovecot's INDEXes stored in a replicated pool of SSDs, and messages > stored > > in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs). > > All using cephfs / filestore backend. > > > > Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel > > (10.2.9-4). > > - ~25K users from a few thousands of domains per cluster > > - ~25TB of email data per cluster > > - ~70GB of dovecot INDEX [meta]data per cluster > > - ~100MB of cephfs metadata per cluster > > > > Our goal is to build a single ceph cluster for storage that could expand > in > > capacity, be highly available and perform well enough. I know, that's > what > > everyone wants. > > > > Cephfs is an important choise because: > > - there can be multiple mountpoints, thus multip
[ceph-users] dovecot + cephfs - sdbox vs mdbox
I'm sending this message to both dovecot and ceph-users ML so please don't mind if something seems too obvious for you. Hi, I have a question for both dovecot and ceph lists and below I'll explain what's going on. Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox), when using sdbox, a new file is stored for each email message. When using mdbox, multiple messages are appended to a single file until it reaches/passes the rotate limit. I would like to understand better how the mdbox format impacts on IO performance. I think it's generally expected that fewer larger file translate to less IO and more troughput when compared to more small files, but how does dovecot handle that with mdbox? If dovecot does flush data to storage upon each and every new email is arrived and appended to the corresponding file, would that mean that it generate the same ammount of IO as it would do with one file per message? Also, if using mdbox many messages will be appended to a said file before a new file is created. That should mean that a file descriptor is kept open for sometime by dovecot process. Using cephfs as backend, how would this impact cluster performance regarding MDS caps and inodes cached when files from thousands of users are opened and appended all over? I would like to understand this better. Why? We are a small Business Email Hosting provider with bare metal, self hosted systems, using dovecot for servicing mailboxes and cephfs for email storage. We are currently working on dovecot and storage redesign to be in production ASAP. The main objective is to serve more users with better performance, high availability and scalability. * high availability and load balancing is extremely important to us * On our current model, we're using mdbox format with dovecot, having dovecot's INDEXes stored in a replicated pool of SSDs, and messages stored in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs). All using cephfs / filestore backend. Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel (10.2.9-4). - ~25K users from a few thousands of domains per cluster - ~25TB of email data per cluster - ~70GB of dovecot INDEX [meta]data per cluster - ~100MB of cephfs metadata per cluster Our goal is to build a single ceph cluster for storage that could expand in capacity, be highly available and perform well enough. I know, that's what everyone wants. Cephfs is an important choise because: - there can be multiple mountpoints, thus multiple dovecot instances on different hosts - the same storage backend is used for all dovecot instances - no need of sharding domains - dovecot is easily load balanced (with director sticking users to the same dovecot backend) On the upcoming upgrade we intent to: - upgrade ceph to 12.X (Luminous) - drop the SSD Cache Tier (because it's deprecated) - use bluestore engine I was said on freenode/#dovecot that there are many cases where SDBOX would perform better with NFS sharing. In case of cephfs, at first, I wouldn't think that would be true because more files == more generated IO, but thinking about what I said in the beginning regarding sdbox vs mdbox that could be wrong. Any thoughts will be highlt appreciated. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Node crash, filesytem not usable
I'm sorry I wouldn't know, I'm on Jewel. is your cluster HEALTH_OK now? Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Sun, May 13, 2018 at 6:29 AM Marc Roos wrote: > > In luminous > osd_recovery_threads = osd_disk_threads ? > osd_recovery_sleep = osd_recovery_sleep_hdd ? > > Or is this speeding up recovery, a lot different in luminous? > > [@~]# ceph daemon osd.0 config show | grep osd | grep thread > "osd_command_thread_suicide_timeout": "900", > "osd_command_thread_timeout": "600", > "osd_disk_thread_ioprio_class": "", > "osd_disk_thread_ioprio_priority": "-1", > "osd_disk_threads": "1", > "osd_op_num_threads_per_shard": "0", > "osd_op_num_threads_per_shard_hdd": "1", > "osd_op_num_threads_per_shard_ssd": "2", > "osd_op_thread_suicide_timeout": "150", > "osd_op_thread_timeout": "15", > "osd_peering_wq_threads": "2", > "osd_recovery_thread_suicide_timeout": "300", > "osd_recovery_thread_timeout": "30", > "osd_remove_thread_suicide_timeout": "36000", > "osd_remove_thread_timeout": "3600", > > -Original Message- > From: Webert de Souza Lima [mailto:webert.b...@gmail.com] > Sent: vrijdag 11 mei 2018 20:34 > To: ceph-users > Subject: Re: [ceph-users] Node crash, filesytem not usable > > This message seems to be very concerning: > >mds0: Metadata damage detected > > > but for the rest, the cluster seems still to be recovering. you could > try to seep thing up with ceph tell, like: > > ceph tell osd.* injectargs --osd_max_backfills=10 > > ceph tell osd.* injectargs --osd_recovery_sleep=0.0 > > ceph tell osd.* injectargs --osd_recovery_threads=2 > > > > Regards, > > Webert Lima > DevOps Engineer at MAV Tecnologia > Belo Horizonte - Brasil > IRC NICK - WebertRLZ > > > On Fri, May 11, 2018 at 3:06 PM Daniel Davidson > wrote: > > > Below id the information you were asking for. I think they are > size=2, min size=1. > > Dan > > # ceph status > cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77 > > > > > health HEALTH_ERR > > > > > 140 pgs are stuck inactive for more than 300 seconds > 64 pgs backfill_wait > 76 pgs backfilling > 140 pgs degraded > 140 pgs stuck degraded > 140 pgs stuck inactive > 140 pgs stuck unclean > 140 pgs stuck undersized > 140 pgs undersized > 210 requests are blocked > 32 sec > recovery 38725029/695508092 objects degraded (5.568%) > recovery 10844554/695508092 objects misplaced (1.559%) > mds0: Metadata damage detected > mds0: Behind on trimming (71/30) > noscrub,nodeep-scrub flag(s) set > monmap e3: 4 mons at > {ceph-0=172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3: > 6789/0,ceph-3=172.16.31.4:6789/0} > election epoch 824, quorum 0,1,2,3 > ceph-0,ceph-1,ceph-2,ceph-3 > fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby > osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs > flags > noscrub,nodeep-scrub,sortbitwise,require_jewel_osds > pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects > 1444 TB used, 1011 TB / 2455 TB avail > 38725029/695508092 objects degraded (5.568%) > 10844554/695508092 objects misplaced (1.559%) > 1396 active+clean > 76 > undersized+degraded+remapped+backfilling+peered > 64 > undersized+degraded+remapped+wait_backfill+peered > recovery io 1244 MB/s, 1612 keys/s, 705 objects/s > > ID WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 2619.54541 root default > -2 163.72159 host ceph-0 > 0 81.86079 osd.0 up 1.0 1.0 > 1 81.86079 osd.1 up 1.0 1.0 > -3 163.72159 host ceph-1 > 2 81.86079 osd.2
Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
On Sat, May 12, 2018 at 3:11 AM Alexandre DERUMIER wrote: > The documentation (luminous) say: > > >mds cache size > > > >Description:The number of inodes to cache. A value of 0 indicates an > unlimited number. It is recommended to use mds_cache_memory_limit to limit > the amount of memory the MDS cache uses. > >Type: 32-bit Integer > >Default:0 > > and, my mds_cache_memory_limit is currently at 5GB. yeah I have only suggested that because the high memory usage seemed to trouble you and it might be a bug, so it's more of a workaround. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question: CephFS + Bluestore
Thanks David. Although you mentioned this was introduced with Luminous, it's working with Jewel. ~# ceph osd pool stats Fri May 11 17:41:39 2018 pool rbd id 5 client io 505 kB/s rd, 3801 kB/s wr, 46 op/s rd, 27 op/s wr pool rbd_cache id 6 client io 2538 kB/s rd, 3070 kB/s wr, 601 op/s rd, 758 op/s wr cache tier io 12225 kB/s flush, 0 op/s promote, 3 PG(s) flushing pool cephfs_metadata id 7 client io 2233 kB/s rd, 2260 kB/s wr, 95 op/s rd, 587 op/s wr pool cephfs_data_ssd id 8 client io 1126 kB/s rd, 94897 B/s wr, 33 op/s rd, 42 op/s wr pool cephfs_data id 9 client io 0 B/s rd, 11203 kB/s wr, 12 op/s rd, 12 op/s wr pool cephfs_data_cache id 10 client io 4383 kB/s rd, 550 kB/s wr, 57 op/s rd, 39 op/s wr cache tier io 7012 kB/s flush, 4399 kB/s evict, 11 op/s promote Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Fri, May 11, 2018 at 5:14 PM David Turner wrote: > `ceph osd pool stats` with the option to specify the pool you are > interested in should get you the breakdown of IO per pool. This was > introduced with luminous. > > On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima < > webert.b...@gmail.com> wrote: > >> I think ceph doesn't have IO metrics will filters by pool right? I see IO >> metrics from clients only: >> >> ceph_client_io_ops >> ceph_client_io_read_bytes >> ceph_client_io_read_ops >> ceph_client_io_write_bytes >> ceph_client_io_write_ops >> >> and pool "byte" metrics, but not "io": >> >> ceph_pool(write/read)_bytes(_total) >> >> Regards, >> >> Webert Lima >> DevOps Engineer at MAV Tecnologia >> *Belo Horizonte - Brasil* >> *IRC NICK - WebertRLZ* >> >> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima < >> webert.b...@gmail.com> wrote: >> >>> Hey Jon! >>> >>> On Wed, May 9, 2018 at 12:11 PM, John Spray wrote: >>> >>>> It depends on the metadata intensity of your workload. It might be >>>> quite interesting to gather some drive stats on how many IOPS are >>>> currently hitting your metadata pool over a week of normal activity. >>>> >>> >>> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not >>> sure what I should be looking at). >>> My current SSD disks have 2 partitions. >>> - One is used for cephfs cache tier pool, >>> - The other is used for both: cephfs meta-data pool and cephfs >>> data-ssd (this is an additional cephfs data pool with only ssds with file >>> layout for a specific direcotory to use it) >>> >>> Because of this, iostat shows me peaks of 12k IOPS in the metadata >>> partition, but this could definitely be IO for the data-ssd pool. >>> >>> >>>> If you are doing large file workloads, and the metadata mostly fits in >>>> RAM, then the number of IOPS from the MDS can be very, very low. On >>>> the other hand, if you're doing random metadata reads from a small >>>> file workload where the metadata does not fit in RAM, almost every >>>> client read could generate a read operation, and each MDS could easily >>>> generate thousands of ops per second. >>>> >>> >>> I have yet to measure it the right way but I'd assume my metadata fits >>> in RAM (a few 100s of MB only). >>> >>> This is an email hosting cluster with dozens of thousands of users so >>> there are a lot of random reads and writes, but not too many small files. >>> Email messages are concatenated together in files up to 4MB in size >>> (when a rotation happens). >>> Most user operations are dovecot's INDEX operations and I will keep >>> index directory in a SSD-dedicaded pool. >>> >>> >>> >>>> Isolating metadata OSDs is useful if the data OSDs are going to be >>>> completely saturated: metadata performance will be protected even if >>>> clients are hitting the data OSDs hard. >>>> >>> >>> This seems to be the case. >>> >>> >>>> If "heavy write" means completely saturating the cluster, then sharing >>>> the OSDs is risky. If "heavy write" just means that there are more >>>> writes than reads, then it may be fine if the metadata workload is not >>>> heavy enough to make good use of SSDs. >>>> >>> >>> Saturarion will only happen in peak workloads, not often. By heavy write >>>
Re: [ceph-users] Question: CephFS + Bluestore
I think ceph doesn't have IO metrics will filters by pool right? I see IO metrics from clients only: ceph_client_io_ops ceph_client_io_read_bytes ceph_client_io_read_ops ceph_client_io_write_bytes ceph_client_io_write_ops and pool "byte" metrics, but not "io": ceph_pool(write/read)_bytes(_total) Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima wrote: > Hey Jon! > > On Wed, May 9, 2018 at 12:11 PM, John Spray wrote: > >> It depends on the metadata intensity of your workload. It might be >> quite interesting to gather some drive stats on how many IOPS are >> currently hitting your metadata pool over a week of normal activity. >> > > Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not > sure what I should be looking at). > My current SSD disks have 2 partitions. > - One is used for cephfs cache tier pool, > - The other is used for both: cephfs meta-data pool and cephfs data-ssd > (this is an additional cephfs data pool with only ssds with file layout for > a specific direcotory to use it) > > Because of this, iostat shows me peaks of 12k IOPS in the metadata > partition, but this could definitely be IO for the data-ssd pool. > > >> If you are doing large file workloads, and the metadata mostly fits in >> RAM, then the number of IOPS from the MDS can be very, very low. On >> the other hand, if you're doing random metadata reads from a small >> file workload where the metadata does not fit in RAM, almost every >> client read could generate a read operation, and each MDS could easily >> generate thousands of ops per second. >> > > I have yet to measure it the right way but I'd assume my metadata fits in > RAM (a few 100s of MB only). > > This is an email hosting cluster with dozens of thousands of users so > there are a lot of random reads and writes, but not too many small files. > Email messages are concatenated together in files up to 4MB in size (when > a rotation happens). > Most user operations are dovecot's INDEX operations and I will keep index > directory in a SSD-dedicaded pool. > > > >> Isolating metadata OSDs is useful if the data OSDs are going to be >> completely saturated: metadata performance will be protected even if >> clients are hitting the data OSDs hard. >> > > This seems to be the case. > > >> If "heavy write" means completely saturating the cluster, then sharing >> the OSDs is risky. If "heavy write" just means that there are more >> writes than reads, then it may be fine if the metadata workload is not >> heavy enough to make good use of SSDs. >> > > Saturarion will only happen in peak workloads, not often. By heavy write I > mean there are much more writes than reads, yes. > So I think I can start sharing the OSDs, if I think this is impacting > performance I can just change the ruleset and move metadata to a SSD-only > pool, right? > > >> The way I'd summarise this is: in the general case, dedicated SSDs are >> the safe way to go -- they're intrinsically better suited to metadata. >> However, in some quite common special cases, the overall number of >> metadata ops is so low that the device doesn't matter. > > > > Thank you very much John! > Webert Lima > DevOps Engineer at MAV Tecnologia > Belo Horizonte - Brasil > IRC NICK - WebertRLZ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Node crash, filesytem not usable
This message seems to be very concerning: >mds0: Metadata damage detected but for the rest, the cluster seems still to be recovering. you could try to seep thing up with ceph tell, like: ceph tell osd.* injectargs --osd_max_backfills=10 ceph tell osd.* injectargs --osd_recovery_sleep=0.0 ceph tell osd.* injectargs --osd_recovery_threads=2 Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Fri, May 11, 2018 at 3:06 PM Daniel Davidson wrote: > Below id the information you were asking for. I think they are size=2, > min size=1. > > Dan > > # ceph status > cluster > 7bffce86-9d7b-4bdf-a9c9-67670e68ca77 > > health > HEALTH_ERR > > 140 pgs are stuck inactive for more than 300 seconds > 64 pgs backfill_wait > 76 pgs backfilling > 140 pgs degraded > 140 pgs stuck degraded > 140 pgs stuck inactive > 140 pgs stuck unclean > 140 pgs stuck undersized > 140 pgs undersized > 210 requests are blocked > 32 sec > recovery 38725029/695508092 objects degraded (5.568%) > recovery 10844554/695508092 objects misplaced (1.559%) > mds0: Metadata damage detected > mds0: Behind on trimming (71/30) > noscrub,nodeep-scrub flag(s) set > monmap e3: 4 mons at {ceph-0= > 172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0 > } > election epoch 824, quorum 0,1,2,3 ceph-0,ceph-1,ceph-2,ceph-3 > fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby > osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs > flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds > pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects > 1444 TB used, 1011 TB / 2455 TB avail > 38725029/695508092 objects degraded (5.568%) > 10844554/695508092 objects misplaced (1.559%) > 1396 active+clean > 76 undersized+degraded+remapped+backfilling+peered > 64 undersized+degraded+remapped+wait_backfill+peered > recovery io 1244 MB/s, 1612 keys/s, 705 objects/s > > ID WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 2619.54541 root default > -2 163.72159 host ceph-0 > 0 81.86079 osd.0 up 1.0 1.0 > 1 81.86079 osd.1 up 1.0 1.0 > -3 163.72159 host ceph-1 > 2 81.86079 osd.2 up 1.0 1.0 > 3 81.86079 osd.3 up 1.0 1.0 > -4 163.72159 host ceph-2 > 8 81.86079 osd.8 up 1.0 1.0 > 9 81.86079 osd.9 up 1.0 1.0 > -5 163.72159 host ceph-3 > 10 81.86079 osd.10up 1.0 1.0 > 11 81.86079 osd.11up 1.0 1.0 > -6 163.72159 host ceph-4 > 4 81.86079 osd.4 up 1.0 1.0 > 5 81.86079 osd.5 up 1.0 1.0 > -7 163.72159 host ceph-5 > 6 81.86079 osd.6 up 1.0 1.0 > 7 81.86079 osd.7 up 1.0 1.0 > -8 163.72159 host ceph-6 > 12 81.86079 osd.12up 0.7 1.0 > 13 81.86079 osd.13up 1.0 1.0 > -9 163.72159 host ceph-7 > 14 81.86079 osd.14up 1.0 1.0 > 15 81.86079 osd.15up 1.0 1.0 > -10 163.72159 host ceph-8 > 16 81.86079 osd.16up 1.0 1.0 > 17 81.86079 osd.17up 1.0 1.0 > -11 163.72159 host ceph-9 > 18 81.86079 osd.18up 1.0 1.0 > 19 81.86079 osd.19up 1.0 1.0 > -12 163.72159 host ceph-10 > 20 81.86079 osd.20up 1.0 1.0 > 21 81.86079 osd.21up 1.0 1.0 > -13 163.72159 host ceph-11 > 22 81.86079 osd.22up 1.0 1.0 > 23 81.86079 osd.23up 1.0 1.0 > -14 163.72159 host ceph-12 > 24 81.86079 osd.24up 1.0 1.0 > 25 81.86079 osd.25up 1.0 1.0 > -15 163.72159 host ceph-13 > 26 81.86079 osd.26 down0 1.0 > 27 81.86079 osd.27 down0 1.0 > -16 163.72159 host ceph-14 > 28 81.86079 osd.28up 1.0 1.0 > 29 81.86079 osd.29up 1.0 1.0 > -17 163.72159 host ceph-15 > 30 81.86079 osd.30up
Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
You could use "mds_cache_size" to limit number of CAPS untill you have this fixed, but I'd say for your number of caps and inodes, 20GB is normal. this mds (jewel) here is consuming 24GB RAM: { "mds": { "request": 7194867047, "reply": 7194866688, "reply_latency": { "avgcount": 7194866688, "sum": 27779142.611775008 }, "forward": 0, "dir_fetch": 179223482, "dir_commit": 1529387896, "dir_split": 0, "inode_max": 300, "inodes": 3001264, "inodes_top": 160517, "inodes_bottom": 226577, "inodes_pin_tail": 2614170, "inodes_pinned": 2770689, "inodes_expired": 2920014835, "inodes_with_caps": 2743194, "caps": 2803568, "subtrees": 2, "traverse": 8255083028, "traverse_hit": 7452972311, "traverse_forward": 0, "traverse_discover": 0, "traverse_dir_fetch": 180547123, "traverse_remote_ino": 122257, "traverse_lock": 5957156, "load_cent": 18446743934203149911, "q": 54, "exported": 0, "exported_inodes": 0, "imported": 0, "imported_inodes": 0 } } Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Fri, May 11, 2018 at 3:13 PM Alexandre DERUMIER wrote: > Hi, > > I'm still seeing memory leak with 12.2.5. > > seem to leak some MB each 5 minutes. > > I'll try to resent some stats next weekend. > > > - Mail original - > De: "Patrick Donnelly" > À: "Brady Deetz" > Cc: "Alexandre Derumier" , "ceph-users" < > ceph-users@lists.ceph.com> > Envoyé: Jeudi 10 Mai 2018 21:11:19 > Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ? > > On Thu, May 10, 2018 at 12:00 PM, Brady Deetz wrote: > > [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds > > ceph 1841 3.5 94.3 133703308 124425384 ? Ssl Apr04 1808:32 > > /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup > ceph > > > > > > [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status > > { > > "pool": { > > "items": 173261056, > > "bytes": 76504108600 > > } > > } > > > > So, 80GB is my configured limit for the cache and it appears the mds is > > following that limit. But, the mds process is using over 100GB RAM in my > > 128GB host. I thought I was playing it safe by configuring at 80. What > other > > things consume a lot of RAM for this process? > > > > Let me know if I need to create a new thread. > > The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade > ASAP. > > [1] https://tracker.ceph.com/issues/22972 > > -- > Patrick Donnelly > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] howto: multiple ceph filesystems
Basically what we're trying to figure out looks like what is being done here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020958.html But instead of using LIBRADOS to store EMAILs directly into RADOS we're still using CEPHFS for it, just figuring out if it makes sense to separate them in different workloads. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Fri, May 11, 2018 at 2:07 AM, Marc Roos wrote: > > > If I would like to use an erasurecode pool for a cephfs directory how > would I create these placement rules? > > > > > -Original Message- > From: David Turner [mailto:drakonst...@gmail.com] > Sent: vrijdag 11 mei 2018 1:54 > To: João Paulo Sacchetto Ribeiro Bastos > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] howto: multiple ceph filesystems > > Another option you could do is to use a placement rule. You could create > a general pool for most data to go to and a special pool for specific > folders on the filesystem. Particularly I think of a pool for replica vs > EC vs flash for specific folders in the filesystem. > > If the pool and OSDs wasn't the main concern for multiple filesystems > and the mds servers are then you could have multiple active mds servers > and pin the metadata for the indexes to one of them while the rest is > served by the other active mds servers. > > I really haven't come across a need for multiple filesystems in ceph > with the type of granularity you can achieve with mds pinning, folder > placement rules, and cephx authentication to limit a user to a specific > subfolder. > > > On Thu, May 10, 2018, 5:10 PM João Paulo Sacchetto Ribeiro Bastos > wrote: > > > Hey John, thanks for you answer. For sure the hardware robustness > will be nice enough. My true concern was actually the two FS ecosystem > coexistence. In fact I realized that we may not use this as well because > it may be represent a high overhead, despite the fact that it's a > experiental feature yet. > > On Thu, 10 May 2018 at 15:48 John Spray wrote: > > > On Thu, May 10, 2018 at 7:38 PM, João Paulo Sacchetto > Ribeiro > Bastos > wrote: > > Hello guys, > > > > My company is about to rebuild its whole infrastructure, > so > I was called in > > order to help on the planning. We are essentially an > corporate mail > > provider, so we handle daily lots of clients using > dovecot > and roundcube and > > in order to do so we want to design a better plant of > our > cluster. Today, > > using Jewel, we have a single cephFS for both index and > mail > from dovecot, > > but we want to split it into an index_FS and a mail_FS > to > handle the > > workload a little better, is it profitable nowadays? > From my > research I > > realized that we will need data and metadata individual > pools for each FS > > such as a group of MDS for each of then, also. > > > > The one thing that really scares me about all of this > is: we > are planning to > > have four machines at full disposal to handle our MDS > instances. We started > > to think if an idea like the one below is valid, can > anybody > give a hint on > > this? We basically want to handle two MDS instances on > each > machine (one for > > each FS) and wonder if we'll be able to have them > swapping > between active > > and standby simultaneously without any trouble. > > > > index_FS: (active={machines 1 and 3}, standby={machines > 2 > and 4}) > > mail_FS: (active={machines 2 and 4}, standby={machines 1 > and > 3}) > > Nothing wrong with that setup, but remember that those > servers > are > going to have to be well-resourced enough to run all four > at > once > (when a failure occurs), so it might not matter very much > exactly > which servers are running which daemons. > > With a filesystem's MDS daemons (i.e. daemons with the same > standby_for_fscid setting), Ceph will activate whichever > daemon comes > up first, so if it's important to you to have particular > daemons > active then you would need to take care of that at the > point > you're > starting them up. > > John > > > > > Regards, > > -- > > > > João Paulo Sacchetto Ribeiro Bastos > > +55 31 99279-7092 > > > > > > ___ > > ceph-users mailing list > > ceph-us
Re: [ceph-users] Question: CephFS + Bluestore
Hey Jon! On Wed, May 9, 2018 at 12:11 PM, John Spray wrote: > It depends on the metadata intensity of your workload. It might be > quite interesting to gather some drive stats on how many IOPS are > currently hitting your metadata pool over a week of normal activity. > Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not sure what I should be looking at). My current SSD disks have 2 partitions. - One is used for cephfs cache tier pool, - The other is used for both: cephfs meta-data pool and cephfs data-ssd (this is an additional cephfs data pool with only ssds with file layout for a specific direcotory to use it) Because of this, iostat shows me peaks of 12k IOPS in the metadata partition, but this could definitely be IO for the data-ssd pool. > If you are doing large file workloads, and the metadata mostly fits in > RAM, then the number of IOPS from the MDS can be very, very low. On > the other hand, if you're doing random metadata reads from a small > file workload where the metadata does not fit in RAM, almost every > client read could generate a read operation, and each MDS could easily > generate thousands of ops per second. > I have yet to measure it the right way but I'd assume my metadata fits in RAM (a few 100s of MB only). This is an email hosting cluster with dozens of thousands of users so there are a lot of random reads and writes, but not too many small files. Email messages are concatenated together in files up to 4MB in size (when a rotation happens). Most user operations are dovecot's INDEX operations and I will keep index directory in a SSD-dedicaded pool. > Isolating metadata OSDs is useful if the data OSDs are going to be > completely saturated: metadata performance will be protected even if > clients are hitting the data OSDs hard. > This seems to be the case. > If "heavy write" means completely saturating the cluster, then sharing > the OSDs is risky. If "heavy write" just means that there are more > writes than reads, then it may be fine if the metadata workload is not > heavy enough to make good use of SSDs. > Saturarion will only happen in peak workloads, not often. By heavy write I mean there are much more writes than reads, yes. So I think I can start sharing the OSDs, if I think this is impacting performance I can just change the ruleset and move metadata to a SSD-only pool, right? > The way I'd summarise this is: in the general case, dedicated SSDs are > the safe way to go -- they're intrinsically better suited to metadata. > However, in some quite common special cases, the overall number of > metadata ops is so low that the device doesn't matter. Thank you very much John! Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question: CephFS + Bluestore
I'm sorry I have mixed up some information. The actual ratio I have now is 0,0005% (*100MB for 20TB data*). Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, May 9, 2018 at 11:32 AM, Webert de Souza Lima wrote: > Hello, > > Currently, I run Jewel + Filestore for cephfs, with SSD-only pools used > for cephfs-metadata, and HDD-only pools for cephfs-data. The current > metadata/data ratio is something like 0,25% (50GB metadata for 20TB data). > > Regarding bluestore architecture, assuming I have: > > - SSDs for WAL+DB > - Spinning Disks for bluestore data. > > would you recommend still store metadata in SSD-Only OSD nodes? > If not, is it recommended to *dedicate* some OSDs (Spindle+SSD for > WAL/DB) for cephfs-metadata? > > If I just have 2 pools (metadata and data) all sharing the same OSDs in > the cluster, would it be enough for heavy-write cases? > > Assuming min_size=2, size=3. > > Thanks for your thoughts. > > Regards, > > Webert Lima > DevOps Engineer at MAV Tecnologia > *Belo Horizonte - Brasil* > *IRC NICK - WebertRLZ* > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Question: CephFS + Bluestore
Hello, Currently, I run Jewel + Filestore for cephfs, with SSD-only pools used for cephfs-metadata, and HDD-only pools for cephfs-data. The current metadata/data ratio is something like 0,25% (50GB metadata for 20TB data). Regarding bluestore architecture, assuming I have: - SSDs for WAL+DB - Spinning Disks for bluestore data. would you recommend still store metadata in SSD-Only OSD nodes? If not, is it recommended to *dedicate* some OSDs (Spindle+SSD for WAL/DB) for cephfs-metadata? If I just have 2 pools (metadata and data) all sharing the same OSDs in the cluster, would it be enough for heavy-write cases? Assuming min_size=2, size=3. Thanks for your thoughts. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Can't get MDS running after a power outage
I'd also try to boot up only one mds until it's fully up and running. Not both of them. Sometimes they go switching states between each other. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Thu, Mar 29, 2018 at 7:32 AM, John Spray wrote: > On Thu, Mar 29, 2018 at 8:16 AM, Zhang Qiang > wrote: > > Hi, > > > > Ceph version 10.2.3. After a power outage, I tried to start the MDS > > deamons, but they stuck forever replaying journals, I had no idea why > > they were taking that long, because this is just a small cluster for > > testing purpose with only hundreds MB data. I restarted them, and the > > error below was encountered. > > Usually if an MDS is stuck in replay, it's because it's waiting for > the OSDs to service the reads of the journal. Are all your PGs up and > healthy? > > > > > Any chance I can restore them? > > > > Mar 28 14:20:30 node01 systemd: Started Ceph metadata server daemon. > > Mar 28 14:20:30 node01 systemd: Starting Ceph metadata server daemon... > > Mar 28 14:20:30 node01 ceph-mds: 2018-03-28 14:20:30.796255 > > 7f0150c8c180 -1 deprecation warning: MDS id 'mds.0' is invalid and > > will be forbidden in a future version. MDS names may not start with a > > numeric digit. > > If you're really using "0" as an MDS name, now would be a good time to > fix that -- most people use a hostname or something like that. The > reason that numeric MDS names are invalid is that it makes commands > like "ceph mds fail 0" ambiguous (do we mean the name 0 or the rank > 0?). > > > Mar 28 14:20:30 node01 ceph-mds: starting mds.0 at :/0 > > Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: In function 'const > > entity_inst_t MDSMap::get_inst(mds_rank_t)' thread 7f014ac6c700 time > > 2018-03-28 14:20:30.942480 > > Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: 582: FAILED > assert(up.count(m)) > > Mar 28 14:20:30 node01 ceph-mds: ceph version 10.2.3 > > (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) > > Mar 28 14:20:30 node01 ceph-mds: 1: (ceph::__ceph_assert_fail(char > > const*, char const*, int, char const*)+0x85) [0x7f01512aba45] > > Mar 28 14:20:30 node01 ceph-mds: 2: (MDSMap::get_inst(int)+0x20f) > > [0x7f0150ee5e3f] > > Mar 28 14:20:30 node01 ceph-mds: 3: > > (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x7b9) > > [0x7f0150ed6e49] > > This is a weird assertion. I can't see how it could be reached :-/ > > John > > > Mar 28 14:20:30 node01 ceph-mds: 4: > > (MDSDaemon::handle_mds_map(MMDSMap*)+0xe3d) [0x7f0150eb396d] > > Mar 28 14:20:30 node01 ceph-mds: 5: > > (MDSDaemon::handle_core_message(Message*)+0x7b3) [0x7f0150eb4eb3] > > Mar 28 14:20:30 node01 ceph-mds: 6: > > (MDSDaemon::ms_dispatch(Message*)+0xdb) [0x7f0150eb514b] > > Mar 28 14:20:30 node01 ceph-mds: 7: (DispatchQueue::entry()+0x78a) > > [0x7f01513ad4aa] > > Mar 28 14:20:30 node01 ceph-mds: 8: > > (DispatchQueue::DispatchThread::entry()+0xd) [0x7f015129098d] > > Mar 28 14:20:30 node01 ceph-mds: 9: (()+0x7dc5) [0x7f0150095dc5] > > Mar 28 14:20:30 node01 ceph-mds: 10: (clone()+0x6d) [0x7f014eb61ced] > > Mar 28 14:20:30 node01 ceph-mds: NOTE: a copy of the executable, or > > `objdump -rdS ` is needed to interpret this. > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS very unstable with many small files
hi, can you give soem more details on the setup? number and size of osds. are you using EC or not? and if so, what EC parameters? thanks, stijn On 02/26/2018 08:15 AM, Linh Vu wrote: > Sounds like you just need more RAM on your MDS. Ours have 256GB each, and the > OSD nodes have 128GB each. Networking is 2x25Gbe. > > > We are on luminous 12.2.1, bluestore, and use CephFS for HPC, with about > 500-ish compute nodes. We have done stress testing with small files up to 2M > per directory as part of our acceptance testing, and encountered no problem. > > > From: ceph-users on behalf of Oliver > Freyermuth > Sent: Monday, 26 February 2018 3:45:59 AM > To: ceph-users@lists.ceph.com > Subject: [ceph-users] CephFS very unstable with many small files > > Dear Cephalopodians, > > in preparation for production, we have run very successful tests with large > sequential data, > and just now a stress-test creating many small files on CephFS. > > We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 > hosts with 32 OSDs each, running in EC k=4 m=2. > Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous > 12.2.3. > There are (at the moment) only two MDS's, one is active, the other standby. > > For the test, we had 1120 client processes on 40 client machines (all > cephfs-fuse!) extract a tarball with 150k small files > ( http://distfiles.gentoo.org/snapshots/portage-latest.tar.xz ) each into a > separate subdirectory. > > Things started out rather well (but expectedly slow), we had to increase > mds_log_max_segments => 240 > mds_log_max_expiring => 160 > due to https://github.com/ceph/ceph/pull/18624 > and adjusted mds_cache_memory_limit to 4 GB. > > Even though the MDS machine has 32 GB, it is also running 2 OSDs (for > metadata) and so we have been careful with the cache > (e.g. due to http://tracker.ceph.com/issues/22599 ). > > After a while, we tested MDS failover and realized we entered a flip-flop > situation between the two MDS nodes we have. > Increasing mds_beacon_grace to 240 helped with that. > > Now, with about 100,000,000 objects written, we are in a disaster situation. > First off, the MDS could not restart anymore - it required >40 GB of memory, > which (together with the 2 OSDs on the MDS host) exceeded RAM and swap. > So it tried to recover and OOMed quickly after. Replay was reasonably fast, > but join took many minutes: > 2018-02-25 04:16:02.299107 7fe20ce1f700 1 mds.0.17657 rejoin_start > 2018-02-25 04:19:00.618514 7fe20ce1f700 1 mds.0.17657 rejoin_joint_start > and finally, 5 minutes later, OOM. > > I stopped half of the stress-test tar's, which did not help - then I rebooted > half of the clients, which did help and let the MDS recover just fine. > So it seems the client caps have been too many for the MDS to handle. I'm > unsure why "tar" would cause so many open file handles. > Is there anything that can be configured to prevent this from happening? > Now, I only lost some "stress test data", but later, it might be user's > data... > > > In parallel, I had reinstalled one OSD host. > It was backfilling well, but now, <24 hours later, before backfill has > finished, several OSD hosts enter OOM condition. > Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the > default bluestore cache size of 1 GB. However, it seems the processes are > using much more, > up to several GBs until memory is exhausted. They then become sluggish, are > kicked out of the cluster, come back, and finally at some point they are > OOMed. > > Now, I have restarted some OSD processes and hosts which helped to reduce the > memory usage - but now I have some OSDs crashing continously, > leading to PG unavailability, and preventing recovery from completion. > I have reported a ticket about that, with stacktrace and log: > http://tracker.ceph.com/issues/23120 > This might well be a consequence of a previous OOM killer condition. > > However, my final question after these ugly experiences is: > Did somebody ever stresstest CephFS for many small files? > Are those issues known? Can special configuration help? > Are the memory issues known? Are there solutions? > > We don't plan to use Ceph for many small files, but we don't have full > control of our users, which is why we wanted to test this "worst case" > scenario. > It would be really bad if we lost a production filesystem due to such a > situation, so the plan was to test now to know what happens before we enter > production. > As of now, this looks really bad, and I'm not sure the cluster will ever > recover. > I'll give it some more time, but we'll likely kill off all remaining clients > next week and see what happens, and worst case recreate the Ceph cluster. > > Cheers, > Oliver > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/
Re: [ceph-users] CephFS very unstable with many small files
hi oliver, >>> in preparation for production, we have run very successful tests with large >>> sequential data, >>> and just now a stress-test creating many small files on CephFS. >>> >>> We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with >>> 6 hosts with 32 OSDs each, running in EC k=4 m=2. >>> Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous >>> 12.2.3. (this is all afaik;) so with EC k=4, small files get cut in 4 smaller parts. i'm not sure when the compression is applied, but your small files might be very small files before the get cut in 4 tiny parts. this might become pure iops wrt performance. with filestore (and witout compression), this was quite awfull. we have not retested with bluestore yet, but in the end a disk is just a disk. writing 1 file results in 6 diskwrites, so you need a lot of iops and/or disks. <...> >>> In parallel, I had reinstalled one OSD host. >>> It was backfilling well, but now, <24 hours later, before backfill has >>> finished, several OSD hosts enter OOM condition. >>> Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the >>> default bluestore cache size of 1 GB. However, it seems the processes are >>> using much more, >>> up to several GBs until memory is exhausted. They then become sluggish, are >>> kicked out of the cluster, come back, and finally at some point they are >>> OOMed. >> >> 32GB RAM for MDS, 64GB RAM for 32 OSDs per node looks very low on memory >> requirements for the scale you are trying. what are the size of each osd >> device? >> Could you also dump osd tree + more cluster info in the tracker you raised, >> so that one could try to recreate at a lower scale and check. > > Done! > All HDD-OSDs have 4 TB, while the SSDs used for the metadata pool have 240 > GB. the rule of thumb is 1GB per 1 TB. that is a lot (and imho one of the bad things about ceph, but i'm not complaining ;) most of the time this memory will not be used except for cache, but eg recovery is one of the cases where it is used, and thus needed. i have no idea what the real requirements are (i assumes there's some fixed amount per OSD and the rest is linear(?) with volume. so you can try to use some softraid on the disks to reduce the number of OSDs per host; but i doubt that the fixed part is over 50%, so you will probably end up with ahving to add some memory or not use certain disks. i don't know if you can limit the amount of volume per disk, eg only use 2TB of a 4TB disk, because then you can keep the iops. stijn > We had initially planned to use something more lightweight on CPU and RAM > (BeeGFS or Lustre), > but since we encountered serious issues with BeeGFS, have some bad past > experience with Lustre (but it was an old version) > and were really happy with the self-healing features of Ceph which also > allows us to reinstall OSD-hosts if we do an upgrade without having a > downtime, > we have decided to repurpose the hardware. For this reason, the RAM is not > really optimized (yet) for Ceph. > We will try to adapt hardware now as best as possible. > > Are there memory recommendations for a setup of this size? Anything's > welcome. > > Cheers and thanks! > Oliver > >> >>> >>> Now, I have restarted some OSD processes and hosts which helped to reduce >>> the memory usage - but now I have some OSDs crashing continously, >>> leading to PG unavailability, and preventing recovery from completion. >>> I have reported a ticket about that, with stacktrace and log: >>> http://tracker.ceph.com/issues/23120 >>> This might well be a consequence of a previous OOM killer condition. >>> >>> However, my final question after these ugly experiences is: >>> Did somebody ever stresstest CephFS for many small files? >>> Are those issues known? Can special configuration help? >>> Are the memory issues known? Are there solutions? >>> >>> We don't plan to use Ceph for many small files, but we don't have full >>> control of our users, which is why we wanted to test this "worst case" >>> scenario. >>> It would be really bad if we lost a production filesystem due to such a >>> situation, so the plan was to test now to know what happens before we enter >>> production. >>> As of now, this looks really bad, and I'm not sure the cluster will ever >>> recover. >>> I'll give it some more time, but we'll likely kill off all remaining >>> clients next week and see what happens, and worst case recreate the Ceph >>> cluster. >>> >>> Cheers, >>> Oliver >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users
Re: [ceph-users] Ceph Bluestore performance question
hi oliver, the IPoIB network is not 56gb, it's probably a lot less (20gb or so). the ib_write_bw test is verbs/rdma based. do you have iperf tests between hosts, and if so, can you share those reuslts? stijn > we are just getting started with our first Ceph cluster (Luminous 12.2.2) and > doing some basic benchmarking. > > We have two pools: > - cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240 GB) > on 2 hosts (i.e. 2 SSDs each), setup as: > - replicated, min size 2, max size 4 > - 128 PGs > - cephfs_data, living on 6 hosts each of which has the following setup: > - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI controller > to which they are attached is in JBOD personality > - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as > block-db by the bluestore OSDs living on the HDDs. > - Created with: > ceph osd erasure-code-profile set cephfs_data k=4 m=2 > crush-device-class=hdd crush-failure-domain=host > ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data > - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB > block-db > > The interconnect (public and cluster network) > is made via IP over Infiniband (56 GBit bandwidth), using the software stack > that comes with CentOS 7. > > This leaves us with the possibility that one of the metadata-hosts can fail, > and still one of the disks can fail. > For the data hosts, up to two machines total can fail. > > We have 40 clients connected to this cluster. We now run something like: > dd if=/dev/zero of=some_file bs=1M count=1 > on each CPU core of each of the clients, yielding a total of 1120 writing > processes (all 40 clients have 28+28HT cores), > using the ceph-fuse client. > > This yields a write throughput of a bit below 1 GB/s (capital B), which is > unexpectedly low. > Running a BeeGFS on the same cluster before (disks were in RAID 6 in that > case) yielded throughputs of about 12 GB/s, > but came with other issues (e.g. it's not FOSS...), so we'd love to run Ceph > :-). > > I performed some basic tests to try to understand the bottleneck for Ceph: > # rados bench -p cephfs_data 10 write --no-cleanup -t 40 > Bandwidth (MB/sec): 695.952 > Stddev Bandwidth: 295.223 > Max bandwidth (MB/sec): 1088 > Min bandwidth (MB/sec): 76 > Average IOPS: 173 > Stddev IOPS:73 > Max IOPS: 272 > Min IOPS: 19 > Average Latency(s): 0.220967 > Stddev Latency(s): 0.305967 > Max latency(s): 2.88931 > Min latency(s): 0.0741061 > > => This agrees mostly with our basic dd benchmark. > > Reading is a bit faster: > # rados bench -p cephfs_data 10 rand > => Bandwidth (MB/sec): 1108.75 > > However, the disks are reasonably quick: > # ceph tell osd.0 bench > { > "bytes_written": 1073741824, > "blocksize": 4194304, > "bytes_per_sec": 331850403 > } > > I checked and the OSD-hosts peaked at a load average of about 22 (they have > 24+24HT cores) in our dd benchmark, > but stayed well below that (only about 20 % per OSD daemon) in the rados > bench test. > One idea would be to switch from jerasure to ISA, since the machines are all > Intel CPUs only anyways. > > Already tried: > - TCP stack tuning (wmem, rmem), no huge effect. > - changing the block sizes used by dd, no effect. > - Testing network throughput with ib_write_bw, this revealed something like: > #bytes #iterationsBW peak[MB/sec]BW average[MB/sec] > MsgRate[Mpps] > 2 5000 19.73 19.30 10.118121 > 4 5000 52.79 51.70 13.553412 > 8 5000 101.23 96.65 12.668371 > > 16 5000 243.66 233.42 15.297583 > 32 5000 350.66 344.73 11.296089 > 64 5000 909.14 324.85 5.322323 > 1285000 1424.841401.2911.479374 > 2565000 2865.242801.0411.473055 > 5125000 5169.985095.0810.434733 > 1024 5000 10022.759791.42 > 10.026410 > 2048 5000 10988.6410628.83 > 5.441958 > 4096 5000 11401.4011399.14 > 2.918180 > [...] > > So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using > RDMA). > Other ideas that come to mind: > - Testing with Ceph-RDMA, but that does not seem production-ready yet, if I > read the list correctly. > - Increasing osd_pool_erasure_code_stripe_width. > - Using ISA as EC plugin. > - Reducing the bluestore_cache_size_hdd, it seems when recovery + benchmark > is ongoing, swap is used (but not when perfo
Re: [ceph-users] Luminous 12.2.2 OSDs with Bluestore crashing randomly
Hi Greg, many thanks. This is a new cluster created initially with luminous 12.2.0. I'm not sure the instructions on jewel really apply on my case too, and all the machines have ntp enabled, but I'll have a look, many thanks for the link. All machines are set to CET, although I'm running over docker containers which are using UTC internally, but they are all consistent. At the moment, after setting 5 of the osds out the cluster resumed, and now I'm recreating those osds to be on the safe side. Thanks, Alessandro Il 31/01/18 19:26, Gregory Farnum ha scritto: On Tue, Jan 30, 2018 at 5:49 AM Alessandro De Salvo <mailto:alessandro.desa...@roma1.infn.it>> wrote: Hi, we have several times a day different OSDs running Luminous 12.2.2 and Bluestore crashing with errors like this: starting osd.2 at - osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal 2018-01-30 13:45:28.440883 7f1e193cbd00 -1 osd.2 107082 log_to_monitors {default=true} /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned int)' thread 7f1dfd734700 time 2018-01-30 13:45:29.498133 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: 12819: FAILED assert(obc) ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x556c6df51550] 2: (PrimaryLogPG::hit_set_trim(std::unique_ptr >&, unsigned int)+0x3b6) [0x556c6db5e106] 3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7] 4: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2389) [0x556c6db78d39] 5: (PrimaryLogPG::do_request(boost::intrusive_ptr&, ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa] 6: (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3f9) [0x556c6d9c0899] 7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr const&)+0x57) [0x556c6dc38897] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x556c6df57069] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556c6df59000] 11: (()+0x7e25) [0x7f1e16c17e25] 12: (clone()+0x6d) [0x7f1e15d0b34d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. 2018-01-30 13:45:29.505317 7f1dfd734700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned int)' thread 7f1dfd734700 time 2018-01-30 13:45:29.498133 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: 12819: FAILED assert(obc) ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x556c6df51550] 2: (PrimaryLogPG::hit_set_trim(std::unique_ptr >&, unsigned int)+0x3b6) [0x556c6db5e106] 3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7] 4: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2389) [0x556c6db78d39] 5: (PrimaryLogPG::do_request(boost::intrusive_ptr&, ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa] 6: (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3f9) [0x556c6d9c0899] 7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr const&)+0x57) [0x556c6dc38897] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x556c6df57069] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556c6df59000] 11: (()+0x7e25) [0x7f1e16c17e25] 12: (clone()+0x6d) [0x7f1e15d0b34d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. Is it a known issue? How can we fix that? Hmm, it looks a lot like http://tracker.ceph.com/issues/19185, but that wasn't suppo
[ceph-users] Luminous 12.2.2 OSDs with Bluestore crashing randomly
Hi, we have several times a day different OSDs running Luminous 12.2.2 and Bluestore crashing with errors like this: starting osd.2 at - osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal 2018-01-30 13:45:28.440883 7f1e193cbd00 -1 osd.2 107082 log_to_monitors {default=true} /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned int)' thread 7f1dfd734700 time 2018-01-30 13:45:29.498133 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: 12819: FAILED assert(obc) ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x556c6df51550] 2: (PrimaryLogPG::hit_set_trim(std::unique_ptrstd::default_delete >&, unsigned int)+0x3b6) [0x556c6db5e106] 3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7] 4: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2389) [0x556c6db78d39] 5: (PrimaryLogPG::do_request(boost::intrusive_ptr&, ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa] 6: (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3f9) [0x556c6d9c0899] 7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr const&)+0x57) [0x556c6dc38897] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x556c6df57069] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556c6df59000] 11: (()+0x7e25) [0x7f1e16c17e25] 12: (clone()+0x6d) [0x7f1e15d0b34d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. 2018-01-30 13:45:29.505317 7f1dfd734700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned int)' thread 7f1dfd734700 time 2018-01-30 13:45:29.498133 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: 12819: FAILED assert(obc) ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x556c6df51550] 2: (PrimaryLogPG::hit_set_trim(std::unique_ptrstd::default_delete >&, unsigned int)+0x3b6) [0x556c6db5e106] 3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7] 4: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2389) [0x556c6db78d39] 5: (PrimaryLogPG::do_request(boost::intrusive_ptr&, ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa] 6: (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3f9) [0x556c6d9c0899] 7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr const&)+0x57) [0x556c6dc38897] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x556c6df57069] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556c6df59000] 11: (()+0x7e25) [0x7f1e16c17e25] 12: (clone()+0x6d) [0x7f1e15d0b34d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. Is it a known issue? How can we fix that? Thanks, Alessandro ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph df shows 100% used
Hi, On Fri, Jan 19, 2018 at 8:31 PM, zhangbingyin wrote: > 'MAX AVAIL' in the 'ceph df' output represents the amount of data that can > be used before the first OSD becomes full, and not the sum of all free > space across a set of OSDs. > Thank you very much. I figured this out by the end of the day. That is the answer. I'm not sure this is in ceph.com docs though. Now I know the problem is indeed solved (by doing proper reweight). Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph df shows 100% used
While it seemed to be solved yesterday, today the %USED has grown a lot again. See: ~# ceph osd df tree http://termbin.com/0zhk ~# ceph df detail http://termbin.com/thox 94% USED while there is about 21TB worth of data, size = 2 menas ~42TB RAW Usage, but the OSDs in that root sum ~70TB available space. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Thu, Jan 18, 2018 at 8:21 PM, Webert de Souza Lima wrote: > With the help of robbat2 and llua on IRC channel I was able to solve this > situation by taking down the 2-OSD only hosts. > After crush reweighting OSDs 8 and 23 from host mia1-master-fe02 to 0, > ceph df showed the expected storage capacity usage (about 70%) > > > With this in mind, those guys have told me that it is due the cluster > beeing uneven and unable to balance properly. It makes sense and it worked. > But for me it is still a very unexpected bahaviour for ceph to say that > the pools are 100% full and Available Space is 0. > > There were 3 hosts and repl. size = 2, if the host with only 2 OSDs were > full (it wasn't), ceph could still use space from OSDs from the other hosts. > > Regards, > > Webert Lima > DevOps Engineer at MAV Tecnologia > *Belo Horizonte - Brasil* > *IRC NICK - WebertRLZ* > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph df shows 100% used
With the help of robbat2 and llua on IRC channel I was able to solve this situation by taking down the 2-OSD only hosts. After crush reweighting OSDs 8 and 23 from host mia1-master-fe02 to 0, ceph df showed the expected storage capacity usage (about 70%) With this in mind, those guys have told me that it is due the cluster beeing uneven and unable to balance properly. It makes sense and it worked. But for me it is still a very unexpected bahaviour for ceph to say that the pools are 100% full and Available Space is 0. There were 3 hosts and repl. size = 2, if the host with only 2 OSDs were full (it wasn't), ceph could still use space from OSDs from the other hosts. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph df shows 100% used
Hi David, thanks for replying. On Thu, Jan 18, 2018 at 5:03 PM David Turner wrote: > You can have overall space available in your cluster because not all of > your disks are in the same crush root. You have multiple roots > corresponding to multiple crush rulesets. All pools using crush ruleset 0 > are full because all of the osds in that crush rule are full. > So I did check this. The usage of the OSDs that belonged to that root (default) was about 60%. All the pools using crush ruleset 0 were being show 100% there was only 1 near-full OSD in that crush rule. That's what is so weird about it. On Thu, Jan 18, 2018 at 8:05 PM, David Turner wrote: > `ceph osd df` is a good command for you to see what's going on. Compare > the osd numbers with `ceph osd tree`. > I am sorry I forgot to send this output, here it is. I have added 2 OSDs to that crush, borrowed them from the host mia1-master-ds05, to see if the available space would higher, but it didn't. So adding new OSDs to this didn't take any effect. ceph osd df tree ID WEIGHT REWEIGHT SIZE USEAVAIL %USE VAR PGS TYPE NAME -9 13.5- 14621G 2341G 12279G 16.02 0.31 0 root databases -8 6.5- 7182G 835G 6346G 11.64 0.22 0 host mia1-master-ds05 20 3.0 1.0 3463G 380G 3082G 10.99 0.21 260 osd.20 17 3.5 1.0 3719G 455G 3263G 12.24 0.24 286 osd.17 -10 7.0- 7438G 1505G 5932G 20.24 0.39 0 host mia1-master-fe01 21 3.5 1.0 3719G 714G 3004G 19.22 0.37 269 osd.21 22 3.5 1.0 3719G 791G 2928G 21.27 0.41 295 osd.22 -3 2.39996- 2830G 1647G 1182G 58.22 1.12 0 root databases-ssd -5 1.19998- 1415G 823G 591G 58.22 1.12 0 host mia1-master-ds02-ssd 24 0.3 1.0 471G 278G 193G 58.96 1.14 173 osd.24 25 0.3 1.0 471G 276G 194G 58.68 1.13 172 osd.25 26 0.3 1.0 471G 269G 202G 57.03 1.10 167 osd.26 -6 1.19998- 1415G 823G 591G 58.22 1.12 0 host mia1-master-ds03-ssd 27 0.3 1.0 471G 244G 227G 51.87 1.00 152 osd.27 28 0.3 1.0 471G 281G 190G 59.63 1.15 175 osd.28 29 0.3 1.0 471G 297G 173G 63.17 1.22 185 osd.29 -1 71.69997- 76072G 44464G 31607G 58.45 1.13 0 root default -2 26.59998- 29575G 17334G 12240G 58.61 1.13 0 host mia1-master-ds01 0 3.2 1.0 3602G 1907G 1695G 52.94 1.02 90 osd.0 1 3.2 1.0 3630G 2721G 908G 74.97 1.45 112 osd.1 2 3.2 1.0 3723G 2373G 1349G 63.75 1.23 98 osd.2 3 3.2 1.0 3723G 1781G 1941G 47.85 0.92 105 osd.3 4 3.2 1.0 3723G 1880G 1843G 50.49 0.97 95 osd.4 5 3.2 1.0 3723G 2465G 1257G 66.22 1.28 111 osd.5 6 3.7 1.0 3723G 1722G 2001G 46.25 0.89 109 osd.6 7 3.7 1.0 3723G 2481G 1241G 66.65 1.29 126 osd.7 -4 8.5- 9311G 8540G 770G 91.72 1.77 0 host mia1-master-fe02 8 5.5 0.7 5587G 5419G 167G 97.00 1.87 189 osd.8 23 3.0 1.0 3724G 3120G 603G 83.79 1.62 128 osd.23 -7 29.5- 29747G 17821G 11926G 59.91 1.16 0 host mia1-master-ds04 9 3.7 1.0 3718G 2493G 1224G 67.07 1.29 114 osd.9 10 3.7 1.0 3718G 2454G 1264G 66.00 1.27 90 osd.10 11 3.7 1.0 3718G 2202G 1516G 59.22 1.14 116 osd.11 12 3.7 1.0 3718G 2290G 1427G 61.61 1.19 113 osd.12 13 3.7 1.0 3718G 2015G 1703G 54.19 1.05 112 osd.13 14 3.7 1.0 3718G 1264G 2454G 34.00 0.66 101 osd.14 15 3.7 1.0 3718G 2195G 1522G 59.05 1.14 104 osd.15 16 3.7 1.0 3718G 2905G 813G 78.13 1.51 130 osd.16 -11 7.0- 7438G 768G 6669G 10.33 0.20 0 host mia1-master-ds05-borrowed-osds 18 3.5 1.0 3719G 393G 3325G 10.59 0.20 262 osd.18 19 3.5 1.0 3719G 374G 3344G 10.07 0.19 256 osd.19 TOTAL 93524G 48454G 45069G 51.81 MIN/MAX VAR: 0.19/1.87 STDDEV: 22.02 Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Thu, Jan 18, 2018 at 8:05 PM, David Turner wrote: > `ceph osd df` is a good command for you to see what's going on. Compare > the osd numbers with `ceph osd tree`. > > >> >> On Thu, Jan 18, 2018 at 3:34 PM Webert de Souza Lima < >> webert.b...@gmail.com> wrote: >> >>> Sorry I forgot, this is a ceph jewel 10.2.10 >>> >>> >>> Regards, >>> >>> Webert Lima >>> DevOps Engineer at MAV Tecnologia >>> *Belo Horizonte - Brasil* >>> *IRC NICK - WebertRLZ* >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph df shows 100% used
Sorry I forgot, this is a ceph jewel 10.2.10 Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph df shows 100% used
Also, there is no quota set for the pools Here is "ceph osd pool get xxx all": http://termbin.com/ix0n Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph df shows 100% used
Hello, I'm running near-out-of service radosgw (very slow to write new objects) and I suspect it's because of ceph df is showing 100% usage in some pools, though I don't know what that information comes from. Pools: #~ ceph osd pool ls detail -> http://termbin.com/lsd0 Crush Rules (important is rule 0) ~# ceph osd crush rule dump -> http://termbin.com/wkpo OSD Tree: ~# ceph osd tree -> http://termbin.com/87vt Ceph DF, which shows 100% Usage: ~# ceph df detail -> http://termbin.com/15mz Ceph Status, which shows 45600 GB / 93524 GB avail: ~# ceph -s -> http://termbin.com/wycq Any thoughts? Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs degraded on ceph luminous 12.2.2
Hi, took quite some time to recover the pgs, and indeed the problem with the mds instances was due to the activating pgs. Once they were cleared the fs went back to the original state. I had to restart a few times some OSds though, in order to get all the pgs activated, and I didn't hit the limits on the max pgs, but I'm close to, so I have set them to 300 just to be safe (AFAIK it was the limit set to prior releases of ceph, not sure why it was lowered to 200 now). Thanks, Alessandro On Tue, 2018-01-09 at 09:01 +0100, Burkhard Linke wrote: > Hi, > > > On 01/08/2018 05:40 PM, Alessandro De Salvo wrote: > > Thanks Lincoln, > > > > indeed, as I said the cluster is recovering, so there are pending ops: > > > > > > pgs: 21.034% pgs not active > > 1692310/24980804 objects degraded (6.774%) > > 5612149/24980804 objects misplaced (22.466%) > > 458 active+clean > > 329 active+remapped+backfill_wait > > 159 activating+remapped > > 100 active+undersized+degraded+remapped+backfill_wait > > 58 activating+undersized+degraded+remapped > > 27 activating > > 22 active+undersized+degraded+remapped+backfilling > > 6 active+remapped+backfilling > > 1 active+recovery_wait+degraded > > > > > > If it's just a matter to wait for the system to complete the recovery > > it's fine, I'll deal with that, but I was wondendering if there is a > > more suble problem here. > > > > OK, I'll wait for the recovery to complete and see what happens, thanks. > > The blocked MDS might be caused by the 'activating' PGs. Do you have a > warning about too much PGs per OSD? If that is the case, > activating/creating/peering/whatever on the affected OSDs is blocked, > which leads to blocked requests etc. > > You can resolve this be increasing the number of allowed PGs per OSD > ('mon_max_pg_per_osd'). AFAIK it needs to be set for mon, mgr and osd > instances. There was also been some discussion about this setting on the > mailing list in the last weeks. > > Regards, > Burkhard > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] luminous: HEALTH_ERR full ratio(s) out of order
Good to know. I don't think this should trigger HEALTH_ERR though, but HEALTH_WARN makes sense. It makes sense to keep the backfillfull_ratio greater than nearfull_ratio as one might need backfilling to avoid OSD getting full on reweight operations. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, Jan 10, 2018 at 12:11 PM, Stefan Priebe - Profihost AG < s.pri...@profihost.ag> wrote: > Hello, > > since upgrading to luminous i get the following error: > > HEALTH_ERR full ratio(s) out of order > OSD_OUT_OF_ORDER_FULL full ratio(s) out of order > backfillfull_ratio (0.9) < nearfull_ratio (0.95), increased > > but ceph.conf has: > > mon_osd_full_ratio = .97 > mon_osd_nearfull_ratio = .95 > mon_osd_backfillfull_ratio = .96 > osd_backfill_full_ratio = .96 > osd_failsafe_full_ratio = .98 > > Any ideas? i already restarted: > * all osds > * all mons > * all mgrs > > Greets, > Stefan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 'lost' cephfs filesystem?
On Wed, Jan 10, 2018 at 12:44 PM, Mark Schouten wrote: > > Thanks, that's a good suggestion. Just one question, will this affect > RBD- > > access from the same (client)host? i'm sorry that this didn't help. No, it does not affect rbd clients, as MDS is related only to cephfs. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 'lost' cephfs filesystem?
try to kick out (evict) that cephfs client from the mds node, see http://docs.ceph.com/docs/master/cephfs/eviction/ Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Wed, Jan 10, 2018 at 12:59 AM, Mark Schouten wrote: > Hi, > > While upgrading a server with a CephFS mount tonight, it stalled on > installing > a new kernel, because it was waiting for `sync`. > > I'm pretty sure it has something to do with the CephFS filesystem which > caused > some issues last week. I think the kernel still has a reference to the > probably lazy unmounted CephFS filesystem. > Unmounting the filesystem 'works', which means it is no longer available, > but > the unmount-command seems to be waiting for sync() as well. Mounting the > filesystem again doesn't work either. > > I know the simple solution is to just reboot the server, but the server > holds > quite a lot of VM's and Containers, so I'd prefer to fix this without a > reboot. > > Anybody with some clever ideas? :) > > -- > Kerio Operator in de Cloud? https://www.kerioindecloud.nl/ > Mark Schouten | Tuxis Internet Engineering > KvK: 61527076 | http://www.tuxis.nl/ > T: 0318 200208 | i...@tuxis.nl > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs degraded on ceph luminous 12.2.2
Thanks Lincoln, indeed, as I said the cluster is recovering, so there are pending ops: pgs: 21.034% pgs not active 1692310/24980804 objects degraded (6.774%) 5612149/24980804 objects misplaced (22.466%) 458 active+clean 329 active+remapped+backfill_wait 159 activating+remapped 100 active+undersized+degraded+remapped+backfill_wait 58 activating+undersized+degraded+remapped 27 activating 22 active+undersized+degraded+remapped+backfilling 6 active+remapped+backfilling 1 active+recovery_wait+degraded If it's just a matter to wait for the system to complete the recovery it's fine, I'll deal with that, but I was wondendering if there is a more suble problem here. OK, I'll wait for the recovery to complete and see what happens, thanks. Cheers, Alessandro Il 08/01/18 17:36, Lincoln Bryant ha scritto: Hi Alessandro, What is the state of your PGs? Inactive PGs have blocked CephFS recovery on our cluster before. I'd try to clear any blocked ops and see if the MDSes recover. --Lincoln On Mon, 2018-01-08 at 17:21 +0100, Alessandro De Salvo wrote: Hi, I'm running on ceph luminous 12.2.2 and my cephfs suddenly degraded. I have 2 active mds instances and 1 standby. All the active instances are now in replay state and show the same error in the logs: mds1 2018-01-08 16:04:15.765637 7fc2e92451c0 0 ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process (unknown), pid 164 starting mds.mds1 at - 2018-01-08 16:04:15.785849 7fc2e92451c0 0 pidfile_write: ignore empty --pid-file 2018-01-08 16:04:20.168178 7fc2e1ee1700 1 mds.mds1 handle_mds_map standby 2018-01-08 16:04:20.278424 7fc2e1ee1700 1 mds.1.20635 handle_mds_map i am now mds.1.20635 2018-01-08 16:04:20.278432 7fc2e1ee1700 1 mds.1.20635 handle_mds_map state change up:boot --> up:replay 2018-01-08 16:04:20.278443 7fc2e1ee1700 1 mds.1.20635 replay_start 2018-01-08 16:04:20.278449 7fc2e1ee1700 1 mds.1.20635 recovery set is 0 2018-01-08 16:04:20.278458 7fc2e1ee1700 1 mds.1.20635 waiting for osdmap 21467 (which blacklists prior instance) mds2 2018-01-08 16:04:16.870459 7fd8456201c0 0 ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process (unknown), pid 295 starting mds.mds2 at - 2018-01-08 16:04:16.881616 7fd8456201c0 0 pidfile_write: ignore empty --pid-file 2018-01-08 16:04:21.274543 7fd83e2bc700 1 mds.mds2 handle_mds_map standby 2018-01-08 16:04:21.314438 7fd83e2bc700 1 mds.0.20637 handle_mds_map i am now mds.0.20637 2018-01-08 16:04:21.314459 7fd83e2bc700 1 mds.0.20637 handle_mds_map state change up:boot --> up:replay 2018-01-08 16:04:21.314479 7fd83e2bc700 1 mds.0.20637 replay_start 2018-01-08 16:04:21.314492 7fd83e2bc700 1 mds.0.20637 recovery set is 1 2018-01-08 16:04:21.314517 7fd83e2bc700 1 mds.0.20637 waiting for osdmap 21467 (which blacklists prior instance) 2018-01-08 16:04:21.393307 7fd837aaf700 0 mds.0.cache creating system inode with ino:0x100 2018-01-08 16:04:21.397246 7fd837aaf700 0 mds.0.cache creating system inode with ino:0x1 The cluster is recovering as we are changing some of the osds, and there are a few slow/stuck requests, but I'm not sure if this is the cause, as there is apparently no data loss (until now). How can I force the MDSes to quit the replay state? Thanks for any help, Alessandro ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs degraded on ceph luminous 12.2.2
Hi, I'm running on ceph luminous 12.2.2 and my cephfs suddenly degraded. I have 2 active mds instances and 1 standby. All the active instances are now in replay state and show the same error in the logs: mds1 2018-01-08 16:04:15.765637 7fc2e92451c0 0 ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process (unknown), pid 164 starting mds.mds1 at - 2018-01-08 16:04:15.785849 7fc2e92451c0 0 pidfile_write: ignore empty --pid-file 2018-01-08 16:04:20.168178 7fc2e1ee1700 1 mds.mds1 handle_mds_map standby 2018-01-08 16:04:20.278424 7fc2e1ee1700 1 mds.1.20635 handle_mds_map i am now mds.1.20635 2018-01-08 16:04:20.278432 7fc2e1ee1700 1 mds.1.20635 handle_mds_map state change up:boot --> up:replay 2018-01-08 16:04:20.278443 7fc2e1ee1700 1 mds.1.20635 replay_start 2018-01-08 16:04:20.278449 7fc2e1ee1700 1 mds.1.20635 recovery set is 0 2018-01-08 16:04:20.278458 7fc2e1ee1700 1 mds.1.20635 waiting for osdmap 21467 (which blacklists prior instance) mds2 2018-01-08 16:04:16.870459 7fd8456201c0 0 ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process (unknown), pid 295 starting mds.mds2 at - 2018-01-08 16:04:16.881616 7fd8456201c0 0 pidfile_write: ignore empty --pid-file 2018-01-08 16:04:21.274543 7fd83e2bc700 1 mds.mds2 handle_mds_map standby 2018-01-08 16:04:21.314438 7fd83e2bc700 1 mds.0.20637 handle_mds_map i am now mds.0.20637 2018-01-08 16:04:21.314459 7fd83e2bc700 1 mds.0.20637 handle_mds_map state change up:boot --> up:replay 2018-01-08 16:04:21.314479 7fd83e2bc700 1 mds.0.20637 replay_start 2018-01-08 16:04:21.314492 7fd83e2bc700 1 mds.0.20637 recovery set is 1 2018-01-08 16:04:21.314517 7fd83e2bc700 1 mds.0.20637 waiting for osdmap 21467 (which blacklists prior instance) 2018-01-08 16:04:21.393307 7fd837aaf700 0 mds.0.cache creating system inode with ino:0x100 2018-01-08 16:04:21.397246 7fd837aaf700 0 mds.0.cache creating system inode with ino:0x1 The cluster is recovering as we are changing some of the osds, and there are a few slow/stuck requests, but I'm not sure if this is the cause, as there is apparently no data loss (until now). How can I force the MDSes to quit the replay state? Thanks for any help, Alessandro ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Linux Meltdown (KPTI) fix and how it affects performance?
or do it live https://access.redhat.com/articles/3311301 # echo 0 > /sys/kernel/debug/x86/pti_enabled # echo 0 > /sys/kernel/debug/x86/ibpb_enabled # echo 0 > /sys/kernel/debug/x86/ibrs_enabled stijn On 01/05/2018 12:54 PM, David wrote: > Hi! > > nopti or pti=off in kernel options should disable some of the kpti. > I haven't tried it yet though, so give it a whirl. > > https://en.wikipedia.org/wiki/Kernel_page-table_isolation > <https://en.wikipedia.org/wiki/Kernel_page-table_isolation> > > Kind Regards, > > David Majchrzak > > >> 5 jan. 2018 kl. 11:03 skrev Xavier Trilla : >> >> Hi Nick, >> >> I'm actually wondering about exactly the same. Regarding OSDs, I agree, >> there is no reason to apply the security patch to the machines running the >> OSDs -if they are properly isolated in your setup-. >> >> But I'm worried about the hypervisors, as I don't know how meltdown or >> Spectre patches -AFAIK, only Spectre patch needs to be applied to the host >> hypervisor, Meltdown patch only needs to be applied to guest- will affect >> librbd performance in the hypervisors. >> >> Does anybody have some information about how Meltdown or Spectre affect ceph >> OSDs and clients? >> >> Also, regarding Meltdown patch, seems to be a compilation option, meaning >> you could build a kernel without it easily. >> >> Thanks, >> Xavier. >> >> -Mensaje original- >> De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Nick >> Fisk >> Enviado el: jueves, 4 de enero de 2018 17:30 >> Para: 'ceph-users' >> Asunto: [ceph-users] Linux Meltdown (KPTI) fix and how it affects >> performance? >> >> Hi All, >> >> As the KPTI fix largely only affects the performance where there are a large >> number of syscalls made, which Ceph does a lot of, I was wondering if >> anybody has had a chance to perform any initial tests. I suspect small write >> latencies will the worse affected? >> >> Although I'm thinking the backend Ceph OSD's shouldn't really be at risk >> from these vulnerabilities, due to them not being direct user facing and >> could have this work around disabled? >> >> Nick >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PGs stuck in "active+undersized+degraded+remapped+backfill_wait", recovery speed is extremely slow
Hello all, I have ceph Luminous setup with filestore and bluestore OSDs. This cluster was deployed initially as Hammer, than I upgraded it to Jewel and eventually to Luminous. It’s heterogenous, we have SSDs, SAS 15K and 7.2K HDDs in it (see crush map attached). Earlier I converted 7.2K HDD from filestore to bluestore without any problem. After converting two SSDs from filestore to bluestore I ended up the following warning: cluster: id: 089d3673-5607-404d-9351-2d4004043966 health: HEALTH_WARN Degraded data redundancy: 12566/4361616 objects degraded (0.288%), 6 pgs unclean, 6 pgs degraded, 6 pgs undersized 10 slow requests are blocked > 32 sec services: mon: 3 daemons, quorum 2,1,0 mgr: tw-dwt-prx-03(active), standbys: tw-dwt-prx-05, tw-dwt-prx-07 osd: 92 osds: 92 up, 92 in; 6 remapped pgs data: pools: 3 pools, 1024 pgs objects: 1419k objects, 5676 GB usage: 17077 GB used, 264 TB / 280 TB avail pgs: 12566/4361616 objects degraded (0.288%) 1018 active+clean 4active+undersized+degraded+remapped+backfill_wait 2active+undersized+degraded+remapped+backfilling io: client: 1567 kB/s rd, 2274 kB/s wr, 67 op/s rd, 186 op/s wr # rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RDWR_OPSWR sas_sata 556G 142574 0 427722 0 00 48972431 478G 207803733 3035G sata_only 1939M 491 01473 0 00 3302 5003k 17170 2108M ssd_sata 5119G 1311028 0 3933084 0 012549 46982011 2474G 620926839 24962G total_objects1454093 total_used 17080G total_avail 264T total_space 280T # ceph pg dump_stuck ok PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 22.ac active+undersized+degraded+remapped+backfilling [6,28,62] 6 [28,62] 28 22.85 active+undersized+degraded+remapped+backfilling [7,43,62] 7 [43,62] 43 22.146 active+undersized+degraded+remapped+backfill_wait [7,48,46] 7 [46,48] 46 22.4f active+undersized+degraded+remapped+backfill_wait [7,59,58] 7 [58,59] 58 22.d8 active+undersized+degraded+remapped+backfill_wait [7,48,46] 7 [46,48] 46 22.60 active+undersized+degraded+remapped+backfill_wait [7,50,34] 7 [34,50] 34 The pool I have problem with, has replicas on SSDs and 7.2K HDD with primary affinity set as 1 for SSD and 0 for HDD. All clients eventually ceased to operate, recovery speed is 1-2 objects per minute (which would take more than a week to recover 12500 objects). Another pool works fine. How I can speed up recovery process? Thank you, Ignaqui ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [luminous 12.2.2] Cluster write performance degradation problem(possibly tcmalloc related)
On Thu, Dec 21, 2017 at 12:52 PM, shadow_lin wrote: > > After 18:00 suddenly the write throughput dropped and the osd latency > increased. TCmalloc started relcaim page heap freelist much more > frequently.All of this happened very fast and every osd had the indentical > pattern. > Could that be caused by OSD scrub? Check your "osd_scrub_begin_hour" ceph daemon osd.$ID config show | grep osd_scrub Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS locatiins
it depends on how you use it. for me, it runs fine on the OSD hosts but the mds server consumes loads of RAM, so be aware of that. if the system load average goes too high due to osd disk utilization the MDS server might run into troubles too, as delayed response from the host could cause the MDS to be marked as down. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Fri, Dec 22, 2017 at 5:24 AM, nigel davies wrote: > Hay all > > Is it ok to set up mds on the same serves that do host the osd's or should > they be on different server's > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com