Re: [ceph-users] slow ops for mon slowly increasing
OK, looks like clock skew is the problem. I thought this is caused by the reboot but it did not fix itself after some minutes (mon3 was 6 seconds ahead). After forcing time sync from the same server, it seems to be solved now. Kevin Am Fr., 20. Sept. 2019 um 07:33 Uhr schrieb Kevin Olbrich : > Hi! > > Today some OSDs went down, a temporary problem that was solved easily. > The mimic cluster is working and all OSDs are complete, all active+clean. > > Completely new for me is this: > > 25 slow ops, oldest one blocked for 219 sec, mon.mon03 has slow ops > > The cluster itself looks fine, monitoring for the VMs that use RBD are > fine. > > I thought that might be (https://tracker.ceph.com/issues/24531) but I've > restarted the mon service (and the node as a whole) but both did not help. > The slop ops slowly increase. > > Example: > > { > "description": "auth(proto 0 30 bytes epoch 0)", > "initiated_at": "2019-09-20 05:31:52.295858", > "age": 7.851164, > "duration": 7.900068, > "type_data": { > "events": [ > { > "time": "2019-09-20 05:31:52.295858", > "event": "initiated" > }, > { > "time": "2019-09-20 05:31:52.295858", > "event": "header_read" > }, > { > "time": "2019-09-20 05:31:52.295864", > "event": "throttled" > }, > { > "time": "2019-09-20 05:31:52.295875", > "event": "all_read" > }, > { > "time": "2019-09-20 05:31:52.296075", > "event": "dispatched" > }, > { > "time": "2019-09-20 05:31:52.296089", > "event": "mon:_ms_dispatch" > }, > { > "time": "2019-09-20 05:31:52.296097", > "event": "mon:dispatch_op" > }, > { > "time": "2019-09-20 05:31:52.296098", > "event": "psvc:dispatch" > }, > { > "time": "2019-09-20 05:31:52.296172", > "event": "auth:wait_for_readable" > }, > { > "time": "2019-09-20 05:31:52.296177", > "event": "auth:wait_for_readable/paxos" > }, > { > "time": "2019-09-20 05:31:52.296232", > "event": "paxos:wait_for_readable" > } > ], > "info": { > "seq": 1708, > "src_is_mon": false, > "source": "client.? > [fd91:462b:4243:47e::1:3]:0/2365414961", > "forwarded_to_leader": false > } > } > }, > { > "description": "auth(proto 0 30 bytes epoch 0)", > "initiated_at": "2019-09-20 05:31:52.314892", > "age": 7.832131, > "duration": 7.881230, > "type_data": { > "events": [ > { > "time": "2019-09-20 05:31:52.314892", > "event": "initiated" > }, > { > "time": "2019-09-20 05:31:52.314892", > "event": "header_read" > }, > { > "time": "2019-09-20 05:31:52.3
[ceph-users] slow ops for mon slowly increasing
Hi! Today some OSDs went down, a temporary problem that was solved easily. The mimic cluster is working and all OSDs are complete, all active+clean. Completely new for me is this: > 25 slow ops, oldest one blocked for 219 sec, mon.mon03 has slow ops The cluster itself looks fine, monitoring for the VMs that use RBD are fine. I thought that might be (https://tracker.ceph.com/issues/24531) but I've restarted the mon service (and the node as a whole) but both did not help. The slop ops slowly increase. Example: { "description": "auth(proto 0 30 bytes epoch 0)", "initiated_at": "2019-09-20 05:31:52.295858", "age": 7.851164, "duration": 7.900068, "type_data": { "events": [ { "time": "2019-09-20 05:31:52.295858", "event": "initiated" }, { "time": "2019-09-20 05:31:52.295858", "event": "header_read" }, { "time": "2019-09-20 05:31:52.295864", "event": "throttled" }, { "time": "2019-09-20 05:31:52.295875", "event": "all_read" }, { "time": "2019-09-20 05:31:52.296075", "event": "dispatched" }, { "time": "2019-09-20 05:31:52.296089", "event": "mon:_ms_dispatch" }, { "time": "2019-09-20 05:31:52.296097", "event": "mon:dispatch_op" }, { "time": "2019-09-20 05:31:52.296098", "event": "psvc:dispatch" }, { "time": "2019-09-20 05:31:52.296172", "event": "auth:wait_for_readable" }, { "time": "2019-09-20 05:31:52.296177", "event": "auth:wait_for_readable/paxos" }, { "time": "2019-09-20 05:31:52.296232", "event": "paxos:wait_for_readable" } ], "info": { "seq": 1708, "src_is_mon": false, "source": "client.? [fd91:462b:4243:47e::1:3]:0/2365414961", "forwarded_to_leader": false } } }, { "description": "auth(proto 0 30 bytes epoch 0)", "initiated_at": "2019-09-20 05:31:52.314892", "age": 7.832131, "duration": 7.881230, "type_data": { "events": [ { "time": "2019-09-20 05:31:52.314892", "event": "initiated" }, { "time": "2019-09-20 05:31:52.314892", "event": "header_read" }, { "time": "2019-09-20 05:31:52.314897", "event": "throttled" }, { "time": "2019-09-20 05:31:52.314907", "event": "all_read" }, { "time": "2019-09-20 05:31:52.315057", "event": "dispatched" }, { "time": "2019-09-20 05:31:52.315072", "event": "mon:_ms_dispatch" }, { "time": "2019-09-20 05:31:52.315082", "event": "mon:dispatch_op" }, { "time": "2019-09-20 05:31:52.315083", "event": "psvc:dispatch" }, { "time": "2019-09-20 05:31:52.315161", "event": "auth:wait_for_readable" }, { "time": "2019-09-20 05:31:52.315167", "event": "auth:wait_for_readable/paxos" }, { "time": "2019-09-20 05:31:52.315230", "event": "paxos:wait_for_readable" } ], "info": { "seq": 1709, "src_is_mon": false, "source": "client.? [fd91:462b:4243:47e::1:3]:0/997594187",
Re: [ceph-users] QEMU/KVM client compatibility
Am Di., 28. Mai 2019 um 10:20 Uhr schrieb Wido den Hollander : > > > On 5/28/19 10:04 AM, Kevin Olbrich wrote: > > Hi Wido, > > > > thanks for your reply! > > > > For CentOS 7, this means I can switch over to the "rpm-nautilus/el7" > > repository and Qemu uses a nautilus compatible client? > > I just want to make sure, I understand correctly. > > > > Yes, that is correct. Keep in mind though that you will need to > Stop/Start the VMs or (Live) Migrate them to a different hypervisor for > the new packages to be loaded. > > Actually the hosts are Fedora 29 which I need to re-deploy with Fedora 30 to get nautilus on the clients. I just wanted to unterstand how this works. I always reboot the whole machine after such a large change to make sure it works. Thank you for your time! > Wido > > > Thank you very much! > > > > Kevin > > > > Am Di., 28. Mai 2019 um 09:46 Uhr schrieb Wido den Hollander > > mailto:w...@42on.com>>: > > > > > > > > On 5/28/19 7:52 AM, Kevin Olbrich wrote: > > > Hi! > > > > > > How can I determine which client compatibility level (luminous, > mimic, > > > nautilus, etc.) is supported in Qemu/KVM? > > > Does it depend on the version of ceph packages on the system? Or > do I > > > need a recent version Qemu/KVM? > > > > This is mainly related to librados and librbd on your system. Qemu > talks > > to librbd which then talks to librados. > > > > Qemu -> librbd -> librados -> Ceph cluster > > > > So make sure you keep the librbd and librados packages updated on > your > > hypervisor. > > > > When upgrading them make sure you either Stop/Start or Live Migrate > the > > VMs to a different hypervisor so the VMs are initiated with the new > > code. > > > > Wido > > > > > Which component defines, which client level will be supported? > > > > > > Thank you very much! > > > > > > Kind regards > > > Kevin > > > > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] QEMU/KVM client compatibility
Hi Wido, thanks for your reply! For CentOS 7, this means I can switch over to the "rpm-nautilus/el7" repository and Qemu uses a nautilus compatible client? I just want to make sure, I understand correctly. Thank you very much! Kevin Am Di., 28. Mai 2019 um 09:46 Uhr schrieb Wido den Hollander : > > > On 5/28/19 7:52 AM, Kevin Olbrich wrote: > > Hi! > > > > How can I determine which client compatibility level (luminous, mimic, > > nautilus, etc.) is supported in Qemu/KVM? > > Does it depend on the version of ceph packages on the system? Or do I > > need a recent version Qemu/KVM? > > This is mainly related to librados and librbd on your system. Qemu talks > to librbd which then talks to librados. > > Qemu -> librbd -> librados -> Ceph cluster > > So make sure you keep the librbd and librados packages updated on your > hypervisor. > > When upgrading them make sure you either Stop/Start or Live Migrate the > VMs to a different hypervisor so the VMs are initiated with the new code. > > Wido > > > Which component defines, which client level will be supported? > > > > Thank you very much! > > > > Kind regards > > Kevin > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] QEMU/KVM client compatibility
Hi! How can I determine which client compatibility level (luminous, mimic, nautilus, etc.) is supported in Qemu/KVM? Does it depend on the version of ceph packages on the system? Or do I need a recent version Qemu/KVM? Which component defines, which client level will be supported? Thank you very much! Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cluster is not stable
Are you sure that firewalld is stopped and disabled? Looks exactly like that when I missed one host in a test cluster. Kevin Am Di., 12. März 2019 um 09:31 Uhr schrieb Zhenshi Zhou : > Hi, > > I deployed a ceph cluster with good performance. But the logs > indicate that the cluster is not as stable as I think it should be. > > The log shows the monitors mark some osd as down periodly: > [image: image.png] > > I didn't find any useful information in osd logs. > > ceph version 13.2.4 mimic (stable) > OS version CentOS 7.6.1810 > kernel version 5.0.0-2.el7 > > Thanks. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage of devices in SSD pool vary very much
dd 0.90999 1.0 932GiB 335GiB 597GiB 35.96 0.79 91 12 hdd 0.90999 1.0 932GiB 357GiB 575GiB 38.28 0.84 96 35 hdd 0.90970 1.0 932GiB 318GiB 614GiB 34.14 0.75 86 6 ssd 0.43700 1.0 447GiB 278GiB 170GiB 62.08 1.36 63 7 ssd 0.43700 1.0 447GiB 256GiB 191GiB 57.17 1.25 60 8 ssd 0.43700 1.0 447GiB 291GiB 156GiB 65.01 1.42 57 31 ssd 0.43660 1.0 447GiB 246GiB 201GiB 54.96 1.20 51 34 ssd 0.43660 1.0 447GiB 189GiB 258GiB 42.22 0.92 46 36 ssd 0.87329 1.0 894GiB 389GiB 506GiB 43.45 0.95 91 37 ssd 0.87329 1.0 894GiB 390GiB 504GiB 43.63 0.96 85 42 ssd 0.87329 1.0 894GiB 401GiB 493GiB 44.88 0.98 92 43 ssd 0.87329 1.0 894GiB 455GiB 439GiB 50.89 1.11 89 17 hdd 0.90999 1.0 932GiB 368GiB 563GiB 39.55 0.87 100 18 hdd 0.90999 1.0 932GiB 350GiB 582GiB 37.56 0.82 95 24 hdd 0.90999 1.0 932GiB 359GiB 572GiB 38.58 0.84 97 26 hdd 0.90999 1.0 932GiB 388GiB 544GiB 41.62 0.91 105 13 ssd 0.43700 1.0 447GiB 322GiB 125GiB 72.12 1.58 80 14 ssd 0.43700 1.0 447GiB 291GiB 156GiB 65.16 1.43 70 15 ssd 0.43700 1.0 447GiB 350GiB 96.9GiB 78.33 1.72 78 <-- 16 ssd 0.43700 1.0 447GiB 268GiB 179GiB 60.05 1.31 71 23 hdd 0.90999 1.0 932GiB 364GiB 567GiB 39.08 0.86 98 25 hdd 0.90999 1.0 932GiB 391GiB 541GiB 41.92 0.92 106 27 hdd 0.90999 1.0 932GiB 393GiB 538GiB 42.21 0.92 106 28 hdd 0.90970 1.0 932GiB 467GiB 464GiB 50.14 1.10 126 19 ssd 0.43700 1.0 447GiB 310GiB 137GiB 69.36 1.52 76 20 ssd 0.43700 1.0 447GiB 316GiB 131GiB 70.66 1.55 76 21 ssd 0.43700 1.0 447GiB 323GiB 125GiB 72.13 1.58 80 22 ssd 0.43700 1.0 447GiB 283GiB 164GiB 63.39 1.39 69 38 ssd 0.43660 1.0 447GiB 146GiB 302GiB 32.55 0.71 46 39 ssd 0.43660 1.0 447GiB 142GiB 305GiB 31.84 0.70 43 40 ssd 0.87329 1.0 894GiB 407GiB 487GiB 45.53 1.00 98 41 ssd 0.87329 1.0 894GiB 353GiB 541GiB 39.51 0.87 102 TOTAL 29.9TiB 13.7TiB 16.3TiB 45.66 MIN/MAX VAR: 0.63/1.72 STDDEV: 13.59 Kevin Am So., 6. Jan. 2019 um 07:34 Uhr schrieb Konstantin Shalygin : > > On 1/5/19 4:17 PM, Kevin Olbrich wrote: > > root@adminnode:~# ceph osd tree > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > -1 30.82903 root default > > -16 30.82903 datacenter dc01 > > -19 30.82903 pod dc01-agg01 > > -10 17.43365 rack dc01-rack02 > > -47.20665 host node1001 > >0 hdd 0.90999 osd.0 up 1.0 1.0 > >1 hdd 0.90999 osd.1 up 1.0 1.0 > >5 hdd 0.90999 osd.5 up 1.0 1.0 > > 29 hdd 0.90970 osd.29up 1.0 1.0 > > 32 hdd 0.90970 osd.32 down0 1.0 > > 33 hdd 0.90970 osd.33up 1.0 1.0 > >2 ssd 0.43700 osd.2 up 1.0 1.0 > >3 ssd 0.43700 osd.3 up 1.0 1.0 > >4 ssd 0.43700 osd.4 up 1.0 1.0 > > 30 ssd 0.43660 osd.30up 1.0 1.0 > > -76.29724 host node1002 > >9 hdd 0.90999 osd.9 up 1.0 1.0 > > 10 hdd 0.90999 osd.10up 1.0 1.0 > > 11 hdd 0.90999 osd.11up 1.0 1.0 > > 12 hdd 0.90999 osd.12up 1.0 1.0 > > 35 hdd 0.90970 osd.35up 1.0 1.0 > >6 ssd 0.43700 osd.6 up 1.0 1.0 > >7 ssd 0.43700 osd.7 up 1.0 1.0 > >8 ssd 0.43700 osd.8 up 1.0 1.0 > > 31 ssd 0.43660 osd.31up 1.0 1.0 > > -282.18318 host node1005 > > 34 ssd 0.43660 osd.34up 1.0 1.0 > > 36 ssd 0.87329 osd.36up 1.0 1.0 > > 37 ssd 0.87329 osd.37up 1.0 1.0 > > -291.74658 host node1006 > > 42 ssd 0.87329 osd.42up 1.0 1.0 > > 43 ssd 0.87329 osd.43up 1.0 1.0 > > -11 13.39537 rack dc01-rack03 > > -225.38794 host node100
Re: [ceph-users] Rezising an online mounted ext4 on a rbd - failed
Am Sa., 26. Jan. 2019 um 13:43 Uhr schrieb Götz Reinicke : > > Hi, > > I have a fileserver which mounted a 4TB rbd, which is ext4 formatted. > > I grow that rbd and ext4 starting with an 2TB rbd that way: > > rbd resize testpool/disk01--size 4194304 > > resize2fs /dev/rbd0 > > Today I wanted to extend that ext4 to 8 TB and did: > > rbd resize testpool/disk01--size 8388608 > > resize2fs /dev/rbd0 > > => which gives an error: The filesystem is already 1073741824 blocks. Nothing > to do. > > > I bet I missed something very simple. Any hint? Thanks and regards . > Götz Try "partprobe" to read device metrics again. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck in creating+peering state
Are you sure, no service like firewalld is running? Did you check that all machines have the same MTU and jumbo frames are enabled if needed? I had this problem when I first started with ceph and forgot to disable firewalld. Replication worked perfectly fine but the OSD was kicked out every few seconds. Kevin Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen : > > Hi, > > I have a sad ceph cluster. > All my osds complain about failed reply on heartbeat, like so: > > osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42 > ever on either front or back, first ping sent 2019-01-16 > 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353) > > .. I've checked the network sanity all I can, and all ceph ports are > open between nodes both on the public network and the cluster network, > and I have no problems sending traffic back and forth between nodes. > I've tried tcpdump'ing and traffic is passing in both directions > between the nodes, but unfortunately I don't natively speak the ceph > protocol, so I can't figure out what's going wrong in the heartbeat > conversation. > > Still: > > # ceph health detail > > HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072 > pgs inactive, 1072 pgs peering > OSDMAP_FLAGS nodown,noout flag(s) set > PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering > pg 7.3cd is stuck inactive for 245901.560813, current state > creating+peering, last acting [13,41,1] > pg 7.3ce is stuck peering for 245901.560813, current state > creating+peering, last acting [1,40,7] > pg 7.3cf is stuck peering for 245901.560813, current state > creating+peering, last acting [0,42,9] > pg 7.3d0 is stuck peering for 245901.560813, current state > creating+peering, last acting [20,8,38] > pg 7.3d1 is stuck peering for 245901.560813, current state > creating+peering, last acting [10,20,42] >() > > > I've set "noout" and "nodown" to prevent all osd's from being removed > from the cluster. They are all running and marked as "up". > > # ceph osd tree > > ID CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF > -1 249.73434 root default > -25 166.48956 datacenter m1 > -2483.24478 pod kube1 > -3541.62239 rack 10 > -3441.62239 host ceph-sto-p102 > 40 hdd 7.27689 osd.40 up 1.0 1.0 > 41 hdd 7.27689 osd.41 up 1.0 1.0 > 42 hdd 7.27689 osd.42 up 1.0 1.0 >() > > I'm at a point where I don't know which options and what logs to check > anymore? > > Any debug hint would be very much appreciated. > > btw. I have no important data in the cluster (yet), so if the solution > is to drop all osd and recreate them, it's ok for now. But I'd really > like to know how the cluster ended in this state. > > /Johan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problem with CephFS - No space left on device
It would but you should not: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html Kevin Am Di., 8. Jan. 2019 um 15:35 Uhr schrieb Rodrigo Embeita : > > Thanks again Kevin. > If I reduce the size flag to a value of 2, that should fix the problem? > > Regards > > On Tue, Jan 8, 2019 at 11:28 AM Kevin Olbrich wrote: >> >> You use replication 3 failure-domain host. >> OSD 2 and 4 are full, thats why your pool is also full. >> You need to add two disks to pf-us1-dfs3 or swap one from the larger >> nodes to this one. >> >> Kevin >> >> Am Di., 8. Jan. 2019 um 15:20 Uhr schrieb Rodrigo Embeita >> : >> > >> > Hi Yoann, thanks for your response. >> > Here are the results of the commands. >> > >> > root@pf-us1-dfs2:/var/log/ceph# ceph osd df >> > ID CLASS WEIGHT REWEIGHT SIZEUSE AVAIL %USE VAR PGS >> > 0 hdd 7.27739 1.0 7.3 TiB 6.7 TiB 571 GiB 92.33 1.74 310 >> > 5 hdd 7.27739 1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.18 1.45 271 >> > 6 hdd 7.27739 1.0 7.3 TiB 609 GiB 6.7 TiB 8.17 0.15 49 >> > 8 hdd 7.27739 1.0 7.3 TiB 2.5 GiB 7.3 TiB 0.030 42 >> > 1 hdd 7.27739 1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.28 1.45 285 >> > 3 hdd 7.27739 1.0 7.3 TiB 6.9 TiB 371 GiB 95.02 1.79 296 >> > 7 hdd 7.27739 1.0 7.3 TiB 360 GiB 6.9 TiB 4.84 0.09 53 >> > 9 hdd 7.27739 1.0 7.3 TiB 4.1 GiB 7.3 TiB 0.06 0.00 38 >> > 2 hdd 7.27739 1.0 7.3 TiB 6.7 TiB 576 GiB 92.27 1.74 321 >> > 4 hdd 7.27739 1.0 7.3 TiB 6.1 TiB 1.2 TiB 84.10 1.58 351 >> >TOTAL 73 TiB 39 TiB 34 TiB 53.13 >> > MIN/MAX VAR: 0/1.79 STDDEV: 41.15 >> > >> > >> > root@pf-us1-dfs2:/var/log/ceph# ceph osd pool ls detail >> > pool 1 'poolcephfs' replicated size 3 min_size 2 crush_rule 0 object_hash >> > rjenkins pg_num 128 pgp_num 128 last_change 471 fla >> > gs hashpspool,full stripe_width 0 >> > pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash >> > rjenkins pg_num 256 pgp_num 256 last_change 471 lf >> > or 0/439 flags hashpspool,full stripe_width 0 application cephfs >> > pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 >> > object_hash rjenkins pg_num 256 pgp_num 256 last_change 47 >> > 1 lfor 0/448 flags hashpspool,full stripe_width 0 application cephfs >> > pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash >> > rjenkins pg_num 8 pgp_num 8 last_change 471 flags ha >> > shpspool,full stripe_width 0 application rgw >> > pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 >> > object_hash rjenkins pg_num 8 pgp_num 8 last_change 47 >> > 1 flags hashpspool,full stripe_width 0 application rgw >> > pool 6 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 >> > object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 f >> > lags hashpspool,full stripe_width 0 application rgw >> > pool 7 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 >> > object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 fl >> > ags hashpspool,full stripe_width 0 application rgw >> > >> > >> > root@pf-us1-dfs2:/var/log/ceph# ceph osd tree >> > ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF >> > -1 72.77390 root default >> > -3 29.10956 host pf-us1-dfs1 >> > 0 hdd 7.27739 osd.0up 1.0 1.0 >> > 5 hdd 7.27739 osd.5up 1.0 1.0 >> > 6 hdd 7.27739 osd.6up 1.0 1.0 >> > 8 hdd 7.27739 osd.8up 1.0 1.0 >> > -5 29.10956 host pf-us1-dfs2 >> > 1 hdd 7.27739 osd.1up 1.0 1.0 >> > 3 hdd 7.27739 osd.3up 1.0 1.0 >> > 7 hdd 7.27739 osd.7up 1.0 1.0 >> > 9 hdd 7.27739 osd.9up 1.0 1.0 >> > -7 14.55478 host pf-us1-dfs3 >> > 2 hdd 7.27739 osd.2up 1.0 1.0 >> > 4 hdd 7.27739 osd.4up 1.0 1.0 >> > >> > >> > Thanks for your help guys. >> > >> > >> > On Tue, Jan 8, 2019 at 10:36 AM Yoann Moulin wrote: >> >> >> >> Hello, >> >> >> >> > Hi guys, I need your help. >> >> > I'm new with Cephfs and we started using it
Re: [ceph-users] Problem with CephFS - No space left on device
You use replication 3 failure-domain host. OSD 2 and 4 are full, thats why your pool is also full. You need to add two disks to pf-us1-dfs3 or swap one from the larger nodes to this one. Kevin Am Di., 8. Jan. 2019 um 15:20 Uhr schrieb Rodrigo Embeita : > > Hi Yoann, thanks for your response. > Here are the results of the commands. > > root@pf-us1-dfs2:/var/log/ceph# ceph osd df > ID CLASS WEIGHT REWEIGHT SIZEUSE AVAIL %USE VAR PGS > 0 hdd 7.27739 1.0 7.3 TiB 6.7 TiB 571 GiB 92.33 1.74 310 > 5 hdd 7.27739 1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.18 1.45 271 > 6 hdd 7.27739 1.0 7.3 TiB 609 GiB 6.7 TiB 8.17 0.15 49 > 8 hdd 7.27739 1.0 7.3 TiB 2.5 GiB 7.3 TiB 0.030 42 > 1 hdd 7.27739 1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.28 1.45 285 > 3 hdd 7.27739 1.0 7.3 TiB 6.9 TiB 371 GiB 95.02 1.79 296 > 7 hdd 7.27739 1.0 7.3 TiB 360 GiB 6.9 TiB 4.84 0.09 53 > 9 hdd 7.27739 1.0 7.3 TiB 4.1 GiB 7.3 TiB 0.06 0.00 38 > 2 hdd 7.27739 1.0 7.3 TiB 6.7 TiB 576 GiB 92.27 1.74 321 > 4 hdd 7.27739 1.0 7.3 TiB 6.1 TiB 1.2 TiB 84.10 1.58 351 >TOTAL 73 TiB 39 TiB 34 TiB 53.13 > MIN/MAX VAR: 0/1.79 STDDEV: 41.15 > > > root@pf-us1-dfs2:/var/log/ceph# ceph osd pool ls detail > pool 1 'poolcephfs' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins pg_num 128 pgp_num 128 last_change 471 fla > gs hashpspool,full stripe_width 0 > pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins pg_num 256 pgp_num 256 last_change 471 lf > or 0/439 flags hashpspool,full stripe_width 0 application cephfs > pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 256 pgp_num 256 last_change 47 > 1 lfor 0/448 flags hashpspool,full stripe_width 0 application cephfs > pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins pg_num 8 pgp_num 8 last_change 471 flags ha > shpspool,full stripe_width 0 application rgw > pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 8 pgp_num 8 last_change 47 > 1 flags hashpspool,full stripe_width 0 application rgw > pool 6 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 f > lags hashpspool,full stripe_width 0 application rgw > pool 7 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 fl > ags hashpspool,full stripe_width 0 application rgw > > > root@pf-us1-dfs2:/var/log/ceph# ceph osd tree > ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF > -1 72.77390 root default > -3 29.10956 host pf-us1-dfs1 > 0 hdd 7.27739 osd.0up 1.0 1.0 > 5 hdd 7.27739 osd.5up 1.0 1.0 > 6 hdd 7.27739 osd.6up 1.0 1.0 > 8 hdd 7.27739 osd.8up 1.0 1.0 > -5 29.10956 host pf-us1-dfs2 > 1 hdd 7.27739 osd.1up 1.0 1.0 > 3 hdd 7.27739 osd.3up 1.0 1.0 > 7 hdd 7.27739 osd.7up 1.0 1.0 > 9 hdd 7.27739 osd.9up 1.0 1.0 > -7 14.55478 host pf-us1-dfs3 > 2 hdd 7.27739 osd.2up 1.0 1.0 > 4 hdd 7.27739 osd.4up 1.0 1.0 > > > Thanks for your help guys. > > > On Tue, Jan 8, 2019 at 10:36 AM Yoann Moulin wrote: >> >> Hello, >> >> > Hi guys, I need your help. >> > I'm new with Cephfs and we started using it as file storage. >> > Today we are getting no space left on device but I'm seeing that we have >> > plenty space on the filesystem. >> > Filesystem Size Used Avail Use% Mounted on >> > 192.168.51.8,192.168.51.6,192.168.51.118:6789:/pagefreezer/smhosts 73T >> > 39T 35T 54% /mnt/cephfs >> > >> > We have 35TB of disk space. I've added 2 additional OSD disks with 7TB >> > each but I'm getting the error "No space left on device" every time that >> > I want to add a new file. >> > After adding the 2 additional OSD disks I'm seeing that the load is beign >> > distributed among the cluster. >> > Please I need your help. >> >> Could you give us the output of >> >> ceph osd df >> ceph osd pool ls detail >> ceph osd tree >> >> Best regards, >> >> -- >> Yoann Moulin >> EPFL IC-IT >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problem with CephFS - No space left on device
Looks like the same problem like mine: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032054.html The free space is total while Ceph uses the smallest free space (worst OSD). Please check your (re-)weights. Kevin Am Di., 8. Jan. 2019 um 14:32 Uhr schrieb Rodrigo Embeita : > > Hi guys, I need your help. > I'm new with Cephfs and we started using it as file storage. > Today we are getting no space left on device but I'm seeing that we have > plenty space on the filesystem. > Filesystem Size Used Avail Use% Mounted on > 192.168.51.8,192.168.51.6,192.168.51.118:6789:/pagefreezer/smhosts 73T > 39T 35T 54% /mnt/cephfs > > We have 35TB of disk space. I've added 2 additional OSD disks with 7TB each > but I'm getting the error "No space left on device" every time that I want to > add a new file. > After adding the 2 additional OSD disks I'm seeing that the load is beign > distributed among the cluster. > Please I need your help. > > root@pf-us1-dfs1:/etc/ceph# ceph -s > cluster: >id: 609e9313-bdd3-449e-a23f-3db8382e71fb >health: HEALTH_ERR >2 backfillfull osd(s) >1 full osd(s) >7 pool(s) full >197313040/508449063 objects misplaced (38.807%) >Degraded data redundancy: 2/508449063 objects degraded (0.000%), 2 > pgs degraded >Degraded data redundancy (low space): 16 pgs backfill_toofull, 3 > pgs recovery_toofull > > services: >mon: 3 daemons, quorum pf-us1-dfs2,pf-us1-dfs1,pf-us1-dfs3 >mgr: pf-us1-dfs3(active), standbys: pf-us1-dfs2 >mds: pagefs-2/2/2 up {0=pf-us1-dfs3=up:active,1=pf-us1-dfs1=up:active}, 1 > up:standby >osd: 10 osds: 10 up, 10 in; 189 remapped pgs >rgw: 1 daemon active > > data: >pools: 7 pools, 416 pgs >objects: 169.5 M objects, 3.6 TiB >usage: 39 TiB used, 34 TiB / 73 TiB avail >pgs: 2/508449063 objects degraded (0.000%) > 197313040/508449063 objects misplaced (38.807%) > 224 active+clean > 168 active+remapped+backfill_wait > 16 active+remapped+backfill_wait+backfill_toofull > 5 active+remapped+backfilling > 2 active+recovery_toofull+degraded > 1 active+recovery_toofull > > io: >recovery: 1.1 MiB/s, 31 objects/s > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Balancer=on with crush-compat mode
If I understand the balancer correct, it balances PGs not data. This worked perfectly fine in your case. I prefer a PG count of ~100 per OSD, you are at 30. Maybe it would help to bump the PGs. Kevin Am Sa., 5. Jan. 2019 um 14:39 Uhr schrieb Marc Roos : > > > I have straw2, balancer=on, crush-compat and it gives worst spread over > my ssd drives (4 only) being used by only 2 pools. One of these pools > has pg 8. Should I increase this to 16 to create a better result, or > will it never be any better. > > For now I like to stick to crush-compat, so I can use a default centos7 > kernel. > > Luminous 12.2.8, 3.10.0-862.14.4.el7.x86_64, CentOS Linux release > 7.5.1804 (Core) > > > > [@c01 ~]# cat balancer-1-before.txt | egrep '^19|^20|^21|^30' > 19 ssd 0.48000 1.0 447GiB 164GiB 283GiB 36.79 0.93 31 > 20 ssd 0.48000 1.0 447GiB 136GiB 311GiB 30.49 0.77 32 > 21 ssd 0.48000 1.0 447GiB 215GiB 232GiB 48.02 1.22 30 > 30 ssd 0.48000 1.0 447GiB 151GiB 296GiB 33.72 0.86 27 > > [@c01 ~]# ceph osd df | egrep '^19|^20|^21|^30' > 19 ssd 0.48000 1.0 447GiB 157GiB 290GiB 35.18 0.87 30 > 20 ssd 0.48000 1.0 447GiB 125GiB 322GiB 28.00 0.69 30 > 21 ssd 0.48000 1.0 447GiB 245GiB 202GiB 54.71 1.35 30 > 30 ssd 0.48000 1.0 447GiB 217GiB 230GiB 48.46 1.20 30 > > [@c01 ~]# ceph osd pool ls detail | egrep 'fs_meta|rbd.ssd' > pool 19 'fs_meta' replicated size 3 min_size 2 crush_rule 5 object_hash > rjenkins pg_num 16 pgp_num 16 last_change 22425 lfor 0/9035 flags > hashpspool stripe_width 0 application cephfs > pool 54 'rbd.ssd' replicated size 3 min_size 2 crush_rule 5 object_hash > rjenkins pg_num 8 pgp_num 8 last_change 24666 flags hashpspool > stripe_width 0 application rbd > > [@c01 ~]# ceph df |egrep 'ssd|fs_meta' > fs_meta 19 170MiB 0.07 > 240GiB 2451382 > fs_data.ssd 33 0B 0 > 240GiB 0 > rbd.ssd 54 266GiB 52.57 > 240GiB 75902 > fs_data.ec21.ssd 55 0B 0 > 480GiB 0 > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage of devices in SSD pool vary very much
osd.33 2 ssd 0.43700 1.0 447GiB 271GiB 176GiB 60.67 1.30 50 osd.2 3 ssd 0.43700 1.0 447GiB 249GiB 198GiB 55.62 1.19 58 osd.3 4 ssd 0.43700 1.0 447GiB 297GiB 150GiB 66.39 1.42 56 osd.4 30 ssd 0.43660 1.0 447GiB 236GiB 211GiB 52.85 1.13 48 osd.30 -76.29724- 6.29TiB 2.74TiB 3.55TiB 43.53 0.93 - host node1002 9 hdd 0.90999 1.0 932GiB 354GiB 578GiB 37.96 0.81 95 osd.9 10 hdd 0.90999 1.0 932GiB 357GiB 575GiB 38.28 0.82 96 osd.10 11 hdd 0.90999 1.0 932GiB 318GiB 613GiB 34.18 0.73 86 osd.11 12 hdd 0.90999 1.0 932GiB 373GiB 558GiB 40.09 0.86 100 osd.12 35 hdd 0.90970 1.0 932GiB 343GiB 588GiB 36.83 0.79 92 osd.35 6 ssd 0.43700 1.0 447GiB 269GiB 178GiB 60.20 1.29 60 osd.6 7 ssd 0.43700 1.0 447GiB 249GiB 198GiB 55.69 1.19 56 osd.7 8 ssd 0.43700 1.0 447GiB 286GiB 161GiB 63.95 1.37 56 osd.8 31 ssd 0.43660 1.0 447GiB 257GiB 190GiB 57.47 1.23 55 osd.31 -282.18318- 2.18TiB 968GiB 1.24TiB 43.29 0.93 - host node1005 34 ssd 0.43660 1.0 447GiB 202GiB 245GiB 45.14 0.97 47 osd.34 36 ssd 0.87329 1.0 894GiB 405GiB 489GiB 45.28 0.97 91 osd.36 37 ssd 0.87329 1.0 894GiB 361GiB 533GiB 40.38 0.87 79 osd.37 -291.74658- 1.75TiB 888GiB 900GiB 49.65 1.06 - host node1006 42 ssd 0.87329 1.0 894GiB 417GiB 477GiB 46.68 1.00 92 osd.42 43 ssd 0.87329 1.0 894GiB 471GiB 424GiB 52.63 1.13 90 osd.43 -11 13.39537- 13.4TiB 6.64TiB 6.75TiB 49.60 1.06 - rack dc01-rack03 -225.38794- 5.39TiB 2.70TiB 2.69TiB 50.14 1.07 - host node1003 17 hdd 0.90999 1.0 932GiB 371GiB 560GiB 39.83 0.85 100 osd.17 18 hdd 0.90999 1.0 932GiB 390GiB 542GiB 41.82 0.90 105 osd.18 24 hdd 0.90999 1.0 932GiB 352GiB 580GiB 37.77 0.81 94 osd.24 26 hdd 0.90999 1.0 932GiB 387GiB 545GiB 41.54 0.89 104 osd.26 13 ssd 0.43700 1.0 447GiB 319GiB 128GiB 71.32 1.53 77 osd.13 14 ssd 0.43700 1.0 447GiB 303GiB 144GiB 67.76 1.45 70 osd.14 15 ssd 0.43700 1.0 447GiB 361GiB 86.4GiB 80.67 1.73 77 osd.15 16 ssd 0.43700 1.0 447GiB 283GiB 164GiB 63.29 1.36 71 osd.16 -255.38765- 5.39TiB 2.83TiB 2.56TiB 52.55 1.13 - host node1004 23 hdd 0.90999 1.0 932GiB 382GiB 549GiB 41.05 0.88 102 osd.23 25 hdd 0.90999 1.0 932GiB 412GiB 520GiB 44.20 0.95 111 osd.25 27 hdd 0.90999 1.0 932GiB 385GiB 546GiB 41.36 0.89 103 osd.27 28 hdd 0.90970 1.0 932GiB 462GiB 469GiB 49.64 1.06 124 osd.28 19 ssd 0.43700 1.0 447GiB 314GiB 133GiB 70.22 1.51 75 osd.19 20 ssd 0.43700 1.0 447GiB 327GiB 120GiB 73.06 1.57 76 osd.20 21 ssd 0.43700 1.0 447GiB 324GiB 123GiB 72.45 1.55 77 osd.21 22 ssd 0.43700 1.0 447GiB 292GiB 156GiB 65.21 1.40 68 osd.22 -302.61978- 2.62TiB 1.11TiB 1.51TiB 42.43 0.91 - host node1007 38 ssd 0.43660 1.0 447GiB 165GiB 283GiB 36.82 0.79 46 osd.38 39 ssd 0.43660 1.0 447GiB 156GiB 292GiB 34.79 0.75 42 osd.39 40 ssd 0.87329 1.0 894GiB 429GiB 466GiB 47.94 1.03 98 osd.40 41 ssd 0.87329 1.0 894GiB 389GiB 505GiB 43.55 0.93 103 osd.41 TOTAL 29.9TiB 14.0TiB 16.0TiB 46.65 MIN/MAX VAR: 0.65/1.73 STDDEV: 13.30 = root@adminnode:~# ceph df && ceph -v GLOBAL: SIZEAVAIL RAW USED %RAW USED 29.9TiB 16.0TiB 14.0TiB 46.65 POOLS: NAME ID USED%USED MAX AVAIL OBJECTS rbd_vms_ssd 2 986GiB 49.83993GiB 262606 rbd_vms_hdd 3 3.76TiB 48.94 3.92TiB 992255 rbd_vms_ssd_014 372KiB 0662GiB 148 rbd_vms_ssd_01_ec 6 2.85TiB 68.81 1.29TiB 770506 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable) Kevin Am Sa., 5. Jan. 2019 um 05:12 Uhr schrieb Konstantin Shalygin : > > On 1/5/19 1:51 AM, Kevin Olbrich wrote: > &
Re: [ceph-users] Help Ceph Cluster Down
degraded, acting [9,31] > pg 14.8f9 is activating+degraded, acting [27,21] > pg 14.901 is activating+degraded, acting [22,8] > pg 14.910 is activating+degraded, acting [17,2] > pg 20.808 is activating+degraded, acting [20,12] > pg 20.825 is activating+degraded, acting [25,35] > pg 20.827 is activating+degraded, acting [23,16] > pg 20.829 is activating+degraded, acting [20,31] > pg 20.837 is activating+degraded, acting [31,6] > pg 20.83c is activating+degraded, acting [26,17] > pg 20.85e is activating+degraded, acting [4,27] > pg 20.85f is activating+degraded, acting [1,25] > pg 20.865 is activating+degraded, acting [8,33] > pg 20.88b is activating+degraded, acting [6,32] > pg 20.895 is stale+activating+degraded, acting [37,27] > pg 20.89c is activating+degraded, acting [1,24] > pg 20.8a3 is activating+degraded, acting [30,1] > pg 20.8ad is activating+degraded, acting [1,20] > pg 20.8af is activating+degraded, acting [33,31] > pg 20.8b4 is activating+degraded, acting [9,1] > pg 20.8b7 is activating+degraded, acting [0,33] > pg 20.8b9 is activating+degraded, acting [20,24] > pg 20.8c5 is activating+degraded, acting [27,14] > pg 20.8d1 is activating+degraded, acting [10,7] > pg 20.8d4 is activating+degraded, acting [28,21] > pg 20.8d5 is activating+degraded, acting [24,15] > pg 20.8e0 is activating+degraded, acting [18,0] > pg 20.8e2 is activating+degraded, acting [25,7] > pg 20.8ea is activating+degraded, acting [17,21] > pg 20.8f1 is activating+degraded, acting [15,11] > pg 20.8fb is activating+degraded, acting [10,24] > pg 20.8fc is activating+degraded, acting [20,15] > pg 20.8ff is activating+degraded, acting [18,25] > pg 20.913 is activating+degraded, acting [11,0] > pg 20.91d is activating+degraded, acting [10,16] > REQUEST_SLOW 99059 slow requests are blocked > 32 sec > 24235 ops are blocked > 2097.15 sec > 17029 ops are blocked > 1048.58 sec > 54122 ops are blocked > 524.288 sec > 2311 ops are blocked > 262.144 sec > 767 ops are blocked > 131.072 sec > 396 ops are blocked > 65.536 sec > 199 ops are blocked > 32.768 sec > osd.32 has blocked requests > 262.144 sec > osds 5,8,12,26,28 have blocked requests > 524.288 sec > osds 1,3,9,10 have blocked requests > 1048.58 sec > osds 2,14,18,19,20,23,24,25,27,29,30,31,33,34,35 have blocked requests > > 2097.15 sec > REQUEST_STUCK 4834 stuck requests are blocked > 4096 sec > 4834 ops are blocked > 4194.3 sec > osds 0,4,11,13,17,21,22 have stuck requests > 4194.3 sec > TOO_MANY_PGS too many PGs per OSD (3003 > max 200) > [root@fre101 ~]# > > [root@fre101 ~]# ceph -s > 2019-01-04 15:18:53.398950 7fc372c94700 -1 asok(0x7fc36c0017a0) > AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to > bind the UNIX domain socket to > '/var/run/ceph-guests/ceph-client.admin.130425.140477307296080.asok': (2) No > such file or directory > cluster: > id: adb9ad8e-f458-4124-bf58-7963a8d1391f > health: HEALTH_ERR > 3 pools have many more objects per pg than average > 523656/12393978 objects misplaced (4.225%) > 6523 PGs pending on creation > Reduced data availability: 6584 pgs inactive, 1267 pgs down, 2 > pgs peering, 2696 pgs stale > Degraded data redundancy: 86858/12393978 objects degraded > (0.701%), 717 pgs degraded, 21 pgs undersized > 107622 slow requests are blocked > 32 sec > 4957 stuck requests are blocked > 4096 sec > too many PGs per OSD (3003 > max 200) > > services: > mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 > mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02 > osd: 39 osds: 39 up, 36 in; 85 remapped pgs > rgw: 1 daemon active > > data: > pools: 18 pools, 54656 pgs > objects: 6051k objects, 10947 GB > usage: 21971 GB used, 50650 GB / 72622 GB avail > pgs: 0.002% pgs unknown > 12.046% pgs not active > 86858/12393978 objects degraded (0.701%) > 523656/12393978 objects misplaced (4.225%) > 46743 active+clean > 4342 activating > 1317 stale+active+clean > 1151 stale+down > 667 activating+degraded > 159 stale+activating > 116 down > 77activating+remapped > 34stale+activating+degraded > 21stale+activating+remapped > 9 stale+active+undersiz
Re: [ceph-users] Help Ceph Cluster Down
I don't think this will help you. Unfound means, the cluster is unable to find the data anywhere (it's lost). It would be sufficient to shut down the new host - the OSDs will then be out. You can also force-heal the cluster, something like "do your best possible": ceph pg 2.5 mark_unfound_lost revert|delete Src: http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/ Kevin Am Fr., 4. Jan. 2019 um 20:47 Uhr schrieb Arun POONIA : > > Hi Kevin, > > Can I remove newly added server from Cluster and see if it heals cluster ? > > When I check Hard Disk Iops on new server which are very low compared to > existing cluster server. > > Indeed this is a critical cluster but I don't have expertise to make it > flawless. > > Thanks > Arun > > On Fri, Jan 4, 2019 at 11:35 AM Kevin Olbrich wrote: >> >> If you realy created and destroyed OSDs before the cluster healed >> itself, this data will be permanently lost (not found / inactive). >> Also your PG count is so much oversized, the calculation for peering >> will most likely break because this was never tested. >> >> If this is a critical cluster, I would start a new one and bring back >> the backups (using a better PG count). >> >> Kevin >> >> Am Fr., 4. Jan. 2019 um 20:25 Uhr schrieb Arun POONIA >> : >> > >> > Can anyone comment on this issue please, I can't seem to bring my cluster >> > healthy. >> > >> > On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA >> > wrote: >> >> >> >> Hi Caspar, >> >> >> >> Number of IOPs are also quite low. It used be around 1K Plus on one of >> >> Pool (VMs) now its like close to 10-30 . >> >> >> >> Thansk >> >> Arun >> >> >> >> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA >> >> wrote: >> >>> >> >>> Hi Caspar, >> >>> >> >>> Yes and No, numbers are going up and down. If I run ceph -s command I >> >>> can see it decreases one time and later it increases again. I see there >> >>> are so many blocked/slow requests. Almost all the OSDs have slow >> >>> requests. Around 12% PGs are inactive not sure how to activate them >> >>> again. >> >>> >> >>> >> >>> [root@fre101 ~]# ceph health detail >> >>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0) >> >>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed >> >>> to bind the UNIX domain socket to >> >>> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': >> >>> (2) No such file or directory >> >>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than >> >>> average; 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on >> >>> creation; Reduced data availability: 6578 pgs inactive, 1882 pgs down, >> >>> 86 pgs peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 >> >>> objects degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 >> >>> slow requests are blocked > 32 sec; 551 stuck requests are blocked > >> >>> 4096 sec; too many PGs per OSD (2709 > max 200) >> >>> OSD_DOWN 1 osds down >> >>> osd.28 (root=default,host=fre119) is down >> >>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average >> >>> pool glance-images objects per pg (10478) is more than 92.7257 times >> >>> cluster average (113) >> >>> pool vms objects per pg (4717) is more than 41.7434 times cluster >> >>> average (113) >> >>> pool volumes objects per pg (1220) is more than 10.7965 times >> >>> cluster average (113) >> >>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%) >> >>> PENDING_CREATING_PGS 3610 PGs pending on creation >> >>> osds >> >>> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9] >> >>> have pending PGs. >> >>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs >> >>> down, 86 pgs peering, 850 pgs stale >> >>> pg 10.900 is down, acting [18] >> >>> pg 10.90e is stuck inactive for 60266.030164, current state >> >&g
Re: [ceph-users] Help Ceph Cluster Down
If you realy created and destroyed OSDs before the cluster healed itself, this data will be permanently lost (not found / inactive). Also your PG count is so much oversized, the calculation for peering will most likely break because this was never tested. If this is a critical cluster, I would start a new one and bring back the backups (using a better PG count). Kevin Am Fr., 4. Jan. 2019 um 20:25 Uhr schrieb Arun POONIA : > > Can anyone comment on this issue please, I can't seem to bring my cluster > healthy. > > On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA > wrote: >> >> Hi Caspar, >> >> Number of IOPs are also quite low. It used be around 1K Plus on one of Pool >> (VMs) now its like close to 10-30 . >> >> Thansk >> Arun >> >> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA >> wrote: >>> >>> Hi Caspar, >>> >>> Yes and No, numbers are going up and down. If I run ceph -s command I can >>> see it decreases one time and later it increases again. I see there are so >>> many blocked/slow requests. Almost all the OSDs have slow requests. Around >>> 12% PGs are inactive not sure how to activate them again. >>> >>> >>> [root@fre101 ~]# ceph health detail >>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0) >>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to >>> bind the UNIX domain socket to >>> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2) >>> No such file or directory >>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than average; >>> 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on creation; >>> Reduced data availability: 6578 pgs inactive, 1882 pgs down, 86 pgs >>> peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 objects >>> degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow >>> requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec; >>> too many PGs per OSD (2709 > max 200) >>> OSD_DOWN 1 osds down >>> osd.28 (root=default,host=fre119) is down >>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average >>> pool glance-images objects per pg (10478) is more than 92.7257 times >>> cluster average (113) >>> pool vms objects per pg (4717) is more than 41.7434 times cluster >>> average (113) >>> pool volumes objects per pg (1220) is more than 10.7965 times cluster >>> average (113) >>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%) >>> PENDING_CREATING_PGS 3610 PGs pending on creation >>> osds >>> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9] >>> have pending PGs. >>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs >>> down, 86 pgs peering, 850 pgs stale >>> pg 10.900 is down, acting [18] >>> pg 10.90e is stuck inactive for 60266.030164, current state activating, >>> last acting [2,38] >>> pg 10.913 is stuck stale for 1887.552862, current state stale+down, >>> last acting [9] >>> pg 10.915 is stuck inactive for 60266.215231, current state activating, >>> last acting [30,38] >>> pg 11.903 is stuck inactive for 59294.465961, current state activating, >>> last acting [11,38] >>> pg 11.910 is down, acting [21] >>> pg 11.919 is down, acting [25] >>> pg 12.902 is stuck inactive for 57118.544590, current state activating, >>> last acting [36,14] >>> pg 13.8f8 is stuck inactive for 60707.167787, current state activating, >>> last acting [29,37] >>> pg 13.901 is stuck stale for 60226.543289, current state >>> stale+active+clean, last acting [1,31] >>> pg 13.905 is stuck inactive for 60266.050940, current state activating, >>> last acting [2,36] >>> pg 13.909 is stuck inactive for 60707.160714, current state activating, >>> last acting [34,36] >>> pg 13.90e is stuck inactive for 60707.410749, current state activating, >>> last acting [21,36] >>> pg 13.911 is down, acting [25] >>> pg 13.914 is stale+down, acting [29] >>> pg 13.917 is stuck stale for 580.224688, current state stale+down, last >>> acting [16] >>> pg 14.901 is stuck inactive for 60266.037762, current state >>> activating+degraded, last acting [22,37] >>> pg 14.90f is stuck inactive for 60296.996447, current state activating, >>> last acting [30,36] >>> pg 14.910 is stuck inactive for 60266.077310, current state >>> activating+degraded, last acting [17,37] >>> pg 14.915 is stuck inactive for 60266.032445, current state activating, >>> last acting [34,36] >>> pg 15.8fa is stuck stale for 560.223249, current state stale+down, last >>> acting [8] >>> pg 15.90c is stuck inactive for 59294.402388, current state activating, >>> last acting [29,38] >>> pg 15.90d is stuck inactive for 60266.176492, current state activating, >>> last acting [5,36] >>> pg 15.915
Re: [ceph-users] Usage of devices in SSD pool vary very much
PS: Could be http://tracker.ceph.com/issues/36361 There is one HDD OSD that is out (which will not be replaced because the SSD pool will get the images and the hdd pool will be deleted). Kevin Am Fr., 4. Jan. 2019 um 19:46 Uhr schrieb Kevin Olbrich : > > Hi! > > I did what you wrote but my MGRs started to crash again: > root@adminnode:~# ceph -s > cluster: > id: 086d9f80-6249-4594-92d0-e31b6a9c > health: HEALTH_WARN > no active mgr > 105498/6277782 objects misplaced (1.680%) > > services: > mon: 3 daemons, quorum mon01,mon02,mon03 > mgr: no daemons active > osd: 44 osds: 43 up, 43 in > > data: > pools: 4 pools, 1616 pgs > objects: 1.88M objects, 7.07TiB > usage: 13.2TiB used, 16.7TiB / 29.9TiB avail > pgs: 105498/6277782 objects misplaced (1.680%) > 1606 active+clean > 8active+remapped+backfill_wait > 2active+remapped+backfilling > > io: > client: 5.51MiB/s rd, 3.38MiB/s wr, 33op/s rd, 317op/s wr > recovery: 60.3MiB/s, 15objects/s > > > MON 1 log: >-13> 2019-01-04 14:05:04.432186 7fec56a93700 4 mgr ms_dispatch > active mgrdigest v1 >-12> 2019-01-04 14:05:04.432194 7fec56a93700 4 mgr ms_dispatch mgrdigest > v1 >-11> 2019-01-04 14:05:04.822041 7fec434e1700 4 mgr[balancer] > Optimize plan auto_2019-01-04_14:05:04 >-10> 2019-01-04 14:05:04.822170 7fec434e1700 4 mgr get_config > get_configkey: mgr/balancer/mode > -9> 2019-01-04 14:05:04.822231 7fec434e1700 4 mgr get_config > get_configkey: mgr/balancer/max_misplaced > -8> 2019-01-04 14:05:04.822268 7fec434e1700 4 ceph_config_get > max_misplaced not found > -7> 2019-01-04 14:05:04.822444 7fec434e1700 4 mgr[balancer] Mode > upmap, max misplaced 0.05 > -6> 2019-01-04 14:05:04.822849 7fec434e1700 4 mgr[balancer] do_upmap > -5> 2019-01-04 14:05:04.822923 7fec434e1700 4 mgr get_config > get_configkey: mgr/balancer/upmap_max_iterations > -4> 2019-01-04 14:05:04.822964 7fec434e1700 4 ceph_config_get > upmap_max_iterations not found > -3> 2019-01-04 14:05:04.823013 7fec434e1700 4 mgr get_config > get_configkey: mgr/balancer/upmap_max_deviation > -2> 2019-01-04 14:05:04.823048 7fec434e1700 4 ceph_config_get > upmap_max_deviation not found > -1> 2019-01-04 14:05:04.823265 7fec434e1700 4 mgr[balancer] pools > ['rbd_vms_hdd', 'rbd_vms_ssd', 'rbd_vms_ssd_01', 'rbd_vms_ssd_01_ec'] > 0> 2019-01-04 14:05:04.836124 7fec434e1700 -1 > /build/ceph-12.2.8/src/osd/OSDMap.cc: In function 'int > OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set int>&, OSDMap::Incremental*)' thread 7fec434e1700 time 2019-01-04 > 14:05:04.832885 > /build/ceph-12.2.8/src/osd/OSDMap.cc: 4102: FAILED assert(target > 0) > > ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) > luminous (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x558c3c0bb572] > 2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set std::less, std::allocator > const&, > OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1] > 3: (()+0x2f3020) [0x558c3bf5d020] > 4: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971] > 5: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] > 6: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d] > 7: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] > 8: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] > 9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] > 10: (()+0x13e370) [0x7fec5e8be370] > 11: (PyObject_Call()+0x43) [0x7fec5e891273] > 12: (()+0x1853ac) [0x7fec5e9053ac] > 13: (PyObject_Call()+0x43) [0x7fec5e891273] > 14: (PyObject_CallMethod()+0xf4) [0x7fec5e892444] > 15: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c] > 16: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998] > 17: (()+0x76ba) [0x7fec5d74c6ba] > 18: (clone()+0x6d) [0x7fec5c7b841d] > NOTE: a copy of the executable, or `objdump -rdS ` is > needed to interpret this. > > --- logging levels --- >0/ 5 none >0/ 1 lockdep >0/ 1 context >1/ 1 crush >1/ 5 mds >1/ 5 mds_balancer >1/ 5 mds_locker >1/ 5 mds_log >1/ 5 mds_log_expire >1/ 5 mds_migrator >0/ 1 buffer >0/ 1 timer >0/ 1 filer >0/ 1 striper >0/ 1 objecter >0/ 5 rados >0/ 5 rbd >0/ 5 rbd_mirror >0/ 5 rbd_replay >0/ 5 journaler >0/ 5 objectcacher >0/ 5 client >1/ 5 osd >0/ 5 optracker >0/ 5 objclass >1/ 3 filestore >1/ 3 journal >0/ 5 ms >1/ 5 mon >0/10 monc >
Re: [ceph-users] Usage of devices in SSD pool vary very much
Hi! I did what you wrote but my MGRs started to crash again: root@adminnode:~# ceph -s cluster: id: 086d9f80-6249-4594-92d0-e31b6a9c health: HEALTH_WARN no active mgr 105498/6277782 objects misplaced (1.680%) services: mon: 3 daemons, quorum mon01,mon02,mon03 mgr: no daemons active osd: 44 osds: 43 up, 43 in data: pools: 4 pools, 1616 pgs objects: 1.88M objects, 7.07TiB usage: 13.2TiB used, 16.7TiB / 29.9TiB avail pgs: 105498/6277782 objects misplaced (1.680%) 1606 active+clean 8active+remapped+backfill_wait 2active+remapped+backfilling io: client: 5.51MiB/s rd, 3.38MiB/s wr, 33op/s rd, 317op/s wr recovery: 60.3MiB/s, 15objects/s MON 1 log: -13> 2019-01-04 14:05:04.432186 7fec56a93700 4 mgr ms_dispatch active mgrdigest v1 -12> 2019-01-04 14:05:04.432194 7fec56a93700 4 mgr ms_dispatch mgrdigest v1 -11> 2019-01-04 14:05:04.822041 7fec434e1700 4 mgr[balancer] Optimize plan auto_2019-01-04_14:05:04 -10> 2019-01-04 14:05:04.822170 7fec434e1700 4 mgr get_config get_configkey: mgr/balancer/mode -9> 2019-01-04 14:05:04.822231 7fec434e1700 4 mgr get_config get_configkey: mgr/balancer/max_misplaced -8> 2019-01-04 14:05:04.822268 7fec434e1700 4 ceph_config_get max_misplaced not found -7> 2019-01-04 14:05:04.822444 7fec434e1700 4 mgr[balancer] Mode upmap, max misplaced 0.05 -6> 2019-01-04 14:05:04.822849 7fec434e1700 4 mgr[balancer] do_upmap -5> 2019-01-04 14:05:04.822923 7fec434e1700 4 mgr get_config get_configkey: mgr/balancer/upmap_max_iterations -4> 2019-01-04 14:05:04.822964 7fec434e1700 4 ceph_config_get upmap_max_iterations not found -3> 2019-01-04 14:05:04.823013 7fec434e1700 4 mgr get_config get_configkey: mgr/balancer/upmap_max_deviation -2> 2019-01-04 14:05:04.823048 7fec434e1700 4 ceph_config_get upmap_max_deviation not found -1> 2019-01-04 14:05:04.823265 7fec434e1700 4 mgr[balancer] pools ['rbd_vms_hdd', 'rbd_vms_ssd', 'rbd_vms_ssd_01', 'rbd_vms_ssd_01_ec'] 0> 2019-01-04 14:05:04.836124 7fec434e1700 -1 /build/ceph-12.2.8/src/osd/OSDMap.cc: In function 'int OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set&, OSDMap::Incremental*)' thread 7fec434e1700 time 2019-01-04 14:05:04.832885 /build/ceph-12.2.8/src/osd/OSDMap.cc: 4102: FAILED assert(target > 0) ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x558c3c0bb572] 2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set, std::allocator > const&, OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1] 3: (()+0x2f3020) [0x558c3bf5d020] 4: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971] 5: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 6: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d] 7: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 8: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 10: (()+0x13e370) [0x7fec5e8be370] 11: (PyObject_Call()+0x43) [0x7fec5e891273] 12: (()+0x1853ac) [0x7fec5e9053ac] 13: (PyObject_Call()+0x43) [0x7fec5e891273] 14: (PyObject_CallMethod()+0xf4) [0x7fec5e892444] 15: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c] 16: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998] 17: (()+0x76ba) [0x7fec5d74c6ba] 18: (clone()+0x6d) [0x7fec5c7b841d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mgr.mon01.ceph01.srvfarm.net.log --- end dump of recent events --- 2019-01-04 14:05:05.032479 7fec434e1700 -1 *** Caught signal (Aborted) ** in thread 7fec434e1700 thread_name:balancer ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable) 1: (()+0x4105b4) [0x558c3c07a5b4] 2: (()+0x11390) [0x7fec5d756390] 3:
[ceph-users] TCP qdisc + congestion control / BBR
Hi! I wonder if changing qdisc and congestion_control (for example fq with Google BBR) on Ceph servers / clients has positive effects during high load. Google BBR: https://cloud.google.com/blog/products/gcp/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster I am running a lot of VMs with BBR but the hypervisors run fq_codel + cubic (OSDs run Ubuntu defaults). Did someone test qdisc and congestion control settings? Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Usage of devices in SSD pool vary very much
Hi! On a medium sized cluster with device-classes, I am experiencing a problem with the SSD pool: root@adminnode:~# ceph osd df | grep ssd ID CLASS WEIGHT REWEIGHT SIZEUSE AVAIL %USE VAR PGS 2 ssd 0.43700 1.0 447GiB 254GiB 193GiB 56.77 1.28 50 3 ssd 0.43700 1.0 447GiB 208GiB 240GiB 46.41 1.04 58 4 ssd 0.43700 1.0 447GiB 266GiB 181GiB 59.44 1.34 55 30 ssd 0.43660 1.0 447GiB 222GiB 225GiB 49.68 1.12 49 6 ssd 0.43700 1.0 447GiB 238GiB 209GiB 53.28 1.20 59 7 ssd 0.43700 1.0 447GiB 228GiB 220GiB 50.88 1.14 56 8 ssd 0.43700 1.0 447GiB 269GiB 178GiB 60.16 1.35 57 31 ssd 0.43660 1.0 447GiB 231GiB 217GiB 51.58 1.16 56 34 ssd 0.43660 1.0 447GiB 186GiB 261GiB 41.65 0.94 49 36 ssd 0.87329 1.0 894GiB 364GiB 530GiB 40.68 0.92 91 37 ssd 0.87329 1.0 894GiB 321GiB 573GiB 35.95 0.81 78 42 ssd 0.87329 1.0 894GiB 375GiB 519GiB 41.91 0.94 92 43 ssd 0.87329 1.0 894GiB 438GiB 456GiB 49.00 1.10 92 13 ssd 0.43700 1.0 447GiB 249GiB 198GiB 55.78 1.25 72 14 ssd 0.43700 1.0 447GiB 290GiB 158GiB 64.76 1.46 71 15 ssd 0.43700 1.0 447GiB 368GiB 78.6GiB 82.41 1.85 78 < 16 ssd 0.43700 1.0 447GiB 253GiB 194GiB 56.66 1.27 70 19 ssd 0.43700 1.0 447GiB 269GiB 178GiB 60.21 1.35 70 20 ssd 0.43700 1.0 447GiB 312GiB 135GiB 69.81 1.57 77 21 ssd 0.43700 1.0 447GiB 312GiB 135GiB 69.77 1.57 77 22 ssd 0.43700 1.0 447GiB 269GiB 178GiB 60.10 1.35 67 38 ssd 0.43660 1.0 447GiB 153GiB 295GiB 34.11 0.77 46 39 ssd 0.43660 1.0 447GiB 127GiB 320GiB 28.37 0.64 38 40 ssd 0.87329 1.0 894GiB 386GiB 508GiB 43.17 0.97 97 41 ssd 0.87329 1.0 894GiB 375GiB 520GiB 41.88 0.94 113 This leads to just 1.2TB free space (some GBs away from NEAR_FULL pool). Currently, the balancer plugin is off because it immediately crashed the MGR in the past (on 12.2.5). Since then I upgraded to 12.2.8 but did not re-enable the balancer. [I am unable to find the bugtracker ID] Would the balancer plugin correct this situation? What happens if all MGRs die like they did on 12.2.5 because of the plugin? Will the balancer take data from the most-unbalanced OSDs first? Otherwise the OSD may fill up more then FULL which would cause the whole pool to freeze (because the smallest OSD is taken into account for free space calculation). This would be the worst case as over 100 VMs would freeze, causing lot of trouble. This is also the reason I did not try to enable the balancer again. Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM
> > Assuming everything is on LVM including the root filesystem, only moving > > the boot partition will have to be done outside of LVM. > > Since the OP mentioned MS Exchange, I assume the VM is running windows. > You can do the same LVM-like trick in Windows Server via Disk Manager > though; add the new ceph RBD disk to the existing data volume as a > mirror; wait for it to sync, then break the mirror and remove the > original disk. Mirrors only work on dynamic disks which are a pain to revert and cause lot's of problems with backup solutions. I will keep this in mind as this is still better than shutting down the whole VM. @all Thank you very much for your inputs. I will try some less important VMs and then start migration of the big one. Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] KVM+Ceph: Live migration of I/O-heavy VM
Hi! Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous cluster (which already holds lot's of images). The server has access to both local and cluster-storage, I only need to live migrate the storage, not machine. I have never used live migration as it can cause more issues and the VMs that are already migrated, had planned downtime. Taking the VM offline and convert/import using qemu-img would take some hours but I would like to still serve clients, even if it is slower. The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with BBU). There are two HDDs bound as RAID1 which are constantly under 30% - 60% load (this goes up to 100% during reboot, updates or login prime-time). What happens when either the local compute node or the ceph cluster fails (degraded)? Or network is unavailable? Are all writes performed to both locations? Is this fail-safe? Or does the VM crash in worst case, which can lead to dirty shutdown for MS-EX DBs? The node currently has 4GB free RAM and 29GB listed as cache / available. These numbers need caution because we have "tuned" enabled which causes de-deplication on RAM and this host runs about 10 Windows VMs. During reboots or updates, RAM can get full again. Maybe I am to cautious about live-storage-migration, maybe I am not. What are your experiences or advices? Thank you very much! Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Packages for debian in Ceph repo
I now had the time to test and after installing this package, uploads to rbd are working perfectly. Thank you very much fur sharing this! Kevin Am Mi., 7. Nov. 2018 um 15:36 Uhr schrieb Kevin Olbrich : > Am Mi., 7. Nov. 2018 um 07:40 Uhr schrieb Nicolas Huillard < > nhuill...@dolomede.fr>: > >> >> > It lists rbd but still fails with the exact same error. >> >> I stumbled upon the exact same error, and since there was no answer >> anywhere, I figured it was a very simple problem: don't forget to >> install the qemu-block-extra package (Debian stretch) along with qemu- >> utils which contains the qemu-img command. >> This command is actually compiled with rbd support (hence the output >> above), but need this extra package to pull actual support-code and >> dependencies... >> > > I have not been able to test this yet but this package was indeed missing > on my system! > Thank you for this hint! > > >> -- >> Nicolas Huillard >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times
I read the whole thread and it looks like the write cache should always be disabled as in the worst case, the performance is the same(?). This is based on this discussion. I will test some WD4002FYYZ which don't mention "media cache". Kevin Am Di., 13. Nov. 2018 um 09:27 Uhr schrieb Виталий Филиппов < vita...@yourcmc.ru>: > This may be the explanation: > > > https://serverfault.com/questions/857271/better-performance-when-hdd-write-cache-is-disabled-hgst-ultrastar-7k6000-and > > Other manufacturers may have started to do the same, I suppose. > -- > With best regards, > Vitaliy Filippov___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph or Gluster for implementing big NAS
Hi Dan, ZFS without sync would be very much identical to ext2/ext4 without journals or XFS with barriers disabled. The ARC cache in ZFS is awesome but disbaling sync on ZFS is a very high risk (using ext4 with kvm-mode unsafe would be similar I think). Also, ZFS only works as expected with scheduler set to noop as it is optimized to consume whole, non-shared devices. Just my 2 cents ;-) Kevin Am Mo., 12. Nov. 2018 um 15:08 Uhr schrieb Dan van der Ster < d...@vanderster.com>: > We've done ZFS on RBD in a VM, exported via NFS, for a couple years. > It's very stable and if your use-case permits you can set zfs > sync=disabled to get very fast write performance that's tough to beat. > > But if you're building something new today and have *only* the NAS > use-case then it would make better sense to try CephFS first and see > if it works for you. > > -- Dan > > On Mon, Nov 12, 2018 at 3:01 PM Kevin Olbrich wrote: > > > > Hi! > > > > ZFS won't play nice on ceph. Best would be to mount CephFS directly with > the ceph-fuse driver on the endpoint. > > If you definitely want to put a storage gateway between the data and the > compute nodes, then go with nfs-ganesha which can export CephFS directly > without local ("proxy") mount. > > > > I had such a setup with nfs and switched to mount CephFS directly. If > using NFS with the same data, you must make sure your HA works well to > avoid data corruption. > > With ceph-fuse you directly connect to the cluster, one component less > that breaks. > > > > Kevin > > > > Am Mo., 12. Nov. 2018 um 12:44 Uhr schrieb Premysl Kouril < > premysl.kou...@gmail.com>: > >> > >> Hi, > >> > >> > >> We are planning to build NAS solution which will be primarily used via > NFS and CIFS and workloads ranging from various archival application to > more “real-time processing”. The NAS will not be used as a block storage > for virtual machines, so the access really will always be file oriented. > >> > >> > >> We are considering primarily two designs and I’d like to kindly ask for > any thoughts, views, insights, experiences. > >> > >> > >> Both designs utilize “distributed storage software at some level”. Both > designs would be built from commodity servers and should scale as we grow. > Both designs involve virtualization for instantiating "access virtual > machines" which will be serving the NFS and CIFS protocol - so in this > sense the access layer is decoupled from the data layer itself. > >> > >> > >> First design is based on a distributed filesystem like Gluster or > CephFS. We would deploy this software on those commodity servers and mount > the resultant filesystem on the “access virtual machines” and they would be > serving the mounted filesystem via NFS/CIFS. > >> > >> > >> Second design is based on distributed block storage using CEPH. So we > would build distributed block storage on those commodity servers, and then, > via virtualization (like OpenStack Cinder) we would allocate the block > storage into the access VM. Inside the access VM we would deploy ZFS which > would aggregate block storage into a single filesystem. And this filesystem > would be served via NFS/CIFS from the very same VM. > >> > >> > >> Any advices and insights highly appreciated > >> > >> > >> Cheers, > >> > >> Prema > >> > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph or Gluster for implementing big NAS
Hi! ZFS won't play nice on ceph. Best would be to mount CephFS directly with the ceph-fuse driver on the endpoint. If you definitely want to put a storage gateway between the data and the compute nodes, then go with nfs-ganesha which can export CephFS directly without local ("proxy") mount. I had such a setup with nfs and switched to mount CephFS directly. If using NFS with the same data, you must make sure your HA works well to avoid data corruption. With ceph-fuse you directly connect to the cluster, one component less that breaks. Kevin Am Mo., 12. Nov. 2018 um 12:44 Uhr schrieb Premysl Kouril < premysl.kou...@gmail.com>: > Hi, > > We are planning to build NAS solution which will be primarily used via NFS > and CIFS and workloads ranging from various archival application to more > “real-time processing”. The NAS will not be used as a block storage for > virtual machines, so the access really will always be file oriented. > > We are considering primarily two designs and I’d like to kindly ask for > any thoughts, views, insights, experiences. > > Both designs utilize “distributed storage software at some level”. Both > designs would be built from commodity servers and should scale as we grow. > Both designs involve virtualization for instantiating "access virtual > machines" which will be serving the NFS and CIFS protocol - so in this > sense the access layer is decoupled from the data layer itself. > > First design is based on a distributed filesystem like Gluster or CephFS. > We would deploy this software on those commodity servers and mount the > resultant filesystem on the “access virtual machines” and they would be > serving the mounted filesystem via NFS/CIFS. > > Second design is based on distributed block storage using CEPH. So we > would build distributed block storage on those commodity servers, and then, > via virtualization (like OpenStack Cinder) we would allocate the block > storage into the access VM. Inside the access VM we would deploy ZFS which > would aggregate block storage into a single filesystem. And this filesystem > would be served via NFS/CIFS from the very same VM. > > > Any advices and insights highly appreciated > > > Cheers, > > Prema > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph 12.2.9 release
Am Mi., 7. Nov. 2018 um 16:40 Uhr schrieb Gregory Farnum : > On Wed, Nov 7, 2018 at 5:58 AM Simon Ironside > wrote: > >> >> >> On 07/11/2018 10:59, Konstantin Shalygin wrote: >> >> I wonder if there is any release announcement for ceph 12.2.9 that I >> missed. >> >> I just found the new packages on download.ceph.com, is this an >> official >> >> release? >> > >> > This is because 12.2.9 have a several bugs. You should avoid to use >> this >> > release and wait for 12.2.10 >> >> Argh! What's it doing in the repos then?? I've just upgraded to it! >> What are the bugs? Is there a thread about them? > > > If you’ve already upgraded and have no issues then you won’t have any > trouble going forward — except perhaps on the next upgrade, if you do it > while the cluster is unhealthy. > > I agree that it’s annoying when these issues make it out. We’ve had > ongoing discussions to try and improve the release process so it’s less > drawn-out and to prevent these upgrade issues from making it through > testing, but nobody has resolved it yet. If anybody has experience working > with deb repositories and handling releases, the Ceph upstream could use > some help... ;) > -Greg > >> >> We solve this problem by hosting two repos. One for staging and QA and one for production. Every release gets to staging (for example directly after building a scm tag). If QA passed, the stage repo is turned into the prod one. Using symlinks, it would be possible to switch back if problems occure. Example: https://incoming.debian.org/ Currently I would be unable to deploy new nodes if I use the official mirrors as apt is unable to use older versions (which does work on yum/dnf). Thats why we are implementing "mirror-sync" / rsync with a copy of the repo and the desired packages until such solution is available. Kevin >> Simon >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Packages for debian in Ceph repo
Am Mi., 7. Nov. 2018 um 07:40 Uhr schrieb Nicolas Huillard < nhuill...@dolomede.fr>: > > > It lists rbd but still fails with the exact same error. > > I stumbled upon the exact same error, and since there was no answer > anywhere, I figured it was a very simple problem: don't forget to > install the qemu-block-extra package (Debian stretch) along with qemu- > utils which contains the qemu-img command. > This command is actually compiled with rbd support (hence the output > above), but need this extra package to pull actual support-code and > dependencies... > I have not been able to test this yet but this package was indeed missing on my system! Thank you for this hint! > -- > Nicolas Huillard > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy osd creation failed with multipath and dmcrypt
I met the same problem. I had to create GPT table for each disk, create first partition over full space and then fed these to ceph-volume (should be similar for ceph-deploy). Also I am not sure if you can combine fs-type btrfs with bluestore (afaik this is for filestore). Kevin Am Di., 6. Nov. 2018 um 14:41 Uhr schrieb Pavan, Krish < krish.pa...@nuance.com>: > Trying to created OSD with multipath with dmcrypt and it failed . Any > suggestion please?. > > ceph-deploy --overwrite-conf osd create ceph-store1:/dev/mapper/mpathr > --bluestore --dmcrypt -- failed > > ceph-deploy --overwrite-conf osd create ceph-store1:/dev/mapper/mpathr > --bluestore – worked > > > > the logs for fail > > [ceph-store12][WARNIN] command: Running command: /usr/sbin/restorecon -R > /var/lib/ceph/osd-lockbox/e15f1adc-feff-4890-a617-adc473e7331e/magic.68428.tmp > > [ceph-store12][WARNIN] command: Running command: /usr/bin/chown -R > ceph:ceph > /var/lib/ceph/osd-lockbox/e15f1adc-feff-4890-a617-adc473e7331e/magic.68428.tmp > > [ceph-store12][WARNIN] Traceback (most recent call last): > > [ceph-store12][WARNIN] File "/usr/sbin/ceph-disk", line 9, in > > [ceph-store12][WARNIN] load_entry_point('ceph-disk==1.0.0', > 'console_scripts', 'ceph-disk')() > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5736, in run > > [ceph-store12][WARNIN] main(sys.argv[1:]) > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5687, in main > > [ceph-store12][WARNIN] args.func(args) > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2108, in main > > [ceph-store12][WARNIN] Prepare.factory(args).prepare() > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2097, in prepare > > [ceph-store12][WARNIN] self._prepare() > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2171, in _prepare > > [ceph-store12][WARNIN] self.lockbox.prepare() > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2861, in prepare > > [ceph-store12][WARNIN] self.populate() > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2818, in populate > > [ceph-store12][WARNIN] get_partition_base(self.partition.get_dev()), > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 844, in > get_partition_base > > [ceph-store12][WARNIN] raise Error('not a partition', dev) > > [ceph-store12][WARNIN] ceph_disk.main.Error: Error: not a partition: > /dev/dm-215 > > [ceph-store12][ERROR ] RuntimeError: command returned non-zero exit > status: 1 > > [ceph_deploy.osd][ERROR ] Failed to execute command: /usr/sbin/ceph-disk > -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys --bluestore > --cluster ceph --fs-type btrfs -- /dev/mapper/mpathr > > [ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Packages for debian in Ceph repo
Hi! Proxmox has support for rbd as they ship additional packages as well as ceph via their own repo. I ran your command and got this: > qemu-img version 2.8.1(Debian 1:2.8+dfsg-6+deb9u4) > Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers > Supported formats: blkdebug blkreplay blkverify bochs cloop dmg file ftp > ftps gluster host_cdrom host_device http https iscsi iser luks nbd nfs > null-aio null-co parallels qcow qcow2 qed quorum raw rbd replication > sheepdog ssh vdi vhdx vmdk vpc vvfat It lists rbd but still fails with the exact same error. Kevin Am Di., 30. Okt. 2018 um 17:14 Uhr schrieb David Turner < drakonst...@gmail.com>: > What version of qemu-img are you using? I found [1] this when poking > around on my qemu server when checking for rbd support. This version (note > it's proxmox) has rbd listed as a supported format. > > [1] > # qemu-img -V; qemu-img --help|grep rbd > qemu-img version 2.11.2pve-qemu-kvm_2.11.2-1 > Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers > Supported formats: blkdebug blkreplay blkverify bochs cloop dmg file ftp > ftps gluster host_cdrom host_device http https iscsi iser luks nbd null-aio > null-co parallels qcow qcow2 qed quorum raw rbd replication sheepdog > throttle vdi vhdx vmdk vpc vvfat zeroinit > On Tue, Oct 30, 2018 at 12:08 PM Kevin Olbrich wrote: > >> Is it possible to use qemu-img with rbd support on Debian Stretch? >> I am on Luminous and try to connect my image-buildserver to load images >> into a ceph pool. >> >> root@buildserver:~# qemu-img convert -p -O raw /target/test-vm.qcow2 >>> rbd:rbd_vms_ssd_01/test_vm >>> qemu-img: Unknown protocol 'rbd' >> >> >> Kevin >> >> Am Mo., 3. Sep. 2018 um 12:07 Uhr schrieb Abhishek Lekshmanan < >> abhis...@suse.com>: >> >>> arad...@tma-0.net writes: >>> >>> > Can anyone confirm if the Ceph repos for Debian/Ubuntu contain >>> packages for >>> > Debian? I'm not seeing any, but maybe I'm missing something... >>> > >>> > I'm seeing ceph-deploy install an older version of ceph on the nodes >>> (from the >>> > Debian repo) and then failing when I run "ceph-deploy osd ..." because >>> ceph- >>> > volume doesn't exist on the nodes. >>> > >>> The newer versions of Ceph (from mimic onwards) requires compiler >>> toolchains supporting c++17 which we unfortunately do not have for >>> stretch/jessie yet. >>> >>> - >>> Abhishek >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Packages for debian in Ceph repo
Is it possible to use qemu-img with rbd support on Debian Stretch? I am on Luminous and try to connect my image-buildserver to load images into a ceph pool. root@buildserver:~# qemu-img convert -p -O raw /target/test-vm.qcow2 > rbd:rbd_vms_ssd_01/test_vm > qemu-img: Unknown protocol 'rbd' Kevin Am Mo., 3. Sep. 2018 um 12:07 Uhr schrieb Abhishek Lekshmanan < abhis...@suse.com>: > arad...@tma-0.net writes: > > > Can anyone confirm if the Ceph repos for Debian/Ubuntu contain packages > for > > Debian? I'm not seeing any, but maybe I'm missing something... > > > > I'm seeing ceph-deploy install an older version of ceph on the nodes > (from the > > Debian repo) and then failing when I run "ceph-deploy osd ..." because > ceph- > > volume doesn't exist on the nodes. > > > The newer versions of Ceph (from mimic onwards) requires compiler > toolchains supporting c++17 which we unfortunately do not have for > stretch/jessie yet. > > - > Abhishek > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Command to check last change to rbd image?
Hi! Is there an easy way to check when an image was last modified? I want to make sure, that the images I want to clean up, were not used for a long time. Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] nfs-ganesha version in Ceph repos
I had a similar problem: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029698.html But even the recent 2.6.x releases were not working well for me (many many segfaults). I am on the master-branch (2.7.x) and that works well with less crashs. Cluster is 13.2.1/.2 with nfs-ganesha as standalone VM. Kevin Am Di., 9. Okt. 2018 um 19:39 Uhr schrieb Erik McCormick < emccorm...@cirrusseven.com>: > On Tue, Oct 9, 2018 at 1:27 PM Erik McCormick > wrote: > > > > Hello, > > > > I'm trying to set up an nfs-ganesha server with the Ceph FSAL, and > > running into difficulties getting the current stable release running. > > The versions in the Luminous repo is stuck at 2.6.1, whereas the > > current stable version is 2.6.3. I've seen a couple of HA issues in > > pre 2.6.3 versions that I'd like to avoid. > > > > I should have been more specific that the ones I am looking for are for > Centos 7 > > > I've also been attempting to build my own from source, but banging my > > head against a wall as far as dependencies and config options are > > concerned. > > > > If anyone reading this has the ability to kick off a fresh build of > > the V2.6-stable branch with all the knobs turned properly for Ceph, or > > can point me to a set of cmake configs and scripts that might help me > > do it myself, I would be eternally grateful. > > > > Thanks, > > Erik > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fastest way to find raw device from OSD-ID? (osd -> lvm lv -> lvm pv -> disk)
Hi Jakub, "ceph osd metadata X" this is perfect! This also lists multipath devices which I was looking for! Kevin Am Mo., 8. Okt. 2018 um 21:16 Uhr schrieb Jakub Jaszewski < jaszewski.ja...@gmail.com>: > Hi Kevin, > Have you tried ceph osd metadata OSDid ? > > Jakub > > pon., 8 paź 2018, 19:32 użytkownik Alfredo Deza > napisał: > >> On Mon, Oct 8, 2018 at 6:09 AM Kevin Olbrich wrote: >> > >> > Hi! >> > >> > Yes, thank you. At least on one node this works, the other node just >> freezes but this might by caused by a bad disk that I try to find. >> >> If it is freezing, you could maybe try running the command where it >> freezes? (ceph-volume will log it to the terminal) >> >> >> > >> > Kevin >> > >> > Am Mo., 8. Okt. 2018 um 12:07 Uhr schrieb Wido den Hollander < >> w...@42on.com>: >> >> >> >> Hi, >> >> >> >> $ ceph-volume lvm list >> >> >> >> Does that work for you? >> >> >> >> Wido >> >> >> >> On 10/08/2018 12:01 PM, Kevin Olbrich wrote: >> >> > Hi! >> >> > >> >> > Is there an easy way to find raw disks (eg. sdd/sdd1) by OSD id? >> >> > Before I migrated from filestore with simple-mode to bluestore with >> lvm, >> >> > I was able to find the raw disk with "df". >> >> > Now, I need to go from LVM LV to PV to disk every time I need to >> >> > check/smartctl a disk. >> >> > >> >> > Kevin >> >> > >> >> > >> >> > ___ >> >> > ceph-users mailing list >> >> > ceph-users@lists.ceph.com >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fastest way to find raw device from OSD-ID? (osd -> lvm lv -> lvm pv -> disk)
Hi! Yes, thank you. At least on one node this works, the other node just freezes but this might by caused by a bad disk that I try to find. Kevin Am Mo., 8. Okt. 2018 um 12:07 Uhr schrieb Wido den Hollander : > Hi, > > $ ceph-volume lvm list > > Does that work for you? > > Wido > > On 10/08/2018 12:01 PM, Kevin Olbrich wrote: > > Hi! > > > > Is there an easy way to find raw disks (eg. sdd/sdd1) by OSD id? > > Before I migrated from filestore with simple-mode to bluestore with lvm, > > I was able to find the raw disk with "df". > > Now, I need to go from LVM LV to PV to disk every time I need to > > check/smartctl a disk. > > > > Kevin > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fastest way to find raw device from OSD-ID? (osd -> lvm lv -> lvm pv -> disk)
Hi! Is there an easy way to find raw disks (eg. sdd/sdd1) by OSD id? Before I migrated from filestore with simple-mode to bluestore with lvm, I was able to find the raw disk with "df". Now, I need to go from LVM LV to PV to disk every time I need to check/smartctl a disk. Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error
nt: (5) Input/output error 2018-10-08 10:32:17.434 7f6af518e1c0 20 bdev aio_wait 0x55a3a1edb8c0 done 2018-10-08 10:32:17.434 7f6af518e1c0 1 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) close 2018-10-08 10:32:17.434 7f6af518e1c0 10 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _aio_stop 2018-10-08 10:32:17.568 7f6add7d3700 10 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _aio_thread end 2018-10-08 10:32:17.573 7f6af518e1c0 10 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _discard_stop 2018-10-08 10:32:17.573 7f6adcfd2700 20 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _discard_thread wake 2018-10-08 10:32:17.573 7f6adcfd2700 10 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _discard_thread finish 2018-10-08 10:32:17.573 7f6af518e1c0 10 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _discard_stop stopped 2018-10-08 10:32:17.573 7f6af518e1c0 1 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) close 2018-10-08 10:32:17.573 7f6af518e1c0 10 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _aio_stop 2018-10-08 10:32:17.817 7f6ade7d5700 10 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _aio_thread end 2018-10-08 10:32:17.822 7f6af518e1c0 10 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _discard_stop 2018-10-08 10:32:17.822 7f6addfd4700 20 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _discard_thread wake 2018-10-08 10:32:17.822 7f6addfd4700 10 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _discard_thread finish 2018-10-08 10:32:17.822 7f6af518e1c0 10 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _discard_stop stopped 2018-10-08 10:32:17.823 7f6af518e1c0 -1 osd.46 0 OSD:init: unable to mount object store 2018-10-08 10:32:17.823 7f6af518e1c0 -1 ** ERROR: osd init failed: (5) Input/output error Anything interesting here? I will try to export the down PGs from the disks. I got a bunch of new disks to replace all. Most of current disks are of same age. Kevin Am Mi., 3. Okt. 2018 um 13:52 Uhr schrieb Paul Emmerich < paul.emmer...@croit.io>: > There's "ceph-bluestore-tool repair/fsck" > > In your scenario, a few more log files would be interesting: try > setting debug bluefs to 20/20. And if that's not enough log try also > setting debug osd, debug bluestore, and debug bdev to 20/20. > > > > Paul > Am Mi., 3. Okt. 2018 um 13:48 Uhr schrieb Kevin Olbrich : > > > > The disks were deployed with ceph-deploy / ceph-volume using the default > style (lvm) and not simple-mode. > > > > The disks were provisioned as a whole, no resizing. I never touched the > disks after deployment. > > > > It is very strange that this first happened after the update, never met > such an error before. > > > > I found a BUG in the tracker, that also shows such an error with count > 0. That was closed with „can’t reproduce“ (don’t have the link ready). For > me this seems like the data itself is fine and I just hit a bad transaction > in the replay (which maybe caused the crash in the first place). > > > > I need one of three disks back. Object corruption would not be a problem > (regarding drop of a journal), as this cluster hosts backups which will > fail validation and regenerate. Just marking the OSD lost does not seem to > be an option. > > > > Is there some sort of fsck for BlueFS? > > > > Kevin > > > > > > Igor Fedotov schrieb am Mi. 3. Okt. 2018 um 13:01: > >> > >> I've seen somewhat similar behavior in a log from Sergey Malinin in > another thread ("mimic: 3/4 OSDs crashed...") > >> > >> He claimed it happened after LVM volume expansion. Isn't this the case > for you? > >> > >> Am I right that you use LVM volumes? > >> > >> > >> On 10/3/2018 11:22 AM, Kevin Olbrich wrote: > >> > >> Small addition: the failing disks are in the same host. > >> This is a two-host, failure-domain OSD cluster. > >> > >> > >> Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich : > >>> > >>> Hi! > >>> > >>> Yesterday one of our (non-priority) clusters failed when 3 OSDs went > down (EC 8+2) together. > >>> This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two > hours before. > >>> They failed exactly at the same moment, rendering the cluster unusable > (CephFS). > >>> We are using CentOS 7 with latest updates and ceph repo. No cache > SSDs, no external journal / wal / db. > >>> > >>> OSD 29 (no disk failure in dmesg): > >>> 2018-10-03 09:47:15.074 7fb8835ce1c0 0 set uid:gid to 167:167 > (ceph:ceph) > >>> 2018-10-03 09:47:15.074 7fb8835ce1c0 0 ceph version 13.2.2 > (02899bfda8141
Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error
The disks were deployed with ceph-deploy / ceph-volume using the default style (lvm) and not simple-mode. The disks were provisioned as a whole, no resizing. I never touched the disks after deployment. It is very strange that this first happened after the update, never met such an error before. I found a BUG in the tracker, that also shows such an error with count 0. That was closed with „can’t reproduce“ (don’t have the link ready). For me this seems like the data itself is fine and I just hit a bad transaction in the replay (which maybe caused the crash in the first place). I need one of three disks back. Object corruption would not be a problem (regarding drop of a journal), as this cluster hosts backups which will fail validation and regenerate. Just marking the OSD lost does not seem to be an option. Is there some sort of fsck for BlueFS? Kevin Igor Fedotov schrieb am Mi. 3. Okt. 2018 um 13:01: > I've seen somewhat similar behavior in a log from Sergey Malinin in > another thread ("mimic: 3/4 OSDs crashed...") > > He claimed it happened after LVM volume expansion. Isn't this the case for > you? > > Am I right that you use LVM volumes? > > On 10/3/2018 11:22 AM, Kevin Olbrich wrote: > > Small addition: the failing disks are in the same host. > This is a two-host, failure-domain OSD cluster. > > > Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich : > >> Hi! >> >> Yesterday one of our (non-priority) clusters failed when 3 OSDs went down >> (EC 8+2) together. >> *This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two >> hours before.* >> They failed exactly at the same moment, rendering the cluster unusable >> (CephFS). >> We are using CentOS 7 with latest updates and ceph repo. No cache SSDs, >> no external journal / wal / db. >> >> *OSD 29 (no disk failure in dmesg):* >> 2018-10-03 09:47:15.074 7fb8835ce1c0 0 set uid:gid to 167:167 (ceph:ceph) >> 2018-10-03 09:47:15.074 7fb8835ce1c0 0 ceph version 13.2.2 >> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process >> ceph-osd, pid 20899 >> 2018-10-03 09:47:15.074 7fb8835ce1c0 0 pidfile_write: ignore empty >> --pid-file >> 2018-10-03 09:47:15.100 7fb8835ce1c0 0 load: jerasure load: lrc load: >> isa >> 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev create path >> /var/lib/ceph/osd/ceph-29/block type kernel >> 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev(0x561250a2 >> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block >> 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev(0x561250a2 >> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 >> GiB) block_size 4096 (4 KiB) rotational >> 2018-10-03 09:47:15.101 7fb8835ce1c0 1 >> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > >> kv_ratio 0.5 >> 2018-10-03 09:47:15.101 7fb8835ce1c0 1 >> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 >> meta 0 kv 1 data 0 >> 2018-10-03 09:47:15.101 7fb8835ce1c0 1 bdev(0x561250a2 >> /var/lib/ceph/osd/ceph-29/block) close >> 2018-10-03 09:47:15.358 7fb8835ce1c0 1 >> bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29 >> 2018-10-03 09:47:15.358 7fb8835ce1c0 1 bdev create path >> /var/lib/ceph/osd/ceph-29/block type kernel >> 2018-10-03 09:47:15.358 7fb8835ce1c0 1 bdev(0x561250a2 >> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block >> 2018-10-03 09:47:15.359 7fb8835ce1c0 1 bdev(0x561250a2 >> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 >> GiB) block_size 4096 (4 KiB) rotational >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 >> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > >> kv_ratio 0.5 >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 >> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 >> meta 0 kv 1 data 0 >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev create path >> /var/lib/ceph/osd/ceph-29/block type kernel >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev(0x561250a20a80 >> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev(0x561250a20a80 >> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 >> GiB) block_size 4096 (4 KiB) rotational >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluefs add_block_device bdev 1 >> path /var/lib/ceph/osd/ceph-29/block size 932 GiB >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluefs mount >> 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file wi
Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error
Small addition: the failing disks are in the same host. This is a two-host, failure-domain OSD cluster. Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich : > Hi! > > Yesterday one of our (non-priority) clusters failed when 3 OSDs went down > (EC 8+2) together. > *This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two > hours before.* > They failed exactly at the same moment, rendering the cluster unusable > (CephFS). > We are using CentOS 7 with latest updates and ceph repo. No cache SSDs, no > external journal / wal / db. > > *OSD 29 (no disk failure in dmesg):* > 2018-10-03 09:47:15.074 7fb8835ce1c0 0 set uid:gid to 167:167 (ceph:ceph) > 2018-10-03 09:47:15.074 7fb8835ce1c0 0 ceph version 13.2.2 > (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process > ceph-osd, pid 20899 > 2018-10-03 09:47:15.074 7fb8835ce1c0 0 pidfile_write: ignore empty > --pid-file > 2018-10-03 09:47:15.100 7fb8835ce1c0 0 load: jerasure load: lrc load: isa > 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev create path > /var/lib/ceph/osd/ceph-29/block type kernel > 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev(0x561250a2 > /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block > 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev(0x561250a2 > /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 > GiB) block_size 4096 (4 KiB) rotational > 2018-10-03 09:47:15.101 7fb8835ce1c0 1 > bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > > kv_ratio 0.5 > 2018-10-03 09:47:15.101 7fb8835ce1c0 1 > bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 > meta 0 kv 1 data 0 > 2018-10-03 09:47:15.101 7fb8835ce1c0 1 bdev(0x561250a2 > /var/lib/ceph/osd/ceph-29/block) close > 2018-10-03 09:47:15.358 7fb8835ce1c0 1 > bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29 > 2018-10-03 09:47:15.358 7fb8835ce1c0 1 bdev create path > /var/lib/ceph/osd/ceph-29/block type kernel > 2018-10-03 09:47:15.358 7fb8835ce1c0 1 bdev(0x561250a2 > /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block > 2018-10-03 09:47:15.359 7fb8835ce1c0 1 bdev(0x561250a2 > /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 > GiB) block_size 4096 (4 KiB) rotational > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 > bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > > kv_ratio 0.5 > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 > bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 > meta 0 kv 1 data 0 > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev create path > /var/lib/ceph/osd/ceph-29/block type kernel > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev(0x561250a20a80 > /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev(0x561250a20a80 > /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 > GiB) block_size 4096 (4 KiB) rotational > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluefs add_block_device bdev 1 > path /var/lib/ceph/osd/ceph-29/block size 932 GiB > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluefs mount > 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file with link > count 0: file(ino 519 size 0x31e2f42 mtime 2018-10-02 12:24:22.632397 bdev > 1 allocated 320 extents > [1:0x700820+10,1:0x700900+10,1:0x700910+10,1:0x700920+10,1:0x700930+10,1:0x700940+10,1:0x700950+10,1:0x700960+10,1:0x700970+10,1:0x700980+10,1:0x700990+10,1:0x7009a0+10,1:0x7009b0+10,1:0x7009c0+10,1:0x7009d0+10,1:0x7009e0+10,1:0x7009f0+10,1:0x700a00+10,1:0x700a10+10,1:0x700a20+10,1:0x700a30+10,1:0x700a40+10,1:0x700a50+10,1:0x700a60+10,1:0x700a70+10,1:0x700a80+10,1:0x700a90+10,1:0x700aa0+10,1:0x700ab0+10,1:0x700ac0+10,1:0x700ad0+10,1:0x700ae0+10,1:0x700af0+10,1:0x700b00+10,1:0x700b10+10,1:0x700b20+10,1:0x700b30+10,1:0x700b40+10,1:0x700b50+10,1:0x700b60+10,1:0x700b70+10,1:0x700b80+10,1:0x700b90+10,1:0x700ba0+10,1:0x700bb0+10,1:0x700bc0+10,1:0x700bd0+10,1:0x700be0+10,1:0x700bf0+10,1:0x700c00+10]) > 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs mount failed to replay log: > (5) Input/output error > 2018-10-03 09:47:15.538 7fb8835ce1c0 1 stupidalloc 0x0x561250b8d030 > shutdown > 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 > bluestore(/var/lib/ceph/osd/ceph-29) _open_db failed bluefs mount: (
[ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error
Hi! Yesterday one of our (non-priority) clusters failed when 3 OSDs went down (EC 8+2) together. *This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two hours before.* They failed exactly at the same moment, rendering the cluster unusable (CephFS). We are using CentOS 7 with latest updates and ceph repo. No cache SSDs, no external journal / wal / db. *OSD 29 (no disk failure in dmesg):* 2018-10-03 09:47:15.074 7fb8835ce1c0 0 set uid:gid to 167:167 (ceph:ceph) 2018-10-03 09:47:15.074 7fb8835ce1c0 0 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process ceph-osd, pid 20899 2018-10-03 09:47:15.074 7fb8835ce1c0 0 pidfile_write: ignore empty --pid-file 2018-10-03 09:47:15.100 7fb8835ce1c0 0 load: jerasure load: lrc load: isa 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev create path /var/lib/ceph/osd/ceph-29/block type kernel 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev(0x561250a2 /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev(0x561250a2 /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 GiB) block_size 4096 (4 KiB) rotational 2018-10-03 09:47:15.101 7fb8835ce1c0 1 bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > kv_ratio 0.5 2018-10-03 09:47:15.101 7fb8835ce1c0 1 bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 meta 0 kv 1 data 0 2018-10-03 09:47:15.101 7fb8835ce1c0 1 bdev(0x561250a2 /var/lib/ceph/osd/ceph-29/block) close 2018-10-03 09:47:15.358 7fb8835ce1c0 1 bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29 2018-10-03 09:47:15.358 7fb8835ce1c0 1 bdev create path /var/lib/ceph/osd/ceph-29/block type kernel 2018-10-03 09:47:15.358 7fb8835ce1c0 1 bdev(0x561250a2 /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block 2018-10-03 09:47:15.359 7fb8835ce1c0 1 bdev(0x561250a2 /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 GiB) block_size 4096 (4 KiB) rotational 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > kv_ratio 0.5 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 meta 0 kv 1 data 0 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev create path /var/lib/ceph/osd/ceph-29/block type kernel 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev(0x561250a20a80 /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev(0x561250a20a80 /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 GiB) block_size 4096 (4 KiB) rotational 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-29/block size 932 GiB 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluefs mount 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file with link count 0: file(ino 519 size 0x31e2f42 mtime 2018-10-02 12:24:22.632397 bdev 1 allocated 320 extents [1:0x700820+10,1:0x700900+10,1:0x700910+10,1:0x700920+10,1:0x700930+10,1:0x700940+10,1:0x700950+10,1:0x700960+10,1:0x700970+10,1:0x700980+10,1:0x700990+10,1:0x7009a0+10,1:0x7009b0+10,1:0x7009c0+10,1:0x7009d0+10,1:0x7009e0+10,1:0x7009f0+10,1:0x700a00+10,1:0x700a10+10,1:0x700a20+10,1:0x700a30+10,1:0x700a40+10,1:0x700a50+10,1:0x700a60+10,1:0x700a70+10,1:0x700a80+10,1:0x700a90+10,1:0x700aa0+10,1:0x700ab0+10,1:0x700ac0+10,1:0x700ad0+10,1:0x700ae0+10,1:0x700af0+10,1:0x700b00+10,1:0x700b10+10,1:0x700b20+10,1:0x700b30+10,1:0x700b40+10,1:0x700b50+10,1:0x700b60+10,1:0x700b70+10,1:0x700b80+10,1:0x700b90+10,1:0x700ba0+10,1:0x700bb0+10,1:0x700bc0+10,1:0x700bd0+10,1:0x700be0+10,1:0x700bf0+10,1:0x700c00+10]) 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs mount failed to replay log: (5) Input/output error 2018-10-03 09:47:15.538 7fb8835ce1c0 1 stupidalloc 0x0x561250b8d030 shutdown 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluestore(/var/lib/ceph/osd/ceph-29) _open_db failed bluefs mount: (5) Input/output error 2018-10-03 09:47:15.538 7fb8835ce1c0 1 bdev(0x561250a20a80 /var/lib/ceph/osd/ceph-29/block) close 2018-10-03 09:47:15.616 7fb8835ce1c0 1 bdev(0x561250a2 /var/lib/ceph/osd/ceph-29/block) close 2018-10-03 09:47:15.870 7fb8835ce1c0 -1 osd.29 0 OSD:init: unable to mount object store 2018-10-03 09:47:15.870 7fb8835ce1c0 -1 ** ERROR: osd init failed: (5) Input/output error *OSD 42:* disk is found by lvm, tmpfs is created but service immediately dies on start without log... This might be
Re: [ceph-users] data-pool option for qemu-img / ec pool
Hi Paul, thanks for the hint, I just checked and it works perfectly. I found this guide: https://www.reddit.com/r/ceph/comments/72yc9m/ceph_openstack_with_ec/ The works well with one meta/data setup but not with multiple (like device-class based pools). The link above uses client-auth, is there a better way? Kevin Am So., 23. Sep. 2018 um 18:08 Uhr schrieb Paul Emmerich : > > The usual trick for clients not supporting this natively is the option > "rbd_default_data_pool" in ceph.conf which should also work here. > > > Paul > Am So., 23. Sep. 2018 um 18:03 Uhr schrieb Kevin Olbrich : > > > > Hi! > > > > Is it possible to set data-pool for ec-pools on qemu-img? > > For repl-pools I used "qemu-img convert" to convert from e.g. vmdk to raw > > and write to rbd/ceph directly. > > > > The rbd utility is able to do this for raw or empty images but without > > convert (converting 800G and writing it again would now take at least twice > > the time). > > > > Do I miss a parameter for qemu-kvm? > > > > Kind regards > > Kevin > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] data-pool option for qemu-img / ec pool
Hi! Is it possible to set data-pool for ec-pools on qemu-img? For repl-pools I used "qemu-img convert" to convert from e.g. vmdk to raw and write to rbd/ceph directly. The rbd utility is able to do this for raw or empty images but without convert (converting 800G and writing it again would now take at least twice the time). Do I miss a parameter for qemu-kvm? Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crush distribution with heterogeneous device classes and failure domain hosts
Thank you very much Paul. Kevin Am Do., 20. Sep. 2018 um 15:19 Uhr schrieb Paul Emmerich < paul.emmer...@croit.io>: > Hi, > > device classes are internally represented as completely independent > trees/roots; showing them in one tree is just syntactic sugar. > > For example, if you have a hierarchy like root --> host1, host2, host3 > --> nvme/ssd/sata OSDs, then you'll actually have 3 trees: > > root~ssd -> host1~ssd, host2~ssd ... > root~sata -> host~sata, ... > > > Paul > > 2018-09-20 14:54 GMT+02:00 Kevin Olbrich : > > Hi! > > > > Currently I have a cluster with four hosts and 4x HDDs + 4 SSDs per host. > > I also have replication rules to distinguish between HDD and SSD (and > > failure-domain set to rack) which are mapped to pools. > > > > What happens if I add a heterogeneous host with 1x SSD and 1x NVMe (where > > NVMe will be a new device-class based rule)? > > > > Will the crush weight be calculated from the OSDs up to the > failure-domain > > based on the crush rule? > > The only crush-weights I know and see are those shown by "ceph osd tree". > > > > Kind regards > > Kevin > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crush distribution with heterogeneous device classes and failure domain hosts
To answer my own question: ceph osd crush tree --show-shadow Sorry for the noise... Am Do., 20. Sep. 2018 um 14:54 Uhr schrieb Kevin Olbrich : > Hi! > > Currently I have a cluster with four hosts and 4x HDDs + 4 SSDs per host. > I also have replication rules to distinguish between HDD and SSD (and > failure-domain set to rack) which are mapped to pools. > > What happens if I add a heterogeneous host with 1x SSD and 1x NVMe (where > NVMe will be a new device-class based rule)? > > Will the crush weight be calculated from the OSDs up to the failure-domain > based on the crush rule? > The only crush-weights I know and see are those shown by "ceph osd tree". > > Kind regards > Kevin > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Crush distribution with heterogeneous device classes and failure domain hosts
Hi! Currently I have a cluster with four hosts and 4x HDDs + 4 SSDs per host. I also have replication rules to distinguish between HDD and SSD (and failure-domain set to rack) which are mapped to pools. What happens if I add a heterogeneous host with 1x SSD and 1x NVMe (where NVMe will be a new device-class based rule)? Will the crush weight be calculated from the OSDs up to the failure-domain based on the crush rule? The only crush-weights I know and see are those shown by "ceph osd tree". Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] (no subject)
Hi! is the compressible hint / incompressible hint supported on qemu+kvm? http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/ If not, only aggressive would work in this case for rbd, right? Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] nfs-ganesha FSAL CephFS: nfs_health :DBUS :WARN :Health status is unhealthy
Hi! Today one of our nfs-ganesha gateway experienced an outage and since crashs every time, the client behind it tries to access the data. This is a Ceph Mimic cluster with nfs-ganesha from ceph-repos: nfs-ganesha-2.6.2-0.1.el7.x86_64 nfs-ganesha-ceph-2.6.2-0.1.el7.x86_64 There were fixes for this problem in 2.6.3: https://github.com/nfs-ganesha/nfs-ganesha/issues/339 Can the build in the repos be compiled against this bugfix release? Thank you very much. Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SPDK/DPDK with Intel P3700 NVMe pool
Hi! During our move from filestore to bluestore, we removed several Intel P3700 NVMe from the nodes. Is someone running a SPDK/DPDK NVMe-only EC pool? Is it working well? The docs are very short about the setup: http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#spdk-usage I would like to re-use these cards for high-end (max IO) for database VMs. Some notes or feedback about the setup (ceph-volume etc.) would be appreciated. Thank you. Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] HDD-only CephFS cluster with EC and without SSD/NVMe
Hi! I am in the progress of moving a local ("large", 24x1TB) ZFS RAIDZ2 to CephFS. This storage is used for backup images (large sequential reads and writes). To save space and have a RAIDZ2 (RAID6) like setup, I am planning the following profile: ceph osd erasure-code-profile set myprofile \ k=3 \ m=2 \ ruleset-failure-domain=rack Performance is not the first priority, this is why I do not plan to outsource WAL/DB (broken NVMe = broken OSDs is more administrative overhead then single OSDs). Disks are attached by SAS multipath, throughput in general is no problem but I did not test with ceph yet. Is anyone using CephFS + bluestore + ec 3/2 + without WAL/DB-dev and is it working well? Thank you. Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running 12.2.5 without problems, should I upgrade to 12.2.7 or wait for 12.2.8?
Am Fr., 10. Aug. 2018 um 19:29 Uhr schrieb : > > > Am 30. Juli 2018 09:51:23 MESZ schrieb Micha Krause : > >Hi, > > Hi Micha, > > > > >I'm Running 12.2.5 and I have no Problems at the moment. > > > >However my servers reporting daily that they want to upgrade to 12.2.7, > >is this save or should I wait for 12.2.8? > > > I guess you should Upgrade to 12.2.7 as soon as you can, specialy when > Why? As far as I unterstood, replicated pools for rbd are out of danger - .6 and .7 were mostly fixes for the known cases. We are not planning any upgrade from 12.2.5 atm. Please correct me, if I am wrong. Kevin > Quote: > The v12.2.5 release has a potential data corruption issue with erasure > coded pools. If you ran v12.2.5 with erasure coding, please see below. > > See: https://ceph.com/releases/12-2-7-luminous-released/ > > Hth > - Mehmet > >Are there any predictions when the 12.2.8 release will be available? > > > > > >Micha Krause > >___ > >ceph-users mailing list > >ceph-users@lists.ceph.com > >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v12.2.7 Luminous released
Hi, on upgrade from 12.2.4 to 12.2.5 the balancer module broke (mgr crashes minutes after service started). Only solution was to disable the balancer (service is running fine since). Is this fixed in 12.2.7? I was unable to locate the bug in bugtracker. Kevin 2018-07-17 18:28 GMT+02:00 Abhishek Lekshmanan : > > This is the seventh bugfix release of Luminous v12.2.x long term > stable release series. This release contains several fixes for > regressions in the v12.2.6 and v12.2.5 releases. We recommend that > all users upgrade. > > *NOTE* The v12.2.6 release has serious known regressions, while 12.2.6 > wasn't formally announced in the mailing lists or blog, the packages > were built and available on download.ceph.com since last week. If you > installed this release, please see the upgrade procedure below. > > *NOTE* The v12.2.5 release has a potential data corruption issue with > erasure coded pools. If you ran v12.2.5 with erasure coding, please see > below. > > The full blog post alongwith the complete changelog is published at the > official ceph blog at https://ceph.com/releases/12-2-7-luminous-released/ > > Upgrading from v12.2.6 > -- > > v12.2.6 included an incomplete backport of an optimization for > BlueStore OSDs that avoids maintaining both the per-object checksum > and the internal BlueStore checksum. Due to the accidental omission > of a critical follow-on patch, v12.2.6 corrupts (fails to update) the > stored per-object checksum value for some objects. This can result in > an EIO error when trying to read those objects. > > #. If your cluster uses FileStore only, no special action is required. >This problem only affects clusters with BlueStore. > > #. If your cluster has only BlueStore OSDs (no FileStore), then you >should enable the following OSD option:: > > osd skip data digest = true > >This will avoid setting and start ignoring the full-object digests >whenever the primary for a PG is BlueStore. > > #. If you have a mix of BlueStore and FileStore OSDs, then you should >enable the following OSD option:: > > osd distrust data digest = true > >This will avoid setting and start ignoring the full-object digests >in all cases. This weakens the data integrity checks for >FileStore (although those checks were always only opportunistic). > > If your cluster includes BlueStore OSDs and was affected, deep scrubs > will generate errors about mismatched CRCs for affected objects. > Currently the repair operation does not know how to correct them > (since all replicas do not match the expected checksum it does not > know how to proceed). These warnings are harmless in the sense that > IO is not affected and the replicas are all still in sync. The number > of affected objects is likely to drop (possibly to zero) on their own > over time as those objects are modified. We expect to include a scrub > improvement in v12.2.8 to clean up any remaining objects. > > Additionally, see the notes below, which apply to both v12.2.5 and v12.2.6. > > Upgrading from v12.2.5 or v12.2.6 > - > > If you used v12.2.5 or v12.2.6 in combination with erasure coded > pools, there is a small risk of corruption under certain workloads. > Specifically, when: > > * An erasure coded pool is in use > * The pool is busy with successful writes > * The pool is also busy with updates that result in an error result to > the librados user. RGW garbage collection is the most common > example of this (it sends delete operations on objects that don't > always exist.) > * Some OSDs are reasonably busy. One known example of such load is > FileStore splitting, although in principle any load on the cluster > could also trigger the behavior. > * One or more OSDs restarts. > > This combination can trigger an OSD crash and possibly leave PGs in a state > where they fail to peer. > > Notably, upgrading a cluster involves OSD restarts and as such may > increase the risk of encountering this bug. For this reason, for > clusters with erasure coded pools, we recommend the following upgrade > procedure to minimize risk: > > 1. Install the v12.2.7 packages. > 2. Temporarily quiesce IO to cluster:: > > ceph osd pause > > 3. Restart all OSDs and wait for all PGs to become active. > 4. Resume IO:: > > ceph osd unpause > > This will cause an availability outage for the duration of the OSD > restarts. If this in unacceptable, an *more risky* alternative is to > disable RGW garbage collection (the primary known cause of these rados > operations) for the duration of the upgrade:: > > 1. Set ``rgw_enable_gc_threads = false`` in ceph.conf > 2. Restart all radosgw daemons > 3. Upgrade and restart all OSDs > 4. Remove ``rgw_enable_gc_threads = false`` from ceph.conf > 5. Restart all radosgw daemons > > Upgrading from other versions > - > > If your cluster did not run v12.2.5 or v12.2.6 then none of the above >
Re: [ceph-users] Periodically activating / peering on OSD add
PS: It's luminous 12.2.5! Mit freundlichen Grüßen / best regards, Kevin Olbrich. 2018-07-14 15:19 GMT+02:00 Kevin Olbrich : > Hi, > > why do I see activating followed by peering during OSD add (refill)? > I did not change pg(p)_num. > > Is this normal? From my other clusters, I don't think that happend... > > Kevin > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Periodically activating / peering on OSD add
Hi, why do I see activating followed by peering during OSD add (refill)? I did not change pg(p)_num. Is this normal? From my other clusters, I don't think that happend... Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore and number of devices
You can keep the same layout as before. Most place DB/WAL combined in one partition (similar to the journal on filestore). Kevin 2018-07-13 12:37 GMT+02:00 Robert Stanford : > > I'm using filestore now, with 4 data devices per journal device. > > I'm confused by this: "BlueStore manages either one, two, or (in certain > cases) three storage devices." > (http://docs.ceph.com/docs/luminous/rados/configuration/ > bluestore-config-ref/) > > When I convert my journals to bluestore, will they still be four data > devices (osds) per journal, or will they each require a dedicated journal > drive now? > > Regards > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs stuck peering (looping?) after upgrade to Luminous.
Sounds a little bit like the problem I had on OSDs: [ceph-users] Blocked requests activating+remapped after extending pg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026680.html> *Kevin Olbrich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026681.html> *Burkhard Linke* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026682.html> *Kevin Olbrich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026683.html> *Kevin Olbrich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026685.html> *Kevin Olbrich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026689.html> *Kevin Olbrich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026692.html> *Paul Emmerich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026695.html> *Kevin Olbrich* I ended up restarting the OSDs which were stuck in that state and they immediately fixed themselfs. It should also work to just "out" the problem-OSDs and immeditly up them again to fix it. - Kevin 2018-07-11 20:30 GMT+02:00 Magnus Grönlund : > Hi, > > Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6) > > After upgrading and restarting the mons everything looked OK, the mons had > quorum, all OSDs where up and in and all the PGs where active+clean. > But before I had time to start upgrading the OSDs it became obvious that > something had gone terribly wrong. > All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data > was misplaced! > > The mons appears OK and all OSDs are still up and in, but a few hours > later there was still 1483 pgs stuck inactive, essentially all of them in > peering! > Investigating one of the stuck PGs it appears to be looping between > “inactive”, “remapped+peering” and “peering” and the epoch number is rising > fast, see the attached pg query outputs. > > We really can’t afford to loose the cluster or the data so any help or > suggestions on how to debug or fix this issue would be very, very > appreciated! > > > health: HEALTH_ERR > 1483 pgs are stuck inactive for more than 60 seconds > 542 pgs backfill_wait > 14 pgs backfilling > 11 pgs degraded > 1402 pgs peering > 3 pgs recovery_wait > 11 pgs stuck degraded > 1483 pgs stuck inactive > 2042 pgs stuck unclean > 7 pgs stuck undersized > 7 pgs undersized > 111 requests are blocked > 32 sec > 10586 requests are blocked > 4096 sec > recovery 9472/11120724 objects degraded (0.085%) > recovery 1181567/11120724 objects misplaced (10.625%) > noout flag(s) set > mon.eselde02u32 low disk space > > services: > mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34 > mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34 > osd: 111 osds: 111 up, 111 in; 800 remapped pgs > flags noout > > data: > pools: 18 pools, 4104 pgs > objects: 3620k objects, 13875 GB > usage: 42254 GB used, 160 TB / 201 TB avail > pgs: 1.876% pgs unknown > 34.259% pgs not active > 9472/11120724 objects degraded (0.085%) > 1181567/11120724 objects misplaced (10.625%) > 2062 active+clean > 1221 peering > 535 active+remapped+backfill_wait > 181 remapped+peering > 77 unknown > 13 active+remapped+backfilling > 7active+undersized+degraded+remapped+backfill_wait > 4remapped > 3active+recovery_wait+degraded+remapped > 1active+degraded+remapped+backfilling > > io: > recovery: 298 MB/s, 77 objects/s > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd lock remove unable to parse address
2018-07-10 14:37 GMT+02:00 Jason Dillaman : > On Tue, Jul 10, 2018 at 2:37 AM Kevin Olbrich wrote: > >> 2018-07-10 0:35 GMT+02:00 Jason Dillaman : >> >>> Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least >>> present on the client computer you used? I would have expected the OSD to >>> determine the client address, so it's odd that it was able to get a >>> link-local address. >>> >> >> Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is >> attached to brX which has an ULA-prefix for the ceph cluster. >> Eth0 has no address itself. In this case this must mean, the address has >> been carried down to the hardware interface. >> >> I am wondering why it uses link local when there is an ULA-prefix >> available. >> >> The address is available on brX on this client node. >> > > I'll open a tracker ticker to get that issue fixed, but in the meantime, > you can run "rados -p rmxattr rbd_header. > lock.rbd_lock" to remove the lock. > Worked perfectly, thank you very much! > >> - Kevin >> >> >>> On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich wrote: >>> >>>> 2018-07-09 21:25 GMT+02:00 Jason Dillaman : >>>> >>>>> BTW -- are you running Ceph on a one-node computer? I thought IPv6 >>>>> addresses starting w/ fe80 were link-local addresses which would probably >>>>> explain why an interface scope id was appended. The current IPv6 address >>>>> parser stops reading after it encounters a non hex, colon character [1]. >>>>> >>>> >>>> No, this is a compute machine attached to the storage vlan where I >>>> previously had also local disks. >>>> >>>> >>>>> >>>>> >>>>> On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman >>>>> wrote: >>>>> >>>>>> Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses >>>>>> since it is failing to parse the address as valid. Perhaps it's barfing >>>>>> on >>>>>> the "%eth0" scope id suffix within the address. >>>>>> >>>>>> On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich wrote: >>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> I tried to convert an qcow2 file to rbd and set the wrong pool. >>>>>>> Immediately I stopped the transfer but the image is stuck locked: >>>>>>> >>>>>>> Previusly when that happened, I was able to remove the image after >>>>>>> 30 secs. >>>>>>> >>>>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02 >>>>>>> There is 1 exclusive lock on this image. >>>>>>> Locker ID Address >>>>>>> >>>>>>> client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86% >>>>>>> eth0]:0/1200385089 >>>>>>> >>>>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 >>>>>>> "auto 93921602220416" client.1195723 >>>>>>> rbd: releasing lock failed: (22) Invalid argument >>>>>>> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse >>>>>>> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 >>>>>>> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to >>>>>>> blacklist client: (22) Invalid argument >>>>>>> >>>>>>> The image is not in use anywhere! >>>>>>> >>>>>>> How can I force removal of all locks for this image? >>>>>>> >>>>>>> Kind regards, >>>>>>> Kevin >>>>>>> ___ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@lists.ceph.com >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jason >>>>>> >>>>> >>>>> [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108 >>>>> >>>>> -- >>>>> Jason >>>>> >>>> >>>> >>> >>> -- >>> Jason >>> >> >> > > -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd lock remove unable to parse address
2018-07-10 0:35 GMT+02:00 Jason Dillaman : > Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least > present on the client computer you used? I would have expected the OSD to > determine the client address, so it's odd that it was able to get a > link-local address. > Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is attached to brX which has an ULA-prefix for the ceph cluster. Eth0 has no address itself. In this case this must mean, the address has been carried down to the hardware interface. I am wondering why it uses link local when there is an ULA-prefix available. The address is available on brX on this client node. - Kevin > On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich wrote: > >> 2018-07-09 21:25 GMT+02:00 Jason Dillaman : >> >>> BTW -- are you running Ceph on a one-node computer? I thought IPv6 >>> addresses starting w/ fe80 were link-local addresses which would probably >>> explain why an interface scope id was appended. The current IPv6 address >>> parser stops reading after it encounters a non hex, colon character [1]. >>> >> >> No, this is a compute machine attached to the storage vlan where I >> previously had also local disks. >> >> >>> >>> >>> On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman >>> wrote: >>> >>>> Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses >>>> since it is failing to parse the address as valid. Perhaps it's barfing on >>>> the "%eth0" scope id suffix within the address. >>>> >>>> On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich wrote: >>>> >>>>> Hi! >>>>> >>>>> I tried to convert an qcow2 file to rbd and set the wrong pool. >>>>> Immediately I stopped the transfer but the image is stuck locked: >>>>> >>>>> Previusly when that happened, I was able to remove the image after 30 >>>>> secs. >>>>> >>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02 >>>>> There is 1 exclusive lock on this image. >>>>> Locker ID Address >>>>> >>>>> client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86% >>>>> eth0]:0/1200385089 >>>>> >>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 "auto >>>>> 93921602220416" client.1195723 >>>>> rbd: releasing lock failed: (22) Invalid argument >>>>> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse >>>>> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 >>>>> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to blacklist >>>>> client: (22) Invalid argument >>>>> >>>>> The image is not in use anywhere! >>>>> >>>>> How can I force removal of all locks for this image? >>>>> >>>>> Kind regards, >>>>> Kevin >>>>> ___ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> >>>> >>>> -- >>>> Jason >>>> >>> >>> [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108 >>> >>> -- >>> Jason >>> >> >> > > -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd lock remove unable to parse address
2018-07-09 21:25 GMT+02:00 Jason Dillaman : > BTW -- are you running Ceph on a one-node computer? I thought IPv6 > addresses starting w/ fe80 were link-local addresses which would probably > explain why an interface scope id was appended. The current IPv6 address > parser stops reading after it encounters a non hex, colon character [1]. > No, this is a compute machine attached to the storage vlan where I previously had also local disks. > > > On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman wrote: > >> Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses >> since it is failing to parse the address as valid. Perhaps it's barfing on >> the "%eth0" scope id suffix within the address. >> >> On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich wrote: >> >>> Hi! >>> >>> I tried to convert an qcow2 file to rbd and set the wrong pool. >>> Immediately I stopped the transfer but the image is stuck locked: >>> >>> Previusly when that happened, I was able to remove the image after 30 >>> secs. >>> >>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02 >>> There is 1 exclusive lock on this image. >>> Locker ID Address >>> >>> client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86% >>> eth0]:0/1200385089 >>> >>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 "auto >>> 93921602220416" client.1195723 >>> rbd: releasing lock failed: (22) Invalid argument >>> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse >>> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 >>> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to blacklist >>> client: (22) Invalid argument >>> >>> The image is not in use anywhere! >>> >>> How can I force removal of all locks for this image? >>> >>> Kind regards, >>> Kevin >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> -- >> Jason >> > > [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108 > > -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd lock remove unable to parse address
Is it possible to force-remove the lock or the image? Kevin 2018-07-09 21:14 GMT+02:00 Jason Dillaman : > Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses since > it is failing to parse the address as valid. Perhaps it's barfing on the > "%eth0" scope id suffix within the address. > > On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich wrote: > >> Hi! >> >> I tried to convert an qcow2 file to rbd and set the wrong pool. >> Immediately I stopped the transfer but the image is stuck locked: >> >> Previusly when that happened, I was able to remove the image after 30 >> secs. >> >> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02 >> There is 1 exclusive lock on this image. >> Locker ID Address >> >> client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86% >> eth0]:0/1200385089 >> >> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 "auto >> 93921602220416" client.1195723 >> rbd: releasing lock failed: (22) Invalid argument >> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse >> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 >> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to blacklist >> client: (22) Invalid argument >> >> The image is not in use anywhere! >> >> How can I force removal of all locks for this image? >> >> Kind regards, >> Kevin >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd lock remove unable to parse address
Hi! I tried to convert an qcow2 file to rbd and set the wrong pool. Immediately I stopped the transfer but the image is stuck locked: Previusly when that happened, I was able to remove the image after 30 secs. [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02 There is 1 exclusive lock on this image. Locker ID Address client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 "auto 93921602220416" client.1195723 rbd: releasing lock failed: (22) Invalid argument 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to blacklist client: (22) Invalid argument The image is not in use anywhere! How can I force removal of all locks for this image? Kind regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] GFS2 as RBD on ceph?
Hi! *Is it safe to run GFS2 on ceph as RBD and mount it to approx. 3 to 5 vm's?* Idea is to consolidate 3 webservers which are located behind proxys. The old infrastructure is not HA or capable of load balancing. I would like to set up a webserver, clone the image and mount the GFS2 disk as shared storage. This would also allow FTP load balancing. Redundancy would be taken care of by ceph while the VMs share up-to-date data on all nodes. *I don't think CephFS is an option, as most files are very small and thousands of files will be opened simultaneously.* Anyone using such an approach? Kind regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding cluster network to running cluster
Realy? I always thought that splitting the replication network is best practice. Keeping everything in the same IPv6 network is much easier. Thank you. Kevin 2018-06-07 10:44 GMT+02:00 Wido den Hollander : > > > On 06/07/2018 09:46 AM, Kevin Olbrich wrote: > > Hi! > > > > When we installed our new luminous cluster, we had issues with the > > cluster network (setup of mon's failed). > > We moved on with a single network setup. > > > > Now I would like to set the cluster network again but the cluster is in > > use (4 nodes, 2 pools, VMs). > > Why? What is the benefit from having the cluster network? Back in the > old days when 10Gb was expensive you would run public on 1G and cluster > on 10G. > > Now with 2x10Gb going into each machine, why still bother with managing > two networks? > > I really do not see the benefit. > > I manage multiple 1000 ~ 2500 OSD clusters all running with all their > nodes on IPv6 and 2x10Gb in a single network. That works just fine. > > Try to keep the network simple and do not overcomplicate it. > > Wido > > > What happens if I set the cluster network on one of the nodes and reboot > > (maintenance, updates, etc.)? > > Will the node use both networks as the other three nodes are not > > reachable there? > > > > Both the MONs and OSDs have IPs in both networks, routing is not needed. > > This cluster is dualstack but we set ms_bind_ipv6 = true. > > > > Thank you. > > > > Kind regards > > Kevin > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num
Hi! @Paul Thanks! I know, I read the whole topic about size 2 some months ago. But this has not been my decision, I had to set it up like that. In the meantime, I did a reboot of node1001 and node1002 with flag "noout" set and now peering has finished and only 0.0x% are rebalanced. IO is flowing again. This happend as soon as the OSD was down (not out). This looks very much like a bug for me, isn't it? Restarting an OSD to "repair" crush? Also I did query the pg but it did not show any error. It just lists stats and that the pg was active since 8:40 this morning. There are row(s) with "blocked by" but no value, is that supposed to be filled with data? Kind regards, Kevin 2018-05-17 16:45 GMT+02:00 Paul Emmerich <paul.emmer...@croit.io>: > Check ceph pg query, it will (usually) tell you why something is stuck > inactive. > > Also: never do min_size 1. > > > Paul > > > 2018-05-17 15:48 GMT+02:00 Kevin Olbrich <k...@sv01.de>: > >> I was able to obtain another NVMe to get the HDDs in node1004 into the >> cluster. >> The number of disks (all 1TB) is now balanced between racks, still some >> inactive PGs: >> >> data: >> pools: 2 pools, 1536 pgs >> objects: 639k objects, 2554 GB >> usage: 5167 GB used, 14133 GB / 19300 GB avail >> pgs: 1.562% pgs not active >> 1183/1309952 objects degraded (0.090%) >> 199660/1309952 objects misplaced (15.242%) >> 1072 active+clean >> 405 active+remapped+backfill_wait >> 35 active+remapped+backfilling >> 21 activating+remapped >> 3activating+undersized+degraded+remapped >> >> >> >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 18.85289 root default >> -16 18.85289 datacenter dc01 >> -19 18.85289 pod dc01-agg01 >> -108.98700 rack dc01-rack02 >> -44.03899 host node1001 >> 0 hdd 0.90999 osd.0 up 1.0 1.0 >> 1 hdd 0.90999 osd.1 up 1.0 1.0 >> 5 hdd 0.90999 osd.5 up 1.0 1.0 >> 2 ssd 0.43700 osd.2 up 1.0 1.0 >> 3 ssd 0.43700 osd.3 up 1.0 1.0 >> 4 ssd 0.43700 osd.4 up 1.0 1.0 >> -74.94899 host node1002 >> 9 hdd 0.90999 osd.9 up 1.0 1.0 >> 10 hdd 0.90999 osd.10up 1.0 1.0 >> 11 hdd 0.90999 osd.11up 1.0 1.0 >> 12 hdd 0.90999 osd.12up 1.0 1.0 >> 6 ssd 0.43700 osd.6 up 1.0 1.0 >> 7 ssd 0.43700 osd.7 up 1.0 1.0 >> 8 ssd 0.43700 osd.8 up 1.0 1.0 >> -119.86589 rack dc01-rack03 >> -225.38794 host node1003 >> 17 hdd 0.90999 osd.17up 1.0 1.0 >> 18 hdd 0.90999 osd.18up 1.0 1.0 >> 24 hdd 0.90999 osd.24up 1.0 1.0 >> 26 hdd 0.90999 osd.26up 1.0 1.0 >> 13 ssd 0.43700 osd.13up 1.0 1.0 >> 14 ssd 0.43700 osd.14up 1.0 1.0 >> 15 ssd 0.43700 osd.15up 1.0 1.0 >> 16 ssd 0.43700 osd.16up 1.0 1.0 >> -254.47795 host node1004 >> 23 hdd 0.90999 osd.23up 1.0 1.0 >> 25 hdd 0.90999 osd.25up 1.0 1.0 >> 27 hdd 0.90999 osd.27up 1.0 1.0 >> 19 ssd 0.43700 osd.19up 1.0 1.0 >> 20 ssd 0.43700 osd.20up 1.0 1.0 >> 21 ssd 0.43700 osd.21up 1.0 1.0 >> 22 ssd 0.43700 osd.22up 1.0 1.0 >> >> >> Pools are size 2, min_size 1 during setup. >> >> The count of PGs in activate state are related to the weight of OSDs but >> why are they failing to proceed to active+clean or active+remapped? >> >> Kind regards, &
Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num
I was able to obtain another NVMe to get the HDDs in node1004 into the cluster. The number of disks (all 1TB) is now balanced between racks, still some inactive PGs: data: pools: 2 pools, 1536 pgs objects: 639k objects, 2554 GB usage: 5167 GB used, 14133 GB / 19300 GB avail pgs: 1.562% pgs not active 1183/1309952 objects degraded (0.090%) 199660/1309952 objects misplaced (15.242%) 1072 active+clean 405 active+remapped+backfill_wait 35 active+remapped+backfilling 21 activating+remapped 3activating+undersized+degraded+remapped ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 18.85289 root default -16 18.85289 datacenter dc01 -19 18.85289 pod dc01-agg01 -108.98700 rack dc01-rack02 -44.03899 host node1001 0 hdd 0.90999 osd.0 up 1.0 1.0 1 hdd 0.90999 osd.1 up 1.0 1.0 5 hdd 0.90999 osd.5 up 1.0 1.0 2 ssd 0.43700 osd.2 up 1.0 1.0 3 ssd 0.43700 osd.3 up 1.0 1.0 4 ssd 0.43700 osd.4 up 1.0 1.0 -74.94899 host node1002 9 hdd 0.90999 osd.9 up 1.0 1.0 10 hdd 0.90999 osd.10up 1.0 1.0 11 hdd 0.90999 osd.11up 1.0 1.0 12 hdd 0.90999 osd.12up 1.0 1.0 6 ssd 0.43700 osd.6 up 1.0 1.0 7 ssd 0.43700 osd.7 up 1.0 1.0 8 ssd 0.43700 osd.8 up 1.0 1.0 -119.86589 rack dc01-rack03 -225.38794 host node1003 17 hdd 0.90999 osd.17up 1.0 1.0 18 hdd 0.90999 osd.18up 1.0 1.0 24 hdd 0.90999 osd.24up 1.0 1.0 26 hdd 0.90999 osd.26up 1.0 1.0 13 ssd 0.43700 osd.13up 1.0 1.0 14 ssd 0.43700 osd.14up 1.0 1.0 15 ssd 0.43700 osd.15up 1.0 1.0 16 ssd 0.43700 osd.16up 1.0 1.0 -254.47795 host node1004 23 hdd 0.90999 osd.23up 1.0 1.0 25 hdd 0.90999 osd.25up 1.0 1.0 27 hdd 0.90999 osd.27up 1.0 1.0 19 ssd 0.43700 osd.19up 1.0 1.0 20 ssd 0.43700 osd.20up 1.0 1.0 21 ssd 0.43700 osd.21up 1.0 1.0 22 ssd 0.43700 osd.22up 1.0 1.0 Pools are size 2, min_size 1 during setup. The count of PGs in activate state are related to the weight of OSDs but why are they failing to proceed to active+clean or active+remapped? Kind regards, Kevin 2018-05-17 14:05 GMT+02:00 Kevin Olbrich <k...@sv01.de>: > Ok, I just waited some time but I still got some "activating" issues: > > data: > pools: 2 pools, 1536 pgs > objects: 639k objects, 2554 GB > usage: 5194 GB used, 11312 GB / 16506 GB avail > pgs: 7.943% pgs not active > 5567/1309948 objects degraded (0.425%) > 195386/1309948 objects misplaced (14.916%) > 1147 active+clean > 235 active+remapped+backfill_wait > * 107 activating+remapped* > 32 active+remapped+backfilling > * 15 activating+undersized+degraded+remapped* > > I set these settings during runtime: > ceph tell 'osd.*' injectargs '--osd-max-backfills 16' > ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4' > ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800' > ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' > > Sure, mon_max_pg_per_osd is oversized but this is just temporary. > Calculated PGs per OSD is 200. > > I searched the net and the bugtracker but most posts suggest > osd_max_pg_per_osd_hard_ratio = 32 to fix this issue but this time, I got > more stuck PGs. > > Any more hints? > > Kind regards. > Kevin > > 2018-05-17 13:37 GMT+02:00 Kevin Olbrich <k...@sv01.de>: > >> PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by >> default, will place 200 PGs on each OSD. >> I read about the protection in
Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num
Ok, I just waited some time but I still got some "activating" issues: data: pools: 2 pools, 1536 pgs objects: 639k objects, 2554 GB usage: 5194 GB used, 11312 GB / 16506 GB avail pgs: 7.943% pgs not active 5567/1309948 objects degraded (0.425%) 195386/1309948 objects misplaced (14.916%) 1147 active+clean 235 active+remapped+backfill_wait * 107 activating+remapped* 32 active+remapped+backfilling * 15 activating+undersized+degraded+remapped* I set these settings during runtime: ceph tell 'osd.*' injectargs '--osd-max-backfills 16' ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4' ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800' ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' Sure, mon_max_pg_per_osd is oversized but this is just temporary. Calculated PGs per OSD is 200. I searched the net and the bugtracker but most posts suggest osd_max_pg_per_osd_hard_ratio = 32 to fix this issue but this time, I got more stuck PGs. Any more hints? Kind regards. Kevin 2018-05-17 13:37 GMT+02:00 Kevin Olbrich <k...@sv01.de>: > PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by > default, will place 200 PGs on each OSD. > I read about the protection in the docs and later noticed that I better > had only placed 100 PGs. > > > 2018-05-17 13:35 GMT+02:00 Kevin Olbrich <k...@sv01.de>: > >> Hi! >> >> Thanks for your quick reply. >> Before I read your mail, i applied the following conf to my OSDs: >> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' >> >> Status is now: >> data: >> pools: 2 pools, 1536 pgs >> objects: 639k objects, 2554 GB >> usage: 5211 GB used, 11295 GB / 16506 GB avail >> pgs: 7.943% pgs not active >> 5567/1309948 objects degraded (0.425%) >> 252327/1309948 objects misplaced (19.262%) >> 1030 active+clean >> 351 active+remapped+backfill_wait >> 107 activating+remapped >> 33 active+remapped+backfilling >> 15 activating+undersized+degraded+remapped >> >> A little bit better but still some non-active PGs. >> I will investigate your other hints! >> >> Thanks >> Kevin >> >> 2018-05-17 13:30 GMT+02:00 Burkhard Linke <Burkhard.Linke@computational. >> bio.uni-giessen.de>: >> >>> Hi, >>> >>> >>> >>> On 05/17/2018 01:09 PM, Kevin Olbrich wrote: >>> >>>> Hi! >>>> >>>> Today I added some new OSDs (nearly doubled) to my luminous cluster. >>>> I then changed pg(p)_num from 256 to 1024 for that pool because it was >>>> complaining about to few PGs. (I noticed that should better have been >>>> small >>>> changes). >>>> >>>> This is the current status: >>>> >>>> health: HEALTH_ERR >>>> 336568/1307562 objects misplaced (25.740%) >>>> Reduced data availability: 128 pgs inactive, 3 pgs >>>> peering, 1 >>>> pg stale >>>> Degraded data redundancy: 6985/1307562 objects degraded >>>> (0.534%), 19 pgs degraded, 19 pgs undersized >>>> 107 slow requests are blocked > 32 sec >>>> 218 stuck requests are blocked > 4096 sec >>>> >>>>data: >>>> pools: 2 pools, 1536 pgs >>>> objects: 638k objects, 2549 GB >>>> usage: 5210 GB used, 11295 GB / 16506 GB avail >>>> pgs: 0.195% pgs unknown >>>> 8.138% pgs not active >>>> 6985/1307562 objects degraded (0.534%) >>>> 336568/1307562 objects misplaced (25.740%) >>>> 855 active+clean >>>> 517 active+remapped+backfill_wait >>>> 107 activating+remapped >>>> 31 active+remapped+backfilling >>>> 15 activating+undersized+degraded+remapped >>>> 4 active+undersized+degraded+remapped+backfilling >>>> 3 unknown >>>> 3 peering >>>> 1 stale+active+clean >>>> >>> >>> You need to resolve the unknown/peering/activating pgs first. You have >>> 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25 >>> OSDs and the heterogenous host sizes,
Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num
PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by default, will place 200 PGs on each OSD. I read about the protection in the docs and later noticed that I better had only placed 100 PGs. 2018-05-17 13:35 GMT+02:00 Kevin Olbrich <k...@sv01.de>: > Hi! > > Thanks for your quick reply. > Before I read your mail, i applied the following conf to my OSDs: > ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' > > Status is now: > data: > pools: 2 pools, 1536 pgs > objects: 639k objects, 2554 GB > usage: 5211 GB used, 11295 GB / 16506 GB avail > pgs: 7.943% pgs not active > 5567/1309948 objects degraded (0.425%) > 252327/1309948 objects misplaced (19.262%) > 1030 active+clean > 351 active+remapped+backfill_wait > 107 activating+remapped > 33 active+remapped+backfilling > 15 activating+undersized+degraded+remapped > > A little bit better but still some non-active PGs. > I will investigate your other hints! > > Thanks > Kevin > > 2018-05-17 13:30 GMT+02:00 Burkhard Linke <Burkhard.Linke@computational. > bio.uni-giessen.de>: > >> Hi, >> >> >> >> On 05/17/2018 01:09 PM, Kevin Olbrich wrote: >> >>> Hi! >>> >>> Today I added some new OSDs (nearly doubled) to my luminous cluster. >>> I then changed pg(p)_num from 256 to 1024 for that pool because it was >>> complaining about to few PGs. (I noticed that should better have been >>> small >>> changes). >>> >>> This is the current status: >>> >>> health: HEALTH_ERR >>> 336568/1307562 objects misplaced (25.740%) >>> Reduced data availability: 128 pgs inactive, 3 pgs peering, >>> 1 >>> pg stale >>> Degraded data redundancy: 6985/1307562 objects degraded >>> (0.534%), 19 pgs degraded, 19 pgs undersized >>> 107 slow requests are blocked > 32 sec >>> 218 stuck requests are blocked > 4096 sec >>> >>>data: >>> pools: 2 pools, 1536 pgs >>> objects: 638k objects, 2549 GB >>> usage: 5210 GB used, 11295 GB / 16506 GB avail >>> pgs: 0.195% pgs unknown >>> 8.138% pgs not active >>> 6985/1307562 objects degraded (0.534%) >>> 336568/1307562 objects misplaced (25.740%) >>> 855 active+clean >>> 517 active+remapped+backfill_wait >>> 107 activating+remapped >>> 31 active+remapped+backfilling >>> 15 activating+undersized+degraded+remapped >>> 4 active+undersized+degraded+remapped+backfilling >>> 3 unknown >>> 3 peering >>> 1 stale+active+clean >>> >> >> You need to resolve the unknown/peering/activating pgs first. You have >> 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25 >> OSDs and the heterogenous host sizes, I assume that some OSDs hold more >> than 200 PGs. There's a threshold for the number of PGs; reaching this >> threshold keeps the OSDs from accepting new PGs. >> >> Try to increase the threshold (mon_max_pg_per_osd / >> max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about >> the exact one, consult the documentation) to allow more PGs on the OSDs. If >> this is the cause of the problem, the peering and activating states should >> be resolved within a short time. >> >> You can also check the number of PGs per OSD with 'ceph osd df'; the last >> column is the current number of PGs. >> >> >>> >>> OSD tree: >>> >>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>> -1 16.12177 root default >>> -16 16.12177 datacenter dc01 >>> -19 16.12177 pod dc01-agg01 >>> -108.98700 rack dc01-rack02 >>> -44.03899 host node1001 >>>0 hdd 0.90999 osd.0 up 1.0 1.0 >>>1 hdd 0.90999 osd.1 up 1.0 1.0 >>>5 hdd 0.90999 osd.5 up 1.0 1.0 >>>2 ssd 0.43700 osd.2 up 1.0 1.0 >>>3 ssd 0.43700 osd.3 up 1.0 1.0 >>>
Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num
Hi! Thanks for your quick reply. Before I read your mail, i applied the following conf to my OSDs: ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' Status is now: data: pools: 2 pools, 1536 pgs objects: 639k objects, 2554 GB usage: 5211 GB used, 11295 GB / 16506 GB avail pgs: 7.943% pgs not active 5567/1309948 objects degraded (0.425%) 252327/1309948 objects misplaced (19.262%) 1030 active+clean 351 active+remapped+backfill_wait 107 activating+remapped 33 active+remapped+backfilling 15 activating+undersized+degraded+remapped A little bit better but still some non-active PGs. I will investigate your other hints! Thanks Kevin 2018-05-17 13:30 GMT+02:00 Burkhard Linke < burkhard.li...@computational.bio.uni-giessen.de>: > Hi, > > > > On 05/17/2018 01:09 PM, Kevin Olbrich wrote: > >> Hi! >> >> Today I added some new OSDs (nearly doubled) to my luminous cluster. >> I then changed pg(p)_num from 256 to 1024 for that pool because it was >> complaining about to few PGs. (I noticed that should better have been >> small >> changes). >> >> This is the current status: >> >> health: HEALTH_ERR >> 336568/1307562 objects misplaced (25.740%) >> Reduced data availability: 128 pgs inactive, 3 pgs peering, 1 >> pg stale >> Degraded data redundancy: 6985/1307562 objects degraded >> (0.534%), 19 pgs degraded, 19 pgs undersized >> 107 slow requests are blocked > 32 sec >> 218 stuck requests are blocked > 4096 sec >> >>data: >> pools: 2 pools, 1536 pgs >> objects: 638k objects, 2549 GB >> usage: 5210 GB used, 11295 GB / 16506 GB avail >> pgs: 0.195% pgs unknown >> 8.138% pgs not active >> 6985/1307562 objects degraded (0.534%) >> 336568/1307562 objects misplaced (25.740%) >> 855 active+clean >> 517 active+remapped+backfill_wait >> 107 activating+remapped >> 31 active+remapped+backfilling >> 15 activating+undersized+degraded+remapped >> 4 active+undersized+degraded+remapped+backfilling >> 3 unknown >> 3 peering >> 1 stale+active+clean >> > > You need to resolve the unknown/peering/activating pgs first. You have > 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25 > OSDs and the heterogenous host sizes, I assume that some OSDs hold more > than 200 PGs. There's a threshold for the number of PGs; reaching this > threshold keeps the OSDs from accepting new PGs. > > Try to increase the threshold (mon_max_pg_per_osd / > max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about > the exact one, consult the documentation) to allow more PGs on the OSDs. If > this is the cause of the problem, the peering and activating states should > be resolved within a short time. > > You can also check the number of PGs per OSD with 'ceph osd df'; the last > column is the current number of PGs. > > >> >> OSD tree: >> >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 16.12177 root default >> -16 16.12177 datacenter dc01 >> -19 16.12177 pod dc01-agg01 >> -108.98700 rack dc01-rack02 >> -44.03899 host node1001 >>0 hdd 0.90999 osd.0 up 1.0 1.0 >>1 hdd 0.90999 osd.1 up 1.0 1.0 >>5 hdd 0.90999 osd.5 up 1.0 1.0 >>2 ssd 0.43700 osd.2 up 1.0 1.0 >>3 ssd 0.43700 osd.3 up 1.0 1.0 >>4 ssd 0.43700 osd.4 up 1.0 1.0 >> -74.94899 host node1002 >>9 hdd 0.90999 osd.9 up 1.0 1.0 >> 10 hdd 0.90999 osd.10up 1.0 1.0 >> 11 hdd 0.90999 osd.11up 1.0 1.0 >> 12 hdd 0.90999 osd.12up 1.0 1.0 >>6 ssd 0.43700 osd.6 up 1.0 1.0 >>7 ssd 0.43700 osd.7 up 1.0 1.0 >>8 ssd 0.43700 osd.8 up 1.0 1.0 >> -11
[ceph-users] Blocked requests activating+remapped after extending pg(p)_num
Hi! Today I added some new OSDs (nearly doubled) to my luminous cluster. I then changed pg(p)_num from 256 to 1024 for that pool because it was complaining about to few PGs. (I noticed that should better have been small changes). This is the current status: health: HEALTH_ERR 336568/1307562 objects misplaced (25.740%) Reduced data availability: 128 pgs inactive, 3 pgs peering, 1 pg stale Degraded data redundancy: 6985/1307562 objects degraded (0.534%), 19 pgs degraded, 19 pgs undersized 107 slow requests are blocked > 32 sec 218 stuck requests are blocked > 4096 sec data: pools: 2 pools, 1536 pgs objects: 638k objects, 2549 GB usage: 5210 GB used, 11295 GB / 16506 GB avail pgs: 0.195% pgs unknown 8.138% pgs not active 6985/1307562 objects degraded (0.534%) 336568/1307562 objects misplaced (25.740%) 855 active+clean 517 active+remapped+backfill_wait 107 activating+remapped 31 active+remapped+backfilling 15 activating+undersized+degraded+remapped 4 active+undersized+degraded+remapped+backfilling 3 unknown 3 peering 1 stale+active+clean OSD tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 16.12177 root default -16 16.12177 datacenter dc01 -19 16.12177 pod dc01-agg01 -108.98700 rack dc01-rack02 -44.03899 host node1001 0 hdd 0.90999 osd.0 up 1.0 1.0 1 hdd 0.90999 osd.1 up 1.0 1.0 5 hdd 0.90999 osd.5 up 1.0 1.0 2 ssd 0.43700 osd.2 up 1.0 1.0 3 ssd 0.43700 osd.3 up 1.0 1.0 4 ssd 0.43700 osd.4 up 1.0 1.0 -74.94899 host node1002 9 hdd 0.90999 osd.9 up 1.0 1.0 10 hdd 0.90999 osd.10up 1.0 1.0 11 hdd 0.90999 osd.11up 1.0 1.0 12 hdd 0.90999 osd.12up 1.0 1.0 6 ssd 0.43700 osd.6 up 1.0 1.0 7 ssd 0.43700 osd.7 up 1.0 1.0 8 ssd 0.43700 osd.8 up 1.0 1.0 -117.13477 rack dc01-rack03 -225.38678 host node1003 17 hdd 0.90970 osd.17up 1.0 1.0 18 hdd 0.90970 osd.18up 1.0 1.0 24 hdd 0.90970 osd.24up 1.0 1.0 26 hdd 0.90970 osd.26up 1.0 1.0 13 ssd 0.43700 osd.13up 1.0 1.0 14 ssd 0.43700 osd.14up 1.0 1.0 15 ssd 0.43700 osd.15up 1.0 1.0 16 ssd 0.43700 osd.16up 1.0 1.0 -251.74799 host node1004 19 ssd 0.43700 osd.19up 1.0 1.0 20 ssd 0.43700 osd.20up 1.0 1.0 21 ssd 0.43700 osd.21up 1.0 1.0 22 ssd 0.43700 osd.22up 1.0 1.0 Crush rule is set to chooseleaf rack and (temporary!) to size 2. Why are PGs stuck in peering and activating? "ceph df" shows that only 1,5TB are used on the pool, residing on the hdd's - which would perfectly fit the crush rule(?) Is this only a problem during recovery and the cluster moves to OK after rebalance or can I take any action to unblock IO on the hdd pool? This is a pre-prod cluster, it does not have highest prio but I would appreciate if we would be able to use it before rebalancing is completed. Kind regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] read_fsid unparsable uuid
Hi! Yesterday I deployed 3x SSDs as OSDs fine but today I get this error when deploying an HDD with separted WAL/DB: stderr: 2018-04-26 11:58:19.531966 7fe57e5f5e00 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid Command: ceph-deploy --overwrite-conf osd create --dmcrypt --bluestore --data /dev/sde --block-db /dev/nvme0n1p1 --block-wal /dev/nvme0n1p1 node1001.ceph01.example.com Seems related to: http://tracker.ceph.com/issues/15386 I am using an Intel P3700 NVMe. Any ideas? Kind regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Where to place Block-DB?
>>What happens im the NVMe dies? >You lost OSDs backed by that NVMe and need to re-add them to cluster. With data located on the OSD (recovery) or as fresh formatted OSD? Thank you. - Kevin 2018-04-26 12:36 GMT+02:00 Serkan Çoban <cobanser...@gmail.com>: > >On bluestore, is it safe to move both Block-DB and WAL to this journal > NVMe? > Yes, just specify block-db with ceph-volume and wal also use that > partition. You can put 12-18 HDDs per NVMe > > >What happens im the NVMe dies? > You lost OSDs backed by that NVMe and need to re-add them to cluster. > > On Thu, Apr 26, 2018 at 12:58 PM, Kevin Olbrich <k...@sv01.de> wrote: > > Hi! > > > > On a small cluster I have an Intel P3700 as the journaling device for 4 > > HDDs. > > While using filestore, I used it as journal. > > > > On bluestore, is it safe to move both Block-DB and WAL to this journal > NVMe? > > Easy maintenance is first priority (on filestore we just had to flush and > > replace the SSD). > > > > What happens im the NVMe dies? > > > > Thank you. > > > > - Kevin > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Where to place Block-DB?
Hi! On a small cluster I have an Intel P3700 as the journaling device for 4 HDDs. While using filestore, I used it as journal. On bluestore, is it safe to move both Block-DB and WAL to this journal NVMe? Easy maintenance is first priority (on filestore we just had to flush and replace the SSD). What happens im the NVMe dies? Thank you. - Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Backup LUKS/Dmcrypt keys
Hi, how can I backup the dmcrypt keys on luminous? The folder under /etc/ceph does not exist anymore. Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some monitors have still not reached quorum
I found a fix: It is *mandatory *to set the public network to the same network the mons use. Skipping this while the mon has another network interface, saves garbage to the monmap. - Kevin 2018-02-23 11:38 GMT+01:00 Kevin Olbrich <k...@sv01.de>: > I always see this: > > [mon01][DEBUG ] "mons": [ > [mon01][DEBUG ] { > [mon01][DEBUG ] "addr": "[fd91:462b:4243:47e::1:1]:6789/0", > [mon01][DEBUG ] "name": "mon01", > [mon01][DEBUG ] "public_addr": "[fd91:462b:4243:47e::1:1]:6789/0", > [mon01][DEBUG ] "rank": 0 > [mon01][DEBUG ] }, > [mon01][DEBUG ] { > [mon01][DEBUG ] "addr": "0.0.0.0:0/1", > [mon01][DEBUG ] "name": "mon02", > [mon01][DEBUG ] "public_addr": "0.0.0.0:0/1", > [mon01][DEBUG ] "rank": 1 > [mon01][DEBUG ] }, > [mon01][DEBUG ] { > [mon01][DEBUG ] "addr": "0.0.0.0:0/2", > [mon01][DEBUG ] "name": "mon03", > [mon01][DEBUG ] "public_addr": "0.0.0.0:0/2", > [mon01][DEBUG ] "rank": 2 > [mon01][DEBUG ] } > [mon01][DEBUG ] ] > > > DNS is working fine and the hostnames are also listed in /etc/hosts. > I already purged the mon but still the same problem. > > - Kevin > > > 2018-02-23 10:26 GMT+01:00 Kevin Olbrich <k...@sv01.de>: > >> Hi! >> >> On a new cluster, I get the following error. All 3x mons are connected to >> the same switch and ping between them works (firewalls disabled). >> Mon-nodes are Ubuntu 16.04 LTS on Cep Luminous. >> >> >> [ceph_deploy.mon][ERROR ] Some monitors have still not reached quorum: >> [ceph_deploy.mon][ERROR ] mon03 >> [ceph_deploy.mon][ERROR ] mon02 >> [ceph_deploy.mon][ERROR ] mon01 >> >> >> root@adminnode:~# cat ceph.conf >> [global] >> fsid = 2689defb-8715-47bb-8d78-e862089adf7a >> ms_bind_ipv6 = true >> mon_initial_members = mon01, mon02, mon03 >> mon_host = [fd91:462b:4243:47e::1:1],[fd91:462b:4243:47e::1:2],[fd91: >> 462b:4243:47e::1:3] >> auth_cluster_required = cephx >> auth_service_required = cephx >> auth_client_required = cephx >> public network = fdd1:ecbd:731f:ee8e::/64 >> cluster network = fd91:462b:4243:47e::/64 >> >> >> root@mon01:~# ip a >> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group >> default qlen 1000 >> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >> inet 127.0.0.1/8 scope host lo >>valid_lft forever preferred_lft forever >> inet6 ::1/128 scope host >>valid_lft forever preferred_lft forever >> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast >> state UP group default qlen 1000 >> link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff >> inet 172.17.1.1/16 brd 172.17.255.255 scope global eth0 >>valid_lft forever preferred_lft forever >> inet6 fd91:462b:4243:47e::1:1/64 scope global >>valid_lft forever preferred_lft forever >> inet6 fe80::baae:edff:fee9:b661/64 scope link >>valid_lft forever preferred_lft forever >> 3: wlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group >> default qlen 1000 >> link/ether 00:db:df:64:34:d5 brd ff:ff:ff:ff:ff:ff >> 4: eth0.22@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc >> noqueue state UP group default qlen 1000 >> link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff >> inet6 fdd1:ecbd:731f:ee8e::1:1/64 scope global >>valid_lft forever preferred_lft forever >> inet6 fe80::baae:edff:fee9:b661/64 scope link >>valid_lft forever preferred_lft forever >> >> >> Don't mind wlan0, thats because this node is built from an Intel NUC. >> >> Any idea? >> >> Kind regards >> Kevin >> > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some monitors have still not reached quorum
I always see this: [mon01][DEBUG ] "mons": [ [mon01][DEBUG ] { [mon01][DEBUG ] "addr": "[fd91:462b:4243:47e::1:1]:6789/0", [mon01][DEBUG ] "name": "mon01", [mon01][DEBUG ] "public_addr": "[fd91:462b:4243:47e::1:1]:6789/0", [mon01][DEBUG ] "rank": 0 [mon01][DEBUG ] }, [mon01][DEBUG ] { [mon01][DEBUG ] "addr": "0.0.0.0:0/1", [mon01][DEBUG ] "name": "mon02", [mon01][DEBUG ] "public_addr": "0.0.0.0:0/1", [mon01][DEBUG ] "rank": 1 [mon01][DEBUG ] }, [mon01][DEBUG ] { [mon01][DEBUG ] "addr": "0.0.0.0:0/2", [mon01][DEBUG ] "name": "mon03", [mon01][DEBUG ] "public_addr": "0.0.0.0:0/2", [mon01][DEBUG ] "rank": 2 [mon01][DEBUG ] } [mon01][DEBUG ] ] DNS is working fine and the hostnames are also listed in /etc/hosts. I already purged the mon but still the same problem. - Kevin 2018-02-23 10:26 GMT+01:00 Kevin Olbrich <k...@sv01.de>: > Hi! > > On a new cluster, I get the following error. All 3x mons are connected to > the same switch and ping between them works (firewalls disabled). > Mon-nodes are Ubuntu 16.04 LTS on Cep Luminous. > > > [ceph_deploy.mon][ERROR ] Some monitors have still not reached quorum: > [ceph_deploy.mon][ERROR ] mon03 > [ceph_deploy.mon][ERROR ] mon02 > [ceph_deploy.mon][ERROR ] mon01 > > > root@adminnode:~# cat ceph.conf > [global] > fsid = 2689defb-8715-47bb-8d78-e862089adf7a > ms_bind_ipv6 = true > mon_initial_members = mon01, mon02, mon03 > mon_host = [fd91:462b:4243:47e::1:1],[fd91:462b:4243:47e::1:2],[ > fd91:462b:4243:47e::1:3] > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > public network = fdd1:ecbd:731f:ee8e::/64 > cluster network = fd91:462b:4243:47e::/64 > > > root@mon01:~# ip a > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group > default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 scope host lo >valid_lft forever preferred_lft forever > inet6 ::1/128 scope host >valid_lft forever preferred_lft forever > 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast > state UP group default qlen 1000 > link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff > inet 172.17.1.1/16 brd 172.17.255.255 scope global eth0 >valid_lft forever preferred_lft forever > inet6 fd91:462b:4243:47e::1:1/64 scope global >valid_lft forever preferred_lft forever > inet6 fe80::baae:edff:fee9:b661/64 scope link >valid_lft forever preferred_lft forever > 3: wlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group > default qlen 1000 > link/ether 00:db:df:64:34:d5 brd ff:ff:ff:ff:ff:ff > 4: eth0.22@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue > state UP group default qlen 1000 > link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff > inet6 fdd1:ecbd:731f:ee8e::1:1/64 scope global >valid_lft forever preferred_lft forever > inet6 fe80::baae:edff:fee9:b661/64 scope link >valid_lft forever preferred_lft forever > > > Don't mind wlan0, thats because this node is built from an Intel NUC. > > Any idea? > > Kind regards > Kevin > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Some monitors have still not reached quorum
Hi! On a new cluster, I get the following error. All 3x mons are connected to the same switch and ping between them works (firewalls disabled). Mon-nodes are Ubuntu 16.04 LTS on Cep Luminous. [ceph_deploy.mon][ERROR ] Some monitors have still not reached quorum: [ceph_deploy.mon][ERROR ] mon03 [ceph_deploy.mon][ERROR ] mon02 [ceph_deploy.mon][ERROR ] mon01 root@adminnode:~# cat ceph.conf [global] fsid = 2689defb-8715-47bb-8d78-e862089adf7a ms_bind_ipv6 = true mon_initial_members = mon01, mon02, mon03 mon_host = [fd91:462b:4243:47e::1:1],[fd91:462b:4243:47e::1:2],[fd91:462b:4243:47e::1:3] auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx public network = fdd1:ecbd:731f:ee8e::/64 cluster network = fd91:462b:4243:47e::/64 root@mon01:~# ip a 1: lo:mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 9000 qdisc pfifo_fast state UP group default qlen 1000 link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff inet 172.17.1.1/16 brd 172.17.255.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fd91:462b:4243:47e::1:1/64 scope global valid_lft forever preferred_lft forever inet6 fe80::baae:edff:fee9:b661/64 scope link valid_lft forever preferred_lft forever 3: wlan0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 00:db:df:64:34:d5 brd ff:ff:ff:ff:ff:ff 4: eth0.22@eth0: mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff inet6 fdd1:ecbd:731f:ee8e::1:1/64 scope global valid_lft forever preferred_lft forever inet6 fe80::baae:edff:fee9:b661/64 scope link valid_lft forever preferred_lft forever Don't mind wlan0, thats because this node is built from an Intel NUC. Any idea? Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous/Ubuntu 16.04 kernel recommendation ?
2018-02-08 11:20 GMT+01:00 Martin Emrich: > I have a machine here mounting a Ceph RBD from luminous 12.2.2 locally, > running linux-generic-hwe-16.04 (4.13.0-32-generic). > > Works fine, except that it does not support the latest features: I had to > disable exclusive-lock,fast-diff,object-map,deep-flatten on the image. > Otherwise it runs well. > I always thought that the latest features are built into newer kernels, are they available on non-HWE 4.4, HWE 4.8 or HWE 4.10? Also I am researching for the OSD server side. - Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous/Ubuntu 16.04 kernel recommendation ?
Would be interested as well. - Kevin 2018-02-04 19:00 GMT+01:00 Yoann Moulin: > Hello, > > What is the best kernel for Luminous on Ubuntu 16.04 ? > > Is linux-image-virtual-lts-xenial still the best one ? Or > linux-virtual-hwe-16.04 will offer some improvement ? > > Thanks, > > -- > Yoann Moulin > EPFL IC-IT > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] _read_bdev_label failed to open
Running the following after prepare and a reboot, "solves" this problem. [root@osd01 ~]# partx -v -a /dev/mapper/mpatha partition: none, disk: /dev/mapper/mpatha, lower: 0, upper: 0 /dev/mapper/mpatha: partition table type 'gpt' detected partx: /dev/mapper/mpatha: adding partition #1 failed: Invalid argument partx: /dev/mapper/mpatha: adding partition #2 failed: Invalid argument partx: /dev/mapper/mpatha: error adding partitions 1-2 The disk is then activated and in and up. It seems like the partuuid was not correctly imported into the kernel. Even if it states that partitions 1 - 2 were not added, they are (this disk has only two partitions). Should I open a bug? Kind regards, Kevin 2018-02-04 19:05 GMT+01:00 Kevin Olbrich <k...@sv01.de>: > I also noticed there are no folders under /var/lib/ceph/osd/ ... > > > Mit freundlichen Grüßen / best regards, > Kevin Olbrich. > > 2018-02-04 19:01 GMT+01:00 Kevin Olbrich <k...@sv01.de>: > >> Hi! >> >> Currently I try to re-deploy a cluster from filestore to bluestore. >> I zapped all disks (multiple times) but I fail adding a disk array: >> >> Prepare: >> >>> ceph-deploy --overwrite-conf osd prepare --bluestore --block-wal >>> /dev/sdb --block-db /dev/sdb osd01.cloud.example.local:/dev >>> /mapper/mpatha >> >> >> Activate: >> >>> ceph-deploy --overwrite-conf osd activate osd01.cloud.example.local:/dev >>> /mapper/mpatha1 >> >> >> Error on activate: >> >>> [osd01.cloud.example.local][WARNIN] got monmap epoch 2 >>> [osd01.cloud.example.local][WARNIN] command_check_call: Running >>> command: /usr/bin/ceph-osd --cluster ceph --mkfs -i 0 --monmap >>> /var/lib/ceph/tmp/mnt.pAfCl4/activate.monmap --osd-data >>> /var/lib/ceph/tmp/mnt.pAfCl4 --osd-uuid d5b6ab85-9437-4cb2-a34d-16a29067ba27 >>> --setuser ceph --setgroup ceph >>> >>> *[osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900368 >>> 7f00d6359d00 -1 bluestore(/var/lib/ceph/tmp/mnt.pAfCl4/block) >>> _read_bdev_label failed to open /var/lib/ceph/tmp/mnt.pAfCl4/block: (2) No >>> such file or directory[osd01.cloud.example.local][WARNIN] 2018-02-04 >>> 18:52:43.900405 7f00d6359d00 -1 >>> bluestore(/var/lib/ceph/tmp/mnt.pAfCl4/block) _read_bdev_label failed to >>> open /var/lib/ceph/tmp/mnt.pAfCl4/block: (2) No such file or directory* >>> [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900462 >>> 7f00d6359d00 -1 bluestore(/var/lib/ceph/tmp/mnt.pAfCl4) >>> _setup_block_symlink_or_file failed to open block file: (13) Permission >>> denied >>> [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900480 >>> 7f00d6359d00 -1 bluestore(/var/lib/ceph/tmp/mnt.pAfCl4) mkfs failed, >>> (13) Permission denied >>> [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900485 >>> 7f00d6359d00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (13) >>> Permission denied >>> [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900662 >>> 7f00d6359d00 -1 ** ERROR: error creating empty object store in >>> /var/lib/ceph/tmp/mnt.pAfCl4: (13) Permission denied >>> [osd01.cloud.example.local][WARNIN] mount_activate: Failed to activate >>> [osd01.cloud.example.local][WARNIN] unmount: Unmounting >>> /var/lib/ceph/tmp/mnt.pAfCl4 >>> >> >> >> Same problem on 2x 14 disks. I was unable to get this cluster up. >> >> Any ideas? >> >> Kind regards, >> Kevin >> > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] _read_bdev_label failed to open
I also noticed there are no folders under /var/lib/ceph/osd/ ... Mit freundlichen Grüßen / best regards, Kevin Olbrich. 2018-02-04 19:01 GMT+01:00 Kevin Olbrich <k...@sv01.de>: > Hi! > > Currently I try to re-deploy a cluster from filestore to bluestore. > I zapped all disks (multiple times) but I fail adding a disk array: > > Prepare: > >> ceph-deploy --overwrite-conf osd prepare --bluestore --block-wal /dev/sdb >> --block-db /dev/sdb osd01.cloud.example.local:/dev/mapper/mpatha > > > Activate: > >> ceph-deploy --overwrite-conf osd activate osd01.cloud.example.local:/ >> dev/mapper/mpatha1 > > > Error on activate: > >> [osd01.cloud.example.local][WARNIN] got monmap epoch 2 >> [osd01.cloud.example.local][WARNIN] command_check_call: Running command: >> /usr/bin/ceph-osd --cluster ceph --mkfs -i 0 --monmap >> /var/lib/ceph/tmp/mnt.pAfCl4/activate.monmap --osd-data >> /var/lib/ceph/tmp/mnt.pAfCl4 --osd-uuid d5b6ab85-9437-4cb2-a34d-16a29067ba27 >> --setuser ceph --setgroup ceph >> >> *[osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900368 >> 7f00d6359d00 -1 bluestore(/var/lib/ceph/tmp/mnt.pAfCl4/block) >> _read_bdev_label failed to open /var/lib/ceph/tmp/mnt.pAfCl4/block: (2) No >> such file or directory[osd01.cloud.example.local][WARNIN] 2018-02-04 >> 18:52:43.900405 7f00d6359d00 -1 >> bluestore(/var/lib/ceph/tmp/mnt.pAfCl4/block) _read_bdev_label failed to >> open /var/lib/ceph/tmp/mnt.pAfCl4/block: (2) No such file or directory* >> [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900462 >> 7f00d6359d00 -1 bluestore(/var/lib/ceph/tmp/mnt.pAfCl4) >> _setup_block_symlink_or_file failed to open block file: (13) Permission >> denied >> [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900480 >> 7f00d6359d00 -1 bluestore(/var/lib/ceph/tmp/mnt.pAfCl4) mkfs failed, >> (13) Permission denied >> [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900485 >> 7f00d6359d00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (13) >> Permission denied >> [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900662 >> 7f00d6359d00 -1 ** ERROR: error creating empty object store in >> /var/lib/ceph/tmp/mnt.pAfCl4: (13) Permission denied >> [osd01.cloud.example.local][WARNIN] mount_activate: Failed to activate >> [osd01.cloud.example.local][WARNIN] unmount: Unmounting >> /var/lib/ceph/tmp/mnt.pAfCl4 >> > > > Same problem on 2x 14 disks. I was unable to get this cluster up. > > Any ideas? > > Kind regards, > Kevin > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] _read_bdev_label failed to open
Hi! Currently I try to re-deploy a cluster from filestore to bluestore. I zapped all disks (multiple times) but I fail adding a disk array: Prepare: > ceph-deploy --overwrite-conf osd prepare --bluestore --block-wal /dev/sdb > --block-db /dev/sdb osd01.cloud.example.local:/dev/mapper/mpatha Activate: > ceph-deploy --overwrite-conf osd activate osd01.cloud.example > .local:/dev/mapper/mpatha1 Error on activate: > [osd01.cloud.example.local][WARNIN] got monmap epoch 2 > [osd01.cloud.example.local][WARNIN] command_check_call: Running command: > /usr/bin/ceph-osd --cluster ceph --mkfs -i 0 --monmap > /var/lib/ceph/tmp/mnt.pAfCl4/activate.monmap --osd-data > /var/lib/ceph/tmp/mnt.pAfCl4 --osd-uuid > d5b6ab85-9437-4cb2-a34d-16a29067ba27 --setuser ceph --setgroup ceph > > *[osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900368 > 7f00d6359d00 -1 bluestore(/var/lib/ceph/tmp/mnt.pAfCl4/block) > _read_bdev_label failed to open /var/lib/ceph/tmp/mnt.pAfCl4/block: (2) No > such file or directory[osd01.cloud.example.local][WARNIN] 2018-02-04 > 18:52:43.900405 7f00d6359d00 -1 > bluestore(/var/lib/ceph/tmp/mnt.pAfCl4/block) _read_bdev_label failed to > open /var/lib/ceph/tmp/mnt.pAfCl4/block: (2) No such file or directory* > [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900462 > 7f00d6359d00 -1 bluestore(/var/lib/ceph/tmp/mnt.pAfCl4) > _setup_block_symlink_or_file failed to open block file: (13) Permission > denied > [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900480 > 7f00d6359d00 -1 bluestore(/var/lib/ceph/tmp/mnt.pAfCl4) mkfs failed, (13) > Permission denied > [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900485 > 7f00d6359d00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (13) > Permission denied > [osd01.cloud.example.local][WARNIN] 2018-02-04 18:52:43.900662 > 7f00d6359d00 -1 ** ERROR: error creating empty object store in > /var/lib/ceph/tmp/mnt.pAfCl4: (13) Permission denied > [osd01.cloud.example.local][WARNIN] mount_activate: Failed to activate > [osd01.cloud.example.local][WARNIN] unmount: Unmounting > /var/lib/ceph/tmp/mnt.pAfCl4 > Same problem on 2x 14 disks. I was unable to get this cluster up. Any ideas? Kind regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RFC Bluestore-Cluster of SAMSUNG PM863a
2018-02-02 12:44 GMT+01:00 Richard Hesketh <richard.hesk...@rd.bbc.co.uk>: > On 02/02/18 08:33, Kevin Olbrich wrote: > > Hi! > > > > I am planning a new Flash-based cluster. In the past we used SAMSUNG > PM863a 480G as journal drives in our HDD cluster. > > After a lot of tests with luminous and bluestore on HDD clusters, we > plan to re-deploy our whole RBD pool (OpenNebula cloud) using these disks. > > > > As far as I understand, it would be best to skip journaling / WAL and > just deploy every OSD 1-by-1. This would have the following pro's (correct > me, if I am wrong): > > - maximum performance as the journal is spread accross all devices > > - a lost drive does not affect any other drive > > > > Currently we are on CentOS 7 with elrepo 4.4.x-kernel. We plan to > migrate to Ubuntu 16.04.3 with HWE (kernel 4.10). > > Clients will be Fedora 27 + OpenNebula. > > > > Any comments? > > > > Thank you. > > > > Kind regards, > > Kevin > > There is only a real advantage to separating the DB/WAL from the main data > if they're going to be hosted on a device which is appreciably faster than > the main storage. Since you're going all SSD, it makes sense to deploy each > OSD all-in-one; as you say, you don't bottleneck on any one disk, and it > also offers you more maintenance flexibility as you will be able to easily > move OSDs between hosts if required. If you wanted to start pushing > performance more, you'd be looking at putting NVMe disks in your hosts for > DB/WAL. > We got some Intel P3700 NVMe (PCIe) disks but each host will be serving 10 OSDs, combined sync-speed on the samsungs was better than this single NVMe (we did some short fio-benchmarks no real-ceph-test, could also be different now). If performance is only slightly better, sticking to single OSD failure domain is better for maintenance, as this new cluster will not be monitored 24/7 by our staff while migration is in progress. > FYI, the 16.04 HWE kernel has currently rolled on over to 4.13. > Did someone test this kernel branch with ceph? Any performance impact? If I unterstood the docs, Ubuntu is a well tested plattform for ceph, so this should have been already tested (?). > May I ask why are you using EL repo with centos? > AFAIK, Redhat is backporting all ceph features to 3.10 kernels. Am I > wrong? > Before we moved from OpenStack to OpenNebula in early 2017, we had some problems with krbd / fuse (missing features, etc.). We then decided to move from 3.10 zu 4.4 which solved all problems and we noticed a small performance improvement. Maybe these problems are solved already, we had these problems when we rolled out Mitaka. We did not change our deployment scripts since then, thats why we are still at kernel-ml. Kind regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RFC Bluestore-Cluster of SAMSUNG PM863a
Hi! I am planning a new Flash-based cluster. In the past we used SAMSUNG PM863a 480G as journal drives in our HDD cluster. After a lot of tests with luminous and bluestore on HDD clusters, we plan to re-deploy our whole RBD pool (OpenNebula cloud) using these disks. As far as I understand, it would be best to skip journaling / WAL and just deploy every OSD 1-by-1. This would have the following pro's (correct me, if I am wrong): - maximum performance as the journal is spread accross all devices - a lost drive does not affect any other drive Currently we are on CentOS 7 with elrepo 4.4.x-kernel. We plan to migrate to Ubuntu 16.04.3 with HWE (kernel 4.10). Clients will be Fedora 27 + OpenNebula. Any comments? Thank you. Kind regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Adding a new node to a small cluster (size = 2)
Hi! A customer is running a small two node ceph cluster with 14 disks each. He has min_size 1 and size 2 and it is only used for backups. If we add a third member with 14 identical disks and remain size = 2, replicas should be distributed evenly, right? Or is an uneven count of hosts unadvisable or not working? Kind regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failed to start Ceph disk activation: /dev/dm-18
Hi, seems that I found the cause. The disk array was used for ZFS before and was not wiped. I zapped the disks with sgdisk and via ceph but "zfs_member" was still somewhere on the disk. Wiping the disk (wipefs -a -f /dev/mapper/mpatha), "ceph osd create --zap-disk" twice until entry in "df" and reboot fixed it. Then OSDs were failing again. Cause: IPv6 DAD on bond-interface. Disabled via sysctl. Reboot and voila, cluster immediately online. Kind regards, Kevin. 2017-05-16 16:59 GMT+02:00 Kevin Olbrich <k...@sv01.de>: > HI! > > Currently I am deploying a small cluster with two nodes. I installed ceph > jewel on all nodes and made a basic deployment. > After "ceph osd create..." I am now getting "Failed to start Ceph disk > activation: /dev/dm-18" on boot. All 28 OSDs were never active. > This server has a 14 disk JBOD with 4x fiber using multipath (4x active > multibus). We have two servers. > > OS: Latest CentOS 7 > > [root@osd01 ~]# ceph -v >> ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) > > > Command run: > >> ceph-deploy osd create osd01.example.local:/dev/mappe >> r/mpatha:/dev/disk/by-partlabel/journal01 > > > There is no error in journalctl, just that the unit failed: > >> May 16 16:47:33 osd01.example.local systemd[1]: Failed to start Ceph disk >> activation: /dev/dm-27. >> May 16 16:47:33 osd01.example.local systemd[1]: ceph-disk@dev-dm >> \x2d27.service: main process exited, code=exited, status=124/n/a >> May 16 16:47:33 osd01.example.local systemd[1]: >> ceph-disk@dev-dm\x2d24.service >> failed. >> May 16 16:47:33 osd01.example.local systemd[1]: Unit >> ceph-disk@dev-dm\x2d24.service >> entered failed state. > > > [root@osd01 ~]# gdisk -l /dev/mapper/mpatha >> GPT fdisk (gdisk) version 0.8.6 >> Partition table scan: >> MBR: protective >> BSD: not present >> APM: not present >> GPT: present >> Found valid GPT with protective MBR; using GPT. >> Disk /dev/mapper/mpatha: 976642095 sectors, 465.7 GiB >> Logical sector size: 512 bytes >> Disk identifier (GUID): DEF0B782-3B7F-4AF5-A0CB-9E2B96C40B13 >> Partition table holds up to 128 entries >> First usable sector is 34, last usable sector is 976642061 >> Partitions will be aligned on 2048-sector boundaries >> Total free space is 2014 sectors (1007.0 KiB) >> Number Start (sector)End (sector) Size Code Name >>12048 976642061 465.7 GiB ceph data > > > I had problems with multipath in the past when running ceph but this time > I was unable to solve the problem. > Any ideas? > > Kind regards, > Kevin. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Failed to start Ceph disk activation: /dev/dm-18
HI! Currently I am deploying a small cluster with two nodes. I installed ceph jewel on all nodes and made a basic deployment. After "ceph osd create..." I am now getting "Failed to start Ceph disk activation: /dev/dm-18" on boot. All 28 OSDs were never active. This server has a 14 disk JBOD with 4x fiber using multipath (4x active multibus). We have two servers. OS: Latest CentOS 7 [root@osd01 ~]# ceph -v > ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) Command run: > ceph-deploy osd create > osd01.example.local:/dev/mapper/mpatha:/dev/disk/by-partlabel/journal01 There is no error in journalctl, just that the unit failed: > May 16 16:47:33 osd01.example.local systemd[1]: Failed to start Ceph disk > activation: /dev/dm-27. > May 16 16:47:33 osd01.example.local systemd[1]: > ceph-disk@dev-dm\x2d27.service: > main process exited, code=exited, status=124/n/a > May 16 16:47:33 osd01.example.local systemd[1]: ceph-disk@dev-dm\x2d24.service > failed. > May 16 16:47:33 osd01.example.local systemd[1]: Unit > ceph-disk@dev-dm\x2d24.service > entered failed state. [root@osd01 ~]# gdisk -l /dev/mapper/mpatha > GPT fdisk (gdisk) version 0.8.6 > Partition table scan: > MBR: protective > BSD: not present > APM: not present > GPT: present > Found valid GPT with protective MBR; using GPT. > Disk /dev/mapper/mpatha: 976642095 sectors, 465.7 GiB > Logical sector size: 512 bytes > Disk identifier (GUID): DEF0B782-3B7F-4AF5-A0CB-9E2B96C40B13 > Partition table holds up to 128 entries > First usable sector is 34, last usable sector is 976642061 > Partitions will be aligned on 2048-sector boundaries > Total free space is 2014 sectors (1007.0 KiB) > Number Start (sector)End (sector) Size Code Name >12048 976642061 465.7 GiB ceph data I had problems with multipath in the past when running ceph but this time I was unable to solve the problem. Any ideas? Kind regards, Kevin. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Do we know which version of ceph-client has this fix ? http://tracker.ceph.com/issues/17191
2017-03-22 5:30 GMT+01:00 Brad Hubbard <bhubb...@redhat.com>: > On Wed, Mar 22, 2017 at 10:55 AM, Deepak Naidu <dna...@nvidia.com> wrote: > > Do we know which version of ceph client does this bug has a fix. Bug: > > http://tracker.ceph.com/issues/17191 > > > > > > > > I have ceph-common-10.2.6-0 ( on CentOS 7.3.1611) & ceph-fs-common- > > 10.2.6-1(Ubuntu 14.04.5) > > ceph-client is the repository for the ceph kernel client (kernel modules). > > The commits referenced in the tracker above went into upstream kernel > 4.9-rc1. > > https://lkml.org/lkml/2016/10/8/110 > > I doubt these are available in any CentOS 7.x kernel yet but you could > check the source. > If it is in 4.9-rc1, it could also be in 4.10.4. We are using kernel-lt (4.4.x) for our clusters but there is also mainline in elrepo: https://elrepo.org/linux/kernel/el7/x86_64/RPMS/ I did not test 4.10.x with ceph but 4.4.x with rbd and kvm works well for us. > > > > > > > > > -- > > > > Deepak > > > > > > This email message is for the sole use of the intended recipient(s) and > may > > contain confidential information. Any unauthorized review, use, > disclosure > > or distribution is prohibited. If you are not the intended recipient, > > please contact the sender by reply email and destroy all copies of the > > original message. > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > Cheers, > Brad > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > Kind regards, Kevin Olbrich. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Shrinking lab cluster to free hardware for a new deployment
Hi! Currently I have a cluster with 6 OSDs (5 hosts, 7TB RAID6 each). We want to shut down the cluster but it holds some semi-productive VMs we might or might not need in the future. To keep them, we would like to shrink our cluster from 6 to 2 OSDs (we use size 2 and min_size 1). Should I set the OSDs out one by one or with norefill, norecovery flags set but all at once? If last is the case, which flags should be set also? Thanks! Kind regards, Kevin Olbrich. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack
In all cases, VMs were fully functional. Currently we are migrating most VMs out of the cluster to shut it down (we had some semi-productive VMs on it to get real world usage stats). I just wanted to let you know which problems we had with Ceph on ZFS. No doubt we made a lot of mistakes (this was our first Ceph cluster) but we had a lot of tests running on it and would not recommand to use ZFS as the backend. And for those interested in monitoring this type of cluster: Do not use munin. As the disks were spinning at 100% and each disk is seen three times (2 paths combined in one mpath) I caused a deadlock resulting in 3/4 offline nodes (one of the disasters we had Ceph repair everything). I hope this helps all Ceph users who are interested in the idea of running Ceph on ZFS. Kind regards, Kevin Olbrich. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What happens if all replica OSDs journals are broken?
2016-12-14 2:37 GMT+01:00 Christian Balzer <ch...@gol.com>: > > Hello, > Hi! > > On Wed, 14 Dec 2016 00:06:14 +0100 Kevin Olbrich wrote: > > > Ok, thanks for your explanation! > > I read those warnings about size 2 + min_size 1 (we are using ZFS as > RAID6, > > called zraid2) as OSDs. > > > This is similar to my RAID6 or RAID10 backed OSDs with regards to having > very resilient, extremely unlikely to fail OSDs. > This was our intention (unlikely to fail, data security > performance). We use Ceph for OpenStack (Cinder RBD). > As such a Ceph replication of 2 with min_size is a calculated risk, > acceptable for me on others in certain use cases. > This is also with very few (2-3) journals per SSD. > We are running 14x 500G RAID6 ZFS-RAID per Host (1x journal, 1x OSD, 32GB RAM). The ZFS pools use L2ARC-Cache on Samsung 850 PRO's 128GB. Hint: Was a bad idea, would have better split the ZFS pools. (ZFS performance was very good but double parity with 4k random on sync with ceph takes very long, resulting in XXX requests blocked more than 32 seconds). Currently I am waiting for a lab cluster to test "osd op threads" for these single OSD hosts. > If: > > 1. Your journal SSDs are well trusted and monitored (Intel DC S36xx, 37xx) > Indeed Intel DC P3700 400GB for Ceph. We had Samsung 850 PRO before I leard 4k random while DSYNC is a very bad idea... ;-) 2. Your failure domain represented by a journal SSD is small enough > (meaning that replicating the lost OSDs can be done quickly) > OSDs are rather large but we are "just" using 8 TB (size 2) in the whole cluster (OSD is 24% full). Before we moved from infernalis to jewel, a recovery from an OSD which was offline for 8 hours took approx. one hour to be back in sync. it may be an acceptable risk for you as well. We got reliable backups in the past but downtime is a greater problem. > > > Time to raise replication! > > > If you can afford that (money, space, latency), definitely go for it. > It's more the double journal failure which scares me compared to the OSD itself (as ZFS was very reliable in the past). Kevin > Christian > > Kevin > > > > 2016-12-13 0:00 GMT+01:00 Christian Balzer <ch...@gol.com>: > > > > > On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote: > > > > > > > Hi, > > > > > > > > just in case: What happens when all replica journal SSDs are broken > at > > > once? > > > > > > > That would be bad, as in BAD. > > > > > > In theory you just "lost" all the associated OSDs and their data. > > > > > > In practice everything but in the in-flight data at the time is still > on > > > the actual OSDs (HDDs), but it's inconsistent and inaccessible as far > as > > > Ceph is concerned. > > > > > > So with some trickery and an experienced data-recovery Ceph consultant > you > > > _may_ get things running with limited data loss/corruption, but that's > > > speculation and may be wishful thinking on my part. > > > > > > Another data point to deploy only well known/monitored/trusted SSDs and > > > have a 3x replication. > > > > > > > The PGs most likely will be stuck inactive but as I read, the > journals > > > just > > > > need to be replaced (http://ceph.com/planet/ceph- > recover-osds-after-ssd- > > > > journal-failure/). > > > > > > > > Does this also work in this case? > > > > > > > Not really, no. > > > > > > The above works by having still a valid state and operational OSDs from > > > which the "broken" one can recover. > > > > > > Christian > > > -- > > > Christian BalzerNetwork/Systems Engineer > > > ch...@gol.com Global OnLine Japan/Rakuten Communications > > > http://www.gol.com/ > > > > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What happens if all replica OSDs journals are broken?
Ok, thanks for your explanation! I read those warnings about size 2 + min_size 1 (we are using ZFS as RAID6, called zraid2) as OSDs. Time to raise replication! Kevin 2016-12-13 0:00 GMT+01:00 Christian Balzer <ch...@gol.com>: > On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote: > > > Hi, > > > > just in case: What happens when all replica journal SSDs are broken at > once? > > > That would be bad, as in BAD. > > In theory you just "lost" all the associated OSDs and their data. > > In practice everything but in the in-flight data at the time is still on > the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as > Ceph is concerned. > > So with some trickery and an experienced data-recovery Ceph consultant you > _may_ get things running with limited data loss/corruption, but that's > speculation and may be wishful thinking on my part. > > Another data point to deploy only well known/monitored/trusted SSDs and > have a 3x replication. > > > The PGs most likely will be stuck inactive but as I read, the journals > just > > need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd- > > journal-failure/). > > > > Does this also work in this case? > > > Not really, no. > > The above works by having still a valid state and operational OSDs from > which the "broken" one can recover. > > Christian > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] What happens if all replica OSDs journals are broken?
Hi, just in case: What happens when all replica journal SSDs are broken at once? The PGs most likely will be stuck inactive but as I read, the journals just need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd- journal-failure/). Does this also work in this case? Kind regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [EXTERNAL] Re: 2x replication: A BIG warning
Is Ceph accepting this OSD if the other (newer) replica is down? In this case I would assume that my cluster is instantly broken when rack _after_ rack fails (power outage) and I just start in random order. We have at least one MON on stand-alone UPS to resolv such an issue - I just assumed this is safe regardless of full outage. Mit freundlichen Grüßen / best regards, Kevin Olbrich. 2016-12-07 21:10 GMT+01:00 Wido den Hollander <w...@42on.com>: > > > Op 7 december 2016 om 21:04 schreef "Will.Boege" <will.bo...@target.com > >: > > > > > > Hi Wido, > > > > Just curious how blocking IO to the final replica provides protection > from data loss? I’ve never really understood why this is a Ceph best > practice. In my head all 3 replicas would be on devices that have roughly > the same odds of physically failing or getting logically corrupted in any > given minute. Not sure how blocking IO prevents this. > > > > Say, disk #1 fails and you have #2 and #3 left. Now #2 fails leaving only > #3 left. > > By block you know that #2 and #3 still have the same data. Although #2 > failed it could be that it is the host which went down but the disk itself > is just fine. Maybe the SATA cable broke, you never know. > > If disk #3 now fails you can still continue your operation if you bring #2 > back. It has the same data on disk as #3 had before it failed. Since you > didn't allow for any I/O on #3 when #2 went down earlier. > > If you would have accepted writes on #3 while #1 and #2 were gone you have > invalid/old data on #2 by the time it comes back. > > Writes were made on #3 but that one really broke down. You managed to get > #2 back, but it doesn't have the changes which #3 had. > > The result is corrupted data. > > Does this make sense? > > Wido > > > On 12/7/16, 9:11 AM, "ceph-users on behalf of LOIC DEVULDER" < > ceph-users-boun...@lists.ceph.com on behalf of loic.devul...@mpsa.com> > wrote: > > > > > -Message d'origine- > > > De : Wido den Hollander [mailto:w...@42on.com] > > > Envoyé : mercredi 7 décembre 2016 16:01 > > > À : ceph-us...@ceph.com; LOIC DEVULDER - U329683 < > loic.devul...@mpsa.com> > > > Objet : RE: [ceph-users] 2x replication: A BIG warning > > > > > > > > > > Op 7 december 2016 om 15:54 schreef LOIC DEVULDER > > > <loic.devul...@mpsa.com>: > > > > > > > > > > > > Hi Wido, > > > > > > > > > As a Ceph consultant I get numerous calls throughout the year > to > > > > > help people with getting their broken Ceph clusters back > online. > > > > > > > > > > The causes of downtime vary vastly, but one of the biggest > causes is > > > > > that people use replication 2x. size = 2, min_size = 1. > > > > > > > > We are building a Ceph cluster for our OpenStack and for data > integrity > > > reasons we have chosen to set size=3. But we want to continue to > access > > > data if 2 of our 3 osd server are dead, so we decided to set > min_size=1. > > > > > > > > Is it a (very) bad idea? > > > > > > > > > > I would say so. Yes, downtime is annoying on your cloud, but data > loss if > > > even worse, much more worse. > > > > > > I would always run with min_size = 2 and manually switch to > min_size = 1 > > > if the situation really requires it at that moment. > > > > > > Loosing two disks at the same time is something which doesn't > happen that > > > much, but if it happens you don't want to modify any data on the > only copy > > > which you still have left. > > > > > > Setting min_size to 1 should be a manual action imho when size = 3 > and you > > > loose two copies. In that case YOU decide at that moment if it is > the > > > right course of action. > > > > > > Wido > > > > Thanks for your quick response! > > > > That's make sense, I will try to convince my colleagues :-) > > > > Loic > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deploying new OSDs in parallel or one after another
I need to note that I already have 5 hosts with one OSD each. Mit freundlichen Grüßen / best regards, Kevin Olbrich. 2016-11-28 10:02 GMT+01:00 Kevin Olbrich <k...@sv01.de>: > Hi! > > I want to deploy two nodes with 4 OSDs each. I already prepared OSDs and > only need to activate them. > What is better? One by one or all at once? > > Kind regards, > Kevin. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Deploying new OSDs in parallel or one after another
Hi! I want to deploy two nodes with 4 OSDs each. I already prepared OSDs and only need to activate them. What is better? One by one or all at once? Kind regards, Kevin. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph performance laggy (requests blocked > 32) on OpenStack
Hi, we are running 80 VMs using KVM in OpenStack via RBD in Ceph Jewel on a total of 53 disks (RAID parity already excluded). Our nodes are using Intel P3700 DC-SSDs for journaling. Most VMs are linux based and load is low to medium. There are also about 10 VMs running Windows 2012R2, two of them run remote services (terminal). My question is: Are 80 VMs hosted on 53 disks (mostly 7.2k SATA) to much? We sometime experience lags where nearly all servers suffer from "blocked IO > 32" seconds. What are your experiences? Mit freundlichen Grüßen / best regards, Kevin Olbrich. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] degraded objects after osd add
Hi, what happens when size = 2 and some objects are in degraded state? This sounds like easy data loss when the old but active OSD fails while recovery is in progress? It would make more sense to have the pg replicate first and then remove the PG from the old OSD. Mit freundlichen Grüßen / best regards, Kevin Olbrich. > > Original Message > Subject: Re: [ceph-users] degraded objects after osd add (17-Nov-2016 9:14) > From:Burkhard Linke <burkhard.li...@computational.bio.uni-giessen.de> > To: c...@dolphin-it.de > > Hi, > > > On 11/17/2016 08:07 AM, Steffen Weißgerber wrote: > > Hello, > > > > just for understanding: > > > > When starting to fill osd's with data due to setting the weigth from 0 > to the normal value > > the ceph status displays degraded objects (>0.05%). > > > > I don't understand the reason for this because there's no storage > revoekd from the cluster, > > only added. Therefore only the displayed object displacement makes sense. > If you just added a new OSD, a number of PGs will be backfilling or > waiting for backfilling (the remapped ones). I/O to these PGs is not > blocked, and thus object may be modified. AFAIK these objects show up as > degraded. > > I'm not sure how ceph handles these objects, e.g. whether it writes them > to the old OSDs assigned to the PG, or whether they are put on the new OSD > already, even if the corresponding PG is waiting for backfilling. > > Nonetheless the degraded objects will be cleaned up during backfilling. > > Regards, > Burkhard > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How are replicas spread in default crush configuration?
Hi, just to make sure, as I did not find a reference in the docs: Are replicas spread across hosts or "just" OSDs? I am using a 5 OSD cluster (4 pools, 128 pgs each) with size = 2. Currently each OSD is a ZFS backed storage array. Now I installed a server which is planned to host 4x OSDs (and setting size to 3). I want to make sure we can resist two offline hosts (in terms of hardware). Is my assumption correct? Mit freundlichen Grüßen / best regards, Kevin Olbrich. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com