Re: [ceph-users] bluestore OSD did not start at system-boot
Hey Ansgar, we have a similar "problem": in our case all servers are wiped on reboot, as they boot their operating system from the network into initramfs. While the OS configuration is done with cdist [0], we consider ceph osds more dynamic data and just re-initialise all osds on boot using the ungleich-tools [1] suite, which we created to work with ceph clusters mostly. Especially [2] might be of interest for you. HTH, Nico [0] https://www.nico.schottelius.org/software/cdist/ [1] https://github.com/ungleich/ungleich-tools [2] https://github.com/ungleich/ungleich-tools/blob/master/ceph-osd-activate-all Ansgar Jazdzewski writes: > hi folks, > > i just figured out that my ODS's did not start because the filsystem > is not mounted. > > So i wrote a script to Hack my way around it > # > #! /usr/bin/env bash > > DATA=( $(ceph-volume lvm list | grep -e 'osd id\|osd fsid' | awk > '{print $3}' | tr '\n' ' ') ) > > OSDS=$(( ${#DATA[@]}/2 )) > > for OSD in $(seq 0 $(($OSDS-1))); do > ceph-volume lvm activate "${DATA[( $OSD*2 )]}" "${DATA[( $OSD*2+1 )]}" > done > # > > i'am sure that this is not the way it should be!? so any help i > welcome to figure out why my BlueStore-OSD is not mounted at > boot-time. > > Thanks, > Ansgar > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Stuck in creating+activating
You hit the nail! Thanks a lot! Anytime around in Switzerland for a free beer [tm]? Vladimir Prokofev writes: > My first guess would be PG overdose protection kicked in [1][2] > You can try fixing it by increasing allowed number of PG per OSD with > ceph tell mon.* injectargs '--mon_max_pg_per_osd 500' > ceph tell osd.* injectargs '--mon_max_pg_per_osd 500' > and then triggering CRUSH algorithm update by restarting an OSD for example. > > [1] https://ceph.com/community/new-luminous-pg-overdose-protection/ > [2] > https://blog.widodh.nl/2018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/ > > 2018-03-17 12:15 GMT+03:00 Nico Schottelius : > >> >> Good morning, >> >> some days ago we created a new pool with 512 pgs, and originally 5 osds. >> We use the device class "ssd" and a crush rule that maps all data for >> the pool "ssd" to the ssd device class osds. >> >> While creating, one of the ssds failed and we are left with 4 osds: >> >> [10:00:22] server2.place6:/var/log/ceph# ceph osd tree >> ID CLASS WEIGHTTYPE NAMESTATUS REWEIGHT PRI-AFF >> -1 135.12505 root default >> -751.36911 host server2 >> 15 hdd-big 9.09511 osd.15 up 1.0 1.0 >> 20 hdd-big 9.09511 osd.20 up 1.0 1.0 >> 21 hdd-big 9.09511 osd.21 up 1.0 1.0 >> 7 hdd-small 4.54776 osd.7up 1.0 1.0 >> 8 hdd-small 4.54776 osd.8up 1.0 1.0 >> 10 hdd-small 4.54776 osd.10 up 1.0 1.0 >> 26 hdd-small 4.54776 osd.26 up 1.0 1.0 >> 14 notinuse 5.45741 osd.14 up 1.0 1.0 >> 12 ssd 0.21767 osd.12 up 1.0 1.0 >> 24 ssd 0.21767 osd.24 up 1.0 1.0 >> -542.50967 host server3 >> 9 hdd-big 9.09511 osd.9up 1.0 1.0 >> 16 hdd-big 9.09511 osd.16 up 1.0 1.0 >> 19 hdd-big 9.09511 osd.19 up 1.0 1.0 >> 3 hdd-small 4.54776 osd.3up 1.0 1.0 >> 5 hdd-small 4.54776 osd.5up 1.0 1.0 >> 6 hdd-small 4.54776 osd.6up 1.0 1.0 >> 11 notinuse 0.45424 osd.11 up 1.0 1.0 >> 13 notinuse 0.90907 osd.13 up 1.0 1.0 >> 25 ssd 0.21776 osd.25 up 1.0 1.0 >> -241.24626 host server4 >> 2 hdd-big 9.09511 osd.2up 1.0 1.0 >> 17 hdd-big 9.09511 osd.17 up 1.0 1.0 >> 18 hdd-big 9.09511 osd.18 up 1.0 1.0 >> 0 hdd-small 4.54776 osd.0up 1.0 1.0 >> 1 hdd-small 4.54776 osd.1up 1.0 1.0 >> 22 hdd-small 4.54776 osd.22 up 1.0 1.0 >> 4 notinuse 0.0 osd.4up 1.0 1.0 >> 23 ssd 0.21767 osd.23 up 1.0 1.0 >> [10:04:27] server2.place6:/var/log/ceph# >> >> We first had about 160 pgs stuck in creating+activating. After >> restarting all osds in the ssd class one by one, it shifted to >> 100 activating and 60 creating+activating: >> >> >> [10:00:18] server2.place6:/var/log/ceph# ceph -s >> cluster: >> id: 1ccd84f6-e362-4c50-9ffe-59436745e445 >> health: HEALTH_ERR >> 1803200/13770981 objects misplaced (13.094%) >> Reduced data availability: 175 pgs inactive >> Degraded data redundancy: 857547/13770981 objects degraded >> (6.227%), 197 pgs degraded, 123 pgs undersized >> 39 slow requests are blocked > 32 sec >> 40 stuck requests are blocked > 4096 sec >> >> services: >> mon: 3 daemons, quorum black1,black2,black3 >> mgr: black3(active), standbys: black2, black1 >> osd: 27 osds: 27 up, 27 in; 156 remapped pgs >> >> data: >> pools: 2 pools, 1024 pgs >> objects: 4482k objects, 17725 GB >> usage: 55542 GB used, 83188 GB / 135 TB avail >> pgs: 17.090% pgs not active >> 857547/13770981 objects degraded (6.227%) >> 1803200/13770981 objects misplaced (13.094%) >> 640 active+clean >> 105 active+undersized+degraded+remapped+backfill_wait >> 100 activating >> 60 creating+activating >>
[ceph-users] Stuck in creating+activating
Good morning, some days ago we created a new pool with 512 pgs, and originally 5 osds. We use the device class "ssd" and a crush rule that maps all data for the pool "ssd" to the ssd device class osds. While creating, one of the ssds failed and we are left with 4 osds: [10:00:22] server2.place6:/var/log/ceph# ceph osd tree ID CLASS WEIGHTTYPE NAMESTATUS REWEIGHT PRI-AFF -1 135.12505 root default -751.36911 host server2 15 hdd-big 9.09511 osd.15 up 1.0 1.0 20 hdd-big 9.09511 osd.20 up 1.0 1.0 21 hdd-big 9.09511 osd.21 up 1.0 1.0 7 hdd-small 4.54776 osd.7up 1.0 1.0 8 hdd-small 4.54776 osd.8up 1.0 1.0 10 hdd-small 4.54776 osd.10 up 1.0 1.0 26 hdd-small 4.54776 osd.26 up 1.0 1.0 14 notinuse 5.45741 osd.14 up 1.0 1.0 12 ssd 0.21767 osd.12 up 1.0 1.0 24 ssd 0.21767 osd.24 up 1.0 1.0 -542.50967 host server3 9 hdd-big 9.09511 osd.9up 1.0 1.0 16 hdd-big 9.09511 osd.16 up 1.0 1.0 19 hdd-big 9.09511 osd.19 up 1.0 1.0 3 hdd-small 4.54776 osd.3up 1.0 1.0 5 hdd-small 4.54776 osd.5up 1.0 1.0 6 hdd-small 4.54776 osd.6up 1.0 1.0 11 notinuse 0.45424 osd.11 up 1.0 1.0 13 notinuse 0.90907 osd.13 up 1.0 1.0 25 ssd 0.21776 osd.25 up 1.0 1.0 -241.24626 host server4 2 hdd-big 9.09511 osd.2up 1.0 1.0 17 hdd-big 9.09511 osd.17 up 1.0 1.0 18 hdd-big 9.09511 osd.18 up 1.0 1.0 0 hdd-small 4.54776 osd.0up 1.0 1.0 1 hdd-small 4.54776 osd.1up 1.0 1.0 22 hdd-small 4.54776 osd.22 up 1.0 1.0 4 notinuse 0.0 osd.4up 1.0 1.0 23 ssd 0.21767 osd.23 up 1.0 1.0 [10:04:27] server2.place6:/var/log/ceph# We first had about 160 pgs stuck in creating+activating. After restarting all osds in the ssd class one by one, it shifted to 100 activating and 60 creating+activating: [10:00:18] server2.place6:/var/log/ceph# ceph -s cluster: id: 1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_ERR 1803200/13770981 objects misplaced (13.094%) Reduced data availability: 175 pgs inactive Degraded data redundancy: 857547/13770981 objects degraded (6.227%), 197 pgs degraded, 123 pgs undersized 39 slow requests are blocked > 32 sec 40 stuck requests are blocked > 4096 sec services: mon: 3 daemons, quorum black1,black2,black3 mgr: black3(active), standbys: black2, black1 osd: 27 osds: 27 up, 27 in; 156 remapped pgs data: pools: 2 pools, 1024 pgs objects: 4482k objects, 17725 GB usage: 55542 GB used, 83188 GB / 135 TB avail pgs: 17.090% pgs not active 857547/13770981 objects degraded (6.227%) 1803200/13770981 objects misplaced (13.094%) 640 active+clean 105 active+undersized+degraded+remapped+backfill_wait 100 activating 60 creating+activating 50 active+recovery_wait+degraded 21 active+remapped+backfill_wait 16 active+recovery_wait+undersized+degraded+remapped 15 activating+degraded 9 active+recovery_wait+degraded+remapped 3 active+recovery_wait+remapped 3 active+recovery_wait 2 active+undersized+degraded+remapped+backfilling io: client: 519 kB/s rd, 38025 kB/s wr, 4 op/s rd, 20 op/s wr recovery: 1694 kB/s, 0 objects/s I looked into the archives, but did not find anything that directly related to our situation. We are using ceph 12.2.4. An excerpt from our ceph health detail looks like this: HEALTH_ERR 1803116/13770981 objects misplaced (13.094%); Reduced data availability: 175 pgs inactive; Degraded data redundancy: 856881/13770981 objects degraded (6.222%), 197 pgs degraded, 123 pgs undersized; 53 slow requests are blocked > 32 sec; 40 stuck requests are blocked > 4096 sec OBJECT_MISPLACED 1803116/13770981 objects misplaced (13.094%) PG_AVAILABILITY Reduced data availability: 175 pgs inactive pg 7.118 is stuck inactive for 183000.110669, current state creating+activating, last acting [12,23,25] pg 7.11a is stuck inactive for 38143.679989, current state activating, last acting [25,24,23] pg 7.121 is stuck inactive for 38143.670149, current state activating, last acting [25,23,12] pg 7.123 is stuck inactiv
Re: [ceph-users] Ceph iSCSI is a prank?
Max, I understand your frustration. However, last time I checked, ceph was open source. Some of you might not remember, but one major reason why open source is great is that YOU CAN DO your own modifications. If you need a change like iSCSI support and it isn't there, it is probably best, if you implement it. Even if a lot of people are voluntarily contributing to open source and even if there is a company behind ceph as a product, there is no right for a feature. Best, Nico p.s.: If your answer is "I don't have experience to implement it" then my answer will be "hire somebody" and if your answer is "I don't have the money", my answer is "You don't have the resource to have that feature". (from: the book of reality) Max Cuttins writes: > Sorry for being rude Ross, > > I follow Ceph since 2014 waiting for iSCSI support in order to use it > with Xen. > When finally it seemds it was implemented the OS requirements are > irrealistic. > Seems a bad prank. 4 year waiting for this... and still not true support > yet. -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Proper procedure to replace DB/WAL SSD
A very interesting question and I would add the follow up question: Is there an easy way to add an external DB/WAL devices to an existing OSD? I suspect that it might be something on the lines of: - stop osd - create a link in ...ceph/osd/ceph-XX/block.db to the target device - (maybe run some kind of osd mkfs ?) - start osd Has anyone done this so far or recommendations on how to do it? Which also makes me wonder: what is actually the format of WAL and BlockDB in bluestore? Is there any documentation available about it? Best, Nico Caspar Smit writes: > Hi All, > > What would be the proper way to preventively replace a DB/WAL SSD (when it > is nearing it's DWPD/TBW limit and not failed yet). > > It hosts DB partitions for 5 OSD's > > Maybe something like: > > 1) ceph osd reweight 0 the 5 OSD's > 2) let backfilling complete > 3) destroy/remove the 5 OSD's > 4) replace SSD > 5) create 5 new OSD's with seperate DB partition on new SSD > > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so i > thought maybe the following would work: > > 1) ceph osd set noout > 2) stop the 5 OSD's (systemctl stop) > 3) 'dd' the old SSD to a new SSD of same or bigger size > 4) remove the old SSD > 5) start the 5 OSD's (systemctl start) > 6) let backfilling/recovery complete (only delta data between OSD stop and > now) > 6) ceph osd unset noout > > Would this be a viable method to replace a DB SSD? Any udev/serial nr/uuid > stuff preventing this to work? > > Or is there another 'less hacky' way to replace a DB SSD without moving too > much data? > > Kind regards, > Caspar > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restoring keyring capabilities
It seems your monitor capabilities are different to mine: root@server3:/opt/ungleich-tools# ceph -k /var/lib/ceph/mon/ceph-server3/keyring -n mon. auth list 2018-02-16 20:34:59.257529 7fe0d5c6b700 0 librados: mon. authentication error (13) Permission denied [errno 13] error connecting to the cluster root@server3:/opt/ungleich-tools# cat /var/lib/ceph/mon/ceph-server3/keyring [mon.] key = AQCp9IVa2GmYARAAVvCGfNpXfxOoUf119KAq1g== Where you have > root@ceph-mon1:/# cat /var/lib/ceph/mon/ceph-ceph-mon1/keyring > [mon.] > key = AQD1y3RapVDCNxAAmInc8D3OPZKuTVeUcNsPug== > caps mon = "allow *" Which probably explains why it works for you, but not for me. -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restoring keyring capabilities
Saw that, too, however it does not work: root@server3:/var/lib/ceph/mon/ceph-server3# ceph -n mon. --keyring keyring auth caps client.admin mds 'allow *' osd 'allow *' mon 'allow *' 2018-02-16 17:23:38.154282 7f7e257e3700 0 librados: mon. authentication error (13) Permission denied [errno 13] error connecting to the cluster ... which kind of makes sense, as the mon. key does not have capabilities for it. Then again, I wonder how monitors actually talk to each other... Michel Raabe writes: > On 02/16/18 @ 18:21, Nico Schottelius wrote: >> on a test cluster I issued a few seconds ago: >> >> ceph auth caps client.admin mgr 'allow *' >> >> instead of what I really wanted to do >> >> ceph auth caps client.admin mgr 'allow *' mon 'allow *' osd 'allow *' \ >> mds allow >> >> Now any access to the cluster using client.admin correctly results in >> client.admin authentication error (13) Permission denied. >> >> Is there any way to modify the keyring capabilities "from behind", >> i.e. by modifying the rocksdb of the monitors or similar? > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015474.html > > Not verified. > > Regards, > Michel -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Restoring keyring capabilities
Hello, on a test cluster I issued a few seconds ago: ceph auth caps client.admin mgr 'allow *' instead of what I really wanted to do ceph auth caps client.admin mgr 'allow *' mon 'allow *' osd 'allow *' \ mds allow Now any access to the cluster using client.admin correctly results in client.admin authentication error (13) Permission denied. Is there any way to modify the keyring capabilities "from behind", i.e. by modifying the rocksdb of the monitors or similar? If the answer is no, it's not a big problem, as we can easily destroy the cluster, but if the answer is yes, it would be interesting to know how to get out of this. Best, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Is there a "set pool readonly" command?
Hello, we have one pool, in which about 10 disks failed last week (fortunately mostly sequentially), which now has now some pgs that are only left on one disk. Is there a command to set one pool into "read-only" mode or even "recovery io-only" mode so that the only thing same is doing is recovering and no client i/o will disturb that process? Best, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-disk vs. ceph-volume: both error prone
Dear list, for a few days we are disecting ceph-disk and ceph-volume to find out, what is the appropriate way of creating partitions for ceph. For years already I found ceph-disk (and especially ceph-deploy) very error prone and we at ungleich are considering to rewrite both into a ceph-block-do-what-I-want-tool. Only considering bluestore, I see that ceph-disk creates two partitions: Device StartEndSectors Size Type /dev/sde12048 206847 204800 100M Ceph OSD /dev/sde2 206848 2049966046 2049759199 977.4G unknown Does somebody know, what exactly belongs onto the xfs formatted first disk and how is the data/wal/db device sde2 formatted? What I really would like to know is, how can we best extract this information so that we are not depending on ceph-{disk,volume} anymore. Any pointer for the on disk format would be much appreciated! Best, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Inactive PGs rebuild is not priorized
Good morning, after another disk failure, we currently have 7 inactive pgs [1], which are stalling IO from the affected VMs. It seems that ceph, when rebuilding does not focus on repairing the inactive PGs first, which surprised us quite a lot: It does not repair the inactive first, but mixes inactive with active+undersized+degraded+remapped+backfill_wait. Is this a misconfiguration on our side or a design aspect of ceph? I have attached ceph -s from three times while rebuilding below. First the number of active+undersized+degraded+remapped+backfill_wait. decreases and much later then undersized+degraded+remapped+backfill_wait+peered decreases If anyone could comment on this, I would be very thankful to know how to progress here, as we had 6 disk failures this week and each time we had inactive pgs that stalled the VM i/o. Best, Nico [1] cluster: id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab health: HEALTH_WARN 108752/3920931 objects misplaced (2.774%) Reduced data availability: 7 pgs inactive Degraded data redundancy: 419786/3920931 objects degraded (10.706%), 147 pgs unclean, 140 pgs degraded, 140 pgs und ersized services: mon: 3 daemons, quorum server5,server3,server2 mgr: server5(active), standbys: server3, server2 osd: 53 osds: 52 up, 52 in; 147 remapped pgs data: pools: 2 pools, 1280 pgs objects: 1276k objects, 4997 GB usage: 13481 GB used, 26853 GB / 40334 GB avail pgs: 0.547% pgs not active 419786/3920931 objects degraded (10.706%) 108752/3920931 objects misplaced (2.774%) 1133 active+clean 108 active+undersized+degraded+remapped+backfill_wait 25 active+undersized+degraded+remapped+backfilling 7active+remapped+backfill_wait 6undersized+degraded+remapped+backfilling+peered 1undersized+degraded+remapped+backfill_wait+peered io: client: 29980 B/s rd, kB/s wr, 17 op/s rd, 74 op/s wr recovery: 71727 kB/s, 17 objects/s [2] [11:20:15] server3:~# ceph -s cluster: id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab health: HEALTH_WARN 103908/3920967 objects misplaced (2.650%) Reduced data availability: 7 pgs inactive Degraded data redundancy: 380860/3920967 objects degraded (9.713%), 144 pgs unclean, 137 pgs degraded, 137 pgs undersized services: mon: 3 daemons, quorum server5,server3,server2 mgr: server5(active), standbys: server3, server2 osd: 53 osds: 52 up, 52 in; 144 remapped pgs data: pools: 2 pools, 1280 pgs objects: 1276k objects, 4997 GB usage: 13630 GB used, 26704 GB / 40334 GB avail pgs: 0.547% pgs not active 380860/3920967 objects degraded (9.713%) 103908/3920967 objects misplaced (2.650%) 1136 active+clean 105 active+undersized+degraded+remapped+backfill_wait 25 active+undersized+degraded+remapped+backfilling 7active+remapped+backfill_wait 6undersized+degraded+remapped+backfilling+peered 1undersized+degraded+remapped+backfill_wait+peered io: client: 40201 B/s rd, 1189 kB/s wr, 16 op/s rd, 74 op/s wr recovery: 54519 kB/s, 13 objects/s [3] cluster: id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab health: HEALTH_WARN 88382/3921066 objects misplaced (2.254%) Reduced data availability: 4 pgs inactive Degraded data redundancy: 285528/3921066 objects degraded (7.282%), 127 pgs unclean , 121 pgs degraded, 115 pgs undersized 14 slow requests are blocked > 32 sec services: mon: 3 daemons, quorum server5,server3,server2 mgr: server5(active), standbys: server3, server2 osd: 53 osds: 52 up, 52 in; 121 remapped pgs data: pools: 2 pools, 1280 pgs objects: 1276k objects, 4997 GB usage: 14014 GB used, 26320 GB / 40334 GB avail pgs: 0.313% pgs not active 285528/3921066 objects degraded (7.282%) 88382/3921066 objects misplaced (2.254%) 1153 active+clean 78 active+undersized+degraded+remapped+backfill_wait 33 active+undersized+degraded+remapped+backfilling 6active+recovery_wait+degraded 6active+remapped+backfill_wait 2undersized+degraded+remapped+backfill_wait+peered 2undersized+degraded+remapped+backfilling+peered io: client: 56370 B/s rd, 5304 kB/s wr, 11 op/s rd, 44 op/s wr recovery: 37838 kB/s, 9 objects/s And our tree: [12:53:57] server4:~# ceph osd tree ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF -1 39.84532 root default -67.28383 host server1 25 hdd 4.5 osd.25 up 1.0 1.0 48 ssd 0.22198 osd.48 up 1.0 1.0 49 ssd 0.22198
Re: [ceph-users] [Best practise] Adding new data center
Hey Wido, > [...] > Like I said, latency, latency, latency. That's what matters. Bandwidth > usually isn't a real problem. I imagined that. > What latency do you have with a 8k ping between hosts? As the link will be setup this week, I cannot tell yet. However, currently we have on a 65km link with ~2ms latency. In our data center, we currently have ~0.4 ms latency. (both 8k pings). Do you see similar latencies in your setup? Best, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [Best practise] Adding new data center
Good evening list, we are soon expanding our data center [0] to a new location [1]. We are mainly offering VPS / VM Hosting, so rbd is our main interest. We have a low latency 10 Gbit/s link between our other location [2] and we are wondering, what is the best practise for expanding. Naturally we think about creating a new ceph cluster that is independent from the first location, so connection interrupts (unlikely) or different power outages (more likely) are becoming a concern. Given that we running two different ceph clusters, we think about rbd mirroring, so that we can (partially) mirror one side to the other or vice versa. However using this approach we lose the possibility to have very big rbd images (big as in 10ths to 100ds of TBs), as the storage is divided. My question to the list is, how have you handled this situation so far? Would you also recommend splitting or have you expanded ceph clusters over several kilometers of range so far? With what experiences? I am very curious to hear your answers! Best, Nico [0] https://datacenterlight.ch [1] Linthal, in pretty Glarus https://www.google.ch/maps/place/Linthal,+8783+Glarus+S%C3%BCd/ [2] Schwanden, also pretty https://www.google.ch/maps/place/Schwanden,+8762+Glarus+S%C3%BCd/ -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
Hey Burkhard, we did actually restart osd.61, which led to the current status. Best, Nico Burkhard Linke writes:> > On 01/23/2018 08:54 AM, Nico Schottelius wrote: >> Good morning, >> >> the osd.61 actually just crashed and the disk is still intact. However, >> after 8 hours of rebuilding, the unfound objects are still missing: > > *snipsnap* >> >> >> Is there any chance to recover those pgs or did we actually lose data >> with a 2 disk failure? >> >> And is there any way out of this besides going with >> >> ceph pg {pg-id} mark_unfound_lost revert|delete >> >> ? > > Just my 2 cents: > > If the disk is still intact and the data is still readable, you can try > to export the pg content with ceph-objectstore-tool, and import it into > another OSD. > > On the other hand: if the disk is still intact, just restart the OSD? -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
... while trying to locate which VMs are potentially affected by a revert/delete, we noticed that root@server1:~# rados -p one-hdd ls hangs. Where does ceph store the index of block devices found in a pool? And is it possible that this information is in one of the damaged pgs? Nico Nico Schottelius writes: > Good morning, > > the osd.61 actually just crashed and the disk is still intact. However, > after 8 hours of rebuilding, the unfound objects are still missing: > > root@server1:~# ceph -s > cluster: > id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab > health: HEALTH_WARN > noscrub,nodeep-scrub flag(s) set > 111436/3017766 objects misplaced (3.693%) > 9377/1005922 objects unfound (0.932%) > Reduced data availability: 84 pgs inactive > Degraded data redundancy: 277034/3017766 objects degraded > (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized > mon server2 is low on available space > > services: > mon: 3 daemons, quorum server5,server3,server2 > mgr: server5(active), standbys: server2, 2, 0, server3 > osd: 54 osds: 54 up, 54 in; 84 remapped pgs > flags noscrub,nodeep-scrub > > data: > pools: 3 pools, 1344 pgs > objects: 982k objects, 3837 GB > usage: 10618 GB used, 39030 GB / 49648 GB avail > pgs: 6.250% pgs not active > 277034/3017766 objects degraded (9.180%) > 111436/3017766 objects misplaced (3.693%) > 9377/1005922 objects unfound (0.932%) > 1260 active+clean > 84 recovery_wait+undersized+degraded+remapped+peered > > io: > client: 68960 B/s rd, 20722 kB/s wr, 12 op/s rd, 77 op/s wr > > We tried restarting osd.61, but ceph health detail does not change > anymore: > > HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 111436/3017886 objects > misplaced (3.69 > 3%); 9377/1005962 objects unfound (0.932%); Reduced data availability: 84 pgs > inacti > ve; Degraded data redundancy: 277034/3017886 objects degraded (9.180%), 84 > pgs uncle > an, 84 pgs degraded, 84 pgs undersized; mon server2 is low on available space > OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set > OBJECT_MISPLACED 111436/3017886 objects misplaced (3.693%) > OBJECT_UNFOUND 9377/1005962 objects unfound (0.932%) > pg 4.fa has 117 unfound objects > pg 4.ff has 107 unfound objects > pg 4.fd has 113 unfound objects > ... > pg 4.2a has 108 unfound objects > > PG_AVAILABILITY Reduced data availability: 84 pgs inactive > pg 4.2a is stuck inactive for 64117.189552, current state > recovery_wait+undersiz > ed+degraded+remapped+peered, last acting [61] > pg 4.31 is stuck inactive for 64117.147636, current state > recovery_wait+undersiz > ed+degraded+remapped+peered, last acting [61] > pg 4.32 is stuck inactive for 64117.178461, current state > recovery_wait+undersiz > ed+degraded+remapped+peered, last acting [61] > pg 4.34 is stuck inactive for 64117.150475, current state > recovery_wait+undersiz > ed+degraded+remapped+peered, last acting [61] > ... > > > PG_DEGRADED Degraded data redundancy: 277034/3017886 objects degraded > (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized > pg 4.2a is stuck unclean for 131612.984555, current state > recovery_wait+undersized+degraded+remapped+peered, last acting [61] > pg 4.31 is stuck undersized for 221.568468, current state > recovery_wait+undersized+degraded+remapped+peered, last acting [61] > > > Is there any chance to recover those pgs or did we actually lose data > with a 2 disk failure? > > And is there any way out of this besides going with > > ceph pg {pg-id} mark_unfound_lost revert|delete > > ? > > Best, > > Nico > > p.s.: the ceph 4.2a query: > > { > "state": "recovery_wait+undersized+degraded+remapped+peered", > "snap_trimq": "[]", > "epoch": 17879, > "up": [ > 17, > 13, > 25 > ], > "acting": [ > 61 > ], > "backfill_targets": [ > "13", > "17", > "25" > ], > "actingbackfill": [ > "13", > "17", > "25", > "61" > ], > "info": { > "pgid": "4.2a", > "last_update": "17529'53875", > "last_complete": "17217'45447", > "log_tail": "17090'43812"
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
"last_became_peered": "0.00", "last_unstale": "0.00", "last_undegraded": "0.00", "last_fullsized": "0.00", "mapping_epoch": 17878, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 0, "last_epoch_clean": 0, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "0'0", "last_scrub_stamp": "0.00", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "0.00", "last_clean_scrub_stamp": "0.00", "log_size": 0, "ondisk_log_size": 0, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "stat_sum": { "num_bytes": 0, "num_objects": 0, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 0, "num_whiteouts": 0, "num_read": 0, "num_read_kb": 0, "num_write": 0, "num_write_kb": 0, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 0, "num_bytes_recovered": 0, "num_keys_recovered": 0, "num_objects_omap": 0, "num_objects_hit_set_archive": 0, "num_bytes_hit_set_archive": 0, "num_flush": 0, "num_flush_kb": 0, "num_evict": 0, "num_evict_kb": 0, "num_promote": 0, "num_flush_mode_high": 0, "num_flush_mode_low": 0, "num_evict_mode_some": 0, "num_evict_mode_full": 0, "num_objects_pinned": 0, "num_legacy_snapsets": 0 }, "up": [ 17, 13, 25 ], "acting": [ 61 ], "blocked_by": [], "up_primary": 17, "acting_primary": 61 }, "empty": 0, "dne": 0, "incomplete": 1, "last_epoch_started": 17137, "hit_set_history": { "current_last_update": "0'0", "history": [] } }, { "peer": "25", "pgid": "4.2a", "last_update": "17529'53875", "last_complete": "17529'53875", "log_tail": "17090'43812", "last_user_version": 53875, "last_backfill": "MIN", "last_backfill_bitwise": 1, "purged_snaps": [ { "start": "1", "length": "3" }, { "start": "6", "length": "8" }, { "start": "10", "length": "2" } ], "history": {
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
While writing, yet another disk (osd.61 now) died and now we have 172 pgs down: [19:32:35] server2:~# ceph -s cluster: id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab health: HEALTH_WARN noscrub,nodeep-scrub flag(s) set 21033/2263701 objects misplaced (0.929%) Reduced data availability: 186 pgs inactive, 172 pgs down Degraded data redundancy: 67370/2263701 objects degraded (2.976%), 219 pgs unclean, 46 pgs degraded, 46 pgs undersized mon server2 is low on available space services: mon: 3 daemons, quorum server5,server3,server2 mgr: server5(active), standbys: server2, 2, 0, server3 osd: 54 osds: 53 up, 53 in; 47 remapped pgs flags noscrub,nodeep-scrub data: pools: 3 pools, 1344 pgs objects: 736k objects, 2889 GB usage: 8517 GB used, 36474 GB / 44991 GB avail pgs: 13.839% pgs not active 67370/2263701 objects degraded (2.976%) 21033/2263701 objects misplaced (0.929%) 1125 active+clean 172 down 26 active+undersized+degraded+remapped+backfilling 14 undersized+degraded+remapped+backfilling+peered 6active+undersized+degraded+remapped+backfill_wait 1active+remapped+backfill_wait io: client: 835 kB/s rd, 262 kB/s wr, 16 op/s rd, 25 op/s wr recovery: 102 MB/s, 26 objects/s What is the most sensible way to get out of this situation? David Turner writes: > I do remember seeing that exactly. As the number of recovery_wait pgs > decreased, the number of unfound objects decreased until they were all > found. Unfortunately it blocked some IO from happening during the > recovery, but in the long run we ended up with full data integrity again. > > On Mon, Jan 22, 2018 at 1:03 PM Nico Schottelius < > nico.schottel...@ungleich.ch> wrote: > >> >> Hey David, >> >> thanks for the fast answer. All our pools are running with size=3, >> min_size=2 and the two disks were in 2 different hosts. >> >> What I am a bit worried about is the output of "ceph pg 4.fa query" (see >> below) that indicates that ceph already queried all other hosts and did >> not find the data anywhere. >> >> Do you remember having seen something similar? >> >> Best, >> >> Nico >> >> David Turner writes: >> >> > I have had the same problem before with unfound objects that happened >> while >> > backfilling after losing a drive. We didn't lose drives outside of the >> > failure domains and ultimately didn't lose any data, but we did have to >> > wait until after all of the PGs in recovery_wait state were caught up. >> So >> > if the 2 disks you lost were in the same host and your CRUSH rules are >> set >> > so that you can lose a host without losing data, then the cluster will >> > likely find all of the objects by the time it's done backfilling. With >> > only losing 2 disks, I wouldn't worry about the missing objects not >> > becoming found unless you're pool size=2. >> > >> > On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius < >> > nico.schottel...@ungleich.ch> wrote: >> > >> >> >> >> Hello, >> >> >> >> we added about 7 new disks yesterday/today and our cluster became very >> >> slow. While the rebalancing took place, 2 of the 7 new added disks >> >> died. >> >> >> >> Our cluster is still recovering, however we spotted that there are a lot >> >> of unfound objects. >> >> >> >> We lost osd.63 and osd.64, which seem not to be involved into the sample >> >> pg that has unfound objects. >> >> >> >> We were wondering why there are unfound objects, where they are coming >> >> from and if there is a way to recover them? >> >> >> >> Any help appreciated, >> >> >> >> Best, >> >> >> >> Nico >> >> >> >> >> >> Our status is: >> >> >> >> cluster: >> >> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab >> >> health: HEALTH_WARN >> >> 261953/3006663 objects misplaced (8.712%) >> >> 9377/1002221 objects unfound (0.936%) >> >> Reduced data availability: 176 pgs inactive >> >> Degraded data redundancy: 609338/3006663 objects degraded >> >> (20.266%), 243 pgs unclea >> >> n, 222 pgs degraded, 213 pgs undersized >> >> mon server2 is
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
Hey David, thanks for the fast answer. All our pools are running with size=3, min_size=2 and the two disks were in 2 different hosts. What I am a bit worried about is the output of "ceph pg 4.fa query" (see below) that indicates that ceph already queried all other hosts and did not find the data anywhere. Do you remember having seen something similar? Best, Nico David Turner writes: > I have had the same problem before with unfound objects that happened while > backfilling after losing a drive. We didn't lose drives outside of the > failure domains and ultimately didn't lose any data, but we did have to > wait until after all of the PGs in recovery_wait state were caught up. So > if the 2 disks you lost were in the same host and your CRUSH rules are set > so that you can lose a host without losing data, then the cluster will > likely find all of the objects by the time it's done backfilling. With > only losing 2 disks, I wouldn't worry about the missing objects not > becoming found unless you're pool size=2. > > On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius < > nico.schottel...@ungleich.ch> wrote: > >> >> Hello, >> >> we added about 7 new disks yesterday/today and our cluster became very >> slow. While the rebalancing took place, 2 of the 7 new added disks >> died. >> >> Our cluster is still recovering, however we spotted that there are a lot >> of unfound objects. >> >> We lost osd.63 and osd.64, which seem not to be involved into the sample >> pg that has unfound objects. >> >> We were wondering why there are unfound objects, where they are coming >> from and if there is a way to recover them? >> >> Any help appreciated, >> >> Best, >> >> Nico >> >> >> Our status is: >> >> cluster: >> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab >> health: HEALTH_WARN >> 261953/3006663 objects misplaced (8.712%) >> 9377/1002221 objects unfound (0.936%) >> Reduced data availability: 176 pgs inactive >> Degraded data redundancy: 609338/3006663 objects degraded >> (20.266%), 243 pgs unclea >> n, 222 pgs degraded, 213 pgs undersized >> mon server2 is low on available space >> >> services: >> mon: 3 daemons, quorum server5,server3,server2 >> mgr: server5(active), standbys: 2, server2, 0, server3 >> osd: 54 osds: 54 up, 54 in; 234 remapped pgs >> >> data: >> pools: 3 pools, 1344 pgs >> objects: 978k objects, 3823 GB >> usage: 9350 GB used, 40298 GB / 49648 GB avail >> pgs: 13.095% pgs not active >> 609338/3006663 objects degraded (20.266%) >> 261953/3006663 objects misplaced (8.712%) >> 9377/1002221 objects unfound (0.936%) >> 1101 active+clean >> 84 recovery_wait+undersized+degraded+remapped+peered >> 82 undersized+degraded+remapped+backfill_wait+peered >> 23 active+undersized+degraded+remapped+backfill_wait >> 18 active+remapped+backfill_wait >> 14 active+undersized+degraded+remapped+backfilling >> 10 undersized+degraded+remapped+backfilling+peered >> 9active+recovery_wait+degraded >> 3active+remapped+backfilling >> >> io: >> client: 624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr >> recovery: 90148 kB/s, 22 objects/s >> >> Looking at the unfound objects: >> >> [17:32:17] server1:~# ceph health detail >> HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221 >> objects unfound (0.936%); Reduced data availability: 176 pgs inactive; >> Degraded data redundancy: 612398/3006663 objects degraded (20.368%), 244 >> pgs unclean, 223 pgs degraded, 214 pgs undersized; mon server2 is low on >> available space >> OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%) >> OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%) >> pg 4.fa has 117 unfound objects >> pg 4.ff has 107 unfound objects >> pg 4.fd has 113 unfound objects >> pg 4.f0 has 120 unfound objects >> >> >> >> Output from ceph pg 4.fa query: >> >> { >> "state": "recovery_wait+undersized+degraded+remapped+peered", >> "snap_trimq": "[]", >> "epoch": 17561, >> "up": [ >> 8, >> 17, >> 25 >> ], >
[ceph-users] Adding disks -> getting unfound objects [Luminous]
Hello, we added about 7 new disks yesterday/today and our cluster became very slow. While the rebalancing took place, 2 of the 7 new added disks died. Our cluster is still recovering, however we spotted that there are a lot of unfound objects. We lost osd.63 and osd.64, which seem not to be involved into the sample pg that has unfound objects. We were wondering why there are unfound objects, where they are coming from and if there is a way to recover them? Any help appreciated, Best, Nico Our status is: cluster: id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab health: HEALTH_WARN 261953/3006663 objects misplaced (8.712%) 9377/1002221 objects unfound (0.936%) Reduced data availability: 176 pgs inactive Degraded data redundancy: 609338/3006663 objects degraded (20.266%), 243 pgs unclea n, 222 pgs degraded, 213 pgs undersized mon server2 is low on available space services: mon: 3 daemons, quorum server5,server3,server2 mgr: server5(active), standbys: 2, server2, 0, server3 osd: 54 osds: 54 up, 54 in; 234 remapped pgs data: pools: 3 pools, 1344 pgs objects: 978k objects, 3823 GB usage: 9350 GB used, 40298 GB / 49648 GB avail pgs: 13.095% pgs not active 609338/3006663 objects degraded (20.266%) 261953/3006663 objects misplaced (8.712%) 9377/1002221 objects unfound (0.936%) 1101 active+clean 84 recovery_wait+undersized+degraded+remapped+peered 82 undersized+degraded+remapped+backfill_wait+peered 23 active+undersized+degraded+remapped+backfill_wait 18 active+remapped+backfill_wait 14 active+undersized+degraded+remapped+backfilling 10 undersized+degraded+remapped+backfilling+peered 9active+recovery_wait+degraded 3active+remapped+backfilling io: client: 624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr recovery: 90148 kB/s, 22 objects/s Looking at the unfound objects: [17:32:17] server1:~# ceph health detail HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221 objects unfound (0.936%); Reduced data availability: 176 pgs inactive; Degraded data redundancy: 612398/3006663 objects degraded (20.368%), 244 pgs unclean, 223 pgs degraded, 214 pgs undersized; mon server2 is low on available space OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%) OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%) pg 4.fa has 117 unfound objects pg 4.ff has 107 unfound objects pg 4.fd has 113 unfound objects pg 4.f0 has 120 unfound objects Output from ceph pg 4.fa query: { "state": "recovery_wait+undersized+degraded+remapped+peered", "snap_trimq": "[]", "epoch": 17561, "up": [ 8, 17, 25 ], "acting": [ 61 ], "backfill_targets": [ "8", "17", "25" ], "actingbackfill": [ "8", "17", "25", "61" ], "info": { "pgid": "4.fa", "last_update": "17529'85051", "last_complete": "17217'77468", "log_tail": "17091'75034", "last_user_version": 85051, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [ { "start": "1", "length": "3" }, { "start": "6", "length": "8" }, { "start": "10", "length": "2" } ], "history": { "epoch_created": 9134, "epoch_pool_created": 9134, "last_epoch_started": 17528, "last_interval_started": 17527, "last_epoch_clean": 17079, "last_interval_clean": 17078, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 17143, "same_interval_since": 17530, "same_primary_since": 17515, "last_scrub": "17090'57357", "last_scrub_stamp": "2018-01-20 20:45:32.616142", "last_deep_scrub": "17082'54734", "last_deep_scrub_stamp": "2018-01-15 21:09:34.121488", "last_clean_scrub_stamp": "2018-01-20 20:45:32.616142" }, "stats": { "version": "17529'85051", "reported_seq": "218453", "reported_epoch": "17561", "state": "recovery_wait+undersized+degraded+remapped+peered", "last_fresh": "2018-01-22 17:42:28.196701", "last_change": "2018-01-22 15:00:46.507189", "last_active": "2018-01-22 15:00:44.635399", "last_peered": "2018-01-22 17:42:28.196701", "last_clean": "2018-01-21 20:15:48.267209", "last_became_active": "2018-01-22 14:53:07.918893", "last_became_peered
[ceph-users] Adding Monitor ceph freeze, monitor 100% cpu usage
Hello, our problems with ceph monitors continue in version 12.2.2: Adding a specific monitor causes all monitors to hang and not respond to ceph -s or similar anymore. Interestingly when this monitor is on (mon.server2), the other two monitors (mon.server3, mon.server5) randomly begin to consume 100% cpu time, until we restart them, when the procedure repeats. The monitor mon.server2 interestingly has a different view on the cluster: when the other two are electing, it is in state synchronising. We recently noticed that the MTUs of the bond0 device that we use was setup to be 9200 and the vlan tagged device bond0.2, that we use for ceph, also had an 9200 mtu. We raised the underlying devices and bond0 to 9204, restarted the monitors, but the problem persists. Does anyone have a hint on how to further debug this problem? I have added the logs from the time when we tried to restart the monitor on server2. Best, Nico ceph-mon.server2.log.bz2 Description: BZip2 compressed data ceph-mon.server5.log.bz2 Description: BZip2 compressed data -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to enable jumbo frames on IPv6 only cluster?
Hello, we are running everything IPv6 only. You just need to setup the MTU on your devices (nics, switches) correctly, nothing ceph or IPv6 specific required. If you are using SLAAC (like we do), you can also announce the MTU via RA. Best, Nico Jack writes: > Or maybe you reach that ipv4 directly, and that ipv6 via a router, somehow > > Check your routing table and neighbor table > > On 27/10/2017 16:02, Wido den Hollander wrote: >> >>> Op 27 oktober 2017 om 14:22 schreef Félix Barbeira : >>> >>> >>> Hi, >>> >>> I'm trying to configure a ceph cluster using IPv6 only but I can't enable >>> jumbo frames. I made the definition on the >>> 'interfaces' file and it seems like the value is applied but when I test it >>> looks like only works on IPv4, not IPv6. >>> >>> It works on IPv4: >>> >>> root@ceph-node01:~# ping -c 3 -M do -s 8972 ceph-node02 >>> >>> PING ceph-node02 (x.x.x.x) 8972(9000) bytes of data. >>> 8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=1 ttl=64 time=0.474 ms >>> 8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=2 ttl=64 time=0.254 ms >>> 8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=3 ttl=64 time=0.288 ms >>> >> >> Verify with Wireshark/tcpdump if it really sends 9k packets. I doubt it. >> >>> --- ceph-node02 ping statistics --- >>> 3 packets transmitted, 3 received, 0% packet loss, time 2000ms >>> rtt min/avg/max/mdev = 0.254/0.338/0.474/0.099 ms >>> >>> root@ceph-node01:~# >>> >>> But *not* in IPv6: >>> >>> root@ceph-node01:~# ping6 -c 3 -M do -s 8972 ceph-node02 >>> PING ceph-node02(x:x:x:x:x:x:x:x) 8972 data bytes >>> ping: local error: Message too long, mtu=1500 >>> ping: local error: Message too long, mtu=1500 >>> ping: local error: Message too long, mtu=1500 >>> >> >> Like Ronny already mentioned, check the switches and the receiver. There is >> a 1500 MTU somewhere configured. >> >> Wido >> >>> --- ceph-node02 ping statistics --- >>> 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3024ms >>> >>> root@ceph-node01:~# >>> >>> >>> >>> root@ceph-node01:~# ifconfig >>> eno1 Link encap:Ethernet HWaddr 24:6e:96:05:55:f8 >>> inet6 addr: 2a02:x:x:x:x:x:x:x/64 Scope:Global >>> inet6 addr: fe80::266e:96ff:fe05:55f8/64 Scope:Link >>> UP BROADCAST RUNNING MULTICAST *MTU:9000* Metric:1 >>> RX packets:633318 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:649607 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:1000 >>> RX bytes:463355602 (463.3 MB) TX bytes:498891771 (498.8 MB) >>> >>> loLink encap:Local Loopback >>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> inet6 addr: ::1/128 Scope:Host >>> UP LOOPBACK RUNNING MTU:65536 Metric:1 >>> RX packets:127420 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:127420 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:1 >>> RX bytes:179470326 (179.4 MB) TX bytes:179470326 (179.4 MB) >>> >>> root@ceph-node01:~# >>> >>> root@ceph-node01:~# cat /etc/network/interfaces >>> # This file describes network interfaces avaiulable on your system >>> # and how to activate them. For more information, see interfaces(5). >>> >>> source /etc/network/interfaces.d/* >>> >>> # The loopback network interface >>> auto lo >>> iface lo inet loopback >>> >>> # The primary network interface >>> auto eno1 >>> iface eno1 inet6 auto >>>post-up ifconfig eno1 mtu 9000 >>> root@ceph-node01:# >>> >>> >>> Please help! >>> >>> -- >>> Félix Barbeira. >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [MONITOR SEGFAULT] Luminous cluster stuck when adding monitor
Hey Joao, thanks for the pointer! Do you have a timeline for the release of v12.2.2? Best, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [MONITOR SEGFAULT] Luminous cluster stuck when adding monitor
Hello Joao, thanks for coming back! I copied the log of the crashing monitor to http://www.nico.schottelius.org/cephmonlog-2017-10-08-v2.xz Can I somehow get access to the logs of the other monitors, without restarting them? I would like to not stop them, as currently we are running with 2/3 monitors and adding a new one seems not easily be possible, because adding a new one means losing the quorum and then being unable to remove the new one, because the quorum is lost with 2/4 nodes. (this is what actually happened about a week ago in our cluster) Best, Nico Joao Eduardo Luis writes: > Hi Nico, > > I'm sorry I forgot about your issue. Crazy few weeks. > > I checked the log you initially sent to the list, but it only contains > the log from one of the monitors, and it's from the one > synchronizing. This monitor is not stuck however - synchronizing is > progressing, albeit slowly. > > Can you please share the logs of the other monitors, especially of > those crashing? > > -Joao > > On 10/18/2017 06:58 AM, Nico Schottelius wrote: >> >> Hello everyone, >> >> is there any solution in sight for this problem? Currently our cluster >> is stuck with a 2 monitor configuration, as everytime we restart the one >> server2, it crashes after some minutes (and in between the cluster is stuck). >> >> Should we consider downgrading to kraken to fix that problem? >> >> Best, >> >> Nico >> >> >> -- >> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch >> -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [MONITOR SEGFAULT] Luminous cluster stuck when adding monitor
Hello everyone, is there any solution in sight for this problem? Currently our cluster is stuck with a 2 monitor configuration, as everytime we restart the one server2, it crashes after some minutes (and in between the cluster is stuck). Should we consider downgrading to kraken to fix that problem? Best, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [MONITOR SEGFAULT] Luminous cluster stuck when adding monitor
Good morning Joao, thanks for your feedback! We do actually have three managers running: cluster: id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab health: HEALTH_WARN 1/3 mons down, quorum server5,server3 services: mon: 3 daemons, quorum server5,server3, out of quorum: server2 mgr: 0(active), standbys: 1, 2 osd: 57 osds: 57 up, 57 in data: pools: 3 pools, 1344 pgs objects: 580k objects, 2256 GB usage: 6778 GB used, 30276 GB / 37054 GB avail pgs: 1344 active+clean io: client: 17705 B/s rd, 14586 kB/s wr, 21 op/s rd, 70 op/s wr Joao Eduardo Luis writes: > This looks a lot like a bug I fixed a week or so ago, but for which I > currently don't recall the ticket off the top of my head. It was basically a > crash each time a "ceph osd df" was called, if a mgr was not available after > having set the > luminous osd require flag. I will check the log in the morning to figure out > whether you need to upgrade to a newer version or if this is a corner case > the fix missed. In the mean time, check if you have ceph-mgr running, because > that's > the easy work around (assuming it's the same bug). > > -Joao -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [CLUSTER STUCK] Luminous cluster stuck when adding monitor
After spending some hours on debugging packets on the wire, without seeing a good reason for things not to work, the monitor on server2 eventually joined the quorum. Being happy for some time and then our alarming sends a message that the quorum is lost. And indeed, the monitor on server2 died and now comes the not so funny part: restarting the monitor makes the cluster hang again. I will post another debug log in the next hours, now from the monitor on server2. Nico Schottelius writes: > Not sure if I mentioned before: adding a new monitor also puts the whole > cluster into stuck state. > > Some minutes ago I did: > > root@server1:~# ceph mon add server2 2a0a:e5c0::92e2:baff:fe4e:6614 > port defaulted to 6789; adding mon.server2 at > [2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0 > > And then started the daemon on server2: > > ceph-mon -i server2 --pid-file /var/lib/ceph/run/mon.server2.pid -c > /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -d 2>&1 | > tee ~/cephmonlog-2017-10-08-2 > > And now the cluster hangs (as in ceph -s does not return). > > Looking at mon_status of server5, shows that server5 thinks it is time > for electing [0]. > > When stopping the monitor on server2 and trying to remove server2 again, > the removal command also gets stuck and never returns: > > root@server1:~# ceph mon rm server2 > > As our cluster is now severely degraded, I was wondering if anyone has a > quick hint on how to get ceph -s back working and/or remove server2 > and/or how to readd server1? > > Best, > > Nico > > > [0] > > [10:50:38] server5:~# ceph daemon mon.server5 mon_status > { > "name": "server5", > "rank": 0, > "state": "electing", > "election_epoch": 6087, > "quorum": [], > "features": { > "required_con": "153140804152475648", > "required_mon": [ > "kraken", > "luminous" > ], > "quorum_con": "2305244844532236283", > "quorum_mon": [ > "kraken", > "luminous" > ] > }, > "outside_quorum": [], > "extra_probe_peers": [ > "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0" > ], > "sync_provider": [], > "monmap": { > "epoch": 11, > "fsid": "26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab", > "modified": "2017-10-08 10:43:49.667986", > "created": "2017-05-16 22:33:04.500528", > "features": { > "persistent": [ > "kraken", > "luminous" > ], > "optional": [] > }, > "mons": [ > { > "rank": 0, > "name": "server5", > "addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0", > "public_addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0" > }, > { > "rank": 1, > "name": "server3", > "addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0", > "public_addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0" > }, > { > "rank": 2, > "name": "server2", > "addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0", > "public_addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0" > }, > { > "rank": 3, > "name": "server1", > "addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0", > "public_addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0" > } > ] > }, > "feature_map": { > "mon": { > "group": { > "features": "0x1ffddff8eea4fffb", > "release": "luminous", > "num": 1 > } > }, > "client": { > "group": { > "features": "0x1ffddff8eea4fffb", >
Re: [ceph-users] [CLUSTER STUCK] Luminous cluster stuck when adding monitor
Not sure if I mentioned before: adding a new monitor also puts the whole cluster into stuck state. Some minutes ago I did: root@server1:~# ceph mon add server2 2a0a:e5c0::92e2:baff:fe4e:6614 port defaulted to 6789; adding mon.server2 at [2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0 And then started the daemon on server2: ceph-mon -i server2 --pid-file /var/lib/ceph/run/mon.server2.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -d 2>&1 | tee ~/cephmonlog-2017-10-08-2 And now the cluster hangs (as in ceph -s does not return). Looking at mon_status of server5, shows that server5 thinks it is time for electing [0]. When stopping the monitor on server2 and trying to remove server2 again, the removal command also gets stuck and never returns: root@server1:~# ceph mon rm server2 As our cluster is now severely degraded, I was wondering if anyone has a quick hint on how to get ceph -s back working and/or remove server2 and/or how to readd server1? Best, Nico [0] [10:50:38] server5:~# ceph daemon mon.server5 mon_status { "name": "server5", "rank": 0, "state": "electing", "election_epoch": 6087, "quorum": [], "features": { "required_con": "153140804152475648", "required_mon": [ "kraken", "luminous" ], "quorum_con": "2305244844532236283", "quorum_mon": [ "kraken", "luminous" ] }, "outside_quorum": [], "extra_probe_peers": [ "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0" ], "sync_provider": [], "monmap": { "epoch": 11, "fsid": "26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab", "modified": "2017-10-08 10:43:49.667986", "created": "2017-05-16 22:33:04.500528", "features": { "persistent": [ "kraken", "luminous" ], "optional": [] }, "mons": [ { "rank": 0, "name": "server5", "addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0", "public_addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0" }, { "rank": 1, "name": "server3", "addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0", "public_addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0" }, { "rank": 2, "name": "server2", "addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0", "public_addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0" }, { "rank": 3, "name": "server1", "addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0", "public_addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0" } ] }, "feature_map": { "mon": { "group": { "features": "0x1ffddff8eea4fffb", "release": "luminous", "num": 1 } }, "client": { "group": { "features": "0x1ffddff8eea4fffb", "release": "luminous", "num": 4 } } } } Nico Schottelius writes: > Good evening Joao, > > we double checked our MTUs, they are all 9200 on the servers and 9212 on > the switches. And we have no problems transferring big files in general > (as opennebula copies around images for importing, we do this quite a > lot). > > So if you could have a look, it would be much appreciated. > > If we should collect other logs, just let us know. > > Best, > > Nico > > Joao Eduardo Luis writes: > >> On 10/04/2017 09:19 PM, Gregory Farnum wrote: >>> Oh, hmm, you're right. I see synchronization starts but it seems to >>> progress very slowly, and it certainly doesn't complete in that 2.5 >>> minute logging window. I don't see any clear reason why it's so >>> slow; it might be more clear if you could provide logs of the other >>> logs at
Re: [ceph-users] Luminous cluster stuck when adding monitor
Good evening Joao, we double checked our MTUs, they are all 9200 on the servers and 9212 on the switches. And we have no problems transferring big files in general (as opennebula copies around images for importing, we do this quite a lot). So if you could have a look, it would be much appreciated. If we should collect other logs, just let us know. Best, Nico Joao Eduardo Luis writes: > On 10/04/2017 09:19 PM, Gregory Farnum wrote: >> Oh, hmm, you're right. I see synchronization starts but it seems to >> progress very slowly, and it certainly doesn't complete in that 2.5 >> minute logging window. I don't see any clear reason why it's so >> slow; it might be more clear if you could provide logs of the other >> logs at the same time (especially since you now say they are getting >> stuck in the electing state during that period). Perhaps Kefu or >> Joao will have some clearer idea what the problem is. >> -Greg > > I haven't gone through logs yet (maybe Friday, it's late today and > it's a holiday tomorrow), but not so long ago I seem to recall someone > having a similar issue with the monitors that was solely related to a > switch's MTU being too small. > > Maybe that could be the case? If not, I'll take a look at the logs as > soon as possible. > > -Joao > >> >> On Wed, Oct 4, 2017 at 1:04 PM Nico Schottelius >> mailto:nico.schottel...@ungleich.ch>> >> wrote: >> >> >> Some more detail: >> >> when restarting the monitor on server1, it stays in synchronizing state >> forever. >> >> However the other two monitors change into electing state. >> >> I have double checked that there are not (host) firewalls active and >> that the times are within 1 second different of the hosts (they all have >> ntpd running). >> >> We are running everything on IPv6, but this should not be a problem, >> should it? >> >> Best, >> >> Nico >> >> >> Nico Schottelius > <mailto:nico.schottel...@ungleich.ch>> writes: >> >> > Hello Gregory, >> > >> > the logfile I produced has already debug mon = 20 set: >> > >> > [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf >> > debug mon = 20 >> > >> > It is clear that server1 is out of quorum, however how do we make it >> > being part of the quorum again? >> > >> > I expected that the quorum finding process is triggered automatically >> > after restarting the monitor, or is that incorrect? >> > >> > Best, >> > >> > Nico >> > >> > >> > Gregory Farnum mailto:gfar...@redhat.com>> >> writes: >> > >> >> You'll need to change the config so that it's running "debug mon >> = 20" for >> >> the log to be very useful here. It does say that it's dropping >> client >> >> connections because it's been out of quorum for too long, which >> is the >> >> correct behavior in general. I'd imagine that you've got clients >> trying to >> >> connect to the new monitor instead of the ones already in the >> quorum and >> >> not passing around correctly; this is all configurable. >> >> >> >> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius < >> >> nico.schottel...@ungleich.ch >> <mailto:nico.schottel...@ungleich.ch>> wrote: >> >> >> >>> >> >>> Good morning, >> >>> >> >>> we have recently upgraded our kraken cluster to luminous and >> since then >> >>> noticed an odd behaviour: we cannot add a monitor anymore. >> >>> >> >>> As soon as we start a new monitor (server2), ceph -s and ceph >> -w start to >> >>> hang. >> >>> >> >>> The situation became worse, since one of our staff stopped an >> existing >> >>> monitor (server1), as restarting that monitor results in the same >> >>> situation, ceph -s hangs until we stop the monitor again. >> >>> >> >>> We kept the monitor running for some minutes, but the situation >> never >> >>> cl
Re: [ceph-users] Luminous cluster stuck when adding monitor
Some more detail: when restarting the monitor on server1, it stays in synchronizing state forever. However the other two monitors change into electing state. I have double checked that there are not (host) firewalls active and that the times are within 1 second different of the hosts (they all have ntpd running). We are running everything on IPv6, but this should not be a problem, should it? Best, Nico Nico Schottelius writes: > Hello Gregory, > > the logfile I produced has already debug mon = 20 set: > > [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf > debug mon = 20 > > It is clear that server1 is out of quorum, however how do we make it > being part of the quorum again? > > I expected that the quorum finding process is triggered automatically > after restarting the monitor, or is that incorrect? > > Best, > > Nico > > > Gregory Farnum writes: > >> You'll need to change the config so that it's running "debug mon = 20" for >> the log to be very useful here. It does say that it's dropping client >> connections because it's been out of quorum for too long, which is the >> correct behavior in general. I'd imagine that you've got clients trying to >> connect to the new monitor instead of the ones already in the quorum and >> not passing around correctly; this is all configurable. >> >> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius < >> nico.schottel...@ungleich.ch> wrote: >> >>> >>> Good morning, >>> >>> we have recently upgraded our kraken cluster to luminous and since then >>> noticed an odd behaviour: we cannot add a monitor anymore. >>> >>> As soon as we start a new monitor (server2), ceph -s and ceph -w start to >>> hang. >>> >>> The situation became worse, since one of our staff stopped an existing >>> monitor (server1), as restarting that monitor results in the same >>> situation, ceph -s hangs until we stop the monitor again. >>> >>> We kept the monitor running for some minutes, but the situation never >>> cleares up. >>> >>> The network does not have any firewall in between the nodes and there >>> are no host firewalls. >>> >>> I have attached the output of the monitor on server1, running in >>> foreground using >>> >>> root@server1:~# ceph-mon -i server1 --pid-file >>> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf --cluster ceph >>> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog >>> >>> Does anyone see any obvious problem in the attached log? >>> >>> Any input or hint would be appreciated! >>> >>> Best, >>> >>> Nico >>> >>> >>> >>> -- >>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous cluster stuck when adding monitor
Hello Gregory, the logfile I produced has already debug mon = 20 set: [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf debug mon = 20 It is clear that server1 is out of quorum, however how do we make it being part of the quorum again? I expected that the quorum finding process is triggered automatically after restarting the monitor, or is that incorrect? Best, Nico Gregory Farnum writes: > You'll need to change the config so that it's running "debug mon = 20" for > the log to be very useful here. It does say that it's dropping client > connections because it's been out of quorum for too long, which is the > correct behavior in general. I'd imagine that you've got clients trying to > connect to the new monitor instead of the ones already in the quorum and > not passing around correctly; this is all configurable. > > On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius < > nico.schottel...@ungleich.ch> wrote: > >> >> Good morning, >> >> we have recently upgraded our kraken cluster to luminous and since then >> noticed an odd behaviour: we cannot add a monitor anymore. >> >> As soon as we start a new monitor (server2), ceph -s and ceph -w start to >> hang. >> >> The situation became worse, since one of our staff stopped an existing >> monitor (server1), as restarting that monitor results in the same >> situation, ceph -s hangs until we stop the monitor again. >> >> We kept the monitor running for some minutes, but the situation never >> cleares up. >> >> The network does not have any firewall in between the nodes and there >> are no host firewalls. >> >> I have attached the output of the monitor on server1, running in >> foreground using >> >> root@server1:~# ceph-mon -i server1 --pid-file >> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf --cluster ceph >> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog >> >> Does anyone see any obvious problem in the attached log? >> >> Any input or hint would be appreciated! >> >> Best, >> >> Nico >> >> >> >> -- >> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Luminous cluster stuck when adding monitor
Good morning, we have recently upgraded our kraken cluster to luminous and since then noticed an odd behaviour: we cannot add a monitor anymore. As soon as we start a new monitor (server2), ceph -s and ceph -w start to hang. The situation became worse, since one of our staff stopped an existing monitor (server1), as restarting that monitor results in the same situation, ceph -s hangs until we stop the monitor again. We kept the monitor running for some minutes, but the situation never cleares up. The network does not have any firewall in between the nodes and there are no host firewalls. I have attached the output of the monitor on server1, running in foreground using root@server1:~# ceph-mon -i server1 --pid-file /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog Does anyone see any obvious problem in the attached log? Any input or hint would be appreciated! Best, Nico cephmonlog.bz2 Description: BZip2 compressed data -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
Well, we basically needed to fix it, that's why did it :-) Blair Bethwaite writes: > Great to see this issue sorted. > > I have to say I am quite surprised anyone would implement the > export/import workaround mentioned here without *first* racing to this > ML or IRC and crying out for help. This is a valuable resource, made > more so by people sharing issues. > > Cheers, > > On 12 September 2017 at 07:22, Jason Dillaman wrote: >> Yes -- the upgrade documentation definitely needs to be updated to add >> a pre-monitor upgrade step to verify your caps before proceeding -- I >> will take care of that under this ticket [1]. I believe the OpenStack >> documentation has been updated [2], but let me know if you find other >> places. >> >> [1] http://tracker.ceph.com/issues/21353 >> [2] >> http://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication >> >> On Mon, Sep 11, 2017 at 5:16 PM, Nico Schottelius >> wrote: >>> >>> That indeed worked! Thanks a lot! >>> >>> The remaining question from my side: did we do anything wrong in the >>> upgrade process and if not, should it be documented somewhere how to >>> setup the permissions correctly on upgrade? >>> >>> Or should the documentation on the side of the cloud infrastructure >>> software be updated? >>> >>> >>> >>> Jason Dillaman writes: >>> >>>> Since you have already upgraded to Luminous, the fastest and probably >>>> easiest way to fix this is to run "ceph auth caps client.libvirt mon >>>> 'profile rbd' osd 'profile rbd pool=one'" [1]. Luminous provides >>>> simplified RBD caps via named profiles which ensure all the correct >>>> permissions are enabled. >>>> >>>> [1] >>>> http://docs.ceph.com/docs/master/rados/operations/user-management/#authorization-capabilities >>> >>> -- >>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch >> >> >> >> -- >> Jason >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
For opennebula this would be http://docs.opennebula.org/5.4/deployment/open_cloud_storage_setup/ceph_ds.html (added opennebula in CC) Jason Dillaman writes: > Yes -- the upgrade documentation definitely needs to be updated to add > a pre-monitor upgrade step to verify your caps before proceeding -- I > will take care of that under this ticket [1]. I believe the OpenStack > documentation has been updated [2], but let me know if you find other > places. > > [1] http://tracker.ceph.com/issues/21353 > [2] > http://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication > > On Mon, Sep 11, 2017 at 5:16 PM, Nico Schottelius > wrote: >> >> That indeed worked! Thanks a lot! >> >> The remaining question from my side: did we do anything wrong in the >> upgrade process and if not, should it be documented somewhere how to >> setup the permissions correctly on upgrade? >> >> Or should the documentation on the side of the cloud infrastructure >> software be updated? >> >> >> >> Jason Dillaman writes: >> >>> Since you have already upgraded to Luminous, the fastest and probably >>> easiest way to fix this is to run "ceph auth caps client.libvirt mon >>> 'profile rbd' osd 'profile rbd pool=one'" [1]. Luminous provides >>> simplified RBD caps via named profiles which ensure all the correct >>> permissions are enabled. >>> >>> [1] >>> http://docs.ceph.com/docs/master/rados/operations/user-management/#authorization-capabilities >> >> -- >> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
That indeed worked! Thanks a lot! The remaining question from my side: did we do anything wrong in the upgrade process and if not, should it be documented somewhere how to setup the permissions correctly on upgrade? Or should the documentation on the side of the cloud infrastructure software be updated? Jason Dillaman writes: > Since you have already upgraded to Luminous, the fastest and probably > easiest way to fix this is to run "ceph auth caps client.libvirt mon > 'profile rbd' osd 'profile rbd pool=one'" [1]. Luminous provides > simplified RBD caps via named profiles which ensure all the correct > permissions are enabled. > > [1] > http://docs.ceph.com/docs/master/rados/operations/user-management/#authorization-capabilities -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
Hey Jason, here it is: [22:42:12] server4:~# ceph auth get client.libvirt exported keyring for client.libvirt [client.libvirt] key = ... caps mgr = "allow r" caps mon = "allow r" caps osd = "allow class-read object_prefix rbd_children, allow rwx pool=one" [22:52:57] server4:~# p.s.: I am also available for online chat on https://brandnewchat.ungleich.ch/ in case you need more information quickly. Jason Dillaman writes: > I see the following which is most likely the issue: > > 2017-09-11 22:26:38.945776 7efd677fe700 -1 > librbd::managed_lock::BreakRequest: 0x7efd58020e70 handle_blacklist: > failed to blacklist lock owner: (13) Permission denied > 2017-09-11 22:26:38.945795 7efd677fe700 10 > librbd::managed_lock::BreakRequest: 0x7efd58020e70 finish: r=-13 > 2017-09-11 22:26:38.945798 7efd677fe700 10 > librbd::managed_lock::AcquireRequest: 0x7efd60017960 > handle_break_lock: r=-13 > 2017-09-11 22:26:38.945800 7efd677fe700 -1 > librbd::managed_lock::AcquireRequest: 0x7efd60017960 > handle_break_lock: failed to break lock : (13) Permission denied > 2017-09-11 22:26:38.945865 7efd677fe700 10 librbd::ManagedLock: > 0x7efd580267d0 handle_acquire_lock: r=-13 > 2017-09-11 22:26:38.945873 7efd677fe700 -1 librbd::ManagedLock: > 0x7efd580267d0 handle_acquire_lock: failed to acquire exclusive > lock:(13) Permission denied > 2017-09-11 22:26:38.945883 7efd677fe700 10 librbd::ExclusiveLock: > 0x7efd580267d0 post_acquire_lock_handler: r=-13 > 2017-09-11 22:26:38.945887 7efd677fe700 10 librbd::ImageState: > 0x55b55ace8dc0 handle_prepare_lock_complete > 2017-09-11 22:26:38.945892 7efd677fe700 10 librbd::ManagedLock: > 0x7efd580267d0 handle_post_acquire_lock: r=-13 > 2017-09-11 22:26:38.945895 7efd677fe700 5 librbd::io::ImageRequestWQ: > 0x55b55ace9a20 handle_acquire_lock: r=-13, req=0x55b55add32a0 > 2017-09-11 22:26:38.945901 7efd677fe700 -1 librbd::io::AioCompletion: > 0x55b55add46a0 fail: (13) Permission denied > > It looks like your "client.libvirt" user lacks the permission to > blacklist a dead client that had previously acquired the exclusive > lock and failed to release it. > > Can you provide the results from "ceph auth get client.libvirt"? I > suspect it only has 'caps mon = "allow r"'. > > On Mon, Sep 11, 2017 at 4:45 PM, Nico Schottelius > wrote: >> >> >> Thanks a lot for the great ceph.conf pointer, Mykola! >> >> I found something interesting: >> >> 2017-09-11 22:26:23.418796 7efd7d479700 10 client.1039597.objecter >> ms_dispatch 0x55b55ab8f950 osd_op_reply(4 rbd_header.df7343d1b58ba [call] >> v0'0 uv0 ondisk = -8 ((8) Exec format error)) v8 >> 2017-09-11 22:26:23.439501 7efd7dc7a700 10 client.1039597.objecter >> ms_dispatch 0x55b55ab8f950 osd_op_reply(14 rbd_header.2b0c02ae8944a >> [call] v0'0 uv0 ondisk = -8 ((8) Exec format error)) v8 >> >> Not sure if those are the ones causing the problem, but at least some >> error. >> >> I have uploaded the log at >> http://www.nico.schottelius.org/ceph.client.libvirt.41670.log.bz2 >> >> I wonder if anyone sees the real reason for the I/O errors in the log? >> >> Best, >> >> Nico >> >>> Mykola Golub writes: >>> >>>> On Sun, Sep 10, 2017 at 03:56:21PM +0200, Nico Schottelius wrote: >>>>> >>>>> Just tried and there is not much more log in ceph -w (see below) neither >>>>> from the qemu process. >>>>> >>>>> [15:52:43] server4:~$ /usr/bin/qemu-system-x86_64 -name one-17031 -S >>>>> -machine pc-i440fx-2.1,accel=kvm,usb=off -m 8192 -realtime mlock=off >>>>> -smp 6,sockets=6,cores=1,threads=1 -uuid >>>>> 79845fca-9b26-4072-bcb3-7f5206c2a531 -no-user-config -nodefaults >>>>> -chardev >>>>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-17031.monitor,server,nowait >>>>> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc >>>>> -no-shutdown -boot strict=on -device >>>>> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive >>>>> file='rbd:one/one-29-17031-0:id=libvirt:key=DELETEME:auth_supported=cephx\;none:mon_host=server1\:6789\;server3\:6789\;server5\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none' >>>>> -device >>>>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 >>>>> -drive >>>>> file=/var/lib/one//datastores/100/17031/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw >>>>> -devic
Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
The only errors message I see is from dmesg when trying to accessing the XFS filesystem (see attached image). Let me know if you need any more logs - luckily I can spin up this VM in a broken state as often as you want to :-) Jason Dillaman writes: > ... also, do have any logs from the OS associated w/ this log file? I > am specifically looking for anything to indicate which sector was > considered corrupt. > > On Mon, Sep 11, 2017 at 4:41 PM, Jason Dillaman wrote: >> Thanks -- I'll take a look to see if anything else stands out. That >> "Exec format error" isn't actually an issue -- but now that I know >> about it, we can prevent it from happening in the future [1] >> >> [1] http://tracker.ceph.com/issues/21360 >> >> On Mon, Sep 11, 2017 at 4:32 PM, Nico Schottelius >> wrote: >>> >>> >>> Thanks a lot for the great ceph.conf pointer, Mykola! >>> >>> I found something interesting: >>> >>> 2017-09-11 22:26:23.418796 7efd7d479700 10 client.1039597.objecter >>> ms_dispatch 0x55b55ab8f950 osd_op_reply(4 rbd_header.df7343d1b58ba [call] >>> v0'0 uv0 ondisk = -8 ((8) Exec format error)) v8 >>> 2017-09-11 22:26:23.439501 7efd7dc7a700 10 client.1039597.objecter >>> ms_dispatch 0x55b55ab8f950 osd_op_reply(14 rbd_header.2b0c02ae8944a >>> [call] v0'0 uv0 ondisk = -8 ((8) Exec format error)) v8 >>> >>> Not sure if those are the ones causing the problem, but at least some >>> error. >>> >>> I have attached the bzip'd log file for reference (1.7MiB, hope it makes >>> it to the list) and wonder if anyone sees the real reason for the I/O >>> errors? >>> >>> Best, >>> >>> Nico >>> >>> >>> >>> >>> >>> >>> >>> Mykola Golub writes: >>> >>>> On Sun, Sep 10, 2017 at 03:56:21PM +0200, Nico Schottelius wrote: >>>>> >>>>> Just tried and there is not much more log in ceph -w (see below) neither >>>>> from the qemu process. >>>>> >>>>> [15:52:43] server4:~$ /usr/bin/qemu-system-x86_64 -name one-17031 -S >>>>> -machine pc-i440fx-2.1,accel=kvm,usb=off -m 8192 -realtime mlock=off >>>>> -smp 6,sockets=6,cores=1,threads=1 -uuid >>>>> 79845fca-9b26-4072-bcb3-7f5206c2a531 -no-user-config -nodefaults >>>>> -chardev >>>>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-17031.monitor,server,nowait >>>>> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc >>>>> -no-shutdown -boot strict=on -device >>>>> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive >>>>> file='rbd:one/one-29-17031-0:id=libvirt:key=DELETEME:auth_supported=cephx\;none:mon_host=server1\:6789\;server3\:6789\;server5\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none' >>>>> -device >>>>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 >>>>> -drive >>>>> file=/var/lib/one//datastores/100/17031/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw >>>>> -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -vnc >>>>> [::]:21131 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device >>>>> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on 2>&1 >>>>> | tee kvmlogwithdebug >>>>> >>>>> -> no output >>>> >>>> Try to find where the qemu process writes the ceph log, e.g. with the >>>> help of lsof utility. Or add something like below >>>> >>>> log file = /tmp/ceph.$name.$pid.log >>>> >>>> to ceph.conf before starting qemu and look for /tmp/ceph.*.log >>> >>> >>> -- >>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch >>> >> >> >> >> -- >> Jason -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
Thanks a lot for the great ceph.conf pointer, Mykola! I found something interesting: 2017-09-11 22:26:23.418796 7efd7d479700 10 client.1039597.objecter ms_dispatch 0x55b55ab8f950 osd_op_reply(4 rbd_header.df7343d1b58ba [call] v0'0 uv0 ondisk = -8 ((8) Exec format error)) v8 2017-09-11 22:26:23.439501 7efd7dc7a700 10 client.1039597.objecter ms_dispatch 0x55b55ab8f950 osd_op_reply(14 rbd_header.2b0c02ae8944a [call] v0'0 uv0 ondisk = -8 ((8) Exec format error)) v8 Not sure if those are the ones causing the problem, but at least some error. I have uploaded the log at http://www.nico.schottelius.org/ceph.client.libvirt.41670.log.bz2 I wonder if anyone sees the real reason for the I/O errors in the log? Best, Nico > Mykola Golub writes: > >> On Sun, Sep 10, 2017 at 03:56:21PM +0200, Nico Schottelius wrote: >>> >>> Just tried and there is not much more log in ceph -w (see below) neither >>> from the qemu process. >>> >>> [15:52:43] server4:~$ /usr/bin/qemu-system-x86_64 -name one-17031 -S >>> -machine pc-i440fx-2.1,accel=kvm,usb=off -m 8192 -realtime mlock=off >>> -smp 6,sockets=6,cores=1,threads=1 -uuid >>> 79845fca-9b26-4072-bcb3-7f5206c2a531 -no-user-config -nodefaults >>> -chardev >>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-17031.monitor,server,nowait >>> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc >>> -no-shutdown -boot strict=on -device >>> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive >>> file='rbd:one/one-29-17031-0:id=libvirt:key=DELETEME:auth_supported=cephx\;none:mon_host=server1\:6789\;server3\:6789\;server5\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none' >>> -device >>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 >>> -drive >>> file=/var/lib/one//datastores/100/17031/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw >>> -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -vnc >>> [::]:21131 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device >>> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on 2>&1 | >>> tee kvmlogwithdebug >>> >>> -> no output >> >> Try to find where the qemu process writes the ceph log, e.g. with the >> help of lsof utility. Or add something like below >> >> log file = /tmp/ceph.$name.$pid.log >> >> to ceph.conf before starting qemu and look for /tmp/ceph.*.log -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
Hey Mykola, thanks for the hint, I will test this in a few hours when I'm back on a regular Internet connection! Best, Nico Mykola Golub writes: > On Sun, Sep 10, 2017 at 03:56:21PM +0200, Nico Schottelius wrote: >> >> Just tried and there is not much more log in ceph -w (see below) neither >> from the qemu process. >> >> [15:52:43] server4:~$ /usr/bin/qemu-system-x86_64 -name one-17031 -S >> -machine pc-i440fx-2.1,accel=kvm,usb=off -m 8192 -realtime mlock=off >> -smp 6,sockets=6,cores=1,threads=1 -uuid >> 79845fca-9b26-4072-bcb3-7f5206c2a531 -no-user-config -nodefaults >> -chardev >> socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-17031.monitor,server,nowait >> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc >> -no-shutdown -boot strict=on -device >> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive >> file='rbd:one/one-29-17031-0:id=libvirt:key=DELETEME:auth_supported=cephx\;none:mon_host=server1\:6789\;server3\:6789\;server5\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none' >> -device >> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 >> -drive >> file=/var/lib/one//datastores/100/17031/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw >> -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -vnc >> [::]:21131 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device >> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on 2>&1 | >> tee kvmlogwithdebug >> >> -> no output > > Try to find where the qemu process writes the ceph log, e.g. with the > help of lsof utility. Or add something like below > > log file = /tmp/ceph.$name.$pid.log > > to ceph.conf before starting qemu and look for /tmp/ceph.*.log -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
Sarunas, may I ask when this happened? And did you move OSDs or mons after that export/import procecdure? I really wonder, what is the reason for this behaviour and also if it is likely to experience it again. Best, Nico Sarunas Burdulis writes: > On 2017-09-10 08:23, Nico Schottelius wrote: >> >> Good morning, >> >> yesterday we had an unpleasant surprise that I would like to discuss: >> >> Many (not all!) of our VMs were suddenly >> dying (qemu process exiting) and when trying to restart them, inside the >> qemu process we saw i/o errors on the disks and the OS was not able to >> start (i.e. stopped in initramfs). > > We experienced the same after upgrade from kraken to luminous, i.e. all > VM with their system images in Ceph pool failed to boot due to > filesystem errors, ending in initramfs. fsck wasn't able to fix them. > >> When we exported the image from rbd and loop mounted it, there were >> however no I/O errors and the filesystem could be cleanly mounted [-1]. > > Same here. > > We ended up rbd-exporting images from Ceph rbd pool to local filesystem > and re-exporting them back. That "fixed" them without the need for fsck. -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
Good morning Lionel, it's great to hear that it's not only us being affected! I am not sure what you refer to by "glance" images, but what we see is that we can spawn a new VM based on an existing image and that one runs. Can I invite you (and anyone else who has problems w/ Luminous upgrade) to join our chat at https://brandnewchat.ungleich.ch/ so that we can discuss online the real world problems? For us it is currently very unclear how to progress, if it is even save to rejoin the host into the cluster or if a downgrade would even make sense. Best, Nico p.s.: This cluster was installed with kraken, so no old jewel clients or osds have existed at all. Beard Lionel (BOSTON-STORAGE) writes: > Hi, > > We also have the same issue with Openstack instances (QEMU/libvirt) after > upgrading from kraken to luminous, and just after starting osd migration from > btrfs to bluestore. > We were able to restart failed VMs by mounting all disks from a linux box > with rbd map, and run fsck on them. > QEMU hosts are running Ubuntu with kernel > 4.4. > We have noticed that one of our QEMU hosts was still running jewel ceph > client (error during installation...) , and issue doesn't happen on this one. > > Don't you have issues with some glance images? > Because we do (unable to spawn an instance from them), and it was fixed by > following this ticket: http://tracker.ceph.com/issues/19413 > > Regards, > Lionel > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Nico Schottelius >> Sent: dimanche 10 septembre 2017 14:23 >> To: ceph-users >> Cc: kamila.souck...@ungleich.ch >> Subject: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd >> change] >> >> >> Good morning, >> >> yesterday we had an unpleasant surprise that I would like to discuss: >> >> Many (not all!) of our VMs were suddenly dying (qemu process exiting) and >> when trying to restart them, inside the qemu process we saw i/o errors on >> the disks and the OS was not able to start (i.e. stopped in initramfs). >> >> When we exported the image from rbd and loop mounted it, there were >> however no I/O errors and the filesystem could be cleanly mounted [-1]. >> >> We are running Devuan with kernel 3.16.0-4-amd64 and saw that there are >> some problems reported with kernels < 3.16.39 and thus we upgraded one >> host that serves as VM host + runs ceph osds to Devuan ascii using 4.9.0-3- >> amd64. >> >> Trying to start the VM again on this host however resulted in the same I/O >> problem. >> >> We then did the "stupid" approach of exporting an image and importing it >> again as the same name [0]. Surprisingly, this solved our problem >> reproducible for all affected VMs and allowed us to go back online. >> >> We intentionally left one broken VM in our system (a test VM) so that we >> have the chance of debugging further what happened and how we can >> prevent it from happening again. >> >> As you might have guessed, there have been some event prior this: >> >> - Some weeks before we upgraded our cluster from kraken to luminous (in >> the right order of mon's first, adding mgrs) >> >> - About a week ago we added the first hdd to our cluster and modified the >> crushmap so that it the "one" pool (from opennebula) still selects >> only ssds >> >> - Some hours before we took out one of the 5 hosts of the ceph cluster, >> as we intended to replace the filesystem based OSDs with bluestore >> (roughly 3 hours prior to the event) >> >> - Short time before the event we readded an osd, but did not "up" it >> >> To our understanding, none of these actions should have triggered this >> behaviour, however we are aware that with the upgrade to luminous also >> the client libraries were updated and not all qemu processes were restarted. >> [1] >> >> After this long story, I was wondering about the following things: >> >> - Why did this happen at all? >> And what is different after we reimported the image? >> Can it be related to disconnected the image from the parent >> (i.e. opennebula creates clones prior to starting a VM) >> >> - We have one broken VM left - is there a way to get it back running >> without doing the export/import dance? >> >> - How / or is http://tracker.ceph.com/issues/18807 related to our issue? >> How is the kernel involved into running VMs that use librbd? >> rbd showmapped does not show any mapped VMs, as qemu connects
Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
Just tried and there is not much more log in ceph -w (see below) neither from the qemu process. [15:52:43] server4:~$ /usr/bin/qemu-system-x86_64 -name one-17031 -S -machine pc-i440fx-2.1,accel=kvm,usb=off -m 8192 -realtime mlock=off -smp 6,sockets=6,cores=1,threads=1 -uuid 79845fca-9b26-4072-bcb3-7f5206c2a531 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-17031.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file='rbd:one/one-29-17031-0:id=libvirt:key=DELETEME:auth_supported=cephx\;none:mon_host=server1\:6789\;server3\:6789\;server5\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none' -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/var/lib/one//datastores/100/17031/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -vnc [::]:21131 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on 2>&1 | tee kvmlogwithdebug -> no output The command line of qemu is copied out of what opennebula usually spawns, minus the networking part. [15:41:54] server4:~# ceph -w 2017-09-10 15:44:32.873281 7f59f17fa700 10 client.?.objecter ms_handle_connect 0x7f59f4150e90 2017-09-10 15:44:32.873315 7f59f17fa700 10 client.?.objecter resend_mon_ops 2017-09-10 15:44:32.873327 7f59f17fa700 10 client.?.objecter ms_handle_connect 0x7f59f41544d0 2017-09-10 15:44:32.873329 7f59f17fa700 10 client.?.objecter resend_mon_ops 2017-09-10 15:44:32.876248 7f59f9a63700 10 client.1021613.objecter _maybe_request_map subscribing (onetime) to next osd map 2017-09-10 15:44:32.876710 7f59f17fa700 10 client.1021613.objecter ms_dispatch 0x7f59f4000fe0 osd_map(9059..9059 src has 8530..9059) v3 2017-09-10 15:44:32.876722 7f59f17fa700 3 client.1021613.objecter handle_osd_map got epochs [9059,9059] > 0 2017-09-10 15:44:32.876726 7f59f17fa700 3 client.1021613.objecter handle_osd_map decoding full epoch 9059 2017-09-10 15:44:32.877099 7f59f17fa700 20 client.1021613.objecter dump_active .. 0 homeless 2017-09-10 15:44:32.877423 7f59f17fa700 10 client.1021613.objecter ms_handle_connect 0x7f59dc00c9c0 cluster: id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab health: HEALTH_OK services: mon: 3 daemons, quorum server5,server3,server1 mgr: 1(active), standbys: 2, 0 osd: 50 osds: 49 up, 49 in data: pools: 2 pools, 1088 pgs objects: 500k objects, 1962 GB usage: 5914 GB used, 9757 GB / 15672 GB avail pgs: 1088 active+clean io: client: 18822 B/s rd, 799 kB/s wr, 6 op/s rd, 52 op/s wr 2017-09-10 15:44:37.876324 7f59f1ffb700 10 client.1021613.objecter tick 2017-09-10 15:44:42.876437 7f59f1ffb700 10 client.1021613.objecter tick 2017-09-10 15:44:45.223970 7f59f17fa700 10 client.1021613.objecter ms_dispatch 0x7f59f4000fe0 log(2 entries from seq 215046 at 2017-09-10 15:44:45.164162) v1 2017-09-10 15:44:47.876548 7f59f1ffb700 10 client.1021613.objecter tick 2017-09-10 15:44:52.876668 7f59f1ffb700 10 client.1021613.objecter tick 2017-09-10 15:44:57.876770 7f59f1ffb700 10 client.1021613.objecter tick 2017-09-10 15:45:02.876888 7f59f1ffb700 10 client.1021613.objecter tick 2017-09-10 15:45:07.877001 7f59f1ffb700 10 client.1021613.objecter tick 2017-09-10 15:45:12.877120 7f59f1ffb700 10 client.1021613.objecter tick 2017-09-10 15:45:17.877229 7f59f1ffb700 10 client.1021613.objecter tick 2017-09-10 15:45:22.877349 7f59f1ffb700 10 client.1021613.objecter tick 2017-09-10 15:45:27.877455 7f59f1ffb700 10 client.1021613.objecter tick Jason Dillaman writes: > Sorry -- meant VM. Yes, librbd uses ceph.conf for configuration settings. > > On Sun, Sep 10, 2017 at 9:22 AM, Nico Schottelius > wrote: >> >> Hello Jason, >> >> I think there is a slight misunderstanding: >> There is only one *VM*, not one OSD left that we did not start. >> >> Or does librbd also read ceph.conf and will that cause qemu to output >> debug messages? >> >> Best, >> >> Nico >> >> Jason Dillaman writes: >> >>> I presume QEMU is using librbd instead of a mapped krbd block device, >>> correct? If that is the case, can you add "debug-rbd=20" and "debug >>> objecter=20" to your ceph.conf and boot up your last remaining broken >>> OSD? >>> >>> On Sun, Sep 10, 2017 at 8:23 AM, Nico Schottelius >>> wrote: >>>> >>>> Good morning, >>>> >>>> yesterday we had an unpleasant surprise that I would like to discuss: >>>> >>>> Many (not all!) of our VMs were suddenly >>>> dying (qe
Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
Hello Jason, I think there is a slight misunderstanding: There is only one *VM*, not one OSD left that we did not start. Or does librbd also read ceph.conf and will that cause qemu to output debug messages? Best, Nico Jason Dillaman writes: > I presume QEMU is using librbd instead of a mapped krbd block device, > correct? If that is the case, can you add "debug-rbd=20" and "debug > objecter=20" to your ceph.conf and boot up your last remaining broken > OSD? > > On Sun, Sep 10, 2017 at 8:23 AM, Nico Schottelius > wrote: >> >> Good morning, >> >> yesterday we had an unpleasant surprise that I would like to discuss: >> >> Many (not all!) of our VMs were suddenly >> dying (qemu process exiting) and when trying to restart them, inside the >> qemu process we saw i/o errors on the disks and the OS was not able to >> start (i.e. stopped in initramfs). >> >> When we exported the image from rbd and loop mounted it, there were >> however no I/O errors and the filesystem could be cleanly mounted [-1]. >> >> We are running Devuan with kernel 3.16.0-4-amd64 and saw that there are >> some problems reported with kernels < 3.16.39 and thus we upgraded one >> host that serves as VM host + runs ceph osds to Devuan ascii using >> 4.9.0-3-amd64. >> >> Trying to start the VM again on this host however resulted in the same >> I/O problem. >> >> We then did the "stupid" approach of exporting an image and importing it >> again as the same name [0]. Surprisingly, this solved our problem >> reproducible for all affected VMs and allowed us to go back online. >> >> We intentionally left one broken VM in our system (a test VM) so that we >> have the chance of debugging further what happened and how we can >> prevent it from happening again. >> >> As you might have guessed, there have been some event prior this: >> >> - Some weeks before we upgraded our cluster from kraken to luminous (in >> the right order of mon's first, adding mgrs) >> >> - About a week ago we added the first hdd to our cluster and modified the >> crushmap so that it the "one" pool (from opennebula) still selects >> only ssds >> >> - Some hours before we took out one of the 5 hosts of the ceph cluster, >> as we intended to replace the filesystem based OSDs with bluestore >> (roughly 3 hours prior to the event) >> >> - Short time before the event we readded an osd, but did not "up" it >> >> To our understanding, none of these actions should have triggered this >> behaviour, however we are aware that with the upgrade to luminous also >> the client libraries were updated and not all qemu processes were >> restarted. [1] >> >> After this long story, I was wondering about the following things: >> >> - Why did this happen at all? >> And what is different after we reimported the image? >> Can it be related to disconnected the image from the parent >> (i.e. opennebula creates clones prior to starting a VM) >> >> - We have one broken VM left - is there a way to get it back running >> without doing the export/import dance? >> >> - How / or is http://tracker.ceph.com/issues/18807 related to our issue? >> How is the kernel involved into running VMs that use librbd? >> rbd showmapped does not show any mapped VMs, as qemu connects directly >> to ceph. >> >> We tried upgrading one host to Devuan ascii which uses 4.9.0-3-amd64, >> but did not fix our problem. >> >> We would appreciate any pointer! >> >> Best, >> >> Nico >> >> >> [-1] >> losetup -P /dev/loop0 /var/tmp/one-staging/monitoring1-disk.img >> mkdir /tmp/monitoring1-mnt >> mount /dev/loop0p1 /tmp/monitoring1-mnt/ >> >> >> [0] >> >> rbd export one/$img /var/tmp/one-staging/$img >> rbd rm one/$img >> rbd import /var/tmp/one-staging/$img one/$img >> rm /var/tmp/one-staging/$img >> >> [1] >> [14:05:34] server5:~# ceph features >> { >> "mon": { >> "group": { >> "features": "0x1ffddff8eea4fffb", >> "release": "luminous", >> "num": 3 >> } >> }, >> "osd": { >> "group": { >> "features": "0x1ffddff8eea4fffb", >> "release": "luminous", >> "num&qu
[ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]
Good morning, yesterday we had an unpleasant surprise that I would like to discuss: Many (not all!) of our VMs were suddenly dying (qemu process exiting) and when trying to restart them, inside the qemu process we saw i/o errors on the disks and the OS was not able to start (i.e. stopped in initramfs). When we exported the image from rbd and loop mounted it, there were however no I/O errors and the filesystem could be cleanly mounted [-1]. We are running Devuan with kernel 3.16.0-4-amd64 and saw that there are some problems reported with kernels < 3.16.39 and thus we upgraded one host that serves as VM host + runs ceph osds to Devuan ascii using 4.9.0-3-amd64. Trying to start the VM again on this host however resulted in the same I/O problem. We then did the "stupid" approach of exporting an image and importing it again as the same name [0]. Surprisingly, this solved our problem reproducible for all affected VMs and allowed us to go back online. We intentionally left one broken VM in our system (a test VM) so that we have the chance of debugging further what happened and how we can prevent it from happening again. As you might have guessed, there have been some event prior this: - Some weeks before we upgraded our cluster from kraken to luminous (in the right order of mon's first, adding mgrs) - About a week ago we added the first hdd to our cluster and modified the crushmap so that it the "one" pool (from opennebula) still selects only ssds - Some hours before we took out one of the 5 hosts of the ceph cluster, as we intended to replace the filesystem based OSDs with bluestore (roughly 3 hours prior to the event) - Short time before the event we readded an osd, but did not "up" it To our understanding, none of these actions should have triggered this behaviour, however we are aware that with the upgrade to luminous also the client libraries were updated and not all qemu processes were restarted. [1] After this long story, I was wondering about the following things: - Why did this happen at all? And what is different after we reimported the image? Can it be related to disconnected the image from the parent (i.e. opennebula creates clones prior to starting a VM) - We have one broken VM left - is there a way to get it back running without doing the export/import dance? - How / or is http://tracker.ceph.com/issues/18807 related to our issue? How is the kernel involved into running VMs that use librbd? rbd showmapped does not show any mapped VMs, as qemu connects directly to ceph. We tried upgrading one host to Devuan ascii which uses 4.9.0-3-amd64, but did not fix our problem. We would appreciate any pointer! Best, Nico [-1] losetup -P /dev/loop0 /var/tmp/one-staging/monitoring1-disk.img mkdir /tmp/monitoring1-mnt mount /dev/loop0p1 /tmp/monitoring1-mnt/ [0] rbd export one/$img /var/tmp/one-staging/$img rbd rm one/$img rbd import /var/tmp/one-staging/$img one/$img rm /var/tmp/one-staging/$img [1] [14:05:34] server5:~# ceph features { "mon": { "group": { "features": "0x1ffddff8eea4fffb", "release": "luminous", "num": 3 } }, "osd": { "group": { "features": "0x1ffddff8eea4fffb", "release": "luminous", "num": 49 } }, "client": { "group": { "features": "0xffddff8ee84fffb", "release": "kraken", "num": 1 }, "group": { "features": "0xffddff8eea4fffb", "release": "luminous", "num": 4 }, "group": { "features": "0x1ffddff8eea4fffb", "release": "luminous", "num": 61 } } } -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Good morning Jiri, sure, let me catch up on this: - Kernel 3.16 - ceph: 0.80.7 - fs: xfs - os: debian (backports) (1x)/ubuntu (2x) Cheers, Nico Jiri Kanicky [Fri, Jan 09, 2015 at 10:44:33AM +1100]: > Hi Nico. > > If you are experiencing such issues it would be good if you provide more info > about your deployment: ceph version, kernel versions, OS, filesystem > btrfs/xfs. > > Thx Jiri > > - Reply message - > From: "Nico Schottelius" > To: > Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = > Cluster unusable] > Date: Wed, Dec 31, 2014 02:36 > > Good evening, > > we also tried to rescue data *from* our old / broken pool by map'ing the > rbd devices, mounting them on a host and rsync'ing away as much as > possible. > > However, after some time rsync got completly stuck and eventually the > host which mounted the rbd mapped devices decided to kernel panic at > which time we decided to drop the pool and go with a backup. > > This story and the one of Christian makes me wonder: > > Is anyone using ceph as a backend for qemu VM images in production? > > And: > > Has anyone on the list been able to recover from a pg incomplete / > stuck situation like ours? > > Reading about the issues on the list here gives me the impression that > ceph as a software is stuck/incomplete and has not yet become ready > "clean" for production (sorry for the word joke). > > Cheers, > > Nico > > Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]: > > Hi Nico and all others who answered, > > > > After some more trying to somehow get the pgs in a working state (I've > > tried force_create_pg, which was putting then in creating state. But > > that was obviously not true, since after rebooting one of the containing > > osd's it went back to incomplete), I decided to save what can be saved. > > > > I've created a new pool, created a new image there, mapped the old image > > from the old pool and the new image from the new pool to a machine, to > > copy data on posix level. > > > > Unfortunately, formatting the image from the new pool hangs after some > > time. So it seems that the new pool is suffering from the same problem > > as the old pool. Which is totaly not understandable for me. > > > > Right now, it seems like Ceph is giving me no options to either save > > some of the still intact rbd volumes, or to create a new pool along the > > old one to at least enable our clients to send data to ceph again. > > > > To tell the truth, I guess that will result in the end of our ceph > > project (running for already 9 Monthes). > > > > Regards, > > Christian > > > > Am 29.12.2014 15:59, schrieb Nico Schottelius: > > > Hey Christian, > > > > > > Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: > > >> [incomplete PG / RBD hanging, osd lost also not helping] > > > > > > that is very interesting to hear, because we had a similar situation > > > with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg > > > directories to allow OSDs to start after the disk filled up completly. > > > > > > So I am sorry not to being able to give you a good hint, but I am very > > > interested in seeing your problem solved, as it is a show stopper for > > > us, too. (*) > > > > > > Cheers, > > > > > > Nico > > > > > > (*) We migrated from sheepdog to gluster to ceph and so far sheepdog > > > seems to run much smoother. The first one is however not supported > > > by opennebula directly, the second one not flexible enough to host > > > our heterogeneous infrastructure (mixed disk sizes/amounts) - so we > > > are using ceph at the moment. > > > > > > > > > -- > > Christian Eichelmann > > Systemadministrator > > > > 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting > > Brauerstraße 48 · DE-76135 Karlsruhe > > Telefon: +49 721 91374-8026 > > christian.eichelm...@1und1.de > > > > Amtsgericht Montabaur / HRB 6484 > > Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert > > Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen > > Aufsichtsratsvorsitzender: Michael Scheeren > > -- > New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Lionel, Christian, we do have the exactly same trouble as Christian, namely Christian Eichelmann [Fri, Jan 09, 2015 at 10:43:20AM +0100]: > We still don't know what caused this specific error... and > ...there is currently no way to make ceph forget about the data of this pg > and create it as an empty one. So the only way > to make this pool usable again is to loose all your data in there. I wonder what is the position of ceph developers regarding dropping (emptying) specific pgs? Is that a use case that was never thought of or tested? For us it is essential to be able to keep the pool/cluster running even in case we have lost pgs. Even though I do not like the fact that we lost a pg for an unknown reason, I would prefer ceph to handle that case to recover to the best possible situation. Namely I wonder if we can integrate a tool that shows which (parts of) rbd images would be affected by dropping a pg. That would give us the chance to selectively restore VMs in case this happens again. Cheers, Nico -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Hello Dan, it is good to know that there are actually people using ceph + qemu in production! Regarding replicas: I thought about using size = 2, but I see that this resembles raid5 and size = 3 is more or less equal in terms of loss to raid6. Regarding the kernel panics: I am still researching / trying to find out why they happen. They can easily be reproduced by triggering high amount of i/o in a VM. We are mostly running Debian (stable, testing, stable+backports) that shows the kernel panics. Ubuntu has not shown this behaviour so far, afair. So if anyone has experienced kernel panics in Qemu-VMs running on RBD (and fixed it), please let me know! Cheers, Nico p.s.: We are *not* using rbdmap / kernel mounts - it's just qemu running with qemu-system-x86_64 -enable-kvm -name one-204 -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 512 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid d7c3374e-349e-4db6-8f54-f3c607f93101 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-204.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device lsi,id=scsi0,bus=pci.0,addr=0x4 -drive file=rbd:one/one-53-204-0:id=libvirt:key=...:auth_supported=cephx\;none:mon_host=kaffee.private.ungleich.ch\;wein.private.ungleich.ch\;tee.private.ungleich.ch,if=none,id=drive-scsi0-0-0,format=raw,cache=none -device scsi-hd,bus=scsi0.0,scsi-id=0,drive=drive-scsi0-0-0,id=scsi0-0-0,bootindex=1 -drive file=/var/lib/one//datastores/0/204/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=24,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=02:00:4d:6d:96:ae,bus=pci.0,addr=0x3 -vnc 0.0.0.0:204 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 Dan Van Der Ster [Wed, Jan 07, 2015 at 08:12:29PM +]: > Hi Nico, > Yes Ceph is production ready. Yes people are using it in production for qemu. > Last time I heard, Ceph was surveyed as the most popular backend for > OpenStack Cinder in production. > > When using RBD in production, it really is critically important to (a) use 3 > replicas and (b) pay attention to pg distribution early on so that you don't > end up with unbalanced OSDs. > > Replication is especially important for RBD because you > _must_not_ever_lose_an_entire_pg_. Parts of every single rbd device are > stored on every single PG... So losing a PG means you lost random parts of > every single block device. If this happens, the only safe course of action is > to restore from backups. But the whole point of Ceph is that it enables you > to configure adequate replication across failure domains, which makes this > scenario very very very unlikely to occur. > > I don't know why you were getting kernel panics. It's probably advisable to > stick to the most recent mainline kernel when using kRBD. > > Cheers, Dan > > On 7 Jan 2015 20:45, Nico Schottelius wrote: > Good evening, > > we also tried to rescue data *from* our old / broken pool by map'ing the > rbd devices, mounting them on a host and rsync'ing away as much as > possible. > > However, after some time rsync got completly stuck and eventually the > host which mounted the rbd mapped devices decided to kernel panic at > which time we decided to drop the pool and go with a backup. > > This story and the one of Christian makes me wonder: > > Is anyone using ceph as a backend for qemu VM images in production? > > And: > > Has anyone on the list been able to recover from a pg incomplete / > stuck situation like ours? > > Reading about the issues on the list here gives me the impression that > ceph as a software is stuck/incomplete and has not yet become ready > "clean" for production (sorry for the word joke). > > Cheers, > > Nico > > Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]: > > Hi Nico and all others who answered, > > > > After some more trying to somehow get the pgs in a working state (I've > > tried force_create_pg, which was putting then in creating state. But > > that was obviously not true, since after rebooting one of the containing > > osd's it went back to incomplete), I decided to save what can be saved. > > > > I've created a new pool, created a new image there, mapped the old image > > from the old pool and the new image from the new pool to a machine, to > > copy data on posix level. > > > > Unfortunately, formatting the image from the new pool hangs after some > > time. So it seems that the new pool
Re: [ceph-users] Hanging VMs with Qemu + RBD
Hello Achim, good to hear someone else running this setup. We have changed the number of backfills using ceph tell osd.\* injectargs '--osd-max-backfills 1' and it seems to work mostly in regards of issues when rebalancing. One unsolved problem we have is machines kernel panic'ing, when i/o is slow. We usually see a kernel panic in the sym53c8xx driver, especially for those VMs with high i/o rates. We tried to upgrade the kernel in the VM (Debian stable 3.2.0 -> Debian backports 3.16.0), but just have different kernel panic in the same driver now. Have you had the some problem and if so, how did you get it fixed? Cheers, Nico Achim Ledermüller [Wed, Jan 07, 2015 at 05:42:38PM +0100]: > Hi, > > We have the same setup including OpenNebula 4.10.1. We had some > backfilling due to node failures and node expansion. If we throttle > osd_max_backfills there is not a problem at all. If the value for > backfilling jobs is too high, we can see delayed reactions within the > shell, eg. `ls -lh` needs 2 seconds. > > Kind regards, > Achim > > -- > Achim Ledermüller, M. Sc. > Systems Engineer > > NETWAYS Managed Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > GF: Julian Hein, Bernd Erk | AG Nuernberg HRB25207 > http://www.netways.de | achim.ledermuel...@netways.de > > ** OSDC 2015 - April - osdc.de ** > ** Puppet Camp Berlin 2015 - April - netways.de/puppetcamp ** > ** OSBConf 2015 - September – osbconf.org ** > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Good evening, we also tried to rescue data *from* our old / broken pool by map'ing the rbd devices, mounting them on a host and rsync'ing away as much as possible. However, after some time rsync got completly stuck and eventually the host which mounted the rbd mapped devices decided to kernel panic at which time we decided to drop the pool and go with a backup. This story and the one of Christian makes me wonder: Is anyone using ceph as a backend for qemu VM images in production? And: Has anyone on the list been able to recover from a pg incomplete / stuck situation like ours? Reading about the issues on the list here gives me the impression that ceph as a software is stuck/incomplete and has not yet become ready "clean" for production (sorry for the word joke). Cheers, Nico Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]: > Hi Nico and all others who answered, > > After some more trying to somehow get the pgs in a working state (I've > tried force_create_pg, which was putting then in creating state. But > that was obviously not true, since after rebooting one of the containing > osd's it went back to incomplete), I decided to save what can be saved. > > I've created a new pool, created a new image there, mapped the old image > from the old pool and the new image from the new pool to a machine, to > copy data on posix level. > > Unfortunately, formatting the image from the new pool hangs after some > time. So it seems that the new pool is suffering from the same problem > as the old pool. Which is totaly not understandable for me. > > Right now, it seems like Ceph is giving me no options to either save > some of the still intact rbd volumes, or to create a new pool along the > old one to at least enable our clients to send data to ceph again. > > To tell the truth, I guess that will result in the end of our ceph > project (running for already 9 Monthes). > > Regards, > Christian > > Am 29.12.2014 15:59, schrieb Nico Schottelius: > > Hey Christian, > > > > Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: > >> [incomplete PG / RBD hanging, osd lost also not helping] > > > > that is very interesting to hear, because we had a similar situation > > with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg > > directories to allow OSDs to start after the disk filled up completly. > > > > So I am sorry not to being able to give you a good hint, but I am very > > interested in seeing your problem solved, as it is a show stopper for > > us, too. (*) > > > > Cheers, > > > > Nico > > > > (*) We migrated from sheepdog to gluster to ceph and so far sheepdog > > seems to run much smoother. The first one is however not supported > > by opennebula directly, the second one not flexible enough to host > > our heterogeneous infrastructure (mixed disk sizes/amounts) - so we > > are using ceph at the moment. > > > > > -- > Christian Eichelmann > Systemadministrator > > 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting > Brauerstraße 48 · DE-76135 Karlsruhe > Telefon: +49 721 91374-8026 > christian.eichelm...@1und1.de > > Amtsgericht Montabaur / HRB 6484 > Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert > Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen > Aufsichtsratsvorsitzender: Michael Scheeren -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weights: Hosts vs. OSDs
Hey Lindsay, Lindsay Mathieson [Wed, Dec 31, 2014 at 06:23:10AM +1000]: > On Tue, 30 Dec 2014 05:07:31 PM Nico Schottelius wrote: > > While writing this I noted that the relation / factor is exactly 5.5 times > > wrong, so I *guess* that ceph treats all hosts with the same weight (even > > though it looks differently to me in the osd tree and the crushmap)? > > I believe If you have the default replication factor of 3, then with 3 hosts > you will effectively have a weight of 1 per host no matter what you specify > because ceph will be forced to place a copy of all data on each host to > satisfy replication requirements. sorry, I forgot to mention we are using size = 2. Cheers, Nico -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 pgpDlVhv_Bove.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Weights: Hosts vs. OSDs
Good evening, for some time we have the problem that ceph stores too much data on a host with small disks. Originally we used weight 1 = 1 TB, but we reduced the weight for this particular host further to keep it somehow alive. Our setup currently consists of 3 hosts: wein: 6x 136G (fest disks) kaffee: 1x 5.5T (slow disk) tee: 1x 5.5T (slow disk) We originally started with 6 osds on wein with a weight of 0.13, but had to reduce it to 0.05, because the disks were running full. The current tree looks as following: root@wein:~# ceph osd tree # idweight type name up/down reweight -1 2.3 root default -2 0.2999 host wein 0 0.04999 osd.0 up 1 3 0.04999 osd.3 up 1 4 0.04999 osd.4 up 1 5 0.04999 osd.5 up 1 6 0.04999 osd.6 up 1 7 0.04999 osd.7 up 1 -3 1 host tee 1 5.5 osd.1 up 1 -4 1 host kaffee 2 5.5 osd.2 up 1 The hosts have the following disk usage: root@wein:~# df -h | grep ceph /dev/sdc1 136G 58G 79G 43% /var/lib/ceph/osd/ceph-0 /dev/sdd1 136G 54G 83G 40% /var/lib/ceph/osd/ceph-3 /dev/sde1 136G 31G 105G 23% /var/lib/ceph/osd/ceph-4 /dev/sdf1 136G 62G 75G 46% /var/lib/ceph/osd/ceph-5 /dev/sdg1 136G 45G 92G 33% /var/lib/ceph/osd/ceph-6 /dev/sdh1 136G 28G 109G 21% /var/lib/ceph/osd/ceph-7 root@kaffee:~# df -h | grep ceph /dev/sdc 5.5T 983G 4.5T 18% /var/lib/ceph/osd/ceph-2 root@tee:~# df -h | grep ceph /dev/sdb5.5T 967G 4.6T 18% /var/lib/ceph/osd/ceph-1 On wein 48G are stored on average per osd, tee/kaffee store 952G on average. (58+64+31+62+45+28)/6 = 48.0 (967+938)/2 = 952.5 The weight relation from wein osd to kaffee/tee osd is 5.5/0.05 = 110.0 The usage relation from wein osd to kaffee/tee osd is (967+938)/2) = 952.5 952.5/48 = 19.84375 So ceph is allocating 5.5 times more storage to wein osds than what we want it do: 110/(952.5/48) = 5.543307086614173 We are also a bit puzzled that the host weight for wein is 0.3 and tee/kaffee is 1. So wein is the sum of the OSDs, but kaffee and tee it is not. However looking at the crushmap, the host weight is being displayed as 5.5! Has anyone an idea what may be going wrong here? While writing this I noted that the relation / factor is exactly 5.5 times wrong, so I *guess* that ceph treats all hosts with the same weight (even though it looks differently to me in the osd tree and the crushmap)? You find our crushmap attached below. Cheers, Nico # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host wein { id -2 # do not change unnecessarily # weight 0.300 alg straw hash 0 # rjenkins1 item osd.0 weight 0.050 item osd.3 weight 0.050 item osd.4 weight 0.050 item osd.5 weight 0.050 item osd.6 weight 0.050 item osd.7 weight 0.050 } host tee { id -3 # do not change unnecessarily # weight 5.500 alg straw hash 0 # rjenkins1 item osd.1 weight 5.500 } host kaffee { id -4 # do not change unnecessarily # weight 5.500 alg straw hash 0 # rjenkins1 item osd.2 weight 5.500 } root default { id -1 # do not change unnecessarily # weight 2.300 alg straw hash 0 # rjenkins1 item wein weight 0.300 item tee weight 1.000 item kaffee weight 1.000 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: > [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hey Jiri, also rais the pgp_num (pg != pgp - it's easy to overread). Cheers, Nico Jiri Kanicky [Sun, Dec 28, 2014 at 01:52:39AM +1100]: > Hi, > > I just build my CEPH cluster but having problems with the health of > the cluster. > > Here are few details: > - I followed the ceph documentation. > - I used btrfs filesystem for all OSDs > - I did not set "osd pool default size = 2 " as I thought that if I > have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this > was right. > - I noticed that default pools "data,metadata" were not created. > Only "rbd" pool was created. > - As it was complaining that the pg_num is too low, I increased the > pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num > 133 > pgp_num 64". > > Would you give me hint where I have made the mistake? (I can remove > the OSDs and start over if needed.) > > > cephadmin@ceph1:/etc/ceph$ sudo ceph health > HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck > unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num > 133 > pgp_num 64 > cephadmin@ceph1:/etc/ceph$ sudo ceph status > cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 > pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool > rbd pg_num 133 > pgp_num 64 > monmap e1: 2 mons at > {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election > epoch 8, quorum 0,1 ceph1,ceph2 > osdmap e42: 4 osds: 4 up, 4 in > pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects > 11704 kB used, 11154 GB / 11158 GB avail > 29 active+undersized+degraded > 104 active+remapped > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree > # idweight type name up/down reweight > -1 10.88 root default > -2 5.44host ceph1 > 0 2.72osd.0 up 1 > 1 2.72osd.1 up 1 > -3 5.44host ceph2 > 2 2.72osd.2 up 1 > 3 2.72osd.3 up 1 > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools > 0 rbd, > > cephadmin@ceph1:/etc/ceph$ cat ceph.conf > [global] > fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > public_network = 192.168.30.0/24 > cluster_network = 10.1.1.0/24 > mon_initial_members = ceph1, ceph2 > mon_host = 192.168.30.21,192.168.30.22 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > > Thank you > Jiri > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Behaviour of a cluster with full OSD(s)
Max, List, Max Power [Tue, Dec 23, 2014 at 12:34:54PM +0100]: > [...Recovering from full osd ...] > > Normally > the osd process quits then and I cannot restart it (even after setting the > replicas back). The only possibility is to manually delete complete PG folders > after exploring them with 'pg dump'. Is this the only way to get it back > working > again? I was wondering if ceph-osd crashing when the disk gets full shouldn't be considered being a bug? Shouldn't ceph osd be able to recover itself? Like if an admin detects that the disk is full, she can simply reduce the weight of the osd to free up space. With a dead osd, this is not possible. To those having deeper ceph knowledge: For what reason does ceph-osd exit when the disk is full? Why can it not start when it is full to get itself out of this invidious situation? Cheers, Nico -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running instances on ceph with openstack
Hello Ali Shah, we are running VMs using Opennebula with ceph as the backend. So far with varying results: From time to time VMs are freezing, probably panic'ing when the load is too high on the ceph storage due to rebalance work. We are experimenting with --osd-max-backfills 1, but it hasn't solved the problem completly. Cheers, Nico Zeeshan Ali Shah [Tue, Dec 23, 2014 at 09:12:25AM +0100]: > Has any one tried running instances over ceph i.e using ceph as backend for > vm storage . How would you get instant migration in that case since every > compute host will have it's own RBD . other option is to have a big rbd > pool on head node and share it with NFS to have shared file system > > any idea ? > > -- > > Regards > > Zeeshan Ali Shah > System Administrator - PDC HPC > PhD researcher (IT security) > Kungliga Tekniska Hogskolan > +46 8 790 9115 > http://www.pdc.kth.se/members/zashah > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy & state of documentation [was: OSD & JOURNAL not associated - ceph-disk list ?]
Yes - we are mostly using cdist [0] and also plan to support ceph mid term. The website is by the way running on a VM managed with Opennebula and stored on the ceph cluster - so in case it is not reachable, you can guess why ;-) [0] http://www.nico.schottelius.org/software/cdist/ Craig Lewis [Mon, Dec 22, 2014 at 11:44:43AM -0800]: > I get the impression that more people on the ML are using a config > management system. ceph-deploy questions seem to come from new users > following the quick start guide. > > I know both Puppet and Chef are fairly well represented here. I've seen a > few posts about Salt and Ansible, but not much. Calamari is built on top > of Salt, so I suppose that means Salt is well represented. I really > haven't seen anything from the CFEngine or Bcfg2 camps. > > > I'm personally using Chef with a private fork of the Ceph cookbook. The > Ceph cookbook doesn't use ceph-deploy, but it does use ceph-disk. Whenever > I have problems with the ceph-disk command, I first go look at the cookbook > to see how it's doing things. > > > > On Sun, Dec 21, 2014 at 10:37 AM, Nico Schottelius < > nico-ceph-us...@schottelius.org> wrote: > > > Hello list, > > > > I am a bit wondering about "ceph-deploy" and the development of ceph: I > > see that many people in the community are pushing towards the use of > > ceph-deploy, likely to ease use of ceph. > > > > However, I have run multiple times into issues using ceph-deploy, when > > it failed or incorrectly setup partitions or created a cluster of > > monitors that never reach a qurom. > > > > I have also recognised debugging and learning of ceph being much more > > difficult with ceph-deploy, compared to going the manual way, because as > > a user I miss a lot of information. > > > > Furthermore as the maintainer of a configuration management system [0], > > I am interested in knowing how things are working behind the scenes to > > be able to automate them. > > > > Thus I was wondering, if it is an option for the ceph community to > > focus on both (the manual & ceph-deploy) way instead of just pushing > > ceph-deploy? > > > > Cheers, > > > > Nico > > > > p.s.: Loic, just taking your mail as an example, but it is not personal > > - just want to show my point. > > > > Loic Dachary [Sun, Dec 21, 2014 at 06:08:27PM +0100]: > > > [...] > > > > > > Is there a reason why you need to do this instead of letting ceph-disk > > prepare do it for you ? > > > > > > [...] > > > > -- > > New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-deploy & state of documentation [was: OSD & JOURNAL not associated - ceph-disk list ?]
Hello list, I am a bit wondering about "ceph-deploy" and the development of ceph: I see that many people in the community are pushing towards the use of ceph-deploy, likely to ease use of ceph. However, I have run multiple times into issues using ceph-deploy, when it failed or incorrectly setup partitions or created a cluster of monitors that never reach a qurom. I have also recognised debugging and learning of ceph being much more difficult with ceph-deploy, compared to going the manual way, because as a user I miss a lot of information. Furthermore as the maintainer of a configuration management system [0], I am interested in knowing how things are working behind the scenes to be able to automate them. Thus I was wondering, if it is an option for the ceph community to focus on both (the manual & ceph-deploy) way instead of just pushing ceph-deploy? Cheers, Nico p.s.: Loic, just taking your mail as an example, but it is not personal - just want to show my point. Loic Dachary [Sun, Dec 21, 2014 at 06:08:27PM +0100]: > [...] > > Is there a reason why you need to do this instead of letting ceph-disk > prepare do it for you ? > > [...] -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Hanging VMs with Qemu + RBD
Hello, another issue we have experienced with qemu VMs (qemu 2.0.0) with ceph-0.80 on Ubuntu 14.04 managed by opennebula 4.10.1: The VMs are completly frozen when rebalancing takes place, they do not even respond to ping anymore. Looking at the qemu processes they are in state "Sl". Is this a known problem / have others seen this behaviour? I have not yet tuned any backfilling parameters and it is a cluster of 3 hosts with one host having 6 osds and two 1 one (so 8 osds in total). Our qemu runs with these rbd related options: qemu-system-x86_64 ... -drive file=rbd:one/one-38:id=libvirt:key=...:auth_supported=cephx\;none:mon_host=kaffee.private.ungleich.ch\;wein.private.ungleich.ch\;tee.private.ungleich.ch,if=none,id=drive-ide0-0-0,format=raw,cache=none Cheers, Nico -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Stuck + Incomplete after deleting to allow osd to start
Hello, we have had some trouble of osds running full, even after rebalancing. So at 100% usage and ceph-osds not starting anymore, we decided to delete some pg directories, after which rebalancing finished. However after this, we have the situation that one pg is not becoming clean anymore. We tried to a) stop, stop+out osd.7 -> after rebalancing the pg is still stuck b) Mark objects lost: root@wein:~# ceph pg 3.14 mark_unfound_lost revert pg has no unfound objects c) stop osd.7, rsync the directory 3.14_head from osd.2, start osd.7 d) ceph pg scrub 3.14 So far the status is still that this pg is down. I have attached some of the lines / logs. I would be grateful if you can give any hints on how to repair this situation. Cheers, Nico p.s.: Using ceph 0.80.7. Action causing the problem: root@wein:/var/lib/ceph/osd/ceph-7/current# ls 0.12_head 0.a_head 2.1c_head 3.2a_head 3.4c_head 3.6b_TEMP 3.8b_head 3.97_TEMP 3.c7_TEMP 0.14_head 1.10_head 2.26_head 3.32_head 3.4c_TEMP 3.6c_head 3.8d_head 3.9b_head 3.c_head 0.21_head 1.1a_head 2.2a_head 3.32_TEMP 3.56_head 3.6c_TEMP 3.8d_TEMP 3.9b_TEMP 3.d_head 0.23_head 1.21_head 2.2e_head 3.37_head 3.56_TEMP 3.6_head 3.8e_head 3.a9_head 3.d_TEMP 0.2b_head 1.2b_head 2.2f_head 3.37_TEMP 3.5b_head 3.7b_head 3.8_head 3.a9_TEMP 3.f_head 0.2d_head 1.2c_head 2.33_head 3.47_head 3.5b_TEMP 3.7b_TEMP 3.91_head 3.ab_TEMP 3.f_TEMP 0.2e_head 1.32_head 2.3f_head 3.47_TEMP 3.60_head 3.80_head 3.91_TEMP 3.b2_TEMP commit_op_seq 0.2_head 1.37_head 2.b_head 3.49_head 3.61_head 3.81_head 3.93_head 3.b7_TEMP meta 0.38_head 1.3c_head 3.0_head 3.49_TEMP 3.61_TEMP 3.82_head 3.93_TEMP 3.bf_head nosnap 0.3b_head 1.e_head 3.12_head 3.4a_head 3.67_head 3.82_TEMP 3.94_head 3.bf_TEMP omap 0.3e_head 2.10_head 3.14_head 3.4a_TEMP 3.67_TEMP 3.89_head 3.94_TEMP 3.b_head 0.7_head 2.15_head 3.14_TEMP 3.4b_head 3.6b_head 3.89_TEMP 3.97_head 3.b_TEMP root@wein:/var/lib/ceph/osd/ceph-7/current# du -sh 3.14_* 3.9G3.14_head 4.0K3.14_TEMP The current status: root@kaffee:~# ceph -s cluster e0611730-09ff-4f3c-bfdb-2dd415274a36 health HEALTH_WARN 1 pgs down; 1 pgs peering; 1 pgs stuck inactive; 1 pgs stuck unclean; 5 requests are blocked > 32 sec monmap e3: 3 mons at {kaffee=192.168.40.1:6789/0,tee=192.168.40.2:6789/0,wein=192.168.40.3:6789/0}, election epoch 3652, quorum 0,1,2 kaffee,tee,wein osdmap e1129: 8 osds: 7 up, 7 in pgmap v435448: 448 pgs, 4 pools, 976 GB data, 248 kobjects 1938 GB used, 9913 GB / 11852 GB avail 447 active+clean 1 down+peering root@wein:/var/lib/ceph/osd/ceph-7/current# ceph health detail HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 5 requests are blocked > 32 sec; 1 osds have slow requests pg 3.14 is stuck inactive for 135697.438689, current state incomplete, last acting [2,7] pg 3.14 is stuck unclean for 135697.438702, current state incomplete, last acting [2,7] pg 3.14 is incomplete, acting [2,7] 5 ops are blocked > 8388.61 sec 5 ops are blocked > 8388.61 sec on osd.2 1 osds have slow requests root@wein:~# ceph pg dump_stuck stale ok root@wein:~# ceph pg dump_stuck unclean ok pg_stat objects mip degrunf bytes log disklog state state_stamp v reportedup up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.1410060 0 0 4135415824 30013001 incomplete 2014-12-19 14:40:00.272775 589'27399 1150:66317 [2,7] 2 [2,7] 2 503'24268 2014-12-13 19:17:39.272720 503'24268 2014-12-13 19:17:38.672258 root@wein:~# ceph pg dump_stuck inactive ok pg_stat objects mip degrunf bytes log disklog state state_stamp v reportedup up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.1410060 0 0 4135415824 30013001 incomplete 2014-12-19 14:40:00.272775 589'27399 1150:66317 [2,7] 2 [2,7] 2 503'24268 2014-12-13 19:17:39.272720 503'24268 2014-12-13 19:17:38.672258 root@wein:~# root@wein:~# ceph osd tree # idweight type name up/down reweight -1 2.3 root default -2 0.2999 host wein 0 0.04999 osd.0 up 1 3 0.04999 osd.3 up 1 4 0.04999 osd.4 up 1 5 0.04999 osd.5 up 1 6 0.04999