Re: [ceph-users] Repair inconsistent pgs..
Is there a bug for this in the tracker? -Sam On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Issue, that in forward mode, fstrim doesn't work proper, and when we take snapshot - data not proper update in cache layer, and client (ceph) see damaged snap.. As headers requested from cache layer. 2015-08-20 19:53 GMT+03:00 Samuel Just sj...@redhat.com: What was the issue? -Sam On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Samuel, we turned off cache layer few hours ago... I will post ceph.log in few minutes For snap - we found issue, was connected with cache tier.. 2015-08-20 19:23 GMT+03:00 Samuel Just sj...@redhat.com: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1
Re: [ceph-users] Repair inconsistent pgs..
Not yet. I will create. But according to mail lists and Inktank docs - it's expected behaviour when cache enable 2015-08-20 19:56 GMT+03:00 Samuel Just sj...@redhat.com: Is there a bug for this in the tracker? -Sam On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Issue, that in forward mode, fstrim doesn't work proper, and when we take snapshot - data not proper update in cache layer, and client (ceph) see damaged snap.. As headers requested from cache layer. 2015-08-20 19:53 GMT+03:00 Samuel Just sj...@redhat.com: What was the issue? -Sam On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Samuel, we turned off cache layer few hours ago... I will post ceph.log in few minutes For snap - we found issue, was connected with cache tier.. 2015-08-20 19:23 GMT+03:00 Samuel Just sj...@redhat.com: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1
Re: [ceph-users] Repair inconsistent pgs..
Which docs? -Sam On Thu, Aug 20, 2015 at 9:57 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Not yet. I will create. But according to mail lists and Inktank docs - it's expected behaviour when cache enable 2015-08-20 19:56 GMT+03:00 Samuel Just sj...@redhat.com: Is there a bug for this in the tracker? -Sam On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Issue, that in forward mode, fstrim doesn't work proper, and when we take snapshot - data not proper update in cache layer, and client (ceph) see damaged snap.. As headers requested from cache layer. 2015-08-20 19:53 GMT+03:00 Samuel Just sj...@redhat.com: What was the issue? -Sam On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Samuel, we turned off cache layer few hours ago... I will post ceph.log in few minutes For snap - we found issue, was connected with cache tier.. 2015-08-20 19:23 GMT+03:00 Samuel Just sj...@redhat.com: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone
Re: [ceph-users] Repair inconsistent pgs..
Inktank: https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf Mail-list: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg18338.html 2015-08-20 20:06 GMT+03:00 Samuel Just sj...@redhat.com: Which docs? -Sam On Thu, Aug 20, 2015 at 9:57 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Not yet. I will create. But according to mail lists and Inktank docs - it's expected behaviour when cache enable 2015-08-20 19:56 GMT+03:00 Samuel Just sj...@redhat.com: Is there a bug for this in the tracker? -Sam On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Issue, that in forward mode, fstrim doesn't work proper, and when we take snapshot - data not proper update in cache layer, and client (ceph) see damaged snap.. As headers requested from cache layer. 2015-08-20 19:53 GMT+03:00 Samuel Just sj...@redhat.com: What was the issue? -Sam On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Samuel, we turned off cache layer few hours ago... I will post ceph.log in few minutes For snap - we found issue, was connected with cache tier.. 2015-08-20 19:23 GMT+03:00 Samuel Just sj...@redhat.com: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
Re: [ceph-users] Repair inconsistent pgs..
Image? One? We start deleting images only to fix thsi (export/import)m before - 1-4 times per day (when VM destroyed)... 2015-08-21 1:44 GMT+03:00 Samuel Just sj...@redhat.com: Interesting. How often do you delete an image? I'm wondering if whatever this is happened when you deleted these two images. -Sam On Thu, Aug 20, 2015 at 3:42 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Sam, i try to understand which rbd contain this chunks.. but no luck. No rbd images block names started with this... Actually, now that I think about it, you probably didn't remove the images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 2015-08-21 1:36 GMT+03:00 Samuel Just sj...@redhat.com: Actually, now that I think about it, you probably didn't remove the images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2, but other images (that's why the scrub errors went down briefly, those objects -- which were fine -- went away). You might want to export and reimport those two images into new images, but leave the old ones alone until you can clean up the on disk state (image and snapshots) and clear the scrub errors. You probably don't want to read the snapshots for those images either. Everything else is, I think, harmless. The ceph-objectstore-tool feature would probably not be too hard, actually. Each head/snapdir image has two attrs (possibly stored in leveldb -- that's why you want to modify the ceph-objectstore-tool and use its interfaces rather than mucking about with the files directly) '_' and 'snapset' which contain encoded representations of object_info_t and SnapSet (both can be found in src/osd/osd_types.h). SnapSet has a set of clones and related metadata -- you want to read the SnapSet attr off disk and commit a transaction writing out a new version with that clone removed. I'd start by cloning the repo, starting a vstart cluster locally, and reproducing the issue. Next, get familiar with using ceph-objectstore-tool on the osds in that vstart cluster. A good first change would be creating a ceph-objectstore-tool op that lets you dump json for the object_info_t and SnapSet (both types have format() methods which make that easy) on an object to stdout so you can confirm what's actually there. oftc #ceph-devel or the ceph-devel mailing list would be the right place to ask questions. Otherwise, it'll probably get done in the next few weeks. -Sam On Thu, Aug 20, 2015 at 3:10 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: thank you Sam! I also noticed this linked errors during scrub... Now all lools like reasonable! So we will wait for bug to be closed. do you need any help on it? I mean i can help with coding/testing/etc... 2015-08-21 0:52 GMT+03:00 Samuel Just sj...@redhat.com: Ah, this is kind of silly. I think you don't have 37 errors, but 2 errors. pg 2.490 object 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing snap 141. If you look at the objects after that in the log: 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster [ERR] repair 2.490 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster [ERR] repair 2.490 ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 The clone from the second line matches the head object from the previous line, and they have the same clone id. I *think* that the first error is real, and the subsequent ones are just scrub being dumb. Same deal with pg 2.c4. I just opened http://tracker.ceph.com/issues/12738. The original problem is that 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both missing a clone. Not sure how that happened, my money is on a cache/tiering evict racing with a snap trim. If you have any logging or relevant information from when that happened, you should open a bug. The 'snapdir' in the two object names indicates that the head object has actually been deleted (which makes sense if you moved the image to a new image and deleted the old one) and is only being kept around since there are live snapshots. I suggest you leave the snapshots for those images alone for the time being -- removing them might cause the osd to crash trying to clean up the wierd on disk state. Other than the leaked space from those two image snapshots and the annoying spurious scrub errors, I think no actual
Re: [ceph-users] Repair inconsistent pgs..
Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com osdmap Description: Binary data ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
Interesting. How often do you delete an image? I'm wondering if whatever this is happened when you deleted these two images. -Sam On Thu, Aug 20, 2015 at 3:42 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Sam, i try to understand which rbd contain this chunks.. but no luck. No rbd images block names started with this... Actually, now that I think about it, you probably didn't remove the images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 2015-08-21 1:36 GMT+03:00 Samuel Just sj...@redhat.com: Actually, now that I think about it, you probably didn't remove the images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2, but other images (that's why the scrub errors went down briefly, those objects -- which were fine -- went away). You might want to export and reimport those two images into new images, but leave the old ones alone until you can clean up the on disk state (image and snapshots) and clear the scrub errors. You probably don't want to read the snapshots for those images either. Everything else is, I think, harmless. The ceph-objectstore-tool feature would probably not be too hard, actually. Each head/snapdir image has two attrs (possibly stored in leveldb -- that's why you want to modify the ceph-objectstore-tool and use its interfaces rather than mucking about with the files directly) '_' and 'snapset' which contain encoded representations of object_info_t and SnapSet (both can be found in src/osd/osd_types.h). SnapSet has a set of clones and related metadata -- you want to read the SnapSet attr off disk and commit a transaction writing out a new version with that clone removed. I'd start by cloning the repo, starting a vstart cluster locally, and reproducing the issue. Next, get familiar with using ceph-objectstore-tool on the osds in that vstart cluster. A good first change would be creating a ceph-objectstore-tool op that lets you dump json for the object_info_t and SnapSet (both types have format() methods which make that easy) on an object to stdout so you can confirm what's actually there. oftc #ceph-devel or the ceph-devel mailing list would be the right place to ask questions. Otherwise, it'll probably get done in the next few weeks. -Sam On Thu, Aug 20, 2015 at 3:10 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: thank you Sam! I also noticed this linked errors during scrub... Now all lools like reasonable! So we will wait for bug to be closed. do you need any help on it? I mean i can help with coding/testing/etc... 2015-08-21 0:52 GMT+03:00 Samuel Just sj...@redhat.com: Ah, this is kind of silly. I think you don't have 37 errors, but 2 errors. pg 2.490 object 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing snap 141. If you look at the objects after that in the log: 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster [ERR] repair 2.490 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster [ERR] repair 2.490 ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 The clone from the second line matches the head object from the previous line, and they have the same clone id. I *think* that the first error is real, and the subsequent ones are just scrub being dumb. Same deal with pg 2.c4. I just opened http://tracker.ceph.com/issues/12738. The original problem is that 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both missing a clone. Not sure how that happened, my money is on a cache/tiering evict racing with a snap trim. If you have any logging or relevant information from when that happened, you should open a bug. The 'snapdir' in the two object names indicates that the head object has actually been deleted (which makes sense if you moved the image to a new image and deleted the old one) and is only being kept around since there are live snapshots. I suggest you leave the snapshots for those images alone for the time being -- removing them might cause the osd to crash trying to clean up the wierd on disk state. Other than the leaked space from those two image snapshots and the annoying spurious scrub errors, I think no actual corruption is going on though. I created a tracker ticket for a feature that would let ceph-objectstore-tool remove the spurious clone from the head/snapdir metadata. Am I right that you haven't actually seen any osd crashes or user visible corruption (except possibly on snapshots
Re: [ceph-users] Repair inconsistent pgs..
Ok, so images are regularly removed. In that case, these two objects probably are left over from previously removed images. Once ceph-objectstore-tool can dump the SnapSet from those two objects, you will probably find that those two snapdir objects each have only one bogus clone, in which case you'll probably just remove the images. -Sam On Thu, Aug 20, 2015 at 3:45 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Image? One? We start deleting images only to fix thsi (export/import)m before - 1-4 times per day (when VM destroyed)... 2015-08-21 1:44 GMT+03:00 Samuel Just sj...@redhat.com: Interesting. How often do you delete an image? I'm wondering if whatever this is happened when you deleted these two images. -Sam On Thu, Aug 20, 2015 at 3:42 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Sam, i try to understand which rbd contain this chunks.. but no luck. No rbd images block names started with this... Actually, now that I think about it, you probably didn't remove the images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 2015-08-21 1:36 GMT+03:00 Samuel Just sj...@redhat.com: Actually, now that I think about it, you probably didn't remove the images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2, but other images (that's why the scrub errors went down briefly, those objects -- which were fine -- went away). You might want to export and reimport those two images into new images, but leave the old ones alone until you can clean up the on disk state (image and snapshots) and clear the scrub errors. You probably don't want to read the snapshots for those images either. Everything else is, I think, harmless. The ceph-objectstore-tool feature would probably not be too hard, actually. Each head/snapdir image has two attrs (possibly stored in leveldb -- that's why you want to modify the ceph-objectstore-tool and use its interfaces rather than mucking about with the files directly) '_' and 'snapset' which contain encoded representations of object_info_t and SnapSet (both can be found in src/osd/osd_types.h). SnapSet has a set of clones and related metadata -- you want to read the SnapSet attr off disk and commit a transaction writing out a new version with that clone removed. I'd start by cloning the repo, starting a vstart cluster locally, and reproducing the issue. Next, get familiar with using ceph-objectstore-tool on the osds in that vstart cluster. A good first change would be creating a ceph-objectstore-tool op that lets you dump json for the object_info_t and SnapSet (both types have format() methods which make that easy) on an object to stdout so you can confirm what's actually there. oftc #ceph-devel or the ceph-devel mailing list would be the right place to ask questions. Otherwise, it'll probably get done in the next few weeks. -Sam On Thu, Aug 20, 2015 at 3:10 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: thank you Sam! I also noticed this linked errors during scrub... Now all lools like reasonable! So we will wait for bug to be closed. do you need any help on it? I mean i can help with coding/testing/etc... 2015-08-21 0:52 GMT+03:00 Samuel Just sj...@redhat.com: Ah, this is kind of silly. I think you don't have 37 errors, but 2 errors. pg 2.490 object 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing snap 141. If you look at the objects after that in the log: 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster [ERR] repair 2.490 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster [ERR] repair 2.490 ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 The clone from the second line matches the head object from the previous line, and they have the same clone id. I *think* that the first error is real, and the subsequent ones are just scrub being dumb. Same deal with pg 2.c4. I just opened http://tracker.ceph.com/issues/12738. The original problem is that 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both missing a clone. Not sure how that happened, my money is on a cache/tiering evict racing with a snap trim. If you have any logging or relevant information from when that happened, you should open a bug. The 'snapdir' in the two object names indicates that the head object has actually been deleted (which makes
Re: [ceph-users] Repair inconsistent pgs..
Guys, I'm Igor's colleague, working a bit on CEPH, together with Igor. This is production cluster, and we are becoming more desperate as the time goes by. Im not sure if this is appropriate place to seek commercial support, but anyhow, I do it... If anyone feels like and have some experience in this particular PG troubleshooting issues, we are also ready to seek for commercial support to solve our issue, company or individual, it doesn't matter. Thanks, Andrija On 20 August 2015 at 19:07, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Inktank: https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf Mail-list: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg18338.html 2015-08-20 20:06 GMT+03:00 Samuel Just sj...@redhat.com: Which docs? -Sam On Thu, Aug 20, 2015 at 9:57 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Not yet. I will create. But according to mail lists and Inktank docs - it's expected behaviour when cache enable 2015-08-20 19:56 GMT+03:00 Samuel Just sj...@redhat.com: Is there a bug for this in the tracker? -Sam On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Issue, that in forward mode, fstrim doesn't work proper, and when we take snapshot - data not proper update in cache layer, and client (ceph) see damaged snap.. As headers requested from cache layer. 2015-08-20 19:53 GMT+03:00 Samuel Just sj...@redhat.com: What was the issue? -Sam On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Samuel, we turned off cache layer few hours ago... I will post ceph.log in few minutes For snap - we found issue, was connected with cache tier.. 2015-08-20 19:23 GMT+03:00 Samuel Just sj...@redhat.com: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2
Re: [ceph-users] Repair inconsistent pgs..
Ah, this is kind of silly. I think you don't have 37 errors, but 2 errors. pg 2.490 object 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing snap 141. If you look at the objects after that in the log: 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster [ERR] repair 2.490 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster [ERR] repair 2.490 ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 The clone from the second line matches the head object from the previous line, and they have the same clone id. I *think* that the first error is real, and the subsequent ones are just scrub being dumb. Same deal with pg 2.c4. I just opened http://tracker.ceph.com/issues/12738. The original problem is that 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both missing a clone. Not sure how that happened, my money is on a cache/tiering evict racing with a snap trim. If you have any logging or relevant information from when that happened, you should open a bug. The 'snapdir' in the two object names indicates that the head object has actually been deleted (which makes sense if you moved the image to a new image and deleted the old one) and is only being kept around since there are live snapshots. I suggest you leave the snapshots for those images alone for the time being -- removing them might cause the osd to crash trying to clean up the wierd on disk state. Other than the leaked space from those two image snapshots and the annoying spurious scrub errors, I think no actual corruption is going on though. I created a tracker ticket for a feature that would let ceph-objectstore-tool remove the spurious clone from the head/snapdir metadata. Am I right that you haven't actually seen any osd crashes or user visible corruption (except possibly on snapshots of those two images)? -Sam On Thu, Aug 20, 2015 at 10:07 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Inktank: https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf Mail-list: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg18338.html 2015-08-20 20:06 GMT+03:00 Samuel Just sj...@redhat.com: Which docs? -Sam On Thu, Aug 20, 2015 at 9:57 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Not yet. I will create. But according to mail lists and Inktank docs - it's expected behaviour when cache enable 2015-08-20 19:56 GMT+03:00 Samuel Just sj...@redhat.com: Is there a bug for this in the tracker? -Sam On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Issue, that in forward mode, fstrim doesn't work proper, and when we take snapshot - data not proper update in cache layer, and client (ceph) see damaged snap.. As headers requested from cache layer. 2015-08-20 19:53 GMT+03:00 Samuel Just sj...@redhat.com: What was the issue? -Sam On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Samuel, we turned off cache layer few hours ago... I will post ceph.log in few minutes For snap - we found issue, was connected with cache tier.. 2015-08-20 19:23 GMT+03:00 Samuel Just sj...@redhat.com: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com
Re: [ceph-users] Repair inconsistent pgs..
The feature bug for the tool is http://tracker.ceph.com/issues/12740. -Sam On Thu, Aug 20, 2015 at 2:52 PM, Samuel Just sj...@redhat.com wrote: Ah, this is kind of silly. I think you don't have 37 errors, but 2 errors. pg 2.490 object 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing snap 141. If you look at the objects after that in the log: 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster [ERR] repair 2.490 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster [ERR] repair 2.490 ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 The clone from the second line matches the head object from the previous line, and they have the same clone id. I *think* that the first error is real, and the subsequent ones are just scrub being dumb. Same deal with pg 2.c4. I just opened http://tracker.ceph.com/issues/12738. The original problem is that 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both missing a clone. Not sure how that happened, my money is on a cache/tiering evict racing with a snap trim. If you have any logging or relevant information from when that happened, you should open a bug. The 'snapdir' in the two object names indicates that the head object has actually been deleted (which makes sense if you moved the image to a new image and deleted the old one) and is only being kept around since there are live snapshots. I suggest you leave the snapshots for those images alone for the time being -- removing them might cause the osd to crash trying to clean up the wierd on disk state. Other than the leaked space from those two image snapshots and the annoying spurious scrub errors, I think no actual corruption is going on though. I created a tracker ticket for a feature that would let ceph-objectstore-tool remove the spurious clone from the head/snapdir metadata. Am I right that you haven't actually seen any osd crashes or user visible corruption (except possibly on snapshots of those two images)? -Sam On Thu, Aug 20, 2015 at 10:07 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Inktank: https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf Mail-list: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg18338.html 2015-08-20 20:06 GMT+03:00 Samuel Just sj...@redhat.com: Which docs? -Sam On Thu, Aug 20, 2015 at 9:57 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Not yet. I will create. But according to mail lists and Inktank docs - it's expected behaviour when cache enable 2015-08-20 19:56 GMT+03:00 Samuel Just sj...@redhat.com: Is there a bug for this in the tracker? -Sam On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Issue, that in forward mode, fstrim doesn't work proper, and when we take snapshot - data not proper update in cache layer, and client (ceph) see damaged snap.. As headers requested from cache layer. 2015-08-20 19:53 GMT+03:00 Samuel Just sj...@redhat.com: What was the issue? -Sam On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Samuel, we turned off cache layer few hours ago... I will post ceph.log in few minutes For snap - we found issue, was connected with cache tier.. 2015-08-20 19:23 GMT+03:00 Samuel Just sj...@redhat.com: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using
Re: [ceph-users] Repair inconsistent pgs..
thank you Sam! I also noticed this linked errors during scrub... Now all lools like reasonable! So we will wait for bug to be closed. do you need any help on it? I mean i can help with coding/testing/etc... 2015-08-21 0:52 GMT+03:00 Samuel Just sj...@redhat.com: Ah, this is kind of silly. I think you don't have 37 errors, but 2 errors. pg 2.490 object 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing snap 141. If you look at the objects after that in the log: 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster [ERR] repair 2.490 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster [ERR] repair 2.490 ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 The clone from the second line matches the head object from the previous line, and they have the same clone id. I *think* that the first error is real, and the subsequent ones are just scrub being dumb. Same deal with pg 2.c4. I just opened http://tracker.ceph.com/issues/12738. The original problem is that 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both missing a clone. Not sure how that happened, my money is on a cache/tiering evict racing with a snap trim. If you have any logging or relevant information from when that happened, you should open a bug. The 'snapdir' in the two object names indicates that the head object has actually been deleted (which makes sense if you moved the image to a new image and deleted the old one) and is only being kept around since there are live snapshots. I suggest you leave the snapshots for those images alone for the time being -- removing them might cause the osd to crash trying to clean up the wierd on disk state. Other than the leaked space from those two image snapshots and the annoying spurious scrub errors, I think no actual corruption is going on though. I created a tracker ticket for a feature that would let ceph-objectstore-tool remove the spurious clone from the head/snapdir metadata. Am I right that you haven't actually seen any osd crashes or user visible corruption (except possibly on snapshots of those two images)? -Sam On Thu, Aug 20, 2015 at 10:07 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Inktank: https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf Mail-list: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg18338.html 2015-08-20 20:06 GMT+03:00 Samuel Just sj...@redhat.com: Which docs? -Sam On Thu, Aug 20, 2015 at 9:57 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Not yet. I will create. But according to mail lists and Inktank docs - it's expected behaviour when cache enable 2015-08-20 19:56 GMT+03:00 Samuel Just sj...@redhat.com: Is there a bug for this in the tracker? -Sam On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Issue, that in forward mode, fstrim doesn't work proper, and when we take snapshot - data not proper update in cache layer, and client (ceph) see damaged snap.. As headers requested from cache layer. 2015-08-20 19:53 GMT+03:00 Samuel Just sj...@redhat.com: What was the issue? -Sam On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Samuel, we turned off cache layer few hours ago... I will post ceph.log in few minutes For snap - we found issue, was connected with cache tier.. 2015-08-20 19:23 GMT+03:00 Samuel Just sj...@redhat.com: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just
Re: [ceph-users] Repair inconsistent pgs..
Samuel, we turned off cache layer few hours ago... I will post ceph.log in few minutes For snap - we found issue, was connected with cache tier.. 2015-08-20 19:23 GMT+03:00 Samuel Just sj...@redhat.com: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone
Re: [ceph-users] Repair inconsistent pgs..
What was the issue? -Sam On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Samuel, we turned off cache layer few hours ago... I will post ceph.log in few minutes For snap - we found issue, was connected with cache tier.. 2015-08-20 19:23 GMT+03:00 Samuel Just sj...@redhat.com: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1
Re: [ceph-users] Repair inconsistent pgs..
Issue, that in forward mode, fstrim doesn't work proper, and when we take snapshot - data not proper update in cache layer, and client (ceph) see damaged snap.. As headers requested from cache layer. 2015-08-20 19:53 GMT+03:00 Samuel Just sj...@redhat.com: What was the issue? -Sam On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Samuel, we turned off cache layer few hours ago... I will post ceph.log in few minutes For snap - we found issue, was connected with cache tier.. 2015-08-20 19:23 GMT+03:00 Samuel Just sj...@redhat.com: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone
Re: [ceph-users] Repair inconsistent pgs..
Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! ___
Re: [ceph-users] Repair inconsistent pgs..
Also, was there at any point a power failure/power cycle event, perhaps on osd 56? -Sam On Thu, Aug 20, 2015 at 9:23 AM, Samuel Just sj...@redhat.com wrote: Ok, you appear to be using a replicated cache tier in front of a replicated base tier. Please scrub both inconsistent pgs and post the ceph.log from before when you started the scrub until after. Also, what command are you using to take snapshots? -Sam On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi Samuel, we try to fix it in trick way. we check all rbd_data chunks from logs (OSD) which are affected, then query rbd info to compare which rbd consist bad rbd_data, after that we mount this rbd as rbd0, create empty rbd, and DD all info from bad volume to new one. But after that - scrub errors growing... Was 15 errors.. .Now 35... We laos try to out OSD which was lead, but after rebalancing this 2 pgs still have 35 scrub errors... ceph osd getmap -o outfile - attached 2015-08-18 18:48 GMT+03:00 Samuel Just sj...@redhat.com: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490
Re: [ceph-users] Repair inconsistent pgs..
Actually, now that I think about it, you probably didn't remove the images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2, but other images (that's why the scrub errors went down briefly, those objects -- which were fine -- went away). You might want to export and reimport those two images into new images, but leave the old ones alone until you can clean up the on disk state (image and snapshots) and clear the scrub errors. You probably don't want to read the snapshots for those images either. Everything else is, I think, harmless. The ceph-objectstore-tool feature would probably not be too hard, actually. Each head/snapdir image has two attrs (possibly stored in leveldb -- that's why you want to modify the ceph-objectstore-tool and use its interfaces rather than mucking about with the files directly) '_' and 'snapset' which contain encoded representations of object_info_t and SnapSet (both can be found in src/osd/osd_types.h). SnapSet has a set of clones and related metadata -- you want to read the SnapSet attr off disk and commit a transaction writing out a new version with that clone removed. I'd start by cloning the repo, starting a vstart cluster locally, and reproducing the issue. Next, get familiar with using ceph-objectstore-tool on the osds in that vstart cluster. A good first change would be creating a ceph-objectstore-tool op that lets you dump json for the object_info_t and SnapSet (both types have format() methods which make that easy) on an object to stdout so you can confirm what's actually there. oftc #ceph-devel or the ceph-devel mailing list would be the right place to ask questions. Otherwise, it'll probably get done in the next few weeks. -Sam On Thu, Aug 20, 2015 at 3:10 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: thank you Sam! I also noticed this linked errors during scrub... Now all lools like reasonable! So we will wait for bug to be closed. do you need any help on it? I mean i can help with coding/testing/etc... 2015-08-21 0:52 GMT+03:00 Samuel Just sj...@redhat.com: Ah, this is kind of silly. I think you don't have 37 errors, but 2 errors. pg 2.490 object 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing snap 141. If you look at the objects after that in the log: 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster [ERR] repair 2.490 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster [ERR] repair 2.490 ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 The clone from the second line matches the head object from the previous line, and they have the same clone id. I *think* that the first error is real, and the subsequent ones are just scrub being dumb. Same deal with pg 2.c4. I just opened http://tracker.ceph.com/issues/12738. The original problem is that 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both missing a clone. Not sure how that happened, my money is on a cache/tiering evict racing with a snap trim. If you have any logging or relevant information from when that happened, you should open a bug. The 'snapdir' in the two object names indicates that the head object has actually been deleted (which makes sense if you moved the image to a new image and deleted the old one) and is only being kept around since there are live snapshots. I suggest you leave the snapshots for those images alone for the time being -- removing them might cause the osd to crash trying to clean up the wierd on disk state. Other than the leaked space from those two image snapshots and the annoying spurious scrub errors, I think no actual corruption is going on though. I created a tracker ticket for a feature that would let ceph-objectstore-tool remove the spurious clone from the head/snapdir metadata. Am I right that you haven't actually seen any osd crashes or user visible corruption (except possibly on snapshots of those two images)? -Sam On Thu, Aug 20, 2015 at 10:07 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Inktank: https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf Mail-list: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg18338.html 2015-08-20 20:06 GMT+03:00 Samuel Just sj...@redhat.com: Which docs? -Sam On Thu, Aug 20, 2015 at 9:57 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Not yet. I will create. But according to mail lists and Inktank docs - it's expected behaviour when cache enable 2015-08-20 19:56 GMT+03:00 Samuel Just sj...@redhat.com: Is
Re: [ceph-users] Repair inconsistent pgs..
Sam, i try to understand which rbd contain this chunks.. but no luck. No rbd images block names started with this... Actually, now that I think about it, you probably didn't remove the images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 2015-08-21 1:36 GMT+03:00 Samuel Just sj...@redhat.com: Actually, now that I think about it, you probably didn't remove the images for 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2, but other images (that's why the scrub errors went down briefly, those objects -- which were fine -- went away). You might want to export and reimport those two images into new images, but leave the old ones alone until you can clean up the on disk state (image and snapshots) and clear the scrub errors. You probably don't want to read the snapshots for those images either. Everything else is, I think, harmless. The ceph-objectstore-tool feature would probably not be too hard, actually. Each head/snapdir image has two attrs (possibly stored in leveldb -- that's why you want to modify the ceph-objectstore-tool and use its interfaces rather than mucking about with the files directly) '_' and 'snapset' which contain encoded representations of object_info_t and SnapSet (both can be found in src/osd/osd_types.h). SnapSet has a set of clones and related metadata -- you want to read the SnapSet attr off disk and commit a transaction writing out a new version with that clone removed. I'd start by cloning the repo, starting a vstart cluster locally, and reproducing the issue. Next, get familiar with using ceph-objectstore-tool on the osds in that vstart cluster. A good first change would be creating a ceph-objectstore-tool op that lets you dump json for the object_info_t and SnapSet (both types have format() methods which make that easy) on an object to stdout so you can confirm what's actually there. oftc #ceph-devel or the ceph-devel mailing list would be the right place to ask questions. Otherwise, it'll probably get done in the next few weeks. -Sam On Thu, Aug 20, 2015 at 3:10 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: thank you Sam! I also noticed this linked errors during scrub... Now all lools like reasonable! So we will wait for bug to be closed. do you need any help on it? I mean i can help with coding/testing/etc... 2015-08-21 0:52 GMT+03:00 Samuel Just sj...@redhat.com: Ah, this is kind of silly. I think you don't have 37 errors, but 2 errors. pg 2.490 object 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 is missing snap 141. If you look at the objects after that in the log: 2015-08-20 20:15:44.865670 osd.19 10.12.2.6:6838/1861727 298 : cluster [ERR] repair 2.490 68c89490/rbd_data.16796a3d1b58ba.0047/head//2 expected clone 2d7b9490/rbd_data.18f92c3d1b58ba.6167/141//2 2015-08-20 20:15:44.865817 osd.19 10.12.2.6:6838/1861727 299 : cluster [ERR] repair 2.490 ded49490/rbd_data.11a25c7934d3d4.8a8a/head//2 expected clone 68c89490/rbd_data.16796a3d1b58ba.0047/141//2 The clone from the second line matches the head object from the previous line, and they have the same clone id. I *think* that the first error is real, and the subsequent ones are just scrub being dumb. Same deal with pg 2.c4. I just opened http://tracker.ceph.com/issues/12738. The original problem is that 3fac9490/rbd_data.eb5f22eb141f2.04ba/snapdir//2 and 22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir//2 are both missing a clone. Not sure how that happened, my money is on a cache/tiering evict racing with a snap trim. If you have any logging or relevant information from when that happened, you should open a bug. The 'snapdir' in the two object names indicates that the head object has actually been deleted (which makes sense if you moved the image to a new image and deleted the old one) and is only being kept around since there are live snapshots. I suggest you leave the snapshots for those images alone for the time being -- removing them might cause the osd to crash trying to clean up the wierd on disk state. Other than the leaked space from those two image snapshots and the annoying spurious scrub errors, I think no actual corruption is going on though. I created a tracker ticket for a feature that would let ceph-objectstore-tool remove the spurious clone from the head/snapdir metadata. Am I right that you haven't actually seen any osd crashes or user visible corruption (except possibly on snapshots of those two images)? -Sam On Thu, Aug 20, 2015 at 10:07 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Inktank:
Re: [ceph-users] Repair inconsistent pgs..
Voloshanenko Igor writes: Hi Irek, Please read careful ))) You proposal was the first, i try to do... That's why i asked about help... ( 2015-08-18 8:34 GMT+03:00 Irek Fasikhov malm...@gmail.com: Hi, Igor. You need to repair the PG. for i in `ceph pg dump| grep inconsistent | grep -v 'inconsistent+repair' | awk {'print$1'}`;do ceph pg repair $i;done С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-18 8:27 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! I've had an inconsistent pg once, but it was a different sort of an error (some sort of digest mismatch, where the secondary object copies had later timestamps). This was fixed by moving the object away and restarting, the osd which got fixed when the osd peered, similar to what was mentioned in Sebastian Han's blog[1]. I'm guessing the same method will solve this error as well, but not completely sure, maybe someone else who has seen this particular error could guide you better. [1]: http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ -- Abhishek signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
No. This will no help ((( I try to found data, but it's look exist with same time stamp on all osd or missing on all osd ... So, need advice , what I need to do... вторник, 18 августа 2015 г. пользователь Abhishek L написал: Voloshanenko Igor writes: Hi Irek, Please read careful ))) You proposal was the first, i try to do... That's why i asked about help... ( 2015-08-18 8:34 GMT+03:00 Irek Fasikhov malm...@gmail.com javascript:;: Hi, Igor. You need to repair the PG. for i in `ceph pg dump| grep inconsistent | grep -v 'inconsistent+repair' | awk {'print$1'}`;do ceph pg repair $i;done С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-18 8:27 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com javascript:;: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! I've had an inconsistent pg once, but it was a different sort of an error (some sort of digest mismatch, where the secondary object copies had later timestamps). This was fixed by moving the object away and restarting, the osd which got fixed when the osd peered, similar to what was mentioned in Sebastian Han's blog[1]. I'm guessing the same method will solve this error as well, but not completely sure, maybe someone else who has seen this particular error could guide you better. [1]: http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ -- Abhishek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
No. This will no help ((( I try to found data, but it's look exist with same time stamp on all osd or missing on all osd ... So, need advice , what I need to do... вторник, 18 августа 2015 г. пользователь Abhishek L написал: Voloshanenko Igor writes: Hi Irek, Please read careful ))) You proposal was the first, i try to do... That's why i asked about help... ( 2015-08-18 8:34 GMT+03:00 Irek Fasikhov malm...@gmail.com javascript:;: Hi, Igor. You need to repair the PG. for i in `ceph pg dump| grep inconsistent | grep -v 'inconsistent+repair' | awk {'print$1'}`;do ceph pg repair $i;done С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-18 8:27 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com javascript:;: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! I've had an inconsistent pg once, but it was a different sort of an error (some sort of digest mismatch, where the secondary object copies had later timestamps). This was fixed by moving the object away and restarting, the osd which got fixed when the osd peered, similar to what was mentioned in Sebastian Han's blog[1]. I'm guessing the same method will solve this error as well, but not completely sure, maybe someone else who has seen this particular error could guide you better. [1]: http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ -- Abhishek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
Also, what command are you using to take snapshots? -Sam On Tue, Aug 18, 2015 at 8:48 AM, Samuel Just sj...@redhat.com wrote: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
Hi, Igor. You need to repair the PG. for i in `ceph pg dump| grep inconsistent | grep -v 'inconsistent+repair' | awk {'print$1'}`;do ceph pg repair $i;done С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-18 8:27 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
Hi Irek, Please read careful ))) You proposal was the first, i try to do... That's why i asked about help... ( 2015-08-18 8:34 GMT+03:00 Irek Fasikhov malm...@gmail.com: Hi, Igor. You need to repair the PG. for i in `ceph pg dump| grep inconsistent | grep -v 'inconsistent+repair' | awk {'print$1'}`;do ceph pg repair $i;done С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-18 8:27 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com