Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
Hi Thanks Sage, I got it working now. Everything else seems to be ok, except mds is reporting "mds cluster is degraded", not sure what could be wrong. Mds is running and all osds are up and pg's are active+clean and active+clean+replay. Had to delete some empty pools which were created while the osd's were not working and recovery started to go through. Seems mds is not that stable, this isn't the first time it goes degraded. Before it just started to work, but now I just can't get it back working. Thanks Br, Tuomas -Original Message- From: tuomas.juntu...@databasement.fi [mailto:tuomas.juntu...@databasement.fi] Sent: 1. toukokuuta 2015 21:14 To: Sage Weil Cc: tuomas.juntunen; ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down Thanks, I'll do this when the commit is available and report back. And indeed, I'll change to the official ones after everything is ok. Br, Tuomas > On Fri, 1 May 2015, tuomas.juntu...@databasement.fi wrote: >> Hi >> >> I deleted the images and img pools and started osd's, they still die. >> >> Here's a log of one of the osd's after this, if you need it. >> >> http://beta.xaasbox.com/ceph/ceph-osd.19.log > > I've pushed another commit that should avoid this case, sha1 > 425bd4e1dba00cc2243b0c27232d1f9740b04e34. > > Note that once the pools are fully deleted (shouldn't take too long > once the osds are up and stabilize) you should switch back to the > normal packages that don't have these workarounds. > > sage > > > >> >> Br, >> Tuomas >> >> >> > Thanks man. I'll try it tomorrow. Have a good one. >> > >> > Br,T >> > >> > Original message >> > From: Sage Weil >> > Date: 30/04/2015 18:23 (GMT+02:00) >> > To: Tuomas Juntunen >> > Cc: ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org >> > Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after >> > some basic >> >> > operations most of the OSD's went down >> > >> > On Thu, 30 Apr 2015, tuomas.juntu...@databasement.fi wrote: >> >> Hey >> >> >> >> Yes I can drop the images data, you think this will fix it? >> > >> > It's a slightly different assert that (I believe) should not >> > trigger once the pool is deleted. Please give that a try and if >> > you still hit it I'll whip up a workaround. >> > >> > Thanks! >> > sage >> > >> > > >> >> >> >> Br, >> >> >> >> Tuomas >> >> >> >> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote: >> >> >> Hi >> >> >> >> >> >> I updated that version and it seems that something did happen, >> >> >> the osd's stayed up for a while and 'ceph status' got updated. >> >> >> But then in couple >> of >> >> >> minutes, they all went down the same way. >> >> >> >> >> >> I have attached new 'ceph osd dump -f json-pretty' and got a >> >> >> new log >> from >> >> >> one of the osd's with osd debug = 20, >> >> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log >> >> > >> >> > Sam mentioned that you had said earlier that this was not critical data? >> >> > If not, I think the simplest thing is to just drop those pools. >> >> > The important thing (from my perspective at least :) is that we >> >> > understand >> the >> >> > root cause and can prevent this in the future. >> >> > >> >> > sage >> >> > >> >> > >> >> >> >> >> >> Thank you! >> >> >> >> >> >> Br, >> >> >> Tuomas >> >> >> >> >> >> >> >> >> >> >> >> -Original Message- >> >> >> From: Sage Weil [mailto:s...@newdream.net] >> >> >> Sent: 28. huhtikuuta 2015 23:57 >> >> >> To: Tuomas Juntunen >> >> >> Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org >> >> >> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and >> >> >> after some >> basic >> >> >> operations most of the OSD's went down >> >> >> >> >> >> Hi Tuomas, >> >> >> >> >> >> I've pushed an updated wip-hammer-snaps branch. Can you please try it? >> >> >> The build will appear here >> >> >> >> >> >> >> >> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08 >> >> >> bf531331afd5e >> >> >> 2eb514067f72afda11bcde286 >> >> >> >> >> >> (or a similar url; adjust for your distro). >> >> >> >> >> >> Thanks! >> >> >> sage >> >> >> >> >> >> >> >> >> On Tue, 28 Apr 2015, Sage Weil wrote: >> >> >> >> >> >> > [adding ceph-devel] >> >> >> > >> >> >> > Okay, I see the problem. This seems to be unrelated ot the >> >> >> > giant -> hammer move... it's a result of the tiering changes you made: >> >> >> > >> >> >> > > > > > > > The following: >> >> >> > > > > > > > >> >> >> > > > > > > > ceph osd tier add img images --force-nonempty >> >> >> > > > > > > > ceph osd tier cache-mode images forward ceph osd >> >> >> > > > > > > > tier set-overlay img images >> >> >> > >> >> >> > Specifically, --force-nonempty bypassed important safety checks. >> >> >> > >> >> >> > 1. images had snapshots (and removed_snaps) >> >> >> > >> >> >> > 2. images was added as a tier *of* img, and img's >> >> >> > removed_snaps was copied to
[ceph-users] How to add a slave to rgw
Hi, geeks: I have a ceph cluster for rgw service in production, which was setup according to the simple configuration tutorial, with only one deafult region and one default zone. Even worse, I didn't enable neither the meta logging nor the data logging. Now i want to add a slave zone to the rgw for disaster recovery. How can i do this, influencing the service in production the least ? Thank you for your help. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Btrfs defragmentation
On 05/04/15 01:34, Sage Weil wrote: > On Mon, 4 May 2015, Lionel Bouton wrote: >> Hi, we began testing one Btrfs OSD volume last week and for this >> first test we disabled autodefrag and began to launch manual btrfs fi >> defrag. During the tests, I monitored the number of extents of the >> journal (10GB) and it went through the roof (it currently sits at >> 8000+ extents for example). I was tempted to defragment it but after >> thinking a bit about it I think it might not be a good idea. With >> Btrfs, by default the data written to the journal on disk isn't >> copied to its final destination. Ceph is using a clone_range feature >> to reference the same data instead of copying it. > We've discussed this possibility but have never implemented it. The > data is written twice: once to the journal and once to the object file. That's odd. Here's an extract of filefrag output: Filesystem type is: 9123683e File size of /var/lib/ceph/osd/ceph-17/journal is 1048576 (256 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 0: 155073097.. 155073097: 1: 1:1..1254: 155068587.. 155069840: 1254: 155073098: shared 2: 1255..2296: 155071149.. 155072190: 1042: 155069841: shared 3: 2297..2344: 148124256.. 148124303: 48: 155072191: shared 4: 2345..4396: 148129654.. 148131705: 2052: 148124304: shared 5: 4397..6446: 148137117.. 148139166: 2050: 148131706: shared 6: 6447..6451: 150414237.. 150414241: 5: 148139167: shared 7: 6452.. 10552: 150432040.. 150436140: 4101: 150414242: shared 8:10553.. 12603: 150477824.. 150479874: 2051: 150436141: shared Almost all extents of the journal are shared with another file (on one occasion I've found 3 consecutive extents without the shared flag). I've thought that it could be shared by a copy in a snapshot but the snapshots are of the "current" subvolume. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Btrfs defragmentation
On Mon, 4 May 2015, Lionel Bouton wrote: > Hi, > > we began testing one Btrfs OSD volume last week and for this first test > we disabled autodefrag and began to launch manual btrfs fi defrag. > > During the tests, I monitored the number of extents of the journal > (10GB) and it went through the roof (it currently sits at 8000+ extents > for example). > I was tempted to defragment it but after thinking a bit about it I think > it might not be a good idea. > With Btrfs, by default the data written to the journal on disk isn't > copied to its final destination. Ceph is using a clone_range feature to > reference the same data instead of copying it. We've discussed this possibility but have never implemented it. The data is written twice: once to the journal and once to the object file. > So if you defragment both the journal and the final destination, you are > moving the data around to attempt to get both references to satisfy a > one extent goal but most of the time can't get both of them at the same > time (unless the destination is a whole file instead of a fragment of one). > > I assume the journal probably doesn't benefit at all from > defragmentation: it's overwritten constantly and as Btrfs uses CoW, the > previous extents won't be reused at all and new ones will be created for > the new data instead of overwritting the old in place. The final > destination files are reused (reread) and benefit from defragmentation. Yeah, I agree. It is probably best to let btrfs write the journal anywhere since it is never read (except for replay after a failure or restart). There is also a newish 'journal discard' option that is false by default; enabling this may let us thorw out the previously allocated space so that the new writes get written to fresh locations (instead of to the previously written and fragmented positions). I expect this will make a positive difference, but I'm not sure that anyone has tested it. > Under these assumptions we excluded the journal file from > defragmentation, in fact we only defragment the "current" directory > (snapshot directories are probably only read from in rare cases and are > ephemeral so optimizing them is not interesting). > > The filesystem is only one week old so we will have to wait a bit to see > if this strategy is better than the one used when mounting with > autodefrag (I couldn't find much about it but last year we had > unmanageable latencies). Cool.. let us know how things look after it ages! sage > We have a small Ruby script which triggers defragmentation based on the > number of extents and by default limits the rate of calls to btrfs fi > defrag to a negligible level to avoid trashing the filesystem. If > someone is interested I can attach it or push it on Github after a bit > of cleanup. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Btrfs defragmentation
Hi, we began testing one Btrfs OSD volume last week and for this first test we disabled autodefrag and began to launch manual btrfs fi defrag. During the tests, I monitored the number of extents of the journal (10GB) and it went through the roof (it currently sits at 8000+ extents for example). I was tempted to defragment it but after thinking a bit about it I think it might not be a good idea. With Btrfs, by default the data written to the journal on disk isn't copied to its final destination. Ceph is using a clone_range feature to reference the same data instead of copying it. So if you defragment both the journal and the final destination, you are moving the data around to attempt to get both references to satisfy a one extent goal but most of the time can't get both of them at the same time (unless the destination is a whole file instead of a fragment of one). I assume the journal probably doesn't benefit at all from defragmentation: it's overwritten constantly and as Btrfs uses CoW, the previous extents won't be reused at all and new ones will be created for the new data instead of overwritting the old in place. The final destination files are reused (reread) and benefit from defragmentation. Under these assumptions we excluded the journal file from defragmentation, in fact we only defragment the "current" directory (snapshot directories are probably only read from in rare cases and are ephemeral so optimizing them is not interesting). The filesystem is only one week old so we will have to wait a bit to see if this strategy is better than the one used when mounting with autodefrag (I couldn't find much about it but last year we had unmanageable latencies). We have a small Ruby script which triggers defragmentation based on the number of extents and by default limits the rate of calls to btrfs fi defrag to a negligible level to avoid trashing the filesystem. If someone is interested I can attach it or push it on Github after a bit of cleanup. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help with CEPH deployment
On 04/05/15 05:42, Venkateswara Rao Jujjuri wrote: Here is the output..I am still stuck at this step. :( (multiple times tried to by purging and restarting from scratch) vjujjuri@rgulistan-wsl10:~/ceph-cluster$ ceph-deploy mon create-initial [ceph_deploy.conf][DEBUG ] found configuration file at: /home/vjujjuri/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.23): /usr/bin/ceph-deploy mon create-initial [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts rgulistan-wsl11 [ceph_deploy.mon][DEBUG ] detecting platform for host rgulistan-wsl11 ... [rgulistan-wsl11][DEBUG ] connection detected need for sudo [rgulistan-wsl11][DEBUG ] connected to host: rgulistan-wsl11 [rgulistan-wsl11][DEBUG ] detect platform information from remote host [rgulistan-wsl11][DEBUG ] detect machine type [ceph_deploy.mon][INFO ] distro info: Ubuntu 12.04 precise [rgulistan-wsl11][DEBUG ] determining if provided host has same hostname in remote [rgulistan-wsl11][DEBUG ] get remote short hostname [rgulistan-wsl11][DEBUG ] deploying mon to rgulistan-wsl11 [rgulistan-wsl11][DEBUG ] get remote short hostname [rgulistan-wsl11][DEBUG ] remote hostname: rgulistan-wsl11 [rgulistan-wsl11][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [rgulistan-wsl11][DEBUG ] create the mon path if it does not exist [rgulistan-wsl11][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-rgulistan-wsl11/done [rgulistan-wsl11][DEBUG ] create a done file to avoid re-doing the mon deployment [rgulistan-wsl11][DEBUG ] create the init path if it does not exist [rgulistan-wsl11][DEBUG ] locating the `service` executable... [rgulistan-wsl11][INFO ] Running command: sudo initctl emit ceph-mon cluster=ceph id=rgulistan-wsl11 [rgulistan-wsl11][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.rgulistan-wsl11.asok mon_status [rgulistan-wsl11][DEBUG ] [rgulistan-wsl11][DEBUG ] status for monitor: mon.rgulistan-wsl11 [rgulistan-wsl11][DEBUG ] { [rgulistan-wsl11][DEBUG ] "election_epoch": 1, [rgulistan-wsl11][DEBUG ] "extra_probe_peers": [], [rgulistan-wsl11][DEBUG ] "monmap": { [rgulistan-wsl11][DEBUG ] "created": "2015-05-02 10:52:17.318500", [rgulistan-wsl11][DEBUG ] "epoch": 1, [rgulistan-wsl11][DEBUG ] "fsid": "64e48bd5-f174-44a4-a485-7df3adbdad3d", [rgulistan-wsl11][DEBUG ] "modified": "2015-05-02 10:52:17.318500", [rgulistan-wsl11][DEBUG ] "mons": [ [rgulistan-wsl11][DEBUG ] { [rgulistan-wsl11][DEBUG ] "addr": "xx.xx.xx.xx:6789/0", [rgulistan-wsl11][DEBUG ] "name": "rgulistan-wsl11", [rgulistan-wsl11][DEBUG ] "rank": 0 [rgulistan-wsl11][DEBUG ] } [rgulistan-wsl11][DEBUG ] ] [rgulistan-wsl11][DEBUG ] }, [rgulistan-wsl11][DEBUG ] "name": "rgulistan-wsl11", [rgulistan-wsl11][DEBUG ] "outside_quorum": [], [rgulistan-wsl11][DEBUG ] "quorum": [ [rgulistan-wsl11][DEBUG ] 0 [rgulistan-wsl11][DEBUG ] ], [rgulistan-wsl11][DEBUG ] "rank": 0, [rgulistan-wsl11][DEBUG ] "state": "leader", [rgulistan-wsl11][DEBUG ] "sync_provider": [] [rgulistan-wsl11][DEBUG ] } [rgulistan-wsl11][DEBUG ] [rgulistan-wsl11][INFO ] monitor: mon.rgulistan-wsl11 is running [rgulistan-wsl11][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.rgulistan-wsl11.asok mon_status [ceph_deploy.mon][INFO ] processing monitor mon.rgulistan-wsl11 [rgulistan-wsl11][DEBUG ] connection detected need for sudo [rgulistan-wsl11][DEBUG ] connected to host: rgulistan-wsl11 [rgulistan-wsl11][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.rgulistan-wsl11.asok mon_status [ceph_deploy.mon][INFO ] mon.rgulistan-wsl11 monitor has reached quorum! [ceph_deploy.mon][INFO ] all initial monitors are running and have formed quorum [ceph_deploy.mon][INFO ] Running gatherkeys... [ceph_deploy.gatherkeys][DEBUG ] Checking rgulistan-wsl11 for /etc/ceph/ceph.client.admin.keyring [rgulistan-wsl11][DEBUG ] connection detected need for sudo [rgulistan-wsl11][DEBUG ] connected to host: rgulistan-wsl11 [rgulistan-wsl11][DEBUG ] detect platform information from remote host [rgulistan-wsl11][DEBUG ] detect machine type [rgulistan-wsl11][DEBUG ] fetch remote file [ceph_deploy.gatherkeys][WARNIN] Unable to find /etc/ceph/ceph.client.admin.keyring on rgulistan-wsl11 [ceph_deploy][ERROR ] KeyNotFoundError: Could not find keyring file: /etc/ceph/ceph.client.admin.keyring on host rgulistan-wsl11 Hmmm, so this is Ubuntu 12.04, which should work ok. It looks like the upstart command to start the monitor is working, which *should* kick off the key creation (see /etc/init/ceph-create-keys.conf). I'd guess that ceph-create-keys is hanging or failing - do you see the process running, if not have a look in /var/log/ceph on the on host to see what is g
Re: [ceph-users] 1 unfound object (but I can find it on-disk on the OSDs!)
Okay I have now ended up returning the cluster into a healthy state but instead using the version of the object from OSDs 0 and 2 rather than OSD 1. I set the "noout" flag, and shut down OSD 1. That appears to have resulted in the cluster being happy to use the version of the object that was present on the other OSDs. Then after starting up OSD 1 again, their version was replicated back to OSD 1. So there are no more inconsistencies or unfound objects. I had noticed that the object in question corresponded to the first 4 MB of a logical volume within the VM that was used for its root filesystem (which is BTRFS). Comparing the content to the equivalent location on disk on some other similar VMs, I started suspecting that the "extra data" in OSD 1's copy of the object was superfluous anyway. I have now restarted the VM that owns the RBD, and it was at least quite happy mounting the filesystem, so I'm hoping all is well... Alex On 03/05/2015 12:55 PM, Alex Moore wrote: Hi all, I need some help getting my 0.87.1 cluster back into a healthy state... Overnight, a deep scrub detected an inconsistent object pg. Ceph health detail said the following: # ceph health detail HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 2.3b is active+clean+inconsistent, acting [1,2,0] 2 scrub errors And these were the corresponding errors from the log: 2015-05-03 02:47:27.804774 6a8bc3f1e700 -1 log_channel(default) log [ERR] : 2.3b shard 1: soid c886da7b/rbd_data.25212ae8944a.0100/head//2 digest 1859582522 != known digest 2859280481, size 4194304 != known size 1642496 2015-05-03 02:47:44.099475 6a8bc3f1e700 -1 log_channel(default) log [ERR] : 2.3b deep-scrub stat mismatch, got 655/656 objects, 0/0 clones, 655/656 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 2685746176/2689940480 bytes,0/0 hit_set_archive bytes. 2015-05-03 02:47:44.099496 6a8bc3f1e700 -1 log_channel(default) log [ERR] : 2.3b deep-scrub 0 missing, 1 inconsistent objects 2015-05-03 02:47:44.099501 6a8bc3f1e700 -1 log_channel(default) log [ERR] : 2.3b deep-scrub 2 errors I located the inconsistent object on-disk on the 3 OSDs (and have saved a copy of them). The copy on OSDs 0 and 2 match each other, and have the supposedly "known size" of 1642496. The copy on OSD 1 (the primary) has additional data appended, and a size of 4194304. The content within the portion of the file that exists on OSDs 0 and 2 is the same on OSD 1, it just has extra data as well. As this is part of an RBD (used by a linux VM, with a filesystem on top) I reasoned that if the "extra data" on OSD 1's copy of the object is not supposed to be there, then it almost certainly maps to an unallocated part of the filesystem within the VM, and so having the extra data isn't going to do any harm. So I want to stick with the version on OSD 1 (the primary). I then ran "ceph pg repair 2.3b", as my understanding is that should replace the copies of the object on OSDs 0 and 2 with the one from the primary OSD, achieving what I want, and removing the inconsistency. However that doesn't seem to have happened! Instead I now have 1 unfound object (and it is the same object that had previously been reported as inconsistent), and some IO is now being blocked: # ceph health detail HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow requests; recovery -1/1747956 objects degraded (-0.000%); 1/582652 unfound (0.000%) pg 2.3b is stuck unclean for 533.238307, current state active+recovering, last acting [1,2,0] pg 2.3b is active+recovering, acting [1,2,0], 1 unfound 1 ops are blocked > 524.288 sec 1 ops are blocked > 524.288 sec on osd.1 1 osds have slow requests recovery -1/1747956 objects degraded (-0.000%); 1/582652 unfound (0.000%) # ceph pg 2.3b list_missing { "offset": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -1, "namespace": ""}, "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "rbd_data.25212ae8944a.0100", "key": "", "snapid": -2, "hash": 3364280955, "max": 0, "pool": 2, "namespace": ""}, "need": "1216'8088646", "have": "0'0", "locations": []}], "more": 0} However the 3 OSDs do still have the corresponding file on-disk, with the same content that they had when I first looked at them. I can only assume that because the data in the object on the primary OSD didn't match the "known size", when I issued the "repair" Ceph somehow decided to invalidate the copy of the object on the primary OSD, rather than use it as the authoritative version, and now believes it has no good copies of the object. How can I persuade Ceph to just go ahead and use the version of rbd_data.25212ae8944a.0100 that is already on-disk on OSD 1, and push it out to OSDs 0 and 2? S
Re: [ceph-users] Kicking 'Remapped' PGs
Thanks, Greg. Following your lead, we discovered the proper 'set_choose_tries xxx’ value had not been applied to *this* pool’s rule, and we updated the cluster accordingly. We then moved a random OSD out and back in to ‘kick’ things, but no joy: we still have the 4 ‘remapped’ PGs. BTW: the 4 PGs look OK from a basic rule perspective: they’re on different OSDs/on different Hosts, which is what we’re concerned with… but it seems CRUSH has different goals for them and they are inactive. So..back to the basic question: can we get just the ‘remapped’ PGs to re-sort themselves without causing massive data movement….or is a complete re-sort the only way to get to a desired CRUSH state? As for the force_create_pg command: if it creates a blank PG element on a specific OSD (yes?), what happens to an existing PG element on other OSDs? Could we use force_create_pg followed by a ‘pg repair’ command to get things back to the proper state (in a very targeted way)? For reference, below is the (reduced) output of dump_stuck: pg_stat objects mip degr unf bytes logdisklog state state_stamp v reportedupup_pri acting acting_pri 11.6e52840002366787669 30123012 remapped 2015-04-23 13:19:02.373507 68310'4906878500:123712 [0,92]0[0,84]0 11.8bb2830002349260884 30013001 remapped 2015-04-23 13:19:02.550735 70105'4977678500:125026 [0,92]0[0,88]0 11.e2f2800002339844181 30013001 remapped 2015-04-23 13:18:59.299589 68310'5108278500:119555 [77,4]77 [77,34] 77 11.3232820002357186647 30013001 remapped 2015-04-23 13:18:58.970396 70105'4896178500:123987 [0,37]0[0,19]0 On Apr 30, 2015, at 10:30 AM, Gregory Farnum mailto:g...@gregs42.com>> wrote: Remapped PGs that are stuck that way mean that CRUSH is failing to map them appropriately — I think we talked about the circumstances around that previously. :) So nudging CRUSH can't do anything; it will just fail to map them appropriately again. (And indeed this is what happens whenever anyone does something to that PG or the OSD Map gets changed.) The force_create_pg command does exactly what it sounds like: it tells the OSDs which should currently host the named PG to create it. You shouldn't need to run it and I don't remember exactly what checks it goes through, but it's generally for when you've given up on retrieving any data out of a PG whose OSDs died and want to just start over with a completely blank one. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 1 unfound object (but I can find it on-disk on the OSDs!)
Hi all, I need some help getting my 0.87.1 cluster back into a healthy state... Overnight, a deep scrub detected an inconsistent object pg. Ceph health detail said the following: # ceph health detail HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 2.3b is active+clean+inconsistent, acting [1,2,0] 2 scrub errors And these were the corresponding errors from the log: 2015-05-03 02:47:27.804774 6a8bc3f1e700 -1 log_channel(default) log [ERR] : 2.3b shard 1: soid c886da7b/rbd_data.25212ae8944a.0100/head//2 digest 1859582522 != known digest 2859280481, size 4194304 != known size 1642496 2015-05-03 02:47:44.099475 6a8bc3f1e700 -1 log_channel(default) log [ERR] : 2.3b deep-scrub stat mismatch, got 655/656 objects, 0/0 clones, 655/656 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 2685746176/2689940480 bytes,0/0 hit_set_archive bytes. 2015-05-03 02:47:44.099496 6a8bc3f1e700 -1 log_channel(default) log [ERR] : 2.3b deep-scrub 0 missing, 1 inconsistent objects 2015-05-03 02:47:44.099501 6a8bc3f1e700 -1 log_channel(default) log [ERR] : 2.3b deep-scrub 2 errors I located the inconsistent object on-disk on the 3 OSDs (and have saved a copy of them). The copy on OSDs 0 and 2 match each other, and have the supposedly "known size" of 1642496. The copy on OSD 1 (the primary) has additional data appended, and a size of 4194304. The content within the portion of the file that exists on OSDs 0 and 2 is the same on OSD 1, it just has extra data as well. As this is part of an RBD (used by a linux VM, with a filesystem on top) I reasoned that if the "extra data" on OSD 1's copy of the object is not supposed to be there, then it almost certainly maps to an unallocated part of the filesystem within the VM, and so having the extra data isn't going to do any harm. So I want to stick with the version on OSD 1 (the primary). I then ran "ceph pg repair 2.3b", as my understanding is that should replace the copies of the object on OSDs 0 and 2 with the one from the primary OSD, achieving what I want, and removing the inconsistency. However that doesn't seem to have happened! Instead I now have 1 unfound object (and it is the same object that had previously been reported as inconsistent), and some IO is now being blocked: # ceph health detail HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow requests; recovery -1/1747956 objects degraded (-0.000%); 1/582652 unfound (0.000%) pg 2.3b is stuck unclean for 533.238307, current state active+recovering, last acting [1,2,0] pg 2.3b is active+recovering, acting [1,2,0], 1 unfound 1 ops are blocked > 524.288 sec 1 ops are blocked > 524.288 sec on osd.1 1 osds have slow requests recovery -1/1747956 objects degraded (-0.000%); 1/582652 unfound (0.000%) # ceph pg 2.3b list_missing { "offset": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -1, "namespace": ""}, "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "rbd_data.25212ae8944a.0100", "key": "", "snapid": -2, "hash": 3364280955, "max": 0, "pool": 2, "namespace": ""}, "need": "1216'8088646", "have": "0'0", "locations": []}], "more": 0} However the 3 OSDs do still have the corresponding file on-disk, with the same content that they had when I first looked at them. I can only assume that because the data in the object on the primary OSD didn't match the "known size", when I issued the "repair" Ceph somehow decided to invalidate the copy of the object on the primary OSD, rather than use it as the authoritative version, and now believes it has no good copies of the object. How can I persuade Ceph to just go ahead and use the version of rbd_data.25212ae8944a.0100 that is already on-disk on OSD 1, and push it out to OSDs 0 and 2? Surely there is a way to do that! Thanks in advance! Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD failing to restart
Hi, On starting OSD its failing $ /etc/init.d/ceph start osd.119 with errors $ tail -f /var/lib/ceph/osd/ceph-119/ceph-osd.119.log |grep -i err -1/-1 (stderr threshold) 2015-05-03 11:38:44.366984 7f0794e5b780 -1 journal _check_disk_write_cache: fclose error: (61) No data available 2015-05-03 11:38:44.526567 7f0794e5b780 -1 filestore(/var/lib/ceph/osd/ceph-119) FileStore::_do_copy_range: read error at 155648~303616, (5) Input/output error -9> 2015-05-03 11:38:44.366984 7f0794e5b780 -1 journal _check_disk_write_cache: fclose error: (61) No data available -1> 2015-05-03 11:38:44.526567 7f0794e5b780 -1 filestore(/var/lib/ceph/osd/ceph-119) FileStore::_do_copy_range: read error at 155648~303616, (5) Input/output error --- Any ideas? -- Sourabh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com