Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-05-03 Thread Tuomas Juntunen
Hi

Thanks Sage, I got it working now. Everything else seems to be ok, except
mds is reporting "mds cluster is degraded", not sure what could be wrong.
Mds is running and all osds are up and pg's are active+clean and
active+clean+replay.

Had to delete some empty pools which were created while the osd's were not
working and recovery started to go through.

Seems mds is not that stable, this isn't the first time it goes degraded.
Before it just started to work, but now I just can't get it back working.

Thanks

Br,
Tuomas


-Original Message-
From: tuomas.juntu...@databasement.fi
[mailto:tuomas.juntu...@databasement.fi] 
Sent: 1. toukokuuta 2015 21:14
To: Sage Weil
Cc: tuomas.juntunen; ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic
operations most of the OSD's went down

Thanks, I'll do this when the commit is available and report back.

And indeed, I'll change to the official ones after everything is ok.

Br,
Tuomas

> On Fri, 1 May 2015, tuomas.juntu...@databasement.fi wrote:
>> Hi
>>
>> I deleted the images and img pools and started osd's, they still die.
>>
>> Here's a log of one of the osd's after this, if you need it.
>>
>> http://beta.xaasbox.com/ceph/ceph-osd.19.log
>
> I've pushed another commit that should avoid this case, sha1 
> 425bd4e1dba00cc2243b0c27232d1f9740b04e34.
>
> Note that once the pools are fully deleted (shouldn't take too long 
> once the osds are up and stabilize) you should switch back to the 
> normal packages that don't have these workarounds.
>
> sage
>
>
>
>>
>> Br,
>> Tuomas
>>
>>
>> > Thanks man. I'll try it tomorrow. Have a good one.
>> >
>> > Br,T
>> >
>> >  Original message 
>> > From: Sage Weil 
>> > Date: 30/04/2015  18:23  (GMT+02:00)
>> > To: Tuomas Juntunen 
>> > Cc: ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org
>> > Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after 
>> > some basic
>>
>> > operations most of the OSD's went down
>> >
>> > On Thu, 30 Apr 2015, tuomas.juntu...@databasement.fi wrote:
>> >> Hey
>> >>
>> >> Yes I can drop the images data, you think this will fix it?
>> >
>> > It's a slightly different assert that (I believe) should not 
>> > trigger once the pool is deleted.  Please give that a try and if 
>> > you still hit it I'll whip up a workaround.
>> >
>> > Thanks!
>> > sage
>> >
>> >  >
>> >>
>> >> Br,
>> >>
>> >> Tuomas
>> >>
>> >> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
>> >> >> Hi
>> >> >>
>> >> >> I updated that version and it seems that something did happen, 
>> >> >> the osd's stayed up for a while and 'ceph status' got updated. 
>> >> >> But then in couple
>> of
>> >> >> minutes, they all went down the same way.
>> >> >>
>> >> >> I have attached new 'ceph osd dump -f json-pretty' and got a 
>> >> >> new log
>> from
>> >> >> one of the osd's with osd debug = 20, 
>> >> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log
>> >> >
>> >> > Sam mentioned that you had said earlier that this was not critical
data?
>> >> > If not, I think the simplest thing is to just drop those pools.  
>> >> > The important thing (from my perspective at least :) is that we 
>> >> > understand
>> the
>> >> > root cause and can prevent this in the future.
>> >> >
>> >> > sage
>> >> >
>> >> >
>> >> >>
>> >> >> Thank you!
>> >> >>
>> >> >> Br,
>> >> >> Tuomas
>> >> >>
>> >> >>
>> >> >>
>> >> >> -Original Message-
>> >> >> From: Sage Weil [mailto:s...@newdream.net]
>> >> >> Sent: 28. huhtikuuta 2015 23:57
>> >> >> To: Tuomas Juntunen
>> >> >> Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
>> >> >> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and 
>> >> >> after some
>> basic
>> >> >> operations most of the OSD's went down
>> >> >>
>> >> >> Hi Tuomas,
>> >> >>
>> >> >> I've pushed an updated wip-hammer-snaps branch.  Can you please
try it?
>> >> >> The build will appear here
>> >> >>
>> >> >>
>> >> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08
>> >> >> bf531331afd5e
>> >> >> 2eb514067f72afda11bcde286
>> >> >>
>> >> >> (or a similar url; adjust for your distro).
>> >> >>
>> >> >> Thanks!
>> >> >> sage
>> >> >>
>> >> >>
>> >> >> On Tue, 28 Apr 2015, Sage Weil wrote:
>> >> >>
>> >> >> > [adding ceph-devel]
>> >> >> >
>> >> >> > Okay, I see the problem.  This seems to be unrelated ot the 
>> >> >> > giant -> hammer move... it's a result of the tiering changes you
made:
>> >> >> >
>> >> >> > > > > > > > The following:
>> >> >> > > > > > > >
>> >> >> > > > > > > > ceph osd tier add img images --force-nonempty 
>> >> >> > > > > > > > ceph osd tier cache-mode images forward ceph osd 
>> >> >> > > > > > > > tier set-overlay img images
>> >> >> >
>> >> >> > Specifically, --force-nonempty bypassed important safety checks.
>> >> >> >
>> >> >> > 1. images had snapshots (and removed_snaps)
>> >> >> >
>> >> >> > 2. images was added as a tier *of* img, and img's 
>> >> >> > removed_snaps was copied to 

[ceph-users] How to add a slave to rgw

2015-05-03 Thread 周炳华
Hi, geeks:

I have a ceph cluster for rgw service in production, which was setup
according to the simple configuration tutorial, with only one deafult
region and one default zone. Even worse, I didn't enable neither the meta
logging nor the data logging. Now i want to add a slave zone to the rgw for
disaster recovery. How can i do this, influencing the service in production
the least ?

Thank you for your help.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Btrfs defragmentation

2015-05-03 Thread Lionel Bouton
On 05/04/15 01:34, Sage Weil wrote:
> On Mon, 4 May 2015, Lionel Bouton wrote:
>> Hi, we began testing one Btrfs OSD volume last week and for this
>> first test we disabled autodefrag and began to launch manual btrfs fi
>> defrag. During the tests, I monitored the number of extents of the
>> journal (10GB) and it went through the roof (it currently sits at
>> 8000+ extents for example). I was tempted to defragment it but after
>> thinking a bit about it I think it might not be a good idea. With
>> Btrfs, by default the data written to the journal on disk isn't
>> copied to its final destination. Ceph is using a clone_range feature
>> to reference the same data instead of copying it. 
> We've discussed this possibility but have never implemented it. The
> data is written twice: once to the journal and once to the object file.

That's odd. Here's an extract of filefrag output:

Filesystem type is: 9123683e
File size of /var/lib/ceph/osd/ceph-17/journal is 1048576 (256
blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   0:  155073097.. 155073097:  1:   
   1:1..1254:  155068587.. 155069840:   1254:  155073098: shared
   2: 1255..2296:  155071149.. 155072190:   1042:  155069841: shared
   3: 2297..2344:  148124256.. 148124303: 48:  155072191: shared
   4: 2345..4396:  148129654.. 148131705:   2052:  148124304: shared
   5: 4397..6446:  148137117.. 148139166:   2050:  148131706: shared
   6: 6447..6451:  150414237.. 150414241:  5:  148139167: shared
   7: 6452..   10552:  150432040.. 150436140:   4101:  150414242: shared
   8:10553..   12603:  150477824.. 150479874:   2051:  150436141: shared

Almost all extents of the journal are shared with another file (on one
occasion I've found 3 consecutive extents without the shared flag). I've
thought that it could be shared by a copy in a snapshot but the
snapshots are of the "current" subvolume.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Btrfs defragmentation

2015-05-03 Thread Sage Weil
On Mon, 4 May 2015, Lionel Bouton wrote:
> Hi,
> 
> we began testing one Btrfs OSD volume last week and for this first test
> we disabled autodefrag and began to launch manual btrfs fi defrag.
> 
> During the tests, I monitored the number of extents of the journal
> (10GB) and it went through the roof (it currently sits at 8000+ extents
> for example).
> I was tempted to defragment it but after thinking a bit about it I think
> it might not be a good idea.
> With Btrfs, by default the data written to the journal on disk isn't
> copied to its final destination. Ceph is using a clone_range feature to
> reference the same data instead of copying it.

We've discussed this possibility but have never implemented it.  The data 
is written twice: once to the journal and once to the object file.

> So if you defragment both the journal and the final destination, you are
> moving the data around to attempt to get both references to satisfy a
> one extent goal but most of the time can't get both of them at the same
> time (unless the destination is a whole file instead of a fragment of one).
> 
> I assume the journal probably doesn't benefit at all from
> defragmentation: it's overwritten constantly and as Btrfs uses CoW, the
> previous extents won't be reused at all and new ones will be created for
> the new data instead of overwritting the old in place. The final
> destination files are reused (reread) and benefit from defragmentation.

Yeah, I agree.  It is probably best to let btrfs write the journal 
anywhere since it is never read (except for replay after a failure 
or restart).

There is also a newish 'journal discard' option that is false by default; 
enabling this may let us thorw out the previously allocated space so that 
the new writes get written to fresh locations (instead of to the 
previously written and fragmented positions).  I expect this will make a 
positive difference, but I'm not sure that anyone has tested it.

> Under these assumptions we excluded the journal file from
> defragmentation, in fact we only defragment the "current" directory
> (snapshot directories are probably only read from in rare cases and are
> ephemeral so optimizing them is not interesting).
> 
> The filesystem is only one week old so we will have to wait a bit to see
> if this strategy is better than the one used when mounting with
> autodefrag (I couldn't find much about it but last year we had
> unmanageable latencies).

Cool.. let us know how things look after it ages!

sage

> We have a small Ruby script which triggers defragmentation based on the
> number of extents and by default limits the rate of calls to btrfs fi
> defrag to a negligible level to avoid trashing the filesystem. If
> someone is interested I can attach it or push it on Github after a bit
> of cleanup.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Btrfs defragmentation

2015-05-03 Thread Lionel Bouton
Hi,

we began testing one Btrfs OSD volume last week and for this first test
we disabled autodefrag and began to launch manual btrfs fi defrag.

During the tests, I monitored the number of extents of the journal
(10GB) and it went through the roof (it currently sits at 8000+ extents
for example).
I was tempted to defragment it but after thinking a bit about it I think
it might not be a good idea.
With Btrfs, by default the data written to the journal on disk isn't
copied to its final destination. Ceph is using a clone_range feature to
reference the same data instead of copying it.
So if you defragment both the journal and the final destination, you are
moving the data around to attempt to get both references to satisfy a
one extent goal but most of the time can't get both of them at the same
time (unless the destination is a whole file instead of a fragment of one).

I assume the journal probably doesn't benefit at all from
defragmentation: it's overwritten constantly and as Btrfs uses CoW, the
previous extents won't be reused at all and new ones will be created for
the new data instead of overwritting the old in place. The final
destination files are reused (reread) and benefit from defragmentation.

Under these assumptions we excluded the journal file from
defragmentation, in fact we only defragment the "current" directory
(snapshot directories are probably only read from in rare cases and are
ephemeral so optimizing them is not interesting).

The filesystem is only one week old so we will have to wait a bit to see
if this strategy is better than the one used when mounting with
autodefrag (I couldn't find much about it but last year we had
unmanageable latencies).
We have a small Ruby script which triggers defragmentation based on the
number of extents and by default limits the rate of calls to btrfs fi
defrag to a negligible level to avoid trashing the filesystem. If
someone is interested I can attach it or push it on Github after a bit
of cleanup.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with CEPH deployment

2015-05-03 Thread Mark Kirkwood

On 04/05/15 05:42, Venkateswara Rao Jujjuri wrote:

Here is the output..I am still stuck at this step. :(
(multiple times tried to by purging and restarting from scratch)

vjujjuri@rgulistan-wsl10:~/ceph-cluster$ ceph-deploy mon create-initial
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/vjujjuri/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.23): /usr/bin/ceph-deploy mon
create-initial
[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts rgulistan-wsl11
[ceph_deploy.mon][DEBUG ] detecting platform for host rgulistan-wsl11 ...
[rgulistan-wsl11][DEBUG ] connection detected need for sudo
[rgulistan-wsl11][DEBUG ] connected to host: rgulistan-wsl11
[rgulistan-wsl11][DEBUG ] detect platform information from remote host
[rgulistan-wsl11][DEBUG ] detect machine type
[ceph_deploy.mon][INFO  ] distro info: Ubuntu 12.04 precise
[rgulistan-wsl11][DEBUG ] determining if provided host has same
hostname in remote
[rgulistan-wsl11][DEBUG ] get remote short hostname
[rgulistan-wsl11][DEBUG ] deploying mon to rgulistan-wsl11
[rgulistan-wsl11][DEBUG ] get remote short hostname
[rgulistan-wsl11][DEBUG ] remote hostname: rgulistan-wsl11
[rgulistan-wsl11][DEBUG ] write cluster configuration to
/etc/ceph/{cluster}.conf
[rgulistan-wsl11][DEBUG ] create the mon path if it does not exist
[rgulistan-wsl11][DEBUG ] checking for done path:
/var/lib/ceph/mon/ceph-rgulistan-wsl11/done
[rgulistan-wsl11][DEBUG ] create a done file to avoid re-doing the mon
deployment
[rgulistan-wsl11][DEBUG ] create the init path if it does not exist
[rgulistan-wsl11][DEBUG ] locating the `service` executable...
[rgulistan-wsl11][INFO  ] Running command: sudo initctl emit ceph-mon
cluster=ceph id=rgulistan-wsl11
[rgulistan-wsl11][INFO  ] Running command: sudo ceph --cluster=ceph
--admin-daemon /var/run/ceph/ceph-mon.rgulistan-wsl11.asok mon_status
[rgulistan-wsl11][DEBUG ]

[rgulistan-wsl11][DEBUG ] status for monitor: mon.rgulistan-wsl11
[rgulistan-wsl11][DEBUG ] {
[rgulistan-wsl11][DEBUG ]   "election_epoch": 1,
[rgulistan-wsl11][DEBUG ]   "extra_probe_peers": [],
[rgulistan-wsl11][DEBUG ]   "monmap": {
[rgulistan-wsl11][DEBUG ] "created": "2015-05-02 10:52:17.318500",
[rgulistan-wsl11][DEBUG ] "epoch": 1,
[rgulistan-wsl11][DEBUG ] "fsid": "64e48bd5-f174-44a4-a485-7df3adbdad3d",
[rgulistan-wsl11][DEBUG ] "modified": "2015-05-02 10:52:17.318500",
[rgulistan-wsl11][DEBUG ] "mons": [
[rgulistan-wsl11][DEBUG ]   {
[rgulistan-wsl11][DEBUG ] "addr": "xx.xx.xx.xx:6789/0",
[rgulistan-wsl11][DEBUG ] "name": "rgulistan-wsl11",
[rgulistan-wsl11][DEBUG ] "rank": 0
[rgulistan-wsl11][DEBUG ]   }
[rgulistan-wsl11][DEBUG ] ]
[rgulistan-wsl11][DEBUG ]   },
[rgulistan-wsl11][DEBUG ]   "name": "rgulistan-wsl11",
[rgulistan-wsl11][DEBUG ]   "outside_quorum": [],
[rgulistan-wsl11][DEBUG ]   "quorum": [
[rgulistan-wsl11][DEBUG ] 0
[rgulistan-wsl11][DEBUG ]   ],
[rgulistan-wsl11][DEBUG ]   "rank": 0,
[rgulistan-wsl11][DEBUG ]   "state": "leader",
[rgulistan-wsl11][DEBUG ]   "sync_provider": []
[rgulistan-wsl11][DEBUG ] }
[rgulistan-wsl11][DEBUG ]

[rgulistan-wsl11][INFO  ] monitor: mon.rgulistan-wsl11 is running
[rgulistan-wsl11][INFO  ] Running command: sudo ceph --cluster=ceph
--admin-daemon /var/run/ceph/ceph-mon.rgulistan-wsl11.asok mon_status
[ceph_deploy.mon][INFO  ] processing monitor mon.rgulistan-wsl11
[rgulistan-wsl11][DEBUG ] connection detected need for sudo
[rgulistan-wsl11][DEBUG ] connected to host: rgulistan-wsl11
[rgulistan-wsl11][INFO  ] Running command: sudo ceph --cluster=ceph
--admin-daemon /var/run/ceph/ceph-mon.rgulistan-wsl11.asok mon_status
[ceph_deploy.mon][INFO  ] mon.rgulistan-wsl11 monitor has reached quorum!
[ceph_deploy.mon][INFO  ] all initial monitors are running and have
formed quorum
[ceph_deploy.mon][INFO  ] Running gatherkeys...
[ceph_deploy.gatherkeys][DEBUG ] Checking rgulistan-wsl11 for
/etc/ceph/ceph.client.admin.keyring
[rgulistan-wsl11][DEBUG ] connection detected need for sudo
[rgulistan-wsl11][DEBUG ] connected to host: rgulistan-wsl11
[rgulistan-wsl11][DEBUG ] detect platform information from remote host
[rgulistan-wsl11][DEBUG ] detect machine type
[rgulistan-wsl11][DEBUG ] fetch remote file
[ceph_deploy.gatherkeys][WARNIN] Unable to find
/etc/ceph/ceph.client.admin.keyring on rgulistan-wsl11
[ceph_deploy][ERROR ] KeyNotFoundError: Could not find keyring file:
/etc/ceph/ceph.client.admin.keyring on host rgulistan-wsl11





Hmmm, so this is Ubuntu 12.04, which should work ok.

It looks like the upstart command to start the monitor is working, which 
*should* kick off the key creation (see /etc/init/ceph-create-keys.conf).


I'd guess that ceph-create-keys is hanging or failing - do you see the 
process running, if not have a look in /var/log/ceph on the on host to 
see what is g

Re: [ceph-users] 1 unfound object (but I can find it on-disk on the OSDs!)

2015-05-03 Thread Alex Moore
Okay I have now ended up returning the cluster into a healthy state but 
instead using the version of the object from OSDs 0 and 2 rather than 
OSD 1. I set the "noout" flag, and shut down OSD 1. That appears to have 
resulted in the cluster being happy to use the version of the object 
that was present on the other OSDs. Then after starting up OSD 1 again, 
their version was replicated back to OSD 1. So there are no more 
inconsistencies or unfound objects.


I had noticed that the object in question corresponded to the first 4 MB 
of a logical volume within the VM that was used for its root filesystem 
(which is BTRFS). Comparing the content to the equivalent location on 
disk on some other similar VMs, I started suspecting that the "extra 
data" in OSD 1's copy of the object was superfluous anyway. I have now 
restarted the VM that owns the RBD, and it was at least quite happy 
mounting the filesystem, so I'm hoping all is well...


Alex

On 03/05/2015 12:55 PM, Alex Moore wrote:
Hi all, I need some help getting my 0.87.1 cluster back into a healthy 
state...


Overnight, a deep scrub detected an inconsistent object pg. Ceph 
health detail said the following:


# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 2.3b is active+clean+inconsistent, acting [1,2,0]
2 scrub errors

And these were the corresponding errors from the log:

2015-05-03 02:47:27.804774 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b shard 1: soid 
c886da7b/rbd_data.25212ae8944a.0100/head//2 digest 
1859582522 != known digest 2859280481, size 4194304 != known size 1642496
2015-05-03 02:47:44.099475 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b deep-scrub stat mismatch, got 655/656 objects, 0/0 
clones, 655/656 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 
2685746176/2689940480 bytes,0/0 hit_set_archive bytes.
2015-05-03 02:47:44.099496 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b deep-scrub 0 missing, 1 inconsistent objects
2015-05-03 02:47:44.099501 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b deep-scrub 2 errors


I located the inconsistent object on-disk on the 3 OSDs (and have 
saved a copy of them). The copy on OSDs 0 and 2 match each other, and 
have the supposedly "known size" of 1642496. The copy on OSD 1 (the 
primary) has additional data appended, and a size of 4194304. The 
content within the portion of the file that exists on OSDs 0 and 2 is 
the same on OSD 1, it just has extra data as well.


As this is part of an RBD (used by a linux VM, with a filesystem on 
top) I reasoned that if the "extra data" on OSD 1's copy of the object 
is not supposed to be there, then it almost certainly maps to an 
unallocated part of the filesystem within the VM, and so having the 
extra data isn't going to do any harm. So I want to stick with the 
version on OSD 1 (the primary).


I then ran "ceph pg repair 2.3b", as my understanding is that should 
replace the copies of the object on OSDs 0 and 2 with the one from the 
primary OSD, achieving what I want, and removing the inconsistency. 
However that doesn't seem to have happened!


Instead I now have 1 unfound object (and it is the same object that 
had previously been reported as inconsistent), and some IO is now 
being blocked:


# ceph health detail
HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; 1 requests are 
blocked > 32 sec; 1 osds have slow requests; recovery -1/1747956 
objects degraded (-0.000%); 1/582652 unfound (0.000%)
pg 2.3b is stuck unclean for 533.238307, current state 
active+recovering, last acting [1,2,0]

pg 2.3b is active+recovering, acting [1,2,0], 1 unfound
1 ops are blocked > 524.288 sec
1 ops are blocked > 524.288 sec on osd.1
1 osds have slow requests
recovery -1/1747956 objects degraded (-0.000%); 1/582652 unfound (0.000%)

# ceph pg 2.3b list_missing
{ "offset": { "oid": "",
  "key": "",
  "snapid": 0,
  "hash": 0,
  "max": 0,
  "pool": -1,
  "namespace": ""},
  "num_missing": 1,
  "num_unfound": 1,
  "objects": [
{ "oid": { "oid": "rbd_data.25212ae8944a.0100",
  "key": "",
  "snapid": -2,
  "hash": 3364280955,
  "max": 0,
  "pool": 2,
  "namespace": ""},
  "need": "1216'8088646",
  "have": "0'0",
  "locations": []}],
  "more": 0}

However the 3 OSDs do still have the corresponding file on-disk, with 
the same content that they had when I first looked at them. I can only 
assume that because the data in the object on the primary OSD didn't 
match the "known size", when I issued the "repair" Ceph somehow 
decided to invalidate the copy of the object on the primary OSD, 
rather than use it as the authoritative version, and now believes it 
has no good copies of the object.


How can I persuade Ceph to just go ahead and use the version of 
rbd_data.25212ae8944a.0100 that is already on-disk on OSD 
1, and push it out to OSDs 0 and 2? S

Re: [ceph-users] Kicking 'Remapped' PGs

2015-05-03 Thread Paul Evans
Thanks, Greg.  Following your lead, we discovered the proper 'set_choose_tries 
xxx’ value had not been applied to *this* pool’s rule, and we updated the 
cluster accordingly. We then moved a random OSD out and back in to ‘kick’ 
things, but no joy: we still have the 4 ‘remapped’ PGs.  BTW: the 4 PGs look OK 
from a basic rule perspective: they’re on different OSDs/on different Hosts, 
which is what we’re concerned with… but it seems CRUSH has different goals for 
them and they are inactive.
So..back to the basic question:  can we get just the ‘remapped’ PGs to re-sort 
themselves without causing massive data movement….or is a complete re-sort the 
only way to get to a desired CRUSH state?

As for the force_create_pg command: if it creates a blank PG element on a 
specific OSD (yes?), what happens to an existing PG element on other OSDs? 
Could we use force_create_pg followed by a ‘pg repair’ command to get things 
back to the proper state (in a very targeted way)?

For reference, below is the (reduced) output of dump_stuck:

pg_stat  objects mip  degr unf  bytes   logdisklog state  state_stamp   
v  reportedupup_pri  acting
acting_pri
11.6e52840002366787669  30123012  remapped  2015-04-23 
13:19:02.373507  68310'4906878500:123712   [0,92]0[0,84]0
11.8bb2830002349260884  30013001  remapped  2015-04-23 
13:19:02.550735  70105'4977678500:125026   [0,92]0[0,88]0
11.e2f2800002339844181  30013001  remapped  2015-04-23 
13:18:59.299589  68310'5108278500:119555   [77,4]77   [77,34]   77
11.3232820002357186647  30013001  remapped  2015-04-23 
13:18:58.970396  70105'4896178500:123987   [0,37]0[0,19]0



On Apr 30, 2015, at 10:30 AM, Gregory Farnum 
mailto:g...@gregs42.com>> wrote:

Remapped PGs that are stuck that way mean that CRUSH is failing to map
them appropriately — I think we talked about the circumstances around
that previously. :) So nudging CRUSH can't do anything; it will just
fail to map them appropriately again. (And indeed this is what happens
whenever anyone does something to that PG or the OSD Map gets
changed.)

The force_create_pg command does exactly what it sounds like: it tells
the OSDs which should currently host the named PG to create it. You
shouldn't need to run it and I don't remember exactly what checks it
goes through, but it's generally for when you've given up on
retrieving any data out of a PG whose OSDs died and want to just start
over with a completely blank one.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 1 unfound object (but I can find it on-disk on the OSDs!)

2015-05-03 Thread Alex Moore
Hi all, I need some help getting my 0.87.1 cluster back into a healthy 
state...


Overnight, a deep scrub detected an inconsistent object pg. Ceph health 
detail said the following:


# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 2.3b is active+clean+inconsistent, acting [1,2,0]
2 scrub errors

And these were the corresponding errors from the log:

2015-05-03 02:47:27.804774 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b shard 1: soid 
c886da7b/rbd_data.25212ae8944a.0100/head//2 digest 
1859582522 != known digest 2859280481, size 4194304 != known size 1642496
2015-05-03 02:47:44.099475 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b deep-scrub stat mismatch, got 655/656 objects, 0/0 clones, 
655/656 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 
2685746176/2689940480 bytes,0/0 hit_set_archive bytes.
2015-05-03 02:47:44.099496 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b deep-scrub 0 missing, 1 inconsistent objects
2015-05-03 02:47:44.099501 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b deep-scrub 2 errors


I located the inconsistent object on-disk on the 3 OSDs (and have saved 
a copy of them). The copy on OSDs 0 and 2 match each other, and have the 
supposedly "known size" of 1642496. The copy on OSD 1 (the primary) has 
additional data appended, and a size of 4194304. The content within the 
portion of the file that exists on OSDs 0 and 2 is the same on OSD 1, it 
just has extra data as well.


As this is part of an RBD (used by a linux VM, with a filesystem on top) 
I reasoned that if the "extra data" on OSD 1's copy of the object is not 
supposed to be there, then it almost certainly maps to an unallocated 
part of the filesystem within the VM, and so having the extra data isn't 
going to do any harm. So I want to stick with the version on OSD 1 (the 
primary).


I then ran "ceph pg repair 2.3b", as my understanding is that should 
replace the copies of the object on OSDs 0 and 2 with the one from the 
primary OSD, achieving what I want, and removing the inconsistency. 
However that doesn't seem to have happened!


Instead I now have 1 unfound object (and it is the same object that had 
previously been reported as inconsistent), and some IO is now being blocked:


# ceph health detail
HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; 1 requests are 
blocked > 32 sec; 1 osds have slow requests; recovery -1/1747956 objects 
degraded (-0.000%); 1/582652 unfound (0.000%)
pg 2.3b is stuck unclean for 533.238307, current state 
active+recovering, last acting [1,2,0]

pg 2.3b is active+recovering, acting [1,2,0], 1 unfound
1 ops are blocked > 524.288 sec
1 ops are blocked > 524.288 sec on osd.1
1 osds have slow requests
recovery -1/1747956 objects degraded (-0.000%); 1/582652 unfound (0.000%)

# ceph pg 2.3b list_missing
{ "offset": { "oid": "",
  "key": "",
  "snapid": 0,
  "hash": 0,
  "max": 0,
  "pool": -1,
  "namespace": ""},
  "num_missing": 1,
  "num_unfound": 1,
  "objects": [
{ "oid": { "oid": "rbd_data.25212ae8944a.0100",
  "key": "",
  "snapid": -2,
  "hash": 3364280955,
  "max": 0,
  "pool": 2,
  "namespace": ""},
  "need": "1216'8088646",
  "have": "0'0",
  "locations": []}],
  "more": 0}

However the 3 OSDs do still have the corresponding file on-disk, with 
the same content that they had when I first looked at them. I can only 
assume that because the data in the object on the primary OSD didn't 
match the "known size", when I issued the "repair" Ceph somehow decided 
to invalidate the copy of the object on the primary OSD, rather than use 
it as the authoritative version, and now believes it has no good copies 
of the object.


How can I persuade Ceph to just go ahead and use the version of 
rbd_data.25212ae8944a.0100 that is already on-disk on OSD 1, 
and push it out to OSDs 0 and 2? Surely there is a way to do that!


Thanks in advance!
Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD failing to restart

2015-05-03 Thread sourabh saryal

Hi,

On starting OSD its failing

$ /etc/init.d/ceph start osd.119

with errors

$ tail -f /var/lib/ceph/osd/ceph-119/ceph-osd.119.log |grep -i err
  -1/-1 (stderr threshold)
2015-05-03 11:38:44.366984 7f0794e5b780 -1 journal 
_check_disk_write_cache: fclose error: (61) No data available
2015-05-03 11:38:44.526567 7f0794e5b780 -1 
filestore(/var/lib/ceph/osd/ceph-119) FileStore::_do_copy_range: read 
error at 155648~303616, (5) Input/output error
-9> 2015-05-03 11:38:44.366984 7f0794e5b780 -1 journal 
_check_disk_write_cache: fclose error: (61) No data available
-1> 2015-05-03 11:38:44.526567 7f0794e5b780 -1 
filestore(/var/lib/ceph/osd/ceph-119) FileStore::_do_copy_range: read 
error at 155648~303616, (5) Input/output error

---

Any ideas?

--
Sourabh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com