[ceph-users] Cluster issue - pgs degraded, recovering, stale, etc.

2016-05-10 Thread deeepdish
Hello.

I have a two node cluster with 4x replicas for all objects distributed between 
the two nodes (two copies on each node).  I recently converted my OSDs from 
BTRFS to XFS (BTRFS was slow) by removing / preparing / activating OSDs on each 
node (one at at time) as XFS allowing cluster to rebalance / recover itself.   
Now with this all complete, I have a better performing cluster, all data is 
intact, however I have the following status.   How can I remedy this?  Looking 
for guidance into steps / troubleshooting starting point.   There’s a bunch of 
seemingly different issues that likely stem from the same root cause.   

 health HEALTH_WARN
11 pgs degraded
7 pgs peering
4 pgs recovering
2 pgs recovery_wait
885 pgs stale
11 pgs stuck degraded
60 pgs stuck inactive
885 pgs stuck stale
66 pgs stuck unclean
recovery 3/24971148 objects degraded (0.000%)


Thank you. ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH Rule Review - Not replicating correctly

2016-01-20 Thread deeepdish
Hi Robert,

Just wanted to let you know that after applying your crush suggestion and 
allowing cluster to rebalance itself, I now have symmetrical data distribution. 
  In keeping 5 monitors my rationale is availability.   I have 3 compute nodes 
+ 2 storage nodes.   I was thinking that making all of them a monitor would 
provide an additional backups.  Based on your earlier comments, can you provide 
guidance on how much latency is induced by having excess monitors deployed?

Thanks.


> On Jan 18, 2016, at 12:36 , Robert LeBlanc <rob...@leblancnet.us> wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Not that I know of.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Mon, Jan 18, 2016 at 10:33 AM, deeepdish  wrote:
>> Thanks Robert.   Will definitely try this.   Is there a way to implement 
>> “gradual CRUSH” changes?   I noticed whenever cluster wide changes are 
>> pushed (crush map, for instance) the cluster immediately attempts to align 
>> itself disrupting client access / performance…
>> 
>> 
>>> On Jan 18, 2016, at 12:22 , Robert LeBlanc  wrote:
>>> 
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>> 
>>> I'm not sure why you have six monitors. Six monitors buys you nothing
>>> over five monitors other than more power being used, and more latency
>>> and more headache. See
>>> http://docs.ceph.com/docs/hammer/rados/configuration/mon-config-ref/#monitor-quorum
>>>  
>>> <http://docs.ceph.com/docs/hammer/rados/configuration/mon-config-ref/#monitor-quorum>
>>> for some more info. Also, I'd consider 5 monitors overkill for this
>>> size cluster, I'd recommend three.
>>> 
>>> Although this is most likely not the root cause of your problem, you
>>> probably have an error here: "root replicated-T1" is pointing to
>>> b02s08 and b02s12 and "site erbus" is also pointing to b02s08 and
>>> b02s12. You probably meant to have "root replicated-T1" pointing to
>>> erbus instead.
>>> 
>>> Where I think your problem is, is in your "rule replicated" section.
>>> You can try:
>>> step take replicated-T1
>>> step choose firstn 2 type host
>>> step chooseleaf firstn 2 type osdgroup
>>> step emit
>>> 
>>> What this does is choose two hosts from the root replicated-T1 (which
>>> happens to be both hosts you have), then chooses an OSD from two
>>> osdgroups on each host.
>>> 
>>> I believe the problem with your current rule set is that firstn 0 type
>>> host tries to select four hosts, but only two are available. You
>>> should be able to see that with 'ceph pg dump', where only two osds
>>> will be listed in the up set.
>>> 
>>> I hope that helps.
>>> -BEGIN PGP SIGNATURE-
>>> Version: Mailvelope v1.3.3
>>> Comment: https://www.mailvelope.com <https://www.mailvelope.com/>
>>> 
>>> wsFcBAEBCAAQBQJWnR9kCRDmVDuy+mK58QAA5hUP/iJprG4nGR2sJvL//8l+
>>> V6oLYXTCs8lHeKL3ZPagThE9oh2xDMV37WR3I/xMNTA8735grl8/AAhy8ypW
>>> MDOikbpzfWnlaL0SWs5rIQ5umATwv73Fg/Mf+K2Olt8IGP6D0NMIxfeOjU6E
>>> 0Sc3F37nDQFuDEkBYjcVcqZC89PByh7yaId+eOgr7Ot+BZL/3fbpWIZ9kyD5
>>> KoPYdPjtFruoIpc8DJydzbWdmha65DkB65QOZlI3F3lMc6LGXUopm4OP4sQd
>>> txVKFtTcLh97WgUshQMSWIiJiQT7+3D6EqQyPzlnei3O3gACpkpsmUteDPpn
>>> p8CDeJtIpgKnQZjBwfK/bUQXdIGem8Y0x/PC+1ekIhkHCIJeW2sD3mFJduDQ
>>> 9loQ9+IsWHfQmEHLMLdeNzRXbgBY2djxP2X70fXTg31fx+dYvbWeulYJHiKi
>>> 1fJS4GdbPjoRUp5k4lthk3hDTFD/f5ZuowLDIaexgISb0bIJcObEn9RWlHut
>>> IRVi0fUuRVIX3snGMOKjLmSUe87Od2KSEbULYPTLYDMo/FsWXWHNlP3gVKKd
>>> lQJdxcwXOW7/v5oayY4wiEE6NF4rCupcqt0nPxxmbehmeRPxgkWCKJJs3FNr
>>> VmUdnrdpfxzR5c8dmOELJnpNS6MTT56B8A4kKmqbbHCEKpZ83piG7uwqc+6f
>>> RKkQ
>>> =gp/0
>>> -END PGP SIGNATURE-
>>> 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> 
>>> 
>>> On Sun, Jan 17, 2016 at 6:31 PM, deeepdish  wrote:
>>>> Hi Everyone,
>>>> 
>>>> Looking for a double check of my logic and crush map..
>>>> 
>>>> Overview:
>>>> 
>>>> - osdgroup bucket type defines failure domain within a host of 5 OSDs + 1
>>>> SSD.   Therefore 5 OSDs (all utilizing the same journal) constitute an
>>>> osdgroup bucket.   Each host has 4 osdgroups.
>>>> - 6 monitors
>>>> - Two node cl

Re: [ceph-users] CRUSH Rule Review - Not replicating correctly

2016-01-18 Thread deeepdish
Thanks Robert.   Will definitely try this.   Is there a way to implement 
“gradual CRUSH” changes?   I noticed whenever cluster wide changes are pushed 
(crush map, for instance) the cluster immediately attempts to align itself 
disrupting client access / performance…  


> On Jan 18, 2016, at 12:22 , Robert LeBlanc <rob...@leblancnet.us> wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> I'm not sure why you have six monitors. Six monitors buys you nothing
> over five monitors other than more power being used, and more latency
> and more headache. See
> http://docs.ceph.com/docs/hammer/rados/configuration/mon-config-ref/#monitor-quorum
> for some more info. Also, I'd consider 5 monitors overkill for this
> size cluster, I'd recommend three.
> 
> Although this is most likely not the root cause of your problem, you
> probably have an error here: "root replicated-T1" is pointing to
> b02s08 and b02s12 and "site erbus" is also pointing to b02s08 and
> b02s12. You probably meant to have "root replicated-T1" pointing to
> erbus instead.
> 
> Where I think your problem is, is in your "rule replicated" section.
> You can try:
> step take replicated-T1
> step choose firstn 2 type host
> step chooseleaf firstn 2 type osdgroup
> step emit
> 
> What this does is choose two hosts from the root replicated-T1 (which
> happens to be both hosts you have), then chooses an OSD from two
> osdgroups on each host.
> 
> I believe the problem with your current rule set is that firstn 0 type
> host tries to select four hosts, but only two are available. You
> should be able to see that with 'ceph pg dump', where only two osds
> will be listed in the up set.
> 
> I hope that helps.
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.3
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWnR9kCRDmVDuy+mK58QAA5hUP/iJprG4nGR2sJvL//8l+
> V6oLYXTCs8lHeKL3ZPagThE9oh2xDMV37WR3I/xMNTA8735grl8/AAhy8ypW
> MDOikbpzfWnlaL0SWs5rIQ5umATwv73Fg/Mf+K2Olt8IGP6D0NMIxfeOjU6E
> 0Sc3F37nDQFuDEkBYjcVcqZC89PByh7yaId+eOgr7Ot+BZL/3fbpWIZ9kyD5
> KoPYdPjtFruoIpc8DJydzbWdmha65DkB65QOZlI3F3lMc6LGXUopm4OP4sQd
> txVKFtTcLh97WgUshQMSWIiJiQT7+3D6EqQyPzlnei3O3gACpkpsmUteDPpn
> p8CDeJtIpgKnQZjBwfK/bUQXdIGem8Y0x/PC+1ekIhkHCIJeW2sD3mFJduDQ
> 9loQ9+IsWHfQmEHLMLdeNzRXbgBY2djxP2X70fXTg31fx+dYvbWeulYJHiKi
> 1fJS4GdbPjoRUp5k4lthk3hDTFD/f5ZuowLDIaexgISb0bIJcObEn9RWlHut
> IRVi0fUuRVIX3snGMOKjLmSUe87Od2KSEbULYPTLYDMo/FsWXWHNlP3gVKKd
> lQJdxcwXOW7/v5oayY4wiEE6NF4rCupcqt0nPxxmbehmeRPxgkWCKJJs3FNr
> VmUdnrdpfxzR5c8dmOELJnpNS6MTT56B8A4kKmqbbHCEKpZ83piG7uwqc+6f
> RKkQ
> =gp/0
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Sun, Jan 17, 2016 at 6:31 PM, deeepdish <deeepd...@gmail.com> wrote:
>> Hi Everyone,
>> 
>> Looking for a double check of my logic and crush map..
>> 
>> Overview:
>> 
>> - osdgroup bucket type defines failure domain within a host of 5 OSDs + 1
>> SSD.   Therefore 5 OSDs (all utilizing the same journal) constitute an
>> osdgroup bucket.   Each host has 4 osdgroups.
>> - 6 monitors
>> - Two node cluster
>> - Each node:
>> - 20 OSDs
>> -  4 SSDs
>> - 4 osdgroups
>> 
>> Desired Crush Rule outcome:
>> - Assuming a pool with min_size=2 and size=4, all each node would contain a
>> redundant copy of each object.   Should any of the hosts fail, access to
>> data would be uninterrupted.
>> 
>> Current Crush Rule outcome:
>> - There are 4 copies of each object, however I don’t believe each node has a
>> redundant copy of each object, when a node fails, data is NOT accessible
>> until ceph rebuilds itself / node becomes accessible again.
>> 
>> I susepct my crush is not right, and to remedy it may take some time and
>> cause cluster to be unresponsive / unavailable.Is there a way / method
>> to apply substantial crush changes gradually to a cluster?
>> 
>> Thanks for your help.
>> 
>> 
>> Current crush map:
>> 
>> # begin crush map
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once 1
>> tunable straw_calc_version 1
>> 
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 osd.2
>> device 3 osd.3
>> device 4 osd.4
>> device 5 osd.5
>> device 6 osd.6
>> device 7 osd.7
>> device 8 osd.8
>> device 9 osd.9
>> device 10 osd.10
>> device 11 osd.11
>> device 12 osd.12
>> device 13 osd.13
>

[ceph-users] CRUSH Rule Review - Not replicating correctly

2016-01-17 Thread deeepdish
Hi Everyone,

Looking for a double check of my logic and crush map..

Overview:

- osdgroup bucket type defines failure domain within a host of 5 OSDs + 1 SSD.  
 Therefore 5 OSDs (all utilizing the same journal) constitute an osdgroup 
bucket.   Each host has 4 osdgroups.
- 6 monitors
- Two node cluster
- Each node:
-   20 OSDs
-   4 SSDs
-   4 osdgroups

Desired Crush Rule outcome:
- Assuming a pool with min_size=2 and size=4, all each node would contain a 
redundant copy of each object.   Should any of the hosts fail, access to data 
would be uninterrupted.   

Current Crush Rule outcome:
- There are 4 copies of each object, however I don’t believe each node has a 
redundant copy of each object, when a node fails, data is NOT accessible until 
ceph rebuilds itself / node becomes accessible again.

I susepct my crush is not right, and to remedy it may take some time and cause 
cluster to be unresponsive / unavailable.Is there a way / method to apply 
substantial crush changes gradually to a cluster?   

Thanks for your help.


Current crush map:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 osd.39

# types
type 0 osd
type 1 osdgroup
type 2 host
type 3 rack
type 4 site
type 5 root

# buckets
osdgroup b02s08-osdgroupA {
id -81  # do not change unnecessarily
# weight 18.100
alg straw
hash 0  # rjenkins1
item osd.0 weight 3.620
item osd.1 weight 3.620
item osd.2 weight 3.620
item osd.3 weight 3.620
item osd.4 weight 3.620
}
osdgroup b02s08-osdgroupB {
id -82  # do not change unnecessarily
# weight 18.100
alg straw
hash 0  # rjenkins1
item osd.5 weight 3.620
item osd.6 weight 3.620
item osd.7 weight 3.620
item osd.8 weight 3.620
item osd.9 weight 3.620
}
osdgroup b02s08-osdgroupC {
id -83  # do not change unnecessarily
# weight 19.920
alg straw
hash 0  # rjenkins1
item osd.10 weight 3.620
item osd.11 weight 3.620
item osd.12 weight 3.620
item osd.13 weight 3.620
item osd.14 weight 5.440
}
osdgroup b02s08-osdgroupD {
id -84  # do not change unnecessarily
# weight 19.920
alg straw
hash 0  # rjenkins1
item osd.15 weight 3.620
item osd.16 weight 3.620
item osd.17 weight 3.620
item osd.18 weight 3.620
item osd.19 weight 5.440
}
host b02s08 {
id -80  # do not change unnecessarily
# weight 76.040
alg straw
hash 0  # rjenkins1
item b02s08-osdgroupA weight 18.100
item b02s08-osdgroupB weight 18.100
item b02s08-osdgroupC weight 19.920
item b02s08-osdgroupD weight 19.920
}
osdgroup b02s12-osdgroupA {
id -121 # do not change unnecessarily
# weight 18.100
alg straw
hash 0  # rjenkins1
item osd.20 weight 3.620
item osd.21 weight 3.620
item osd.22 weight 3.620
item osd.23 weight 3.620
item osd.24 weight 3.620
}
osdgroup b02s12-osdgroupB {
id -122 # do not change unnecessarily
# weight 18.100
alg straw
hash 0  # rjenkins1
item osd.25 weight 3.620
item osd.26 weight 3.620
item osd.27 weight 3.620
item osd.28 weight 3.620
item osd.29 weight 3.620
}
osdgroup b02s12-osdgroupC {
id -123 # do not change unnecessarily
# weight 19.920
alg straw
hash 0  # rjenkins1
item osd.30 weight 3.620
item osd.31 weight 3.620
item osd.32 weight 3.620
item osd.33 weight 3.620
item osd.34 weight 5.440
}
osdgroup b02s12-osdgroupD {
id -124 # do not change unnecessarily
# weight 19.920
alg straw
hash 0  # rjenkins1
item osd.35 weight 3.620
item osd.36 weight 3.620
item osd.37 weight 3.620
item osd.38 weight 3.620
item osd.39 weight 5.440
}
host b02s12 {
id -120 # do not change unnecessarily
# weight 76.040
alg straw
  

Re: [ceph-users] Help! OSD host failure - recovery without rebuilding OSDs

2015-12-28 Thread deeepdish
HI Josef,

Yes, everything came back to normal.   Thanks for following up!


> On Dec 28, 2015, at 11:25 , Josef Johansson <jose...@gmail.com> wrote:
> 
> Did you manage to work this out?
> 
> On 25 Dec 2015 9:33 am, "Josef Johansson" <jose...@gmail.com 
> <mailto:jose...@gmail.com>> wrote:
> Hi
> 
> Someone here will probably lay out a detailed answer but to get you started,
> 
> All the details for the osd are in the xfs partitions, mirror a new USB key 
> and change ip etc and you should be able to recover.
> 
> If the journal is linked to a /dev/sdx, make sure it's in the same spot as it 
> was before..
> 
> All the best of luck
> /Josef
> 
> On 25 Dec 2015 05:39, "deeepdish" <deeepd...@gmail.com 
> <mailto:deeepd...@gmail.com>> wrote:
> Hello,
> 
> Had an interesting issue today.
> 
> My OSD hosts are booting off a USB key which, you guessed it has a root 
> partition on there.   All OSDs are mounted.   My USB key failed on one of my 
> OSD hosts, leaving the data on OSDs inaccessible to the rest of my cluster.   
> I have multiple monitors running other OSD hosts where data can be recovered 
> to.   However I’m wondering if there’s a way to “restore” / “rebuild” the 
> ceph install that was on this host without having all OSDs resync again.
> 
> Lesson learned = don’t use USB boot/root drives.   However, now just looking 
> at what needs to be done once the OS and Ceph packages are reinstalled.
> 
> Thank you.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help! OSD host failure - recovery without rebuilding OSDs

2015-12-24 Thread deeepdish
Hello,

Had an interesting issue today. 

My OSD hosts are booting off a USB key which, you guessed it has a root 
partition on there.   All OSDs are mounted.   My USB key failed on one of my 
OSD hosts, leaving the data on OSDs inaccessible to the rest of my cluster.   I 
have multiple monitors running other OSD hosts where data can be recovered to.  
 However I’m wondering if there’s a way to “restore” / “rebuild” the ceph 
install that was on this host without having all OSDs resync again.

Lesson learned = don’t use USB boot/root drives.   However, now just looking at 
what needs to be done once the OS and Ceph packages are reinstalled.   

Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SOLVED] Monitor rename / recreate issue -- probing state

2015-12-21 Thread deeepdish
Hello Joao,

Thanks for your help.I increased logging on the failed monitor and noticed 
a lot of cephx authentication errors.   After verifying ntp sync, I noticed 
that the monitor keyring deployed on working monitors differed from what was 
stored in the management server’s ceph.mon.keyring.   Syncing the key and 
redeploying monitors got them to peer and establish quorum.



> On Dec 14, 2015, at 11:10 , deeepdish <deeepd...@gmail.com> wrote:
> 
> Joao,
> 
> Please see below.   I think you’re totally right on:
> 
>> I suspect they may already have this monitor in their map, but either
>> with a different name or a different address -- and are thus ignoring
>> probes from a peer that does not match what they are expecting.
> 
> 
> The monitor in question has been previously working (quorum).   It was 
> removed and now attempting to re-add using a different IP address as per 
> public procedure:  
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/ 
> <http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/>   (I 
> followed the 'CHANGING A MONITOR’S IP ADDRESS (THE RIGHT WAY)’ procedure)
> 
> #  ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.smg01.asok 
> mon_status
> {
> "name": "smg01",
> "rank": 0,
> "state": "probing",
> "election_epoch": 0,
> "quorum": [],
> "outside_quorum": [
> "smg01"
> ],
> "extra_probe_peers": [
> "10.20.1.8:6789\/0",
> "10.20.10.251:6789\/0",
> "10.20.10.252:6789\/0"
> ],
> "sync_provider": [],
> "monmap": {
> "epoch": 0,
> "fsid": "693834c1-1f95-4237-ab97-a767b0c0e6e7",
> "modified": "0.00",
> "created": "0.00",
> "mons": [
> {
> "rank": 0,
> "name": "smg01",
> "addr": "10.20.10.250:6789\/0"
> },
> {
> "rank": 1,
> "name": "smon01s",
> "addr": "0.0.0.0:0\/1"
> },
> {
> "rank": 2,
> "name": "smon02s",
> "addr": "0.0.0.0:0\/2"
> },
> {
> "rank": 3,
> "name": "b02s08",
> "addr": "0.0.0.0:0\/3"
> }
> ]
> }
> }
> 
> 
> # ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.smon01.asok 
> mon_status
> {
> "name": "smon01",
> "rank": 1,
> "state": "peon",
> "election_epoch": 2702,
> "quorum": [
> 0,
> 1,
> 2
> ],
> "outside_quorum": [],
> "extra_probe_peers": [],
> "sync_provider": [],
> "monmap": {
> "epoch": 12,
> "fsid": "693834c1-1f95-4237-ab97-a767b0c0e6e7",
> "modified": "2015-12-09 06:23:43.665100",
> "created": "0.00",
> "mons": [
> {
> "rank": 0,
> "name": "b02s08",
> "addr": "10.20.1.8:6789\/0"
> },
> {
> "rank": 1,
> "name": "smon01",
> "addr": "10.20.10.251:6789\/0"
> },
> {
> "rank": 2,
> "name": "smon02",
> "addr": "10.20.10.252:6789\/0"
> }
> ]
> }
> }
> 
> # ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.smon02.asok 
> mon_status
> {
> "name": "smon02",
> "rank": 2,
> "state": "peon",
> "election_epoch": 2702,
> "quorum": [
> 0,
> 1,
> 2
> ],
> "outside_quorum": [],
> "extra_probe_peers": [],
> "sync_provider": [],
> "monmap": {
>   

Re: [ceph-users] Monitor rename / recreate issue -- probing state

2015-12-13 Thread deeepdish
On 12/10/2015 04:00 AM, deeepdish wrote:
> Hello,
> 
> I encountered a strange issue when rebuilding monitors reusing same
> hostnames, however different IPs.
> 
> Steps to reproduce:
> 
> - Build monitor using ceph-deploy create mon 
> - Remove monitor
> via http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/ 
> <http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/>
> (remove monitor) ? I didn?t realize there was a ceph-deploy mon destroy
> command at this point.
> - Build a new monitor on same hardware using ceph-deploy create mon
>   # reason = to rename / change IP of monitor as per above link
> - Monitor ends up in probing mode.   When connecting via the admin
> socket, I see that there are no peers avail.   
> 
> The above behavior of only when reinstalling monitors.   I even tried
> reinstalling the OS, however there?s a monmap embedded somewhere causing
> the previous monitor hostnames / IPs to conflict with the new monitor?s
> peering ability.  

> [root@b02s08 ~]# 
> 
> On a reinstalled (not working) monitor:
> 
> sudo ceph --cluster=ceph --admin-daemon
> /var/run/ceph/ceph-mon.smg01.asok mon_status
> {
>"name": "smg01",
>"rank": 0,
>"state": "probing",
>"election_epoch": 0,
>"quorum": [],
>"outside_quorum": [
>"smg01"
>],
>"extra_probe_peers": [
>"10.20.1.8:6789\/0",
>"10.20.10.14:6789\/0",
>"10.20.10.16:6789\/0",
>"10.20.10.18:6789\/0",
>"10.20.10.251:6789\/0",
>"10.20.10.252:6789\/0"
>],
[snip]
> }


> 
> This appears to be consistent with a wrongly populated 'mon_host' and
> 'mon_initial_members' in your ceph.conf.
> 
>  -Joao



Thanks Joao.   I had a look but my other 3 monitors are working just fine.   To 
be clear, I’ve confirmed the same behaviour on other monitor nodes that have 
been removed from the cluster and rebuild with a new IP (however same name).

[global]
fsid = (hidden)
mon_initial_members = smg01, smon01s, smon02s, b02s08
mon_host = 10.20.10.250, 10.20.10.251, 10.20.10.252, 10.20.1.8
public network = 10.20.10.0/24, 10.20.1.0/24
cluster network = 10.20.41.0/24

. . . 

[mon.smg01s]
#host = smg01s.erbus.kupsta.net
host = smg01s
addr = 10.20.10.250:6789

[mon.smon01s]
#host = smon01s.erbus.kupsta.net
host = smon01s
addr = 10.20.10.251:6789

[mon.smon02s]
#host = smon02s.erbus.kupsta.net
host = smon02s
addr = 10.20.10.252:6789

[mon.b02s08]
#host = b02s08.erbus.kupsta.net
host = b02s08
addr = 10.20.1.8:6789

# sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.smg01.asok 
mon_status
{
"name": "smg01",
"rank": 0,
"state": "probing",
"election_epoch": 0,
"quorum": [],
"outside_quorum": [
"smg01"
],
"extra_probe_peers": [
"10.20.1.8:6789\/0",
"10.20.10.251:6789\/0",
"10.20.10.252:6789\/0"
],
"sync_provider": [],
"monmap": {
"epoch": 0,
"fsid": “(hidden)",
"modified": "0.00",
"created": "0.00",
"mons": [
{
"rank": 0,
"name": "smg01",
"addr": "10.20.10.250:6789\/0"
},
{
"rank": 1,
"name": "smon01s",
"addr": "0.0.0.0:0\/1"
},
{
"rank": 2,
"name": "smon02s",
"addr": "0.0.0.0:0\/2"
},
{
"rank": 3,
"name": "b02s08",
"addr": "0.0.0.0:0\/3"
}
]
}
}

Processes running on the monitor node that’s in probing state:

# ps -ef | grep ceph
root  1140 1  0 Dec11 ?00:05:07 python 
/usr/sbin/ceph-create-keys --cluster ceph -i smg01
root  6406 1  0 Dec11 ?00:05:10 python 
/usr/sbin/ceph-create-keys --cluster ceph -i smg01
root  7712 1  0 Dec11 ?00:05:09 python 
/usr/sbin/ceph-create-keys --cluster ceph -i smg01
root  9105 1  0 Dec11 ?00:05:11 python 
/usr/sbin/ceph-create-keys --cluster ceph -i smg01
root 13098 30548  0 07:18 pts/100:00:00 grep --color=auto ceph
root 14243 1  0 Dec11 ?00:05:09 python 
/usr/sbin/ceph-create-keys --cluster ceph -i smg01
root 31222 1  0 05:39 ?00:00:00 /bin/bash -c ulimit -n 32768; 
/usr/bin/ceph-mon -i smg01 --pid-file /var/run/ceph/mon.smg01.pid -c 
/etc/ceph/ceph.conf --cluster ceph -f
root 31226 31222  1 05:39 ?00:01:39 /usr/bin/ceph-mon -i smg01 
--pid-file /var/run/ceph/mon.smg01.pid -c /etc/ceph/ceph.conf --cluster ceph -f
root 31228 1  0 05:39 pts/100:00:15 python 
/usr/sbin/ceph-create-keys --cluster ceph -i smg01

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor rename / recreate issue -- probing state

2015-12-13 Thread deeepdish
Perhaps I’m not understanding something..

The “extra_probe_peers” ARE the other working monitors in quorum out of the 
mon_host line in ceph.conf.

In the example below 10.20.1.8 = b20s08; 10.20.10.251 = smon01s; 10.20.10.252 = 
smon02s

The monitor is not reaching out to the other IPs and syncing.   I’m able to 
ping all IPs in the extra_probe_peers list.

# ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.smg01.asok 
mon_status
{
"name": "smg01",
"rank": 0,
"state": "probing",
"election_epoch": 0,
"quorum": [],
"outside_quorum": [
"smg01"
],
"extra_probe_peers": [
"10.20.1.8:6789\/0",
"10.20.10.251:6789\/0",
"10.20.10.252:6789\/0"
],
"sync_provider": [],
"monmap": {
"epoch": 0,
"fsid": "693834c1-1f95-4237-ab97-a767b0c0e6e7",
"modified": "0.00",
"created": "0.00",
"mons": [
{
"rank": 0,
"name": "smg01",
"addr": "10.20.10.250:6789\/0"
},
{
"rank": 1,
"name": "smon01s",
"addr": "0.0.0.0:0\/1"
},
{
"rank": 2,
    "name": "smon02s",
"addr": "0.0.0.0:0\/2"
},
{
"rank": 3,
"name": "b02s08",
"addr": "0.0.0.0:0\/3"
}
]
}
}


> On Dec 13, 2015, at 19:18 , Joao Eduardo Luis <j...@suse.de> wrote:
> 
> On 12/13/2015 12:26 PM, deeepdish wrote:
>>> 
>>> This appears to be consistent with a wrongly populated 'mon_host' and
>>> 'mon_initial_members' in your ceph.conf.
>>> 
>>> -Joao
>> 
>> 
>> Thanks Joao.   I had a look but my other 3 monitors are working just
>> fine.   To be clear, I’ve confirmed the same behaviour on other monitor
>> nodes that have been removed from the cluster and rebuild with a new IP
>> (however same name).
> 
> I'm not entirely sure what you mean, but let me clarify what I meant a bit.
> 
> Existing monitors take their monmap from their own stores. All monitors
> in a quorum will see the same monmap. Existing monitors do not care
> about the configuration file for their monmap.
> 
> 'mon_host' and 'mon_initial_members' are only used by clients trying to
> reach the monitors AND when creating a new monitor.
> 
> Therefore, when creating a new monitor, 'mon_host' must contain the ips
> of the existing monitors PLUS the monitor you are creating, and
> 'mon_initial_members' must contain the hosts of the existing monitors
> PLUS the host of the monitor you are creating.
> 
> Your initial email reflected a lot of other ips on the
> 'extra_probe_peers' (which is basically the contents of mon_host during
> the probing phase, while the monitor tries to find the other monitors),
> which is consistent with mon_host being wrongly populated.
> 
>  -Joao

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor rename / recreate issue -- probing state

2015-12-09 Thread deeepdish
Hello,

I encountered a strange issue when rebuilding monitors reusing same hostnames, 
however different IPs.

Steps to reproduce:

- Build monitor using ceph-deploy create mon 
- Remove monitor via 
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/ (remove 
monitor) — I didn’t realize there was a ceph-deploy mon destroy command at this 
point.
- Build a new monitor on same hardware using ceph-deploy create mon 
  # reason = to rename / change IP of monitor as per above link
- Monitor ends up in probing mode.   When connecting via the admin socket, I 
see that there are no peers avail.   

The above behavior of only when reinstalling monitors.   I even tried 
reinstalling the OS, however there’s a monmap embedded somewhere causing the 
previous monitor hostnames / IPs to conflict with the new monitor’s peering 
ability.  

On a working monitor:

# sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.b02s08.asok 
mon_status
{
"name": "b02s08",
"rank": 0,
"state": "leader",
"election_epoch": 2618,
"quorum": [
0,
1,
2
],
"outside_quorum": [],
"extra_probe_peers": [
"10.20.10.14:6789\/0",
"10.20.10.16:6789\/0"
],
"sync_provider": [],
"monmap": {
"epoch": 12,
"fsid": "693834c1-1f95-4237-ab97-a767b0c0e6e7",
"modified": "2015-12-09 06:23:43.665100",
"created": "0.00",
"mons": [
{
"rank": 0,
"name": "b02s08",
"addr": "10.20.1.8:6789\/0"
},
{
"rank": 1,
"name": "smon01",
"addr": "10.20.10.251:6789\/0"
},
{
"rank": 2,
"name": "smon02",
"addr": "10.20.10.252:6789\/0"
}
]
}
}

[root@b02s08 ~]# 

On a reinstalled (not working) monitor:

 sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.smg01.asok 
mon_status
{
"name": "smg01",
"rank": 0,
"state": "probing",
"election_epoch": 0,
"quorum": [],
"outside_quorum": [
"smg01"
],
"extra_probe_peers": [
"10.20.1.8:6789\/0",
"10.20.10.14:6789\/0",
"10.20.10.16:6789\/0",
"10.20.10.18:6789\/0",
"10.20.10.251:6789\/0",
"10.20.10.252:6789\/0"
],
"sync_provider": [],
"monmap": {
"epoch": 0,
"fsid": "693834c1-1f95-4237-ab97-a767b0c0e6e7",
"modified": "0.00",
"created": "0.00",
"mons": [
{
"rank": 0,
"name": "smg01",
"addr": "10.20.10.250:6789\/0"
},
{
"rank": 1,
"name": "b02vm14s",
"addr": "0.0.0.0:0\/1"
},
{
"rank": 2,
"name": "b02vm16s",
"addr": "0.0.0.0:0\/2"
},
{
"rank": 3,
"name": "b02s18s",
"addr": "0.0.0.0:0\/3"
},
{
"rank": 4,
"name": "smon01s",
"addr": "0.0.0.0:0\/4"
},
{
"rank": 5,
"name": "smon02s",
"addr": "0.0.0.0:0\/5"
},
{
"rank": 6,
"name": "b02s08",
"addr": "0.0.0.0:0\/6"
}
]
}
}


How can I correct this?

Thanks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] XFS calltrace exporting RBD via NFS

2015-11-09 Thread deeepdish
Hello,

This is the second time I experienced this so I thought and post to get some 
perspective.   When this first happened, I suspected the kernel, and upgraded 
from 3.18.22 to 3.18.23.

Scenario:

- lab scenario
- single osd host — osdhost01.   Supermicro X8DTE-F - 2x X5570 + 48G RAM + 20x 
4TB OSD + SSD journals — 5:1 osd:journal ratio.   All OSDs stable, no active 
rebuilds / scrubbing, etc.
- osdhost01 also runs a mon
- Separate mon-only node  — two monitors in cluster.
- osdhost01 maps rbds, exports via NFS.
- rbds = xfs formatted.
- osds = btrfs formatted.
- ceph 0.94.5
- calltrace only on NFS use.   I haven’t observed a specific trigger IO pattern.
- When OSDs exported via block methods — e.g. SCST, things are stable.

Relevant / recent log entires — nothing significant preceding the below:

- system returned back to normal after a reboot.
—

Nov  9 06:56:11 osdhost01 bash: 2015-11-09 06:56:11.209312 7f60a6669700 -1 
osd.10 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:51.113587 front 2015-11-09 06:55:55.814376 (cutoff 2015-11-09 
06:55:51.209310)
Nov  9 06:56:11 osdhost01 bash: 2015-11-09 06:56:11.424398 7f608a0d5700 -1 
osd.10 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:51.113587 front 2015-11-09 06:55:55.814376 (cutoff 2015-11-09 
06:55:51.424395)
Nov  9 06:56:11 osdhost01 bash: 2015-11-09 06:56:11.982586 7fe3d2cf0700 -1 
osd.13 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:51.670920 front 2015-11-09 06:55:51.670920 (cutoff 2015-11-09 
06:55:51.982583)
Nov  9 06:56:12 osdhost01 bash: 2015-11-09 06:56:12.287997 7fe3ee2f1700 -1 
osd.13 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:51.670920 front 2015-11-09 06:55:51.670920 (cutoff 2015-11-09 
06:55:52.287995)
Nov  9 06:56:12 osdhost01 bash: 2015-11-09 06:56:12.290600 7f60a6669700 -1 
osd.10 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:51.113587 front 2015-11-09 06:55:55.814376 (cutoff 2015-11-09 
06:55:52.290598)
Nov  9 06:56:12 osdhost01 bash: 2015-11-09 06:56:12.675519 7f6acff5c700 -1 
osd.7 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:51.795557 front 2015-11-09 06:55:51.795557 (cutoff 2015-11-09 
06:55:52.675515)
Nov  9 06:56:12 osdhost01 bash: 2015-11-09 06:56:12.763544 7f03b2c1d700 -1 
osd.11 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:52.641513 front 2015-11-09 06:55:52.641513 (cutoff 2015-11-09 
06:55:52.763540)
Nov  9 06:56:12 osdhost01 bash: 2015-11-09 06:56:12.804794 7f6ab4604700 -1 
osd.7 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:51.795557 front 2015-11-09 06:55:51.795557 (cutoff 2015-11-09 
06:55:52.804791)
Nov  9 06:56:12 osdhost01 bash: 2015-11-09 06:56:12.708600 7fe64670f700 -1 
osd.14 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:52.701280 front 2015-11-09 06:55:52.701280 (cutoff 2015-11-09 
06:55:52.708597)
Nov  9 06:56:13 osdhost01 bash: 2015-11-09 06:56:13.056720 7f0395812700 -1 
osd.11 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:52.641513 front 2015-11-09 06:55:52.641513 (cutoff 2015-11-09 
06:55:53.056717)
Nov  9 06:56:13 osdhost01 bash: 2015-11-09 06:56:13.127314 7f608a0d5700 -1 
osd.10 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:51.113587 front 2015-11-09 06:55:55.814376 (cutoff 2015-11-09 
06:55:53.127302)
Nov  9 06:56:13 osdhost01 bash: 2015-11-09 06:56:13.288281 7fe3ee2f1700 -1 
osd.13 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:51.670920 front 2015-11-09 06:55:51.670920 (cutoff 2015-11-09 
06:55:53.288278)
Nov  9 06:56:13 osdhost01 bash: 2015-11-09 06:56:13.290793 7f60a6669700 -1 
osd.10 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:51.113587 front 2015-11-09 06:55:55.814376 (cutoff 2015-11-09 
06:55:53.290791)
Nov  9 06:56:13 osdhost01 bash: 2015-11-09 06:56:13.469732 7f7809459700 -1 
osd.3 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:52.637819 front 2015-11-09 06:55:52.637819 (cutoff 2015-11-09 
06:55:53.469719)
Nov  9 06:56:13 osdhost01 bash: 2015-11-09 06:56:13.598682 7fe66401a700 -1 
osd.14 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:52.701280 front 2015-11-09 06:55:52.701280 (cutoff 2015-11-09 
06:55:53.598678)
Nov  9 06:56:13 osdhost01 bash: 2015-11-09 06:56:13.674622 7f9f225f1700 -1 
osd.9 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:52.803479 front 2015-11-09 06:55:52.803479 (cutoff 2015-11-09 
06:55:53.674619)
Nov  9 06:56:13 osdhost01 bash: 2015-11-09 06:56:13.675760 7f6acff5c700 -1 
osd.7 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 
06:55:51.795557 front 2015-11-09 06:55:51.795557 (cutoff 2015-11-09 
06:55:53.675757)
Nov  9 06:56:13 osdhost01 bash: 2015-11-09 06:56:13.683551 7fe3d2cf0700 -1 
osd.13 504 heartbeat_check: no reply from osd.19 since back 2015-11-09 

Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-23 Thread deeepdish
@John-Paul Robinson:

I’ve also experienced nfs being blocked when serving rbd devices (XFS system).  
In my scenario I had rbd device mapped on an OSD host and nfs exported (lab 
scenario).   Log entries below..  Running Centos 7 w/ 
3.10.0-229.14.1.el7.x86_64.   Next step for me is to compile 3.18.22 and test 
nfs and scst (iscsi / fc).

Oct 22 13:30:01 osdhost01 systemd: Started Session 14 of user root.
Oct 22 13:37:04 osdhost01 kernel: INFO: task nfsd:12672 blocked for more than 
120 seconds.
Oct 22 13:37:04 osdhost01 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 22 13:37:04 osdhost01 kernel: nfsdD 880627c73680 0 
12672  2 0x0080
Oct 22 13:37:04 osdhost01 kernel: 880bda763b08 0046 
880be73af1c0 880bda763fd8
Oct 22 13:37:04 osdhost01 kernel: 880bda763fd8 880bda763fd8 
880be73af1c0 880627c73f48
Oct 22 13:37:04 osdhost01 kernel: 880c3ff98ae8 0002 
811562e0 880bda763b80
Oct 22 13:37:04 osdhost01 kernel: Call Trace:
Oct 22 13:37:04 osdhost01 kernel: [] ? 
wait_on_page_read+0x60/0x60
Oct 22 13:37:04 osdhost01 kernel: [] io_schedule+0x9d/0x130
Oct 22 13:37:04 osdhost01 kernel: [] sleep_on_page+0xe/0x20
Oct 22 13:37:04 osdhost01 kernel: [] __wait_on_bit+0x60/0x90
Oct 22 13:37:04 osdhost01 kernel: [] 
wait_on_page_bit+0x86/0xb0
Oct 22 13:37:04 osdhost01 kernel: [] ? 
autoremove_wake_function+0x40/0x40
Oct 22 13:37:04 osdhost01 kernel: [] 
filemap_fdatawait_range+0x111/0x1b0
Oct 22 13:37:04 osdhost01 kernel: [] 
filemap_write_and_wait_range+0x3f/0x70
Oct 22 13:37:04 osdhost01 kernel: [] 
xfs_file_fsync+0x66/0x1f0 [xfs]
Oct 22 13:37:04 osdhost01 kernel: [] vfs_fsync_range+0x1d/0x30
Oct 22 13:37:04 osdhost01 kernel: [] nfsd_commit+0xb9/0xe0 
[nfsd]
Oct 22 13:37:04 osdhost01 kernel: [] nfsd4_commit+0x57/0x60 
[nfsd]
Oct 22 13:37:04 osdhost01 kernel: [] 
nfsd4_proc_compound+0x4d7/0x7f0 [nfsd]
Oct 22 13:37:04 osdhost01 kernel: [] nfsd_dispatch+0xbb/0x200 
[nfsd]
Oct 22 13:37:04 osdhost01 kernel: [] 
svc_process_common+0x453/0x6f0 [sunrpc]
Oct 22 13:37:04 osdhost01 kernel: [] svc_process+0x103/0x170 
[sunrpc]
Oct 22 13:37:04 osdhost01 kernel: [] nfsd+0xe7/0x150 [nfsd]
Oct 22 13:37:04 osdhost01 kernel: [] ? nfsd_destroy+0x80/0x80 
[nfsd]
Oct 22 13:37:04 osdhost01 kernel: [] kthread+0xcf/0xe0
Oct 22 13:37:04 osdhost01 kernel: [] ? 
kthread_create_on_node+0x140/0x140
Oct 22 13:37:04 osdhost01 kernel: [] ret_from_fork+0x58/0x90
Oct 22 13:37:04 osdhost01 kernel: [] ? 
kthread_create_on_node+0x140/0x140
Oct 22 13:37:04 osdhost01 kernel: INFO: task kworker/u50:81:15660 blocked for 
more than 120 seconds.
Oct 22 13:37:04 osdhost01 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 22 13:37:04 osdhost01 kernel: kworker/u50:81  D 880c3fc73680 0 
15660  2 0x0080
Oct 22 13:37:04 osdhost01 kernel: Workqueue: writeback bdi_writeback_workfn 
(flush-252:0)
Oct 22 13:37:04 osdhost01 kernel: 88086deeb738 0046 
880beb6796c0 88086deebfd8
Oct 22 13:37:04 osdhost01 kernel: 88086deebfd8 88086deebfd8 
880beb6796c0 880c3fc73f48
Oct 22 13:37:04 osdhost01 kernel: 88061aec0fc0 880c1bb2dea0 
88061aec0ff0 88061aec0fc0
Oct 22 13:37:04 osdhost01 kernel: Call Trace:
Oct 22 13:37:04 osdhost01 kernel: [] io_schedule+0x9d/0x130
Oct 22 13:37:04 osdhost01 kernel: [] get_request+0x1b5/0x780
Oct 22 13:37:04 osdhost01 kernel: [] ? wake_up_bit+0x30/0x30
Oct 22 13:37:04 osdhost01 kernel: [] blk_queue_bio+0xc6/0x390
Oct 22 13:37:04 osdhost01 kernel: [] 
generic_make_request+0xe2/0x130
Oct 22 13:37:04 osdhost01 kernel: [] submit_bio+0x71/0x150
Oct 22 13:37:04 osdhost01 kernel: [] 
xfs_submit_ioend_bio.isra.12+0x33/0x40 [xfs]
Oct 22 13:37:04 osdhost01 kernel: [] 
xfs_submit_ioend+0xef/0x130 [xfs]
Oct 22 13:37:04 osdhost01 kernel: [] 
xfs_vm_writepage+0x36a/0x5d0 [xfs]
Oct 22 13:37:04 osdhost01 kernel: [] __writepage+0x13/0x50
Oct 22 13:37:04 osdhost01 kernel: [] 
write_cache_pages+0x251/0x4d0
Oct 22 13:37:04 osdhost01 kernel: [] ? 
global_dirtyable_memory+0x70/0x70
Oct 22 13:37:04 osdhost01 kernel: [] 
generic_writepages+0x4d/0x80
Oct 22 13:37:04 osdhost01 kernel: [] 
xfs_vm_writepages+0x43/0x50 [xfs]
Oct 22 13:37:04 osdhost01 kernel: [] do_writepages+0x1e/0x40
Oct 22 13:37:04 osdhost01 kernel: [] 
__writeback_single_inode+0x40/0x220
Oct 22 13:37:04 osdhost01 kernel: [] 
writeback_sb_inodes+0x25e/0x420
Oct 22 13:37:04 osdhost01 kernel: [] 
__writeback_inodes_wb+0x9f/0xd0
Oct 22 13:37:04 osdhost01 kernel: [] wb_writeback+0x263/0x2f0
Oct 22 13:37:04 osdhost01 kernel: [] 
bdi_writeback_workfn+0x1cc/0x460
Oct 22 13:37:04 osdhost01 kernel: [] 
process_one_work+0x17b/0x470
Oct 22 13:37:04 osdhost01 kernel: [] worker_thread+0x11b/0x400
Oct 22 13:37:04 osdhost01 kernel: [] ? 
rescuer_thread+0x400/0x400
Oct 22 13:37:04 osdhost01 kernel: [] kthread+0xcf/0xe0
Oct 22 13:37:04 osdhost01 kernel: [] ? 
kthread_create_on_node+0x140/0x140

[ceph-users] Cache tier full not evicting

2015-09-14 Thread deeepdish
Hi Everyone,

Getting close to cracking my understanding of cache tiering, and ec pools.   
Stuck on one anomaly which I do not understand — spent hours reviewing docs 
online, can’t seem to pin point what I’m doing wrong.   Referencing 
http://ceph.com/docs/master/rados/operations/cache-tiering/ 


Setup:

Test / PoC Lab environment (not production)

1x [26x OSD/MON host]
1x MON VM

Erasure coded pool consisting of 10 spinning OSDs  (journals on SSDs - 5:1 
spinner:SSD ratio)
Cache tier consisting of 2 SSD OSDs

Issue:

Cache tier is not honoring configured thresholds.   In my particular case, I 
have 2 OSDs in pool ‘cache’ (140G each == 280G total pool capacity).   

Pool cache is configured with replica factor of 2 (size = 2, min size = 1)

Initially I tried the following settings:

ceph osd pool set cache cache_target_dirty_ratio 0.3
ceph osd pool set cache cache_target_full_ratio 0.7
ceph osd pool set cache cache_min_flush_age 1
ceph osd pool set cache cache_min_evict_age 1

My cache tier’s utilization hit 96%+, causing the pool to run out of capacity.

I realized that in a replicated pool, only 1/2 the capacity is available and 
made the following adjustments:

ceph osd pool set cache cache_target_dirty_ratio 0.1
ceph osd pool set cache cache_target_full_ratio 0.3
ceph osd pool set cache cache_min_flush_age 1
ceph osd pool set cache cache_min_evict_age 1

The above implies that 0.3 = 60% of replicated (2x) pool size) and 0.1 = 20% of 
replicated (2x) pool size.   

Even with above revised values, I still see the cache tier getting full.  

The cache tier can only be flushed / evicted by manually running the following:

rados -p cache cache-flush-evict-all

Thank you.







 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SOLVED] Cache tier full not evicting

2015-09-14 Thread deeepdish
Thanks Nick.   That did it!   Cache cleans it self up now.   

> On Sep 14, 2015, at 11:49 , Nick Fisk <n...@fisk.me.uk> wrote:
> 
> Have you set the target_max_bytes? Otherwise those ratios are not relative to 
> anything, they use the target_max_bytes as a max, not the pool size.
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of deeepdish
> Sent: 14 September 2015 16:27
> To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> Subject: [ceph-users] Cache tier full not evicting
>  
> Hi Everyone,
>  
> Getting close to cracking my understanding of cache tiering, and ec pools.   
> Stuck on one anomaly which I do not understand — spent hours reviewing docs 
> online, can’t seem to pin point what I’m doing wrong.   Referencing 
> http://xo4t.mj.am/link/xo4t/no2irn4/1/4BSmK1EUshpYjOdI2VWk4g/aHR0cDovL2NlcGguY29tL2RvY3MvbWFzdGVyL3JhZG9zL29wZXJhdGlvbnMvY2FjaGUtdGllcmluZy8
>  
> <http://xo4t.mj.am/link/xo4t/no2irn4/1/4BSmK1EUshpYjOdI2VWk4g/aHR0cDovL2NlcGguY29tL2RvY3MvbWFzdGVyL3JhZG9zL29wZXJhdGlvbnMvY2FjaGUtdGllcmluZy8>
>  
> Setup:
>  
> Test / PoC Lab environment (not production)
>  
> 1x [26x OSD/MON host]
> 1x MON VM
>  
> Erasure coded pool consisting of 10 spinning OSDs  (journals on SSDs - 5:1 
> spinner:SSD ratio)
> Cache tier consisting of 2 SSD OSDs
>  
> Issue:
>  
> Cache tier is not honoring configured thresholds.   In my particular case, I 
> have 2 OSDs in pool ‘cache’ (140G each == 280G total pool capacity).   
>  
> Pool cache is configured with replica factor of 2 (size = 2, min size = 1)
>  
> Initially I tried the following settings:
>  
> ceph osd pool set cache cache_target_dirty_ratio 0.3
> ceph osd pool set cache cache_target_full_ratio 0.7
> ceph osd pool set cache cache_min_flush_age 1
> ceph osd pool set cache cache_min_evict_age 1
>  
> My cache tier’s utilization hit 96%+, causing the pool to run out of capacity.
>  
> I realized that in a replicated pool, only 1/2 the capacity is available and 
> made the following adjustments:
>  
> ceph osd pool set cache cache_target_dirty_ratio 0.1
> ceph osd pool set cache cache_target_full_ratio 0.3
> ceph osd pool set cache cache_min_flush_age 1
> ceph osd pool set cache cache_min_evict_age 1
>  
> The above implies that 0.3 = 60% of replicated (2x) pool size) and 0.1 = 20% 
> of replicated (2x) pool size.   
>  
> Even with above revised values, I still see the cache tier getting full.  
>  
> The cache tier can only be flushed / evicted by manually running the 
> following:
>  
> rados -p cache cache-flush-evict-all
>  
> Thank you.
>  
>  
>  
>  
>  
>  
>  
>  
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SOLVED: CRUSH odd bucket affinity / persistence

2015-09-13 Thread deeepdish
Thanks Nick.   Looking at the script its something along the lines I was after. 
 

I just realized that I could create multiple availability group “hosts” however 
your statement is valid that the failure domain is an entire host.

Thanks for all your help everyone.

> On Sep 13, 2015, at 11:47 , Nick Fisk <n...@fisk.me.uk> wrote:
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> deeepdish
>> Sent: 13 September 2015 02:47
>> To: Johannes Formann <mlm...@formann.de>
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] CRUSH odd bucket affinity / persistence
>> 
>> Johannes,
>> 
>> Thank you — "osd crush update on start = false” did the trick.   I wasn’t 
>> aware
>> that ceph has automatic placement logic for OSDs
>> (http://permalink.gmane.org/gmane.comp.file-
>> systems.ceph.user/9035).   This brings up a best practice question..
>> 
>> How is the configuration of OSD hosts with multiple storage types (e.g.
>> spinners + flash/ssd), typically implemented in the field from a crush map /
>> device location perspective?   Preference is for a scale out design.
> 
> I use something based on this script:
> 
> https://gist.github.com/wido/5d26d88366e28e25e23d
> 
> With the crush hook location config value in ceph.conf. You can pretty much 
> place OSD's wherever you like with it.
> 
>> 
>> In addition to the SSDs which are used for a EC cache tier, I’m also 
>> planning a
>> 5:1 ratio of spinners to SSD for journals.   In this case I want to 
>> implement an
>> availability groups within the OSD host itself.
>> 
>> e.g. in a 26-drive chassis, there will be 6 SSDs + 20 spinners.   [2 SSDs for
>> replicated cache tier, 4 SSDs will create 5 availability groups of 5 spinners
>> each]   The idea is to have CRUSH take into account SSD journal failure
>> (affecting 5 spinners).
> 
> By default Ceph will make the host the smallest failure domain, so I'm not 
> sure if there is any benefit to identifying to crush that several OSD's share 
> one journal. Whether you lose 1 OSD or all OSD's from a server, there 
> shouldn't be any difference to the possibility of data loss. Or have I 
> misunderstood your question?
> 
>> 
>> Thanks.
>> 
>> 
>> 
>> On Sep 12, 2015, at 19:11 , Johannes Formann <mlm...@formann.de> wrote:
>> 
>> Hi,
>> 
>> 
>> I’m having a (strange) issue with OSD bucket persistence / affinity on my 
>> test
>> cluster..
>> 
>> The cluster is PoC / test, by no means production.   Consists of a single 
>> OSD /
>> MON host + another MON running on a KVM VM.
>> 
>> Out of 12 OSDs I’m trying to get osd.10 and osd.11 to be part of the ssd
>> bucket in my CRUSH map.   This works fine when either editing the CRUSH
>> map by hand (exporting, decompile, edit, compile, import), or via the ceph
>> osd crush set command:
>> 
>> "ceph osd crush set osd.11 0.140 root=ssd”
>> 
>> I’m able to verify that the OSD / MON host and another MON I have running
>> see the same CRUSH map.
>> 
>> After rebooting OSD / MON host, both osd.10 and osd.11 become part of the
>> default bucket.   How can I ensure that ODSs persist in their configured
>> buckets?
>> 
>> I guess you have set "osd crush update on start = true"
>> (http://ceph.com/docs/master/rados/operations/crush-map/ ) and only the
>> default „root“-entry.
>> 
>> Either fix the „root“-Entry in the ceph.conf or set osd crush update on 
>> start =
>> false.
>> 
>> greetings
>> 
>> Johannes
> 
> 
> 
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CRUSH odd bucket affinity / persistence

2015-09-12 Thread deeepdish
Hello,

I’m having a (strange) issue with OSD bucket persistence / affinity on my test 
cluster..   

The cluster is PoC / test, by no means production.   Consists of a single OSD / 
MON host + another MON running on a KVM VM.  

Out of 12 OSDs I’m trying to get osd.10 and osd.11 to be part of the ssd bucket 
in my CRUSH map.   This works fine when either editing the CRUSH map by hand 
(exporting, decompile, edit, compile, import), or via the ceph osd crush set 
command:

"ceph osd crush set osd.11 0.140 root=ssd”

I’m able to verify that the OSD / MON host and another MON I have running see 
the same CRUSH map. 

After rebooting OSD / MON host, both osd.10 and osd.11 become part of the 
default bucket.   How can I ensure that ODSs persist in their configured 
buckets?

Here’s my desired CRUSH map.   This is a PoC and by no means production ready.. 

——

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host osdhost {
id -2   # do not change unnecessarily
# weight 36.200
alg straw
hash 0  # rjenkins1
item osd.0 weight 3.620
item osd.1 weight 3.620
item osd.2 weight 3.620
item osd.3 weight 3.620
item osd.4 weight 3.620
item osd.5 weight 3.620
item osd.6 weight 3.620
item osd.7 weight 3.620
item osd.8 weight 3.620
item osd.9 weight 3.620
}
root default {
id -1   # do not change unnecessarily
# weight 36.200
alg straw
hash 0  # rjenkins1
item osdhost weight 36.200
}
host osdhost-ssd {
id -3   # do not change unnecessarily
# weight 0.280
alg straw
hash 0  # rjenkins1
item osd.10 weight 0.140
item osd.11 weight 0.140
}
root ssd {
id -4   # do not change unnecessarily
# weight 0.280
alg straw
hash 0  # rjenkins1
item osdhost-ssd weight 0.280
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type osd
step emit
}
rule ecpool {
ruleset 1
type erasure
min_size 3
max_size 7
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step choose indep 0 type osd
step emit
}
rule cachetier {
ruleset 2
type replicated
min_size 1
max_size 10
step set_chooseleaf_tries 5
step set_choose_tries 100
step take ssd
step chooseleaf firstn 0 type osd
step emit
}

# end crush map

——


ceph osd tree (before reboot)

ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-4  0.28000 root ssd  
-30 host osdhost-ssd   
10  0.14000 osd.10   up  1.0  1.0 
11  0.14000 osd.11   up  1.0  1.0 
-1 36.19995 root default  
-2 36.19995 host osdhost   
 0  3.62000 osd.0up  1.0  1.0 
 1  3.62000 osd.1up  1.0  1.0 
 2  3.62000 osd.2up  1.0  1.0 
 3  3.62000 osd.3up  1.0  1.0 
 4  3.62000 osd.4up  1.0  1.0 
 5  3.62000 osd.5up  1.0  1.0 
 6  3.62000 osd.6up  1.0  1.0 
 7  3.62000 osd.7up  1.0  1.0 
 8  3.62000 osd.8up  1.0  1.0 
 9  3.62000 osd.9up  1.0  1.0 


ceph osd tree (after reboot)


[root@osdhost tmp]# ceph osd tree
ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-40 root ssd  
-30 host osdhost-ssd   
-1 36.47995 root default  
-2 36.47995 host osdhost   
 0  3.62000 osd.0up  1.0  1.0 
 1  3.62000 osd.1up  1.0  1.0 
 2  3.62000 osd.2up  1.0  1.0 
 3  3.62000 osd.3up  1.0  1.0 
 4  

Re: [ceph-users] CRUSH odd bucket affinity / persistence

2015-09-12 Thread deeepdish
Johannes,

Thank you — "osd crush update on start = false” did the trick.   I wasn’t aware 
that ceph has automatic placement logic for OSDs 
(http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/9035 
).
   This brings up a best practice question..   

How is the configuration of OSD hosts with multiple storage types (e.g. 
spinners + flash/ssd), typically implemented in the field from a crush map / 
device location perspective?   Preference is for a scale out design.

In addition to the SSDs which are used for a EC cache tier, I’m also planning a 
5:1 ratio of spinners to SSD for journals.   In this case I want to implement 
an availability groups within the OSD host itself.   

e.g. in a 26-drive chassis, there will be 6 SSDs + 20 spinners.   [2 SSDs for 
replicated cache tier, 4 SSDs will create 5 availability groups of 5 spinners 
each]   The idea is to have CRUSH take into account SSD journal failure 
(affecting 5 spinners).   

Thanks.



> On Sep 12, 2015, at 19:11 , Johannes Formann  wrote:
> 
> Hi,
> 
>> I’m having a (strange) issue with OSD bucket persistence / affinity on my 
>> test cluster..   
>> 
>> The cluster is PoC / test, by no means production.   Consists of a single 
>> OSD / MON host + another MON running on a KVM VM.  
>> 
>> Out of 12 OSDs I’m trying to get osd.10 and osd.11 to be part of the ssd 
>> bucket in my CRUSH map.   This works fine when either editing the CRUSH map 
>> by hand (exporting, decompile, edit, compile, import), or via the ceph osd 
>> crush set command:
>> 
>> "ceph osd crush set osd.11 0.140 root=ssd”
>> 
>> I’m able to verify that the OSD / MON host and another MON I have running 
>> see the same CRUSH map. 
>> 
>> After rebooting OSD / MON host, both osd.10 and osd.11 become part of the 
>> default bucket.   How can I ensure that ODSs persist in their configured 
>> buckets?
> 
> I guess you have set "osd crush update on start = true" 
> (http://ceph.com/docs/master/rados/operations/crush-map/ ) and only the 
> default „root“-entry.
> 
> Either fix the „root“-Entry in the ceph.conf or set osd crush update on start 
> = false.
> 
> greetings
> 
> Johannes

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] EC + RBD Possible?

2015-01-07 Thread deeepdish
Hello.

I wasn’t able to obtain a clear answer in my googling and reading official Ceph 
docs if Erasure Coded pools are possible/supported for RBD access?   

The idea is to have block (cold) storage for archival purposes.   I would 
access an RBD device and format it as EXT or XFS for block use.I understand 
that acceleration is possible via using SSDs as a cache tier or OSD journals.   

Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cache Tiering vs. OSD Journal

2015-01-07 Thread deeepdish
Hello.

Quick question RE: cache tiering vs. OSD journals.

As I understand it, SSD acceleration is possible at the pool or OSD level. 

When considering cache tiering, should I still put OSD journals on SSDs or 
should they be disabled altogether.  

Can a single SSD pool function as a cache tier for multiple pools?

Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy Errors - Fedora 21

2014-12-29 Thread deeepdish
Hello.

I’m having an issue with ceph-deploy on Fedora 21.   

- Installed ceph-deploy via ‘yum install ceph-deploy'
- created non-root user
- assigned sudo privs as per documentation - 
http://ceph.com/docs/master/rados/deployment/preflight-checklist/ 
http://ceph.com/docs/master/rados/deployment/preflight-checklist/
 
$ ceph-deploy install smg01.erbus.kupsta.net http://smg01.erbus.kupsta.net/ 
[ceph_deploy.conf][DEBUG ] found configuration file at: /cephfs/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.20): /bin/ceph-deploy install [hostname]
[ceph_deploy.install][DEBUG ] Installing stable version firefly on cluster ceph 
hosts [hostname]
[ceph_deploy.install][DEBUG ] Detecting platform for host [hostname] ...
[ceph_deploy][ERROR ] RuntimeError: connecting to host: [hostname] resulted in 
errors: TypeError __init__() got an unexpected keyword argument 'detect_sudo'


Thank you.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy Errors - Fedora 21

2014-12-29 Thread deeepdish
Hello.

I’m having an issue with ceph-deploy on Fedora 21.   

- Installed ceph-deploy via ‘yum install ceph-deploy'
- created non-root user
- assigned sudo privs as per documentation - 
http://ceph.com/docs/master/rados/deployment/preflight-checklist/ 
http://ceph.com/docs/master/rados/deployment/preflight-checklist/
 
$ ceph-deploy install smg01.erbus.kupsta.net http://smg01.erbus.kupsta.net/ 
[ceph_deploy.conf][DEBUG ] found configuration file at: /cephfs/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.20): /bin/ceph-deploy install [hostname]
[ceph_deploy.install][DEBUG ] Installing stable version firefly on cluster ceph 
hosts [hostname]
[ceph_deploy.install][DEBUG ] Detecting platform for host [hostname] ...
[ceph_deploy][ERROR ] RuntimeError: connecting to host: [hostname] resulted in 
errors: TypeError __init__() got an unexpected keyword argument 'detect_sudo'


Thank you.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com