Re: [ceph-users] Monitor Recovery

2018-10-23 Thread Bastiaan Visser
are you using ceph-deploy?

In that case you could do:
ceph-deploy mon destroy {host-name [host-name]...}
and:
ceph-deploy mon create {host-name [host-name]...}
te recreate it.


- Original Message -
From: "John Petrini" 
To: "ceph-users" 
Sent: Tuesday, October 23, 2018 8:22:44 PM
Subject: [ceph-users] Monitor Recovery

Hi List,

I've got a monitor that won't stay up. It comes up and joins the
cluster but crashes within a couple of minutes with no info in the
logs. At this point I'd prefer to just give up on it and assume it's
in a bad state and recover it from the working monitors. What's the
best way to go about this?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crushmap and failure domains at rack level (ideally data-center level in the future)

2018-10-23 Thread Bastiaan Visser
Something must be wrong, since you have min_size 3 the pool should go read only 
once you take out the first rack. Probably even when you take out the first 
host. 

What is the outputput of ceph osd pool get  min_size ? 

I guess it will be 2, since you did not hit a problem while taking out one 
rack. as soon as you take out an extra host after that, some PG's will have a 
size < 2, so the pool goes read-only and recovery should start. Once all pg's 
are back to size 2, the pool would be RW again. 


From: "Waterbly, Dan"  
To: "ceph-users"  
Sent: Tuesday, October 23, 2018 6:44:52 PM 
Subject: [ceph-users] Crushmap and failure domains at rack level (ideally 
data-center level in the future) 



Hello, 



I want to create a crushmap rule where I can lose two racks of hosts and still 
be able to operate. I have tried the rule below, but it only allows me to 
operate (rados gateway) with one rack down and two racks up. If I lose any host 
in the two remaining racks my rados gateway stops responding. Here is my 
crushmap and rule. If anyone can point out what I am doing wrong it would be 
greatly appreciated. I’m very new to ceph so please forgive any incorrect 
terminology I have used. 



# begin crush map 

tunable choose_local_tries 0 

tunable choose_local_fallback_tries 0 

tunable choose_total_tries 50 

tunable chooseleaf_descend_once 1 

tunable chooseleaf_vary_r 1 

tunable chooseleaf_stable 1 

tunable straw_calc_version 1 

tunable allowed_bucket_algs 54 



# devices 

device 0 osd.0 class hdd 

device 1 osd.1 class hdd 

device 2 osd.2 class hdd 

device 3 osd.3 class hdd 

device 4 osd.4 class hdd 

device 5 osd.5 class hdd 

device 6 osd.6 class hdd 

device 7 osd.7 class hdd 

device 8 osd.8 class hdd 

device 9 osd.9 class hdd 

device 10 osd.10 class hdd 

device 11 osd.11 class hdd 

device 12 osd.12 class hdd 

device 13 osd.13 class hdd 

device 14 osd.14 class hdd 

device 15 osd.15 class hdd 

device 16 osd.16 class hdd 

device 17 osd.17 class hdd 

device 18 osd.18 class hdd 

device 19 osd.19 class hdd 

device 20 osd.20 class hdd 

device 21 osd.21 class hdd 

device 22 osd.22 class hdd 

device 23 osd.23 class hdd 

device 24 osd.24 class hdd 

device 25 osd.25 class hdd 

device 26 osd.26 class hdd 

device 27 osd.27 class hdd 

device 28 osd.28 class hdd 

device 29 osd.29 class hdd 

device 30 osd.30 class hdd 

device 31 osd.31 class hdd 

device 32 osd.32 class hdd 

device 33 osd.33 class hdd 

device 34 osd.34 class hdd 

device 35 osd.35 class hdd 

device 36 osd.36 class hdd 

device 37 osd.37 class hdd 

device 38 osd.38 class hdd 

device 39 osd.39 class hdd 

device 40 osd.40 class hdd 

device 41 osd.41 class hdd 

device 42 osd.42 class hdd 

device 43 osd.43 class hdd 

device 44 osd.44 class hdd 

device 45 osd.45 class hdd 

device 46 osd.46 class hdd 

device 47 osd.47 class hdd 

device 48 osd.48 class hdd 

device 49 osd.49 class hdd 

device 50 osd.50 class hdd 

device 51 osd.51 class hdd 

device 52 osd.52 class hdd 

device 53 osd.53 class hdd 

device 54 osd.54 class hdd 

device 55 osd.55 class hdd 

device 56 osd.56 class hdd 

device 57 osd.57 class hdd 

device 58 osd.58 class hdd 

device 59 osd.59 class hdd 

device 60 osd.60 class hdd 

device 61 osd.61 class hdd 

device 62 osd.62 class hdd 

device 63 osd.63 class hdd 

device 64 osd.64 class hdd 

device 65 osd.65 class hdd 

device 66 osd.66 class hdd 

device 67 osd.67 class hdd 

device 68 osd.68 class hdd 

device 69 osd.69 class hdd 

device 70 osd.70 class hdd 

device 71 osd.71 class hdd 

device 72 osd.72 class hdd 

device 73 osd.73 class hdd 

device 74 osd.74 class hdd 

device 75 osd.75 class hdd 

device 76 osd.76 class hdd 

device 77 osd.77 class hdd 

device 78 osd.78 class hdd 

device 79 osd.79 class hdd 

device 80 osd.80 class hdd 

device 81 osd.81 class hdd 

device 82 osd.82 class hdd 

device 83 osd.83 class hdd 

device 84 osd.84 class hdd 

device 85 osd.85 class hdd 

device 86 osd.86 class hdd 

device 87 osd.87 class hdd 

device 88 osd.88 class hdd 

device 89 osd.89 class hdd 

device 90 osd.90 class hdd 

device 91 osd.91 class hdd 

device 92 osd.92 class hdd 

device 93 osd.93 class hdd 

device 94 osd.94 class hdd 

device 95 osd.95 class hdd 

device 96 osd.96 class hdd 

device 97 osd.97 class hdd 

device 98 osd.98 class hdd 

device 99 osd.99 class hdd 

device 100 osd.100 class hdd 

device 101 osd.101 class hdd 

device 102 osd.102 class hdd 

device 103 osd.103 class hdd 

device 104 osd.104 class hdd 

device 105 osd.105 class hdd 

device 106 osd.106 class hdd 

device 107 osd.107 class hdd 

device 108 osd.108 class hdd 

device 109 osd.109 class hdd 

device 110 osd.110 class hdd 

device 111 osd.111 class hdd 

device 112 osd.112 class hdd 

device 113 osd.113 class hdd 

device 114 osd.114 class hdd 

device 115 osd.115 class hdd 

device 116 osd.116 class hdd 

device 117 osd.117 class hdd 

device 118 osd.118 class hdd 

device 119 osd.119 

[ceph-users] Monitor Recovery

2018-10-23 Thread John Petrini
Hi List,

I've got a monitor that won't stay up. It comes up and joins the
cluster but crashes within a couple of minutes with no info in the
logs. At this point I'd prefer to just give up on it and assume it's
in a bad state and recover it from the working monitors. What's the
best way to go about this?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Crushmap and failure domains at rack level (ideally data-center level in the future)

2018-10-23 Thread Waterbly, Dan
Hello,

I want to create a crushmap rule where I can lose two racks of hosts and still 
be able to operate. I have tried the rule below, but it only allows me to 
operate (rados gateway) with one rack down and two racks up. If I lose any host 
in the two remaining racks my rados gateway stops responding. Here is my 
crushmap and rule. If anyone can point out what I am doing wrong it would be 
greatly appreciated. I'm very new to ceph so please forgive any incorrect 
terminology I have used.

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class hdd
device 33 osd.33 class hdd
device 34 osd.34 class hdd
device 35 osd.35 class hdd
device 36 osd.36 class hdd
device 37 osd.37 class hdd
device 38 osd.38 class hdd
device 39 osd.39 class hdd
device 40 osd.40 class hdd
device 41 osd.41 class hdd
device 42 osd.42 class hdd
device 43 osd.43 class hdd
device 44 osd.44 class hdd
device 45 osd.45 class hdd
device 46 osd.46 class hdd
device 47 osd.47 class hdd
device 48 osd.48 class hdd
device 49 osd.49 class hdd
device 50 osd.50 class hdd
device 51 osd.51 class hdd
device 52 osd.52 class hdd
device 53 osd.53 class hdd
device 54 osd.54 class hdd
device 55 osd.55 class hdd
device 56 osd.56 class hdd
device 57 osd.57 class hdd
device 58 osd.58 class hdd
device 59 osd.59 class hdd
device 60 osd.60 class hdd
device 61 osd.61 class hdd
device 62 osd.62 class hdd
device 63 osd.63 class hdd
device 64 osd.64 class hdd
device 65 osd.65 class hdd
device 66 osd.66 class hdd
device 67 osd.67 class hdd
device 68 osd.68 class hdd
device 69 osd.69 class hdd
device 70 osd.70 class hdd
device 71 osd.71 class hdd
device 72 osd.72 class hdd
device 73 osd.73 class hdd
device 74 osd.74 class hdd
device 75 osd.75 class hdd
device 76 osd.76 class hdd
device 77 osd.77 class hdd
device 78 osd.78 class hdd
device 79 osd.79 class hdd
device 80 osd.80 class hdd
device 81 osd.81 class hdd
device 82 osd.82 class hdd
device 83 osd.83 class hdd
device 84 osd.84 class hdd
device 85 osd.85 class hdd
device 86 osd.86 class hdd
device 87 osd.87 class hdd
device 88 osd.88 class hdd
device 89 osd.89 class hdd
device 90 osd.90 class hdd
device 91 osd.91 class hdd
device 92 osd.92 class hdd
device 93 osd.93 class hdd
device 94 osd.94 class hdd
device 95 osd.95 class hdd
device 96 osd.96 class hdd
device 97 osd.97 class hdd
device 98 osd.98 class hdd
device 99 osd.99 class hdd
device 100 osd.100 class hdd
device 101 osd.101 class hdd
device 102 osd.102 class hdd
device 103 osd.103 class hdd
device 104 osd.104 class hdd
device 105 osd.105 class hdd
device 106 osd.106 class hdd
device 107 osd.107 class hdd
device 108 osd.108 class hdd
device 109 osd.109 class hdd
device 110 osd.110 class hdd
device 111 osd.111 class hdd
device 112 osd.112 class hdd
device 113 osd.113 class hdd
device 114 osd.114 class hdd
device 115 osd.115 class hdd
device 116 osd.116 class hdd
device 117 osd.117 class hdd
device 118 osd.118 class hdd
device 119 osd.119 class hdd
device 120 osd.120 class hdd
device 121 osd.121 class hdd
device 122 osd.122 class hdd
device 123 osd.123 class hdd
device 124 osd.124 class hdd
device 125 osd.125 class hdd
device 126 osd.126 class hdd
device 127 osd.127 class hdd
device 128 osd.128 class hdd
device 129 osd.129 class hdd
device 130 osd.130 class hdd
device 131 osd.131 class hdd
device 132 osd.132 class hdd
device 133 osd.133 class hdd
device 134 osd.134 class hdd
device 135 osd.135 class hdd
device 136 osd.136 class hdd
device 137 osd.137 class hdd
device 138 osd.138 class hdd
device 139 osd.139 class hdd
device 140 osd.140 class hdd
device 141 osd.141 class hdd
device 142 osd.142 class hdd
device 143 osd.143 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host cephstorage02 {
id -3   # do not change unnecessarily
id -4 class hdd # 

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-23 Thread Igor Fedotov

Hi Frank,


On 10/23/2018 2:56 PM, Frank Schilder wrote:

Dear David and Igor,

thank you very much for your help. I have one more question about chunk sizes 
and data granularity on bluestore and will summarize the information I got on 
bluestore compression at the end.

1) Compression ratio
---

Following Igor's explanation, I tried to understand the numbers for 
compressed_allocated and compressed_original and am somewhat stuck with 
figuring out how bluestore arithmetic works. I created a 32GB file of zeros 
using dd with write size bs=8M on a cephfs with

 ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 
pool=con-fs-data-test"

The data pool is an 8+2 erasure coded pool with properties

 pool 37 'con-fs-data-test' erasure size 10 min_size 9 crush_rule 11 
object_hash rjenkins pg_num 900 pgp_num 900 last_change 9970 flags 
hashpspool,ec_overwrites stripe_width 32768 compression_mode aggressive 
application cephfs

As I understand EC pools, a 4M object is split into 8x0.5M data shards that are 
stored together with 2x0.5M coding shards on one OSD each. So, I would expect a 
full object write to put a 512K chunk on each OSD in the PG. Looking at some 
config options of one of the OSDs, I see:

 "bluestore_compression_max_blob_size_hdd": "524288",
 "bluestore_compression_min_blob_size_hdd": "131072",
 "bluestore_max_blob_size_hdd": "524288",
 "bluestore_min_alloc_size_hdd": "65536",

 From this, I would conclude that the largest chunk size is 512K, which also 
equals compression_max_blob_size. The minimum allocation size is 64K for any 
object. What I would expect now is, that the full object writes to cephfs 
create chunk sizes of 512M per OSD in the PG, meaning that with an all-zero 
file I should observe a compresses_allocated ratio of 64K/512K=0.125 instead of 
the 0.5 reported below. It looks like that chunks of 128K are written instead 
of 512K. I'm happy with the 64K granularity, but the observed maximum chunk 
size seems a factor of 4 too small.

Where am I going wrong, what am I overlooking?
Please note how selection whether to use compression_max_blob_size or 
compression_min_blob_size is performed.


Max blob size threshold is mainly for objects that are tagged with flags 
indicating non-random access, e.g. sequential read and/or write, 
immutable, append-only etc.

Here is how it's determined in the code:
  if ((alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_SEQUENTIAL_READ) &&
  (alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_RANDOM_READ) == 0 &&
  (alloc_hints & (CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE |
  CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY)) &&
  (alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_RANDOM_WRITE) == 0) {
    dout(20) << __func__ << " will prefer large blob and csum sizes" << 
dendl;


This is done to minimize the overhead during future random access since 
it will need full blob decompression.
Hence min blob size is used for regular random I/O. Which is probably 
you case as well.
You can check bluestore log (once its level is raised to 20) to confirm 
this. E.g. by looking for the following line output:

  dout(20) << __func__ << " prefer csum_order " << wctx->csum_order
   << " target_blob_size 0x" << std::hex << wctx->target_blob_size
   << std::dec << dendl;

So you can simply increase bluestore_compression_min_blob_size_hdd if 
you want longer compressed chunks.

With the above-mentioned penalty on subsequent access though.


2) Bluestore compression configuration
---

If I understand David correctly, pool and OSD settings do *not* override each 
other, but are rather *combined* into a resulting setting as follows. Let

 0 - (n)one
 1 - (p)assive
 2 - (a)ggressive
 3 - (f)orce

 ? - (u)nset

be the 4+1 possible settings of compression modes with numeric values assigned 
as shown. Then, the resulting numeric compression mode for data in a pool on a 
specific OSD is

 res_compr_mode = min(mode OSD, mode pool)

or in form of a table:

   pool
  | n  p  a  f  u
--+--
n | n  n  n  n  n
 O  p | n  p  p  p  ?
 S  a | n  p  a  a  ?
 D  f | n  p  a  f  ?
u | n  ?  ?  ?  u

which would allow for the flexible configuration as mentioned by David below.

I'm actually not sure if I can confirm this. I have some pools where compression_mode is not set 
and which reside on separate OSDs with compression enabled, yet there is compressed data on these 
OSDs. Wondering if I polluted my test with "ceph config set bluestore_compression_mode 
aggressive" that I executed earlier, or if my above interpretation is still wrong. Does the 
setting issued with "ceph config set bluestore_compression_mode aggressive" apply to 
pools with 'compression_mode' not set on the pool (see question marks in table above, what is the 
resulting mode?).

What I would like to do is 

Re: [ceph-users] [ceph-ansible]Purging cluster using ceph-ansible stable 3.1/3.2

2018-10-23 Thread Cody
Hi Mark,

Thank you for pointing out the issue.

The problem is solved after I added "library=
~/.ansible/plugins/modules:/usr/share/ansible/plugins/modules:/root/ceph-ansible/library"
into the /root/ceph-ansible/ansible.cfg file. The "library" key wasn't
there in the first place and the result of running "ansible --version"
from the /root/ceph-ansible directory initially showed the first two
paths only.

Thank you very much.

Best regards,
Cody
On Tue, Oct 23, 2018 at 9:51 AM Mark Johnston  wrote:
>
> On Mon, 2018-10-22 at 20:05 -0400, Cody wrote:
> > I tried to purge a ceph cluster using infrastructure-playbooks/purge-
> > cluster.yml from stable 3.1 and stable 3.2 branches, but kept getting the
> > following error immediately:
> >
> > ERROR! no action detected in task. This often indicates a misspelled module
> > name, or incorrect module path.
> >
> > The error appears to have been in '/root/ceph-ansible/infrastructure-
> > playbooks/purge-cluster.yml': line 353, column 5, but may
> > be elsewhere in the file depending on the exact syntax problem.
> >
> > The offending line appears to be:
> >
> >   - name: zap and destroy osds created by ceph-volume with lvm_volumes
> > ^ here
>
> That's Ansible's way of saying "the module referenced in this task doesn't
> exist".  In this case it can't find the ceph_volume module, which is packaged
> with the ceph-ansible distribution.  It should find it if you're running 
> Ansible
> from the /root/ceph-ansible directory.
>
> Try running "ansible --version" and check what's shown for "config file" and
> "configured module search path".  You should have /root/ceph-ansible/library
> on the module search path.
>
>
> Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-10-23 Thread Stefan Kooman
Quoting Patrick Donnelly (pdonn...@redhat.com):
> Thanks for the detailed notes. It looks like the MDS is stuck
> somewhere it's not even outputting any log messages. If possible, it'd
> be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> if you're comfortable with gdb, a backtrace of any threads that look
> suspicious (e.g. not waiting on a futex) including `info threads`.

It took a while before the same issue reappeared again ... but we
managed to catch gdb backtraces and strace output. See below pastebin
links. Note: we had difficulty getting the MDSs working again, so we had
to restart them a couple of times, capturing debug output as much as we
can. Hopefully you can squeeze some useful information out of this data.

MDS1:
https://8n1.org/13869/bc3b - Some few minutes after it first started
acting up
https://8n1.org/13870/caf4 - Probably made when I tried to stop the
process and it took too long (process already received SIGKILL)
https://8n1.org/13871/2f22 - After restarting the same issue returned
https://8n1.org/13872/2246 - After restarting the same issue returned

MDS2:
https://8n1.org/13873/f861 - After it went craycray when it became
active
https://8n1.org/13874/c567 - After restarting the same issue returned
https://8n1.org/13875/133a - After restarting the same issue returned

STRACES:
MDS1: https://8n1.org/mds1-strace.zip
MDS2: https://8n1.org/mds2-strace.zip

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [ceph-ansible]Purging cluster using ceph-ansible stable 3.1/3.2

2018-10-23 Thread Mark Johnston
On Mon, 2018-10-22 at 20:05 -0400, Cody wrote:
> I tried to purge a ceph cluster using infrastructure-playbooks/purge-
> cluster.yml from stable 3.1 and stable 3.2 branches, but kept getting the
> following error immediately:
> 
> ERROR! no action detected in task. This often indicates a misspelled module
> name, or incorrect module path.
> 
> The error appears to have been in '/root/ceph-ansible/infrastructure-
> playbooks/purge-cluster.yml': line 353, column 5, but may
> be elsewhere in the file depending on the exact syntax problem.
> 
> The offending line appears to be:
> 
>   - name: zap and destroy osds created by ceph-volume with lvm_volumes
> ^ here 

That's Ansible's way of saying "the module referenced in this task doesn't
exist".  In this case it can't find the ceph_volume module, which is packaged
with the ceph-ansible distribution.  It should find it if you're running Ansible
from the /root/ceph-ansible directory.

Try running "ansible --version" and check what's shown for "config file" and
"configured module search path".  You should have /root/ceph-ansible/library
on the module search path.


Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-23 Thread Frank Schilder
Dear David and Igor,

thank you very much for your help. I have one more question about chunk sizes 
and data granularity on bluestore and will summarize the information I got on 
bluestore compression at the end.

1) Compression ratio
---

Following Igor's explanation, I tried to understand the numbers for 
compressed_allocated and compressed_original and am somewhat stuck with 
figuring out how bluestore arithmetic works. I created a 32GB file of zeros 
using dd with write size bs=8M on a cephfs with

ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 
pool=con-fs-data-test"

The data pool is an 8+2 erasure coded pool with properties

pool 37 'con-fs-data-test' erasure size 10 min_size 9 crush_rule 11 
object_hash rjenkins pg_num 900 pgp_num 900 last_change 9970 flags 
hashpspool,ec_overwrites stripe_width 32768 compression_mode aggressive 
application cephfs

As I understand EC pools, a 4M object is split into 8x0.5M data shards that are 
stored together with 2x0.5M coding shards on one OSD each. So, I would expect a 
full object write to put a 512K chunk on each OSD in the PG. Looking at some 
config options of one of the OSDs, I see:

"bluestore_compression_max_blob_size_hdd": "524288",
"bluestore_compression_min_blob_size_hdd": "131072",
"bluestore_max_blob_size_hdd": "524288",
"bluestore_min_alloc_size_hdd": "65536",

>From this, I would conclude that the largest chunk size is 512K, which also 
>equals compression_max_blob_size. The minimum allocation size is 64K for any 
>object. What I would expect now is, that the full object writes to cephfs 
>create chunk sizes of 512M per OSD in the PG, meaning that with an all-zero 
>file I should observe a compresses_allocated ratio of 64K/512K=0.125 instead 
>of the 0.5 reported below. It looks like that chunks of 128K are written 
>instead of 512K. I'm happy with the 64K granularity, but the observed maximum 
>chunk size seems a factor of 4 too small.

Where am I going wrong, what am I overlooking?

2) Bluestore compression configuration
---

If I understand David correctly, pool and OSD settings do *not* override each 
other, but are rather *combined* into a resulting setting as follows. Let

0 - (n)one
1 - (p)assive
2 - (a)ggressive
3 - (f)orce

? - (u)nset

be the 4+1 possible settings of compression modes with numeric values assigned 
as shown. Then, the resulting numeric compression mode for data in a pool on a 
specific OSD is

res_compr_mode = min(mode OSD, mode pool)

or in form of a table:

  pool
 | n  p  a  f  u
   --+--
   n | n  n  n  n  n
O  p | n  p  p  p  ?
S  a | n  p  a  a  ?
D  f | n  p  a  f  ?
   u | n  ?  ?  ?  u

which would allow for the flexible configuration as mentioned by David below.

I'm actually not sure if I can confirm this. I have some pools where 
compression_mode is not set and which reside on separate OSDs with compression 
enabled, yet there is compressed data on these OSDs. Wondering if I polluted my 
test with "ceph config set bluestore_compression_mode aggressive" that I 
executed earlier, or if my above interpretation is still wrong. Does the 
setting issued with "ceph config set bluestore_compression_mode aggressive" 
apply to pools with 'compression_mode' not set on the pool (see question marks 
in table above, what is the resulting mode?).

What I would like to do is enable compression on all OSDs, enable compression 
on all data pools and disable compression on all meta data pools. Data and meta 
data pools might share OSDs in the future. The above table says I should be 
able to do just that by being explicit.

Many thanks again and best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 19 October 2018 23:41
To: Frank Schilder; David Turner
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore compression enabled but no data compressed

Hi Frank,


On 10/19/2018 2:19 PM, Frank Schilder wrote:
> Hi David,
>
> sorry for the slow response, we had a hell of a week at work.
>
> OK, so I had compression mode set to aggressive on some pools, but the global 
> option was not changed, because I interpreted the documentation as "pool 
> settings take precedence". To check your advise, I executed
>
>ceph tell "osd.*" config set bluestore_compression_mode aggressive
>
> and dumped a new file consisting of null-bytes. Indeed, this time I observe 
> compressed objects:
>
> [root@ceph-08 ~]# ceph daemon osd.80 perf dump | grep blue
>  "bluefs": {
>  "bluestore": {
>  "bluestore_allocated": 2967207936,
>  "bluestore_stored": 3161981179,
>  "bluestore_compressed": 24549408,
>  "bluestore_compressed_allocated": 261095424,
>  "bluestore_compressed_original": 522190848,
>
> 

Re: [ceph-users] scrub errors

2018-10-23 Thread Sergey Malinin
There is an osd_scrub_auto_repair setting which defaults to 'false'.


> On 23.10.2018, at 12:12, Dominque Roux  wrote:
> 
> Hi all,
> 
> We lately faced several scrub errors.
> All of them were more or less easily fixed with the ceph pg repair X.Y
> command.
> 
> We're using ceph version 12.2.7 and have SSD and HDD pools.
> 
> Is there a way to prevent our datastore from these kind of errors, or is
> there a way to automate the fix (It would be rather easy to create a
> bash script)
> 
> Thank you very much for your help!
> 
> Best regards,
> 
> Dominique
> 
> -- 
> Your Swiss, Open Source and IPv6 Virtual Machine. Now on
> www.datacenterlight.ch
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] scrub errors

2018-10-23 Thread Dominque Roux
Hi all,

We lately faced several scrub errors.
All of them were more or less easily fixed with the ceph pg repair X.Y
command.

We're using ceph version 12.2.7 and have SSD and HDD pools.

Is there a way to prevent our datastore from these kind of errors, or is
there a way to automate the fix (It would be rather easy to create a
bash script)

Thank you very much for your help!

Best regards,

Dominique

-- 
Your Swiss, Open Source and IPv6 Virtual Machine. Now on
www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW stale buckets

2018-10-23 Thread Janne Johansson
When you run rgw it creates a ton of pools, so one of the other pools
were holding the indexes of what buckets there are, and the actual
data is what got stored in default.rgw.data (or whatever name it had),
so that cleanup was not complete and this is what causes your issues,
I'd say.

How to move from here depends on how much work/data you have put into
the badly-cleaned-pools and if you can redo the last part again after
a good clean restart.

Den tis 23 okt. 2018 kl 00:27 skrev Robert Stanford :
>
>
>  Someone deleted our rgw data pool to clean up.  They recreated it afterward. 
>  This is fine in one respect, we don't need the data.  But listing with 
> radosgw-admin still shows all the buckets.  How can we clean things up and 
> get rgw to understand what actually exists, and what doesn't?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] slow requests and degraded cluster, but not really ?

2018-10-23 Thread Ben Morrice

Hello all,

We have an issue with our ceph cluster where 'ceph -s' shows that 
several requests are blocked, however querying further with 'ceph health 
detail' indicates that the PGs affected are either active+clean or do 
not currently exist.
OSD 32 appears to be working fine, and the cluster is performing as 
expected with no clients seemingly affected.


Note - we had just upgraded to Luminous - and despite having "mon max pg 
per osd = 400" set in ceph.conf, we still have the message "too many PGs 
per OSD (278 > max 200)"


In order to improve the situation above, I removed several pools that 
were not used anymore. I assume the PGs that ceph cannot find now are 
related to this pool deletion.


Does anyone have any ideas on how to get out of this state?

Details below - and full 'ceph health detail' attached to this email.

Kind regards,

Ben Morrice

[root@ceph03 ~]# ceph -s
  cluster:
    id: 6c21c4ba-9c4d-46ef-93a3-441b8055cdc6
    health: HEALTH_WARN
    Degraded data redundancy: 443765/14311983 objects degraded 
(3.101%), 162 pgs degraded, 241 pgs undersized

    75 slow requests are blocked > 32 sec. Implicated osds 32
    too many PGs per OSD (278 > max 200)

  services:
    mon: 5 daemons, quorum bbpocn01,bbpocn02,bbpocn03,bbpocn04,bbpocn07
    mgr: bbpocn03(active, starting)
    osd: 36 osds: 36 up, 36 in
    rgw: 1 daemon active

  data:
    pools:   24 pools, 3440 pgs
    objects: 4.77M objects, 7.69TiB
    usage:   23.1TiB used, 104TiB / 127TiB avail
    pgs: 443765/14311983 objects degraded (3.101%)
 3107 active+clean
 170  active+undersized
 109  active+undersized+degraded
 43   active+recovery_wait+degraded
 10   active+recovering+degraded
 1    active+recovery_wait

[root@ceph03 ~]# for i in `ceph health detail |grep stuck | awk '{print 
$2}'`; do echo -n "$i: " ; ceph pg $i query -f plain | cut -d: -f2 | cut 
-d\" -f2; done

150.270: active+clean
150.2a0: active+clean
150.2b6: active+clean
150.2c2: active+clean
150.2cc: active+clean
150.2d5: active+clean
150.2d6: active+clean
150.2e1: active+clean
150.2ef: active+clean
150.2f5: active+clean
150.2f7: active+clean
150.2fc: active+clean
150.315: active+clean
150.318: active+clean
150.31a: active+clean
150.320: active+clean
150.326: active+clean
150.36e: active+clean
150.380: active+clean
150.389: active+clean
150.3a4: active+clean
150.3ad: active+clean
150.3b4: active+clean
150.3bb: active+clean
150.3ce: active+clean
150.3d0: active+clean
150.3d8: active+clean
150.3e0: active+clean
150.3f6: active+clean
165.24c: Error ENOENT: problem getting command descriptions from pg.165.24c
165.28f: Error ENOENT: problem getting command descriptions from pg.165.28f
165.2b3: Error ENOENT: problem getting command descriptions from pg.165.2b3
165.2b4: Error ENOENT: problem getting command descriptions from pg.165.2b4
165.2d6: Error ENOENT: problem getting command descriptions from pg.165.2d6
165.2f4: Error ENOENT: problem getting command descriptions from pg.165.2f4
165.2fd: Error ENOENT: problem getting command descriptions from pg.165.2fd
165.30f: Error ENOENT: problem getting command descriptions from pg.165.30f
165.322: Error ENOENT: problem getting command descriptions from pg.165.322
165.325: Error ENOENT: problem getting command descriptions from pg.165.325
165.334: Error ENOENT: problem getting command descriptions from pg.165.334
165.36e: Error ENOENT: problem getting command descriptions from pg.165.36e
165.37c: Error ENOENT: problem getting command descriptions from pg.165.37c
165.382: Error ENOENT: problem getting command descriptions from pg.165.382
165.387: Error ENOENT: problem getting command descriptions from pg.165.387
165.3af: Error ENOENT: problem getting command descriptions from pg.165.3af
165.3da: Error ENOENT: problem getting command descriptions from pg.165.3da
165.3e0: Error ENOENT: problem getting command descriptions from pg.165.3e0
165.3e2: Error ENOENT: problem getting command descriptions from pg.165.3e2
165.3e9: Error ENOENT: problem getting command descriptions from pg.165.3e9
165.3fb: Error ENOENT: problem getting command descriptions from pg.165.3fb

[root@ceph03 ~]# ceph pg 165.24c query
Error ENOENT: problem getting command descriptions from pg.165.24c
[root@ceph03 ~]# ceph pg 165.24c delete
Error ENOENT: problem getting command descriptions from pg.165.24c

--
Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

HEALTH_WARN Degraded data redundancy: 443765/14311983 objects degraded 
(3.101%), 162 pgs degraded, 241 pgs undersized; 75 slow requests are blocked > 
32 sec. Implicated osds 32; too many PGs per OSD (278 > max 200)
pg 150.270 is stuck undersized for 1871.987162, current state 
active+undersized, last acting [17,30]
pg 150.2a0 is