from:"Frank Schilder"

[ceph-users] bluestore compression enabled but no data compressed

2018-09-18 Thread Frank Schilder

I seem to have a problem getting bluestore compression to do anything. I 
followed the documentation and enabled bluestore compression on various pools 
by executing "ceph osd pool set  compression_mode aggressive". 
Unfortunately, it seems like no data is compressed at all. As an example, below 
is some diagnostic output for a data pool used by a cephfs:

[root@ceph-01 ~]# ceph --version
ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)

All defaults are OK:

[root@ceph-01 ~]# ceph --show-config | grep compression
[...]
bluestore_compression_algorithm = snappy
bluestore_compression_max_blob_size = 0
bluestore_compression_max_blob_size_hdd = 524288
bluestore_compression_max_blob_size_ssd = 65536
bluestore_compression_min_blob_size = 0
bluestore_compression_min_blob_size_hdd = 131072
bluestore_compression_min_blob_size_ssd = 8192
bluestore_compression_mode = none
bluestore_compression_required_ratio = 0.875000
[...]

Compression is reported as enabled:

[root@ceph-01 ~]# ceph osd pool ls detail
[...]
pool 24 'sr-fs-data-test' erasure size 8 min_size 7 crush_rule 10 object_hash 
rjenkins pg_num 50 pgp_num 50 last_change 7726 flags hashpspool,ec_overwrites 
stripe_width 24576 compression_algorithm snappy compression_mode aggressive 
application cephfs
[...]

[root@ceph-01 ~]# ceph osd pool get sr-fs-data-test compression_mode
compression_mode: aggressive
[root@ceph-01 ~]# ceph osd pool get sr-fs-data-test compression_algorithm
compression_algorithm: snappy

We dumped a 4Gib file with dd from /dev/zero. Should be easy to compress with 
excellent ratio. Search for a PG:

[root@ceph-01 ~]# ceph pg ls-by-pool sr-fs-data-test
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG 
DISK_LOG STATESTATE_STAMPVERSION  REPORTED UP   
UP_PRIMARY ACTING   ACTING_PRIMARY LAST_SCRUB 
SCRUB_STAMPLAST_DEEP_SCRUB DEEP_SCRUB_STAMP   
24.0 15  00 0   0  62914560  77 
  77 active+clean 2018-09-14 01:07:14.593007  7698'77 7735:142 
[53,47,36,30,14,55,57,5] 53 [53,47,36,30,14,55,57,5] 53
7698'77 2018-09-14 01:07:14.592966 0'0 2018-09-11 08:06:29.309010 

There is about 250MB data on the primary OSD, but noting seems to be compressed:

[root@ceph-07 ~]# ceph daemon osd.53 perf dump | grep blue
[...]
"bluestore_allocated": 313917440,
"bluestore_stored": 264362803,
"bluestore_compressed": 0,
"bluestore_compressed_allocated": 0,
"bluestore_compressed_original": 0,
[...]

Just to make sure, I checked one of the objects' contents:

[root@ceph-01 ~]# rados ls -p sr-fs-data-test
104.039c
[...]
104.039f

It is 4M chunks ...
[root@ceph-01 ~]# rados -p sr-fs-data-test stat 104.039f
sr-fs-data-test/104.039f mtime 2018-09-11 14:39:38.00, size 
4194304

... with all zeros:

[root@ceph-01 ~]# rados -p sr-fs-data-test get 104.039f obj

[root@ceph-01 ~]# hexdump -C obj
  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*
0040

All as it should be, except for compression. Am I overlooking something?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-12 Thread Frank Schilder

Hi David,

thanks for your answer. I did enable compression on the pools as described in 
the link you sent below (ceph osd pool set sr-fs-data-test compression_mode 
aggressive, I also tried force to no avail). However, I could not find anything 
on enabling compression per OSD. Could you possibly provide a source or sample 
commands?

Thanks and best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: David Turner 
Sent: 09 October 2018 17:42
To: Frank Schilder
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore compression enabled but no data compressed

When I've tested compression before there are 2 places you need to configure 
compression.  On the OSDs in the configuration settings that you mentioned, but 
also on the [1] pools themselves.  If you have the compression mode on the 
pools set to none, then it doesn't matter what the OSDs configuration is and 
vice versa unless you are using the setting of force.  If you want to default 
compress everything, set pools to passive and osds to aggressive.  If you want 
to only compress specific pools, set the osds to passive and the specific pools 
to aggressive.  Good luck.


[1] http://docs.ceph.com/docs/mimic/rados/operations/pools/#set-pool-values

On Tue, Sep 18, 2018 at 7:11 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
I seem to have a problem getting bluestore compression to do anything. I 
followed the documentation and enabled bluestore compression on various pools 
by executing "ceph osd pool set  compression_mode aggressive". 
Unfortunately, it seems like no data is compressed at all. As an example, below 
is some diagnostic output for a data pool used by a cephfs:

[root@ceph-01 ~]# ceph --version
ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)

All defaults are OK:

[root@ceph-01 ~]# ceph --show-config | grep compression
[...]
bluestore_compression_algorithm = snappy
bluestore_compression_max_blob_size = 0
bluestore_compression_max_blob_size_hdd = 524288
bluestore_compression_max_blob_size_ssd = 65536
bluestore_compression_min_blob_size = 0
bluestore_compression_min_blob_size_hdd = 131072
bluestore_compression_min_blob_size_ssd = 8192
bluestore_compression_mode = none
bluestore_compression_required_ratio = 0.875000
[...]

Compression is reported as enabled:

[root@ceph-01 ~]# ceph osd pool ls detail
[...]
pool 24 'sr-fs-data-test' erasure size 8 min_size 7 crush_rule 10 object_hash 
rjenkins pg_num 50 pgp_num 50 last_change 7726 flags hashpspool,ec_overwrites 
stripe_width 24576 compression_algorithm snappy compression_mode aggressive 
application cephfs
[...]

[root@ceph-01 ~]# ceph osd pool get sr-fs-data-test compression_mode
compression_mode: aggressive
[root@ceph-01 ~]# ceph osd pool get sr-fs-data-test compression_algorithm
compression_algorithm: snappy

We dumped a 4Gib file with dd from /dev/zero. Should be easy to compress with 
excellent ratio. Search for a PG:

[root@ceph-01 ~]# ceph pg ls-by-pool sr-fs-data-test
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG 
DISK_LOG STATESTATE_STAMPVERSION  REPORTED UP   
UP_PRIMARY ACTING   ACTING_PRIMARY LAST_SCRUB 
SCRUB_STAMPLAST_DEEP_SCRUB DEEP_SCRUB_STAMP
24.0 15  00 0   0  62914560  77 
  77 active+clean 2018-09-14 01:07:14.593007  7698'77 7735:142 
[53,47,36,30,14,55,57,5] 53 [53,47,36,30,14,55,57,5] 53
7698'77 2018-09-14 01:07:14.592966 0'0 2018-09-11 08:06:29.309010

There is about 250MB data on the primary OSD, but noting seems to be compressed:

[root@ceph-07 ~]# ceph daemon osd.53 perf dump | grep blue
[...]
"bluestore_allocated": 313917440,
"bluestore_stored": 264362803,
"bluestore_compressed": 0,
"bluestore_compressed_allocated": 0,
"bluestore_compressed_original": 0,
[...]

Just to make sure, I checked one of the objects' contents:

[root@ceph-01 ~]# rados ls -p sr-fs-data-test
104.039c
[...]
104.039f

It is 4M chunks ...
[root@ceph-01 ~]# rados -p sr-fs-data-test stat 104.039f
sr-fs-data-test/104.039f mtime 2018-09-11 14:39:38.00, size 
4194304

... with all zeros:

[root@ceph-01 ~]# rados -p sr-fs-data-test get 104.039f obj

[root@ceph-01 ~]# hexdump -C obj
  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*
0040

All as it should be, except for compression. Am I overlooking something?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.c

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-12 Thread Frank Schilder

Hi David,

thanks for your quick answer. When I look at both references, I see exactly the 
same commands:

ceph osd pool set {pool-name} {key} {value}

where on one page only keys specific for compression are described. This is the 
command I found and used. However, I can't see any compression happening. If 
you know about something else than "ceph osd pool set" - commands, please let 
me know.

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: David Turner 
Sent: 12 October 2018 15:47:20
To: Frank Schilder
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore compression enabled but no data compressed

It's all of the settings that you found in your first email when you dumped the 
configurations and such.  
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#inline-compression

On Fri, Oct 12, 2018 at 7:36 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
Hi David,

thanks for your answer. I did enable compression on the pools as described in 
the link you sent below (ceph osd pool set sr-fs-data-test compression_mode 
aggressive, I also tried force to no avail). However, I could not find anything 
on enabling compression per OSD. Could you possibly provide a source or sample 
commands?

Thanks and best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: David Turner mailto:drakonst...@gmail.com>>
Sent: 09 October 2018 17:42
To: Frank Schilder
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] bluestore compression enabled but no data compressed

When I've tested compression before there are 2 places you need to configure 
compression.  On the OSDs in the configuration settings that you mentioned, but 
also on the [1] pools themselves.  If you have the compression mode on the 
pools set to none, then it doesn't matter what the OSDs configuration is and 
vice versa unless you are using the setting of force.  If you want to default 
compress everything, set pools to passive and osds to aggressive.  If you want 
to only compress specific pools, set the osds to passive and the specific pools 
to aggressive.  Good luck.


[1] http://docs.ceph.com/docs/mimic/rados/operations/pools/#set-pool-values

On Tue, Sep 18, 2018 at 7:11 AM Frank Schilder 
mailto:fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>>> 
wrote:
I seem to have a problem getting bluestore compression to do anything. I 
followed the documentation and enabled bluestore compression on various pools 
by executing "ceph osd pool set  compression_mode aggressive". 
Unfortunately, it seems like no data is compressed at all. As an example, below 
is some diagnostic output for a data pool used by a cephfs:

[root@ceph-01 ~]# ceph --version
ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)

All defaults are OK:

[root@ceph-01 ~]# ceph --show-config | grep compression
[...]
bluestore_compression_algorithm = snappy
bluestore_compression_max_blob_size = 0
bluestore_compression_max_blob_size_hdd = 524288
bluestore_compression_max_blob_size_ssd = 65536
bluestore_compression_min_blob_size = 0
bluestore_compression_min_blob_size_hdd = 131072
bluestore_compression_min_blob_size_ssd = 8192
bluestore_compression_mode = none
bluestore_compression_required_ratio = 0.875000
[...]

Compression is reported as enabled:

[root@ceph-01 ~]# ceph osd pool ls detail
[...]
pool 24 'sr-fs-data-test' erasure size 8 min_size 7 crush_rule 10 object_hash 
rjenkins pg_num 50 pgp_num 50 last_change 7726 flags hashpspool,ec_overwrites 
stripe_width 24576 compression_algorithm snappy compression_mode aggressive 
application cephfs
[...]

[root@ceph-01 ~]# ceph osd pool get sr-fs-data-test compression_mode
compression_mode: aggressive
[root@ceph-01 ~]# ceph osd pool get sr-fs-data-test compression_algorithm
compression_algorithm: snappy

We dumped a 4Gib file with dd from /dev/zero. Should be easy to compress with 
excellent ratio. Search for a PG:

[root@ceph-01 ~]# ceph pg ls-by-pool sr-fs-data-test
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG 
DISK_LOG STATESTATE_STAMPVERSION  REPORTED UP   
UP_PRIMARY ACTING   ACTING_PRIMARY LAST_SCRUB 
SCRUB_STAMPLAST_DEEP_SCRUB DEEP_SCRUB_STAMP
24.0 15  00 0   0  62914560  77 
  77 active+clean 2018-09-14 01:07:14.593007  7698'77 7735:142 
[53,47,36,30,14,55,57,5] 53 [53,47,36,30,14,55,57,5] 53
7698'77 2018-09-14 01:07:14.592966 0'0 2018-09-11 08:06:29.309010

There is about 250MB data on the primary OSD, but noting seems to be compressed:

[root@ceph-07 ~]# ceph daemon osd.53 perf d

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-12 Thread Frank Schilder

Hi David,

thanks, now I see what you mean. If you are right, that would mean that the 
documentation is wrong. Under 
"http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values"; is 
stated that "Sets inline compression algorithm to use for underlying BlueStore. 
This setting overrides the global setting of bluestore compression algorithm". 
In other words, the global setting should be irrelevant if compression is 
enabled on a pool.

Well, I will try how setting both to "aggressive" or "force" works out and let 
you know.

Thanks and have a nice weekend,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: David Turner 
Sent: 12 October 2018 16:50:31
To: Frank Schilder
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore compression enabled but no data compressed

If you go down just a little farther you'll see the settings that you put into 
your ceph.conf under the osd section (although I'd probably do global).  That's 
where the OSDs get the settings from.  As a note, once these are set, future 
writes will be compressed (if they match the compression settings which you can 
see there about minimum ratios, blob sizes, etc).  To compress current data, 
you need to re-write it.

On Fri, Oct 12, 2018 at 10:41 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
Hi David,

thanks for your quick answer. When I look at both references, I see exactly the 
same commands:

ceph osd pool set {pool-name} {key} {value}

where on one page only keys specific for compression are described. This is the 
command I found and used. However, I can't see any compression happening. If 
you know about something else than "ceph osd pool set" - commands, please let 
me know.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: David Turner mailto:drakonst...@gmail.com>>
Sent: 12 October 2018 15:47:20
To: Frank Schilder
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] bluestore compression enabled but no data compressed

It's all of the settings that you found in your first email when you dumped the 
configurations and such.  
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#inline-compression

On Fri, Oct 12, 2018 at 7:36 AM Frank Schilder 
mailto:fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>>> 
wrote:
Hi David,

thanks for your answer. I did enable compression on the pools as described in 
the link you sent below (ceph osd pool set sr-fs-data-test compression_mode 
aggressive, I also tried force to no avail). However, I could not find anything 
on enabling compression per OSD. Could you possibly provide a source or sample 
commands?

Thanks and best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: David Turner 
mailto:drakonst...@gmail.com><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>>>
Sent: 09 October 2018 17:42
To: Frank Schilder
Cc: 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com><mailto:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] bluestore compression enabled but no data compressed

When I've tested compression before there are 2 places you need to configure 
compression.  On the OSDs in the configuration settings that you mentioned, but 
also on the [1] pools themselves.  If you have the compression mode on the 
pools set to none, then it doesn't matter what the OSDs configuration is and 
vice versa unless you are using the setting of force.  If you want to default 
compress everything, set pools to passive and osds to aggressive.  If you want 
to only compress specific pools, set the osds to passive and the specific pools 
to aggressive.  Good luck.


[1] http://docs.ceph.com/docs/mimic/rados/operations/pools/#set-pool-values

On Tue, Sep 18, 2018 at 7:11 AM Frank Schilder 
mailto:fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>>>>
 wrote:
I seem to have a problem getting bluestore compression to do anything. I 
followed the documentation and enabled bluestore compression on various pools 
by executing "ceph osd pool set  compression_mode aggressive". 
Unfortunately, it seems like no data is compressed at all. As an example, below 
is some diagnostic output for a data pool used by a cephfs:

[root@ceph-01 ~]# ceph --version
ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)

All defaults are OK:

[root@ceph-01 ~]# ceph --show-config | grep compression
[...]
bluestore_compression_algorithm = snappy
bluesto

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-19 Thread Frank Schilder

Hi David,

sorry for the slow response, we had a hell of a week at work.

OK, so I had compression mode set to aggressive on some pools, but the global 
option was not changed, because I interpreted the documentation as "pool 
settings take precedence". To check your advise, I executed

  ceph tell "osd.*" config set bluestore_compression_mode aggressive

and dumped a new file consisting of null-bytes. Indeed, this time I observe 
compressed objects:

[root@ceph-08 ~]# ceph daemon osd.80 perf dump | grep blue
"bluefs": {
"bluestore": {
"bluestore_allocated": 2967207936,
"bluestore_stored": 3161981179,
"bluestore_compressed": 24549408,
"bluestore_compressed_allocated": 261095424,
"bluestore_compressed_original": 522190848,

Obvious questions that come to my mind:

1) I think either the documentation is misleading or the implementation is not 
following documented behaviour. I observe that per pool settings do *not* 
override globals, but the documentation says they will. (From doc: "Sets the 
policy for the inline compression algorithm for underlying BlueStore. This 
setting overrides the global setting of bluestore compression mode.") Will this 
be fixed in the future? Should this be reported?

Remark: When I look at "compression_mode" under 
"http://docs.ceph.com/docs/luminous/rados/operations/pools/?highlight=bluestore%20compression#set-pool-values";
 it actually looks like a copy-and-paste error. The doc here talks about 
compression algorithm (see quote above) while the compression mode should be 
explained. Maybe that is worth looking at?

2) If I set the global to aggressive, do I now have to disable compression 
explicitly on pools where I don't want compression or is the pool default still 
"none"? Right now, I seem to observe that compression is still disabled by 
default.

3) Do you know what the output means? What is the compression ratio? 
bluestore_compressed/bluestore_compressed_original=0.04 or 
bluestore_compressed_allocated/bluestore_compressed_original=0.5? The second 
ratio does not look too impressive given the file contents.

4) Is there any way to get uncompressed data compressed as a background task 
like scrub?

If you have the time to look at these questions, this would be great. Most 
importantly right now is that I got it to work.

Thanks for your help,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Frank 
Schilder 
Sent: 12 October 2018 17:00
To: David Turner
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore compression enabled but no data compressed

Hi David,

thanks, now I see what you mean. If you are right, that would mean that the 
documentation is wrong. Under 
"http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values"; is 
stated that "Sets inline compression algorithm to use for underlying BlueStore. 
This setting overrides the global setting of bluestore compression algorithm". 
In other words, the global setting should be irrelevant if compression is 
enabled on a pool.

Well, I will try how setting both to "aggressive" or "force" works out and let 
you know.

Thanks and have a nice weekend,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: David Turner 
Sent: 12 October 2018 16:50:31
To: Frank Schilder
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore compression enabled but no data compressed

If you go down just a little farther you'll see the settings that you put into 
your ceph.conf under the osd section (although I'd probably do global).  That's 
where the OSDs get the settings from.  As a note, once these are set, future 
writes will be compressed (if they match the compression settings which you can 
see there about minimum ratios, blob sizes, etc).  To compress current data, 
you need to re-write it.

On Fri, Oct 12, 2018 at 10:41 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
Hi David,

thanks for your quick answer. When I look at both references, I see exactly the 
same commands:

ceph osd pool set {pool-name} {key} {value}

where on one page only keys specific for compression are described. This is the 
command I found and used. However, I can't see any compression happening. If 
you know about something else than "ceph osd pool set" - commands, please let 
me know.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: David Turner mailto:drakonst...@gmail.com>>
Sent: 12 October 2018 15:47:20
To: Frank Schilder
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [cep

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-23 Thread Frank Schilder

Dear David and Igor,

thank you very much for your help. I have one more question about chunk sizes 
and data granularity on bluestore and will summarize the information I got on 
bluestore compression at the end.

1) Compression ratio
---

Following Igor's explanation, I tried to understand the numbers for 
compressed_allocated and compressed_original and am somewhat stuck with 
figuring out how bluestore arithmetic works. I created a 32GB file of zeros 
using dd with write size bs=8M on a cephfs with

ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 
pool=con-fs-data-test"

The data pool is an 8+2 erasure coded pool with properties

pool 37 'con-fs-data-test' erasure size 10 min_size 9 crush_rule 11 
object_hash rjenkins pg_num 900 pgp_num 900 last_change 9970 flags 
hashpspool,ec_overwrites stripe_width 32768 compression_mode aggressive 
application cephfs

As I understand EC pools, a 4M object is split into 8x0.5M data shards that are 
stored together with 2x0.5M coding shards on one OSD each. So, I would expect a 
full object write to put a 512K chunk on each OSD in the PG. Looking at some 
config options of one of the OSDs, I see:

"bluestore_compression_max_blob_size_hdd": "524288",
"bluestore_compression_min_blob_size_hdd": "131072",
"bluestore_max_blob_size_hdd": "524288",
"bluestore_min_alloc_size_hdd": "65536",

>From this, I would conclude that the largest chunk size is 512K, which also 
>equals compression_max_blob_size. The minimum allocation size is 64K for any 
>object. What I would expect now is, that the full object writes to cephfs 
>create chunk sizes of 512M per OSD in the PG, meaning that with an all-zero 
>file I should observe a compresses_allocated ratio of 64K/512K=0.125 instead 
>of the 0.5 reported below. It looks like that chunks of 128K are written 
>instead of 512K. I'm happy with the 64K granularity, but the observed maximum 
>chunk size seems a factor of 4 too small.

Where am I going wrong, what am I overlooking?

2) Bluestore compression configuration
---

If I understand David correctly, pool and OSD settings do *not* override each 
other, but are rather *combined* into a resulting setting as follows. Let

0 - (n)one
1 - (p)assive
2 - (a)ggressive
3 - (f)orce

? - (u)nset

be the 4+1 possible settings of compression modes with numeric values assigned 
as shown. Then, the resulting numeric compression mode for data in a pool on a 
specific OSD is

res_compr_mode = min(mode OSD, mode pool)

or in form of a table:

  pool
 | n  p  a  f  u
   --+--
   n | n  n  n  n  n
O  p | n  p  p  p  ?
S  a | n  p  a  a  ?
D  f | n  p  a  f  ?
   u | n  ?  ?  ?  u

which would allow for the flexible configuration as mentioned by David below.

I'm actually not sure if I can confirm this. I have some pools where 
compression_mode is not set and which reside on separate OSDs with compression 
enabled, yet there is compressed data on these OSDs. Wondering if I polluted my 
test with "ceph config set bluestore_compression_mode aggressive" that I 
executed earlier, or if my above interpretation is still wrong. Does the 
setting issued with "ceph config set bluestore_compression_mode aggressive" 
apply to pools with 'compression_mode' not set on the pool (see question marks 
in table above, what is the resulting mode?).

What I would like to do is enable compression on all OSDs, enable compression 
on all data pools and disable compression on all meta data pools. Data and meta 
data pools might share OSDs in the future. The above table says I should be 
able to do just that by being explicit.

Many thanks again and best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Igor Fedotov 
Sent: 19 October 2018 23:41
To: Frank Schilder; David Turner
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore compression enabled but no data compressed

Hi Frank,

On 10/19/2018 2:19 PM, Frank Schilder wrote:
> Hi David,
>
> sorry for the slow response, we had a hell of a week at work.
>
> OK, so I had compression mode set to aggressive on some pools, but the global 
> option was not changed, because I interpreted the documentation as "pool 
> settings take precedence". To check your advise, I executed
>
>ceph tell "osd.*" config set bluestore_compression_mode aggressive
>
> and dumped a new file consisting of null-bytes. Indeed, this time I observe 
> compressed objects:
>
> [root@ceph-08 ~]# ceph daemon osd.80 perf dump | grep blue
>  "bluefs": {
>  "blues

Re: [ceph-users] bluestore compression enabled but no data compressed

2019-03-16 Thread Frank Schilder

Yes:

Two days ago I did a complete re-deployment of ceph from my test cluster to a 
production cluster. As part this re-deployment I also added the following to my 
ceph.conf:

[osd]
bluestore compression mode = aggressive
bluestore compression min blob size hdd = 262144

Apparently, cephfs and rbd clients do not provide hints to ceph about blob 
sizes, so for these apps bluestore will (at least in current versions) always 
compress blobs of size bluestore_compression_min_blob_size_hdd. The best 
achievable compression ratio is 
bluestore_compression_min_blob_size_hdd/bluestore_min_alloc_size_hdd.

I did not want to reduce the default of bluestore_min_alloc_size_hdd = 64KB and 
only increased bluestore_compression_min_blob_size_hdd to 
4*bluestore_compression_min_blob_size_hdd, which means that for large 
compressible files the best ratio is 4. I tested this with a 1TB file of zeros 
and it works.

I'm not sure what the performance impact and rocksDB overhead implications of 
all bluestore options are. Pretty much nothing of this is easy to find in 
documentations. I will keep watching how the above works in reality and, maybe, 
make some more advanced experiments later.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: ceph-users  on behalf of Ragan, Tj 
(Dr.) 
Sent: 14 March 2019 11:22:07
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore compression enabled but no data compressed

Hi Frank,

Did you ever get the 0.5 compression ratio thing figured out?

Thanks
-TJ Ragan

On 23 Oct 2018, at 16:56, Igor Fedotov 
mailto:ifedo...@suse.de>> wrote:

Hi Frank,

On 10/23/2018 2:56 PM, Frank Schilder wrote:
Dear David and Igor,

thank you very much for your help. I have one more question about chunk sizes 
and data granularity on bluestore and will summarize the information I got on 
bluestore compression at the end.

1) Compression ratio
---

Following Igor's explanation, I tried to understand the numbers for 
compressed_allocated and compressed_original and am somewhat stuck with 
figuring out how bluestore arithmetic works. I created a 32GB file of zeros 
using dd with write size bs=8M on a cephfs with

ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 
pool=con-fs-data-test"

The data pool is an 8+2 erasure coded pool with properties

pool 37 'con-fs-data-test' erasure size 10 min_size 9 crush_rule 11 
object_hash rjenkins pg_num 900 pgp_num 900 last_change 9970 flags 
hashpspool,ec_overwrites stripe_width 32768 compression_mode aggressive 
application cephfs

As I understand EC pools, a 4M object is split into 8x0.5M data shards that are 
stored together with 2x0.5M coding shards on one OSD each. So, I would expect a 
full object write to put a 512K chunk on each OSD in the PG. Looking at some 
config options of one of the OSDs, I see:

"bluestore_compression_max_blob_size_hdd": "524288",
"bluestore_compression_min_blob_size_hdd": "131072",
"bluestore_max_blob_size_hdd": "524288",
"bluestore_min_alloc_size_hdd": "65536",

>From this, I would conclude that the largest chunk size is 512K, which also 
>equals compression_max_blob_size. The minimum allocation size is 64K for any 
>object. What I would expect now is, that the full object writes to cephfs 
>create chunk sizes of 512M per OSD in the PG, meaning that with an all-zero 
>file I should observe a compresses_allocated ratio of 64K/512K=0.125 instead 
>of the 0.5 reported below. It looks like that chunks of 128K are written 
>instead of 512K. I'm happy with the 64K granularity, but the observed maximum 
>chunk size seems a factor of 4 too small.

Where am I going wrong, what am I overlooking?
Please note how selection whether to use compression_max_blob_size or 
compression_min_blob_size is performed.

Max blob size threshold is mainly for objects that are tagged with flags 
indicating non-random access, e.g. sequential read and/or write, immutable, 
append-only etc.
Here is how it's determined in the code:
  if ((alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_SEQUENTIAL_READ) &&
  (alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_RANDOM_READ) == 0 &&
  (alloc_hints & (CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE |
  CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY)) &&
  (alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_RANDOM_WRITE) == 0) {
dout(20) << __func__ << " will prefer large blob and csum sizes" << dendl;

This is done to minimize the overhead during future random access since it will 
need full blob decompression.
Hence min blob size is used for regular random I/O. Which is probably you case 
as well.
You can check bluestore log (once its level is raised to 20) to conf

Re: [ceph-users] Checking cephfs compression is working

2019-03-26 Thread Frank Schilder

Hi Rhian,

not sure if you fond an answer already.

I believe in luminous and mimic it is only possible to extract compression 
information on osd device level. According to the recent announcement of 
nautilus, this seems to get better in the future.

If you want to check if anything is compressed, you need to find a data-OSD 
that's in a data pool of your cepf fs. On this OSD you can extract low-level 
information, including compression statistics. You find details in this 
conversation: 
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg49339.html .

Hope that helps,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Rhian Resnick 

Sent: 16 November 2018 16:58:04
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Checking cephfs compression is working

How do you confirm that cephfs files and rados objects are being compressed?

I don't see how in the docs.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-13 Thread Frank Schilder

oblem seems to remain isolated. Also, the load on 
the servers was not high during the test. The fs remained responsive to other 
users. Also, the MDS daemons never crashed. There was no fail-over except the 
ones we triggered manually.

As mentioned above, we can do some more testing within reason. We already have 
pilot users on the system and we need to keep it sort of working.


Best regards and thanks in advance for any help on getting a stable 
active/standby-replay config working.

And for reading all that.

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-15 Thread Frank Schilder

quot;time": "2019-05-15 11:37:31.765640",
"event": "all_read"
},
{
"time": "2019-05-15 11:37:31.765731",
"event": "dispatched"
},
{
"time": "2019-05-15 11:37:31.765759",
"event": "failed to authpin, dir is being fragmented"
}
]
}
},
{
"description": "client_request(client.377552:5446 readdir 
#0x13a 2019-05-15 11:43:07.569329 caller_uid=0, caller_gid=0{})",
"initiated_at": "2019-05-15 11:38:36.511381",
"age": 23.462997,
"duration": 23.463467,
"type_data": {
"flag_point": "failed to authpin, dir is being fragmented",
"reqid": "client.377552:5446",
"op_type": "client_request",
"client_info": {
"client": "client.377552",
"tid": 5446
},
"events": [
{
"time": "2019-05-15 11:38:36.511381",
"event": "initiated"
},
{
"time": "2019-05-15 11:38:36.511381",
"event": "header_read"
},
{
"time": "2019-05-15 11:38:36.511383",
    "event": "throttled"
},
{
"time": "2019-05-15 11:38:36.511392",
"event": "all_read"
},
{
"time": "2019-05-15 11:38:36.511561",
"event": "dispatched"
},
{
"time": "2019-05-15 11:38:36.511604",
"event": "failed to authpin, dir is being fragmented"
}
]
}
},
{
"description": "client_request(client.62472:6092368 getattr 
pAsLsXsFs #0x138 2019-05-15 11:17:21.633854 caller_uid=105731, 
caller_gid=105731{})",
"initiated_at": "2019-05-15 11:17:21.635927",
"age": 1298.338451,
"duration": 1298.338955,
"type_data": {
"flag_point": "failed to authpin, dir is being fragmented",
"reqid": "client.62472:6092368",
"op_type": "client_request",
"client_info": {
"client": "client.62472",
"tid": 6092368
},
"events": [
{
"time": "2019-05-15 11:17:21.635927",
"event": "initiated"
},
{
"time": "2019-05-15 11:17:21.635927",
"event": "header_read"
},
{
"time": "2019-05-15 11:17:21.635931",
"event": "throttled"
},
{
"time": "2019-05-15 11:17:21.635944",
"event": "all_read"
},
{
"time": "2019-05-15 11:17:21.636081",
"event": "dispatched"
},
{
"time": "2019-05-15 11:17:21.636118",
"event": "failed to authpin, dir is being fragmented"
}
]
}
},
{
"description": "client_request(client.62472:6092400 getattr 
pAsLsXsFs #0x138 2019-05-15 11:21:25.909555 caller_uid=105731, 
caller_gid=105731{})",
"initiated_at": "2019-05-15 11:21:25.910514",
"age": 1054.063864,
"duration": 1054.064406,
"type_data": {
"flag_point": "failed to authpin, dir is being fragmented",
"reqid": "client.62472:6092400",
"op_type": "client_request",
"client_info": {
"client": "client.62472",
"tid": 6092400
},
"events": [
{
"time": "2019-05-15 11:21:25.910514",
"event": "initiated"
},
{
"time": "2019-05-15 11:21:25.910514",
"event": "header_read"
},
{
"time": "2019-05-15 11:21:25.910527",
"event": "throttled"
},
{
"time": "2019-05-15 11:21:25.910537",
"event": "all_read"
},
{
"time": "2019-05-15 11:21:25.910597",
"event": "dispatched"
},
{
"time": "2019-05-15 11:21:25.910635",
"event": "failed to authpin, dir is being fragmented"
}
]
}
}
],
"num_ops": 12
}

=
Frank Schilder
AIT Ris? Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 14 May 2019 09:54:05
To: Frank Schilder
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

Quoting Frank Schilder (fr...@dtu.dk):

If at all possible I would:

Upgrade to 13.2.5 (there have been quite a few MDS fixes since 13.2.2).
Use more recent kernels on the clients.

Below settings for [mds] might help with trimming (you might already
have changed mds_log_max_segments to 128 according to logs):

[mds]
mds_log_max_expiring = 80  # default 20
# trim max $value segments in parallel
# Defaults are too conservative.
mds_log_max_segments = 120  # default 30


> 1) Is there a bug with having MDS daemons acting as standby-replay?
I can't tell what bug you are referring to based on info below. It does
seem to work as designed.

Gr. Stefan

--
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-15 Thread Frank Schilder

Dear Yan,

OK, I will try to trigger the problem again and dump the information requested. 
Since it is not easy to get into this situation and I usually need to resolve 
it fast (its not a test system), is there anything else worth capturing?

I will get back as soon as it happened again.

In the meantime, I would be grateful if you could shed some light on the 
following questions:

- Is there a way to cancel an individual operation in the queue? It is a bit 
harsh to have to fail an MDS for that.
- What is the fragmentdir operation doing in a single MDS setup? I thought this 
was only relevant if multiple MDS daemons are active on a file system.

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Yan, Zheng 
Sent: 16 May 2019 05:50
To: Frank Schilder
Cc: Stefan Kooman; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

> [...]
> This time I captured the MDS ops list (log output does not really contain 
> more info than this list). It contains 12 ops and I will include it here in 
> full length (hope this is acceptable):
>

Your issues were caused by stuck internal op fragmentdir.  Can you
dump mds cache and send the output to us?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-16 Thread Frank Schilder

Dear Yan and Stefan,

thanks for the additional information, it should help reproducing the issue.

The pdsh command executes a bash script that echoes a few values to stdout. 
Access should be read-only, however, we still have the FS mounted with atime 
enabled, so there is probably meta data write and synchronisation per access. 
Files accessed are ssh auth-keys in .ssh and the shell script. The shell script 
was located in the home-dir of the user and, following your explanations, to 
reproduce the issue I will create a directory with many entries and execute a 
test with the many-clients single-file-read load on it.

I hope it doesn't take too long.

Thanks for your input!

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Yan, Zheng 
Sent: 16 May 2019 09:35
To: Frank Schilder
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

On Thu, May 16, 2019 at 2:52 PM Frank Schilder  wrote:
>
> Dear Yan,
>
> OK, I will try to trigger the problem again and dump the information 
> requested. Since it is not easy to get into this situation and I usually need 
> to resolve it fast (its not a test system), is there anything else worth 
> capturing?
>

just

ceph daemon mds.x dump_ops_in_flight
ceph daemon mds.x dump cache /tmp/cachedump.x

> I will get back as soon as it happened again.
>
> In the meantime, I would be grateful if you could shed some light on the 
> following questions:
>
> - Is there a way to cancel an individual operation in the queue? It is a bit 
> harsh to have to fail an MDS for that.

no

> - What is the fragmentdir operation doing in a single MDS setup? I thought 
> this was only relevant if multiple MDS daemons are active on a file system.
>

It splits large directory to smaller parts.


> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Yan, Zheng 
> Sent: 16 May 2019 05:50
> To: Frank Schilder
> Cc: Stefan Kooman; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
> bug?)
>
> > [...]
> > This time I captured the MDS ops list (log output does not really contain 
> > more info than this list). It contains 12 ops and I will include it here in 
> > full length (hope this is acceptable):
> >
>
> Your issues were caused by stuck internal op fragmentdir.  Can you
> dump mds cache and send the output to us?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-16 Thread Frank Schilder

Dear Yan,

it is difficult to push the MDS to err in this special way. Is it advisable or 
not to increase the likelihood and frequency of dirfrag operations by tweaking 
some of the parameters mentioned here: 
http://docs.ceph.com/docs/mimic/cephfs/dirfrags/. If so, what would reasonable 
values be, keeping in mind that we are in a pilot production phase already and 
need to maintain integrity of user data?

Is there any counter showing if such operations happened at all?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Yan, Zheng 
Sent: 16 May 2019 09:35
To: Frank Schilder
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

On Thu, May 16, 2019 at 2:52 PM Frank Schilder  wrote:
>
> Dear Yan,
>
> OK, I will try to trigger the problem again and dump the information 
> requested. Since it is not easy to get into this situation and I usually need 
> to resolve it fast (its not a test system), is there anything else worth 
> capturing?
>

just

ceph daemon mds.x dump_ops_in_flight
ceph daemon mds.x dump cache /tmp/cachedump.x

> I will get back as soon as it happened again.
>
> In the meantime, I would be grateful if you could shed some light on the 
> following questions:
>
> - Is there a way to cancel an individual operation in the queue? It is a bit 
> harsh to have to fail an MDS for that.

no

> - What is the fragmentdir operation doing in a single MDS setup? I thought 
> this was only relevant if multiple MDS daemons are active on a file system.
>

It splits large directory to smaller parts.


> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Yan, Zheng 
> Sent: 16 May 2019 05:50
> To: Frank Schilder
> Cc: Stefan Kooman; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
> bug?)
>
> > [...]
> > This time I captured the MDS ops list (log output does not really contain 
> > more info than this list). It contains 12 ops and I will include it here in 
> > full length (hope this is acceptable):
> >
>
> Your issues were caused by stuck internal op fragmentdir.  Can you
> dump mds cache and send the output to us?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-18 Thread Frank Schilder

Dear Yan and Stefan,

it happened again and there were only very few ops in the queue. I pulled the 
ops list and the cache. Please find a zip file here: 
"https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l"; . Its a bit 
more than 100MB.

The active MDS failed over to the standby after or during the dump cache 
operation. Is this expected? As a result, the cluster is healthy and I can't do 
further diagnostics. In case you need more information, we have to wait until 
next time.

Some further observations:

There was no load on the system. I start suspecting that this is not a 
load-induced event. It is also not cause by excessive atime updates, the FS is 
mounted with relatime. Could it have to do with the large level-2 network (ca. 
550 client servers in the same broadcast domain)? I include our kernel tuning 
profile below, just in case. The cluster networks (back and front) are isolated 
VLANs, no gateways, no routing.

We run rolling snapshots on the file system. I didn't observe any problems with 
this, but am wondering if this might be related. We have currently 30 snapshots 
in total. Here is the output of status and pool ls:

[root@ceph-01 ~]# ceph status # before the MDS failed over
  cluster:
id: ###
health: HEALTH_WARN
1 MDSs report slow requests
 
  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-02, ceph-03
mds: con-fs-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby
osd: 192 osds: 192 up, 192 in
 
  data:
pools:   5 pools, 750 pgs
objects: 6.35 M objects, 5.2 TiB
usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
pgs: 750 active+clean
 
[root@ceph-01 ~]# ceph status # after cache dump and the MDS failed over
  cluster:
id: ###
health: HEALTH_OK
 
  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-02, ceph-03
mds: con-fs-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby
osd: 192 osds: 192 up, 192 in
 
  data:
pools:   5 pools, 750 pgs
objects: 6.33 M objects, 5.2 TiB
usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
pgs: 749 active+clean
 1   active+clean+scrubbing+deep
 
  io:
client:   6.3 KiB/s wr, 0 op/s rd, 0 op/s wr

[root@ceph-01 ~]# ceph osd pool ls detail # after the MDS failed over
pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 1 object_hash 
rjenkins pg_num 80 pgp_num 80 last_change 486 flags 
hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 
application rbd
removed_snaps [1~5]
pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash 
rjenkins pg_num 300 pgp_num 300 last_change 1759 flags 
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 274877906944000 
stripe_width 24576 compression_mode aggressive application rbd
removed_snaps [1~3]
pool 3 'sr-rbd-one-stretch' replicated size 4 min_size 2 crush_rule 2 
object_hash rjenkins pg_num 20 pgp_num 20 last_change 500 flags 
hashpspool,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 0 
compression_mode aggressive application rbd
removed_snaps [1~7]
pool 4 'con-fs-meta' replicated size 3 min_size 2 crush_rule 3 object_hash 
rjenkins pg_num 50 pgp_num 50 last_change 428 flags hashpspool,nodelete 
max_bytes 1099511627776 stripe_width 0 application cephfs
pool 5 'con-fs-data' erasure size 10 min_size 8 crush_rule 6 object_hash 
rjenkins pg_num 300 pgp_num 300 last_change 2561 flags 
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 21990232200 
stripe_width 32768 compression_mode aggressive application cephfs
removed_snaps 
[2~3d,41~2a,6d~2a,99~c,a6~1e,c6~18,df~3,e3~1,e5~3,e9~1,eb~3,ef~1,f1~1,f3~1,f5~3,f9~1,fb~3,ff~1,101~1,103~1,105~1,107~1,109~1,10b~1,10d~1,10f~1,111~1]

The relevant pools are con-fs-meta and con-fs-data.

Best regards,
Frank

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


[root@ceph-08 ~]# cat /etc/tuned/ceph/tuned.conf 
[main]
summary=Settings for ceph cluster. Derived from throughput-performance.
include=throughput-performance

[vm]
transparent_hugepages=never

[sysctl]
# See also:
# - https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
# - https://www.kernel.org/doc/Documentation/sysctl/net.txt
# - https://cromwell-intl.com/open-source/performance-tuning/tcp.html
# - https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/
# - https://www.spinics.net/lists/ceph-devel/msg21721.html

# Set available PIDs and open files to maximum possible.
kernel.pid_max=4194304
fs.file-max=26234859

# Swap options, reduce swappiness.
vm.zone_reclaim_mode=0
#vm.dirty_ratio = 20
vm.dirty_bytes = 629145600
vm.dirty_background_bytes = 314572800
vm.swappiness=10
vm.min_free_kbytes=8388608

# Increase ARP cache size to accommodate large level-2 client network.
net.ipv4.neigh.defau

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-18 Thread Frank Schilder

Hi Stefan, cc Yan,

thanks for your quick reply.

> I am pretty sure you hit bug #26982: https://tracker.ceph.com/issues/26982 
> "mds: crash when dumping ops in flight".

Everything is fine, the daemon did not crash. The dump cache operation seems to 
be a blocking operation. It simply blocked the MDS on ceph-08 for too long and 
the mons decided to flip to the MDS on ceph-12. The MDS on ceph-08 is up for 
almost 5 days:

[root@ceph-mds:ceph-08 /]# ps -e -o pid,etime,cmd
PID ELAPSED CMD
  1  4-21:03:44 /bin/bash /entrypoint.sh mds
190  4-21:03:43 /usr/bin/ceph-mds --cluster ceph --setuser ceph --setgroup 
ceph -d -i ceph-08
  31344   02:42 /bin/bash
  31364   00:00 ps -e -o pid,etime,cmd

The relevant section from the syslog is (filtered by 'grep -i mds'):

May 18 10:20:45 ceph-08 journal: 2019-05-18 08:20:45.400 7f1c99552700  1 
mds.ceph-08 asok_command: dump cache (starting...)
May 18 10:20:45 ceph-08 journal: 2019-05-18 08:20:45.400 7f1c99552700  1 
mds.0.cache dump_cache to /var/log/ceph/mds-case/cache
May 18 10:20:51 ceph-01 journal: cluster 2019-05-18 08:20:44.135690 mds.ceph-08 
mds.0 192.168.32.72:6800/314672380 2554 : cluster 
[WRN] 7 slow requests, 0 included below; oldest blocked for > 1931.724397 secs
May 18 10:20:51 ceph-03 journal: cluster 2019-05-18 08:20:44.135690 mds.ceph-08 
mds.0 192.168.32.72:6800/314672380 2554 : cluster 
[WRN] 7 slow requests, 0 included below; oldest blocked for > 1931.724397 secs
May 18 10:20:51 ceph-02 journal: cluster 2019-05-18 08:20:44.135690 mds.ceph-08 
mds.0 192.168.32.72:6800/314672380 2554 : cluster 
[WRN] 7 slow requests, 0 included below; oldest blocked for > 1931.724397 secs
May 18 10:21:01 ceph-08 journal: 2019-05-18 08:21:01.414 7f1c952c1700  1 
heartbeat_map is_healthy 'MDSRank' had timed out after 15
May 18 10:21:01 ceph-08 journal: 2019-05-18 08:21:01.414 7f1c952c1700  0 
mds.beacon.ceph-08 _send skipping beacon, heartbeat map not healthy
May 18 10:21:03 ceph-08 journal: 2019-05-18 08:21:03.549 7f1c99d53700  1 
heartbeat_map is_healthy 'MDSRank' had timed out after 15
May 18 10:21:05 ceph-08 journal: 2019-05-18 08:21:05.414 7f1c952c1700  1 
heartbeat_map is_healthy 'MDSRank' had timed out after 15
May 18 10:21:05 ceph-08 journal: 2019-05-18 08:21:05.414 7f1c952c1700  0 
mds.beacon.ceph-08 _send skipping beacon, heartbeat map not healthy
May 18 10:21:08 ceph-08 journal: 2019-05-18 08:21:08.549 7f1c99d53700  1 
heartbeat_map is_healthy 'MDSRank' had timed out after 15
May 18 10:21:09 ceph-08 journal: 2019-05-18 08:21:09.415 7f1c952c1700  1 
heartbeat_map is_healthy 'MDSRank' had timed out after 15
May 18 10:21:09 ceph-08 journal: 2019-05-18 08:21:09.415 7f1c952c1700  0 
mds.beacon.ceph-08 _send skipping beacon, heartbeat map not healthy
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.021 7f38552b8700  1 
mon.ceph-01@0(leader).mds e16312 no beacon from mds.0.15942 (gid: 327273 addr: 
192.168.32.72:6800/314672380 state: up:active) since 15.6064s
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.021 7f38552b8700  1 
mon.ceph-01@0(leader).mds e16312  replacing 327273 
192.168.32.72:6800/314672380mds.0.15942 up:active with 457451/ceph-12 
192.168.32.76:6800/3202682100
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.021 7f38552b8700  0 
log_channel(cluster) log [WRN] : daemon mds.ceph-08 is not responding, 
replacing it as rank 0 with standby daemon mds.ceph-12
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.021 7f38552b8700  1 
mon.ceph-01@0(leader).mds e16312 fail_mds_gid 327273 mds.ceph-08 role 0
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.038 7f38552b8700  0 
log_channel(cluster) log [WRN] : Health check failed: insufficient standby MDS 
daemons available (MDS_INSUFFICIENT_STANDBY)
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.038 7f38552b8700  0 
log_channel(cluster) log [INF] : Health check cleared: MDS_SLOW_REQUEST (was: 1 
MDSs report slow requests)
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.105 7f384eaab700  0 
mon.ceph-01@0(leader).mds e16313 new map
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.105 7f384eaab700  0 
mon.ceph-01@0(leader).mds e16313 print_map
May 18 10:21:13 ceph-01 journal: compat: compat={},rocompat={},incompat={1=base 
v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in 
separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no 
anchor table,9=file layout v2,10=snaprealm v2}

Sorry, I should have checked this first.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-18 Thread Frank Schilder

Hi Stefan,

thanks for being so thorough. I am aware of that. We are still in a pilot 
phase, which is also the reason that I'm still relatively relaxed about the 
observed issue. I guess you also noticed that our cluster is almost empty too.

I don't have a complete list of storage requirements yet and had to restrict 
allocation of PGs to a reasonable minimum as with mimic I cannot reduce the PG 
count of a pool. With the current values I see imbalance but still reasonable 
performance. Once I have more information about what pools I still need to 
create, I will aim for the 100 PGs per OSD. I actually plan to give the cephfs 
a bit higher share for performance reasons. Its on the list.

Thanks again and have a good weekend,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 18 May 2019 17:41
To: Frank Schilder
Cc: Yan, Zheng; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

Quoting Frank Schilder (fr...@dtu.dk):
>
> [root@ceph-01 ~]# ceph status # before the MDS failed over
>   cluster:
> id: ###
> health: HEALTH_WARN
> 1 MDSs report slow requests
>
>   services:
> mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> mgr: ceph-01(active), standbys: ceph-02, ceph-03
> mds: con-fs-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby
> osd: 192 osds: 192 up, 192 in
>
>   data:
> pools:   5 pools, 750 pgs
> objects: 6.35 M objects, 5.2 TiB
> usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
> pgs: 750 active+clean

How many pools do you plan to use? You have 5 pools and only 750 PGs
total? What hardware do you have for OSDs? If cephfs is your biggest
user I would at up to 6150! PGs to your pool(s). Having around ~ 100 PGs
per OSD is healthy. The cluster will also be able to balance way better.
Math: (100 (PG/OSD) * 192 (# OSDs)) - 750)) / 3 = 6150 for 3 replica
pools. You might have a lot of contention going on on your OSDs, they
are probably under performing.

Gr. Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-20 Thread Frank Schilder

Dear Yan,

thank you for taking care of this. I removed all snapshots and stopped snapshot 
creation.

Please keep me posted.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Yan, Zheng 
Sent: 20 May 2019 13:34:07
To: Frank Schilder
Cc: Stefan Kooman; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

On Sat, May 18, 2019 at 5:47 PM Frank Schilder  wrote:
>
> Dear Yan and Stefan,
>
> it happened again and there were only very few ops in the queue. I pulled the 
> ops list and the cache. Please find a zip file here: 
> "https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l"; . Its a bit 
> more than 100MB.
>

MSD cache dump shows there is a snapshot related. Please avoid using
snapshot until we fix the bug.

Regards
Yan, Zheng

> The active MDS failed over to the standby after or during the dump cache 
> operation. Is this expected? As a result, the cluster is healthy and I can't 
> do further diagnostics. In case you need more information, we have to wait 
> until next time.
>
> Some further observations:
>
> There was no load on the system. I start suspecting that this is not a 
> load-induced event. It is also not cause by excessive atime updates, the FS 
> is mounted with relatime. Could it have to do with the large level-2 network 
> (ca. 550 client servers in the same broadcast domain)? I include our kernel 
> tuning profile below, just in case. The cluster networks (back and front) are 
> isolated VLANs, no gateways, no routing.
>
> We run rolling snapshots on the file system. I didn't observe any problems 
> with this, but am wondering if this might be related. We have currently 30 
> snapshots in total. Here is the output of status and pool ls:
>
> [root@ceph-01 ~]# ceph status # before the MDS failed over
>   cluster:
> id: ###
> health: HEALTH_WARN
> 1 MDSs report slow requests
>
>   services:
> mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> mgr: ceph-01(active), standbys: ceph-02, ceph-03
> mds: con-fs-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby
> osd: 192 osds: 192 up, 192 in
>
>   data:
> pools:   5 pools, 750 pgs
> objects: 6.35 M objects, 5.2 TiB
> usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
> pgs: 750 active+clean
>
> [root@ceph-01 ~]# ceph status # after cache dump and the MDS failed over
>   cluster:
> id: ###
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> mgr: ceph-01(active), standbys: ceph-02, ceph-03
> mds: con-fs-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby
> osd: 192 osds: 192 up, 192 in
>
>   data:
> pools:   5 pools, 750 pgs
> objects: 6.33 M objects, 5.2 TiB
> usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
> pgs: 749 active+clean
>  1   active+clean+scrubbing+deep
>
>   io:
> client:   6.3 KiB/s wr, 0 op/s rd, 0 op/s wr
>
> [root@ceph-01 ~]# ceph osd pool ls detail # after the MDS failed over
> pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 1 
> object_hash rjenkins pg_num 80 pgp_num 80 last_change 486 flags 
> hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 
> application rbd
> removed_snaps [1~5]
> pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash 
> rjenkins pg_num 300 pgp_num 300 last_change 1759 flags 
> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 274877906944000 
> stripe_width 24576 compression_mode aggressive application rbd
> removed_snaps [1~3]
> pool 3 'sr-rbd-one-stretch' replicated size 4 min_size 2 crush_rule 2 
> object_hash rjenkins pg_num 20 pgp_num 20 last_change 500 flags 
> hashpspool,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 0 
> compression_mode aggressive application rbd
> removed_snaps [1~7]
> pool 4 'con-fs-meta' replicated size 3 min_size 2 crush_rule 3 object_hash 
> rjenkins pg_num 50 pgp_num 50 last_change 428 flags hashpspool,nodelete 
> max_bytes 1099511627776 stripe_width 0 application cephfs
> pool 5 'con-fs-data' erasure size 10 min_size 8 crush_rule 6 object_hash 
> rjenkins pg_num 300 pgp_num 300 last_change 2561 flags 
> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 21990232200 
> stripe_width 32768 compression_mode aggressive application cephfs
> removed_snaps 
> [2~3d,41~2a,6d~2a,99~c,a6~1e,c6~18,df~3,e3~1,e5~3,e9~1,eb~3,ef~1,f1~1,f3~1,f5~3,f9~1,fb~3,ff~1,101~1,103~1,105~1,107~1,109~1,10b~1,10d~1,10f~1,111~1]

Re: [ceph-users] Default min_size value for EC pools

2019-05-20 Thread Frank Schilder

This is an issue that is coming up every now and then (for example: 
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg50415.html) and I 
would consider it a very serious one (I will give an example below). A 
statement like "min_size = k is unsafe and should never be set" deserves a bit 
more explanation, because ceph is the only storage system I know of, for which 
k+m redundancy does *not* mean "you can loose up to m disks and still have 
read-write access". If this is really true then, assuming the same redundancy 
level, loosing service (client access) is significantly more likely with ceph 
than with other storage systems. And this has impact on design and storage 
pricing.

However, some help seems on the way and an, in my opinion, utterly important 
feature update seems almost finished: https://github.com/ceph/ceph/pull/17619 . 
It will implement the following:

- recovery I/O happens as long as k shards are available (this is new)
- client I/O will happen as long as min_size shards are available
- recommended is min_size=k+1 (this might be wrong)

This is pretty good and much better than the current behaviour (see below). 
This pull request also offers useful further information.

Apparently, there is some kind of rare issue with erasure coding in ceph that 
makes it problematic to use min_size=k. I couldn't find anything better than 
vague explanations. Quote from the thread above: "Recovery on EC pools requires 
min_size rather than k shards at this time. There were reasons; they weren't 
great."

This is actually a situation I was in. I once lost 2 failure domains 
simultaneously on an 8+2 EC pool and was really surprised that recovery stopped 
after some time with the worst degraded PGs remaining unfixed. I discovered the 
min_size=9 (instead of 8) and "ceph health detail" recommended to reduce 
min_size. Before doing so, I searched the web (I mean, why the default k+1? 
Come on, there must be a reason.) and found some vague hints about problems 
with min_size=k during rebuild. This is a really bad corner to be in. A lot of 
PGs are already critically degraded and the only way forward was to make a bad 
situation worse, because reducing min_size would immediately enable client I/O 
in addition to recovery I/O.

It looks like the default of min_size=k+1 will stay, because min_size=k does 
have some rare issues and these seem not to disappear. (I hope I'm wrong 
though.) Hence, if min_size=k will remain problematic, the recommendation 
should be "never to use m=1" instead of "never use min_size=k". In other words, 
instead of using a 2+1 EC profile, one should use a 4+2 EC profile. If one 
would like to have secure write access for n disk losses, then m>=n+1.

If this issue remains, in my opinion this should be taken up in the best 
practices section. In particular, the documentation should not use examples 
with m=1, this gives the wrong impression. Either min_size=k is safe or not. If 
it is not, it should never be used anywhere in the documentation.

I hope I marked my opinions and hypotheses clearly and that the links are 
helpful. If anyone could shed some light on as to why exactly min_size=k+1 is 
important, I would be grateful.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Default min_size value for EC pools

2019-05-20 Thread Frank Schilder

Dear Paul,

thank you very much for this clarification. I believe also ZFS erasure coded 
data has this property, which is probably the main cause for the expectation of 
min_size=k. So, basically, min_size=k means that we are on the security level 
of traditional redundant storage and this may or may not be good enough - there 
is no additional risk beyond that. Default in ceph is, it is not good enough. 
That's perfectly fine - assuming the rebuild gets fixed.

I have a follow-up: I thought that non-redundant writes would almost never 
occur, because PGs get remapped before accepting writes. To stay with my 
example of 2 (out of 16) failure domains failing simultaneously, I thought that 
all PGs will immediately be remapped to fully redundant sets, because there are 
still 14 failure domains up and only 10 are needed for the 8+2 EC profile. 
Furthermore, I assumed that writes would not be accepted before a PG is 
remapped, meaning that every new write will always be fully redundant while 
recovery I/O slowly recreates the missing objects in the background.

If this "remap first" strategy is not the current behaviour, would it make 
sense to consider this as an interesting feature? Is there any reason for not 
remapping all PGs (if possible) prior to starting recovery? It would eliminate 
the lack of redundancy for new writes (at least for new objects).

Thanks again and best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Paul Emmerich 
Sent: 20 May 2019 21:23
To: Frank Schilder
Cc: florent; ceph-users
Subject: Re: [ceph-users] Default min_size value for EC pools

Yeah, the current situation with recovery and min_size is... unfortunate :(

The reason why min_size = k is bad is just that it means you are accepting 
writes without guaranteeing durability while you are in a degraded state.
A durable storage system should never tell a client "okay, i've written your 
data" if losing a single disk leads to data loss.

Yes, that is the default behavior of traditional raid 5 and raid 6 systems 
during rebuild (with 1 or 2 disk failures for raid 5/6), but that doesn't mean 
it's a good idea.


Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io<http://www.croit.io>
Tel: +49 89 1896585 90


On Mon, May 20, 2019 at 7:37 PM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
This is an issue that is coming up every now and then (for example: 
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg50415.html) and I 
would consider it a very serious one (I will give an example below). A 
statement like "min_size = k is unsafe and should never be set" deserves a bit 
more explanation, because ceph is the only storage system I know of, for which 
k+m redundancy does *not* mean "you can loose up to m disks and still have 
read-write access". If this is really true then, assuming the same redundancy 
level, loosing service (client access) is significantly more likely with ceph 
than with other storage systems. And this has impact on design and storage 
pricing.

However, some help seems on the way and an, in my opinion, utterly important 
feature update seems almost finished: https://github.com/ceph/ceph/pull/17619 . 
It will implement the following:

- recovery I/O happens as long as k shards are available (this is new)
- client I/O will happen as long as min_size shards are available
- recommended is min_size=k+1 (this might be wrong)

This is pretty good and much better than the current behaviour (see below). 
This pull request also offers useful further information.

Apparently, there is some kind of rare issue with erasure coding in ceph that 
makes it problematic to use min_size=k. I couldn't find anything better than 
vague explanations. Quote from the thread above: "Recovery on EC pools requires 
min_size rather than k shards at this time. There were reasons; they weren't 
great."

This is actually a situation I was in. I once lost 2 failure domains 
simultaneously on an 8+2 EC pool and was really surprised that recovery stopped 
after some time with the worst degraded PGs remaining unfixed. I discovered the 
min_size=9 (instead of 8) and "ceph health detail" recommended to reduce 
min_size. Before doing so, I searched the web (I mean, why the default k+1? 
Come on, there must be a reason.) and found some vague hints about problems 
with min_size=k during rebuild. This is a really bad corner to be in. A lot of 
PGs are already critically degraded and the only way forward was to make a bad 
situation worse, because reducing min_size would immediately enable client I/O 
in addition to recovery I/O.

It looks like the default of min_size=k+1 will stay, because min_size=k does 
have some rare issues and these seem not to disappear. (I hope

Re: [ceph-users] Default min_size value for EC pools

2019-05-20 Thread Frank Schilder

If min_size=1 and you loose the last disk, that's end of any data that was only 
on this disk.

Apart from this, using size=2 and min_size=1 is a really bad idea. This has 
nothing to do with data replication but rather with an inherent problem with 
high availability and the number 2. You need at least 3 members of an HA group 
to ensure stable operation with proper majorities. There are numerous stories 
about OSD flapping caused by size-2 min_size-1 pools, leading to situations 
that are extremely hard to recover from. My favourite is this one: 
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
 . You will easily find more. The deeper problem here is called "split-brain" 
and there is no real solution to it except to avoid it at all cost.

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Florent B 
Sent: 20 May 2019 21:33
To: Paul Emmerich; Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] Default min_size value for EC pools

I understand better thanks to Frank & Paul messages.

Paul, when min_size=k, is it the same problem with replicated pool size=2 & 
min_size=1 ?

On 20/05/2019 21:23, Paul Emmerich wrote:
Yeah, the current situation with recovery and min_size is... unfortunate :(

The reason why min_size = k is bad is just that it means you are accepting 
writes without guaranteeing durability while you are in a degraded state.
A durable storage system should never tell a client "okay, i've written your 
data" if losing a single disk leads to data loss.

Yes, that is the default behavior of traditional raid 5 and raid 6 systems 
during rebuild (with 1 or 2 disk failures for raid 5/6), but that doesn't mean 
it's a good idea.


Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io<http://www.croit.io>
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Default min_size value for EC pools

2019-05-20 Thread Frank Schilder

Dear Maged,

thanks for elaborating on this question. Is there already information in which 
release this patch will be deployed?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Default min_size value for EC pools

2019-05-21 Thread Frank Schilder

Hi Poul,

maybe we misunderstood each other here or I'm misunderstanding something. My HA 
comment was not on PGs becoming active/inactive or data loss.

As far as I understand the discussions, the OSD flapping itself may be caused 
by the 2-member HA group, because the OSDs keep marking each other out and 
themselves in continuously. As far as I saw, this type of OSD flapping is never 
observed when only using pools with size>=3 and min_size>size/2 (strictly 
larger than), because this min_size setting will always ensure a stable (>50%) 
majority of votes of HA members that cannot be questioned by a single OSD 
trying to mark itself as in.

At least the only context I have heard of OSD flapping was in connection to 
2/1-pools. I have never seen such a report for, say, 3/2 pools. Am I 
overlooking something here?

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Paul Emmerich 
Sent: 21 May 2019 09:45
To: Frank Schilder
Cc: Florent B; ceph-users
Subject: Re: [ceph-users] Default min_size value for EC pools

No, there is no split brain problem even with size/mine_size 2/1. A PG will not 
go active if it doesn't have the latest data because all other OSDs that might 
have seen writes are currently offline.
That's what the history_ignore_les_bounds option effectively does: it tells 
ceph to take a PG active anyways in that situation.

That's why you end up with inactive PGs if you run 2/1 and a disk dies while 
OSDs flap. You  then have to set history_ignore_les_bounds if the dead disk is 
really unrecoverable, losing the latest modifcations to an object.
But Ceph will not compromise your data without you manually telling it to do 
so, it will just block IO instead.


Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io<http://www.croit.io>
Tel: +49 89 1896585 90


On Mon, May 20, 2019 at 10:04 PM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
If min_size=1 and you loose the last disk, that's end of any data that was only 
on this disk.

Apart from this, using size=2 and min_size=1 is a really bad idea. This has 
nothing to do with data replication but rather with an inherent problem with 
high availability and the number 2. You need at least 3 members of an HA group 
to ensure stable operation with proper majorities. There are numerous stories 
about OSD flapping caused by size-2 min_size-1 pools, leading to situations 
that are extremely hard to recover from. My favourite is this one: 
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
 . You will easily find more. The deeper problem here is called "split-brain" 
and there is no real solution to it except to avoid it at all cost.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Florent B mailto:flor...@coppint.com>>
Sent: 20 May 2019 21:33
To: Paul Emmerich; Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] Default min_size value for EC pools

I understand better thanks to Frank & Paul messages.

Paul, when min_size=k, is it the same problem with replicated pool size=2 & 
min_size=1 ?

On 20/05/2019 21:23, Paul Emmerich wrote:
Yeah, the current situation with recovery and min_size is... unfortunate :(

The reason why min_size = k is bad is just that it means you are accepting 
writes without guaranteeing durability while you are in a degraded state.
A durable storage system should never tell a client "okay, i've written your 
data" if losing a single disk leads to data loss.

Yes, that is the default behavior of traditional raid 5 and raid 6 systems 
during rebuild (with 1 or 2 disk failures for raid 5/6), but that doesn't mean 
it's a good idea.


Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io<http://www.croit.io><http://www.croit.io>
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs causing high load on vm, taking down 15 min later another cephfs vm

2019-05-23 Thread Frank Schilder

Hi Marc,

if you can exclude network problems, you can ignore this message.

The only time we observed something that might be similar to your problem was, 
when a network connection was overloaded. Potential causes include

- broadcast storm
- the "too much cache memory" issues 
https://www.suse.com/support/kb/doc/?id=7010287
- a network or I/O intensive scheduled task that runs at the same time on many 
machines
- a shared up-link between clients and ceph storage with insufficient peak 
capacity
- a bad link in a trunk

In our case, we observed two different network related break downs:

- broadcast storms, probably caused by a misbehaving router and
- a bad link in a trunk. The trunk was a switch stacking connection and failed 
due to a half-broken SFP transceiver. This was really bad and hard to find, 
because the hardware error was not detected by the internal health checks (the 
transceiver showed up as good). The symptom was, that packages just disappeared 
randomly and more likely the larger they were. However, no package losses were 
reported on the server NICs, because they got lost within the switch stack. 
Everything looked healthy. It just didn't work.

If a network connection becomes too congested, latency might get high enough 
for ceph or ceph clients to trigger time-outs. Also, connection attempts might 
repeatedly time out and fail in short succession. We also saw that OSD 
heartbeats did not arrive in time.

Ceph tends to react faster than other services to network issues, so you might 
not see ssh problems etc. while still having a network problem.

If your ceph cluster was healthy during the event (100% cpu load on an OSD is 
not necessarily unhealthy), this could indicate that it is not ceph related.

Some things worth checking:

- are there any health warnings or errors in the ceph.log
- are slow ops/requests reported
- do you have any network load/health monitoring in place (netdata is really 
good for this)
- are you collecting client/guest I/O stats with the hypervisor, do they peak 
during the incident
- are there high-network-load scheduled tasks on your machines (host or VM) or 
somewhere else affecting relevant network traffic (backups etc?)

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Marc Roos 

Sent: 20 May 2019 12:41:43
To: ceph-users
Subject: [ceph-users] cephfs causing high load on vm, taking down 15 min later 
another cephfs vm

I got my first problem with cephfs in a production environment. Is it
possible from these logfiles to deduct what happened?

svr1 is connected to ceph client network via switch
svr2 vm is collocated on c01 node.
c01 has osd's and the mon.a colocated.

svr1 was the first to report errors at 03:38:44. I have no error
messages reported of a network connection problem by any of the ceph
nodes. I have nothing in dmesg on c01.

[@c01 ~]# cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
[@c01 ~]# uname -a
Linux c01 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019
x86_64 x86_64 x86_64 GNU/Linux
[@c01 ~]# ceph versions
{
"mon": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 3
},
"osd": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 32
},
"mds": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 2
},
"rgw": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 2
},
"overall": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 42
}
}




[0] svr1 messages
May 20 03:36:01 svr1 systemd: Started Session 308978 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308979 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308979 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308980 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308980 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308981 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308981 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308982 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308982 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308983 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308983 of user root.
May 20 03:38:44 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:44 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:45 svr1 kernel: last message repeated 5 times
May 20 03:38:45 svr1 kernel: libc

[ceph-users] Pool configuration for RGW on multi-site cluster

2019-06-17 Thread Frank Schilder

We have a two-site cluster with OSDs and pools localised in two locations. I 
was now looking into setting up a rados gateway with the following properties:

- Pools should be EC pools whenever possible.
- Pools for specific buckets should be localised at OSDs on only one of the two 
locations (meaning the crush root must *not* be "default").

Unfortunately, I seem to be unable to find documentation on how to configure 
pools used by RGW in such detail. It seems that the RGW daemon and 
radosgw-admin create pools on the fly, using some global settings that don't 
allow any a-priory fine-tuning of the type described above. I looked here:

- http://docs.ceph.com/docs/mimic/radosgw/placement/
- http://docs.ceph.com/docs/mimic/radosgw/pools/
- http://docs.ceph.com/docs/mimic/radosgw/multisite/

I would be most grateful about answers (or links) to the following questions:

- Which pools are used by RGW (where can I find a complete list)?
- Which of these pools must be replicated and which can be EC pools?
- Are there sizing guides and performance considerations (replication type, 
device class, best practices)?
- If creating all of these pools empty, with desired properties and prior to 
RGW startup, will the RGW daemon work properly?
- If some pools need to be created by the RGW daemon, how does one specify 
details like
  * crush root
  * EC profile / replication rule
  * device class
  * etc.

I would like to avoid any manual a-posteriori operations like editing crush 
rules to adjust locations of pools, etc.

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph fs: stat fails on folder

2019-06-17 Thread Frank Schilder

We observe the following on ceph fs clients with identical ceph fs mounts:

[frans@sophia1 ~]$ ls -l ../neda
ls: cannot access ../neda/NEWA_TEST: Permission denied
total 5
drwxrwxr-x 1 neda neda1 May 17 19:30 ffpy_test
-rw-rw-r-- 1 neda neda  135 May 17 21:06 mount_newa
drwxrwxr-x 1 neda neda1 Jun  6 15:39 neda
drwxrwx--- 1 neda neda 1405 Jun 13 15:25 NEWA
d? ? ??   ?? NEWA_TEST
-rw-rw-r-- 1 neda neda 3671 Jun  3 15:37 test_post.py
-rw-r--r-- 1 neda neda  211 May 17 20:28 test_sophia.slurm

[frans@sn440 ~]$ ls -l ../neda
total 5
drwxrwxr-x 1 neda neda1 May 17 19:30 ffpy_test
-rw-rw-r-- 1 neda neda  135 May 17 21:06 mount_newa
drwxrwxr-x 1 neda neda1 Jun  6 15:39 neda
drwxrwx--- 1 neda neda 1405 Jun 13 15:25 NEWA
drwxrwxr-x 1 neda neda0 May 17 18:58 NEWA_TEST
-rw-rw-r-- 1 neda neda 3671 Jun  3 15:37 test_post.py
-rw-r--r-- 1 neda neda  211 May 17 20:28 test_sophia.slurm

On sophia1 'stat ../neda/NEWA_TEST' returns with permission denied while we see 
no problem on any of our other clients. I guess temporarily evicting the client 
or failing over the MDS will restore access from this client. However, sophia1 
is a head node of an HPC cluster and I would really like to avoid clearing the 
client cache - if possible.

There is no urgent pressure to fix this and I can collect some debug info in 
case this is a yet unknown issue. Please let me know what information to 
collect and how to proceed.

The storage cluster is still at 13.2.2 (upgrade planned):
[root@ceph-01 ~]# ceph -v
ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)

and the client is at 12.2.11 (upgrade to mimic planned):
[frans@sophia1 ~]$ ceph -v
ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous 
(stable)

I can't see anything unusual in the logs or health reports.

Thanks for your help!

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph fs: stat fails on folder

2019-06-17 Thread Frank Schilder

Please ignore the message below, it has nothing to do with ceph.

Sorry for the spam.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Frank 
Schilder 
Sent: 17 June 2019 20:33
To: ceph-users@lists.ceph.com
Subject: [ceph-users] ceph fs: stat fails on folder

We observe the following on ceph fs clients with identical ceph fs mounts:

[frans@sophia1 ~]$ ls -l ../neda
ls: cannot access ../neda/NEWA_TEST: Permission denied
total 5
drwxrwxr-x 1 neda neda1 May 17 19:30 ffpy_test
-rw-rw-r-- 1 neda neda  135 May 17 21:06 mount_newa
drwxrwxr-x 1 neda neda1 Jun  6 15:39 neda
drwxrwx--- 1 neda neda 1405 Jun 13 15:25 NEWA
d? ? ??   ?? NEWA_TEST
-rw-rw-r-- 1 neda neda 3671 Jun  3 15:37 test_post.py
-rw-r--r-- 1 neda neda  211 May 17 20:28 test_sophia.slurm

[frans@sn440 ~]$ ls -l ../neda
total 5
drwxrwxr-x 1 neda neda1 May 17 19:30 ffpy_test
-rw-rw-r-- 1 neda neda  135 May 17 21:06 mount_newa
drwxrwxr-x 1 neda neda1 Jun  6 15:39 neda
drwxrwx--- 1 neda neda 1405 Jun 13 15:25 NEWA
drwxrwxr-x 1 neda neda0 May 17 18:58 NEWA_TEST
-rw-rw-r-- 1 neda neda 3671 Jun  3 15:37 test_post.py
-rw-r--r-- 1 neda neda  211 May 17 20:28 test_sophia.slurm

On sophia1 'stat ../neda/NEWA_TEST' returns with permission denied while we see 
no problem on any of our other clients. I guess temporarily evicting the client 
or failing over the MDS will restore access from this client. However, sophia1 
is a head node of an HPC cluster and I would really like to avoid clearing the 
client cache - if possible.

There is no urgent pressure to fix this and I can collect some debug info in 
case this is a yet unknown issue. Please let me know what information to 
collect and how to proceed.

The storage cluster is still at 13.2.2 (upgrade planned):
[root@ceph-01 ~]# ceph -v
ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)

and the client is at 12.2.11 (upgrade to mimic planned):
[frans@sophia1 ~]$ ceph -v
ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous 
(stable)

I can't see anything unusual in the logs or health reports.

Thanks for your help!

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Frank Schilder

Hi Dan,

this older thread 
(https://www.mail-archive.com/ceph-users@lists.ceph.com/msg49339.html) contains 
details about:

- how to get bluestore compression working (must be enabled on pool as well as 
OSD)
- what the best compression ratio is depending on the application (if 
applications do not give hints, it is 
bluestore_min_alloc_size_hdd/bluestore_compression_min_blob_size_hdd, which is 
usually 0.5 as you observe).

I doubled bluestore_min_alloc_size_hdd to get to 0.25. There are trade-offs for 
random I/O performance. However, since I use EC pools, I have those any ways. 
For replicated pools, the aggregated IOPs might be heavily affected. I have, 
however, no data on that case.

Hope that helps,
Frank

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Dan van der 
Ster 
Sent: 20 June 2019 17:23:51
To: ceph-users
Subject: Re: [ceph-users] understanding the bluestore blob, chunk and 
compression params

P.S. I know this has been discussed before, but the
compression_(mode|algorithm) pool options [1] seem completely broken
-- With the pool mode set to force, we see that sometimes the
compression is invoked and sometimes it isn't. AFAICT,
the only way to compress every object is to set
bluestore_compression_mode=force on the osd.

-- dan

[1] http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values


On Thu, Jun 20, 2019 at 4:33 PM Dan van der Ster  wrote:
>
> Hi all,
>
> I'm trying to compress an rbd pool via backfilling the existing data,
> and the allocated space doesn't match what I expect.
>
> Here is the test: I marked osd.130 out and waited for it to erase all its 
> data.
> Then I set (on the pool) compression_mode=force and 
> compression_algorithm=zstd.
> Then I marked osd.130 to get its PGs/objects back (this time compressing 
> them).
>
> After a few 10s of minutes we have:
> "bluestore_compressed": 989250439,
> "bluestore_compressed_allocated": 3859677184,
> "bluestore_compressed_original": 7719354368,
>
> So, the allocated is exactly 50% of original, but we are wasting space
> because compressed is 12.8% of original.
>
> I don't understand why...
>
> The rbd images all use 4MB objects, and we use the default chunk and
> blob sizes (in v13.2.6):
>osd_recovery_max_chunk = 8MB
>bluestore_compression_max_blob_size_hdd = 512kB
>bluestore_compression_min_blob_size_hdd = 128kB
>bluestore_max_blob_size_hdd = 512kB
>bluestore_min_alloc_size_hdd = 64kB
>
> From my understanding, backfilling should read a whole 4MB object from
> the src osd, then write it to osd.130's bluestore, compressing in
> 512kB blobs. Those compress on average at 12.8% so I would expect to
> see allocated being closer to bluestore_min_alloc_size_hdd /
> bluestore_compression_max_blob_size_hdd = 12.5%.
>
> Does someone understand where the 0.5 ratio is coming from?
>
> Thanks!
>
> Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Frank Schilder

Typo below, I meant "I doubled bluestore_compression_min_blob_size_hdd ..."
____
From: Frank Schilder
Sent: 20 June 2019 19:02
To: Dan van der Ster; ceph-users
Subject: Re: [ceph-users] understanding the bluestore blob, chunk and 
compression params

Hi Dan,

this older thread 
(https://www.mail-archive.com/ceph-users@lists.ceph.com/msg49339.html) contains 
details about:

- how to get bluestore compression working (must be enabled on pool as well as 
OSD)
- what the best compression ratio is depending on the application (if 
applications do not give hints, it is 
bluestore_min_alloc_size_hdd/bluestore_compression_min_blob_size_hdd, which is 
usually 0.5 as you observe).

I doubled bluestore_min_alloc_size_hdd to get to 0.25. There are trade-offs for 
random I/O performance. However, since I use EC pools, I have those any ways. 
For replicated pools, the aggregated IOPs might be heavily affected. I have, 
however, no data on that case.

Hope that helps,
Frank

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Dan van der 
Ster 
Sent: 20 June 2019 17:23:51
To: ceph-users
Subject: Re: [ceph-users] understanding the bluestore blob, chunk and 
compression params

P.S. I know this has been discussed before, but the
compression_(mode|algorithm) pool options [1] seem completely broken
-- With the pool mode set to force, we see that sometimes the
compression is invoked and sometimes it isn't. AFAICT,
the only way to compress every object is to set
bluestore_compression_mode=force on the osd.

-- dan

[1] http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values


On Thu, Jun 20, 2019 at 4:33 PM Dan van der Ster  wrote:
>
> Hi all,
>
> I'm trying to compress an rbd pool via backfilling the existing data,
> and the allocated space doesn't match what I expect.
>
> Here is the test: I marked osd.130 out and waited for it to erase all its 
> data.
> Then I set (on the pool) compression_mode=force and 
> compression_algorithm=zstd.
> Then I marked osd.130 to get its PGs/objects back (this time compressing 
> them).
>
> After a few 10s of minutes we have:
> "bluestore_compressed": 989250439,
> "bluestore_compressed_allocated": 3859677184,
> "bluestore_compressed_original": 7719354368,
>
> So, the allocated is exactly 50% of original, but we are wasting space
> because compressed is 12.8% of original.
>
> I don't understand why...
>
> The rbd images all use 4MB objects, and we use the default chunk and
> blob sizes (in v13.2.6):
>osd_recovery_max_chunk = 8MB
>bluestore_compression_max_blob_size_hdd = 512kB
>bluestore_compression_min_blob_size_hdd = 128kB
>bluestore_max_blob_size_hdd = 512kB
>bluestore_min_alloc_size_hdd = 64kB
>
> From my understanding, backfilling should read a whole 4MB object from
> the src osd, then write it to osd.130's bluestore, compressing in
> 512kB blobs. Those compress on average at 12.8% so I would expect to
> see allocated being closer to bluestore_min_alloc_size_hdd /
> bluestore_compression_max_blob_size_hdd = 12.5%.
>
> Does someone understand where the 0.5 ratio is coming from?
>
> Thanks!
>
> Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-06-21 Thread Frank Schilder

Dear Yan, Zheng,

does mimic 13.2.6 fix the snapshot issue? If not, could you please send me a 
link to the issue tracker?

Thanks and best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Yan, Zheng 
Sent: 20 May 2019 13:34
To: Frank Schilder
Cc: Stefan Kooman; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

On Sat, May 18, 2019 at 5:47 PM Frank Schilder  wrote:
>
> Dear Yan and Stefan,
>
> it happened again and there were only very few ops in the queue. I pulled the 
> ops list and the cache. Please find a zip file here: 
> "https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l"; . Its a bit 
> more than 100MB.
>

MSD cache dump shows there is a snapshot related. Please avoid using
snapshot until we fix the bug.

Regards
Yan, Zheng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-08 Thread Frank Schilder

Hi David,

I'm running a cluster with bluestore on raw devices (no lvm) and all journals 
collocated on the same disk with the data. Disks are spinning NL-SAS. Our goal 
was to build storage at lowest cost, therefore all data on HDD only. I got a 
few SSDs that I'm using for FS and RBD meta data. All large pools are EC on 
spinning disk.

I spent at least one month to run detailed benchmarks (rbd bench) depending on 
EC profile, object size, write size, etc. Results were varying a lot. My advice 
would be to run benchmarks with your hardware. If there was a single perfect 
choice, there wouldn't be so many options. For example, my tests will not be 
valid when using separate fast disks for WAL and DB.

There are some results though that might be valid in general:

1) EC pools have high throughput but low IOP/s compared with replicated pools

I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which is 
probably the network limit and not the disk limit. IOP/s get better with more 
disks, but are way lower than what replicated pools can provide. On a cephfs 
with EC data pool, small-file IO will be comparably slow and eat a lot of 
resources.

2) I observe massive network traffic amplification on small IO sizes, which is 
due to the way EC overwrites are handled. This is one bottleneck for IOP/s. We 
have 10G infrastructure and use 2x10G client and 4x10G OSD network. OSD 
bandwidth at least 2x client network, better 4x or more.

3) k should only have small prime factors, power of 2 if possible

I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All other 
choices were poor. The value of m seems not relevant for performance. Larger k 
will require more failure domains (more hardware).

4) object size matters

The best throughput (1M write size) I see with object sizes of 4MB or 8MB, with 
IOP/s getting somewhat better with slower object sizes but throughput dropping 
fast. I use the default of 4MB in production. Works well for us.

5) jerasure is quite good and seems most flexible

jerasure is quite CPU efficient and can handle smaller chunk sizes than other 
plugins, which is preferrable for IOP/s. However, CPU usage can become a 
problem and a plugin optimized for specific values of k and m might help here. 
Under usual circumstances I see very low load on all OSD hosts, even under 
rebalancing. However, I remember that once I needed to rebuild something on all 
OSDs (I don't remember what it was, sorry). In this situation, CPU load went up 
to 30-50% (meaning up to half the cores were at 100%), which is really high 
considering that each server has only 16 disks at the moment and is sized to 
handle up to 100. CPU power could become a bottle for us neck in the future.

These are some general observations and do not replace benchmarks for specific 
use cases. I was hunting for a specific performance pattern, which might not be 
what you want to optimize for. I would recommend to run extensive benchmarks if 
you have to live with a configuration for a long time - EC profiles cannot be 
changed.

We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also use 
bluestore compression. All meta data pools are on SSD, only very little SSD 
space is required. This choice works well for the majority of our use cases. We 
can still build small expensive pools to accommodate special performance 
requests.

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of David 

Sent: 07 July 2019 20:01:18
To: ceph-users@lists.ceph.com
Subject: [ceph-users]  What's the best practice for Erasure Coding

Hi Ceph-Users,

I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).
Recently, I'm trying to use the Erasure Code pool.
My question is "what's the best practice for using EC pools ?".
More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I 
adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), (k=6,m=3) ).

Does anyone share some experience?

Thanks for any help.

Regards,
David

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-09 Thread Frank Schilder

Hi Nathan,

its just a hypothesis. I did not check what the algorithm does.

The reasoning is this. Bluestore and modern disks have preferred read/write 
sizes that are quite large for large drives. These are usually powers of 2. If 
you use a k+m EC profile, any read/write is split into k fragments. What I 
observe is, that throughput seems best if these fragments are multiples of the 
preferred read/write sizes.

Any prime factor other than 2 will imply split-ups that don't fit perfectly. 
The mismatch tends to be worse the larger a prime factor and the smaller the 
object size. At least this is a correlation I observed in benchmarks. Since 
correlation does not mean causation, I will not claim that my hypothesis is an 
explanation of the observation.

Nevertheless, bluestore has default alloc sizes and just for storage efficiency 
I would try to achieve aim for alloc_size=object_size/k. Coincidentally, for 
spinning disks this also seems to imply best performance.

If this is wrong, maybe a disk IO expert can provide a better explanation as a 
guide for EC profile choices?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Nathan Fish 

Sent: 08 July 2019 18:07:25
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

This is very interesting, thank you. I'm curious, what is the reason
for avoiding k's with large prime factors? If I set k=5, what happens?

On Mon, Jul 8, 2019 at 8:56 AM Lei Liu  wrote:
>
> Hi Frank,
>
> Thanks for sharing valuable experience.
>
> Frank Schilder  于2019年7月8日周一 下午4:36写道：
>>
>> Hi David,
>>
>> I'm running a cluster with bluestore on raw devices (no lvm) and all 
>> journals collocated on the same disk with the data. Disks are spinning 
>> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on 
>> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All 
>> large pools are EC on spinning disk.
>>
>> I spent at least one month to run detailed benchmarks (rbd bench) depending 
>> on EC profile, object size, write size, etc. Results were varying a lot. My 
>> advice would be to run benchmarks with your hardware. If there was a single 
>> perfect choice, there wouldn't be so many options. For example, my tests 
>> will not be valid when using separate fast disks for WAL and DB.
>>
>> There are some results though that might be valid in general:
>>
>> 1) EC pools have high throughput but low IOP/s compared with replicated pools
>>
>> I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which 
>> is probably the network limit and not the disk limit. IOP/s get better with 
>> more disks, but are way lower than what replicated pools can provide. On a 
>> cephfs with EC data pool, small-file IO will be comparably slow and eat a 
>> lot of resources.
>>
>> 2) I observe massive network traffic amplification on small IO sizes, which 
>> is due to the way EC overwrites are handled. This is one bottleneck for 
>> IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD 
>> network. OSD bandwidth at least 2x client network, better 4x or more.
>>
>> 3) k should only have small prime factors, power of 2 if possible
>>
>> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All 
>> other choices were poor. The value of m seems not relevant for performance. 
>> Larger k will require more failure domains (more hardware).
>>
>> 4) object size matters
>>
>> The best throughput (1M write size) I see with object sizes of 4MB or 8MB, 
>> with IOP/s getting somewhat better with slower object sizes but throughput 
>> dropping fast. I use the default of 4MB in production. Works well for us.
>>
>> 5) jerasure is quite good and seems most flexible
>>
>> jerasure is quite CPU efficient and can handle smaller chunk sizes than 
>> other plugins, which is preferrable for IOP/s. However, CPU usage can become 
>> a problem and a plugin optimized for specific values of k and m might help 
>> here. Under usual circumstances I see very low load on all OSD hosts, even 
>> under rebalancing. However, I remember that once I needed to rebuild 
>> something on all OSDs (I don't remember what it was, sorry). In this 
>> situation, CPU load went up to 30-50% (meaning up to half the cores were at 
>> 100%), which is really high considering that each server has only 16 disks 
>> at the moment and is sized to handle up to 100. CPU power could become a 
>> bottle for us neck in the future.
>>
>> These are

Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-09 Thread Frank Schilder

Small addition:

This result holds for rbd bench. It seems to imply good performance for 
large-file IO on cephfs, since cephfs will split large files into many objects 
of size object_size. Small-file IO is a different story.

The formula should be N*alloc_size=object_size/k, where N is some integer. 
alloc_size should be an integer multiple of object_size/k.

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Frank Schilder
Sent: 09 July 2019 09:22
To: Nathan Fish; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

Hi Nathan,

its just a hypothesis. I did not check what the algorithm does.

The reasoning is this. Bluestore and modern disks have preferred read/write 
sizes that are quite large for large drives. These are usually powers of 2. If 
you use a k+m EC profile, any read/write is split into k fragments. What I 
observe is, that throughput seems best if these fragments are multiples of the 
preferred read/write sizes.

Any prime factor other than 2 will imply split-ups that don't fit perfectly. 
The mismatch tends to be worse the larger a prime factor and the smaller the 
object size. At least this is a correlation I observed in benchmarks. Since 
correlation does not mean causation, I will not claim that my hypothesis is an 
explanation of the observation.

Nevertheless, bluestore has default alloc sizes and just for storage efficiency 
I would try to achieve aim for alloc_size=object_size/k. Coincidentally, for 
spinning disks this also seems to imply best performance.

If this is wrong, maybe a disk IO expert can provide a better explanation as a 
guide for EC profile choices?

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: ceph-users  on behalf of Nathan Fish 

Sent: 08 July 2019 18:07:25
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

This is very interesting, thank you. I'm curious, what is the reason
for avoiding k's with large prime factors? If I set k=5, what happens?

On Mon, Jul 8, 2019 at 8:56 AM Lei Liu  wrote:
>
> Hi Frank,
>
> Thanks for sharing valuable experience.
>
> Frank Schilder  于2019年7月8日周一 下午4:36写道：
>>
>> Hi David,
>>
>> I'm running a cluster with bluestore on raw devices (no lvm) and all 
>> journals collocated on the same disk with the data. Disks are spinning 
>> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on 
>> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All 
>> large pools are EC on spinning disk.
>>
>> I spent at least one month to run detailed benchmarks (rbd bench) depending 
>> on EC profile, object size, write size, etc. Results were varying a lot. My 
>> advice would be to run benchmarks with your hardware. If there was a single 
>> perfect choice, there wouldn't be so many options. For example, my tests 
>> will not be valid when using separate fast disks for WAL and DB.
>>
>> There are some results though that might be valid in general:
>>
>> 1) EC pools have high throughput but low IOP/s compared with replicated pools
>>
>> I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which 
>> is probably the network limit and not the disk limit. IOP/s get better with 
>> more disks, but are way lower than what replicated pools can provide. On a 
>> cephfs with EC data pool, small-file IO will be comparably slow and eat a 
>> lot of resources.
>>
>> 2) I observe massive network traffic amplification on small IO sizes, which 
>> is due to the way EC overwrites are handled. This is one bottleneck for 
>> IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD 
>> network. OSD bandwidth at least 2x client network, better 4x or more.
>>
>> 3) k should only have small prime factors, power of 2 if possible
>>
>> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All 
>> other choices were poor. The value of m seems not relevant for performance. 
>> Larger k will require more failure domains (more hardware).
>>
>> 4) object size matters
>>
>> The best throughput (1M write size) I see with object sizes of 4MB or 8MB, 
>> with IOP/s getting somewhat better with slower object sizes but throughput 
>> dropping fast. I use the default of 4MB in production. Works well for us.
>>
>> 5) jerasure is quite good and seems most flexible
>>
>> jerasure is quite CPU efficient and can handle smaller chunk sizes than 
>> other plugins, which is preferrable for IOP/s. However, CPU usage can become 
>>

Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Frank Schilder

Striping with stripe units other than 1 is something I also tested. I found 
that with EC pools non-trivial striping should be avoided. Firstly, EC is 
already a striped format and, secondly, striping on top of that with 
stripe_unit>1 will make every write an ec_overwrite, because now shards are 
rarely if ever written as a whole.

The native striping in EC pools comes from k, data is striped over k disks. The 
higher k the more throughput at the expense of cpu and network.

In my long list, this should actually be point

6) Use stripe_unit=1 (default).

To get back to your question, this is another argument for k=power-of-two. 
Object sizes in ceph are always powers of 2 and stripe sizes contain k as a 
factor. Hence, any prime factor other than 2 in k will imply a mismatch. How 
badly a mismatch affects performance should be tested.

Example: on our 6+2 EC pool I have stripe_width  24576, which has 3 as a 
factor. The 3 comes from k=6=3*2 and will always be there. This implies a 
misalignment and some writes will have to be split/padded in the middle. This 
does not happen too often per object, so 6+2 performance is good, but not as 
good as 8+2 performance.

Some numbers:

1) rbd object size 8MB, 4 servers writing with 1 processes each (=4 workers):
EC profile 4K random write  sequential write 8M write size
   IOP/s aggregated MB/s aggregated
 5+2802.30  1156.05
 6+2   1188.26  1873.67
 8+2   1210.27  2510.78
10+4421.80   681.22

2) rbd object size 8MB, 4 servers writing with 4 processes each (=16 workers):
EC profile 4K random write  sequential write 8M write size
   IOP/s aggregated MB/s aggregated
6+21384.43  3139.14
8+21343.34  4069.27

The EC-profiles with factor 5 are so bad that I didn't repeat the multi-process 
tests (2) with these. I had limited time and went for the discard-early 
strategy to find suitable parameters.

The 25% smaller throughput (6+2 vs 8+2) in test (2) is probably due to the fact 
that data is striped over 6 instead of 8 disks. There might be some impact of 
the factor 3 somewhere as well, but it seems negligible in the scenario I 
tested.

Results with non-trivial striping (stripe_size>1) were so poor, I did not even 
include them in my report.

We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool is 
used for VMs (RBD images), where IOP/s are more important. It also offers a 
higher redundancy level. Its an acceptable compromise for us.

Note that numbers will vary depending on hardware, OSD config, kernel 
parameters etc, etc. One needs to test what one has.

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: ceph-users  on behalf of Lars 
Marowsky-Bree 
Sent: 11 July 2019 10:14:04
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

On 2019-07-09T07:27:28, Frank Schilder  wrote:

> Small addition:
>
> This result holds for rbd bench. It seems to imply good performance for 
> large-file IO on cephfs, since cephfs will split large files into many 
> objects of size object_size. Small-file IO is a different story.
>
> The formula should be N*alloc_size=object_size/k, where N is some integer. 
> alloc_size should be an integer multiple of object_size/k.

If using rbd striping, I'd also assume that making rbd's stripe_unit be
equal to, or at least a multiple of, the stripe_width of the EC pool is
sensible.

(Similar for CephFS's layout.)

Does this hold in your environment?

--
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Frank Schilder

Oh dear. Every occurrence of stripe_* is wrong :)

It should be stripe_count (option --stripe-count in rbd create) everywhere in 
my text.

What choices are legal depends on the restrictions on stripe_count*stripe_unit 
(=stripe_size=stripe_width?) imposed by ceph. I believe all of this ends up 
being powers of 2.

Yes, the 6+2 is a bit surprising. I have no explanation for the observation. It 
just seems a good argument for "do not trust what you believe, gather facts". 
And to try things that seem non-obvious - just to be sure.

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: ceph-users  on behalf of Lars 
Marowsky-Bree 
Sent: 11 July 2019 12:17:37
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

On 2019-07-11T09:46:47, Frank Schilder  wrote:

> Striping with stripe units other than 1 is something I also tested. I found 
> that with EC pools non-trivial striping should be avoided. Firstly, EC is 
> already a striped format and, secondly, striping on top of that with 
> stripe_unit>1 will make every write an ec_overwrite, because now shards are 
> rarely if ever written as a whole.

That's why I said that rbd's stripe_unit should match the EC pool's
stripe_width, or be a 2^n multiple of it. (Not sure what stripe_count
should be set to, probably also a small number of two.)

> The native striping in EC pools comes from k, data is striped over k disks. 
> The higher k the more throughput at the expense of cpu and network.

Increasing k also increases stripe_width though; this leads to more IO
suffering from the ec_overwrite penalty.

> In my long list, this should actually be point
>
> 6) Use stripe_unit=1 (default).

You mean stripe-count?

> To get back to your question, this is another argument for k=power-of-two. 
> Object sizes in ceph are always powers of 2 and stripe sizes contain k as a 
> factor. Hence, any prime factor other than 2 in k will imply a mismatch. How 
> badly a mismatch affects performance should be tested.

Yes, of course. Depending on the IO pattern, this means more IO will be
misaligned or have non-stripe_width portions. (Most IO patterns, if they
strive for alignment, aim for a power of two alignment, obviously.)

> Results with non-trivial striping (stripe_size>1) were so poor, I did not 
> even include them in my report.

stripe_size?

> We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool 
> is used for VMs (RBD images), where IOP/s are more important. It also offers 
> a higher redundancy level. Its an acceptable compromise for us.

Especially with RBDs, I'm surprised that k=6 works well for you. Block
device IO is most commonly aligned on power-of-two boundaries.

Regards,
Lars

--
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What if etcd is lost

2019-07-15 Thread Frank Schilder

Hi Oscar,

ceph itself does not use etcd for anything. Hence, a deployed and operational 
cluster will not notice the presence or absence of an etcd store.

How much a loss of etcd means for your work depends on what you plan to store 
in it. If you look at the ceph/daemon container on docker, the last time I 
checked the code it stored only very little data and all of this would be 
re-build from the running cluster if you create and run a new etcd container. 
In this framework, it only affects how convenient deployment of new servers is. 
You could easily copy the few files it holds by hand to a new server. So etcd 
is not critical at all.

You should have a look at the deploy scripts/method for checking under what 
conditions you can loose and re-build an etcd store. In the example of 
ceph/daemon on docker, a rebuild requires execution on a node with the admin 
keyring (eg. a mon node) against a running cluster with mons in quorum.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Oscar Segarra 

Sent: 15 July 2019 11:55
To: ceph-users
Subject: [ceph-users] What if etcd is lost

Hi,

I'm planning to deploy a ceph cluster using etcd as kv store.

I'm planning to deploy a stateless etcd docker to store the data.

I'd like to know if ceph cluster will be able to boot when etcd container 
restarts (and looses al data written in it)

If the etcd container restarts when the ceph cluster (osd, mds, mon, mgr) is 
working and stable, everything will continue working or any component will stop 
working?

Mon's will be able to regen the keys?

Thanks a lot in advance
Óscar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs snapshot scripting questions

2019-07-19 Thread Frank Schilder

This is a question I'm interested as well.

Right now, I'm using cephfs-snap from the storage tools project and am quite 
happy with that. I made a small modification, but will probably not change. Its 
a simple and robust tool.

About where to take snapshots. There seems to be a bug in cephfs that implies a 
recommended limit of the total number of snapshots to not more than 400. Hence, 
taking as few as possible (i.e. high up) seems sort of a must. Has this changed 
by now? In case this limit does not exist any more, what would be best practice?

Note that we disabled rolling snapshots due to a not yet fixed bug; see this 
thread: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg54233.html

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Robert Ruge 

Sent: 17 July 2019 02:44:02
To: ceph-users@lists.ceph.com
Subject: [ceph-users] cephfs snapshot scripting questions

Greetings.

Before I reinvent the wheel has anyone written a script to maintain X number of 
snapshots on a cephfs file system that can be run through cron?
I am aware of the cephfs-snap code but just wondering if there are any other 
options out there.

On a related note which of these options would be better?

1.   Maintain one .snap directory at the root of the cephfs tree - 
/ceph/.snap

2.   Have a .snap directory for every second level directory 
/ceph/user/.snap

I am thinking the later might make it more obvious for the users to do their 
own restores but wondering what the resource implications of either approach 
might be.

The documentation indicates that I should use kernel >= 4.17 for cephfs.  I’m 
currently using Mimic 13.2.6 on Ubuntu 18.04 with kernel version 4.15.0. What 
issues might I see with this combination? I’m hesitant to upgrade to an 
unsupported kernel on Ubuntu but wondering if I’m going to be playing Russian 
Roulette with this combo.

Are there any gotcha’s I should be aware of before plunging into full blown 
cephfs snapshotting?

Regards and thanks.
Robert Ruge


Important Notice: The contents of this email are intended solely for the named 
addressee and are confidential; any unauthorised use, reproduction or storage 
of the contents is expressly prohibited. If you have received this email in 
error, please delete it and any attachments immediately and advise the sender 
by return email or telephone.

Deakin University does not warrant that this email and any attachments are 
error or virus free.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error Mounting CephFS

2019-08-07 Thread Frank Schilder

On Centos7, the option "secretfile" requires installation of ceph-fuse.

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Yan, Zheng 

Sent: 07 August 2019 10:10:19
To: dhils...@performair.com
Cc: ceph-users
Subject: Re: [ceph-users] Error Mounting CephFS

On Wed, Aug 7, 2019 at 3:46 PM  wrote:
>
> All;
>
> I have a server running CentOS 7.6 (1810), that I want to set up with CephFS 
> (full disclosure, I'm going to be running samba on the CephFS).  I can mount 
> the CephFS fine when I use the option secret=, but when I switch to 
> secretfile=, I get an error "No such process."  I installed ceph-common.
>
> Is there a service that I'm not aware I should be starting?
> Do I need to install another package?
>

mount.ceph is missing.  check if it exists and is located in $PATH

> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director - Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Failure to start ceph-mon in docker

2019-08-29 Thread Frank Schilder

Hi Robert,

this is a bit less trivial than it might look right now. The ceph user is 
usually created by installing the package ceph-common. By default it will use 
id 167. If the ceph user already exists, I would assume it will use the 
existing user to allow an operator to avoid UID collisions (if 167 is used 
already).

If you use docker, the ceph UID on the host and inside the container should 
match (or need to be translated). If they don't, you will have a lot of fun 
re-owning stuff all the time, because deployments will use the symbolic name 
ceph, which has different UIDs on the host and inside the container in your 
case.

I would recommend removing this discrepancy as soon as possible:

1) Find out why there was a ceph user with UID different from 167 before 
installation of ceph-common.
   Did you create it by hand? Was UID 167 allocated already?
2) If you can safely change the GID and UID of ceph to 167, just do 
groupmod+usermod with new GID and UID.
3) If 167 is used already by another service, you will have to map the UIDs 
between host and container.

To prevent ansible from deploying dockerized ceph with mismatching user ID for 
ceph, add these tasks to an appropriate part of your deployment (general host 
preparation or so):

- name: "Create group 'ceph'."
  group:
name: ceph
gid: 167
local: yes
state: present
system: yes

- name: "Create user 'ceph'."
  user:
name: ceph
password: "!"
comment: "ceph-container daemons"
uid: 167
group: ceph
shell: "/sbin/nologin"
home: "/var/lib/ceph"
create_home: no
local: yes
state: present
system: yes

This should err if a group and user ceph already exist with IDs different from 
167.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Robert 
LeBlanc 
Sent: 28 August 2019 23:23:06
To: ceph-users
Subject: Re: [ceph-users] Failure to start ceph-mon in docker

Turns out /var/lib/ceph was ceph.ceph and not 167.167, chowning it made things 
work. I guess only monitor needs that permission, rgw,mgr,osd are all happy 
without needing it to be 167.167.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Aug 28, 2019 at 1:45 PM Robert LeBlanc 
mailto:rob...@leblancnet.us>> wrote:
We are trying to set up a new Nautilus cluster using ceph-ansible with 
containers. We got things deployed, but I couldn't run `ceph s` on the host so 
decided to `apt install ceph-common and installed the Luminous version from 
Ubuntu 18.04. For some reason the docker container that was running the monitor 
restarted and won't restart. I added the repo for Nautilus and upgraded 
ceph-common, but the problem persists. The Manager and OSD docker containers 
don't seem to be affected at all. I see this in the journal:

Aug 28 20:40:55 sun-gcs02-osd01 systemd[1]: Starting Ceph Monitor...
Aug 28 20:40:55 sun-gcs02-osd01 docker[2926]: Error: No such container: 
ceph-mon-sun-gcs02-osd01
Aug 28 20:40:55 sun-gcs02-osd01 systemd[1]: Started Ceph Monitor.
Aug 28 20:40:55 sun-gcs02-osd01 docker[2949]: WARNING: Your kernel does not 
support swap limit capabilities or the cgroup is not mounted. Memory limited 
without swap.
Aug 28 20:40:56 sun-gcs02-osd01 docker[2949]: 2019-08-28 20:40:56  
/opt/ceph-container/bin/entrypoint.sh: Existing mon, trying to rejoin cluster...
Aug 28 20:40:56 sun-gcs02-osd01 docker[2949]: warning: line 41: 
'osd_memory_target' in section 'osd' redefined
Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: 2019-08-28 20:41:03  
/opt/ceph-container/bin/entrypoint.sh: /etc/ceph/ceph.conf is already memory 
tuned
Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: 2019-08-28 20:41:03  
/opt/ceph-container/bin/entrypoint.sh: SUCCESS
Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: exec: PID 368: spawning 
/usr/bin/ceph-mon --cluster ceph --default-log-to-file=false 
--default-mon-cluster-log-to-file=false --setuser ceph --setgroup ceph -d 
--mon-cluster-log-to-stderr --log-stderr-prefix=debug  -i sun-gcs02-osd01 
--mon-data /var/lib/ceph/mon/ceph-sun-gcs02-osd01 --public-addr 10.65.101.21
Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: exec: Waiting 368 to quit
Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: warning: line 41: 
'osd_memory_target' in section 'osd' redefined
Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: debug 2019-08-28 20:41:03.835 
7f401283c180  0 set uid:gid to 167:167 (ceph:ceph)
Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: debug 2019-08-28 20:41:03.835 
7f401283c180  0 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
nautilus (stable), process ceph-mon, pid 368
Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: debug 2019-08-28 20:41:03.835 
7f401283c180 -1 stat(/var/lib/ceph/mon/ceph-sun-gcs02-o

Re: [ceph-users] Can't create erasure coded pools with k+m greater than hosts?

2019-10-24 Thread Frank Schilder

I have some experience with an EC set-up with 2 shards per host, failure-domain 
is host, and also some multi-site wishful thinking of users. What I learned are 
the following:

1) Avoid this work-around too few hosts for EC rule at all cost. There are two 
types of resiliency in ceph. One is against hardware fails and the other is 
against admin fails. Using a non-standard crush set-up to accommodate for a 
lack of hosts dramatically reduces resiliency against admin fails. You will 
have down-time due to simple mistakes. You will need to adjust also other 
defaults, like min_size, to be able to do anything on this cluster without 
downtime, sweating every time and praying that nothing goes wrong. Use this 
only if there is a short-term horizon that it will be over.

2) Do not use EC 2+1. It does not offer anything interesting for production. 
Use 4+2 (or 8+2, 8+3 if you have the hosts). Here you can operate with non-zero 
redundancy while doing maintenance (min_size=5).

3) If you have no perspective of getting at least 7 servers in the long run 
(4+2=6 for EC profile, +1 for fail-over automatic rebuild), do not go for EC. 
If this helps in your negotiations, tell everyone that they either give you 
more servers now and get low-cost storage, or have to pay for expensive 
replicated storage forever.

4) Before you start thinking about replicating to a second site, you should 
have a primary site running solid first. I was in exactly the same situation, 
people expecting wonders with giving me half the stuff I need only. Simply do 
not do it. I wasted a lot of time on impossible requests. With the hardware you 
have, I would ditch the second DC and rather start building up a solid first DC 
to be mirrored later when people move over bags with money. You have 6 servers. 
That's a good start for an 4+2 EC pool. You will not have fail-over capacity, 
but at least you don't have to work around too many exceptions. The one you 
should be aware of though is this one: 
https://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/?highlight=erasure%20code%20pgs#crush-gives-up-too-soon
 . If you had 7 servers, you would be out of trouble.

This is collected from my experience. I would do things different now and maybe 
it helps you with deciding how to proceed. Its basically about what resources 
can you expect in the foreseeable future and what compromises are you willing 
to make with regards to sleep and sanity.

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Salsa 

Sent: 21 October 2019 17:31
To: Martin Verges
Cc: ceph-users
Subject: Re: [ceph-users] Can't create erasure coded pools with k+m greater 
than hosts?

Just to clarify my situation, We have 2 datacenters with 3 hosts each, 12 4TB 
disks each host (2 are RAID with OS installed and the remaining 10 are used for 
Ceph). Right now I'm trying a single DC installation and intended to migrate to 
multi site mirroring DC1 to DC2, so if we lose DC1 we can activate DC2 (NOTE: I 
have no idea how this is setup and have not planned at all; I thought of geting 
DC1 to work first and later set the mirroring)

I don't think I'll be able to change the setup in any way, so my next question 
is: Should I go with a replica 3 or would an erasure 2,1 be ok?

There's a very small chance we get 2 extra hosts for each DC in a near future, 
but we'll probably use all the available storage space in the nearer future.

We're trying to use as much space as possible.

Thanks;

--
Salsa

Sent with ProtonMail<https://protonmail.com> Secure Email.

‐‐‐ Original Message ‐‐‐
On Monday, October 21, 2019 2:53 AM, Martin Verges  
wrote:

Just don't do such setups for production, It will be a lot of pain, trouble, 
and cause you problems.

Just take a cheap system, put some of the disks in it and do a way way better 
deployment than something like 4+2 on 3 hosts. Whatever you do with that 
cluster (example kernel update, reboot, PSU failure, ...) causes you and all 
attached clients, especially bad with VMs on that Ceph cluster, to stop any IO 
or even crash completely.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io<mailto:martin.ver...@croit.io>
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Sa., 19. Okt. 2019 um 01:51 Uhr schrieb Chris Taylor 
mailto:ctay...@eyonic.com>>:
Full disclosure - I have not created an erasure code pool yet!

I have been wanting to do the same thing that you are attempting and
have these links saved. I believe this is what you are looking for.

This link is for decompiling the CRUSH rules and recompiling:

https://docs.ceph.com/docs/luminous/rados/

Re: [ceph-users] Erasure coded pools on Ambedded - advice please

2019-10-24 Thread Frank Schilder

There are plenty of posts in this list. Please search a bit. Example threads 
are:

What's the best practice for Erasure Coding
large concurrent rbd operations block for over 15 mins!
Can't create erasure coded pools with k+m greater than hosts?

And many more. As you will see there, k=2,m=1 is bad and so is k=7. You should 
also refer to the ceph documentation about failure domains and EC pools, which 
will explain the possible reasons why your 7+2 pool does not work.

The answers are not easy, depend on your hardware and you will have to do some 
work testing and benchmarking.

Best regards,

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of John Hearns 

Sent: 24 October 2019 08:21:47
To: ceph-users
Subject: [ceph-users] Erasure coded pools on Ambedded - advice please

I am setting up a storage cluster on Ambedded ARM hardware, which is nice!
I find that I can set up an erasure coded pool with the default k=2,m=1

The cluster has 9x OSD with HDD and 12xOSD with SSD

If I configure another erasure profile such as k=7 m=2 then the pool creates, 
but the pgs stick in configuring/incomplete.
Some advice please:

a) what erasure profiles do people suggest for this setup

b) a pool with m=1 will work fine of course, I imagine though a failed OSD has 
to be replaced quickly

If anyone else has Ambedded, what crush rule do you select for the metadata 
when creating a pool?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v13.2.7 osds crash in build_incremental_map_msg

2019-12-04 Thread Frank Schilder

Is this issue now a no-go for updating to 13.2.7 or are there only some 
specific unsafe scenarios?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Dan van der 
Ster 
Sent: 03 December 2019 16:42:45
To: ceph-users
Subject: Re: [ceph-users] v13.2.7 osds crash in build_incremental_map_msg

I created https://tracker.ceph.com/issues/43106 and we're downgrading
our osds back to 13.2.6.

-- dan

On Tue, Dec 3, 2019 at 4:09 PM Dan van der Ster  wrote:
>
> Hi all,
>
> We're midway through an update from 13.2.6 to 13.2.7 and started
> getting OSDs crashing regularly like this [1].
> Does anyone obviously know what the issue is? (Maybe
> https://github.com/ceph/ceph/pull/26448/files ?)
> Or is it some temporary problem while we still have v13.2.6 and
> v13.2.7 osds running concurrently?
>
> Thanks!
>
> Dan
>
> [1]
>
> 2019-12-03 15:53:51.817 7ff3a3d39700 -1 osd.1384 2758889
> build_incremental_map_msg missing incremental map 2758889
> 2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
> build_incremental_map_msg missing incremental map 2758889
> 2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
> build_incremental_map_msg unable to load latest map 2758889
> 2019-12-03 15:53:51.822 7ff3a453a700 -1 *** Caught signal (Aborted) **
>  in thread 7ff3a453a700 thread_name:tp_osd_tp
>
>  ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable)
>  1: (()+0xf5f0) [0x7ff3c620b5f0]
>  2: (gsignal()+0x37) [0x7ff3c522b337]
>  3: (abort()+0x148) [0x7ff3c522ca28]
>  4: (OSDService::build_incremental_map_msg(unsigned int, unsigned int,
> OSDSuperblock&)+0x767) [0x555d60e8d797]
>  5: (OSDService::send_incremental_map(unsigned int, Connection*,
> std::shared_ptr&)+0x39e) [0x555d60e8dbee]
>  6: (OSDService::share_map_peer(int, Connection*,
> std::shared_ptr)+0x159) [0x555d60e8eda9]
>  7: (OSDService::send_message_osd_cluster(int, Message*, unsigned
> int)+0x1a5) [0x555d60e8f085]
>  8: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&,
> unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t,
> hobject_t, std::vector
> > const&, boost::optional&,
> ReplicatedBackend::InProgressOp*, ObjectStore::Transaction&)+0x452)
> [0x555d6116e522]
>  9: (ReplicatedBackend::submit_transaction(hobject_t const&,
> object_stat_sum_t const&, eversion_t const&,
> std::unique_ptr >&&,
> eversion_t const&, eversion_t const&, std::vector std::allocator > const&,
> boost::optional&, Context*, unsigned long,
> osd_reqid_t, boost::intrusive_ptr)+0x6f5) [0x555d6117ed85]
>  10: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
> PrimaryLogPG::OpContext*)+0xd62) [0x555d60ff5142]
>  11: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xf12)
> [0x555d61035902]
>  12: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x3679)
> [0x555d610397a9]
>  13: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0xc99) [0x555d6103d869]
>  14: (OSD::dequeue_op(boost::intrusive_ptr,
> boost::intrusive_ptr, ThreadPool::TPHandle&)+0x1b7)
> [0x555d60e8e8a7]
>  15: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0x62) [0x555d611144c2]
>  16: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x592) [0x555d60eb25f2]
>  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3)
> [0x7ff3c929f5b3]
>  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7ff3c92a01a0]
>  19: (()+0x7e65) [0x7ff3c6203e65]
>  20: (clone()+0x6d) [0x7ff3c52f388d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Beginner questions

2020-01-17 Thread Frank Schilder

I would strongly advise against 2+1 EC pools for production if stability is 
your main concern. There was a discussion towards the end of last year 
addressing this in more detail. Short story, if you don't have at least 8-10 
nodes (in the short run), EC is not suitable. You cannot maintain a cluster 
with such EC-pools.

Reasoning: k+1 is a no-go in production. You can set min_size to k, but 
whenever a node is down (maintenance or whatever), new writes are 
non-redundant. Loosing just one more disk means data loss. This is not a 
problem with replication x3 and min_size=2. Be aware that maintenance more 
often than not takes more than a day. Parts may need to be shipped. An upgrade 
goes wrong and requires lengthy support for fixing. Etc.

In addition, admins make mistakes. You need to build your cluster such that it 
can survive mistakes (shut down wrong host, etc.) in degraded state. Redundancy 
m=1 means zero tolerance for errors. Often the recommendation therefore is m=3, 
while m=2 is the bare minimum. Note that EC 1+2 is equal in redundancy as 
replication x3, but will use more compute (hence, its useless). In your 
situation, I would start with replicated pools and move to EC once enough nodes 
are at hand.

If you want to use the benefits of EC, you need to build large clusters. 
Starting with 3 nodes and failure domain disk will be a horrible experience. 
You will not be able to maintain, upgrade or fix anything without downtime.

Plan for sleeping well in worst-case situations.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Bastiaan 
Visser 
Sent: 17 January 2020 06:55:25
To: Dave Hall
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Beginner questions

There is no difference in allocation between replication or EC. If failure 
domain is host, one osd per host ok s used for a PG. So if you use a 2+1 EC 
profile with a host failure domain, you need 3 hosts for a healthy cluster. The 
pool will go read-only when you have a failure (host or disk), or are doing 
maintenance on a node (reboot). On a node failure there will be no rebuilding, 
since there is no place to find a 3rd osd for a pg, so you'll have to 
fix/replace the node before any writes will be accepted.

So yes, you can do a 2+1 EC pool on 3 nodes, you are paying the price in 
reliability, flexibility and maybe performance. Only way to really know the 
latter is benchmarking with your setup.

I think you will be fine on the hardware side. Memory recommendations swing 
around between 512M and 1G per Tb storage.I usually go with 1 gig. But I never 
use disks larger than 4Tb. On the cpu I always try to have a few more cores 
than I have osd's in a machine. So 16 is fine in your case.


On Fri, Jan 17, 2020, 03:29 Dave Hall 
mailto:kdh...@binghamton.edu>> wrote:

Bastiaan,

Regarding EC pools:   Our concern at 3 nodes is that 2-way replication seems 
risky - if the two copies don't match, which one is corrupted.  However,  3-way 
replication on a 3 node cluster triples the price per TB.   Doing EC pools that 
are the equivalent of RAID-5 2+1 seems like the right place to start as far as 
maximizing capacity is concerned, although I do understand the potential time 
involved in rebuilding a 12 TB drive.  Early on I'd be more concerned about a 
drive failure than about a node failure.

Regarding the hardware, our nodes are single socket EPYC 7302 (16 core, 32 
thread) with 128GB RAM.  From what I recall reading I think the RAM, at least, 
is a bit higher than recommended.

Question:  Does a PG (EC or replicated) span multiple drives per node?  I 
haven't got to the point of understanding this part yet, so pardon the totally 
naive question.  I'll probably be conversant on this by Monday.

-Dave

Dave Hall
Binghamton University
kdh...@binghamton.edu<mailto:kdh...@binghamton.edu>
607-760-2328 (Cell)
607-777-4641 (Office)




On 1/16/2020 4:27 PM, Bastiaan Visser wrote:
Dave made a good point WAL + DB might end up a little over 60G, I would 
probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme 
drive is smart enough to spread the writes over all available capacity, mort 
recent nvme's are). I have not yet seen a WAL larger or even close to than a 
gigabyte.

We don't even think about EC-coded pools on clusters with less than 6 nodes 
(spindles, full SSD is another story).
EC pools neer more processing resources  We usually settle with 1 gig per TB of 
storage on replicated only sluters, but whet EC polls are involved, we add at 
least 50% to that. Also make sure your processors are up for it.

Do not base your calculations on a healthy cluster -> build to fail.
How long are you willing to be in a degraded state on node failure. Especially 
when using many larger spindles. recovery time might be way longer than you 
think. 12 * 12TB is 144T

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-20 Thread Frank Schilder

We are using Micron 5200 PRO, 1.92TB for RBD images on KVM and are very happy 
with the performance. We are using EC 6+2 pools, which really eat up IOPs. 
Still, we get enough performance out to run 20-50 VMs per disk, which results 
in good space utilisation as well since our default image size is 50GB and we 
take rolling snapshots. I was thinking about 4TB disks also, but am concerned 
that their IOPs/TB performance is too low for images on EC pools.

We found the raw throughput in fio benchmarks to be very different for 
write-cache enabled and disabled, exactly as explained in the performance 
article. Changing write cache settings is a boot-time operation. Unfortunately, 
I couldn't find a reliable way to disable write cache at boot time (I was 
looking for tuned configs) and ended up adding this to a container startup 
script:

  if [[ "$1" == "osd_ceph_disk_activate" && -n "${OSD_DEVICE}" ]] ; then
echo "Disabling write cache on ${OSD_DEVICE}"
/usr/sbin/smartctl -s wcache=off "${OSD_DEVICE}"
  fi

This works for both, SAS and SATA drives and ensures that write cache is 
disabled before an OSD daemon starts.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Eric K. 
Miller 
Sent: 19 January 2020 04:24:33
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we 
expect more? [klartext]

Hi Vitaliy,

Similar to Stefan, we have a bunch of Micron 5200's (3.84TB ECO SATA version) 
in a Ceph cluster (Nautilus) and performance seems less than optimal.  I have 
followed all instructions on your site (thank you for your wonderful article 
btw!!), but I haven't seen much change.

The only thing I could think of is that "maybe" disabling the write cache only 
takes place upon a reboot or power cycle?  Is that necessary?  Or is it a 
"live" change?

I have tested with the cache disabled as well as enabled on all drives.  We're 
using fio running in a QEMU/KVM VM in an OpenStack cluster, so not "raw" access 
to the Micron 5200's.  OSD (Bluestore) nodes run CentOS 7 using a 4.18.x 
kernel.  Testing doesn't show any, or much, difference, enough that the 
variations could be considered "noise" in the results.  Certainly no change 
that anyone could tell.

Thought I'd check to see if you, or anyone else, might have any suggestions 
specific to the Micron 5200.

We have some Micron 5300's inbound, but probably won't have them here for 
another few weeks due to Micron's manufacturing delays, so will be able to test 
these raw drives soon.  I will report back after, but if you know anything 
about these, I'm all ears. :)

Thank you!

Eric


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stefan 
Bauer
Sent: Tuesday, January 14, 2020 10:28 AM
To: undisclosed-recipients
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we 
expect more? [klartext]


Thank you all,



performance is indeed better now. Can now go back to sleep ;)



KR



Stefan


-Ursprüngliche Nachricht-
Von: Виталий Филиппов 
Gesendet: Dienstag 14 Januar 2020 10:28
An: Wido den Hollander ; Stefan Bauer 
CC: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we 
expect more? [klartext]

...disable signatures and rbd cache. I didn't mention it in the email to not 
repeat myself. But I have it in the article :-)
--
With best regards,
Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-21 Thread Frank Schilder

> So hdparam -W 0 /dev/sdx doesn't work or it makes no difference? 

I wrote "We found the raw throughput in fio benchmarks to be very different for 
write-cache enabled and disabled, exactly as explained in the performance 
article.", so yes, it makes a huge difference.

> Also I am not sure I understand why it should happen before OSD have been 
> started. 
> At least in my experience hdparam does it to hardware regardless.

I'm not sure I understand this question. Ideally it happens at boot time and if 
this doesn't work, at least sometimes before the OSD is started. Why and how 
else would one want this to happen?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-21 Thread Frank Schilder

OK, now I understand. Yes, the cache setting will take effect immediately. Its 
more about do you trust the disk firmware to apply the change correctly in all 
situations when production IO is active at the same time (will volatile cache 
be flushed correctly or not)? I would not and rather change the setting while 
the OSD is down.

During benchmarks on raw disks I just switched cache on and off when I needed. 
There was nothing running on the disks and the fio benchmark is destructive any 
ways.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Sasha Litvak 
Sent: 21 January 2020 10:19
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we 
expect more? [klartext]

Frank,

Sorry for the confusion.  I thought that turning off cache using hdparm -W 0 
/dev/sdx takes effect right away and in case of non-raid controllers and 
Seagate or Micron SSDs I would see a difference starting fio benchmark right 
after executing hdparm.  So I wonder it makes a difference whether cache turned 
off before OSD started or after.

On Tue, Jan 21, 2020, 2:07 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
> So hdparam -W 0 /dev/sdx doesn't work or it makes no difference?

I wrote "We found the raw throughput in fio benchmarks to be very different for 
write-cache enabled and disabled, exactly as explained in the performance 
article.", so yes, it makes a huge difference.

> Also I am not sure I understand why it should happen before OSD have been 
> started.
> At least in my experience hdparam does it to hardware regardless.

I'm not sure I understand this question. Ideally it happens at boot time and if 
this doesn't work, at least sometimes before the OSD is started. Why and how 
else would one want this to happen?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

45 matches

Mail list logo