date:20210921

[ceph-users] Re: Corruption on cluster

2021-09-21 Thread Christian Wuerdig

This tracker item should cover it: https://tracker.ceph.com/issues/51948

On Wed, 22 Sept 2021 at 11:03, Nigel Williams
 wrote:
>
> Could we see the content of the bug report please, that RH bugzilla entry
> seems to have restricted access.
> "You are not authorized to access bug #1996680."
>
> On Wed, 22 Sept 2021 at 03:32, Patrick Donnelly  wrote:
>
> > You're probably hitting this bug:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1996680
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Szabo, Istvan (Agoda)

Increasing day by day, this is the current situation: (1 server has 6x 15.3TB 
SAS ssds, 3x ssds are using 1x 1.92TB nvme for db+wal.

WRN] BLUEFS_SPILLOVER: 13 OSD(s) experiencing BlueFS spillover
 osd.1 spilled over 56 GiB metadata from 'db' device (318 GiB used of 596 
GiB) to slow device
 osd.5 spilled over 34 GiB metadata from 'db' device (316 GiB used of 596 
GiB) to slow device
 osd.6 spilled over 37 GiB metadata from 'db' device (314 GiB used of 596 
GiB) to slow device
 osd.8 spilled over 121 MiB metadata from 'db' device (317 GiB used of 596 
GiB) to slow device
 osd.9 spilled over 53 GiB metadata from 'db' device (316 GiB used of 596 
GiB) to slow device
 osd.10 spilled over 114 GiB metadata from 'db' device (307 GiB used of 596 
GiB) to slow device
 osd.11 spilled over 68 GiB metadata from 'db' device (315 GiB used of 596 
GiB) to slow device
 osd.13 spilled over 30 GiB metadata from 'db' device (315 GiB used of 596 
GiB) to slow device
 osd.15 spilled over 65 GiB metadata from 'db' device (313 GiB used of 596 
GiB) to slow device
 osd.21 spilled over 6.8 GiB metadata from 'db' device (298 GiB used of 596 
GiB) to slow device
 osd.22 spilled over 23 GiB metadata from 'db' device (317 GiB used of 596 
GiB) to slow device
 osd.27 spilled over 228 GiB metadata from 'db' device (292 GiB used of 596 
GiB) to slow device
 osd.34 spilled over 9.8 GiB metadata from 'db' device (316 GiB used of 596 
GiB) to slow device

I guess it’s not an issue because it spilled over to an ssd from an nvme which 
is running on 100% unfortunately. It would have effect I guess more if 
spillover to hdd. Or can I scare anything with spillover?

I have 6 nodes with this setup, I can’t believe that 30k read and 10-15k write 
iops less than 1GB throughput can max out this cluster with ec 4:2 :((

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Sep 21., at 20:21, Christian Wuerdig  
wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


On Wed, 22 Sept 2021 at 05:54, Szabo, Istvan (Agoda)
 wrote:

Sorry to steal it, so if I have 500GB and 700GB mixed wal+rocksdb on nvme the 
number should be the level base 50 and 70? Or needs to be power 
of 2?

Generally the sum of all levels (up to the max of your metadata) needs
to fit into the db partition for each OSD. If you have a 500 or 700 GB
WAL+DB partition per OSD then the default settings should carry you to
L3 (~333GB required free space). Do you have more than 300GB metadata
per OSD?
All examples I've ever seen show the level base size at a power of 2
but I don't know if there are any side effects not doing that
5/7GB level base is an order of magnitude higher than the default and
it's unclear what performance effect this has.

I would not advise tinkering with the defaults unless you have a lot
of time and energy to burn on testing and are willing to accept
potential future performance issues on upgrade because you run a setup
that nobody ever tests for.

What's the size of your OSDs, how much db space per OSD do you
actually have and what do the spillover warnings say?


Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Sep 21., at 9:19, Christian Wuerdig  
wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


It's been discussed a few times on the list but RocksDB levels essentially
grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you
need (level-1)*10 space for the next level on your drive to avoid spill over
So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB and
since 50GB < 286 (sum of all levels) you get spill-over going from L2 to
L3. See also
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing

Interestingly your level base seems to be 512MB instead of the default
256MB - did you change that? In your case the sequence I would have
expected is 0.5 -> 5 -> 50 - and you should have already seen spillover at
5GB (since you only have 50GB partitions but you need 55.5GB at least)
Not sure what's up with that. I think you need to re-create OSDs after
changing these RocksDB params

Overall since Pacific this no longer holds entirely true since RocksDB
sharding was added (
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#bluestore-rocksdb-sharding)
- it was broken in 16.2.4 but looks like it's fixed in 16.2.6

 1. Upgrade to Pacific
 2. Get rid of the NVME raid
 3. Make 160GB DB partitions
 4. Activate

[ceph-users] Re: Corruption on cluster

2021-09-21 Thread Nigel Williams

Could we see the content of the bug report please, that RH bugzilla entry
seems to have restricted access.
"You are not authorized to access bug #1996680."

On Wed, 22 Sept 2021 at 03:32, Patrick Donnelly  wrote:

> You're probably hitting this bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1996680
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig

On Wed, 22 Sept 2021 at 07:07, Szabo, Istvan (Agoda)
 wrote:
>
> Increasing day by day, this is the current situation: (1 server has 6x 15.3TB 
> SAS ssds, 3x ssds are using 1x 1.92TB nvme for db+wal.
>
> WRN] BLUEFS_SPILLOVER: 13 OSD(s) experiencing BlueFS spillover
>  osd.1 spilled over 56 GiB metadata from 'db' device (318 GiB used of 596 
> GiB) to slow device
>  osd.5 spilled over 34 GiB metadata from 'db' device (316 GiB used of 596 
> GiB) to slow device
>  osd.6 spilled over 37 GiB metadata from 'db' device (314 GiB used of 596 
> GiB) to slow device
>  osd.8 spilled over 121 MiB metadata from 'db' device (317 GiB used of 
> 596 GiB) to slow device
>  osd.9 spilled over 53 GiB metadata from 'db' device (316 GiB used of 596 
> GiB) to slow device
>  osd.10 spilled over 114 GiB metadata from 'db' device (307 GiB used of 
> 596 GiB) to slow device
>  osd.11 spilled over 68 GiB metadata from 'db' device (315 GiB used of 
> 596 GiB) to slow device
>  osd.13 spilled over 30 GiB metadata from 'db' device (315 GiB used of 
> 596 GiB) to slow device
>  osd.15 spilled over 65 GiB metadata from 'db' device (313 GiB used of 
> 596 GiB) to slow device
>  osd.21 spilled over 6.8 GiB metadata from 'db' device (298 GiB used of 
> 596 GiB) to slow device
>  osd.22 spilled over 23 GiB metadata from 'db' device (317 GiB used of 
> 596 GiB) to slow device
>  osd.27 spilled over 228 GiB metadata from 'db' device (292 GiB used of 
> 596 GiB) to slow device
>  osd.34 spilled over 9.8 GiB metadata from 'db' device (316 GiB used of 
> 596 GiB) to slow device
>
> I guess it’s not an issue because it spilled over to an ssd from an nvme 
> which is running on 100% unfortunately. It would have effect I guess more if 
> spillover to hdd. Or can I scare anything with spillover?

Spillover is normal but it may slow things down obviously. If you have
decent SSDs then the impact may not be as bad as if you had HDD
backing storage
Looks like your setup is pretty decent - you could benefit from a
level base of 400 or 500MiB I guess - so you could set that and try to
make it stick - still not clear to me what you have to do to make it
change, maybe try offline compaction of and OSD: ceph-kvstore-tool
bluestore-kv ${osd_path} compact and the restart it


>
> I have 6 nodes with this setup, I can’t believe that 30k read and 10-15k 
> write iops less than 1GB throughput can max out this cluster with ec 4:2 :((
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> On 2021. Sep 21., at 20:21, Christian Wuerdig  
> wrote:
>
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> 
>
> On Wed, 22 Sept 2021 at 05:54, Szabo, Istvan (Agoda)
>  wrote:
>
>
> Sorry to steal it, so if I have 500GB and 700GB mixed wal+rocksdb on nvme 
> the number should be the level base 50 and 70? Or needs to be 
> power of 2?
>
>
> Generally the sum of all levels (up to the max of your metadata) needs
> to fit into the db partition for each OSD. If you have a 500 or 700 GB
> WAL+DB partition per OSD then the default settings should carry you to
> L3 (~333GB required free space). Do you have more than 300GB metadata
> per OSD?
> All examples I've ever seen show the level base size at a power of 2
> but I don't know if there are any side effects not doing that
> 5/7GB level base is an order of magnitude higher than the default and
> it's unclear what performance effect this has.
>
> I would not advise tinkering with the defaults unless you have a lot
> of time and energy to burn on testing and are willing to accept
> potential future performance issues on upgrade because you run a setup
> that nobody ever tests for.
>
> What's the size of your OSDs, how much db space per OSD do you
> actually have and what do the spillover warnings say?
>
>
> Istvan Szabo
>
> Senior Infrastructure Engineer
>
> ---
>
> Agoda Services Co., Ltd.
>
> e: istvan.sz...@agoda.com
>
> ---
>
>
> On 2021. Sep 21., at 9:19, Christian Wuerdig  
> wrote:
>
>
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
>
> 
>
>
> It's been discussed a few times on the list but RocksDB levels essentially
>
> grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you
>
> need (level-1)*10 space for the next level on your drive to avoid spill over
>
> So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB and
>
> since 50GB < 286 (sum of all levels) you get spill-over going from L2 to
>
> L3. See also
>
> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
>
>
>

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Szabo, Istvan (Agoda)

Sorry to steal it, so if I have 500GB and 700GB mixed wal+rocksdb on nvme the
number should be the level base 50 and 70? Or needs to be power
of 2?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Sep 21., at 9:19, Christian Wuerdig
wrote:

Email received from the internet. If in doubt, don't click any link nor open
any attachment !

It's been discussed a few times on the list but RocksDB levels essentially
grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you
need (level-1)*10 space for the next level on your drive to avoid spill over
So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB and
since 50GB < 286 (sum of all levels) you get spill-over going from L2 to
L3. See also
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing

Interestingly your level base seems to be 512MB instead of the default
256MB - did you change that? In your case the sequence I would have
expected is 0.5 -> 5 -> 50 - and you should have already seen spillover at
5GB (since you only have 50GB partitions but you need 55.5GB at least)
Not sure what's up with that. I think you need to re-create OSDs after
changing these RocksDB params

Overall since Pacific this no longer holds entirely true since RocksDB
sharding was added (
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#bluestore-rocksdb-sharding)
- it was broken in 16.2.4 but looks like it's fixed in 16.2.6

1. Upgrade to Pacific
2. Get rid of the NVME raid
3. Make 160GB DB partitions
4. Activate RocksDB sharding
5. Don't worry about RocksDB params

If you don't feel like upgrading to Pacific any time soon but want to make
more efficient use of the NVME and don't mind going out on a limp I'd still
do 2+3 plus study
https://github-wiki-see.page/m/facebook/rocksdb/wiki/Leveled-Compaction
carefully and make adjustments based on that.
With 160GB partitions a multiplier of 7 might work well with a base size of
350MB 0.35 -> 2.45 -> 17.15 -> 120.05 (total 140GB out of 160GB)

You could also try to switch to a 9x multiplier and re-create one of the
OSDs to see how it pans out prior to dissolving the raid1 setup (given your
settings that should result in 0.5 -> 4.5 -> 40.5 GB usage)

On Tue, 21 Sept 2021 at 13:19, mhnx wrote:

Hello everyone!
I want to understand the concept and tune my rocksDB options on nautilus
14.2.16.

osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used of
50 GiB) to slow device
osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of
50 GiB) to slow device

The problem is, I have the spill over warnings like the rest of the
community.
I tuned RocksDB Options with the settings below but the problem still
exists and I wonder if I did anything wrong. I still have the Spill Overs
and also some times index SSD's are getting down due to compaction problems
and can not start them until I do offline compaction.

Let me tell you about my hardware right?
Every server in my system has:
HDD - 19 x TOSHIBA MG08SCA16TEY 16.0TB for EC pool.
SSD -3 x SAMSUNG MZILS960HEHP/007 GXL0 960GB
NVME - 2 x PM1725b 1.6TB

I'm using Raid 1 Nvme for Bluestore DB. I dont have WAL.
19*50GB = 950GB total usage on NVME. (I was thinking use the rest but
regret it now)

So! Finally let's check my RocksDB Options:
[osd]
bluefs_buffered_io = true
bluestore_rocksdb_options =

compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,flusher_threads=8,compaction_readahead_size=2MB,compaction_threads=16,
*max_bytes_for_level_base=536870912*,
*max_bytes_for_level_multiplier=10*

*"ceph osd df tree" *to see ssd and hdd usage, omap and meta.

ID CLASS WEIGHT REWEIGHT SIZERAW USE DATAOMAPMETA
AVAIL %USE VAR PGS STATUS TYPE NAME
-28280.04810- 280 TiB 169 TiB 166 TiB 688 GiB 2.4 TiB 111
TiB 60.40 1.00 -host MHNX1
178 hdd 14.60149 1.0 15 TiB 8.6 TiB 8.5 TiB 44 KiB 126 GiB 6.0
TiB 59.21 0.98 174 up osd.178
179 ssd0.87329 1.0 894 GiB 415 GiB 89 GiB 321 GiB 5.4 GiB 479
GiB 46.46 0.77 104 up osd.179

I know the size of NVME is not suitable for 16TB HDD's. I should have more
but the expense is cutting us pieces. Because of that I think I'll see the
spill overs no matter what I do. But maybe I will make it better
with your help!

*My questions are:*
1- What is the meaning of (33 GiB used of 50 GiB)
2- Why it's not 50GiB / 50GiB ?
3- Do I have 17GiB

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Szabo, Istvan (Agoda)

Sorry to steal it, so if I have 500GB and 700GB mixed wal+rocksdb on nvme the
number should be the level base 50 and 70? Or needs to be power
of 2?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Sep 21., at 9:19, Christian Wuerdig
wrote:

Email received from the internet. If in doubt, don't click any link nor open
any attachment !

1. Upgrade to Pacific
2. Get rid of the NVME raid
3. Make 160GB DB partitions
4. Activate RocksDB sharding
5. Don't worry about RocksDB params

On Tue, 21 Sept 2021 at 13:19, mhnx wrote:

Hello everyone!
I want to understand the concept and tune my rocksDB options on nautilus
14.2.16.

osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used of
50 GiB) to slow device
osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of
50 GiB) to slow device

Let me tell you about my hardware right?
Every server in my system has:
HDD - 19 x TOSHIBA MG08SCA16TEY 16.0TB for EC pool.
SSD -3 x SAMSUNG MZILS960HEHP/007 GXL0 960GB
NVME - 2 x PM1725b 1.6TB

I'm using Raid 1 Nvme for Bluestore DB. I dont have WAL.
19*50GB = 950GB total usage on NVME. (I was thinking use the rest but
regret it now)

So! Finally let's check my RocksDB Options:
[osd]
bluefs_buffered_io = true
bluestore_rocksdb_options =

*"ceph osd df tree" *to see ssd and hdd usage, omap and meta.

*My questions are:*
1- What is the meaning of (33 GiB used of 50 GiB)
2- Why it's not 50GiB / 50GiB ?
3- Do I have 17GiB

[ceph-users] Re: EC CLAY production-ready or technology preview in Pacific?

2021-09-21 Thread Neha Ojha

On Thu, Aug 19, 2021 at 9:29 AM Jeremy Austin  wrote:
>
> I cannot speak in any official capacity, but my limited experience
> (20-30TB) EC CLAY has been functioning without an error for about 2 years.
> No issues in Pacific myself yet (fingers crossed).

This is good to know! I don't recall too many bugs reported against
the clay plugin, since we introduced it in nautilus. I'd be curious to
know how many users are using it and how has their experience been. We
are also trying to use telemetry to capture more data about this.

Thanks,
Neha


>
> Statistics is not the plural of anecdote… YMMV,
> Jeremy
>
> On Thu, Aug 19, 2021 at 12:56 AM Alexander Sporleder 
> wrote:
>
> > Hallo list,
> > Is the "EC CLAY code plugin" considered to be production-ready in Pacific
> > or is it more a technology preview?
> >
> >
> > Best,
> > Alex
> >
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
> --
> Jeremy Austin
> jhaus...@gmail.com
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Monitor issue while installation

2021-09-21 Thread Konstantin Shalygin


Hi,

Your Ansible monitoring_group_name variable is not defined, define it first



k

> On 21 Sep 2021, at 12:12, Michel Niyoyita  wrote:
> 
> Hello team
> 
> I am running a ceph cluster pacific version deployed using ansible . I
> would like to add other osds but it fails once riche to the mon
> installation with this fatal error:
> 
> msg: |-
>The conditional check 'groups.get(monitoring_group_name, []) | length >
> 0' failed. The error was: error while evaluating conditional
> (groups.get(monitoring_group_name, []) | length > 0):
> 'monitoring_group_name' is undefined
> 
>The error appears to be in '/opt/ceph-ansible/dashboard.yml': line 15,
> column 7, but may
>be elsewhere in the file depending on the exact syntax problem.
> 
>The offending line appears to be:
> 
>  pre_tasks:
>- name: set ceph node exporter install 'In Progress'
>  ^ here
> 
> can someone help to resolve the below issue ?
> 
> Michel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Monitor issue while installation

2021-09-21 Thread Konstantin Shalygin




Hi,

Your Ansible monitoring_group_name variable is not defined, define it first



k

> On 21 Sep 2021, at 12:12, Michel Niyoyita  wrote:
> 
> Hello team
> 
> I am running a ceph cluster pacific version deployed using ansible . I
> would like to add other osds but it fails once riche to the mon
> installation with this fatal error:
> 
> msg: |-
>The conditional check 'groups.get(monitoring_group_name, []) | length >
> 0' failed. The error was: error while evaluating conditional
> (groups.get(monitoring_group_name, []) | length > 0):
> 'monitoring_group_name' is undefined
> 
>The error appears to be in '/opt/ceph-ansible/dashboard.yml': line 15,
> column 7, but may
>be elsewhere in the file depending on the exact syntax problem.
> 
>The offending line appears to be:
> 
>  pre_tasks:
>- name: set ceph node exporter install 'In Progress'
>  ^ here
> 
> can someone help to resolve the below issue ?
> 
> Michel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Monitor issue while installation

2021-09-21 Thread Konstantin Shalygin


Hi,

Your Ansible monitoring_group_name variable is not defined, define it first



k

> On 21 Sep 2021, at 12:12, Michel Niyoyita  wrote:
> 
> Hello team
> 
> I am running a ceph cluster pacific version deployed using ansible . I
> would like to add other osds but it fails once riche to the mon
> installation with this fatal error:
> 
> msg: |-
>The conditional check 'groups.get(monitoring_group_name, []) | length >
> 0' failed. The error was: error while evaluating conditional
> (groups.get(monitoring_group_name, []) | length > 0):
> 'monitoring_group_name' is undefined
> 
>The error appears to be in '/opt/ceph-ansible/dashboard.yml': line 15,
> column 7, but may
>be elsewhere in the file depending on the exact syntax problem.
> 
>The offending line appears to be:
> 
>  pre_tasks:
>- name: set ceph node exporter install 'In Progress'
>  ^ here
> 
> can someone help to resolve the below issue ?
> 
> Michel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig

On Wed, 22 Sept 2021 at 05:54, Szabo, Istvan (Agoda)
 wrote:
>
> Sorry to steal it, so if I have 500GB and 700GB mixed wal+rocksdb on nvme 
> the number should be the level base 50 and 70? Or needs to be 
> power of 2?

Generally the sum of all levels (up to the max of your metadata) needs
to fit into the db partition for each OSD. If you have a 500 or 700 GB
WAL+DB partition per OSD then the default settings should carry you to
L3 (~333GB required free space). Do you have more than 300GB metadata
per OSD?
All examples I've ever seen show the level base size at a power of 2
but I don't know if there are any side effects not doing that
5/7GB level base is an order of magnitude higher than the default and
it's unclear what performance effect this has.

I would not advise tinkering with the defaults unless you have a lot
of time and energy to burn on testing and are willing to accept
potential future performance issues on upgrade because you run a setup
that nobody ever tests for.

What's the size of your OSDs, how much db space per OSD do you
actually have and what do the spillover warnings say?


> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> On 2021. Sep 21., at 9:19, Christian Wuerdig  
> wrote:
>
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> 
>
> It's been discussed a few times on the list but RocksDB levels essentially
> grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you
> need (level-1)*10 space for the next level on your drive to avoid spill over
> So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB and
> since 50GB < 286 (sum of all levels) you get spill-over going from L2 to
> L3. See also
> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
>
> Interestingly your level base seems to be 512MB instead of the default
> 256MB - did you change that? In your case the sequence I would have
> expected is 0.5 -> 5 -> 50 - and you should have already seen spillover at
> 5GB (since you only have 50GB partitions but you need 55.5GB at least)
> Not sure what's up with that. I think you need to re-create OSDs after
> changing these RocksDB params
>
> Overall since Pacific this no longer holds entirely true since RocksDB
> sharding was added (
> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#bluestore-rocksdb-sharding)
> - it was broken in 16.2.4 but looks like it's fixed in 16.2.6
>
>   1. Upgrade to Pacific
>   2. Get rid of the NVME raid
>   3. Make 160GB DB partitions
>   4. Activate RocksDB sharding
>   5. Don't worry about RocksDB params
>
> If you don't feel like upgrading to Pacific any time soon but want to make
> more efficient use of the NVME and don't mind going out on a limp I'd still
> do 2+3 plus study
> https://github-wiki-see.page/m/facebook/rocksdb/wiki/Leveled-Compaction
> carefully and make adjustments based on that.
> With 160GB partitions a multiplier of 7 might work well with a base size of
> 350MB 0.35 -> 2.45 -> 17.15 -> 120.05 (total 140GB out of 160GB)
>
> You could also try to switch to a 9x multiplier and re-create one of the
> OSDs to see how it pans out prior to dissolving the raid1 setup (given your
> settings that should result in 0.5 -> 4.5 -> 40.5 GB usage)
>
> On Tue, 21 Sept 2021 at 13:19, mhnx  wrote:
>
> Hello everyone!
>
> I want to understand the concept and tune my rocksDB options on nautilus
>
> 14.2.16.
>
>
> osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used of
>
> 50 GiB) to slow device
>
> osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of
>
> 50 GiB) to slow device
>
>
> The problem is, I have the spill over warnings like the rest of the
>
> community.
>
> I tuned RocksDB Options with the settings below but the problem still
>
> exists and I wonder if I did anything wrong. I still have the Spill Overs
>
> and also some times index SSD's are getting down due to compaction problems
>
> and can not start them until I do offline compaction.
>
>
> Let me tell you about my hardware right?
>
> Every server in my system has:
>
> HDD -   19 x TOSHIBA  MG08SCA16TEY   16.0TB for EC pool.
>
> SSD -3 x SAMSUNG  MZILS960HEHP/007 GXL0 960GB
>
> NVME - 2 x PM1725b 1.6TB
>
>
> I'm using Raid 1 Nvme for Bluestore DB. I dont have WAL.
>
> 19*50GB = 950GB total usage on NVME. (I was thinking use the rest but
>
> regret it now)
>
>
> So! Finally let's check my RocksDB Options:
>
> [osd]
>
> bluefs_buffered_io = true
>
> bluestore_rocksdb_options =
>
>
>

[ceph-users] Re: SPAM Re: Corruption on cluster

2021-09-21 Thread David Schulz

Wow!  Thanks everyone!

The bug report at https://tracker.ceph.com/issues/51948 describes 
exactly the behaviour that we are seeing.  I'll update and let everyone 
know when I've finished the upgrade.  This will probably take a few days 
as I need to wait for a window to do the work.

Sincerely

-Dave

On 2021-09-21 11:42 a.m., Dan van der Ster wrote:
> [△EXTERNAL]
>
>
>
> It's this: 
> https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftracker.ceph.com%2Fissues%2F51948data=04%7C01%7Cdschulz%40ucalgary.ca%7C57a1d5878287471a7c5d08d97d275fec%7Cc609a0eca5e346319686192280bd9151%7C1%7C0%7C637678430333679564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=GwMoi8NBz%2F6kGlBDnGy60j4uzcoQqrp4XW9ALca8oIc%3Dreserved=0
>
>
> The fix just landed in 4.18.0-305.19.1
>
> https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Faccess.redhat.com%2Ferrata%2FRHSA-2021%3A3548data=04%7C01%7Cdschulz%40ucalgary.ca%7C57a1d5878287471a7c5d08d97d275fec%7Cc609a0eca5e346319686192280bd9151%7C1%7C0%7C637678430333679564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=7A0NDY1912fBbZ9mlZ90Gmerd%2BeFC0ATQvVmoJVKYHg%3Dreserved=0
>
>
>
> On Tue, 21 Sep 2021, 19:35 Marc,  wrote:
>
>> I do not have access to this page. Maybe others also not, so it is better
>> to paste it's content here.
>>
>>> -Original Message-
>>> From: Patrick Donnelly 
>>> Sent: Tuesday, 21 September 2021 19:30
>>> To: David Schulz 
>>> Cc: ceph-users@ceph.io
>>> Subject: *SPAM* [ceph-users] Re: Corruption on cluster
>>>
>>> Hi Dave,
>>>
>>> On Tue, Sep 21, 2021 at 1:20 PM David Schulz 
>>> wrote:
 Hi Everyone,

 For a couple of weeks I've been battling a corruption in Ceph FS that
 happens when a writer on one node writes a line and calls sync as is
 typical with logging and the file is corrupted when the same file that
 is being written is read from another client.

 The cluster is a Nautilus 14.2.9 and the clients are all kernel client
 mounting the filesystem with CentOS 8.4 kernel
 4.18.0-305.10.2.el8_4.x86_64.  Bluestore OSDs and Eraseure coding are
 both used.  The cluster was upgraded from Mimic (the first installed
 versoin) at some point.

 Here is a little python3 program that triggers the issue:

 import os
 import time

 fh=open("test.log", "a")

 while True:
   start = time.time()
   fh.writelines("test2\n")
   end = time.time()
   fh.flush()
   junk=os.getpid()
   fh.writelines(f"took {(end - start)}\n")
   fh.flush()
   time.sleep(1)

 If I run this on one client and repeatedly run "wc -l " on a different
 client.  The wc will do 2 different behaviours, sometimes NULL bytes
>>> get
 scribbled in the file and the next line of output is appended and
>>> other
 times the file gets truncated.

 I did update from 14.2.2 to 14.2.9 (I had the a clone of the 14.2.9
>>> repo
 on hand).  I read the release notes and there did seem to be some
 related fixes between 14.2.2 and 14.2.9 but nothing after 14.2.9.

 I can't seem to find any references to a problem like this anywhere.
 Does anyone have any ideas?
>>> You're probably hitting this bug:
>>> https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1996680data=04%7C01%7Cdschulz%40ucalgary.ca%7C57a1d5878287471a7c5d08d97d275fec%7Cc609a0eca5e346319686192280bd9151%7C1%7C0%7C637678430333679564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=6JtiyzCBHnxMfcrrWDHoUztRIEKSh2zkqDWXpWzgU2U%3Dreserved=0
>>>
>>> Try upgrading your kernel.
>>>
>>> --
>>> Patrick Donnelly, Ph.D.
>>> He / Him / His
>>> Principal Software Engineer
>>> Red Hat Sunnyvale, CA
>>> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>>>
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig

On Wed, 22 Sept 2021 at 00:54, mhnx  wrote:
>
> Thanks for the explanation. Then the first thing I did wrong I didn't add 
> levels to reach total space. I didn't know that and I've set :
> max_bytes_for_level_base=536870912 and max_bytes_for_level_multiplier=10
> 536870912*10*10=50Gb
>
> I have space on Nvme's. I think I can resize the partitions.
> 1- Set osd down
> 2- Migrate partition to next blocks to be able resize the partition
> 3- Resize DB partition block size to 60GiB * 19HDD =
> 4- Sed osd up
>
> Also the other option is:
> 1- Remove Nvme from raid1
> 2- Migrate half of the partitions on New empty Nvme.
> 3- Resize the partitions
> 4- Resize the rest partitions or re-create the Nvme to get rid of degraded 
> Nvme pool.
>
> It's a lot of hard work and also you said "You need to re-create OSD's for 
> new RocksDB options' killed my dreams.
> Are you sure about this? Why OSD restart have no effect on RocksDB options?
> Do I really need to re-create all 190 HDD's ? Just wow. It will take decades 
> to be done.


No - I'm not 100% sure about this (never tinkered with the settings)
but it would fit the observations (namely that the 512MB level base
which you set doesn't seem to apply). Haven't found any explicit
documentation but in principle RocksDB should support changing these
params on-the-fly but not sure how to get ceph to apply them. Somebody
else would have to chime in to confirm.
Also keep in mind that even with 60GB partition you will still get
spillover since you seem to have around 120-130GB meta data per OSD so
moving to 160GB partitions would seem to be better.

>
>
>
>
>
>
> Christian Wuerdig , 21 Eyl 2021 Sal, 10:15 
> tarihinde şunu yazdı:
>>
>> It's been discussed a few times on the list but RocksDB levels essentially 
>> grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you 
>> need (level-1)*10 space for the next level on your drive to avoid spill over
>> So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB and 
>> since 50GB < 286 (sum of all levels) you get spill-over going from L2 to L3. 
>> See also 
>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
>>
>> Interestingly your level base seems to be 512MB instead of the default 256MB 
>> - did you change that? In your case the sequence I would have expected is 
>> 0.5 -> 5 -> 50 - and you should have already seen spillover at 5GB (since 
>> you only have 50GB partitions but you need 55.5GB at least)
>> Not sure what's up with that. I think you need to re-create OSDs after 
>> changing these RocksDB params
>>
>> Overall since Pacific this no longer holds entirely true since RocksDB 
>> sharding was added 
>> (https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#bluestore-rocksdb-sharding)
>>  - it was broken in 16.2.4 but looks like it's fixed in 16.2.6
>>
>> Upgrade to Pacific
>> Get rid of the NVME raid
>> Make 160GB DB partitions
>> Activate RocksDB sharding
>> Don't worry about RocksDB params
>>
>> If you don't feel like upgrading to Pacific any time soon but want to make 
>> more efficient use of the NVME and don't mind going out on a limp I'd still 
>> do 2+3 plus study 
>> https://github-wiki-see.page/m/facebook/rocksdb/wiki/Leveled-Compaction 
>> carefully and make adjustments based on that.
>> With 160GB partitions a multiplier of 7 might work well with a base size of 
>> 350MB 0.35 -> 2.45 -> 17.15 -> 120.05 (total 140GB out of 160GB)
>>
>> You could also try to switch to a 9x multiplier and re-create one of the 
>> OSDs to see how it pans out prior to dissolving the raid1 setup (given your 
>> settings that should result in 0.5 -> 4.5 -> 40.5 GB usage)
>>
>> On Tue, 21 Sept 2021 at 13:19, mhnx  wrote:
>>>
>>> Hello everyone!
>>> I want to understand the concept and tune my rocksDB options on nautilus
>>> 14.2.16.
>>>
>>>  osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used of
>>> 50 GiB) to slow device
>>>  osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of
>>> 50 GiB) to slow device
>>>
>>> The problem is, I have the spill over warnings like the rest of the
>>> community.
>>> I tuned RocksDB Options with the settings below but the problem still
>>> exists and I wonder if I did anything wrong. I still have the Spill Overs
>>> and also some times index SSD's are getting down due to compaction problems
>>> and can not start them until I do offline compaction.
>>>
>>> Let me tell you about my hardware right?
>>> Every server in my system has:
>>> HDD -   19 x TOSHIBA  MG08SCA16TEY   16.0TB for EC pool.
>>> SSD -3 x SAMSUNG  MZILS960HEHP/007 GXL0 960GB
>>> NVME - 2 x PM1725b 1.6TB
>>>
>>> I'm using Raid 1 Nvme for Bluestore DB. I dont have WAL.
>>> 19*50GB = 950GB total usage on NVME. (I was thinking use the rest but
>>> regret it now)
>>>
>>> So! Finally let's check my RocksDB Options:
>>> [osd]
>>> bluefs_buffered_io = true
>>> bluestore_rocksdb_options =
>>>

[ceph-users] Re: SPAM Re: Corruption on cluster

2021-09-21 Thread Dan van der Ster

It's this: https://tracker.ceph.com/issues/51948


The fix just landed in 4.18.0-305.19.1

https://access.redhat.com/errata/RHSA-2021:3548



On Tue, 21 Sep 2021, 19:35 Marc,  wrote:

>
> I do not have access to this page. Maybe others also not, so it is better
> to paste it's content here.
>
> > -Original Message-
> > From: Patrick Donnelly 
> > Sent: Tuesday, 21 September 2021 19:30
> > To: David Schulz 
> > Cc: ceph-users@ceph.io
> > Subject: *SPAM* [ceph-users] Re: Corruption on cluster
> >
> > Hi Dave,
> >
> > On Tue, Sep 21, 2021 at 1:20 PM David Schulz 
> > wrote:
> > >
> > > Hi Everyone,
> > >
> > > For a couple of weeks I've been battling a corruption in Ceph FS that
> > > happens when a writer on one node writes a line and calls sync as is
> > > typical with logging and the file is corrupted when the same file that
> > > is being written is read from another client.
> > >
> > > The cluster is a Nautilus 14.2.9 and the clients are all kernel client
> > > mounting the filesystem with CentOS 8.4 kernel
> > > 4.18.0-305.10.2.el8_4.x86_64.  Bluestore OSDs and Eraseure coding are
> > > both used.  The cluster was upgraded from Mimic (the first installed
> > > versoin) at some point.
> > >
> > > Here is a little python3 program that triggers the issue:
> > >
> > > import os
> > > import time
> > >
> > > fh=open("test.log", "a")
> > >
> > > while True:
> > >  start = time.time()
> > >  fh.writelines("test2\n")
> > >  end = time.time()
> > >  fh.flush()
> > >  junk=os.getpid()
> > >  fh.writelines(f"took {(end - start)}\n")
> > >  fh.flush()
> > >  time.sleep(1)
> > >
> > > If I run this on one client and repeatedly run "wc -l " on a different
> > > client.  The wc will do 2 different behaviours, sometimes NULL bytes
> > get
> > > scribbled in the file and the next line of output is appended and
> > other
> > > times the file gets truncated.
> > >
> > > I did update from 14.2.2 to 14.2.9 (I had the a clone of the 14.2.9
> > repo
> > > on hand).  I read the release notes and there did seem to be some
> > > related fixes between 14.2.2 and 14.2.9 but nothing after 14.2.9.
> > >
> > > I can't seem to find any references to a problem like this anywhere.
> > > Does anyone have any ideas?
> >
> > You're probably hitting this bug:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1996680
> >
> > Try upgrading your kernel.
> >
> > --
> > Patrick Donnelly, Ph.D.
> > He / Him / His
> > Principal Software Engineer
> > Red Hat Sunnyvale, CA
> > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: SPAM Re: Corruption on cluster

2021-09-21 Thread Marc



I do not have access to this page. Maybe others also not, so it is better to 
paste it's content here.

> -Original Message-
> From: Patrick Donnelly 
> Sent: Tuesday, 21 September 2021 19:30
> To: David Schulz 
> Cc: ceph-users@ceph.io
> Subject: *SPAM* [ceph-users] Re: Corruption on cluster
> 
> Hi Dave,
> 
> On Tue, Sep 21, 2021 at 1:20 PM David Schulz 
> wrote:
> >
> > Hi Everyone,
> >
> > For a couple of weeks I've been battling a corruption in Ceph FS that
> > happens when a writer on one node writes a line and calls sync as is
> > typical with logging and the file is corrupted when the same file that
> > is being written is read from another client.
> >
> > The cluster is a Nautilus 14.2.9 and the clients are all kernel client
> > mounting the filesystem with CentOS 8.4 kernel
> > 4.18.0-305.10.2.el8_4.x86_64.  Bluestore OSDs and Eraseure coding are
> > both used.  The cluster was upgraded from Mimic (the first installed
> > versoin) at some point.
> >
> > Here is a little python3 program that triggers the issue:
> >
> > import os
> > import time
> >
> > fh=open("test.log", "a")
> >
> > while True:
> >  start = time.time()
> >  fh.writelines("test2\n")
> >  end = time.time()
> >  fh.flush()
> >  junk=os.getpid()
> >  fh.writelines(f"took {(end - start)}\n")
> >  fh.flush()
> >  time.sleep(1)
> >
> > If I run this on one client and repeatedly run "wc -l " on a different
> > client.  The wc will do 2 different behaviours, sometimes NULL bytes
> get
> > scribbled in the file and the next line of output is appended and
> other
> > times the file gets truncated.
> >
> > I did update from 14.2.2 to 14.2.9 (I had the a clone of the 14.2.9
> repo
> > on hand).  I read the release notes and there did seem to be some
> > related fixes between 14.2.2 and 14.2.9 but nothing after 14.2.9.
> >
> > I can't seem to find any references to a problem like this anywhere.
> > Does anyone have any ideas?
> 
> You're probably hitting this bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1996680
> 
> Try upgrading your kernel.
> 
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Principal Software Engineer
> Red Hat Sunnyvale, CA
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Corruption on cluster

2021-09-21 Thread Patrick Donnelly

Hi Dave,

On Tue, Sep 21, 2021 at 1:20 PM David Schulz  wrote:
>
> Hi Everyone,
>
> For a couple of weeks I've been battling a corruption in Ceph FS that
> happens when a writer on one node writes a line and calls sync as is
> typical with logging and the file is corrupted when the same file that
> is being written is read from another client.
>
> The cluster is a Nautilus 14.2.9 and the clients are all kernel client
> mounting the filesystem with CentOS 8.4 kernel
> 4.18.0-305.10.2.el8_4.x86_64.  Bluestore OSDs and Eraseure coding are
> both used.  The cluster was upgraded from Mimic (the first installed
> versoin) at some point.
>
> Here is a little python3 program that triggers the issue:
>
> import os
> import time
>
> fh=open("test.log", "a")
>
> while True:
>  start = time.time()
>  fh.writelines("test2\n")
>  end = time.time()
>  fh.flush()
>  junk=os.getpid()
>  fh.writelines(f"took {(end - start)}\n")
>  fh.flush()
>  time.sleep(1)
>
> If I run this on one client and repeatedly run "wc -l " on a different
> client.  The wc will do 2 different behaviours, sometimes NULL bytes get
> scribbled in the file and the next line of output is appended and other
> times the file gets truncated.
>
> I did update from 14.2.2 to 14.2.9 (I had the a clone of the 14.2.9 repo
> on hand).  I read the release notes and there did seem to be some
> related fixes between 14.2.2 and 14.2.9 but nothing after 14.2.9.
>
> I can't seem to find any references to a problem like this anywhere.
> Does anyone have any ideas?

You're probably hitting this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1996680

Try upgrading your kernel.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Corruption on cluster

2021-09-21 Thread David Schulz

Hi Everyone,

For a couple of weeks I've been battling a corruption in Ceph FS that 
happens when a writer on one node writes a line and calls sync as is 
typical with logging and the file is corrupted when the same file that 
is being written is read from another client.

The cluster is a Nautilus 14.2.9 and the clients are all kernel client 
mounting the filesystem with CentOS 8.4 kernel 
4.18.0-305.10.2.el8_4.x86_64.  Bluestore OSDs and Eraseure coding are 
both used.  The cluster was upgraded from Mimic (the first installed 
versoin) at some point.

Here is a little python3 program that triggers the issue:

import os
import time

fh=open("test.log", "a")

while True:
     start = time.time()
     fh.writelines("test2\n")
     end = time.time()
     fh.flush()
     junk=os.getpid()
     fh.writelines(f"took {(end - start)}\n")
     fh.flush()
     time.sleep(1)

If I run this on one client and repeatedly run "wc -l " on a different 
client.  The wc will do 2 different behaviours, sometimes NULL bytes get 
scribbled in the file and the next line of output is appended and other 
times the file gets truncated.

I did update from 14.2.2 to 14.2.9 (I had the a clone of the 14.2.9 repo 
on hand).  I read the release notes and there did seem to be some 
related fixes between 14.2.2 and 14.2.9 but nothing after 14.2.9.

I can't seem to find any references to a problem like this anywhere.  
Does anyone have any ideas?

Sincerely

-Dave

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Error: UPGRADE_FAILED_PULL: Upgrade: failed to pull target image

2021-09-21 Thread Radoslav Milanov


There is a problem upgrading ceph-iscsi from 16.25 to 16.2.6

2021-09-21T12:43:58.767556-0400 mgr.nj3231.wagzhn [ERR] cephadm exited 
with an error code: 1, stderr:Redeploy daemon iscsi.iscsi.nj3231.mqeari ...

Creating ceph-iscsi config...
Write file: 
/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/iscsi.iscsi.nj3231.mqeari/iscsi-gateway.cfg
Failed to trim old cgroups 
/sys/fs/cgroup/system.slice/system-ceph\x2dc6c8bc66\x2d1716\x2d11ec\x2db029\x2d1c34da4b9fb6.slice/ceph-c6c8bc66-1716-11ec-b029-1c34da4b9fb6@iscsi.iscsi.nj3231.mqeari.service
Non-zero exit code 1 from systemctl start 
ceph-c6c8bc66-1716-11ec-b029-1c34da4b9fb6@iscsi.iscsi.nj3231.mqeari
systemctl: stderr Job for 
ceph-c6c8bc66-1716-11ec-b029-1c34da4b9fb6@iscsi.iscsi.nj3231.mqeari.service 
failed because the control process exited with error code.
systemctl: stderr See "systemctl status 
ceph-c6c8bc66-1716-11ec-b029-1c34da4b9fb6@iscsi.iscsi.nj3231.mqeari.service" 
and "journalctl -xe" for details.

Traceback (most recent call last):
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 8479, in 

    main()
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 8467, in main

    r = ctx.func(ctx)
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 1782, in _default_image

    return func(ctx)
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 4523, in command_deploy

    deploy_daemon(ctx, ctx.fsid, daemon_type, daemon_id, c, uid, gid,
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 2669, in deploy_daemon

    deploy_daemon_units(ctx, fsid, uid, gid, daemon_type, daemon_id,
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 2899, in deploy_daemon_units

    call_throws(ctx, ['systemctl', 'start', unit_name])
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 1462, in call_throws

    raise RuntimeError('Failed command: %s' % ' '.join(command))
RuntimeError: Failed command: systemctl start 
ceph-c6c8bc66-1716-11ec-b029-1c34da4b9fb6@iscsi.iscsi.nj3231.mqeari

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1366, in 
_remote_connection

    yield (conn, connr)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1263, in _run_cephadm
    code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error 
code: 1, stderr:Redeploy daemon iscsi.iscsi.nj3231.mqeari ...

Creating ceph-iscsi config...
Write file: 
/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/iscsi.iscsi.nj3231.mqeari/iscsi-gateway.cfg
Failed to trim old cgroups 
/sys/fs/cgroup/system.slice/system-ceph\x2dc6c8bc66\x2d1716\x2d11ec\x2db029\x2d1c34da4b9fb6.slice/ceph-c6c8bc66-1716-11ec-b029-1c34da4b9fb6@iscsi.iscsi.nj3231.mqeari.service
Non-zero exit code 1 from systemctl start 
ceph-c6c8bc66-1716-11ec-b029-1c34da4b9fb6@iscsi.iscsi.nj3231.mqeari
systemctl: stderr Job for 
ceph-c6c8bc66-1716-11ec-b029-1c34da4b9fb6@iscsi.iscsi.nj3231.mqeari.service 
failed because the control process exited with error code.
systemctl: stderr See "systemctl status 
ceph-c6c8bc66-1716-11ec-b029-1c34da4b9fb6@iscsi.iscsi.nj3231.mqeari.service" 
and "journalctl -xe" for details.

Traceback (most recent call last):
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 8479, in 

    main()
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 8467, in main

    r = ctx.func(ctx)
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 1782, in _default_image

    return func(ctx)
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 4523, in command_deploy

    deploy_daemon(ctx, ctx.fsid, daemon_type, daemon_id, c, uid, gid,
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 2669, in deploy_daemon

    deploy_daemon_units(ctx, fsid, uid, gid, daemon_type, daemon_id,
  File 
"/var/lib/ceph/c6c8bc66-1716-11ec-b029-1c34da4b9fb6/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c", 
line 2899, in deploy_daemon_units

    call_throws(ctx, ['systemctl', 'start', unit_name])
  File

[ceph-users] MDS 16.2.5-387-g7282d81d and DAEMON_OLD_VERSION

2021-09-21 Thread Выдрук Денис

Hello.

I had upgraded my cluster from Nautilus to Pacific and switched it to Cephadm, 
adopting services to this new container system.

I wanted to start using CephFS and created it with "sudo ceph fs volume create".

Everything works fine.

Some days ago, the “DAEMON_OLD_VERSION” warning appeared on my dashboard.

All my services are of "16.2.5" version. But my new MDS containers image name 
is "docker.io/ceph/daemon-base:latest-pacific-devel" with version 
"16.2.5-387-g7282d81d".

Why MDS uses "devel latest" image? Is it stable? Should I do something about 
this warning?

Or maybe this warning is not about MDS and I need to look at some other daemons?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: S3 Bucket Notification requirement

2021-09-21 Thread Sanjeev Jha

Hi Yuval,

I stuck on the first step where I am trying to create sns topic but not able to 
create it. I am not able to figure out the issue. AMQP server is ready, up and 
running with AMQP 0.9.1.


[root@ceprgw01 ~]# aws --endpoint-url http://localhost:8000 sns create-topic 
--name=mytopic  --attributes='{"push-endpoint": "amqp://10.xx.xx.xx:5672", 
"amqp-exchange": "ex1", "amqp-ack-level": "broker"}'

getting below error:
Unknown options: {"push-endpoint": "amqp://10.xx.xx.xx:15672", 
"amqp-ack-level": "broker",  "persistent": "true", "amqp-exchange": 
"topic_logs"}, set-topic-attributes


I am using Ceph 4.2.


[root@ceprgw01 ~]# aws --version
aws-cli/1.14.28 Python/2.7.5 Linux/3.10.0-1160.25.1.el7.x86_64 botocore/1.8.35



Best regards,
Sanjeev


From: Yuval Lifshitz 
Sent: Friday, August 20, 2021 3:40 PM
To: Sanjeev Jha 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] S3 Bucket Notification requirement



On Thu, Aug 19, 2021 at 6:30 PM Sanjeev Jha 
mailto:sanjeev_...@hotmail.com>> wrote:
Hi Yuval,

Thanks very much for your reply.

I am using AMQP 0.9.1.

Can I use aws sns create-topic command to create a topic in Ceph's RadosGW ?
yes. note that you define the topic on the RGW, not the rabbitmq broker. e.g.

aws --endpoint-url http://: sns create-topic --name=mytopic 
--attributes='{"push-endpoint": "amqp://:5672", "amqp-exchange": 
"ex1"}'

(see also: https://github.com/ceph/ceph/tree/master/examples/boto3#aws-cli)

,, If Yes then How and where to define notifications to associate with created 
topics? Basically, I want to understand the tasks needs to be defined in Ceph 
and RabbitMQ Broker.
you can use the aws cli to define the topic. a minimal setup to get all 
notifications for object creation would be:

aws --region=default --endpoint-url http://: s3api 
put-bucket-notification-configuration --bucket=mybucket 
--notification-configuration='{"TopicConfigurations": [{"Id": "notif1", 
"TopicArn": "arn:aws:sns:default::mytopic", "Events": ["s3:ObjectCreated:*"]}]}'


I have found few formats like  below in the developer documentation but 
wondering how/where to define it: -- Where and how to define these below codes, 
In ceph or AMQP broker?

in ceph (RGW).
these are the formats of the raw HTTP messages used to create the topic. when 
you use AWS cli, they crafty these messages and send them to the RGW.
unless you want to craft these messages yourself (e.g. using CURL) you should 
not really care about them.

For Example:
POST
Action=CreateTopic
=
=
[=amqp-exchange=]
...



Java is being used in RabbitMQ broker with AMQP 0.9.1 protocol..

creating topics and notifications is also possible using the AWS Java SDK:
* topics: 
https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/examples-simple-notification-service.html
* notifications:
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/BucketNotificationConfiguration.html

so, if your application wants to create them, the application developers will 
have to write code that does that (sadly, we don't have Java examples like the 
CLI and python ones, but you can use the example from the AWS docs).

if the topics and notifications are more static, you can create them up front 
using the AWS cli, and then you don't need to do anything in your Java code.

Best regards,
Sanjeev



From: Yuval Lifshitz mailto:ylifs...@redhat.com>>
Sent: Thursday, August 19, 2021 8:38 PM
To: Sanjeev Jha mailto:sanjeev_...@hotmail.com>>
Cc: ceph-users@ceph.io 
mailto:ceph-users@ceph.io>>
Subject: Re: [ceph-users] S3 Bucket Notification requirement

Hi Sanjeev,
Welcome to the Ceph community!

Which protocol do you intend to use in ActiveMQ?
If you want to use AMQP1.0, you will have to wait, as this is still not 
officially supported [1].
Currently, we support AMQP0.9.1, Kafka, and HTTP.

As for the more general question.
To make bucket notifications work you first need to define a "topic", with an 
endpoint configured to point your message broker.
Next, you need to define a "notification" that associates the topic with a 
bucket on which you want to get notifications.
In the "notification" you can define a filter and the events on which you want 
to be notified.

You can find examples of how to do that using the AWS CLI tool (or in python, 
using the boto3 library) here [2].
If you are looking for a different client or client SDK let me know.

More details on the bucket notification feature (without any client specific 
examples) could be found here [3].

Yuval

[1] https://github.com/ceph/ceph/pull/42548
[2] https://github.com/ceph/ceph/tree/master/examples/boto3
[3] https://docs.ceph.com/en/latest/radosgw/notifications/


On Thu, Aug 19, 2021 at 5:10 PM Sanjeev Jha 
mailto:sanjeev_...@hotmail.com>> wrote:
Dear Sir,

I would like to inform you that I am new to Ceph, specially new to the S3 
bucket notification topic and not able to

[ceph-users] Re: rocksdb corruption with 16.2.6

2021-09-21 Thread Sven Kieske

On Mo, 2021-09-20 at 10:29 -0500, Mark Nelson wrote:
> At least in one case for us, the user was using consumer grade SSDs 
> without power loss protection.  I don't think we ever fully diagnosed if 
> that was the cause though.  Another case potentially was related to high 
> memory usage on the node.  Hardware errors are a legitimate concern here 
> so probably checking dmesg/smartctl/etc is warranted.  ECC memory 
> obviously helps too (or rather the lack of which makes it more difficult 
> to diagnose).
> 
> 
> For folks that have experienced this, any info you can give related to 
> the HW involved would be helpful.  We (and other projects) have seen 
> similar things over the years but this is a notoriously difficult issue 
> to track down given that it could be any one of many different things 
> and it may or may not be our code.
> 

Hi,

maybe I can help debug this and you can help me too!

We run 14.2.10 in pre production and I'm fairly confident we hit this bug:

https://tracker.ceph.com/issues/37282

This is a ubuntu based ceph ansible deployment using enterprise SSD with power 
loss
protection.

We see random and rare osd crashes (in ceph crash ls) distributed through our
140 OSD erasure coded cluster.

This is an all flash ssd cluster with metadata on nvme ssd storage.

As I said, these are enterprise ssd from Intel (S4610) and 
Samsung(MZWLL1T6HAJQ).

I already did bluestore fsck (deep) and repair. I see no Hardware Errors at 
all, not even
small issues with SMART etc.

This did started happening some time after we upgraded the cluster from 14.2.6 
to 14.2.10, fwiw.

We also somewhat agressivly pushed the osd_memory_target up after that so I 
feared this might
cause crashes if OSDs die due to OOM (see e.g. this for a
report: 
https://www.mail-archive.com/search?l=ceph-users%40lists.ceph.com=subject:%22%5C%5Bceph%5C-users%5C%5D+OSD+crash+after+change+of+osd_memory_target%22=newest=1
 ).

I'm currently in the process of lowering the osd_memory_target again.
We had no crashes since the beginning of september.

if you need more information about the past crashes I can provide logs etc. 

-- 
Mit freundlichen Grüßen / Regards

Sven Kieske
Systementwickler / systems engineer

Mittwald CM Service GmbH & Co. KG
Königsberger Straße 4-6
32339 Espelkamp

Tel.: 05772 / 293-900
Fax: 05772 / 293-333

https://www.mittwald.de

Geschäftsführer: Robert Meyer, Florian Jürgens

St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen
Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen

Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit 
gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] after upgrade: HEALTH ERR ...'devicehealth' has failed: can't subtract offset-naive and offset-aware datetimes

2021-09-21 Thread Harry G. Coin

A cluster reporting no errors running 16.2.5 immediately after upgrade 
to 16.2.6 features what seems to be an entirely bug-related dramatic 
'Heath Err' on the dashboard:


Module 'devicehealth' has failed: can't subtract offset-naive and 
offset-aware datetimes


Looking at the bug tracking logs, others reported this upon upgrade to 
16.2.5 that  'went away' on upgrade to 16.2.6


Echos of Bilbo passing the ring to Frodo?

It would really be nice not to have to explain a dramatic scary 
dashboard feature for the months between .6 and .7 any help?


Thanks

Harry Coin



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] osd marked down

2021-09-21 Thread Abdelillah Asraoui

Hi,

one of the osd in the cluster went down, is there a workaround to bring
back this osd?


logs from ceph osd pod shows the following:

kubectl -n rook-ceph logs rook-ceph-osd-3-6497bdc65b-pn7mg

debug 2021-09-20T14:32:46.388+ 7f930fe9cf00 -1 auth: unable to find a
keyring on /var/lib/ceph/osd/ceph-3/keyring: (13) Permission denied

debug 2021-09-20T14:32:46.389+ 7f930fe9cf00 -1 auth: unable to find a
keyring on /var/lib/ceph/osd/ceph-3/keyring: (13) Permission denied

debug 2021-09-20T14:32:46.389+ 7f930fe9cf00 -1 auth: unable to find a
keyring on /var/lib/ceph/osd/ceph-3/keyring: (13) Permission denied

debug 2021-09-20T14:32:46.389+ 7f930fe9cf00 -1 monclient: keyring not
found

failed to fetch mon config (--no-mon-config to skip)





kubectl -n rook-ceph describe pod  rook-ceph-osd-3-64





Events:

  Type Reason   Age  From Message

   --     ---

  Normal   Pulled   50m (x749 over 2d16h)kubelet  Container image
"ceph/ceph:v15.2.13" already present on machine

  Warning  BackOff  19s (x18433 over 2d16h)  kubelet  Back-off restarting
failed container



ceph health detail | more

HEALTH_WARN noout flag(s) set; 1 osds down; 1 host (1 osds) down; Degraded
data redundancy: 180969/542907 objects degraded (33.333%), 225 pgs degra

ded, 225 pgs undersized

[WRN] OSDMAP_FLAGS: noout flag(s) set

[WRN] OSD_DOWN: 1 osds down

osd.3 (root=default,host=ab-test) is down

[WRN] OSD_HOST_DOWN: 1 host (1 osds) down

host ab-test-mstr-1-cwan-net (root=default) (1 osds) is down

[WRN] PG_DEGRADED: Degraded data redundancy: 180969/542907 objects degraded
(33.333%), 225 pgs degraded, 225 pgs undersized

pg 3.4d is active+undersized+degraded, acting [2,0]

pg 3.4e is stuck undersized for 3d, current state
active+undersized+degraded, last acting [0,2]


Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread mhnx

Thanks for the explanation. Then the first thing I did wrong I didn't add
levels to reach total space. I didn't know that and I've set :
max_bytes_for_level_base=536870912 and max_bytes_for_level_multiplier=10
536870912*10*10=50Gb

I have space on Nvme's. I think I can resize the partitions.
1- Set osd down
2- Migrate partition to next blocks to be able resize the partition
3- Resize DB partition block size to 60GiB * 19HDD =
4- Sed osd up

Also the other option is:
1- Remove Nvme from raid1
2- Migrate half of the partitions on New empty Nvme.
3- Resize the partitions
4- Resize the rest partitions or re-create the Nvme to get rid of
degraded Nvme pool.

It's a lot of hard work and also you said "You need to re-create OSD's for
new RocksDB options' killed my dreams.
Are you sure about this? Why OSD restart have no effect on RocksDB options?
Do I really need to re-create all 190 HDD's ? Just wow. It will take
decades to be done.






Christian Wuerdig , 21 Eyl 2021 Sal, 10:15
tarihinde şunu yazdı:

> It's been discussed a few times on the list but RocksDB levels essentially
> grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you
> need (level-1)*10 space for the next level on your drive to avoid spill over
> So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB and
> since 50GB < 286 (sum of all levels) you get spill-over going from L2 to
> L3. See also
> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
>
> Interestingly your level base seems to be 512MB instead of the default
> 256MB - did you change that? In your case the sequence I would have
> expected is 0.5 -> 5 -> 50 - and you should have already seen spillover at
> 5GB (since you only have 50GB partitions but you need 55.5GB at least)
> Not sure what's up with that. I think you need to re-create OSDs after
> changing these RocksDB params
>
> Overall since Pacific this no longer holds entirely true since RocksDB
> sharding was added (
> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#bluestore-rocksdb-sharding)
> - it was broken in 16.2.4 but looks like it's fixed in 16.2.6
>
>1. Upgrade to Pacific
>2. Get rid of the NVME raid
>3. Make 160GB DB partitions
>4. Activate RocksDB sharding
>5. Don't worry about RocksDB params
>
> If you don't feel like upgrading to Pacific any time soon but want to make
> more efficient use of the NVME and don't mind going out on a limp I'd still
> do 2+3 plus study
> https://github-wiki-see.page/m/facebook/rocksdb/wiki/Leveled-Compaction
> carefully and make adjustments based on that.
> With 160GB partitions a multiplier of 7 might work well with a base size
> of 350MB 0.35 -> 2.45 -> 17.15 -> 120.05 (total 140GB out of 160GB)
>
> You could also try to switch to a 9x multiplier and re-create one of the
> OSDs to see how it pans out prior to dissolving the raid1 setup (given your
> settings that should result in 0.5 -> 4.5 -> 40.5 GB usage)
>
> On Tue, 21 Sept 2021 at 13:19, mhnx  wrote:
>
>> Hello everyone!
>> I want to understand the concept and tune my rocksDB options on nautilus
>> 14.2.16.
>>
>>  osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used
>> of
>> 50 GiB) to slow device
>>  osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of
>> 50 GiB) to slow device
>>
>> The problem is, I have the spill over warnings like the rest of the
>> community.
>> I tuned RocksDB Options with the settings below but the problem still
>> exists and I wonder if I did anything wrong. I still have the Spill Overs
>> and also some times index SSD's are getting down due to compaction
>> problems
>> and can not start them until I do offline compaction.
>>
>> Let me tell you about my hardware right?
>> Every server in my system has:
>> HDD -   19 x TOSHIBA  MG08SCA16TEY   16.0TB for EC pool.
>> SSD -3 x SAMSUNG  MZILS960HEHP/007 GXL0 960GB
>> NVME - 2 x PM1725b 1.6TB
>>
>> I'm using Raid 1 Nvme for Bluestore DB. I dont have WAL.
>> 19*50GB = 950GB total usage on NVME. (I was thinking use the rest but
>> regret it now)
>>
>> So! Finally let's check my RocksDB Options:
>> [osd]
>> bluefs_buffered_io = true
>> bluestore_rocksdb_options =
>>
>> compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,flusher_threads=8,compaction_readahead_size=2MB,compaction_threads=16,
>> *max_bytes_for_level_base=536870912*,
>> *max_bytes_for_level_multiplier=10*
>>
>> *"ceph osd df tree"  *to see ssd and hdd usage, omap and meta.
>>
>> > ID  CLASS WEIGHT REWEIGHT SIZERAW USE DATAOMAPMETA
>> > AVAIL   %USE  VAR  PGS STATUS TYPE NAME
>> > -28280.04810- 280 TiB 169 TiB 166 TiB 688 GiB 2.4 TiB

[ceph-users] Re: [EXTERNAL] RE: OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

2021-09-21 Thread Dave Piper

I still can't find a way to get ceph-bluestore-tool working in my containerized 
deployment. As soon as the OSD daemon stops, the contents of 
/var/lib/ceph/osd/ceph- are unreachable. 

I've found this blog post that suggests changes to the container's entrypoint 
are required, but the proposed fix didn't work for me. 
https://blog.cephtips.com/perform-osd-maintenance-in-a-container/  The 
container stays alive, but the OSD process within it has still died, which 
seems to be enough to mask / unmount the files so that the ceph- folder 
appears to be empty to all other processes.

The associated pull request in the ceph-container project suggests setting ` -e 
DEBUG=stayalive` when running the container as an alternative but I see the 
same behaviour when trying this: an empty folder as soon as the OSD process 
crashed. https://github.com/ceph/ceph-container/pull/1605
 



 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Successful Upgrade from 14.2.22 to 15.2.14

2021-09-21 Thread Dan van der Ster

Dear friends,

This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14,
successfully, following the procedure at
https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus
It's a 400TB cluster which is 10% used with 72 osds (block=hdd,
block.db=ssd) and 40M objects.

* The mons upgraded cleanly as expected.
* One minor surprise was that the mgrs respawned themselves moments
after the leader restarted into octopus:

2021-09-21T10:16:38.992219+0200 mon.cephdwight-mon-1633994557 (mon.0)
16 : cluster [INF] mon.cephdwight-mon-1633994557 is new leader, mons
cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256
in quorum (ranks 0,1,2)

2021-09-21 10:16:39.046 7fae3caf8700  1 mgr handle_mgr_map respawning
because set of enabled modules changed!

This didn't create any problems AFAICT.

* The osds performed the expected fsck after restarting. Their logs
are spammed with things like

2021-09-21T11:15:23.233+0200 7f85901bd700 -1
bluestore(/var/lib/ceph/osd/ceph-1) fsck warning:
#174:1e024a6e:::10009663a55.:head# has omap that is not
per-pool or pgmeta

but that is fully expected AFAIU. Each osd took just under 10 minutes to fsck:

2021-09-21T11:22:27.188+0200 7f85a3a2bf00  1
bluestore(/var/lib/ceph/osd/ceph-1) _fsck_on_open <<>> with 0
errors, 197756 warnings, 197756 repaired, 0 remaining in 475.083056
seconds

For reference, this cluster was created many major releases ago (maybe
firefly) but osds were probably re-created in luminous.
The memory usage was quite normal, we didn't suffer any OOMs.

* The active mds restarted into octopus without incident.

In summary it was a very smooth upgrade. After a week of observation
we'll proceed with more production clusters.
For our largest S3 cluster with slow hdds, we expect huge fsck
transactions, so will wait for https://github.com/ceph/ceph/pull/42958
to be merged before upgrading.

Best Regards, and thanks to all the devs for their work,

Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Monitor issue while installation

2021-09-21 Thread Michel Niyoyita

Hello team

I am running a ceph cluster pacific version deployed using ansible . I
would like to add other osds but it fails once riche to the mon
installation with this fatal error:

 msg: |-
The conditional check 'groups.get(monitoring_group_name, []) | length >
0' failed. The error was: error while evaluating conditional
(groups.get(monitoring_group_name, []) | length > 0):
'monitoring_group_name' is undefined

The error appears to be in '/opt/ceph-ansible/dashboard.yml': line 15,
column 7, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  pre_tasks:
- name: set ceph node exporter install 'In Progress'
  ^ here

can someone help to resolve the below issue ?

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Szabo, Istvan (Agoda)

Let me join, having 11 bluefs spillover in my cluster. Where this settings
coming from?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Sep 21., at 3:19, mhnx wrote:

Email received from the internet. If in doubt, don't click any link nor open
any attachment !

Hello everyone!
I want to understand the concept and tune my rocksDB options on nautilus
14.2.16.

osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used of
50 GiB) to slow device
osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of
50 GiB) to slow device

Let me tell you about my hardware right?
Every server in my system has:
HDD - 19 x TOSHIBA MG08SCA16TEY 16.0TB for EC pool.
SSD -3 x SAMSUNG MZILS960HEHP/007 GXL0 960GB
NVME - 2 x PM1725b 1.6TB

I'm using Raid 1 Nvme for Bluestore DB. I dont have WAL.
19*50GB = 950GB total usage on NVME. (I was thinking use the rest but
regret it now)

So! Finally let's check my RocksDB Options:
[osd]
bluefs_buffered_io = true
bluestore_rocksdb_options =
compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,flusher_threads=8,compaction_readahead_size=2MB,compaction_threads=16,
*max_bytes_for_level_base=536870912*,
*max_bytes_for_level_multiplier=10*

*"ceph osd df tree" *to see ssd and hdd usage, omap and meta.

*My questions are:*
1- What is the meaning of (33 GiB used of 50 GiB)
2- Why it's not 50GiB / 50GiB ?
3- Do I have 17GiB unused area on the DB partition?
4- Is there anything wrong with my Rocksdb options?
5- How can I be sure and find the good Rocksdb Options for Ceph?
6- How can I measure the change and test it?
7- Do I need different RocksDB options for HDD's and SSD's ?
8- If I stop using Nvme Raid1 to gain x2 size and resize the DB's to
160GiB. Is it worth to take Nvme faulty? Because I will lose 10HDD at the
same time but I have 10 Node and that's only %5 of the EC data . I use m=8
k=2.

P.S: There are so many people asking and searching around this. I hope it
will work this time.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig

 It's been discussed a few times on the list but RocksDB levels essentially
grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you
need (level-1)*10 space for the next level on your drive to avoid spill over
So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB and
since 50GB < 286 (sum of all levels) you get spill-over going from L2 to
L3. See also
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing

Interestingly your level base seems to be 512MB instead of the default
256MB - did you change that? In your case the sequence I would have
expected is 0.5 -> 5 -> 50 - and you should have already seen spillover at
5GB (since you only have 50GB partitions but you need 55.5GB at least)
Not sure what's up with that. I think you need to re-create OSDs after
changing these RocksDB params

Overall since Pacific this no longer holds entirely true since RocksDB
sharding was added (
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#bluestore-rocksdb-sharding)
- it was broken in 16.2.4 but looks like it's fixed in 16.2.6

   1. Upgrade to Pacific
   2. Get rid of the NVME raid
   3. Make 160GB DB partitions
   4. Activate RocksDB sharding
   5. Don't worry about RocksDB params

If you don't feel like upgrading to Pacific any time soon but want to make
more efficient use of the NVME and don't mind going out on a limp I'd still
do 2+3 plus study
https://github-wiki-see.page/m/facebook/rocksdb/wiki/Leveled-Compaction
carefully and make adjustments based on that.
With 160GB partitions a multiplier of 7 might work well with a base size of
350MB 0.35 -> 2.45 -> 17.15 -> 120.05 (total 140GB out of 160GB)

You could also try to switch to a 9x multiplier and re-create one of the
OSDs to see how it pans out prior to dissolving the raid1 setup (given your
settings that should result in 0.5 -> 4.5 -> 40.5 GB usage)

On Tue, 21 Sept 2021 at 13:19, mhnx  wrote:

> Hello everyone!
> I want to understand the concept and tune my rocksDB options on nautilus
> 14.2.16.
>
>  osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used of
> 50 GiB) to slow device
>  osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of
> 50 GiB) to slow device
>
> The problem is, I have the spill over warnings like the rest of the
> community.
> I tuned RocksDB Options with the settings below but the problem still
> exists and I wonder if I did anything wrong. I still have the Spill Overs
> and also some times index SSD's are getting down due to compaction problems
> and can not start them until I do offline compaction.
>
> Let me tell you about my hardware right?
> Every server in my system has:
> HDD -   19 x TOSHIBA  MG08SCA16TEY   16.0TB for EC pool.
> SSD -3 x SAMSUNG  MZILS960HEHP/007 GXL0 960GB
> NVME - 2 x PM1725b 1.6TB
>
> I'm using Raid 1 Nvme for Bluestore DB. I dont have WAL.
> 19*50GB = 950GB total usage on NVME. (I was thinking use the rest but
> regret it now)
>
> So! Finally let's check my RocksDB Options:
> [osd]
> bluefs_buffered_io = true
> bluestore_rocksdb_options =
>
> compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,flusher_threads=8,compaction_readahead_size=2MB,compaction_threads=16,
> *max_bytes_for_level_base=536870912*,
> *max_bytes_for_level_multiplier=10*
>
> *"ceph osd df tree"  *to see ssd and hdd usage, omap and meta.
>
> > ID  CLASS WEIGHT REWEIGHT SIZERAW USE DATAOMAPMETA
> > AVAIL   %USE  VAR  PGS STATUS TYPE NAME
> > -28280.04810- 280 TiB 169 TiB 166 TiB 688 GiB 2.4 TiB 111
> > TiB 60.40 1.00   -host MHNX1
> > 178   hdd   14.60149  1.0  15 TiB 8.6 TiB 8.5 TiB  44 KiB 126 GiB 6.0
> > TiB 59.21 0.98 174 up osd.178
> > 179   ssd0.87329  1.0 894 GiB 415 GiB  89 GiB 321 GiB 5.4 GiB 479
> > GiB 46.46 0.77 104 up osd.179
>
>
> I know the size of NVME is not suitable for 16TB HDD's. I should have more
> but the expense is cutting us pieces. Because of that I think I'll see the
> spill overs no matter what I do. But maybe I will make it better
> with your help!
>
> *My questions are:*
> 1- What is the meaning of (33 GiB used of 50 GiB)
> 2- Why it's not 50GiB / 50GiB ?
> 3- Do I have 17GiB unused area on the DB partition?
> 4- Is there anything wrong with my Rocksdb options?
> 5- How can I be sure and find the good Rocksdb Options for Ceph?
> 6- How can I measure the change and test it?
> 7- Do I need different RocksDB options for HDD's and SSD's ?
> 8- If I stop using Nvme Raid1 to gain x2 size and resize the DB's  to
> 160GiB. Is it worth to take Nvme faulty? Because I will lose 10HDD at the
> same time but I have 10 Node and that's only

[ceph-users] Re: [EXTERNAL] Re: OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

2021-09-21 Thread Janne Johansson

Den mån 20 sep. 2021 kl 18:02 skrev Dave Piper :
> Okay - I've finally got full debug logs from the flapping OSDs. The raw logs 
> are both 100M each - I can email them directly if necessary. (Igor I've 
> already sent these your way.)
> Both flapping OSDs are reporting the same "bluefs _allocate failed to 
> allocate" errors as before.  I've also noticed additional errors about 
> corrupt blocks which I haven't noticed previously.  E.g.
> 2021-09-08T10:42:13.316+ 7f705c4f2f00  3 rocksdb: 
> [table/block_based_table_reader.cc:1117] Encountered error while reading data 
> from compression dictionary block Corruption: block checksum mismatch: 
> expected 0, got 2324967111  in db/501397.sst offset 18446744073709551615 size 
> 18446744073709551615

Those 18446744073709551615 numbers are -1 (or the largest 64bit int),
so something makes the numbers wrap around below zero.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: rocksdb corruption with 16.2.6

2021-09-21 Thread Andrej Filipcic



Hi,

Some further investigation on the failed OSDs:

1 out of 8 OSDs actually has hardware issue,

[16841006.029332] sd 0:0:10:0: [sdj] tag#96 FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
[16841006.037917] sd 0:0:10:0: [sdj] tag#34 FAILED Result: 
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
[16841006.047558] sd 0:0:10:0: [sdj] tag#96 Sense Key : Medium Error 
[current]
[16841006.057647] sd 0:0:10:0: [sdj] tag#34 CDB: Read(16) 88 00 00 00 00 
00 00 07 e7 70 00 00 00 10 00 00
[16841006.064693] sd 0:0:10:0: [sdj] tag#96 Add. Sense: Unrecovered read 
error
[16841006.073988] blk_update_request: I/O error, dev sdj, sector 518000 
op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[16841006.080949] sd 0:0:10:0: [sdj] tag#96 CDB: Read(16) 88 00 00 00 00 
00 0b 95 d9 80 00 00 00 08 00 00


smartctl:
Error 23 occurred at disk power-on lifetime: 6105 hours (254 days + 9 hours)
  When the command that caused the error occurred, the device was 
active or idle.


  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 d9 95 0b  Error: UNC at LBA = 0x0b95d980 = 194369920

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --    
  60 00 10 70 e7 07 40 00  14d+02:46:05.704  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  14d+02:46:05.703  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  14d+02:46:05.703  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  14d+02:46:05.703  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  14d+02:46:05.703  READ FPDMA QUEUED

so, let's say, this might be hw fault, though the drive appears to be 
working fine.


But the other 7 show no hw related issues. The HDDs are Seagate Exos 
X16, enterprise grade,  servers are supermicro SSG-6029P-E1CR24L-AT059 
with ECC. There are no cpu or memory errors logged in the past months on 
the servers, which have been up for ~200 days. So it's is unlikely HW fault.


Is there something else that could be checked? I have left one OSD 
intact, so it can be checked further.


Best regards,
Andrej

On 20/09/2021 17:09, Neha Ojha wrote:

Can we please create a bluestore tracker issue for this
(if one does not exist already), where we can start capturing all the
relevant information needed to debug this? Given that this has been
encountered in previous 16.2.* versions, it doesn't sound like a
regression in 16.2.6 to me, rather an issue in pacific. In any case,
we'll prioritize fixing it.

Thanks,
Neha

On Mon, Sep 20, 2021 at 8:03 AM Andrej Filipcic  wrote:

On 20/09/2021 16:02, David Orman wrote:

Same question here, for clarity, was this on upgrading to 16.2.6 from
16.2.5? Or upgrading
from some other release?

from 16.2.5. but the OSD services were never restarted after upgrade to
.5, so it could be a leftover of previous issues.

Cheers,
Andrej

On Mon, Sep 20, 2021 at 8:57 AM Sean  wrote:

   I also ran into this with v16. In my case, trying to run a repair totally
exhausted the RAM on the box, and was unable to complete.

After removing/recreating the OSD, I did notice that it has a drastically
   smaller OMAP size than the other OSDs. I don’t know if that actually means
anything, but just wanted to mention it in case it does.

ID   CLASS  WEIGHT REWEIGHT  SIZE RAW USE  DATA OMAP META
AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
14   hdd10.91409   1.0   11 TiB  3.3 TiB  3.2 TiB  4.6 MiB  5.4 GiB
   7.7 TiB  29.81  1.02   34  uposd.14
16   hdd10.91409   1.0   11 TiB  3.3 TiB  3.3 TiB   20 KiB  9.4 GiB
   7.6 TiB  30.03  1.03   35  uposd.16

~ Sean


On Sep 20, 2021 at 8:27:39 AM, Paul Mezzanini  wrote:


I got the exact same error on one of my OSDs when upgrading to 16.  I
used it as an exercise on trying to fix a corrupt rocksdb. A spent a few
days of poking with no success.  I got mostly tool crashes like you are
seeing with no forward progress.

I eventually just gave up, purged the OSD, did a smart long test on the
drive to be sure and then threw it back in the mix.  Been HEALTH OK for
a week now after it finished refilling the drive.


On 9/19/21 10:47 AM, Andrej Filipcic wrote:

2021-09-19T15:47:13.610+0200 7f8bc1f0e700  2 rocksdb:

[db_impl/db_impl_compaction_flush.cc:2344] Waiting after background

compaction error: Corruption: block checksum mismatch: expected

2427092066, got 4051549320  in db/251935.sst offset 18414386 size

4032, Accumulated background error counts: 1

2021-09-19T15:47:13.636+0200 7f8bbacf1700 -1 rocksdb: submit_common

error: Corruption: block checksum mismatch: expected 2427092066, got

4051549320  in db/251935.sst offset 18414386 size 4032 code = 2

Rocksdb transaction:

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Corruption on cluster

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

[ceph-users] Re: Corruption on cluster

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

[ceph-users] Re: EC CLAY production-ready or technology preview in Pacific?

[ceph-users] Re: Monitor issue while installation

[ceph-users] Re: Monitor issue while installation

[ceph-users] Re: Monitor issue while installation

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

[ceph-users] Re: SPAM Re: Corruption on cluster

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

[ceph-users] Re: SPAM Re: Corruption on cluster

[ceph-users] Re: SPAM Re: Corruption on cluster

[ceph-users] Re: Corruption on cluster

[ceph-users] Corruption on cluster

[ceph-users] Re: Error: UPGRADE_FAILED_PULL: Upgrade: failed to pull target image

[ceph-users] MDS 16.2.5-387-g7282d81d and DAEMON_OLD_VERSION

[ceph-users] Re: S3 Bucket Notification requirement

[ceph-users] Re: rocksdb corruption with 16.2.6

[ceph-users] after upgrade: HEALTH ERR ...'devicehealth' has failed: can't subtract offset-naive and offset-aware datetimes

[ceph-users] osd marked down

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

[ceph-users] Re: [EXTERNAL] RE: OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

[ceph-users] Successful Upgrade from 14.2.22 to 15.2.14

[ceph-users] Monitor issue while installation

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

[ceph-users] Re: [EXTERNAL] Re: OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

[ceph-users] Re: rocksdb corruption with 16.2.6

31 matches

Site Navigation

Mail list logo

Footer information