[ceph-users] Previously synced bucket resharded after sync removed

2023-11-20 Thread Szabo, Istvan (Agoda)
Hi,

I had a multisite bucket which I've removed from sync completely and resharded 
on the master zone the bucket which was successful.

On the 2nd site (which was expected) can't list anything inside that bucket 
anymore which is okay, the issue is how I can delete the data somehow?
It was 50TB data there which I'd like to cleanup but at the moment everything 
shows 0, if I check the user space usage or the bucket space usage, all 0, 
however the data I'm sure still there because the used space is still the same 
as before in my cluster.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CFP closing soon: Everything Open 2024 (Gladstone, Queensland, Australia, April 16-18)

2023-11-20 Thread Tim Serong

Update: the CFP has been extended 'til November 30 (see
http://lists.linux.org.au/pipermail/eo-announce/2023-November/11.html)

On 11/17/23 14:09, Tim Serong wrote:
Everything Open (auspiced by Linux Australia) is happening again in 
2024.  The CFP closes at the end of this weekend (November 19):


   https://2024.everythingopen.au/programme/proposals/

More details below.

 Forwarded Message 
Date: Sun, 15 Oct 2023 09:16:31 +1000
From: Everything Open 
To: eo-annou...@lists.linux.org.au, annou...@lists.linux.org.au
Subject: [Announce] Everything Open 2024: Call for Sessions Now Open
User-Agent: Roundcube Webmail/1.1.1

Submit your session proposals today - the Everything Open 2024 Call for
Sessions is now open.


## Call for Sessions

We invite you to submit a session proposal on a topic you are familiar 
with via our proposals portal at 
https://2024.everythingopen.au/programme/proposals/.
The Call for Sessions will remain open until 11:59pm on Sunday 19 
November 2023 anywhere on earth (AoE).


There will be multiple streams catering for a wide range of interest 
areas across the many facets of open technology, including Linux, open 
source software, open hardware, standards, formats and documentation, 
and our communities.
In keeping with the conference’s aim to be inclusive to all community 
members, presentations can be aimed at any level, ranging from technical 
deep-dives through to beginner and intermediate level presentations for 
those who are newer to the subject.


There will be two types of sessions at Everything Open: talks and 
tutorials. Talks will nominally be 45 minutes long on a single topic 
presented in lecture format. We will also have a few short talk slots of 
25 minutes available, which are perfect for people new to presenting at 
a conference. Tutorials are interactive and hands-on in nature, 
presented in classroom format.
Each accepted session will receive one Professional level ticket to 
attend the conference.


The Session Selection Committee is looking forward to reading your
submissions. We would also like to thank them for coming together and
volunteering their time to help put this conference together.


## Sponsor Early

As usual, we have a range of sponsorship opportunities available, for 
the conference overall as well as the ability to contribute towards 
specific parts of the event.
We encourage you to sponsor the conference early, to get the maximum 
promotion during the lead up to the event.

If you or your organisation is interested in sponsoring Everything Open,
please get in touch via 
https://2024.everythingopen.au/sponsors/prospectus/.



Read this online at
https://2024.everythingopen.au/news/call-for-sessions-open/


___
announce mailing list
annou...@lists.linux.org.au
http://lists.linux.org.au/mailman/listinfo/announce

- End forwarded message -
___
Dev mailing list -- d...@ceph.io
To unsubscribe send an email to dev-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why is min_size of erasure pools set to k+1

2023-11-20 Thread Wesley Dillingham
" if min_size is k and you lose an OSD during recovery after a failure of m
OSDs, data will become unavailable"

In that situation data wouldnt become unavailable it would be lost.

Having a min_size of k+1 provides a buffer between data being
active+writeable and where data is lost. That inbetween is called inactive.

By having that buffer you prevent the situation of having data being
written to the PG when you are only one disk/shard away from data loss.

Imagine the scenario of 4+2 with a min_size of 4. The cluster is 6 servers
filled with osds

You have brought 2 servers down for maintenance (not a good idea but this
is an example). Your PGs are all degraded with only 4 shards of clean data
but active because k=min_size. Data is being written to the pool.

As you are booting your 2 servers up out of maintenance an OSD/disk on
another server fails and fails hard. Because that OSD was part of the
acting set the cluster only wrote four shards and now one is lost.

You only have 3 shards of data in a 4+2 and now some subset of data is lost.

Now imagine a 4+2 with min_size = 5.

You wouldnt bring down more than 1 host because "ceph osd ok-to-stop" would
return false if your tried to bring down more than 1 host for maintenance.

Lets say you did bring down two hosts against the advice of the ok-to-stop
command your PGs would become inactive and so they wouldn't accept
writes. Once you boot your 2 servers back the cluster heals.

Lets say you heed the advice of ok-to-stop and only bring 1 host down for
maintenance at a time.  Your data is degraded with 5/6 shards healthy. New
data is being written with 5 shards able to be written out.

As you are booting your server out of maintenance an OSD on another host
dies and those shards are lost forever, The PGs from that lost OSD now have
4  healthy shards. That is enough shards to recover the data from (though
you would have some PGs inactive for a bit until recovery finished)

Hope this helps to answer the min_size question a bit.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Mon, Nov 20, 2023 at 2:03 PM Vladimir Brik <
vladimir.b...@icecube.wisc.edu> wrote:

> Could someone help me understand why it's a bad idea to set min_size of
> erasure-coded pools to k?
>
> From what I've read, the argument for k+1 is that if min_size is k and you
> lose an OSD during recovery after a failure of m OSDs, data will become
> unavailable. But how does setting min_size to k+1 help? If m=2, if you
> experience a double failure followed by another failure during recovery you
> still lost 3 OSDs and therefore your data because the pool wasn't set up to
> handle 3 concurrent failures, and the value of min_size is irrelevant.
>
> https://github.com/ceph/ceph/pull/8008 mentions inability to peer if
> min_size = k, but I don't understand why. Does that mean that if min_size=k
> and I lose m OSDs, and then an OSD is restarted during recovery, PGs will
> not peer even after the restarted OSD comes back online?
>
>
> Vlad
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Why is min_size of erasure pools set to k+1

2023-11-20 Thread Vladimir Brik
Could someone help me understand why it's a bad idea to set min_size of
erasure-coded pools to k?

>From what I've read, the argument for k+1 is that if min_size is k and you
lose an OSD during recovery after a failure of m OSDs, data will become
unavailable. But how does setting min_size to k+1 help? If m=2, if you
experience a double failure followed by another failure during recovery you
still lost 3 OSDs and therefore your data because the pool wasn't set up to
handle 3 concurrent failures, and the value of min_size is irrelevant.

https://github.com/ceph/ceph/pull/8008 mentions inability to peer if
min_size = k, but I don't understand why. Does that mean that if min_size=k
and I lose m OSDs, and then an OSD is restarted during recovery, PGs will
not peer even after the restarted OSD comes back online?


Vlad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bug fixes in 17.2.7

2023-11-20 Thread Konstantin Shalygin
Hi,

> On Nov 20, 2023, at 19:24, Tobias Kulschewski  
> wrote:
> 
> do you have a rough estimate of when this will happen?
> 
> 

Not at this year I think. For now precedence for a 18.2.1 and last release of 
Pacific
But you can request shaman build, and clone repo for your local usage


k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bug fixes in 17.2.7

2023-11-20 Thread Konstantin Shalygin
Hi Tobias,

This was not meged to Quincy yet [1]

k

[1] https://tracker.ceph.com/issues/59730
Sent from my iPhone

> On Nov 20, 2023, at 17:50, Tobias Kulschewski  
> wrote:
> 
> Just wanted to ask, if the bug with the multipart upload [1] has been fixed 
> in 17.2.7?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] After hardware failure tried to recover ceph and followed instructions for recovery using OSDS

2023-11-20 Thread Manolis Daramas
Hello everyone,

We had a recent power failure on a server which hosts a 3-node ceph cluster 
(with Ubuntu 20.04 and Ceph version 17.2.7) and we think that we may have lost 
some of our data if not all of them.

We have followed the instructions on 
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-mon/#recovery-using-osds
 but with no luck.

We have kept a backup of store.db folder on all 3 nodes prior the below steps.

We have stopped ceph.target on all 3 nodes.

We have run the first part of the script and we have altered it according to 
our configuration:

ms=/root/mon-store
mkdir $ms

hosts="node01 node02 node03"
# collect the cluster map from stopped OSDs
for host in $hosts; do
  rsync -avz $ms/. root@$host:$ms.remote
  rm -rf $ms
  ssh root@$host <=5.



2023-11-17T12:26:24.160+0200 7f482b393600  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1700216784163944, "cf_name": "default", "job": 1, "event": 
"table_file_creation", "file_number": 86, "file_size": 1266, "file_checksum": 
"", "file_checksum_func_name": "Unknown", "table_properties": {"data_size": 
238, "index_size": 40, "index_partitions": 0, "top_level_index_size": 0, 
"index_key_is_user_key": 1, "index_value_is_delta_encoded": 1, "filter_size": 
69, "raw_key_size": 72, "raw_average_key_size": 24, "raw_value_size": 148, 
"raw_average_value_size": 49, "num_data_blocks": 1, "num_entries": 3, 
"num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, 
"format_version": 0, "fixed_key_len": 0, "filter_policy": 
"rocksdb.BuiltinBloomFilter", "column_family_name": "default", 
"column_family_id": 0, "comparator": "leveldb.BytewiseComparator", 
"merge_operator": "", "prefix_extractor_name": "nullptr", 
"property_collectors": "[]", "compression": "NoCompression", 
"compression_options": "wind
 ow_bits=-14; level=32767; strategy=0; max_dict_bytes=0; 
zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1700216784, 
"oldest_key_time": 0, "file_creation_time": 0, "db_id": 
"53025a24-2059-43e1-a0f7-a87a28e33d38", "db_session_id": 
"OS2T69IQ02SU5OKHBI40"}}



2023-11-17T12:26:24.160+0200 7f482b393600  4 rocksdb: [db/version_set.cc:4082] 
Creating manifest 87





2023-11-17T12:26:24.160+0200 7f482b393600  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1700216784166273, "job": 1, "event": "recovery_finished"}



2023-11-17T12:26:24.160+0200 7f482b393600  4 rocksdb: [db/column_family.cc:983] 
[default] Increasing compaction threads because we have 14 level-0 files



2023-11-17T12:26:24.160+0200 7f482b393600  4 rocksdb: 
[file/delete_scheduler.cc:69] Deleted file /root/mon-store/store.db/82.log 
immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 
0.25



2023-11-17T12:26:24.164+0200 7f482b393600  4 rocksdb: 
[db/db_impl/db_impl_open.cc:1700] SstFileManager instance 0x56017d230700



2023-11-17T12:26:24.164+0200 7f482b393600  4 rocksdb: DB pointer 0x56017df56000



adding auth for 'client.admin': 
auth(key=AQCsdUViHYjTGBAAf7/1KYZjb0h3x3EOywqbbQ==) with caps({mds=allow 
*,mgr=allow *,mon=allow *,osd=allow *})

2023-11-17T12:26:24.164+0200 7f482a349700  4 rocksdb: 
[db/compaction/compaction_job.cc:1881] [default] [JOB 3] Compacting 14@0 files 
to L6, score 3.50



2023-11-17T12:26:24.164+0200 7f482a349700  4 rocksdb: 
[db/compaction/compaction_job.cc:1887] [default] Compaction start summary: Base 
version 3 Base level 0, inputs: [86(1266B) 80(1266B) 74(1267B) 68(1267B) 
62(1266B) 56(1265B) 50(1265B) 44(1265B) 38(1265B) 32(1266B) 26(1265B) 20(1265B) 
14(283KB) 8(7387KB)]





2023-11-17T12:26:24.164+0200 7f482a349700  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1700216784169200, "job": 3, "event": "compaction_started", 
"compaction_reason": "LevelL0FilesNum", "files_L0": [86, 80, 74, 68, 62, 56, 
50, 44, 38, 32, 26, 20, 14, 8], "score": 3.5, "input_data_size": 7870219}



2023-11-17T12:26:24.164+0200 7f4822339700  4 rocksdb: 
[db/db_impl/db_impl.cc:901] --- DUMPING STATS ---



2023-11-17T12:26:24.164+0200 7f4822339700  4 rocksdb: 
[db/db_impl/db_impl.cc:903]

** DB Stats **

Uptime(secs): 0.0 total, 0.0 interval

Cumulative writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit 
group, ingest: 0.00 GB, 0.00 MB/s

Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 
MB/s

Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent

Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit 
group, ingest: 0.00 MB, 0.00 MB/s

Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 
MB/s

Interval stall: 00:00:0.000 H:M:S, 0.0 percent



** Compaction Stats [default] **

LevelFiles   Size Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) 
Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) 
Avg(sec) KeyIn KeyDrop



  L0 14/14   7.51 MB   0.0  

[ceph-users] Bug fixes in 17.2.7

2023-11-20 Thread Tobias Kulschewski

Hi guys,

thank you for releasing 17.2.7!

Just wanted to ask, if the bug with the multipart upload [1] has been 
fixed in 17.2.7?


When are you planning on fixing this bug?

Best, Tobias

[1] https://tracker.ceph.com/issues/58879_




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: blustore osd nearfull but no pgs on it

2023-11-20 Thread Wesley Dillingham
The large amount of osdmaps is what i was suspecting. "ceph tell osd.158
status" (or any osd other than 158) would show us how many osdmaps the osds
are currently holding on to.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Mon, Nov 20, 2023 at 6:15 AM Debian  wrote:

> Hi,
>
> yes all of my small osds are affected
>
> i found the issue, my cluster is healthy and my rebalance finished - i
> have only to wait that my old osdmaps get cleaned up.
>
> like in the thread "Disks are filling up even if there is not a single
> placement group on them"
>
> thx!
>
> On 20.11.23 11:36, Eugen Block wrote:
> > You provide only a few details at a time, it would help to get a full
> > picture if you provided the output Wesley asked for (ceph df detail,
> > ceph tell osd.158 status, ceph osd df tree). Is osd.149 now the
> > problematic one or did you just add output from a different osd?
> > It's not really clear what you're doing without the necessary context.
> > You can just add the 'ceph daemon osd.{OSD} perf dump' output here or
> > in some pastebin.
> >
> > Zitat von Debian :
> >
> >> Hi,
> >>
> >> the block.db size ist default and not custom configured:
> >>
> >> current:
> >>
> >> bluefs.db_used_bytes: 9602859008
> >> bluefs.db_used_bytes: 469434368
> >>
> >> ceph daemon osd.149 config show
> >>
> >> "bluestore_bitmapallocator_span_size": "1024",
> >> "bluestore_block_db_size": "0",
> >> "bluestore_block_size": "107374182400",
> >> "bluestore_block_wal_size": "100663296",
> >> "bluestore_cache_size": "0",
> >> "bluestore_cache_size_hdd": "1073741824",
> >> "bluestore_cache_size_ssd": "3221225472",
> >> "bluestore_compression_max_blob_size": "0",
> >> "bluestore_compression_max_blob_size_hdd": "524288",
> >> "bluestore_compression_max_blob_size_ssd": "65536",
> >> "bluestore_compression_min_blob_size": "0",
> >> "bluestore_compression_min_blob_size_hdd": "131072",
> >> "bluestore_compression_min_blob_size_ssd": "8192",
> >> "bluestore_extent_map_inline_shard_prealloc_size": "256",
> >> "bluestore_extent_map_shard_max_size": "1200",
> >> "bluestore_extent_map_shard_min_size": "150",
> >> "bluestore_extent_map_shard_target_size": "500",
> >> "bluestore_extent_map_shard_target_size_slop": "0.20",
> >> "bluestore_max_alloc_size": "0",
> >> "bluestore_max_blob_size": "0",
> >> "bluestore_max_blob_size_hdd": "524288",
> >> "bluestore_max_blob_size_ssd": "65536",
> >> "bluestore_min_alloc_size": "0",
> >> "bluestore_min_alloc_size_hdd": "65536",
> >> "bluestore_min_alloc_size_ssd": "4096",
> >> "bluestore_prefer_deferred_size": "0",
> >> "bluestore_prefer_deferred_size_hdd": "32768",
> >> "bluestore_prefer_deferred_size_ssd": "0",
> >> "bluestore_rocksdb_options":
> >>
> "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2",
> >>
> >> "bluefs_alloc_size": "1048576",
> >> "bluefs_allocator": "hybrid",
> >> "bluefs_buffered_io": "false",
> >> "bluefs_check_for_zeros": "false",
> >> "bluefs_compact_log_sync": "false",
> >> "bluefs_log_compact_min_ratio": "5.00",
> >> "bluefs_log_compact_min_size": "16777216",
> >> "bluefs_max_log_runway": "4194304",
> >> "bluefs_max_prefetch": "1048576",
> >> "bluefs_min_flush_size": "524288",
> >> "bluefs_min_log_runway": "1048576",
> >> "bluefs_preextend_wal_files": "false",
> >> "bluefs_replay_recovery": "false",
> >> "bluefs_replay_recovery_disable_compact": "false",
> >> "bluefs_shared_alloc_size": "65536",
> >> "bluefs_sync_write": "false",
> >>
> >> which the osd performance counter i cannot determine who is using the
> >> memory,...
> >>
> >> thx & best regards
> >>
> >>
> >> On 18.11.23 09:05, Eugen Block wrote:
> >>> Do you have a large block.db size defined in the ceph.conf (or
> >>> config store)?
> >>>
> >>> Zitat von Debian :
> >>>
>  thx for your reply, it shows nothing,... there are no pgs on the
>  osd,...
> 
>  best regards
> 
>  On 17.11.23 23:09, Eugen Block wrote:
> > After you create the OSD, run ‚ceph pg ls-by-osd {OSD}‘, it should
> > show you which PGs are created there and then you’ll know which
> > pool they belong to, then check again the crush rule for that
> > pool. You can paste the outputs here.
> >
> > Zitat von Debian :
> >
> >> Hi,
> >>
> >> after a massive rebalance(tunables) my small SSD-OSDs are getting
> >> full, i changed my crush rules so there are actual no pgs/pools
> >> on it, but the disks stay full:
> >>
> >> ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6)
> >> nautilus (stable)
> >>
> >> ID CLASS WEIGHT 

[ceph-users] 304 response is not RFC9110 compliant

2023-11-20 Thread Ondřej Kukla
Hello,

I’ve noticed that 304 response from s3 and s3website api is not RFC9110 
compliant. This is an issue especially for caching the content when you have a 
cache-control header set on the object.

There was an old Issue and PR from 2020 fixing this issue but it was completely 
ignored.

I’ve created a new issue so it will be able to get back to it. 
https://tracker.ceph.com/issues/63507

Regards,

Ondrej
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: blustore osd nearfull but no pgs on it

2023-11-20 Thread Debian

Hi,

yes all of my small osds are affected

i found the issue, my cluster is healthy and my rebalance finished - i 
have only to wait that my old osdmaps get cleaned up.


like in the thread "Disks are filling up even if there is not a single 
placement group on them"


thx!

On 20.11.23 11:36, Eugen Block wrote:
You provide only a few details at a time, it would help to get a full 
picture if you provided the output Wesley asked for (ceph df detail, 
ceph tell osd.158 status, ceph osd df tree). Is osd.149 now the 
problematic one or did you just add output from a different osd?
It's not really clear what you're doing without the necessary context. 
You can just add the 'ceph daemon osd.{OSD} perf dump' output here or 
in some pastebin.


Zitat von Debian :


Hi,

the block.db size ist default and not custom configured:

current:

bluefs.db_used_bytes: 9602859008
bluefs.db_used_bytes: 469434368

ceph daemon osd.149 config show

    "bluestore_bitmapallocator_span_size": "1024",
    "bluestore_block_db_size": "0",
    "bluestore_block_size": "107374182400",
    "bluestore_block_wal_size": "100663296",
    "bluestore_cache_size": "0",
    "bluestore_cache_size_hdd": "1073741824",
    "bluestore_cache_size_ssd": "3221225472",
    "bluestore_compression_max_blob_size": "0",
    "bluestore_compression_max_blob_size_hdd": "524288",
    "bluestore_compression_max_blob_size_ssd": "65536",
    "bluestore_compression_min_blob_size": "0",
    "bluestore_compression_min_blob_size_hdd": "131072",
    "bluestore_compression_min_blob_size_ssd": "8192",
    "bluestore_extent_map_inline_shard_prealloc_size": "256",
    "bluestore_extent_map_shard_max_size": "1200",
    "bluestore_extent_map_shard_min_size": "150",
    "bluestore_extent_map_shard_target_size": "500",
    "bluestore_extent_map_shard_target_size_slop": "0.20",
    "bluestore_max_alloc_size": "0",
    "bluestore_max_blob_size": "0",
    "bluestore_max_blob_size_hdd": "524288",
    "bluestore_max_blob_size_ssd": "65536",
    "bluestore_min_alloc_size": "0",
    "bluestore_min_alloc_size_hdd": "65536",
    "bluestore_min_alloc_size_ssd": "4096",
    "bluestore_prefer_deferred_size": "0",
    "bluestore_prefer_deferred_size_hdd": "32768",
    "bluestore_prefer_deferred_size_ssd": "0",
    "bluestore_rocksdb_options": 
"compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2",


    "bluefs_alloc_size": "1048576",
    "bluefs_allocator": "hybrid",
    "bluefs_buffered_io": "false",
    "bluefs_check_for_zeros": "false",
    "bluefs_compact_log_sync": "false",
    "bluefs_log_compact_min_ratio": "5.00",
    "bluefs_log_compact_min_size": "16777216",
    "bluefs_max_log_runway": "4194304",
    "bluefs_max_prefetch": "1048576",
    "bluefs_min_flush_size": "524288",
    "bluefs_min_log_runway": "1048576",
    "bluefs_preextend_wal_files": "false",
    "bluefs_replay_recovery": "false",
    "bluefs_replay_recovery_disable_compact": "false",
    "bluefs_shared_alloc_size": "65536",
    "bluefs_sync_write": "false",

which the osd performance counter i cannot determine who is using the 
memory,...


thx & best regards


On 18.11.23 09:05, Eugen Block wrote:
Do you have a large block.db size defined in the ceph.conf (or 
config store)?


Zitat von Debian :

thx for your reply, it shows nothing,... there are no pgs on the 
osd,...


best regards

On 17.11.23 23:09, Eugen Block wrote:
After you create the OSD, run ‚ceph pg ls-by-osd {OSD}‘, it should 
show you which PGs are created there and then you’ll know which 
pool they belong to, then check again the crush rule for that 
pool. You can paste the outputs here.


Zitat von Debian :


Hi,

after a massive rebalance(tunables) my small SSD-OSDs are getting 
full, i changed my crush rules so there are actual no pgs/pools 
on it, but the disks stay full:


ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6) 
nautilus (stable)


ID CLASS WEIGHT REWEIGHT SIZE    RAW USE DATA OMAP 
META AVAIL    %USE  VAR  PGS STATUS TYPE NAME
158   ssd    0.21999  1.0 224 GiB 194 GiB 193 GiB 22 MiB 1002 
MiB   30 GiB 86.68 1.49   0 up osd.158


inferring bluefs devices from bluestore path
1 : device size 0x37e440 : own 0x[1ad3f0~23c60] = 
0x23c60 : using 0x3963(918 MiB) : bluestore has 
0x46e2d(18 GiB) available


when i recreate the osd the osd gets full again

any suggestion?

thx & best regards
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe 

[ceph-users] Re: blustore osd nearfull but no pgs on it

2023-11-20 Thread Debian

Hi,

ohh that is exactly my problem, my Cluster is healthy and no rebalance 
active.


I have only to wait that the old osdmaps get cleaned up,...

thx!

On 20.11.23 10:42, Michal Strnad wrote:

Hi.

Try to look on thread "Disks are filling up even if there is not a 
single placement group on them" in this mailing list. Maybe you 
encounter the same problem as me.


Michal



On 11/20/23 08:56, Debian wrote:

Hi,

the block.db size ist default and not custom configured:

current:

bluefs.db_used_bytes: 9602859008
bluefs.db_used_bytes: 469434368

ceph daemon osd.149 config show

 "bluestore_bitmapallocator_span_size": "1024",
 "bluestore_block_db_size": "0",
 "bluestore_block_size": "107374182400",
 "bluestore_block_wal_size": "100663296",
 "bluestore_cache_size": "0",
 "bluestore_cache_size_hdd": "1073741824",
 "bluestore_cache_size_ssd": "3221225472",
 "bluestore_compression_max_blob_size": "0",
 "bluestore_compression_max_blob_size_hdd": "524288",
 "bluestore_compression_max_blob_size_ssd": "65536",
 "bluestore_compression_min_blob_size": "0",
 "bluestore_compression_min_blob_size_hdd": "131072",
 "bluestore_compression_min_blob_size_ssd": "8192",
 "bluestore_extent_map_inline_shard_prealloc_size": "256",
 "bluestore_extent_map_shard_max_size": "1200",
 "bluestore_extent_map_shard_min_size": "150",
 "bluestore_extent_map_shard_target_size": "500",
 "bluestore_extent_map_shard_target_size_slop": "0.20",
 "bluestore_max_alloc_size": "0",
 "bluestore_max_blob_size": "0",
 "bluestore_max_blob_size_hdd": "524288",
 "bluestore_max_blob_size_ssd": "65536",
 "bluestore_min_alloc_size": "0",
 "bluestore_min_alloc_size_hdd": "65536",
 "bluestore_min_alloc_size_ssd": "4096",
 "bluestore_prefer_deferred_size": "0",
 "bluestore_prefer_deferred_size_hdd": "32768",
 "bluestore_prefer_deferred_size_ssd": "0",
 "bluestore_rocksdb_options": 
"compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2",


 "bluefs_alloc_size": "1048576",
 "bluefs_allocator": "hybrid",
 "bluefs_buffered_io": "false",
 "bluefs_check_for_zeros": "false",
 "bluefs_compact_log_sync": "false",
 "bluefs_log_compact_min_ratio": "5.00",
 "bluefs_log_compact_min_size": "16777216",
 "bluefs_max_log_runway": "4194304",
 "bluefs_max_prefetch": "1048576",
 "bluefs_min_flush_size": "524288",
 "bluefs_min_log_runway": "1048576",
 "bluefs_preextend_wal_files": "false",
 "bluefs_replay_recovery": "false",
 "bluefs_replay_recovery_disable_compact": "false",
 "bluefs_shared_alloc_size": "65536",
 "bluefs_sync_write": "false",

which the osd performance counter i cannot determine who is using the 
memory,...


thx & best regards


On 18.11.23 09:05, Eugen Block wrote:
Do you have a large block.db size defined in the ceph.conf (or 
config store)?


Zitat von Debian :

thx for your reply, it shows nothing,... there are no pgs on the 
osd,...


best regards

On 17.11.23 23:09, Eugen Block wrote:
After you create the OSD, run ‚ceph pg ls-by-osd {OSD}‘, it should 
show you which PGs are created there and then you’ll know which 
pool they belong to, then check again the crush rule for that 
pool. You can paste the outputs here.


Zitat von Debian :


Hi,

after a massive rebalance(tunables) my small SSD-OSDs are getting 
full, i changed my crush rules so there are actual no pgs/pools 
on it, but the disks stay full:


ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6) 
nautilus (stable)


ID CLASS WEIGHT REWEIGHT SIZE    RAW USE DATA OMAP META 
AVAIL    %USE  VAR  PGS STATUS TYPE NAME
158   ssd    0.21999  1.0 224 GiB 194 GiB 193 GiB 22 MiB 1002 
MiB   30 GiB 86.68 1.49   0 up osd.158


inferring bluefs devices from bluestore path
1 : device size 0x37e440 : own 0x[1ad3f0~23c60] = 
0x23c60 : using 0x3963(918 MiB) : bluestore has 
0x46e2d(18 GiB) available


when i recreate the osd the osd gets full again

any suggestion?

thx & best regards
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an 

[ceph-users] Re: blustore osd nearfull but no pgs on it

2023-11-20 Thread Eugen Block
You provide only a few details at a time, it would help to get a full  
picture if you provided the output Wesley asked for (ceph df detail,  
ceph tell osd.158 status, ceph osd df tree). Is osd.149 now the  
problematic one or did you just add output from a different osd?
It's not really clear what you're doing without the necessary context.  
You can just add the 'ceph daemon osd.{OSD} perf dump' output here or  
in some pastebin.


Zitat von Debian :


Hi,

the block.db size ist default and not custom configured:

current:

bluefs.db_used_bytes: 9602859008
bluefs.db_used_bytes: 469434368

ceph daemon osd.149 config show

    "bluestore_bitmapallocator_span_size": "1024",
    "bluestore_block_db_size": "0",
    "bluestore_block_size": "107374182400",
    "bluestore_block_wal_size": "100663296",
    "bluestore_cache_size": "0",
    "bluestore_cache_size_hdd": "1073741824",
    "bluestore_cache_size_ssd": "3221225472",
    "bluestore_compression_max_blob_size": "0",
    "bluestore_compression_max_blob_size_hdd": "524288",
    "bluestore_compression_max_blob_size_ssd": "65536",
    "bluestore_compression_min_blob_size": "0",
    "bluestore_compression_min_blob_size_hdd": "131072",
    "bluestore_compression_min_blob_size_ssd": "8192",
    "bluestore_extent_map_inline_shard_prealloc_size": "256",
    "bluestore_extent_map_shard_max_size": "1200",
    "bluestore_extent_map_shard_min_size": "150",
    "bluestore_extent_map_shard_target_size": "500",
    "bluestore_extent_map_shard_target_size_slop": "0.20",
    "bluestore_max_alloc_size": "0",
    "bluestore_max_blob_size": "0",
    "bluestore_max_blob_size_hdd": "524288",
    "bluestore_max_blob_size_ssd": "65536",
    "bluestore_min_alloc_size": "0",
    "bluestore_min_alloc_size_hdd": "65536",
    "bluestore_min_alloc_size_ssd": "4096",
    "bluestore_prefer_deferred_size": "0",
    "bluestore_prefer_deferred_size_hdd": "32768",
    "bluestore_prefer_deferred_size_ssd": "0",
    "bluestore_rocksdb_options":  
"compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2",


    "bluefs_alloc_size": "1048576",
    "bluefs_allocator": "hybrid",
    "bluefs_buffered_io": "false",
    "bluefs_check_for_zeros": "false",
    "bluefs_compact_log_sync": "false",
    "bluefs_log_compact_min_ratio": "5.00",
    "bluefs_log_compact_min_size": "16777216",
    "bluefs_max_log_runway": "4194304",
    "bluefs_max_prefetch": "1048576",
    "bluefs_min_flush_size": "524288",
    "bluefs_min_log_runway": "1048576",
    "bluefs_preextend_wal_files": "false",
    "bluefs_replay_recovery": "false",
    "bluefs_replay_recovery_disable_compact": "false",
    "bluefs_shared_alloc_size": "65536",
    "bluefs_sync_write": "false",

which the osd performance counter i cannot determine who is using  
the memory,...


thx & best regards


On 18.11.23 09:05, Eugen Block wrote:
Do you have a large block.db size defined in the ceph.conf (or  
config store)?


Zitat von Debian :


thx for your reply, it shows nothing,... there are no pgs on the osd,...

best regards

On 17.11.23 23:09, Eugen Block wrote:
After you create the OSD, run ‚ceph pg ls-by-osd {OSD}‘, it  
should show you which PGs are created there and then you’ll know  
which pool they belong to, then check again the crush rule for  
that pool. You can paste the outputs here.


Zitat von Debian :


Hi,

after a massive rebalance(tunables) my small SSD-OSDs are  
getting full, i changed my crush rules so there are actual no  
pgs/pools on it, but the disks stay full:


ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6)  
nautilus (stable)


ID CLASS WEIGHT REWEIGHT SIZE    RAW USE DATA OMAP  
META AVAIL    %USE  VAR  PGS STATUS TYPE NAME
158   ssd    0.21999  1.0 224 GiB 194 GiB 193 GiB  22 MiB  
1002 MiB   30 GiB 86.68 1.49   0 up osd.158


inferring bluefs devices from bluestore path
1 : device size 0x37e440 : own 0x[1ad3f0~23c60] =  
0x23c60 : using 0x3963(918 MiB) : bluestore has  
0x46e2d(18 GiB) available


when i recreate the osd the osd gets full again

any suggestion?

thx & best regards
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To 

[ceph-users] Re: blustore osd nearfull but no pgs on it

2023-11-20 Thread Michal Strnad

Hi.

Try to look on thread "Disks are filling up even if there is not a 
single placement group on them" in this mailing list. Maybe you 
encounter the same problem as me.


Michal



On 11/20/23 08:56, Debian wrote:

Hi,

the block.db size ist default and not custom configured:

current:

bluefs.db_used_bytes: 9602859008
bluefs.db_used_bytes: 469434368

ceph daemon osd.149 config show

     "bluestore_bitmapallocator_span_size": "1024",
     "bluestore_block_db_size": "0",
     "bluestore_block_size": "107374182400",
     "bluestore_block_wal_size": "100663296",
     "bluestore_cache_size": "0",
     "bluestore_cache_size_hdd": "1073741824",
     "bluestore_cache_size_ssd": "3221225472",
     "bluestore_compression_max_blob_size": "0",
     "bluestore_compression_max_blob_size_hdd": "524288",
     "bluestore_compression_max_blob_size_ssd": "65536",
     "bluestore_compression_min_blob_size": "0",
     "bluestore_compression_min_blob_size_hdd": "131072",
     "bluestore_compression_min_blob_size_ssd": "8192",
     "bluestore_extent_map_inline_shard_prealloc_size": "256",
     "bluestore_extent_map_shard_max_size": "1200",
     "bluestore_extent_map_shard_min_size": "150",
     "bluestore_extent_map_shard_target_size": "500",
     "bluestore_extent_map_shard_target_size_slop": "0.20",
     "bluestore_max_alloc_size": "0",
     "bluestore_max_blob_size": "0",
     "bluestore_max_blob_size_hdd": "524288",
     "bluestore_max_blob_size_ssd": "65536",
     "bluestore_min_alloc_size": "0",
     "bluestore_min_alloc_size_hdd": "65536",
     "bluestore_min_alloc_size_ssd": "4096",
     "bluestore_prefer_deferred_size": "0",
     "bluestore_prefer_deferred_size_hdd": "32768",
     "bluestore_prefer_deferred_size_ssd": "0",
     "bluestore_rocksdb_options": 
"compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2",


     "bluefs_alloc_size": "1048576",
     "bluefs_allocator": "hybrid",
     "bluefs_buffered_io": "false",
     "bluefs_check_for_zeros": "false",
     "bluefs_compact_log_sync": "false",
     "bluefs_log_compact_min_ratio": "5.00",
     "bluefs_log_compact_min_size": "16777216",
     "bluefs_max_log_runway": "4194304",
     "bluefs_max_prefetch": "1048576",
     "bluefs_min_flush_size": "524288",
     "bluefs_min_log_runway": "1048576",
     "bluefs_preextend_wal_files": "false",
     "bluefs_replay_recovery": "false",
     "bluefs_replay_recovery_disable_compact": "false",
     "bluefs_shared_alloc_size": "65536",
     "bluefs_sync_write": "false",

which the osd performance counter i cannot determine who is using the 
memory,...


thx & best regards


On 18.11.23 09:05, Eugen Block wrote:
Do you have a large block.db size defined in the ceph.conf (or config 
store)?


Zitat von Debian :


thx for your reply, it shows nothing,... there are no pgs on the osd,...

best regards

On 17.11.23 23:09, Eugen Block wrote:
After you create the OSD, run ‚ceph pg ls-by-osd {OSD}‘, it should 
show you which PGs are created there and then you’ll know which pool 
they belong to, then check again the crush rule for that pool. You 
can paste the outputs here.


Zitat von Debian :


Hi,

after a massive rebalance(tunables) my small SSD-OSDs are getting 
full, i changed my crush rules so there are actual no pgs/pools on 
it, but the disks stay full:


ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6) 
nautilus (stable)


ID CLASS WEIGHT REWEIGHT SIZE    RAW USE DATA OMAP META 
AVAIL    %USE  VAR  PGS STATUS TYPE NAME
158   ssd    0.21999  1.0 224 GiB 194 GiB 193 GiB  22 MiB 1002 
MiB   30 GiB 86.68 1.49   0 up osd.158


inferring bluefs devices from bluestore path
1 : device size 0x37e440 : own 0x[1ad3f0~23c60] = 
0x23c60 : using 0x3963(918 MiB) : bluestore has 
0x46e2d(18 GiB) available


when i recreate the osd the osd gets full again

any suggestion?

thx & best regards
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Michal Strnad
Oddeleni datovych ulozist
CESNET z.s.p.o.


smime.p7s
Description: S/MIME Cryptographic Signature

[ceph-users] Re: How to use hardware

2023-11-20 Thread Frank Schilder
Hi Simon,

we are using something similar for ceph-fs. For a backup system your setup can 
work, depending on how you back up. While HDD pools have poor IOP/s 
performance, they are very good for streaming workloads. If you are using 
something like Borg backup that writes huge files sequentially, a HDD back-end 
should be OK.

Here some things to consider and try out:

1. You really need to get a bunch of enterprise SSDs with power loss protection 
for the FS meta data pool (disable write cache if enabled, this will disable 
volatile write cache and switch to protected caching). We are using (formerly 
Intel) 1.8T SATA drives that we subdivide into 4 OSDs each to raise 
performance. Place the meta-data pool and the primary data pool on these disks. 
Create a secondary data pool on the HDDs and assign it to the root *before* 
creating anything on the FS (see the recommended 3-pool layout for ceph file 
systems in the docs). I would not even consider running this without SSDs. 1 
such SSD per host is the minimum, 2 is better. If Borg or whatever can make use 
of a small fast storage directory, assign a sub-dir of the root to the primary 
data pool.

2. Calculate with sufficient extra disk space. As long as utilization stays 
below 60-70% bluestore will try to make large object writes sequential, which 
is really important for HDDs. On our cluster we currently have 40% utilization 
and I get full HDD bandwidth out for large sequential reads/writes. Make sure 
your backup application makes large sequential IO requests.

3. As Anthony said, add RAM. You should go for 512G on 50 HDD-nodes. You can 
run the MDS daemons on the OSD nodes. Set a reasonable cache limit and use 
ephemeral pinning. Depending on the CPUs you are using, 48 cores can be plenty. 
The latest generation Intel Xeon Scalable Processors is so efficient with ceph 
that 1HT per HDD is more than enough.

4. 3 MON+MGR nodes are sufficient. You can do something else with the remaining 
2 nodes. Of course, you can use them as additional MON+MGR nodes. We also use 5 
and it improves maintainability a lot.

Something more exotic if you have time:

5. To improve sequential performance further, you can experiment with larger 
min_alloc_sizes for OSDs (on creation time, you will need to scrap and 
re-deploy the cluster to test different values). Every HDD has a preferred 
IO-size for which random IO achieves nearly the same band-with as sequential 
writes. (But see 7.)

6. On your set-up you will probably go for a 4+2 EC data pool on HDD. With 
object size 4M the max. chunk size per OSD will be 1M. For many HDDs this is 
the preferred IO size (usually between 256K-1M). (But see 7.)

7. Important: large min_alloc_sizes are only good if your workload *never* 
modifies files, but only replaces them. A bit like a pool without EC overwrite 
enabled. The implementation of EC overwrites has a "feature" that can lead to 
massive allocation amplification. If your backup workload does modifications to 
files instead of adding new+deleting old, do *not* experiment with options 
5.-7. Instead, use the default and make sure you have sufficient unused 
capacity to increase the chances for large bluestore writes (keep utilization 
below 60-70% and just buy extra disks). A workload with large min_alloc_sizes 
has to be S3-like, only upload, download and delete are allowed.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: Saturday, November 18, 2023 3:24 PM
To: Simon Kepp
Cc: Albert Shih; ceph-users@ceph.io
Subject: [ceph-users] Re: How to use hardware

Common motivations for this strategy include the lure of unit economics and RUs.

Often ultra dense servers can’t fill racks anyway due to power and weight 
limits.

Here the osd_memory_target would have to be severely reduced to avoid 
oomkilling.  Assuming the OSDs are top load LFF HDDs with expanders, the HBA 
will be a bottleck as well.  I’ve suffered similar systems for RGW.  All the 
clever juggling in the world could not override the math, and the solution was 
QLC.

“We can lose 4 servers”

Do you realize that your data would then be unavailable ?  When you lose even 
one, you will not be able to restore redundancy and your OSDs likely will 
oomkill.

If you’re running CephFS, how are you provisioning fast OSDs for the metadata 
pool?  Are the CPUs high-clock for MDS responsiveness?

Even given the caveats this seems like a recipe for at best disappointment.

At the very least add RAM.  8GB per OSD plus ample for other daemons.  Better 
would be 3x normal additional hosts for the others.

> On Nov 17, 2023, at 8:33 PM, Simon Kepp  wrote:
>
> I know that your question is regarding the service servers, but may I ask,
> why you are planning to place so many OSDs ( 300) on so few OSD hosts( 6)
> (= 50 OSDs per node)?
> This is possible to do, but sounds like the nodes were designed for
> scale-up rather than a