from:"Josh Baergen"

[ceph-users] Re: Accumulation of removed_snaps_queue After Deleting Snapshots in Ceph RBD

2024-02-13 Thread Josh Baergen

> 24   active+clean+snaptrim

I see snaptrimming happening in your status output - do you know if
that was happening before restarting those OSDs? This is the mechanism
by which OSDs clean up deleted snapshots, and once all OSDs have
completed snaptrim for a given snapshot it should be removed from the
removed_snaps_queue.

> ceph version 16.2.7

You may want to consider upgrading. 16.2.8 has a fix for
https://tracker.ceph.com/issues/52026, which is an issue that can
cause snaptrim to not happen in some circumstances.

Josh

On Mon, Feb 12, 2024 at 6:00 PM localhost Liam  wrote:
>
> Thanks, we are storing a lot less stress.
> 0. I rebooted 30 OSDs on one machine and the queue was not reduced, but the 
> storage space was released in large amounts.
> 1. why did the reboot OSD release so much space?
>
>
> Here are Ceph details..
>
> ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific 
> (stable)
>
>   cluster:
> id: 9acc3734-b27b-4bc3-84b8-c7762f2294c6
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum onf-akl-stor001,onf-akl-stor002,onf-akl-stor003 
> (age
> 11d)
> mgr: onf-akl-stor001(active, since 3M), standbys: onf-akl-stor002
> osd: 101 osds: 98 up (since 41s), 98 in (since 11d)
> rgw: 2 daemons active (2 hosts, 1 zones)
>
>   data:
> pools:   7 pools, 2209 pgs
> objects: 25.47M objects, 58 TiB
> usage:   115 TiB used, 184 TiB / 299 TiB avail
> pgs: 2183 active+clean
> 24   active+clean+snaptrim
> 2active+clean+scrubbing+deep
>
>   io:
> client:   38 MiB/s rd, 226 MiB/s wr, 1.32k op/s rd, 2.27k op/s wr
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Accumulation of removed_snaps_queue After Deleting Snapshots in Ceph RBD

2024-02-09 Thread Josh Baergen

Hello,

Which version of Ceph are you using? Are all of your OSDs currently
up+in? If you're HEALTH_OK and all OSDs are up, snaptrim should work
through the removed_snaps_queue and clear it over time, but I have
seen cases where this seems to get stuck and restarting OSDs can help.

Josh

On Wed, Feb 7, 2024 at 12:01 PM localhost Liam  wrote:
>
> Hello,
>
> I'm encountering an issue with Ceph when using it as the backend storage for 
> OpenStack Cinder. Specifically, after deleting RBD snapshots through Cinder, 
> I've noticed a significant increase in the removed_snaps_queue entries within 
> the corresponding Ceph pool. It seems to affect the pool's performance and 
> space efficiency.
>
> I understand that snapshot deletion in Cinder is an asynchronous operation, 
> and Ceph itself uses a lazy deletion mechanism to handle snapshot removal. 
> However, even after allowing sufficient time, the entries in 
> removed_snaps_queue do not decrease as expected.
>
> I have several questions for the community:
>
> Are there recommended methods or best practices for managing or reducing 
> entries in removed_snaps_queue?
> Is there any tool or command that can safely clear these residual snapshot 
> entries without affecting the integrity of active snapshots and data?
> Is this issue known, and are there any bug reports or plans for fixes related 
> to it?
> Thank you very much for your assistance!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to solve data fixity

2024-02-09 Thread Josh Baergen

MPU etags are an MD5-of-MD5s, FWIW. If the users knows how the parts are
uploaded then it can be used to verify contents, both just after upload and
then at download time (both need to be validated if you want end-to-end
validation - but then you're trusting the system to not change the etag
underneath you).

Josh

On Fri, Feb 9, 2024, 6:16 a.m. Michal Strnad 
wrote:

> Thank you for your response.
>
> We have already done some Lua scripting in the past, and it wasn't
> entirely enjoyable :-), but we may have to do it again. Scrubbing is
> still enabled, and turning it off definitely won't be an option.
> However, due to the project requirements, it would be great if
> Ceph could, on upload completion, initiate and compute hash (
> md5, sha256) and store it to object's metadata, so that user later
> could validate if the downloaded data are correct.
>
> We can't use Etag for that as it is does not contain md5 in case of
> multipart upload.
>
> Michal
>
>
> On 2/9/24 13:53, Anthony D'Atri wrote:
> > You could use Lua scripting perhaps to do this at ingest, but I'm very
> curious about scrubs -- you have them turned off completely?
> >
> >
> >> On Feb 9, 2024, at 04:18, Michal Strnad 
> wrote:
> >>
> >> Hi all!
> >>
> >> In the context of a repository-type project, we need to address a
> situation where we cannot use periodic checks in Ceph (scrubbing) due to
> the project's nature. Instead, we need the ability to write a checksum into
> the metadata of the uploaded file via API. In this context, we are not
> concerned about individual file parts, but rather the file as a whole.
> Users will calculate the checksum and write it. Based on this hash, we
> should be able to trigger a check of the given files. We are aware that
> tools like s3cmd can write MD5 hashes to file metadata, but is there a more
> general approach? Does anyone have experience with this, or can you suggest
> a tool that can accomplish this?
> >>
> >> Thx
> >> Michal
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: XFS on top of RBD, overhead

2024-02-02 Thread Josh Baergen

On Fri, Feb 2, 2024 at 7:44 AM Ruben Vestergaard  wrote:
> Is the RBD client performing partial object reads? Is that even a thing?

Yup! The rados API has both length and offset parameters for reads
(https://docs.ceph.com/en/latest/rados/api/librados/#c.rados_aio_read)
and writes 
(https://docs.ceph.com/en/latest/rados/api/librados/#c.rados_aio_write).

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Scrubbing?

2024-01-30 Thread Josh Baergen

Ah, yeah, you hit https://tracker.ceph.com/issues/63389 during the upgrade.

Josh

On Tue, Jan 30, 2024 at 3:17 AM Jan Marek  wrote:
>
> Hello again,
>
> I'm sorry, I forgot attach file... :-(
>
> Sincerely
> Jan
>
> Dne Út, led 30, 2024 at 11:09:44 CET napsal(a) Jan Marek:
> > Hello Sridhar,
> >
> > at Saturday I've finished upgrade proces to 18.2.1.
> >
> > Cluster is now in HEALTH_OK state and performs well.
> >
> > According to my colleagues there are lower latences and good
> > throughput.
> >
> > On OSD nodes there is relative low I/O activity.
> >
> > I still have mClock profile "high_client_ops".
> >
> > When I was stucked in the upgrade process, I had in logs so many
> > records, see attached file. Since upgrade is complete, this
> > messages went away... Can be this reason of poor
> > performance?
> >
> > Sincerely
> > Jan Marek
> >
> > Dne Čt, led 25, 2024 at 02:31:41 CET napsal(a) Jan Marek:
> > > Hello Sridhar,
> > >
> > > Dne Čt, led 25, 2024 at 09:53:26 CET napsal(a) Sridhar Seshasayee:
> > > > Hello Jan,
> > > >
> > > > Meaning of my previous post was, that CEPH cluster didn't fulfill
> > > > my needs and, although I had set mClock profile to
> > > > "high_client_ops" (because I have a plenty of time to rebalancing
> > > > and scrubbing), my clients went to problems.
> > > >
> > > > As far as the question around mClock is concerned, there are further
> > > > improvements in the works to handle QoS between client ops and
> > > > background scrub ops. This should help address the issue you are
> > > > currently facing. See PR: https://github.com/ceph/ceph/pull/51171
> > > > for more information.
> > > > Also, it would be helpful to know the Ceph version you are currently 
> > > > using.
> > >
> > > thanks for your reply.
> > >
> > > I've just in process upgrade between 17.2.6 and 18.2.1 (you can
> > > see my previous posts about stuck in upgrade to reef).
> > >
> > > Maybe this was cause of my problem...
> > >
> > > Now I've tried give rest to the cluster to do some "background"
> > > tasks (and it seems, that this was correct, because on my hosts
> > > there is around 50-100MBps read and cca 10-50MBps write traffic -
> > > cca 1/4-1/2 of previous load).
> > >
> > > At Saturday I will change some settings on networking and I will
> > > try to start upgrade process, maybe with --limit=1, to be "soft"
> > > for cluster and for our clients...
> > >
> > > > -Sridhar
> > >
> > > Sincerely
> > > Jan Marek
> > > --
> > > Ing. Jan Marek
> > > University of South Bohemia
> > > Academic Computer Centre
> > > Phone: +420389032080
> > > http://www.gnu.org/philosophy/no-word-attachments.cs.html
> >
> >
> >
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> > --
> > Ing. Jan Marek
> > University of South Bohemia
> > Academic Computer Centre
> > Phone: +420389032080
> > http://www.gnu.org/philosophy/no-word-attachments.cs.html
>
>
>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> --
> Ing. Jan Marek
> University of South Bohemia
> Academic Computer Centre
> Phone: +420389032080
> http://www.gnu.org/philosophy/no-word-attachments.cs.html
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-29 Thread Josh Baergen

You need to be running at least 16.2.11 on the OSDs so that you have
the fix for https://tracker.ceph.com/issues/55631.

On Mon, Jan 29, 2024 at 8:07 AM Michel Niyoyita  wrote:
>
> I am running ceph pacific , version 16 , ubuntu 20 OS , deployed using 
> ceph-ansible.
>
> Michel
>
> On Mon, Jan 29, 2024 at 4:47 PM Josh Baergen  
> wrote:
>>
>> Make sure you're on a fairly recent version of Ceph before doing this, 
>> though.
>>
>> Josh
>>
>> On Mon, Jan 29, 2024 at 5:05 AM Janne Johansson  wrote:
>> >
>> > Den mån 29 jan. 2024 kl 12:58 skrev Michel Niyoyita :
>> > >
>> > > Thank you Frank ,
>> > >
>> > > All disks are HDDs . Would like to know if I can increase the number of 
>> > > PGs
>> > > live in production without a negative impact to the cluster. if yes which
>> > > commands to use .
>> >
>> > Yes. "ceph osd pool set  pg_num "
>> > where the number usually should be a power of two that leads to a
>> > number of PGs per OSD between 100-200.
>> >
>> > --
>> > May the most significant bit of your life be positive.
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-29 Thread Josh Baergen

Make sure you're on a fairly recent version of Ceph before doing this, though.

Josh

On Mon, Jan 29, 2024 at 5:05 AM Janne Johansson  wrote:
>
> Den mån 29 jan. 2024 kl 12:58 skrev Michel Niyoyita :
> >
> > Thank you Frank ,
> >
> > All disks are HDDs . Would like to know if I can increase the number of PGs
> > live in production without a negative impact to the cluster. if yes which
> > commands to use .
>
> Yes. "ceph osd pool set  pg_num "
> where the number usually should be a power of two that leads to a
> number of PGs per OSD between 100-200.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD read latency grows over time

2024-01-26 Thread Josh Baergen

> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on
> HDD OSD, where data is stored, but it would be easier to set it to [osd]
> section without cherry picking only SSD/NVME OSDs, but for all at once.

I think that depends on your workload, but I'm not certain.

If you don't override the OSD classes, you should be able to do
something like "ceph config set osd/class:ssd
rocksdb_cf_compact_on_deletion_trigger 4096".

Josh

On Fri, Jan 26, 2024 at 10:27 AM Roman Pashin  wrote:
>
> > Unfortunately they cannot. You'll want to set them in centralized conf
> > and then restart OSDs for them to take effect.
> >
>
> Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
> them.
>
> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on
> HDD OSD, where data is stored, but it would be easier to set it to [osd]
> section without cherry picking only SSD/NVME OSDs, but for all at once.
>
> --
> Thank you,
> Roman
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD read latency grows over time

2024-01-26 Thread Josh Baergen

> Do you know if it rocksdb_cf_compact_on_deletion_trigger and
> rocksdb_cf_compact_on_deletion_sliding_window can be changed in runtime
> without OSD restart?

Unfortunately they cannot. You'll want to set them in centralized conf
and then restart OSDs for them to take effect.

Josh

On Fri, Jan 26, 2024 at 2:54 AM Roman Pashin  wrote:
>
> Hi Mark,
>
> In v17.2.7 we enabled a feature that automatically performs a compaction
> >> if too many tombstones are present during iteration in RocksDB.  It
> >> might be worth upgrading to see if it helps (you might have to try
> >> tweaking the settings if the defaults aren't helping enough).  The PR is
> >> here:
> >>
> >> https://github.com/ceph/ceph/pull/50893
> >
> >
> we've upgraded Ceph to v17.2.7 yesterday. Unfortunately I still see growing
> latency on OSDs hosting index pool. Will try to tune
> rocksdb_cf_compact_on_deletion options as you suggested.
>
> I've started with decreasing deletion_trigger from 16384 to 512 with:
>
> # ceph tell 'osd.*' injectargs '--rocksdb_cf_compact_on_deletion_trigger
> 512'
>
> At first glance - nothing has changed per OSD latency graphs. I've tried to
> decrease it to 32 deletions per window on a single OSD where I see
> increasing latency to force compactions, but per graphs nothing has changed
> after approx 40 minutes.
>
> # ceph tell 'osd.435' injectargs '--rocksdb_cf_compact_on_deletion_trigger
> 32'
>
> Didn't touch rocksdb_cf_compact_on_deletion_sliding_window yet, it is set
> with default 32768 entries.
>
> Do you know if it rocksdb_cf_compact_on_deletion_trigger and
> rocksdb_cf_compact_on_deletion_sliding_window can be changed in runtime
> without OSD restart?
>
> --
> Thank you,
> Roman
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Logging control

2023-12-19 Thread Josh Baergen

I would start with "ceph tell osd.1 config diff", as I find that
output the easiest to read when trying to understand where various
config overrides are coming from. You almost never need to use "ceph
daemon" in Octopus+ systems since "ceph tell" should be able to access
pretty much all commands for daemons from any node.

Josh

On Tue, Dec 19, 2023 at 2:02 PM Tim Holloway  wrote:
>
> Ceph version is Pacific (16.2.14), upgraded from a sloppy Octopus.
>
> I ran afoul of all the best bugs in Octopus, and in the process
> switched on a lot of stuff better left alone, including some detailed
> debug logging. Now I can't turn it off.
>
> I am confidently informed by the documentation that the first step
> would be the command:
>
> ceph daemon osd.1 config show | less
>
> But instead of config information I get back:
>
> Can't get admin socket path: unable to get conf option admin_socket for
> osd: b"error parsing 'osd': expected string of the form TYPE.ID, valid
> types are: auth, mon, osd, mds, mgr, client\n"
>
> Which seems to be kind of insane.
>
> Attempting to get daemon config info on a monitor on that machine
> gives:
>
> admin_socket: exception getting command descriptions: [Errno 2] No such
> file or directory
>
> Which doesn't help either.
>
> Anyone got an idea?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: About ceph osd slow ops

2023-12-01 Thread Josh Baergen

Given that this is s3, are the slow ops on index or data OSDs? (You
mentioned HDD but I don't want to assume that meant that the osd you
mentioned
is data)

Josh

On Fri, Dec 1, 2023 at 7:05 AM VÔ VI  wrote:
>
> Hi Stefan,
>
> I am running replicate x3 with a failure domain as host and setting
> min_size pool is 1. Because my cluster s3 traffic real time and can't stop
> or block IO, the data may be lost but IO alway available. I hope my cluster
> can run with two nodes unavailable.
> After that two nodes is down at the same time, and then nodes up, client IO
> and recover running in the same time, and some disk warning is slowops,
> what is the problem, may be my disk is overload, but the disk utilization
> only 60 -80%
>
> Thanks Stefan
>
> Vào Th 6, 1 thg 12, 2023 vào lúc 16:40 Stefan Kooman  đã
> viết:
>
> > On 01-12-2023 08:45, VÔ VI wrote:
> > > Hi community,
> > >
> > > My cluster running with 10 nodes and 2 nodes goes down, sometimes the log
> > > shows the slow ops, what is the root cause?
> > > My osd is HDD and block.db and wal is 500GB SSD per osd.
> > >
> > > Health check update: 13 slow ops, oldest one blocked for 167 sec, osd.10
> > > has slow ops (SLOW_OPS)
> >
> > Most likely you have a crush rule that spreads objects over hosts as a
> > failure domain. For size=3, min_size=2 (default for replicated pools)
> > you might end up in a situation where two of the nodes that are offline
> > have PGs where min_size=2 requirement is not fulfilled, and will hence
> > by inactive and slow ops will occur.
> >
> > When host is your failure domain, you should not reboot more than one at
> > the same time. If the hosts are somehow organized (different racks,
> > datacenters) you could make a higher level bucket and put your hosts
> > there. And create a crush rule using that bucket type as failure domain,
> > and have your pools use that.
> >
> > Gr. Stefan
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: resharding RocksDB after upgrade to Pacific breaks OSDs

2023-11-03 Thread Josh Baergen

The ticket has been updated, but it's probably important enough to
state on the list as well: The documentation is currently wrong in a
way that running the command as documented will cause this corruption.
The correct command to run is:

   ceph-bluestore-tool \
 --path  \
--sharding="m(3) p(3,0-12)
O(3,0-13)=block_cache={type=binned_lru} L P" \
 reshard

Josh

On Fri, Nov 3, 2023 at 7:58 AM Denis Polom  wrote:
>
> Hi,
>
> yes, exactly. I had to recreate OSD as well because daemon wasn't able
> to start.
>
> It's obviously a bug and should be fixed either in documentation or code.
>
>
> On 11/3/23 11:45, Eugen Block wrote:
> > Hi,
> >
> > this seems like a dangerous operation to me, I tried the same on two
> > different virtual clusters, Reef and Pacific (all upgraded from
> > previous releases). In Reef the reshard fails alltogether and the OSD
> > fails to start, I had to recreate it. In Pacific the reshard reports a
> > successful operation, but the OSD fails to start as well, with the
> > same stack trace as yours. I wasn't aware of this resharding operation
> > yet, but is it really safe? I don't have an idea how to fix, I just
> > recreated the OSDs.
> >
> >
> > Zitat von Denis Polom :
> >
> >> Hi
> >>
> >> we upgraded our Ceph cluster from latest Octopus to Pacific 16.2.14
> >> and then we followed the docs
> >> (https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#rocksdb-sharding
> >> )
> >> to reshard RocksDB on our OSDs.
> >>
> >> Despite resharding reports operation as successful, OSD fails to start.
> >>
> >> # ceph-bluestore-tool  --path /var/lib/ceph/osd/ceph-5/
> >> --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l
> >> p" reshard
> >> reshard success
> >>
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:
> >> /build/ceph-16.2.14/src/kv/RocksDBStore.cc: 1223: FAILED
> >> ceph_assert(recreate_mode)
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  ceph version 16.2.14
> >> (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable)
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  1:
> >> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x14b) [0x564047cb92b2]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  2:
> >> /usr/bin/ceph-osd(+0xaa948a) [0x564047cb948a]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  3:
> >> (RocksDBStore::do_open(std::ostream&, bool, bool,
> >> std::__cxx11::basic_string,
> >> std::allocator > const&)+0x1609) [0x564048794829]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  4:
> >> (BlueStore::_open_db(bool, bool, bool)+0x601) [0x564048240421]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  5:
> >> (BlueStore::_open_db_and_around(bool, bool)+0x26b) [0x5640482a5f8b]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  6:
> >> (BlueStore::_mount()+0x9c) [0x5640482a896c]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  7: (OSD::init()+0x38a)
> >> [0x564047daacea]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  8: main()
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  9: __libc_start_main()
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  10: _start()
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  0>
> >> 2023-10-30T12:44:17.088+ 7f4971ed2100 -1 *** Caught signal
> >> (Aborted) **
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  in thread 7f4971ed2100
> >> thread_name:ceph-osd
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  ceph version 16.2.14
> >> (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable)
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  1:
> >> /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f4972921730]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  2: gsignal()
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  3: abort()
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  4:
> >> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x19c) [0x564047cb9303]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  5:
> >> /usr/bin/ceph-osd(+0xaa948a) [0x564047cb948a]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  6:
> >> (RocksDBStore::do_open(std::ostream&, bool, bool,
> >> std::__cxx11::basic_string,
> >> std::allocator > const&)+0x1609) [0x564048794829]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  7:
> >> (BlueStore::_open_db(bool, bool, bool)+0x601) [0x564048240421]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  8:
> >> (BlueStore::_open_db_and_around(bool, bool)+0x26b) [0x5640482a5f8b]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  9:
> >> (BlueStore::_mount()+0x9c) [0x5640482a896c]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  10: (OSD::init()+0x38a)
> >> [0x564047daacea]
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  11: main()
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  12: __libc_start_main()
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  13: _start()
> >> Oct 30 12:44:17 octopus2 ceph-osd[4521]:  NOTE: a copy of the
> >> executable, or `objdump -rdS ` is needed to

[ceph-users] Re: Ceph 16.2.14: how to set mon_rocksdb_options to enable RocksDB compression?

2023-10-16 Thread Josh Baergen

> the resulting ceph.conf inside the monitor container doesn't have 
> mon_rocksdb_options

I don't know where this particular ceph.conf copy comes from, but I
still suspect that this is where this particular option needs to be
set. The reason I think this is that rocksdb mount options are needed
_before_ the mon is able to access any of the centralized conf data,
which I believe is itself stored in rocksdb.

Josh

On Sun, Oct 15, 2023 at 10:29 PM Zakhar Kirpichenko  wrote:
>
> Out of curiosity, I tried setting mon_rocksdb_options via ceph.conf. This 
> didn't work either: ceph.conf gets overridden at monitor start, the resulting 
> ceph.conf inside the monitor container doesn't have mon_rocksdb_options, the 
> monitor starts with no RocksDB compression.
>
> I would appreciate it if someone from the Ceph team could please chip in and 
> suggest a working way to enable RocksDB compression in Ceph monitors.
>
> /Z
>
> On Sat, 14 Oct 2023 at 19:16, Zakhar Kirpichenko  wrote:
>>
>> Thanks for your response, Josh. Our ceph.conf doesn't have anything but the 
>> mon addresses, modern Ceph versions store their configuration in the monitor 
>> configuration database.
>>
>> This works rather well for various Ceph components, including the monitors. 
>> RocksDB options are also applied to monitors correctly, but for some reason 
>> are being ignored.
>>
>> /Z
>>
>> On Sat, 14 Oct 2023, 17:40 Josh Baergen,  wrote:
>>>
>>> Apologies if you tried this already and I missed it - have you tried
>>> configuring that setting in /etc/ceph/ceph.conf (or wherever your conf
>>> file is) instead of via 'ceph config'? I wonder if mon settings like
>>> this one won't actually apply the way you want because they're needed
>>> before the mon has the ability to obtain configuration from,
>>> effectively, itself.
>>>
>>> Josh
>>>
>>> On Sat, Oct 14, 2023 at 1:32 AM Zakhar Kirpichenko  wrote:
>>> >
>>> > I also tried setting RocksDB compression options and deploying a new
>>> > monitor. The monitor started with no RocksDB compression again.
>>> >
>>> > Ceph monitors seem to ignore mon_rocksdb_options set at runtime, at mon
>>> > start and at mon deploy. How can I enable RocksDB compression in Ceph
>>> > monitors?
>>> >
>>> > Any input from anyone, please?
>>> >
>>> > /Z
>>> >
>>> > On Fri, 13 Oct 2023 at 23:01, Zakhar Kirpichenko  wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > I'm still trying to fight large Ceph monitor writes. One option I
>>> > > considered is enabling RocksDB compression, as our nodes have more than
>>> > > sufficient RAM and CPU. Unfortunately, monitors seem to completely 
>>> > > ignore
>>> > > the compression setting:
>>> > >
>>> > > I tried:
>>> > >
>>> > > - setting ceph config set mon.ceph05 mon_rocksdb_options
>>> > > "write_buffer_size=33554432,compression=kLZ4Compression,level_compaction_dynamic_level_bytes=true",
>>> > > restarting the test monitor. The monitor started with no RocksDB
>>> > > compression:
>>> > >
>>> > > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb: Compression
>>> > > algorithms supported:
>>> > > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
>>> > > kZSTDNotFinalCompression supported: 0
>>> > > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
>>> > > kXpressCompression supported: 0
>>> > > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
>>> > > kLZ4HCCompression supported: 1
>>> > > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
>>> > > kLZ4Compression supported: 1
>>> > > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
>>> > > kBZip2Compression supported: 0
>>> > > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
>>> > > kZlibCompression supported: 1
>>> > > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
>>> > > kSnappyCompression supported: 1
>>> > > ...
>>> > > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
>>> > >  Options.compression: NoCompression
>>> > > debug 2023-10-13T19:47:00.403+ 7f1cd9

[ceph-users] Re: Ceph 16.2.14: how to set mon_rocksdb_options to enable RocksDB compression?

2023-10-14 Thread Josh Baergen

Apologies if you tried this already and I missed it - have you tried
configuring that setting in /etc/ceph/ceph.conf (or wherever your conf
file is) instead of via 'ceph config'? I wonder if mon settings like
this one won't actually apply the way you want because they're needed
before the mon has the ability to obtain configuration from,
effectively, itself.

Josh

On Sat, Oct 14, 2023 at 1:32 AM Zakhar Kirpichenko  wrote:
>
> I also tried setting RocksDB compression options and deploying a new
> monitor. The monitor started with no RocksDB compression again.
>
> Ceph monitors seem to ignore mon_rocksdb_options set at runtime, at mon
> start and at mon deploy. How can I enable RocksDB compression in Ceph
> monitors?
>
> Any input from anyone, please?
>
> /Z
>
> On Fri, 13 Oct 2023 at 23:01, Zakhar Kirpichenko  wrote:
>
> > Hi,
> >
> > I'm still trying to fight large Ceph monitor writes. One option I
> > considered is enabling RocksDB compression, as our nodes have more than
> > sufficient RAM and CPU. Unfortunately, monitors seem to completely ignore
> > the compression setting:
> >
> > I tried:
> >
> > - setting ceph config set mon.ceph05 mon_rocksdb_options
> > "write_buffer_size=33554432,compression=kLZ4Compression,level_compaction_dynamic_level_bytes=true",
> > restarting the test monitor. The monitor started with no RocksDB
> > compression:
> >
> > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb: Compression
> > algorithms supported:
> > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
> > kZSTDNotFinalCompression supported: 0
> > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
> > kXpressCompression supported: 0
> > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
> > kLZ4HCCompression supported: 1
> > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
> > kLZ4Compression supported: 1
> > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
> > kBZip2Compression supported: 0
> > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
> > kZlibCompression supported: 1
> > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
> > kSnappyCompression supported: 1
> > ...
> > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
> >  Options.compression: NoCompression
> > debug 2023-10-13T19:47:00.403+ 7f1cd967a880  4 rocksdb:
> >Options.bottommost_compression: Disabled
> >
> > - setting ceph config set mon mon_rocksdb_options
> > "write_buffer_size=33554432,compression=kLZ4Compression,level_compaction_dynamic_level_bytes=true",
> > restarting the test monitor. The monitor started with no RocksDB
> > compression, the same way as above.
> >
> > In each case config options were correctly set and readable with config
> > get. I also found a suggestion in ceph-users (
> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/KJM232IHN7FKYI5LODUREN7SVO45BL42/)
> > to set compression in a similar manner. Unfortunately, these options appear
> > to be ignored.
> >
> > How can I enable RocksDB compression in Ceph monitors?
> >
> > I would very much appreciate your advices and comments.
> >
> > Best regards,
> > Zakhar
> >
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph osd down doesn't seem to work

2023-10-03 Thread Josh Baergen

Hi Simon,

If the OSD is actually up, using 'ceph osd down` will cause it to flap
but come back immediately. To prevent this, you would want to 'ceph
osd set noup'. However, I don't think this is what you actually want:

> I'm thinking (but perhaps incorrectly?) that it would be good to keep the OSD 
> down+in, to try to read from it as long as possible

In this case, you actually want it up+out ('ceph osd out XXX'), though
if it's replicated then marking it out will switch primaries around so
that it's not actually read from anymore. It doesn't look like you
have that much recovery backfill left, so hopefully you'll be in a
clean state soon, though you'll have to deal with those 'inconsistent'
and 'recovery_unfound' PGs.

Josh

On Tue, Oct 3, 2023 at 10:14 AM Simon Oosthoek  wrote:
>
> Hi
>
> I'm trying to mark one OSD as down, so we can clean it out and replace
> it. It keeps getting medium read errors, so it's bound to fail sooner
> rather than later. When I command ceph from the mon to mark the osd
> down, it doesn't actually do it. When the service on the osd stops, it
> is also marked out and I'm thinking (but perhaps incorrectly?) that it
> would be good to keep the OSD down+in, to try to read from it as long as
> possible. Why doesn't it get marked down and stay that way when I
> command it?
>
> Context: Our cluster is in a bit of a less optimal state (see below),
> this is after one of OSD nodes had failed and took a week to get back up
> (long story). Due to a seriously unbalanced filling of our OSDs we kept
> having to reweight OSDs to keep below the 85% threshold. Several disks
> are starting to fail now (they're 4+ years old and failures are expected
> to occur more frequently).
>
> I'm open to suggestions to help get us back to health_ok more quickly,
> but I think we'll get there eventually anyway...
>
> Cheers
>
> /Simon
>
> 
>
> # ceph -s
>cluster:
>  health: HEALTH_ERR
>  1 clients failing to respond to cache pressure
>  1/843763422 objects unfound (0.000%)
>  noout flag(s) set
>  14 scrub errors
>  Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent
>  Degraded data redundancy: 13795525/7095598195 objects
> degraded (0.194%), 13 pgs degraded, 12 pgs undersized
>  70 pgs not deep-scrubbed in time
>  65 pgs not scrubbed in time
>
>services:
>  mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 11h)
>  mgr: cephmon3(active, since 35h), standbys: cephmon1
>  mds: 1/1 daemons up, 1 standby
>  osd: 264 osds: 264 up (since 2m), 264 in (since 75m); 227 remapped pgs
>   flags noout
>  rgw: 8 daemons active (4 hosts, 1 zones)
>
>data:
>  volumes: 1/1 healthy
>  pools:   15 pools, 3681 pgs
>  objects: 843.76M objects, 1.2 PiB
>  usage:   2.0 PiB used, 847 TiB / 2.8 PiB avail
>  pgs: 13795525/7095598195 objects degraded (0.194%)
>   54839263/7095598195 objects misplaced (0.773%)
>   1/843763422 objects unfound (0.000%)
>   3374 active+clean
>   195  active+remapped+backfill_wait
>   65   active+clean+scrubbing+deep
>   20   active+remapped+backfilling
>   11   active+clean+snaptrim
>   10   active+undersized+degraded+remapped+backfill_wait
>   2active+undersized+degraded+remapped+backfilling
>   2active+clean+scrubbing
>   1active+recovery_unfound+degraded
>   1active+clean+inconsistent
>
>progress:
>  Global Recovery Event (8h)
>[==..] (remaining: 2h)
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Error EPERM: error setting 'osd_op_queue' to 'wpq': (1) Operation not permitted

2023-09-18 Thread Josh Baergen

My guess is that this is because this setting can't be changed at
runtime, though if so that's a new enforcement behaviour in Quincy
that didn't exist in prior versions.

I think what you want to do is 'config set osd osd_op_queue wpq'
(assuming you want this set for all OSDs) and then restart your OSDs
in a safe manner.

Josh

On Mon, Sep 18, 2023 at 4:43 AM Nikolaos Dandoulakis  wrote:
>
> Hi,
>
> After upgrading our cluster to 17.2.6 all OSDs appear to have "osd_op_queue": 
> "mclock_scheduler" (used to be wpq). As we see several OSDs reporting 
> unjustifiable heavy load, we would like to revert this back to "wpq" but any 
> attempt yells the following error:
>
> root@store14:~# ceph tell osd.71 config set osd_op_queue wpq
> Error EPERM: error setting 'osd_op_queue' to 'wpq': (1) Operation not 
> permitted
>
> I cannot find anywhere why this is happening, I am guessing another setting 
> needs to be changed as well. Has anybody resolved this?
>
> Best,
> Nick
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336. Is e buidheann carthannais a th' ann an 
> Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Backfill Performance for

2023-08-08 Thread Josh Baergen

Hi Jonathan,

> - All PGs seem to be backfilling at the same time which seems to be in
> violation of osd_max_backfills. I understand that there should be 6 readers
> and 6 writers at a time, but I'm seeing a given OSD participate in more
> than 6 PG backfills. Is an OSD only considered as backfilling if it is not
> present in both the UP and ACTING groups (e.g. it will have it's data
> altered)?

Say you have a PG that looks like this:
1.7ffe   active+remapped+backfill_wait[983,1112,486] 983
[983,1423,1329] 983

If this is a replicated cluster, 983 (the primary OSD) will be the
data read source, and 1423/1329 will of course be targets. If this is
EC, then 1112 will be the read source for the 1423 backfill, and 486
will be the read source for the 1329 backfill. (Unless the PG is
degraded, in which case backfill reads may become normal PG reads.)

Backfill locks are taken on the primary OSD (983 in the example above)
and then all the backfill targets (1329, 1423). Locks are _not_ taken
on read sources for EC backfills, so it's possible to have any number
of backfills reading from a single OSD during EC backfill with no
direct control over this.

> - Some PGs are recovering at a much slower rate than others (some as little
> as kilobytes per second) despite the disks being all of a similar speed. Is
> there some way to dig into why that may be?

Where I would start with this is looking at whether the read sources
or write targets are overloaded at the disk level.

> - In general, the recovery is happening very slowly (between 1 and 5
> objects per second per PG). Is it possible the settings above are too
> aggressive and causing performance degradation due to disk thrashing?

Maybe - which settings are appropriate depend on your configuration
(replicated vs. EC); if you have a replicated pool, then those
settings are probably way too aggressive, and max backfills should be
reduced. If it's EC, the max backfills might be OK. In either case,
the sleep should be increased, but it's unlikely that the sleep
setting is affecting per-PG backfill speed that much (though it could
make it uneven).

> - Currently, all misplaced PGs are backfilling, if I were to change some of
> the settings above (specifically `osd_max_backfills`) would that
> essentially pause backfilling PGs or will those backfills have to end and
> then start over when it is done waiting?

It effectively pauses backfill.

> - Given that all PGs are backfilling simultaneously there is no way to
> prioritize one PG over another (we have some disks with very high usage
> that we're trying to reduce). Would reducing those max backfills allow for
> proper prioritization of PGs with force-backfill?

There's no great way to affect backfill prioritization. The backfill
lock acquisition I noted above is blocking without backoff, so
high-priority backfills could be waiting in line for a while until
they get a chance to run.

> - We have had some OSDs restart during the process and their misplaced
> object count is now zero but they are incrementing their recovering objects
> bytes. Is that expected and is there a way to estimate when that will
> complete?

Not sure - this gets messy.

FWIW, this situation is one of the reasons why we built
https://github.com/digitalocean/pgremapper (inspired by a procedure
and some tooling that CERN built for the same reason). You might be
interested in 
https://github.com/digitalocean/pgremapper#example---cancel-all-backfill-in-the-system-as-a-part-of-an-augment,
or using cancel-backfill plus an undo-upmaps loop.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MON sync time depends on outage duration

2023-07-11 Thread Josh Baergen

Out of curiosity, what is your require_osd_release set to? (ceph osd
dump | grep require_osd_release)

Josh

On Tue, Jul 11, 2023 at 5:11 AM Eugen Block  wrote:
>
> I'm not so sure anymore if that could really help here. The dump-keys
> output from the mon contains 42 million osd_snap prefix entries, 39
> million of them are "purged_snap" keys. I also compared to other
> clusters as well, those aren't tombstones but expected "history" of
> purged snapshots. So I don't think removing a couple of hundred trash
> snapshots will actually reduce the number of osd_snap keys. At least
> doubling the payload_size seems to have a positive impact. The
> compaction during the sync has a negative impact, of course, same as
> not having the mon store on SSDs.
> I'm currently playing with a test cluster, removing all "purged_snap"
> entries from the mon db (not finished yet) to see what that will do
> with the mon and if it will even start correctly. But has anyone done
> that, removing keys from the mon store? Not sure what to expect yet...
>
> Zitat von Dan van der Ster :
>
> > Oh yes, sounds like purging the rbd trash will be the real fix here!
> > Good luck!
> >
> > __
> > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
> >
> >
> >
> >
> > On Mon, Jul 10, 2023 at 6:10 AM Eugen Block  wrote:
> >
> >> Hi,
> >> I got a customer response with payload size 4096, that made things
> >> even worse. The mon startup time was now around 40 minutes. My doubts
> >> wrt decreasing the payload size seem confirmed. Then I read Dan's
> >> response again which also mentions that the default payload size could
> >> be too small. So I asked them to double the default (2M instead of 1M)
> >> and am now waiting for a new result. I'm still wondering why this only
> >> happens when the mon is down for more than 5 minutes. Does anyone have
> >> an explanation for that time factor?
> >> Another thing they're going to do is to remove lots of snapshot
> >> tombstones (rbd mirroring snapshots in the trash namespace), maybe
> >> that will reduce the osd_snap keys in the mon db, which then would
> >> increase the startup time. We'll see...
> >>
> >> Zitat von Eugen Block :
> >>
> >> > Thanks, Dan!
> >> >
> >> >> Yes that sounds familiar from the luminous and mimic days.
> >> >> The workaround for zillions of snapshot keys at that time was to use:
> >> >>   ceph config set mon mon_sync_max_payload_size 4096
> >> >
> >> > I actually did search for mon_sync_max_payload_keys, not bytes so I
> >> > missed your thread, it seems. Thanks for pointing that out. So the
> >> > defaults seem to be these in Octopus:
> >> >
> >> > "mon_sync_max_payload_keys": "2000",
> >> > "mon_sync_max_payload_size": "1048576",
> >> >
> >> >> So it could be in your case that the sync payload is just too small to
> >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
> >> debug_mon
> >> >> you should be able to understand what is taking so long, and tune
> >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
> >> >
> >> > I'm confused, if the payload size is too small, why would decreasing
> >> > it help? Or am I misunderstanding something? But it probably won't
> >> > hurt to try it with 4096 and see if anything changes. If not we can
> >> > still turn on debug logs and take a closer look.
> >> >
> >> >> And additional to Dan suggestion, the HDD is not a good choices for
> >> >> RocksDB, which is most likely the reason for this thread, I think
> >> >> that from the 3rd time the database just goes into compaction
> >> >> maintenance
> >> >
> >> > Believe me, I know... but there's not much they can currently do
> >> > about it, quite a long story... But I have been telling them that
> >> > for months now. Anyway, I will make some suggestions and report back
> >> > if it worked in this case as well.
> >> >
> >> > Thanks!
> >> > Eugen
> >> >
> >> > Zitat von Dan van der Ster :
> >> >
> >> >> Hi Eugen!
> >> >>
> >> >> Yes that sounds familiar from the luminous and mimic days.
> >> >>
> >> >> Check this old thread:
> >> >>
> >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
> >> >> (that thread is truncated but I can tell you that it worked for Frank).
> >> >> Also the even older referenced thread:
> >> >>
> >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/
> >> >>
> >> >> The workaround for zillions of snapshot keys at that time was to use:
> >> >>   ceph config set mon mon_sync_max_payload_size 4096
> >> >>
> >> >> That said, that sync issue was supposed to be fixed by way of adding the
> >> >> new option mon_sync_max_payload_keys, which has been around since
> >> nautilus.
> >> >>
> >> >> So it could be in your case that the sync payload is just too small to
> >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
> >> debug_mon
> >> >> you

[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device

2023-06-27 Thread Josh Baergen

On Tue, Jun 27, 2023 at 11:50 AM Matthew Booth  wrote:
> What do you mean by saturated here? FWIW I was using the default cache
> size of 1G and each test run only wrote ~100MB of data, so I don't
> think I ever filled the cache, even with multiple runs.

Ah, my apologies - I saw that fio had been invoked in time-based mode
and assumed it was exceeding the pwl size, but doing the math on the
latency and block size, you should be fine. You're right, I wouldn't
expect that you would be filling the pwl cache, and thus the results
you are getting are Not Good(tm).

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device

2023-06-27 Thread Josh Baergen

Hi Matthew,

We've done a limited amount of work on characterizing the pwl and I think
it suffers the classic problem of some writeback caches in that, once the
cache is saturated, it's actually worse than just being in writethrough.
IIRC the pwl does try to preserve write ordering (unlike the other
writeback/writearound modes) which limits it in the concurrency it can
issue to the backend, which means that even an iodepth=1 test can saturate
the pwl, assuming the backend latency is higher than the pwl latency.

I _think_ that if you were able to devise a burst test with bursts smaller
than the pwl capacity and gaps in between large enough for the cache to
flush, or if you were to ratelimit I/Os to the pwl, that you should see
closer to the lower latencies that you would expect.

Josh

On Tue, Jun 27, 2023 at 9:04 AM Matthew Booth  wrote:

> ** TL;DR
>
> In testing, the write latency performance of a PWL-cache backed RBD
> disk was 2 orders of magnitude worse than the disk holding the PWL
> cache.
>
> ** Summary
>
> I was hoping that PWL cache might be a good solution to the problem of
> write latency requirements of etcd when running a kubernetes control
> plane on ceph. Etcd is extremely write latency sensitive and becomes
> unstable if write latency is too high. The etcd workload can be
> characterised by very small (~4k) writes with a queue depth of 1.
> Throughput, even on a busy system, is normally very low. As etcd is
> distributed and can safely handle the loss of un-flushed data from a
> single node, a local ssd PWL cache for etcd looked like an ideal
> solution.
>
> My expectation was that adding a PWL cache on a local SSD to an
> RBD-backed would improve write latency to something approaching the
> write latency performance of the local SSD. However, in my testing
> adding a PWL cache to an rbd-backed VM increased write latency by
> approximately 4x over not using a PWL cache. This was over 100x more
> than the write latency performance of the underlying SSD.
>
> My expectation was based on the documentation here:
> https://docs.ceph.com/en/quincy/rbd/rbd-persistent-write-log-cache/
>
> “The cache provides two different persistence modes. In
> persistent-on-write mode, the writes are completed only when they are
> persisted to the cache device and will be readable after a crash. In
> persistent-on-flush mode, the writes are completed as soon as it no
> longer needs the caller’s data buffer to complete the writes, but does
> not guarantee that writes will be readable after a crash. The data is
> persisted to the cache device when a flush request is received.”
>
> ** Method
>
> 2 systems, 1 running single-node Ceph Quincy (17.2.6), the other
> running libvirt and mounting a VM’s disk with librbd (also 17.2.6)
> from the first node.
>
> All performance testing is from the libvirt system. I tested write
> latency performance:
>
> * Inside the VM without a PWL cache
> * Of the PWL device directly from the host (direct to filesystem, no VM)
> * Inside the VM with a PWL cache
>
> I am testing with fio. Specifically I am running a containerised test,
> executed with:
>   podman run --volume .:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf
>
> This container runs:
>   fio --rw=write --ioengine=sync --fdatasync=1
> --directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf
> --output-format=json --runtime=60 --time_based=1
>
> And extracts sync.lat_ns.percentile["99.00"]
>
> ** Results
>
> All results were stable across multiple runs within a small margin of
> error.
>
> * rbd no cache: 1417216 ns
> * pwl cache device: 44288 ns
> * rbd with pwl cache: 5210112 ns
>
> Note that by adding a PWL cache we increase write latency by
> approximately 4x, which is more than 100x than the underlying device.
>
> ** Hardware
>
> 2 x Dell R640s, each with Xeon Silver 4216 CPU @ 2.10GHz and 192G RAM
> Storage under test: 2 x SAMSUNG MZ7KH480HAHQ0D3 SSDs attached to PERC
> H730P Mini (Embedded)
>
> OS installed on rotational disks
>
> N.B. Linux incorrectly detects these disks as rotational, which I
> assume relates to weird behaviour by the PERC controller. I remembered
> to manually correct this on the ‘client’ machine for the PWL cache,
> but at OSD configuration time ceph would have detected them as
> rotational. They are not rotational.
>
> ** Ceph Configuration
>
> CentOS Stream 9
>
>   # ceph version
>   ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
> (stable)
>
> Single node installation with cephadm. 2 OSDs, one on each SSD.
> 1 pool with size 2
>
> ** Client Configuration
>
> Fedora 38
> Librbd1-17.2.6-3.fc38.x86_64
>
> PWL cache is XFS filesystem with 4k block size, matching the
> underlying device. The filesystem uses the whole block device. There
> is no other load on the system.
>
> ** RBD Configuration
>
> # rbd config image list libvirt-pool/pwl-test | grep cache
> rbd_cachetrue
>  config
> rbd_cache_block_writes_upfront

[ceph-users] Re: 16.2.13: ERROR:ceph-crash:directory /var/lib/ceph/crash/posted does not exist; please create

2023-06-01 Thread Josh Baergen

Hi Zakhar,

I'm going to guess that it's a permissions issue arising from
https://github.com/ceph/ceph/pull/48804, which was included in 16.2.13. You
may need to change the directory permissions, assuming that you manage the
directories yourself. If this is managed by cephadm or something like that,
then that seems like some sort of missing migration in the upgrade.

Josh

On Thu, Jun 1, 2023 at 12:34 PM Zakhar Kirpichenko  wrote:

> Hi,
>
> I'm having an issue with crash daemons on Pacific 16.2.13 hosts. ceph-crash
> throws the following error on all hosts:
>
> ERROR:ceph-crash:directory /var/lib/ceph/crash/posted does not exist;
> please create
> ERROR:ceph-crash:directory /var/lib/ceph/crash/posted does not exist;
> please create
> ERROR:ceph-crash:directory /var/lib/ceph/crash/posted does not exist;
> please create
>
> ceph-crash runs in docker, the container has the directory mounted: -v
>
> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/crash:/var/lib/ceph/crash:z
>
> The mount works correctly:
>
> 18:26 [root@ceph02 /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86]# ls
> -al crash/posted/
> total 8
> drwx-- 2 nobody nogroup 4096 May  6  2021 .
> drwx-- 3 nobody nogroup 4096 May  6  2021 ..
>
> 18:26 [root@ceph02 /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86]#
> touch crash/posted/a
>
> 18:26 [root@ceph02 /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86]#
> docker exec -it c0cd2b8022d8 bash
>
> [root@ceph02 /]# ls -al /var/lib/ceph/crash/posted/
> total 8
> drwx-- 2 nobody nobody 4096 Jun  1 18:26 .
> drwx-- 3 nobody nobody 4096 May  6  2021 ..
> -rw-r--r-- 1 root   root  0 Jun  1 18:26 a
>
> I.e. the directory actually exists and is correctly mounted in the crash
> container, yet ceph-crash says it doesn't exist. How can I convince it
> that the directory is there?
>
> Best regards,
> Zakhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best practice for expanding Ceph cluster

2023-05-04 Thread Josh Baergen

Hi Samuel,

Both pgremapper and the CERN scripts were developed against Luminous,
and in my experience 12.2.13 has all of the upmap patches needed for
the scheme that Janne outlined to work. However, if you have a complex
CRUSH map sometimes the upmap balancer can struggle, and I think
that's true of any release so far.

Josh

On Thu, May 4, 2023 at 5:58 AM huxia...@horebdata.cn
 wrote:
>
> Janne,
>
> thanks a lot for the detailed scheme. I totally agree that the upmap approach 
> would be one of best methods, however, my current cluster is working on 
> Luminious 12.2.13 version and upmap seems not work reliably on Lumnious.
>
> samuel
>
>
>
> huxia...@horebdata.cn
>
> From: Janne Johansson
> Date: 2023-05-04 11:56
> To: huxia...@horebdata.cn
> CC: ceph-users
> Subject: Re: [ceph-users] Best practice for expanding Ceph cluster
> Den tors 4 maj 2023 kl 10:39 skrev huxia...@horebdata.cn
> :
> > Dear Ceph folks,
> >
> > I am writing to ask for advice on best practice of expanding ceph cluster. 
> > We are running an 8-node Ceph cluster and RGW, and would like to add 
> > another 10 node, each of which have 10x 12TB HDD. The current 8-node has 
> > ca. 400TB user data.
> >
> > I am wondering whether to add 10 nodes at one shot and let the cluster to 
> > rebalance, or divide into 5 steps, each of which add 2 nodes and rebalance 
> > step by step?  I do not know what would be the advantages or disadvantages 
> > with the one shot scheme vs 5 bataches of adding 2 nodes step-by-step.
> >
> > Any suggestions, experience sharing or advice are highly appreciated.
>
> If you add one or two hosts, it will rebalance involving all hosts to
> even out the data. Then you add two more and it has to even all data
> again more or less. Then two more and all old hosts have to redo the
> same work again.
>
> I would suggest that you add all new hosts and make the OSDs start
> with a super-low initial weight (0.0001 or so), which means they will
> be in and up, but not receive any PGs.
>
> Then you set "noout" and "norebalance" and ceph osd crush reweight the
> new OSDs to their correct size, perhaps with a sleep 30 in between or
> so, to let the dust settle after you change weights.
>
> After all new OSDs are of the correct crush weight, there will be a
> lot of PGs misplaced/remapped but not moving. Now you grab one of the
> programs/scripts[1] which talks to upmap and tells it that every
> misplaced PG actually is where you want it to be. You might need to
> run several times, but it usually goes quite fast on the second/third
> run. Even if it never gets 100% of the PGs happy, it is quite
> sufficient if 95-99% are thinking they are at their correct place.
>
> Now, if you enable the ceph balancer (or already have it enabled) in
> upmap mode and unset "noout" and "norebalance" the mgr balancer will
> take a certain amount of PGs (some 3% by default[2] ) and remove the
> temporary "upmap" setting that says the PG is at the right place even
> when it isn't. This means that the balancer takes a small amount of
> PGs, lets them move to where they actually want to be, then picks a
> few more PGs and repeats until the final destination is correct for
> all PGs, evened out on all OSDs as you wanted.
>
> This is the method that I think has the least impact on client IO,
> scrubs and all that, should be quite safe but will take a while in
> calendar time to finish. The best part is that the admin work needed
> comes only in at the beginning, the rest is automatic.
>
> [1] Tools:
> https://raw.githubusercontent.com/HeinleinSupport/cern-ceph-scripts/master/tools/upmap/upmap-remapped.py
> https://github.com/digitalocean/pgremapper
> I think this one works too, haven't tried it:
> https://github.com/TheJJ/ceph-balancer
>
> [2] Percent to have moving at any moment:
> https://docs.ceph.com/en/latest/rados/operations/balancer/#throttling
>
> --
> May the most significant bit of your life be positive.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: For suggestions and best practices on expanding Ceph cluster and removing old nodes

2023-04-25 Thread Josh Baergen

Hi Samuel,

While the second method would probably work fine in the happy path, if
something goes wrong I think you'll be happier having a uniform
release installed. In general, we've found the backfill experience to
be better on Nautilus than Luminous, so my vote would be for the first
method. Given that your usage is RGW, just note that the OMAP format
change that happens between Luminous and Nautilus can sometimes take a
while.

Josh

On Tue, Apr 25, 2023 at 10:31 AM huxia...@horebdata.cn
 wrote:
>
> Dear Ceph folks,
>
> I would like to listen to your advice on the following topic: We have a 
> 6-node Ceph cluster (for RGW usage only ) running on Luminous 12.2.12, and 
> now will add 10 new nodes. Our plan is to phase out the old 6 nodes, and run 
> RGW Ceph cluster with the new 10 nodes on Nautilus version。
>
> I can think of two ways to achieve the above goal. The first method would be: 
>   1) Upgrade the current 6-node cluster from Luminous 12.2.12 to Nautilus 
> 14.2.22;  2) Expand the cluster with the 10 new nodes, and then re-balance;  
> 3) After rebalance completes, remove the 6 old nodes from the cluster
>
> The second method would get rid of the procedure to upgrade the old 6-node 
> from Luminous to Nautilus, because those 6 nodes will be phased out anyway, 
> but then we have to deal with a hybrid cluster with 6-node on Luminous 
> 12.2.12, and 10-node on Nautilus, and after re-balancing, we can remove the 6 
> old nodes from the cluster.
>
> Any suggestions, advice, or best practice would be highly appreciated.
>
> best regards,
>
>
> Samuel
>
>
>
> huxia...@horebdata.cn
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: pg wait too long when osd restart

2023-03-13 Thread Josh Baergen

(trimming out the dev list and Radoslaw's email)

Hello,

I think the two critical PRs were:
* https://github.com/ceph/ceph/pull/44585 - included in 15.2.16
* https://github.com/ceph/ceph/pull/45655 - included in 15.2.17

I don't have any comments on tweaking those configuration values, and
what safe values would be.

Josh

On Sun, Mar 12, 2023 at 9:43 PM yite gu  wrote:
>
> Hello, Baergen
> Thanks for your reply. Restart osd in planned, but my version is 15.2.7, so, 
> I may have encountered the problem you said. Could you provide PR to me about 
> optimize this mechanism? Besides that, if I don't want to upgrade version in 
> recently, is a good way that adjust osd_pool_default_read_lease_ratio to 
> lower? For example, 0.4 or 0.2 to reach the user's tolerance time.
>
> Yite Gu
>
> Josh Baergen  于2023年3月10日周五 22:09写道：
>>
>> Hello,
>>
>> When you say "osd restart", what sort of restart are you referring to
>> - planned (e.g. for upgrades or maintenance) or unplanned (OSD
>> hang/crash, host issue, etc.)? If it's the former, then these
>> parameters shouldn't matter provided that you're running a recent
>> enough Ceph with default settings - it's supposed to handle planned
>> restarts with little I/O wait time. There were some issues with this
>> mechanism before Octopus 15.2.17 / Pacific 16.2.8 that could cause
>> planned restarts to wait for the read lease timeout in some
>> circumstances.
>>
>> Josh
>>
>> On Fri, Mar 10, 2023 at 1:31 AM yite gu  wrote:
>> >
>> > Hi all,
>> > osd_heartbeat_grace = 20 and osd_pool_default_read_lease_ratio = 0.8 by
>> > default, so, pg will wait 16s when osd restart in the worst case. This wait
>> > time is too long, client i/o can not be unacceptable. I think adjusting
>> > the osd_pool_default_read_lease_ratio to lower is a good way. Have any good
>> > suggestions about reduce pg wait time？
>> >
>> > Best Regard
>> > Yite Gu
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: pg wait too long when osd restart

2023-03-10 Thread Josh Baergen

Hello,

When you say "osd restart", what sort of restart are you referring to
- planned (e.g. for upgrades or maintenance) or unplanned (OSD
hang/crash, host issue, etc.)? If it's the former, then these
parameters shouldn't matter provided that you're running a recent
enough Ceph with default settings - it's supposed to handle planned
restarts with little I/O wait time. There were some issues with this
mechanism before Octopus 15.2.17 / Pacific 16.2.8 that could cause
planned restarts to wait for the read lease timeout in some
circumstances.

Josh

On Fri, Mar 10, 2023 at 1:31 AM yite gu  wrote:
>
> Hi all,
> osd_heartbeat_grace = 20 and osd_pool_default_read_lease_ratio = 0.8 by
> default, so, pg will wait 16s when osd restart in the worst case. This wait
> time is too long, client i/o can not be unacceptable. I think adjusting
> the osd_pool_default_read_lease_ratio to lower is a good way. Have any good
> suggestions about reduce pg wait time？
>
> Best Regard
> Yite Gu
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-02-28 Thread Josh Baergen

Hi Boris,

OK, what I'm wondering is whether
https://tracker.ceph.com/issues/58530 is involved. There are two
aspects to that ticket:
* A measurable increase in the number of bytes written to disk in
Pacific as compared to Nautilus
* The same, but for IOPS

Per the current theory, both are due to the loss of rocksdb log
recycling when using default recovery options in rocksdb 6.8; Octopus
uses version 6.1.2, Pacific uses 6.8.1.

16.2.11 largely addressed the bytes-written amplification, but the
IOPS amplification remains. In practice, whether this results in a
write performance degradation depends on the speed of the underlying
media and the workload, and thus the things I mention in the next
paragraph may or may not be applicable to you.

There's no known workaround or solution for this at this time. In some
cases I've seen that disabling bluefs_buffered_io (which itself can
cause IOPS amplification in some cases) can help; I think most folks
do this by setting it in local conf and then restarting OSDs in order
to gain the config change. Something else to consider is
https://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches,
as sometimes disabling these write caches can improve the IOPS
performance of SSDs.

Josh

On Tue, Feb 28, 2023 at 7:19 AM Boris Behrens  wrote:
>
> Hi Josh,
> we upgraded 15.2.17 -> 16.2.11 and we only use rbd workload.
>
>
>
> Am Di., 28. Feb. 2023 um 15:00 Uhr schrieb Josh Baergen 
> :
>>
>> Hi Boris,
>>
>> Which version did you upgrade from and to, specifically? And what
>> workload are you running (RBD, etc.)?
>>
>> Josh
>>
>> On Tue, Feb 28, 2023 at 6:51 AM Boris Behrens  wrote:
>> >
>> > Hi,
>> > today I did the first update from octopus to pacific, and it looks like the
>> > avg apply latency went up from 1ms to 2ms.
>> >
>> > All 36 OSDs are 4TB SSDs and nothing else changed.
>> > Someone knows if this is an issue, or am I just missing a config value?
>> >
>> > Cheers
>> >  Boris
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im 
> groÃƒ¼en Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-02-28 Thread Josh Baergen

Hi Boris,

Which version did you upgrade from and to, specifically? And what
workload are you running (RBD, etc.)?

Josh

On Tue, Feb 28, 2023 at 6:51 AM Boris Behrens  wrote:
>
> Hi,
> today I did the first update from octopus to pacific, and it looks like the
> avg apply latency went up from 1ms to 2ms.
>
> All 36 OSDs are 4TB SSDs and nothing else changed.
> Someone knows if this is an issue, or am I just missing a config value?
>
> Cheers
>  Boris
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: increasing PGs OOM kill SSD OSDs (octopus) - unstable OSD behavior

2023-02-21 Thread Josh Baergen

Hi Boris,

This sounds a bit like https://tracker.ceph.com/issues/53729.
https://tracker.ceph.com/issues/53729#note-65 might help you diagnose
whether this is the case.

Josh

On Tue, Feb 21, 2023 at 9:29 AM Boris Behrens  wrote:
>
> Hi,
> today I wanted to increase the PGs from 2k -> 4k and random OSDs went
> offline in the cluster.
> After some investigation we saw, that the OSDs got OOM killed (I've seen a
> host that went from 90GB used memory to 190GB before OOM kills happen).
>
> We have around 24 SSD OSDs per host and 128GB/190GB/265GB memory in these
> hosts. All of them experienced OOM kills.
> All hosts are octopus / ubuntu 20.04.
>
> And on every step new OSDs crashed with OOM. (We now set the pg_num/pgp_num
> to 2516 to stop the process).
> The OSD logs do not show anything why this might happen.
> Some OSDs also segfault.
>
> I now started to stop all OSDs on a host, and do a "ceph-bluestore-tool
> repair" and a "ceph-kvstore-tool bluestore-kv compact" on all OSDs. This
> takes for the 8GB OSDs around 30 minutes. When I start the OSDs I instantly
> get a lot of slow OPS from all the other OSDs when the OSD come up (the 8TB
> OSDs take around 10 minutes with "load_pgs".
>
> I am unsure what I can do to restore normal cluster performance. Any ideas
> or suggestions or maybe even known bugs?
> Maybe a line for what I can search in the logs.
>
> Cheers
>  Boris
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Adding Labels Section to Perf Counters Output

2023-02-01 Thread Josh Baergen

Do the counters need to be moved under a separate key? That would
break anything today that currently tries to parse them. We have quite
a bit of internal monitoring that relies on "perf dump" output, but
it's mostly not output that I would expect to gain labels in general
(e.g. bluestore stats).

Taking a step back, though, from a json perspective, the proposal
seems a bit odd to me. I think what's being proposed is something like
this (using a random stat name that may or may not exist):
"bucket_stats": {
  "labels": { "bucket": "one", }
  "counters": { ... },
}
"bucket_stats": {
  "labels": { "bucket": "two", }
  "counters": { ... },
}

Many standard json parsers (e.g. golang's) won't be able to
meaningfully parse this due to the "bucket_stats" key repeating. They
would expect something like this instead:
"bucket_stats": [
  {
"labels": { "bucket": "one", }
"counters": { ... },
  },
  {
"labels": { "bucket": "two", }
"counters": { ... },
  }
]

Josh

On Tue, Jan 31, 2023 at 8:55 PM Ali Maredia  wrote:
>
> Hi Ceph Developers and Users,
>
> Various upstream developers and I are working on adding labels to perf
> counters (https://github.com/ceph/ceph/pull/48657).
>
> We would like to understand the ramifications of changing the format of the
> json dumped by the `perf dump` command for the Reef Release on users and
> components of Ceph.
>
> As an example given in the PR, currently unlabeled counters are dumped like
> this in comparison with their new labeled counterparts.
>
> "some unlabeled_counter": {
> "put_b": 1048576,
> },
> "some labeled_counter": {
> "labels": {
> "Bucket: "bkt1",
> "User: "user1",
> },
> "counters": {
> "put_b": 1048576,
> },
> },
>
> Here is an example given in the PR of the old style unlabeled counters
> being dumped in the same format as the labeled counters:
>
> "some unlabeled": {
> "labels": {
> },
> "counters": {
> "put_b": 1048576,
> },
> },
> "some labeled": {
> "labels": {
> "Bucket: "bkt1",
> "User: "user1",
> },
> "counters": {
> "put_b": 1048576,
> },
> },
>
> Would users/consumers of these counters be opposed to changing the format?
> Why is this the case?
>
> As far as I know there are ceph-mgr modules related to Prometheus and
> telemetry that are consuming the current unlabeled counters. Also this
> topic will be discussed at the upcoming Ceph Developer Monthly EMEA as well.
>
> Best,
> Ali
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: All pgs unknown

2023-01-29 Thread Josh Baergen

This often indicates that something is up with your mgr process. Based
on ceph status, it looks like both the mgr and mon had recently
restarted. Is that expected?

Josh

On Sun, Jan 29, 2023 at 3:36 AM Daniel Brunner  wrote:
>
> Hi,
>
> my ceph cluster started to show HEALTH_WARN, there are no healthy pgs left,
> all are unknown, but it seems my cephfs is still readable, how to
> investigate this any further?
>
> $ sudo ceph -s
>   cluster:
> id: ddb7ebd8-65b5-11ed-84d7-22aca0408523
> health: HEALTH_WARN
> failed to probe daemons or devices
> noout flag(s) set
> Reduced data availability: 339 pgs inactive
>
>   services:
> mon: 1 daemons, quorum flucky-server (age 3m)
> mgr: flucky-server.cupbak(active, since 3m)
> mds: 1/1 daemons up
> osd: 18 osds: 18 up (since 26h), 18 in (since 7w)
>  flags noout
> rgw: 1 daemon active (1 hosts, 1 zones)
>
>   data:
> volumes: 1/1 healthy
> pools:   11 pools, 339 pgs
> objects: 0 objects, 0 B
> usage:   0 B used, 0 B / 0 B avail
> pgs: 100.000% pgs unknown
>  339 unknown
>
>
>
> $ sudo ceph fs status
> cephfs - 2 clients
> ==
> RANK  STATE   MDS ACTIVITY DNSINOS
> DIRS   CAPS
>  0active  cephfs.flucky-server.ldzavv  Reqs:0 /s  61.9k  61.9k
>  17.1k  54.5k
>   POOL TYPE USED  AVAIL
> cephfs_metadata  metadata 0  0
>   cephfs_data  data   0  0
> MDS version: ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
> quincy (stable)
>
>
>
> $ docker logs ceph-ddb7ebd8-65b5-11ed-84d7-22aca0408523-mon-flucky-server
> cluster 2023-01-27T12:15:30.437140+ mgr.flucky-server.cupbak
> (mgr.144098) 200 : cluster [DBG] pgmap v189: 339 pgs: 339 unknown; 0 B
> data, 0 B used, 0 B / 0 B avail
>
>
> debug 2023-01-27T12:15:31.995+ 7fa90b3f7700  1
> mon.flucky-server@0(leader).osd
> e50043 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 348127232
> full_alloc: 348127232 kv_alloc: 322961408
>
>
> cluster 2023-01-27T12:15:32.437854+ mgr.flucky-server.cupbak
> (mgr.144098) 201 : cluster [DBG] pgmap v190: 339 pgs: 339 unknown; 0 B
> data, 0 B used, 0 B / 0 B avail
>
>
> cluster 2023-01-27T12:15:32.373735+ osd.9 (osd.9) 123948 : cluster
> [DBG] 9.a deep-scrub starts
>
>
>
> cluster 2023-01-27T12:15:33.013990+ osd.2 (osd.2) 41797 : cluster [DBG]
> 5.6 scrub starts
>
>
>
> cluster 2023-01-27T12:15:33.402881+ osd.9 (osd.9) 123949 : cluster
> [DBG] 9.13 scrub starts
>
>
>
> cluster 2023-01-27T12:15:34.438591+ mgr.flucky-server.cupbak
> (mgr.144098) 202 : cluster [DBG] pgmap v191: 339 pgs: 339 unknown; 0 B
> data, 0 B used, 0 B / 0 B avail
>
>
> cluster 2023-01-27T12:15:35.461575+ osd.9 (osd.9) 123950 : cluster
> [DBG] 7.16 deep-scrub starts
>
>
>
> debug 2023-01-27T12:15:37.005+ 7fa90b3f7700  1
> mon.flucky-server@0(leader).osd
> e50043 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 348127232
> full_alloc: 348127232 kv_alloc: 322961408
>
>
> cluster 2023-01-27T12:15:36.439416+ mgr.flucky-server.cupbak
> (mgr.144098) 203 : cluster [DBG] pgmap v192: 339 pgs: 339 unknown; 0 B
> data, 0 B used, 0 B / 0 B avail
>
>
> cluster 2023-01-27T12:15:36.925368+ osd.2 (osd.2) 41798 : cluster [DBG]
> 7.15 deep-scrub starts
>
>
>
> cluster 2023-01-27T12:15:37.960907+ osd.2 (osd.2) 41799 : cluster [DBG]
> 6.6 scrub starts
>
>
>
> cluster 2023-01-27T12:15:38.440099+ mgr.flucky-server.cupbak
> (mgr.144098) 204 : cluster [DBG] pgmap v193: 339 pgs: 339 unknown; 0 B
> data, 0 B used, 0 B / 0 B avail
>
>
> cluster 2023-01-27T12:15:38.482333+ osd.9 (osd.9) 123951 : cluster
> [DBG] 2.2 scrub starts
>
>
>
> cluster 2023-01-27T12:15:38.959557+ osd.2 (osd.2) 41800 : cluster [DBG]
> 9.47 scrub starts
>
>
>
> cluster 2023-01-27T12:15:39.519980+ osd.9 (osd.9) 123952 : cluster
> [DBG] 4.b scrub starts
>
>
>
> cluster 2023-01-27T12:15:40.440711+ mgr.flucky-server.cupbak
> (mgr.144098) 205 : cluster [DBG] pgmap v194: 339 pgs: 339 unknown; 0 B
> data, 0 B used, 0 B / 0 B avail
>
>
> debug 2023-01-27T12:15:42.012+ 7fa90b3f7700  1
> mon.flucky-server@0(leader).osd
> e50043 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 348127232
> full_alloc: 348127232 kv_alloc: 322961408
>
>
> cluster 2023-01-27T12:15:41.536421+ osd.9 (osd.9) 123953 : cluster
> [DBG] 2.7 scrub starts
>
>
>
> cluster 2023-01-27T12:15:42.441314+ mgr.flucky-server.cupbak
> (mgr.144098) 206 : cluster [DBG] pgmap v195: 339 pgs: 339 unknown; 0 B
> data, 0 B used, 0 B / 0 B avail
>
>
> cluster 2023-01-27T12:15:43.954128+ osd.2 (osd.2) 41801 : cluster [DBG]
> 9.4f scrub starts
>
>
>
> cluster 2023-01-27T12:15:44.441897+ mgr.flucky-server.cupbak
> (mgr.144098) 207 : cluster [DBG] pgmap v196: 339 pgs: 339 unknown; 0 B
> data, 0 B used, 0 B / 0 B avail
>
>
> cluster 2023-01-27T12:15:45.944038+ osd.2 (osd.2) 41802 : cluster [DBG]
> 1.1f deep-scrub starts
>
>
>
> debug

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Josh Baergen

This might be due to tombstone accumulation in rocksdb. You can try to
issue a compact to all of your OSDs and see if that helps (ceph tell
osd.XXX compact). I usually prefer to do this one host at a time just
in case it causes issues, though on a reasonably fast RBD cluster you
can often get away with compacting everything at once.

Josh

On Fri, Jan 27, 2023 at 6:52 AM Victor Rodriguez
 wrote:
>
> Hello,
>
> Asking for help with an issue. Maybe someone has a clue about what's
> going on.
>
> Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I removed
> it. A bit later, nearly half of the PGs of the pool entered snaptrim and
> snaptrim_wait state, as expected. The problem is that such operations
> ran extremely slow and client I/O was nearly nothing, so all VMs in the
> cluster got stuck as they could not I/O to the storage. Taking and
> removing big snapshots is a normal operation that we do often and this
> is the first time I see this issue in any of my clusters.
>
> Disks are all Samsung PM1733 and network is 25G. It gives us plenty of
> performance for the use case and never had an issue with the hardware.
>
> Both disk I/O and network I/O was very low. Still, client I/O seemed to
> get queued forever. Disabling snaptrim (ceph osd set nosnaptrim) stops
> any active snaptrim operation and client I/O resumes back to normal.
> Enabling snaptrim again makes client I/O to almost halt again.
>
> I've been playing with some settings:
>
> ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
> ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
> ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
> ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'
>
> None really seemed to help. Also tried restarting OSD services.
>
> This cluster was upgraded from 14.2.x to 15.2.17 a couple of months. Is
> there any setting that must be changed which may cause this problem?
>
> I have scheduled a maintenance window, what should I look for to
> diagnose this problem?
>
> Any help is very appreciated. Thanks in advance.
>
> Victor
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread Josh Baergen

> - you will need to love those filestore OSD’s to Bluestore before hitting 
> Pacific, might even be part of the Nautilus upgrade. This takes some time if 
> I remember correctly.

I don't think this is necessary. It _is_ necessary to convert all
leveldb to rocksdb before upgrading to Pacific, on both mons and any
filestore OSDs.

Quincy will warn you about filestore OSDs, and Reef will no longer
support filestore.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Implications of pglog_hardlimit

2022-11-29 Thread Josh Baergen

It's also possible you're running into large pglog entries - any
chance you're running RGW and there's an s3:CopyObject workload
hitting an object that was uploaded with MPU?
https://tracker.ceph.com/issues/56707

If that's the case, you can inject a much smaller value for
osd_min_pg_log_entries and osd_max_pg_log_entries (ceph tell osd.*
config set osd_min_pg_log_entries 500 - repeat for max) to relieve
memory pressure.

Josh

On Tue, Nov 29, 2022 at 3:10 PM Frank Schilder  wrote:
>
> Hi, it sounds like you might be affected by the pg_log dup bug:
>
> # Check if any OSDs are affected by the pg dup problem
> sudo -i ceph tell "osd.*" perf dump | grep -e pglog -e "osd\\."
>
> If any osd_pglog_items>>1M check 
> https://www.clyso.com/blog/osds-with-unlimited-ram-growth/
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Gregory Farnum 
> Sent: 29 November 2022 22:25:54
> To: Joshua Timmer
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: Implications of pglog_hardlimit
>
> On Tue, Nov 29, 2022 at 1:18 PM Joshua Timmer 
> wrote:
>
> > I've got a cluster in a precarious state because several nodes have run
> > out of memory due to extremely large pg logs on the osds. I came across
> > the pglog_hardlimit flag which sounds like the solution to the issue,
> > but I'm concerned that enabling it will immediately truncate the pg logs
> > and possibly drop some information needed to recover the pgs. There are
> > many in degraded and undersized states right now as nodes are down. Is
> > it safe to enable the flag in this state? The cluster is running
> > luminous 12.2.13 right now.
>
>
> The hard limit will truncate the log, but all the data goes into the
> backing bluestore/filestore instance at the same time. The pglogs are used
> for two things:
> 1) detecting replayed client operations and sending the same answer back on
> replays, so shorter logs means a shorter time window of detection but
> shouldn’t be an issue;
> 2) enabling log-based recovery of pgs where OSDs with overlapping logs can
> identify exactly which objects have been modified and only moving them.
>
> So if you set the hard limit, it’s possible you’ll induce more backfill as
> fewer logs overlap. But no data will be lost.
> -Greg
>
>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about quorum

2022-11-03 Thread Josh Baergen

Hi Murilo,

This is briefly referred to by
https://docs.ceph.com/en/octopus/rados/deployment/ceph-deploy-mon/,
but in order to avoid split brain issues it's common that distributed
consensus algorithms require a strict majority in order to maintain
quorum. This is why production deployments of mons should usually be
an odd number: Adding one more mon to an odd number (to end up with an
even number) doesn't provide materially better availability.

Josh

On Thu, Nov 3, 2022 at 1:55 PM Murilo Morais  wrote:
>
> Good afternoon everyone!
>
> I have a lab with 4 mons, I was testing the behavior in case a certain
> amount of hosts went offline, as soon as the second one went offline
> everything stopped. It would be interesting if there was a fifth node to
> ensure that, if two fall, everything will work, but why did everything stop
> with only 2 nodes when if there were 3 nodes in the cluster and one fell,
> everything would still be working? Is there no way to get this behavior
> with 4 nodes?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 1 pg stale, 1 pg undersized

2022-10-27 Thread Josh Baergen

Hi Alexander,

I'd be suspicious that something is up with pool 25. Which pool is
that? ('ceph osd pool ls detail') Knowing the pool and the CRUSH rule
it's using is a good place to start. Then that can be compared to your
CRUSH map (e.g. 'ceph osd tree') to see why Ceph is struggling to map
that PG to a valid up set.

Josh

On Tue, Oct 25, 2022 at 6:45 AM Alexander Fiedler
 wrote:
>
> Hello,
>
> we run a ceph cluster with the following error which came up suddenly without 
> any maintenance/changes:
>
> HEALTH_WARN Reduced data availability: 1 pg stale; Degraded data redundancy: 
> 1 pg undersized
>
> The PG in question is PG 25
>
> Output of ceph pg dump_stuck stale:
>
> PG_STAT  STATE UP  UP_PRIMARY  ACTING   
> ACTING_PRIMARY
> 25.0 stale+active+undersized+remapped  []  -1  [66,64]
>   66
>
> Both acting OSDs and the mons+managers were rebooted. All OSDs in the cluster 
> are up.
>
> Do you have any idea why 1 PG is stuck?
>
> Best regards
>
> Alexander Fiedler
>
>
> --
> imos Gesellschaft fuer Internet-Marketing und Online-Services mbH
> Alfons-Feifel-Str. 9 // D-73037 Goeppingen // Stauferpark Ost
> Tel: 07161 93339- // Fax: 07161 93339-99 // Internet: www.imos.net
>
> Eingetragen im Handelsregister des Amtsgerichts Ulm, HRB 532571
> Vertreten durch die Geschaeftsfuehrer Alfred und Rolf Wallender
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Josh Baergen

Hi Tim,

Ah, it didn't sink in for me at first how many pools there were here.
I think you might be hitting the issue that the author of
https://github.com/TheJJ/ceph-balancer ran into, and thus their
balancer might help in this case.

Josh

On Mon, Oct 24, 2022 at 8:37 AM Tim Bishop  wrote:
>
> Hi Josh,
>
> On Mon, Oct 24, 2022 at 07:20:46AM -0600, Josh Baergen wrote:
> > > I've included the osd df output below, along with pool and crush rules.
> >
> > Looking at these, the balancer module should be taking care of this
> > imbalance automatically. What does "ceph balancer status" say?
>
> # ceph balancer status
> {
> "active": true,
> "last_optimize_duration": "0:00:00.038795",
> "last_optimize_started": "Mon Oct 24 15:35:43 2022",
> "mode": "upmap",
> "optimize_result": "Optimization plan created successfully",
> "plans": []
> }
>
> Looks healthy?
>
> This cluster is on pacific but has been upgraded through numerous
> previous releases, so it is possible some settings have been inherited
> and are not the same defaults as a new cluster.
>
> Tim.
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Josh Baergen

Hi Tim,

> I've included the osd df output below, along with pool and crush rules.

Looking at these, the balancer module should be taking care of this
imbalance automatically. What does "ceph balancer status" say?

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Iinfinite backfill loop + number of pgp groups stuck at wrong value

2022-10-07 Thread Josh Baergen

As of Nautilus+, when you set pg_num, it actually internally sets
pg(p)_num_target, and then slowly increases (or decreases, if you're
merging) pg_num and then pgp_num until it reaches the target. The
amount of backfill scheduled into the system is controlled by
target_max_misplaced_ratio.

Josh

On Fri, Oct 7, 2022 at 3:50 AM Nicola Mori  wrote:
>
> The situation got solved by itself, since probably there was no error. I
> manually increased the number of PGs and PGPs to 128 some days ago, and
> the PGP count was being updated step by step. Actually after a bump from
> 5% to 7% in the count of misplaced object I noticed that the number of
> PGPs was updated to 126, and after a last bump it is now at 128 with a
> ~4% of misplaced objects currently decreasing.
> Sorry for the noise,
>
> Nicola
>
> On 07/10/22 09:15, Nicola Mori wrote:
> > Dear Ceph users,
> >
> > my cluster is stuck since several days with some PG backfilling. The
> > number of misplaced objects slowly decreases down to 5%, and at that
> > point jumps up again to about 7%, and so on. I found several possible
> > reasons for this behavior. One is related to the balancer, which anyway
> > I think is not operating:
> >
> > # ceph balancer status
> > {
> >  "active": false,
> >  "last_optimize_duration": "0:00:00.000938",
> >  "last_optimize_started": "Thu Oct  6 16:19:59 2022",
> >  "mode": "upmap",
> >  "optimize_result": "Too many objects (0.071539 > 0.05) are
> > misplaced; try again later",
> >  "plans": []
> > }
> >
> > (the lase optimize result is from yesterday when I disabled it, and
> > since then the backfill loop has happened several times).
> > Another possible reason seems to be an imbalance of PG and PGB  numbers.
> > Effectively I found such an imbalance on one of my pools:
> >
> > # ceph osd pool get wizard_data pg_num
> > pg_num: 128
> > # ceph osd pool get wizard_data pgp_num
> > pgp_num: 123
> >
> > but I cannot fix it:
> > # ceph osd pool set wizard_data pgp_num 128
> > set pool 3 pgp_num to 128
> > # ceph osd pool get wizard_data pgp_num
> > pgp_num: 123
> >
> > The autoscaler is off for that pool:
> >
> > POOL   SIZE  TARGET SIZERATE  RAW CAPACITY
> > RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM
> > AUTOSCALE  BULK
> > wizard_data   8951G   1.333730697632152.8T
> > 0.0763  1.0 128  off
> > False
> >
> > so I don't understand why the PGP number is stuck at 123.
> > Thanks in advance for any help,
> >
> > Nicola
>
> --
> Nicola Mori, Ph.D.
> INFN sezione di Firenze
> Via Bruno Rossi 1, 50019 Sesto F.no (Italy)
> +390554572660
> m...@fi.infn.it
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrade from Octopus to Quiny fails on third ceph-mon

2022-09-28 Thread Josh Baergen

FWIW, this is what the Quincy release notes say: LevelDB support has
been removed. WITH_LEVELDB is no longer a supported build option.
Users should migrate their monitors and OSDs to RocksDB before
upgrading to Quincy.

Josh

On Wed, Sep 28, 2022 at 4:20 AM Eugen Block  wrote:
>
> Hi,
>
> there was a thread about deprecating leveldb [1], but I didn't get the
> impression that it already has been deprecated. But the thread
> mentions that it's not tested anymore, so that might explain it. To
> confirm that you use leveldb you can run:
>
> cat /var/lib/ceph/mon/ceph-/kv_backend
>
> So you already have successfully upgraded other MONs, what kv_backend
> do they use? If this is the last one with leveldb you can probably
> move the old store content and recreate an empty MON.
>
> [1]
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/K4OSAA4AJS2V7FQI6GNCKCK3IRQDBQRS/
>
> Zitat von "Ackermann, Christoph" :
>
> > Hello List,
> >
> > i'm on the way to upgrade our "non cephadm" from Octopus to Quiny. It
> > fails/stuck on third ceph-mon ceph1n021  with an strange error:
> >
> > 2022-09-28T11:04:27.691+0200 7f8681543880 -1 _open error initializing
> > leveldb db back storage in /var/lib/ceph/mon/ceph-ceph1n021/store.db
> >
> > This monitor contains a lot of ldb files, wondering if we do not use
> > leveldb anymore..
> >
> > [root@ceph1n021 ~]# ls  /var/lib/ceph/mon/ceph-ceph1n021/store.db/
> > 3251327.ldb  5720254.ldb  6568574.ldb  6652800.ldb  6726397.ldb
> >  6726468.ldb  6726623.ldb  6726631.ldb  6726638.ldb  6726646.ldb
> >  6726653.ldb  IDENTITY
> > 3251520.ldb  6497196.ldb  6575398.ldb  6654280.ldb  6726398.ldb
> >  6726469.ldb  6726624.ldb  6726632.ldb  6726639.ldb  6726647.ldb
> >  6726654.ldb  LOCK
> > 3251566.ldb  6517010.ldb  6595757.ldb  668.ldb  6726399.ldb
> >  6726588.ldb  6726627.ldb  6726634.ldb  6726642.ldb  6726648.ldb
> >  6726655.ldb  MANIFEST-5682438
> > 3251572.ldb  6523701.ldb  6601653.ldb  6699521.ldb  6726400.ldb
> >  6726608.ldb  6726628.ldb  6726635.ldb  6726643.ldb  6726649.ldb
> >  6726656.ldb  OPTIONS-05
> > 3251583.ldb  6543819.ldb  6624261.ldb  6706116.ldb  6726401.ldb
> >  6726618.log  6726629.ldb  6726636.ldb  6726644.ldb  6726650.ldb
> >  6726657.ldb
> > 3251584.ldb  6549696.ldb  6627961.ldb  6725307.ldb  6726467.ldb
> >  6726622.ldb  6726630.ldb  6726637.ldb  6726645.ldb  6726651.ldb  CURRENT
> >
> > All other ceph-mon "store.db" folder consists only expected files like:
> >
> > [root@ceph1n020 ~]# ls -l  /var/lib/ceph/mon/ceph-ceph1n020/store.db/
> > total 153252
> > -rw---. 1 ceph ceph 11230512 Sep 28 05:13 1040392.log
> > -rw---. 1 ceph ceph 67281589 Sep 28 05:11 1040394.sst
> > -rw---. 1 ceph ceph 40121324 Sep 28 05:11 1040395.sst
> > -rw---. 1 ceph ceph   16 Aug 19 06:29 CURRENT
> > -rw---. 1 ceph ceph   37 Feb 21  2022 IDENTITY
> > -rw-r--r--. 1 ceph ceph0 Feb 21  2022 LOCK
> > -rw---. 1 ceph ceph  8465618 Sep 28 05:11 MANIFEST-898389
> > -rw---. 1 ceph ceph 4946 Aug 19 04:51 OPTIONS-898078
> > -rw---. 1 ceph ceph 4946 Aug 19 06:29 OPTIONS-898392
> >
> >
> > "mon": {
> > "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4)
> > octopus (stable)": 3,
> > "ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d)
> > quincy (stable)":(ceph-mon@ceph1n011 and ceph-mon@ceph1n012)
> >
> > Is it safe to go forward restarting the rest of these monitors (ceph1n019
> > and ceph1n020)  and what can we do to fix errors on ceph-mon@ceph1n021 ?
> >
> > Best regards,
> > Christoph
> >
> >
> >
> > Christoph Ackermann | System Engineer
> > INFOSERVE GmbH | Am Felsbrunnen 15 | D-66119 Saarbrücken
> > Fon +49 (0)681 88008-59 | Fax +49 (0)681 88008-33 | c.ackerm...@infoserve.de
> > | www.infoserve.de
> > INFOSERVE Datenschutzhinweise: www.infoserve.de/datenschutz
> > Handelsregister: Amtsgericht Saarbrücken, HRB 11001 | Erfüllungsort:
> > Saarbrücken
> > Geschäftsführer: Dr. Stefan Leinenbach | Ust-IdNr.: DE168970599
> >
> > 
> > 
> > 
> > 
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Josh Baergen

Hey Wyll,

> $ pgremapper cancel-backfill --yes   # to stop all pending operations
> $ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
> $ bash upmap-moves
>
> Repeat the above 3 steps until balance is achieved, then re-enable the 
> balancer and unset the "no" flags set earlier?

You don't want to run cancel-backfill after placementoptimizer,
otherwise it will undo the balancing backfill.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about recovery priority

2022-09-23 Thread Josh Baergen

Hi Fulvio,

> leads to a much shorter and less detailed page, and I assumed Nautilus
> was far behind Quincy in managing this...

The only major change I'm aware of between Nautilus and Quincy is that
in Quincy the mClock scheduler is able to automatically tune up/down
backfill parameters to achieve better speed and/or balance with client
I/O. The reservation mechanics themselves are unchanged.

> Thanks for "pgremapper", will give it a try once I have finished current
> data movement: will it still work after I upgrade to Pacific?

We're not aware of any Pacific incompatibilities at this time (we've
tested it there and community members have used it against Pacific),
though the tool has most heavily been used on Luminous and Nautilus,
as the README implies.

> You are correct, it would be best to drain OSDs cleanly, and I see
> pgremapper has an option for this, great!

Despite its name, I don't usually recommend using the "drain" command
for draining a batch of OSDs. Confusing, I know! "Drain" is best used
when you intend to move the data back afterwards, and if you give it
multiple targets, it won't balance data across those targets. The
reason for this is that "drain" doesn't pay attention to the
CRUSH-preferred PG location or target fullness, and thus it can make
suboptimal placement choices.

For your usecase, I would recommend using a downweight of OSDS on host
to 0.001 (can't be 0 - upmaps won't work) -> cancel-backfill (to map
data back to the host) -> undo-upmaps in a loop to optimally drain the
host.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about recovery priority

2022-09-22 Thread Josh Baergen

Hi Fulvio,

https://docs.ceph.com/en/quincy/dev/osd_internals/backfill_reservation/
describes the prioritization and reservation mechanism used for
recovery and backfill. AIUI, unless a PG is below min_size, all
backfills for a given pool will be at the same priority.
force-recovery will modify the PG priority but doing so can have a
very delayed effect because a given backfill can be waiting behind a
bunch of other backfills that have acquired partial reservations,
which in turn are waiting behind other backfills that have partial
reservations, etc. etc. Once one is doing degraded backfill, they've
lost a lot of control over their system.

Rather than ripping out hosts like you did here, operators that want
to retain control will drain hosts without degradation.
https://github.com/digitalocean/pgremapper is one tool that can help
with this, though depending on the size of the system one can
sometimes simply downweight the host and then wait.

Josh

On Thu, Sep 22, 2022 at 6:35 AM Fulvio Galeazzi  wrote:
>
> Hallo all,
>   taking advantage of the redundancy of my EC pool, I destroyed a
> couple of servers in order to reinstall them with a new operating system.
>I am on Nautilus (but will evolve soon to Pacific), and today I am
> not in "emergency mode": this is just for my education.  :-)
>
> "ceph pg dump" shows a couple pg's with 3 missing chunks, some other
> with 2, several with 1 missing chunk: that's fine and expected.
> Having looked at it for a while, I think I understand the recovery queue
> is unique: there is no internal higher priority for 3-missing-chunks PGs
> wrt 1-missing-chunk PGs, right?
> I tried to issue "ceph pg force-recovery" on the few worst-degraded PGs
> but, apparently, numbers of 3-missing 2-missing and 1-missing are going
> down at the same relative speed.
> Is this expected? Can I do something to "guide" the process?
>
> Thanks for your hints
>
> Fulvio
>
> --
> Fulvio Galeazzi
> GARR-CSD Department
> skype: fgaleazzi70
> tel.: +39-334-6533-250
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: force-create-pg not working

2022-09-20 Thread Josh Baergen

Hi Jesper,

Given that the PG is marked recovery_unfound, I think you need to
follow 
https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#unfound-objects.

Josh

On Tue, Sep 20, 2022 at 12:56 AM Jesper Lykkegaard Karlsen
 wrote:
>
> Dear all,
>
> System: latest Octopus, 8+3 erasure Cephfs
>
> I have a PG that has been driving me crazy.
> It had gotten to a bad state after heavy backfilling, combined with OSD going 
> down in turn.
>
> State is:
>
> active+recovery_unfound+undersized+degraded+remapped
>
> I have tried repairing it with ceph-objectstore-tool, but no luck so far.
> Given the time recovery takes this way and since data are under backup, I 
> thought that I would do the "easy" approach instead and:
>
>   *   scan pg_files with cephfs-data-scan
>   *   delete data beloging to that pool
>   *   recreate PG with "ceph osd force-create-pg"
>   *   restore data
>
> Although, this has shown not to be so easy after all.
>
> ceph osd force-create-pg 20.13f --yes-i-really-mean-it
>
> seems to be accepted well enough with "pg 20.13f now creating, ok", but then 
> nothing happens.
> Issuing the command again just gives a "pg 20.13f already creating" response.
>
> If I restart the primary OSD, then the pending force-create-pg disappears.
>
> I read that this could be due to crush map issue, but I have checked and that 
> does not seem to be the case.
>
> Would it, for instance, be possible to do the force-create-pg manually with 
> something like this?:
>
>   *   set nobackfill and norecovery
>   *   delete the pgs shards one by one
>   *   unset nobackfill and norecovery
>
>
> Any idea on how to proceed from here is most welcome.
>
> Thanks,
> Jesper
>
>
> --
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Universitetsbyen 81
> 8000 Aarhus C
>
> E-mail: je...@mbg.au.dk
> Tlf:+45 50906203
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: config db host filter issue

2021-10-20 Thread Josh Baergen

Hey Richard,

On Tue, Oct 19, 2021 at 8:37 PM Richard Bade  wrote:
> user@cstor01 DEV:~$ sudo ceph config set osd/host:cstor01 osd_max_backfills 2
> user@cstor01 DEV:~$ sudo ceph config get osd.0 osd_max_backfills
> 2
> ...
> Are others able to reproduce?

Yes, we've found the same thing on Nautilus. root-based filtering
works, but nothing else that we've tried so far. We were going to
investigate at some point whether this is fixed in Octopus/Pacific
before filing a ticket.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Tool to cancel pending backfills

2021-10-01 Thread Josh Baergen

Hi Peter,

> When I check for circles I found that running the upmap balancer alone never 
> seems to create
> any kind of circle in the graph

By a circle, do you mean something like this?
pg 1.a: 1->2 (upmap to put a chunk on 2 instead of 1)
pg 1.b: 2->3
pg 1.c: 3->1

If so, then it's not surprising that the upmap balancer wouldn't
create this situation by itself, since there's no reason for this set
of upmaps to exist purely for balance reasons. I don't think the
balancer needs any explicit code to avoid the situation because of
this.

> Running pgremapper + balancer created circles with sometimes several dozen 
> nodes. I would update the docs of the pgremapper
> to warn about this fact and guide the users to use undo-upmap to slowly 
> remove the upmaps create by cancel-backfill.

This is again not surprising, since cancel-backfill will do whatever's
necessary to undo a set of CRUSH changes (and some CRUSH changes
regularly lead to movement cycles like this), and then using the upmap
balancer will only make enough changes to achieve balance, not undo
everything that's there.

> It might be a nice addition to pgremapper to add an option to optimze the 
> upmap table.

What I'm still missing here is the value in this. Are there
demonstrable problems presented by a large upmap exception table (e.g.
performance or operational)?

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Tool to cancel pending backfills

2021-09-27 Thread Josh Baergen

> I have a question regarding the last step. It seems to me that the ceph 
> balancer is not able to remove the upmaps
> created by pgremapper, but instead creates new upmaps to balance the pgs 
> among osds.

The balancer will prefer to remove existing upmaps[1], but it's not
guaranteed. The upmap has to exist between the source and target OSD
already decided on by the balancer in order for this to happen. The
reality, though, is that the upmap balancer will need to create many
upmap exception table entries to balance any sizable system.

Like you, I would prefer to have as few upmap exception table entries
as possible in a system (fewer surprises when an OSD fails), but we
regularly run systems that have thousands of entries without any
discernible impact and haven't had any major operational issues that
result from it, except for on really old systems that are just awful
to work with in the first place.

Josh

[1] I think this is the implementation:
https://github.com/ceph/ceph/blob/bc8c846b36288ff7ac65005087b0dda0e4b857f4/src/osd/OSDMap.cc#L4794-L4832
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Is it normal Ceph reports "Degraded data redundancy" in normal use?

2021-09-16 Thread Josh Baergen

> I assume it's the balancer module. If you write lots of data quickly
> into the cluster the distribution can vary and the balancer will try
> to even out the placement.

The balancer won't cause degradation, only misplaced objects.

> Degraded data redundancy: 260/11856050 objects degraded
> (0.014%), 1 pg degraded

That status definitely indicates that something is wrong. Check your
cluster logs on your mons (/var/log/ceph/ceph.log) for the cause; my
guess is that you have OSDs flapping (rapidly going down and up again)
due to either overload (disk or network) or some sort of
misconfiguration.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephfs small files expansion

2021-09-14 Thread Josh Baergen

Hey Seb,

> I have a test cluster on which I created pools rbd and cephfs (octopus), when
> I copy a directory containing many small files on a pool rbd the USED part of
> the ceph df command seems normal on the other hand on cephfs the USED part
> seems really abnormal, I tried to change the blocksize
> bluestore_min_alloc_size but it didn't change anything, would the solution be
> to re-create the pool or outright the OSDs?

bluestore_min_alloc_size has an effect only at OSD creation time; if
you changed it after creating the OSDs, it will have had no effect
yet. If your pool is on HDDs and this is pre-Pacific, then the default
of 64k will have a huge amplification effect for small objects.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Mon-map inconsistency?

2021-09-06 Thread Josh Baergen

Hi Melanie,

On Mon, Sep 6, 2021 at 10:06 AM Desaive, Melanie
 wrote:
> When I execute "ceph mon_status --format json-pretty" from our 
> ceph-management VM, the correct mon nodes are returned.
>
> But when I execute "ceph daemon osd.xx config show | grep mon_host" on the 
> respective storage node the old mon node IPs are returned.
>
> I am now unsure, that if I change more mon nodes, the information known to 
> the OSDs could become invalid one after the other and we could run into heavy 
> problems?

"config show" is showing you the mon IPs read from the OSDs'
ceph.conf, and is what is used to initially connect to the mons. After
that, my understanding is that those IPs don't matter as the OSDs will
use the IPs from the mons for further connections/communication.
However, I'm not certain what happens if, for example, all of your
mons were to go down for a period of time; do the OSDs use the last
monmap for reconnecting to the mons or do they revert to using the
configured mon IPs?

At the very least, you should be fine to replace all of your mons and
update each ceph.conf with the new info without needing to restart the
OSDs. After that it may be wise to restart the OSDs to both update the
configured mon IPs as well as test to make sure that they can
reconnect to the new mons without issue in case of a future outage.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: After adding New Osd's, Pool Max Avail did not changed.

2021-09-01 Thread Josh Baergen

Well, if I read your df output correctly, your emptiest SSD is 22%
full and the fullest is 55%. That's a huge spread, and max-avail takes
imbalances like this into account. Assuming triple-replication, a
max-avail of 3.5 TiB makes sense given this imbalance.

Similarly, OSD 209 has 65 PGs on it, whereas some OSDs have 104 PGs on
them. At least within a host, the upmap balancer should be capable of
bringing the PG count within a variance of 1 PG or so (if configured
to do so). I admit that I don't have experience with class-based
rules, though, so it's possible that the upmap balancer also has
limitations when it comes to this sort of rule.

Having said that, PG count imbalance isn't the only source of
imbalance here; OSD 19 is 45% full with 102 PGs whereas OSD 208 is 23%
with 95 PGs; it would seem that there is a data imbalance on a per-PG
basis, or perhaps that's just an OSD needing compaction to clean up
some space or something like that. Not sure.

Josh

On Wed, Sep 1, 2021 at 8:19 AM mhnx  wrote:
>
> I've tried upmap balancer but pg distribution was not good.  I think 
> Crush-compat is way better with Nautilus and also its the default option. As 
> you can see at df tree pg distribution is not that bad. Also max avail never 
> changed.
>
> If its not the reason for max avail then what it is?
>
> Maybe I should create a new pool and check its max avail. If the New pool max 
> avail greater then other pools then ceph recalculation is bugy or need to 
> triger somehow.
>
> 1 Eyl 2021 Çar 17:07 tarihinde Josh Baergen  şunu 
> yazdı:
>>
>> Googling for that balancer error message, I came across
>> https://tracker.ceph.com/issues/22814, which was closed/wont-fix, and
>> some threads that claimed that class-based crush rules actually use
>> some form of shadow trees in the background. I'm not sure how accurate
>> that is.
>>
>> The only suggestion I have, which is what was also suggested in one of
>> the above threads, is to use the upmap balancer instead if possible.
>>
>> Josh
>>
>> On Wed, Sep 1, 2021 at 2:38 AM mhnx  wrote:
>> >
>> > ceph osd crush tree (I only have one subtree and its root default)
>> > ID  CLASS WEIGHT (compat)  TYPE NAME
>> >  -1   2785.87891   root default
>> >  -3280.04803 280.04803 host NODE-1
>> >   0   hdd   14.60149  14.60149 osd.0
>> >  19   ssd0.87320   0.87320 osd.19
>> > 208   ssd0.87329   0.87329 osd.208
>> > 209   ssd0.87329   0.87329 osd.209
>> >  -7280.04803 280.04803 host NODE-2
>> >  38   hdd   14.60149  14.60149 osd.38
>> >  39   ssd0.87320   0.87320 osd.39
>> > 207   ssd0.87329   0.87329 osd.207
>> > 210   ssd0.87329   0.87329 osd.210
>> > -10280.04803 280.04803 host NODE-3
>> >  58   hdd   14.60149  14.60149 osd.58
>> >  59   ssd0.87320   0.87320 osd.59
>> > 203   ssd0.87329   0.87329 osd.203
>> > 211   ssd0.87329   0.87329 osd.211
>> > -13280.04803 280.04803 host NODE-4
>> >  78   hdd   14.60149  14.60149 osd.78
>> >  79   ssd0.87320   0.87320 osd.79
>> > 206   ssd0.87329   0.87329 osd.206
>> > 212   ssd0.87329   0.87329 osd.212
>> > -16280.04803 280.04803 host NODE-5
>> >  98   hdd   14.60149  14.60149 osd.98
>> >  99   ssd0.87320   0.87320 osd.99
>> > 205   ssd0.87329   0.87329 osd.205
>> > 213   ssd0.87329   0.87329 osd.213
>> > -19265.44662 265.44662 host NODE-6
>> > 118   hdd   14.60149  14.60149 osd.118
>> > 114   ssd0.87329   0.87329 osd.114
>> > 200   ssd0.87329   0.87329 osd.200
>> > 214   ssd0.87329   0.87329 osd.214
>> > -22280.04803 280.04803 host NODE-7
>> > 138   hdd   14.60149  14.60149 osd.138
>> > 139   ssd0.87320   0.87320 osd.139
>> > 204   ssd0.87329   0.87329 osd.204
>> > 215   ssd0.87329   0.87329 osd.215
>> > -25280.04810 280.04810 host NODE-8
>> > 158   hdd   14.60149  14.60149 osd.158
>> > 119   ssd0.87329   0.87329 osd.119
>> > 159   ssd0.87329   0.87329 osd.159
>> > 216   ssd0.87329   0.87329 osd.216
>> > -28280.04810 280.04810 host NODE-9
>> > 178   hdd   14.60149  14.60149 osd.178
>> > 179   ssd0.87329   0.87329

[ceph-users] Re: After adding New Osd's, Pool Max Avail did not changed.

2021-09-01 Thread Josh Baergen

Googling for that balancer error message, I came across
https://tracker.ceph.com/issues/22814, which was closed/wont-fix, and
some threads that claimed that class-based crush rules actually use
some form of shadow trees in the background. I'm not sure how accurate
that is.

The only suggestion I have, which is what was also suggested in one of
the above threads, is to use the upmap balancer instead if possible.

Josh

On Wed, Sep 1, 2021 at 2:38 AM mhnx  wrote:
>
> ceph osd crush tree (I only have one subtree and its root default)
> ID  CLASS WEIGHT (compat)  TYPE NAME
>  -1   2785.87891   root default
>  -3280.04803 280.04803 host NODE-1
>   0   hdd   14.60149  14.60149 osd.0
>  19   ssd0.87320   0.87320 osd.19
> 208   ssd0.87329   0.87329 osd.208
> 209   ssd0.87329   0.87329 osd.209
>  -7280.04803 280.04803 host NODE-2
>  38   hdd   14.60149  14.60149 osd.38
>  39   ssd0.87320   0.87320 osd.39
> 207   ssd0.87329   0.87329 osd.207
> 210   ssd0.87329   0.87329 osd.210
> -10280.04803 280.04803 host NODE-3
>  58   hdd   14.60149  14.60149 osd.58
>  59   ssd0.87320   0.87320 osd.59
> 203   ssd0.87329   0.87329 osd.203
> 211   ssd0.87329   0.87329 osd.211
> -13280.04803 280.04803 host NODE-4
>  78   hdd   14.60149  14.60149 osd.78
>  79   ssd0.87320   0.87320 osd.79
> 206   ssd0.87329   0.87329 osd.206
> 212   ssd0.87329   0.87329 osd.212
> -16280.04803 280.04803 host NODE-5
>  98   hdd   14.60149  14.60149 osd.98
>  99   ssd0.87320   0.87320 osd.99
> 205   ssd0.87329   0.87329 osd.205
> 213   ssd0.87329   0.87329 osd.213
> -19265.44662 265.44662 host NODE-6
> 118   hdd   14.60149  14.60149 osd.118
> 114   ssd0.87329   0.87329 osd.114
> 200   ssd0.87329   0.87329 osd.200
> 214   ssd0.87329   0.87329 osd.214
> -22280.04803 280.04803 host NODE-7
> 138   hdd   14.60149  14.60149 osd.138
> 139   ssd0.87320   0.87320 osd.139
> 204   ssd0.87329   0.87329 osd.204
> 215   ssd0.87329   0.87329 osd.215
> -25280.04810 280.04810 host NODE-8
> 158   hdd   14.60149  14.60149 osd.158
> 119   ssd0.87329   0.87329 osd.119
> 159   ssd0.87329   0.87329 osd.159
> 216   ssd0.87329   0.87329 osd.216
> -28280.04810 280.04810 host NODE-9
> 178   hdd   14.60149  14.60149 osd.178
> 179   ssd0.87329   0.87329 osd.179
> 201   ssd0.87329   0.87329 osd.201
> 217   ssd0.87329   0.87329 osd.217
> -31280.04803 280.04803 host NODE-10
> 180   hdd   14.60149  14.60149 osd.180
> 199   ssd0.87320   0.87320 osd.199
> 202   ssd0.87329   0.87329 osd.202
> 218   ssd0.87329   0.87329 osd.218
>
> This pg "6.dc" is on 199,213,217 OSD's.
>
> 6.dc812  00 0   0   1369675264
>0  0 3005 3005active+clean 2021-08-31 
> 16:36:06.64520832265'415965  32265:287175109
> [199,213,217]
>
> ceph osd df tree | grep "CLASS\|ssd" | grep ".199\|.213\|217"
> 199   ssd0.87320  1.0 894 GiB 281 GiB 119 GiB 159 GiB 2.5 GiB 614 GiB 
> 31.38 0.52 103 up osd.199
> 213   ssd0.87329  1.0 894 GiB 291 GiB  95 GiB 195 GiB 2.3 GiB 603 GiB 
> 32.59 0.54  95 up osd.213
> 217   ssd0.87329  1.0 894 GiB 261 GiB  83 GiB 176 GiB 2.3 GiB 633 GiB 
> 29.18 0.48  89 up osd.217
>
> As you can see the pg lives on 3 ssd OSD's and one of them is the new one. So 
> we can not say it belongs to someone else.
>
> rule ssd-rule {
> id 1
> type replicated
> step take default class ssd
> step chooseleaf firstn 0 type host
> step emit
> }
>
> pool 54 'rgw.buckets.index' replicated size 3 min_size 1 crush_rule 1 
> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 
> 31607 lfor 0/0/30823 flags hashpspool stripe_width 0 compression_algorithm 
> lz4 compression_mode aggressive application rgw
>
> What is the next step?
>
>
> Josh Baergen , 1 Eyl 2021 Çar, 04:03 tarihinde 
> şunu yazdı:
>>
>> Yeah, I would suggest inspecting your CRUSH tree. Unfortunately the
>> grep above removed that information from 'df tree', but from the
>> information you provided there does appear to be a significant
>> imbalance remaining.
>>

[ceph-users] Re: After adding New Osd's, Pool Max Avail did not changed.

2021-08-31 Thread Josh Baergen

Yeah, I would suggest inspecting your CRUSH tree. Unfortunately the
grep above removed that information from 'df tree', but from the
information you provided there does appear to be a significant
imbalance remaining.

Josh

On Tue, Aug 31, 2021 at 6:02 PM mhnx  wrote:
>
> Hello Josh!
>
> I use balancer active - crush-compat. Balance is done and there are no 
> remapped pgs at ceph -s
>
> ceph osd df tree | grep 'CLASS\|ssd'
>
> ID  CLASS WEIGHT REWEIGHT SIZERAW USE DATAOMAPMETAAVAIL   
> %USE  VAR  PGS STATUS TYPE NAME
>  19   ssd0.87320  1.0 894 GiB 402 GiB 117 GiB 281 GiB 3.0 GiB 492 GiB 
> 44.93 0.74 102 up osd.19
> 208   ssd0.87329  1.0 894 GiB 205 GiB  85 GiB 113 GiB 6.6 GiB 690 GiB 
> 22.89 0.38  95 up osd.208
> 209   ssd0.87329  1.0 894 GiB 204 GiB  87 GiB 114 GiB 2.7 GiB 690 GiB 
> 22.84 0.38  65 up osd.209
> 199   ssd0.87320  1.0 894 GiB 281 GiB 118 GiB 159 GiB 2.8 GiB 614 GiB 
> 31.37 0.52 103 up osd.199
> 202   ssd0.87329  1.0 894 GiB 278 GiB  89 GiB 183 GiB 6.3 GiB 616 GiB 
> 31.08 0.51  97 up osd.202
> 218   ssd0.87329  1.0 894 GiB 201 GiB  75 GiB 124 GiB 1.8 GiB 693 GiB 
> 22.46 0.37  84 up osd.218
>  39   ssd0.87320  1.0 894 GiB 334 GiB  86 GiB 242 GiB 5.3 GiB 560 GiB 
> 37.34 0.61  91 up osd.39
> 207   ssd0.87329  1.0 894 GiB 232 GiB  88 GiB 138 GiB 7.0 GiB 662 GiB 
> 25.99 0.43  81 up osd.207
> 210   ssd0.87329  1.0 894 GiB 270 GiB 109 GiB 160 GiB 1.4 GiB 624 GiB 
> 30.18 0.50  99 up osd.210
>  59   ssd0.87320  1.0 894 GiB 374 GiB 127 GiB 244 GiB 3.1 GiB 520 GiB 
> 41.79 0.69  97 up osd.59
> 203   ssd0.87329  1.0 894 GiB 314 GiB  96 GiB 210 GiB 7.5 GiB 581 GiB 
> 35.06 0.58 104 up osd.203
> 211   ssd0.87329  1.0 894 GiB 231 GiB  60 GiB 169 GiB 1.7 GiB 663 GiB 
> 25.82 0.42  81 up osd.211
>  79   ssd0.87320  1.0 894 GiB 409 GiB 109 GiB 298 GiB 2.0 GiB 486 GiB 
> 45.70 0.75 102 up osd.79
> 206   ssd0.87329  1.0 894 GiB 284 GiB 107 GiB 175 GiB 1.9 GiB 610 GiB 
> 31.79 0.52  94 up osd.206
> 212   ssd0.87329  1.0 894 GiB 239 GiB  85 GiB 152 GiB 2.0 GiB 655 GiB 
> 26.71 0.44  80 up osd.212
>  99   ssd0.87320  1.0 894 GiB 392 GiB  73 GiB 314 GiB 4.7 GiB 503 GiB 
> 43.79 0.72  85 up osd.99
> 205   ssd0.87329  1.0 894 GiB 445 GiB  87 GiB 353 GiB 4.8 GiB 449 GiB 
> 49.80 0.82  95 up osd.205
> 213   ssd0.87329  1.0 894 GiB 291 GiB  94 GiB 194 GiB 2.3 GiB 603 GiB 
> 32.57 0.54  95 up osd.213
> 114   ssd0.87329  1.0 894 GiB 319 GiB 125 GiB 191 GiB 3.0 GiB 575 GiB 
> 35.67 0.59  99 up osd.114
> 200   ssd0.87329  1.0 894 GiB 231 GiB  78 GiB 150 GiB 2.9 GiB 663 GiB 
> 25.83 0.42  90 up osd.200
> 214   ssd0.87329  1.0 894 GiB 296 GiB 106 GiB 187 GiB 2.6 GiB 598 GiB 
> 33.09 0.54 100 up osd.214
> 139   ssd0.87320  1.0 894 GiB 270 GiB  98 GiB 169 GiB 2.3 GiB 624 GiB 
> 30.18 0.50  96 up osd.139
> 204   ssd0.87329  1.0 894 GiB 301 GiB 117 GiB 181 GiB 2.9 GiB 593 GiB 
> 33.64 0.55 104 up osd.204
> 215   ssd0.87329  1.0 894 GiB 203 GiB  78 GiB 122 GiB 3.3 GiB 691 GiB 
> 22.69 0.37  81 up osd.215
> 119   ssd0.87329  1.0 894 GiB 200 GiB 106 GiB  92 GiB 2.0 GiB 694 GiB 
> 22.39 0.37  99 up osd.119
> 159   ssd0.87329  1.0 894 GiB 213 GiB  96 GiB 113 GiB 3.2 GiB 682 GiB 
> 23.77 0.39  93 up osd.159
> 216   ssd0.87329  1.0 894 GiB 322 GiB 109 GiB 211 GiB 1.8 GiB 573 GiB 
> 35.96 0.59 101 up osd.216
> 179   ssd0.87329  1.0 894 GiB 389 GiB  85 GiB 300 GiB 3.2 GiB 505 GiB 
> 43.49 0.71 104 up osd.179
> 201   ssd0.87329  1.0 894 GiB 494 GiB 104 GiB 386 GiB 4.1 GiB 401 GiB 
> 55.20 0.91 103 up osd.201
> 217   ssd0.87329  1.0 894 GiB 261 GiB  83 GiB 176 GiB 2.3 GiB 634 GiB 
> 29.15 0.48  89 up osd.217
>
>
> When I check the balancer status I saw that: ""optimize_result": "Some osds 
> belong to multiple subtrees:"
> Do I need to check crushmap?
>
>
>
> Josh Baergen , 31 Ağu 2021 Sal, 22:32 tarihinde 
> şunu yazdı:
>>
>> Hi there,
>>
>> Could you post the output of "ceph osd df tree"? I would highly
>> suspect that this is a result of imbalance, and that's the easiest way
>> to see if that's the case. It would also confirm that the new disks
>> have taken on PGs.
>>
>> Josh
>>
>&g

[ceph-users] Re: After adding New Osd's, Pool Max Avail did not changed.

2021-08-31 Thread Josh Baergen

Hi there,

Could you post the output of "ceph osd df tree"? I would highly
suspect that this is a result of imbalance, and that's the easiest way
to see if that's the case. It would also confirm that the new disks
have taken on PGs.

Josh

On Tue, Aug 31, 2021 at 10:50 AM mhnx  wrote:
>
> I'm using Nautilus 14.2.16
>
> I was have 20 ssd OSD in my cluster and I added 10 more. " Each SSD=960GB"
> The Size increased to *(26TiB)* as expected but the Replicated (3) Pool Max
> Avail didn't changed *(3.5TiB)*.
> I've increased pg_num and PG rebalance is also done.
>
> Do I need any special treatment to expand the pool Max Avail?
>
> CLASS SIZEAVAIL   USEDRAW USED %RAW USED
> hdd   2.7 PiB 1.0 PiB 1.6 PiB  1.6 PiB 61.12
> ssd*26 TiB*  18 TiB 2.8 TiB  8.7 TiB 33.11
> TOTAL 2.7 PiB 1.1 PiB 1.6 PiB  1.7 PiB 60.85
>
> POOLS:
> POOLID PGS  STORED  OBJECTS
>  USED%USED MAX AVAIL
> xxx.rgw.buckets.index  54  128 541 GiB 435.69k 541
> GiB  4.82   *3.5 TiB*
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Is there any way to obtain the maximum number of node failure in ceph without data loss?

2021-07-26 Thread Josh Baergen

 systemctl stop ceph-osd@14
> >3. ceph osd out 14
> >
> > The used CRUSH rule to create the EC8+3 pool is described as below:
> > # ceph osd crush rule dump erasure_hdd_mhosts
> > {
> > "rule_id": 8,
> > "rule_name": "erasure_hdd_mhosts",
> > "ruleset": 8,
> > "type": 3,
> > "min_size": 1,
> > "max_size": 16,
> > "steps": [
> > {
> > "op": "take",
> > "item": -1,
> > "item_name": "default"
> > },
> > {
> > "op": "chooseleaf_indep",
> > "num": 0,
> > "type": "host"
> > },
> > {
> > "op": "emit"
> > }
> > ]
> > }
> >
> > And the output of `ceph osd tree` is also attached:
> > [~] # ceph osd tree
> > ID   CLASS  WEIGHT   TYPE NAME  STATUS
> > REWEIGHT  PRI-AFF
> >  -132.36148  root default
> > -13 2.69679  host jceph-n01
> >   0hdd  0.89893  osd.0  up
> > 1.0  1.0
> >   1hdd  0.89893  osd.1  up
> > 1.0  1.0
> >   2hdd  0.89893  osd.2  up
> > 1.0  1.0
> > -17 2.69679  host jceph-n02
> >   3hdd  0.89893  osd.3  up
> > 1.0  1.0
> >   4hdd  0.89893  osd.4  up
> > 1.0  1.0
> >   5hdd  0.89893  osd.5  up
> > 1.0  1.0
> > -21 2.69679  host jceph-n03
> >   6hdd  0.89893  osd.6  up
> > 1.0  1.0
> >   7hdd  0.89893  osd.7  up
> > 1.0  1.0
> >   8hdd  0.89893  osd.8  up
> > 1.0  1.0
> > -25 2.69679  host jceph-n04
> >   9hdd  0.89893  osd.9  up
> > 1.0  1.0
> >  10hdd  0.89893  osd.10 up
> > 1.0  1.0
> >  11hdd  0.89893  osd.11 up
> > 1.0  1.0
> > -29 2.69679  host jceph-n05
> >  12hdd  0.89893  osd.12 up
> > 1.0  1.0
> >  13hdd  0.89893  osd.13 up
> > 1.0  1.0
> >  14hdd  0.89893  osd.14 up
> > 1.0  1.0
> > -33 2.69679  host jceph-n06
> >  15hdd  0.89893  osd.15 up
> > 1.0  1.0
> >  16hdd  0.89893  osd.16 up
> > 1.0  1.0
> >  17hdd  0.89893  osd.17 up
> > 1.0  1.0
> > -37 2.69679  host jceph-n07
> >  18hdd  0.89893  osd.18 up
> > 1.0  1.0
> >  19hdd  0.89893  osd.19 up
> > 1.0  1.0
> >  20hdd  0.89893  osd.20 up
> > 1.0  1.0
> > -41 2.69679  host jceph-n08
> >  21hdd  0.89893  osd.21 up
> > 1.0  1.0
> >  22hdd  0.89893  osd.22 up
> > 1.0  1.0
> >  23hdd  0.89893  osd.23 up
> > 1.0  1.0
> > -45 2.69679  host jceph-n09
> >  24hdd  0.89893  osd.24 up
> > 1.0  1.0
> >  25hdd  0.89893  osd.25 up
> > 1.0  1.0
> >  26hdd  0.89893  osd.26 up
> > 1.0  1.0
> > -49 2.69679  host jceph-n10
> >  27hdd  0.89893  osd.27 up
> > 1.0  1.0
> >  28hdd  0.89893  osd.28 up
> > 1.0  1.0
> >  29hdd  0.89893  osd.29 up
> > 1.0  1.0
> > -53 2.69679  host jceph-n11
> >  30hdd  0.89893  osd.30 up
> > 1.0  1.0
> >  31hdd  0.89893  osd.31

[ceph-users] Re: Is there any way to obtain the maximum number of node failure in ceph without data loss?

2021-07-23 Thread Josh Baergen

Hi Jerry,

In general, your CRUSH rules should define the behaviour you're
looking for. Based on what you've stated about your configuration,
after failing a single node or an OSD on a single node, then you
should still be able to tolerate two more failures in the system
without losing data (or losing access to data, given that min_size=k,
though I believe it's recommended to set min_size=k+1).

However, that sequence of acting sets doesn't make a whole lot of
sense to me for a single OSD failure (though perhaps I'm misreading
them). Can you clarify exactly how you simulated the osd.14 failure?
It might also be helpful to post your CRUSH rule and "ceph osd tree".

Josh

On Fri, Jul 23, 2021 at 1:42 AM Jerry Lee  wrote:
>
> Hello,
>
> I would like to know the maximum number of node failures for a EC8+3
> pool in a 12-node cluster with 3 OSDs in each node.  The size and
> min_size of the EC8+3 pool is configured as 11 and 8, and OSDs of each
> PG are selected by host.  When there is no node failure, the maximum
> number of node failures is 3, right?  After unplugging a OSD (osd.14)
> in the cluster, I check the PG acting set changes and one of the
> results is shown as below:
>
> T0:
> [15,31,11,34,28,1,8,26,14,19,5]
>
> T1: after unplugging a OSD (osd.14) and recovery started
> [15,31,11,34,28,1,8,26,NONE,19,5]
>
> T2:
> [15,31,11,34,21,1,8,26,19,29,5]
>
> T3:
> [15,31,11,34,NONE,1,8,26,NONE,NONE,5]
>
> T4: recovery was done
> [15,31,11,34,21,1,8,26,19,29,5]
>
> For the PG, 3 OSD peers changed during the recovery progress
> ([_,_,_,_,28->21,_,_,_,14->19,19->29,_]).  It seems that min_size (8)
> of chunks of the EC8+3 pool are kept during recovery.  Does it mean
> that no more node failures are bearable during T3 to T4?  Can we
> calculate the maximum number of node failures by examining all the
> acting sets of the PGs?  Is there some simple way to obtain such
> information?  Any ideas and feedback are appreciated, thanks!
>
> - Jerry
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: NVME hosts added to the clusters and it made old ssd hosts flapping osds

2021-07-08 Thread Josh Baergen

Have you confirmed that all OSD hosts can see each other (on both the front
and back networks if you use split networks)? If there's not full
connectivity, then that can lead to the issues you see here.

Checking the logs on the mons can be helpful, as it will usually indicate
why a given OSD is being marked down (e.g. which OSDs are indicating that
it's down). The OSD logs may also be helpful.

Josh

On Thu, Jul 8, 2021 at 5:18 AM Szabo, Istvan (Agoda) 
wrote:

> Hi,
>
> I've added 4 nvme hosts with 2osd/nvme to my cluster and it made al the
> ssd osds flapping I don't understand why.
> It is under the same root but 2 different device classes, nvme and ssd.
> The pools are on the ssd on the nvme nothing at the moment.
> The only way to bring back the ssd osds alive to shutdown the nvmes.
> The new nvme servers have 25GB nics the old servers and the mons have 10GB
> but in aggregated mode.
>
> This is the crush rule dump:
> [
> {
> "rule_id": 0,
> "rule_name": "replicated_ssd",
> "ruleset": 0,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -21,
> "item_name": "default~ssd"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> },
> {
> "rule_id": 1,
> "rule_name": "replicated_nvme",
> "ruleset": 1,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -10,
> "item_name": "default~nvme"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> }
> ]
>
>
> This is the osd tree
>
> ID  CLASS WEIGHTTYPE NAMESTATUS REWEIGHT PRI-AFF
> -19   561.15057 root default
> -138.03099 host server-2001
>   0   ssd   2.0 osd.0up  1.0 1.0
> 10   ssd   6.98499 osd.10   up  1.0 1.0
> 11   ssd   6.98599 osd.11   up  1.0 1.0
> 12   ssd   2.29799 osd.12   up  1.0 1.0
> 13   ssd   2.29799 osd.13   up  1.0 1.0
> 14   ssd   3.49300 osd.14   up  1.0 1.0
> 41   ssd   6.98499 osd.41   up  1.0 1.0
> 42   ssd   6.98599 osd.42   up  1.0 1.0
> -338.03099 host server-2002
>   1   ssd   2.0 osd.1up  1.0 1.0
> 24   ssd   6.98499 osd.24   up  1.0 1.0
> 25   ssd   6.98599 osd.25   up  1.0 1.0
> 27   ssd   2.29799 osd.27   up  1.0 1.0
> 28   ssd   2.29799 osd.28   up  1.0 1.0
> 29   ssd   3.49300 osd.29   up  1.0 1.0
> 43   ssd   6.98499 osd.43   up  1.0 1.0
> 44   ssd   6.98599 osd.44   up  1.0 1.0
> -638.03000 host server-2003
>   2   ssd   2.0 osd.2up  1.0 1.0
> 26   ssd   6.98499 osd.26   up  1.0 1.0
> 38   ssd   2.2 osd.38   up  1.0 1.0
> 39   ssd   2.29500 osd.39   up  1.0 1.0
> 40   ssd   3.49300 osd.40   up  1.0 1.0
> 45   ssd   6.98499 osd.45   up  1.0 1.0
> 46   ssd   6.98599 osd.46   up  1.0 1.0
> 47   ssd   6.98599 osd.47   up  1.0 1.0
> -17   111.76465 host server-2004
>   5  nvme   6.98529 osd.5  down0 1.0
>   9  nvme   6.98529 osd.9  down0 1.0
> 18  nvme   6.98529 osd.18 down0 1.0
> 22  nvme   6.98529 osd.22 down0 1.0
> 32  nvme   6.98529 osd.32 down0 1.0
> 36  nvme   6.98529 osd.36 down0 1.0
> 50  nvme   6.98529 osd.50 down0 1.0
> 54  nvme   6.98529 osd.54 down0 1.0
> 58  nvme   6.98529 osd.58 down0 1.0
> 62  nvme   6.98529 osd.62 down0 1.0
> 66  nvme   6.98529 osd.66 down0 1.0
> 70  nvme   6.98529 osd.70 down0 1.0
> 74  nvme   6.98529 osd.74 down0 1.0
> 78  nvme   6.98529 osd.78

[ceph-users] Re: ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

2021-07-06 Thread Josh Baergen

Oh, I just read your message again, and I see that I didn't answer your
question. :D I admit I don't know how MAX AVAIL is calculated, and whether
it takes things like imbalance into account (it might).

Josh

On Tue, Jul 6, 2021 at 7:41 AM Josh Baergen 
wrote:

> Hey Wladimir,
>
> That output looks like it's from Nautilus or later. My understanding is
> that the USED column is in raw bytes, whereas STORED is "user" bytes. If
> you're using EC 2:1 for all of those pools, I would expect USED to be at
> least 1.5x STORED, which looks to be the case for jerasure21. Perhaps your
> libvirt pool is 3x replicated, in which case the numbers add up as well.
>
> Josh
>
> On Tue, Jul 6, 2021 at 5:51 AM Wladimir Mutel  wrote:
>
>> I started my experimental 1-host/8-HDDs setup in 2018 with
>> Luminous,
>> and I read
>> https://ceph.io/community/new-luminous-erasure-coding-rbd-cephfs/ ,
>> which had interested me in using Bluestore and rewriteable EC
>> pools for RBD data.
>> I have about 22 TiB or raw storage, and ceph df shows this :
>>
>> --- RAW STORAGE ---
>> CLASSSIZEAVAILUSED  RAW USED  %RAW USED
>> hdd22 TiB  2.7 TiB  19 TiB19 TiB  87.78
>> TOTAL  22 TiB  2.7 TiB  19 TiB19 TiB  87.78
>>
>> --- POOLS ---
>> POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX
>> AVAIL
>> jerasure21  1  256  9.0 TiB2.32M   13 TiB  97.06276
>> GiB
>> libvirt 2  128  1.5 TiB  413.60k  4.5 TiB  91.77140
>> GiB
>> rbd 3   32  798 KiB5  2.7 MiB  0138
>> GiB
>> iso 4   32  2.3 MiB   10  8.0 MiB  0138
>> GiB
>> device_health_metrics   51   31 MiB9   94 MiB   0.02138
>> GiB
>>
>> If I add USED for libvirt and jerasure21 , I get 17.5 TiB, and
>> 2.7 TiB is shown at RAW STORAGE/AVAIL
>> Sum of POOLS/MAX AVAIL is about 840 GiB, where are my other
>> 2.7-0.840 =~ 1.86 TiB ???
>> Or in different words, where are my (RAW STORAGE/RAW
>> USED)-(SUM(POOLS/USED)) = 19-17.5 = 1.5 TiB ?
>>
>> As it does not seem I would get any more hosts for this setup,
>> I am seriously thinking of bringing down this Ceph
>> and setting up instead a Btrfs storing qcow2 images served over
>> iSCSI
>> which looks simpler to me for single-host situation.
>>
>> Josh Baergen wrote:
>> > Hey Wladimir,
>> >
>> > I actually don't know where this is referenced in the docs, if
>> anywhere. Googling around shows many people discovering this overhead the
>> hard way on ceph-users.
>> >
>> > I also don't know the rbd journaling mechanism in enough depth to
>> comment on whether it could be causing this issue for you. Are you seeing a
>> high
>> > allocated:stored ratio on your cluster?
>> >
>> > Josh
>> >
>> > On Sun, Jul 4, 2021 at 6:52 AM Wladimir Mutel > m...@mwg.dp.ua>> wrote:
>> >
>> > Dear Mr Baergen,
>> >
>> > thanks a lot for your very concise explanation,
>> > however I would like to learn more why default Bluestore alloc.size
>> causes such a big storage overhead,
>> > and where in the Ceph docs it is explained how and what to watch
>> for to avoid hitting this phenomenon again and again.
>> > I have a feeling this is what I get on my experimental Ceph setup
>> with simplest JErasure 2+1 data pool.
>> > Could it be caused by journaled RBD writes to EC data-pool ?
>> >
>> > Josh Baergen wrote:
>> >  > Hey Arkadiy,
>> >  >
>> >  > If the OSDs are on HDDs and were created with the default
>> >  > bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus,
>> then in
>> >  > effect data will be allocated from the pool in 640KiB chunks
>> (64KiB *
>> >  > (k+m)). 5.36M objects taking up 501GiB is an average object size
>> of 98KiB
>> >  > which results in a ratio of 6.53:1 allocated:stored, which is
>> pretty close
>> >  > to the 7:1 observed.
>> >  >
>> >  > If my assumption about your configuration is correct, then the
>> only way to
>> >  > fix this is to adjust bluestore_min_alloc_size_hdd and recreate
>> all your
>> >  > OSDs, which will take a while...
>> >  >
>> >  > Josh
>> >  >
>> >

[ceph-users] Re: ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

2021-07-06 Thread Josh Baergen

Hey Wladimir,

That output looks like it's from Nautilus or later. My understanding is
that the USED column is in raw bytes, whereas STORED is "user" bytes. If
you're using EC 2:1 for all of those pools, I would expect USED to be at
least 1.5x STORED, which looks to be the case for jerasure21. Perhaps your
libvirt pool is 3x replicated, in which case the numbers add up as well.

Josh

On Tue, Jul 6, 2021 at 5:51 AM Wladimir Mutel  wrote:

> I started my experimental 1-host/8-HDDs setup in 2018 with
> Luminous,
> and I read
> https://ceph.io/community/new-luminous-erasure-coding-rbd-cephfs/ ,
> which had interested me in using Bluestore and rewriteable EC
> pools for RBD data.
> I have about 22 TiB or raw storage, and ceph df shows this :
>
> --- RAW STORAGE ---
> CLASSSIZEAVAILUSED  RAW USED  %RAW USED
> hdd22 TiB  2.7 TiB  19 TiB19 TiB  87.78
> TOTAL  22 TiB  2.7 TiB  19 TiB19 TiB  87.78
>
> --- POOLS ---
> POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
> jerasure21  1  256  9.0 TiB2.32M   13 TiB  97.06276 GiB
> libvirt 2  128  1.5 TiB  413.60k  4.5 TiB  91.77140 GiB
> rbd 3   32  798 KiB5  2.7 MiB  0138 GiB
> iso 4   32  2.3 MiB   10  8.0 MiB  0138 GiB
> device_health_metrics   51   31 MiB9   94 MiB   0.02138 GiB
>
> If I add USED for libvirt and jerasure21 , I get 17.5 TiB, and 2.7
> TiB is shown at RAW STORAGE/AVAIL
> Sum of POOLS/MAX AVAIL is about 840 GiB, where are my other
> 2.7-0.840 =~ 1.86 TiB ???
> Or in different words, where are my (RAW STORAGE/RAW
> USED)-(SUM(POOLS/USED)) = 19-17.5 = 1.5 TiB ?
>
> As it does not seem I would get any more hosts for this setup,
> I am seriously thinking of bringing down this Ceph
> and setting up instead a Btrfs storing qcow2 images served over
> iSCSI
> which looks simpler to me for single-host situation.
>
> Josh Baergen wrote:
> > Hey Wladimir,
> >
> > I actually don't know where this is referenced in the docs, if anywhere.
> Googling around shows many people discovering this overhead the hard way on
> ceph-users.
> >
> > I also don't know the rbd journaling mechanism in enough depth to
> comment on whether it could be causing this issue for you. Are you seeing a
> high
> > allocated:stored ratio on your cluster?
> >
> > Josh
> >
> > On Sun, Jul 4, 2021 at 6:52 AM Wladimir Mutel  m...@mwg.dp.ua>> wrote:
> >
> > Dear Mr Baergen,
> >
> > thanks a lot for your very concise explanation,
> > however I would like to learn more why default Bluestore alloc.size
> causes such a big storage overhead,
> > and where in the Ceph docs it is explained how and what to watch for
> to avoid hitting this phenomenon again and again.
> > I have a feeling this is what I get on my experimental Ceph setup
> with simplest JErasure 2+1 data pool.
> > Could it be caused by journaled RBD writes to EC data-pool ?
> >
> > Josh Baergen wrote:
> >  > Hey Arkadiy,
> >  >
> >  > If the OSDs are on HDDs and were created with the default
> >  > bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus,
> then in
> >  > effect data will be allocated from the pool in 640KiB chunks
> (64KiB *
> >  > (k+m)). 5.36M objects taking up 501GiB is an average object size
> of 98KiB
> >  > which results in a ratio of 6.53:1 allocated:stored, which is
> pretty close
> >  > to the 7:1 observed.
> >  >
> >  > If my assumption about your configuration is correct, then the
> only way to
> >  > fix this is to adjust bluestore_min_alloc_size_hdd and recreate
> all your
> >  > OSDs, which will take a while...
> >  >
> >  > Josh
> >  >
> >  > On Tue, Jun 29, 2021 at 3:07 PM Arkadiy Kulev  <mailto:e...@ethaniel.com>> wrote:
> >  >
> >  >> The pool *default.rgw.buckets.data* has *501 GiB* stored, but
> USED shows
> >  >> *3.5
> >  >> TiB *(7 times higher!)*:*
> >  >>
> >  >> root@ceph-01:~# ceph df
> >  >> --- RAW STORAGE ---
> >  >> CLASS  SIZE AVAILUSED RAW USED  %RAW USED
> >  >> hdd196 TiB  193 TiB  3.5 TiB   3.6 TiB   1.85
> >  >> TOTAL  196 TiB  193 TiB  3.5 TiB   3.6 TiB   1.85
> >  >>
> >  >> --- POOLS ---

[ceph-users] Re: ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

2021-07-05 Thread Josh Baergen

Hey Wladimir,

I actually don't know where this is referenced in the docs, if anywhere.
Googling around shows many people discovering this overhead the hard way on
ceph-users.

I also don't know the rbd journaling mechanism in enough depth to comment
on whether it could be causing this issue for you. Are you seeing a high
allocated:stored ratio on your cluster?

Josh

On Sun, Jul 4, 2021 at 6:52 AM Wladimir Mutel  wrote:

> Dear Mr Baergen,
>
> thanks a lot for your very concise explanation,
> however I would like to learn more why default Bluestore alloc.size causes
> such a big storage overhead,
> and where in the Ceph docs it is explained how and what to watch for to
> avoid hitting this phenomenon again and again.
> I have a feeling this is what I get on my experimental Ceph setup with
> simplest JErasure 2+1 data pool.
> Could it be caused by journaled RBD writes to EC data-pool ?
>
> Josh Baergen wrote:
> > Hey Arkadiy,
> >
> > If the OSDs are on HDDs and were created with the default
> > bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus, then in
> > effect data will be allocated from the pool in 640KiB chunks (64KiB *
> > (k+m)). 5.36M objects taking up 501GiB is an average object size of 98KiB
> > which results in a ratio of 6.53:1 allocated:stored, which is pretty
> close
> > to the 7:1 observed.
> >
> > If my assumption about your configuration is correct, then the only way
> to
> > fix this is to adjust bluestore_min_alloc_size_hdd and recreate all your
> > OSDs, which will take a while...
> >
> > Josh
> >
> > On Tue, Jun 29, 2021 at 3:07 PM Arkadiy Kulev  wrote:
> >
> >> The pool *default.rgw.buckets.data* has *501 GiB* stored, but USED shows
> >> *3.5
> >> TiB *(7 times higher!)*:*
> >>
> >> root@ceph-01:~# ceph df
> >> --- RAW STORAGE ---
> >> CLASS  SIZE AVAILUSED RAW USED  %RAW USED
> >> hdd196 TiB  193 TiB  3.5 TiB   3.6 TiB   1.85
> >> TOTAL  196 TiB  193 TiB  3.5 TiB   3.6 TiB   1.85
> >>
> >> --- POOLS ---
> >> POOL   ID  PGS  STORED   OBJECTS  USED %USED
> MAX
> >> AVAIL
> >> device_health_metrics   11   19 KiB   12   56 KiB  0
> >>   61 TiB
> >> .rgw.root   2   32  2.6 KiB6  1.1 MiB  0
> >>   61 TiB
> >> default.rgw.log 3   32  168 KiB  210   13 MiB  0
> >>   61 TiB
> >> default.rgw.control 4   32  0 B8  0 B  0
> >>   61 TiB
> >> default.rgw.meta58  4.8 KiB   11  1.9 MiB  0
> >>   61 TiB
> >> default.rgw.buckets.index   68  1.6 GiB  211  4.7 GiB  0
> >>   61 TiB
> >>
> >> default.rgw.buckets.data   10  128  501 GiB5.36M  3.5 TiB   1.90
> >> 110 TiB
> >>
> >> The *default.rgw.buckets.data* pool is using erasure coding:
> >>
> >> root@ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST
> >> crush-device-class=hdd
> >> crush-failure-domain=host
> >> crush-root=default
> >> jerasure-per-chunk-alignment=false
> >> k=6
> >> m=4
> >> plugin=jerasure
> >> technique=reed_sol_van
> >> w=8
> >>
> >> If anyone could help explain why it's using up 7 times more space, it
> would
> >> help a lot. Versioning is disabled. ceph version 15.2.13 (octopus
> stable).
> >>
> >> Sincerely,
> >> Ark.
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

2021-06-29 Thread Josh Baergen

Hey Arkadiy,

If the OSDs are on HDDs and were created with the default
bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus, then in
effect data will be allocated from the pool in 640KiB chunks (64KiB *
(k+m)). 5.36M objects taking up 501GiB is an average object size of 98KiB
which results in a ratio of 6.53:1 allocated:stored, which is pretty close
to the 7:1 observed.

If my assumption about your configuration is correct, then the only way to
fix this is to adjust bluestore_min_alloc_size_hdd and recreate all your
OSDs, which will take a while...

Josh

On Tue, Jun 29, 2021 at 3:07 PM Arkadiy Kulev  wrote:

> The pool *default.rgw.buckets.data* has *501 GiB* stored, but USED shows
> *3.5
> TiB *(7 times higher!)*:*
>
> root@ceph-01:~# ceph df
> --- RAW STORAGE ---
> CLASS  SIZE AVAILUSED RAW USED  %RAW USED
> hdd196 TiB  193 TiB  3.5 TiB   3.6 TiB   1.85
> TOTAL  196 TiB  193 TiB  3.5 TiB   3.6 TiB   1.85
>
> --- POOLS ---
> POOL   ID  PGS  STORED   OBJECTS  USED %USED  MAX
> AVAIL
> device_health_metrics   11   19 KiB   12   56 KiB  0
>  61 TiB
> .rgw.root   2   32  2.6 KiB6  1.1 MiB  0
>  61 TiB
> default.rgw.log 3   32  168 KiB  210   13 MiB  0
>  61 TiB
> default.rgw.control 4   32  0 B8  0 B  0
>  61 TiB
> default.rgw.meta58  4.8 KiB   11  1.9 MiB  0
>  61 TiB
> default.rgw.buckets.index   68  1.6 GiB  211  4.7 GiB  0
>  61 TiB
>
> default.rgw.buckets.data   10  128  501 GiB5.36M  3.5 TiB   1.90
> 110 TiB
>
> The *default.rgw.buckets.data* pool is using erasure coding:
>
> root@ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST
> crush-device-class=hdd
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=6
> m=4
> plugin=jerasure
> technique=reed_sol_van
> w=8
>
> If anyone could help explain why it's using up 7 times more space, it would
> help a lot. Versioning is disabled. ceph version 15.2.13 (octopus stable).
>
> Sincerely,
> Ark.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] pgremapper released

2021-05-05 Thread Josh Baergen

Hello all,

I just wanted to let you know that DigitalOcean has open-sourced a
tool we've developed called pgremapper.

Originally inspired by CERN's upmap exception table manipulation
scripts, pgremapper is a CLI written in Go which exposes a number of
upmap-based algorithms for backfill-related usecases: Canceling
backfill (like CERN's upmap-remapped.py, but with some extra tricks up
its sleeve), draining PGs off of an OSD, undoing upmaps in a
controlled and concurrent manor, and more.

If you're interested, please read the details in the repo's README:
https://github.com/digitalocean/pgremapper

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: EC Backfill Observations

2021-04-21 Thread Josh Baergen

Hey Josh,

Thanks for the info!

> With respect to reservations, it seems like an oversight that
> we don't reserve other shards for backfilling. We reserve all
> shards for recovery [0].

Very interesting that there is a reservation difference between
backfill and recovery.

> On the other hand, overload from recovery is handled better in
> pacific and beyond with mclock-based QoS, which provides much
> more effective control of recovery traffic [1][2].

Indeed, I was wondering if mclock was ultimately the answer here,
though I wonder how mclock acts in the case where a source OSD gets
overloaded in the way that I described. Will it throttle backfill too
aggressively, for example, compared to if the reservation was in
place, preventing overload in the first place?

One more question in this space: Has there ever been discussion about
a back-off mechanism when one of the remote reservations is blocked?
Another issue that we've commonly seen is that a backfill that should
be able to make progress can't because of a backfill_wait that holds
some of its reservations but is waiting for others. Example (with
simplified up/acting sets):

1.1  active+remapped+backfilling   [0,2]  0   [0,1]  0
1.2  active+remapped+backfill_wait   [3,2]  3   [3,1]  3
1.3  active+remapped+backfill_wait   [3,5]  3   [3,4]  3

1.3's backfill could make progress independent of 1.1, but is blocked
behind 1.2 because the latter is holding the local reservation on
osd.3 and is waiting for the remote reservation on osd.2.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] EC Backfill Observations

2021-04-19 Thread Josh Baergen

Hey all,

I wanted to confirm my understanding of some of the mechanics of
backfill in EC pools. I've yet to find a document that outlines this
in detail; if there is one, please send it my way. :) Some of what I
write below is likely in the "well, duh" category, but I tended
towards completeness.

First off, I understand that backfill reservations work the same way
between replicated pools and EC pools. A local reservation is taken on
the primary OSD, then a remote reservation on the backfill target(s),
before the backfill is allowed to begin. Until this point, the
backfill is in the backfill_wait state.

When the backfill begins, though, is when the differences begin. Let's
say we have an EC 3:2 PG that's backfilling from OSD 2 to OSD 5
(formatted here like pgs_brief):

1.1  active+remapped+backfilling   [0,1,5,3,4]  0   [0,1,2,3,4]  0

The question in my mind was: Where is the data for this backfill
coming from? In replicated pools, all reads come from the primary.
However, in this case, the primary does not have the data in question;
the primary has to either read the EC chunk from OSD 2, or it has to
reconstruct it by reading from 3 of the OSDs in the acting set.

Based on observation, I _think_ this is what happens:
1. As long as the PG is not degraded, the backfill read is simply
forwarded by the primary to OSD 2.
2. Once the PG becomes degraded, the backfill read needs to use the
reconstructing path, and begins reading from 3 of the OSDs in the
acting set.

Questions:
1. Can anyone confirm or correct my description of how EC backfill
operates? In particular, in case 2 above, does it matter whether OSD 2
is the cause of degradation, for example? Does the read still get
forwarded to a single OSD when it's parity chunks that are being moved
via backfill?
2. I'm curious as to why a 3rd reservation, for the source OSD, wasn't
introduced as a part of EC in Ceph. We've occasionally seen an OSD
become overloaded because several backfills were reading from it
simultaneously, and there's no way to control this via the normal
osd_max_backfills mechanism. Is anyone aware of discussions to this
effect?

Thanks!
Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Josh Baergen

I thought that recovery below min_size for EC pools wasn't expected to work
until Octopus. From the Octopus release notes: "Ceph will allow recovery
below min_size for Erasure coded pools, wherever possible."

Josh

On Tue, Mar 30, 2021 at 6:53 AM Frank Schilder  wrote:

> Dear Rainer,
>
> hmm, maybe the option is ignored or not implemented properly. This option
> set to true should have the same effect as reducing min_size *except* that
> new writes will not go to non-redundant storage. When reducing min-size, a
> critically degraded PG will accept new writes, which is the danger of
> data-loss mentioned before and avoided if only recovery ops are allowed on
> such PGs.
>
> Can you open a tracker about your observation that reducing min-size was
> necessary and helped despite osd_allow_recovery_below_min_size=true?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Rainer Krienke 
> Sent: 30 March 2021 13:30:00
> To: Frank Schilder; Eugen Block; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: ceph Nautilus lost two disk over night
> everything hangs
>
> Hello Frank,
>
> the option is actually set. On one of my monitors:
>
> # ceph daemon /var/run/ceph/ceph-mon.*.asok config show|grep
> osd_allow_recovery_below_min_size
>  "osd_allow_recovery_below_min_size": "true",
>
> Thank you very much
> Rainer
>
> Am 30.03.21 um 13:20 schrieb Frank Schilder:
> > Hi, this is odd. The problem with recovery when sufficiently many but
> less than min_size shards are present should have been resolved with
> osd_allow_recovery_below_min_size=true. It is really dangerous to reduce
> min_size below k+1 and, in fact, should never be necessary for recovery.
> Can you check if this option is present and set to true? If it is not
> working as intended, a tracker ticker might be in order.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
>
>   --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
> 56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287
> 1312
> PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
> 1001312
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Do I need to update ceph.conf and restart each OSD after adding more MONs?

2021-03-29 Thread Josh Baergen

As was mentioned in this thread, all of the mon clients (OSDs included)
learn about other mons through monmaps, which are distributed when mon
membership and election changes. Thus, your OSDs should already know about
the new mons.

mon_host indicates the list of mons that mon clients should try to contact
at boot. Thus, it's important to have correct in the config but doesn't
need to be updated after the process starts.

At least that's how I understand it; the config docs aren't terribly clear
on this behaviour.

Josh


On Sat., Mar. 27, 2021, 2:07 p.m. Tony Liu,  wrote:

> Just realized that all config files (/var/lib/ceph/ id>//config)
> on all nodes are already updated properly. It must be handled as part of
> adding
> MONs. But "ceph config show" shows only single host.
>
> mon_host   [v2:
> 10.250.50.80:3300/0,v1:10.250.50.80:6789/0]  file
>
> That means I still need to restart all services to apply the update, right?
> Is this supposed to be part of adding MONs as well, or additional manual
> step?
>
>
> Thanks!
> Tony
> 
> From: Tony Liu 
> Sent: March 27, 2021 12:53 PM
> To: Stefan Kooman; ceph-users@ceph.io
> Subject: [ceph-users] Re: Do I need to update ceph.conf and restart each
> OSD after adding more MONs?
>
> # ceph config set osd.0 mon_host [v2:
> 10.250.50.80:3300/0,v1:10.250.50.80:6789/0,v2:10.250.50.81:3300/0,v1:10.250.50.81:6789/0,v2:10.250.50.82:3300/0,v1:10.250.50.82:6789/0
> ]
> Error EINVAL: mon_host is special and cannot be stored by the mon
>
> It seems that the only option is to update ceph.conf and restart service.
>
>
> Tony
> 
> From: Tony Liu 
> Sent: March 27, 2021 12:20 PM
> To: Stefan Kooman; ceph-users@ceph.io
> Subject: [ceph-users] Re: Do I need to update ceph.conf and restart each
> OSD after adding more MONs?
>
> I expanded MON from 1 to 3 by updating orch service "ceph orch apply".
> "mon_host" in all services (MON, MGR, OSDs) is not updated. It's still
> single
> host from source "file".
> What's the guidance here to update "mon_host" for all services? I am
> talking
> about Ceph services, not client side.
> Should I update ceph.conf for all services and restart all of them?
> Or I can update it on-the-fly by "ceph config set"?
> In the latter case, where the updated configuration is stored? Is it going
> to
> be overridden by ceph.conf when restart service?
>
>
> Thanks!
> Tony
>
> 
> From: Stefan Kooman 
> Sent: March 26, 2021 12:22 PM
> To: Tony Liu; ceph-users@ceph.io
> Subject: Re: [ceph-users] Do I need to update ceph.conf and restart each
> OSD after adding more MONs?
>
> On 3/26/21 6:06 PM, Tony Liu wrote:
> > Hi,
> >
> > Do I need to update ceph.conf and restart each OSD after adding more
> MONs?
>
> This should not be necessary, as the OSDs should learn about these
> changes through monmaps. Updating the ceph.conf after the mons have been
> updated is advised.
>
> > This is with 15.2.8 deployed by cephadm.
> >
> > When adding MON, "mon_host" should be updated accordingly.
> > Given [1], is that update "the monitor cluster’s centralized
> configuration
> > database" or "runtime overrides set by an administrator"?
>
> No need to put that in the centralized config database. I *think* they
> mean ceph.conf file on the clients and hosts. At least, that's what you
> would normally do (if not using DNS).
>
> Gr. Stefan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: memory consumption by osd

2021-03-29 Thread Josh Baergen

Linux will automatically make use of all available memory for the buffer
cache, freeing buffers when it needs more memory for other things. This is
why MemAvailable is more useful than MemFree; the former indicates how much
memory could be used between Free, buffer cache, and anything else that
could be freed up. If you'd like to learn more about the buffer cache and
Linux's management of it, there are plenty of resources a search away.

My guess is that you're using a Ceph release that has bluefs_buffered_io
set to true by default, which will cause the OSDs to use the buffer cache
for some of their IO. What you're seeing is normal behaviour in this case.

Josh

On Sat., Mar. 27, 2021, 8:59 p.m. Tony Liu,  wrote:

> I don't see any problems yet. All OSDs are working fine.
> Just that 1.8GB free memory concerns me.
> I know 256GB memory for 10 OSDs (16TB HDD) is a lot, I am planning to
> reduce it or increate osd_memory_target (if that's what you meant) to
> boost performance. But before doing that, I'd like to understand what's
> taking so much buff/cache and if there is any option to control it.
>
>
> Thanks!
> Tony
> 
> From: Anthony D'Atri 
> Sent: March 27, 2021 07:27 PM
> To: ceph-users
> Subject: [ceph-users] Re: memory consumption by osd
>
>
> Depending on your kernel version, MemFree can be misleading.  Attend to
> the value of MemAvailable instead.
>
> Your OSDs all look to be well below the target, I wouldn’t think you have
> any problems.  In fact 256GB for just 10 OSDs is an embarassment of
> riches.  What type of drives are you using, and what’s the cluster used
> for?  If anything I might advise *raising* the target.
>
> You might check tcmalloc usage
>
>
> https://ceph-devel.vger.kernel.narkive.com/tYp0KkIT/ceph-daemon-memory-utilization-heap-release-drops-use-by-50
>
> but I doubt this is an issue for you.
>
> > What's taking that much buffer?
> > # free -h
> >  totalusedfree  shared  buff/cache
>  available
> > Mem:  251Gi31Gi   1.8Gi   1.6Gi   217Gi
>  215Gi
> >
> > # cat /proc/meminfo
> > MemTotal:   263454780 kB
> > MemFree: 2212484 kB
> > MemAvailable:   226842848 kB
> > Buffers:219061308 kB
> > Cached:  2066532 kB
> > SwapCached:  928 kB
> > Active: 142272648 kB
> > Inactive:   109641772 kB
> > ..
> >
> >
> > Thanks!
> > Tony
> > 
> > From: Tony Liu 
> > Sent: March 27, 2021 01:25 PM
> > To: ceph-users
> > Subject: [ceph-users] memory consumption by osd
> >
> > Hi,
> >
> > Here is a snippet from top on a node with 10 OSDs.
> > ===
> > MiB Mem : 257280.1 total,   2070.1 free,  31881.7 used, 223328.3
> buff/cache
> > MiB Swap: 128000.0 total, 126754.7 free,   1245.3 used. 221608.0 avail
> Mem
> >
> >PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+
> COMMAND
> >  30492 167   20   0 4483384   2.9g  16696 S   6.0   1.2 707:05.25
> ceph-osd
> >  35396 167   20   0 952   2.8g  16468 S   5.0   1.1 815:58.52
> ceph-osd
> >  33488 167   20   0 4161872   2.8g  16580 S   4.7   1.1 496:07.94
> ceph-osd
> >  36371 167   20   0 4387792   3.0g  16748 S   4.3   1.2 762:37.64
> ceph-osd
> >  39185 167   20   0 5108244   3.1g  16576 S   4.0   1.2 998:06.73
> ceph-osd
> >  38729 167   20   0 4748292   2.8g  16580 S   3.3   1.1 895:03.67
> ceph-osd
> >  34439 167   20   0 4492312   2.8g  16796 S   2.0   1.1 921:55.50
> ceph-osd
> >  31473 167   20   0 4314500   2.9g  16684 S   1.3   1.2 680:48.09
> ceph-osd
> >  32495 167   20   0 4294196   2.8g  16552 S   1.0   1.1 545:14.53
> ceph-osd
> >  37230 167   20   0 4586020   2.7g  16620 S   1.0   1.1 844:12.23
> ceph-osd
> > ===
> > Does it look OK with 2GB free?
> > I can't tell how that 220GB is used for buffer/cache.
> > Is that used by OSDs? Is it controlled by configuration or auto scaling
> based
> > on physical memory? Any clarifications would be helpful.
> >
> >
> > Thanks!
> > Tony
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Where has my capacity gone?

2021-01-28 Thread Josh Baergen

Hi George,

> May I ask if enabling pool compression helps for the future space 
> amplification?

If the amplification is indeed due to min_alloc_size, then I don't
think that compression will help. My understanding is that compression
is applied post-EC (and thus probably won't even activate due to the
small chunks), and that the compressed bits will still be stored on
disk in the same way as before (min_alloc_size still applies). More
info here: https://www.suse.com/support/kb/doc/?id=19629

It's possible, though, that turning on compression and tuning its
settings could reduce the overall number of blocks allocated, which
would compensate slightly for the amplification. To confirm that you'd
have to analyze the object sizes of your data set. There are also
pathological cases where perhaps most of your EC chunks are slightly
over 64K and by forcing them to compress (they won't by default) you
actually cut allocated blocks in half. Again, that would take analysis
to determine.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Where has my capacity gone?

2021-01-27 Thread Josh Baergen

On Wed, Jan 27, 2021 at 12:24 AM George Yil  wrote:
> May I ask if it can be dynamically changed and any disadvantages should be 
> expected?

Unless there's some magic I'm unaware of, there is no way to
dynamically change this. Each OSD must be recreated with the new
min_alloc_size setting. In production systems this can be quite the
chore, since the safest way to accomplish this is to drain the OSD
(set it 'out', use CRUSH map changes, or use upmaps), recreate it, and
then repopulate it. With automation this can run in the background.
Given how much room you have currently you may be able to do this
host-at-a-time by storing a host's data on the other hosts in a given
rack (though I don't remember what your CRUSH tree looks like so maybe
you can't do this and maintain host independence).

The downside is potentially more tracking metadata at the OSD level,
though I understand that Nautilus has made improvements here. I'm not
up to speed on the latest state in this area, though.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Where has my capacity gone?

2021-01-26 Thread Josh Baergen

> I created radosgw pools. secondaryzone.rgw.buckets.data pool is
configured as EC 8+2 (jerasure).

Did you override the default bluestore_min_alloc_size_hdd (64k in that
version IIRC) when creating your hdd OSDs? If not, all of the small objects
produced by that EC configuration will be leading to significant on-disk
allocation overhead.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

69 matches

Mail list logo