[ceph-users] Re: Different behaviors for ceph kernel client in limiting IOPS when data pool enters `nearfull`?

2023-11-16 Thread Matt Larson
Ilya,

 Thank you for providing these discussion threads on the Kernel fixes for
where there was a change and details on this affects the clients.

 What is the expected behavior in CephFS client when there are multiple
data pools in the CephFS? Does having 'nearfull' in any data pool in the
CephFS then trigger the synchronous writes for clients even if they would
be writing to a CephFS location mapped to a non-nearfull data pool? I.e. is
'nearfull' / sync behavior global across the same CephFS filesystem?

 Thanks,
  Matt

On Thu, Nov 16, 2023 at 8:39 AM Ilya Dryomov  wrote:

> On Thu, Nov 16, 2023 at 3:21 AM Xiubo Li  wrote:
> >
> > Hi Matt,
> >
> > On 11/15/23 02:40, Matt Larson wrote:
> > > On CentOS 7 systems with the CephFS kernel client, if the data pool
> has a
> > > `nearfull` status there is a slight reduction in write speeds (possibly
> > > 20-50% fewer IOPS).
> > >
> > > On a similar Rocky 8 system with the CephFS kernel client, if the data
> pool
> > > has `nearfull` status, a similar test shows write speeds at different
> block
> > > sizes shows the IOPS < 150 bottlenecked vs the typical write
> > > performance that might be with 2-3 IOPS at a particular block
> size.
> > >
> > > Is there any way to avoid the extremely bottlenecked IOPS seen on the
> Rocky
> > > 8 system CephFS kernel clients during the `nearfull` condition or to
> have
> > > behavior more similar to the CentOS 7 CephFS clients?
> > >
> > > Do different OS or Linux kernels have greatly different ways they
> respond
> > > or limit on the IOPS? Are there any options to adjust how they limit on
> > > IOPS?
> >
> > Just to be clear that the kernel on CentOS 7 is lower than the kernel on
> > Rocky 8, they may behave differently someway. BTW, are the ceph versions
> > the same for your test between CentOS 7 and Rocky 8 ?
> >
> > I saw in libceph.ko there has some code will handle the OSD FULL case,
> > but I didn't find the near full case, let's get help from Ilya about
> this.
> >
> > @Ilya,
> >
> > Do you know will the osdc will behave differently when it detects the
> > pool is near full ?
>
> Hi Xiubo,
>
> It's not libceph or osdc, but CephFS itself.  I think Matt is running
> against this fix:
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7614209736fbc4927584d4387faade4f31444fce
>
> It was previously discussed in detail here:
>
>
> https://lore.kernel.org/ceph-devel/caoi1vp_k2ybx9+jffmuhcuxsyngftqjyh+frusyy4ureprk...@mail.gmail.com/
>
> The solution is to add additional capacity or bump the nearfull
> threshold:
>
>
> https://lore.kernel.org/ceph-devel/23f46ca6dd1f45a78beede92fc91d...@mpinat.mpg.de/
>
> Thanks,
>
> Ilya
>


-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Different behaviors for ceph kernel client in limiting IOPS when data pool enters `nearfull`?

2023-11-14 Thread Matt Larson
On CentOS 7 systems with the CephFS kernel client, if the data pool has a
`nearfull` status there is a slight reduction in write speeds (possibly
20-50% fewer IOPS).

On a similar Rocky 8 system with the CephFS kernel client, if the data pool
has `nearfull` status, a similar test shows write speeds at different block
sizes shows the IOPS < 150 bottlenecked vs the typical write
performance that might be with 2-3 IOPS at a particular block size.

Is there any way to avoid the extremely bottlenecked IOPS seen on the Rocky
8 system CephFS kernel clients during the `nearfull` condition or to have
behavior more similar to the CentOS 7 CephFS clients?

Do different OS or Linux kernels have greatly different ways they respond
or limit on the IOPS? Are there any options to adjust how they limit on
IOPS?

Thanks,
  Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Moving devices to a different device class?

2023-10-26 Thread Matt Larson
Thanks Janne,

 It is good to know that moving the devices over to a new class is a safe
operation.

On Tue, Oct 24, 2023 at 2:16 PM Janne Johansson  wrote:

>
>> The documentation describes that I could set a device class for an OSD
>> with
>> a command like:
>>
>> `ceph osd crush set-device-class CLASS OSD_ID [OSD_ID ..]`
>>
>> Class names can be arbitrary strings like 'big_nvme".  Before setting a
>> new
>> device class to an OSD that already has an assigned device class, should
>> use `ceph osd crush rm-device-class ssd osd.XX`.
>>
>
> Yes, you can re-"name" them by removing old class and setting a new one.
>
>
>> Can I proceed to directly remove these OSDs from the current device class
>> and assign to a new device class? Should they be moved one by one? What is
>> the way to safely protect data from the existing pool that they are mapped
>> to?
>>
>>
> Yes, the PGs on them will be misplaced, so if their pool aims to only use
> "ssd"
> and you re-label them to big-nvme instead, the PGs will look for other
> "ssd"-named
> OSDs to land on, and move themselves if possible. It is a fairly safe
> operation where
> they continue to work, but will try to evacuate the PGs which should not
> be there.
>
> Worst case, your planning is wrong, and the "ssd" OSDs can't accept them,
> and you
> can just undo the relabel and the PGs come back.
>
> --
> May the most significant bit of your life be positive.
>


-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Moving devices to a different device class?

2023-10-24 Thread Matt Larson
Anthony,

 Thank you! This is very helpful information and thanks for the specific
advice for these drive types on choosing a 64KB min_alloc_size. I will do
some more review as I believe they are likely at the 4KB min_alloc_size if
that is the default for the `ssd` device-class.

  I will look to try to use the 64K default *min_alloc_size*, if I can do
so for a new device-class, and then `destroy` each of these OSDs and create
anew with the better `min_alloc_size`. These steps could then be done
1-by-1 for each of the OSDs of this type prior to trying to create the new
pool.

  If I cannot do that, then I would use the `ceph osd crush rm-device-class
ssd osd.XX` and `ceph osd crush rm-device-class ssd osd.XX` to individually
reassign the drives over to a new class with a simple name like `qlc` to
avoid issues with special characters in the class name. This could be done
1-by-1 and with watching that the PGs rebalance to the other SSDs in the
original pool.

 Thanks again,
   Matt

On Tue, Oct 24, 2023 at 12:11 PM Anthony D'Atri  wrote:

> Ah, our old friend the P5316.
>
> A few things to remember about these:
>
> * 64KB IU means that you'll burn through endurance if you do a lot of
> writes smaller than that.  The firmware will try to coalesce smaller
> writes, especially if they're sequential.  You probably want to keep your
> RGW / CephFS index / medata pools on other media.
>
>
> * With Quincy or later and a reasonably recent kernel you can set
> bluestore_use_optimal_io_size_for_min_alloc_size to true and OSDs deployed
> on these should automatically be created with a 64KB min_alloc_size.  If
> you're writing a lot of objects smaller than, say, 256KB -- especially if
> using EC -- a more nuanced approach may be warranted.  ISTR that your data
> are large sequential files, so probably you can exploit this.  For sure you
> want these OSDs to not have the default 4KB min_alloc_size; that would
> result in lowered write performance and especially endurance burn.  The
> min_alloc_size cannot be changed after an OSD is created; instead one would
> need to destroy and recreate.
>
> cf. https://github.com/ceph/ceph/pulls?q=is%3Apr+author%3Acurtbruns
>
> [image: maxresdefault.jpg]
>
> Optimizing RGW Object Storage Mixed Media through Storage Classes and Lua
> Scripting <https://www.youtube.com/watch?v=w91e0EjWD6E>
> youtube.com <https://www.youtube.com/watch?v=w91e0EjWD6E>
> <https://www.youtube.com/watch?v=w91e0EjWD6E>
>
>
>
>
> On Oct 24, 2023, at 11:42, Matt Larson  wrote:
>
> I am looking to create a new pool that would be backed by a particular set
> of drives that are larger nVME SSDs (Intel SSDPF2NV153TZ, 15TB drives).
> Particularly, I am wondering about what is the best way to move devices
> from one pool and to direct them to be used in a new pool to be created. In
> this case, the documentation suggests I could want to assign them to a new
> device-class and have a placement rule that targets that device-class in
> the new pool.
>
>
> If you're using cephadm / ceph orch you can craft an OSD spec that uses or
> ignores drives based on size or model.
>
> Multiple pools can share OSDs, for your use-case though you probably don't
> want to.
>
>
> Currently the Ceph cluster has two device classes 'hdd' and 'ssd', and the
> larger 15TB drives were automatically assigned to the 'ssd' device class
> that is in use by a different pool. The `ssd` device classes are used in a
> placement rule targeting that class.
>
>
> The names of device classes are actually semi-arbitrary.  The above
> distinction is made on the basis of whether or not the kernel believes a
> given device to rotate.
>
>
> The documentation describes that I could set a device class for an OSD with
> a command like:
>
> `ceph osd crush set-device-class CLASS OSD_ID [OSD_ID ..]`
>
> Class names can be arbitrary strings like 'big_nvme".
>
>
> or "qlc"
>
> Before setting a new
> device class to an OSD that already has an assigned device class, should
> use `ceph osd crush rm-device-class ssd osd.XX`.
>
>
> Yep.  I suspect that's a guardrail to prevent inadvertently trampling.
>
>
> Can I proceed to directly remove these OSDs from the current device class
> and assign to a new device class?
>
>
> Carpe NAND!
>
> Should they be moved one by one? What is
> the way to safely protect data from the existing pool that they are mapped
> to?
>
>
> Are there other SSDs in said existing pool?  If you reassign all of these,
> will there be enough survivors to meet replication policy and hold all the
> data?
>
> One by one would be safe.  Doing more than one might be faster and more
> efficient, depending on your hardware and topology

[ceph-users] Moving devices to a different device class?

2023-10-24 Thread Matt Larson
I am looking to create a new pool that would be backed by a particular set
of drives that are larger nVME SSDs (Intel SSDPF2NV153TZ, 15TB drives).
Particularly, I am wondering about what is the best way to move devices
from one pool and to direct them to be used in a new pool to be created. In
this case, the documentation suggests I could want to assign them to a new
device-class and have a placement rule that targets that device-class in
the new pool.

Currently the Ceph cluster has two device classes 'hdd' and 'ssd', and the
larger 15TB drives were automatically assigned to the 'ssd' device class
that is in use by a different pool. The `ssd` device classes are used in a
placement rule targeting that class.

The documentation describes that I could set a device class for an OSD with
a command like:

`ceph osd crush set-device-class CLASS OSD_ID [OSD_ID ..]`

Class names can be arbitrary strings like 'big_nvme".  Before setting a new
device class to an OSD that already has an assigned device class, should
use `ceph osd crush rm-device-class ssd osd.XX`.

Can I proceed to directly remove these OSDs from the current device class
and assign to a new device class? Should they be moved one by one? What is
the way to safely protect data from the existing pool that they are mapped
to?

Thanks,
  Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Removing failing OSD with cephadm?

2023-02-17 Thread Matt Larson
I have an OSD that is causing slow ops, and appears to be backed by a
failing drive according to smartctl outputs.  I am using cephadm, and
wondering what is the best way to remove this drive from the cluster and
proper steps to replace the disk?

Mark the osd.35 as out.

`sudo ceph osd out osd.35`

Then mark osd.35 as down.

`sudo ceph osd down osd.35`

 The OSD is marked as out, but it does come back up after a couple of
seconds.  I do not know if that is a problem or to just let the drive stay
online as long as it lasts during the removal from the cluster.

 After the recovery completes, I would then `destroy` the osd:

`ceph osd destroy {id} --yes-i-really-mean-it`

(https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/)

Besides checking steps above, my question now is ..* If the drive is acting
very slow and causing slow ops, should I be trying to shut down its OSD
and keep it down? There is an example to stop the OSD on the server using
systemctl, outside of cephadm:*

ssh {osd-host}sudo systemctl stop ceph-osd@{osd-num}


Thanks,
  Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: What to expect on rejoining a host to cluster?

2022-12-05 Thread Matt Larson
Frank,

 Then if you have only a few OSDs with excessive PG counts / usage, do you
reweight it down by something like 10-20% to acheive a better distribution
and improve capacity?  Do weight it back to normal after PGs have moved?

 I wondered if manually picking on some of the higher data usage OSDs could
get to a gold outcome and avoid continous rebalancing or other issues.

 Thanks,
   Matt

On Mon, Dec 5, 2022 at 4:32 AM Frank Schilder  wrote:

> Hi Matt,
>
> I can't comment on balancers, I don't use them. I manually re-weight OSDs,
> which fits well with our pools' OSD allocation. Also, we don't aim for
> perfect balance, we just remove the peak of allocation on the fullest few
> OSDs to avoid excessive capacity loss. Not balancing too much has the pro
> of being fairly stable under OSD failures/additions at the expanse of a few
> % less capacity.
>
> Maybe someone else an help here?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Matt Larson 
> Sent: 04 December 2022 02:00:11
> To: Eneko Lacunza
> Cc: Frank Schilder; ceph-users
> Subject: Re: [ceph-users] Re: What to expect on rejoining a host to
> cluster?
>
> Thank you Frank and Eneko,
>
>  Without help and support from ceph admins like you, I would be adrift.  I
> really appreciate this.
>
>  I rejoined the host now one week ago, and the cluster has been dealing
> with the misplaced objects and recovering well.
>
> I will use this strategy in the future:
>
> "If you consider replacing the host and all disks, get a new host first
> and give it the host name in the crush map. Just before you deploy the new
> host, simply purge all down OSDs in its bucket (set norebalance) and
> deploy. Then, the data movement is restricted to re-balancing to the new
> host.
>
> If you just want to throw out the old host, destroy the OSDs but keep the
> IDs intact (ceph osd destroy). Then, no further re-balancing will happen
> and you can re-use the OSD ids later when adding a new host. That's a
> stable situation from an operations point of view."
>
> Last question I have is that I am now seeing that some OSDs have uneven
> load of PGs, which balancer do you recommend and any caveats for how the
> balancer operations can affect/slow the cluster?
>
> Thanks,
>   Matt
>
> On Mon, Nov 28, 2022 at 2:23 AM Eneko Lacunza  elacu...@binovo.es>> wrote:
> Hi Matt,
>
> Also, make sure that when rejoining host has correct time. I have seen
> clusters going down when rejoining hosts that were down for maintenance for
> various weeks and came in with datetime deltas of some months (no idea why
> that happened, I arrived with the firefighter team ;-) )
>
> Cheers
>
> El 27/11/22 a las 13:27, Frank Schilder escribió:
>
> Hi Matt,
>
> if you didn't touch the OSDs on that host, they will join and only objects
> that have been modified will actually be updated. Ceph keeps some basic
> history information and can detect changes. 2 weeks is not a very long
> time. If you have a lot of cold data, re-integration will go fast.
>
> Initially, you will see a huge amount of misplaced objects. However, this
> count will go down much faster than objects/s recovery.
>
> Before you rejoin the host, I would fix its issues though. Now that you
> have it out of the cluster, do the maintenance first. There is no rush. In
> fact, you can buy a new host, install the OSDs in the new one and join that
> to the cluster with the host-name of the old host.
>
> If you consider replacing the host and all disks, the get a new host first
> and give it the host name in the crush map. Just before you deploy the new
> host, simply purge all down OSDs in its bucket (set norebalance) and
> deploy. Then, the data movement is restricted to re-balancing to the new
> host.
>
> If you just want to throw out the old host, destroy the OSDs but keep the
> IDs intact (ceph osd destroy). Then, no further re-balancing will happen
> and you can re-use the OSD ids later when adding a new host. That's a
> stable situation from an operations point of view.
>
> Hope that helps.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Matt Larson <mailto:larsonma...@gmail.com>
> Sent: 26 November 2022 21:07:41
> To: ceph-users
> Subject: [ceph-users] What to expect on rejoining a host to cluster?
>
> Hi all,
>
>  I have had a host with 16 OSDs, each 14TB in capacity that started having
> hardware issues causing it to crash.  I took this host down 2 weeks ago,
> and the data rebalanced to th

[ceph-users] Re: What to expect on rejoining a host to cluster?

2022-12-03 Thread Matt Larson
Thank you Frank and Eneko,

 Without help and support from ceph admins like you, I would be adrift.  I
really appreciate this.

 I rejoined the host now one week ago, and the cluster has been dealing
with the misplaced objects and recovering well.

I will use this strategy in the future:

"If you consider replacing the host and all disks, get a new host first and
give it the host name in the crush map. Just before you deploy the new
host, simply purge all down OSDs in its bucket (set norebalance) and
deploy. Then, the data movement is restricted to re-balancing to the new
host.

If you just want to throw out the old host, destroy the OSDs but keep the
IDs intact (ceph osd destroy). Then, no further re-balancing will happen
and you can re-use the OSD ids later when adding a new host. That's a
stable situation from an operations point of view."

Last question I have is that I am now seeing that some OSDs have uneven
load of PGs, which balancer do you recommend and any caveats for how
the balancer operations can affect/slow the cluster?

Thanks,
  Matt

On Mon, Nov 28, 2022 at 2:23 AM Eneko Lacunza  wrote:

> Hi Matt,
>
> Also, make sure that when rejoining host has correct time. I have seen
> clusters going down when rejoining hosts that were down for maintenance for
> various weeks and came in with datetime deltas of some months (no idea why
> that happened, I arrived with the firefighter team ;-) )
>
> Cheers
>
> El 27/11/22 a las 13:27, Frank Schilder escribió:
>
> Hi Matt,
>
> if you didn't touch the OSDs on that host, they will join and only objects 
> that have been modified will actually be updated. Ceph keeps some basic 
> history information and can detect changes. 2 weeks is not a very long time. 
> If you have a lot of cold data, re-integration will go fast.
>
> Initially, you will see a huge amount of misplaced objects. However, this 
> count will go down much faster than objects/s recovery.
>
> Before you rejoin the host, I would fix its issues though. Now that you have 
> it out of the cluster, do the maintenance first. There is no rush. In fact, 
> you can buy a new host, install the OSDs in the new one and join that to the 
> cluster with the host-name of the old host.
>
> If you consider replacing the host and all disks, the get a new host first 
> and give it the host name in the crush map. Just before you deploy the new 
> host, simply purge all down OSDs in its bucket (set norebalance) and deploy. 
> Then, the data movement is restricted to re-balancing to the new host.
>
> If you just want to throw out the old host, destroy the OSDs but keep the IDs 
> intact (ceph osd destroy). Then, no further re-balancing will happen and you 
> can re-use the OSD ids later when adding a new host. That's a stable 
> situation from an operations point of view.
>
> Hope that helps.
>
> Best regards,
> =========
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Matt Larson  
> Sent: 26 November 2022 21:07:41
> To: ceph-users
> Subject: [ceph-users] What to expect on rejoining a host to cluster?
>
> Hi all,
>
>  I have had a host with 16 OSDs, each 14TB in capacity that started having
> hardware issues causing it to crash.  I took this host down 2 weeks ago,
> and the data rebalanced to the remaining 11 server hosts in the Ceph
> cluster over this time period.
>
>  My initial goal was to then remove the host completely from the cluster
> with `ceph osd rm XX` and `ceph osd purge XX` (Adding/Removing OSDs — Ceph
> Documentation<https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/>
>  <https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/>).
> However, I found that after the large amount of data migration from the
> recovery, that the purge and removal from the crush map for an OSDs still
> required another large data move.  It appears that it would have been a
> better strategy to assign a 0 weight to an OSD to have only a single larger
> data move instead of twice.
>
>  I'd like to join the downed server back into the Ceph cluster.  It still
> has 14 OSDs that are listed as out/down that would be brought back online.
> My question is what can I expect if I bring this host online?  Will the
> OSDs of a host that has been offline for an extended period of time and out
> of the cluster have PGs that are now quite different or inconsistent?  Will
> this be problematic?
>
>  Thanks for any advice,
>Matt
>
> --
> Matt Larson, PhD
> Madison, WI  53705 U.S.A.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___

[ceph-users] What to expect on rejoining a host to cluster?

2022-11-26 Thread Matt Larson
Hi all,

 I have had a host with 16 OSDs, each 14TB in capacity that started having
hardware issues causing it to crash.  I took this host down 2 weeks ago,
and the data rebalanced to the remaining 11 server hosts in the Ceph
cluster over this time period.

 My initial goal was to then remove the host completely from the cluster
with `ceph osd rm XX` and `ceph osd purge XX` (Adding/Removing OSDs — Ceph
Documentation
<https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/>).
However, I found that after the large amount of data migration from the
recovery, that the purge and removal from the crush map for an OSDs still
required another large data move.  It appears that it would have been a
better strategy to assign a 0 weight to an OSD to have only a single larger
data move instead of twice.

 I'd like to join the downed server back into the Ceph cluster.  It still
has 14 OSDs that are listed as out/down that would be brought back online.
My question is what can I expect if I bring this host online?  Will the
OSDs of a host that has been offline for an extended period of time and out
of the cluster have PGs that are now quite different or inconsistent?  Will
this be problematic?

 Thanks for any advice,
   Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Best practice for removing failing host from cluster?

2022-11-09 Thread Matt Larson
We have a Ceph cluster running Octopus v 15.2.3 , and 1 of 12 of the hosts
in the cluster has started having what appears to be a hardware issue
causing it to freeze.  This began with a freeze and reported 'CATERR' in
the server logs. The host has been having repeated freeze issues over the
last week.

I'm looking to safely isolate this host from the cluster while
troubleshooting.  I started trying to remove OSDs from the host with `ceph
orch osd rm XX` for one of the drives on the node to rebalance the data
from the host.  The host is now having difficulties remaining online for
extended periods of time, and so I was planning to remove this host from
the cluster / to remove all the remaining OSDs from the node.  What would
be the best way to do this?

Should I use `ceph orch osd rm XX` for each of the OSDs of this host
or should I set the weights of each of the OSDs as 0?  Can I do this while
the host is offline, or should I bring it online first before setting
weights or using `ceph orch osd rm`?

Thanks,
  Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Upgrading ceph to latest version, skipping minor versions?

2021-06-14 Thread Matt Larson
Looking at the documentation for (
https://docs.ceph.com/en/latest/cephadm/upgrade/) - I have a question on
whether you need to sequentially upgrade for each minor versions, 15.2.1 ->
15.2.3 -> ... -> 15.2.XX?

Can you safely upgrade by directly specifying the latest version from
several minor versions behind?

 ceph orch upgrade start --ceph-version 15.2.13

-Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Updated ceph-osd package, now get -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]

2021-06-14 Thread Matt Larson
It appears what happened:

1. This host was a MON host, and was the only listed MON in the
/etc/ceph/ceph.conf file.
2. The MON went out of quorum with the other 4 monitors.
3. After adding all MON to the /etc/ceph/ceph.conf, then the `ceph -s`
command worked again.
4. After restarting the host, the MON was able to get back in the quorum
with the other monitors

I think the issue was that when the MON was out of quorum, the ceph client
could no longer connect when only having that MON as an option.

Problem is solved -

-Matt

On Mon, Jun 14, 2021 at 2:07 PM Matt Larson  wrote:

> I was working through updates of CentOS and packages on each of the Ceph
> storage nodes of a cluster, and I hit an issue after I updated a package
> `ceph-osd` on the original Ceph node of the cluster.
>
> 1. Before the updates, the Ceph cluster was running 15.2.3 version
> 2. After the update of the package with `dnf install ceph-osd`, I was
> unable to check the status any more with `ceph -s`:
>
>  2021-06-14T14:01:53.216-0500 7efdef7fe700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only support [2]
>
> [errno 13] RADOS permission denied (error connecting to the cluster)
>
> 3. The `ceph -v` command now shows as version 15.2.13.
>
> Based on other reported issues this could be due recent changes with
> security fixes ( https://docs.ceph.com/en/latest/security/CVE-2021-20288/
> ) that were introduced in 15.2.11.
>
> This cluster had originally been built with `cephadm` tool and
> containerized daemons.
>
> How can I restore the ability to connect with the command-line `ceph`
> client to check the status and all other interactions?
>
> Thanks,
>   Matt
>
> --
> Matt Larson, PhD
> Madison, WI  53705 U.S.A.
>


-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Updated ceph-osd package, now get -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]

2021-06-14 Thread Matt Larson
I was working through updates of CentOS and packages on each of the Ceph
storage nodes of a cluster, and I hit an issue after I updated a package
`ceph-osd` on the original Ceph node of the cluster.

1. Before the updates, the Ceph cluster was running 15.2.3 version
2. After the update of the package with `dnf install ceph-osd`, I was
unable to check the status any more with `ceph -s`:

 2021-06-14T14:01:53.216-0500 7efdef7fe700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2]

[errno 13] RADOS permission denied (error connecting to the cluster)

3. The `ceph -v` command now shows as version 15.2.13.

Based on other reported issues this could be due recent changes with
security fixes ( https://docs.ceph.com/en/latest/security/CVE-2021-20288/ )
that were introduced in 15.2.11.

This cluster had originally been built with `cephadm` tool and
containerized daemons.

How can I restore the ability to connect with the command-line `ceph`
client to check the status and all other interactions?

Thanks,
  Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Building ceph clusters with 8TB SSD drives?

2021-05-07 Thread Matt Larson
Is anyone trying Ceph clusters containing larger (4-8TB) SSD drives?

8TB SSDs are described here (
https://www.anandtech.com/show/16136/qlc-8tb-ssd-review-samsung-870-qvo-sabrent-rocket-q
) and make use QLC NAND flash memory to reach the costs and capacity.
Currently, the 8TB Samsung 870 SSD is $800/ea at some online retail stores.

SATA form-factor SSDs can reach read/write rates of 560/520 MB/s, while not
as great as nVME drives is still a multiple faster than 7200 RPM drives.
SSDs now appear to have much lower failure rates than HDs in 2021 (
https://www.techspot.com/news/89590-backblaze-latest-storage-reliability-figures-add-ssd-boot.html
).

Are there any major caveats to considering working with larger SSDs for
data pools?

Thanks,
  Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unable to clarify error using vfs_ceph (Samba gateway for CephFS)

2020-11-12 Thread Matt Larson
Thank you Frank,

 That was a good suggestion to make sure the mount wasn't the issue. I
tried changing the `client.samba.upload` to have read access directly
to '/' rather than '/upload' and to also change smb.conf to directly
use 'path = /'. Still getting the same issue (log level 10 content
below).

 It appears that it is correctly reading `/etc/ceph/ceph.conf`. It
does appear to be the ceph_mount where the failure occurs.

 It would be great to have vfs_ceph working, but if I cannot I'll try
to find other approaches.

[2020/11/12 10:47:39.360943, 10, pid=2723021, effective(0, 0), real(0,
0), class=vfs] ../../source3/smbd/vfs.c:65(vfs_find_backend_entry)

  vfs_find_backend_entry called for ceph
  Successfully loaded vfs module [ceph] with the new modules system
[2020/11/12 10:47:39.360966, 10, pid=2723021, effective(0, 0), real(0,
0), class=vfs] ../../source3/modules/vfs_ceph.c:103(cephwrap_connect)
  cephwrap_connect: [CEPH] calling: ceph_create
[2020/11/12 10:47:39.365668, 10, pid=2723021, effective(0, 0), real(0,
0), class=vfs] ../../source3/modules/vfs_ceph.c:110(cephwrap_connect)
  cephwrap_connect: [CEPH] calling: ceph_conf_read_file with /etc/ceph/ceph.conf
[2020/11/12 10:47:39.368842, 10, pid=2723021, effective(0, 0), real(0,
0), class=vfs] ../../source3/modules/vfs_ceph.c:116(cephwrap_connect)
  cephwrap_connect: [CEPH] calling: ceph_conf_get
[2020/11/12 10:47:39.368895, 10, pid=2723021, effective(0, 0), real(0,
0), class=vfs] ../../source3/modules/vfs_ceph.c:133(cephwrap_connect)
  cephwrap_connect: [CEPH] calling: ceph_mount
[2020/11/12 10:47:39.373319, 10, pid=2723021, effective(0, 0), real(0,
0), class=vfs] ../../source3/modules/vfs_ceph.c:160(cephwrap_connect)
  cephwrap_connect: [CEPH] Error return: No such file or directory
[2020/11/12 10:47:39.373357,  1, pid=2723021, effective(0, 0), real(0,
0)] ../../source3/smbd/service.c:668(make_connection_snum)
  make_connection_snum: SMB_VFS_CONNECT for service 'cryofs_upload' at
'/' failed: No such file or directory

On Thu, Nov 12, 2020 at 2:29 AM Frank Schilder  wrote:
>
> You might face the same issue I had. vfs_ceph wants to have a key for the 
> root of the cephfs, it is cutrently not possible to restrict access to a 
> sub-directory mount. For this reason, I decided to go for a re-export of a 
> kernel client mount.
>
> I consider this a serious security issue in vfs_ceph and will not use it 
> until it is possible to do sub-directory mounts.
>
> I don't think its difficult to patch the vfs_ceph source code, if you need to 
> use vfs_ceph and cannot afford to give access to "/" of the cephfs.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Matt Larson 
> Sent: 12 November 2020 00:40:21
> To: ceph-users
> Subject: [ceph-users] Unable to clarify error using vfs_ceph (Samba gateway 
> for CephFS)
>
> I am getting an error in the log.smbd from the Samba gateway that I
> don’t understand and looking for help from anyone who has gotten the
> vfs_ceph working.
>
> Background:
>
> I am trying to get a Samba gateway with CephFS working with the
> vfs_ceph module. I observed that the default Samba package on CentOS
> 7.7 did not come with the ceph.so vfs_ceph module, so I tried to
> compile a working Samba version with vfs_ceph.
>
> Newer Samba versions have a requirement for GnuTLS >= 3.4.7, which is
> not an available package on CentOS 7.7 without a custom repository. I
> opted to build an earlier version of Samba.
>
> On CentOS 7.7, I built Samba 4.11.16 with
>
> [global]
> security = user
> map to guest = Bad User
> username map = /etc/samba/smbusers
> log level = 4
> load printers = no
> printing = bsd
> printcap name = /dev/null
> disable spoolss = yes
>
> [cryofs_upload]
> public = yes
> read only = yes
> guest ok = yes
> vfs objects = ceph
> path = /upload
> kernel share modes = no
> ceph:user_id = samba.upload
> ceph:config_file = /etc/ceph/ceph.conf
>
> I have a file at /etc/ceph/ceph.conf including:
> fsid = redacted
> mon_host = redacted
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
>
>
> I have an /etc/ceph/client.samba.upload.keyring /w key for the user
> `samba.upload`
>
> However, connecting fails:
>
> smbclient localhost\\cryofs_upload -U guest
> Enter guest's password:
> tree connect failed: NT_STATUS_UNSUCCESSFUL
>
>
> The log.smbd gives these errors:
>
>   Initialising custom vfs hooks from [ceph]
> [2020/11/11 17:24:37.388460,  3]
> ../../lib/util/modules.c:167(load_module_absolute_p

[ceph-users] Unable to clarify error using vfs_ceph (Samba gateway for CephFS)

2020-11-11 Thread Matt Larson
I am getting an error in the log.smbd from the Samba gateway that I
don’t understand and looking for help from anyone who has gotten the
vfs_ceph working.

Background:

I am trying to get a Samba gateway with CephFS working with the
vfs_ceph module. I observed that the default Samba package on CentOS
7.7 did not come with the ceph.so vfs_ceph module, so I tried to
compile a working Samba version with vfs_ceph.

Newer Samba versions have a requirement for GnuTLS >= 3.4.7, which is
not an available package on CentOS 7.7 without a custom repository. I
opted to build an earlier version of Samba.

On CentOS 7.7, I built Samba 4.11.16 with

[global]
security = user
map to guest = Bad User
username map = /etc/samba/smbusers
log level = 4
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes

[cryofs_upload]
public = yes
read only = yes
guest ok = yes
vfs objects = ceph
path = /upload
kernel share modes = no
ceph:user_id = samba.upload
ceph:config_file = /etc/ceph/ceph.conf

I have a file at /etc/ceph/ceph.conf including:
fsid = redacted
mon_host = redacted
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx


I have an /etc/ceph/client.samba.upload.keyring /w key for the user
`samba.upload`

However, connecting fails:

smbclient localhost\\cryofs_upload -U guest
Enter guest's password:
tree connect failed: NT_STATUS_UNSUCCESSFUL


The log.smbd gives these errors:

  Initialising custom vfs hooks from [ceph]
[2020/11/11 17:24:37.388460,  3]
../../lib/util/modules.c:167(load_module_absolute_path)
  load_module_absolute_path: Module '/usr/local/samba/lib/vfs/ceph.so' loaded
[2020/11/11 17:24:37.402026,  1]
../../source3/smbd/service.c:668(make_connection_snum)
  make_connection_snum: SMB_VFS_CONNECT for service 'cryofs_upload' at
'/upload' failed: No such file or directory

There is an /upload directory for which the samba.upload user has read
access to in the CephFS.

What does this error mean: ‘no such file or directory’ ? Is it that
vfs_ceph isn’t finding `/upload` or is some other file depended by
vfs_ceph not been found? I have also tried to specify a local path
rather than a CephFS path and will get the same error.

Is there any good guide that describes not just the Samba smb.conf,
but also what should be in /etc/ceph/ceph.conf, and how to provide the
key for the ceph:user_id ? I am really struggling to find good
first-hand documentation for this.

Thanks,
  Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume quite buggy compared to ceph-disk

2020-10-01 Thread Matt Larson
Hi Marc,

 Did you have any success with `ceph-volume` for activating your OSD?

 I am having a similar problem where the command `ceph-bluestore-tool`
fails to be able to read a label for a previously created OSD on an
LVM partition. I had previously been using the OSD without issues, but
after a reboot it fails to load.

 1. I had initially created my OSD using Ceph Octopus 15.x with `ceph
orch daemon add osd :boot/cephfs_meta` that was able to
create an OSD on the LVM partition and bring up an OSD.
 2. After a reboot, the OSD fails to come up, with error from
`ceph-bluestore-tool` happening inside the container specifically
being unable to read the label of the device.
 3. When I query the symlinked device /dev/boot/cephfs_meta ->
/dev/dm3, with `dmsetup info /dev/dm-3`, I can see the state is active
and that it has a UUID, etc.
 4. I installed `ceph-osd` CentOS package providing the
ceph-bluestore-tool, and tried to manually test and `sudo
ceph-bluestore-tool show-label --dev /dev/dm-3` fails to read the
label. When I try with other OSD's that were created for entire disks
this command is able to read the label and print out information.

 I am considering submitting a ticket to the ceph issue tracker, as I
am unable to figure out why the ceph-bluestore-tool cannot read the
labels and it seems either the OSD was initially created incorrectly
or there is a bug in ceph-bluestore-tool.

 One possibility is that I did not have the LVM2 package installed on
this host prior to the `ceph orch daemon add ..` command and this
caused a particular issue with the LVM partition OSD.

 -Matt

On Sat, Sep 19, 2020 at 9:11 AM Marc Roos  wrote:
>
>
>
>
> [@]# ceph-volume lvm activate 36 82b94115-4dfb-4ed0-8801-def59a432b0a
> Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-36
> Running command: /usr/bin/ceph-authtool
> /var/lib/ceph/osd/ceph-36/lockbox.keyring --create-keyring --name
> client.osd-lockbox.82b94115-4dfb-4ed0-8801-def59a432b0a --add-key
> AQBxA2Zfj6avOBAAIIHqNNY2J22EnOZV+dNzFQ==
>  stdout: creating /var/lib/ceph/osd/ceph-36/lockbox.keyring
> added entity client.osd-lockbox.82b94115-4dfb-4ed0-8801-def59a432b0a
> auth(key=AQBxA2Zfj6avOBAAIIHqNNY2J22EnOZV+dNzFQ==)
> Running command: /usr/bin/chown -R ceph:ceph
> /var/lib/ceph/osd/ceph-36/lockbox.keyring
> Running command: /usr/bin/ceph --cluster ceph --name
> client.osd-lockbox.82b94115-4dfb-4ed0-8801-def59a432b0a --keyring
> /var/lib/ceph/osd/ceph-36/lockbox.keyring config-key get
> dm-crypt/osd/82b94115-4dfb-4ed0-8801-def59a432b0a/luks
> Running command: /usr/sbin/cryptsetup --key-file - --allow-discards
> luksOpen
> /dev/ceph-9263e83b-7660-4f5b-843a-2111e882a17e/osd-block-82b94115-4dfb-4
> ed0-8801-def59a432b0a I8MyTZ-RQjx-gGmd-XSRw-kfa1-L60n-fgQpCb
>  stderr: Device I8MyTZ-RQjx-gGmd-XSRw-kfa1-L60n-fgQpCb already exists.
> Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-36
> Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph
> prime-osd-dir --dev /dev/mapper/I8MyTZ-RQjx-gGmd-XSRw-kfa1-L60n-fgQpCb
> --path /var/lib/ceph/osd/ceph-36 --no-mon-config
>  stderr: failed to read label for
> /dev/mapper/I8MyTZ-RQjx-gGmd-XSRw-kfa1-L60n-fgQpCb: (2) No such file or
> directory
> -->  RuntimeError: command returned non-zero exit status: 1
>
> dmsetup ls lists this
>
> Where is an option to set the weight? As far as I can see you can only
> set this after peering started?
>
> How can I mount this tmpfs manually to inspect this? Maybe put in the
> manual[1]?
>
>
> [1]
> https://docs.ceph.com/en/latest/ceph-volume/lvm/activate/
>
> _______
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: objects misplaced jumps up at 5%

2020-09-29 Thread Matt Larson
Continuing on this topic, is it only possible to increase the count of
placement group (PG) quickly, but the associated placement group
placeholder (PGP) values can only increase in smaller increments of
1-3? Each increase of the PGP requires a rebalancing and backfill
again of lots of PGs?

I am working with an erasure coded pool that I recently increased a
target PG count by use of the pg_autoscaler  `target_size_ratio` to be
closer to what I expect the pool's data size to grow to. I am
wondering if this pool will constantly hit 5% misplaced, incrementally
add PGP count, and repeat backfilling PGs while the PGs sit
unscrubbed.

I have 160 OSDs in my pool and now have a target PG count of 2048.

I've seen a similar 5% misplaced and never-ending backfills described
in a thread by Paul Mezannini
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/OUEZJAEKFV74H5RCZSICNXS3P5JHYRK6/).

I'd be nice to know the correct strategy to adjust the overall PG
count and have a pool return to a healthy + balanced state.

-Matt


On Tue, Sep 29, 2020 at 9:10 AM Paul Emmerich  wrote:
>
> On Tue, Sep 29, 2020 at 12:34 PM Jake Grimmett 
> wrote:
>
> > I think you found the answer!
> >
> > When adding 100 new OSDs to the cluster, I increased both pg and pgp
> > from 4096 to 16,384
> >
>
> Too much for your cluster, 4096 seems sufficient for a pool of size 10.
> You can still reduce it relatively cheaply while it hasn't been fully
> actuated yet
>
>
> Paul
>
>
> >
> > **
> > [root@ceph1 ~]# ceph osd pool set ec82pool pg_num 16384
> > set pool 5 pg_num to 16384
> >
> > [root@ceph1 ~]# ceph osd pool set ec82pool pgp_num 16384
> > set pool 5 pgp_num to 16384
> >
> > **
> >
> > The pg number increased immediately as seen with "ceph -s"
> >
> > But unknown to me, the pgp number did not increase immediately.
> >
> > "ceph osd pool ls detail" shows that pgp is currently 11412
> >
> > Each time we hit 5.000% misplaced, the pgp number increases by 1 or 2,
> > this causes the % misplaced to increase again to ~5.1%
> > ... which is why we thought the cluster was not re-balancing.
> >
> >
> > If I'd looked at the ceph.audit.log there are entries like this:
> >
> > 2020-09-23 01:13:11.564384 mon.ceph3b (mon.1) 50747 : audit [INF]
> > from='mgr.90414409 10.1.0.80:0/7898' entity='mgr.ceph2' cmd=[{"prefix":
> > "osd pool set", "pool": "ec82pool", "var": "pgp_num_actual", "val":
> > "5076"}]: dispatch
> > 2020-09-23 01:13:11.565598 mon.ceph1b (mon.0) 85947 : audit [INF]
> > from='mgr.90414409 ' entity='mgr.ceph2' cmd=[{"prefix": "osd pool set",
> > "pool": "ec82pool", "var": "pgp_num_actual", "val": "5076"}]: dispatch
> > 2020-09-23 01:13:12.530584 mon.ceph1b (mon.0) 85949 : audit [INF]
> > from='mgr.90414409 ' entity='mgr.ceph2' cmd='[{"prefix": "osd pool set",
> > "pool": "ec82pool", "var": "pgp_num_actual", "val": "5076"}]': finished
> >
> >
> > Our assumption is that the pgp number will continue to increase till it
> > reaches its set level, at which point the cluster will complete it's
> > re-balance...
> >
> > again, many thanks to you both for your help,
> >
> > Jake
> >
> > On 28/09/2020 17:35, Paul Emmerich wrote:
> > > Hi,
> > >
> > > 5% misplaced is the default target ratio for misplaced PGs when any
> > > automated rebalancing happens, the sources for this are either the
> > > balancer or pg scaling.
> > > So I'd suspect that there's a PG change ongoing (either pg autoscaler or
> > > a manual change, both obey the target misplaced ratio).
> > > You can check this by running "ceph osd pool ls detail" and check for
> > > the value of pg target.
> > >
> > > Also: Looks like you've set osd_scrub_during_recovery = false, this
> > > setting can be annoying on large erasure-coded setups on HDDs that see
> > > long recovery times. It's better to get IO priorities right; search
> > > mailing list for osd op queue cut off high.
> > >
> > > Paul
> >
> > --
> > Dr Jake Grimmett
> > Head Of Scientific Computing
> > MRC Laboratory of Molecular Biology
> > Francis Crick Avenue,
> > Cambridge CB2 0QH, UK.
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



--
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Unable to restart OSD assigned to LVM partition on Ceph 15.1.2?

2020-09-24 Thread Matt Larson
Hi,

 I recently restarted a storage node for our Ceph cluster and had an
issue bringing one of the OSDs back online. This storage node has
multiple HDs each as a devoted OSD for a data pool, and a single nVME
drive with an LVM partition assigned as an OSD in a metadata pool.
After rebooting the host, the OSD using an LVM partition did not
restart. When trying to manually start the OSD using systemctl, I can
follow the launch of a podman container and see an error message prior
to the container shutting down again:

 Sep 23 14:02:06 X bash[30318]: Running command:
/usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/boot/cephfs_meta --path /var/lib/ceph/osd/ceph-165
--no-mon-config
Sep 23 14:02:06 X bash[30318]: stderr: failed to read label for
/dev/boot/cephfs_meta: (2) No such file or directory
Sep 23 14:02:06 X bash[30318]: -->  RuntimeError: command returned
non-zero exit status: 1

 1. I can see the existence of the /dev/boot/cephfs_meta symlink to a
device ../dm-3
 2. `lsblk` shows the lvm partition 'boot-cephfs_meta' under nvme0n1p3
 3. `sudo lvscan --all` shows the it as activated:
`  ACTIVE'/dev/boot/cephfs_meta' [3.42 TiB] inherit`

 This is on a CentOS 8 system, with ceph version 15.2.1
(9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)

 Related issues I have found include:
 1. https://github.com/rook/rook/issues/2591
 2. https://github.com/rook/rook/issues/3289

 There were indicated solutions for these involving installing the
LVM2 package, which I completed with `sudo dnf install lvm2`, then
tried a restart of the system and restart of the container. This was
not able to resolve the problem for LVM-partition based OSD.

 This LVM-based OSD was initially created with a `ceph-volume`
command: `ceph-volume lvm create --bluestore --data /dev/sd
--block.db
/dev/nvme0n1`

 Is there a workaround for this problem where the container process is
unable to read the label of the LVM partition and fails to start the
OSD?

 Thanks,
  Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Troubleshooting stuck unclean PGs?

2020-09-21 Thread Matt Larson
I tried this:

`sudo ceph tell 'osd.*' injectargs '--osd-max-backfills 4'`

Which has increased to having 10 simultaneous backfills and a higher
10X higher rate of data movements. It looks like I could increase this
further by increasing the number of simultaneous recovery operations,
but changing that parameter to 20 didn't cause a change. The command
warned that OSDs may need to be restarted before this takes effect:

sudo ceph tell 'osd.*' injectargs '--osd-recovery-max-active 20'

I'll let it run overnight with a higher backfill rate and see if that
is sufficient to let the cluster catch up.

The commands are from
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023844.html)

-Matt

On Mon, Sep 21, 2020 at 7:20 PM Matt Larson  wrote:
>
> Hi Wout,
>
>  None of the OSDs are greater than 20% full. However, only 1 PG is
> backfilling at a time, while the others are backfill_wait. I had
> recently added a large amount of data to the Ceph cluster, and this
> may have caused the # of PGs to increase causing the need to rebalance
> or move objects.
>
>  It appears that I could increase the # of backfill operations that
> happen simultaneously by increasing `osd_max_backfills` and/or
> `osd_recovery_max_active`. It looks like I should maybe consider
> increasing the number of max backfills happening at a time because the
> overall io during the backfill is pretty small.
>
>  Does this seem reasonable? If so, with Ceph Octopus/cephadm, how can
> adjust the parameters?
>
>  Thanks,
>Matt
>
> On Mon, Sep 21, 2020 at 2:21 PM Wout van Heeswijk  wrote:
> >
> > Hi Matt,
> >
> > The mon data can grow during when PGs are stuck unclean. Don't restart the 
> > mons.
> >
> > You need to find out why your placement groups are "backfill_wait". Likely 
> > some of your OSDs are (near)full.
> >
> > If you have space elsewhere you can use the ceph balancer module or 
> > reweighting of OSDs to rebalance data.
> >
> > Scrubbing will continue once the PGs are "active+clean"
> >
> > Kind regards,
> >
> > Wout
> > 42on
> >
> > 
> > From: Matt Larson 
> > Sent: Monday, September 21, 2020 6:22 PM
> > To: ceph-users@ceph.io
> > Subject: [ceph-users] Troubleshooting stuck unclean PGs?
> >
> > Hi,
> >
> >  Our Ceph cluster is reporting several PGs that have not been scrubbed
> > or deep scrubbed in time. It is over a week for these PGs to have been
> > scrubbed. When I checked the `ceph health detail`, there are 29 pgs
> > not deep-scrubbed in time and 22 pgs not scrubbed in time. I tried to
> > manually start a scrub on the PGs, but it appears that they are
> > actually in an unclean state that needs to be resolved first.
> >
> > This is a cluster running:
> >  ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus 
> > (stable)
> >
> >  Following the information at [Troubleshooting
> > PGs](https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/),
> > I checked for PGs that are stuck stale | inactive | unclean. There
> > were no PGs that are stale or inactive, but there are several that are
> > stuck unclean:
> >
> >  ```
> > PG_STAT  STATE  UP
> >UP_PRIMARY  ACTINGACTING_PRIMARY
> > 8.3c active+remapped+backfill_wait
> > [124,41,108,8,87,16,79,157,49] 124
> > [139,57,16,125,154,65,109,86,45] 139
> > 8.3e active+remapped+backfill_wait
> > [108,2,58,146,130,29,37,66,118] 108
> > [127,92,24,50,33,6,130,66,149] 127
> > 8.3f active+remapped+backfill_wait
> > [19,34,86,132,59,78,153,99,6]  19
> > [90,45,147,4,105,61,30,66,125]  90
> > 8.40 active+remapped+backfill_wait
> > [19,131,80,76,42,101,61,3,144]  19
> > [28,106,132,3,151,36,65,60,83]  28
> > 8.3a   active+remapped+backfilling
> > [32,72,151,30,103,131,62,84,120]  32
> > [91,60,7,133,101,117,78,20,158]  91
> > 8.7e active+remapped+backfill_wait
> > [108,2,58,146,130,29,37,66,118] 108
> > [127,92,24,50,33,6,130,66,149] 127
> > 8.3b active+remapped+backfill_wait
> > [34,113,148,63,18,95,70,129,13]  34
> > [66,17,132,90,14,52,101,47,115]  66
> > 8.7f active+remapped+backfill_wait
> > [19,34,86,132,59,78,153,99,6]  19
> > [90,45,147,4,105,61,30,66,125]  90
> > 8.78 active+remapped+backfill_wait
> > [96,113,159,63,29,133,73,8,89]  

[ceph-users] Re: Troubleshooting stuck unclean PGs?

2020-09-21 Thread Matt Larson
Hi Wout,

 None of the OSDs are greater than 20% full. However, only 1 PG is
backfilling at a time, while the others are backfill_wait. I had
recently added a large amount of data to the Ceph cluster, and this
may have caused the # of PGs to increase causing the need to rebalance
or move objects.

 It appears that I could increase the # of backfill operations that
happen simultaneously by increasing `osd_max_backfills` and/or
`osd_recovery_max_active`. It looks like I should maybe consider
increasing the number of max backfills happening at a time because the
overall io during the backfill is pretty small.

 Does this seem reasonable? If so, with Ceph Octopus/cephadm, how can
adjust the parameters?

 Thanks,
   Matt

On Mon, Sep 21, 2020 at 2:21 PM Wout van Heeswijk  wrote:
>
> Hi Matt,
>
> The mon data can grow during when PGs are stuck unclean. Don't restart the 
> mons.
>
> You need to find out why your placement groups are "backfill_wait". Likely 
> some of your OSDs are (near)full.
>
> If you have space elsewhere you can use the ceph balancer module or 
> reweighting of OSDs to rebalance data.
>
> Scrubbing will continue once the PGs are "active+clean"
>
> Kind regards,
>
> Wout
> 42on
>
> 
> From: Matt Larson 
> Sent: Monday, September 21, 2020 6:22 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Troubleshooting stuck unclean PGs?
>
> Hi,
>
>  Our Ceph cluster is reporting several PGs that have not been scrubbed
> or deep scrubbed in time. It is over a week for these PGs to have been
> scrubbed. When I checked the `ceph health detail`, there are 29 pgs
> not deep-scrubbed in time and 22 pgs not scrubbed in time. I tried to
> manually start a scrub on the PGs, but it appears that they are
> actually in an unclean state that needs to be resolved first.
>
> This is a cluster running:
>  ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus 
> (stable)
>
>  Following the information at [Troubleshooting
> PGs](https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/),
> I checked for PGs that are stuck stale | inactive | unclean. There
> were no PGs that are stale or inactive, but there are several that are
> stuck unclean:
>
>  ```
> PG_STAT  STATE  UP
>UP_PRIMARY  ACTINGACTING_PRIMARY
> 8.3c active+remapped+backfill_wait
> [124,41,108,8,87,16,79,157,49] 124
> [139,57,16,125,154,65,109,86,45] 139
> 8.3e active+remapped+backfill_wait
> [108,2,58,146,130,29,37,66,118] 108
> [127,92,24,50,33,6,130,66,149] 127
> 8.3f active+remapped+backfill_wait
> [19,34,86,132,59,78,153,99,6]  19
> [90,45,147,4,105,61,30,66,125]  90
> 8.40 active+remapped+backfill_wait
> [19,131,80,76,42,101,61,3,144]  19
> [28,106,132,3,151,36,65,60,83]  28
> 8.3a   active+remapped+backfilling
> [32,72,151,30,103,131,62,84,120]  32
> [91,60,7,133,101,117,78,20,158]  91
> 8.7e active+remapped+backfill_wait
> [108,2,58,146,130,29,37,66,118] 108
> [127,92,24,50,33,6,130,66,149] 127
> 8.3b active+remapped+backfill_wait
> [34,113,148,63,18,95,70,129,13]  34
> [66,17,132,90,14,52,101,47,115]  66
> 8.7f active+remapped+backfill_wait
> [19,34,86,132,59,78,153,99,6]  19
> [90,45,147,4,105,61,30,66,125]  90
> 8.78 active+remapped+backfill_wait
> [96,113,159,63,29,133,73,8,89]  96
> [138,121,15,103,55,41,146,69,18] 138
> 8.7d   active+remapped+backfilling
> [0,90,60,124,159,19,71,101,135]   0
> [150,72,124,129,63,10,94,29,41] 150
> 8.7c active+remapped+backfill_wait
> [124,41,108,8,87,16,79,157,49] 124
> [139,57,16,125,154,65,109,86,45] 139
> 8.79 active+remapped+backfill_wait
> [59,15,41,82,131,20,73,156,113]  59
> [13,51,120,102,29,149,42,79,132]  13
> ```
>
> If I query one of the PGs that is backfilling, 8.3a, it shows it's state as :
> "recovery_state": [
> {
> "name": "Started/Primary/Active",
> "enter_time": "2020-09-19T20:45:44.027759+",
> "might_have_unfound": [],
> "recovery_progress": {
> "backfill_targets": [
> "30(3)",
> "32(0)",
> "62(6)",
> "72(1)",
> "84(7)",
> "103(4)",

[ceph-users] Troubleshooting stuck unclean PGs?

2020-09-21 Thread Matt Larson
Hi,

 Our Ceph cluster is reporting several PGs that have not been scrubbed
or deep scrubbed in time. It is over a week for these PGs to have been
scrubbed. When I checked the `ceph health detail`, there are 29 pgs
not deep-scrubbed in time and 22 pgs not scrubbed in time. I tried to
manually start a scrub on the PGs, but it appears that they are
actually in an unclean state that needs to be resolved first.

This is a cluster running:
 ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)

 Following the information at [Troubleshooting
PGs](https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/),
I checked for PGs that are stuck stale | inactive | unclean. There
were no PGs that are stale or inactive, but there are several that are
stuck unclean:

 ```
PG_STAT  STATE  UP
   UP_PRIMARY  ACTINGACTING_PRIMARY
8.3c active+remapped+backfill_wait
[124,41,108,8,87,16,79,157,49] 124
[139,57,16,125,154,65,109,86,45] 139
8.3e active+remapped+backfill_wait
[108,2,58,146,130,29,37,66,118] 108
[127,92,24,50,33,6,130,66,149] 127
8.3f active+remapped+backfill_wait
[19,34,86,132,59,78,153,99,6]  19
[90,45,147,4,105,61,30,66,125]  90
8.40 active+remapped+backfill_wait
[19,131,80,76,42,101,61,3,144]  19
[28,106,132,3,151,36,65,60,83]  28
8.3a   active+remapped+backfilling
[32,72,151,30,103,131,62,84,120]  32
[91,60,7,133,101,117,78,20,158]  91
8.7e active+remapped+backfill_wait
[108,2,58,146,130,29,37,66,118] 108
[127,92,24,50,33,6,130,66,149] 127
8.3b active+remapped+backfill_wait
[34,113,148,63,18,95,70,129,13]  34
[66,17,132,90,14,52,101,47,115]  66
8.7f active+remapped+backfill_wait
[19,34,86,132,59,78,153,99,6]  19
[90,45,147,4,105,61,30,66,125]  90
8.78 active+remapped+backfill_wait
[96,113,159,63,29,133,73,8,89]  96
[138,121,15,103,55,41,146,69,18] 138
8.7d   active+remapped+backfilling
[0,90,60,124,159,19,71,101,135]   0
[150,72,124,129,63,10,94,29,41] 150
8.7c active+remapped+backfill_wait
[124,41,108,8,87,16,79,157,49] 124
[139,57,16,125,154,65,109,86,45] 139
8.79 active+remapped+backfill_wait
[59,15,41,82,131,20,73,156,113]  59
[13,51,120,102,29,149,42,79,132]  13
```

If I query one of the PGs that is backfilling, 8.3a, it shows it's state as :
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2020-09-19T20:45:44.027759+",
"might_have_unfound": [],
"recovery_progress": {
"backfill_targets": [
"30(3)",
"32(0)",
"62(6)",
"72(1)",
"84(7)",
"103(4)",
"120(8)",
"131(5)",
"151(2)"
],

Q1: Is there anything that I should check/fix to enable the PGs to
resolve from the `unclean` state?
Q2: I have also seen that the podman containers on one of our OSD
servers are taking large amounts of disk space. Is there a way to
limit the growth of disk space for podman containers, when
administering a Ceph cluster using `cephadm` tools? At last check, a
server running 16 OSDs and 1 MON is using 39G of disk space for its
running containers. Can restarting containers help to start with a
fresh slate or reduce the disk use?

Thanks,
  Matt



Matt Larson
Associate Scientist
Computer Scientist/System Administrator
UW-Madison Cryo-EM Research Center
433 Babcock Drive, Madison, WI 53706
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Error with zabbix module on Ceph Octopus

2020-05-06 Thread Matt Larson
Hi,

 I am trying to setup the Zabbix reporting module, but it is giving an
error which looks like a Python error:

ceph zabbix config-show
Error EINVAL: TypeError: __init__() got an unexpected keyword argument 'index'

I have configured the zabbix_host and identifier already at this point.

The command: 'ceph zabbix send' also fails to run with 'Failed to send
data to Zabbix'

I am running:
 - CentOS 8.1
 - ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee)
octopus (stable)
 - Zabbix 4.4-1.el8 release
(https://repo.zabbix.com/zabbix/4.4/rhel/8/x86_64/zabbix-release-4.4-1.el8.noarch.rpm)
 - Python version 3.6.8

Any suggestions? I am wondering if this could be requiring Python 2.7 to run?

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Benefits of high RAM on a metadata server?

2020-02-06 Thread Matt Larson
Hi Bogdan,

Are the "client failing to respond" messages indicating that you
actually exceed the 128 GB ram on your MDS hosts?

The MDS servers are not planned to have SSD drives. The storage
servers would have HD's and 1 nVME SSD drive that could hold metadata
volumes.


On Thu, Feb 6, 2020 at 4:11 PM Bogdan Adrian Velica  wrote:
>
> Hi,
> I am running on 3 MDS servers (1 active and 2 backups and I recommend that) 
> each of 128 GB of RAM (the clients are running ML analysis) and I have about 
> 20 mil inodes loaded in ram. It's working fine except some warnings I have  
> "client X is failing to respond to cache pressure."
> Besides that there are no complaints but I thing you would need the 256GB of 
> ram specially if the datasets will increase...  just my 2 cents..
>
> Will you have SSD ?
>
>
>
> On Fri, Feb 7, 2020 at 12:02 AM Matt Larson  wrote:
>>
>> Hi, we are planning out a Ceph storage cluster and were choosing
>> between 64GB, 128GB, or even 256GB on metadata servers. We are
>> considering having 2 metadata servers overall.
>>
>> Does going to high levels of RAM possibly yield any performance
>> benefits? Is there a size beyond which there are just diminishing
>> returns vs cost?
>>
>> The expected use case would be for a cluster where there might be
>> 10-20 concurrent users working on individual datasets of 5TB in size.
>> I expect there would be lots of reads of the 5TB datasets matched with
>> the creation of hundreds to thousands of smaller files during
>> processing of the images.
>>
>> Thanks!
>> -Matt
>>
>> --
>> Matt Larson, PhD
>> Madison, WI  53705 U.S.A.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Benefits of high RAM on a metadata server?

2020-02-06 Thread Matt Larson
Hi, we are planning out a Ceph storage cluster and were choosing
between 64GB, 128GB, or even 256GB on metadata servers. We are
considering having 2 metadata servers overall.

Does going to high levels of RAM possibly yield any performance
benefits? Is there a size beyond which there are just diminishing
returns vs cost?

The expected use case would be for a cluster where there might be
10-20 concurrent users working on individual datasets of 5TB in size.
I expect there would be lots of reads of the 5TB datasets matched with
the creation of hundreds to thousands of smaller files during
processing of the images.

Thanks!
-Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io