[ceph-users] Re: MDS Behind on Trimming...

2024-04-21 Thread Xiubo Li

Hi Erich,

I raised one tracker for this https://tracker.ceph.com/issues/65607.

Currently I haven't figured out where was holding the 'dn->lock' in the 
'lookup' request or somewhere else, since there is not debug log.


Hopefully we can get the debug logs, which we can push it further.

Thanks

- Xiubo

On 4/19/24 23:55, Erich Weiler wrote:

Hi Xiubo,

Nevermind I was wrong, most the blocked ops were 12 hours old. Ug.

I restarted the MDS daemon to clear them.

I just reset to having one active MDS instead of two, let's see if 
that makes a difference.


I am beginning to think it may be impossible to catch the logs that 
matter here.  I feel like sometimes the blocked ops are just waiting 
because of load and sometimes they are waiting because they are stuck. 
But, it's really hard to tell which, without waiting a while.  But, I 
can't wait while having debug turned on because my root disks (which 
are 150 GB large) fill up with debug logs in 20 minutes.  So it almost 
seems that unless I could somehow store many TB of debug logs we won't 
be able to catch this.


Let's see how having one MDS helps.  Or maybe I actually need like 4 
MDSs because the load is too high for only one or two.  I don't know. 
Or maybe it's the lock issue you've been working on.  I guess I can 
test the lock order fix when it's available to test.


-erich

On 4/19/24 7:26 AM, Erich Weiler wrote:
So I woke up this morning and checked the blocked_ops again, there 
were 150 of them.  But the age of each ranged from 500 to 4300 
seconds.  So it seems as if they are eventually being processed.


I wonder if we are thinking about this in the wrong way?  Maybe I 
should be *adding* MDS daemons because my current ones are overloaded?


Can a single server hold multiple MDS daemons?  Right now I have 
three physical servers each with one MDS daemon on it.


I can still try reducing to one.  And I'll keep an eye on blocked ops 
to see if any get to a very old age (and are thus wedged).


-erich

On 4/18/24 8:55 PM, Xiubo Li wrote:

Okay, please try it to set only one active mds.


On 4/19/24 11:54, Erich Weiler wrote:

We have 2 active MDS daemons and one standby.

On 4/18/24 8:52 PM, Xiubo Li wrote:

BTW, how man active mds you are using ?


On 4/19/24 10:55, Erich Weiler wrote:
OK, I'm sure I caught it in the right order this time, the logs 
should definitely show when the blocked/slow requests start.  
Check out these logs and dumps:


http://hgwdev.gi.ucsc.edu/~weiler/

It's a 762 MB tarball but it uncompresses to 16 GB.

-erichll


On 4/18/24 6:57 PM, Xiubo Li wrote:

Okay, could you try this with 18.2.0 ?

I just double it was introduce by:

commit e610179a6a59c463eb3d85e87152ed3268c808ff
Author: Patrick Donnelly 
Date:   Mon Jul 17 16:10:59 2023 -0400

 mds: drop locks and retry when lock set changes

 An optimization was added to avoid an unnecessary gather on 
the inode
 filelock when the client can safely get the file size 
without also
 getting issued the requested caps. However, if a retry of 
getattr

 is necessary, this conditional inclusion of the inode filelock
 can cause lock-order violations resulting in deadlock.

 So, if we've already acquired some of the inode's locks 
then we must

 drop locks and retry.

 Fixes: https://tracker.ceph.com/issues/62052
 Fixes: c822b3e2573578c288d170d1031672b74e02dced
 Signed-off-by: Patrick Donnelly 
 (cherry picked from commit 
b5719ac32fe6431131842d62ffaf7101c03e9bac)



On 4/19/24 09:54, Erich Weiler wrote:
I'm on 18.2.1.  I think I may have gotten the timing off on the 
logs and dumps so I'll try again.  Just really hard to capture 
because I need to kind of be looking at it in real time to 
capture it. Hang on, lemme see if I can get another capture...


-erich

On 4/18/24 6:35 PM, Xiubo Li wrote:


BTW, which ceph version you are using ?



On 4/12/24 04:22, Erich Weiler wrote:
BTW - it just happened again, I upped the debugging settings 
as you instructed and got more dumps (then returned the debug 
settings to normal).


Attached are the new dumps.

Thanks again,
erich

On 4/9/24 9:00 PM, Xiubo Li wrote:


On 4/10/24 11:48, Erich Weiler wrote:
Dos that mean it could be the locker order bug 
(https://tracker.ceph.com/issues/62123) as Xiubo suggested?


I have raised one PR to fix the lock order issue, if 
possible please have a try to see could it resolve this 
issue.


Thank you!  Yeah, this issue is happening every couple days 
now. It just happened again today and I got more MDS dumps. 
If it would help, let me know and I can send them!


Once this happen if you could enable the mds debug logs will 
be better:


debug mds = 20

debug ms = 1

And then provide the debug logs together with the MDS dumps.


I assume if this fix is approved and backported it will 
then appear in like 18.2.3 or something?



Yeah, it will be backported after being well tested.

- Xiubo


Thanks again,
erich





















[ceph-users] Re: MDS crash

2024-04-21 Thread Xiubo Li

Hi Alexey,

This looks a new issue for me. Please create a tracker for it and 
provide the detail call trace there.


Thanks

- Xiubo

On 4/19/24 05:42, alexey.gerasi...@opencascade.com wrote:

Dear colleagues, hope that anybody can help us.

The initial point:  Ceph cluster v15.2 (installed and controlled by the 
Proxmox) with 3 nodes based on physical servers rented from a cloud provider. 
CephFS is installed also.

Yesterday we discovered that some of the applications stopped working. During 
the investigation we recognized that we have the problem with Ceph, more 
precisely with СephFS - MDS daemons suddenly crashed. We tried to restart them 
and found that they crashed again immediately after the start. The crash 
information:
2024-04-17T17:47:42.841+ 7f959ced9700  1 mds.0.29134 recovery_done -- 
successful recovery!
2024-04-17T17:47:42.853+ 7f959ced9700  1 mds.0.29134 active_start
2024-04-17T17:47:42.881+ 7f959ced9700  1 mds.0.29134 cluster recovered.
2024-04-17T17:47:43.825+ 7f959aed5700 -1 ./src/mds/OpenFileTable.cc: In 
function 'void OpenFileTable::commit(MDSContext*, uint64_t, int)' thread 
7f959aed5700 time 2024-04-17T17:47:43.831243+
./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0)

Next hours we read the tons of articles, studied the documentation, and checked 
the common state of Ceph cluster by the various diagnostic commands – but 
didn’t find anything wrong. At evening we decided to upgrade it up to v16, and 
finally to v17.2.7. Unfortunately, it didn’t solve the problem, MDS continue to 
crash with the same error. The only difference that we found is “1 MDSs report 
damaged metadata” in the output of ceph -s – see it below.

I supposed that it may be the well-known bug, but couldn’t find the same one on 
https://tracker.ceph.com - there are several bugs associated with file 
OpenFileTable.cc but not related to ceph_assert(count > 0)

We tried to check the source code of OpenFileTable.cc also, here is a fragment 
of it, in function OpenFileTable::_journal_finish
   int omap_idx = anchor.omap_idx;
   unsigned& count = omap_num_items.at(omap_idx);
   ceph_assert(count > 0);
So, we guess that the object map is empty for some object in Ceph, and it is 
unexpected behavior. But again, we found nothing wrong in our cluster…

Next, we started with 
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ article – 
tried to reset the journal (despite that it was Ok all the time) and wipe the 
sessions using cephfs-table-tool all reset session command. No result…
Now I decided to continue following this article and run cephfs-data-scan 
scan_extents command, it is working just now. But I have a doubt that it will 
solve the issue because of no problem with our objects in Ceph.

Is it the new bug? or something else? Any idea is welcome!

The important outputs:

- ceph -s
   cluster:
 id: 4cd1c477-c8d0-4855-a1f1-cb71d89427ed
 health: HEALTH_ERR
 1 MDSs report damaged metadata
 insufficient standby MDS daemons available
 83 daemons have recently crashed
 3 mgr modules have recently crashed

   services:
 mon: 3 daemons, quorum asrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 
(age 22h)
 mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1
 mds: 1/1 daemons up
 osd: 18 osds: 18 up (since 22h), 18 in (since 29h)

   data:
 volumes: 1/1 healthy
 pools:   5 pools, 289 pgs
 objects: 29.72M objects, 5.6 TiB
 usage:   21 TiB used, 47 TiB / 68 TiB avail
 pgs: 287 active+clean
  2   active+clean+scrubbing+deep

   io:
 client:   2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr

-ceph fs dump
e29480
enable_multiple, ever_enabled_multiple: 0,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no 
anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   29480
flags   12 joinable allow_snaps allow_multimds_snaps
created 2022-11-25T15:56:08.507407+
modified2024-04-18T16:52:29.970504+
tableserver 0
root0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
required_client_features{}
last_failure0
last_failure_osd_epoch  14728
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no 
anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in  0
up  {0=156636152}
failed
damaged
stopped
data_pools  [5]
metadata_pool   6
inline_data disabled
balancer
standby_count_wanted1
[mds.asrv-dev-stor-1{0:156636152} state up:active seq 6 laggy 

[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

2024-04-21 Thread Anthony D'Atri
>Do you have any data on the reliability of QLC NVMe drives? 

They were my job for a year, so yes, I do.  The published specs are accurate.  
A QLC drive built from the same NAND as a TLC drive will have more capacity, 
but less endurance.  Depending on the model, you may wish to enable 
`bluestore_use_optimal_io_size_for_min_alloc_size` when creating your OSDs.  
The Intel / Soldigim P5316, for example, has a 64KB IU size, so performance and 
endurance will benefit from aligning OSD `min_alloc_size` to that value.  Note 
that this is baked in at creation, you cannot change it on a given OSD after 
the fact, but you can redeploy the OSD and let it recover.

Other SKUs have 8KB or 16KB IU sizes, some have 4KB which requires no specific 
min_alloc_size.  Note that QLC is a good fit for workloads where writes tend to 
be sequential and reasonably large on average and infrequent.  I know of 
successful QLC RGW clusters that see 0.01 DWPD.  Yes, that decimal point is in 
the correct place.  Millions of 1KB files overwritten once an hour aren't a 
good workload for QLC. Backups, archives, even something like an OpenStack 
Glance pool are good fits.  I'm about to trial QLC as Prometheus LTS as well. 
Read-mostly workloads are good fits, as the read performance is in the ballpark 
of TLC.  Write performance is still going to be way better than any HDD, and 
you aren't stuck with legacy SATA slots.  You also don't have to buy or manage 
a fussy HBA.

> How old is your deep archive cluster, how many NVMes it has, and how many did 
> you
> have to replace?

I don't personally have one at the moment.

Even with TLC, endurance is, dare I say, overrated.  99% of enterprise SSDs 
never burn more than 15% of their rated endurance.  SSDs from at least some 
manufacturers have a timed workload feature in firmware that will estimate 
drive lifetime when presented with a real-world workload -- this is based on 
observed PE cycles.

Pretty much any SSD will report lifetime used or remaining, so TLC, QLC, even 
MLC or SLC you should collect those metrics in your time-series DB and watch 
both for drives nearing EOL and their burn rates.  

> 
> On Sun, Apr 21, 2024 at 11:06 PM Anthony D'Atri  
> wrote:
>> 
>> A deep archive cluster benefits from NVMe too.  You can use QLC up to 60TB 
>> in size, 32 of those in one RU makes for a cluster that doesn’t take up the 
>> whole DC.
>> 
>>> On Apr 21, 2024, at 5:42 AM, Darren Soothill  
>>> wrote:
>>> 
>>> Hi Niklaus,
>>> 
>>> Lots of questions here but let me tray and get through some of them.
>>> 
>>> Personally unless a cluster is for deep archive then I would never suggest 
>>> configuring or deploying a cluster without Rocks DB and WAL on NVME.
>>> There are a number of benefits to this in terms of performance and 
>>> recovery. Small writes go to the NVME first before being written to the HDD 
>>> and it makes many recovery operations far more efficient.
>>> 
>>> As to how much faster it makes things that very much depends on the type of 
>>> workload you have on the system. Lots of small writes will make a 
>>> significant difference. Very large writes not as much of a difference.
>>> Things like compactions of the RocksDB database are a lot faster as they 
>>> are now running from NVME and not from the HDD.
>>> 
>>> We normally work with  a upto 1:12 ratio so 1 NVME for every 12 HDD’s. This 
>>> is assuming the NVME’s being used are good mixed use enterprise NVME’s with 
>>> power loss protection.
>>> 
>>> As to failures yes a failure of the NVME would mean a loss of 12 OSD’s but 
>>> this is no worse than a failure of an entire node. This is something Ceph 
>>> is designed to handle.
>>> 
>>> I certainly wouldn’t be thinking about putting the NVME’s into raid sets as 
>>> that will degrade the performance of them when you are trying to get better 
>>> performance.
>>> 
>>> 
>>> 
>>> Darren Soothill
>>> 
>>> 
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io/
>>> 
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> -- 
> Alexander E. Patrakov
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW: Cannot write to bucket anymore

2024-04-21 Thread Malte Stroem

Hello Robin,

thank you.

The object-stat did not show anything suspicious.

And the logs do show

s3:get_obj decode_policy Read AccessControlPolicyxmlns="http://s3.amazonaws.com/doc/2006-03-01/;>XY 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
xsi:type="CanonicalUser">XY 
FULL_CONTROL

and than it fails with

s3:put_obj http status=403

So we do not see any errors or something.

Everything looks the same to the other working buckets.

No versioning.

But there has to be something.

Where can I have a look?

I tried almost anything with the aws cli to find something. But there is 
nothing.


Are there any rados or other commands to debug this?

Best,
Malte

On 22.03.24 02:35, Robin H. Johnson wrote:

On Thu, Mar 21, 2024 at 11:20:44AM +0100, Malte Stroem wrote:

Hello Robin,

thanks a lot.

Yes, I set debug to debug_rgw=20 & debug_ms=1.

It's that 403 I always get.

There is no versioning enabled.

There is a lifecycle policy for removing the files after one day.

Did the object stat call return anything?

Can you show more of the debug output (redact the keys/hostname/filename)?


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

2024-04-21 Thread Alexander E. Patrakov
Hello Anthony,

Do you have any data on the reliability of QLC NVMe drives? How old is
your deep archive cluster, how many NVMes it has, and how many did you
have to replace?

On Sun, Apr 21, 2024 at 11:06 PM Anthony D'Atri  wrote:
>
> A deep archive cluster benefits from NVMe too.  You can use QLC up to 60TB in 
> size, 32 of those in one RU makes for a cluster that doesn’t take up the 
> whole DC.
>
> > On Apr 21, 2024, at 5:42 AM, Darren Soothill  
> > wrote:
> >
> > Hi Niklaus,
> >
> > Lots of questions here but let me tray and get through some of them.
> >
> > Personally unless a cluster is for deep archive then I would never suggest 
> > configuring or deploying a cluster without Rocks DB and WAL on NVME.
> > There are a number of benefits to this in terms of performance and 
> > recovery. Small writes go to the NVME first before being written to the HDD 
> > and it makes many recovery operations far more efficient.
> >
> > As to how much faster it makes things that very much depends on the type of 
> > workload you have on the system. Lots of small writes will make a 
> > significant difference. Very large writes not as much of a difference.
> > Things like compactions of the RocksDB database are a lot faster as they 
> > are now running from NVME and not from the HDD.
> >
> > We normally work with  a upto 1:12 ratio so 1 NVME for every 12 HDD’s. This 
> > is assuming the NVME’s being used are good mixed use enterprise NVME’s with 
> > power loss protection.
> >
> > As to failures yes a failure of the NVME would mean a loss of 12 OSD’s but 
> > this is no worse than a failure of an entire node. This is something Ceph 
> > is designed to handle.
> >
> > I certainly wouldn’t be thinking about putting the NVME’s into raid sets as 
> > that will degrade the performance of them when you are trying to get better 
> > performance.
> >
> >
> >
> > Darren Soothill
> >
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io/
> >
> > croit GmbH, Freseniusstr. 31h, 81247 Munich
> > CEO: Martin Verges - VAT-ID: DE310638492
> > Com. register: Amtsgericht Munich HRB 231263
> > Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
> >
> >
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrading Ceph 15 to 18

2024-04-21 Thread Malte Stroem

Thanks, Anthony.

We'll try 15 -> 16 -> 18.

Best,
Malte

On 21.04.24 17:02, Anthony D'Atri wrote:

It should.


On Apr 21, 2024, at 5:48 AM, Malte Stroem  wrote:

Thank you, Anthony.

But does it work to upgrade from the latest 15 to the latest 16, too?

We'd like to be careful.

And then from the latest 16 to the latest 18?

Best,
Malte


On 21.04.24 04:14, Anthony D'Atri wrote:
The party line is to jump no more than 2 major releases at once.
So that would be Octopus (15) to Quincy (17) to Reef (18).
Squid (19) is due out soon, so you may want to pause at Quincy until Squid is 
released and has some runtime and maybe 19.2.1, then go straight to Squid from 
Quincy to save a step.
If you can test the upgrades on a lab cluster first, so much the better.  Be 
sure to read the release notes for every release in case there are specific 
additional actions or NBs.

On Apr 20, 2024, at 18:42, Malte Stroem  wrote:


Hello,

we'd like to upgrade our cluster from the latest Ceph 15 to Ceph 18.

It's running with cephadm.

What's the right way to do it?

Latest Ceph 15 to latest 16 and then to the latest 17 and then the latest 18?

Does that work?

Or is it possible to jump from the latest Ceph 16 to the latest Ceph 18?

Latest Ceph 15 -> latest Ceph 16 -> latest Ceph 18.

Best,
Malte
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why CEPH is better than other storage solutions?

2024-04-21 Thread William Edwards

> Op 21 apr 2024 om 17:14 heeft Anthony D'Atri  het 
> volgende geschreven:
> 
> Vendor lock-in only benefits vendors.

Strictly speaking, that isn’t necessarily true. Proprietary standards and the 
like *can* enhance user experience in some cases. Making it intentionally 
difficult to migrate is another story. 

> You’ll pay outrageously for support / maint then your gear goes EOL and 
> you’re trolling eBay for parts.   
> 
> With Ceph you use commodity servers, you can swap 100% of the hardware 
> without taking downtime with servers and drives of your choice.  And you get 
> the source code so worst case you can fix or customize.  Ask me sometime 
> about my experience with a certain proprietary HW vendor.  
> 
> Longhorn , openEBS I don’t know much about.  I suspect that they don’t offer 
> the richness of Ceph and that their communities are much smaller.  
> 
> Of course we’re biased here;)
> 
>> On Apr 21, 2024, at 5:21 AM, sebci...@o2.pl wrote:
>> 
>> Hi,
>> I have problem to answer to this question:
>> Why CEPH is better than other storage solutions?
>> 
>> I know this high level texts about
>> - scalability,
>> - flexibility,
>> - distributed,
>> - cost-Effectiveness
>> 
>> What convince me, but could be received also against, is ceph as a product 
>> has everything what I need it mean:
>> block storage (RBD),
>> file storage (CephFS),
>> object storage (S3, Swift)
>> and "plugins" to run NFS, NVMe over Fabric, NFS on object storage.
>> 
>> Also many other features which are usually sold as a option (mirroring, geo 
>> replication, etc) in paid solutions.
>> I have problem to write it done piece by piece.
>> I want convince my managers we are going in good direction.
>> 
>> Why not something from robin.io or purestorage, netapp, dell/EMC. From 
>> opensource longhorn or openEBS.
>> 
>> If you have ideas please write it.
>> 
>> Thanks,
>> S.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why CEPH is better than other storage solutions?

2024-04-21 Thread Anthony D'Atri
Vendor lock-in only benefits vendors.  You’ll pay outrageously for support / 
maint then your gear goes EOL and you’re trolling eBay for parts.   

With Ceph you use commodity servers, you can swap 100% of the hardware without 
taking downtime with servers and drives of your choice.  And you get the source 
code so worst case you can fix or customize.  Ask me sometime about my 
experience with a certain proprietary HW vendor.  

Longhorn , openEBS I don’t know much about.  I suspect that they don’t offer 
the richness of Ceph and that their communities are much smaller.  

Of course we’re biased here;)

> On Apr 21, 2024, at 5:21 AM, sebci...@o2.pl wrote:
> 
> Hi,
> I have problem to answer to this question:
> Why CEPH is better than other storage solutions?
> 
> I know this high level texts about
> - scalability,
> - flexibility,
> - distributed,
> - cost-Effectiveness
> 
> What convince me, but could be received also against, is ceph as a product 
> has everything what I need it mean:
> block storage (RBD),
> file storage (CephFS),
> object storage (S3, Swift)
> and "plugins" to run NFS, NVMe over Fabric, NFS on object storage.
> 
> Also many other features which are usually sold as a option (mirroring, geo 
> replication, etc) in paid solutions.
> I have problem to write it done piece by piece.
> I want convince my managers we are going in good direction.
> 
> Why not something from robin.io or purestorage, netapp, dell/EMC. From 
> opensource longhorn or openEBS.
> 
> If you have ideas please write it.
> 
> Thanks,
> S.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

2024-04-21 Thread Anthony D'Atri
A deep archive cluster benefits from NVMe too.  You can use QLC up to 60TB in 
size, 32 of those in one RU makes for a cluster that doesn’t take up the whole 
DC.  

> On Apr 21, 2024, at 5:42 AM, Darren Soothill  wrote:
> 
> Hi Niklaus,
> 
> Lots of questions here but let me tray and get through some of them.
> 
> Personally unless a cluster is for deep archive then I would never suggest 
> configuring or deploying a cluster without Rocks DB and WAL on NVME.
> There are a number of benefits to this in terms of performance and recovery. 
> Small writes go to the NVME first before being written to the HDD and it 
> makes many recovery operations far more efficient.
> 
> As to how much faster it makes things that very much depends on the type of 
> workload you have on the system. Lots of small writes will make a significant 
> difference. Very large writes not as much of a difference.
> Things like compactions of the RocksDB database are a lot faster as they are 
> now running from NVME and not from the HDD.
> 
> We normally work with  a upto 1:12 ratio so 1 NVME for every 12 HDD’s. This 
> is assuming the NVME’s being used are good mixed use enterprise NVME’s with 
> power loss protection.
> 
> As to failures yes a failure of the NVME would mean a loss of 12 OSD’s but 
> this is no worse than a failure of an entire node. This is something Ceph is 
> designed to handle.
> 
> I certainly wouldn’t be thinking about putting the NVME’s into raid sets as 
> that will degrade the performance of them when you are trying to get better 
> performance.
> 
> 
> 
> Darren Soothill
> 
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io/
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
> 
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrading Ceph 15 to 18

2024-04-21 Thread Anthony D'Atri
It should.  

> On Apr 21, 2024, at 5:48 AM, Malte Stroem  wrote:
> 
> Thank you, Anthony.
> 
> But does it work to upgrade from the latest 15 to the latest 16, too?
> 
> We'd like to be careful.
> 
> And then from the latest 16 to the latest 18?
> 
> Best,
> Malte
> 
>> On 21.04.24 04:14, Anthony D'Atri wrote:
>> The party line is to jump no more than 2 major releases at once.
>> So that would be Octopus (15) to Quincy (17) to Reef (18).
>> Squid (19) is due out soon, so you may want to pause at Quincy until Squid 
>> is released and has some runtime and maybe 19.2.1, then go straight to Squid 
>> from Quincy to save a step.
>> If you can test the upgrades on a lab cluster first, so much the better.  Be 
>> sure to read the release notes for every release in case there are specific 
>> additional actions or NBs.
 On Apr 20, 2024, at 18:42, Malte Stroem  wrote:
>>> 
>>> Hello,
>>> 
>>> we'd like to upgrade our cluster from the latest Ceph 15 to Ceph 18.
>>> 
>>> It's running with cephadm.
>>> 
>>> What's the right way to do it?
>>> 
>>> Latest Ceph 15 to latest 16 and then to the latest 17 and then the latest 
>>> 18?
>>> 
>>> Does that work?
>>> 
>>> Or is it possible to jump from the latest Ceph 16 to the latest Ceph 18?
>>> 
>>> Latest Ceph 15 -> latest Ceph 16 -> latest Ceph 18.
>>> 
>>> Best,
>>> Malte
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS crash

2024-04-21 Thread Eugen Block

What’s the output of:

ceph tell mds.0 damage ls

Zitat von alexey.gerasi...@opencascade.com:


Dear colleagues, hope that anybody can help us.

The initial point:  Ceph cluster v15.2 (installed and controlled by  
the Proxmox) with 3 nodes based on physical servers rented from a  
cloud provider. CephFS is installed also.


Yesterday we discovered that some of the applications stopped  
working. During the investigation we recognized that we have the  
problem with Ceph, more precisely with СephFS - MDS daemons suddenly  
crashed. We tried to restart them and found that they crashed again  
immediately after the start. The crash information:
2024-04-17T17:47:42.841+ 7f959ced9700  1 mds.0.29134  
recovery_done -- successful recovery!

2024-04-17T17:47:42.853+ 7f959ced9700  1 mds.0.29134 active_start
2024-04-17T17:47:42.881+ 7f959ced9700  1 mds.0.29134 cluster recovered.
2024-04-17T17:47:43.825+ 7f959aed5700 -1  
./src/mds/OpenFileTable.cc: In function 'void  
OpenFileTable::commit(MDSContext*, uint64_t, int)' thread  
7f959aed5700 time 2024-04-17T17:47:43.831243+

./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0)

Next hours we read the tons of articles, studied the documentation,  
and checked the common state of Ceph cluster by the various  
diagnostic commands – but didn’t find anything wrong. At evening we  
decided to upgrade it up to v16, and finally to v17.2.7.  
Unfortunately, it didn’t solve the problem, MDS continue to crash  
with the same error. The only difference that we found is “1 MDSs  
report damaged metadata” in the output of ceph -s – see it below.


I supposed that it may be the well-known bug, but couldn’t find the  
same one on https://tracker.ceph.com - there are several bugs  
associated with file OpenFileTable.cc but not related to  
ceph_assert(count > 0)


We tried to check the source code of OpenFileTable.cc also, here is  
a fragment of it, in function OpenFileTable::_journal_finish

  int omap_idx = anchor.omap_idx;
  unsigned& count = omap_num_items.at(omap_idx);
  ceph_assert(count > 0);
So, we guess that the object map is empty for some object in Ceph,  
and it is unexpected behavior. But again, we found nothing wrong in  
our cluster…


Next, we started with  
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/  
article – tried to reset the journal (despite that it was Ok all the  
time) and wipe the sessions using cephfs-table-tool all reset  
session command. No result…
Now I decided to continue following this article and run  
cephfs-data-scan scan_extents command, it is working just now. But I  
have a doubt that it will solve the issue because of no problem with  
our objects in Ceph.


Is it the new bug? or something else? Any idea is welcome!

The important outputs:

- ceph -s
  cluster:
id: 4cd1c477-c8d0-4855-a1f1-cb71d89427ed
health: HEALTH_ERR
1 MDSs report damaged metadata
insufficient standby MDS daemons available
83 daemons have recently crashed
3 mgr modules have recently crashed

  services:
mon: 3 daemons, quorum  
asrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 (age 22h)

mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1
mds: 1/1 daemons up
osd: 18 osds: 18 up (since 22h), 18 in (since 29h)

  data:
volumes: 1/1 healthy
pools:   5 pools, 289 pgs
objects: 29.72M objects, 5.6 TiB
usage:   21 TiB used, 47 TiB / 68 TiB avail
pgs: 287 active+clean
 2   active+clean+scrubbing+deep

  io:
client:   2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr

-ceph fs dump
e29480
enable_multiple, ever_enabled_multiple: 0,1
default compat: compat={},rocompat={},incompat={1=base  
v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir  
inode in separate object,5=mds uses versioned encoding,6=dirfrag is  
stored in omap,7=mds uses inline data,8=no anchor table,9=file  
layout v2,10=snaprealm v2}

legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   29480
flags   12 joinable allow_snaps allow_multimds_snaps
created 2022-11-25T15:56:08.507407+
modified2024-04-18T16:52:29.970504+
tableserver 0
root0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
required_client_features{}
last_failure0
last_failure_osd_epoch  14728
compat  compat={},rocompat={},incompat={1=base v0.20,2=client  
writeable ranges,3=default file layouts on dirs,4=dir inode in  
separate object,5=mds uses versioned encoding,6=dirfrag is stored in  
omap,7=mds uses inline data,8=no anchor table,9=file layout  
v2,10=snaprealm v2}

max_mds 1
in  0
up  {0=156636152}
failed
damaged
stopped
data_pools  [5]
metadata_pool   6
inline_data disabled
balancer
standby_count_wanted1
[mds.asrv-dev-stor-1{0:156636152} state up:active seq 6 laggy since  
2024-04-18T16:52:29.970479+ addr  

[ceph-users] Re: Why CEPH is better than other storage solutions?

2024-04-21 Thread Marc
> I know this high level texts about
> - scalability,
> - flexibility,
> - distributed,
> - cost-Effectiveness

If you are careful not to over estimate the performance, then you are ok.

> 
> Why not something from robin.io or purestorage, netapp, dell/EMC. From
> opensource longhorn or openEBS.
> 

:) difficult to say. I can remember years ago on a trade fare telling some 
Arabs to look at ceph when they were standing at the EMC booth. ;P If you want 
answers to such questions, I guess you are stuck with doing some really 
thorough research. I did not have such time, so for me it was that companies 
like CERN, NASA are(were?) using this on very large scale (thousands of disks 
for years) and contributing to the development.

Needless to say such storage solution is critical for your business so you need 
to have something reliable for the future. You don't want to do business with 
companies that all of a sudden change licensing like Elastic search did 
(robin.io -> rakuten cloud?) or looking only for a quick buyout. You want 
experienced competent people developing this for you. Although I am quite a bit 
annoyed with RedHat lately, my complements are really going out to this ceph 
development team ,as are my complements going out to universities contributing 
here.





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why CEPH is better than other storage solutions?

2024-04-21 Thread Daniel Brown


Suggestion:  Start with your requirements, vs the “ilities” of the storage 
system. 


By “ilities” I mean scalability, flexibility, distributability, durability, 
manageability, and so on - any storage system can and will lay (at least some) 
claim to those. 


What are the needs of your project? 

How does CEPH meet the needs of your project? 

How do the other systems NOT meet those requirements?





> On Apr 17, 2024, at 4:06 PM, sebci...@o2.pl wrote:
> 
> Hi, 
> I have problem to answer to this question:
> Why CEPH is better than other storage solutions? 
> 
> I know this high level texts about 
> - scalability,
> - flexibility,
> - distributed,
> - cost-Effectiveness
> 
> What convince me, but could be received also against, is ceph as a product 
> has everything what I need it mean:
> block storage (RBD),
> file storage (CephFS),
> object storage (S3, Swift)
> and "plugins" to run NFS, NVMe over Fabric, NFS on object storage.
> 
> Also many other features which are usually sold as a option (mirroring, geo 
> replication, etc) in paid solutions. 
> I have problem to write it done piece by piece. 
> I want convince my managers we are going in good direction.
> 
> Why not something from robin.io or purestorage, netapp, dell/EMC. From 
> opensource longhorn or openEBS.
> 
> If you have ideas please write it.
> 
> Thanks,
> S.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrading Ceph 15 to 18

2024-04-21 Thread Malte Stroem

Thank you, Anthony.

But does it work to upgrade from the latest 15 to the latest 16, too?

We'd like to be careful.

And then from the latest 16 to the latest 18?

Best,
Malte

On 21.04.24 04:14, Anthony D'Atri wrote:

The party line is to jump no more than 2 major releases at once.

So that would be Octopus (15) to Quincy (17) to Reef (18).

Squid (19) is due out soon, so you may want to pause at Quincy until Squid is 
released and has some runtime and maybe 19.2.1, then go straight to Squid from 
Quincy to save a step.

If you can test the upgrades on a lab cluster first, so much the better.  Be 
sure to read the release notes for every release in case there are specific 
additional actions or NBs.


On Apr 20, 2024, at 18:42, Malte Stroem  wrote:

Hello,

we'd like to upgrade our cluster from the latest Ceph 15 to Ceph 18.

It's running with cephadm.

What's the right way to do it?

Latest Ceph 15 to latest 16 and then to the latest 17 and then the latest 18?

Does that work?

Or is it possible to jump from the latest Ceph 16 to the latest Ceph 18?

Latest Ceph 15 -> latest Ceph 16 -> latest Ceph 18.

Best,
Malte
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Working ceph cluster reports large amount of pgs in state unknown/undersized and objects degraded

2024-04-21 Thread Alwin Antreich
Hi Tobias,

April 18, 2024 at 10:43 PM, "Tobias Langner"  wrote:


> While trying to dig up a bit more information, I noticed that the mgr web UI 
> was down, which is why we failed the active mgr to have one of the standbys 
> to take over, without thinking much...
> 
> Lo and behold, this completely resolved the issue from one moment to the 
> other. Now `ceph -s` return 338 active+clean pgs, as expected and desired...
> 
> While we are naturally pretty happy that the problem resolved itself, it 
> would still be good to understand
Thank you that confirms my thought.

> 
> 1. what caused this weird state in which `ceph -s` output did not match
The MGR provides the stats for it.
 
> 
> 2. how a mgr failover could cause changes in `ceph -s` output, thereby
See above.
 
> 
> 3. why `ceph osd df tree` reported a weird split state with only few
Likely the same.

You'd need to go through the MGR log and see what caused the MGR to hang.


Cheers,
Alwin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

2024-04-21 Thread Darren Soothill
Hi Niklaus,

Lots of questions here but let me tray and get through some of them.

Personally unless a cluster is for deep archive then I would never suggest 
configuring or deploying a cluster without Rocks DB and WAL on NVME.
There are a number of benefits to this in terms of performance and recovery. 
Small writes go to the NVME first before being written to the HDD and it makes 
many recovery operations far more efficient.

As to how much faster it makes things that very much depends on the type of 
workload you have on the system. Lots of small writes will make a significant 
difference. Very large writes not as much of a difference.
Things like compactions of the RocksDB database are a lot faster as they are 
now running from NVME and not from the HDD.

We normally work with  a upto 1:12 ratio so 1 NVME for every 12 HDD’s. This is 
assuming the NVME’s being used are good mixed use enterprise NVME’s with power 
loss protection.

As to failures yes a failure of the NVME would mean a loss of 12 OSD’s but this 
is no worse than a failure of an entire node. This is something Ceph is 
designed to handle.

I certainly wouldn’t be thinking about putting the NVME’s into raid sets as 
that will degrade the performance of them when you are trying to get better 
performance.



Darren Soothill


Looking for help with your Ceph cluster? Contact us at https://croit.io/
 
croit GmbH, Freseniusstr. 31h, 81247 Munich 
CEO: Martin Verges - VAT-ID: DE310638492 
Com. register: Amtsgericht Munich HRB 231263 
Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS crash

2024-04-21 Thread alexey . gerasimov
Dear colleagues, hope that anybody can help us.

The initial point:  Ceph cluster v15.2 (installed and controlled by the 
Proxmox) with 3 nodes based on physical servers rented from a cloud provider. 
CephFS is installed also.

Yesterday we discovered that some of the applications stopped working. During 
the investigation we recognized that we have the problem with Ceph, more 
precisely with СephFS - MDS daemons suddenly crashed. We tried to restart them 
and found that they crashed again immediately after the start. The crash 
information:
2024-04-17T17:47:42.841+ 7f959ced9700  1 mds.0.29134 recovery_done -- 
successful recovery!
2024-04-17T17:47:42.853+ 7f959ced9700  1 mds.0.29134 active_start
2024-04-17T17:47:42.881+ 7f959ced9700  1 mds.0.29134 cluster recovered.
2024-04-17T17:47:43.825+ 7f959aed5700 -1 ./src/mds/OpenFileTable.cc: In 
function 'void OpenFileTable::commit(MDSContext*, uint64_t, int)' thread 
7f959aed5700 time 2024-04-17T17:47:43.831243+
./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0)

Next hours we read the tons of articles, studied the documentation, and checked 
the common state of Ceph cluster by the various diagnostic commands – but 
didn’t find anything wrong. At evening we decided to upgrade it up to v16, and 
finally to v17.2.7. Unfortunately, it didn’t solve the problem, MDS continue to 
crash with the same error. The only difference that we found is “1 MDSs report 
damaged metadata” in the output of ceph -s – see it below.

I supposed that it may be the well-known bug, but couldn’t find the same one on 
https://tracker.ceph.com - there are several bugs associated with file 
OpenFileTable.cc but not related to ceph_assert(count > 0)

We tried to check the source code of OpenFileTable.cc also, here is a fragment 
of it, in function OpenFileTable::_journal_finish
  int omap_idx = anchor.omap_idx;
  unsigned& count = omap_num_items.at(omap_idx);
  ceph_assert(count > 0);
So, we guess that the object map is empty for some object in Ceph, and it is 
unexpected behavior. But again, we found nothing wrong in our cluster…

Next, we started with 
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ article – 
tried to reset the journal (despite that it was Ok all the time) and wipe the 
sessions using cephfs-table-tool all reset session command. No result…
Now I decided to continue following this article and run cephfs-data-scan 
scan_extents command, it is working just now. But I have a doubt that it will 
solve the issue because of no problem with our objects in Ceph.

Is it the new bug? or something else? Any idea is welcome!

The important outputs:

- ceph -s
  cluster:
id: 4cd1c477-c8d0-4855-a1f1-cb71d89427ed
health: HEALTH_ERR
1 MDSs report damaged metadata
insufficient standby MDS daemons available
83 daemons have recently crashed
3 mgr modules have recently crashed

  services:
mon: 3 daemons, quorum asrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 (age 
22h)
mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1
mds: 1/1 daemons up
osd: 18 osds: 18 up (since 22h), 18 in (since 29h)

  data:
volumes: 1/1 healthy
pools:   5 pools, 289 pgs
objects: 29.72M objects, 5.6 TiB
usage:   21 TiB used, 47 TiB / 68 TiB avail
pgs: 287 active+clean
 2   active+clean+scrubbing+deep

  io:
client:   2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr

-ceph fs dump
e29480
enable_multiple, ever_enabled_multiple: 0,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no 
anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   29480
flags   12 joinable allow_snaps allow_multimds_snaps
created 2022-11-25T15:56:08.507407+
modified2024-04-18T16:52:29.970504+
tableserver 0
root0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
required_client_features{}
last_failure0
last_failure_osd_epoch  14728
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no 
anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in  0
up  {0=156636152}
failed
damaged
stopped
data_pools  [5]
metadata_pool   6
inline_data disabled
balancer
standby_count_wanted1
[mds.asrv-dev-stor-1{0:156636152} state up:active seq 6 laggy since 
2024-04-18T16:52:29.970479+ addr 
[v2:172.22.2.91:6800/2487054023,v1:172.22.2.91:6801/2487054023] compat 
{c=[1],r=[1],i=[7ff]}]

-cephfs-journal-tool --rank=cephfs:0 journal inspect
Overall journal 

[ceph-users] Re: Working ceph cluster reports large amount of pgs in state unknown/undersized and objects degraded

2024-04-21 Thread c+gvihgmke
Some additional information. Even though the pgs report unknown, directly 
querying them seems like they are correctly up.

What could be causing this disconnect in reported pg states?

```
$ ceph pg dump_stuck inactive | head
ok
PG_STAT  STATEUP  UP_PRIMARY  ACTING  ACTING_PRIMARY
16.fcunknown  []  -1  []  -1
16.fbunknown  []  -1  []  -1
16.faunknown  []  -1  []  -1
...
...
...
```

```
$ ceph osd dump | grep epoch
epoch 54661
```

```
$ ceph pg 16.fc query
{
"snap_trimq": "[]",
"snap_trimq_len": 0,
"state": "active+clean",
"epoch": 54661,
"up": [
5,
0,
4
],
"acting": [
5,
0,
4
],
"acting_recovery_backfill": [
"0(1)",
"4(2)",
"5(0)"
],
"info": {
"pgid": "16.fcs0",
"last_update": "54599'266234",
"last_complete": "54599'266234",
"log_tail": "54406'263772",
"last_user_version": 266234,
"last_backfill": "MAX",
"purged_snaps": [],
"history": {
"epoch_created": 14243,
"epoch_pool_created": 2798,
"last_epoch_started": 54660,
"last_interval_started": 54659,
"last_epoch_clean": 54660,
"last_interval_clean": 54659,
"last_epoch_split": 14243,
"last_epoch_marked_full": 24149,
"same_up_since": 54659,
"same_interval_since": 54659,
"same_primary_since": 54586,
"last_scrub": "54579'266179",
"last_scrub_stamp": "2024-04-18T01:20:28.226815+0200",
"last_deep_scrub": "53783'261235",
"last_deep_scrub_stamp": "2024-03-26T03:29:22.874529+0100",
"last_clean_scrub_stamp": "2024-04-18T01:20:28.226815+0200",
"prior_readable_until_ub": 0
},
"stats": {
"version": "54599'266234",
"reported_seq": 1432180,
"reported_epoch": 54661,
"state": "active+clean",
"last_fresh": "2024-04-18T21:09:57.484866+0200",
"last_change": "2024-04-18T20:57:24.855741+0200",
"last_active": "2024-04-18T21:09:57.484866+0200",
"last_peered": "2024-04-18T21:09:57.484866+0200",
"last_clean": "2024-04-18T21:09:57.484866+0200",
"last_became_active": "2024-04-18T20:57:24.86+0200",
"last_became_peered": "2024-04-18T20:57:24.86+0200",
"last_unstale": "2024-04-18T21:09:57.484866+0200",
"last_undegraded": "2024-04-18T21:09:57.484866+0200",
"last_fullsized": "2024-04-18T21:09:57.484866+0200",
"mapping_epoch": 54659,
"log_start": "54406'263772",
"ondisk_log_start": "54406'263772",
"created": 14243,
"last_epoch_clean": 54660,
"parent": "0.0",
"parent_split_bits": 8,
"last_scrub": "54579'266179",
"last_scrub_stamp": "2024-04-18T01:20:28.226815+0200",
"last_deep_scrub": "53783'261235",
"last_deep_scrub_stamp": "2024-03-26T03:29:22.874529+0100",
"last_clean_scrub_stamp": "2024-04-18T01:20:28.226815+0200",
"log_size": 2462,
"ondisk_log_size": 2462,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 100198672069,
"num_objects": 24008,
"num_object_clones": 0,
"num_object_copies": 72024,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 24008,
"num_whiteouts": 0,
"num_read": 481265,
"num_read_kb": 589512788,
"num_write": 160947,
"num_write_kb": 164785104,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 132312,
"num_bytes_recovered": 548853441237,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
   

[ceph-users] Re: Working ceph cluster reports large amount of pgs in state unknown/undersized and objects degraded

2024-04-21 Thread Alwin Antreich
Hi Tobias,

April 18, 2024 at 8:08 PM, "Tobias Langner"  wrote:



> 
> We operate a tiny ceph cluster (v16.2.7) across three machines, each 
> 
> running two OSDs and one of each mds, mgr, and mon. The cluster serves 
> 
> one main erasure-coded (2+1) storage pool and a few other 
I'd assume (w/o pool config) that the EC 2+1 is putting PG as inactive. Because 
for EC you need n-2 for redundancy and n-1 for availability.

The output got a bit mangled. Could you please provide them in some pastebin 
maybe?

Can you please post the crush rule and pool settings? To better understand the 
data distribution. And what does the logs show on one of the affected OSDs?

Cheers,
Alwin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGWs stop processing requests after upgrading to Reef

2024-04-21 Thread Iain Stott
Hi,

We have recently upgraded one of our clusters from Quincy 17.2.6 to Reef 
18.2.1, since then we have had 3 instances of our RGWs stop processing 
requests. We have 3 hosts that run a single instance of RGW on each, and all 3 
just seem to stop processing requests at the same time causing our storage to 
become unavailable. A restart or redeploy of the RGW service brings them back 
ok. The cluster was deployed using ceph ansible, but since we have adopted it 
to cephadm which is how the upgrade was performed.

We have enabled debug logging as there was nothing out of the ordinary in 
normal logs and are currently sifting through them from the last crash.

We are just wondering if it possible to run Quincy RGWs instead of Reef as we 
didn't have this issue prior to the upgrade?

We have 3 clusters in a multisite setup, we are holding off on upgrading the 
other 2 clusters due to this issue.


Thanks
Iain

Iain Stott
OpenStack Engineer
iain.st...@thg.com
[THG Ingenuity Logo]
www.thg.com
[LinkedIn] 
[Instagram]   [X] 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prevent users to create buckets

2024-04-21 Thread Michel Raabe

Hi Sinan,

On 17.04.24 14:45, si...@turka.nl wrote:

Hello,

I am using Ceph RGW for S3. Is it possible to create (sub)users that 
cannot create/delete buckets and are limited to specific buckets?


At the end, I want to create 3 separate users and for each user I want 
to create a bucket. The users should only have access to their own 
bucket and should not be able to create new or delete buckets.


One approach could be to limit the max_buckets to 1 so the user cannot 
create new buckets, but it will still have access to other buckets and 
will able to delete buckets.


Any advice here? Thanks!


You need to set max_buckets to -1 to prevent a user from creating a bucket.

And use ACLs or Policys to give a user read/write permissions to 
specific buckets.


hth,
Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Community Management Update

2024-04-21 Thread Noah Lehman
Thanks for the introduction, Josh!

Hi Ceph Community, I look forward to working with you and learning more
about your community. If anyone has content you'd like shared on social
media, or if you PR questions, feel free to reach out!

Best,

Noah


On Tue, Apr 16, 2024 at 9:00 AM Josh Durgin  wrote:

> Hi everyone,
>
> I’d like to extend a warm thank you to Mike Perez for his years of service
> as community manager for Ceph. He is changing focuses now to engineering.
>
> The Ceph Foundation board decided to use services from the Linux
> Foundation to fulfill some community management responsibilities, rather
> than rely on a single member organization employing a community manager.
> The Linux Foundation will assist with Ceph Foundation membership and
> governance matters.
>
> Please welcome Noah Lehman (cc’d) as our social media and marketing point
> person - for anything related to this area, including the Ceph YouTube
> channel, please reach out to him.
>
> Ceph days will continue to be organized and funded by organizations around
> the world, with the help of the Ceph Ambassadors (
> https://ceph.io/en/community/ambassadors/). Gaurav Sitlani (cc’d) will
> help organize the ambassadors going forward.
>
> For other matters, please contact coun...@ceph.io and we’ll direct the
> matter to the appropriate people.
>
> Thanks,
> Neha Ojha, Dan van der Ster, Josh Durgin
> Ceph Executive Council
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Why CEPH is better than other storage solutions?

2024-04-21 Thread sebcio_t
Hi, 
I have problem to answer to this question:
Why CEPH is better than other storage solutions? 

I know this high level texts about 
- scalability,
- flexibility,
- distributed,
- cost-Effectiveness

What convince me, but could be received also against, is ceph as a product has 
everything what I need it mean:
block storage (RBD),
file storage (CephFS),
object storage (S3, Swift)
and "plugins" to run NFS, NVMe over Fabric, NFS on object storage.

Also many other features which are usually sold as a option (mirroring, geo 
replication, etc) in paid solutions. 
I have problem to write it done piece by piece. 
I want convince my managers we are going in good direction.

Why not something from robin.io or purestorage, netapp, dell/EMC. From 
opensource longhorn or openEBS.

If you have ideas please write it.

Thanks,
S.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Working ceph cluster reports large amount of pgs in state unknown/undersized and objects degraded

2024-04-21 Thread Tobias Langner
We operate a tiny ceph cluster across three machines, each running two 
OSDs and one of each mds, mgr, and mon. The cluster serves one main 
erasure-coded (2+1) storage pool and a few other management-related 
pools. The cluster has been running smoothly for several months.
A few weeks ago we noticed a health warning reporting 
backfillfull/nearfull osds and pools. Here is the output of `ceph -s` at 
this point (extraced from logs):



  cluster:
    health: HEALTH_WARN
    1 backfillfull osd(s)
    2 nearfull osd(s)
    Reduced data availability: 163 pgs inactive, 1 pg peering
    Low space hindering backfill (add storage if this doesn't 
resolve itself): 2 pgs backfill_toofull
    Degraded data redundancy: 1486709/10911157 objects degraded 
(13.626%), 68 pgs degraded, 68 pgs undersized

    162 pgs not scrubbed in time
    6 pool(s) backfillfull

  services:
    mon: 3 daemons, quorum mon.101,mon.102,mon.100 (age 5m)
    mgr: mgr-102(active, since 54m), standbys: mgr-101, mgr-100
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 6 osds: 6 up (since 4m), 6 in (since 2w); 7 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   6 pools, 338 pgs
    objects: 3.64M objects, 14 TiB
    usage:   13 TiB used, 1.7 TiB / 15 TiB avail
    pgs: 47.929% pgs unknown
 0.296% pgs not active
 1486709/10911157 objects degraded (13.626%)
 52771/10911157 objects misplaced (0.484%)
 162 unknown
 106 active+clean
 67  active+undersized+degraded
 1 active+undersized+degraded+remapped+backfill_toofull
 1   remapped+peering
 1   active+remapped+backfill_toofull


I now see the large amount of pgs in state unknown and the fact that a 
significant fraction of objects is degraded despite all osds being up, 
but we didn't notice this back then.
Because the cluster continued to act fine from an FS access perspective, 
we didn't really notice the developing problem and did not intervene. 
From then one, things have mostly gone downwards. Now we `ceph -s` 
reports the following:



  cluster:
    health: HEALTH_WARN
    noout flag(s) set
    Reduced data availability: 117 pgs inactive
    Degraded data redundancy: 2095625/12121767 objects degraded 
(17.288%), 114 pgs degraded, 114 pgs undersized

    117 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum mon.101,mon.102,mon.100 (age 15h)
    mgr: mgr-102(active, since 7d), standbys: mgr-100, mgr-101
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 6 osds: 6 up (since 55m), 6 in (since 3w)
 flags noout

  data:
    volumes: 1/1 healthy
    pools:   6 pools, 338 pgs
    objects: 4.04M objects, 15 TiB
    usage:   12 TiB used, 2.8 TiB / 15 TiB avail
    pgs: 34.615% pgs unknown
 2095625/12121767 objects degraded (17.288%)
 117 unknown
 114 active+undersized+degraded
 107 active+clean


Note in particular the still very large number of pgs in state unknown, 
which hasn't changed in days. Same goes for the degraded pgs. Also, the 
cluster should have around 37TiB storage available but now it only 
reports 15 TiB.
We did a bit of digging around but couldn't really get to the bottom of 
the unknown pgs and how we can recover from that. One other data point 
is that the command `ceph osd df tree` gets stuck on two of the three 
machines and one the one where it returns something, it looks like this:



ID   CLASS  WEIGHT    REWEIGHT  SIZE RAW USE  DATA OMAP META    
AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
 -1 47.67506 -  0 B  0 B  0 B 0 B 0 
B  0 B  0 0    -  root default
-13 18.26408 -  0 B  0 B  0 B 0 B 0 
B  0 B  0 0    -  datacenter dc.100
 -5 18.26408 -  0 B  0 B  0 B 0 B 0 
B  0 B  0 0    -  host osd-100
  3    hdd  10.91409   1.0  0 B  0 B  0 B 0 B 0 
B  0 B  0 0   91  up osd.3
  5    hdd   7.34999   1.0  0 B  0 B  0 B 0 B 0 
B  0 B  0 0   48  up osd.5
 -9 14.69998 -  0 B  0 B  0 B 0 B 0 
B  0 B  0 0    -  datacenter dc.101
 -7 14.69998 -  0 B  0 B  0 B 0 B 0 
B  0 B  0 0    -  host osd-101
  0    hdd   7.34999 

[ceph-users] stretched cluster new pool and second pool with nvme

2024-04-21 Thread ronny.lippold

hi ... running against the wall, i need your help, again.

our test stretched cluster is running fine.
now i have 2 questions.

whats the right way to add another pool?
create pool with 4/2 and use the rule for the stretched mode, finished?
the exsisting pools were automaticly set to 4/2 after "ceph mon 
enable_stretch_mode".


the second question, we want to use ssd and nvme together.
so, we need to have a second pool for class nvme.

i don't know, how to setup a second crush rule for the nvme class.
i thought, that i need to filter with 2 rules for the classes. is that 
correct?



thanks for help,
ronny
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS daemons crash

2024-04-21 Thread Alexey GERASIMOV
Hello all! Hope that anybody can help us.

The initial point:  Ceph cluster v15.2 (installed and controlled by the 
Proxmox) with 3 nodes based on physical servers rented from a cloud provider. 
The volumes provided by Ceph using CephFS and RBD also. We run 2 MDS daemons 
but use max_mds=1 so one daemon was in active state, and another in standby.
On Thursday some of the applications stopped working. After the investigation 
it was clear that we have a problem with Ceph, more precisely with СephFS - 
both MDS daemons suddenly crashed. We tried to restart them and found that they 
crashed again immediately after the start. The crash information:

2024-04-17T17:47:42.841+ 7f959ced9700  1 mds.0.29134 recovery_done -- 
successful recovery!
2024-04-17T17:47:42.853+ 7f959ced9700  1 mds.0.29134 active_start
2024-04-17T17:47:42.881+ 7f959ced9700  1 mds.0.29134 cluster recovered.
2024-04-17T17:47:43.825+ 7f959aed5700 -1 ./src/mds/OpenFileTable.cc: In 
function 'void OpenFileTable::commit(MDSContext*, uint64_t, int)' thread 
7f959aed5700 time 2024-04-17T17:47:43.831243+
./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0)

Next hours we read tons of articles, studied the documentation, and checked the 
cluster status in general by the various diagnostic commands - but didn't find 
anything wrong. At evening we decided to upgrade our Ceph cluster; so, we 
upgraded it to v16, and finally to v17.2.7. Unfortunately, it didn't solve the 
problem, MDS continue to crash with the same error. The only difference that we 
found is the "1 MDSs report damaged metadata" in the output of ceph -s - see it 
below.

I supposed that it may be the well-known bug, but couldn't find the same one on 
https://tracker.ceph.com - there are several bugs associated with file 
OpenFileTable.cc but not related to ceph_assert(count > 0)
We tried to check the source code of OpenFileTable.cc also, here is a fragment 
of it, in function OpenFileTable::_journal_finish
  int omap_idx = anchor.omap_idx;
  unsigned& count = omap_num_items.at(omap_idx);
  ceph_assert(count > 0);

So, we guess that the object map is empty for some object in Ceph, and it is 
unexpected behavior. But again, we found nothing wrong in our cluster...

Next, we started with 
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ article - 
tried to reset the journal (despite that it was Ok all the time) and wipe the 
sessions using cephfs-table-tool all reset session command. No result...
Now I decided to continue following this article and run cephfs-data-scan 
scan_extents command, we started it on Friday but it is still working (2 from 3 
workers finished, so I'm waiting for the last one; may be I need more workers 
for the next command cephfs-data-scan scan_inodes that I plan to run ). But I 
have a doubt that it will solve the issue because, again, we guess that we have 
no problem with our objects in Ceph but with metadata only...

Is it the new bug? or something else? What should we try additionally to run 
our MDS daemon? Any idea is welcome!

The important outputs:
ceph -s
  cluster:
id: 4cd1c477-c8d0-4855-a1f1-cb71d89427ed
health: HEALTH_ERR
1 MDSs report damaged metadata
insufficient standby MDS daemons available
83 daemons have recently crashed
3 mgr modules have recently crashed

  services:
mon: 3 daemons, quorum asrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 (age 
22h)
mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1
mds: 1/1 daemons up
osd: 18 osds: 18 up (since 22h), 18 in (since 29h)

  data:
volumes: 1/1 healthy
pools:   5 pools, 289 pgs
objects: 29.72M objects, 5.6 TiB
usage:   21 TiB used, 47 TiB / 68 TiB avail
pgs: 287 active+clean
 2   active+clean+scrubbing+deep

  io:
client:   2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr

ceph fs dump
e29480
enable_multiple, ever_enabled_multiple: 0,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no 
anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   29480
flags   12 joinable allow_snaps allow_multimds_snaps
created 2022-11-25T15:56:08.507407+
modified2024-04-18T16:52:29.970504+
tableserver 0
root0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
required_client_features{}
last_failure0
last_failure_osd_epoch  14728
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no 
anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in  0
up