[ceph-users] Re: RBD Disk Usage
When you delete files they're not normally scrubbed from the disk, the file system just forgets the deleted files are there. To fully remove the data you need something like TRIM: fstrim -v /the_file_system Simon On 07/08/2023 15:15, mahnoosh shahidi wrote: Hi all, I have an rbd image that `rbd disk-usage` shows it has 31GB usage but in the filesystem `du` shows its usage is 40KB. Does anyone know the reason for this difference? Best Regards, Mahnoosh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: v18.2.1 Reef released
Hi All, We're deploying a fresh Reef cluster now and noticed that cephadm bootstrap deploys 18.2.0 and not 18.2.1. It appears this is because the v18, v18.2 (and v18.2.0) tags are all pointing to the v18.2.0-20231212 tag since 16th December here: https://quay.io/repository/ceph/ceph?tab=history Is this intentional? I.e. was 18.2.1 rolled back? Or is this a mistake? Thanks, Simon. On 18/12/2023 21:20, Yuri Weinstein wrote: We're happy to announce the 1st backport release in the Reef series. This is the first backport release in the Reef series, and the first with Debian packages, for Debian Bookworm. We recommend all users update to this release. https://ceph.io/en/news/blog/2023/v18-2-1-reef-released/ Notable Changes --- * RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in multi-site. Previously, the replicas of such objects were corrupted on decryption. A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to identify these original multipart uploads. The ``LastModified`` timestamp of any identified object is incremented by 1ns to cause peer zones to replicate it again. For multi-site deployments that make any use of Server-Side Encryption, we recommended running this command against every bucket in every zone after all zones have upgraded. * CEPHFS: MDS evicts clients which are not advancing their request tids which causes a large buildup of session metadata resulting in the MDS going read-only due to the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold` config controls the maximum size that a (encoded) session metadata can grow. * RGW: New tools have been added to radosgw-admin for identifying and correcting issues with versioned bucket indexes. Historical bugs with the versioned bucket index transaction workflow made it possible for the index to accumulate extraneous "book-keeping" olh entries and plain placeholder entries. In some specific scenarios where clients made concurrent requests referencing the same object key, it was likely that a lot of extra index entries would accumulate. When a significant number of these entries are present in a single bucket index shard, they can cause high bucket listing latencies and lifecycle processing failures. To check whether a versioned bucket has unnecessary olh entries, users can now run ``radosgw-admin bucket check olh``. If the ``--fix`` flag is used, the extra entries will be safely removed. A distinct issue from the one described thus far, it is also possible that some versioned buckets are maintaining extra unlinked objects that are not listable from the S3/ Swift APIs. These extra objects are typically a result of PUT requests that exited abnormally, in the middle of a bucket index transaction - so the client would not have received a successful response. Bugs in prior releases made these unlinked objects easy to reproduce with any PUT request that was made on a bucket that was actively resharding. Besides the extra space that these hidden, unlinked objects consume, there can be another side effect in certain scenarios, caused by the nature of the failure mode that produced them, where a client of a bucket that was a victim of this bug may find the object associated with the key to7fe91d5d5842e04be3b4f514d6dd990c54b29c76 be in an inconsistent state. To check whether a versioned bucket has unlinked entries, users can now run ``radosgw-admin bucket check unlinked``. If the ``--fix`` flag is used, the unlinked objects will be safely removed. Finally, a third issue made it possible for versioned bucket index stats to be accounted inaccurately. The tooling for recalculating versioned bucket stats also had a bug, and was not previously capable of fixing these inaccuracies. This release resolves those issues and users can now expect that the existing ``radosgw-admin bucket check`` command will produce correct results. We recommend that users with versioned buckets, especially those that existed on prior releases, use these new tools to check whether their buckets are affected and to clean them up accordingly. * mgr/snap-schedule: For clusters with multiple CephFS file systems, all the snap-schedule commands now expect the '--fs' argument. * RADOS: A POOL_APP_NOT_ENABLED health warning will now be reported if the application is not enabled for the pool irrespective of whether the pool is in use or not. Always add ``application`` label to a pool to avoid reporting of POOL_APP_NOT_ENABLED health warning for that pool. The user might temporarilty mute this warning using ``ceph health mute POOL_APP_NOT_ENABLED``. Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at https://download.ceph.com/tarballs/ceph-18.2.1.tar.gz * Containers at https://quay.io/repository/ceph/ceph * For packages,
[ceph-users] Re: Building new cluster had a couple of questions
On 21/12/2023 13:50, Drew Weaver wrote: Howdy, I am going to be replacing an old cluster pretty soon and I am looking for a few suggestions. #1 cephadm or ceph-ansible for management? #2 Since the whole... CentOS thing... what distro appears to be the most straightforward to use with Ceph? I was going to try and deploy it on Rocky 9. I'm in the same boat and have used cephadm on Rocky 9 and the standard podman packages that come with the distro. Installation went without a hitch, a breeze actually compared to the old ceph-deploy/Nautilus install it's going to replace. Cheers, Simon. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Worst thing that can happen if I have size= 2
On 03/02/2021 09:24, Mario Giammarco wrote: Hello, Imagine this situation: - 3 servers with ceph - a pool with size 2 min 1 I know perfectly the size 3 and min 2 is better. I would like to know what is the worst thing that can happen: Hi Mario, This thread is worth a read, it's an oldie but a goodie: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html Especially this post, which helped me understand the importance of min_size=2 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html Cheers, Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Worst thing that can happen if I have size= 2
On 03/02/2021 19:48, Mario Giammarco wrote: It is obvious and a bit paranoid because many servers on many customers run on raid1 and so you are saying: yeah you have two copies of the data but you can broke both. Consider that in ceph recovery is automatic, with raid1 some one must manually go to the customer and change disks. So ceph is already an improvement in this case even with size=2. With size 3 and min 2 it is a bigger improvement I know. To labour Dan's point a bit further, maybe a RAID5/6 analogy is better than RAID1. Yes, I know we're not talking erasure coding pools here but this is similar to the reasons why people moved from RAID5 (size=2, kind of) to RAID6 (size=3, kind of). I.e. the more disks you have in an array (cluster, in our case) and the bigger those disks are, the greater the chance you have of encountering a second problem during a recovery. What I ask is this: what happens with min_size=1 and split brain, network down or similar things: do ceph block writes because it has no quorum on monitors? Are there some failure scenarios that I have not considered? It sounds like in your example you would have 3 physical servers in total. So would you have both a monitor and OSDs processes on each server? If so, it's not really related to min_size=1 but to answer your question you could lose one monitor and the cluster would continue. Losing a second monitor will stop your cluster until this is resolved. In your example setup (with colocated mons & OSDs) this would presumably also mean you'd lost two OSDs servers too so you'd have bigger problems. HTH, Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Worst thing that can happen if I have size= 2
On 05/02/2021 20:10, Mario Giammarco wrote: It is not that a morning I wake up and put some random hardware together, I followed guidelines. The result should be: - if a disk (or more) brokes work goes on - if a server brokes the VMs on the server start on another server and work goes on. The result is: one disk brokes, ceph fills the other one in the same server , reaches 90% and EVERYTHING stops including all VMs and the customer has lost unsaved data and it cannot run the VMs it needs to continue works. Not very "HA" as hoped. With three OSD hosts, each with two disks, size=3 and default CRUSH rules (i.e. each replica goes to a different host) then each OSD host would expect to get roughly 1/3 of the total data. Under normal running this would mean each disk sees 1/6 of the total data. When a single disk failed in your scenario above, all three hosts were still available and still get 1/3 of the total data. Because one disk failed, the surviving disk has to store the replicas that were on the failed disk as well its own (so, 2/6 total data - double what it had before). To have reached 90% full on the surviving disk suggests that it was (at least) 45% full under normal running. Ceph is doing what it's supposed to in this case, the issue is that the disks haven't been sized large enough to allow for this failure. Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] One PG keeps going inconsistent (stat mismatch)
Hi All, I have a recurring single PG that keeps going inconsistent. A scrub is enough to pick up the problem. The primary OSD log shows something like: 2021-09-22 18:08:18.502 7f5bdcb11700 0 log_channel(cluster) log [DBG] : 1.3ff scrub starts 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff scrub 1 errors It always repairs ok when I run ceph pg repair 1.3ff: 2021-09-22 18:08:47.533 7f5bdcb11700 0 log_channel(cluster) log [DBG] : 1.3ff repair starts 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff repair 1 errors, 1 fixed It's happened multiple times and always with the same PG number, no other PG is doing this. It's a Nautilus v14.2.5 cluster using spinning disks with separate DB/WAL on SSDs. I don't believe there's an underlying hardware problem but in a bid to make sure I reweighted the primary OSD for this PG to 0 to get it to move to another disk. The backfilling is complete but on manually scrubbing the PG again it showed inconsistent as above. In case it's relevant the only major activity I've performed recently has been gradually adding new OSD nodes and disks to the cluster, prior to this it had been up without issue for well over a year. The primary OSD for this PG was on the first new OSD I added when this issue first presented. The inconsistent PG issue didn't start happening immediately after adding it though, it was some weeks later. Any suggestions as to how I can get rid of this problem? Should I try reweighting the other two OSDs for this PG to 0? Or is this a known bug that requires some specific work or just an upgrade? Thanks, Simon. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: One PG keeps going inconsistent (stat mismatch)
Bump for any pointers here? tl;dr - I've got a single PG that keeps going inconsistent (stat mismatch). It always repairs ok but comes back every day now when it's scrubbed. If there's no suggestions I'll try upgrading to 14.2.22 and then reweighting the other OSDs (I've already done the primary) that serve this PG to 0 to try to force its recreation. Thanks, Simon. On 22/09/2021 18:50, Simon Ironside wrote: Hi All, I have a recurring single PG that keeps going inconsistent. A scrub is enough to pick up the problem. The primary OSD log shows something like: 2021-09-22 18:08:18.502 7f5bdcb11700 0 log_channel(cluster) log [DBG] : 1.3ff scrub starts 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff scrub 1 errors It always repairs ok when I run ceph pg repair 1.3ff: 2021-09-22 18:08:47.533 7f5bdcb11700 0 log_channel(cluster) log [DBG] : 1.3ff repair starts 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff repair 1 errors, 1 fixed It's happened multiple times and always with the same PG number, no other PG is doing this. It's a Nautilus v14.2.5 cluster using spinning disks with separate DB/WAL on SSDs. I don't believe there's an underlying hardware problem but in a bid to make sure I reweighted the primary OSD for this PG to 0 to get it to move to another disk. The backfilling is complete but on manually scrubbing the PG again it showed inconsistent as above. In case it's relevant the only major activity I've performed recently has been gradually adding new OSD nodes and disks to the cluster, prior to this it had been up without issue for well over a year. The primary OSD for this PG was on the first new OSD I added when this issue first presented. The inconsistent PG issue didn't start happening immediately after adding it though, it was some weeks later. Any suggestions as to how I can get rid of this problem? Should I try reweighting the other two OSDs for this PG to 0? Or is this a known bug that requires some specific work or just an upgrade? Thanks, Simon. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Are there 'tuned profiles' for various ceph scenarios?
Here's an example for SCSI disks (the main benefit vs VirtIO is discard/unmap/TRIM support): discard='unmap'/> You also need a VirtIO-SCSI controller to use these, which will look something like: function='0x0'/> Cheers, Simon. On 01/07/2020 20:52, Harry G. Coin wrote: [Resent to correct title] Marc: Here's a template that works here. You'll need to do some steps to create the 'secret' and make the block devs and so on: Glad I could contribute something. Sure would appreciate leads for the suggested sysctls/etc either apart or as tuned profiles. Harry On 7/1/20 2:44 PM, Marc Roos wrote: Just curious, how does the libvirt xml part look like of a 'direct virtio->rados link' and 'kernel-mounted rbd' -Original Message- To: ceph-users@ceph.io Subject: *SPAM* [ceph-users] Are there 'tuned profiles' for various ceph scenarios? Hi Are there any 'official' or even 'works for us' pointers to 'tuned profiles' for such common uses as 'ceph baremetal osd host' 'ceph osd + libvirt host' 'ceph mon/mgr' 'guest vm based on a kernel-mounted rbd' 'guest vm based on a direct virtio->rados link' I suppose there are a few other common configurations, but you get the idea. If you haven't used or know of 'tuned'-- it's a nice way to collect a great whole lot of sysctl and other low level configuration options in one spot. https://tuned-project.org/ Thanks Harry Coin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Slow write speed on 3-node cluster with 6* SATA Harddisks (~ 3.5 MB/s)
Hi, My three-node lab cluster is similar to yours but with 3x bluestore OSDs per node (4TB SATA spinning disks) and 1x shared DB/WAL (240GB SATA SSD) device per node. I'm only using gigabit networking (one interface public, one interface cluster) also ceph 14.2.4 with 3x replicas. I would have expected your dd commands to use the cache, try these instead inside your VM: # Write test dd if=/dev/zero of=/zero.file bs=32M oflag=direct status=progress # Read test dd if=/zero.file of=/dev/null bs=32M iflag=direct status=progress You can obviously delete /zero.file when you're finished. - bs=32M tells dd to read/write 32MB at a time, I think the default is something like 512 bytes which slows things up significantly without a cache. - oflag/iflag=direct will use direct I/O bypassing the cache. - status=progress is just instead of where you're using pv to show the transfer rate. On my cluster I get 124MB/sec read (maxing out the network) and 74MB/sec write. Without bs=32M I get more like 1MB/sec read and write. The VM I'm using for this test is cache=writeback and virtio-scsi (i.e. sda rather than vda). Simon On 05/11/2019 11:31, Hermann Himmelbauer wrote: Hi, Thank you for your quick reply, Proxmox offers me "writeback" (cache=writeback) and "writeback unsafe" (cache=unsafe), however, for my "dd" test, this makes no difference at all. I still have write speeds of ~ 4,5 MB/s. Perhaps "dd" disables the write cache? Would it perhaps help to put the journal or something else on a SSD? Best Regards, Hermann ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
Hi, I have two new-ish 14.2.4 clusters that began life on 14.2.0 , all with HDD OSDs with SSD DB/WALs but neither have experienced obvious problems yet. What's the impact of this? Does possible data corruption mean possible silent data corruption? Or does the corruption cause the OSD failures mentioned on the tracker and you're basically ok if you either haven't had a failure or if you keep on top of failures the way you would if they were normal disk failures? Thanks, Simon On 14/11/2019 16:10, Sage Weil wrote: Hi everyone, We've identified a data corruption bug[1], first introduced[2] (by yours truly) in 14.2.3 and affecting both 14.2.3 and 14.2.4. The corruption appears as a rocksdb checksum error or assertion that looks like os/bluestore/fastbmap_allocator_impl.h: 750: FAILED ceph_assert(available >= allocated) or in some cases a rocksdb checksum error. It only affects BlueStore OSDs that have a separate 'db' or 'wal' device. We have a fix[3] that is working its way through testing, and will expedite the next Nautilus point release (14.2.5) once it is ready. If you are running 14.2.2 or 14.2.1 and use BlueStore OSDs with separate 'db' volumes, you should consider waiting to upgrade until 14.2.5 is released. A big thank you to Igor Fedotov and several *extremely* helpful users who managed to reproduce and track down this problem! sage [1] https://tracker.ceph.com/issues/42223 [2] https://github.com/ceph/ceph/commit/096033b9d931312c0688c2eea7e14626bfde0ad7#diff-618db1d3389289a9d25840a4500ef0b0 [3] https://github.com/ceph/ceph/pull/31621 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
Hi Igor, On 15/11/2019 14:22, Igor Fedotov wrote: Do you mean both standalone DB and(!!) standalone WAL devices/partitions by having SSD DB/WAL? No, 1x combined DB/WAL partition on an SSD and 1x data partition on an HDD per OSD. I.e. created like: ceph-deploy osd create --data /dev/sda --block-db ssd0/ceph-db-disk0 ceph-deploy osd create --data /dev/sdb --block-db ssd0/ceph-db-disk1 ceph-deploy osd create --data /dev/sdc --block-db ssd0/ceph-db-disk2 --block-wal wasn't used. If so then BlueFS might eventually overwrite some data at you DB volume with BlueFS log content. Which most probably makes OSD crash and unable to restart one day. This is quite random and not very frequent event which is to some degree dependent from cluster loading. And the period between actual data corruption and any evidence of this is non-zero most of the time - we tend to see it mostly when RocksDB was performing compaction. So this, if I've understood you correctly, is for those with 3 separate (DB + WAL + Data) devices per OSD. Not my setup. Other OSD configuration which might suffer from the issue is main device + WAL devices. Much less failure probability exists for main + DB layout. It requires almost full DB to get any chances to appear. This sounds like my setup: 2 separate (DB/WAL combined + Data) devices per OSD. Main-only device configurations aren't under the threat as far as I can tell. And this is for all-in-one devices that aren't at risk. Understood. While we're waiting for 14.2.5 to be released, what should 14.2.3/4 users with an at risk setup do in the meantime, if anything? - Check how full their DB devices are? - Avoid adding new data/load to the cluster? - Would deep scrubbing detect any undiscovered corruption? - Get backups ready to restore? I mean, how bad is this? Thanks, Simon. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
Hi Igor, Thanks very much for providing all this detail. On 18/11/2019 10:43, Igor Fedotov wrote: - Check how full their DB devices are? For your case it makes sense to check this. And then safely wait for 14.2.5 if its not full. bluefs.db_used_bytes / bluefs_db_total_bytes is only around 1-2% (I am almost exclusively RBD and using a 64GB DB/WAL partition) and bluefs_slow_used_bytes is 0 on them all so it would seem I have little to worry about here with an essentially zero chance of corruption so far. I will sit tight and wait for 14.2.5. Thanks again, Simon. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
Any word on 14.2.5? Nervously waiting here . . . Thanks, Simon. On 18/11/2019 11:29, Simon Ironside wrote: I will sit tight and wait for 14.2.5. Thanks again, Simon. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: v14.2.5 Nautilus released
Thanks all! On 10/12/2019 09:45, Abhishek Lekshmanan wrote: This is the fifth release of the Ceph Nautilus release series. Among the many notable changes, this release fixes a critical BlueStore bug that was introduced in 14.2.3. All Nautilus users are advised to upgrade to this release. For the complete changelog entry, please visit the release blog at https://ceph.io/releases/v14-2-5-nautilus-released/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: moving small production cluster to different datacenter
And us too, exactly as below. One at a time then wait for things to recover before moving the next host. We didn't have any issues with this approach either. Regards, Simon. On 28/01/2020 13:03, Tobias Urdin wrote: We did this as well, pretty much the same as Wido. We had a fiber connection with good latency between the locations. We installed a virtual monitor in the destination datacenter to always keep quorum then we simply moved one node at a time after setting noout. When we took a node up on the destination we had a small moving of data then the cluster was back to healthy again. We had a higher apply and commit latency until we all the nodes was on the destination side but we never noticed any performance issues that caused issues for us. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible bug with rbd export/import?
On 10/03/2020 19:31, Matt Dunavant wrote: We're using rbd images for VM drives both with and without custom stripe sizes. When we try to export/import the drive to another ceph cluster, the VM always comes up in a busted state it can't recover from. Don't shoot me for asking but is the VM being exported still started up and in use? Asking since you don't mention using snapshots. Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io