Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
I don’t run Ceph on btrfs, but isn’t this related to the btrfs snapshotting feature ceph uses to ensure a consistent journal? Jan On 19 Jun 2015, at 14:26, Lionel Bouton lionel+c...@bouton.name wrote: On 06/19/15 13:42, Burkhard Linke wrote: Forget the reply to the list... Forwarded Message Subject: Re: [ceph-users] Unexpected disk write activity with btrfs OSDs Date:Fri, 19 Jun 2015 09:06:33 +0200 From:Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de mailto:burkhard.li...@computational.bio.uni-giessen.de To: Lionel Bouton lionel+c...@bouton.name mailto:lionel+c...@bouton.name Hi, On 06/18/2015 11:28 PM, Lionel Bouton wrote: Hi, *snipsnap* - Disks with btrfs OSD have a spike of activity every 30s (2 intervals of 10s with nearly 0 activity, one interval with a total amount of writes of ~120MB). The averages are : 4MB/s, 100 IO/s. Just a guess: btrfs has a commit interval which defaults to 30 seconds. You can verify this by changing the interval with the commit=XYZ mount option. I know and I tested commit intervals of 60 and 120 seconds without any change. As this is directly linked to filestore max sync interval I didn't report this test result. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
On Tue, Jun 23, 2015 at 9:50 AM, Erik Logtenberg e...@logtenberg.eu wrote: Thanks! Just so I understand correctly, the btrfs snapshots are mainly useful if the journals are on the same disk as the osd, right? Is it indeed safe to turn them off if the journals are on a separate ssd? That's not quite it...it *is* safe to turn off btrfs snapshots, but by doing so you get the same behavior as XFS does by default. With btrfs snapshots enabled the OSD uses snapshots for consistent checkpoints and doesn't need to be quite as careful in its ordering of writes to the backing filesystem versus the journaling. Our use of snapshots is pretty abusive though, so you may well find better performance without them. :( The location of the journal on- or off-disk has nothing to do with it, though. :) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
Thanks! Just so I understand correctly, the btrfs snapshots are mainly useful if the journals are on the same disk as the osd, right? Is it indeed safe to turn them off if the journals are on a separate ssd? Kind regards, Erik. On 22-06-15 20:18, Krzysztof Nowicki wrote: pon., 22.06.2015 o 20:09 użytkownik Lionel Bouton lionel-subscript...@bouton.name mailto:lionel-subscript...@bouton.name napisał: On 06/22/15 17:21, Erik Logtenberg wrote: I have the journals on a separate disk too. How do you disable the snapshotting on the OSD? http://ceph.com/docs/master/rados/configuration/filestore-config-ref/ : filestore btrfs snap = false Once this is done and verified working (after a restart of the OSD) make sure to remove the now unnecessary snapshots (snap_xxx) from the osd filesystem as failing to do so will cause an increase of occupied space over time (old and unneeded versions of objects will remain stored). This can be done by running 'sudo btrfs subvolume delete /var/lib/ceph/osd/ceph-xx/snap_yy'. To verify that the option change is effective you can observe the 'snap_xxx' directories - after disabling snapshotting their revision number should not increase any more). ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
On Tue, Jun 23, 2015 at 12:17 PM, Lionel Bouton lionel-subscript...@bouton.name wrote: On 06/23/15 11:43, Gregory Farnum wrote: On Tue, Jun 23, 2015 at 9:50 AM, Erik Logtenberg e...@logtenberg.eu wrote: Thanks! Just so I understand correctly, the btrfs snapshots are mainly useful if the journals are on the same disk as the osd, right? Is it indeed safe to turn them off if the journals are on a separate ssd? That's not quite it...it *is* safe to turn off btrfs snapshots, but by doing so you get the same behavior as XFS does by default. I just disabled snapshots and the OSD logged this: mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled which I assume means that I don't have to change the following configuration parameters, the OSD takes care of using sensible values for them: filestore journal parallel filestore journal writeahead Write. You probably shouldn't mess with these in any case; the OSD selects the right mode based on other things. From the limited feedback I got from our monitoring our disk writes are now ~1MB/s instead of ~4MB/s when the cluster is mostly idle. There are still spikes of activity (compared to XFS) but they might just be linked to the default btrfs commit delay and are harmless. Xfs OSDs still have a lower amount of writes though but this is expected when comparing a COW filesystem to a classic one. Note that these numbers might push you from Intel DC S3500 to S3610 (for example) if you plan to use btrfs on Intel SSD OSDs: ~1MB/s is 30+TB/year... With btrfs snapshots enabled and 4MB/s this is 120+TB/year. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
On 06/22/15 17:21, Erik Logtenberg wrote: I have the journals on a separate disk too. How do you disable the snapshotting on the OSD? http://ceph.com/docs/master/rados/configuration/filestore-config-ref/ : filestore btrfs snap = false ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
AFAIK the snapshots are useful when the journal sits inside the OSD filesystem. In case the journal is on a separate filesystem/device then OSD BTRFS snapshots can be safely disabled. I have done so on my OSDs as they all use external journals and experienced a reduction in periodic writes, but they are not completely gone. pon., 22.06.2015 o 11:27 użytkownik Jan Schermer j...@schermer.cz napisał: I don’t run Ceph on btrfs, but isn’t this related to the btrfs snapshotting feature ceph uses to ensure a consistent journal? Jan On 19 Jun 2015, at 14:26, Lionel Bouton lionel+c...@bouton.name wrote: On 06/19/15 13:42, Burkhard Linke wrote: Forget the reply to the list... Forwarded Message Subject: Re: [ceph-users] Unexpected disk write activity with btrfs OSDs Date: Fri, 19 Jun 2015 09:06:33 +0200 From: Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de burkhard.li...@computational.bio.uni-giessen.de To: Lionel Bouton lionel+c...@bouton.name lionel+c...@bouton.name Hi, On 06/18/2015 11:28 PM, Lionel Bouton wrote: Hi, *snipsnap* - Disks with btrfs OSD have a spike of activity every 30s (2 intervals of 10s with nearly 0 activity, one interval with a total amount of writes of ~120MB). The averages are : 4MB/s, 100 IO/s. Just a guess: btrfs has a commit interval which defaults to 30 seconds. You can verify this by changing the interval with the commit=XYZ mount option. I know and I tested commit intervals of 60 and 120 seconds without any change. As this is directly linked to filestore max sync interval I didn't report this test result. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
On 06/19/15 13:23, Erik Logtenberg wrote: I believe this may be the same issue I reported some time ago, which is as of yet unsolved. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg19770.html I used strace to figure out that the OSD's were doing an incredible amount of getxattr, setxattr and removexattr calls, for no apparent reason. Do you see the same write pattern? My OSD's are also btrfs-backed. Thanks for the heads-up. Did you witness this with no activity at all? From your report, this was happening during CephFS reads and we don't use CephFS, only RBD volumes. The amount of written data in our case is fairly consistent too. I'll try to launch a strace but I'm not sure if I will have the time before we add SSDs to our currrent HDD-only setup. If I can strace btrfs OSD without SSD journals I'll report here. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
On 06/22/15 11:27, Jan Schermer wrote: I don’t run Ceph on btrfs, but isn’t this related to the btrfs snapshotting feature ceph uses to ensure a consistent journal? It's possible: if I understand correctly the code, the btrfs filestore backend creates a snapshot when syncing the journal. I'm a little surprised that btrfs would need approximately 120MB written to disk to perform a snapshot of a subvolume with ~160k files (and the removal of the oldest one as the OSD maintains 2 active) but they aren't guaranteed to be dirt cheap and probably weren't optimised for this frequency. I'm surprised because I was under the impression that a snapshot on btrfs was only a matter of keeping a reference to the root of the filesystem btree which (at least in theory) seems cheap. In fact thinking while writing this I realise it might very well be that it is the release of a previous snapshot with its associated cleanups which is costly not the snapshot creation. We are about to add Intel DC SSDs for journals and I believe Krzysztof is right: we should be able to disable the snapshots safely then. The main reason for us to use btrfs is compression and crc at the fs level. It seems performance could be too: we get constantly better latencies vs xfs in our configuration. So I'm not particularly bothered by this: it may be something useful to document (and at least leave a trace here for others to find): btrfs with the default filestore max sync interval (5 seconds) may have serious performance problems in most configurations. I'm not sure if I will have the time to trace the OSD processes to check if I witness what Erik saw with CephFS (lots of xattr activity including setxattr and removexattr): I'm not using CephFS and his findings didn't specify if he was using btrfs and/or xfs backed OSD (we only see this behaviour on btrfs). Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
pon., 22.06.2015 o 13:11 użytkownik Lionel Bouton lionel+c...@bouton.name napisał: On 06/22/15 11:27, Jan Schermer wrote: I don’t run Ceph on btrfs, but isn’t this related to the btrfs snapshotting feature ceph uses to ensure a consistent journal? It's possible: if I understand correctly the code, the btrfs filestore backend creates a snapshot when syncing the journal. I'm a little surprised that btrfs would need approximately 120MB written to disk to perform a snapshot of a subvolume with ~160k files (and the removal of the oldest one as the OSD maintains 2 active) but they aren't guaranteed to be dirt cheap and probably weren't optimised for this frequency. I'm surprised because I was under the impression that a snapshot on btrfs was only a matter of keeping a reference to the root of the filesystem btree which (at least in theory) seems cheap. In fact thinking while writing this I realise it might very well be that it is the release of a previous snapshot with its associated cleanups which is costly not the snapshot creation. I think it's not the snapshot creation that causes I/O, but deleting and cleaning up old snapshots. I've noticed that it's the btrfs-cleaner process that usually shows the highest I/O. We are about to add Intel DC SSDs for journals and I believe Krzysztof is right: we should be able to disable the snapshots safely then. The main reason for us to use btrfs is compression and crc at the fs level. It seems performance could be too: we get constantly better latencies vs xfs in our configuration. So I'm not particularly bothered by this: it may be something useful to document (and at least leave a trace here for others to find): btrfs with the default filestore max sync interval (5 seconds) may have serious performance problems in most configurations. I'm not sure if I will have the time to trace the OSD processes to check if I witness what Erik saw with CephFS (lots of xattr activity including setxattr and removexattr): I'm not using CephFS and his findings didn't specify if he was using btrfs and/or xfs backed OSD (we only see this behaviour on btrfs). Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
I don’t run Ceph on btrfs, but isn’t this related to the btrfs snapshotting feature ceph uses to ensure a consistent journal? Jan On 19 Jun 2015, at 14:26, Lionel Bouton lionel+c...@bouton.name mailto:lionel+c...@bouton.name wrote: On 06/19/15 13:42, Burkhard Linke wrote: Forget the reply to the list... Forwarded Message Subject: Re: [ceph-users] Unexpected disk write activity with btrfs OSDs Date:Fri, 19 Jun 2015 09:06:33 +0200 From:Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de mailto:burkhard.li...@computational.bio.uni-giessen.de To: Lionel Bouton lionel+c...@bouton.name mailto:lionel+c...@bouton.name Hi, On 06/18/2015 11:28 PM, Lionel Bouton wrote: Hi, *snipsnap* - Disks with btrfs OSD have a spike of activity every 30s (2 intervals of 10s with nearly 0 activity, one interval with a total amount of writes of ~120MB). The averages are : 4MB/s, 100 IO/s. Just a guess: btrfs has a commit interval which defaults to 30 seconds. You can verify this by changing the interval with the commit=XYZ mount option. I know and I tested commit intervals of 60 and 120 seconds without any change. As this is directly linked to filestore max sync interval I didn't report this test result. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
pon., 22.06.2015 o 20:09 użytkownik Lionel Bouton lionel-subscript...@bouton.name napisał: On 06/22/15 17:21, Erik Logtenberg wrote: I have the journals on a separate disk too. How do you disable the snapshotting on the OSD? http://ceph.com/docs/master/rados/configuration/filestore-config-ref/ : filestore btrfs snap = false Once this is done and verified working (after a restart of the OSD) make sure to remove the now unnecessary snapshots (snap_xxx) from the osd filesystem as failing to do so will cause an increase of occupied space over time (old and unneeded versions of objects will remain stored). This can be done by running 'sudo btrfs subvolume delete /var/lib/ceph/osd/ceph-xx/snap_yy'. To verify that the option change is effective you can observe the 'snap_xxx' directories - after disabling snapshotting their revision number should not increase any more). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
I just realized I forgot to add a proper context : this is with Firefly 0.80.9 and the btrfs OSDs are running on kernel 4.0.5 (this was happening with previous kernel versions according to our monitoring history), xfs OSDs run on 4.0.5 or 3.18.9. There are 23 OSDs total and 2 of them are using btrfs. On 06/18/15 23:28, Lionel Bouton wrote: Hi, I've just noticed an odd behaviour with the btrfs OSDs. We monitor the amount of disk writes on each device, our granularity is 10s (every 10s the monitoring system collects the total amount of sector written and write io performed since boot and computes both the B/s and IO/s). With only residual write activity on our storage network (~450kB/s total for the whole Ceph cluster, which amounts to a theoretical ~120kB/s on each OSD once replication, double writes due to journal and number of OSD are factored in) : - Disks with btrfs OSD have a spike of activity every 30s (2 intervals of 10s with nearly 0 activity, one interval with a total amount of writes of ~120MB). The averages are : 4MB/s, 100 IO/s. - Disks with xfs OSD (with journal on a separate partition but same disk) don't have these spikes of activity and the averages are far lower : 160kB/s and 5 IO/s. This is not far off what is expected from the whole cluster write activity. There's a setting of 30s on our platform : filestore max sync interval I changed it to 60s with ceph tell osd.* injectargs '--filestore-max-sync-interval 60' and the amount of writes was lowered to ~2.5MB/s. I changed it to 5s (the default) with ceph tell osd.* injectargs '--filestore-max-sync-interval 5' the amount of writes to the device rose to an average of 10MB/s (and given our sampling interval of 10s appeared constant). During these tests the activity on disks hosting XFS OSDs didn't change much. So it seems filestore syncs generate far more activity on btrfs OSDs compared to XFS OSDs (journal activity included for both). Note that autodefrag is disabled on our btrfs OSDs. We use our own scheduler which in the case of our OSD limits the amount of defragmented data to ~10MB per minute in the worst case and usually (during low write activity which was the case here) triggers a single file defragmentation every 2 minutes (which amounts to a 4MB write as we only host RBDs with the default order value). So defragmentation shouldn't be an issue here. This doesn't seem to generate too much stress when filestore max sync interval is 30s (our btrfs OSDs are faster than xfs OSDs with the same amount of data according to apply latencies) but at 5s the btrfs OSDs are far slower than our xfs OSDs with 10x the average apply latency (we didn't let this continue more than 10 minutes as it began to make some VMs wait for IOs too much). Does anyone know if this is normal and why it is happening? Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com