Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-30 Thread Jan Schermer
I don’t run Ceph on btrfs, but isn’t this related to the btrfs snapshotting 
feature ceph uses to ensure a consistent journal?

Jan

 On 19 Jun 2015, at 14:26, Lionel Bouton lionel+c...@bouton.name wrote:
 
 On 06/19/15 13:42, Burkhard Linke wrote:
 
 Forget the reply to the list...
 
  Forwarded Message 
 Subject: Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
 Date:Fri, 19 Jun 2015 09:06:33 +0200
 From:Burkhard Linke 
 burkhard.li...@computational.bio.uni-giessen.de 
 mailto:burkhard.li...@computational.bio.uni-giessen.de
 To:  Lionel Bouton lionel+c...@bouton.name mailto:lionel+c...@bouton.name
 
 Hi,
 
 On 06/18/2015 11:28 PM, Lionel Bouton wrote:
  Hi,
 *snipsnap*
 
  - Disks with btrfs OSD have a spike of activity every 30s (2 intervals
  of 10s with nearly 0 activity, one interval with a total amount of
  writes of ~120MB). The averages are : 4MB/s, 100 IO/s.
 
 Just a guess:
 
 btrfs has a commit interval which defaults to 30 seconds.
 
 You can verify this by changing the interval with the commit=XYZ mount 
 option.
 
 I know and I tested commit intervals of 60 and 120 seconds without any 
 change. As this is directly linked to filestore max sync interval I didn't 
 report this test result.
 
 Best regards,
 
 Lionel
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-23 Thread Gregory Farnum
On Tue, Jun 23, 2015 at 9:50 AM, Erik Logtenberg e...@logtenberg.eu wrote:
 Thanks!

 Just so I understand correctly, the btrfs snapshots are mainly useful if
 the journals are on the same disk as the osd, right? Is it indeed safe
 to turn them off if the journals are on a separate ssd?

That's not quite it...it *is* safe to turn off btrfs snapshots, but by
doing so you get the same behavior as XFS does by default.

With btrfs snapshots enabled the OSD uses snapshots for consistent
checkpoints and doesn't need to be quite as careful in its ordering of
writes to the backing filesystem versus the journaling. Our use of
snapshots is pretty abusive though, so you may well find better
performance without them. :( The location of the journal on- or
off-disk has nothing to do with it, though. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-23 Thread Erik Logtenberg
Thanks!

Just so I understand correctly, the btrfs snapshots are mainly useful if
the journals are on the same disk as the osd, right? Is it indeed safe
to turn them off if the journals are on a separate ssd?

Kind regards,

Erik.


On 22-06-15 20:18, Krzysztof Nowicki wrote:
 pon., 22.06.2015 o 20:09 użytkownik Lionel Bouton
 lionel-subscript...@bouton.name
 mailto:lionel-subscript...@bouton.name napisał:
 
 On 06/22/15 17:21, Erik Logtenberg wrote:
  I have the journals on a separate disk too. How do you disable the
  snapshotting on the OSD?
 http://ceph.com/docs/master/rados/configuration/filestore-config-ref/ :
 
 filestore btrfs snap = false
 
 Once this is done and verified working (after a restart of the OSD) make
 sure to remove the now unnecessary snapshots (snap_xxx) from the osd
 filesystem as failing to do so will cause an increase of occupied space
 over time (old and unneeded versions of objects will remain stored).
 This can be done by running 'sudo btrfs subvolume delete
 /var/lib/ceph/osd/ceph-xx/snap_yy'. To verify that the option change is
 effective you can observe the 'snap_xxx' directories - after disabling
 snapshotting their revision number should not increase any more).
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-23 Thread Gregory Farnum
On Tue, Jun 23, 2015 at 12:17 PM, Lionel Bouton
lionel-subscript...@bouton.name wrote:
 On 06/23/15 11:43, Gregory Farnum wrote:
 On Tue, Jun 23, 2015 at 9:50 AM, Erik Logtenberg e...@logtenberg.eu wrote:
 Thanks!

 Just so I understand correctly, the btrfs snapshots are mainly useful if
 the journals are on the same disk as the osd, right? Is it indeed safe
 to turn them off if the journals are on a separate ssd?
 That's not quite it...it *is* safe to turn off btrfs snapshots, but by
 doing so you get the same behavior as XFS does by default.

 I just disabled snapshots and the OSD logged this:

 mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled

 which I assume means that I don't have to change the following
 configuration parameters, the OSD takes care of using sensible values
 for them:

 filestore journal parallel
 filestore journal writeahead

Write. You probably shouldn't mess with these in any case; the OSD
selects the right mode based on other things.


 From the limited feedback I got from our monitoring our disk writes are
 now ~1MB/s instead of ~4MB/s when the cluster is mostly idle. There are
 still spikes of activity (compared to XFS) but they might just be linked
 to the default btrfs commit delay and are harmless. Xfs OSDs still have
 a lower amount of writes though but this is expected when comparing a
 COW filesystem to a classic one.

 Note that these numbers might push you from Intel DC S3500 to S3610 (for
 example) if you plan to use btrfs on Intel SSD OSDs: ~1MB/s is 30+TB/year...
 With btrfs snapshots enabled and 4MB/s this is 120+TB/year.

 Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-22 Thread Lionel Bouton
On 06/22/15 17:21, Erik Logtenberg wrote:
 I have the journals on a separate disk too. How do you disable the
 snapshotting on the OSD?
http://ceph.com/docs/master/rados/configuration/filestore-config-ref/ :

filestore btrfs snap = false
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-22 Thread Krzysztof Nowicki
AFAIK the snapshots are useful when the journal sits inside the OSD
filesystem. In case the journal is on a separate filesystem/device then OSD
BTRFS snapshots can be safely disabled. I have done so on my OSDs as they
all use external journals and experienced a reduction in periodic writes,
but they are not completely gone.

pon., 22.06.2015 o 11:27 użytkownik Jan Schermer j...@schermer.cz napisał:

 I don’t run Ceph on btrfs, but isn’t this related to the btrfs
 snapshotting feature ceph uses to ensure a consistent journal?

 Jan

 On 19 Jun 2015, at 14:26, Lionel Bouton lionel+c...@bouton.name wrote:

  On 06/19/15 13:42, Burkhard Linke wrote:


 Forget the reply to the list...

  Forwarded Message   Subject: Re: [ceph-users] Unexpected
 disk write activity with btrfs OSDs  Date: Fri, 19 Jun 2015 09:06:33 +0200  
 From:
 Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de
 burkhard.li...@computational.bio.uni-giessen.de  To: Lionel Bouton
 lionel+c...@bouton.name lionel+c...@bouton.name

 Hi,

 On 06/18/2015 11:28 PM, Lionel Bouton wrote:
  Hi,
 *snipsnap*

  - Disks with btrfs OSD have a spike of activity every 30s (2 intervals
  of 10s with nearly 0 activity, one interval with a total amount of
  writes of ~120MB). The averages are : 4MB/s, 100 IO/s.

 Just a guess:

 btrfs has a commit interval which defaults to 30 seconds.

 You can verify this by changing the interval with the commit=XYZ mount
 option.


 I know and I tested commit intervals of 60 and 120 seconds without any
 change. As this is directly linked to filestore max sync interval I didn't
 report this test result.

 Best regards,

 Lionel
  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-22 Thread Lionel Bouton
On 06/19/15 13:23, Erik Logtenberg wrote:
 I believe this may be the same issue I reported some time ago, which is
 as of yet unsolved.

 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg19770.html

 I used strace to figure out that the OSD's were doing an incredible
 amount of getxattr, setxattr and removexattr calls, for no apparent
 reason. Do you see the same write pattern?

 My OSD's are also btrfs-backed.

Thanks for the heads-up.

Did you witness this with no activity at all?
From your report, this was happening during CephFS reads and we don't
use CephFS, only RBD volumes.

The amount of written data in our case is fairly consistent too.
I'll try to launch a strace but I'm not sure if I will have the time
before we add SSDs to our currrent HDD-only setup.

If I can strace btrfs OSD without SSD journals I'll report here.

Lionel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-22 Thread Lionel Bouton
On 06/22/15 11:27, Jan Schermer wrote:
 I don’t run Ceph on btrfs, but isn’t this related to the btrfs
 snapshotting feature ceph uses to ensure a consistent journal?

It's possible: if I understand correctly the code, the btrfs filestore
backend creates a snapshot when syncing the journal. I'm a little
surprised that btrfs would need approximately 120MB written to disk to
perform a snapshot of a subvolume with ~160k files (and the removal of
the oldest one as the OSD maintains 2 active) but they aren't guaranteed
to be dirt cheap and probably weren't optimised for this frequency. I'm
surprised because I was under the impression that a snapshot on btrfs
was only a matter of keeping a reference to the root of the filesystem
btree which (at least in theory) seems cheap. In fact thinking while
writing this I realise it might very well be that it is the release of a
previous snapshot with its associated cleanups which is costly not the
snapshot creation.

We are about to add Intel DC SSDs for journals and I believe Krzysztof
is right: we should be able to disable the snapshots safely then. The
main reason for us to use btrfs is compression and crc at the fs level.
It seems performance could be too: we get constantly better latencies vs
xfs in our configuration. So I'm not particularly bothered by this: it
may be something useful to document (and at least leave a trace here for
others to find): btrfs with the default filestore max sync interval (5
seconds) may have serious performance problems in most configurations.

I'm not sure if I will have the time to trace the OSD processes to check
if I witness what Erik saw with CephFS (lots of xattr activity including
setxattr and removexattr): I'm not using CephFS and his findings didn't
specify if he was using btrfs and/or xfs backed OSD (we only see this
behaviour on btrfs).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-22 Thread Krzysztof Nowicki
pon., 22.06.2015 o 13:11 użytkownik Lionel Bouton lionel+c...@bouton.name
napisał:

 On 06/22/15 11:27, Jan Schermer wrote:

 I don’t run Ceph on btrfs, but isn’t this related to the btrfs
 snapshotting feature ceph uses to ensure a consistent journal?


 It's possible: if I understand correctly the code, the btrfs filestore
 backend creates a snapshot when syncing the journal. I'm a little surprised
 that btrfs would need approximately 120MB written to disk to perform a
 snapshot of a subvolume with ~160k files (and the removal of the oldest one
 as the OSD maintains 2 active) but they aren't guaranteed to be dirt cheap
 and probably weren't optimised for this frequency. I'm surprised because I
 was under the impression that a snapshot on btrfs was only a matter of
 keeping a reference to the root of the filesystem btree which (at least in
 theory) seems cheap. In fact thinking while writing this I realise it might
 very well be that it is the release of a previous snapshot with its
 associated cleanups which is costly not the snapshot creation.


I think it's not the snapshot creation that causes I/O, but deleting and
cleaning up old snapshots. I've noticed that it's the btrfs-cleaner process
that usually shows the highest I/O.


 We are about to add Intel DC SSDs for journals and I believe Krzysztof is
 right: we should be able to disable the snapshots safely then. The main
 reason for us to use btrfs is compression and crc at the fs level. It seems
 performance could be too: we get constantly better latencies vs xfs in our
 configuration. So I'm not particularly bothered by this: it may be
 something useful to document (and at least leave a trace here for others to
 find): btrfs with the default filestore max sync interval (5 seconds) may
 have serious performance problems in most configurations.

 I'm not sure if I will have the time to trace the OSD processes to check
 if I witness what Erik saw with CephFS (lots of xattr activity including
 setxattr and removexattr): I'm not using CephFS and his findings didn't
 specify if he was using btrfs and/or xfs backed OSD (we only see this
 behaviour on btrfs).

 Best regards,

 Lionel
  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-22 Thread Jan Schermer
I don’t run Ceph on btrfs, but isn’t this related to the btrfs snapshotting 
feature ceph uses to ensure a consistent journal?

Jan

 On 19 Jun 2015, at 14:26, Lionel Bouton lionel+c...@bouton.name 
 mailto:lionel+c...@bouton.name wrote:
 
 On 06/19/15 13:42, Burkhard Linke wrote:
 
 Forget the reply to the list...
 
  Forwarded Message 
 Subject: Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
 Date:Fri, 19 Jun 2015 09:06:33 +0200
 From:Burkhard Linke 
 burkhard.li...@computational.bio.uni-giessen.de 
 mailto:burkhard.li...@computational.bio.uni-giessen.de
 To:  Lionel Bouton lionel+c...@bouton.name mailto:lionel+c...@bouton.name
 
 Hi,
 
 On 06/18/2015 11:28 PM, Lionel Bouton wrote:
  Hi,
 *snipsnap*
 
  - Disks with btrfs OSD have a spike of activity every 30s (2 intervals
  of 10s with nearly 0 activity, one interval with a total amount of
  writes of ~120MB). The averages are : 4MB/s, 100 IO/s.
 
 Just a guess:
 
 btrfs has a commit interval which defaults to 30 seconds.
 
 You can verify this by changing the interval with the commit=XYZ mount 
 option.
 
 I know and I tested commit intervals of 60 and 120 seconds without any 
 change. As this is directly linked to filestore max sync interval I didn't 
 report this test result.
 
 Best regards,
 
 Lionel
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-22 Thread Krzysztof Nowicki
pon., 22.06.2015 o 20:09 użytkownik Lionel Bouton 
lionel-subscript...@bouton.name napisał:

 On 06/22/15 17:21, Erik Logtenberg wrote:
  I have the journals on a separate disk too. How do you disable the
  snapshotting on the OSD?
 http://ceph.com/docs/master/rados/configuration/filestore-config-ref/ :

 filestore btrfs snap = false

Once this is done and verified working (after a restart of the OSD) make
sure to remove the now unnecessary snapshots (snap_xxx) from the osd
filesystem as failing to do so will cause an increase of occupied space
over time (old and unneeded versions of objects will remain stored). This
can be done by running 'sudo btrfs subvolume delete
/var/lib/ceph/osd/ceph-xx/snap_yy'. To verify that the option change is
effective you can observe the 'snap_xxx' directories - after disabling
snapshotting their revision number should not increase any more).

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-18 Thread Lionel Bouton
I just realized I forgot to add a proper context :

this is with Firefly 0.80.9 and the btrfs OSDs are running on kernel
4.0.5 (this was happening with previous kernel versions according to our
monitoring history), xfs OSDs run on 4.0.5 or 3.18.9. There are 23 OSDs
total and 2 of them are using btrfs.

On 06/18/15 23:28, Lionel Bouton wrote:
 Hi,

 I've just noticed an odd behaviour with the btrfs OSDs. We monitor the
 amount of disk writes on each device, our granularity is 10s (every 10s
 the monitoring system collects the total amount of sector written and
 write io performed since boot and computes both the B/s and IO/s).

 With only residual write activity on our storage network (~450kB/s total
 for the whole Ceph cluster, which amounts to a theoretical ~120kB/s on
 each OSD once replication, double writes due to journal and number of
 OSD are factored in) :
 - Disks with btrfs OSD have a spike of activity every 30s (2 intervals
 of 10s with nearly 0 activity, one interval with a total amount of
 writes of ~120MB). The averages are : 4MB/s, 100 IO/s.
 - Disks with xfs OSD (with journal on a separate partition but same
 disk) don't have these spikes of activity and the averages are far lower
 : 160kB/s and 5 IO/s. This is not far off what is expected from the
 whole cluster write activity.

 There's a setting of 30s on our platform :
 filestore max sync interval

 I changed it to 60s with
 ceph tell osd.* injectargs '--filestore-max-sync-interval 60'
 and the amount of writes was lowered to ~2.5MB/s.

 I changed it to 5s (the default) with
 ceph tell osd.* injectargs '--filestore-max-sync-interval 5'
 the amount of writes to the device rose to an average of 10MB/s (and
 given our sampling interval of 10s appeared constant).

 During these tests the activity on disks hosting XFS OSDs didn't change
 much.

 So it seems filestore syncs generate far more activity on btrfs OSDs
 compared to XFS OSDs (journal activity included for both).

 Note that autodefrag is disabled on our btrfs OSDs. We use our own
 scheduler which in the case of our OSD limits the amount of defragmented
 data to ~10MB per minute in the worst case and usually (during low write
 activity which was the case here) triggers a single file defragmentation
 every 2 minutes (which amounts to a 4MB write as we only host RBDs with
 the default order value). So defragmentation shouldn't be an issue here.

 This doesn't seem to generate too much stress when filestore max sync
 interval is 30s (our btrfs OSDs are faster than xfs OSDs with the same
 amount of data according to apply latencies) but at 5s the btrfs OSDs
 are far slower than our xfs OSDs with 10x the average apply latency (we
 didn't let this continue more than 10 minutes as it began to make some
 VMs wait for IOs too much).

 Does anyone know if this is normal and why it is happening?

 Best regards,

 Lionel
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com