Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-06-04 Thread Josh Durgin

On 06/03/2015 04:15 AM, Jan Schermer wrote:

Thanks for a very helpful answer.
So if I understand it correctly then what I want (crash consistency with RPO0) 
isn’t possible now in any way.
If there is no ordering in RBD cache then ignoring barriers sounds like a very 
bad idea also.


Yes, that's why the default rbd cache configuration in hammer stays in
writethrough mode until it sees a flush from the guest.


Any thoughts on ext4 with journal_async_commit? That should be safe in any 
circumstance, but it’s pretty hard to test that assumption…


It doesn't sound incredibly well-tested in general. It does something 
like what you want, allowing some data to be lost but theoretically

preventing fs corruption, but I wouldn't trust it without a lot of
testing.

It seems like db-specific options for controlling how much data they
can lose may be best for your use case right now.


Is there someone running big database (OLTP) workloads on Ceph? What did you do 
to make them run? Out of box we are all limited to the same ~100 tqs/s (with 
5ms write latency)…


There is a lot of work going on to improve performance, and latency in
particular:

http://pad.ceph.com/p/performance_weekly

If you haven't seen them, Mark has a config optimized for latency at
the end of this:

http://nhm.ceph.com/Ceph_SSD_OSD_Performance.pdf

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-06-03 Thread Jan Schermer
Thanks for a very helpful answer.
So if I understand it correctly then what I want (crash consistency with RPO0) 
isn’t possible now in any way.
If there is no ordering in RBD cache then ignoring barriers sounds like a very 
bad idea also.

Any thoughts on ext4 with journal_async_commit? That should be safe in any 
circumstance, but it’s pretty hard to test that assumption…

Is there someone running big database (OLTP) workloads on Ceph? What did you do 
to make them run? Out of box we are all limited to the same ~100 tqs/s (with 
5ms write latency)…

Jan

 On 03 Jun 2015, at 02:08, Josh Durgin jdur...@redhat.com wrote:
 
 On 06/01/2015 03:41 AM, Jan Schermer wrote:
 Thanks, that’s it exactly.
 But I think that’s really too much work for now, that’s why I really would 
 like to see a quick-win by using the local RBD cache for now - that would 
 suffice for most workloads (not too many people run big databases on CEPH 
 now, those who do must be aware of this).
 
 The issue is - and I have not yet seen an answer to that - would it be safe 
 as it is now if the flushes were ignored (rbd cache = unsafe) or will it 
 completely b0rk the filesystem when not flushed properly?
 
 Generally the latter. Right now flushes are the only thing enforcing
 ordering for rbd. As a block device it doesn't guarantee that e.g. the
 extent at offset 0 is written before the extent at offset 4096 unless
 it sees a flush between the writes.
 
 As suggested earlier in this thread, maintaining order during writeback
 would make not sending flushes (via mount -o nobarrier in the guest or
 cache=unsafe for qemu) safer from a crash-consistency point of view.
 
 An fs or database on top of rbd would still have to replay their
 internal journal, and could lose some writes, but should be able to
 end up in a consistent state that way. This would make larger caches
 more useful, and would be a simple way to use a large local cache
 devices as an rbd cache backend. Live migration should still work in
 such a system because qemu will still tell rbd to flush data at that
 point.
 
 A distributed local cache like [1] might be better long term, but
 much more complicated to implement.
 
 Josh
 
 [1] 
 https://www.usenix.org/conference/fast15/technical-sessions/presentation/bhagwat
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-06-02 Thread Josh Durgin

On 06/01/2015 03:41 AM, Jan Schermer wrote:

Thanks, that’s it exactly.
But I think that’s really too much work for now, that’s why I really would like 
to see a quick-win by using the local RBD cache for now - that would suffice 
for most workloads (not too many people run big databases on CEPH now, those 
who do must be aware of this).

The issue is - and I have not yet seen an answer to that - would it be safe as 
it is now if the flushes were ignored (rbd cache = unsafe) or will it 
completely b0rk the filesystem when not flushed properly?


Generally the latter. Right now flushes are the only thing enforcing
ordering for rbd. As a block device it doesn't guarantee that e.g. the
extent at offset 0 is written before the extent at offset 4096 unless
it sees a flush between the writes.

As suggested earlier in this thread, maintaining order during writeback
would make not sending flushes (via mount -o nobarrier in the guest or
cache=unsafe for qemu) safer from a crash-consistency point of view.

An fs or database on top of rbd would still have to replay their
internal journal, and could lose some writes, but should be able to
end up in a consistent state that way. This would make larger caches
more useful, and would be a simple way to use a large local cache
devices as an rbd cache backend. Live migration should still work in
such a system because qemu will still tell rbd to flush data at that
point.

A distributed local cache like [1] might be better long term, but
much more complicated to implement.

Josh

[1] 
https://www.usenix.org/conference/fast15/technical-sessions/presentation/bhagwat


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-06-01 Thread Nick Fisk


 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Mark Nelson
 Sent: 27 May 2015 16:00
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Synchronous writes - tuning and some thoughts
 about them?
 
 On 05/27/2015 09:33 AM, Jan Schermer wrote:
  Hi Nick,
  responses inline, again ;-)
 
  Thanks
 
  Jan
 
  On 27 May 2015, at 12:29, Nick Fisk n...@fisk.me.uk wrote:
 
  Hi Jan,
 
  Responses inline below
 
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
  Behalf Of Jan Schermer
  Sent: 25 May 2015 21:14
  To: Nick Fisk
  Cc: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Synchronous writes - tuning and some
  thoughts about them?
 
  Hi Nick,
 
  flashcache doesn’t support barriers, so I haven’t even considered
  it. I used a few years ago to speed up some workloads out of
  curiosity and it worked well, but I can’t use it to cache this kind of
 workload.
 
  EnhanceIO passed my initial testing (although the documentation is
  very sketchy and the project abandoned AFAIK), and is supposed to
  respect barriers/flushes. I was only interested in a “volatile
  cache” scenario - create a ramdisk in the guest (for example 1GB)
  and use it to cache the virtual block device (and of course flush
  and remove it before rebooting). All worked pretty well during my
  testing with fio  stuff until I ran the actual workload - in my
  case a DB2 9.7 database. It took just minutes for the kernel to
  panic (I can share a screenshot if you’d like). So it was not a host
  failure but a guest failure and it managed to fail on two fronts -
  stability and crash consistency - at the same time. The filesystem
  was completely broken afterwards - while it could be mounted
  “cleanly” (journal appeared consistent), there was massive damage to
  the files. I expected the open files to be zeroed or missing or
  damaged, but it did veryrandom damage all over the place including
 binaries in /bin, manpage files and so on - things that nobody was even
 touching. Scary.
 
  I see, so just to confirm you don't want to use a caching solution
  with an SSD, just a ram disk? I think that’s where are approaches
  differed and can understand why you are probably having problems when
  the OS crashes or suffers powerloss. I was going the SSD route, with
  something like:-
 
 
  This actually proves that EnhanceIO doesn’t really respect barriers, at 
  least
 not when flushing blocks to the underlying device.
  To be fair, maybe using a (mirrored!) SSD makes it crash-consistent, maybe
 it has an internal journal and just replays whatever is in cache - I will not 
 read
 the source code to confirm that because to me that’s clearly not what I need.
 
 FWIW, I think both dm-cache and bcache properly respect barriers, though I
 haven't read through the source.
 
 
 
 http://www.storagereview.com/hgst_ultrastar_ssd800mm_enterprise_ssd_
 r
  eview
 
  On my iSCSI head nodes, but if you are exporting RBD's to lots of different
 servers I guess this wouldn't work quite the same.
 
 
  Exactly. If you want to maintain elasticity, want to be able to migrate
 instances freely, then using any local storage is a no-go.
 
  I don't really see a solution that could work for you without using SSD's 
  for
 the cache. You seem to be suffering from slow sync writes and want to cache
 them in a volatile medium, but by their very nature sync writes are meant to
 be safe once the write confirmation arrives. I guess in any caching solution
 barriers go some length to help guard against  data corruption but if properly
 implemented they will probably also slow the speed down to what you can
 achieve with just RBD caching. Much like Hardware Raid Controllers, they
 only enable writeback cache if they can guarantee data security, either by a
 functioning battery backup or flash device.
 
 
  You are right. Sync writes and barriers are supposed to be flushed to
 physical medium when returning (though in practice lots of RAID controllers
 and _all_ arrays will lie about that, slightly breaking the spec but still 
 being
 safe if you don’t let the battery die).
  I don’t want to lose crash consistency, but I don’t need to have the latest
 completed transaction flushed to the disk - I don’t care if power outage
 wipes the last 1 minute of records from the database even though they were
 “commited” by database and should thus be flushed to disks, and I don’t
 think too many people care either as long as it’s fast.
  Of course, this won’t work for everyone and in that respect the current rbd
 cache behaviour is 100% correct.

Another potential option which honours barriers

http://www.lessfs.com/wordpress/?p=699

But I still don't see how you are going to differentiate between when you want 
to flush and when you don't. I might still be misunderstanding what you want to 
achieve. But it seems you want to honour barriers, but only once every minute, 
the rest

Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-06-01 Thread Jan Schermer
Thanks, that’s it exactly.
But I think that’s really too much work for now, that’s why I really would like 
to see a quick-win by using the local RBD cache for now - that would suffice 
for most workloads (not too many people run big databases on CEPH now, those 
who do must be aware of this).

The issue is - and I have not yet seen an answer to that - would it be safe as 
it is now if the flushes were ignored (rbd cache = unsafe) or will it 
completely b0rk the filesystem when not flushed properly?

Jan

 On 01 Jun 2015, at 12:37, Nick Fisk n...@fisk.me.uk wrote:
 
 Hi Mark, I think the real problem is that even tuning Ceph to the Max it is 
 still potentially 100x slower than a hardware raid card for doing these very 
 important sync writes. Especially in DB's that have been designed to rely on 
 the fact they can submit a large chain of very small IO's, without some sort 
 of cache sitting at the front of the whole Ceph infrastructure (Journals and 
 cache tiering are too far back), Ceph just doesn't provide the required 
 latency. I know it would be really quite a large piece of work, but 
 implementing some sort of distributed cache with a very low overhead that 
 could plump direct into librbd would dramatically improve performance, 
 especially in a lot of enterprise workloads.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-05-27 Thread Nick Fisk
Hi Jan,

Responses inline below

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 25 May 2015 21:14
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Synchronous writes - tuning and some thoughts
 about them?
 
 Hi Nick,
 
 flashcache doesn’t support barriers, so I haven’t even considered it. I used a
 few years ago to speed up some workloads out of curiosity and it worked
 well, but I can’t use it to cache this kind of workload.
 
 EnhanceIO passed my initial testing (although the documentation is very
 sketchy and the project abandoned AFAIK), and is supposed to respect
 barriers/flushes. I was only interested in a “volatile cache” scenario - 
 create a
 ramdisk in the guest (for example 1GB) and use it to cache the virtual block
 device (and of course flush and remove it before rebooting). All worked
 pretty well during my testing with fio  stuff until I ran the actual 
 workload -
 in my case a DB2 9.7 database. It took just minutes for the kernel to panic (I
 can share a screenshot if you’d like). So it was not a host failure but a 
 guest
 failure and it managed to fail on two fronts - stability and crash 
 consistency -
 at the same time. The filesystem was completely broken afterwards - while it
 could be mounted “cleanly” (journal appeared consistent), there was
 massive damage to the files. I expected the open files to be zeroed or
 missing or damaged, but it did veryrandom damage all over the place
 including binaries in /bin, manpage files and so on - things that nobody was
 even touching. Scary.

I see, so just to confirm you don't want to use a caching solution with an SSD, 
just a ram disk? I think that’s where are approaches differed and can 
understand why you are probably having problems when the OS crashes or suffers 
powerloss. I was going the SSD route, with something like:-

http://www.storagereview.com/hgst_ultrastar_ssd800mm_enterprise_ssd_review

On my iSCSI head nodes, but if you are exporting RBD's to lots of different 
servers I guess this wouldn't work quite the same.

I don't really see a solution that could work for you without using SSD's for 
the cache. You seem to be suffering from slow sync writes and want to cache 
them in a volatile medium, but by their very nature sync writes are meant to be 
safe once the write confirmation arrives. I guess in any caching solution 
barriers go some length to help guard against  data corruption but if properly 
implemented they will probably also slow the speed down to what you can achieve 
with just RBD caching. Much like Hardware Raid Controllers, they only enable 
writeback cache if they can guarantee data security, either by a functioning 
battery backup or flash device.


 
 I don’t really understand your question about flashcache - do you run it in
 writeback mode? It’s been years since I used it so I won’t be much help here
 - I disregarded it as unsafe right away because of barriers and wouldn’t use 
 it
 in production.
 

What I mean is that for every IO that passed through flashcache, I see it write 
to the SSD with no delay/buffering. So from a Kernel Panic/Powerloss situation, 
as long as the SSD has powerloss caps and the flashcache device is assembled 
correctly before mouting, I don't see a way for data to be lost. Although I 
haven't done a lot of testing around this yet, so I could be missing something.

 I don’t think a persistent cache is something to do right now, it would be
 overly complex to implement, it would limit migration, and it can be done on
 the guest side with (for example) bcache if really needed - you can always
 expose a local LVM volume to the guest and use it for caching (and that’s
 something I might end up doing) with mostly the same effect.
 For most people (and that’s my educated guess) the only needed features
 are that it needs to be fast(-er) and it needs to come up again after a crash
 without recovering for backup - that’s something that could be just a slight
 modification to the existing RBD cache - just don’t flush it on every fsync()
 but maintain ordering - and it’s done? I imagine some ordering is there
 already, it must be flushed when the guest is migrated, and it’s production-
 grade and not just some hackish attempt. It just doesn’t really cache the
 stuff that matters most in my scenario…

My initial idea was just to be able to specify a block device to use for 
writeback caching in librbd. This could either be a local block device (dual 
port sas for failover/cluster) or an iSCSI device if it needs to be shared 
around a larger cluster of hypervisors...etc

Ideally though this would all be managed through Ceph with some sort of 
OSD-lite device which is optimized for sync writes but misses out on a lot of 
the distributed functionality of a full fat OSD. This way you could create a 
writeback pool and then just specify it in the librbd config.

 
 I wonder if cache=unsafe does what

Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-05-27 Thread Mark Nelson

On 05/27/2015 09:33 AM, Jan Schermer wrote:

Hi Nick,
responses inline, again ;-)

Thanks

Jan


On 27 May 2015, at 12:29, Nick Fisk n...@fisk.me.uk wrote:

Hi Jan,

Responses inline below


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Jan Schermer
Sent: 25 May 2015 21:14
To: Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Synchronous writes - tuning and some thoughts
about them?

Hi Nick,

flashcache doesn’t support barriers, so I haven’t even considered it. I used a
few years ago to speed up some workloads out of curiosity and it worked
well, but I can’t use it to cache this kind of workload.

EnhanceIO passed my initial testing (although the documentation is very
sketchy and the project abandoned AFAIK), and is supposed to respect
barriers/flushes. I was only interested in a “volatile cache” scenario - create 
a
ramdisk in the guest (for example 1GB) and use it to cache the virtual block
device (and of course flush and remove it before rebooting). All worked
pretty well during my testing with fio  stuff until I ran the actual workload -
in my case a DB2 9.7 database. It took just minutes for the kernel to panic (I
can share a screenshot if you’d like). So it was not a host failure but a guest
failure and it managed to fail on two fronts - stability and crash consistency -
at the same time. The filesystem was completely broken afterwards - while it
could be mounted “cleanly” (journal appeared consistent), there was
massive damage to the files. I expected the open files to be zeroed or
missing or damaged, but it did veryrandom damage all over the place
including binaries in /bin, manpage files and so on - things that nobody was
even touching. Scary.


I see, so just to confirm you don't want to use a caching solution with an SSD, 
just a ram disk? I think that’s where are approaches differed and can 
understand why you are probably having problems when the OS crashes or suffers 
powerloss. I was going the SSD route, with something like:-



This actually proves that EnhanceIO doesn’t really respect barriers, at least 
not when flushing blocks to the underlying device.
To be fair, maybe using a (mirrored!) SSD makes it crash-consistent, maybe it 
has an internal journal and just replays whatever is in cache - I will not read 
the source code to confirm that because to me that’s clearly not what I need.


FWIW, I think both dm-cache and bcache properly respect barriers, though 
I haven't read through the source.





http://www.storagereview.com/hgst_ultrastar_ssd800mm_enterprise_ssd_review

On my iSCSI head nodes, but if you are exporting RBD's to lots of different 
servers I guess this wouldn't work quite the same.



Exactly. If you want to maintain elasticity, want to be able to migrate 
instances freely, then using any local storage is a no-go.


I don't really see a solution that could work for you without using SSD's for 
the cache. You seem to be suffering from slow sync writes and want to cache 
them in a volatile medium, but by their very nature sync writes are meant to be 
safe once the write confirmation arrives. I guess in any caching solution 
barriers go some length to help guard against  data corruption but if properly 
implemented they will probably also slow the speed down to what you can achieve 
with just RBD caching. Much like Hardware Raid Controllers, they only enable 
writeback cache if they can guarantee data security, either by a functioning 
battery backup or flash device.



You are right. Sync writes and barriers are supposed to be flushed to physical 
medium when returning (though in practice lots of RAID controllers and _all_ 
arrays will lie about that, slightly breaking the spec but still being safe if 
you don’t let the battery die).
I don’t want to lose crash consistency, but I don’t need to have the latest 
completed transaction flushed to the disk - I don’t care if power outage wipes 
the last 1 minute of records from the database even though they were “commited” 
by database and should thus be flushed to disks, and I don’t think too many 
people care either as long as it’s fast.
Of course, this won’t work for everyone and in that respect the current rbd 
cache behaviour is 100% correct.
And of course it won’t solve all problems - if you have an underlying device 
that can do 200 IOPS but your workload needs 300 IOPS at all times, then 
caching the writes is a bit futile - it may help for a few seconds and then it 
gets back to 200 IOPS at best. It might, however help if you rewrite the same 
blocks again and again, incrementing a counter or updating one set of thata - 
there it will just update the dirty block in cache and flush it from time to 
time. It can also turn some random-io into sequential-io, coalescing adjacent 
blocks into one re/write or journaling it in some way (CEPH journal does 
exactly this I think).






I don’t really understand your question about flashcache - do you run

Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-05-27 Thread Jan Schermer
Hi Nick,
responses inline, again ;-)

Thanks

Jan

 On 27 May 2015, at 12:29, Nick Fisk n...@fisk.me.uk wrote:
 
 Hi Jan,
 
 Responses inline below
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 25 May 2015 21:14
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Synchronous writes - tuning and some thoughts
 about them?
 
 Hi Nick,
 
 flashcache doesn’t support barriers, so I haven’t even considered it. I used 
 a
 few years ago to speed up some workloads out of curiosity and it worked
 well, but I can’t use it to cache this kind of workload.
 
 EnhanceIO passed my initial testing (although the documentation is very
 sketchy and the project abandoned AFAIK), and is supposed to respect
 barriers/flushes. I was only interested in a “volatile cache” scenario - 
 create a
 ramdisk in the guest (for example 1GB) and use it to cache the virtual block
 device (and of course flush and remove it before rebooting). All worked
 pretty well during my testing with fio  stuff until I ran the actual 
 workload -
 in my case a DB2 9.7 database. It took just minutes for the kernel to panic 
 (I
 can share a screenshot if you’d like). So it was not a host failure but a 
 guest
 failure and it managed to fail on two fronts - stability and crash 
 consistency -
 at the same time. The filesystem was completely broken afterwards - while it
 could be mounted “cleanly” (journal appeared consistent), there was
 massive damage to the files. I expected the open files to be zeroed or
 missing or damaged, but it did veryrandom damage all over the place
 including binaries in /bin, manpage files and so on - things that nobody was
 even touching. Scary.
 
 I see, so just to confirm you don't want to use a caching solution with an 
 SSD, just a ram disk? I think that’s where are approaches differed and can 
 understand why you are probably having problems when the OS crashes or 
 suffers powerloss. I was going the SSD route, with something like:-
 

This actually proves that EnhanceIO doesn’t really respect barriers, at least 
not when flushing blocks to the underlying device.
To be fair, maybe using a (mirrored!) SSD makes it crash-consistent, maybe it 
has an internal journal and just replays whatever is in cache - I will not read 
the source code to confirm that because to me that’s clearly not what I need.

 http://www.storagereview.com/hgst_ultrastar_ssd800mm_enterprise_ssd_review
 
 On my iSCSI head nodes, but if you are exporting RBD's to lots of different 
 servers I guess this wouldn't work quite the same.
 

Exactly. If you want to maintain elasticity, want to be able to migrate 
instances freely, then using any local storage is a no-go.

 I don't really see a solution that could work for you without using SSD's for 
 the cache. You seem to be suffering from slow sync writes and want to cache 
 them in a volatile medium, but by their very nature sync writes are meant to 
 be safe once the write confirmation arrives. I guess in any caching solution 
 barriers go some length to help guard against  data corruption but if 
 properly implemented they will probably also slow the speed down to what you 
 can achieve with just RBD caching. Much like Hardware Raid Controllers, they 
 only enable writeback cache if they can guarantee data security, either by a 
 functioning battery backup or flash device.
 

You are right. Sync writes and barriers are supposed to be flushed to physical 
medium when returning (though in practice lots of RAID controllers and _all_ 
arrays will lie about that, slightly breaking the spec but still being safe if 
you don’t let the battery die).
I don’t want to lose crash consistency, but I don’t need to have the latest 
completed transaction flushed to the disk - I don’t care if power outage wipes 
the last 1 minute of records from the database even though they were “commited” 
by database and should thus be flushed to disks, and I don’t think too many 
people care either as long as it’s fast.
Of course, this won’t work for everyone and in that respect the current rbd 
cache behaviour is 100% correct.
And of course it won’t solve all problems - if you have an underlying device 
that can do 200 IOPS but your workload needs 300 IOPS at all times, then 
caching the writes is a bit futile - it may help for a few seconds and then it 
gets back to 200 IOPS at best. It might, however help if you rewrite the same 
blocks again and again, incrementing a counter or updating one set of thata - 
there it will just update the dirty block in cache and flush it from time to 
time. It can also turn some random-io into sequential-io, coalescing adjacent 
blocks into one re/write or journaling it in some way (CEPH journal does 
exactly this I think).


 
 
 I don’t really understand your question about flashcache - do you run it in
 writeback mode? It’s been years since I used it so I won’t be much help here
 - I disregarded

[ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-05-25 Thread Jan Schermer
Hi,
I have a full-ssd cluster on my hands, currently running Dumpling, with plans 
to upgrade soon, and Openstack with RBD on top of that. While I am overall 
quite happy with the performance (scales well accross clients), there is one 
area where it really fails bad - big database workloads.

Typically, what a well-behaved database does is commit to disk every 
transaction before confirming it, so on a “typical” cluster with a write 
latency of 5ms (with SSD journal) the maximum number of transactions per second 
for a single client is 200 (likely more like 100 depending on the filesystem). 
Now, that’s not _too_ bad when running hundreds of small databases, but it’s 
nowhere near the required performance to subsitute an existing SAN or even just 
a simple RAID array with writeback cache.

First hope was that enabling RBD cache will help - but it really doesn’t 
because all the flushes (O_DIRECT writes) end on the drives and not in the 
cache. Disabling barriers in the client helps, but that makes it not crash 
consistent (unless one uses ext4 with journal_checksum etc., I am going to test 
that soon).

Are there any plans to change this behaviour - i.e. make the cache a real 
writeback cache?

I know there are good reasons not to do this, and I commend the developers for 
designing the cache this way, but real world workloads demand shortcuts from 
time to time - for example MySQL with its InnoDB engine has an option to only 
commit to disk every Nth transaction - and this is exactly the kind of thing 
I’m looking for. Not having every confirmed transaction/write on the disk is 
not a huge problem, having a b0rked filesystem is, so this should be safe as 
long as I/O order is preserved. Sadly, my database is not an InnoDB where I can 
tune something, but an enterprise behemoth that traditionally runs on FC 
arrays, it has no parallelism (that I could find), and always uses O_DIRECT for 
txlog.

(For the record - while the array is able to swallow 30K IOps for a minute, 
once the cache is full it slows to ~3 IOps, while CEPH happily gives the same 
200 IOps forever, bottom line is you always need more disks or more cache, and 
your workload should always be able to run without the cache anyway  - even 
enterprise arrays fail, and write cache is not always available, contrary to 
popular belief).

Is there some option that we could use right now to turn on a true writeback 
caching? Losing a few transactions is fine as long as ordering is preserved.
I was thinking “cache=unsafe” but I have no idea whether I/O order is preserved 
with that.
I already mentioned turning off barriers, which could be safe in some setups 
but needs testing.
Upgrading from Dumpling will probably help with scaling, but will it help write 
latency? I would need to get from 5ms/write to 1ms/write.
I investigated guest-side caching (enhanceio/flashcache) but that fails really 
bad when the guest or host crashes - lots of corruption. EnhanceIO in 
particular looked very nice and claims to respect barriers… not in my 
experience, though.

It might seem that what I want is evil, and it really is if you’re running a 
banking database, but for most people this is exactly what is missing to make 
their workloads run without having some sort of 80s SAN system in their 
datacentre, I think everyone here would appreciate that :-)

Thanks

Jan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-05-25 Thread Nick Fisk
Hi Jan,

I share your frustrations with slow sync writes. I'm exporting RBD's via iSCSI 
to ESX, which seems to do most operations in 64k sync IO's. You can do a fio 
run and impress yourself with the numbers that you can get out of the cluster, 
but this doesn't translate into what you can achieve when using sync writes 
with a client.

I have too been experimenting with flashcache/enhanceio with the goal to use 
Dual Port SAS SSD's to allow for HA iSCSI gateways. Currently I'm just testing 
with a single iSCSI server and see a massive improvement. I'm interested in the 
corruptions you have been experiencing on host crashes, are you implying that 
you think flashcache is buffering writes before submitting them to the SSD? 
When watching its behaviour using iostat it looks like it submits everything in 
4k IO's to the SSD which to me looks like it is not buffering.

I did raise a topic a few months back asking about the possibility of librbd 
supporting persistent caching to SSD's, which would allow write back caching 
regardless if the client requests a flush. Although there was some interest in 
the idea, I didn't get the feeling it would be at the top of anyone's 
priority's.

Nick

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 25 May 2015 09:59
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] Synchronous writes - tuning and some thoughts about
 them?
 
 Hi,
 I have a full-ssd cluster on my hands, currently running Dumpling, with plans
 to upgrade soon, and Openstack with RBD on top of that. While I am overall
 quite happy with the performance (scales well accross clients), there is one
 area where it really fails bad - big database workloads.
 
 Typically, what a well-behaved database does is commit to disk every
 transaction before confirming it, so on a “typical” cluster with a write 
 latency
 of 5ms (with SSD journal) the maximum number of transactions per second
 for a single client is 200 (likely more like 100 depending on the filesystem).
 Now, that’s not _too_ bad when running hundreds of small databases, but
 it’s nowhere near the required performance to subsitute an existing SAN or
 even just a simple RAID array with writeback cache.
 
 First hope was that enabling RBD cache will help - but it really doesn’t
 because all the flushes (O_DIRECT writes) end on the drives and not in the
 cache. Disabling barriers in the client helps, but that makes it not crash
 consistent (unless one uses ext4 with journal_checksum etc., I am going to
 test that soon).
 
 Are there any plans to change this behaviour - i.e. make the cache a real
 writeback cache?
 
 I know there are good reasons not to do this, and I commend the developers
 for designing the cache this way, but real world workloads demand shortcuts
 from time to time - for example MySQL with its InnoDB engine has an option
 to only commit to disk every Nth transaction - and this is exactly the kind of
 thing I’m looking for. Not having every confirmed transaction/write on the
 disk is not a huge problem, having a b0rked filesystem is, so this should be
 safe as long as I/O order is preserved. Sadly, my database is not an InnoDB
 where I can tune something, but an enterprise behemoth that traditionally
 runs on FC arrays, it has no parallelism (that I could find), and always uses
 O_DIRECT for txlog.
 
 (For the record - while the array is able to swallow 30K IOps for a minute,
 once the cache is full it slows to ~3 IOps, while CEPH happily gives the same
 200 IOps forever, bottom line is you always need more disks or more cache,
 and your workload should always be able to run without the cache anyway  -
 even enterprise arrays fail, and write cache is not always available, contrary
 to popular belief).
 
 Is there some option that we could use right now to turn on a true writeback
 caching? Losing a few transactions is fine as long as ordering is preserved.
 I was thinking “cache=unsafe” but I have no idea whether I/O order is
 preserved with that.
 I already mentioned turning off barriers, which could be safe in some setups
 but needs testing.
 Upgrading from Dumpling will probably help with scaling, but will it help 
 write
 latency? I would need to get from 5ms/write to 1ms/write.
 I investigated guest-side caching (enhanceio/flashcache) but that fails really
 bad when the guest or host crashes - lots of corruption. EnhanceIO in
 particular looked very nice and claims to respect barriers… not in my
 experience, though.
 
 It might seem that what I want is evil, and it really is if you’re running a
 banking database, but for most people this is exactly what is missing to make
 their workloads run without having some sort of 80s SAN system in their
 datacentre, I think everyone here would appreciate that :-)
 
 Thanks
 
 Jan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi

Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-05-25 Thread Jan Schermer
Hi Nick,

flashcache doesn’t support barriers, so I haven’t even considered it. I used a 
few years ago to speed up some workloads out of curiosity and it worked well, 
but I can’t use it to cache this kind of workload.

EnhanceIO passed my initial testing (although the documentation is very sketchy 
and the project abandoned AFAIK), and is supposed to respect barriers/flushes. 
I was only interested in a “volatile cache” scenario - create a ramdisk in the 
guest (for example 1GB) and use it to cache the virtual block device (and of 
course flush and remove it before rebooting). All worked pretty well during my 
testing with fio  stuff until I ran the actual workload - in my case a DB2 9.7 
database. It took just minutes for the kernel to panic (I can share a 
screenshot if you’d like). So it was not a host failure but a guest failure and 
it managed to fail on two fronts - stability and crash consistency - at the 
same time. The filesystem was completely broken afterwards - while it could be 
mounted “cleanly” (journal appeared consistent), there was massive damage to 
the files. I expected the open files to be zeroed or missing or damaged, but it 
did veryrandom damage all over the place including binaries in /bin, manpage 
files and so on - things that nobody was even touching. Scary.

I don’t really understand your question about flashcache - do you run it in 
writeback mode? It’s been years since I used it so I won’t be much help here - 
I disregarded it as unsafe right away because of barriers and wouldn’t use it 
in production.

I don’t think a persistent cache is something to do right now, it would be 
overly complex to implement, it would limit migration, and it can be done on 
the guest side with (for example) bcache if really needed - you can always 
expose a local LVM volume to the guest and use it for caching (and that’s 
something I might end up doing) with mostly the same effect.
For most people (and that’s my educated guess) the only needed features are 
that it needs to be fast(-er) and it needs to come up again after a crash 
without recovering for backup - that’s something that could be just a slight 
modification to the existing RBD cache - just don’t flush it on every fsync() 
but maintain ordering - and it’s done? I imagine some ordering is there 
already, it must be flushed when the guest is migrated, and it’s 
production-grade and not just some hackish attempt. It just doesn’t really 
cache the stuff that matters most in my scenario…

I wonder if cache=unsafe does what I want, but it’s hard to test that 
assumption unless something catastrophic happens like it did with EIO…

Jan

 On 25 May 2015, at 19:58, Nick Fisk n...@fisk.me.uk wrote:
 
 Hi Jan,
 
 I share your frustrations with slow sync writes. I'm exporting RBD's via 
 iSCSI to ESX, which seems to do most operations in 64k sync IO's. You can do 
 a fio run and impress yourself with the numbers that you can get out of the 
 cluster, but this doesn't translate into what you can achieve when using sync 
 writes with a client.
 
 I have too been experimenting with flashcache/enhanceio with the goal to use 
 Dual Port SAS SSD's to allow for HA iSCSI gateways. Currently I'm just 
 testing with a single iSCSI server and see a massive improvement. I'm 
 interested in the corruptions you have been experiencing on host crashes, are 
 you implying that you think flashcache is buffering writes before submitting 
 them to the SSD? When watching its behaviour using iostat it looks like it 
 submits everything in 4k IO's to the SSD which to me looks like it is not 
 buffering.
 
 I did raise a topic a few months back asking about the possibility of librbd 
 supporting persistent caching to SSD's, which would allow write back caching 
 regardless if the client requests a flush. Although there was some interest 
 in the idea, I didn't get the feeling it would be at the top of anyone's 
 priority's.
 
 Nick
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 25 May 2015 09:59
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] Synchronous writes - tuning and some thoughts about
 them?
 
 Hi,
 I have a full-ssd cluster on my hands, currently running Dumpling, with plans
 to upgrade soon, and Openstack with RBD on top of that. While I am overall
 quite happy with the performance (scales well accross clients), there is one
 area where it really fails bad - big database workloads.
 
 Typically, what a well-behaved database does is commit to disk every
 transaction before confirming it, so on a “typical” cluster with a write 
 latency
 of 5ms (with SSD journal) the maximum number of transactions per second
 for a single client is 200 (likely more like 100 depending on the 
 filesystem).
 Now, that’s not _too_ bad when running hundreds of small databases, but
 it’s nowhere near the required performance to subsitute an existing SAN or
 even just a simple RAID array