Re: [ceph-users] Local SSD cache for ceph on each compute node.
> >> > >> On 03/29/2016 04:35 PM, Nick Fisk wrote: > >>> One thing I picked up on when looking at dm-cache for doing caching > >>> with RBD's is that it wasn't really designed to be used as a > >>> writeback cache for new writes, as in how you would expect a > >>> traditional writeback cache to work. It seems all the policies are > >>> designed around the idea that writes go to cache only if the block > >>> is already in the cache (through reads) or its hot enough to > >>> promote. Although there did seem to be some tunables to alter this > >>> behaviour, posts on the mailing list seemed to suggest this wasn't > >>> how it was designed to be used. I'm not sure if this has been addressed > since I last looked at it though. > >>> > >>> Depending on if you are trying to accelerate all writes, or just > >>> your > > "hot" > >>> blocks, this may or may not matter. Even <1GB local caches can make > >>> a huge difference to sync writes. > >> Hi Nick, > >> > >> Some of the caching policies have changed recently as the team has > >> looked at different workloads. > >> > >> Happy to introduce you to them if you want to discuss offline or post > >> comments over on their list: device-mapper development >> de...@redhat.com> > >> > >> thanks! > >> > >> Ric > > Hi Ric, > > > > Thanks for the heads up, just from a quick flick through I can see > > there are now separate read and write promotion thresholds, so I can > > see just from that it would be a lot more suitable for what I > > intended. I might try and find some time to give it another test. > > > > Nick > > Let us know how it works out for you, I know that they are very interested in > making sure things are useful :) Hi Ric, I have given it another test and unfortunately it seems it's still not giving the improvements that I was expecting. Here is a rough description of my test 10GB RBD 1GB ZRAM kernel device for cache (Testing only) 0 20971520 cache 8 106/4096 64 32768/32768 2492 1239 349993 113194 47157 47157 0 1 writeback 2 migration_threshold 8192 mq 10 ra ndom_threshold 0 sequential_threshold 0 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 0 rw - I'm then running a directio 64kb seq write QD=1 bench with fio to the DM device. What I expect to happen would be for this sequential stream of 64kb IO's to be coalesced into 4MB IO's and written out to the RBD at a high queue depth as possible/required. Effectively meaning my 64kb sequential bandwidth should match the limit of 4MB sequential bandwidth of my cluster. I'm more interested in replicating the behaviour of a write cache on a battery backed raid card, than a RW SSD cache, if that makes sense? An example real life scenario would be for sitting underneath a iSCSI target, something like ESXi generates that IO pattern when moving VM's between datastores. What I was seeing is that I get a sudden burst of speed at the start of the fio test, but then it quickly drops down to the speed of the underlying RBD device. The dirty blocks counter never seems to go too high, so I don't think that it’s a cache full problem. The counter is probably no more than about 40% when the slowdown starts and then it drops to less than 10% for the remainder of the test as it crawls along. It feels like it hits some sort of throttle and it never recovers. I've done similar tests with flashcache and it gets more stable performance over a longer period of time, but the associative hit set behaviour seems to cause write misses due to the sequential IO pattern, which limits overall top performance. Nick > > ric > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
On 03/29/2016 04:53 PM, Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ric Wheeler Sent: 29 March 2016 14:40 To: Nick Fisk ; 'Sage Weil' Cc: ceph-users@lists.ceph.com; device-mapper development Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. On 03/29/2016 04:35 PM, Nick Fisk wrote: One thing I picked up on when looking at dm-cache for doing caching with RBD's is that it wasn't really designed to be used as a writeback cache for new writes, as in how you would expect a traditional writeback cache to work. It seems all the policies are designed around the idea that writes go to cache only if the block is already in the cache (through reads) or its hot enough to promote. Although there did seem to be some tunables to alter this behaviour, posts on the mailing list seemed to suggest this wasn't how it was designed to be used. I'm not sure if this has been addressed since I last looked at it though. Depending on if you are trying to accelerate all writes, or just your "hot" blocks, this may or may not matter. Even <1GB local caches can make a huge difference to sync writes. Hi Nick, Some of the caching policies have changed recently as the team has looked at different workloads. Happy to introduce you to them if you want to discuss offline or post comments over on their list: device-mapper development thanks! Ric Hi Ric, Thanks for the heads up, just from a quick flick through I can see there are now separate read and write promotion thresholds, so I can see just from that it would be a lot more suitable for what I intended. I might try and find some time to give it another test. Nick Let us know how it works out for you, I know that they are very interested in making sure things are useful :) ric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Ric Wheeler > Sent: 29 March 2016 14:40 > To: Nick Fisk ; 'Sage Weil' > Cc: ceph-users@lists.ceph.com; device-mapper development de...@redhat.com> > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > On 03/29/2016 04:35 PM, Nick Fisk wrote: > > One thing I picked up on when looking at dm-cache for doing caching > > with RBD's is that it wasn't really designed to be used as a writeback > > cache for new writes, as in how you would expect a traditional > > writeback cache to work. It seems all the policies are designed around > > the idea that writes go to cache only if the block is already in the > > cache (through reads) or its hot enough to promote. Although there did > > seem to be some tunables to alter this behaviour, posts on the mailing > > list seemed to suggest this wasn't how it was designed to be used. I'm > > not sure if this has been addressed since I last looked at it though. > > > > Depending on if you are trying to accelerate all writes, or just your "hot" > > blocks, this may or may not matter. Even <1GB local caches can make a > > huge difference to sync writes. > > Hi Nick, > > Some of the caching policies have changed recently as the team has looked > at different workloads. > > Happy to introduce you to them if you want to discuss offline or post > comments over on their list: device-mapper development de...@redhat.com> > > thanks! > > Ric Hi Ric, Thanks for the heads up, just from a quick flick through I can see there are now separate read and write promotion thresholds, so I can see just from that it would be a lot more suitable for what I intended. I might try and find some time to give it another test. Nick > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
On 03/29/2016 04:35 PM, Nick Fisk wrote: One thing I picked up on when looking at dm-cache for doing caching with RBD's is that it wasn't really designed to be used as a writeback cache for new writes, as in how you would expect a traditional writeback cache to work. It seems all the policies are designed around the idea that writes go to cache only if the block is already in the cache (through reads) or its hot enough to promote. Although there did seem to be some tunables to alter this behaviour, posts on the mailing list seemed to suggest this wasn't how it was designed to be used. I'm not sure if this has been addressed since I last looked at it though. Depending on if you are trying to accelerate all writes, or just your "hot" blocks, this may or may not matter. Even <1GB local caches can make a huge difference to sync writes. Hi Nick, Some of the caching policies have changed recently as the team has looked at different workloads. Happy to introduce you to them if you want to discuss offline or post comments over on their list: device-mapper development thanks! Ric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Ric Wheeler > Sent: 29 March 2016 14:07 > To: Sage Weil > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > On 03/29/2016 03:42 PM, Sage Weil wrote: > > On Tue, 29 Mar 2016, Ric Wheeler wrote: > >>> However, if the write cache would would be "flushed in-order" to > >>> Ceph you would just lose x seconds of data and, hopefully, not have > >>> a corrupted disk. That could be acceptable for some people. I was > >>> just stressing that that isn?t the case. > >> This in order assumption - speaking as some one who has a long > >> history in kernel file and storage - is the wrong assumption. > >> > >> Don't think of the cache device and RBD as separate devices, once > >> they are configured like this, they are the same device from the > >> point of view of the file system (or whatever) that runs on top of them. > >> > >> The cache and its caching policy can vary, but it is perfectly > >> reasonable to have data live only in that caching layer pretty much > >> forever. Local disk caches can also do this by the way :) > > That's true for current caching devices like dm-cache, but does not > > need to be true--and I think that's what Robert is getting at. The > > plan for RBD, for example, is to implement a client-side cache that > > has an ordered writeback policy, similar to the one described in this paper: > > > > > > https://www.usenix.org/conference/fast13/technical-sessions/presentati > > on/koller > > > > In that scenario, loss of the cache devices leaves you with a stale > > but crash-consistent image on the base device. > > Certainly if we design a distributed system aware caching layer, things can be > different. > > I think that the trade offs for using a local to the client cache are certainly a bit > confusing for normal users, but they are pretty popular these days even > given the limits. > > Enterprise storage systems (big EMC/Hitachi/etc arrays) have been > effectively implemented as distributed systems for quite a long time and > these server local caches are routinely used for clients for their virtual LUNs. > > Adding a performance boosting cahing layer (especially for a virtual guest) to > use a local SSD I think has a lot of upsides even if it does not solve the > problem for the migrating device case. > > > > >> The whole flushing order argument is really not relevant here. I could > >> "flush in order" after a minute, a week or a year. If the cache is large > >> enough, you might have zero data land on the backing store (even if the > >> destage policy does it as you suggest as in order). > > I think the assumption is you would place a bound on the amount of dirty > > data in the cache. Since you need to checkpoint cache content (on, say, > > flush boundaries), that roughly bounds the size of the cache by the amount > > of data written, even if it is repeatedly scribbling over the same blocks. > > Keep in mind that the device mapper device is the device you see at the > client - > when you flush, you are flushing to it. It is designed as a cache for a local > device, the fact that ceph rbd is under it (and has a distributed backend) is > not really of interest in the current generation at least. > > Perfectly legal to have data live only in the SSD for example, not land in the > backing device. > > How we bound and manage the cache and the life cycle of the data is > something > that Joe and the device mapper people have been actively working on. > > I don't think that ordering alone is enough for any local linux file system. > > The promises made from the storage layer up to the kernel file system stack > are > basically that any transaction we commit (using synchronize_cache or similar > mechanisms) is durable across a power failure. We don't have assumptions > on > ordering with regards to other writes (i.e., when we care, we flush the world > which is a whole device sync, or sync and FUA (gory details of the multiple > incarnations of this live in the kernel Documentation subtree in > block/writeback_cache_control.txt). > > > > >> That all said, the reason to use a write cache on top of client block > >> device - rbd or other - is to improve performance for the client. > >> > >> Any time we make our failure domain require fully operating two devices > >> (the cache device and the original de
Re: [ceph-users] Local SSD cache for ceph on each compute node.
On 03/29/2016 03:42 PM, Sage Weil wrote: On Tue, 29 Mar 2016, Ric Wheeler wrote: However, if the write cache would would be "flushed in-order" to Ceph you would just lose x seconds of data and, hopefully, not have a corrupted disk. That could be acceptable for some people. I was just stressing that that isn’t the case. This in order assumption - speaking as some one who has a long history in kernel file and storage - is the wrong assumption. Don't think of the cache device and RBD as separate devices, once they are configured like this, they are the same device from the point of view of the file system (or whatever) that runs on top of them. The cache and its caching policy can vary, but it is perfectly reasonable to have data live only in that caching layer pretty much forever. Local disk caches can also do this by the way :) That's true for current caching devices like dm-cache, but does not need to be true--and I think that's what Robert is getting at. The plan for RBD, for example, is to implement a client-side cache that has an ordered writeback policy, similar to the one described in this paper: https://www.usenix.org/conference/fast13/technical-sessions/presentation/koller In that scenario, loss of the cache devices leaves you with a stale but crash-consistent image on the base device. Certainly if we design a distributed system aware caching layer, things can be different. I think that the trade offs for using a local to the client cache are certainly a bit confusing for normal users, but they are pretty popular these days even given the limits. Enterprise storage systems (big EMC/Hitachi/etc arrays) have been effectively implemented as distributed systems for quite a long time and these server local caches are routinely used for clients for their virtual LUNs. Adding a performance boosting cahing layer (especially for a virtual guest) to use a local SSD I think has a lot of upsides even if it does not solve the problem for the migrating device case. The whole flushing order argument is really not relevant here. I could "flush in order" after a minute, a week or a year. If the cache is large enough, you might have zero data land on the backing store (even if the destage policy does it as you suggest as in order). I think the assumption is you would place a bound on the amount of dirty data in the cache. Since you need to checkpoint cache content (on, say, flush boundaries), that roughly bounds the size of the cache by the amount of data written, even if it is repeatedly scribbling over the same blocks. Keep in mind that the device mapper device is the device you see at the client - when you flush, you are flushing to it. It is designed as a cache for a local device, the fact that ceph rbd is under it (and has a distributed backend) is not really of interest in the current generation at least. Perfectly legal to have data live only in the SSD for example, not land in the backing device. How we bound and manage the cache and the life cycle of the data is something that Joe and the device mapper people have been actively working on. I don't think that ordering alone is enough for any local linux file system. The promises made from the storage layer up to the kernel file system stack are basically that any transaction we commit (using synchronize_cache or similar mechanisms) is durable across a power failure. We don't have assumptions on ordering with regards to other writes (i.e., when we care, we flush the world which is a whole device sync, or sync and FUA (gory details of the multiple incarnations of this live in the kernel Documentation subtree in block/writeback_cache_control.txt). That all said, the reason to use a write cache on top of client block device - rbd or other - is to improve performance for the client. Any time we make our failure domain require fully operating two devices (the cache device and the original device), we increase the probability of a non-recoverable failure. In effect, the reliability of the storage is at best as reliable as the least reliable part of the pair. The goal is to add a new choice on the spectrum between (1) all writes are replicated across the cluster in order to get a consistent and up-to-date image when the client+cache fail, and (2) a writeback that gives you fast writes but leaves you with a corrupted (and stale) image after such a failer. Ordered writeback (1.5) gives you low write latency and a stale but crash consistent image. I suspect this will be a sensible choice for a lot of different use cases and workloads. Writeback mode with dm-cache does not give you corruption after a reboot or power failure, etc. It only gives you problems when the physical card dies or we try to use the underlying device without the caching device present. The thing that makes dm-cache different than what this paper was designed for I think is that the dm-cache is durable across power outages. As l
Re: [ceph-users] Local SSD cache for ceph on each compute node.
On Tue, 29 Mar 2016, Ric Wheeler wrote: > > However, if the write cache would would be "flushed in-order" to Ceph > > you would just lose x seconds of data and, hopefully, not have a > > corrupted disk. That could be acceptable for some people. I was just > > stressing that that isn’t the case. > > This in order assumption - speaking as some one who has a long history > in kernel file and storage - is the wrong assumption. > > Don't think of the cache device and RBD as separate devices, once they > are configured like this, they are the same device from the point of > view of the file system (or whatever) that runs on top of them. > > The cache and its caching policy can vary, but it is perfectly > reasonable to have data live only in that caching layer pretty much > forever. Local disk caches can also do this by the way :) That's true for current caching devices like dm-cache, but does not need to be true--and I think that's what Robert is getting at. The plan for RBD, for example, is to implement a client-side cache that has an ordered writeback policy, similar to the one described in this paper: https://www.usenix.org/conference/fast13/technical-sessions/presentation/koller In that scenario, loss of the cache devices leaves you with a stale but crash-consistent image on the base device. > The whole flushing order argument is really not relevant here. I could > "flush in order" after a minute, a week or a year. If the cache is large > enough, you might have zero data land on the backing store (even if the > destage policy does it as you suggest as in order). I think the assumption is you would place a bound on the amount of dirty data in the cache. Since you need to checkpoint cache content (on, say, flush boundaries), that roughly bounds the size of the cache by the amount of data written, even if it is repeatedly scribbling over the same blocks. > That all said, the reason to use a write cache on top of client block > device - rbd or other - is to improve performance for the client. > > Any time we make our failure domain require fully operating two devices > (the cache device and the original device), we increase the probability > of a non-recoverable failure. In effect, the reliability of the storage > is at best as reliable as the least reliable part of the pair. The goal is to add a new choice on the spectrum between (1) all writes are replicated across the cluster in order to get a consistent and up-to-date image when the client+cache fail, and (2) a writeback that gives you fast writes but leaves you with a corrupted (and stale) image after such a failer. Ordered writeback (1.5) gives you low write latency and a stale but crash consistent image. I suspect this will be a sensible choice for a lot of different use cases and workloads. Is anything like this on the dm-cache roadmap as well? It's probably less useful when the cache device lives in the same host (compared to a client/cluster arrangement more typical of RBD where a client host failure takes out the cache device but not the base image), but it might still be worth considering. sage___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
On 03/29/2016 01:35 PM, Van Leeuwen, Robert wrote: If you try to look at the rbd device under dm-cache from another host, of course any data that was cached on the dm-cache layer will be missing since the dm-cache device itself is local to the host you wrote the data from originally. And here it can (and probably will) go horribly wrong. If you miss the dm-cache device (cache/hypervisor failure) you will probably end up with an inconsistent filesystem. This is because dm-cache is not a ordered write-back cache afaik. I think that you are twisting in two unrelated points. dm-cache does do proper ordering. If you use it to cache writes and then take it effectively out of the picture (i.e., never destage that data from cache), you end up with holes in a file system. Nothing to do with ordering, all to do with having a write back cache enabled and then chopping the write back cached data out of the picture. Way back in this thread it was mentioned you would just lose a few seconds of data when you lose the cache device. My point was that when you lose the cache device you do not just miss x seconds of data but probably lose the whole filesystem. True. This is because the cache is not “ordered” and random parts, probably the “hot data” you care about, never made it from the cache device into ceph. Totally unrelated. However, if the write cache would would be "flushed in-order" to Ceph you would just lose x seconds of data and, hopefully, not have a corrupted disk. That could be acceptable for some people. I was just stressing that that isn’t the case. This in order assumption - speaking as some one who has a long history in kernel file and storage - is the wrong assumption. Don't think of the cache device and RBD as separate devices, once they are configured like this, they are the same device from the point of view of the file system (or whatever) that runs on top of them. The cache and its caching policy can vary, but it is perfectly reasonable to have data live only in that caching layer pretty much forever. Local disk caches can also do this by the way :) The whole flushing order argument is really not relevant here. I could "flush in order" after a minute, a week or a year. If the cache is large enough, you might have zero data land on the backing store (even if the destage policy does it as you suggest as in order). That all said, the reason to use a write cache on top of client block device - rbd or other - is to improve performance for the client. Any time we make our failure domain require fully operating two devices (the cache device and the original device), we increase the probability of a non-recoverable failure. In effect, the reliability of the storage is at best as reliable as the least reliable part of the pair. If you use dm-cache as a write through cache, this is not a problem (i.e., we would only be used to cache reads). Caching reads is fine :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
>>> If you try to look at the rbd device under dm-cache from another host, of >>> course >>> any data that was cached on the dm-cache layer will be missing since the >>> dm-cache device itself is local to the host you wrote the data from >>> originally. >> And here it can (and probably will) go horribly wrong. >> If you miss the dm-cache device (cache/hypervisor failure) you will probably >> end up with an inconsistent filesystem. >> This is because dm-cache is not a ordered write-back cache afaik. >> > >I think that you are twisting in two unrelated points. > >dm-cache does do proper ordering. > >If you use it to cache writes and then take it effectively out of the picture >(i.e., never destage that data from cache), you end up with holes in a file >system. > >Nothing to do with ordering, all to do with having a write back cache enabled >and then chopping the write back cached data out of the picture. Way back in this thread it was mentioned you would just lose a few seconds of data when you lose the cache device. My point was that when you lose the cache device you do not just miss x seconds of data but probably lose the whole filesystem. This is because the cache is not “ordered” and random parts, probably the “hot data” you care about, never made it from the cache device into ceph. However, if the write cache would would be "flushed in-order" to Ceph you would just lose x seconds of data and, hopefully, not have a corrupted disk. That could be acceptable for some people. I was just stressing that that isn’t the case. > >If you use dm-cache as a write through cache, this is not a problem (i.e., we >would only be used to cache reads). Caching reads is fine :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
On 03/29/2016 10:06 AM, Van Leeuwen, Robert wrote: On 3/27/16, 9:59 AM, "Ric Wheeler" wrote: On 03/16/2016 12:15 PM, Van Leeuwen, Robert wrote: My understanding of how a writeback cache should work is that it should only take a few seconds for writes to be streamed onto the network and is focussed on resolving the speed issue of small sync writes. The writes would be bundled into larger writes that are not time sensitive. So there is potential for a few seconds data loss but compared to the current trend of using ephemeral storage to solve this issue, it's a major improvement. It think is a bit worse then just a few seconds of data: As mentioned in the blueprint for ceph you would need some kind or ordered write-back cache that maintains checkpoints internally. I am not that familiar with the internals of dm-cache but I do not think it guarantees any write order. E.g. By default it will bypass the cache for sequential IO. So I think it is very likely the “few seconds of data loss" in this case means the filesystem is corrupt and you could lose the whole thing. At the very least you will need to run fsck on it and hope it can sort out all of the errors with minimal data loss. I might be misunderstanding your point above, but dm-cache provides persistent storage. It will be there when you reboot and look for data on that same box. dm-cache is also power failure safe and tested to survive this kind of outage. Correct, dm-cache is power-faillure safe assuming all hardware survives the reboot. If you try to look at the rbd device under dm-cache from another host, of course any data that was cached on the dm-cache layer will be missing since the dm-cache device itself is local to the host you wrote the data from originally. And here it can (and probably will) go horribly wrong. If you miss the dm-cache device (cache/hypervisor failure) you will probably end up with an inconsistent filesystem. This is because dm-cache is not a ordered write-back cache afaik. I think that you are twisting in two unrelated points. dm-cache does do proper ordering. If you use it to cache writes and then take it effectively out of the picture (i.e., never destage that data from cache), you end up with holes in a file system. Nothing to do with ordering, all to do with having a write back cache enabled and then chopping the write back cached data out of the picture. This is no different than any other write cache, you need the write cache device and the back end storage both to be present to see all of the data. If you use dm-cache as a write through cache, this is not a problem (i.e., we would only be used to cache reads). Ric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
On 3/27/16, 9:59 AM, "Ric Wheeler" wrote: >On 03/16/2016 12:15 PM, Van Leeuwen, Robert wrote: >>> My understanding of how a writeback cache should work is that it should >>> only take a few seconds for writes to be streamed onto the network and is >>> focussed on resolving the speed issue of small sync writes. The writes >>> would be bundled into larger writes that are not time sensitive. >>> >>> So there is potential for a few seconds data loss but compared to the >>> current trend of using ephemeral storage to solve this issue, it's a major >>> improvement. >> It think is a bit worse then just a few seconds of data: >> As mentioned in the blueprint for ceph you would need some kind or ordered >> write-back cache that maintains checkpoints internally. >> >> I am not that familiar with the internals of dm-cache but I do not think it >> guarantees any write order. >> E.g. By default it will bypass the cache for sequential IO. >> >> So I think it is very likely the “few seconds of data loss" in this case >> means the filesystem is corrupt and you could lose the whole thing. >> At the very least you will need to run fsck on it and hope it can sort out >> all of the errors with minimal data loss. >> >> >I might be misunderstanding your point above, but dm-cache provides persistent >storage. It will be there when you reboot and look for data on that same box. >dm-cache is also power failure safe and tested to survive this kind of outage. Correct, dm-cache is power-faillure safe assuming all hardware survives the reboot. >If you try to look at the rbd device under dm-cache from another host, of >course >any data that was cached on the dm-cache layer will be missing since the >dm-cache device itself is local to the host you wrote the data from originally. And here it can (and probably will) go horribly wrong. If you miss the dm-cache device (cache/hypervisor failure) you will probably end up with an inconsistent filesystem. This is because dm-cache is not a ordered write-back cache afaik. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
Hi all, On 16/03/16 18:11, Van Leeuwen, Robert wrote: >> Indeed, well understood. >> >> As a shorter term workaround, if you have control over the VMs, you could >> always just slice out an LVM volume from local SSD/NVMe and pass it through >> to the guest. Within the guest, use dm-cache (or similar) to add a cache >> front-end to your RBD volume. > > If you do this you need to setup your cache as read-cache only. > Caching writes would be bad because a hypervisor failure would result in loss > of the cache which pretty much guarantees inconsistent data on the ceph > volume. > Also live-migration will become problematic compared to running everything > from ceph since you will also need to migrate the local-storage. > > The question will be if adding more ram (== more read cache) would not be > more convenient and cheaper in the end. > (considering the time required for setting up and maintaining the extra > caching layer on each vm, unless you work for free ;-) > Also reads from ceph are pretty fast compared to the biggest bottleneck: > (small) sync writes. > So it is debatable how much performance you would win except for some > use-cases with lots of reads on very large data sets which are also very > latency sensitive. Been following this discussion from a distance for a while, and have personally experimented with trying to introduce VM-local caching. Our set-up, we have 3 storage nodes which run a monitor and two OSDs each. The machines have two gigabit Ethernet cards, one public-facing, the other a private network for storage cluster communications. Machines themselves are Core i3s with 8GB RAM. The cluster is set to put replicas on each of the three nodes. The thought being that the load would be spread across the three nodes, so we should get close to SATA-1 type speeds in ideal cases. I tried running without any sort of caching other than what the kernel RBD driver / librbd provided out-of-the-box. Virtual machines are running on KVM managed by OpenNebula. If I used VirtIO storage, things were a little better, but trying out some HyperV images, the VMs chocked on disk I/O. I did some research and after seeing a post by Sebastian Han regarding use of FlashCache with Ceph, I tried patching OpenNebula to support this caching scheme. http://dev.opennebula.org/issues/2827 The way this worked is it would slice a bit of SSD using LVM, map the RBD using the in-kernel driver, then set up FlashCache to combine the two. There was a bit of cache pre-seeding done too. This worked, and gave big speed improvements, but my implementation in OpenNebula is a bit of a house of cards. I've since bought my own hardware and intend to look into this at home: https://hackaday.io/project/10529-solar-powered-cloud-computing Something built into Ceph's librbd or kernel driver would be fantastic though as it would then be usable by OpenStack, libvirt, etc. -- _ ___ Stuart Longland - Systems Engineer \ /|_) | T: +61 7 3535 9619 \/ | \ | 38b Douglas StreetF: +61 7 3535 9699 SYSTEMSMilton QLD 4064 http://www.vrt.com.au ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
Thanks Wouldn't it be amazing to puta 2TB NVMe card in each compute node, make 1 config change and presto! Users see a 10 fold increase in performance :) with 95% reads going to cache and all writes being acknowledged after being written on cache. For writes you might want dual NVMe in a raid 1 so you're fully covered. -Original Message- From: Ric Wheeler [mailto:rwhee...@redhat.com] Sent: 27 March 2016 09:27 To: Daniel Niasoff ; Van Leeuwen, Robert ; Jason Dillaman Cc: ceph-users@lists.ceph.com; Mike Snitzer ; Joe Thornber Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. On 03/27/2016 11:13 AM, Daniel Niasoff wrote: > Hi Ric, > > But you would still have to set a dm-cache per rbd volume which makes it > difficult to manage. > > There needs to be a global setting either within kvm or ceph that caches > reads/writes before they hit the rbd the device. > > Thanks > > Daniel Correct, it is per block device - effectively it is a layer on top of the rbd device if you want to set up a caching layer like this. As you mention, you can cache at other layers of the system as well. How difficult that is to manage and assemble depends on tooling. I don't see doing it in kvm as really easier than doing it under kvm, but I am a big believer in the need for much better tools to help manage things like this so that users don't see the complexity. Ric > > -Original Message- > From: Ric Wheeler [mailto:rwhee...@redhat.com] > Sent: 27 March 2016 09:00 > To: Van Leeuwen, Robert ; Daniel Niasoff > ; Jason Dillaman > Cc: ceph-users@lists.ceph.com; Mike Snitzer ; Joe > Thornber > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > On 03/16/2016 12:15 PM, Van Leeuwen, Robert wrote: >>> My understanding of how a writeback cache should work is that it should >>> only take a few seconds for writes to be streamed onto the network and is >>> focussed on resolving the speed issue of small sync writes. The writes >>> would be bundled into larger writes that are not time sensitive. >>> >>> So there is potential for a few seconds data loss but compared to the >>> current trend of using ephemeral storage to solve this issue, it's a major >>> improvement. >> It think is a bit worse then just a few seconds of data: >> As mentioned in the blueprint for ceph you would need some kind or ordered >> write-back cache that maintains checkpoints internally. >> >> I am not that familiar with the internals of dm-cache but I do not think it >> guarantees any write order. >> E.g. By default it will bypass the cache for sequential IO. >> >> So I think it is very likely the “few seconds of data loss" in this case >> means the filesystem is corrupt and you could lose the whole thing. >> At the very least you will need to run fsck on it and hope it can sort out >> all of the errors with minimal data loss. >> >> >> So, for me, it seems conflicting to me to use persistent storage and then >> hoping your volumes survive a power outage. >> >> If you can survive missing that data you are probably better of running >> fully from ephemeral storage in the first place. >> >> Cheers, >> Robert van Leeuwen > Hi Robert, > > I might be misunderstanding your point above, but dm-cache provides > persistent storage. It will be there when you reboot and look for data on > that same box. > dm-cache is also power failure safe and tested to survive this kind of outage. > > If you try to look at the rbd device under dm-cache from another host, of > course any data that was cached on the dm-cache layer will be missing since > the dm-cache device itself is local to the host you wrote the data from > originally. > > In a similar way, using dm-cache for write caching (or any write cache local > to a client) will also mean that your data has a single point of failure > since that data will not be replicated out to the backing store until it is > destaged from cache. > > I would note that this is exactly the kind of write cache that is popular > these days in front of enterprise storage arrays on clients so this is not > really uncommon. > > Regards, > > Ric > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
On 03/27/2016 11:13 AM, Daniel Niasoff wrote: Hi Ric, But you would still have to set a dm-cache per rbd volume which makes it difficult to manage. There needs to be a global setting either within kvm or ceph that caches reads/writes before they hit the rbd the device. Thanks Daniel Correct, it is per block device - effectively it is a layer on top of the rbd device if you want to set up a caching layer like this. As you mention, you can cache at other layers of the system as well. How difficult that is to manage and assemble depends on tooling. I don't see doing it in kvm as really easier than doing it under kvm, but I am a big believer in the need for much better tools to help manage things like this so that users don't see the complexity. Ric -Original Message- From: Ric Wheeler [mailto:rwhee...@redhat.com] Sent: 27 March 2016 09:00 To: Van Leeuwen, Robert ; Daniel Niasoff ; Jason Dillaman Cc: ceph-users@lists.ceph.com; Mike Snitzer ; Joe Thornber Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. On 03/16/2016 12:15 PM, Van Leeuwen, Robert wrote: My understanding of how a writeback cache should work is that it should only take a few seconds for writes to be streamed onto the network and is focussed on resolving the speed issue of small sync writes. The writes would be bundled into larger writes that are not time sensitive. So there is potential for a few seconds data loss but compared to the current trend of using ephemeral storage to solve this issue, it's a major improvement. It think is a bit worse then just a few seconds of data: As mentioned in the blueprint for ceph you would need some kind or ordered write-back cache that maintains checkpoints internally. I am not that familiar with the internals of dm-cache but I do not think it guarantees any write order. E.g. By default it will bypass the cache for sequential IO. So I think it is very likely the “few seconds of data loss" in this case means the filesystem is corrupt and you could lose the whole thing. At the very least you will need to run fsck on it and hope it can sort out all of the errors with minimal data loss. So, for me, it seems conflicting to me to use persistent storage and then hoping your volumes survive a power outage. If you can survive missing that data you are probably better of running fully from ephemeral storage in the first place. Cheers, Robert van Leeuwen Hi Robert, I might be misunderstanding your point above, but dm-cache provides persistent storage. It will be there when you reboot and look for data on that same box. dm-cache is also power failure safe and tested to survive this kind of outage. If you try to look at the rbd device under dm-cache from another host, of course any data that was cached on the dm-cache layer will be missing since the dm-cache device itself is local to the host you wrote the data from originally. In a similar way, using dm-cache for write caching (or any write cache local to a client) will also mean that your data has a single point of failure since that data will not be replicated out to the backing store until it is destaged from cache. I would note that this is exactly the kind of write cache that is popular these days in front of enterprise storage arrays on clients so this is not really uncommon. Regards, Ric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
Hi Ric, But you would still have to set a dm-cache per rbd volume which makes it difficult to manage. There needs to be a global setting either within kvm or ceph that caches reads/writes before they hit the rbd the device. Thanks Daniel -Original Message- From: Ric Wheeler [mailto:rwhee...@redhat.com] Sent: 27 March 2016 09:00 To: Van Leeuwen, Robert ; Daniel Niasoff ; Jason Dillaman Cc: ceph-users@lists.ceph.com; Mike Snitzer ; Joe Thornber Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. On 03/16/2016 12:15 PM, Van Leeuwen, Robert wrote: >> My understanding of how a writeback cache should work is that it should only >> take a few seconds for writes to be streamed onto the network and is >> focussed on resolving the speed issue of small sync writes. The writes would >> be bundled into larger writes that are not time sensitive. >> >> So there is potential for a few seconds data loss but compared to the >> current trend of using ephemeral storage to solve this issue, it's a major >> improvement. > It think is a bit worse then just a few seconds of data: > As mentioned in the blueprint for ceph you would need some kind or ordered > write-back cache that maintains checkpoints internally. > > I am not that familiar with the internals of dm-cache but I do not think it > guarantees any write order. > E.g. By default it will bypass the cache for sequential IO. > > So I think it is very likely the “few seconds of data loss" in this case > means the filesystem is corrupt and you could lose the whole thing. > At the very least you will need to run fsck on it and hope it can sort out > all of the errors with minimal data loss. > > > So, for me, it seems conflicting to me to use persistent storage and then > hoping your volumes survive a power outage. > > If you can survive missing that data you are probably better of running fully > from ephemeral storage in the first place. > > Cheers, > Robert van Leeuwen Hi Robert, I might be misunderstanding your point above, but dm-cache provides persistent storage. It will be there when you reboot and look for data on that same box. dm-cache is also power failure safe and tested to survive this kind of outage. If you try to look at the rbd device under dm-cache from another host, of course any data that was cached on the dm-cache layer will be missing since the dm-cache device itself is local to the host you wrote the data from originally. In a similar way, using dm-cache for write caching (or any write cache local to a client) will also mean that your data has a single point of failure since that data will not be replicated out to the backing store until it is destaged from cache. I would note that this is exactly the kind of write cache that is popular these days in front of enterprise storage arrays on clients so this is not really uncommon. Regards, Ric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
On 03/16/2016 12:15 PM, Van Leeuwen, Robert wrote: My understanding of how a writeback cache should work is that it should only take a few seconds for writes to be streamed onto the network and is focussed on resolving the speed issue of small sync writes. The writes would be bundled into larger writes that are not time sensitive. So there is potential for a few seconds data loss but compared to the current trend of using ephemeral storage to solve this issue, it's a major improvement. It think is a bit worse then just a few seconds of data: As mentioned in the blueprint for ceph you would need some kind or ordered write-back cache that maintains checkpoints internally. I am not that familiar with the internals of dm-cache but I do not think it guarantees any write order. E.g. By default it will bypass the cache for sequential IO. So I think it is very likely the “few seconds of data loss" in this case means the filesystem is corrupt and you could lose the whole thing. At the very least you will need to run fsck on it and hope it can sort out all of the errors with minimal data loss. So, for me, it seems conflicting to me to use persistent storage and then hoping your volumes survive a power outage. If you can survive missing that data you are probably better of running fully from ephemeral storage in the first place. Cheers, Robert van Leeuwen Hi Robert, I might be misunderstanding your point above, but dm-cache provides persistent storage. It will be there when you reboot and look for data on that same box. dm-cache is also power failure safe and tested to survive this kind of outage. If you try to look at the rbd device under dm-cache from another host, of course any data that was cached on the dm-cache layer will be missing since the dm-cache device itself is local to the host you wrote the data from originally. In a similar way, using dm-cache for write caching (or any write cache local to a client) will also mean that your data has a single point of failure since that data will not be replicated out to the backing store until it is destaged from cache. I would note that this is exactly the kind of write cache that is popular these days in front of enterprise storage arrays on clients so this is not really uncommon. Regards, Ric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Daniel Niasoff > Sent: 16 March 2016 21:02 > To: Nick Fisk ; 'Van Leeuwen, Robert' > ; 'Jason Dillaman' > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > Hi Nick, > > Your solution requires manual configuration for each VM and cannot be > setup as part of an automated OpenStack deployment. Absolutely, potentially flaky as well. > > It would be really nice if it was a hypervisor based setting as opposed to a VM > based setting. Yes, I can't wait until we can just specify "rbd_cache_device=/dev/ssd" in the ceph.conf and get it to write to that instead. Ideally ceph would also provide some sort of lightweight replication for the cache devices, but otherwise a iSCSI SSD farm or switched SAS could be used so that the caching device is not tied to one physical host. > > Thanks > > Daniel > > -Original Message- > From: Nick Fisk [mailto:n...@fisk.me.uk] > Sent: 16 March 2016 08:59 > To: Daniel Niasoff ; 'Van Leeuwen, Robert' > ; 'Jason Dillaman' > Cc: ceph-users@lists.ceph.com > Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node. > > > > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > > Of Daniel Niasoff > > Sent: 16 March 2016 08:26 > > To: Van Leeuwen, Robert ; Jason Dillaman > > > > Cc: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > > > Hi Robert, > > > > >Caching writes would be bad because a hypervisor failure would result > > >in > > loss of the cache which pretty much guarantees inconsistent data on > > the ceph volume. > > >Also live-migration will become problematic compared to running > > everything from ceph since you will also need to migrate the > local-storage. > > I tested a solution using iSCSI for the cache devices. Each VM was using > flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This gets > around the problem of moving things around or if the hypervisor goes down. > It's not local caching but the write latency is at least 10x lower than the RBD. > Note I tested it, I didn't put it into production :-) > > > > > My understanding of how a writeback cache should work is that it > > should only take a few seconds for writes to be streamed onto the > > network and is focussed on resolving the speed issue of small sync > > writes. The writes > would > > be bundled into larger writes that are not time sensitive. > > > > So there is potential for a few seconds data loss but compared to the > current > > trend of using ephemeral storage to solve this issue, it's a major > > improvement. > > Yeah, problem is a couple of seconds data loss mean different things to > different people. > > > > > > (considering the time required for setting up and maintaining the > > > extra > > caching layer on each vm, unless you work for free ;-) > > > > Couldn't agree more there. > > > > I am just so surprised how the openstack community haven't looked to > > resolve this issue. Ephemeral storage is a HUGE compromise unless you > > have built in failure into every aspect of your application but many > > people use openstack as a general purpose devstack. > > > > (Jason pointed out his blueprint but I guess it's at least a year or 2 > away - > > http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash- > > consistent_write-back_caching_extension) > > > > I see articles discussing the idea such as this one > > > > http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering- > > scalable-cache/ > > > > but no real straightforward validated setup instructions. > > > > Thanks > > > > Daniel > > > > > > -Original Message- > > From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com] > > Sent: 16 March 2016 08:11 > > To: Jason Dillaman ; Daniel Niasoff > > > > Cc: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > > > >Indeed, well understood. > > > > > >As a shorter term workaround, if you have control over the VMs, you > > >could > > always just slice out an LVM volume from local SSD/NVMe and pass it > > through to the guest. Within the guest, use
Re: [ceph-users] Local SSD cache for ceph on each compute node.
Hi Nick, Your solution requires manual configuration for each VM and cannot be setup as part of an automated OpenStack deployment. It would be really nice if it was a hypervisor based setting as opposed to a VM based setting. Thanks Daniel -Original Message- From: Nick Fisk [mailto:n...@fisk.me.uk] Sent: 16 March 2016 08:59 To: Daniel Niasoff ; 'Van Leeuwen, Robert' ; 'Jason Dillaman' Cc: ceph-users@lists.ceph.com Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node. > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Daniel Niasoff > Sent: 16 March 2016 08:26 > To: Van Leeuwen, Robert ; Jason Dillaman > > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > Hi Robert, > > >Caching writes would be bad because a hypervisor failure would result > >in > loss of the cache which pretty much guarantees inconsistent data on > the ceph volume. > >Also live-migration will become problematic compared to running > everything from ceph since you will also need to migrate the local-storage. I tested a solution using iSCSI for the cache devices. Each VM was using flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This gets around the problem of moving things around or if the hypervisor goes down. It's not local caching but the write latency is at least 10x lower than the RBD. Note I tested it, I didn't put it into production :-) > > My understanding of how a writeback cache should work is that it > should only take a few seconds for writes to be streamed onto the > network and is focussed on resolving the speed issue of small sync > writes. The writes would > be bundled into larger writes that are not time sensitive. > > So there is potential for a few seconds data loss but compared to the current > trend of using ephemeral storage to solve this issue, it's a major > improvement. Yeah, problem is a couple of seconds data loss mean different things to different people. > > > (considering the time required for setting up and maintaining the > > extra > caching layer on each vm, unless you work for free ;-) > > Couldn't agree more there. > > I am just so surprised how the openstack community haven't looked to > resolve this issue. Ephemeral storage is a HUGE compromise unless you > have built in failure into every aspect of your application but many > people use openstack as a general purpose devstack. > > (Jason pointed out his blueprint but I guess it's at least a year or 2 away - > http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash- > consistent_write-back_caching_extension) > > I see articles discussing the idea such as this one > > http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering- > scalable-cache/ > > but no real straightforward validated setup instructions. > > Thanks > > Daniel > > > -----Original Message----- > From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com] > Sent: 16 March 2016 08:11 > To: Jason Dillaman ; Daniel Niasoff > > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > >Indeed, well understood. > > > >As a shorter term workaround, if you have control over the VMs, you > >could > always just slice out an LVM volume from local SSD/NVMe and pass it > through to the guest. Within the guest, use dm-cache (or similar) to > add a > cache front-end to your RBD volume. > > If you do this you need to setup your cache as read-cache only. > Caching writes would be bad because a hypervisor failure would result > in loss > of the cache which pretty much guarantees inconsistent data on the > ceph volume. > Also live-migration will become problematic compared to running > everything from ceph since you will also need to migrate the local-storage. > > The question will be if adding more ram (== more read cache) would not > be more convenient and cheaper in the end. > (considering the time required for setting up and maintaining the > extra caching layer on each vm, unless you work for free ;-) Also > reads from ceph > are pretty fast compared to the biggest bottleneck: (small) sync writes. > So it is debatable how much performance you would win except for some > use-cases with lots of reads on very large data sets which are also > very latency sensitive. > > Cheers, > Robert van Leeuwen > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
I’d rather like to see this implemented at the hypervisor level, i.e.: QEMU, so we can have a common layer for all the storage backends. Although this is less portable... > On 17 Mar 2016, at 11:00, Nick Fisk wrote: > > > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Daniel Niasoff >> Sent: 16 March 2016 21:02 >> To: Nick Fisk ; 'Van Leeuwen, Robert' >> ; 'Jason Dillaman' >> Cc: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. >> >> Hi Nick, >> >> Your solution requires manual configuration for each VM and cannot be >> setup as part of an automated OpenStack deployment. > > Absolutely, potentially flaky as well. > >> >> It would be really nice if it was a hypervisor based setting as opposed to > a VM >> based setting. > > Yes, I can't wait until we can just specify "rbd_cache_device=/dev/ssd" in > the ceph.conf and get it to write to that instead. Ideally ceph would also > provide some sort of lightweight replication for the cache devices, but > otherwise a iSCSI SSD farm or switched SAS could be used so that the caching > device is not tied to one physical host. > >> >> Thanks >> >> Daniel >> >> -Original Message----- >> From: Nick Fisk [mailto:n...@fisk.me.uk] >> Sent: 16 March 2016 08:59 >> To: Daniel Niasoff ; 'Van Leeuwen, Robert' >> ; 'Jason Dillaman' >> Cc: ceph-users@lists.ceph.com >> Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node. >> >> >> >>> -Original Message- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >>> Of Daniel Niasoff >>> Sent: 16 March 2016 08:26 >>> To: Van Leeuwen, Robert ; Jason Dillaman >>> >>> Cc: ceph-users@lists.ceph.com >>> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. >>> >>> Hi Robert, >>> >>>> Caching writes would be bad because a hypervisor failure would result >>>> in >>> loss of the cache which pretty much guarantees inconsistent data on >>> the ceph volume. >>>> Also live-migration will become problematic compared to running >>> everything from ceph since you will also need to migrate the >> local-storage. >> >> I tested a solution using iSCSI for the cache devices. Each VM was using >> flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This > gets >> around the problem of moving things around or if the hypervisor goes down. >> It's not local caching but the write latency is at least 10x lower than > the RBD. >> Note I tested it, I didn't put it into production :-) >> >>> >>> My understanding of how a writeback cache should work is that it >>> should only take a few seconds for writes to be streamed onto the >>> network and is focussed on resolving the speed issue of small sync >>> writes. The writes >> would >>> be bundled into larger writes that are not time sensitive. >>> >>> So there is potential for a few seconds data loss but compared to the >> current >>> trend of using ephemeral storage to solve this issue, it's a major >>> improvement. >> >> Yeah, problem is a couple of seconds data loss mean different things to >> different people. >> >>> >>>> (considering the time required for setting up and maintaining the >>>> extra >>> caching layer on each vm, unless you work for free ;-) >>> >>> Couldn't agree more there. >>> >>> I am just so surprised how the openstack community haven't looked to >>> resolve this issue. Ephemeral storage is a HUGE compromise unless you >>> have built in failure into every aspect of your application but many >>> people use openstack as a general purpose devstack. >>> >>> (Jason pointed out his blueprint but I guess it's at least a year or 2 >> away - >>> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash- >>> consistent_write-back_caching_extension) >>> >>> I see articles discussing the idea such as this one >>> >>> http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering- >>> scalable-cache/ >>> >>> but no real straightforward validated setup instructions. >>> >>> Thanks >>> &
Re: [ceph-users] Local SSD cache for ceph on each compute node.
Hi Robert, It seems I have to give up on this goal for now but wanted to be sure I wasn't missing something obvious. >If you can survive missing that data you are probably better of running fully >from ephemeral storage in the first place. What and lose the entire ephemeral disk since the VM was created? Am I missing something here or is there an automated way of syncing ephemeral disks from time to time with a ceph back end? Thanks Daniel -Original Message- From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com] Sent: 16 March 2016 10:15 To: Daniel Niasoff ; Jason Dillaman Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > >My understanding of how a writeback cache should work is that it should only >take a few seconds for writes to be streamed onto the network and is focussed >on resolving the speed issue of small sync writes. The writes would be bundled >into larger writes that are not time sensitive. > >So there is potential for a few seconds data loss but compared to the current >trend of using ephemeral storage to solve this issue, it's a major improvement. It think is a bit worse then just a few seconds of data: As mentioned in the blueprint for ceph you would need some kind or ordered write-back cache that maintains checkpoints internally. I am not that familiar with the internals of dm-cache but I do not think it guarantees any write order. E.g. By default it will bypass the cache for sequential IO. So I think it is very likely the “few seconds of data loss" in this case means the filesystem is corrupt and you could lose the whole thing. At the very least you will need to run fsck on it and hope it can sort out all of the errors with minimal data loss. So, for me, it seems conflicting to me to use persistent storage and then hoping your volumes survive a power outage. If you can survive missing that data you are probably better of running fully from ephemeral storage in the first place. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
> >My understanding of how a writeback cache should work is that it should only >take a few seconds for writes to be streamed onto the network and is focussed >on resolving the speed issue of small sync writes. The writes would be bundled >into larger writes that are not time sensitive. > >So there is potential for a few seconds data loss but compared to the current >trend of using ephemeral storage to solve this issue, it's a major improvement. It think is a bit worse then just a few seconds of data: As mentioned in the blueprint for ceph you would need some kind or ordered write-back cache that maintains checkpoints internally. I am not that familiar with the internals of dm-cache but I do not think it guarantees any write order. E.g. By default it will bypass the cache for sequential IO. So I think it is very likely the “few seconds of data loss" in this case means the filesystem is corrupt and you could lose the whole thing. At the very least you will need to run fsck on it and hope it can sort out all of the errors with minimal data loss. So, for me, it seems conflicting to me to use persistent storage and then hoping your volumes survive a power outage. If you can survive missing that data you are probably better of running fully from ephemeral storage in the first place. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Daniel Niasoff > Sent: 16 March 2016 08:26 > To: Van Leeuwen, Robert ; Jason Dillaman > > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > Hi Robert, > > >Caching writes would be bad because a hypervisor failure would result in > loss of the cache which pretty much guarantees inconsistent data on the > ceph volume. > >Also live-migration will become problematic compared to running > everything from ceph since you will also need to migrate the local-storage. I tested a solution using iSCSI for the cache devices. Each VM was using flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This gets around the problem of moving things around or if the hypervisor goes down. It's not local caching but the write latency is at least 10x lower than the RBD. Note I tested it, I didn't put it into production :-) > > My understanding of how a writeback cache should work is that it should > only take a few seconds for writes to be streamed onto the network and is > focussed on resolving the speed issue of small sync writes. The writes would > be bundled into larger writes that are not time sensitive. > > So there is potential for a few seconds data loss but compared to the current > trend of using ephemeral storage to solve this issue, it's a major > improvement. Yeah, problem is a couple of seconds data loss mean different things to different people. > > > (considering the time required for setting up and maintaining the extra > caching layer on each vm, unless you work for free ;-) > > Couldn't agree more there. > > I am just so surprised how the openstack community haven't looked to > resolve this issue. Ephemeral storage is a HUGE compromise unless you have > built in failure into every aspect of your application but many people use > openstack as a general purpose devstack. > > (Jason pointed out his blueprint but I guess it's at least a year or 2 away - > http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash- > consistent_write-back_caching_extension) > > I see articles discussing the idea such as this one > > http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering- > scalable-cache/ > > but no real straightforward validated setup instructions. > > Thanks > > Daniel > > > -Original Message----- > From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com] > Sent: 16 March 2016 08:11 > To: Jason Dillaman ; Daniel Niasoff > > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > >Indeed, well understood. > > > >As a shorter term workaround, if you have control over the VMs, you could > always just slice out an LVM volume from local SSD/NVMe and pass it > through to the guest. Within the guest, use dm-cache (or similar) to add a > cache front-end to your RBD volume. > > If you do this you need to setup your cache as read-cache only. > Caching writes would be bad because a hypervisor failure would result in loss > of the cache which pretty much guarantees inconsistent data on the ceph > volume. > Also live-migration will become problematic compared to running everything > from ceph since you will also need to migrate the local-storage. > > The question will be if adding more ram (== more read cache) would not be > more convenient and cheaper in the end. > (considering the time required for setting up and maintaining the extra > caching layer on each vm, unless you work for free ;-) Also reads from ceph > are pretty fast compared to the biggest bottleneck: (small) sync writes. > So it is debatable how much performance you would win except for some > use-cases with lots of reads on very large data sets which are also very > latency sensitive. > > Cheers, > Robert van Leeuwen > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
Hi Robert, >Caching writes would be bad because a hypervisor failure would result in loss >of the cache which pretty much guarantees inconsistent data on the ceph volume. >Also live-migration will become problematic compared to running everything >from ceph since you will also need to migrate the local-storage. My understanding of how a writeback cache should work is that it should only take a few seconds for writes to be streamed onto the network and is focussed on resolving the speed issue of small sync writes. The writes would be bundled into larger writes that are not time sensitive. So there is potential for a few seconds data loss but compared to the current trend of using ephemeral storage to solve this issue, it's a major improvement. > (considering the time required for setting up and maintaining the extra > caching layer on each vm, unless you work for free ;-) Couldn't agree more there. I am just so surprised how the openstack community haven't looked to resolve this issue. Ephemeral storage is a HUGE compromise unless you have built in failure into every aspect of your application but many people use openstack as a general purpose devstack. (Jason pointed out his blueprint but I guess it's at least a year or 2 away - http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension) I see articles discussing the idea such as this one http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering-scalable-cache/ but no real straightforward validated setup instructions. Thanks Daniel -Original Message- From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com] Sent: 16 March 2016 08:11 To: Jason Dillaman ; Daniel Niasoff Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. >Indeed, well understood. > >As a shorter term workaround, if you have control over the VMs, you could >always just slice out an LVM volume from local SSD/NVMe and pass it through to >the guest. Within the guest, use dm-cache (or similar) to add a cache >front-end to your RBD volume. If you do this you need to setup your cache as read-cache only. Caching writes would be bad because a hypervisor failure would result in loss of the cache which pretty much guarantees inconsistent data on the ceph volume. Also live-migration will become problematic compared to running everything from ceph since you will also need to migrate the local-storage. The question will be if adding more ram (== more read cache) would not be more convenient and cheaper in the end. (considering the time required for setting up and maintaining the extra caching layer on each vm, unless you work for free ;-) Also reads from ceph are pretty fast compared to the biggest bottleneck: (small) sync writes. So it is debatable how much performance you would win except for some use-cases with lots of reads on very large data sets which are also very latency sensitive. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
>Indeed, well understood. > >As a shorter term workaround, if you have control over the VMs, you could >always just slice out an LVM volume from local SSD/NVMe and pass it through to >the guest. Within the guest, use dm-cache (or similar) to add a cache >front-end to your RBD volume. If you do this you need to setup your cache as read-cache only. Caching writes would be bad because a hypervisor failure would result in loss of the cache which pretty much guarantees inconsistent data on the ceph volume. Also live-migration will become problematic compared to running everything from ceph since you will also need to migrate the local-storage. The question will be if adding more ram (== more read cache) would not be more convenient and cheaper in the end. (considering the time required for setting up and maintaining the extra caching layer on each vm, unless you work for free ;-) Also reads from ceph are pretty fast compared to the biggest bottleneck: (small) sync writes. So it is debatable how much performance you would win except for some use-cases with lots of reads on very large data sets which are also very latency sensitive. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
I am using openstack so need this to be fully automated and apply to all my VMs. If I could do what you mention at the hypervisor level that would me much easier. The options that you mention I guess are for very specific use cases and need to be configured on a per vm basis whilst I am looking for a general "ceph on steroids" approach for all my VMs without any maintenance. Thanks again :) -Original Message- From: Jason Dillaman [mailto:dilla...@redhat.com] Sent: 16 March 2016 01:42 To: Daniel Niasoff Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. Indeed, well understood. As a shorter term workaround, if you have control over the VMs, you could always just slice out an LVM volume from local SSD/NVMe and pass it through to the guest. Within the guest, use dm-cache (or similar) to add a cache front-end to your RBD volume. Others have also reported improvements by using the QEMU x-data-plane option and RAIDing several RBD images together within the VM. -- Jason Dillaman - Original Message - > From: "Daniel Niasoff" > To: "Jason Dillaman" > Cc: ceph-users@lists.ceph.com > Sent: Tuesday, March 15, 2016 9:32:50 PM > Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node. > > Thanks. > > Reassuring but I could do with something today :) > > -Original Message- > From: Jason Dillaman [mailto:dilla...@redhat.com] > Sent: 16 March 2016 01:25 > To: Daniel Niasoff > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > The good news is such a feature is in the early stage of design [1]. > Hopefully this is a feature that will land in the Kraken release timeframe. > > [1] > http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consist > ent_write-back_caching_extension > > -- > > Jason Dillaman > > > - Original Message - > > From: "Daniel Niasoff" > > To: ceph-users@lists.ceph.com > > Sent: Tuesday, March 15, 2016 8:47:04 PM > > Subject: [ceph-users] Local SSD cache for ceph on each compute node. > > > > Hi, > > > > Let me start. Ceph is amazing, no it really is! > > > > But a hypervisor reading and writing all its data off the network > > off the network will add some latency to read and writes. > > > > So the hypervisor could do with a local cache, possible SSD or even NVMe. > > > > Spent a while looking into this but it seems really strange that few > > people see the value of this. > > > > Basically the cache would be used in two ways > > > > a) cache hot data > > b) writeback cache for ceph writes > > > > There is the RBD cache but that isn't disk based and on a hypervisor > > memory is at a premium. > > > > A simple solution would be to put a journal on each compute node and > > get each hypervisor to use its own journal. Would this work? > > > > Something like this > > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png > > > > Can this be achieved? > > > > A better explanation of what I am trying to achieve is here > > > > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/ > > > > This talk if it was voted in looks interesting - > > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Prese > > nt > > ation/6827 > > > > Can anyone help? > > > > Thanks > > > > Daniel > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
Indeed, well understood. As a shorter term workaround, if you have control over the VMs, you could always just slice out an LVM volume from local SSD/NVMe and pass it through to the guest. Within the guest, use dm-cache (or similar) to add a cache front-end to your RBD volume. Others have also reported improvements by using the QEMU x-data-plane option and RAIDing several RBD images together within the VM. -- Jason Dillaman - Original Message - > From: "Daniel Niasoff" > To: "Jason Dillaman" > Cc: ceph-users@lists.ceph.com > Sent: Tuesday, March 15, 2016 9:32:50 PM > Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node. > > Thanks. > > Reassuring but I could do with something today :) > > -Original Message- > From: Jason Dillaman [mailto:dilla...@redhat.com] > Sent: 16 March 2016 01:25 > To: Daniel Niasoff > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > The good news is such a feature is in the early stage of design [1]. > Hopefully this is a feature that will land in the Kraken release timeframe. > > [1] > http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension > > -- > > Jason Dillaman > > > - Original Message - > > From: "Daniel Niasoff" > > To: ceph-users@lists.ceph.com > > Sent: Tuesday, March 15, 2016 8:47:04 PM > > Subject: [ceph-users] Local SSD cache for ceph on each compute node. > > > > Hi, > > > > Let me start. Ceph is amazing, no it really is! > > > > But a hypervisor reading and writing all its data off the network off > > the network will add some latency to read and writes. > > > > So the hypervisor could do with a local cache, possible SSD or even NVMe. > > > > Spent a while looking into this but it seems really strange that few > > people see the value of this. > > > > Basically the cache would be used in two ways > > > > a) cache hot data > > b) writeback cache for ceph writes > > > > There is the RBD cache but that isn't disk based and on a hypervisor > > memory is at a premium. > > > > A simple solution would be to put a journal on each compute node and > > get each hypervisor to use its own journal. Would this work? > > > > Something like this > > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png > > > > Can this be achieved? > > > > A better explanation of what I am trying to achieve is here > > > > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/ > > > > This talk if it was voted in looks interesting - > > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Present > > ation/6827 > > > > Can anyone help? > > > > Thanks > > > > Daniel > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
Thanks. Reassuring but I could do with something today :) -Original Message- From: Jason Dillaman [mailto:dilla...@redhat.com] Sent: 16 March 2016 01:25 To: Daniel Niasoff Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. The good news is such a feature is in the early stage of design [1]. Hopefully this is a feature that will land in the Kraken release timeframe. [1] http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension -- Jason Dillaman - Original Message - > From: "Daniel Niasoff" > To: ceph-users@lists.ceph.com > Sent: Tuesday, March 15, 2016 8:47:04 PM > Subject: [ceph-users] Local SSD cache for ceph on each compute node. > > Hi, > > Let me start. Ceph is amazing, no it really is! > > But a hypervisor reading and writing all its data off the network off > the network will add some latency to read and writes. > > So the hypervisor could do with a local cache, possible SSD or even NVMe. > > Spent a while looking into this but it seems really strange that few > people see the value of this. > > Basically the cache would be used in two ways > > a) cache hot data > b) writeback cache for ceph writes > > There is the RBD cache but that isn't disk based and on a hypervisor > memory is at a premium. > > A simple solution would be to put a journal on each compute node and > get each hypervisor to use its own journal. Would this work? > > Something like this > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png > > Can this be achieved? > > A better explanation of what I am trying to achieve is here > > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/ > > This talk if it was voted in looks interesting - > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Present > ation/6827 > > Can anyone help? > > Thanks > > Daniel > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
The good news is such a feature is in the early stage of design [1]. Hopefully this is a feature that will land in the Kraken release timeframe. [1] http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension -- Jason Dillaman - Original Message - > From: "Daniel Niasoff" > To: ceph-users@lists.ceph.com > Sent: Tuesday, March 15, 2016 8:47:04 PM > Subject: [ceph-users] Local SSD cache for ceph on each compute node. > > Hi, > > Let me start. Ceph is amazing, no it really is! > > But a hypervisor reading and writing all its data off the network off the > network will add some latency to read and writes. > > So the hypervisor could do with a local cache, possible SSD or even NVMe. > > Spent a while looking into this but it seems really strange that few people > see the value of this. > > Basically the cache would be used in two ways > > a) cache hot data > b) writeback cache for ceph writes > > There is the RBD cache but that isn't disk based and on a hypervisor memory > is at a premium. > > A simple solution would be to put a journal on each compute node and get each > hypervisor to use its own journal. Would this work? > > Something like this > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png > > Can this be achieved? > > A better explanation of what I am trying to achieve is here > > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/ > > This talk if it was voted in looks interesting - > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/6827 > > Can anyone help? > > Thanks > > Daniel > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Local SSD cache for ceph on each compute node.
Hi, Let me start. Ceph is amazing, no it really is! But a hypervisor reading and writing all its data off the network off the network will add some latency to read and writes. So the hypervisor could do with a local cache, possible SSD or even NVMe. Spent a while looking into this but it seems really strange that few people see the value of this. Basically the cache would be used in two ways a) cache hot data b) writeback cache for ceph writes There is the RBD cache but that isn't disk based and on a hypervisor memory is at a premium. A simple solution would be to put a journal on each compute node and get each hypervisor to use its own journal. Would this work? Something like this http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png Can this be achieved? A better explanation of what I am trying to achieve is here http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/ This talk if it was voted in looks interesting - https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/6827 Can anyone help? Thanks Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com