Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-30 Thread Nick Fisk
> >>
> >> On 03/29/2016 04:35 PM, Nick Fisk wrote:
> >>> One thing I picked up on when looking at dm-cache for doing caching
> >>> with RBD's is that it wasn't really designed to be used as a
> >>> writeback cache for new writes, as in how you would expect a
> >>> traditional writeback cache to work. It seems all the policies are
> >>> designed around the idea that writes go to cache only if the block
> >>> is already in the cache (through reads) or its hot enough to
> >>> promote. Although there did seem to be some tunables to alter this
> >>> behaviour, posts on the mailing list seemed to suggest this wasn't
> >>> how it was designed to be used. I'm not sure if this has been addressed
> since I last looked at it though.
> >>>
> >>> Depending on if you are trying to accelerate all writes, or just
> >>> your
> > "hot"
> >>> blocks, this may or may not matter. Even <1GB local caches can make
> >>> a huge difference to sync writes.
> >> Hi Nick,
> >>
> >> Some of the caching policies have changed recently as the team has
> >> looked at different workloads.
> >>
> >> Happy to introduce you to them if you want to discuss offline or post
> >> comments over on their list: device-mapper development  >> de...@redhat.com>
> >>
> >> thanks!
> >>
> >> Ric
> > Hi Ric,
> >
> > Thanks for the heads up, just from a quick flick through I can see
> > there are now separate read and write promotion thresholds, so I can
> > see just from that it would be a lot more suitable for what I
> > intended. I might try and find some time to give it another test.
> >
> > Nick
> 
> Let us know how it works out for you, I know that they are very interested in
> making sure things are useful :)

Hi Ric,

I have given it another test and unfortunately it seems it's still not giving 
the improvements that I was expecting.

Here is a rough description of my test

10GB RBD
1GB ZRAM kernel device for cache (Testing only)

0 20971520 cache 8 106/4096 64 32768/32768 2492 1239 349993 113194 47157 47157 
0 1 writeback 2 migration_threshold 8192 mq 10 ra
ndom_threshold 0 sequential_threshold 0 discard_promote_adjustment 1 
read_promote_adjustment 4 write_promote_adjustment 0 rw -

I'm then running a directio 64kb seq write QD=1 bench with fio to the DM device.

What I expect to happen would be for this sequential stream of 64kb IO's to be 
coalesced into 4MB IO's and written out to the RBD at a high queue depth as 
possible/required. Effectively meaning my 64kb sequential bandwidth should 
match the limit of 4MB sequential bandwidth of my cluster. I'm more interested 
in replicating the behaviour of a write cache on a battery backed raid card, 
than a RW SSD cache, if that makes sense?

An example real life scenario would be for sitting underneath a iSCSI target, 
something like ESXi generates that IO pattern when moving VM's between 
datastores.

What I was seeing is that I get a sudden burst of speed at the start of the fio 
test, but then it quickly drops down to the speed of the underlying RBD device. 
The dirty blocks counter never seems to go too high, so I don't think that it’s 
a cache full problem. The counter is probably no more than about 40% when the 
slowdown starts and then it drops to less than 10% for the remainder of the 
test as it crawls along. It feels like it hits some sort of throttle and it 
never recovers. 

I've done similar tests with flashcache and it gets more stable performance 
over a longer period of time, but the associative hit set behaviour seems to 
cause write misses due to the sequential IO pattern, which limits overall top 
performance. 


Nick

> 
> ric
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-29 Thread Ric Wheeler

On 03/29/2016 04:53 PM, Nick Fisk wrote:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Ric Wheeler
Sent: 29 March 2016 14:40
To: Nick Fisk <n...@fisk.me.uk>; 'Sage Weil' <s...@newdream.net>
Cc: ceph-users@lists.ceph.com; device-mapper development 
Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.

On 03/29/2016 04:35 PM, Nick Fisk wrote:

One thing I picked up on when looking at dm-cache for doing caching
with RBD's is that it wasn't really designed to be used as a writeback
cache for new writes, as in how you would expect a traditional
writeback cache to work. It seems all the policies are designed around
the idea that writes go to cache only if the block is already in the
cache (through reads) or its hot enough to promote. Although there did
seem to be some tunables to alter this behaviour, posts on the mailing
list seemed to suggest this wasn't how it was designed to be used. I'm
not sure if this has been addressed since I last looked at it though.

Depending on if you are trying to accelerate all writes, or just your

"hot"

blocks, this may or may not matter. Even <1GB local caches can make a
huge difference to sync writes.

Hi Nick,

Some of the caching policies have changed recently as the team has looked
at different workloads.

Happy to introduce you to them if you want to discuss offline or post
comments over on their list: device-mapper development 

thanks!

Ric

Hi Ric,

Thanks for the heads up, just from a quick flick through I can see there are
now separate read and write promotion thresholds, so I can see just from
that it would be a lot more suitable for what I intended. I might try and
find some time to give it another test.

Nick


Let us know how it works out for you, I know that they are very interested in 
making sure things are useful :)


ric


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-29 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Ric Wheeler
> Sent: 29 March 2016 14:40
> To: Nick Fisk <n...@fisk.me.uk>; 'Sage Weil' <s...@newdream.net>
> Cc: ceph-users@lists.ceph.com; device-mapper development  de...@redhat.com>
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> On 03/29/2016 04:35 PM, Nick Fisk wrote:
> > One thing I picked up on when looking at dm-cache for doing caching
> > with RBD's is that it wasn't really designed to be used as a writeback
> > cache for new writes, as in how you would expect a traditional
> > writeback cache to work. It seems all the policies are designed around
> > the idea that writes go to cache only if the block is already in the
> > cache (through reads) or its hot enough to promote. Although there did
> > seem to be some tunables to alter this behaviour, posts on the mailing
> > list seemed to suggest this wasn't how it was designed to be used. I'm
> > not sure if this has been addressed since I last looked at it though.
> >
> > Depending on if you are trying to accelerate all writes, or just your
"hot"
> > blocks, this may or may not matter. Even <1GB local caches can make a
> > huge difference to sync writes.
> 
> Hi Nick,
> 
> Some of the caching policies have changed recently as the team has looked
> at different workloads.
> 
> Happy to introduce you to them if you want to discuss offline or post
> comments over on their list: device-mapper development  de...@redhat.com>
> 
> thanks!
> 
> Ric

Hi Ric,

Thanks for the heads up, just from a quick flick through I can see there are
now separate read and write promotion thresholds, so I can see just from
that it would be a lot more suitable for what I intended. I might try and
find some time to give it another test.

Nick

> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-29 Thread Ric Wheeler

On 03/29/2016 04:35 PM, Nick Fisk wrote:

One thing I picked up on when looking at dm-cache for doing caching with
RBD's is that it wasn't really designed to be used as a writeback cache for
new writes, as in how you would expect a traditional writeback cache to
work. It seems all the policies are designed around the idea that writes go
to cache only if the block is already in the cache (through reads) or its
hot enough to promote. Although there did seem to be some tunables to alter
this behaviour, posts on the mailing list seemed to suggest this wasn't how
it was designed to be used. I'm not sure if this has been addressed since I
last looked at it though.

Depending on if you are trying to accelerate all writes, or just your "hot"
blocks, this may or may not matter. Even <1GB local caches can make a huge
difference to sync writes.


Hi Nick,

Some of the caching policies have changed recently as the team has looked at 
different workloads.


Happy to introduce you to them if you want to discuss offline or post comments 
over on their list: device-mapper development 


thanks!

Ric

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-29 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Ric Wheeler
> Sent: 29 March 2016 14:07
> To: Sage Weil <s...@newdream.net>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> On 03/29/2016 03:42 PM, Sage Weil wrote:
> > On Tue, 29 Mar 2016, Ric Wheeler wrote:
> >>> However, if the write cache would would be "flushed in-order" to
> >>> Ceph you would just lose x seconds of data and, hopefully, not have
> >>> a corrupted disk. That could be acceptable for some people. I was
> >>> just stressing that that isn?t the case.
> >> This in order assumption - speaking as some one who has a long
> >> history in kernel file and storage - is the wrong assumption.
> >>
> >> Don't think of the cache device and RBD as separate devices, once
> >> they are configured like this, they are the same device from the
> >> point of view of the file system (or whatever) that runs on top of
them.
> >>
> >> The cache and its caching policy can vary, but it is perfectly
> >> reasonable to have data live only in that caching layer pretty much
> >> forever. Local disk caches can also do this by the way :)
> > That's true for current caching devices like dm-cache, but does not
> > need to be true--and I think that's what Robert is getting at.  The
> > plan for RBD, for example, is to implement a client-side cache that
> > has an ordered writeback policy, similar to the one described in this
paper:
> >
> >
> > https://www.usenix.org/conference/fast13/technical-sessions/presentati
> > on/koller
> >
> > In that scenario, loss of the cache devices leaves you with a stale
> > but crash-consistent image on the base device.
> 
> Certainly if we design a distributed system aware caching layer, things
can be
> different.
> 
> I think that the trade offs for using a local to the client cache are
certainly a bit
> confusing for normal users, but they are pretty popular these days even
> given the limits.
> 
> Enterprise storage systems (big EMC/Hitachi/etc arrays) have been
> effectively implemented as distributed systems for quite a long time and
> these server local caches are routinely used for clients for their virtual
LUNs.
> 
> Adding a performance boosting cahing layer (especially for a virtual
guest) to
> use a local SSD I think has a lot of upsides even if it does not solve the
> problem for the migrating device case.
> 
> >
> >> The whole flushing order argument is really not relevant here. I could
> >> "flush in order" after a minute, a week or a year. If the cache is
large
> >> enough, you might have zero data land on the backing store (even if the
> >> destage policy does it as you suggest as in order).
> > I think the assumption is you would place a bound on the amount of dirty
> > data in the cache.  Since you need to checkpoint cache content (on, say,
> > flush boundaries), that roughly bounds the size of the cache by the
amount
> > of data written, even if it is repeatedly scribbling over the same
blocks.
> 
> Keep in mind that the device mapper device is the device you see at the
> client -
> when you flush, you are flushing to it.  It is designed as a cache for a
local
> device, the fact that ceph rbd is under it (and has a distributed backend)
is
> not really of interest in the current generation at least.
> 
> Perfectly legal to have data live only in the SSD for example, not land in
the
> backing device.
> 
> How we bound and manage the cache and the life cycle of the data is
> something
> that Joe and the device mapper people have been actively working on.
> 
> I don't think that ordering alone is enough for any local linux file
system.
> 
> The promises made from the storage layer up to the kernel file system
stack
> are
> basically that any transaction we commit (using synchronize_cache or
similar
> mechanisms) is durable across a power failure. We don't have assumptions
> on
> ordering with regards to other writes (i.e., when we care, we flush the
world
> which is a whole device sync, or sync and FUA (gory details of the
multiple
> incarnations of this live in the kernel Documentation subtree in
> block/writeback_cache_control.txt).
> 
> >
> >> That all said, the reason to use a write cache on top of client block
> >> device - rbd or other - is to improve performance for the client.
> >>
> >> Any time we make our failure domain require fully operating two devices
> >> (the cache device and the original device

Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-29 Thread Sage Weil
On Tue, 29 Mar 2016, Ric Wheeler wrote:
> > However, if the write cache would would be "flushed in-order" to Ceph 
> > you would just lose x seconds of data and, hopefully, not have a 
> > corrupted disk. That could be acceptable for some people. I was just 
> > stressing that that isn’t the case.
> 
> This in order assumption - speaking as some one who has a long history 
> in kernel file and storage - is the wrong assumption.
> 
> Don't think of the cache device and RBD as separate devices, once they 
> are configured like this, they are the same device from the point of 
> view of the file system (or whatever) that runs on top of them.
> 
> The cache and its caching policy can vary, but it is perfectly 
> reasonable to have data live only in that caching layer pretty much 
> forever. Local disk caches can also do this by the way :)

That's true for current caching devices like dm-cache, but does not need 
to be true--and I think that's what Robert is getting at.  The plan for 
RBD, for example, is to implement a client-side cache that has an ordered 
writeback policy, similar to the one described in this paper:

 https://www.usenix.org/conference/fast13/technical-sessions/presentation/koller

In that scenario, loss of the cache devices leaves you with a stale but 
crash-consistent image on the base device.

> The whole flushing order argument is really not relevant here. I could 
> "flush in order" after a minute, a week or a year. If the cache is large 
> enough, you might have zero data land on the backing store (even if the 
> destage policy does it as you suggest as in order).

I think the assumption is you would place a bound on the amount of dirty 
data in the cache.  Since you need to checkpoint cache content (on, say, 
flush boundaries), that roughly bounds the size of the cache by the amount 
of data written, even if it is repeatedly scribbling over the same blocks.

> That all said, the reason to use a write cache on top of client block 
> device - rbd or other - is to improve performance for the client.
> 
> Any time we make our failure domain require fully operating two devices 
> (the cache device and the original device), we increase the probability 
> of a non-recoverable failure.  In effect, the reliability of the storage 
> is at best as reliable as the least reliable part of the pair.

The goal is to add a new choice on the spectrum between (1) all writes are 
replicated across the cluster in order to get a consistent and up-to-date 
image when the client+cache fail, and (2) a writeback that gives you fast 
writes but leaves you with a corrupted (and stale) image after such a 
failer.  Ordered writeback (1.5) gives you low write latency and a stale 
but crash consistent image.  I suspect this will be a sensible choice for 
a lot of different use cases and workloads.

Is anything like this on the dm-cache roadmap as well?  It's probably less 
useful when the cache device lives in the same host (compared to a 
client/cluster arrangement more typical of RBD where a client host failure 
takes out the cache device but not the base image), but it might still be 
worth considering.

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-29 Thread Ric Wheeler

On 03/29/2016 01:35 PM, Van Leeuwen, Robert wrote:

If you try to look at the rbd device under dm-cache from another host, of course
any data that was cached on the dm-cache layer will be missing since the
dm-cache device itself is local to the host you wrote the data from originally.

And here it can (and probably will) go horribly wrong.
If you miss the dm-cache device (cache/hypervisor failure) you will probably 
end up with an inconsistent filesystem.
This is because dm-cache is not a ordered write-back cache afaik.


I think that you are twisting in two unrelated points.

dm-cache does do proper ordering.

If you use it to cache writes and then take it effectively out of the picture
(i.e., never destage that data from cache), you end up with holes in a file 
system.

Nothing to do with ordering, all to do with having a write back cache enabled
and then chopping the write back cached data out of the picture.

Way back in this thread it was mentioned you would just lose a few seconds of 
data when you lose the cache device.

My point was that when you lose the cache device you do not just miss x seconds 
of data but probably lose the whole filesystem.


True.


This is because the cache is not “ordered” and random parts, probably the “hot 
data” you care about, never made it from the cache device into ceph.


Totally unrelated.



However, if the write cache would would be "flushed in-order" to Ceph you would 
just lose x seconds of data and, hopefully, not have a corrupted disk.
That could be acceptable for some people. I was just stressing that that isn’t 
the case.


This in order assumption - speaking as some one who has a long history in kernel 
file and storage - is the wrong assumption.


Don't think of the cache device and RBD as separate devices, once they are 
configured like this, they are the same device from the point of view of the 
file system (or whatever) that runs on top of them.


The cache and its caching policy can vary, but it is perfectly reasonable to 
have data live only in that caching layer pretty much forever. Local disk caches 
can also do this by the way :)


The whole flushing order argument is really not relevant here. I could "flush in 
order" after a minute, a week or a year. If the cache is large enough, you might 
have zero data land on the backing store (even if the destage policy does it as 
you suggest as in order).


That all said, the reason to use a write cache on top of client block device - 
rbd or other - is to improve performance for the client.


Any time we make our failure domain require fully operating two devices (the 
cache device and the original device), we increase the probability of a 
non-recoverable failure.  In effect, the reliability of the storage is at best 
as reliable as the least reliable part of the pair.



If you use dm-cache as a write through cache, this is not a problem (i.e., we
would only be used to cache reads).

Caching reads is fine :)


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-29 Thread Van Leeuwen, Robert

>>> If you try to look at the rbd device under dm-cache from another host, of 
>>> course
>>> any data that was cached on the dm-cache layer will be missing since the
>>> dm-cache device itself is local to the host you wrote the data from 
>>> originally.
>> And here it can (and probably will) go horribly wrong.
>> If you miss the dm-cache device (cache/hypervisor failure) you will probably 
>> end up with an inconsistent filesystem.
>> This is because dm-cache is not a ordered write-back cache afaik.
>>
>
>I think that you are twisting in two unrelated points.
>
>dm-cache does do proper ordering.
>
>If you use it to cache writes and then take it effectively out of the picture 
>(i.e., never destage that data from cache), you end up with holes in a file 
>system.
>
>Nothing to do with ordering, all to do with having a write back cache enabled 
>and then chopping the write back cached data out of the picture.

Way back in this thread it was mentioned you would just lose a few seconds of 
data when you lose the cache device.

My point was that when you lose the cache device you do not just miss x seconds 
of data but probably lose the whole filesystem.
This is because the cache is not “ordered” and random parts, probably the “hot 
data” you care about, never made it from the cache device into ceph.

However, if the write cache would would be "flushed in-order" to Ceph you would 
just lose x seconds of data and, hopefully, not have a corrupted disk.
That could be acceptable for some people. I was just stressing that that isn’t 
the case.

>
>If you use dm-cache as a write through cache, this is not a problem (i.e., we 
>would only be used to cache reads).

Caching reads is fine :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-29 Thread Ric Wheeler

On 03/29/2016 10:06 AM, Van Leeuwen, Robert wrote:

On 3/27/16, 9:59 AM, "Ric Wheeler"  wrote:






On 03/16/2016 12:15 PM, Van Leeuwen, Robert wrote:

My understanding of how a writeback cache should work is that it should only 
take a few seconds for writes to be streamed onto the network and is focussed 
on resolving the speed issue of small sync writes. The writes would be bundled 
into larger writes that are not time sensitive.

So there is potential for a few seconds data loss but compared to the current 
trend of using ephemeral storage to solve this issue, it's a major improvement.

It think is a bit worse then just a few seconds of data:
As mentioned in the blueprint for ceph you would need some kind or ordered 
write-back cache that maintains checkpoints internally.

I am not that familiar with the internals of dm-cache but I do not think it 
guarantees any write order.
E.g. By default it will bypass the cache for sequential IO.

So I think it is very likely the “few seconds of data loss" in this case means 
the filesystem is corrupt and you could lose the whole thing.
At the very least you will need to run fsck on it and hope it can sort out all 
of the errors with minimal data loss.



I might be misunderstanding your point above, but dm-cache provides persistent
storage. It will be there when you reboot and look for data on that same box.
dm-cache is also power failure safe and tested to survive this kind of outage.

Correct, dm-cache is power-faillure safe assuming all hardware survives the 
reboot.


If you try to look at the rbd device under dm-cache from another host, of course
any data that was cached on the dm-cache layer will be missing since the
dm-cache device itself is local to the host you wrote the data from originally.

And here it can (and probably will) go horribly wrong.
If you miss the dm-cache device (cache/hypervisor failure) you will probably 
end up with an inconsistent filesystem.
This is because dm-cache is not a ordered write-back cache afaik.



I think that you are twisting in two unrelated points.

dm-cache does do proper ordering.

If you use it to cache writes and then take it effectively out of the picture 
(i.e., never destage that data from cache), you end up with holes in a file system.


Nothing to do with ordering, all to do with having a write back cache enabled 
and then chopping the write back cached data out of the picture.


This is no different than any other write cache, you need the write cache device 
and the back end storage both to be present to see all of the data.


If you use dm-cache as a write through cache, this is not a problem (i.e., we 
would only be used to cache reads).


Ric


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-29 Thread Van Leeuwen, Robert

On 3/27/16, 9:59 AM, "Ric Wheeler"  wrote:





>On 03/16/2016 12:15 PM, Van Leeuwen, Robert wrote:
>>> My understanding of how a writeback cache should work is that it should 
>>> only take a few seconds for writes to be streamed onto the network and is 
>>> focussed on resolving the speed issue of small sync writes. The writes 
>>> would be bundled into larger writes that are not time sensitive.
>>>
>>> So there is potential for a few seconds data loss but compared to the 
>>> current trend of using ephemeral storage to solve this issue, it's a major 
>>> improvement.
>> It think is a bit worse then just a few seconds of data:
>> As mentioned in the blueprint for ceph you would need some kind or ordered 
>> write-back cache that maintains checkpoints internally.
>>
>> I am not that familiar with the internals of dm-cache but I do not think it 
>> guarantees any write order.
>> E.g. By default it will bypass the cache for sequential IO.
>>
>> So I think it is very likely the “few seconds of data loss" in this case 
>> means the filesystem is corrupt and you could lose the whole thing.
>> At the very least you will need to run fsck on it and hope it can sort out 
>> all of the errors with minimal data loss.
>>
>>
>I might be misunderstanding your point above, but dm-cache provides persistent 
>storage. It will be there when you reboot and look for data on that same box. 
>dm-cache is also power failure safe and tested to survive this kind of outage.

Correct, dm-cache is power-faillure safe assuming all hardware survives the 
reboot.

>If you try to look at the rbd device under dm-cache from another host, of 
>course 
>any data that was cached on the dm-cache layer will be missing since the 
>dm-cache device itself is local to the host you wrote the data from originally.

And here it can (and probably will) go horribly wrong.
If you miss the dm-cache device (cache/hypervisor failure) you will probably 
end up with an inconsistent filesystem.
This is because dm-cache is not a ordered write-back cache afaik.


Cheers,
Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-27 Thread Daniel Niasoff
Hi Ric,

But you would still have to set a dm-cache per rbd volume which makes it 
difficult to manage.

There needs to be a global setting either within kvm or ceph that caches 
reads/writes before they hit the rbd the device.

Thanks

Daniel

-Original Message-
From: Ric Wheeler [mailto:rwhee...@redhat.com] 
Sent: 27 March 2016 09:00
To: Van Leeuwen, Robert <rovanleeu...@ebay.com>; Daniel Niasoff 
<dan...@redactus.co.uk>; Jason Dillaman <dilla...@redhat.com>
Cc: ceph-users@lists.ceph.com; Mike Snitzer <snit...@redhat.com>; Joe Thornber 
<thorn...@redhat.com>
Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.

On 03/16/2016 12:15 PM, Van Leeuwen, Robert wrote:
>> My understanding of how a writeback cache should work is that it should only 
>> take a few seconds for writes to be streamed onto the network and is 
>> focussed on resolving the speed issue of small sync writes. The writes would 
>> be bundled into larger writes that are not time sensitive.
>>
>> So there is potential for a few seconds data loss but compared to the 
>> current trend of using ephemeral storage to solve this issue, it's a major 
>> improvement.
> It think is a bit worse then just a few seconds of data:
> As mentioned in the blueprint for ceph you would need some kind or ordered 
> write-back cache that maintains checkpoints internally.
>
> I am not that familiar with the internals of dm-cache but I do not think it 
> guarantees any write order.
> E.g. By default it will bypass the cache for sequential IO.
>
> So I think it is very likely the “few seconds of data loss" in this case 
> means the filesystem is corrupt and you could lose the whole thing.
> At the very least you will need to run fsck on it and hope it can sort out 
> all of the errors with minimal data loss.
>
>
> So, for me, it seems conflicting to me to use persistent storage and then 
> hoping your volumes survive a power outage.
>
> If you can survive missing that data you are probably better of running fully 
> from ephemeral storage in the first place.
>
> Cheers,
> Robert van Leeuwen

Hi Robert,

I might be misunderstanding your point above, but dm-cache provides persistent 
storage. It will be there when you reboot and look for data on that same box. 
dm-cache is also power failure safe and tested to survive this kind of outage.

If you try to look at the rbd device under dm-cache from another host, of 
course any data that was cached on the dm-cache layer will be missing since the 
dm-cache device itself is local to the host you wrote the data from originally.

In a similar way, using dm-cache for write caching (or any write cache local to 
a client) will also mean that your data has a single point of failure since 
that data will not be replicated out to the backing store until it is destaged 
from cache.

I would note that this is exactly the kind of write cache that is popular these 
days in front of enterprise storage arrays on clients so this is not really 
uncommon.

Regards,

Ric


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-27 Thread Ric Wheeler

On 03/16/2016 12:15 PM, Van Leeuwen, Robert wrote:

My understanding of how a writeback cache should work is that it should only 
take a few seconds for writes to be streamed onto the network and is focussed 
on resolving the speed issue of small sync writes. The writes would be bundled 
into larger writes that are not time sensitive.

So there is potential for a few seconds data loss but compared to the current 
trend of using ephemeral storage to solve this issue, it's a major improvement.

It think is a bit worse then just a few seconds of data:
As mentioned in the blueprint for ceph you would need some kind or ordered 
write-back cache that maintains checkpoints internally.

I am not that familiar with the internals of dm-cache but I do not think it 
guarantees any write order.
E.g. By default it will bypass the cache for sequential IO.

So I think it is very likely the “few seconds of data loss" in this case means 
the filesystem is corrupt and you could lose the whole thing.
At the very least you will need to run fsck on it and hope it can sort out all 
of the errors with minimal data loss.


So, for me, it seems conflicting to me to use persistent storage and then 
hoping your volumes survive a power outage.

If you can survive missing that data you are probably better of running fully 
from ephemeral storage in the first place.

Cheers,
Robert van Leeuwen


Hi Robert,

I might be misunderstanding your point above, but dm-cache provides persistent 
storage. It will be there when you reboot and look for data on that same box. 
dm-cache is also power failure safe and tested to survive this kind of outage.


If you try to look at the rbd device under dm-cache from another host, of course 
any data that was cached on the dm-cache layer will be missing since the 
dm-cache device itself is local to the host you wrote the data from originally.


In a similar way, using dm-cache for write caching (or any write cache local to 
a client) will also mean that your data has a single point of failure since that 
data will not be replicated out to the backing store until it is destaged from 
cache.


I would note that this is exactly the kind of write cache that is popular these 
days in front of enterprise storage arrays on clients so this is not really 
uncommon.


Regards,

Ric


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-19 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Daniel Niasoff
> Sent: 16 March 2016 21:02
> To: Nick Fisk <n...@fisk.me.uk>; 'Van Leeuwen, Robert'
> <rovanleeu...@ebay.com>; 'Jason Dillaman' <dilla...@redhat.com>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Hi Nick,
> 
> Your solution requires manual configuration for each VM and cannot be
> setup as part of an automated OpenStack deployment.

Absolutely, potentially flaky as well.

> 
> It would be really nice if it was a hypervisor based setting as opposed to
a VM
> based setting.

Yes, I can't wait until we can just specify "rbd_cache_device=/dev/ssd" in
the ceph.conf and get it to write to that instead. Ideally ceph would also
provide some sort of lightweight replication for the cache devices, but
otherwise a iSCSI SSD farm or switched SAS could be used so that the caching
device is not tied to one physical host.

> 
> Thanks
> 
> Daniel
> 
> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: 16 March 2016 08:59
> To: Daniel Niasoff <dan...@redactus.co.uk>; 'Van Leeuwen, Robert'
> <rovanleeu...@ebay.com>; 'Jason Dillaman' <dilla...@redhat.com>
> Cc: ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Daniel Niasoff
> > Sent: 16 March 2016 08:26
> > To: Van Leeuwen, Robert <rovanleeu...@ebay.com>; Jason Dillaman
> > <dilla...@redhat.com>
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> >
> > Hi Robert,
> >
> > >Caching writes would be bad because a hypervisor failure would result
> > >in
> > loss of the cache which pretty much guarantees inconsistent data on
> > the ceph volume.
> > >Also live-migration will become problematic compared to running
> > everything from ceph since you will also need to migrate the
> local-storage.
> 
> I tested a solution using iSCSI for the cache devices. Each VM was using
> flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This
gets
> around the problem of moving things around or if the hypervisor goes down.
> It's not local caching but the write latency is at least 10x lower than
the RBD.
> Note I tested it, I didn't put it into production :-)
> 
> >
> > My understanding of how a writeback cache should work is that it
> > should only take a few seconds for writes to be streamed onto the
> > network and is focussed on resolving the speed issue of small sync
> > writes. The writes
> would
> > be bundled into larger writes that are not time sensitive.
> >
> > So there is potential for a few seconds data loss but compared to the
> current
> > trend of using ephemeral storage to solve this issue, it's a major
> > improvement.
> 
> Yeah, problem is a couple of seconds data loss mean different things to
> different people.
> 
> >
> > > (considering the time required for setting up and maintaining the
> > > extra
> > caching layer on each vm, unless you work for free ;-)
> >
> > Couldn't agree more there.
> >
> > I am just so surprised how the openstack community haven't looked to
> > resolve this issue. Ephemeral storage is a HUGE compromise unless you
> > have built in failure into every aspect of your application but many
> > people use openstack as a general purpose devstack.
> >
> > (Jason pointed out his blueprint but I guess it's at least a year or 2
> away -
> > http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-
> > consistent_write-back_caching_extension)
> >
> > I see articles discussing the idea such as this one
> >
> > http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering-
> > scalable-cache/
> >
> > but no real straightforward  validated setup instructions.
> >
> > Thanks
> >
> > Daniel
> >
> >
> > -Original Message-
> > From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com]
> > Sent: 16 March 2016 08:11
> > To: Jason Dillaman <dilla...@redhat.com>; Daniel Niasoff
> > <dan...@redactus.co.uk>
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> >
> > >Indeed, well understood.
> > >
> > >As a shorter t

Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-19 Thread Daniel Niasoff
Hi Nick,

Your solution requires manual configuration for each VM and cannot be setup as 
part of an automated OpenStack deployment.

It would be really nice if it was a hypervisor based setting as opposed to a VM 
based setting.

Thanks 

Daniel

-Original Message-
From: Nick Fisk [mailto:n...@fisk.me.uk] 
Sent: 16 March 2016 08:59
To: Daniel Niasoff <dan...@redactus.co.uk>; 'Van Leeuwen, Robert' 
<rovanleeu...@ebay.com>; 'Jason Dillaman' <dilla...@redhat.com>
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Daniel Niasoff
> Sent: 16 March 2016 08:26
> To: Van Leeuwen, Robert <rovanleeu...@ebay.com>; Jason Dillaman 
> <dilla...@redhat.com>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Hi Robert,
> 
> >Caching writes would be bad because a hypervisor failure would result 
> >in
> loss of the cache which pretty much guarantees inconsistent data on 
> the ceph volume.
> >Also live-migration will become problematic compared to running
> everything from ceph since you will also need to migrate the
local-storage.

I tested a solution using iSCSI for the cache devices. Each VM was using 
flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This gets 
around the problem of moving things around or if the hypervisor goes down.
It's not local caching but the write latency is at least 10x lower than the 
RBD. Note I tested it, I didn't put it into production :-)

> 
> My understanding of how a writeback cache should work is that it 
> should only take a few seconds for writes to be streamed onto the 
> network and is focussed on resolving the speed issue of small sync 
> writes. The writes
would
> be bundled into larger writes that are not time sensitive.
> 
> So there is potential for a few seconds data loss but compared to the
current
> trend of using ephemeral storage to solve this issue, it's a major 
> improvement.

Yeah, problem is a couple of seconds data loss mean different things to 
different people.

> 
> > (considering the time required for setting up and maintaining the 
> > extra
> caching layer on each vm, unless you work for free ;-)
> 
> Couldn't agree more there.
> 
> I am just so surprised how the openstack community haven't looked to 
> resolve this issue. Ephemeral storage is a HUGE compromise unless you 
> have built in failure into every aspect of your application but many 
> people use openstack as a general purpose devstack.
> 
> (Jason pointed out his blueprint but I guess it's at least a year or 2
away -
> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-
> consistent_write-back_caching_extension)
> 
> I see articles discussing the idea such as this one
> 
> http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering-
> scalable-cache/
> 
> but no real straightforward  validated setup instructions.
> 
> Thanks
> 
> Daniel
> 
> 
> -Original Message-
> From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com]
> Sent: 16 March 2016 08:11
> To: Jason Dillaman <dilla...@redhat.com>; Daniel Niasoff 
> <dan...@redactus.co.uk>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> >Indeed, well understood.
> >
> >As a shorter term workaround, if you have control over the VMs, you 
> >could
> always just slice out an LVM volume from local SSD/NVMe and pass it 
> through to the guest.  Within the guest, use dm-cache (or similar) to 
> add
a
> cache front-end to your RBD volume.
> 
> If you do this you need to setup your cache as read-cache only.
> Caching writes would be bad because a hypervisor failure would result 
> in
loss
> of the cache which pretty much guarantees inconsistent data on the 
> ceph volume.
> Also live-migration will become problematic compared to running 
> everything from ceph since you will also need to migrate the local-storage.
> 
> The question will be if adding more ram (== more read cache) would not 
> be more convenient and cheaper in the end.
> (considering the time required for setting up and maintaining the 
> extra caching layer on each vm, unless you work for free ;-) Also 
> reads from
ceph
> are pretty fast compared to the biggest bottleneck: (small) sync writes.
> So it is debatable how much performance you would win except for some 
> use-cases with lots of reads on very large data sets which are also 
> very latency sensitive.
> 
> Cheers,
> Robert van Leeuwen
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-18 Thread Sebastien Han
I’d rather like to see this implemented at the hypervisor level, i.e.: QEMU, so 
we can have a common layer for all the storage backends.
Although this is less portable...

> On 17 Mar 2016, at 11:00, Nick Fisk <n...@fisk.me.uk> wrote:
> 
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Daniel Niasoff
>> Sent: 16 March 2016 21:02
>> To: Nick Fisk <n...@fisk.me.uk>; 'Van Leeuwen, Robert'
>> <rovanleeu...@ebay.com>; 'Jason Dillaman' <dilla...@redhat.com>
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
>> 
>> Hi Nick,
>> 
>> Your solution requires manual configuration for each VM and cannot be
>> setup as part of an automated OpenStack deployment.
> 
> Absolutely, potentially flaky as well.
> 
>> 
>> It would be really nice if it was a hypervisor based setting as opposed to
> a VM
>> based setting.
> 
> Yes, I can't wait until we can just specify "rbd_cache_device=/dev/ssd" in
> the ceph.conf and get it to write to that instead. Ideally ceph would also
> provide some sort of lightweight replication for the cache devices, but
> otherwise a iSCSI SSD farm or switched SAS could be used so that the caching
> device is not tied to one physical host.
> 
>> 
>> Thanks
>> 
>> Daniel
>> 
>> -Original Message-
>> From: Nick Fisk [mailto:n...@fisk.me.uk]
>> Sent: 16 March 2016 08:59
>> To: Daniel Niasoff <dan...@redactus.co.uk>; 'Van Leeuwen, Robert'
>> <rovanleeu...@ebay.com>; 'Jason Dillaman' <dilla...@redhat.com>
>> Cc: ceph-users@lists.ceph.com
>> Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node.
>> 
>> 
>> 
>>> -Original Message-----
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>> Of Daniel Niasoff
>>> Sent: 16 March 2016 08:26
>>> To: Van Leeuwen, Robert <rovanleeu...@ebay.com>; Jason Dillaman
>>> <dilla...@redhat.com>
>>> Cc: ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
>>> 
>>> Hi Robert,
>>> 
>>>> Caching writes would be bad because a hypervisor failure would result
>>>> in
>>> loss of the cache which pretty much guarantees inconsistent data on
>>> the ceph volume.
>>>> Also live-migration will become problematic compared to running
>>> everything from ceph since you will also need to migrate the
>> local-storage.
>> 
>> I tested a solution using iSCSI for the cache devices. Each VM was using
>> flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This
> gets
>> around the problem of moving things around or if the hypervisor goes down.
>> It's not local caching but the write latency is at least 10x lower than
> the RBD.
>> Note I tested it, I didn't put it into production :-)
>> 
>>> 
>>> My understanding of how a writeback cache should work is that it
>>> should only take a few seconds for writes to be streamed onto the
>>> network and is focussed on resolving the speed issue of small sync
>>> writes. The writes
>> would
>>> be bundled into larger writes that are not time sensitive.
>>> 
>>> So there is potential for a few seconds data loss but compared to the
>> current
>>> trend of using ephemeral storage to solve this issue, it's a major
>>> improvement.
>> 
>> Yeah, problem is a couple of seconds data loss mean different things to
>> different people.
>> 
>>> 
>>>> (considering the time required for setting up and maintaining the
>>>> extra
>>> caching layer on each vm, unless you work for free ;-)
>>> 
>>> Couldn't agree more there.
>>> 
>>> I am just so surprised how the openstack community haven't looked to
>>> resolve this issue. Ephemeral storage is a HUGE compromise unless you
>>> have built in failure into every aspect of your application but many
>>> people use openstack as a general purpose devstack.
>>> 
>>> (Jason pointed out his blueprint but I guess it's at least a year or 2
>> away -
>>> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-
>>> consistent_write-back_caching_extension)
>>> 
>>> I see articles discussing the idea such as this one
>>> 
>>> http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tierin

Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-16 Thread Daniel Niasoff
Hi Robert,

It seems I have to give up on this goal for now but wanted to be sure I wasn't 
missing something obvious.

>If you can survive missing that data you are probably better of running fully 
>from ephemeral storage in the first place.

What and lose the entire ephemeral disk since the VM was created? Am I missing 
something here or is there an automated way of syncing ephemeral disks from 
time to time with a ceph back end?

Thanks

Daniel

-Original Message-
From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com] 
Sent: 16 March 2016 10:15
To: Daniel Niasoff <dan...@redactus.co.uk>; Jason Dillaman <dilla...@redhat.com>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.

>
>My understanding of how a writeback cache should work is that it should only 
>take a few seconds for writes to be streamed onto the network and is focussed 
>on resolving the speed issue of small sync writes. The writes would be bundled 
>into larger writes that are not time sensitive.
>
>So there is potential for a few seconds data loss but compared to the current 
>trend of using ephemeral storage to solve this issue, it's a major improvement.

It think is a bit worse then just a few seconds of data:
As mentioned in the blueprint for ceph you would need some kind or ordered 
write-back cache that maintains checkpoints internally.

I am not that familiar with the internals of dm-cache but I do not think it 
guarantees any write order.
E.g. By default it will bypass the cache for sequential IO.

So I think it is very likely the “few seconds of data loss" in this case means 
the filesystem is corrupt and you could lose the whole thing.
At the very least you will need to run fsck on it and hope it can sort out all 
of the errors with minimal data loss.


So, for me, it seems conflicting to me to use persistent storage and then 
hoping your volumes survive a power outage.

If you can survive missing that data you are probably better of running fully 
from ephemeral storage in the first place.

Cheers,
Robert van Leeuwen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-16 Thread Van Leeuwen, Robert
>
>My understanding of how a writeback cache should work is that it should only 
>take a few seconds for writes to be streamed onto the network and is focussed 
>on resolving the speed issue of small sync writes. The writes would be bundled 
>into larger writes that are not time sensitive.
>
>So there is potential for a few seconds data loss but compared to the current 
>trend of using ephemeral storage to solve this issue, it's a major improvement.

It think is a bit worse then just a few seconds of data:
As mentioned in the blueprint for ceph you would need some kind or ordered 
write-back cache that maintains checkpoints internally.

I am not that familiar with the internals of dm-cache but I do not think it 
guarantees any write order.
E.g. By default it will bypass the cache for sequential IO.

So I think it is very likely the “few seconds of data loss" in this case means 
the filesystem is corrupt and you could lose the whole thing.
At the very least you will need to run fsck on it and hope it can sort out all 
of the errors with minimal data loss.


So, for me, it seems conflicting to me to use persistent storage and then 
hoping your volumes survive a power outage.

If you can survive missing that data you are probably better of running fully 
from ephemeral storage in the first place.

Cheers,
Robert van Leeuwen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-16 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Daniel Niasoff
> Sent: 16 March 2016 08:26
> To: Van Leeuwen, Robert <rovanleeu...@ebay.com>; Jason Dillaman
> <dilla...@redhat.com>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Hi Robert,
> 
> >Caching writes would be bad because a hypervisor failure would result in
> loss of the cache which pretty much guarantees inconsistent data on the
> ceph volume.
> >Also live-migration will become problematic compared to running
> everything from ceph since you will also need to migrate the
local-storage.

I tested a solution using iSCSI for the cache devices. Each VM was using
flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This gets
around the problem of moving things around or if the hypervisor goes down.
It's not local caching but the write latency is at least 10x lower than the
RBD. Note I tested it, I didn't put it into production :-)

> 
> My understanding of how a writeback cache should work is that it should
> only take a few seconds for writes to be streamed onto the network and is
> focussed on resolving the speed issue of small sync writes. The writes
would
> be bundled into larger writes that are not time sensitive.
> 
> So there is potential for a few seconds data loss but compared to the
current
> trend of using ephemeral storage to solve this issue, it's a major
> improvement.

Yeah, problem is a couple of seconds data loss mean different things to
different people.

> 
> > (considering the time required for setting up and maintaining the extra
> caching layer on each vm, unless you work for free ;-)
> 
> Couldn't agree more there.
> 
> I am just so surprised how the openstack community haven't looked to
> resolve this issue. Ephemeral storage is a HUGE compromise unless you have
> built in failure into every aspect of your application but many people use
> openstack as a general purpose devstack.
> 
> (Jason pointed out his blueprint but I guess it's at least a year or 2
away -
> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-
> consistent_write-back_caching_extension)
> 
> I see articles discussing the idea such as this one
> 
> http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering-
> scalable-cache/
> 
> but no real straightforward  validated setup instructions.
> 
> Thanks
> 
> Daniel
> 
> 
> -Original Message-
> From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com]
> Sent: 16 March 2016 08:11
> To: Jason Dillaman <dilla...@redhat.com>; Daniel Niasoff
> <dan...@redactus.co.uk>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> >Indeed, well understood.
> >
> >As a shorter term workaround, if you have control over the VMs, you could
> always just slice out an LVM volume from local SSD/NVMe and pass it
> through to the guest.  Within the guest, use dm-cache (or similar) to add
a
> cache front-end to your RBD volume.
> 
> If you do this you need to setup your cache as read-cache only.
> Caching writes would be bad because a hypervisor failure would result in
loss
> of the cache which pretty much guarantees inconsistent data on the ceph
> volume.
> Also live-migration will become problematic compared to running everything
> from ceph since you will also need to migrate the local-storage.
> 
> The question will be if adding more ram (== more read cache) would not be
> more convenient and cheaper in the end.
> (considering the time required for setting up and maintaining the extra
> caching layer on each vm, unless you work for free ;-) Also reads from
ceph
> are pretty fast compared to the biggest bottleneck: (small) sync writes.
> So it is debatable how much performance you would win except for some
> use-cases with lots of reads on very large data sets which are also very
> latency sensitive.
> 
> Cheers,
> Robert van Leeuwen
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-16 Thread Daniel Niasoff
Hi Robert,

>Caching writes would be bad because a hypervisor failure would result in loss 
>of the cache which pretty much guarantees inconsistent data on the ceph volume.
>Also live-migration will become problematic compared to running everything 
>from ceph since you will also need to migrate the local-storage.

My understanding of how a writeback cache should work is that it should only 
take a few seconds for writes to be streamed onto the network and is focussed 
on resolving the speed issue of small sync writes. The writes would be bundled 
into larger writes that are not time sensitive.

So there is potential for a few seconds data loss but compared to the current 
trend of using ephemeral storage to solve this issue, it's a major improvement.

> (considering the time required for setting up and maintaining the extra 
> caching layer on each vm, unless you work for free ;-)

Couldn't agree more there.

I am just so surprised how the openstack community haven't looked to resolve 
this issue. Ephemeral storage is a HUGE compromise unless you have built in 
failure into every aspect of your application but many people use openstack as 
a general purpose devstack.

(Jason pointed out his blueprint but I guess it's at least a year or 2 away - 
http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension)

I see articles discussing the idea such as this one 

http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering-scalable-cache/

but no real straightforward  validated setup instructions.

Thanks 

Daniel


-Original Message-
From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com] 
Sent: 16 March 2016 08:11
To: Jason Dillaman <dilla...@redhat.com>; Daniel Niasoff <dan...@redactus.co.uk>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.

>Indeed, well understood.
>
>As a shorter term workaround, if you have control over the VMs, you could 
>always just slice out an LVM volume from local SSD/NVMe and pass it through to 
>the guest.  Within the guest, use dm-cache (or similar) to add a cache 
>front-end to your RBD volume.  

If you do this you need to setup your cache as read-cache only. 
Caching writes would be bad because a hypervisor failure would result in loss 
of the cache which pretty much guarantees inconsistent data on the ceph volume.
Also live-migration will become problematic compared to running everything from 
ceph since you will also need to migrate the local-storage.

The question will be if adding more ram (== more read cache) would not be more 
convenient and cheaper in the end.
(considering the time required for setting up and maintaining the extra caching 
layer on each vm, unless you work for free ;-) Also reads from ceph are pretty 
fast compared to the biggest bottleneck: (small) sync writes.
So it is debatable how much performance you would win except for some use-cases 
with lots of reads on very large data sets which are also very latency 
sensitive.

Cheers,
Robert van Leeuwen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-16 Thread Van Leeuwen, Robert
>Indeed, well understood.
>
>As a shorter term workaround, if you have control over the VMs, you could 
>always just slice out an LVM volume from local SSD/NVMe and pass it through to 
>the guest.  Within the guest, use dm-cache (or similar) to add a cache 
>front-end to your RBD volume.  

If you do this you need to setup your cache as read-cache only. 
Caching writes would be bad because a hypervisor failure would result in loss 
of the cache which pretty much guarantees inconsistent data on the ceph volume.
Also live-migration will become problematic compared to running everything from 
ceph since you will also need to migrate the local-storage.

The question will be if adding more ram (== more read cache) would not be more 
convenient and cheaper in the end.
(considering the time required for setting up and maintaining the extra caching 
layer on each vm, unless you work for free ;-)
Also reads from ceph are pretty fast compared to the biggest bottleneck: 
(small) sync writes.
So it is debatable how much performance you would win except for some use-cases 
with lots of reads on very large data sets which are also very latency 
sensitive.

Cheers,
Robert van Leeuwen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-15 Thread Daniel Niasoff
I am using openstack so need this to be fully automated and apply to all my VMs.

If I could do what you mention at the hypervisor level that would me much 
easier.

The options that you mention I guess are for very specific use cases and need 
to be configured on a per vm basis whilst I am looking for a general "ceph on 
steroids" approach for all my VMs without any maintenance.

Thanks again :)

-Original Message-
From: Jason Dillaman [mailto:dilla...@redhat.com] 
Sent: 16 March 2016 01:42
To: Daniel Niasoff <dan...@redactus.co.uk>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.

Indeed, well understood.

As a shorter term workaround, if you have control over the VMs, you could 
always just slice out an LVM volume from local SSD/NVMe and pass it through to 
the guest.  Within the guest, use dm-cache (or similar) to add a cache 
front-end to your RBD volume.  Others have also reported improvements by using 
the QEMU x-data-plane option and RAIDing several RBD images together within the 
VM.

-- 

Jason Dillaman 


- Original Message -
> From: "Daniel Niasoff" <dan...@redactus.co.uk>
> To: "Jason Dillaman" <dilla...@redhat.com>
> Cc: ceph-users@lists.ceph.com
> Sent: Tuesday, March 15, 2016 9:32:50 PM
> Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Thanks.
> 
> Reassuring but I could do with something today :)
> 
> -Original Message-
> From: Jason Dillaman [mailto:dilla...@redhat.com]
> Sent: 16 March 2016 01:25
> To: Daniel Niasoff <dan...@redactus.co.uk>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> The good news is such a feature is in the early stage of design [1].
> Hopefully this is a feature that will land in the Kraken release timeframe.
> 
> [1]
> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consist
> ent_write-back_caching_extension
> 
> --
> 
> Jason Dillaman
> 
> 
> - Original Message -
> > From: "Daniel Niasoff" <dan...@redactus.co.uk>
> > To: ceph-users@lists.ceph.com
> > Sent: Tuesday, March 15, 2016 8:47:04 PM
> > Subject: [ceph-users] Local SSD cache for ceph on each compute node.
> > 
> > Hi,
> > 
> > Let me start. Ceph is amazing, no it really is!
> > 
> > But a hypervisor reading and writing all its data off the network 
> > off the network will add some latency to read and writes.
> > 
> > So the hypervisor could do with a local cache, possible SSD or even NVMe.
> > 
> > Spent a while looking into this but it seems really strange that few 
> > people see the value of this.
> > 
> > Basically the cache would be used in two ways
> > 
> > a) cache hot data
> > b) writeback cache for ceph writes
> > 
> > There is the RBD cache but that isn't disk based and on a hypervisor 
> > memory is at a premium.
> > 
> > A simple solution would be to put a journal on each compute node and 
> > get each hypervisor to use its own journal. Would this work?
> > 
> > Something like this
> > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png
> > 
> > Can this be achieved?
> > 
> > A better explanation of what I am trying to achieve is here
> > 
> > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/
> > 
> > This talk if it was voted in looks interesting - 
> > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Prese
> > nt
> > ation/6827
> > 
> > Can anyone help?
> > 
> > Thanks
> > 
> > Daniel
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-15 Thread Jason Dillaman
Indeed, well understood.

As a shorter term workaround, if you have control over the VMs, you could 
always just slice out an LVM volume from local SSD/NVMe and pass it through to 
the guest.  Within the guest, use dm-cache (or similar) to add a cache 
front-end to your RBD volume.  Others have also reported improvements by using 
the QEMU x-data-plane option and RAIDing several RBD images together within the 
VM.

-- 

Jason Dillaman 


- Original Message -
> From: "Daniel Niasoff" <dan...@redactus.co.uk>
> To: "Jason Dillaman" <dilla...@redhat.com>
> Cc: ceph-users@lists.ceph.com
> Sent: Tuesday, March 15, 2016 9:32:50 PM
> Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Thanks.
> 
> Reassuring but I could do with something today :)
> 
> -Original Message-
> From: Jason Dillaman [mailto:dilla...@redhat.com]
> Sent: 16 March 2016 01:25
> To: Daniel Niasoff <dan...@redactus.co.uk>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> The good news is such a feature is in the early stage of design [1].
> Hopefully this is a feature that will land in the Kraken release timeframe.
> 
> [1]
> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension
> 
> --
> 
> Jason Dillaman
> 
> 
> - Original Message -
> > From: "Daniel Niasoff" <dan...@redactus.co.uk>
> > To: ceph-users@lists.ceph.com
> > Sent: Tuesday, March 15, 2016 8:47:04 PM
> > Subject: [ceph-users] Local SSD cache for ceph on each compute node.
> > 
> > Hi,
> > 
> > Let me start. Ceph is amazing, no it really is!
> > 
> > But a hypervisor reading and writing all its data off the network off
> > the network will add some latency to read and writes.
> > 
> > So the hypervisor could do with a local cache, possible SSD or even NVMe.
> > 
> > Spent a while looking into this but it seems really strange that few
> > people see the value of this.
> > 
> > Basically the cache would be used in two ways
> > 
> > a) cache hot data
> > b) writeback cache for ceph writes
> > 
> > There is the RBD cache but that isn't disk based and on a hypervisor
> > memory is at a premium.
> > 
> > A simple solution would be to put a journal on each compute node and
> > get each hypervisor to use its own journal. Would this work?
> > 
> > Something like this
> > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png
> > 
> > Can this be achieved?
> > 
> > A better explanation of what I am trying to achieve is here
> > 
> > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/
> > 
> > This talk if it was voted in looks interesting -
> > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Present
> > ation/6827
> > 
> > Can anyone help?
> > 
> > Thanks
> > 
> > Daniel
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-15 Thread Daniel Niasoff
Thanks.

Reassuring but I could do with something today :)

-Original Message-
From: Jason Dillaman [mailto:dilla...@redhat.com] 
Sent: 16 March 2016 01:25
To: Daniel Niasoff <dan...@redactus.co.uk>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.

The good news is such a feature is in the early stage of design [1].  Hopefully 
this is a feature that will land in the Kraken release timeframe.  

[1] 
http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension

-- 

Jason Dillaman 


- Original Message -
> From: "Daniel Niasoff" <dan...@redactus.co.uk>
> To: ceph-users@lists.ceph.com
> Sent: Tuesday, March 15, 2016 8:47:04 PM
> Subject: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Hi,
> 
> Let me start. Ceph is amazing, no it really is!
> 
> But a hypervisor reading and writing all its data off the network off 
> the network will add some latency to read and writes.
> 
> So the hypervisor could do with a local cache, possible SSD or even NVMe.
> 
> Spent a while looking into this but it seems really strange that few 
> people see the value of this.
> 
> Basically the cache would be used in two ways
> 
> a) cache hot data
> b) writeback cache for ceph writes
> 
> There is the RBD cache but that isn't disk based and on a hypervisor 
> memory is at a premium.
> 
> A simple solution would be to put a journal on each compute node and 
> get each hypervisor to use its own journal. Would this work?
> 
> Something like this
> http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png
> 
> Can this be achieved?
> 
> A better explanation of what I am trying to achieve is here
> 
> http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/
> 
> This talk if it was voted in looks interesting -
> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Present
> ation/6827
> 
> Can anyone help?
> 
> Thanks
> 
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-15 Thread Jason Dillaman
The good news is such a feature is in the early stage of design [1].  Hopefully 
this is a feature that will land in the Kraken release timeframe.  

[1] 
http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension

-- 

Jason Dillaman 


- Original Message -
> From: "Daniel Niasoff" <dan...@redactus.co.uk>
> To: ceph-users@lists.ceph.com
> Sent: Tuesday, March 15, 2016 8:47:04 PM
> Subject: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Hi,
> 
> Let me start. Ceph is amazing, no it really is!
> 
> But a hypervisor reading and writing all its data off the network off the
> network will add some latency to read and writes.
> 
> So the hypervisor could do with a local cache, possible SSD or even NVMe.
> 
> Spent a while looking into this but it seems really strange that few people
> see the value of this.
> 
> Basically the cache would be used in two ways
> 
> a) cache hot data
> b) writeback cache for ceph writes
> 
> There is the RBD cache but that isn't disk based and on a hypervisor memory
> is at a premium.
> 
> A simple solution would be to put a journal on each compute node and get each
> hypervisor to use its own journal. Would this work?
> 
> Something like this
> http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png
> 
> Can this be achieved?
> 
> A better explanation of what I am trying to achieve is here
> 
> http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/
> 
> This talk if it was voted in looks interesting -
> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/6827
> 
> Can anyone help?
> 
> Thanks
> 
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Local SSD cache for ceph on each compute node.

2016-03-15 Thread Daniel Niasoff
Hi,

Let me start. Ceph is amazing, no it really is!

But a hypervisor reading and writing all its data off the network off the 
network will add some latency to read and writes.

So the hypervisor could do with a local cache, possible SSD or even NVMe.

Spent a while looking into this but it seems really strange that few people see 
the value of this.

Basically the cache would be used in two ways

a) cache hot data
b) writeback cache for ceph writes

There is the RBD cache but that isn't disk based and on a hypervisor memory is 
at a premium.

A simple solution would be to put a journal on each compute node and get each 
hypervisor to use its own journal. Would this work?

Something like this  
http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png

Can this be achieved?

A better explanation of what I am trying to achieve is here

http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/

This talk if it was voted in looks interesting - 
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/6827

Can anyone help?

Thanks

Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com