Re: wip-librbd-caching

2012-04-18 Thread Sage Weil
On Wed, 18 Apr 2012, Martin Mailand wrote:
> Am 12.04.2012 21:45, schrieb Sage Weil:
> > The config options you'll want to look at are client_oc_* (in case you
> > didn't see that already :).  "oc" is short for objectcacher, and it isn't
> > only used for client (libcephfs), so it might be worth renaming these
> > options before people start using them.
> 
> Hi,
> 
> I changed the values and the performance is still very good and the memory
> footprint is much smaller.
> 
> OPTION(client_oc_size, OPT_INT, 1024*1024* 50)// MB * n
> OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25)// MB * n  (dirty OR
> tx.. bigish)
> OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty (keep
> this smallish)
> // note: the max amount of "in flight" dirty data is roughly (max - target)
> 
> But I am not quite sure about the meaning of the values.
> client_oc_size Max size of the cache?

yes

> client_oc_max_dirty max dirty value before the writeback starts?

before writes block and wait for writeback to bring the dirty level down

> client_oc_target_dirty ???

before writeback starts

BTW I renamed 'rbd cache enabled' -> 'rbd cache'.  I'd like to rename the 
objectcacher settings too so they aren't nested under client_ (which is 
the fs client code).

objectcacher_*?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-librbd-caching

2012-04-18 Thread Greg Farnum
On Wednesday, April 18, 2012 at 5:50 AM, Martin Mailand wrote:
> Hi,
>  
> I changed the values and the performance is still very good and the  
> memory footprint is much smaller.
>  
> OPTION(client_oc_size, OPT_INT, 1024*1024* 50) // MB * n
> OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25) // MB * n (dirty  
> OR tx.. bigish)
> OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty  
> (keep this smallish)
> // note: the max amount of "in flight" dirty data is roughly (max - target)
>  
> But I am not quite sure about the meaning of the values.
> client_oc_size Max size of the cache?
> client_oc_max_dirty max dirty value before the writeback starts?
> client_oc_target_dirty ???
>  

Right now the cache writeout algorithms are based on amount of dirty data, 
rather than something like how long the data has been dirty.  
client_oc_size is the max (and therefore typical) size of the cache.
client_oc_max_dirty is the largest amount of dirty data in the cache — if this 
much is dirty and you try to dirty more, the dirtier (a write of some kind) 
will block until some of the other dirty data has been committed.
client_oc_target_dirty is the amount of dirty data that will trigger the cache 
to start flushing data out.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-librbd-caching

2012-04-18 Thread Martin Mailand

Am 12.04.2012 21:45, schrieb Sage Weil:

The config options you'll want to look at are client_oc_* (in case you
didn't see that already :).  "oc" is short for objectcacher, and it isn't
only used for client (libcephfs), so it might be worth renaming these
options before people start using them.


Hi,

I changed the values and the performance is still very good and the 
memory footprint is much smaller.


OPTION(client_oc_size, OPT_INT, 1024*1024* 50)// MB * n
OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25)// MB * n  (dirty 
OR tx.. bigish)
OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty 
(keep this smallish)

// note: the max amount of "in flight" dirty data is roughly (max - target)

But I am not quite sure about the meaning of the values.
client_oc_size Max size of the cache?
client_oc_max_dirty max dirty value before the writeback starts?
client_oc_target_dirty ???


-martin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-librbd-caching

2012-04-12 Thread Sage Weil
On Thu, 12 Apr 2012, Tommi Virtanen wrote:
> On Thu, Apr 12, 2012 at 12:45, Sage Weil  wrote:
> >> So maybe we could reduce the memory footprint of the cache, but keep it's
> >> performance.
> >
> > I'm not familiar with the performance implications of KSM, but the
> > objectcacher doesn't modify existing buffers in place, so I suspect it's a
> > good candidate.  And it looks like there's minimal effort in enabling
> > it...
> 
> Are the objectcacher cache entries full pages, page aligned, with no
> bookkeeping data inside the page? Those are pretty much the
> requirements for page-granularity dedup to work..

Some buffers are, some aren't, but we'd only want to madvise on page 
aligned ones.  The messenger is careful to read things into aligned 
memory, and librbd will only be getting block-sized (probably page-sized, 
if we say we have 4k blocks) IO... so that should include every buffer in 
this case.

sage

Re: wip-librbd-caching

2012-04-12 Thread Greg Farnum
On Thursday, April 12, 2012 at 12:45 PM, Sage Weil wrote:
> On Thu, 12 Apr 2012, Martin Mailand wrote:
> > The other point is, that the cache is not KSM enabled, therefore identical
> > pages will not be merged, could that be changed, what would be the downside?
> >  
> > So maybe we could reduce the memory footprint of the cache, but keep it's
> > performance.
>  
>  
>  
> I'm not familiar with the performance implications of KSM, but the  
> objectcacher doesn't modify existing buffers in place, so I suspect it's a  
> good candidate. And it looks like there's minimal effort in enabling  
> it...


But if you're supposed to advise the kernel that the memory is a good 
candidate, then probably we shouldn't be making that madvise call on every 
buffer (I imagine it's doing a sha1 on each page and then examining a tree) — 
especially since we (probably) flush all that data out relatively quickly. And 
RBD doesn't currently have any information about whether the data is OS or user 
data… (I guess in future, with layering, we could call madvise on pages which 
were read from an underlying gold image.)
Also, TV is wondering if the data is even page-aligned or not? I can't recall 
off-hand.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-librbd-caching

2012-04-12 Thread Tommi Virtanen
On Thu, Apr 12, 2012 at 12:45, Sage Weil  wrote:
>> So maybe we could reduce the memory footprint of the cache, but keep it's
>> performance.
>
> I'm not familiar with the performance implications of KSM, but the
> objectcacher doesn't modify existing buffers in place, so I suspect it's a
> good candidate.  And it looks like there's minimal effort in enabling
> it...

Are the objectcacher cache entries full pages, page aligned, with no
bookkeeping data inside the page? Those are pretty much the
requirements for page-granularity dedup to work..
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-librbd-caching

2012-04-12 Thread Damien Churchill
On 12 April 2012 20:45, Sage Weil  wrote:
> I'm not familiar with the performance implications of KSM, but the
> objectcacher doesn't modify existing buffers in place, so I suspect it's a
> good candidate.  And it looks like there's minimal effort in enabling
> it...

It uses some CPU when calculating hashes, although I believe if it
gets to be too resource consuming it is possible to disable it and
continue to use the shared pages you have already calculated, just not
updating or checking for any others that could be shared.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-librbd-caching

2012-04-12 Thread Sage Weil
On Thu, 12 Apr 2012, Martin Mailand wrote:
> Hi,
> 
> today I tried the wip-librbd-caching branch. The performance improvement is
> very good particular for small writes.
> I tested from within a vm with fio:
> 
> rbd_cache_enabled=1
> 
> fio -name iops -rw=write -size=10G -iodepth 1 -filename /tmp/bigfile -ioengine
> libaio -direct 1 -bs 4k
> 
> I get over 10k iops
> 
> With an iodepth 4 I get over 30k iops
> 
> In comparison with the rbd_writebackwindow I get around 5k iops with an
> iodepth of 1.
> 
> So far the whole cluster is running stable for over 12 hours.

Great to hear!
 
> But there is also a downside.
> My typical vm are 1Gb in size, the default cache size is 200Mb, which is 20%
> more memory usage. Maybe 50Mb or less will be enough?
> I am going to test that.

The config options you'll want to look at are client_oc_* (in case you 
didn't see that already :).  "oc" is short for objectcacher, and it isn't 
only used for client (libcephfs), so it might be worth renaming these 
options before people start using them.

> The other point is, that the cache is not KSM enabled, therefore identical
> pages will not be merged, could that be changed, what would be the downside?
> 
> So maybe we could reduce the memory footprint of the cache, but keep it's
> performance.

I'm not familiar with the performance implications of KSM, but the 
objectcacher doesn't modify existing buffers in place, so I suspect it's a 
good candidate.  And it looks like there's minimal effort in enabling 
it...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


wip-librbd-caching

2012-04-12 Thread Martin Mailand

Hi,

today I tried the wip-librbd-caching branch. The performance improvement 
is very good particular for small writes.

I tested from within a vm with fio:

rbd_cache_enabled=1

fio -name iops -rw=write -size=10G -iodepth 1 -filename /tmp/bigfile 
-ioengine libaio -direct 1 -bs 4k


I get over 10k iops

With an iodepth 4 I get over 30k iops

In comparison with the rbd_writebackwindow I get around 5k iops with an 
iodepth of 1.


So far the whole cluster is running stable for over 12 hours.

But there is also a downside.
My typical vm are 1Gb in size, the default cache size is 200Mb, which is 
20% more memory usage. Maybe 50Mb or less will be enough?

I am going to test that.

The other point is, that the cache is not KSM enabled, therefore 
identical pages will not be merged, could that be changed, what would be 
the downside?


So maybe we could reduce the memory footprint of the cache, but keep 
it's performance.


-martin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html