Re: [ceph-users] Best method to limit snapshot/clone space overhead

2015-07-27 Thread Jason Dillaman
 If I understand correctly you want to look at how many “guest filesystem
 block size” blocks there are that are empty?
 This might not be that precise because we do not discard blocks inside the
 guests, but if you tell me how to gather this - I can certainly try that.
 I’m not sure if my bash-fu is enough to do this.

Here is a quick-n-dirty script to calculate the number of zeroed 4K blocks in 
all your RADOS objects [1].  If you have a smaller or larger block size on your 
OSD FS, feel free to tweak the block size variable.  For each clone image 
within a pool, it will locate all associated RADOS objects, download the 
objects one at a time, and perform a scan for fully zeroed blocks.  It's not 
the most CPU efficient script, but it should get the job done.

[1] http://fpaste.org/248755/43803526/

-- 

Jason Dillaman 
Red Hat Ceph Storage Engineering 
dilla...@redhat.com 
http://www.redhat.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best method to limit snapshot/clone space overhead

2015-07-24 Thread Jason Dillaman
 Hi all,
 I am looking for a way to alleviate the overhead of RBD snapshots/clones for
 some time.
 
 In our scenario there are a few “master” volumes that contain production
 data, and are frequently snapshotted and cloned for dev/qa use. Those
 snapshots/clones live for a few days to a few weeks before they get dropped,
 and they sometimes grow very fast (databases, etc.).
 
 With the default 4MB object size there seems to be huge overhead involved
 with this, could someone give me some hints on how to solve that?


Do you have any statistics (or can you gather any statistics) that indicate the 
percentage of block-size, zeroed extents within the clone images' RADOS 
objects?  If there is a large amount of waste, it might be possible / 
worthwhile to optimize how RBD handles copy-on-write operations against the 
clone.

-- 

Jason Dillaman 
Red Hat 
dilla...@redhat.com 
http://www.redhat.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best method to limit snapshot/clone space overhead

2015-07-24 Thread Haomai Wang
On Fri, Jul 24, 2015 at 11:55 PM, Jason Dillaman dilla...@redhat.com wrote:
 Hi all,
 I am looking for a way to alleviate the overhead of RBD snapshots/clones for
 some time.

 In our scenario there are a few “master” volumes that contain production
 data, and are frequently snapshotted and cloned for dev/qa use. Those
 snapshots/clones live for a few days to a few weeks before they get dropped,
 and they sometimes grow very fast (databases, etc.).

 With the default 4MB object size there seems to be huge overhead involved
 with this, could someone give me some hints on how to solve that?


 Do you have any statistics (or can you gather any statistics) that indicate 
 the percentage of block-size, zeroed extents within the clone images' RADOS 
 objects?  If there is a large amount of waste, it might be possible / 
 worthwhile to optimize how RBD handles copy-on-write operations against the 
 clone.

I think the fiemap/seek_hole could benefit rbd objects after
recovering or backfill mostly.


 --

 Jason Dillaman
 Red Hat
 dilla...@redhat.com
 http://www.redhat.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best method to limit snapshot/clone space overhead

2015-07-24 Thread Jan Schermer
Hello,

If I understand correctly you want to look at how many “guest filesystem block 
size” blocks there are that are empty?
This might not be that precise because we do not discard blocks inside the 
guests, but if you tell me how to gather this - I can certainly try that. I’m 
not sure if my bash-fu is enough to do this.

Anyway, if I understand how COW works with Ceph clones, then a single 1-byte 
write inside a clone image will cause the whole object to be replicated and 
that 1-byte write eats 4MB of space. I’m not sure whether FIEMAP actually 
creates sparse files if the source is “empty or just doesn’t bother reading 
the holes. 
The same granularity probably applies even without clones, though. If I 
remember correctly “mkfs.ext4” which writes here and there on a volume cost me 
~200GB of space for 2500GB volume (at least according to stats). Not sure how 
much data it writes from the guest perspective (5GB? 20GB? Will be something 
like that spread over the volume).

Thanks
Jan


 On 24 Jul 2015, at 17:55, Jason Dillaman dilla...@redhat.com wrote:
 
 Hi all,
 I am looking for a way to alleviate the overhead of RBD snapshots/clones for
 some time.
 
 In our scenario there are a few “master” volumes that contain production
 data, and are frequently snapshotted and cloned for dev/qa use. Those
 snapshots/clones live for a few days to a few weeks before they get dropped,
 and they sometimes grow very fast (databases, etc.).
 
 With the default 4MB object size there seems to be huge overhead involved
 with this, could someone give me some hints on how to solve that?
 
 
 Do you have any statistics (or can you gather any statistics) that indicate 
 the percentage of block-size, zeroed extents within the clone images' RADOS 
 objects?  If there is a large amount of waste, it might be possible / 
 worthwhile to optimize how RBD handles copy-on-write operations against the 
 clone.
 
 -- 
 
 Jason Dillaman 
 Red Hat 
 dilla...@redhat.com 
 http://www.redhat.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best method to limit snapshot/clone space overhead

2015-07-24 Thread Reistlin
No thanks at all.
I think about ZFS deduplication in a slightly different aspect of using 
snapshots. We determined, that platter HDD work better with big object size. 
But it cause big performance overhead with snapshots. For example, you have 
32Mb block size. And you have image snapshot. If only  1 byte in this object 
have to be written COW mechanism anyway will write 32 Mb of initial object, and 
only when will change this byte. It has big impact on performance of slow hdd.
I am really sorry for my awful English.
 
 25 июля 2015 г., в 0:49, Jan Schermer j...@schermer.cz написал(а):
 
 We use ZFS for other purposes and deduplication is overrated - it is quite 
 useful with big block sizes (and assuming your data don’t “shift” in the 
 blocks), but you can usually achieve much higher space savings with 
 compression - and it usually is faster, too :-) You need lots and lots of RAM 
 for it to be reasonably fast and it’s usually cheaper to just get more drives 
 anyway.
 But we're looking into creating a read-only replica of our pools on ZFS (once 
 we upgrade to Firefly which has primary affinity setting), but I don’t think 
 we’re even going to try it on a production r/w workload instead of xfs/ext4. 
 At least not until someone from Inktank/RH says it’s 100% safe and stable.
 I can imagine OSD runing on top of ZFS using the ZFS clone semantics for RBD 
 image and pool clones/snapshots, that would be quite nice (and fast, proven 
 and just pretty much awesome). Maybe someone from RH will share this dream 
 (wink wink :))
 
 Sorry for being slightly off-topic. In short ZFS is not the solution here and 
 now. But thank you for the idea.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best method to limit snapshot/clone space overhead

2015-07-24 Thread Reistlin
Hi! Did you try ZFS and deduplication mechanism? It could radically decrease 
writes while COW.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best method to limit snapshot/clone space overhead

2015-07-24 Thread Jan Schermer
We use ZFS for other purposes and deduplication is overrated - it is quite 
useful with big block sizes (and assuming your data don’t “shift” in the 
blocks), but you can usually achieve much higher space savings with compression 
- and it usually is faster, too :-) You need lots and lots of RAM for it to be 
reasonably fast and it’s usually cheaper to just get more drives anyway.
But we're looking into creating a read-only replica of our pools on ZFS (once 
we upgrade to Firefly which has primary affinity setting), but I don’t think 
we’re even going to try it on a production r/w workload instead of xfs/ext4. At 
least not until someone from Inktank/RH says it’s 100% safe and stable.
I can imagine OSD runing on top of ZFS using the ZFS clone semantics for RBD 
image and pool clones/snapshots, that would be quite nice (and fast, proven and 
just pretty much awesome). Maybe someone from RH will share this dream (wink 
wink :))

Sorry for being slightly off-topic. In short ZFS is not the solution here and 
now. But thank you for the idea.

Jan


 On 24 Jul 2015, at 21:38, Reistlin reistli...@yandex.ru wrote:
 
 Hi! Did you try ZFS and deduplication mechanism? It could radically decrease 
 writes while COW.
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best method to limit snapshot/clone space overhead

2015-07-23 Thread Josh Durgin

On 07/23/2015 06:31 AM, Jan Schermer wrote:

Hi all,
I am looking for a way to alleviate the overhead of RBD snapshots/clones for 
some time.

In our scenario there are a few “master” volumes that contain production data, 
and are frequently snapshotted and cloned for dev/qa use. Those 
snapshots/clones live for a few days to a few weeks before they get dropped, 
and they sometimes grow very fast (databases, etc.).

With the default 4MB object size there seems to be huge overhead involved with 
this, could someone give me some hints on how to solve that?

I have some hope in

1) FIEMAP
I’ve calculated that files on my OSDs are approx. 30% filled with NULLs - I 
suppose this is what it could save (best-scenario) and it should also make COW 
operations much faster.
But there are lots of bugs in FIEMAP in kernels (i saw some reference to CentOS 
6.5 kernel being buggy - which is what we use) and filesystems (like XFS). No 
idea about ext4 which we’d like to use in the future.

Is enabling FIEMAP a good idea at all? I saw some mention of it being replaced 
with SEEK_DATA and SEEK_HOLE.


fiemap (and ceph's use of it) has been buggy on all fses in the past.
SEEK_DATA and SEEK_HOLE are the proper interfaces to use for these
purposes. That said, it's not incredibly well tested since it's off by
default, so I wouldn't recommend using it without careful testing on
the fs you're using. I wouldn't expect it to make much of a difference
if you use small objects.


2) object size  4MB for clones
I did some quick performance testing and setting this lower for production is 
probably not a good idea. My sweet spot is 8MB object size, however this would 
make the overhead for clones even worse than it already is.
But I could make the cloned images with a different block size from the 
snapshot (at least according to docs). Does someone use it like that? Any 
caveats? That way I could have the production data with 8MB block size but make 
the development snapshots with for example 64KiB granularity, probably at 
expense of some performance, but most of the data would remain in the (faster) 
master snapshot anyway. This should drop overhead tremendously, maybe even more 
than neabling FIEMAP. (Even better when working in tandem I suppose?)


Since these clones are relatively short-lived this seems like a better
way to go in the short term. 64k may be extreme, but if there aren't
too many of these clones it's not a big deal. There is more overhead
for recovery and scrub with smaller objects, so I wouldn't recommend
using tiny objects in general.

It'll be interesting to see your results. I'm not sure many folks
have looked at optimizing this use case.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com