Re: [ceph-users] Best method to limit snapshot/clone space overhead
If I understand correctly you want to look at how many “guest filesystem block size” blocks there are that are empty? This might not be that precise because we do not discard blocks inside the guests, but if you tell me how to gather this - I can certainly try that. I’m not sure if my bash-fu is enough to do this. Here is a quick-n-dirty script to calculate the number of zeroed 4K blocks in all your RADOS objects [1]. If you have a smaller or larger block size on your OSD FS, feel free to tweak the block size variable. For each clone image within a pool, it will locate all associated RADOS objects, download the objects one at a time, and perform a scan for fully zeroed blocks. It's not the most CPU efficient script, but it should get the job done. [1] http://fpaste.org/248755/43803526/ -- Jason Dillaman Red Hat Ceph Storage Engineering dilla...@redhat.com http://www.redhat.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best method to limit snapshot/clone space overhead
Hi all, I am looking for a way to alleviate the overhead of RBD snapshots/clones for some time. In our scenario there are a few “master” volumes that contain production data, and are frequently snapshotted and cloned for dev/qa use. Those snapshots/clones live for a few days to a few weeks before they get dropped, and they sometimes grow very fast (databases, etc.). With the default 4MB object size there seems to be huge overhead involved with this, could someone give me some hints on how to solve that? Do you have any statistics (or can you gather any statistics) that indicate the percentage of block-size, zeroed extents within the clone images' RADOS objects? If there is a large amount of waste, it might be possible / worthwhile to optimize how RBD handles copy-on-write operations against the clone. -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best method to limit snapshot/clone space overhead
On Fri, Jul 24, 2015 at 11:55 PM, Jason Dillaman dilla...@redhat.com wrote: Hi all, I am looking for a way to alleviate the overhead of RBD snapshots/clones for some time. In our scenario there are a few “master” volumes that contain production data, and are frequently snapshotted and cloned for dev/qa use. Those snapshots/clones live for a few days to a few weeks before they get dropped, and they sometimes grow very fast (databases, etc.). With the default 4MB object size there seems to be huge overhead involved with this, could someone give me some hints on how to solve that? Do you have any statistics (or can you gather any statistics) that indicate the percentage of block-size, zeroed extents within the clone images' RADOS objects? If there is a large amount of waste, it might be possible / worthwhile to optimize how RBD handles copy-on-write operations against the clone. I think the fiemap/seek_hole could benefit rbd objects after recovering or backfill mostly. -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best method to limit snapshot/clone space overhead
Hello, If I understand correctly you want to look at how many “guest filesystem block size” blocks there are that are empty? This might not be that precise because we do not discard blocks inside the guests, but if you tell me how to gather this - I can certainly try that. I’m not sure if my bash-fu is enough to do this. Anyway, if I understand how COW works with Ceph clones, then a single 1-byte write inside a clone image will cause the whole object to be replicated and that 1-byte write eats 4MB of space. I’m not sure whether FIEMAP actually creates sparse files if the source is “empty or just doesn’t bother reading the holes. The same granularity probably applies even without clones, though. If I remember correctly “mkfs.ext4” which writes here and there on a volume cost me ~200GB of space for 2500GB volume (at least according to stats). Not sure how much data it writes from the guest perspective (5GB? 20GB? Will be something like that spread over the volume). Thanks Jan On 24 Jul 2015, at 17:55, Jason Dillaman dilla...@redhat.com wrote: Hi all, I am looking for a way to alleviate the overhead of RBD snapshots/clones for some time. In our scenario there are a few “master” volumes that contain production data, and are frequently snapshotted and cloned for dev/qa use. Those snapshots/clones live for a few days to a few weeks before they get dropped, and they sometimes grow very fast (databases, etc.). With the default 4MB object size there seems to be huge overhead involved with this, could someone give me some hints on how to solve that? Do you have any statistics (or can you gather any statistics) that indicate the percentage of block-size, zeroed extents within the clone images' RADOS objects? If there is a large amount of waste, it might be possible / worthwhile to optimize how RBD handles copy-on-write operations against the clone. -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best method to limit snapshot/clone space overhead
No thanks at all. I think about ZFS deduplication in a slightly different aspect of using snapshots. We determined, that platter HDD work better with big object size. But it cause big performance overhead with snapshots. For example, you have 32Mb block size. And you have image snapshot. If only 1 byte in this object have to be written COW mechanism anyway will write 32 Mb of initial object, and only when will change this byte. It has big impact on performance of slow hdd. I am really sorry for my awful English. 25 июля 2015 г., в 0:49, Jan Schermer j...@schermer.cz написал(а): We use ZFS for other purposes and deduplication is overrated - it is quite useful with big block sizes (and assuming your data don’t “shift” in the blocks), but you can usually achieve much higher space savings with compression - and it usually is faster, too :-) You need lots and lots of RAM for it to be reasonably fast and it’s usually cheaper to just get more drives anyway. But we're looking into creating a read-only replica of our pools on ZFS (once we upgrade to Firefly which has primary affinity setting), but I don’t think we’re even going to try it on a production r/w workload instead of xfs/ext4. At least not until someone from Inktank/RH says it’s 100% safe and stable. I can imagine OSD runing on top of ZFS using the ZFS clone semantics for RBD image and pool clones/snapshots, that would be quite nice (and fast, proven and just pretty much awesome). Maybe someone from RH will share this dream (wink wink :)) Sorry for being slightly off-topic. In short ZFS is not the solution here and now. But thank you for the idea. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best method to limit snapshot/clone space overhead
Hi! Did you try ZFS and deduplication mechanism? It could radically decrease writes while COW. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best method to limit snapshot/clone space overhead
We use ZFS for other purposes and deduplication is overrated - it is quite useful with big block sizes (and assuming your data don’t “shift” in the blocks), but you can usually achieve much higher space savings with compression - and it usually is faster, too :-) You need lots and lots of RAM for it to be reasonably fast and it’s usually cheaper to just get more drives anyway. But we're looking into creating a read-only replica of our pools on ZFS (once we upgrade to Firefly which has primary affinity setting), but I don’t think we’re even going to try it on a production r/w workload instead of xfs/ext4. At least not until someone from Inktank/RH says it’s 100% safe and stable. I can imagine OSD runing on top of ZFS using the ZFS clone semantics for RBD image and pool clones/snapshots, that would be quite nice (and fast, proven and just pretty much awesome). Maybe someone from RH will share this dream (wink wink :)) Sorry for being slightly off-topic. In short ZFS is not the solution here and now. But thank you for the idea. Jan On 24 Jul 2015, at 21:38, Reistlin reistli...@yandex.ru wrote: Hi! Did you try ZFS and deduplication mechanism? It could radically decrease writes while COW. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best method to limit snapshot/clone space overhead
On 07/23/2015 06:31 AM, Jan Schermer wrote: Hi all, I am looking for a way to alleviate the overhead of RBD snapshots/clones for some time. In our scenario there are a few “master” volumes that contain production data, and are frequently snapshotted and cloned for dev/qa use. Those snapshots/clones live for a few days to a few weeks before they get dropped, and they sometimes grow very fast (databases, etc.). With the default 4MB object size there seems to be huge overhead involved with this, could someone give me some hints on how to solve that? I have some hope in 1) FIEMAP I’ve calculated that files on my OSDs are approx. 30% filled with NULLs - I suppose this is what it could save (best-scenario) and it should also make COW operations much faster. But there are lots of bugs in FIEMAP in kernels (i saw some reference to CentOS 6.5 kernel being buggy - which is what we use) and filesystems (like XFS). No idea about ext4 which we’d like to use in the future. Is enabling FIEMAP a good idea at all? I saw some mention of it being replaced with SEEK_DATA and SEEK_HOLE. fiemap (and ceph's use of it) has been buggy on all fses in the past. SEEK_DATA and SEEK_HOLE are the proper interfaces to use for these purposes. That said, it's not incredibly well tested since it's off by default, so I wouldn't recommend using it without careful testing on the fs you're using. I wouldn't expect it to make much of a difference if you use small objects. 2) object size 4MB for clones I did some quick performance testing and setting this lower for production is probably not a good idea. My sweet spot is 8MB object size, however this would make the overhead for clones even worse than it already is. But I could make the cloned images with a different block size from the snapshot (at least according to docs). Does someone use it like that? Any caveats? That way I could have the production data with 8MB block size but make the development snapshots with for example 64KiB granularity, probably at expense of some performance, but most of the data would remain in the (faster) master snapshot anyway. This should drop overhead tremendously, maybe even more than neabling FIEMAP. (Even better when working in tandem I suppose?) Since these clones are relatively short-lived this seems like a better way to go in the short term. 64k may be extreme, but if there aren't too many of these clones it's not a big deal. There is more overhead for recovery and scrub with smaller objects, so I wouldn't recommend using tiny objects in general. It'll be interesting to see your results. I'm not sure many folks have looked at optimizing this use case. Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Best method to limit snapshot/clone space overhead
Hi all, I am looking for a way to alleviate the overhead of RBD snapshots/clones for some time. In our scenario there are a few “master” volumes that contain production data, and are frequently snapshotted and cloned for dev/qa use. Those snapshots/clones live for a few days to a few weeks before they get dropped, and they sometimes grow very fast (databases, etc.). With the default 4MB object size there seems to be huge overhead involved with this, could someone give me some hints on how to solve that? I have some hope in 1) FIEMAP I’ve calculated that files on my OSDs are approx. 30% filled with NULLs - I suppose this is what it could save (best-scenario) and it should also make COW operations much faster. But there are lots of bugs in FIEMAP in kernels (i saw some reference to CentOS 6.5 kernel being buggy - which is what we use) and filesystems (like XFS). No idea about ext4 which we’d like to use in the future. Is enabling FIEMAP a good idea at all? I saw some mention of it being replaced with SEEK_DATA and SEEK_HOLE. 2) object size 4MB for clones I did some quick performance testing and setting this lower for production is probably not a good idea. My sweet spot is 8MB object size, however this would make the overhead for clones even worse than it already is. But I could make the cloned images with a different block size from the snapshot (at least according to docs). Does someone use it like that? Any caveats? That way I could have the production data with 8MB block size but make the development snapshots with for example 64KiB granularity, probably at expense of some performance, but most of the data would remain in the (faster) master snapshot anyway. This should drop overhead tremendously, maybe even more than neabling FIEMAP. (Even better when working in tandem I suppose?) Your thoughts? Thanks Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com