Re: [ceph-users] Ceph instead of RAID

Craig Lewis Tue, 13 Aug 2013 11:08:51 -0700

It really sounds like you're looking for a better RAID system, not adistributed storage system.

I've been using ZFS on FreeBSD for years. The Linux port meets nearlyall of your needs, while acting more like a conventional software RAID.BtrFS also has a lot of these features, but I'm not familiar enough toadvocate for it.

I feel that Ceph is better than mdraid because:
1) When ceph cluster is far from being full, 'rebuilding' will be much faster 
vs mdraid

ZFS only rebuilds allocated parts of the disk, same as Ceph.

2) You can easily change the number of replicas

This is not as straight forwards, but it is available. ZFS givesseveral different RAID-like levels, and it lets you control the numberof copies you keep on disk. So you can create something that looks likeRAID10 (stripes of mirrors), or a RAID5+. With 6 disks, I'd go RAIDZ-2(2 parity disks, for ~12TB usable). RAIDZ-2 is than RAID10-like (in myPostgreSQL benchmarks, YMMV), and safer. With 2 parity disks, you'dhave to lose 3 disks to lose data. Just keep in mind that ZFS is notRAID, just RAID-like. I still call the volumes a RAID10 or RAID5, butthe analogy breaks down below the volume level.

If you have really important data, you can also tell it to keep 2 (ormore) copies of the file, regardless of type of RAID. You can set thatreplica policy per file, or per filesystem.

3) When multiple disks have bad sectors, I suspect ceph will be much easier to 
recover data from than from
mdraid which will simply never finish rebuilding.

ZFS checksums every block. If you're using RAID10-like, it will recoverblocks that failed the checksum from the mirror. If you're usingRAID5-like, it will rebuild from parity. Because it has a checksum ofevery block, it only rebuilds the failed ones. It does have to checksumevery block to find the failed once though. My 10TB volumes takes about12 hours to replace a failed 2TB disk.

4) If we need to migrate data over to a different server with no downtime, we 
just add more OSDs, wait, and
then remove the old ones :-)

ZFS snapshot && ZFS send. It's not completely online, but I've moved5TB to a new server with a 5 minute outage window (pre-copy all thedata, shutdown, send a final snapshot, flip the clients to the new server).

If you can't tell, I'm a big fan of ZFS. I'm hoping to run my dev Cephcluster on ZFS soon.



*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*

Connect with us Website <http://www.centraldesktop.com/> | Twitter<http://www.twitter.com/centraldesktop> | Facebook<http://www.facebook.com/CentralDesktop> | LinkedIn<http://www.linkedin.com/groups?gid=147417> | Blog<http://cdblog.centraldesktop.com/>


On 8/13/13 00:47 , Dmitry Postrigan wrote:

This will be a single server configuration, the goal is to replace mdraid, 
hence I tried to use localhost
(nothing more will be added to the cluster). Are you saying it will be less 
fault tolerant than a RAID-10?

Ceph is a distributed object store. If you stay within a single machine,
keep using a local RAID solution (hardware or software).
Why would you want to make this switch?

I do not think RAID-10 on 6 3TB disks is going to be reliable at all. I have 
simulated several failures, and
it looks like a rebuild will take a lot of time. Funnily, during one of these 
experiments, another drive
failed, and I had lost the entire array. Good luck recovering from that...

I feel that Ceph is better than mdraid because:
1) When ceph cluster is far from being full, 'rebuilding' will be much faster 
vs mdraid
2) You can easily change the number of replicas
3) When multiple disks have bad sectors, I suspect ceph will be much easier to 
recover data from than from
mdraid which will simply never finish rebuilding.
4) If we need to migrate data over to a different server with no downtime, we 
just add more OSDs, wait, and
then remove the old ones :-)

This is my initial observation though, so please correct me if I am wrong.

Dmitry

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph instead of RAID

Reply via email to