Re: [ceph-users] Predict performance

Lionel Bouton Fri, 02 Oct 2015 09:35:13 -0700

Hi,

Le 02/10/2015 18:15, Christian Balzer a écrit :
> Hello,
> On Fri, 2 Oct 2015 15:31:11 +0200 Javier C.A. wrote:
>
> Firstly, this has been discussed countless times here.
> For one of the latest recurrences, check the archive for:
>
> "calculating maximum number of disk and node failure that can
> be handled by cluster with out data loss"
>
>
>> A classic raid5 system takes a looong time to rebullid  the raid, so i
>> would say NO, but how long does it take for ceph to rebullid the
>> placement group?
>>
> A placement group resides on an OSD. 
> Until the LAST PG on a failed OSD has been recovered, you are prone to
> data loss.
> And a single lost PG might affect ALL your images...


True.

>
> So while your OSDs are mostly empty, recovery will be faster than a RAID5.
>
> Once it gets fuller AND you realize that rebuilding OSDs SEVERELY impacts
> your cluster performance (at least in your smallish example) you are
> likely to tune down the recovery and backfill parameters to a level where
> it takes LONGER than a typical RAID controller recovery.

No, it doesn't. At least it shouldn't: in a RAID5 array, you need to
read all blocks from all the other devices to build the data on your
replacement device.
To rebuild an OSD, you only have to read the amount of data you will
store on the replacement device, which is <n-1> times less reads and as
much writes than what would happen with RAID5. This is more easily
compared to what would happen with a RAID10 array.

But if you care about redundancy more than minimizing the total amount
of IO pressure linked to balancing the cluster you won't rebuild the OSD
but let the failed one go out and data be reorganized in addition to the
missing replica reconstruction. In this case you will distribute *both*
the reads and writes on all devices.
Pgs will be moved around which will add some read/write load on the
cluster (this is why this will put more IO pressure overall). One of the
jobs of the CRUSH algorithm is to minimize the amount of such movements.
That said even if there are additional movements they don't help with
redundancy, the only process important for redundancy is the replica
being rebuilt for each pg in degraded state which should be far faster
than what RAID5 allows (if Ceph prioritizes backfills and recoveries
moving pgs from degraded to clean, which I suppose it does but can't
find a reference for, then replace "should be far faster" by "is far
faster").

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Predict performance

Reply via email to