Hi Loic, Thanks for the reply and interesting discussion.
On 26 August 2014 23:25, Loic Dachary <l...@dachary.org> wrote: > Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance that two > other disks are lost before recovery. Since the disk that failed initialy > participates in 100 PG, that is 0.000001% x 100 = 0.0001% chance that a PG is > lost. Seems okay, so you're just taking the max PG spread as the worst case (noting as demonstrated with my numbers that the spread could be lower). ...actually, I could be way off here, but if the chance of any one disk failing in that time is 0.0001%, then assuming the first failure has already happened I'd have thought it would be more like: (0.0001% / 2) * 99 * (0.0001% / 2) * 98 ? As you're essentially calculating the probability of one more disk out of the remaining 99 failing, and then another out of the remaining 98 (and so on), within the repair window (dividing by the number of remaining replicas for which the probability is being calculated, as otherwise you'd be counting their chance of failure in the recovery window multiple times). And of course this all assumes the recovery continues gracefully from the remaining replica/s when another failure occurs...? Taking your followup correcting the base chances of failure into account, then that looks like: 99(1/100000 / 2) * 98(1/100000 / 2) = 9.702e-7 1 in 1030715 I'm also skeptical on the 1h recovery time - at the very least the issues regarding stalling client ops come into play here and may push the max_backfills down for operational reasons (after all, you can't have a general purpose volume storage service that periodically spikes latency due to normal operational tasks like recoveries). > Or the entire pool if it is used in a way that loosing a PG means loosing all > data in the pool (as in your example, where it contains RBD volumes and each > of the RBD volume uses all the available PG). Well, there's actually another whole interesting conversation in here - assuming a decent filesystem is sitting on top of those RBDs it should be possible to get those filesystems back into working order and identify any lost inodes, and then, if you've got one you can recover from tape backup. BUT, if you have just one pool for these RBDs spread over the entire cluster then the amount of work to do that fsck-ing is quickly going to be problematic - you'd have to fsck every RBD! So I wonder if there is cause for partitioning large clusters into multiple pools, so that such a failure would (hopefully) have a more limited scope. Backups for DR purposes are only worth having (and paying for) if the DR plan is actually practical. > If the pool is using at least two datacenters operated by two different > organizations, this calculation makes sense to me. However, if the cluster is > in a single datacenter, isn't it possible that some event independent of Ceph > has a greater probability of permanently destroying the data ? A month ago I > lost three machines in a Ceph cluster and realized on that occasion that the > crushmap was not configured properly and that PG were lost as a result. > Fortunately I was able to recover the disks and plug them in another machine > to recover the lost PGs. I'm not a system administrator and the probability > of me failing to do the right thing is higher than normal: this is just an > example of a high probability event leading to data loss. In other words, I > wonder if this 0.0001% chance of losing a PG within the hour following a disk > failure matters or if it is dominated by other factors. What do you think ? I wouldn't expect that number to be dominated by the chances of total-loss/godzilla events, but I'm no datacentre reliability guru (at least we don't have Godzilla here in Melbourne yet anyway). I couldn't very quickly find any stats on "one-in-one-hundred year" events that might actually destroy a datacentre. Availability is another question altogether, which you probably know the Uptime Institute has specific figures for tiers 1-4. But in my mind you should expect datacentre power outages as an operational (rather than disaster) event, and you'd want your Ceph cluster to survive them unscathed. If that Copysets paper mentioned a while ago has any merit (see http://hackingdistributed.com/2014/02/14/chainsets/ for more on that), then it seems like the chances of drive loss following an availability event are much higher than normal. -- Cheers, ~Blairo _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com