Re: [ceph-users] 2x replication: A BIG warning

2016-12-12 Thread Oliver Humpage

> On 12 Dec 2016, at 07:59, Wido den Hollander  wrote:
> 
> As David already said, when all OSDs are up and in for a PG Ceph will wait 
> for ALL OSDs to Ack the write. Writes in RADOS are always synchronous.

Apologies, I missed that.

Clearly I’ve been misunderstanding min_size for a while then: thanks for 
clearing it up. For most use size=3 cases it seems sensible to set it at 2 
rather than 1.

Oliver.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-11 Thread Wido den Hollander

> Op 9 december 2016 om 22:31 schreef Oliver Humpage :
> 
> 
> 
> > On 7 Dec 2016, at 15:01, Wido den Hollander  wrote:
> > 
> > I would always run with min_size = 2 and manually switch to min_size = 1 if 
> > the situation really requires it at that moment.
> 
> Thanks for this thread, it’s been really useful.
> 
> I might have misunderstood, but does min_size=2 also mean that writes have to 
> wait for at least 2 OSDs to have data written before the write is confirmed? 
> I always assumed this would have a noticeable effect on performance and so 
> left it at 1.
> 
> Our use case is RBDs being exported as iSCSI for ESXi. OSDs are journalled on 
> enterprise SSDs, servers are linked with 10Gb, and we’re generally getting 
> very acceptable speeds. Any idea as to how upping min_size to 2 might affect 
> things, or should we just try it and see?
> 

As David already said, when all OSDs are up and in for a PG Ceph will wait for 
ALL OSDs to Ack the write. Writes in RADOS are always synchronous.

Only when OSDs go down you need at least min_size OSDs up before writes or 
reads are accepted.

So if min_size = 2 and size = 3 you need at least 2 OSDs online for I/O to take 
place.

Wido

> Oliver.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-09 Thread David Turner
I'm pretty certain that the write returns as complete only after all active 
OSDs for a PG have completed the write regardless of min_size.



[cid:image87d2ad.JPG@6e2c58b3.4d9df465]<https://storagecraft.com>   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Oliver 
Humpage [oli...@watershed.co.uk]
Sent: Friday, December 09, 2016 2:31 PM
To: ceph-us...@ceph.com
Subject: Re: [ceph-users] 2x replication: A BIG warning


On 7 Dec 2016, at 15:01, Wido den Hollander 
mailto:w...@42on.com>> wrote:

I would always run with min_size = 2 and manually switch to min_size = 1 if the 
situation really requires it at that moment.

Thanks for this thread, it’s been really useful.

I might have misunderstood, but does min_size=2 also mean that writes have to 
wait for at least 2 OSDs to have data written before the write is confirmed? I 
always assumed this would have a noticeable effect on performance and so left 
it at 1.

Our use case is RBDs being exported as iSCSI for ESXi. OSDs are journalled on 
enterprise SSDs, servers are linked with 10Gb, and we’re generally getting very 
acceptable speeds. Any idea as to how upping min_size to 2 might affect things, 
or should we just try it and see?

Oliver.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-09 Thread Oliver Humpage

> On 7 Dec 2016, at 15:01, Wido den Hollander  wrote:
> 
> I would always run with min_size = 2 and manually switch to min_size = 1 if 
> the situation really requires it at that moment.

Thanks for this thread, it’s been really useful.

I might have misunderstood, but does min_size=2 also mean that writes have to 
wait for at least 2 OSDs to have data written before the write is confirmed? I 
always assumed this would have a noticeable effect on performance and so left 
it at 1.

Our use case is RBDs being exported as iSCSI for ESXi. OSDs are journalled on 
enterprise SSDs, servers are linked with 10Gb, and we’re generally getting very 
acceptable speeds. Any idea as to how upping min_size to 2 might affect things, 
or should we just try it and see?

Oliver.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-09 Thread Kees Meijs
Hi Wido,

Since it's a Friday night, I decided to just go for it. ;-)

It took a while to rebalance the cache tier but all went well. Thanks
again for your valuable advice!

Best regards, enjoy your weekend,
Kees

On 07-12-16 14:58, Wido den Hollander wrote:
>> Anyway, any things to consider or could we just:
>>
>>  1. Run "ceph osd pool set cache size 3".
>>  2. Wait for rebalancing to complete.
>>  3. Run "ceph osd pool set cache min_size 2".
>>
> Indeed! It is a simple as that.
>
> Your cache pool can also contain very valuable data you do not want to loose.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Wido den Hollander

> Op 7 december 2016 om 15:54 schreef LOIC DEVULDER :
> 
> 
> Hi Wido,
> 
> > As a Ceph consultant I get numerous calls throughout the year to help people
> > with getting their broken Ceph clusters back online.
> > 
> > The causes of downtime vary vastly, but one of the biggest causes is that
> > people use replication 2x. size = 2, min_size = 1.
> 
> We are building a Ceph cluster for our OpenStack and for data integrity 
> reasons we have chosen to set size=3. But we want to continue to access data 
> if 2 of our 3 osd server are dead, so we decided to set min_size=1.
> 
> Is it a (very) bad idea?
> 

I would say so. Yes, downtime is annoying on your cloud, but data loss if even 
worse, much more worse.

I would always run with min_size = 2 and manually switch to min_size = 1 if the 
situation really requires it at that moment.

Loosing two disks at the same time is something which doesn't happen that much, 
but if it happens you don't want to modify any data on the only copy which you 
still have left.

Setting min_size to 1 should be a manual action imho when size = 3 and you 
loose two copies. In that case YOU decide at that moment if it is the right 
course of action.

Wido

> Regards / Cordialement,
> ___
> PSA Groupe
> Loïc Devulder (loic.devul...@mpsa.com)
> Senior Linux System Engineer / Linux HPC Specialist
> DF/DDCE/ISTA/DSEP/ULES - Linux Team
> BESSONCOURT / EXTENSION RIVE DROITE / B19
> Internal postal address: SX.BES.15
> Phone Incident - Level 3: 22 94 39
> Phone Incident - Level 4: 22 92 40
> Office: +33 (0)9 66 66 69 06 (27 69 06)
> Mobile: +33 (0)6 87 72 47 31
> ___
> 
> This message may contain confidential information. If you are not the 
> intended recipient, please advise the sender immediately and delete this 
> message. For further information on confidentiality and the risks inherent in 
> electronic communication see http://disclaimer.psa-peugeot-citroen.com.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread LOIC DEVULDER
> -Message d'origine-
> De : Wido den Hollander [mailto:w...@42on.com]
> Envoyé : mercredi 7 décembre 2016 16:01
> À : ceph-us...@ceph.com; LOIC DEVULDER - U329683 
> Objet : RE: [ceph-users] 2x replication: A BIG warning
> 
> 
> > Op 7 december 2016 om 15:54 schreef LOIC DEVULDER
> :
> >
> >
> > Hi Wido,
> >
> > > As a Ceph consultant I get numerous calls throughout the year to
> > > help people with getting their broken Ceph clusters back online.
> > >
> > > The causes of downtime vary vastly, but one of the biggest causes is
> > > that people use replication 2x. size = 2, min_size = 1.
> >
> > We are building a Ceph cluster for our OpenStack and for data integrity
> reasons we have chosen to set size=3. But we want to continue to access
> data if 2 of our 3 osd server are dead, so we decided to set min_size=1.
> >
> > Is it a (very) bad idea?
> >
> 
> I would say so. Yes, downtime is annoying on your cloud, but data loss if
> even worse, much more worse.
> 
> I would always run with min_size = 2 and manually switch to min_size = 1
> if the situation really requires it at that moment.
> 
> Loosing two disks at the same time is something which doesn't happen that
> much, but if it happens you don't want to modify any data on the only copy
> which you still have left.
> 
> Setting min_size to 1 should be a manual action imho when size = 3 and you
> loose two copies. In that case YOU decide at that moment if it is the
> right course of action.
> 
> Wido

Thanks for your quick response!

That's make sense, I will try to convince my colleagues :-)

Loic
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread LOIC DEVULDER
Hi Wido,

> As a Ceph consultant I get numerous calls throughout the year to help people
> with getting their broken Ceph clusters back online.
> 
> The causes of downtime vary vastly, but one of the biggest causes is that
> people use replication 2x. size = 2, min_size = 1.

We are building a Ceph cluster for our OpenStack and for data integrity reasons 
we have chosen to set size=3. But we want to continue to access data if 2 of 
our 3 osd server are dead, so we decided to set min_size=1.

Is it a (very) bad idea?

Regards / Cordialement,
___
PSA Groupe
Loïc Devulder (loic.devul...@mpsa.com)
Senior Linux System Engineer / Linux HPC Specialist
DF/DDCE/ISTA/DSEP/ULES - Linux Team
BESSONCOURT / EXTENSION RIVE DROITE / B19
Internal postal address: SX.BES.15
Phone Incident - Level 3: 22 94 39
Phone Incident - Level 4: 22 92 40
Office: +33 (0)9 66 66 69 06 (27 69 06)
Mobile: +33 (0)6 87 72 47 31
___

This message may contain confidential information. If you are not the intended 
recipient, please advise the sender immediately and delete this message. For 
further information on confidentiality and the risks inherent in electronic 
communication see http://disclaimer.psa-peugeot-citroen.com.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Peter Maloney
On 12/07/16 14:58, Wido den Hollander wrote:
>> Op 7 december 2016 om 11:29 schreef Kees Meijs :
>>
>>
>> Hi Wido,
>>
>> Valid point. At this moment, we're using a cache pool with size = 2 and
>> would like to "upgrade" to size = 3.
>>
>> Again, you're absolutely right... ;-)
>>
>> Anyway, any things to consider or could we just:
>>
>>  1. Run "ceph osd pool set cache size 3".
>>  2. Wait for rebalancing to complete.
>>  3. Run "ceph osd pool set cache min_size 2".
>>
> Indeed! It is a simple as that.
>
> Your cache pool can also contain very valuable data you do not want to loose.
>
> Wido
Almost as simple as that...

First make sure there is free space. Then when you run it, also monitor
that there are no side effects... bad performance, blocked requests, etc.

And if there are issues, be ready to stop it with:
ceph osd set nobackfill
ceph osd set norecover

And then figure out some tuning... eg. (very minimal settings)
# more than likely you can handle more than 1 on a small cluster and
maybe much more
ceph tell osd.* injectargs --osd_max_backfills=1
# manuals/emails/something I read suggest a number like 0.05 ... I
find that does nothing in times of real trouble, but this really slows
down recovery
ceph tell osd.* injectargs --osd_recovery_sleep=0.6
ceph osd set noscrub
ceph osd set nodeep-scrub
# also I think this one is highly relevant, but not sure what to
suggest for it... others suggest 12[1] to 16[2], and so far I found 8
works better than 12-32 for my frequent "blocked requests" small cluster
# --osd_op_threads=...

And then unset those flags to resume it. And when done, consider
unsetting your new settings (I would unset the noscrub at least).

[1]
https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/a-year-with-cinder-and-ceph-at-twc
[2] http://www.spinics.net/lists/ceph-users/msg32368.html  somewhere in
this thread, but can't find it online  "We have recently increase osd op
threads from 2 (default value) to 16 because CPU usage on DN was very
low. We have the impression it has increased overall ceph cluster
performances and reduced block ops occurrences."

-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Wido den Hollander

> Op 7 december 2016 om 11:29 schreef Kees Meijs :
> 
> 
> Hi Wido,
> 
> Valid point. At this moment, we're using a cache pool with size = 2 and
> would like to "upgrade" to size = 3.
> 
> Again, you're absolutely right... ;-)
> 
> Anyway, any things to consider or could we just:
> 
>  1. Run "ceph osd pool set cache size 3".
>  2. Wait for rebalancing to complete.
>  3. Run "ceph osd pool set cache min_size 2".
> 

Indeed! It is a simple as that.

Your cache pool can also contain very valuable data you do not want to loose.

Wido

> Thanks!
> 
> Regards,
> Kees
> 
> On 07-12-16 09:08, Wido den Hollander wrote:
> > As a Ceph consultant I get numerous calls throughout the year to help 
> > people with getting their broken Ceph clusters back online.
> >
> > The causes of downtime vary vastly, but one of the biggest causes is that 
> > people use replication 2x. size = 2, min_size = 1.
> >
> > In 2016 the amount of cases I have where data was lost due to these 
> > settings grew exponentially.
> >
> > Usually a disk failed, recovery kicks in and while recovery is happening a 
> > second disk fails. Causing PGs to become incomplete.
> >
> > There have been to many times where I had to use xfs_repair on broken disks 
> > and use ceph-objectstore-tool to export/import PGs.
> >
> > I really don't like these cases, mainly because they can be prevented 
> > easily by using size = 3 and min_size = 2 for all pools.
> >
> > With size = 2 you go into the danger zone as soon as a single disk/daemon 
> > fails. With size = 3 you always have two additional copies left thus 
> > keeping your data safe(r).
> >
> > If you are running CephFS, at least consider running the 'metadata' pool 
> > with size = 3 to keep the MDS happy.
> >
> > Please, let this be a big warning to everybody who is running with size = 
> > 2. The downtime and problems caused by missing objects/replicas are usually 
> > big and it takes days to recover from those. But very often data is lost 
> > and/or corrupted which causes even more problems.
> >
> > I can't stress this enough. Running with size = 2 in production is a 
> > SERIOUS hazard and should not be done imho.
> >
> > To anyone out there running with size = 2, please reconsider this!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Wido den Hollander

> Op 7 december 2016 om 10:06 schreef Dan van der Ster :
> 
> 
> Hi Wido,
> 
> Thanks for the warning. We have one pool as you described (size 2,
> min_size 1), simply because 3 replicas would be too expensive and
> erasure coding didn't meet our performance requirements. We are well
> aware of the risks, but of course this is a balancing act between risk
> and cost.
> 

Well, that is good. You are aware of the risk.

> Anyway, I'm curious if you ever use
> osd_find_best_info_ignore_history_les in order to recover incomplete
> PGs (while accepting the possibility of data loss). I've used this on
> two colleagues' clusters over the past few months and as far as they
> could tell there was no detectable data loss in either case.
> 

No, not really. Most cases were a true drive failure and a second one during 
recovery. XFS was broken underneath.

> So I run with size = 2 because if something bad happens I'll
> re-activate the PG with osd_find_best_info_ignore_history_les, then
> re-scrub both within Ceph and via our external application.
> 
> Any thoughts on that?
> 

No real thoughts, but it will be mainly useful in a flapping case where a OSD 
might have outdated data, but that's still better then nothing there.

Wido

> Cheers, Dan
> 
> P.S. we're going to retry erasure coding for this cluster in 2017,
> because clearly 4+2 or similar would be much safer than size 2,
> provided we can get the needed performance.
> 
> 
> 
> On Wed, Dec 7, 2016 at 9:08 AM, Wido den Hollander  wrote:
> > Hi,
> >
> > As a Ceph consultant I get numerous calls throughout the year to help 
> > people with getting their broken Ceph clusters back online.
> >
> > The causes of downtime vary vastly, but one of the biggest causes is that 
> > people use replication 2x. size = 2, min_size = 1.
> >
> > In 2016 the amount of cases I have where data was lost due to these 
> > settings grew exponentially.
> >
> > Usually a disk failed, recovery kicks in and while recovery is happening a 
> > second disk fails. Causing PGs to become incomplete.
> >
> > There have been to many times where I had to use xfs_repair on broken disks 
> > and use ceph-objectstore-tool to export/import PGs.
> >
> > I really don't like these cases, mainly because they can be prevented 
> > easily by using size = 3 and min_size = 2 for all pools.
> >
> > With size = 2 you go into the danger zone as soon as a single disk/daemon 
> > fails. With size = 3 you always have two additional copies left thus 
> > keeping your data safe(r).
> >
> > If you are running CephFS, at least consider running the 'metadata' pool 
> > with size = 3 to keep the MDS happy.
> >
> > Please, let this be a big warning to everybody who is running with size = 
> > 2. The downtime and problems caused by missing objects/replicas are usually 
> > big and it takes days to recover from those. But very often data is lost 
> > and/or corrupted which causes even more problems.
> >
> > I can't stress this enough. Running with size = 2 in production is a 
> > SERIOUS hazard and should not be done imho.
> >
> > To anyone out there running with size = 2, please reconsider this!
> >
> > Thanks,
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Дмитрий Глушенок
Hi,

The assumptions are:
- OSD nearly full
- HDD vendor not hides real LSE (latent sector error) rate like 1 in 10^18 
under "not more than 1 unrecoverable error in 10^15 bits read"

In case of disk (OSD) failure Ceph have to read copy of the disk from other 
nodes (to restore redundancy). More you read - more chances that LSE will 
happen on one of the nodes Ceph is reading from (all of them have the same LSE 
rate). In case of LSE Ceph cannot recover data because the error is 
unrecoverable and there is no other places to read the data from (in contrast 
to size=3 where third copy can be used to recover from the error).

Here is a good paper about LSE influence on RAID5: 
http://www.snia.org/sites/default/orig/sdc_archives/2010_presentations/tuesday/JasonResch_%20Solving-Data-Loss.pdf

> 7 дек. 2016 г., в 15:07, Wolfgang Link  написал(а):
> 
> Hi
> 
> I'm very interested in this calculation.
> What assumption do you have done?
> Network speed, osd degree of fulfilment, etc?
> 
> Thanks
> 
> Wolfgang
> 
> On 12/07/2016 11:16 AM, Дмитрий Глушенок wrote:
>> Hi,
>> 
>> Let me add a little math to your warning: with LSE rate of 1 in 10^15 on
>> modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8
>> TB disk. So, every 18th recovery will probably fail. Similarly to RAID6
>> (two parity disks) size=3 mitigates the problem.
>> By the way - why it is a common opinion that using RAID (RAID6) with
>> Ceph (size=2) is bad idea? It is cheaper than size=3, all hardware disk
>> errors are handled by RAID (instead of OS/Ceph), decreases OSD count,
>> adds some battery-backed cache and increases performance of single OSD.
>> 
>>> 7 дек. 2016 г., в 11:08, Wido den Hollander >> > написал(а):
>>> 
>>> Hi,
>>> 
>>> As a Ceph consultant I get numerous calls throughout the year to help
>>> people with getting their broken Ceph clusters back online.
>>> 
>>> The causes of downtime vary vastly, but one of the biggest causes is
>>> that people use replication 2x. size = 2, min_size = 1.
>>> 
>>> In 2016 the amount of cases I have where data was lost due to these
>>> settings grew exponentially.
>>> 
>>> Usually a disk failed, recovery kicks in and while recovery is
>>> happening a second disk fails. Causing PGs to become incomplete.
>>> 
>>> There have been to many times where I had to use xfs_repair on broken
>>> disks and use ceph-objectstore-tool to export/import PGs.
>>> 
>>> I really don't like these cases, mainly because they can be prevented
>>> easily by using size = 3 and min_size = 2 for all pools.
>>> 
>>> With size = 2 you go into the danger zone as soon as a single
>>> disk/daemon fails. With size = 3 you always have two additional copies
>>> left thus keeping your data safe(r).
>>> 
>>> If you are running CephFS, at least consider running the 'metadata'
>>> pool with size = 3 to keep the MDS happy.
>>> 
>>> Please, let this be a big warning to everybody who is running with
>>> size = 2. The downtime and problems caused by missing objects/replicas
>>> are usually big and it takes days to recover from those. But very
>>> often data is lost and/or corrupted which causes even more problems.
>>> 
>>> I can't stress this enough. Running with size = 2 in production is a
>>> SERIOUS hazard and should not be done imho.
>>> 
>>> To anyone out there running with size = 2, please reconsider this!
>>> 
>>> Thanks,
>>> 
>>> Wido
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Christian Balzer

Hello,

On Wed, 7 Dec 2016 14:49:28 +0300 Дмитрий Глушенок wrote:

> RAID10 also will suffer from LSE on big disks, isn't it?
>
IF LSE stands for latent sector errors, then yes, but that's not limited 
to large disks per se.

And you counter it by having another replica and checksums like in ZFS
or hopefully in Bluestore.
The current scrubbing in Ceph is a pretty weak defense against it unless
it's 100% clear from drive SMART checks which one the bad source is.

Christian
 
> > 7 дек. 2016 г., в 13:35, Christian Balzer  написал(а):
> > 
> > 
> > 
> > Hello,
> > 
> > On Wed, 7 Dec 2016 13:16:45 +0300 Дмитрий Глушенок wrote:
> > 
> >> Hi,
> >> 
> >> Let me add a little math to your warning: with LSE rate of 1 in 10^15 on 
> >> modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8 TB 
> >> disk. So, every 18th recovery will probably fail. Similarly to RAID6 (two 
> >> parity disks) size=3 mitigates the problem.
> > 
> > Indeed.
> > That math changes significantly of course if you have very reliable,
> > endurable, well monitored and fast SSDs of not too big a size.
> > Something that will recover in less than hour.
> > 
> > So people with SSD pools might have an acceptable risk.
> > 
> > That being said, I'd prefer size 3 for my SSD pool as well, alas both cost
> > and the increased latency stopped me for this time.
> > Next round I'll upgrade my HW requirements and budget.
> > 
> >> By the way - why it is a common opinion that using RAID (RAID6) with Ceph 
> >> (size=2) is bad idea? It is cheaper than size=3, all hardware disk errors 
> >> are handled by RAID (instead of OS/Ceph), decreases OSD count, adds some 
> >> battery-backed cache and increases performance of single OSD.
> >> 
> > 
> > I did run something like that and if your IOPS needs are low enough it
> > works well (the larger HW cache the better).
> > But once you exceed the combined speed of HW cache coalescing, it degrades
> > badly, something that's usually triggered by very mixed R/W ops and/or
> > deep scrubs.
> > It also depends on your cluster size, if you have dozens of OSDs based on
> > such a design, it will work a lot better than with a few.
> > 
> > I changed it to RAID10s with 4 HDDs each since I needed the speed (IOPS)
> > and didn't require all the space.
> > 
> > Christian
> > 
> >>> 7 дек. 2016 г., в 11:08, Wido den Hollander  написал(а):
> >>> 
> >>> Hi,
> >>> 
> >>> As a Ceph consultant I get numerous calls throughout the year to help 
> >>> people with getting their broken Ceph clusters back online.
> >>> 
> >>> The causes of downtime vary vastly, but one of the biggest causes is that 
> >>> people use replication 2x. size = 2, min_size = 1.
> >>> 
> >>> In 2016 the amount of cases I have where data was lost due to these 
> >>> settings grew exponentially.
> >>> 
> >>> Usually a disk failed, recovery kicks in and while recovery is happening 
> >>> a second disk fails. Causing PGs to become incomplete.
> >>> 
> >>> There have been to many times where I had to use xfs_repair on broken 
> >>> disks and use ceph-objectstore-tool to export/import PGs.
> >>> 
> >>> I really don't like these cases, mainly because they can be prevented 
> >>> easily by using size = 3 and min_size = 2 for all pools.
> >>> 
> >>> With size = 2 you go into the danger zone as soon as a single disk/daemon 
> >>> fails. With size = 3 you always have two additional copies left thus 
> >>> keeping your data safe(r).
> >>> 
> >>> If you are running CephFS, at least consider running the 'metadata' pool 
> >>> with size = 3 to keep the MDS happy.
> >>> 
> >>> Please, let this be a big warning to everybody who is running with size = 
> >>> 2. The downtime and problems caused by missing objects/replicas are 
> >>> usually big and it takes days to recover from those. But very often data 
> >>> is lost and/or corrupted which causes even more problems.
> >>> 
> >>> I can't stress this enough. Running with size = 2 in production is a 
> >>> SERIOUS hazard and should not be done imho.
> >>> 
> >>> To anyone out there running with size = 2, please reconsider this!
> >>> 
> >>> Thanks,
> >>> 
> >>> Wido
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> 
> >> --
> >> Dmitry Glushenok
> >> Jet Infosystems
> >> 
> > 
> > 
> > -- 
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com Global OnLine Japan/Rakuten 
> > Communications
> > http://www.gol.com/ 
> --
> Дмитрий Глушенок
> Инфосистемы Джет
> +7-910-453-2568
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Wolfgang Link
Hi

I'm very interested in this calculation.
What assumption do you have done?
Network speed, osd degree of fulfilment, etc?

Thanks

Wolfgang

On 12/07/2016 11:16 AM, Дмитрий Глушенок wrote:
> Hi,
> 
> Let me add a little math to your warning: with LSE rate of 1 in 10^15 on
> modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8
> TB disk. So, every 18th recovery will probably fail. Similarly to RAID6
> (two parity disks) size=3 mitigates the problem.
> By the way - why it is a common opinion that using RAID (RAID6) with
> Ceph (size=2) is bad idea? It is cheaper than size=3, all hardware disk
> errors are handled by RAID (instead of OS/Ceph), decreases OSD count,
> adds some battery-backed cache and increases performance of single OSD.
> 
>> 7 дек. 2016 г., в 11:08, Wido den Hollander > > написал(а):
>>
>> Hi,
>>
>> As a Ceph consultant I get numerous calls throughout the year to help
>> people with getting their broken Ceph clusters back online.
>>
>> The causes of downtime vary vastly, but one of the biggest causes is
>> that people use replication 2x. size = 2, min_size = 1.
>>
>> In 2016 the amount of cases I have where data was lost due to these
>> settings grew exponentially.
>>
>> Usually a disk failed, recovery kicks in and while recovery is
>> happening a second disk fails. Causing PGs to become incomplete.
>>
>> There have been to many times where I had to use xfs_repair on broken
>> disks and use ceph-objectstore-tool to export/import PGs.
>>
>> I really don't like these cases, mainly because they can be prevented
>> easily by using size = 3 and min_size = 2 for all pools.
>>
>> With size = 2 you go into the danger zone as soon as a single
>> disk/daemon fails. With size = 3 you always have two additional copies
>> left thus keeping your data safe(r).
>>
>> If you are running CephFS, at least consider running the 'metadata'
>> pool with size = 3 to keep the MDS happy.
>>
>> Please, let this be a big warning to everybody who is running with
>> size = 2. The downtime and problems caused by missing objects/replicas
>> are usually big and it takes days to recover from those. But very
>> often data is lost and/or corrupted which causes even more problems.
>>
>> I can't stress this enough. Running with size = 2 in production is a
>> SERIOUS hazard and should not be done imho.
>>
>> To anyone out there running with size = 2, please reconsider this!
>>
>> Thanks,
>>
>> Wido
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Dmitry Glushenok
> Jet Infosystems
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Дмитрий Глушенок
RAID10 also will suffer from LSE on big disks, isn't it?

> 7 дек. 2016 г., в 13:35, Christian Balzer  написал(а):
> 
> 
> 
> Hello,
> 
> On Wed, 7 Dec 2016 13:16:45 +0300 Дмитрий Глушенок wrote:
> 
>> Hi,
>> 
>> Let me add a little math to your warning: with LSE rate of 1 in 10^15 on 
>> modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8 TB 
>> disk. So, every 18th recovery will probably fail. Similarly to RAID6 (two 
>> parity disks) size=3 mitigates the problem.
> 
> Indeed.
> That math changes significantly of course if you have very reliable,
> endurable, well monitored and fast SSDs of not too big a size.
> Something that will recover in less than hour.
> 
> So people with SSD pools might have an acceptable risk.
> 
> That being said, I'd prefer size 3 for my SSD pool as well, alas both cost
> and the increased latency stopped me for this time.
> Next round I'll upgrade my HW requirements and budget.
> 
>> By the way - why it is a common opinion that using RAID (RAID6) with Ceph 
>> (size=2) is bad idea? It is cheaper than size=3, all hardware disk errors 
>> are handled by RAID (instead of OS/Ceph), decreases OSD count, adds some 
>> battery-backed cache and increases performance of single OSD.
>> 
> 
> I did run something like that and if your IOPS needs are low enough it
> works well (the larger HW cache the better).
> But once you exceed the combined speed of HW cache coalescing, it degrades
> badly, something that's usually triggered by very mixed R/W ops and/or
> deep scrubs.
> It also depends on your cluster size, if you have dozens of OSDs based on
> such a design, it will work a lot better than with a few.
> 
> I changed it to RAID10s with 4 HDDs each since I needed the speed (IOPS)
> and didn't require all the space.
> 
> Christian
> 
>>> 7 дек. 2016 г., в 11:08, Wido den Hollander  написал(а):
>>> 
>>> Hi,
>>> 
>>> As a Ceph consultant I get numerous calls throughout the year to help 
>>> people with getting their broken Ceph clusters back online.
>>> 
>>> The causes of downtime vary vastly, but one of the biggest causes is that 
>>> people use replication 2x. size = 2, min_size = 1.
>>> 
>>> In 2016 the amount of cases I have where data was lost due to these 
>>> settings grew exponentially.
>>> 
>>> Usually a disk failed, recovery kicks in and while recovery is happening a 
>>> second disk fails. Causing PGs to become incomplete.
>>> 
>>> There have been to many times where I had to use xfs_repair on broken disks 
>>> and use ceph-objectstore-tool to export/import PGs.
>>> 
>>> I really don't like these cases, mainly because they can be prevented 
>>> easily by using size = 3 and min_size = 2 for all pools.
>>> 
>>> With size = 2 you go into the danger zone as soon as a single disk/daemon 
>>> fails. With size = 3 you always have two additional copies left thus 
>>> keeping your data safe(r).
>>> 
>>> If you are running CephFS, at least consider running the 'metadata' pool 
>>> with size = 3 to keep the MDS happy.
>>> 
>>> Please, let this be a big warning to everybody who is running with size = 
>>> 2. The downtime and problems caused by missing objects/replicas are usually 
>>> big and it takes days to recover from those. But very often data is lost 
>>> and/or corrupted which causes even more problems.
>>> 
>>> I can't stress this enough. Running with size = 2 in production is a 
>>> SERIOUS hazard and should not be done imho.
>>> 
>>> To anyone out there running with size = 2, please reconsider this!
>>> 
>>> Thanks,
>>> 
>>> Wido
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten 
> Communications
> http://www.gol.com/ 
--
Дмитрий Глушенок
Инфосистемы Джет
+7-910-453-2568

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Christian Balzer


Hello,

On Wed, 7 Dec 2016 13:16:45 +0300 Дмитрий Глушенок wrote:

> Hi,
> 
> Let me add a little math to your warning: with LSE rate of 1 in 10^15 on 
> modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8 TB 
> disk. So, every 18th recovery will probably fail. Similarly to RAID6 (two 
> parity disks) size=3 mitigates the problem.

Indeed.
That math changes significantly of course if you have very reliable,
endurable, well monitored and fast SSDs of not too big a size.
Something that will recover in less than hour.

So people with SSD pools might have an acceptable risk.

That being said, I'd prefer size 3 for my SSD pool as well, alas both cost
and the increased latency stopped me for this time.
Next round I'll upgrade my HW requirements and budget.

> By the way - why it is a common opinion that using RAID (RAID6) with Ceph 
> (size=2) is bad idea? It is cheaper than size=3, all hardware disk errors are 
> handled by RAID (instead of OS/Ceph), decreases OSD count, adds some 
> battery-backed cache and increases performance of single OSD.
>

I did run something like that and if your IOPS needs are low enough it
works well (the larger HW cache the better).
But once you exceed the combined speed of HW cache coalescing, it degrades
badly, something that's usually triggered by very mixed R/W ops and/or
deep scrubs.
It also depends on your cluster size, if you have dozens of OSDs based on
such a design, it will work a lot better than with a few.

I changed it to RAID10s with 4 HDDs each since I needed the speed (IOPS)
and didn't require all the space.

Christian
 
> > 7 дек. 2016 г., в 11:08, Wido den Hollander  написал(а):
> > 
> > Hi,
> > 
> > As a Ceph consultant I get numerous calls throughout the year to help 
> > people with getting their broken Ceph clusters back online.
> > 
> > The causes of downtime vary vastly, but one of the biggest causes is that 
> > people use replication 2x. size = 2, min_size = 1.
> > 
> > In 2016 the amount of cases I have where data was lost due to these 
> > settings grew exponentially.
> > 
> > Usually a disk failed, recovery kicks in and while recovery is happening a 
> > second disk fails. Causing PGs to become incomplete.
> > 
> > There have been to many times where I had to use xfs_repair on broken disks 
> > and use ceph-objectstore-tool to export/import PGs.
> > 
> > I really don't like these cases, mainly because they can be prevented 
> > easily by using size = 3 and min_size = 2 for all pools.
> > 
> > With size = 2 you go into the danger zone as soon as a single disk/daemon 
> > fails. With size = 3 you always have two additional copies left thus 
> > keeping your data safe(r).
> > 
> > If you are running CephFS, at least consider running the 'metadata' pool 
> > with size = 3 to keep the MDS happy.
> > 
> > Please, let this be a big warning to everybody who is running with size = 
> > 2. The downtime and problems caused by missing objects/replicas are usually 
> > big and it takes days to recover from those. But very often data is lost 
> > and/or corrupted which causes even more problems.
> > 
> > I can't stress this enough. Running with size = 2 in production is a 
> > SERIOUS hazard and should not be done imho.
> > 
> > To anyone out there running with size = 2, please reconsider this!
> > 
> > Thanks,
> > 
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Dmitry Glushenok
> Jet Infosystems
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Kees Meijs
Hi Wido,

Valid point. At this moment, we're using a cache pool with size = 2 and
would like to "upgrade" to size = 3.

Again, you're absolutely right... ;-)

Anyway, any things to consider or could we just:

 1. Run "ceph osd pool set cache size 3".
 2. Wait for rebalancing to complete.
 3. Run "ceph osd pool set cache min_size 2".

Thanks!

Regards,
Kees

On 07-12-16 09:08, Wido den Hollander wrote:
> As a Ceph consultant I get numerous calls throughout the year to help people 
> with getting their broken Ceph clusters back online.
>
> The causes of downtime vary vastly, but one of the biggest causes is that 
> people use replication 2x. size = 2, min_size = 1.
>
> In 2016 the amount of cases I have where data was lost due to these settings 
> grew exponentially.
>
> Usually a disk failed, recovery kicks in and while recovery is happening a 
> second disk fails. Causing PGs to become incomplete.
>
> There have been to many times where I had to use xfs_repair on broken disks 
> and use ceph-objectstore-tool to export/import PGs.
>
> I really don't like these cases, mainly because they can be prevented easily 
> by using size = 3 and min_size = 2 for all pools.
>
> With size = 2 you go into the danger zone as soon as a single disk/daemon 
> fails. With size = 3 you always have two additional copies left thus keeping 
> your data safe(r).
>
> If you are running CephFS, at least consider running the 'metadata' pool with 
> size = 3 to keep the MDS happy.
>
> Please, let this be a big warning to everybody who is running with size = 2. 
> The downtime and problems caused by missing objects/replicas are usually big 
> and it takes days to recover from those. But very often data is lost and/or 
> corrupted which causes even more problems.
>
> I can't stress this enough. Running with size = 2 in production is a SERIOUS 
> hazard and should not be done imho.
>
> To anyone out there running with size = 2, please reconsider this!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Дмитрий Глушенок
Hi,

Let me add a little math to your warning: with LSE rate of 1 in 10^15 on modern 
8 TB disks there is 5,8% chance to hit LSE during recovery of 8 TB disk. So, 
every 18th recovery will probably fail. Similarly to RAID6 (two parity disks) 
size=3 mitigates the problem.
By the way - why it is a common opinion that using RAID (RAID6) with Ceph 
(size=2) is bad idea? It is cheaper than size=3, all hardware disk errors are 
handled by RAID (instead of OS/Ceph), decreases OSD count, adds some 
battery-backed cache and increases performance of single OSD.

> 7 дек. 2016 г., в 11:08, Wido den Hollander  написал(а):
> 
> Hi,
> 
> As a Ceph consultant I get numerous calls throughout the year to help people 
> with getting their broken Ceph clusters back online.
> 
> The causes of downtime vary vastly, but one of the biggest causes is that 
> people use replication 2x. size = 2, min_size = 1.
> 
> In 2016 the amount of cases I have where data was lost due to these settings 
> grew exponentially.
> 
> Usually a disk failed, recovery kicks in and while recovery is happening a 
> second disk fails. Causing PGs to become incomplete.
> 
> There have been to many times where I had to use xfs_repair on broken disks 
> and use ceph-objectstore-tool to export/import PGs.
> 
> I really don't like these cases, mainly because they can be prevented easily 
> by using size = 3 and min_size = 2 for all pools.
> 
> With size = 2 you go into the danger zone as soon as a single disk/daemon 
> fails. With size = 3 you always have two additional copies left thus keeping 
> your data safe(r).
> 
> If you are running CephFS, at least consider running the 'metadata' pool with 
> size = 3 to keep the MDS happy.
> 
> Please, let this be a big warning to everybody who is running with size = 2. 
> The downtime and problems caused by missing objects/replicas are usually big 
> and it takes days to recover from those. But very often data is lost and/or 
> corrupted which causes even more problems.
> 
> I can't stress this enough. Running with size = 2 in production is a SERIOUS 
> hazard and should not be done imho.
> 
> To anyone out there running with size = 2, please reconsider this!
> 
> Thanks,
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Dan van der Ster
Hi Wido,

Thanks for the warning. We have one pool as you described (size 2,
min_size 1), simply because 3 replicas would be too expensive and
erasure coding didn't meet our performance requirements. We are well
aware of the risks, but of course this is a balancing act between risk
and cost.

Anyway, I'm curious if you ever use
osd_find_best_info_ignore_history_les in order to recover incomplete
PGs (while accepting the possibility of data loss). I've used this on
two colleagues' clusters over the past few months and as far as they
could tell there was no detectable data loss in either case.

So I run with size = 2 because if something bad happens I'll
re-activate the PG with osd_find_best_info_ignore_history_les, then
re-scrub both within Ceph and via our external application.

Any thoughts on that?

Cheers, Dan

P.S. we're going to retry erasure coding for this cluster in 2017,
because clearly 4+2 or similar would be much safer than size 2,
provided we can get the needed performance.



On Wed, Dec 7, 2016 at 9:08 AM, Wido den Hollander  wrote:
> Hi,
>
> As a Ceph consultant I get numerous calls throughout the year to help people 
> with getting their broken Ceph clusters back online.
>
> The causes of downtime vary vastly, but one of the biggest causes is that 
> people use replication 2x. size = 2, min_size = 1.
>
> In 2016 the amount of cases I have where data was lost due to these settings 
> grew exponentially.
>
> Usually a disk failed, recovery kicks in and while recovery is happening a 
> second disk fails. Causing PGs to become incomplete.
>
> There have been to many times where I had to use xfs_repair on broken disks 
> and use ceph-objectstore-tool to export/import PGs.
>
> I really don't like these cases, mainly because they can be prevented easily 
> by using size = 3 and min_size = 2 for all pools.
>
> With size = 2 you go into the danger zone as soon as a single disk/daemon 
> fails. With size = 3 you always have two additional copies left thus keeping 
> your data safe(r).
>
> If you are running CephFS, at least consider running the 'metadata' pool with 
> size = 3 to keep the MDS happy.
>
> Please, let this be a big warning to everybody who is running with size = 2. 
> The downtime and problems caused by missing objects/replicas are usually big 
> and it takes days to recover from those. But very often data is lost and/or 
> corrupted which causes even more problems.
>
> I can't stress this enough. Running with size = 2 in production is a SERIOUS 
> hazard and should not be done imho.
>
> To anyone out there running with size = 2, please reconsider this!
>
> Thanks,
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Wido den Hollander
Hi,

As a Ceph consultant I get numerous calls throughout the year to help people 
with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that 
people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these settings 
grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a 
second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks and 
use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented easily by 
using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon 
fails. With size = 3 you always have two additional copies left thus keeping 
your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool with 
size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size = 2. 
The downtime and problems caused by missing objects/replicas are usually big 
and it takes days to recover from those. But very often data is lost and/or 
corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a SERIOUS 
hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com