Re: [ceph-users] What happens if all replica OSDs journals are broken?

2016-12-14 Thread Kevin Olbrich
2016-12-14 2:37 GMT+01:00 Christian Balzer :

>
> Hello,
>

Hi!


>
> On Wed, 14 Dec 2016 00:06:14 +0100 Kevin Olbrich wrote:
>
> > Ok, thanks for your explanation!
> > I read those warnings about size 2 + min_size 1 (we are using ZFS as
> RAID6,
> > called zraid2) as OSDs.
> >
> This is similar to my RAID6 or RAID10 backed OSDs with regards to having
> very resilient, extremely unlikely to fail OSDs.
>

This was our intention (unlikely to fail, data security > performance).
We use Ceph for OpenStack (Cinder RBD).


> As such a Ceph replication of 2 with min_size is a calculated risk,
> acceptable for me on others in certain use cases.
> This is also with very few (2-3) journals per SSD.
>

We are running 14x 500G RAID6 ZFS-RAID per Host (1x journal, 1x OSD, 32GB
RAM).
The ZFS pools use L2ARC-Cache on Samsung 850 PRO's 128GB.
Hint: Was a bad idea, would have better split the ZFS pools. (ZFS
performance was very good but double parity with 4k random on sync with
ceph takes very long, resulting in XXX requests blocked more than 32
seconds).
Currently I am waiting for a lab cluster to test "osd op threads" for these
single OSD hosts.


> If:
>
> 1. Your journal SSDs are well trusted and monitored (Intel DC S36xx, 37xx)
>

Indeed Intel DC P3700 400GB for Ceph. We had Samsung 850 PRO before I leard
4k random while DSYNC is a very bad idea... ;-)

2. Your failure domain represented by a journal SSD is small enough
> (meaning that replicating the lost OSDs can be done quickly)
>

OSDs are rather large but we are "just" using 8 TB (size 2) in the whole
cluster (OSD is 24% full).
Before we moved from infernalis to jewel, a recovery from an OSD which was
offline for 8 hours took approx. one hour to be back in sync.

it may be an acceptable risk for you as well.


We got reliable backups in the past but downtime is a greater problem.


>
>
> Time to raise replication!
> >
> If you can afford that (money, space, latency), definitely go for it.
>

It's more the double journal failure which scares me compared to the OSD
itself (as ZFS was very reliable in the past).


Kevin


> Christian
> > Kevin
> >
> > 2016-12-13 0:00 GMT+01:00 Christian Balzer :
> >
> > > On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote:
> > >
> > > > Hi,
> > > >
> > > > just in case: What happens when all replica journal SSDs are broken
> at
> > > once?
> > > >
> > > That would be bad, as in BAD.
> > >
> > > In theory you just "lost" all the associated OSDs and their data.
> > >
> > > In practice everything but in the in-flight data at the time is still
> on
> > > the actual OSDs (HDDs), but it's inconsistent and inaccessible as far
> as
> > > Ceph is concerned.
> > >
> > > So with some trickery and an experienced data-recovery Ceph consultant
> you
> > > _may_ get things running with limited data loss/corruption, but that's
> > > speculation and may be wishful thinking on my part.
> > >
> > > Another data point to deploy only well known/monitored/trusted SSDs and
> > > have a 3x replication.
> > >
> > > > The PGs most likely will be stuck inactive but as I read, the
> journals
> > > just
> > > > need to be replaced (http://ceph.com/planet/ceph-
> recover-osds-after-ssd-
> > > > journal-failure/).
> > > >
> > > > Does this also work in this case?
> > > >
> > > Not really, no.
> > >
> > > The above works by having still a valid state and operational OSDs from
> > > which the "broken" one can recover.
> > >
> > > Christian
> > > --
> > > Christian BalzerNetwork/Systems Engineer
> > > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What happens if all replica OSDs journals are broken?

2016-12-13 Thread Christian Balzer

Hello,

On Wed, 14 Dec 2016 00:06:14 +0100 Kevin Olbrich wrote:

> Ok, thanks for your explanation!
> I read those warnings about size 2 + min_size 1 (we are using ZFS as RAID6,
> called zraid2) as OSDs.
>
This is similar to my RAID6 or RAID10 backed OSDs with regards to having
very resilient, extremely unlikely to fail OSDs.
As such a Ceph replication of 2 with min_size is a calculated risk,
acceptable for me on others in certain use cases.
This is also with very few (2-3) journals per SSD.

If:

1. Your journal SSDs are well trusted and monitored (Intel DC S36xx, 37xx)
2. Your failure domain represented by a journal SSD is small enough
(meaning that replicating the lost OSDs can be done quickly)

it may be an acceptable risk for you as well.

> Time to raise replication!
>
If you can afford that (money, space, latency), definitely go for it.
 
Christian
> Kevin
> 
> 2016-12-13 0:00 GMT+01:00 Christian Balzer :
> 
> > On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote:
> >
> > > Hi,
> > >
> > > just in case: What happens when all replica journal SSDs are broken at
> > once?
> > >
> > That would be bad, as in BAD.
> >
> > In theory you just "lost" all the associated OSDs and their data.
> >
> > In practice everything but in the in-flight data at the time is still on
> > the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as
> > Ceph is concerned.
> >
> > So with some trickery and an experienced data-recovery Ceph consultant you
> > _may_ get things running with limited data loss/corruption, but that's
> > speculation and may be wishful thinking on my part.
> >
> > Another data point to deploy only well known/monitored/trusted SSDs and
> > have a 3x replication.
> >
> > > The PGs most likely will be stuck inactive but as I read, the journals
> > just
> > > need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd-
> > > journal-failure/).
> > >
> > > Does this also work in this case?
> > >
> > Not really, no.
> >
> > The above works by having still a valid state and operational OSDs from
> > which the "broken" one can recover.
> >
> > Christian
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What happens if all replica OSDs journals are broken?

2016-12-13 Thread Kevin Olbrich
Ok, thanks for your explanation!
I read those warnings about size 2 + min_size 1 (we are using ZFS as RAID6,
called zraid2) as OSDs.
Time to raise replication!

Kevin

2016-12-13 0:00 GMT+01:00 Christian Balzer :

> On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote:
>
> > Hi,
> >
> > just in case: What happens when all replica journal SSDs are broken at
> once?
> >
> That would be bad, as in BAD.
>
> In theory you just "lost" all the associated OSDs and their data.
>
> In practice everything but in the in-flight data at the time is still on
> the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as
> Ceph is concerned.
>
> So with some trickery and an experienced data-recovery Ceph consultant you
> _may_ get things running with limited data loss/corruption, but that's
> speculation and may be wishful thinking on my part.
>
> Another data point to deploy only well known/monitored/trusted SSDs and
> have a 3x replication.
>
> > The PGs most likely will be stuck inactive but as I read, the journals
> just
> > need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd-
> > journal-failure/).
> >
> > Does this also work in this case?
> >
> Not really, no.
>
> The above works by having still a valid state and operational OSDs from
> which the "broken" one can recover.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What happens if all replica OSDs journals are broken?

2016-12-13 Thread Wojciech Kobryń
Hi,

Recently I lost 5 out of 12 journal OSDs (2xSDD failure at one time).
size=2, min_size=1. I know, should rather be 3/2, I have plans to switch to
it asap.

CEPH started to throw many failures, then I removed these two SSDs, and
recreated journal OSD from scratch. In my case, all data on main OSD was
still there, but Ceph tried to do the best  it could to disable write to
OSDs and keep the data consistency.
After re-creating all 5 journal OSD on another HDD, recovery+backfill
started to work. After couple of hours it discovered 7 "unfound" objects (6
in data OSD and 1 hitset in cache tier). I found out what files were
affected, and hoped to not loose important data. Then after trying to
revert these 6 unfound object to the previous version, but if was
unsuccessfull, so I just deleted them. Most important problem we found was
that single hitset file that we couldn't just delete, and instead we took
some another hitset file and copied it onto missing one. Then cache tier
recognized this hitset and invalidated it, which allowed all the
backfill+recovery to finish, and finally entire Ceph cluster went back to
HEALTH_OK. Finally I run fsck wherever these 6 unfound files could affect,
and fortunately, these lost blocks were not important and contained empty
data, so fsck recovery was successfull in all cases. That was very
stressfull time :)

-- 
Wojtek

wt., 13.12.2016 o 00:01 użytkownik Christian Balzer  napisał:

> On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote:
>
> > Hi,
> >
> > just in case: What happens when all replica journal SSDs are broken at
> once?
> >
> That would be bad, as in BAD.
>
> In theory you just "lost" all the associated OSDs and their data.
>
> In practice everything but in the in-flight data at the time is still on
> the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as
> Ceph is concerned.
>
> So with some trickery and an experienced data-recovery Ceph consultant you
> _may_ get things running with limited data loss/corruption, but that's
> speculation and may be wishful thinking on my part.
>
> Another data point to deploy only well known/monitored/trusted SSDs and
> have a 3x replication.
>
> > The PGs most likely will be stuck inactive but as I read, the journals
> just
> > need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd-
> > journal-failure/).
> >
> > Does this also work in this case?
> >
> Not really, no.
>
> The above works by having still a valid state and operational OSDs from
> which the "broken" one can recover.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What happens if all replica OSDs journals are broken?

2016-12-12 Thread Kevin Olbrich
Hi,

just in case: What happens when all replica journal SSDs are broken at once?
The PGs most likely will be stuck inactive but as I read, the journals just
need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd-
journal-failure/).

Does this also work in this case?

Kind regards,
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com