Re: [ceph-users] What happens if all replica OSDs journals are broken?
2016-12-14 2:37 GMT+01:00 Christian Balzer: > > Hello, > Hi! > > On Wed, 14 Dec 2016 00:06:14 +0100 Kevin Olbrich wrote: > > > Ok, thanks for your explanation! > > I read those warnings about size 2 + min_size 1 (we are using ZFS as > RAID6, > > called zraid2) as OSDs. > > > This is similar to my RAID6 or RAID10 backed OSDs with regards to having > very resilient, extremely unlikely to fail OSDs. > This was our intention (unlikely to fail, data security > performance). We use Ceph for OpenStack (Cinder RBD). > As such a Ceph replication of 2 with min_size is a calculated risk, > acceptable for me on others in certain use cases. > This is also with very few (2-3) journals per SSD. > We are running 14x 500G RAID6 ZFS-RAID per Host (1x journal, 1x OSD, 32GB RAM). The ZFS pools use L2ARC-Cache on Samsung 850 PRO's 128GB. Hint: Was a bad idea, would have better split the ZFS pools. (ZFS performance was very good but double parity with 4k random on sync with ceph takes very long, resulting in XXX requests blocked more than 32 seconds). Currently I am waiting for a lab cluster to test "osd op threads" for these single OSD hosts. > If: > > 1. Your journal SSDs are well trusted and monitored (Intel DC S36xx, 37xx) > Indeed Intel DC P3700 400GB for Ceph. We had Samsung 850 PRO before I leard 4k random while DSYNC is a very bad idea... ;-) 2. Your failure domain represented by a journal SSD is small enough > (meaning that replicating the lost OSDs can be done quickly) > OSDs are rather large but we are "just" using 8 TB (size 2) in the whole cluster (OSD is 24% full). Before we moved from infernalis to jewel, a recovery from an OSD which was offline for 8 hours took approx. one hour to be back in sync. it may be an acceptable risk for you as well. We got reliable backups in the past but downtime is a greater problem. > > > Time to raise replication! > > > If you can afford that (money, space, latency), definitely go for it. > It's more the double journal failure which scares me compared to the OSD itself (as ZFS was very reliable in the past). Kevin > Christian > > Kevin > > > > 2016-12-13 0:00 GMT+01:00 Christian Balzer : > > > > > On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote: > > > > > > > Hi, > > > > > > > > just in case: What happens when all replica journal SSDs are broken > at > > > once? > > > > > > > That would be bad, as in BAD. > > > > > > In theory you just "lost" all the associated OSDs and their data. > > > > > > In practice everything but in the in-flight data at the time is still > on > > > the actual OSDs (HDDs), but it's inconsistent and inaccessible as far > as > > > Ceph is concerned. > > > > > > So with some trickery and an experienced data-recovery Ceph consultant > you > > > _may_ get things running with limited data loss/corruption, but that's > > > speculation and may be wishful thinking on my part. > > > > > > Another data point to deploy only well known/monitored/trusted SSDs and > > > have a 3x replication. > > > > > > > The PGs most likely will be stuck inactive but as I read, the > journals > > > just > > > > need to be replaced (http://ceph.com/planet/ceph- > recover-osds-after-ssd- > > > > journal-failure/). > > > > > > > > Does this also work in this case? > > > > > > > Not really, no. > > > > > > The above works by having still a valid state and operational OSDs from > > > which the "broken" one can recover. > > > > > > Christian > > > -- > > > Christian BalzerNetwork/Systems Engineer > > > ch...@gol.com Global OnLine Japan/Rakuten Communications > > > http://www.gol.com/ > > > > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What happens if all replica OSDs journals are broken?
Hello, On Wed, 14 Dec 2016 00:06:14 +0100 Kevin Olbrich wrote: > Ok, thanks for your explanation! > I read those warnings about size 2 + min_size 1 (we are using ZFS as RAID6, > called zraid2) as OSDs. > This is similar to my RAID6 or RAID10 backed OSDs with regards to having very resilient, extremely unlikely to fail OSDs. As such a Ceph replication of 2 with min_size is a calculated risk, acceptable for me on others in certain use cases. This is also with very few (2-3) journals per SSD. If: 1. Your journal SSDs are well trusted and monitored (Intel DC S36xx, 37xx) 2. Your failure domain represented by a journal SSD is small enough (meaning that replicating the lost OSDs can be done quickly) it may be an acceptable risk for you as well. > Time to raise replication! > If you can afford that (money, space, latency), definitely go for it. Christian > Kevin > > 2016-12-13 0:00 GMT+01:00 Christian Balzer: > > > On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote: > > > > > Hi, > > > > > > just in case: What happens when all replica journal SSDs are broken at > > once? > > > > > That would be bad, as in BAD. > > > > In theory you just "lost" all the associated OSDs and their data. > > > > In practice everything but in the in-flight data at the time is still on > > the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as > > Ceph is concerned. > > > > So with some trickery and an experienced data-recovery Ceph consultant you > > _may_ get things running with limited data loss/corruption, but that's > > speculation and may be wishful thinking on my part. > > > > Another data point to deploy only well known/monitored/trusted SSDs and > > have a 3x replication. > > > > > The PGs most likely will be stuck inactive but as I read, the journals > > just > > > need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd- > > > journal-failure/). > > > > > > Does this also work in this case? > > > > > Not really, no. > > > > The above works by having still a valid state and operational OSDs from > > which the "broken" one can recover. > > > > Christian > > -- > > Christian BalzerNetwork/Systems Engineer > > ch...@gol.com Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What happens if all replica OSDs journals are broken?
Ok, thanks for your explanation! I read those warnings about size 2 + min_size 1 (we are using ZFS as RAID6, called zraid2) as OSDs. Time to raise replication! Kevin 2016-12-13 0:00 GMT+01:00 Christian Balzer: > On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote: > > > Hi, > > > > just in case: What happens when all replica journal SSDs are broken at > once? > > > That would be bad, as in BAD. > > In theory you just "lost" all the associated OSDs and their data. > > In practice everything but in the in-flight data at the time is still on > the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as > Ceph is concerned. > > So with some trickery and an experienced data-recovery Ceph consultant you > _may_ get things running with limited data loss/corruption, but that's > speculation and may be wishful thinking on my part. > > Another data point to deploy only well known/monitored/trusted SSDs and > have a 3x replication. > > > The PGs most likely will be stuck inactive but as I read, the journals > just > > need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd- > > journal-failure/). > > > > Does this also work in this case? > > > Not really, no. > > The above works by having still a valid state and operational OSDs from > which the "broken" one can recover. > > Christian > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What happens if all replica OSDs journals are broken?
Hi, Recently I lost 5 out of 12 journal OSDs (2xSDD failure at one time). size=2, min_size=1. I know, should rather be 3/2, I have plans to switch to it asap. CEPH started to throw many failures, then I removed these two SSDs, and recreated journal OSD from scratch. In my case, all data on main OSD was still there, but Ceph tried to do the best it could to disable write to OSDs and keep the data consistency. After re-creating all 5 journal OSD on another HDD, recovery+backfill started to work. After couple of hours it discovered 7 "unfound" objects (6 in data OSD and 1 hitset in cache tier). I found out what files were affected, and hoped to not loose important data. Then after trying to revert these 6 unfound object to the previous version, but if was unsuccessfull, so I just deleted them. Most important problem we found was that single hitset file that we couldn't just delete, and instead we took some another hitset file and copied it onto missing one. Then cache tier recognized this hitset and invalidated it, which allowed all the backfill+recovery to finish, and finally entire Ceph cluster went back to HEALTH_OK. Finally I run fsck wherever these 6 unfound files could affect, and fortunately, these lost blocks were not important and contained empty data, so fsck recovery was successfull in all cases. That was very stressfull time :) -- Wojtek wt., 13.12.2016 o 00:01 użytkownik Christian Balzernapisał: > On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote: > > > Hi, > > > > just in case: What happens when all replica journal SSDs are broken at > once? > > > That would be bad, as in BAD. > > In theory you just "lost" all the associated OSDs and their data. > > In practice everything but in the in-flight data at the time is still on > the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as > Ceph is concerned. > > So with some trickery and an experienced data-recovery Ceph consultant you > _may_ get things running with limited data loss/corruption, but that's > speculation and may be wishful thinking on my part. > > Another data point to deploy only well known/monitored/trusted SSDs and > have a 3x replication. > > > The PGs most likely will be stuck inactive but as I read, the journals > just > > need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd- > > journal-failure/). > > > > Does this also work in this case? > > > Not really, no. > > The above works by having still a valid state and operational OSDs from > which the "broken" one can recover. > > Christian > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] What happens if all replica OSDs journals are broken?
Hi, just in case: What happens when all replica journal SSDs are broken at once? The PGs most likely will be stuck inactive but as I read, the journals just need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd- journal-failure/). Does this also work in this case? Kind regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com