Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-22 Thread Kjetil Jørgensen
Hi,


I should clarify. When you worry about concurrent osd failures, it's more
likely that the source of that is from i.e. network/rack/power - you'd
organize your osd's spread across those failure domains, and tell crush
that you put each replica in separate failure domains. I.e. you have 3 or
more racks, with their own TOR switches and hopefully power circuits.
You'll tell crush to spread your 3 replicas so that they're in separate
racks.

We do run min_size=2, size=3, although we do run with osd's spread across
multiple racks and require the 3 replicas to be in 3 different racks. Our
reasoning is - two or more machines failing at the same instant that isn't
caused by switch/power is unlikely enough that we'll happily live with it,
it has so far served us well.

-KJ

On Wed, Mar 22, 2017 at 7:06 PM, Kjetil Jørgensen 
wrote:

>
> For the most part - I'm assuming min_size=2, size=3. In the min_size=3
> and size=3 this changes.
>
> size is how many replicas of an object to maintain, min_size is how many
> writes need to succeed before the primary can ack the operation to the
> client.
>
> larger min_size most likely higher latency for writes.
>
> On Wed, Mar 22, 2017 at 8:05 AM, Adam Carheden  wrote:
>
>> On Tue, Mar 21, 2017 at 1:54 PM, Kjetil Jørgensen 
>> wrote:
>>
>> >> c. Reads can continue from the single online OSD even in pgs that
>> >> happened to have two of 3 osds offline.
>> >>
>> >
>> > Hypothetically (This is partially informed guessing on my part):
>> > If the survivor happens to be the acting primary and it were up-to-date
>> at
>> > the time,
>> > it can in theory serve reads. (Only the primary serves reads).
>>
>> It makes no sense that only the primary could serve reads. That would
>> mean that even if only a single OSD failed, all PGs for which that OSD
>> was primary would be unreadable.
>>
>
> Acting [1, 2, 3] - primary is 1, only 1 serves read. 1 fails, 2 is now the
> new
> primary. It'll probably check with 3 to determine whether or not it there
> were
> any writes it itself is unaware of - and peer if there were. Promotion
> should
> be near instantaneous (well, you'd in all likelihood be able to measure
> it).
>
>
>> There must be an algorithm to appoint a new primary. So in a 2 OSD
>> failure scenario, a new primary should be appointed after the first
>> failure, no? Would the final remaining OSD not appoint itself as
>> primary after the 2nd failure?
>>
>>
> Assuming the min_size=2, size=3 - if 2 osd's fail at the same instant,
> you have no guarantee that the survivor have all writes.
>
> Assuming min_size=3 and size=3 - then yes - you're good, the surviving
> osd can safely be promoted - you're severely degraded, but it can safely
> be promoted.
>
> If you genuinely worry about concurrent failures of 2 machines - run with
> min_size=3, the price you pay is slightly increased mean/median latency
> for writes.
>
> This make sense in the context of CEPH's synchronous writes too. A
>> write isn't complete until all 3 OSDs in the PG have the data,
>> correct? So shouldn't any one of them be able to act as primary at any
>> time?
>
>
> See distinction between size and min_size.
>
>
>> I don't see how that would change even if 2 of 3 ODS fail at exactly
>> the same time.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Kjetil Joergensen 
> SRE, Medallia Inc
> Phone: +1 (650) 739-6580 <(650)%20739-6580>
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-22 Thread Kjetil Jørgensen
For the most part - I'm assuming min_size=2, size=3. In the min_size=3
and size=3 this changes.

size is how many replicas of an object to maintain, min_size is how many
writes need to succeed before the primary can ack the operation to the
client.

larger min_size most likely higher latency for writes.

On Wed, Mar 22, 2017 at 8:05 AM, Adam Carheden  wrote:

> On Tue, Mar 21, 2017 at 1:54 PM, Kjetil Jørgensen 
> wrote:
>
> >> c. Reads can continue from the single online OSD even in pgs that
> >> happened to have two of 3 osds offline.
> >>
> >
> > Hypothetically (This is partially informed guessing on my part):
> > If the survivor happens to be the acting primary and it were up-to-date
> at
> > the time,
> > it can in theory serve reads. (Only the primary serves reads).
>
> It makes no sense that only the primary could serve reads. That would
> mean that even if only a single OSD failed, all PGs for which that OSD
> was primary would be unreadable.
>

Acting [1, 2, 3] - primary is 1, only 1 serves read. 1 fails, 2 is now the
new
primary. It'll probably check with 3 to determine whether or not it there
were
any writes it itself is unaware of - and peer if there were. Promotion
should
be near instantaneous (well, you'd in all likelihood be able to measure it).


> There must be an algorithm to appoint a new primary. So in a 2 OSD
> failure scenario, a new primary should be appointed after the first
> failure, no? Would the final remaining OSD not appoint itself as
> primary after the 2nd failure?
>
>
Assuming the min_size=2, size=3 - if 2 osd's fail at the same instant,
you have no guarantee that the survivor have all writes.

Assuming min_size=3 and size=3 - then yes - you're good, the surviving
osd can safely be promoted - you're severely degraded, but it can safely
be promoted.

If you genuinely worry about concurrent failures of 2 machines - run with
min_size=3, the price you pay is slightly increased mean/median latency
for writes.

This make sense in the context of CEPH's synchronous writes too. A
> write isn't complete until all 3 OSDs in the PG have the data,
> correct? So shouldn't any one of them be able to act as primary at any
> time?


See distinction between size and min_size.


> I don't see how that would change even if 2 of 3 ODS fail at exactly
> the same time.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-22 Thread Adam Carheden
On Tue, Mar 21, 2017 at 1:54 PM, Kjetil Jørgensen  wrote:

>> c. Reads can continue from the single online OSD even in pgs that
>> happened to have two of 3 osds offline.
>>
>
> Hypothetically (This is partially informed guessing on my part):
> If the survivor happens to be the acting primary and it were up-to-date at
> the time,
> it can in theory serve reads. (Only the primary serves reads).

It makes no sense that only the primary could serve reads. That would
mean that even if only a single OSD failed, all PGs for which that OSD
was primary would be unreadable.

There must be an algorithm to appoint a new primary. So in a 2 OSD
failure scenario, a new primary should be appointed after the first
failure, no? Would the final remaining OSD not appoint itself as
primary after the 2nd failure?

This make sense in the context of CEPH's synchronous writes too. A
write isn't complete until all 3 OSDs in the PG have the data,
correct? So shouldn't any one of them be able to act as primary at any
time?

I don't see how that would change even if 2 of 3 ODS fail at exactly
the same time.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-21 Thread Kjetil Jørgensen
Hi,

On Tue, Mar 21, 2017 at 11:59 AM, Adam Carheden  wrote:

> Let's see if I got this. 4 host cluster. size=3, min_size=2. 2 hosts
> fail. Are all of the following accurate?
>
> a. An rdb is split into lots of objects, parts of which will probably
> exist on all 4 hosts.
>

Correct.


>
> b. Some objects will have 2 of their 3 replicas on 2 of the offline OSDs.
>
> Likely correct.


> c. Reads can continue from the single online OSD even in pgs that
> happened to have two of 3 osds offline.
>
>
Hypothetically (This is partially informed guessing on my part):
If the survivor happens to be the acting primary and it were up-to-date at
the time,
it can in theory serve reads. (Only the primary serves reads).

If the survivor weren't the acting primary - you don't have any guarantees
as to
whether or not it had the most up-to-date version of any objects. I don't
know
if enough state is tracked outside of the osds to make this determination,
but
I doubt it (it feels costly to maintain).

Regardless of scenario - I'd guess - the PG is marked as down, and will stay
that way until you revive either of deceased OSDs or you essentially tell
ceph
that they're a lost cause and incur potential data loss over that. (See:
ceph osd lost).

d. Writes hang for pgs that have 2 offline OSDs because CRUSH can't meet
> the min_size=2 constraint.
>

Correct.


> e. Rebalancing does not occur because with only two hosts online there
> is no way for CRUSH to meet the size=3 constraint even if it were to
> rebalance.
>

Partially correct, see c)

f. I/O can been restored by setting min_size=1.
>

See c)


> g. Alternatively, I/O can be restored by setting size=2, which would
> kick off rebalancing and restored I/O as the pgs come into compliance
> with the size=2 constraint.
>

See c)


> h. If I instead have a cluster with 10 hosts, size=3 and min_size=2 and
> two hosts fail, some pgs would have only 1 OSD online, but rebalancing
> would start immediately since CRUSH can honor the size=3 constraint by
> rebalancing. This means more nodes makes for a more reliable cluster.
>

See c)

Side-note: This is where you start using crush to enumerate what you'd
consider
the likely failure domains for concurrent failures. I.e. you have racks
with distinct
power circuits and TOR switches, your more likely large scale failures will
be
a rack, so you tell crush to maintain replicas in distinct racks.

i. If I wanted to force CRUSH to bring I/O back online with size=3 and
> min_size=2 but only 2 hosts online, I could remove the host bucket from
> the crushmap. CRUSH would then rebalance, but some PGs would likely end
> up with 3 OSDs all on the same host. (This is theory. I promise not to
> do any such thing to a production system ;)
>

Partially correct, see c).



> Thanks
> --
> Adam Carheden
>
>
> On 03/21/2017 11:48 AM, Wes Dillingham wrote:
> > If you had set min_size to 1 you would not have seen the writes pause. a
> > min_size of 1 is dangerous though because it means you are 1 hard disk
> > failure away from losing the objects within that placement group
> > entirely. a min_size of 2 is generally considered the minimum you want
> > but many people ignore that advice, some wish they hadn't.
> >
> > On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden  > > wrote:
> >
> > Thanks everyone for the replies. Very informative. However, should I
> > have expected writes to pause if I'd had min_size set to 1 instead
> of 2?
> >
> > And yes, I was under the false impression that my rdb devices was a
> > single object. That explains what all those other things are on a
> test
> > cluster where I only created a single object!
> >
> >
> > --
> > Adam Carheden
> >
> > On 03/20/2017 08:24 PM, Wes Dillingham wrote:
> > > This is because of the min_size specification. I would bet you
> have it
> > > set at 2 (which is good).
> > >
> > > ceph osd pool get rbd min_size
> > >
> > > With 4 hosts, and a size of 3, removing 2 of the hosts (or 2
> drives 1
> > > from each hosts) results in some of the objects only having 1
> replica
> > > min_size dictates that IO freezes for those objects until min_size
> is
> > > achieved. http://docs.ceph.com/docs/jewel/rados/operations/pools/#
> set-the-number-of-object-replicas
> >  set-the-number-of-object-replicas>
> > >
> > > I cant tell if your under the impression that your RBD device is a
> > > single object. It is not. It is chunked up into many objects and
> spread
> > > throughout the cluster, as Kjeti mentioned earlier.
> > >
> > > On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <
> kje...@medallia.com 
> > > >> wrote:
> > >
> > > Hi,
> > >
> > > rbd_id.vm-100-disk-1 is only a "meta 

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-21 Thread Adam Carheden
Let's see if I got this. 4 host cluster. size=3, min_size=2. 2 hosts
fail. Are all of the following accurate?

a. An rdb is split into lots of objects, parts of which will probably
exist on all 4 hosts.

b. Some objects will have 2 of their 3 replicas on 2 of the offline OSDs.

c. Reads can continue from the single online OSD even in pgs that
happened to have two of 3 osds offline.

d. Writes hang for pgs that have 2 offline OSDs because CRUSH can't meet
the min_size=2 constraint.

e. Rebalancing does not occur because with only two hosts online there
is no way for CRUSH to meet the size=3 constraint even if it were to
rebalance.

f. I/O can been restored by setting min_size=1.

g. Alternatively, I/O can be restored by setting size=2, which would
kick off rebalancing and restored I/O as the pgs come into compliance
with the size=2 constraint.

h. If I instead have a cluster with 10 hosts, size=3 and min_size=2 and
two hosts fail, some pgs would have only 1 OSD online, but rebalancing
would start immediately since CRUSH can honor the size=3 constraint by
rebalancing. This means more nodes makes for a more reliable cluster.

i. If I wanted to force CRUSH to bring I/O back online with size=3 and
min_size=2 but only 2 hosts online, I could remove the host bucket from
the crushmap. CRUSH would then rebalance, but some PGs would likely end
up with 3 OSDs all on the same host. (This is theory. I promise not to
do any such thing to a production system ;)

Thanks
-- 
Adam Carheden


On 03/21/2017 11:48 AM, Wes Dillingham wrote:
> If you had set min_size to 1 you would not have seen the writes pause. a
> min_size of 1 is dangerous though because it means you are 1 hard disk
> failure away from losing the objects within that placement group
> entirely. a min_size of 2 is generally considered the minimum you want
> but many people ignore that advice, some wish they hadn't. 
> 
> On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden  > wrote:
> 
> Thanks everyone for the replies. Very informative. However, should I
> have expected writes to pause if I'd had min_size set to 1 instead of 2?
> 
> And yes, I was under the false impression that my rdb devices was a
> single object. That explains what all those other things are on a test
> cluster where I only created a single object!
> 
> 
> --
> Adam Carheden
> 
> On 03/20/2017 08:24 PM, Wes Dillingham wrote:
> > This is because of the min_size specification. I would bet you have it
> > set at 2 (which is good).
> >
> > ceph osd pool get rbd min_size
> >
> > With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1
> > from each hosts) results in some of the objects only having 1 replica
> > min_size dictates that IO freezes for those objects until min_size is
> > achieved. 
> http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas
> 
> 
> >
> > I cant tell if your under the impression that your RBD device is a
> > single object. It is not. It is chunked up into many objects and spread
> > throughout the cluster, as Kjeti mentioned earlier.
> >
> > On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen  
> > >> wrote:
> >
> > Hi,
> >
> > rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents
> > will get you a "prefix", which then gets you on to
> > rbd_header., rbd_header.prefix contains block size,
> > striping, etc. The actual data bearing objects will be named
> > something like rbd_data.prefix.%-016x.
> >
> > Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first
> >  of that image will be named rbd_data.
> > 86ce2ae8944a., the second  will be
> > 86ce2ae8944a.0001, and so on, chances are that one of these
> > objects are mapped to a pg which has both host3 and host4 among it's
> > replicas.
> >
> > An rbd image will end up scattered across most/all osds of the pool
> > it's in.
> >
> > Cheers,
> > -KJ
> >
> > On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden  
> > >> wrote:
> >
> > I have a 4 node cluster shown by `ceph osd tree` below.
> Monitors are
> > running on hosts 1, 2 and 3. It has a single replicated
> pool of size
> > 3. I have a VM with its hard drive replicated to OSDs
> 11(host3),
> > 5(host1) and 3(host2).
> >
> > I can 'fail' any one host by disabling the SAN network
> interface and
> > 

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-21 Thread Wes Dillingham
If you had set min_size to 1 you would not have seen the writes pause. a
min_size of 1 is dangerous though because it means you are 1 hard disk
failure away from losing the objects within that placement group entirely.
a min_size of 2 is generally considered the minimum you want but many
people ignore that advice, some wish they hadn't.

On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden  wrote:

> Thanks everyone for the replies. Very informative. However, should I
> have expected writes to pause if I'd had min_size set to 1 instead of 2?
>
> And yes, I was under the false impression that my rdb devices was a
> single object. That explains what all those other things are on a test
> cluster where I only created a single object!
>
>
> --
> Adam Carheden
>
> On 03/20/2017 08:24 PM, Wes Dillingham wrote:
> > This is because of the min_size specification. I would bet you have it
> > set at 2 (which is good).
> >
> > ceph osd pool get rbd min_size
> >
> > With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1
> > from each hosts) results in some of the objects only having 1 replica
> > min_size dictates that IO freezes for those objects until min_size is
> > achieved. http://docs.ceph.com/docs/jewel/rados/operations/pools/#
> set-the-number-of-object-replicas
> >
> > I cant tell if your under the impression that your RBD device is a
> > single object. It is not. It is chunked up into many objects and spread
> > throughout the cluster, as Kjeti mentioned earlier.
> >
> > On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen  > > wrote:
> >
> > Hi,
> >
> > rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents
> > will get you a "prefix", which then gets you on to
> > rbd_header., rbd_header.prefix contains block size,
> > striping, etc. The actual data bearing objects will be named
> > something like rbd_data.prefix.%-016x.
> >
> > Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first
> >  of that image will be named rbd_data.
> > 86ce2ae8944a., the second  will be
> > 86ce2ae8944a.0001, and so on, chances are that one of these
> > objects are mapped to a pg which has both host3 and host4 among it's
> > replicas.
> >
> > An rbd image will end up scattered across most/all osds of the pool
> > it's in.
> >
> > Cheers,
> > -KJ
> >
> > On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden  > > wrote:
> >
> > I have a 4 node cluster shown by `ceph osd tree` below. Monitors
> are
> > running on hosts 1, 2 and 3. It has a single replicated pool of
> size
> > 3. I have a VM with its hard drive replicated to OSDs 11(host3),
> > 5(host1) and 3(host2).
> >
> > I can 'fail' any one host by disabling the SAN network interface
> and
> > the VM keeps running with a simple slowdown in I/O performance
> > just as
> > expected. However, if 'fail' both nodes 3 and 4, I/O hangs on
> > the VM.
> > (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2
> > still
> > have quorum, so that shouldn't be an issue. The placement group
> > still
> > has 2 of its 3 replicas online.
> >
> > Why does I/O hang even though host4 isn't running a monitor and
> > doesn't have anything to do with my VM's hard drive.
> >
> >
> > Size?
> > # ceph osd pool get rbd size
> > size: 3
> >
> > Where's rbd_id.vm-100-disk-1?
> > # ceph osd getmap -o /tmp/map && osdmaptool --pool 0
> > --test-map-object
> > rbd_id.vm-100-disk-1 /tmp/map
> > got osdmap epoch 1043
> > osdmaptool: osdmap file '/tmp/map'
> >  object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
> >
> > # ceph osd tree
> > ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> > -1 8.06160 root default
> > -7 5.50308 room A
> > -3 1.88754 host host1
> >  4 0.40369 osd.4   up  1.0  1.0
> >  5 0.40369 osd.5   up  1.0  1.0
> >  6 0.54008 osd.6   up  1.0  1.0
> >  7 0.54008 osd.7   up  1.0  1.0
> > -2 3.61554 host host2
> >  0 0.90388 osd.0   up  1.0  1.0
> >  1 0.90388 osd.1   up  1.0  1.0
> >  2 0.90388 osd.2   up  1.0  1.0
> >  3 0.90388 osd.3   up  1.0  1.0
> > -6 2.55852 room B
> > -4 1.75114 host host3
> >  8 0.40369 osd.8   up  1.0  1.0
> >  9 0.40369 osd.9   up  1.0   

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-21 Thread Adam Carheden
Thanks everyone for the replies. Very informative. However, should I
have expected writes to pause if I'd had min_size set to 1 instead of 2?

And yes, I was under the false impression that my rdb devices was a
single object. That explains what all those other things are on a test
cluster where I only created a single object!


-- 
Adam Carheden

On 03/20/2017 08:24 PM, Wes Dillingham wrote:
> This is because of the min_size specification. I would bet you have it
> set at 2 (which is good). 
> 
> ceph osd pool get rbd min_size
> 
> With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1
> from each hosts) results in some of the objects only having 1 replica
> min_size dictates that IO freezes for those objects until min_size is
> achieved. 
> http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas
> 
> I cant tell if your under the impression that your RBD device is a
> single object. It is not. It is chunked up into many objects and spread
> throughout the cluster, as Kjeti mentioned earlier.
> 
> On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen  > wrote:
> 
> Hi,
> 
> rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents
> will get you a "prefix", which then gets you on to
> rbd_header., rbd_header.prefix contains block size,
> striping, etc. The actual data bearing objects will be named
> something like rbd_data.prefix.%-016x.
> 
> Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first
>  of that image will be named rbd_data.
> 86ce2ae8944a., the second  will be
> 86ce2ae8944a.0001, and so on, chances are that one of these
> objects are mapped to a pg which has both host3 and host4 among it's
> replicas.
> 
> An rbd image will end up scattered across most/all osds of the pool
> it's in.
> 
> Cheers,
> -KJ
> 
> On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden  > wrote:
> 
> I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
> running on hosts 1, 2 and 3. It has a single replicated pool of size
> 3. I have a VM with its hard drive replicated to OSDs 11(host3),
> 5(host1) and 3(host2).
> 
> I can 'fail' any one host by disabling the SAN network interface and
> the VM keeps running with a simple slowdown in I/O performance
> just as
> expected. However, if 'fail' both nodes 3 and 4, I/O hangs on
> the VM.
> (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2
> still
> have quorum, so that shouldn't be an issue. The placement group
> still
> has 2 of its 3 replicas online.
> 
> Why does I/O hang even though host4 isn't running a monitor and
> doesn't have anything to do with my VM's hard drive.
> 
> 
> Size?
> # ceph osd pool get rbd size
> size: 3
> 
> Where's rbd_id.vm-100-disk-1?
> # ceph osd getmap -o /tmp/map && osdmaptool --pool 0
> --test-map-object
> rbd_id.vm-100-disk-1 /tmp/map
> got osdmap epoch 1043
> osdmaptool: osdmap file '/tmp/map'
>  object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
> 
> # ceph osd tree
> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 8.06160 root default
> -7 5.50308 room A
> -3 1.88754 host host1
>  4 0.40369 osd.4   up  1.0  1.0
>  5 0.40369 osd.5   up  1.0  1.0
>  6 0.54008 osd.6   up  1.0  1.0
>  7 0.54008 osd.7   up  1.0  1.0
> -2 3.61554 host host2
>  0 0.90388 osd.0   up  1.0  1.0
>  1 0.90388 osd.1   up  1.0  1.0
>  2 0.90388 osd.2   up  1.0  1.0
>  3 0.90388 osd.3   up  1.0  1.0
> -6 2.55852 room B
> -4 1.75114 host host3
>  8 0.40369 osd.8   up  1.0  1.0
>  9 0.40369 osd.9   up  1.0  1.0
> 10 0.40369 osd.10  up  1.0  1.0
> 11 0.54008 osd.11  up  1.0  1.0
> -5 0.80737 host host4
> 12 0.40369 osd.12  up  1.0  1.0
> 13 0.40369 osd.13  up  1.0  1.0
> 
> 
> --
> Adam Carheden
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> 

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-20 Thread Wes Dillingham
This is because of the min_size specification. I would bet you have it set
at 2 (which is good).

ceph osd pool get rbd min_size

With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1 from
each hosts) results in some of the objects only having 1 replica
min_size dictates that IO freezes for those objects until min_size is
achieved.
http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas

I cant tell if your under the impression that your RBD device is a single
object. It is not. It is chunked up into many objects and spread throughout
the cluster, as Kjeti mentioned earlier.

On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen 
wrote:

> Hi,
>
> rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents will get
> you a "prefix", which then gets you on to rbd_header.,
> rbd_header.prefix contains block size, striping, etc. The actual data
> bearing objects will be named something like rbd_data.prefix.%-016x.
>
> Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first  size> of that image will be named rbd_data. 86ce2ae8944a., the
> second  will be 86ce2ae8944a.0001, and so on, chances
> are that one of these objects are mapped to a pg which has both host3 and
> host4 among it's replicas.
>
> An rbd image will end up scattered across most/all osds of the pool it's
> in.
>
> Cheers,
> -KJ
>
> On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden  wrote:
>
>> I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
>> running on hosts 1, 2 and 3. It has a single replicated pool of size
>> 3. I have a VM with its hard drive replicated to OSDs 11(host3),
>> 5(host1) and 3(host2).
>>
>> I can 'fail' any one host by disabling the SAN network interface and
>> the VM keeps running with a simple slowdown in I/O performance just as
>> expected. However, if 'fail' both nodes 3 and 4, I/O hangs on the VM.
>> (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2 still
>> have quorum, so that shouldn't be an issue. The placement group still
>> has 2 of its 3 replicas online.
>>
>> Why does I/O hang even though host4 isn't running a monitor and
>> doesn't have anything to do with my VM's hard drive.
>>
>>
>> Size?
>> # ceph osd pool get rbd size
>> size: 3
>>
>> Where's rbd_id.vm-100-disk-1?
>> # ceph osd getmap -o /tmp/map && osdmaptool --pool 0 --test-map-object
>> rbd_id.vm-100-disk-1 /tmp/map
>> got osdmap epoch 1043
>> osdmaptool: osdmap file '/tmp/map'
>>  object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
>>
>> # ceph osd tree
>> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 8.06160 root default
>> -7 5.50308 room A
>> -3 1.88754 host host1
>>  4 0.40369 osd.4   up  1.0  1.0
>>  5 0.40369 osd.5   up  1.0  1.0
>>  6 0.54008 osd.6   up  1.0  1.0
>>  7 0.54008 osd.7   up  1.0  1.0
>> -2 3.61554 host host2
>>  0 0.90388 osd.0   up  1.0  1.0
>>  1 0.90388 osd.1   up  1.0  1.0
>>  2 0.90388 osd.2   up  1.0  1.0
>>  3 0.90388 osd.3   up  1.0  1.0
>> -6 2.55852 room B
>> -4 1.75114 host host3
>>  8 0.40369 osd.8   up  1.0  1.0
>>  9 0.40369 osd.9   up  1.0  1.0
>> 10 0.40369 osd.10  up  1.0  1.0
>> 11 0.54008 osd.11  up  1.0  1.0
>> -5 0.80737 host host4
>> 12 0.40369 osd.12  up  1.0  1.0
>> 13 0.40369 osd.13  up  1.0  1.0
>>
>>
>> --
>> Adam Carheden
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Kjetil Joergensen 
> SRE, Medallia Inc
> Phone: +1 (650) 739-6580 <(650)%20739-6580>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-20 Thread Kjetil Jørgensen
Hi,

rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents will get
you a "prefix", which then gets you on to rbd_header.,
rbd_header.prefix contains block size, striping, etc. The actual data
bearing objects will be named something like rbd_data.prefix.%-016x.

Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first 
of that image will be named rbd_data. 86ce2ae8944a., the second
 will be 86ce2ae8944a.0001, and so on, chances are that
one of these objects are mapped to a pg which has both host3 and host4
among it's replicas.

An rbd image will end up scattered across most/all osds of the pool it's in.

Cheers,
-KJ

On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden  wrote:

> I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
> running on hosts 1, 2 and 3. It has a single replicated pool of size
> 3. I have a VM with its hard drive replicated to OSDs 11(host3),
> 5(host1) and 3(host2).
>
> I can 'fail' any one host by disabling the SAN network interface and
> the VM keeps running with a simple slowdown in I/O performance just as
> expected. However, if 'fail' both nodes 3 and 4, I/O hangs on the VM.
> (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2 still
> have quorum, so that shouldn't be an issue. The placement group still
> has 2 of its 3 replicas online.
>
> Why does I/O hang even though host4 isn't running a monitor and
> doesn't have anything to do with my VM's hard drive.
>
>
> Size?
> # ceph osd pool get rbd size
> size: 3
>
> Where's rbd_id.vm-100-disk-1?
> # ceph osd getmap -o /tmp/map && osdmaptool --pool 0 --test-map-object
> rbd_id.vm-100-disk-1 /tmp/map
> got osdmap epoch 1043
> osdmaptool: osdmap file '/tmp/map'
>  object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
>
> # ceph osd tree
> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 8.06160 root default
> -7 5.50308 room A
> -3 1.88754 host host1
>  4 0.40369 osd.4   up  1.0  1.0
>  5 0.40369 osd.5   up  1.0  1.0
>  6 0.54008 osd.6   up  1.0  1.0
>  7 0.54008 osd.7   up  1.0  1.0
> -2 3.61554 host host2
>  0 0.90388 osd.0   up  1.0  1.0
>  1 0.90388 osd.1   up  1.0  1.0
>  2 0.90388 osd.2   up  1.0  1.0
>  3 0.90388 osd.3   up  1.0  1.0
> -6 2.55852 room B
> -4 1.75114 host host3
>  8 0.40369 osd.8   up  1.0  1.0
>  9 0.40369 osd.9   up  1.0  1.0
> 10 0.40369 osd.10  up  1.0  1.0
> 11 0.54008 osd.11  up  1.0  1.0
> -5 0.80737 host host4
> 12 0.40369 osd.12  up  1.0  1.0
> 13 0.40369 osd.13  up  1.0  1.0
>
>
> --
> Adam Carheden
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-17 Thread Adam Carheden
I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
running on hosts 1, 2 and 3. It has a single replicated pool of size
3. I have a VM with its hard drive replicated to OSDs 11(host3),
5(host1) and 3(host2).

I can 'fail' any one host by disabling the SAN network interface and
the VM keeps running with a simple slowdown in I/O performance just as
expected. However, if 'fail' both nodes 3 and 4, I/O hangs on the VM.
(i.e. `df` never completes, etc.) The monitors on hosts 1 and 2 still
have quorum, so that shouldn't be an issue. The placement group still
has 2 of its 3 replicas online.

Why does I/O hang even though host4 isn't running a monitor and
doesn't have anything to do with my VM's hard drive.


Size?
# ceph osd pool get rbd size
size: 3

Where's rbd_id.vm-100-disk-1?
# ceph osd getmap -o /tmp/map && osdmaptool --pool 0 --test-map-object
rbd_id.vm-100-disk-1 /tmp/map
got osdmap epoch 1043
osdmaptool: osdmap file '/tmp/map'
 object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]

# ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 8.06160 root default
-7 5.50308 room A
-3 1.88754 host host1
 4 0.40369 osd.4   up  1.0  1.0
 5 0.40369 osd.5   up  1.0  1.0
 6 0.54008 osd.6   up  1.0  1.0
 7 0.54008 osd.7   up  1.0  1.0
-2 3.61554 host host2
 0 0.90388 osd.0   up  1.0  1.0
 1 0.90388 osd.1   up  1.0  1.0
 2 0.90388 osd.2   up  1.0  1.0
 3 0.90388 osd.3   up  1.0  1.0
-6 2.55852 room B
-4 1.75114 host host3
 8 0.40369 osd.8   up  1.0  1.0
 9 0.40369 osd.9   up  1.0  1.0
10 0.40369 osd.10  up  1.0  1.0
11 0.54008 osd.11  up  1.0  1.0
-5 0.80737 host host4
12 0.40369 osd.12  up  1.0  1.0
13 0.40369 osd.13  up  1.0  1.0


-- 
Adam Carheden
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com