Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-08-03 Thread Somnath Roy
Hi Max A. Krasilnikov,
Could you please explain why we need 3+ nodes in case of replication factor of 
2 ?
My understanding is client io depends on min_size , which is 1 in this case.

Thanks  Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Max A. 
Krasilnikov
Sent: Monday, July 27, 2015 4:07 AM
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Enclosure power failure pausing client IO till all 
connected hosts up

Здравствуйте!

On Tue, Jul 07, 2015 at 02:21:56PM +0530, mallikarjuna.biradar wrote:

 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?

With replication factor 2 You have to take 3+ nodes in order to serve clients. 
If chooseleaf type  0.

--
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-27 Thread Max A. Krasilnikov
Здравствуйте! 

On Tue, Jul 07, 2015 at 02:21:56PM +0530, mallikarjuna.biradar wrote:

 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?

With replication factor 2 You have to take 3+ nodes in order to serve clients. 
If chooseleaf type  0.

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-24 Thread Robert LeBlanc
Sorry, autocorrect. Decompiled crush map.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Jul 24, 2015 9:44 AM, Robert LeBlanc rob...@leblancnet.us wrote:

 Please provide the recompiled crush map.

 Robert LeBlanc

 Sent from a mobile device please excuse any typos.
 On Jul 23, 2015 7:05 AM, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:

 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?

 -Thanks  regards,
 Mallikarjun Biradar
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-23 Thread Varada Kari
(Adding devel list to the CC)
Hi Eric,

To add more context to the problem:

Min_size was set to 1 and replication size is 2.

There was a flaky power connection to one of the enclosures.  With min_size 1, 
we were able to continue the IO's, and recovery was active once the power comes 
back. But if there is a power failure again when recovery is in progress, some 
of the PGs are going to down+peering state.

Extract from pg query.

$ ceph pg 1.143 query
{ state: down+peering,
  snap_trimq: [],
  epoch: 3918,
  up: [
17],
  acting: [
17],
  info: { pgid: 1.143,
  last_update: 3166'40424,
  last_complete: 3166'40424,
  log_tail: 2577'36847,
  last_user_version: 40424,
  last_backfill: MAX,
  purged_snaps: [],

.. recovery_state: [
{ name: Started\/Primary\/Peering\/GetInfo,
  enter_time: 2015-07-15 12:48:51.372676,
  requested_info_from: []},
{ name: Started\/Primary\/Peering,
  enter_time: 2015-07-15 12:48:51.372675,
  past_intervals: [
{ first: 3147,
  last: 3166,
  maybe_went_rw: 1,
  up: [
17,
4],
  acting: [
17,
4],
  primary: 17,
  up_primary: 17},
{ first: 3167,
  last: 3167,
  maybe_went_rw: 0,
  up: [
10,
20],
  acting: [
10,
20],
  primary: 10,
  up_primary: 10},
{ first: 3168,
  last: 3181,
  maybe_went_rw: 1,
  up: [
10,
20],
  acting: [
10,
4],
  primary: 10,
  up_primary: 10},
{ first: 3182,
  last: 3184,
  maybe_went_rw: 0,
  up: [
20],
  acting: [
4],
  primary: 4,
  up_primary: 20},
{ first: 3185,
  last: 3188,
  maybe_went_rw: 1,
  up: [
20],
  acting: [
20],
  primary: 20,
  up_primary: 20}],
  probing_osds: [
17,
20],
  blocked: peering is blocked due to down osds,
  down_osds_we_would_probe: [
4,
10],
  peering_blocked_by: [
{ osd: 4,
  current_lost_at: 0,
  comment: starting or marking this osd lost may let us 
proceed},
{ osd: 10,
  current_lost_at: 0,
  comment: starting or marking this osd lost may let us 
proceed}]},
{ name: Started,
  enter_time: 2015-07-15 12:48:51.372671}],
  agent_state: {}}

And Pgs are not coming to active+clean till power is resumed again. During this 
period no IOs are allowed to the cluster. Not able to follow why the PGs are 
ending up in peering state? Each Pg has two copies in both the enclosures. If 
one of enclosure is down for some time, should be able to serve IO's from the 
second one. That was true, if no recovery IO is involved. In case of any 
recovery, we are ending up some Pg's in down and peering state.

Thanks,
Varada


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Eric 
Eastman
Sent: Thursday, July 23, 2015 8:37 PM
To: Mallikarjun Biradar mallikarjuna.bira...@gmail.com
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Enclosure power failure pausing client IO till all 
connected hosts up

You may want to check your min_size value for your pools.  If it is set to the 
pool size value, then the cluster will not do I/O if you loose a chassis.

On Sun, Jul 5, 2015 at 11:04 PM, Mallikarjun Biradar 
mallikarjuna.bira...@gmail.com wrote:
 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during

Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-23 Thread Eric Eastman
You may want to check your min_size value for your pools.  If it is
set to the pool size value, then the cluster will not do I/O if you
loose a chassis.

On Sun, Jul 5, 2015 at 11:04 PM, Mallikarjun Biradar
mallikarjuna.bira...@gmail.com wrote:
 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with 4M
 block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are connected
 to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected to it
 came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure failure. OR
 its a bug?

 -Thanks  regards,
 Mallikarjun Biradar

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-15 Thread Mallikarjun Biradar
Sorry for delay in replying to this, as I was doing some retries on
this issue and summarise.


Tony,
Setup details:
Two storage box (each with 12 drives) , each connected with 4 hosts.
Each host own 3 disk from storage box. Total of 24 OSD's.
Failure domain is at Chassis level.

OSD tree:
 -1  164.2   root default
-7  82.08   chassis chassis1
-2  20.52   host host-1
0   6.84osd.0   up  1
1   6.84osd.1   up  1
2   6.84osd.2   up  1
-3  20.52   host host-2
3   6.84osd.3   up  1
4   6.84osd.4   up  1
5   6.84osd.5   up  1
-4  20.52   host host-3
6   6.84osd.6   up  1
7   6.84osd.7   up  1
8   6.84osd.8   up  1
-5  20.52   host host-4
9   6.84osd.9   up  1
10  6.84osd.10  up  1
11  6.84osd.11  up  1
-8  82.08   chassis chassis2
-6  20.52   host host-5
12  6.84osd.12  up  1
13  6.84osd.13  up  1
14  6.84osd.14  up  1
-9  20.52   host host-6
15  6.84osd.15  up  1
16  6.84osd.16  up  1
17  6.84osd.17  up  1
-10 20.52   host host-7
18  6.84osd.18  up  1
19  6.84osd.19  up  1
20  6.84osd.20  up  1
-11 20.52   host host-8
21  6.84osd.21  up  1
22  6.84osd.22  up  1
23  6.84osd.23  up  1

Cluster had ~30TB of data. Client IO is in progress on cluster.
After chassis1 underwent powercycle,
1 all OSD's under chassis2 were intact. Up  running
2 all OSD's under chassis1 were down as expected.

But, client IO was paused untill all the hosts/OSD's under chassis1
comes up. This issue is observed twice out of 5 attempts.

Size is 2  min_size is 1.

-Thanks,
Mallikarjun


On Thu, Jul 9, 2015 at 8:01 PM, Tony Harris neth...@gmail.com wrote:
 Sounds to me like you've put yourself at too much risk - *if* I'm reading
 your message right about your configuration, you have multiple hosts
 accessing OSDs that are stored on a single shared box - so if that single
 shared box (single point of failure for multiple nodes) goes down it's
 possible for multiple replicas to disappear at the same time which could
 halt the operation of your cluster if the masters and the replicas are both
 on OSDs within that single shared storage system...

 On Thu, Jul 9, 2015 at 5:42 AM, Mallikarjun Biradar
 mallikarjuna.bira...@gmail.com wrote:

 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?

 -Thanks  regards,
 Mallikarjun Biradar
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-15 Thread Mallikarjun Biradar
cluster state:
 osdmap e3240: 24 osds: 12 up, 12 in
  pgmap v46050: 1088 pgs, 2 pools, 20322 GB data, 5080 kobjects
4 GB used, 61841 GB / 84065 GB avail
4745644/10405374 objects degraded (45.608%);
3688079/10405374 objects misplaced (35.444%)
   5 stale+active+clean
  59 active+clean
  74 active+undersized+degraded+remapped+backfilling
  53 active+remapped
 577 active+undersized+degraded
  37 down+peering
 283 active+undersized+degraded+remapped+wait_backfill
recovery io 844 MB/s, 211 objects/s

On Wed, Jul 15, 2015 at 2:29 PM, Mallikarjun Biradar
mallikarjuna.bira...@gmail.com wrote:
 Sorry for delay in replying to this, as I was doing some retries on
 this issue and summarise.


 Tony,
 Setup details:
 Two storage box (each with 12 drives) , each connected with 4 hosts.
 Each host own 3 disk from storage box. Total of 24 OSD's.
 Failure domain is at Chassis level.

 OSD tree:
  -1  164.2   root default
 -7  82.08   chassis chassis1
 -2  20.52   host host-1
 0   6.84osd.0   up  1
 1   6.84osd.1   up  1
 2   6.84osd.2   up  1
 -3  20.52   host host-2
 3   6.84osd.3   up  1
 4   6.84osd.4   up  1
 5   6.84osd.5   up  1
 -4  20.52   host host-3
 6   6.84osd.6   up  1
 7   6.84osd.7   up  1
 8   6.84osd.8   up  1
 -5  20.52   host host-4
 9   6.84osd.9   up  1
 10  6.84osd.10  up  1
 11  6.84osd.11  up  1
 -8  82.08   chassis chassis2
 -6  20.52   host host-5
 12  6.84osd.12  up  1
 13  6.84osd.13  up  1
 14  6.84osd.14  up  1
 -9  20.52   host host-6
 15  6.84osd.15  up  1
 16  6.84osd.16  up  1
 17  6.84osd.17  up  1
 -10 20.52   host host-7
 18  6.84osd.18  up  1
 19  6.84osd.19  up  1
 20  6.84osd.20  up  1
 -11 20.52   host host-8
 21  6.84osd.21  up  1
 22  6.84osd.22  up  1
 23  6.84osd.23  up  1

 Cluster had ~30TB of data. Client IO is in progress on cluster.
 After chassis1 underwent powercycle,
 1 all OSD's under chassis2 were intact. Up  running
 2 all OSD's under chassis1 were down as expected.

 But, client IO was paused untill all the hosts/OSD's under chassis1
 comes up. This issue is observed twice out of 5 attempts.

 Size is 2  min_size is 1.

 -Thanks,
 Mallikarjun


 On Thu, Jul 9, 2015 at 8:01 PM, Tony Harris neth...@gmail.com wrote:
 Sounds to me like you've put yourself at too much risk - *if* I'm reading
 your message right about your configuration, you have multiple hosts
 accessing OSDs that are stored on a single shared box - so if that single
 shared box (single point of failure for multiple nodes) goes down it's
 possible for multiple replicas to disappear at the same time which could
 halt the operation of your cluster if the masters and the replicas are both
 on OSDs within that single shared storage system...

 On Thu, Jul 9, 2015 at 5:42 AM, Mallikarjun Biradar
 mallikarjuna.bira...@gmail.com wrote:

 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?

 -Thanks  regards,
 Mallikarjun Biradar
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Tony Harris
Sounds to me like you've put yourself at too much risk - *if* I'm reading
your message right about your configuration, you have multiple hosts
accessing OSDs that are stored on a single shared box - so if that single
shared box (single point of failure for multiple nodes) goes down it's
possible for multiple replicas to disappear at the same time which could
halt the operation of your cluster if the masters and the replicas are both
on OSDs within that single shared storage system...

On Thu, Jul 9, 2015 at 5:42 AM, Mallikarjun Biradar 
mallikarjuna.bira...@gmail.com wrote:

 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?

 -Thanks  regards,
 Mallikarjun Biradar
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Jan Schermer
What is the min_size setting for the pool? If you have size=2 and min_size=2, 
then all your data is safe when one replica is down, but the IO is paused. If 
you want to continue IO you need to set min_size=1.
But be aware that a single failure after that causes you to lose all the data, 
you’d have to revert to the other replica if it comes up and works - no idea 
how that works in ceph but will likely be a PITA to do.

Jan

 On 09 Jul 2015, at 12:42, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:
 
 Hi all,
 
 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.
 
 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).
 
 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.
 
 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.
 
 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.
 
 
 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?
 
 -Thanks  regards,
 Mallikarjun Biradar
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Jan Schermer
And are the OSDs getting marked down during the outage?
Are all the MONs still up?

Jan

 On 09 Jul 2015, at 13:20, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:
 
 I have size=2  min_size=1 and IO is paused till all hosts com back.
 
 On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer j...@schermer.cz wrote:
 What is the min_size setting for the pool? If you have size=2 and 
 min_size=2, then all your data is safe when one replica is down, but the IO 
 is paused. If you want to continue IO you need to set min_size=1.
 But be aware that a single failure after that causes you to lose all the 
 data, you’d have to revert to the other replica if it comes up and works - 
 no idea how that works in ceph but will likely be a PITA to do.
 
 Jan
 
 On 09 Jul 2015, at 12:42, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:
 
 Hi all,
 
 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.
 
 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).
 
 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.
 
 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.
 
 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.
 
 
 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?
 
 -Thanks  regards,
 Mallikarjun Biradar
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Mallikarjun Biradar
Yeah. All OSD's down and monitors still up..

On Thu, Jul 9, 2015 at 4:51 PM, Jan Schermer j...@schermer.cz wrote:
 And are the OSDs getting marked down during the outage?
 Are all the MONs still up?

 Jan

 On 09 Jul 2015, at 13:20, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:

 I have size=2  min_size=1 and IO is paused till all hosts com back.

 On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer j...@schermer.cz wrote:
 What is the min_size setting for the pool? If you have size=2 and 
 min_size=2, then all your data is safe when one replica is down, but the IO 
 is paused. If you want to continue IO you need to set min_size=1.
 But be aware that a single failure after that causes you to lose all the 
 data, you’d have to revert to the other replica if it comes up and works - 
 no idea how that works in ceph but will likely be a PITA to do.

 Jan

 On 09 Jul 2015, at 12:42, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:

 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?

 -Thanks  regards,
 Mallikarjun Biradar
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Mallikarjun Biradar
I have size=2  min_size=1 and IO is paused till all hosts com back.

On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer j...@schermer.cz wrote:
 What is the min_size setting for the pool? If you have size=2 and min_size=2, 
 then all your data is safe when one replica is down, but the IO is paused. If 
 you want to continue IO you need to set min_size=1.
 But be aware that a single failure after that causes you to lose all the 
 data, you’d have to revert to the other replica if it comes up and works - no 
 idea how that works in ceph but will likely be a PITA to do.

 Jan

 On 09 Jul 2015, at 12:42, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:

 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?

 -Thanks  regards,
 Mallikarjun Biradar
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Gregory Farnum
Your first point of troubleshooting is pretty much always to look at
ceph -s and see what it says. In this case it's probably telling you
that some PGs are down, and then you can look at why (but perhaps it's
something else).
-Greg

On Thu, Jul 9, 2015 at 12:22 PM, Mallikarjun Biradar
mallikarjuna.bira...@gmail.com wrote:
 Yeah. All OSD's down and monitors still up..

 On Thu, Jul 9, 2015 at 4:51 PM, Jan Schermer j...@schermer.cz wrote:
 And are the OSDs getting marked down during the outage?
 Are all the MONs still up?

 Jan

 On 09 Jul 2015, at 13:20, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:

 I have size=2  min_size=1 and IO is paused till all hosts com back.

 On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer j...@schermer.cz wrote:
 What is the min_size setting for the pool? If you have size=2 and 
 min_size=2, then all your data is safe when one replica is down, but the 
 IO is paused. If you want to continue IO you need to set min_size=1.
 But be aware that a single failure after that causes you to lose all the 
 data, you’d have to revert to the other replica if it comes up and works - 
 no idea how that works in ceph but will likely be a PITA to do.

 Jan

 On 09 Jul 2015, at 12:42, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:

 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?

 -Thanks  regards,
 Mallikarjun Biradar
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com