Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
Hi Max A. Krasilnikov, Could you please explain why we need 3+ nodes in case of replication factor of 2 ? My understanding is client io depends on min_size , which is 1 in this case. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Max A. Krasilnikov Sent: Monday, July 27, 2015 4:07 AM Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up Здравствуйте! On Tue, Jul 07, 2015 at 02:21:56PM +0530, mallikarjuna.biradar wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? With replication factor 2 You have to take 3+ nodes in order to serve clients. If chooseleaf type 0. -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
Здравствуйте! On Tue, Jul 07, 2015 at 02:21:56PM +0530, mallikarjuna.biradar wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? With replication factor 2 You have to take 3+ nodes in order to serve clients. If chooseleaf type 0. -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
Sorry, autocorrect. Decompiled crush map. Robert LeBlanc Sent from a mobile device please excuse any typos. On Jul 24, 2015 9:44 AM, Robert LeBlanc rob...@leblancnet.us wrote: Please provide the recompiled crush map. Robert LeBlanc Sent from a mobile device please excuse any typos. On Jul 23, 2015 7:05 AM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
(Adding devel list to the CC) Hi Eric, To add more context to the problem: Min_size was set to 1 and replication size is 2. There was a flaky power connection to one of the enclosures. With min_size 1, we were able to continue the IO's, and recovery was active once the power comes back. But if there is a power failure again when recovery is in progress, some of the PGs are going to down+peering state. Extract from pg query. $ ceph pg 1.143 query { state: down+peering, snap_trimq: [], epoch: 3918, up: [ 17], acting: [ 17], info: { pgid: 1.143, last_update: 3166'40424, last_complete: 3166'40424, log_tail: 2577'36847, last_user_version: 40424, last_backfill: MAX, purged_snaps: [], .. recovery_state: [ { name: Started\/Primary\/Peering\/GetInfo, enter_time: 2015-07-15 12:48:51.372676, requested_info_from: []}, { name: Started\/Primary\/Peering, enter_time: 2015-07-15 12:48:51.372675, past_intervals: [ { first: 3147, last: 3166, maybe_went_rw: 1, up: [ 17, 4], acting: [ 17, 4], primary: 17, up_primary: 17}, { first: 3167, last: 3167, maybe_went_rw: 0, up: [ 10, 20], acting: [ 10, 20], primary: 10, up_primary: 10}, { first: 3168, last: 3181, maybe_went_rw: 1, up: [ 10, 20], acting: [ 10, 4], primary: 10, up_primary: 10}, { first: 3182, last: 3184, maybe_went_rw: 0, up: [ 20], acting: [ 4], primary: 4, up_primary: 20}, { first: 3185, last: 3188, maybe_went_rw: 1, up: [ 20], acting: [ 20], primary: 20, up_primary: 20}], probing_osds: [ 17, 20], blocked: peering is blocked due to down osds, down_osds_we_would_probe: [ 4, 10], peering_blocked_by: [ { osd: 4, current_lost_at: 0, comment: starting or marking this osd lost may let us proceed}, { osd: 10, current_lost_at: 0, comment: starting or marking this osd lost may let us proceed}]}, { name: Started, enter_time: 2015-07-15 12:48:51.372671}], agent_state: {}} And Pgs are not coming to active+clean till power is resumed again. During this period no IOs are allowed to the cluster. Not able to follow why the PGs are ending up in peering state? Each Pg has two copies in both the enclosures. If one of enclosure is down for some time, should be able to serve IO's from the second one. That was true, if no recovery IO is involved. In case of any recovery, we are ending up some Pg's in down and peering state. Thanks, Varada -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Eric Eastman Sent: Thursday, July 23, 2015 8:37 PM To: Mallikarjun Biradar mallikarjuna.bira...@gmail.com Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up You may want to check your min_size value for your pools. If it is set to the pool size value, then the cluster will not do I/O if you loose a chassis. On Sun, Jul 5, 2015 at 11:04 PM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
You may want to check your min_size value for your pools. If it is set to the pool size value, then the cluster will not do I/O if you loose a chassis. On Sun, Jul 5, 2015 at 11:04 PM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
Sorry for delay in replying to this, as I was doing some retries on this issue and summarise. Tony, Setup details: Two storage box (each with 12 drives) , each connected with 4 hosts. Each host own 3 disk from storage box. Total of 24 OSD's. Failure domain is at Chassis level. OSD tree: -1 164.2 root default -7 82.08 chassis chassis1 -2 20.52 host host-1 0 6.84osd.0 up 1 1 6.84osd.1 up 1 2 6.84osd.2 up 1 -3 20.52 host host-2 3 6.84osd.3 up 1 4 6.84osd.4 up 1 5 6.84osd.5 up 1 -4 20.52 host host-3 6 6.84osd.6 up 1 7 6.84osd.7 up 1 8 6.84osd.8 up 1 -5 20.52 host host-4 9 6.84osd.9 up 1 10 6.84osd.10 up 1 11 6.84osd.11 up 1 -8 82.08 chassis chassis2 -6 20.52 host host-5 12 6.84osd.12 up 1 13 6.84osd.13 up 1 14 6.84osd.14 up 1 -9 20.52 host host-6 15 6.84osd.15 up 1 16 6.84osd.16 up 1 17 6.84osd.17 up 1 -10 20.52 host host-7 18 6.84osd.18 up 1 19 6.84osd.19 up 1 20 6.84osd.20 up 1 -11 20.52 host host-8 21 6.84osd.21 up 1 22 6.84osd.22 up 1 23 6.84osd.23 up 1 Cluster had ~30TB of data. Client IO is in progress on cluster. After chassis1 underwent powercycle, 1 all OSD's under chassis2 were intact. Up running 2 all OSD's under chassis1 were down as expected. But, client IO was paused untill all the hosts/OSD's under chassis1 comes up. This issue is observed twice out of 5 attempts. Size is 2 min_size is 1. -Thanks, Mallikarjun On Thu, Jul 9, 2015 at 8:01 PM, Tony Harris neth...@gmail.com wrote: Sounds to me like you've put yourself at too much risk - *if* I'm reading your message right about your configuration, you have multiple hosts accessing OSDs that are stored on a single shared box - so if that single shared box (single point of failure for multiple nodes) goes down it's possible for multiple replicas to disappear at the same time which could halt the operation of your cluster if the masters and the replicas are both on OSDs within that single shared storage system... On Thu, Jul 9, 2015 at 5:42 AM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
cluster state: osdmap e3240: 24 osds: 12 up, 12 in pgmap v46050: 1088 pgs, 2 pools, 20322 GB data, 5080 kobjects 4 GB used, 61841 GB / 84065 GB avail 4745644/10405374 objects degraded (45.608%); 3688079/10405374 objects misplaced (35.444%) 5 stale+active+clean 59 active+clean 74 active+undersized+degraded+remapped+backfilling 53 active+remapped 577 active+undersized+degraded 37 down+peering 283 active+undersized+degraded+remapped+wait_backfill recovery io 844 MB/s, 211 objects/s On Wed, Jul 15, 2015 at 2:29 PM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Sorry for delay in replying to this, as I was doing some retries on this issue and summarise. Tony, Setup details: Two storage box (each with 12 drives) , each connected with 4 hosts. Each host own 3 disk from storage box. Total of 24 OSD's. Failure domain is at Chassis level. OSD tree: -1 164.2 root default -7 82.08 chassis chassis1 -2 20.52 host host-1 0 6.84osd.0 up 1 1 6.84osd.1 up 1 2 6.84osd.2 up 1 -3 20.52 host host-2 3 6.84osd.3 up 1 4 6.84osd.4 up 1 5 6.84osd.5 up 1 -4 20.52 host host-3 6 6.84osd.6 up 1 7 6.84osd.7 up 1 8 6.84osd.8 up 1 -5 20.52 host host-4 9 6.84osd.9 up 1 10 6.84osd.10 up 1 11 6.84osd.11 up 1 -8 82.08 chassis chassis2 -6 20.52 host host-5 12 6.84osd.12 up 1 13 6.84osd.13 up 1 14 6.84osd.14 up 1 -9 20.52 host host-6 15 6.84osd.15 up 1 16 6.84osd.16 up 1 17 6.84osd.17 up 1 -10 20.52 host host-7 18 6.84osd.18 up 1 19 6.84osd.19 up 1 20 6.84osd.20 up 1 -11 20.52 host host-8 21 6.84osd.21 up 1 22 6.84osd.22 up 1 23 6.84osd.23 up 1 Cluster had ~30TB of data. Client IO is in progress on cluster. After chassis1 underwent powercycle, 1 all OSD's under chassis2 were intact. Up running 2 all OSD's under chassis1 were down as expected. But, client IO was paused untill all the hosts/OSD's under chassis1 comes up. This issue is observed twice out of 5 attempts. Size is 2 min_size is 1. -Thanks, Mallikarjun On Thu, Jul 9, 2015 at 8:01 PM, Tony Harris neth...@gmail.com wrote: Sounds to me like you've put yourself at too much risk - *if* I'm reading your message right about your configuration, you have multiple hosts accessing OSDs that are stored on a single shared box - so if that single shared box (single point of failure for multiple nodes) goes down it's possible for multiple replicas to disappear at the same time which could halt the operation of your cluster if the masters and the replicas are both on OSDs within that single shared storage system... On Thu, Jul 9, 2015 at 5:42 AM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
Sounds to me like you've put yourself at too much risk - *if* I'm reading your message right about your configuration, you have multiple hosts accessing OSDs that are stored on a single shared box - so if that single shared box (single point of failure for multiple nodes) goes down it's possible for multiple replicas to disappear at the same time which could halt the operation of your cluster if the masters and the replicas are both on OSDs within that single shared storage system... On Thu, Jul 9, 2015 at 5:42 AM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
What is the min_size setting for the pool? If you have size=2 and min_size=2, then all your data is safe when one replica is down, but the IO is paused. If you want to continue IO you need to set min_size=1. But be aware that a single failure after that causes you to lose all the data, you’d have to revert to the other replica if it comes up and works - no idea how that works in ceph but will likely be a PITA to do. Jan On 09 Jul 2015, at 12:42, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
And are the OSDs getting marked down during the outage? Are all the MONs still up? Jan On 09 Jul 2015, at 13:20, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: I have size=2 min_size=1 and IO is paused till all hosts com back. On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer j...@schermer.cz wrote: What is the min_size setting for the pool? If you have size=2 and min_size=2, then all your data is safe when one replica is down, but the IO is paused. If you want to continue IO you need to set min_size=1. But be aware that a single failure after that causes you to lose all the data, you’d have to revert to the other replica if it comes up and works - no idea how that works in ceph but will likely be a PITA to do. Jan On 09 Jul 2015, at 12:42, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
Yeah. All OSD's down and monitors still up.. On Thu, Jul 9, 2015 at 4:51 PM, Jan Schermer j...@schermer.cz wrote: And are the OSDs getting marked down during the outage? Are all the MONs still up? Jan On 09 Jul 2015, at 13:20, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: I have size=2 min_size=1 and IO is paused till all hosts com back. On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer j...@schermer.cz wrote: What is the min_size setting for the pool? If you have size=2 and min_size=2, then all your data is safe when one replica is down, but the IO is paused. If you want to continue IO you need to set min_size=1. But be aware that a single failure after that causes you to lose all the data, you’d have to revert to the other replica if it comes up and works - no idea how that works in ceph but will likely be a PITA to do. Jan On 09 Jul 2015, at 12:42, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
I have size=2 min_size=1 and IO is paused till all hosts com back. On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer j...@schermer.cz wrote: What is the min_size setting for the pool? If you have size=2 and min_size=2, then all your data is safe when one replica is down, but the IO is paused. If you want to continue IO you need to set min_size=1. But be aware that a single failure after that causes you to lose all the data, you’d have to revert to the other replica if it comes up and works - no idea how that works in ceph but will likely be a PITA to do. Jan On 09 Jul 2015, at 12:42, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up
Your first point of troubleshooting is pretty much always to look at ceph -s and see what it says. In this case it's probably telling you that some PGs are down, and then you can look at why (but perhaps it's something else). -Greg On Thu, Jul 9, 2015 at 12:22 PM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Yeah. All OSD's down and monitors still up.. On Thu, Jul 9, 2015 at 4:51 PM, Jan Schermer j...@schermer.cz wrote: And are the OSDs getting marked down during the outage? Are all the MONs still up? Jan On 09 Jul 2015, at 13:20, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: I have size=2 min_size=1 and IO is paused till all hosts com back. On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer j...@schermer.cz wrote: What is the min_size setting for the pool? If you have size=2 and min_size=2, then all your data is safe when one replica is down, but the IO is paused. If you want to continue IO you need to set min_size=1. But be aware that a single failure after that causes you to lose all the data, you’d have to revert to the other replica if it comes up and works - no idea how that works in ceph but will likely be a PITA to do. Jan On 09 Jul 2015, at 12:42, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Setup details: Two storage enclosures each connected to 4 OSD nodes (Shared storage). Failure domain is Chassis (enclosure) level. Replication count is 2. Each host has allotted with 4 drives. I have active client IO running on cluster. (Random write profile with 4M block size 64 Queue depth). One of enclosure had power loss. So all OSD's from hosts that are connected to this enclosure went down as expected. But client IO got paused. After some time enclosure hosts connected to it came up. And all OSD's on that hosts came up. Till this time, cluster was not serving IO. Once all hosts OSD's pertaining to that enclosure came up, client IO resumed. Can anybody help me why cluster not serving IO during enclosure failure. OR its a bug? -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com