Re: [ceph-users] Cluster unusable after 50% full, even with index sharding

2018-04-13 Thread Christian Balzer

Hello,

On Fri, 13 Apr 2018 11:59:01 -0500 Robert Stanford wrote:

>  I have 65TB stored on 24 OSDs on 3 hosts (8 OSDs per host).  SSD journals
> and spinning disks.  Our performance before was acceptable for our purposes
> - 300+MB/s simultaneous transmit and receive.  Now that we're up to about
> 50% of our total storage capacity (65/120TB, say), the write performance is
> still ok, but the read performance is unworkable (35MB/s!)
> 
As always, full details.
Versions, HW, what SSDs, what HDDs and how connected, what FS on the
OSDs, etc.
 
>  I am using index sharding, with 256 shards.  I don't see any CPUs
> saturated on any host (we are using radosgw by the way, and the load is
> light there as well).  The hard drives don't seem to be *too* busy (a
> random OSD shows ~10 wa in top).  The network's fine, as we were doing much
> better in terms of speed before we filled up.
>
top is an abysmal tool for these things, use atop in a big terminal window
on all 3 hosts for full situational awareness.
"iostat -x 3" might do in a pinch for IO related bits, too.

Keep in mind that a single busy OSD will drag the performance of the whole
cluster down. 

Other things to check and verify:
1. Are the OSDs reasonably balanced PG wise?
2. How fragmented are the OSD FS?
3. Is a deep scrub running during the low performance times?
4. Have you run out of RAM for the pagecache and more importantly the SLAB
for dir_entries due to the number of objects (files)? 
If so reads will require many more disk accesses than otherwise.  
This is a typical wall to run into and can be mitigated by more RAM and
sysctl tuning. 

Christian
 
>   Is there anything we can do about this, short of replacing hardware?  Is
> it really a limitation of Ceph that getting 50% full makes your cluster
> unusable?  Index sharding has seemed to not help at all (I did some
> benchmarking, with 128 shards and then 256; same result each time.)
> 
>  Or are we out of luck?


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster unusable after 50% full, even with index sharding

2018-04-13 Thread Robert Stanford
 I have 65TB stored on 24 OSDs on 3 hosts (8 OSDs per host).  SSD journals
and spinning disks.  Our performance before was acceptable for our purposes
- 300+MB/s simultaneous transmit and receive.  Now that we're up to about
50% of our total storage capacity (65/120TB, say), the write performance is
still ok, but the read performance is unworkable (35MB/s!)

 I am using index sharding, with 256 shards.  I don't see any CPUs
saturated on any host (we are using radosgw by the way, and the load is
light there as well).  The hard drives don't seem to be *too* busy (a
random OSD shows ~10 wa in top).  The network's fine, as we were doing much
better in terms of speed before we filled up.

  Is there anything we can do about this, short of replacing hardware?  Is
it really a limitation of Ceph that getting 50% full makes your cluster
unusable?  Index sharding has seemed to not help at all (I did some
benchmarking, with 128 shards and then 256; same result each time.)

 Or are we out of luck?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster unusable

2014-12-23 Thread Francois Petit


Hi,

We use Ceph 0.80.7 for our IceHouse PoC.
3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage,
total.
4 pools for RBD, size=2,  512 PGs per pool

Everything was fine until mid of last week, and here's what happened:
- OSD node #12 passed away
- AFAICR, ceph recovered fine
- I installed a fresh new node #12 (which inadvertently erased its 2
attached OSDs), and used ceph-deploy to make the node and its 2 OSDs join
the cluster
- it was looking okay, except that the weight for the 2 OSDs (osd.0 and
osd.4) was a solid -3.052e-05.
- I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph
osd crush reweight' on both OSDs
- ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday
evening
- on Monday morning (yesterday), ceph was still busy. Actually the two new
OSDs were flapping (msg map eX wrongly marked me down every minute)
- I found the root cause was the firewall on node #12. I opened tcp ports
6789-6900 and this solved the flapping issue
- ceph kept on reorganising PGs and reached this unhealthy state:
--- 900 PGs stuck unclean
--- some 'requests are blocked  32 sec'
--- command 'rbd info images/image_id hung
--- all tested VMs hung
- So I tried this:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html,
 and removed the 2 new OSDs
- ceph again started rebalancing data, and things were looking better (VMs
responding, although pretty slowly)
- but at the end, which is the current state, the cluster was back to an
unhealthy state, and our PoC is stuck.


Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm
UTC+1 and then back on Jan 5. So there are around 30 hours left for solving
this PoC sev1  issue. So I hope that the community can help me find a
solution before Christmas.



Here are the details (actual host and DC names not shown in these outputs).

[root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info
images/$im;done
Tue Dec 23 06:53:15 GMT 2014
0dde9837-3e45-414d-a2c5-902adee0cfe9

no reply for 2 hours, still ongoing...

[root@MON ]# rbd ls images | head -5
0dde9837-3e45-414d-a2c5-902adee0cfe9
2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e
3917346f-12b4-46b8-a5a1-04296ea0a826
4bde285b-28db-4bef-99d5-47ce07e2463d
7da30b4c-4547-4b4c-a96e-6a3528e03214
[root@MON ]#

[cloud-user@francois-vm2 ~]$ ls -lh /tmp/file
-rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file
[cloud-user@francois-vm2 ~]$ rm /tmp/file

no reply for 1 hour, still ongoing. The RBD image used by that VM is
'volume-2e989ca0-b620-42ca-a16f-e218aea32000'


[root@MON ~]# ceph -s
cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03
 health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck
unclean; 103 requests are blocked  32 sec; noscrub,nodeep-scrub flag(s)
set
 monmap e6: 3 mons at
{MON01=10.60.9.11:6789/0,MON06=10.60.9.16:6789/0,MON09=10.60.9.19:6789/0},
 election epoch 1338, quorum 0,1,2 MON01,MON06,MON09
 osdmap e42050: 6 osds: 6 up, 6 in
flags noscrub,nodeep-scrub
  pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects
600 GB used, 1031 GB / 1632 GB avail
   2 inactive
2045 active+clean
   1 remapped+peering
  client io 818 B/s wr, 0 op/s

[root@MON ~]# ceph health detail
HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103
requests are blocked  32 sec; 2 osds have slow requests;
noscrub,nodeep-scrub flag(s) set
pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last
acting [2,1]
pg 5.ae is stuck inactive for 54774.738938, current state inactive, last
acting [2,1]
pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering,
last acting [1,0]
pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last
acting [2,1]
pg 5.ae is stuck unclean for 286227.592617, current state inactive, last
acting [2,1]
pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering,
last acting [1,0]
pg 5.b3 is remapped+peering, acting [1,0]
87 ops are blocked  67108.9 sec
16 ops are blocked  33554.4 sec
84 ops are blocked  67108.9 sec on osd.1
16 ops are blocked  33554.4 sec on osd.1
3 ops are blocked  67108.9 sec on osd.2
2 osds have slow requests
noscrub,nodeep-scrub flag(s) set


[root@MON]# ceph osd tree
# idweight  type name   up/down reweight
-1  1.08root default
-5  0.54datacenter dc_TWO
-2  0.54host node10
1   0.27osd.1   up  1
5   0.27osd.5   up  1
-4  0   host node12
-6  0.54datacenter dc_ONE
-3  0.54host node11
2   0.27osd.2   up  1
3   0.27osd.3   up  1
0   0   osd.0   up  1
4   0   osd.4   up  1

(I'm concerned about the above two ghost osd.0 and osd.4...)




Re: [ceph-users] Cluster unusable

2014-12-23 Thread Loic Dachary
Hi François,

Could you paste somewhere the output of ceph report to check the pg dump ? 
(it's probably going to be a little too big for the mailing list). You can 
bring back osd.0 and osd.4 into the host to which they belong (instead of being 
at the root of the crush map) with crush set:

http://ceph.com/docs/master/rados/operations/crush-map/#add-move-an-osd

They won't be used by the ruleset 0 because they are not under the default 
bucket. To make sure this happens automagically, you may consider using 
osd_crush_update_on_start=true :

http://ceph.com/docs/master/rados/operations/crush-map/#ceph-crush-location-hook
http://workbench.dachary.org/ceph/ceph/blob/firefly/src/upstart/ceph-osd.conf#L18

Cheers

On 23/12/2014 09:56, Francois Petit wrote:
 Hi,
 
 We use Ceph 0.80.7 for our IceHouse PoC.
 3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage, total.
 4 pools for RBD, size=2,  512 PGs per pool
 
 Everything was fine until mid of last week, and here's what happened:
 - OSD node #12 passed away
 - AFAICR, ceph recovered fine
 - I installed a fresh new node #12 (which inadvertently erased its 2 attached 
 OSDs), and used ceph-deploy to make the node and its 2 OSDs join the cluster
 - it was looking okay, except that the weight for the 2 OSDs (osd.0 and 
 osd.4) was a solid -3.052e-05.
 - I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph 
 osd crush reweight' on both OSDs
 - ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday 
 evening
 - on Monday morning (yesterday), ceph was still busy. Actually the two new 
 OSDs were flapping (msg map eX wrongly marked me down every minute)
 - I found the root cause was the firewall on node #12. I opened tcp ports 
 6789-6900 and this solved the flapping issue
 - ceph kept on reorganising PGs and reached this unhealthy state:
 --- 900 PGs stuck unclean
 --- some 'requests are blocked  32 sec'
 --- command 'rbd info images/image_id hung
 --- all tested VMs hung
 - So I tried this: 
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html, 
 and removed the 2 new OSDs
 - ceph again started rebalancing data, and things were looking better (VMs 
 responding, although pretty slowly)
 - but at the end, which is the current state, the cluster was back to an 
 unhealthy state, and our PoC is stuck.
 
 
 Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm 
 UTC+1 and then back on Jan 5. So there are around 30 hours left for solving 
 this PoC sev1  issue. So I hope that the community can help me find a 
 solution before Christmas.
 
 
 
 Here are the details (actual host and DC names not shown in these outputs).
 
 [root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info 
 images/$im;done
 Tue Dec 23 06:53:15 GMT 2014
 0dde9837-3e45-414d-a2c5-902adee0cfe9
 
 no reply for 2 hours, still ongoing...
 
 [root@MON ]# rbd ls images | head -5
 0dde9837-3e45-414d-a2c5-902adee0cfe9
 2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e
 3917346f-12b4-46b8-a5a1-04296ea0a826
 4bde285b-28db-4bef-99d5-47ce07e2463d
 7da30b4c-4547-4b4c-a96e-6a3528e03214
 [root@MON ]#
 
 [cloud-user@francois-vm2 ~]$ ls -lh /tmp/file
 -rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file
 [cloud-user@francois-vm2 ~]$ rm /tmp/file
 
 no reply for 1 hour, still ongoing. The RBD image used by that VM is 
 'volume-2e989ca0-b620-42ca-a16f-e218aea32000'
 
 
 [root@MON ~]# ceph -s
 cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03
  health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck 
 unclean; 103 requests are blocked  32 sec; noscrub,nodeep-scrub flag(s) set
  monmap e6: 3 mons at 
 {MON01=10.60.9.11:6789/0,MON06=10.60.9.16:6789/0,MON09=10.60.9.19:6789/0},
  election epoch 1338, quorum 0,1,2 MON01,MON06,MON09
  osdmap e42050: 6 osds: 6 up, 6 in
 flags noscrub,nodeep-scrub
   pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects
 600 GB used, 1031 GB / 1632 GB avail
2 inactive
 2045 active+clean
1 remapped+peering
   client io 818 B/s wr, 0 op/s
 
 [root@MON ~]# ceph health detail
 HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 
 requests are blocked  32 sec; 2 osds have slow requests; 
 noscrub,nodeep-scrub flag(s) set
 pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last 
 acting [2,1]
 pg 5.ae is stuck inactive for 54774.738938, current state inactive, last 
 acting [2,1]
 pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering, 
 last acting [1,0]
 pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last 
 acting [2,1]
 pg 5.ae is stuck unclean for 286227.592617, current state inactive, last 
 acting [2,1]
 pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering, 
 last acting [1,0]
 pg 5.b3 is remapped+peering, acting [1,0]
 87 ops are blocked  67108.9 sec
 16 ops are 

Re: [ceph-users] Cluster unusable

2014-12-23 Thread francois.pe...@san-services.com
Hi Loïc,
 
Thanks.
Am trying to find where I can make the report available to you
[root@qvitblhat06 ~]# ceph report  /tmp/ceph_report
report 3298035134
[root@qvitblhat06 ~]# ls -lh /tmp/ceph_report
-rw-r--r--. 1 root root 4.7M Dec 23 10:38 /tmp/ceph_report
[root@qvitblhat06 ~]#

(Sorry guys for the unwanted ad that was sent in my first email...)
 
Francois___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster unusable

2014-12-23 Thread francois.pe...@san-services.com
Here you go:http://www.filedropper.com/cephreport
 
Francois
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster unusable

2014-12-23 Thread francois.pe...@san-services.com
Hi,


I got a recommendation From Stephan to restart the OSDs one by one.
So I did it. It helped a bit (some IOs completed), but at the end, the state was
the same as before, and new IOs still hung.

Loïc, thanks for the advice on moving back the osd.0 and osd.4 into the game.
 
Actually this was done by simply restarting ceph on that node:
[root@qvitblhat12 ~]# date;service ceph status
Tue Dec 23 14:36:11 UTC 2014
=== osd.0 ===
osd.0: running {version:0.80.7}
=== osd.4 ===
osd.4: running {version:0.80.7}
[root@qvitblhat12 ~]# date;service ceph restart
Tue Dec 23 14:36:17 UTC 2014
=== osd.0 ===
=== osd.0 ===
Stopping Ceph osd.0 on qvitblhat12...kill 4527...kill 4527...done
=== osd.0 ===
create-or-move updating item name 'osd.0' weight 0.27 at location
{host=qvitblhat12,root=default} to crush map
Starting Ceph osd.0 on qvitblhat12...
Running as unit run-4398.service.
=== osd.4 ===
=== osd.4 ===
Stopping Ceph osd.4 on qvitblhat12...kill 5375...done
=== osd.4 ===
create-or-move updating item name 'osd.4' weight 0.27 at location
{host=qvitblhat12,root=default} to crush map
Starting Ceph osd.4 on qvitblhat12...
Running as unit run-4720.service.

[root@qvitblhat06 ~]# ceph osd tree
# idweighttype nameup/downreweight
-11.62root default
-51.08datacenter dc_XAT
-20.54host qvitblhat10
10.27osd.1up1
50.27osd.5up1
-40.54host qvitblhat12
00.27osd.0up1
40.27osd.4up1
-60.54datacenter dc_QVI
-30.54host qvitblhat11
20.27osd.2up1
30.27osd.3up1
[root@qvitblhat06 ~]#

This change made ceph to rebalance data, and then the miracle, as all PGs ended
up as active+clean.

[root@qvitblhat06 ~]# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set
noscrub,nodeep-scrub flag(s) set

Well apart from being happy that the cluster is now healthy, I find it a little
bit scary of having to shake it in one direction and another
and hope that it will eventually recover, while in the meantime my users IOs are
stuck...

So is there a way to understand what happened ?

Francois___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com