Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Hello again, Well, I disabled offloads on the NIC -- didn’t work for me. I also tried setting net.ipv4.tcp_moderate_rcvbuf = 0 as suggested elsewhere in the thread to no avail. Today I was watching iostat on an OSD box ('iostat -xm 5') when the cluster got into “slow” state: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util sdb 0.0013.57 84.23 167.47 0.45 2.7826.26 2.068.18 3.85 96.93 sdc 0.0046.715.59 289.22 0.03 2.5417.85 3.18 10.77 0.97 28.72 sdd 0.0016.57 45.11 91.62 0.25 0.5512.01 0.755.51 2.45 33.47 sde 0.0013.576.99 143.31 0.03 2.5334.97 1.99 13.27 2.12 31.86 sdf 0.0018.764.99 158.48 0.10 1.0914.88 1.267.69 1.24 20.26 sdg 0.0025.55 81.64 237.52 0.44 2.8921.36 4.14 12.99 2.58 82.22 sdh 0.0089.42 16.17 492.42 0.09 3.8115.69 17.12 33.66 0.73 36.95 sdi 0.0020.16 17.76 189.62 0.10 1.6717.46 3.45 16.63 1.57 32.55 sdj 0.0031.540.00 185.23 0.00 1.9121.15 3.33 18.00 0.03 0.62 sdk 0.0026.152.40 133.33 0.01 0.8412.79 1.077.87 0.85 11.58 sdl 0.0025.559.38 123.95 0.05 1.1518.44 0.503.74 1.58 21.10 sdm 0.00 6.39 92.61 47.11 0.47 0.2610.65 1.279.07 6.92 96.73 The %util is rather high on some disks, but I’m not an expert at looking at iostat so I’m not sure how worrisome this is. Does anything here stand out to anyone? At the time of that iostat, Ceph was reporting a lot of blocked ops on the OSD associated with sde (as well as about 30 other OSDs), but it doesn’t look all that busy. Some simple ‘dd’ tests seem to indicate the disk is fine. Similarly, iotop seems OK on this host: TID PRIO USER DISK READ DISK WRITE SWAPIN IO>COMMAND 472477 be/4 root0.00 B/s5.59 M/s 0.00 % 0.57 % ceph-osd -i 111 --pid-file /var/run/ceph/osd.111.pid -c /etc/ceph/ceph.conf --cluster ceph 470621 be/4 root0.00 B/s 10.09 M/s 0.00 % 0.40 % ceph-osd -i 111 --pid-file /var/run/ceph/osd.111.pid -c /etc/ceph/ceph.conf --cluster ceph 3495447 be/4 root0.00 B/s 272.19 K/s 0.00 % 0.36 % ceph-osd -i 114 --pid-file /var/run/ceph/osd.114.pid -c /etc/ceph/ceph.conf --cluster ceph 3488389 be/4 root0.00 B/s 596.80 K/s 0.00 % 0.16 % ceph-osd -i 109 --pid-file /var/run/ceph/osd.109.pid -c /etc/ceph/ceph.conf --cluster ceph 3488060 be/4 root0.00 B/s 600.83 K/s 0.00 % 0.15 % ceph-osd -i 108 --pid-file /var/run/ceph/osd.108.pid -c /etc/ceph/ceph.conf --cluster ceph 3505573 be/4 root0.00 B/s 528.25 K/s 0.00 % 0.10 % ceph-osd -i 119 --pid-file /var/run/ceph/osd.119.pid -c /etc/ceph/ceph.conf --cluster ceph 3495434 be/4 root0.00 B/s2.02 K/s 0.00 % 0.10 % ceph-osd -i 114 --pid-file /var/run/ceph/osd.114.pid -c /etc/ceph/ceph.conf --cluster ceph 3502327 be/4 root0.00 B/s 506.07 K/s 0.00 % 0.09 % ceph-osd -i 118 --pid-file /var/run/ceph/osd.118.pid -c /etc/ceph/ceph.conf --cluster ceph 3489100 be/4 root0.00 B/s 106.86 K/s 0.00 % 0.09 % ceph-osd -i 110 --pid-file /var/run/ceph/osd.110.pid -c /etc/ceph/ceph.conf --cluster ceph 3496631 be/4 root0.00 B/s 229.85 K/s 0.00 % 0.05 % ceph-osd -i 115 --pid-file /var/run/ceph/osd.115.pid -c /etc/ceph/ceph.conf --cluster ceph 3505561 be/4 root0.00 B/s2.02 K/s 0.00 % 0.03 % ceph-osd -i 119 --pid-file /var/run/ceph/osd.119.pid -c /etc/ceph/ceph.conf --cluster ceph 3488059 be/4 root0.00 B/s2.02 K/s 0.00 % 0.03 % ceph-osd -i 108 --pid-file /var/run/ceph/osd.108.pid -c /etc/ceph/ceph.conf --cluster ceph 3488391 be/4 root 46.37 K/s 431.47 K/s 0.00 % 0.02 % ceph-osd -i 109 --pid-file /var/run/ceph/osd.109.pid -c /etc/ceph/ceph.conf --cluster ceph 3500639 be/4 root0.00 B/s 221.78 K/s 0.00 % 0.02 % ceph-osd -i 117 --pid-file /var/run/ceph/osd.117.pid -c /etc/ceph/ceph.conf --cluster ceph 3488392 be/4 root 34.28 K/s 185.49 K/s 0.00 % 0.02 % ceph-osd -i 109 --pid-file /var/run/ceph/osd.109.pid -c /etc/ceph/ceph.conf --cluster ceph 3488062 be/4 root4.03 K/s 66.54 K/s 0.00 % 0.02 % ceph-osd -i 108 --pid-file /var/run/ceph/osd.108.pid -c /etc/ceph/ceph.conf --cluster ceph These are all 6TB seagates in single-disk RAID 0 on a PERC H730 Mini controller. I did try removing the disk with 20k non-medium errors, but that didn’t seem to help. Thanks for any insight! Cheers, Lincoln Bryant > On Sep 9, 2015, at 1:09 PM, Lincoln Bryantwrote: > > Hi Jan, > > I’ll take a look at all of those things
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Hi Nick, Thanks for responding. Yes, I am. —Lincoln > On Sep 17, 2015, at 11:53 AM, Nick Fisk <n...@fisk.me.uk> wrote: > > You are getting a fair amount of reads on the disks whilst doing these > writes. You're not using cache tiering are you? > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Lincoln Bryant >> Sent: 17 September 2015 17:42 >> To: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops >> are blocked >> >> Hello again, >> >> Well, I disabled offloads on the NIC -- didn’t work for me. I also tried >> setting >> net.ipv4.tcp_moderate_rcvbuf = 0 as suggested elsewhere in the thread to >> no avail. >> >> Today I was watching iostat on an OSD box ('iostat -xm 5') when the cluster >> got into “slow” state: >> >> Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz >> avgqu-sz >> await svctm %util >> sdb 0.0013.57 84.23 167.47 0.45 2.7826.26 >> 2.068.18 3.85 >> 96.93 >> sdc 0.0046.715.59 289.22 0.03 2.5417.85 >> 3.18 10.77 0.97 >> 28.72 >> sdd 0.0016.57 45.11 91.62 0.25 0.5512.01 >> 0.755.51 2.45 >> 33.47 >> sde 0.0013.576.99 143.31 0.03 2.5334.97 >> 1.99 13.27 2.12 >> 31.86 >> sdf 0.0018.764.99 158.48 0.10 1.0914.88 >> 1.267.69 1.24 >> 20.26 >> sdg 0.0025.55 81.64 237.52 0.44 2.8921.36 >> 4.14 12.99 2.58 >> 82.22 >> sdh 0.0089.42 16.17 492.42 0.09 3.8115.69 >> 17.12 33.66 0.73 >> 36.95 >> sdi 0.0020.16 17.76 189.62 0.10 1.6717.46 >> 3.45 16.63 1.57 >> 32.55 >> sdj 0.0031.540.00 185.23 0.00 1.9121.15 >> 3.33 18.00 0.03 >> 0.62 >> sdk 0.0026.152.40 133.33 0.01 0.8412.79 >> 1.077.87 0.85 >> 11.58 >> sdl 0.0025.559.38 123.95 0.05 1.1518.44 >> 0.503.74 1.58 >> 21.10 >> sdm 0.00 6.39 92.61 47.11 0.47 0.2610.65 >> 1.279.07 6.92 >> 96.73 >> >> The %util is rather high on some disks, but I’m not an expert at looking at >> iostat so I’m not sure how worrisome this is. Does anything here stand out to >> anyone? >> >> At the time of that iostat, Ceph was reporting a lot of blocked ops on the >> OSD >> associated with sde (as well as about 30 other OSDs), but it doesn’t look all >> that busy. Some simple ‘dd’ tests seem to indicate the disk is fine. >> >> Similarly, iotop seems OK on this host: >> >> TID PRIO USER DISK READ DISK WRITE SWAPIN IO>COMMAND >> 472477 be/4 root0.00 B/s5.59 M/s 0.00 % 0.57 % ceph-osd -i 111 >> --pid- >> file /var/run/ceph/osd.111.pid -c /etc/ceph/ceph.conf --cluster ceph >> 470621 be/4 root0.00 B/s 10.09 M/s 0.00 % 0.40 % ceph-osd -i 111 >> --pid- >> file /var/run/ceph/osd.111.pid -c /etc/ceph/ceph.conf --cluster ceph >> 3495447 be/4 root0.00 B/s 272.19 K/s 0.00 % 0.36 % ceph-osd -i >> 114 -- >> pid-file /var/run/ceph/osd.114.pid -c /etc/ceph/ceph.conf --cluster ceph >> 3488389 be/4 root 0.00 B/s 596.80 K/s 0.00 % 0.16 % ceph-osd -i 109 -- >> pid-file /var/run/ceph/osd.109.pid -c /etc/ceph/ceph.conf --cluster ceph >> 3488060 be/4 root0.00 B/s 600.83 K/s 0.00 % 0.15 % ceph-osd -i >> 108 -- >> pid-file /var/run/ceph/osd.108.pid -c /etc/ceph/ceph.conf --cluster ceph >> 3505573 be/4 root0.00 B/s 528.25 K/s 0.00 % 0.10 % ceph-osd -i >> 119 -- >> pid-file /var/run/ceph/osd.119.pid -c /etc/ceph/ceph.conf --cluster ceph >> 3495434 be/4 root0.00 B/s2.02 K/s 0.00 % 0.10 % ceph-osd -i >> 114 --pid- >> file /var/run/ceph/osd.114.pid -c /etc/ceph/ceph.conf --cluster ceph >> 3502327 be/4 root0.00 B/s 506.07 K/s 0.00 % 0.09 % ceph-osd -i >> 118 -- >> pid-file /var/run/ceph/osd.118.pid -c /etc/ceph/ceph.conf --cluster ceph >> 3489100 be/4 root0.00 B/s 106.86 K/s 0.00 % 0.09 % ceph-osd -i >> 110 -- >> pid-file /var/run/ceph/osd.110.pid -c /etc/ceph/ceph.conf --cluster ceph >> 3496631 be/4 root0.00 B/s 229.
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Just a small update — the blocked ops did disappear after doubling the target_max_bytes. We’ll see if it sticks! I’ve thought I’ve solved this blocked ops problem about 10 times now :) Assuming this is the issue, is there any workaround for this problem (or is it working as intended)? (Should I set up a cron to run cache-try-flush-evict-all every night? :)) Another curious thing is that a rolling restart of all OSDs also seems to fix the problem — for a time. I’m not sure how that would fit in if this is the problem. —Lincoln > On Sep 17, 2015, at 12:07 PM, Lincoln Bryant <linco...@uchicago.edu> wrote: > > We have CephFS utilizing a cache tier + EC backend. The cache tier and ec > pool sit on the same spinners — no SSDs. Our cache tier has a > target_max_bytes of 5TB and the total storage is about 1PB. > > I do have a separate test pool with 3x replication and no cache tier, and I > still see significant performance drops and blocked ops with no/minimal > client I/O from CephFS. Right now I have 530 blocked ops with 20MB/s of > client write I/O and no active scrubs. The rados bench on my test pool looks > like this: > > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat >0 0 0 0 0 0 - 0 >1 319463 251.934 252 0.31017 0.217719 >2 31 10372 143.96936 0.978544 0.260631 >3 31 10372 95.9815 0 - 0.260631 >4 31 11180 79.985616 2.29218 0.476458 >5 31 11281 64.7886 42.5559 0.50213 >6 31 11281 53.9905 0 - 0.50213 >7 31 11584 47.9917 6 3.71826 0.615882 >8 31 11584 41.9928 0 - 0.615882 >9 31 1158437.327 0 - 0.615882 > 10 31 11786 34.3942 2.7 6.73678 0.794532 > > I’m really leaning more toward it being a weird controller/disk problem. > > As a test, I suppose I could double the target_max_bytes, just so the cache > tier stops evicting while client I/O is writing? > > —Lincoln > >> On Sep 17, 2015, at 11:59 AM, Nick Fisk <n...@fisk.me.uk> wrote: >> >> Ah rightthis is where it gets interesting. >> >> You are probably hitting a cache full on a PG somewhere which is either >> making everything wait until it flushes or something like that. >> >> What cache settings have you got set? >> >> I assume you have SSD's for the cache tier? Can you share the size of the >> pool. >> >> If possible could you also create a non tiered test pool and do some >> benchmarks on that to rule out any issue with the hardware and OSD's. >> >>> -Original Message- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>> Lincoln Bryant >>> Sent: 17 September 2015 17:54 >>> To: Nick Fisk <n...@fisk.me.uk> >>> Cc: ceph-users@lists.ceph.com >>> Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops >>> are blocked >>> >>> Hi Nick, >>> >>> Thanks for responding. Yes, I am. >>> >>> —Lincoln >>> >>>> On Sep 17, 2015, at 11:53 AM, Nick Fisk <n...@fisk.me.uk> wrote: >>>> >>>> You are getting a fair amount of reads on the disks whilst doing these >>> writes. You're not using cache tiering are you? >>>> >>>>> -Original Message- >>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >>>>> Of Lincoln Bryant >>>>> Sent: 17 September 2015 17:42 >>>>> To: ceph-users@lists.ceph.com >>>>> Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: >>>>> Ops are blocked >>>>> >>>>> Hello again, >>>>> >>>>> Well, I disabled offloads on the NIC -- didn’t work for me. I also >>>>> tried setting net.ipv4.tcp_moderate_rcvbuf = 0 as suggested elsewhere >>>>> in the thread to no avail. >>>>> >>>>> Today I was watching iostat on an OSD box ('iostat -xm 5') when the >>>>> cluster got into “slow” state: >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/srMB/swMB/s >>>>> avgrq-sz avgqu- >>> sz >>>>> await svctm %util >>>>> sdb 0.0013.57
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
You are getting a fair amount of reads on the disks whilst doing these writes. You're not using cache tiering are you? > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Lincoln Bryant > Sent: 17 September 2015 17:42 > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops > are blocked > > Hello again, > > Well, I disabled offloads on the NIC -- didn’t work for me. I also tried > setting > net.ipv4.tcp_moderate_rcvbuf = 0 as suggested elsewhere in the thread to > no avail. > > Today I was watching iostat on an OSD box ('iostat -xm 5') when the cluster > got into “slow” state: > > Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz > avgqu-sz > await svctm %util > sdb 0.0013.57 84.23 167.47 0.45 2.7826.26 > 2.068.18 3.85 > 96.93 > sdc 0.0046.715.59 289.22 0.03 2.5417.85 > 3.18 10.77 0.97 > 28.72 > sdd 0.0016.57 45.11 91.62 0.25 0.5512.01 > 0.755.51 2.45 > 33.47 > sde 0.0013.576.99 143.31 0.03 2.5334.97 > 1.99 13.27 2.12 > 31.86 > sdf 0.0018.764.99 158.48 0.10 1.0914.88 > 1.267.69 1.24 > 20.26 > sdg 0.0025.55 81.64 237.52 0.44 2.8921.36 > 4.14 12.99 2.58 > 82.22 > sdh 0.0089.42 16.17 492.42 0.09 3.8115.69 > 17.12 33.66 0.73 > 36.95 > sdi 0.0020.16 17.76 189.62 0.10 1.6717.46 > 3.45 16.63 1.57 > 32.55 > sdj 0.0031.540.00 185.23 0.00 1.9121.15 > 3.33 18.00 0.03 > 0.62 > sdk 0.0026.152.40 133.33 0.01 0.8412.79 > 1.077.87 0.85 > 11.58 > sdl 0.0025.559.38 123.95 0.05 1.1518.44 > 0.503.74 1.58 > 21.10 > sdm 0.00 6.39 92.61 47.11 0.47 0.2610.65 > 1.279.07 6.92 > 96.73 > > The %util is rather high on some disks, but I’m not an expert at looking at > iostat so I’m not sure how worrisome this is. Does anything here stand out to > anyone? > > At the time of that iostat, Ceph was reporting a lot of blocked ops on the OSD > associated with sde (as well as about 30 other OSDs), but it doesn’t look all > that busy. Some simple ‘dd’ tests seem to indicate the disk is fine. > > Similarly, iotop seems OK on this host: > > TID PRIO USER DISK READ DISK WRITE SWAPIN IO>COMMAND > 472477 be/4 root0.00 B/s5.59 M/s 0.00 % 0.57 % ceph-osd -i 111 > --pid- > file /var/run/ceph/osd.111.pid -c /etc/ceph/ceph.conf --cluster ceph > 470621 be/4 root0.00 B/s 10.09 M/s 0.00 % 0.40 % ceph-osd -i 111 > --pid- > file /var/run/ceph/osd.111.pid -c /etc/ceph/ceph.conf --cluster ceph > 3495447 be/4 root0.00 B/s 272.19 K/s 0.00 % 0.36 % ceph-osd -i 114 > -- > pid-file /var/run/ceph/osd.114.pid -c /etc/ceph/ceph.conf --cluster ceph > 3488389 be/4 root 0.00 B/s 596.80 K/s 0.00 % 0.16 % ceph-osd -i 109 -- > pid-file /var/run/ceph/osd.109.pid -c /etc/ceph/ceph.conf --cluster ceph > 3488060 be/4 root0.00 B/s 600.83 K/s 0.00 % 0.15 % ceph-osd -i 108 > -- > pid-file /var/run/ceph/osd.108.pid -c /etc/ceph/ceph.conf --cluster ceph > 3505573 be/4 root0.00 B/s 528.25 K/s 0.00 % 0.10 % ceph-osd -i 119 > -- > pid-file /var/run/ceph/osd.119.pid -c /etc/ceph/ceph.conf --cluster ceph > 3495434 be/4 root0.00 B/s2.02 K/s 0.00 % 0.10 % ceph-osd -i 114 > --pid- > file /var/run/ceph/osd.114.pid -c /etc/ceph/ceph.conf --cluster ceph > 3502327 be/4 root0.00 B/s 506.07 K/s 0.00 % 0.09 % ceph-osd -i 118 > -- > pid-file /var/run/ceph/osd.118.pid -c /etc/ceph/ceph.conf --cluster ceph > 3489100 be/4 root0.00 B/s 106.86 K/s 0.00 % 0.09 % ceph-osd -i 110 > -- > pid-file /var/run/ceph/osd.110.pid -c /etc/ceph/ceph.conf --cluster ceph > 3496631 be/4 root0.00 B/s 229.85 K/s 0.00 % 0.05 % ceph-osd -i 115 > -- > pid-file /var/run/ceph/osd.115.pid -c /etc/ceph/ceph.conf --cluster ceph > 3505561 be/4 root 0.00 B/s2.02 K/s 0.00 % 0.03 % ceph-osd -i 119 -- > pid-file /var/run/ceph/osd.119.pid -c /etc/ceph/ceph.conf --cluster ceph > 3488059 be/4 root0.00 B/s2.02 K/s 0.00 % 0.03 % ceph-osd -i 108 > --pid- > file /var/run/ceph/osd.108.pid -c /etc/ceph/ceph.conf --cluster ceph > 3488391 be/4 root 46.37 K/s 431.47 K/s 0.00 % 0.02 % ceph-osd -i 109 &
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Ah rightthis is where it gets interesting. You are probably hitting a cache full on a PG somewhere which is either making everything wait until it flushes or something like that. What cache settings have you got set? I assume you have SSD's for the cache tier? Can you share the size of the pool. If possible could you also create a non tiered test pool and do some benchmarks on that to rule out any issue with the hardware and OSD's. > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Lincoln Bryant > Sent: 17 September 2015 17:54 > To: Nick Fisk <n...@fisk.me.uk> > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops > are blocked > > Hi Nick, > > Thanks for responding. Yes, I am. > > —Lincoln > > > On Sep 17, 2015, at 11:53 AM, Nick Fisk <n...@fisk.me.uk> wrote: > > > > You are getting a fair amount of reads on the disks whilst doing these > writes. You're not using cache tiering are you? > > > >> -Original Message- > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > >> Of Lincoln Bryant > >> Sent: 17 September 2015 17:42 > >> To: ceph-users@lists.ceph.com > >> Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: > >> Ops are blocked > >> > >> Hello again, > >> > >> Well, I disabled offloads on the NIC -- didn’t work for me. I also > >> tried setting net.ipv4.tcp_moderate_rcvbuf = 0 as suggested elsewhere > >> in the thread to no avail. > >> > >> Today I was watching iostat on an OSD box ('iostat -xm 5') when the > >> cluster got into “slow” state: > >> > >> Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz > >> avgqu- > sz > >> await svctm %util > >> sdb 0.0013.57 84.23 167.47 0.45 2.7826.26 > >> 2.068.18 > 3.85 > >> 96.93 > >> sdc 0.0046.715.59 289.22 0.03 2.5417.85 > >> 3.18 10.77 > 0.97 > >> 28.72 > >> sdd 0.0016.57 45.11 91.62 0.25 0.5512.01 > >> 0.755.51 > 2.45 > >> 33.47 > >> sde 0.0013.576.99 143.31 0.03 2.5334.97 > >> 1.99 13.27 > 2.12 > >> 31.86 > >> sdf 0.0018.764.99 158.48 0.10 1.0914.88 > >> 1.267.69 1.24 > >> 20.26 > >> sdg 0.0025.55 81.64 237.52 0.44 2.8921.36 > >> 4.14 12.99 > 2.58 > >> 82.22 > >> sdh 0.0089.42 16.17 492.42 0.09 3.8115.69 > >>17.12 33.66 > 0.73 > >> 36.95 > >> sdi 0.0020.16 17.76 189.62 0.10 1.6717.46 > >> 3.45 16.63 > 1.57 > >> 32.55 > >> sdj 0.0031.540.00 185.23 0.00 1.9121.15 > >> 3.33 18.00 > 0.03 > >> 0.62 > >> sdk 0.0026.152.40 133.33 0.01 0.8412.79 > >> 1.077.87 > 0.85 > >> 11.58 > >> sdl 0.0025.559.38 123.95 0.05 1.1518.44 > >> 0.503.74 1.58 > >> 21.10 > >> sdm 0.00 6.39 92.61 47.11 0.47 0.2610.65 > >> 1.279.07 > 6.92 > >> 96.73 > >> > >> The %util is rather high on some disks, but I’m not an expert at > >> looking at iostat so I’m not sure how worrisome this is. Does > >> anything here stand out to anyone? > >> > >> At the time of that iostat, Ceph was reporting a lot of blocked ops > >> on the OSD associated with sde (as well as about 30 other OSDs), but > >> it doesn’t look all that busy. Some simple ‘dd’ tests seem to indicate the > disk is fine. > >> > >> Similarly, iotop seems OK on this host: > >> > >> TID PRIO USER DISK READ DISK WRITE SWAPIN IO>COMMAND > >> 472477 be/4 root0.00 B/s5.59 M/s 0.00 % 0.57 % ceph-osd -i > >> 111 -- > pid- > >> file /var/run/ceph/osd.111.pid -c /etc/ceph/ceph.conf --cluster ceph > >> 470621 be/4 root0.00 B/s 10.09 M/s 0.00 % 0.40 % ceph-osd -i > >> 111 -- > pid- > >> file /var/run/ceph/osd.111.pid -c /etc/ceph/ceph.conf --cluster ceph > >> 3495447 be/4 root0.00 B/s 272.19 K/s 0.00 %
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
We have CephFS utilizing a cache tier + EC backend. The cache tier and ec pool sit on the same spinners — no SSDs. Our cache tier has a target_max_bytes of 5TB and the total storage is about 1PB. I do have a separate test pool with 3x replication and no cache tier, and I still see significant performance drops and blocked ops with no/minimal client I/O from CephFS. Right now I have 530 blocked ops with 20MB/s of client write I/O and no active scrubs. The rados bench on my test pool looks like this: sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 319463 251.934 252 0.31017 0.217719 2 31 10372 143.96936 0.978544 0.260631 3 31 10372 95.9815 0 - 0.260631 4 31 11180 79.985616 2.29218 0.476458 5 31 11281 64.7886 42.5559 0.50213 6 31 11281 53.9905 0 - 0.50213 7 31 11584 47.9917 6 3.71826 0.615882 8 31 11584 41.9928 0 - 0.615882 9 31 1158437.327 0 - 0.615882 10 31 11786 34.3942 2.7 6.73678 0.794532 I’m really leaning more toward it being a weird controller/disk problem. As a test, I suppose I could double the target_max_bytes, just so the cache tier stops evicting while client I/O is writing? —Lincoln > On Sep 17, 2015, at 11:59 AM, Nick Fisk <n...@fisk.me.uk> wrote: > > Ah rightthis is where it gets interesting. > > You are probably hitting a cache full on a PG somewhere which is either > making everything wait until it flushes or something like that. > > What cache settings have you got set? > > I assume you have SSD's for the cache tier? Can you share the size of the > pool. > > If possible could you also create a non tiered test pool and do some > benchmarks on that to rule out any issue with the hardware and OSD's. > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Lincoln Bryant >> Sent: 17 September 2015 17:54 >> To: Nick Fisk <n...@fisk.me.uk> >> Cc: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops >> are blocked >> >> Hi Nick, >> >> Thanks for responding. Yes, I am. >> >> —Lincoln >> >>> On Sep 17, 2015, at 11:53 AM, Nick Fisk <n...@fisk.me.uk> wrote: >>> >>> You are getting a fair amount of reads on the disks whilst doing these >> writes. You're not using cache tiering are you? >>> >>>> -Original Message----- >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >>>> Of Lincoln Bryant >>>> Sent: 17 September 2015 17:42 >>>> To: ceph-users@lists.ceph.com >>>> Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: >>>> Ops are blocked >>>> >>>> Hello again, >>>> >>>> Well, I disabled offloads on the NIC -- didn’t work for me. I also >>>> tried setting net.ipv4.tcp_moderate_rcvbuf = 0 as suggested elsewhere >>>> in the thread to no avail. >>>> >>>> Today I was watching iostat on an OSD box ('iostat -xm 5') when the >>>> cluster got into “slow” state: >>>> >>>> Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz >>>> avgqu- >> sz >>>> await svctm %util >>>> sdb 0.0013.57 84.23 167.47 0.45 2.7826.26 >>>> 2.068.18 >> 3.85 >>>> 96.93 >>>> sdc 0.0046.715.59 289.22 0.03 2.5417.85 >>>> 3.18 10.77 >> 0.97 >>>> 28.72 >>>> sdd 0.0016.57 45.11 91.62 0.25 0.5512.01 >>>> 0.755.51 >> 2.45 >>>> 33.47 >>>> sde 0.0013.576.99 143.31 0.03 2.5334.97 >>>> 1.99 13.27 >> 2.12 >>>> 31.86 >>>> sdf 0.0018.764.99 158.48 0.10 1.0914.88 >>>> 1.267.69 1.24 >>>> 20.26 >>>> sdg 0.0025.55 81.64 237.52 0.44 2.8921.36 >>>> 4.14 12.99 >> 2.58 >>>> 82.22 >>>> sdh 0.0089.42 16.17 492.42 0.09
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Hi Nick, Thanks for the detailed response and insight. SSDs are indeed definitely on the to-buy list. I will certainly try to rule out any hardware issues in the meantime. Cheers, Lincoln > On Sep 17, 2015, at 12:53 PM, Nick Fisk <n...@fisk.me.uk> wrote: > > It's probably helped but I fear that your overall design is not going to work > well for you. Cache Tier + Base tier + journals on the same disks is going to > really hurt. > > The problem when using cache tiering (especially with EC pools in future > releases) is that to modify a block that isn't in the cache tier you have to > promote it 1st, which often kicks another block out the cache. > > So worse case you could have for a single write > > R from EC -> W to CT + jrnl W -> W actual data to CT + jrnl W -> R from CT -> > W to EC + jrnl W > > Plus any metadata updates. Either way you looking at probably somewhere near > a 10x write amplification for 4MB writes, which will quickly overload your > disks leading to very slow performance. Smaller IO's would still cause 4MB > blocks to be shifted between pools. What makes it worse is that these > promotions/evictions tend to happen to hot PG's and not spread round the > whole cluster meaning that a single hot OSD can hold up writes across the > whole pool. > > I know it's not what you want to hear, but I can't think of anything you can > do to help in this instance unless you are willing to get some SSD journals > and maybe move the Cache pool on to separate disks or SSD's. Basically try > and limit the amount of random IO the disks have to do. > > Of course please do try and find a time to stop all IO and then run the test > on the test 3 way pool, to rule out any hardware/OS issues. > > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Lincoln Bryant >> Sent: 17 September 2015 18:36 >> To: Nick Fisk <n...@fisk.me.uk> >> Cc: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops >> are blocked >> >> Just a small update — the blocked ops did disappear after doubling the >> target_max_bytes. We’ll see if it sticks! I’ve thought I’ve solved this >> blocked >> ops problem about 10 times now :) >> >> Assuming this is the issue, is there any workaround for this problem (or is >> it >> working as intended)? (Should I set up a cron to run >> cache-try-flush-evict-all >> every night? :)) >> >> Another curious thing is that a rolling restart of all OSDs also seems to >> fix the >> problem — for a time. I’m not sure how that would fit in if this is the >> problem. >> >> —Lincoln >> >>> On Sep 17, 2015, at 12:07 PM, Lincoln Bryant <linco...@uchicago.edu> >> wrote: >>> >>> We have CephFS utilizing a cache tier + EC backend. The cache tier and ec >> pool sit on the same spinners — no SSDs. Our cache tier has a >> target_max_bytes of 5TB and the total storage is about 1PB. >>> >>> I do have a separate test pool with 3x replication and no cache tier, and I >> still see significant performance drops and blocked ops with no/minimal >> client I/O from CephFS. Right now I have 530 blocked ops with 20MB/s of >> client write I/O and no active scrubs. The rados bench on my test pool looks >> like this: >>> >>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat >>> 0 0 0 0 0 0 - 0 >>> 1 319463 251.934 252 0.31017 0.217719 >>> 2 31 10372 143.96936 0.978544 0.260631 >>> 3 31 10372 95.9815 0 - 0.260631 >>> 4 31 11180 79.985616 2.29218 0.476458 >>> 5 31 11281 64.7886 42.5559 0.50213 >>> 6 31 11281 53.9905 0 - 0.50213 >>> 7 31 11584 47.9917 6 3.71826 0.615882 >>> 8 31 11584 41.9928 0 - 0.615882 >>> 9 31 1158437.327 0 - 0.615882 >>> 10 31 11786 34.3942 2.7 6.73678 0.794532 >>> >>> I’m really leaning more toward it being a weird controller/disk problem. >>> >>> As a test, I suppose I could double the target_max_bytes, just so the cache >> tier stops evicting while client I/O is writing? >>> >>> —Linco
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
It's probably helped but I fear that your overall design is not going to work well for you. Cache Tier + Base tier + journals on the same disks is going to really hurt. The problem when using cache tiering (especially with EC pools in future releases) is that to modify a block that isn't in the cache tier you have to promote it 1st, which often kicks another block out the cache. So worse case you could have for a single write R from EC -> W to CT + jrnl W -> W actual data to CT + jrnl W -> R from CT -> W to EC + jrnl W Plus any metadata updates. Either way you looking at probably somewhere near a 10x write amplification for 4MB writes, which will quickly overload your disks leading to very slow performance. Smaller IO's would still cause 4MB blocks to be shifted between pools. What makes it worse is that these promotions/evictions tend to happen to hot PG's and not spread round the whole cluster meaning that a single hot OSD can hold up writes across the whole pool. I know it's not what you want to hear, but I can't think of anything you can do to help in this instance unless you are willing to get some SSD journals and maybe move the Cache pool on to separate disks or SSD's. Basically try and limit the amount of random IO the disks have to do. Of course please do try and find a time to stop all IO and then run the test on the test 3 way pool, to rule out any hardware/OS issues. > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Lincoln Bryant > Sent: 17 September 2015 18:36 > To: Nick Fisk <n...@fisk.me.uk> > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops > are blocked > > Just a small update — the blocked ops did disappear after doubling the > target_max_bytes. We’ll see if it sticks! I’ve thought I’ve solved this > blocked > ops problem about 10 times now :) > > Assuming this is the issue, is there any workaround for this problem (or is it > working as intended)? (Should I set up a cron to run cache-try-flush-evict-all > every night? :)) > > Another curious thing is that a rolling restart of all OSDs also seems to fix > the > problem — for a time. I’m not sure how that would fit in if this is the > problem. > > —Lincoln > > > On Sep 17, 2015, at 12:07 PM, Lincoln Bryant <linco...@uchicago.edu> > wrote: > > > > We have CephFS utilizing a cache tier + EC backend. The cache tier and ec > pool sit on the same spinners — no SSDs. Our cache tier has a > target_max_bytes of 5TB and the total storage is about 1PB. > > > > I do have a separate test pool with 3x replication and no cache tier, and I > still see significant performance drops and blocked ops with no/minimal > client I/O from CephFS. Right now I have 530 blocked ops with 20MB/s of > client write I/O and no active scrubs. The rados bench on my test pool looks > like this: > > > > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > >0 0 0 0 0 0 - 0 > >1 319463 251.934 252 0.31017 0.217719 > >2 31 10372 143.96936 0.978544 0.260631 > >3 31 10372 95.9815 0 - 0.260631 > >4 31 11180 79.985616 2.29218 0.476458 > >5 31 11281 64.7886 42.5559 0.50213 > >6 31 11281 53.9905 0 - 0.50213 > >7 31 11584 47.9917 6 3.71826 0.615882 > >8 31 11584 41.9928 0 - 0.615882 > >9 31 1158437.327 0 - 0.615882 > > 10 31 11786 34.3942 2.7 6.73678 0.794532 > > > > I’m really leaning more toward it being a weird controller/disk problem. > > > > As a test, I suppose I could double the target_max_bytes, just so the cache > tier stops evicting while client I/O is writing? > > > > —Lincoln > > > >> On Sep 17, 2015, at 11:59 AM, Nick Fisk <n...@fisk.me.uk> wrote: > >> > >> Ah rightthis is where it gets interesting. > >> > >> You are probably hitting a cache full on a PG somewhere which is either > making everything wait until it flushes or something like that. > >> > >> What cache settings have you got set? > >> > >> I assume you have SSD's for the cache tier? Can you share the size of the > pool. > >> > >> If possible could you also create a non tiered test pool and do some > benchmarks on that to rule o
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
If you really want to improve performance of *distributed* filesystem like Ceph, Lustre, GPFS, you must consider from networking of the linux kernel. L5: Socket L4: TCP L3: IP L2: Queuing In this discussion, problem could be in L2 which is queuing in descriptor. We may have to take a closer look at qdisc, if qlen is good enough or not. But this case: > 399 16 32445 32429 325.054 84 0.0233839 0.193655 to > 400 16 32445 32429 324.241 0 - 0.193655 probably different story -; > needless to say, very strange. Yes, it is quite strange like my English... Shinobu - Original Message - From: "Vickey Singh" <vickey.singh22...@gmail.com> To: "Jan Schermer" <j...@schermer.cz> Cc: ceph-users@lists.ceph.com Sent: Thursday, September 10, 2015 2:22:22 AM Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked Hello Jan On Wed, Sep 9, 2015 at 11:59 AM, Jan Schermer < j...@schermer.cz > wrote: Just to recapitulate - the nodes are doing "nothing" when it drops to zero? Not flushing something to drives (iostat)? Not cleaning pagecache (kswapd and similiar)? Not out of any type of memory (slab, min_free_kbytes)? Not network link errors, no bad checksums (those are hard to spot, though)? Unless you find something I suggest you try disabling offloads on the NICs and see if the problem goes away. Could you please elaborate this point , how do you disable / offload on the NIC ? what does it mean ? how to do it ? how its gonna help. Sorry i don't know about this. - Vickey - Jan > On 08 Sep 2015, at 18:26, Lincoln Bryant < linco...@uchicago.edu > wrote: > > For whatever it’s worth, my problem has returned and is very similar to > yours. Still trying to figure out what’s going on over here. > > Performance is nice for a few seconds, then goes to 0. This is a similar > setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc) > > 384 16 29520 29504 307.287 1188 0.0492006 0.208259 > 385 16 29813 29797 309.532 1172 0.0469708 0.206731 > 386 16 30105 30089 311.756 1168 0.0375764 0.205189 > 387 16 30401 30385 314.009 1184 0.036142 0.203791 > 388 16 30695 30679 316.231 1176 0.0372316 0.202355 > 389 16 30987 30971 318.42 1168 0.0660476 0.200962 > 390 16 31282 31266 320.628 1180 0.0358611 0.199548 > 391 16 31568 31552 322.734 1144 0.0405166 0.198132 > 392 16 31857 31841 324.859 1156 0.0360826 0.196679 > 393 16 32090 32074 326.404 932 0.0416869 0.19549 > 394 16 32205 32189 326.743 460 0.0251877 0.194896 > 395 16 32302 32286 326.897 388 0.0280574 0.194395 > 396 16 32348 32332 326.537 184 0.0256821 0.194157 > 397 16 32385 32369 326.087 148 0.0254342 0.193965 > 398 16 32424 32408 325.659 156 0.0263006 0.193763 > 399 16 32445 32429 325.054 84 0.0233839 0.193655 > 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: > 0.193655 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 400 16 32445 32429 324.241 0 - 0.193655 > 401 16 32445 32429 323.433 0 - 0.193655 > 402 16 32445 32429 322.628 0 - 0.193655 > 403 16 32445 32429 321.828 0 - 0.193655 > 404 16 32445 32429 321.031 0 - 0.193655 > 405 16 32445 32429 320.238 0 - 0.193655 > 406 16 32445 32429 319.45 0 - 0.193655 > 407 16 32445 32429 318.665 0 - 0.193655 > > needless to say, very strange. > > —Lincoln > > >> On Sep 7, 2015, at 3:35 PM, Vickey Singh < vickey.singh22...@gmail.com > >> wrote: >> >> Adding ceph-users. >> >> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh < vickey.singh22...@gmail.com >> > wrote: >> >> >> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke < ulem...@polarzone.de > wrote: >> Hi Vickey, >> Thanks for your time in replying to my problem. >> >> I had the same rados bench output after changing the motherboard of the >> monitor node with the lowest IP... >> Due to the new mainboard, I assume the hw-clock was wrong during startup. >> Ceph health show no errors, but all VMs aren't able to do IO (very high load >> on the VMs - but no traffic). >> I stopped the mon, but this don't changed anything. I had to restart all >> other mons to get IO again. After that I started the first mon also (with >> the right time now) and all worked fine again... >> >> Thanks i will try to restart all OSD / MONS and report back , if it solves >> my problem >> >> Another posibility: >> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage >> collection? >> >> No i don't have journals on SSD , they are on the same OSD disk. >> >> >> >> Udo >> >> >> On 07.09.2015 16:36,
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Dropwatch.stp would help us see who dropped packets, here packets were dropped at. To do further investigation regarding to networking, I always check: /sys/class/net//statistics/* tc command also is quite useful. Have we already check if there is any bo or not using vmstat? Using vmstat and tcpdump, tc would give you more concu- rrency information to solve the problem. Shinobu - Original Message - From: "Shinobu Kinjo" <ski...@redhat.com> To: "Vickey Singh" <vickey.singh22...@gmail.com> Cc: ceph-users@lists.ceph.com Sent: Friday, September 11, 2015 10:32:27 PM Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked If you really want to improve performance of *distributed* filesystem like Ceph, Lustre, GPFS, you must consider from networking of the linux kernel. L5: Socket L4: TCP L3: IP L2: Queuing In this discussion, problem could be in L2 which is queuing in descriptor. We may have to take a closer look at qdisc, if qlen is good enough or not. But this case: > 399 16 32445 32429 325.054 84 0.0233839 0.193655 to > 400 16 32445 32429 324.241 0 - 0.193655 probably different story -; > needless to say, very strange. Yes, it is quite strange like my English... Shinobu - Original Message - From: "Vickey Singh" <vickey.singh22...@gmail.com> To: "Jan Schermer" <j...@schermer.cz> Cc: ceph-users@lists.ceph.com Sent: Thursday, September 10, 2015 2:22:22 AM Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked Hello Jan On Wed, Sep 9, 2015 at 11:59 AM, Jan Schermer < j...@schermer.cz > wrote: Just to recapitulate - the nodes are doing "nothing" when it drops to zero? Not flushing something to drives (iostat)? Not cleaning pagecache (kswapd and similiar)? Not out of any type of memory (slab, min_free_kbytes)? Not network link errors, no bad checksums (those are hard to spot, though)? Unless you find something I suggest you try disabling offloads on the NICs and see if the problem goes away. Could you please elaborate this point , how do you disable / offload on the NIC ? what does it mean ? how to do it ? how its gonna help. Sorry i don't know about this. - Vickey - Jan > On 08 Sep 2015, at 18:26, Lincoln Bryant < linco...@uchicago.edu > wrote: > > For whatever it’s worth, my problem has returned and is very similar to > yours. Still trying to figure out what’s going on over here. > > Performance is nice for a few seconds, then goes to 0. This is a similar > setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc) > > 384 16 29520 29504 307.287 1188 0.0492006 0.208259 > 385 16 29813 29797 309.532 1172 0.0469708 0.206731 > 386 16 30105 30089 311.756 1168 0.0375764 0.205189 > 387 16 30401 30385 314.009 1184 0.036142 0.203791 > 388 16 30695 30679 316.231 1176 0.0372316 0.202355 > 389 16 30987 30971 318.42 1168 0.0660476 0.200962 > 390 16 31282 31266 320.628 1180 0.0358611 0.199548 > 391 16 31568 31552 322.734 1144 0.0405166 0.198132 > 392 16 31857 31841 324.859 1156 0.0360826 0.196679 > 393 16 32090 32074 326.404 932 0.0416869 0.19549 > 394 16 32205 32189 326.743 460 0.0251877 0.194896 > 395 16 32302 32286 326.897 388 0.0280574 0.194395 > 396 16 32348 32332 326.537 184 0.0256821 0.194157 > 397 16 32385 32369 326.087 148 0.0254342 0.193965 > 398 16 32424 32408 325.659 156 0.0263006 0.193763 > 399 16 32445 32429 325.054 84 0.0233839 0.193655 > 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: > 0.193655 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 400 16 32445 32429 324.241 0 - 0.193655 > 401 16 32445 32429 323.433 0 - 0.193655 > 402 16 32445 32429 322.628 0 - 0.193655 > 403 16 32445 32429 321.828 0 - 0.193655 > 404 16 32445 32429 321.031 0 - 0.193655 > 405 16 32445 32429 320.238 0 - 0.193655 > 406 16 32445 32429 319.45 0 - 0.193655 > 407 16 32445 32429 318.665 0 - 0.193655 > > needless to say, very strange. > > —Lincoln > > >> On Sep 7, 2015, at 3:35 PM, Vickey Singh < vickey.singh22...@gmail.com > >> wrote: >> >> Adding ceph-users. >> >> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh < vickey.singh22...@gmail.com >> > wrote: >> >> >> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke < ulem...@polarzone.de > wrote: >> Hi Vickey, >> Thanks for your time in replying to my problem. >> >> I had the same rados bench output after changing the motherboard of the >> monitor node with the lowest IP... >> Due to the new mainboard, I assume the hw-clock was wrong during startup. >> Ceph health show no errors, but all VMs aren't able to do IO (v
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Just to recapitulate - the nodes are doing "nothing" when it drops to zero? Not flushing something to drives (iostat)? Not cleaning pagecache (kswapd and similiar)? Not out of any type of memory (slab, min_free_kbytes)? Not network link errors, no bad checksums (those are hard to spot, though)? Unless you find something I suggest you try disabling offloads on the NICs and see if the problem goes away. Jan > On 08 Sep 2015, at 18:26, Lincoln Bryantwrote: > > For whatever it’s worth, my problem has returned and is very similar to > yours. Still trying to figure out what’s going on over here. > > Performance is nice for a few seconds, then goes to 0. This is a similar > setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc) > > 384 16 29520 29504 307.287 1188 0.0492006 0.208259 > 385 16 29813 29797 309.532 1172 0.0469708 0.206731 > 386 16 30105 30089 311.756 1168 0.0375764 0.205189 > 387 16 30401 30385 314.009 1184 0.036142 0.203791 > 388 16 30695 30679 316.231 1176 0.0372316 0.202355 > 389 16 30987 30971318.42 1168 0.0660476 0.200962 > 390 16 31282 31266 320.628 1180 0.0358611 0.199548 > 391 16 31568 31552 322.734 1144 0.0405166 0.198132 > 392 16 31857 31841 324.859 1156 0.0360826 0.196679 > 393 16 32090 32074 326.404 932 0.0416869 0.19549 > 394 16 32205 32189 326.743 460 0.0251877 0.194896 > 395 16 32302 32286 326.897 388 0.0280574 0.194395 > 396 16 32348 32332 326.537 184 0.0256821 0.194157 > 397 16 32385 32369 326.087 148 0.0254342 0.193965 > 398 16 32424 32408 325.659 156 0.0263006 0.193763 > 399 16 32445 32429 325.05484 0.0233839 0.193655 > 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: > 0.193655 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 400 16 32445 32429 324.241 0 - 0.193655 > 401 16 32445 32429 323.433 0 - 0.193655 > 402 16 32445 32429 322.628 0 - 0.193655 > 403 16 32445 32429 321.828 0 - 0.193655 > 404 16 32445 32429 321.031 0 - 0.193655 > 405 16 32445 32429 320.238 0 - 0.193655 > 406 16 32445 32429319.45 0 - 0.193655 > 407 16 32445 32429 318.665 0 - 0.193655 > > needless to say, very strange. > > —Lincoln > > >> On Sep 7, 2015, at 3:35 PM, Vickey Singh wrote: >> >> Adding ceph-users. >> >> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh >> wrote: >> >> >> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke wrote: >> Hi Vickey, >> Thanks for your time in replying to my problem. >> >> I had the same rados bench output after changing the motherboard of the >> monitor node with the lowest IP... >> Due to the new mainboard, I assume the hw-clock was wrong during startup. >> Ceph health show no errors, but all VMs aren't able to do IO (very high load >> on the VMs - but no traffic). >> I stopped the mon, but this don't changed anything. I had to restart all >> other mons to get IO again. After that I started the first mon also (with >> the right time now) and all worked fine again... >> >> Thanks i will try to restart all OSD / MONS and report back , if it solves >> my problem >> >> Another posibility: >> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage >> collection? >> >> No i don't have journals on SSD , they are on the same OSD disk. >> >> >> >> Udo >> >> >> On 07.09.2015 16:36, Vickey Singh wrote: >>> Dear Experts >>> >>> Can someone please help me , why my cluster is not able write data. >>> >>> See the below output cur MB/S is 0 and Avg MB/s is decreasing. >>> >>> >>> Ceph Hammer 0.94.2 >>> CentOS 6 (3.10.69-1) >>> >>> The Ceph status says OPS are blocked , i have tried checking , what all i >>> know >>> >>> - System resources ( CPU , net, disk , memory )-- All normal >>> - 10G network for public and cluster network -- no saturation >>> - Add disks are physically healthy >>> - No messages in /var/log/messages OR dmesg >>> - Tried restarting OSD which are blocking operation , but no luck >>> - Tried writing through RBD and Rados bench , both are giving same problemm >>> >>> Please help me to fix this problem. >>> >>> # rados bench -p rbd 60 write >>> Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 >>> objects >>> Object prefix: benchmark_data_stor1_1791844 >>> sec Cur ops started finished
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Hey Lincoln On Tue, Sep 8, 2015 at 7:26 PM, Lincoln Bryantwrote: > For whatever it’s worth, my problem has returned and is very similar to > yours. Still trying to figure out what’s going on over here. > > Performance is nice for a few seconds, then goes to 0. This is a similar > setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc) > > 384 16 29520 29504 307.287 1188 0.0492006 0.208259 > 385 16 29813 29797 309.532 1172 0.0469708 0.206731 > 386 16 30105 30089 311.756 1168 0.0375764 0.205189 > 387 16 30401 30385 314.009 1184 0.036142 0.203791 > 388 16 30695 30679 316.231 1176 0.0372316 0.202355 > 389 16 30987 30971318.42 1168 0.0660476 0.200962 > 390 16 31282 31266 320.628 1180 0.0358611 0.199548 > 391 16 31568 31552 322.734 1144 0.0405166 0.198132 > 392 16 31857 31841 324.859 1156 0.0360826 0.196679 > 393 16 32090 32074 326.404 932 0.0416869 0.19549 > 394 16 32205 32189 326.743 460 0.0251877 0.194896 > 395 16 32302 32286 326.897 388 0.0280574 0.194395 > 396 16 32348 32332 326.537 184 0.0256821 0.194157 > 397 16 32385 32369 326.087 148 0.0254342 0.193965 > 398 16 32424 32408 325.659 156 0.0263006 0.193763 > 399 16 32445 32429 325.05484 0.0233839 0.193655 > 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: > 0.193655 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 400 16 32445 32429 324.241 0 - 0.193655 > 401 16 32445 32429 323.433 0 - 0.193655 > 402 16 32445 32429 322.628 0 - 0.193655 > 403 16 32445 32429 321.828 0 - 0.193655 > 404 16 32445 32429 321.031 0 - 0.193655 > 405 16 32445 32429 320.238 0 - 0.193655 > 406 16 32445 32429319.45 0 - 0.193655 > 407 16 32445 32429 318.665 0 - 0.193655 > > needless to say, very strange. > Its indeed very strange ( The solution that you gave me in the below email ) Have you tried restarting all OSD's ? By the way my problem got fixed ( but i am afraid , it can come back any time ) by doing # service ceph restart osd on all OSD nodes ( this didn't helped ) # set noout,nodown,nobackfill,norecover and then reboot all OSD nodes ( It worked ) After they all the rados bench write started to work. [ i know its hilarious , feels like i am watching *The IT Crowd* ' Hello IT , Have you tried turning it OFF and ON again ' ] It would be really helpful if someone provides a real solution. > > —Lincoln > > > > On Sep 7, 2015, at 3:35 PM, Vickey Singh > wrote: > > > > Adding ceph-users. > > > > On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh < > vickey.singh22...@gmail.com> wrote: > > > > > > On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke > wrote: > > Hi Vickey, > > Thanks for your time in replying to my problem. > > > > I had the same rados bench output after changing the motherboard of the > monitor node with the lowest IP... > > Due to the new mainboard, I assume the hw-clock was wrong during > startup. Ceph health show no errors, but all VMs aren't able to do IO (very > high load on the VMs - but no traffic). > > I stopped the mon, but this don't changed anything. I had to restart all > other mons to get IO again. After that I started the first mon also (with > the right time now) and all worked fine again... > > > > Thanks i will try to restart all OSD / MONS and report back , if it > solves my problem > > > > Another posibility: > > Do you use journal on SSDs? Perhaps the SSDs can't write to garbage > collection? > > > > No i don't have journals on SSD , they are on the same OSD disk. > > > > > > > > Udo > > > > > > On 07.09.2015 16:36, Vickey Singh wrote: > >> Dear Experts > >> > >> Can someone please help me , why my cluster is not able write data. > >> > >> See the below output cur MB/S is 0 and Avg MB/s is decreasing. > >> > >> > >> Ceph Hammer 0.94.2 > >> CentOS 6 (3.10.69-1) > >> > >> The Ceph status says OPS are blocked , i have tried checking , what all > i know > >> > >> - System resources ( CPU , net, disk , memory )-- All normal > >> - 10G network for public and cluster network -- no saturation > >> - Add disks are physically healthy > >> - No messages in /var/log/messages OR dmesg > >> - Tried restarting OSD which are blocking operation , but no luck > >> - Tried writing through RBD and Rados bench , both are giving same > problemm > >> > >> Please help me to
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
We were experiencing something similar in our setup (rados bench does some work, then comes to a screeching halt). No pattern to which OSD's were causing the problem, though. Sounds like similar hardware (This was on Dell R720xd, and yeah, that controller is suuuper frustrating). For us, setting tcp_moderate_rcvbuf to 0 on all nodes solved the issue. echo 0 > /proc/sys/net/ipv4/tcp_moderate_rcvbuf Or set it in /etc/sysctl.conf: net.ipv4.tcp_moderate_rcvbuf = 0 We figured this out independently after I posted this thread, "Slow/Hung IOs": http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-January/045674.html Hope this helps Bill Sanders On Wed, Sep 9, 2015 at 11:09 AM, Lincoln Bryantwrote: > Hi Jan, > > I’ll take a look at all of those things and report back (hopefully :)) > > I did try setting all of my OSDs to writethrough instead of writeback on > the controller, which was significantly more consistent in performance > (from 1100MB/s down to 300MB/s, but still occasionally dropping to 0MB/s). > Still plenty of blocked ops. > > I was wondering if not-so-nicely failing OSD(s) might be the cause. My > controller (PERC H730 Mini) seems frustratingly terse with SMART > information, but at least one disk has a “Non-medium error count” of over > 20,000.. > > I’ll try disabling offloads as well. > > Thanks much for the suggestions! > > Cheers, > Lincoln > > > On Sep 9, 2015, at 3:59 AM, Jan Schermer wrote: > > > > Just to recapitulate - the nodes are doing "nothing" when it drops to > zero? Not flushing something to drives (iostat)? Not cleaning pagecache > (kswapd and similiar)? Not out of any type of memory (slab, > min_free_kbytes)? Not network link errors, no bad checksums (those are hard > to spot, though)? > > > > Unless you find something I suggest you try disabling offloads on the > NICs and see if the problem goes away. > > > > Jan > > > >> On 08 Sep 2015, at 18:26, Lincoln Bryant wrote: > >> > >> For whatever it’s worth, my problem has returned and is very similar to > yours. Still trying to figure out what’s going on over here. > >> > >> Performance is nice for a few seconds, then goes to 0. This is a > similar setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, > etc) > >> > >> 384 16 29520 29504 307.287 1188 0.0492006 0.208259 > >> 385 16 29813 29797 309.532 1172 0.0469708 0.206731 > >> 386 16 30105 30089 311.756 1168 0.0375764 0.205189 > >> 387 16 30401 30385 314.009 1184 0.036142 0.203791 > >> 388 16 30695 30679 316.231 1176 0.0372316 0.202355 > >> 389 16 30987 30971318.42 1168 0.0660476 0.200962 > >> 390 16 31282 31266 320.628 1180 0.0358611 0.199548 > >> 391 16 31568 31552 322.734 1144 0.0405166 0.198132 > >> 392 16 31857 31841 324.859 1156 0.0360826 0.196679 > >> 393 16 32090 32074 326.404 932 0.0416869 0.19549 > >> 394 16 32205 32189 326.743 460 0.0251877 0.194896 > >> 395 16 32302 32286 326.897 388 0.0280574 0.194395 > >> 396 16 32348 32332 326.537 184 0.0256821 0.194157 > >> 397 16 32385 32369 326.087 148 0.0254342 0.193965 > >> 398 16 32424 32408 325.659 156 0.0263006 0.193763 > >> 399 16 32445 32429 325.05484 0.0233839 0.193655 > >> 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: > 0.193655 > >> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > >> 400 16 32445 32429 324.241 0 - 0.193655 > >> 401 16 32445 32429 323.433 0 - 0.193655 > >> 402 16 32445 32429 322.628 0 - 0.193655 > >> 403 16 32445 32429 321.828 0 - 0.193655 > >> 404 16 32445 32429 321.031 0 - 0.193655 > >> 405 16 32445 32429 320.238 0 - 0.193655 > >> 406 16 32445 32429319.45 0 - 0.193655 > >> 407 16 32445 32429 318.665 0 - 0.193655 > >> > >> needless to say, very strange. > >> > >> —Lincoln > >> > >> > >>> On Sep 7, 2015, at 3:35 PM, Vickey Singh > wrote: > >>> > >>> Adding ceph-users. > >>> > >>> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh < > vickey.singh22...@gmail.com> wrote: > >>> > >>> > >>> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke > wrote: > >>> Hi Vickey, > >>> Thanks for your time in replying to my problem. > >>> > >>> I had the same rados bench output after changing the motherboard of > the monitor node with the lowest IP... > >>> Due to the new mainboard, I assume the hw-clock was wrong during > startup. Ceph health
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Hello Jan On Wed, Sep 9, 2015 at 11:59 AM, Jan Schermerwrote: > Just to recapitulate - the nodes are doing "nothing" when it drops to > zero? Not flushing something to drives (iostat)? Not cleaning pagecache > (kswapd and similiar)? Not out of any type of memory (slab, > min_free_kbytes)? Not network link errors, no bad checksums (those are hard > to spot, though)? > > Unless you find something I suggest you try disabling offloads on the NICs > and see if the problem goes away. > Could you please elaborate this point , how do you disable / offload on the NIC ? what does it mean ? how to do it ? how its gonna help. Sorry i don't know about this. - Vickey - > > Jan > > > On 08 Sep 2015, at 18:26, Lincoln Bryant wrote: > > > > For whatever it’s worth, my problem has returned and is very similar to > yours. Still trying to figure out what’s going on over here. > > > > Performance is nice for a few seconds, then goes to 0. This is a similar > setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc) > > > > 384 16 29520 29504 307.287 1188 0.0492006 0.208259 > > 385 16 29813 29797 309.532 1172 0.0469708 0.206731 > > 386 16 30105 30089 311.756 1168 0.0375764 0.205189 > > 387 16 30401 30385 314.009 1184 0.036142 0.203791 > > 388 16 30695 30679 316.231 1176 0.0372316 0.202355 > > 389 16 30987 30971318.42 1168 0.0660476 0.200962 > > 390 16 31282 31266 320.628 1180 0.0358611 0.199548 > > 391 16 31568 31552 322.734 1144 0.0405166 0.198132 > > 392 16 31857 31841 324.859 1156 0.0360826 0.196679 > > 393 16 32090 32074 326.404 932 0.0416869 0.19549 > > 394 16 32205 32189 326.743 460 0.0251877 0.194896 > > 395 16 32302 32286 326.897 388 0.0280574 0.194395 > > 396 16 32348 32332 326.537 184 0.0256821 0.194157 > > 397 16 32385 32369 326.087 148 0.0254342 0.193965 > > 398 16 32424 32408 325.659 156 0.0263006 0.193763 > > 399 16 32445 32429 325.05484 0.0233839 0.193655 > > 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: > 0.193655 > > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > > 400 16 32445 32429 324.241 0 - 0.193655 > > 401 16 32445 32429 323.433 0 - 0.193655 > > 402 16 32445 32429 322.628 0 - 0.193655 > > 403 16 32445 32429 321.828 0 - 0.193655 > > 404 16 32445 32429 321.031 0 - 0.193655 > > 405 16 32445 32429 320.238 0 - 0.193655 > > 406 16 32445 32429319.45 0 - 0.193655 > > 407 16 32445 32429 318.665 0 - 0.193655 > > > > needless to say, very strange. > > > > —Lincoln > > > > > >> On Sep 7, 2015, at 3:35 PM, Vickey Singh > wrote: > >> > >> Adding ceph-users. > >> > >> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh < > vickey.singh22...@gmail.com> wrote: > >> > >> > >> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke > wrote: > >> Hi Vickey, > >> Thanks for your time in replying to my problem. > >> > >> I had the same rados bench output after changing the motherboard of the > monitor node with the lowest IP... > >> Due to the new mainboard, I assume the hw-clock was wrong during > startup. Ceph health show no errors, but all VMs aren't able to do IO (very > high load on the VMs - but no traffic). > >> I stopped the mon, but this don't changed anything. I had to restart > all other mons to get IO again. After that I started the first mon also > (with the right time now) and all worked fine again... > >> > >> Thanks i will try to restart all OSD / MONS and report back , if it > solves my problem > >> > >> Another posibility: > >> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage > collection? > >> > >> No i don't have journals on SSD , they are on the same OSD disk. > >> > >> > >> > >> Udo > >> > >> > >> On 07.09.2015 16:36, Vickey Singh wrote: > >>> Dear Experts > >>> > >>> Can someone please help me , why my cluster is not able write data. > >>> > >>> See the below output cur MB/S is 0 and Avg MB/s is decreasing. > >>> > >>> > >>> Ceph Hammer 0.94.2 > >>> CentOS 6 (3.10.69-1) > >>> > >>> The Ceph status says OPS are blocked , i have tried checking , what > all i know > >>> > >>> - System resources ( CPU , net, disk , memory )-- All normal > >>> - 10G network for public and cluster network -- no saturation > >>> - Add disks are physically healthy > >>> - No messages in /var/log/messages OR dmesg > >>> - Tried
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Hi Jan, I’ll take a look at all of those things and report back (hopefully :)) I did try setting all of my OSDs to writethrough instead of writeback on the controller, which was significantly more consistent in performance (from 1100MB/s down to 300MB/s, but still occasionally dropping to 0MB/s). Still plenty of blocked ops. I was wondering if not-so-nicely failing OSD(s) might be the cause. My controller (PERC H730 Mini) seems frustratingly terse with SMART information, but at least one disk has a “Non-medium error count” of over 20,000.. I’ll try disabling offloads as well. Thanks much for the suggestions! Cheers, Lincoln > On Sep 9, 2015, at 3:59 AM, Jan Schermerwrote: > > Just to recapitulate - the nodes are doing "nothing" when it drops to zero? > Not flushing something to drives (iostat)? Not cleaning pagecache (kswapd and > similiar)? Not out of any type of memory (slab, min_free_kbytes)? Not network > link errors, no bad checksums (those are hard to spot, though)? > > Unless you find something I suggest you try disabling offloads on the NICs > and see if the problem goes away. > > Jan > >> On 08 Sep 2015, at 18:26, Lincoln Bryant wrote: >> >> For whatever it’s worth, my problem has returned and is very similar to >> yours. Still trying to figure out what’s going on over here. >> >> Performance is nice for a few seconds, then goes to 0. This is a similar >> setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc) >> >> 384 16 29520 29504 307.287 1188 0.0492006 0.208259 >> 385 16 29813 29797 309.532 1172 0.0469708 0.206731 >> 386 16 30105 30089 311.756 1168 0.0375764 0.205189 >> 387 16 30401 30385 314.009 1184 0.036142 0.203791 >> 388 16 30695 30679 316.231 1176 0.0372316 0.202355 >> 389 16 30987 30971318.42 1168 0.0660476 0.200962 >> 390 16 31282 31266 320.628 1180 0.0358611 0.199548 >> 391 16 31568 31552 322.734 1144 0.0405166 0.198132 >> 392 16 31857 31841 324.859 1156 0.0360826 0.196679 >> 393 16 32090 32074 326.404 932 0.0416869 0.19549 >> 394 16 32205 32189 326.743 460 0.0251877 0.194896 >> 395 16 32302 32286 326.897 388 0.0280574 0.194395 >> 396 16 32348 32332 326.537 184 0.0256821 0.194157 >> 397 16 32385 32369 326.087 148 0.0254342 0.193965 >> 398 16 32424 32408 325.659 156 0.0263006 0.193763 >> 399 16 32445 32429 325.05484 0.0233839 0.193655 >> 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: >> 0.193655 >> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat >> 400 16 32445 32429 324.241 0 - 0.193655 >> 401 16 32445 32429 323.433 0 - 0.193655 >> 402 16 32445 32429 322.628 0 - 0.193655 >> 403 16 32445 32429 321.828 0 - 0.193655 >> 404 16 32445 32429 321.031 0 - 0.193655 >> 405 16 32445 32429 320.238 0 - 0.193655 >> 406 16 32445 32429319.45 0 - 0.193655 >> 407 16 32445 32429 318.665 0 - 0.193655 >> >> needless to say, very strange. >> >> —Lincoln >> >> >>> On Sep 7, 2015, at 3:35 PM, Vickey Singh >>> wrote: >>> >>> Adding ceph-users. >>> >>> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh >>> wrote: >>> >>> >>> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke wrote: >>> Hi Vickey, >>> Thanks for your time in replying to my problem. >>> >>> I had the same rados bench output after changing the motherboard of the >>> monitor node with the lowest IP... >>> Due to the new mainboard, I assume the hw-clock was wrong during startup. >>> Ceph health show no errors, but all VMs aren't able to do IO (very high >>> load on the VMs - but no traffic). >>> I stopped the mon, but this don't changed anything. I had to restart all >>> other mons to get IO again. After that I started the first mon also (with >>> the right time now) and all worked fine again... >>> >>> Thanks i will try to restart all OSD / MONS and report back , if it solves >>> my problem >>> >>> Another posibility: >>> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage >>> collection? >>> >>> No i don't have journals on SSD , they are on the same OSD disk. >>> >>> >>> >>> Udo >>> >>> >>> On 07.09.2015 16:36, Vickey Singh wrote: Dear Experts Can someone please help me , why my cluster is not able write data. See the below output cur MB/S is 0 and Avg MB/s is decreasing.
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
For whatever it’s worth, my problem has returned and is very similar to yours. Still trying to figure out what’s going on over here. Performance is nice for a few seconds, then goes to 0. This is a similar setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc) 384 16 29520 29504 307.287 1188 0.0492006 0.208259 385 16 29813 29797 309.532 1172 0.0469708 0.206731 386 16 30105 30089 311.756 1168 0.0375764 0.205189 387 16 30401 30385 314.009 1184 0.036142 0.203791 388 16 30695 30679 316.231 1176 0.0372316 0.202355 389 16 30987 30971318.42 1168 0.0660476 0.200962 390 16 31282 31266 320.628 1180 0.0358611 0.199548 391 16 31568 31552 322.734 1144 0.0405166 0.198132 392 16 31857 31841 324.859 1156 0.0360826 0.196679 393 16 32090 32074 326.404 932 0.0416869 0.19549 394 16 32205 32189 326.743 460 0.0251877 0.194896 395 16 32302 32286 326.897 388 0.0280574 0.194395 396 16 32348 32332 326.537 184 0.0256821 0.194157 397 16 32385 32369 326.087 148 0.0254342 0.193965 398 16 32424 32408 325.659 156 0.0263006 0.193763 399 16 32445 32429 325.05484 0.0233839 0.193655 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: 0.193655 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 400 16 32445 32429 324.241 0 - 0.193655 401 16 32445 32429 323.433 0 - 0.193655 402 16 32445 32429 322.628 0 - 0.193655 403 16 32445 32429 321.828 0 - 0.193655 404 16 32445 32429 321.031 0 - 0.193655 405 16 32445 32429 320.238 0 - 0.193655 406 16 32445 32429319.45 0 - 0.193655 407 16 32445 32429 318.665 0 - 0.193655 needless to say, very strange. —Lincoln > On Sep 7, 2015, at 3:35 PM, Vickey Singhwrote: > > Adding ceph-users. > > On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh > wrote: > > > On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke wrote: > Hi Vickey, > Thanks for your time in replying to my problem. > > I had the same rados bench output after changing the motherboard of the > monitor node with the lowest IP... > Due to the new mainboard, I assume the hw-clock was wrong during startup. > Ceph health show no errors, but all VMs aren't able to do IO (very high load > on the VMs - but no traffic). > I stopped the mon, but this don't changed anything. I had to restart all > other mons to get IO again. After that I started the first mon also (with the > right time now) and all worked fine again... > > Thanks i will try to restart all OSD / MONS and report back , if it solves my > problem > > Another posibility: > Do you use journal on SSDs? Perhaps the SSDs can't write to garbage > collection? > > No i don't have journals on SSD , they are on the same OSD disk. > > > > Udo > > > On 07.09.2015 16:36, Vickey Singh wrote: >> Dear Experts >> >> Can someone please help me , why my cluster is not able write data. >> >> See the below output cur MB/S is 0 and Avg MB/s is decreasing. >> >> >> Ceph Hammer 0.94.2 >> CentOS 6 (3.10.69-1) >> >> The Ceph status says OPS are blocked , i have tried checking , what all i >> know >> >> - System resources ( CPU , net, disk , memory )-- All normal >> - 10G network for public and cluster network -- no saturation >> - Add disks are physically healthy >> - No messages in /var/log/messages OR dmesg >> - Tried restarting OSD which are blocking operation , but no luck >> - Tried writing through RBD and Rados bench , both are giving same problemm >> >> Please help me to fix this problem. >> >> # rados bench -p rbd 60 write >> Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 >> objects >> Object prefix: benchmark_data_stor1_1791844 >>sec Cur ops started finished avg MB/s cur MB/s last lat avg lat >> 0 0 0 0 0 0 - 0 >> 1 16 125 109 435.873 436 0.022076 0.0697864 >> 2 16 139 123 245.94856 0.246578 0.0674407 >> 3 16 139 123 163.969 0 - 0.0674407 >> 4 16 139 123 122.978 0 - 0.0674407 >> 5 16 139 12398.383 0 - 0.0674407 >> 6 16 139 123 81.9865 0 - 0.0674407 >> 7 16
[ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Dear Experts Can someone please help me , why my cluster is not able write data. See the below output cur MB/S is 0 and Avg MB/s is decreasing. Ceph Hammer 0.94.2 CentOS 6 (3.10.69-1) The Ceph status says OPS are blocked , i have tried checking , what all i know - System resources ( CPU , net, disk , memory )-- All normal - 10G network for public and cluster network -- no saturation - Add disks are physically healthy - No messages in /var/log/messages OR dmesg - Tried restarting OSD which are blocking operation , but no luck - Tried writing through RBD and Rados bench , both are giving same problemm Please help me to fix this problem. # rados bench -p rbd 60 write Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects Object prefix: benchmark_data_stor1_1791844 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 125 109 435.873 436 0.022076 0.0697864 2 16 139 123 245.94856 0.246578 0.0674407 3 16 139 123 163.969 0 - 0.0674407 4 16 139 123 122.978 0 - 0.0674407 5 16 139 12398.383 0 - 0.0674407 6 16 139 123 81.9865 0 - 0.0674407 7 16 139 123 70.2747 0 - 0.0674407 8 16 139 123 61.4903 0 - 0.0674407 9 16 139 123 54.6582 0 - 0.0674407 10 16 139 123 49.1924 0 - 0.0674407 11 16 139 123 44.7201 0 - 0.0674407 12 16 139 123 40.9934 0 - 0.0674407 13 16 139 123 37.8401 0 - 0.0674407 14 16 139 123 35.1373 0 - 0.0674407 15 16 139 123 32.7949 0 - 0.0674407 16 16 139 123 30.7451 0 - 0.0674407 17 16 139 123 28.9364 0 - 0.0674407 18 16 139 123 27.3289 0 - 0.0674407 19 16 139 123 25.8905 0 - 0.0674407 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat: 0.0674407 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 139 12324.596 0 - 0.0674407 21 16 139 123 23.4247 0 - 0.0674407 22 16 139 123 22.36 0 - 0.0674407 23 16 139 123 21.3878 0 - 0.0674407 24 16 139 123 20.4966 0 - 0.0674407 25 16 139 123 19.6768 0 - 0.0674407 26 16 139 123 18.92 0 - 0.0674407 27 16 139 123 18.2192 0 - 0.0674407 28 16 139 123 17.5686 0 - 0.0674407 29 16 139 123 16.9628 0 - 0.0674407 30 16 139 123 16.3973 0 - 0.0674407 31 16 139 123 15.8684 0 - 0.0674407 32 16 139 123 15.3725 0 - 0.0674407 33 16 139 123 14.9067 0 - 0.0674407 34 16 139 123 14.4683 0 - 0.0674407 35 16 139 123 14.0549 0 - 0.0674407 36 16 139 123 13.6645 0 - 0.0674407 37 16 139 123 13.2952 0 - 0.0674407 38 16 139 123 12.9453 0 - 0.0674407 39 16 139 123 12.6134 0 - 0.0674407 2015-09-07 15:55:12.697124min lat: 0.022076 max lat: 0.46117 avg lat: 0.0674407 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 40 16 139 123 12.2981 0 - 0.0674407 41 16 139 123 11.9981 0 - 0.0674407 cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8 health HEALTH_WARN 1 requests are blocked > 32 sec monmap e3: 3 mons at {stor0111= 10.100.1.111:6789/0,stor0113=10.100.1.113:6789/0,stor011 5=10.100.1.115:6789/0} election epoch 32, quorum 0,1,2 stor0111,stor0113,stor0115 osdmap e19536: 50 osds: 50 up, 50 in pgmap v928610: 2752 pgs, 9 pools, 30476 GB data, 4183 kobjects 91513 GB used, 47642 GB / 135 TB avail 2752 active+clean Tried using RBD # dd if=/dev/zero of=file1 bs=4K count=1 oflag=direct 1+0 records in 1+0 records out 4096 bytes (41
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Hi Vickey, I had this exact same problem last week, resolved by rebooting all of my OSD nodes. I have yet to figure out why it happened, though. I _suspect_ in my case it's due to a failing controller on a particular box I've had trouble with in the past. I tried setting 'noout', stopping my OSDs one host at a time, then rerunning RADOS bench between to see if I could nail down the problematic machine. Depending on your # of hosts, this might work for you. Admittedly, I got impatient with this approach though and just ended up restarting everything (which worked!) :) If you have a bunch of blocked ops, you could maybe try a 'pg query' on the PGs involved and see if there's a common OSD with all of your blocked ops. In my experience, it's not necessarily the one reporting. Anecdotally, I've had trouble with Intel 10Gb NICs and custom kernels as well. I've seen a NIC appear to be happy (no message in dmesg, machine appears to be communicating normally, etc) but when I went to iperf it, I was getting super pitiful performance (like KB/s). I don't know what kind of NICs you're using, but you may want to iperf everything just in case. --Lincoln On 9/7/2015 9:36 AM, Vickey Singh wrote: Dear Experts Can someone please help me , why my cluster is not able write data. See the below output cur MB/S is 0 and Avg MB/s is decreasing. Ceph Hammer 0.94.2 CentOS 6 (3.10.69-1) The Ceph status says OPS are blocked , i have tried checking , what all i know - System resources ( CPU , net, disk , memory )-- All normal - 10G network for public and cluster network -- no saturation - Add disks are physically healthy - No messages in /var/log/messages OR dmesg - Tried restarting OSD which are blocking operation , but no luck - Tried writing through RBD and Rados bench , both are giving same problemm Please help me to fix this problem. # rados bench -p rbd 60 write Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects Object prefix: benchmark_data_stor1_1791844 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 125 109 435.873 436 0.022076 0.0697864 2 16 139 123 245.94856 0.246578 0.0674407 3 16 139 123 163.969 0 - 0.0674407 4 16 139 123 122.978 0 - 0.0674407 5 16 139 12398.383 0 - 0.0674407 6 16 139 123 81.9865 0 - 0.0674407 7 16 139 123 70.2747 0 - 0.0674407 8 16 139 123 61.4903 0 - 0.0674407 9 16 139 123 54.6582 0 - 0.0674407 10 16 139 123 49.1924 0 - 0.0674407 11 16 139 123 44.7201 0 - 0.0674407 12 16 139 123 40.9934 0 - 0.0674407 13 16 139 123 37.8401 0 - 0.0674407 14 16 139 123 35.1373 0 - 0.0674407 15 16 139 123 32.7949 0 - 0.0674407 16 16 139 123 30.7451 0 - 0.0674407 17 16 139 123 28.9364 0 - 0.0674407 18 16 139 123 27.3289 0 - 0.0674407 19 16 139 123 25.8905 0 - 0.0674407 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat: 0.0674407 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 139 12324.596 0 - 0.0674407 21 16 139 123 23.4247 0 - 0.0674407 22 16 139 123 22.36 0 - 0.0674407 23 16 139 123 21.3878 0 - 0.0674407 24 16 139 123 20.4966 0 - 0.0674407 25 16 139 123 19.6768 0 - 0.0674407 26 16 139 123 18.92 0 - 0.0674407 27 16 139 123 18.2192 0 - 0.0674407 28 16 139 123 17.5686 0 - 0.0674407 29 16 139 123 16.9628 0 - 0.0674407 30 16 139 123 16.3973 0 - 0.0674407 31 16 139 123 15.8684 0 - 0.0674407 32 16 139 123 15.3725 0 - 0.0674407 33 16 139 123 14.9067 0 - 0.0674407 34 16 139 123 14.4683 0 - 0.0674407 35 16 139 123 14.0549 0
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
On Mon, Sep 7, 2015 at 7:39 PM, Lincoln Bryantwrote: > Hi Vickey, > > Thanks a lot for replying to my problem. > I had this exact same problem last week, resolved by rebooting all of my > OSD nodes. I have yet to figure out why it happened, though. I _suspect_ in > my case it's due to a failing controller on a particular box I've had > trouble with in the past. > Mine is a 5 node cluster 12 OSD each node and and in past there has never been any hardware problems. > I tried setting 'noout', stopping my OSDs one host at a time, then > rerunning RADOS bench between to see if I could nail down the problematic > machine. Depending on your # of hosts, this might work for you. Admittedly, > I got impatient with this approach though and just ended up restarting > everything (which worked!) :) > So do you mean you intentionally brought one node's OSD down ? so some OSD were down but none of then were out (no out ) . Then you waited for some time to make cluster healthy , and then you rerun rados bench ?? > > If you have a bunch of blocked ops, you could maybe try a 'pg query' on > the PGs involved and see if there's a common OSD with all of your blocked > ops. In my experience, it's not necessarily the one reporting. > Yeah, i have 55 OSDs and every time any random OSD shows OPS blocked. So i can't blame any specific OSD. After few minutes that blocked OSD becomes clean and after sometime some other osd blocks ops. Thanks, i will try to restart all osd / monitor daemons , and see if this fixes. Is there any thing i need to keep in mind to restart osd ( expect nodown , noout ) ?? > > Anecdotally, I've had trouble with Intel 10Gb NICs and custom kernels as > well. I've seen a NIC appear to be happy (no message in dmesg, machine > appears to be communicating normally, etc) but when I went to iperf it, I > was getting super pitiful performance (like KB/s). I don't know what kind > of NICs you're using, but you may want to iperf everything just in case. > Yeah i did that , iperf shows no problem. Is there anything else i should do ?? > > --Lincoln > > > On 9/7/2015 9:36 AM, Vickey Singh wrote: > > Dear Experts > > Can someone please help me , why my cluster is not able write data. > > See the below output cur MB/S is 0 and Avg MB/s is decreasing. > > > Ceph Hammer 0.94.2 > CentOS 6 (3.10.69-1) > > The Ceph status says OPS are blocked , i have tried checking , what all i > know > > - System resources ( CPU , net, disk , memory )-- All normal > - 10G network for public and cluster network -- no saturation > - Add disks are physically healthy > - No messages in /var/log/messages OR dmesg > - Tried restarting OSD which are blocking operation , but no luck > - Tried writing through RBD and Rados bench , both are giving same problemm > > Please help me to fix this problem. > > # rados bench -p rbd 60 write > Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or > 0 objects > Object prefix: benchmark_data_stor1_1791844 >sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 0 0 0 0 0 0 - 0 > 1 16 125 109 435.873 436 0.022076 0.0697864 > 2 16 139 123 245.94856 0.246578 0.0674407 > 3 16 139 123 163.969 0 - 0.0674407 > 4 16 139 123 122.978 0 - 0.0674407 > 5 16 139 12398.383 0 - 0.0674407 > 6 16 139 123 81.9865 0 - 0.0674407 > 7 16 139 123 70.2747 0 - 0.0674407 > 8 16 139 123 61.4903 0 - 0.0674407 > 9 16 139 123 54.6582 0 - 0.0674407 > 10 16 139 123 49.1924 0 - 0.0674407 > 11 16 139 123 44.7201 0 - 0.0674407 > 12 16 139 123 40.9934 0 - 0.0674407 > 13 16 139 123 37.8401 0 - 0.0674407 > 14 16 139 123 35.1373 0 - 0.0674407 > 15 16 139 123 32.7949 0 - 0.0674407 > 16 16 139 123 30.7451 0 - 0.0674407 > 17 16 139 123 28.9364 0 - 0.0674407 > 18 16 139 123 27.3289 0 - 0.0674407 > 19 16 139 123 25.8905 0 - 0.0674407 > 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat: > 0.0674407 >sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 20 16 139 12324.596 0 - 0.0674407 > 21 16 139 123 23.4247 0 - 0.0674407 > 22 16 139 123
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Adding ceph-users. On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singhwrote: > > > On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke wrote: > >> Hi Vickey, >> > Thanks for your time in replying to my problem. > > >> I had the same rados bench output after changing the motherboard of the >> monitor node with the lowest IP... >> Due to the new mainboard, I assume the hw-clock was wrong during startup. >> Ceph health show no errors, but all VMs aren't able to do IO (very high >> load on the VMs - but no traffic). >> I stopped the mon, but this don't changed anything. I had to restart all >> other mons to get IO again. After that I started the first mon also (with >> the right time now) and all worked fine again... >> > > Thanks i will try to restart all OSD / MONS and report back , if it solves > my problem > >> >> Another posibility: >> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage >> collection? >> > > No i don't have journals on SSD , they are on the same OSD disk. > >> >> >> >> Udo >> >> >> On 07.09.2015 16:36, Vickey Singh wrote: >> >> Dear Experts >> >> Can someone please help me , why my cluster is not able write data. >> >> See the below output cur MB/S is 0 and Avg MB/s is decreasing. >> >> >> Ceph Hammer 0.94.2 >> CentOS 6 (3.10.69-1) >> >> The Ceph status says OPS are blocked , i have tried checking , what all i >> know >> >> - System resources ( CPU , net, disk , memory )-- All normal >> - 10G network for public and cluster network -- no saturation >> - Add disks are physically healthy >> - No messages in /var/log/messages OR dmesg >> - Tried restarting OSD which are blocking operation , but no luck >> - Tried writing through RBD and Rados bench , both are giving same >> problemm >> >> Please help me to fix this problem. >> >> # rados bench -p rbd 60 write >> Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds >> or 0 objects >> Object prefix: benchmark_data_stor1_1791844 >>sec Cur ops started finished avg MB/s cur MB/s last lat avg lat >> 0 0 0 0 0 0 - 0 >> 1 16 125 109 435.873 436 0.022076 0.0697864 >> 2 16 139 123 245.94856 0.246578 0.0674407 >> 3 16 139 123 163.969 0 - 0.0674407 >> 4 16 139 123 122.978 0 - 0.0674407 >> 5 16 139 12398.383 0 - 0.0674407 >> 6 16 139 123 81.9865 0 - 0.0674407 >> 7 16 139 123 70.2747 0 - 0.0674407 >> 8 16 139 123 61.4903 0 - 0.0674407 >> 9 16 139 123 54.6582 0 - 0.0674407 >> 10 16 139 123 49.1924 0 - 0.0674407 >> 11 16 139 123 44.7201 0 - 0.0674407 >> 12 16 139 123 40.9934 0 - 0.0674407 >> 13 16 139 123 37.8401 0 - 0.0674407 >> 14 16 139 123 35.1373 0 - 0.0674407 >> 15 16 139 123 32.7949 0 - 0.0674407 >> 16 16 139 123 30.7451 0 - 0.0674407 >> 17 16 139 123 28.9364 0 - 0.0674407 >> 18 16 139 123 27.3289 0 - 0.0674407 >> 19 16 139 123 25.8905 0 - 0.0674407 >> 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat: >> 0.0674407 >>sec Cur ops started finished avg MB/s cur MB/s last lat avg lat >> 20 16 139 12324.596 0 - 0.0674407 >> 21 16 139 123 23.4247 0 - 0.0674407 >> 22 16 139 123 22.36 0 - 0.0674407 >> 23 16 139 123 21.3878 0 - 0.0674407 >> 24 16 139 123 20.4966 0 - 0.0674407 >> 25 16 139 123 19.6768 0 - 0.0674407 >> 26 16 139 123 18.92 0 - 0.0674407 >> 27 16 139 123 18.2192 0 - 0.0674407 >> 28 16 139 123 17.5686 0 - 0.0674407 >> 29 16 139 123 16.9628 0 - 0.0674407 >> 30 16 139 123 16.3973 0 - 0.0674407 >> 31 16 139 123 15.8684 0 - 0.0674407 >> 32 16 139 123 15.3725 0 - 0.0674407 >> 33 16 139 123 14.9067 0 - 0.0674407 >> 34 16 139 123 14.4683 0 - 0.0674407 >> 35
Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked
Hi Vickey, I had the same rados bench output after changing the motherboard of the monitor node with the lowest IP... Due to the new mainboard, I assume the hw-clock was wrong during startup. Ceph health show no errors, but all VMs aren't able to do IO (very high load on the VMs - but no traffic). I stopped the mon, but this don't changed anything. I had to restart all other mons to get IO again. After that I started the first mon also (with the right time now) and all worked fine again... Another posibility: Do you use journal on SSDs? Perhaps the SSDs can't write to garbage collection? Udo On 07.09.2015 16:36, Vickey Singh wrote: > Dear Experts > > Can someone please help me , why my cluster is not able write data. > > See the below output cur MB/S is 0 and Avg MB/s is decreasing. > > > Ceph Hammer 0.94.2 > CentOS 6 (3.10.69-1) > > The Ceph status says OPS are blocked , i have tried checking , what > all i know > > - System resources ( CPU , net, disk , memory )-- All normal > - 10G network for public and cluster network -- no saturation > - Add disks are physically healthy > - No messages in /var/log/messages OR dmesg > - Tried restarting OSD which are blocking operation , but no luck > - Tried writing through RBD and Rados bench , both are giving same > problemm > > Please help me to fix this problem. > > # rados bench -p rbd 60 write > Maintaining 16 concurrent writes of 4194304 bytes for up to 60 > seconds or 0 objects > Object prefix: benchmark_data_stor1_1791844 >sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 0 0 0 0 0 0 - 0 > 1 16 125 109 435.873 436 0.022076 0.0697864 > 2 16 139 123 245.94856 0.246578 0.0674407 > 3 16 139 123 163.969 0 - 0.0674407 > 4 16 139 123 122.978 0 - 0.0674407 > 5 16 139 12398.383 0 - 0.0674407 > 6 16 139 123 81.9865 0 - 0.0674407 > 7 16 139 123 70.2747 0 - 0.0674407 > 8 16 139 123 61.4903 0 - 0.0674407 > 9 16 139 123 54.6582 0 - 0.0674407 > 10 16 139 123 49.1924 0 - 0.0674407 > 11 16 139 123 44.7201 0 - 0.0674407 > 12 16 139 123 40.9934 0 - 0.0674407 > 13 16 139 123 37.8401 0 - 0.0674407 > 14 16 139 123 35.1373 0 - 0.0674407 > 15 16 139 123 32.7949 0 - 0.0674407 > 16 16 139 123 30.7451 0 - 0.0674407 > 17 16 139 123 28.9364 0 - 0.0674407 > 18 16 139 123 27.3289 0 - 0.0674407 > 19 16 139 123 25.8905 0 - 0.0674407 > 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat: > 0.0674407 >sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 20 16 139 12324.596 0 - 0.0674407 > 21 16 139 123 23.4247 0 - 0.0674407 > 22 16 139 123 22.36 0 - 0.0674407 > 23 16 139 123 21.3878 0 - 0.0674407 > 24 16 139 123 20.4966 0 - 0.0674407 > 25 16 139 123 19.6768 0 - 0.0674407 > 26 16 139 123 18.92 0 - 0.0674407 > 27 16 139 123 18.2192 0 - 0.0674407 > 28 16 139 123 17.5686 0 - 0.0674407 > 29 16 139 123 16.9628 0 - 0.0674407 > 30 16 139 123 16.3973 0 - 0.0674407 > 31 16 139 123 15.8684 0 - 0.0674407 > 32 16 139 123 15.3725 0 - 0.0674407 > 33 16 139 123 14.9067 0 - 0.0674407 > 34 16 139 123 14.4683 0 - 0.0674407 > 35 16 139 123 14.0549 0 - 0.0674407 > 36 16 139 123 13.6645 0 - 0.0674407 > 37 16 139 123 13.2952 0 - 0.0674407 > 38 16 139 123 12.9453 0 - 0.0674407 > 39 16 139 123 12.6134 0 - 0.0674407 > 2015-09-07 15:55:12.697124min lat: 0.022076 max lat: 0.46117 avg lat: > 0.0674407 >sec Cur ops started finished avg MB/s cur MB/s last lat avg lat >