[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-23 Thread Marc



I don't think there are here people advising to use consumer grade 
ssd's/nvme's. The enterprise ssd's often have more twpd, and are just stable 
under high constant load. 
My 1,5 year old sm863a still has 099 wearlevel and 097 poweronhours, some other 
sm863a of 3,8 years has 099 wearlevel and 093 poweronhours. 

When starting to use ceph try to keep your environment as simple as possible, 
and as standard as possible. You need to be aware of quite some details if you 
are trying to squeeze out the the last few % ceph can offer. 
For instance disabling the ssd/hdd drive cache gives you better performance (I 
do not really notice it, but that is probably because my cluster is having low 
load)

Also realize from this page of Vitalif, at some point it will not become any 
faster. I was also thinking of putting wal/db on ssd's for the hdd pool. But I 
skipped it for now. The hdd pool is not fast, but I am also not complaining 
about it.


This is my fio[1] and a micron sata ssd drive result[2]


[1]
[global]
ioengine=libaio
#ioengine=posixaio
invalidate=1
ramp_time=30
iodepth=1
runtime=180
time_based
direct=1
filename=/dev/sdX
#filename=/mnt/cephfs/ssd/fio-bench.img

[write-4k-seq]
stonewall
bs=4k
rw=write

[randwrite-4k-seq]
stonewall
bs=4k
rw=randwrite
fsync=1

[randwrite-4k-d32-seq]
stonewall
bs=4k
rw=randwrite
iodepth=32

[read-4k-seq]
stonewall
bs=4k
rw=read

[randread-4k-seq]
stonewall
bs=4k
rw=randread
fsync=1

[randread-4k-d32-seq]
stonewall
bs=4k
rw=randread
iodepth=32

[rw-4k-seq]
stonewall
bs=4k
rw=rw

[randrw-4k-seq]
stonewall
bs=4k
rw=randrw

[randrw-4k-d4-seq]
stonewall
bs=4k
rw=randrw
iodepth=4

[write-128k-seq]
stonewall
bs=128k
rw=write

[randwrite-128k-seq]
stonewall
bs=128k
rw=randwrite

[read-128k-seq]
stonewall
bs=128k
rw=read

[randread-128k-seq]
stonewall
bs=128k
rw=randread

[rw-128k-seq]
stonewall
bs=128k
rw=rw

[randrw-128k-seq]
stonewall
bs=128k
rw=randrw

[write-1024k-seq]
stonewall
bs=1024k
rw=write

[randwrite-1024k-seq]
stonewall
bs=1024k
rw=randwrite

[read-1024k-seq]
stonewall
bs=1024k
rw=read

[randread-1024k-seq]
stonewall
bs=1024k
rw=randread

[rw-1024k-seq]
stonewall
bs=1024k
rw=rw

[randrw-1024k-seq]
stonewall
bs=1024k
rw=randrw

[write-4096k-seq]
stonewall
bs=4096k
rw=write

[write-4096k-d16-seq]
stonewall
bs=4M
rw=write
iodepth=16

[randwrite-4096k-seq]
stonewall
bs=4096k
rw=randwrite

[read-4096k-seq]
stonewall
bs=4096k
rw=read

[read-4096k-d16-seq]
stonewall
bs=4M
rw=read
iodepth=16

[randread-4096k-seq]
stonewall
bs=4096k
rw=randread

[rw-4096k-seq]
stonewall
bs=4096k
rw=rw

[randrw-4096k-seq]
stonewall
bs=4096k
rw=randrw

[2]
write-4k-seq: (groupid=0, jobs=1): err= 0: pid=982502: Sun Oct  4 16:13:28 2020
  write: IOPS=15.3k, BW=59.7MiB/s (62.6MB/s)(10.5GiB/180001msec)
slat (usec): min=6, max=706, avg=12.09, stdev= 5.40
clat (nsec): min=1618, max=1154.4k, avg=50455.96, stdev=18670.50
 lat (usec): min=39, max=1161, avg=62.85, stdev=21.79
clat percentiles (usec):
 |  1.00th=[   39],  5.00th=[   40], 10.00th=[   41], 20.00th=[   42],
 | 30.00th=[   43], 40.00th=[   43], 50.00th=[   45], 60.00th=[   48],
 | 70.00th=[   51], 80.00th=[   54], 90.00th=[   58], 95.00th=[   87],
 | 99.00th=[  141], 99.50th=[  153], 99.90th=[  178], 99.95th=[  188],
 | 99.99th=[  235]
   bw (  KiB/s): min=37570, max=63946, per=69.50%, avg=42495.21, stdev=3251.18, 
samples=359
   iops: min= 9392, max=15986, avg=10623.45, stdev=812.82, samples=359
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=66.19%
  lat (usec)   : 100=30.09%, 250=3.70%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu  : usr=9.73%, sys=29.92%, ctx=2751526, majf=0, minf=53
  IO depths: 1=116.8%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=0,2751607,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1
randwrite-4k-seq: (groupid=1, jobs=1): err= 0: pid=983595: Sun Oct  4 16:13:28 
2020
  write: IOPS=14.9k, BW=58.2MiB/s (61.0MB/s)(10.2GiB/180001msec)
slat (usec): min=6, max=304, avg=10.89, stdev= 4.80
clat (nsec): min=1355, max=1258.5k, avg=49272.39, stdev=17923.95
 lat (usec): min=42, max=1265, avg=60.46, stdev=20.51
clat percentiles (usec):
 |  1.00th=[   39],  5.00th=[   40], 10.00th=[   41], 20.00th=[   41],
 | 30.00th=[   42], 40.00th=[   43], 50.00th=[   43], 60.00th=[   46],
 | 70.00th=[   50], 80.00th=[   53], 90.00th=[   58], 95.00th=[   84],
 | 99.00th=[  137], 99.50th=[  151], 99.90th=[  174], 99.95th=[  184],
 | 99.99th=[  231]
   bw (  KiB/s): min=37665, max=62936, per=69.49%, avg=41402.23, stdev=2934.41, 
samples=359
   iops: min= 9416, max=15734, avg=10350.21, stdev=733.65, samples=359
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=71.60%
  

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread zxcs
From ceph document, i see using fast device as wal/db could improve the 
performance.
So we using one(2TB) or two(1TB) samsung  Nvme 970pro as wal/db here, and yes, 
we have two data pools,  ssd pool and hdd pool, also ssd pool using samsung 
860Pro.
the Nvme970 as wal/db for both ssd pool and hdd pool.
I haven’t do a test, mean compare the performance  WITH Nvme as wal/db for ssd 
pool and WITHOUT a nvme as wal/db as ssd 
pool.(https://docs.ceph.com/en/latest/start/hardware-recommendations/)
 Just due to see the document, said using fast device and we know nvme is 
faster than normal ssd. 

Also i have another question here, in some document,  it said we only need 
using fast device as db, and no need create wal(means using nvme or ssd as db 
for hdd pool, no need create wal ), do you think so?  

we will scale out the cluster soon(for fix the two crash nodes), and haven’t 
made the decision about device, one choice may be below:
1 nvme(samsung 980 pro) create db(no wal) for the hdd pool.
no nvme for the ssd pool, and the ssd disk using Samsung 883 dct
 
Would  you are experts please help to shed some light here, Thanks a ton!  
  
Thanks,
zx

> 在 2021年2月23日,上午5:32,Mark Lehrer  写道:
> 
>> Yes, it a Nvme, and on node has two Nvmes as db/wal, one
>> for ssd(0-2) and another for hdd(3-6).  I have no spare to try.
>> ...
>> I/O 517 QID 7 timeout, aborting
>> Input/output error
> 
> If you are seeing errors like these, it is almost certainly a bad
> drive unless you are using fabric.
> 
> Why are you putting the wal on an SSD in the first place?  Are you
> sure it is even necessary, especially when one of your pools is
> already SSD?
> 
> Adding this complexity just means that there are more things to break
> when you least expect it. Putting the db/wal on a separate drive is
> usually premature optimization that is only useful for benchmarkers.
> My opinion of course.
> 
> Mark
> 
> 
> 
> 
> 
> 
> 
> 
> On Sun, Feb 21, 2021 at 7:16 PM zxcs  wrote:
>> 
>> Thanks for you reply!
>> 
>> Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2) and 
>> another for hdd(3-6).
>> I have no spare to try.
>> It’s  very strange, the load not very high at that time. and both ssd and 
>> nvme seems healthy.
>> 
>> If cannot fix it.  I am afraid I need to setup more nodes and set out remove 
>> these OSDs which using this Nvme?
>> 
>> Thanks,
>> zx
>> 
>> 
>>> 在 2021年2月22日,上午10:07,Mark Lehrer  写道:
>>> 
 One nvme  sudden crash again. Could anyone please help shed some light 
 here?
>>> 
>>> It looks like a flaky NVMe drive.  Do you have a spare to try?
>>> 
>>> 
>>> On Mon, Feb 22, 2021 at 1:56 AM zxcs  wrote:
 
 One nvme  sudden crash again. Could anyone please help shed some light 
 here? Thank a ton!!!
 Below are syslog and ceph log.
 
 From  /var/log/syslog
 Feb 21 19:38:33 ip kernel: [232562.847916] nvme :03:00.0: I/O 943 QID 
 7 timeout, aborting
 Feb 21 19:38:34 ip kernel: [232563.847946] nvme :03:00.0: I/O 911 QID 
 18 timeout, aborting
 Feb 21 19:38:34 ip kernel: [232563.847964] nvme :03:00.0: I/O 776 QID 
 28 timeout, aborting
 Feb 21 19:38:36 ip ceph-osd[3241]: 2021-02-21 19:38:36.218 7f023b58f700 -1 
 osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
 osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
 [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
 ondisk+write+known_if_redirected+full_force e7868)
 Feb 21 19:38:36 ip kernel: [232565.851961] nvme :03:00.0: I/O 442 QID 
 2 timeout, aborting
 Feb 21 19:38:36 ip kernel: [232565.851982] nvme :03:00.0: I/O 912 QID 
 18 timeout, aborting
 Feb 21 19:38:37 ip ceph-osd[3241]: 2021-02-21 19:38:37.254 7f023b58f700 -1 
 osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
 osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
 [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
 ondisk+write+known_if_redirected+full_force e7868)
 Feb 21 19:38:38 ip ceph-osd[3241]: 2021-02-21 19:38:38.286 7f023b58f700 -1 
 osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
 osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
 [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
 ondisk+write+known_if_redirected+full_force e7868)
 Feb 21 19:38:39 ip ceph-osd[3241]: 2021-02-21 19:38:39.334 7f023b58f700 -1 
 osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
 osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
 [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
 ondisk+write+known_if_redirected+full_force e7868)
 Feb 21 19:38:40 ip ceph-osd[3241]: 2021-02-21 19:38:40.322 7f023b58f700 -1 
 osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
 osd_op(mds.1.51327:1034954064 2.80 

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread zxcs
Thanks  a lot, Marc!

I will try to do a fio test for the crash disks when there is no traffic in our 
cluster.
we using Samsung nvme 970pro as  wal/db and using SSD 860 Pro as SSD. And the 
nvme disappear  after ssd hit timeout. may be also need throw 970pro away?
Thanks,
zx 

> 在 2021年2月22日,下午9:25,Marc  写道:
> 
> So on the disks that crash anyway, do the fio test. If it crashes, you will 
> know it has nothing to do with ceph. If it does not crash you will probably 
> get poor fio result, which would explain the problems with ceph.
> 
> This is what someone wrote in the past. If you did not do your research on 
> drives, I think it is probably your drives.
> 
> " just throw away your crappy Samsung SSD 860 Pro "
> https://www.mail-archive.com/ceph-users@ceph.io/msg06820.html
> 
> 
> 
>> -Original Message-
>> From: zxcs 
>> Sent: 22 February 2021 13:10
>> To: Marc 
>> Cc: Mark Lehrer ; Konstantin Shalygin
>> ; ceph-users 
>> Subject: Re: [ceph-users] Ceph nvme timeout and then aborting
>> 
>> Haven’t do any fio test for single  disk , but did fio for the ceph
>> cluster, actually the cluster has 12 nodes, and each node has same
>> disks(means, 2 nvmes for cache, and 3 ssds as osd, 4 hdds also as osd).
>> Only two nodes has such problem. And these two nodes are crash many
>> times(at least 4 times). The others are good.  So it strange.
>> This cluster has run more than half years.
>> 
>> 
>> Thanks,
>> zx
>> 
>>> 在 2021年2月22日,下午6:37,Marc  写道:
>>> 
>>> Don't you have problems, just because the Samsung 970 PRO is not
>> suitable for this? Have you run fio tests to make sure it would work ok?
>>> 
>>> https://yourcmc.ru/wiki/Ceph_performance
>>> https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-
>> 0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0
>>> 
>>> 
>>> 
>>>> -Original Message-
>>>> Sent: 22 February 2021 03:16
>>>> us...@ceph.io>
>>>> Subject: [ceph-users] Re: Ceph nvme timeout and then aborting
>>>> 
>>>> Thanks for you reply!
>>>> 
>>>> Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2)
>>>> and another for hdd(3-6).
>>>> I have no spare to try.
>>>> It’s  very strange, the load not very high at that time. and both ssd
>>>> and nvme seems healthy.
>>>> 
>>>> If cannot fix it.  I am afraid I need to setup more nodes and set out
>>>> remove these OSDs which using this Nvme?
>>>> 
>>>> Thanks,
>>>> zx
>>>> 
>>> 
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread Mark Lehrer
> Yes, it a Nvme, and on node has two Nvmes as db/wal, one
> for ssd(0-2) and another for hdd(3-6).  I have no spare to try.
> ...
> I/O 517 QID 7 timeout, aborting
> Input/output error

If you are seeing errors like these, it is almost certainly a bad
drive unless you are using fabric.

Why are you putting the wal on an SSD in the first place?  Are you
sure it is even necessary, especially when one of your pools is
already SSD?

Adding this complexity just means that there are more things to break
when you least expect it. Putting the db/wal on a separate drive is
usually premature optimization that is only useful for benchmarkers.
My opinion of course.

Mark








On Sun, Feb 21, 2021 at 7:16 PM zxcs  wrote:
>
> Thanks for you reply!
>
> Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2) and 
> another for hdd(3-6).
> I have no spare to try.
> It’s  very strange, the load not very high at that time. and both ssd and 
> nvme seems healthy.
>
> If cannot fix it.  I am afraid I need to setup more nodes and set out remove 
> these OSDs which using this Nvme?
>
> Thanks,
> zx
>
>
> > 在 2021年2月22日,上午10:07,Mark Lehrer  写道:
> >
> >> One nvme  sudden crash again. Could anyone please help shed some light 
> >> here?
> >
> > It looks like a flaky NVMe drive.  Do you have a spare to try?
> >
> >
> > On Mon, Feb 22, 2021 at 1:56 AM zxcs  wrote:
> >>
> >> One nvme  sudden crash again. Could anyone please help shed some light 
> >> here? Thank a ton!!!
> >> Below are syslog and ceph log.
> >>
> >> From  /var/log/syslog
> >> Feb 21 19:38:33 ip kernel: [232562.847916] nvme :03:00.0: I/O 943 QID 
> >> 7 timeout, aborting
> >> Feb 21 19:38:34 ip kernel: [232563.847946] nvme :03:00.0: I/O 911 QID 
> >> 18 timeout, aborting
> >> Feb 21 19:38:34 ip kernel: [232563.847964] nvme :03:00.0: I/O 776 QID 
> >> 28 timeout, aborting
> >> Feb 21 19:38:36 ip ceph-osd[3241]: 2021-02-21 19:38:36.218 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:36 ip kernel: [232565.851961] nvme :03:00.0: I/O 442 QID 
> >> 2 timeout, aborting
> >> Feb 21 19:38:36 ip kernel: [232565.851982] nvme :03:00.0: I/O 912 QID 
> >> 18 timeout, aborting
> >> Feb 21 19:38:37 ip ceph-osd[3241]: 2021-02-21 19:38:37.254 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:38 ip ceph-osd[3241]: 2021-02-21 19:38:38.286 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:39 ip ceph-osd[3241]: 2021-02-21 19:38:39.334 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:40 ip ceph-osd[3241]: 2021-02-21 19:38:40.322 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:41 ip ceph-osd[3241]: 2021-02-21 19:38:41.326 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:41 ip kernel: [232570.852035] nvme :03:00.0: I/O 860 QID 
> >> 9 timeout, aborting
> >> Feb 21 19:38:42 ip ceph-osd[3241]: 2021-02-21 19:38:42.298 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:43 ip ceph-osd[3241]: 2021-02-21 19:38:43.258 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> 

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread Marc
So on the disks that crash anyway, do the fio test. If it crashes, you will 
know it has nothing to do with ceph. If it does not crash you will probably get 
poor fio result, which would explain the problems with ceph.

This is what someone wrote in the past. If you did not do your research on 
drives, I think it is probably your drives.

" just throw away your crappy Samsung SSD 860 Pro "
https://www.mail-archive.com/ceph-users@ceph.io/msg06820.html



> -Original Message-
> From: zxcs 
> Sent: 22 February 2021 13:10
> To: Marc 
> Cc: Mark Lehrer ; Konstantin Shalygin
> ; ceph-users 
> Subject: Re: [ceph-users] Ceph nvme timeout and then aborting
> 
> Haven’t do any fio test for single  disk , but did fio for the ceph
> cluster, actually the cluster has 12 nodes, and each node has same
> disks(means, 2 nvmes for cache, and 3 ssds as osd, 4 hdds also as osd).
> Only two nodes has such problem. And these two nodes are crash many
> times(at least 4 times). The others are good.  So it strange.
> This cluster has run more than half years.
> 
> 
> Thanks,
> zx
> 
> > 在 2021年2月22日,下午6:37,Marc  写道:
> >
> > Don't you have problems, just because the Samsung 970 PRO is not
> suitable for this? Have you run fio tests to make sure it would work ok?
> >
> > https://yourcmc.ru/wiki/Ceph_performance
> > https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-
> 0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0
> >
> >
> >
> >> -----Original Message-
> >> Sent: 22 February 2021 03:16
> >> us...@ceph.io>
> >> Subject: [ceph-users] Re: Ceph nvme timeout and then aborting
> >>
> >> Thanks for you reply!
> >>
> >> Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2)
> >> and another for hdd(3-6).
> >> I have no spare to try.
> >> It’s  very strange, the load not very high at that time. and both ssd
> >> and nvme seems healthy.
> >>
> >> If cannot fix it.  I am afraid I need to setup more nodes and set out
> >> remove these OSDs which using this Nvme?
> >>
> >> Thanks,
> >> zx
> >>
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread zxcs
Haven’t do any fio test for single  disk , but did fio for the ceph cluster, 
actually the cluster has 12 nodes, and each node has same disks(means, 2 nvmes 
for cache, and 3 ssds as osd, 4 hdds also as osd).
Only two nodes has such problem. And these two nodes are crash many times(at 
least 4 times). The others are good.  So it strange.
This cluster has run more than half years. 


Thanks,
zx

> 在 2021年2月22日,下午6:37,Marc  写道:
> 
> Don't you have problems, just because the Samsung 970 PRO is not suitable for 
> this? Have you run fio tests to make sure it would work ok?
> 
> https://yourcmc.ru/wiki/Ceph_performance
> https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0
> 
> 
> 
>> -Original Message-
>> Sent: 22 February 2021 03:16
>> us...@ceph.io>
>> Subject: [ceph-users] Re: Ceph nvme timeout and then aborting
>> 
>> Thanks for you reply!
>> 
>> Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2)
>> and another for hdd(3-6).
>> I have no spare to try.
>> It’s  very strange, the load not very high at that time. and both ssd
>> and nvme seems healthy.
>> 
>> If cannot fix it.  I am afraid I need to setup more nodes and set out
>> remove these OSDs which using this Nvme?
>> 
>> Thanks,
>> zx
>> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread Marc
Don't you have problems, just because the Samsung 970 PRO is not suitable for 
this? Have you run fio tests to make sure it would work ok?

https://yourcmc.ru/wiki/Ceph_performance
https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0



> -Original Message-
> Sent: 22 February 2021 03:16
> us...@ceph.io>
> Subject: [ceph-users] Re: Ceph nvme timeout and then aborting
> 
> Thanks for you reply!
> 
> Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2)
> and another for hdd(3-6).
> I have no spare to try.
> It’s  very strange, the load not very high at that time. and both ssd
> and nvme seems healthy.
> 
> If cannot fix it.  I am afraid I need to setup more nodes and set out
> remove these OSDs which using this Nvme?
> 
> Thanks,
> zx
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-21 Thread zxcs
Thanks for you reply!

Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2) and 
another for hdd(3-6).
I have no spare to try.
It’s  very strange, the load not very high at that time. and both ssd and nvme 
seems healthy.

If cannot fix it.  I am afraid I need to setup more nodes and set out remove 
these OSDs which using this Nvme?  

Thanks,
zx


> 在 2021年2月22日,上午10:07,Mark Lehrer  写道:
> 
>> One nvme  sudden crash again. Could anyone please help shed some light here?
> 
> It looks like a flaky NVMe drive.  Do you have a spare to try?
> 
> 
> On Mon, Feb 22, 2021 at 1:56 AM zxcs  wrote:
>> 
>> One nvme  sudden crash again. Could anyone please help shed some light here? 
>> Thank a ton!!!
>> Below are syslog and ceph log.
>> 
>> From  /var/log/syslog
>> Feb 21 19:38:33 ip kernel: [232562.847916] nvme :03:00.0: I/O 943 QID 7 
>> timeout, aborting
>> Feb 21 19:38:34 ip kernel: [232563.847946] nvme :03:00.0: I/O 911 QID 18 
>> timeout, aborting
>> Feb 21 19:38:34 ip kernel: [232563.847964] nvme :03:00.0: I/O 776 QID 28 
>> timeout, aborting
>> Feb 21 19:38:36 ip ceph-osd[3241]: 2021-02-21 19:38:36.218 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:36 ip kernel: [232565.851961] nvme :03:00.0: I/O 442 QID 2 
>> timeout, aborting
>> Feb 21 19:38:36 ip kernel: [232565.851982] nvme :03:00.0: I/O 912 QID 18 
>> timeout, aborting
>> Feb 21 19:38:37 ip ceph-osd[3241]: 2021-02-21 19:38:37.254 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:38 ip ceph-osd[3241]: 2021-02-21 19:38:38.286 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:39 ip ceph-osd[3241]: 2021-02-21 19:38:39.334 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:40 ip ceph-osd[3241]: 2021-02-21 19:38:40.322 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:41 ip ceph-osd[3241]: 2021-02-21 19:38:41.326 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:41 ip kernel: [232570.852035] nvme :03:00.0: I/O 860 QID 9 
>> timeout, aborting
>> Feb 21 19:38:42 ip ceph-osd[3241]: 2021-02-21 19:38:42.298 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:43 ip ceph-osd[3241]: 2021-02-21 19:38:43.258 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:44 ip ceph-osd[3241]: 2021-02-21 19:38:44.258 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:45 ip ntpd[3480]: Soliciting pool server 84.16.67.12
>> Feb 21 19:38:45 ip ceph-osd[3241]: 2021-02-21 19:38:45.286 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:46 ip ceph-osd[3241]: 2021-02-21 19:38:46.254 7f023b58f700 -1 
>> osd.16 

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-21 Thread Mark Lehrer
> One nvme  sudden crash again. Could anyone please help shed some light here?

It looks like a flaky NVMe drive.  Do you have a spare to try?


On Mon, Feb 22, 2021 at 1:56 AM zxcs  wrote:
>
> One nvme  sudden crash again. Could anyone please help shed some light here? 
> Thank a ton!!!
> Below are syslog and ceph log.
>
> From  /var/log/syslog
> Feb 21 19:38:33 ip kernel: [232562.847916] nvme :03:00.0: I/O 943 QID 7 
> timeout, aborting
> Feb 21 19:38:34 ip kernel: [232563.847946] nvme :03:00.0: I/O 911 QID 18 
> timeout, aborting
> Feb 21 19:38:34 ip kernel: [232563.847964] nvme :03:00.0: I/O 776 QID 28 
> timeout, aborting
> Feb 21 19:38:36 ip ceph-osd[3241]: 2021-02-21 19:38:36.218 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:36 ip kernel: [232565.851961] nvme :03:00.0: I/O 442 QID 2 
> timeout, aborting
> Feb 21 19:38:36 ip kernel: [232565.851982] nvme :03:00.0: I/O 912 QID 18 
> timeout, aborting
> Feb 21 19:38:37 ip ceph-osd[3241]: 2021-02-21 19:38:37.254 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:38 ip ceph-osd[3241]: 2021-02-21 19:38:38.286 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:39 ip ceph-osd[3241]: 2021-02-21 19:38:39.334 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:40 ip ceph-osd[3241]: 2021-02-21 19:38:40.322 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:41 ip ceph-osd[3241]: 2021-02-21 19:38:41.326 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:41 ip kernel: [232570.852035] nvme :03:00.0: I/O 860 QID 9 
> timeout, aborting
> Feb 21 19:38:42 ip ceph-osd[3241]: 2021-02-21 19:38:42.298 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:43 ip ceph-osd[3241]: 2021-02-21 19:38:43.258 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:44 ip ceph-osd[3241]: 2021-02-21 19:38:44.258 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:45 ip ntpd[3480]: Soliciting pool server 84.16.67.12
> Feb 21 19:38:45 ip ceph-osd[3241]: 2021-02-21 19:38:45.286 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:46 ip ceph-osd[3241]: 2021-02-21 19:38:46.254 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:47 ip ceph-osd[3241]: 2021-02-21 19:38:47.226 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr 

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-21 Thread zxcs
One nvme  sudden crash again. Could anyone please help shed some light here? 
Thank a ton!!!
Below are syslog and ceph log.

From  /var/log/syslog
Feb 21 19:38:33 ip kernel: [232562.847916] nvme :03:00.0: I/O 943 QID 7 
timeout, aborting
Feb 21 19:38:34 ip kernel: [232563.847946] nvme :03:00.0: I/O 911 QID 18 
timeout, aborting
Feb 21 19:38:34 ip kernel: [232563.847964] nvme :03:00.0: I/O 776 QID 28 
timeout, aborting
Feb 21 19:38:36 ip ceph-osd[3241]: 2021-02-21 19:38:36.218 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:36 ip kernel: [232565.851961] nvme :03:00.0: I/O 442 QID 2 
timeout, aborting
Feb 21 19:38:36 ip kernel: [232565.851982] nvme :03:00.0: I/O 912 QID 18 
timeout, aborting
Feb 21 19:38:37 ip ceph-osd[3241]: 2021-02-21 19:38:37.254 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:38 ip ceph-osd[3241]: 2021-02-21 19:38:38.286 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:39 ip ceph-osd[3241]: 2021-02-21 19:38:39.334 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:40 ip ceph-osd[3241]: 2021-02-21 19:38:40.322 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:41 ip ceph-osd[3241]: 2021-02-21 19:38:41.326 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:41 ip kernel: [232570.852035] nvme :03:00.0: I/O 860 QID 9 
timeout, aborting
Feb 21 19:38:42 ip ceph-osd[3241]: 2021-02-21 19:38:42.298 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:43 ip ceph-osd[3241]: 2021-02-21 19:38:43.258 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:44 ip ceph-osd[3241]: 2021-02-21 19:38:44.258 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:45 ip ntpd[3480]: Soliciting pool server 84.16.67.12
Feb 21 19:38:45 ip ceph-osd[3241]: 2021-02-21 19:38:45.286 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:46 ip ceph-osd[3241]: 2021-02-21 19:38:46.254 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:47 ip ceph-osd[3241]: 2021-02-21 19:38:47.226 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:39:04 ip kernel: [232593.860464] nvme :03:00.0: I/O 943 QID 7 
timeout, reset controller
Feb 21 19:39:33 ip kernel: [232622.868975] nvme :03:00.0: I/O 0 QID 0 
timeout, reset controller
Feb 21 19:40:35 ip ceph-osd[3241]: 2021-02-21 

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-19 Thread Konstantin Shalygin
Look's good, what is your hardware? Server model & NVM'es?



k

> On 19 Feb 2021, at 13:22, zxcs  wrote:
> 
> BTW, actually i have two nodes has same issues, and another error node's nvme 
> output as below 
> 
> Smart Log for NVME device:nvme0n1 namespace-id:
> critical_warning: 0
> temperature : 29 C
> available_spare : 100%
> available_spare_threshold   : 10%
> percentage_used : 1%
> data_units_read : 592,340,175
> data_units_written  : 26,443,352
> host_read_commands  : 5,341,278,662
> host_write_commands : 515,730,885
> controller_busy_time: 14,052
> power_cycles: 8
> power_on_hours  : 4,294
> unsafe_shutdowns: 6
> media_errors: 0
> num_err_log_entries : 0
> Warning Temperature Time: 0
> Critical Composite Temperature Time : 0
> Temperature Sensor 1: 29 C
> Temperature Sensor 2: 46 C
> Temperature Sensor 3: 0 C
> Temperature Sensor 4: 0 C
> Temperature Sensor 5: 0 C
> Temperature Sensor 6: 0 C
> Temperature Sensor 7: 0 C
> Temperature Sensor 8: 0 C
> 
> 
> For compare, i get one healthy node’s nvme output as below:
> 
> mart Log for NVME device:nvme0n1 namespace-id:
> critical_warning: 0
> temperature : 27 C
> available_spare : 100%
> available_spare_threshold   : 10%
> percentage_used : 1%
> data_units_read : 579,829,652
> data_units_written  : 28,271,336
> host_read_commands  : 5,237,750,233
> host_write_commands : 518,979,861
> controller_busy_time: 14,166
> power_cycles: 3
> power_on_hours  : 4,252
> unsafe_shutdowns: 1
> media_errors: 0
> num_err_log_entries : 0
> Warning Temperature Time: 0
> Critical Composite Temperature Time : 0
> Temperature Sensor 1: 27 C
> Temperature Sensor 2: 39 C
> Temperature Sensor 3: 0 C
> Temperature Sensor 4: 0 C
> Temperature Sensor 5: 0 C
> Temperature Sensor 6: 0 C
> Temperature Sensor 7: 0 C
> Temperature Sensor 8: 0 C

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-19 Thread zxcs
BTW, actually i have two nodes has same issues, and another error node's nvme 
output as below 

Smart Log for NVME device:nvme0n1 namespace-id:
critical_warning: 0
temperature : 29 C
available_spare : 100%
available_spare_threshold   : 10%
percentage_used : 1%
data_units_read : 592,340,175
data_units_written  : 26,443,352
host_read_commands  : 5,341,278,662
host_write_commands : 515,730,885
controller_busy_time: 14,052
power_cycles: 8
power_on_hours  : 4,294
unsafe_shutdowns: 6
media_errors: 0
num_err_log_entries : 0
Warning Temperature Time: 0
Critical Composite Temperature Time : 0
Temperature Sensor 1: 29 C
Temperature Sensor 2: 46 C
Temperature Sensor 3: 0 C
Temperature Sensor 4: 0 C
Temperature Sensor 5: 0 C
Temperature Sensor 6: 0 C
Temperature Sensor 7: 0 C
Temperature Sensor 8: 0 C


For compare, i get one healthy node’s nvme output as below:

mart Log for NVME device:nvme0n1 namespace-id:
critical_warning: 0
temperature : 27 C
available_spare : 100%
available_spare_threshold   : 10%
percentage_used : 1%
data_units_read : 579,829,652
data_units_written  : 28,271,336
host_read_commands  : 5,237,750,233
host_write_commands : 518,979,861
controller_busy_time: 14,166
power_cycles: 3
power_on_hours  : 4,252
unsafe_shutdowns: 1
media_errors: 0
num_err_log_entries : 0
Warning Temperature Time: 0
Critical Composite Temperature Time : 0
Temperature Sensor 1: 27 C
Temperature Sensor 2: 39 C
Temperature Sensor 3: 0 C
Temperature Sensor 4: 0 C
Temperature Sensor 5: 0 C
Temperature Sensor 6: 0 C
Temperature Sensor 7: 0 C
Temperature Sensor 8: 0 C


Thanks,
zx


> 在 2021年2月19日,下午6:08,zxcs  写道:
> 
> Thank you very much, Konstantin!
> 
> Here is the output of `nvme smart-log /dev/nvme0n1`
> 
> Smart Log for NVME device:nvme0n1 namespace-id:
> critical_warning: 0
> temperature : 27 C
> available_spare : 100%
> available_spare_threshold   : 10%
> percentage_used : 1%
> data_units_read : 602,417,903
> data_units_written  : 24,350,864
> host_read_commands  : 5,610,227,794
> host_write_commands : 519,030,512
> controller_busy_time: 14,356
> power_cycles: 7
> power_on_hours  : 4,256
> unsafe_shutdowns: 5
> media_errors: 0
> num_err_log_entries : 0
> Warning Temperature Time: 0
> Critical Composite Temperature Time : 0
> Temperature Sensor 1: 27 C
> Temperature Sensor 2: 41 C
> Temperature Sensor 3: 0 C
> Temperature Sensor 4: 0 C
> Temperature Sensor 5: 0 C
> Temperature Sensor 6: 0 C
> Temperature Sensor 7: 0 C
> Temperature Sensor 8: 0 C
> 
> 
> Thanks,
> 
> zx
> 
>> 在 2021年2月19日,下午6:01,Konstantin Shalygin > > 写道:
>> 
>> Please paste your `name smart-log /dev/nvme0n1` output
>> 
>> 
>> 
>> k
>> 
>>> On 19 Feb 2021, at 12:53, zxcs >>  >> >> wrote:
>>> 
>>> I have one ceph cluster with nautilus 14.2.10 and one node has 3 SSD and 4 
>>> HDD each. 
>>> Also has two nvmes as cache.  (Means nvme0n1 cache for 0-2 SSD  and Nvme1n1 
>>> cache for 3-7 HDD)
>>> 
>>> but there is one nodes’ nvme0n1 always hit below issues(see 
>>> name..I/O…timeout, aborting), and sudden this nvme0n1 disappear . 
>>> After that i need reboot this node to recover.
>>> Any one hit same issue ? and how to slow it? Any suggestion are welcome. 
>>> Thanks in advance!
>>> I am once googled the issue, and see below link, but not see any help 
>>> https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd 
>>>  
>>> >> > 
>>> 

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-19 Thread zxcs
Thank you very much, Konstantin!

Here is the output of `nvme smart-log /dev/nvme0n1`

Smart Log for NVME device:nvme0n1 namespace-id:
critical_warning: 0
temperature : 27 C
available_spare : 100%
available_spare_threshold   : 10%
percentage_used : 1%
data_units_read : 602,417,903
data_units_written  : 24,350,864
host_read_commands  : 5,610,227,794
host_write_commands : 519,030,512
controller_busy_time: 14,356
power_cycles: 7
power_on_hours  : 4,256
unsafe_shutdowns: 5
media_errors: 0
num_err_log_entries : 0
Warning Temperature Time: 0
Critical Composite Temperature Time : 0
Temperature Sensor 1: 27 C
Temperature Sensor 2: 41 C
Temperature Sensor 3: 0 C
Temperature Sensor 4: 0 C
Temperature Sensor 5: 0 C
Temperature Sensor 6: 0 C
Temperature Sensor 7: 0 C
Temperature Sensor 8: 0 C


Thanks,

zx

> 在 2021年2月19日,下午6:01,Konstantin Shalygin  写道:
> 
> Please paste your `name smart-log /dev/nvme0n1` output
> 
> 
> 
> k
> 
>> On 19 Feb 2021, at 12:53, zxcs > > wrote:
>> 
>> I have one ceph cluster with nautilus 14.2.10 and one node has 3 SSD and 4 
>> HDD each. 
>> Also has two nvmes as cache.  (Means nvme0n1 cache for 0-2 SSD  and Nvme1n1 
>> cache for 3-7 HDD)
>> 
>> but there is one nodes’ nvme0n1 always hit below issues(see 
>> name..I/O…timeout, aborting), and sudden this nvme0n1 disappear . 
>> After that i need reboot this node to recover.
>> Any one hit same issue ? and how to slow it? Any suggestion are welcome. 
>> Thanks in advance!
>> I am once googled the issue, and see below link, but not see any help 
>> https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd 
>>  
>> > >>   
>> > >>
>> 
>> From syslog
>> Feb 19 01:31:52 ip kernel: [1275313.393211] nvme :03:00.0: I/O 949 QID 
>> 12 timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389232] nvme :03:00.0: I/O 728 QID 5 
>> timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389247] nvme :03:00.0: I/O 515 QID 7 
>> timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389252] nvme :03:00.0: I/O 516 QID 7 
>> timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389257] nvme :03:00.0: I/O 517 QID 7 
>> timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389263] nvme :03:00.0: I/O 82 QID 9 
>> timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389271] nvme :03:00.0: I/O 853 QID 
>> 13 timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389275] nvme :03:00.0: I/O 854 QID 
>> 13 timeout, aborting
>> Feb 19 01:32:23 ip kernel: [1275344.401708] nvme :03:00.0: I/O 728 QID 5 
>> timeout, reset controller
>> Feb 19 01:32:52 ip kernel: [1275373.394112] nvme :03:00.0: I/O 0 QID 0 
>> timeout, reset controller
>> Feb 19 01:33:53 ip ceph-osd[3179]: 
>> /build/ceph-14.2.10/src/common/HeartbeatMap.cc  
>> >: In function 'bool 
>> ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, 
>> ceph::time_detail::coarse_mono_clock::rep)' thread 7f36c03fb700 time 
>> 2021-02-19 01:33:53.436018
>> Feb 19 01:33:53 ip ceph-osd[3179]: 
>> /build/ceph-14.2.10/src/common/HeartbeatMap.cc  
>> >: 82: ceph_abort_msg("hit 
>> suicide timeout")
>> Feb 19 01:33:53 ip ceph-osd[3179]:  ceph version 14.2.10 
>> (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
>> Feb 19 01:33:53 ip ceph-osd[3179]:  1: (ceph::__ceph_abort(char const*, int, 
>> char const*, std::__cxx11::basic_string, 
>> std::allocator > const&)+0xdf) [0x83eb8c]
>> Feb 19 01:33:53 ip ceph-osd[3179]:  2: 
>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, 
>> unsigned long)+0x4a5) [0xec56f5]
>> Feb 19 01:33:53 ip ceph-osd[3179]:  3: 
>> (ceph::HeartbeatMap::is_healthy()+0x106) [0xec6846]
>> Feb 19 01:33:53 ip ceph-osd[3179]:  4: 
>> (OSD::handle_osd_ping(MOSDPing*)+0x67c) [0x8aaf0c]
>> Feb 19 01:33:53 ip ceph-osd[3179]:  5: 
>> (OSD::heartbeat_dispatch(Message*)+0x1eb) [0x8b3f4b]