[ceph-users] OSD PGs are not being removed - Full OSD issues

2019-10-16 Thread Philippe D'Anjou
This is related to https://tracker.ceph.com/issues/42341 and to 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-October/037017.html

After closing inspection yesterday we found that PGs are not being removed from 
OSDs which then leads to near full errors, explains why reweights don't work. 
This is a BIG issue because I have to constantly manually intervene to not have 
the cluster die.14.2.4. Fresh Setup, all defaultPG Balancer is turned off now, 
I begin to wonder if its at fault.

My crush map: https://termbin.com/3t8lWhat was mentioned that the bucket 
weights are WEIRD. I never touched this.The crush weights that are unsual are 
for nearfull osd53 and some are set to 10 from a previous manual intervention.
Now that the PGs are not being purged is one issue, the original issue is why 
the f ceph fills ONLY my nearfull OSDs in the first place. It seems to always 
select the fullest OSD to write more data onto it. If I reweight it it starts 
giving alerts for another almost full OSD because it intends to write 
everything there, despite everything else being only at about 60%.
I dont know how to debug this, it's a MAJOR PITA


Hope someone has an idea because I can't fight this 24/7, I'm getting pretty 
tired of this
Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph iscsi question

2019-10-16 Thread 展荣臻(信泰)



> -原始邮件-
> 发件人: "Jason Dillaman" 
> 发送时间: 2019-10-17 09:54:30 (星期四)
> 收件人: "展荣臻(信泰)" 
> 抄送: dillaman , ceph-users 
> 主题: Re: [ceph-users] ceph iscsi question
> 
> On Wed, Oct 16, 2019 at 9:52 PM 展荣臻(信泰)  wrote:
> >
> >
> >
> >
> > > -原始邮件-
> > > 发件人: "Jason Dillaman" 
> > > 发送时间: 2019-10-16 20:33:47 (星期三)
> > > 收件人: "展荣臻(信泰)" 
> > > 抄送: ceph-users 
> > > 主题: Re: [ceph-users] ceph iscsi question
> > >
> > > On Wed, Oct 16, 2019 at 2:35 AM 展荣臻(信泰)  wrote:
> > > >
> > > > hi,all
> > > >   we deploy ceph with ceph-ansible.osds,mons and daemons of iscsi runs 
> > > > in docker.
> > > >   I create iscsi target according to 
> > > > https://docs.ceph.com/docs/luminous/rbd/iscsi-target-cli/.
> > > >   I discovered and logined iscsi target on another host,as show below:
> > > >
> > > > [root@node1 tmp]# iscsiadm -m discovery -t sendtargets -p 192.168.42.110
> > > > 192.168.42.110:3260,1 iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw
> > > > 192.168.42.111:3260,2 iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw
> > > > [root@node1 tmp]# iscsiadm -m node -T 
> > > > iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw -p 192.168.42.110 -l
> > > > Logging in to [iface: default, target: 
> > > > iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw, portal: 
> > > > 192.168.42.110,3260] (multiple)
> > > > Login to [iface: default, target: 
> > > > iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw, portal: 
> > > > 192.168.42.110,3260] successful.
> > > >
> > > >  /dev/sde is mapped,when i mkfs.xfs -f /dev/sde, an Error occur,
> > > >
> > > > [root@node1 tmp]# mkfs.xfs -f /dev/sde
> > > > meta-data=/dev/sde   isize=512agcount=4, agsize=1966080 
> > > > blks
> > > >  =   sectsz=512   attr=2, projid32bit=1
> > > >  =   crc=1finobt=0, sparse=0
> > > > data =   bsize=4096   blocks=7864320, imaxpct=25
> > > >  =   sunit=0  swidth=0 blks
> > > > naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
> > > > log  =internal log   bsize=4096   blocks=3840, version=2
> > > >  =   sectsz=512   sunit=0 blks, lazy-count=1
> > > > realtime =none   extsz=4096   blocks=0, rtextents=0
> > > > existing superblock read failed: Input/output error
> > > > mkfs.xfs: pwrite64 failed: Input/output error
> > > >
> > > > message in /var/log/messages:
> > > > Oct 16 14:01:44 localhost kernel: Dev sde: unable to read RDB block 0
> > > > Oct 16 14:01:44 localhost kernel: sde: unable to read partition table
> > > > Oct 16 14:02:17 localhost kernel: Dev sde: unable to read RDB block 0
> > > > Oct 16 14:02:17 localhost kernel: sde: unable to read partition table
> > > >
> > > > we use Luminous ceph.
> > > > what cause this error? how debug it.any suggestion is appreciative.
> > >
> > > Please use the associated multipath device, not the raw block device.
> > >
> > hi,Jason
> >   Thanks for your reply
> >   The multipath device is the same error as raw block device.
> >
> 
> What does "multipath -ll" show?
> 
[root@node1 ~]# multipath -ll
mpathf (36001405366100aeda2044f286329b57a) dm-2 LIO-ORG ,TCMU device 
size=30G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| `- 13:0:0:0 sde 8:64 failed faulty running
`-+- policy='service-time 0' prio=0 status=enabled
  `- 14:0:0:0 sdf 8:80 failed faulty running
[root@node1 ~]# 

I don't know if it is related to that our all daemons run in docker while 
docker runs on kvm.   







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph iscsi question

2019-10-16 Thread Jason Dillaman
On Wed, Oct 16, 2019 at 9:52 PM 展荣臻(信泰)  wrote:
>
>
>
>
> > -原始邮件-
> > 发件人: "Jason Dillaman" 
> > 发送时间: 2019-10-16 20:33:47 (星期三)
> > 收件人: "展荣臻(信泰)" 
> > 抄送: ceph-users 
> > 主题: Re: [ceph-users] ceph iscsi question
> >
> > On Wed, Oct 16, 2019 at 2:35 AM 展荣臻(信泰)  wrote:
> > >
> > > hi,all
> > >   we deploy ceph with ceph-ansible.osds,mons and daemons of iscsi runs in 
> > > docker.
> > >   I create iscsi target according to 
> > > https://docs.ceph.com/docs/luminous/rbd/iscsi-target-cli/.
> > >   I discovered and logined iscsi target on another host,as show below:
> > >
> > > [root@node1 tmp]# iscsiadm -m discovery -t sendtargets -p 192.168.42.110
> > > 192.168.42.110:3260,1 iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw
> > > 192.168.42.111:3260,2 iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw
> > > [root@node1 tmp]# iscsiadm -m node -T 
> > > iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw -p 192.168.42.110 -l
> > > Logging in to [iface: default, target: 
> > > iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw, portal: 192.168.42.110,3260] 
> > > (multiple)
> > > Login to [iface: default, target: 
> > > iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw, portal: 192.168.42.110,3260] 
> > > successful.
> > >
> > >  /dev/sde is mapped,when i mkfs.xfs -f /dev/sde, an Error occur,
> > >
> > > [root@node1 tmp]# mkfs.xfs -f /dev/sde
> > > meta-data=/dev/sde   isize=512agcount=4, agsize=1966080 
> > > blks
> > >  =   sectsz=512   attr=2, projid32bit=1
> > >  =   crc=1finobt=0, sparse=0
> > > data =   bsize=4096   blocks=7864320, imaxpct=25
> > >  =   sunit=0  swidth=0 blks
> > > naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
> > > log  =internal log   bsize=4096   blocks=3840, version=2
> > >  =   sectsz=512   sunit=0 blks, lazy-count=1
> > > realtime =none   extsz=4096   blocks=0, rtextents=0
> > > existing superblock read failed: Input/output error
> > > mkfs.xfs: pwrite64 failed: Input/output error
> > >
> > > message in /var/log/messages:
> > > Oct 16 14:01:44 localhost kernel: Dev sde: unable to read RDB block 0
> > > Oct 16 14:01:44 localhost kernel: sde: unable to read partition table
> > > Oct 16 14:02:17 localhost kernel: Dev sde: unable to read RDB block 0
> > > Oct 16 14:02:17 localhost kernel: sde: unable to read partition table
> > >
> > > we use Luminous ceph.
> > > what cause this error? how debug it.any suggestion is appreciative.
> >
> > Please use the associated multipath device, not the raw block device.
> >
> hi,Jason
>   Thanks for your reply
>   The multipath device is the same error as raw block device.
>

What does "multipath -ll" show?

>
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph iscsi question

2019-10-16 Thread 展荣臻(信泰)



> -原始邮件-
> 发件人: "Jason Dillaman" 
> 发送时间: 2019-10-16 20:33:47 (星期三)
> 收件人: "展荣臻(信泰)" 
> 抄送: ceph-users 
> 主题: Re: [ceph-users] ceph iscsi question
> 
> On Wed, Oct 16, 2019 at 2:35 AM 展荣臻(信泰)  wrote:
> >
> > hi,all
> >   we deploy ceph with ceph-ansible.osds,mons and daemons of iscsi runs in 
> > docker.
> >   I create iscsi target according to 
> > https://docs.ceph.com/docs/luminous/rbd/iscsi-target-cli/.
> >   I discovered and logined iscsi target on another host,as show below:
> >
> > [root@node1 tmp]# iscsiadm -m discovery -t sendtargets -p 192.168.42.110
> > 192.168.42.110:3260,1 iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw
> > 192.168.42.111:3260,2 iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw
> > [root@node1 tmp]# iscsiadm -m node -T 
> > iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw -p 192.168.42.110 -l
> > Logging in to [iface: default, target: 
> > iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw, portal: 192.168.42.110,3260] 
> > (multiple)
> > Login to [iface: default, target: 
> > iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw, portal: 192.168.42.110,3260] 
> > successful.
> >
> >  /dev/sde is mapped,when i mkfs.xfs -f /dev/sde, an Error occur,
> >
> > [root@node1 tmp]# mkfs.xfs -f /dev/sde
> > meta-data=/dev/sde   isize=512agcount=4, agsize=1966080 blks
> >  =   sectsz=512   attr=2, projid32bit=1
> >  =   crc=1finobt=0, sparse=0
> > data =   bsize=4096   blocks=7864320, imaxpct=25
> >  =   sunit=0  swidth=0 blks
> > naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
> > log  =internal log   bsize=4096   blocks=3840, version=2
> >  =   sectsz=512   sunit=0 blks, lazy-count=1
> > realtime =none   extsz=4096   blocks=0, rtextents=0
> > existing superblock read failed: Input/output error
> > mkfs.xfs: pwrite64 failed: Input/output error
> >
> > message in /var/log/messages:
> > Oct 16 14:01:44 localhost kernel: Dev sde: unable to read RDB block 0
> > Oct 16 14:01:44 localhost kernel: sde: unable to read partition table
> > Oct 16 14:02:17 localhost kernel: Dev sde: unable to read RDB block 0
> > Oct 16 14:02:17 localhost kernel: sde: unable to read partition table
> >
> > we use Luminous ceph.
> > what cause this error? how debug it.any suggestion is appreciative.
> 
> Please use the associated multipath device, not the raw block device.
> 
hi,Jason
  Thanks for your reply
  The multipath device is the same error as raw block device.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery

2019-10-16 Thread Robert LeBlanc
On Wed, Oct 16, 2019 at 11:53 AM huxia...@horebdata.cn
 wrote:
>
> My Ceph version is Luminuous 12.2.12. Do you think should i upgrade to 
> Nautilus, or will Nautilus have a better control of recovery/backfilling?

We have a Jewel cluster and Luminuous cluster that we have changed
these settings on and it really helped both of them.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery

2019-10-16 Thread huxia...@horebdata.cn
My Ceph version is Luminuous 12.2.12. Do you think should i upgrade to 
Nautilus, or will Nautilus have a better control of recovery/backfilling?

best regards,

Samuel



huxia...@horebdata.cn
 
From: Robert LeBlanc
Date: 2019-10-14 16:27
To: huxia...@horebdata.cn
CC: ceph-users
Subject: Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph 
recovery
On Thu, Oct 10, 2019 at 2:23 PM huxia...@horebdata.cn
 wrote:
>
> Hi, folks,
>
> I have a middle-size Ceph cluster as cinder backup for openstack (queens). 
> Duing testing, one Ceph node went down unexpected and powered up again ca 10 
> minutes later, Ceph cluster starts PG recovery. To my surprise,  VM IOPS 
> drops dramatically during Ceph recovery, from ca. 13K IOPS to about 400, a 
> factor of 1/30, and I did put a stringent throttling on backfill and 
> recovery, with the following ceph parameters
>
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> osd_client_op_priority=63
> osd_recovery_op_priority=1
> osd_recovery_sleep = 0.5
>
> The most weird thing is,
> 1) when there is no IO activity from any VM (ALL VMs are quiet except the 
> recovery IO), the recovery bandwidth is ca. 10MiB/s, 2 objects/s. Seems like 
> recovery throttle setting is working properly
> 2) when using FIO testing inside a VM, the recovery bandwith is going up 
> quickly, reaching above 200MiB/s, 60 objects/s. FIO IOPS performance inside 
> VM, however, is only at 400 IOPS/s (8KiB block size), around 3MiB/s. Obvious 
> recovery throttling DOES NOT work properly
> 3) If i stop the FIO testing in VM, the recovery bandwith then goes down to  
> 10MiB/s, 2 objects/s again, strange enough.
>
> How can this weird behavior happen? I just wonder, is there a method to 
> configure recovery bandwith to a specific value, or the number of recovery 
> objects per second? this may give better control of bakcfilling/recovery, 
> instead of the faulty logic or relative osd_client_op_priority vs 
> osd_recovery_op_priority.
>
> any ideas or suggests to make the recovery under control?
>
> best regards,
>
> Samuel
 
Not sure which version of Ceph you are on, but add these to your
/etc/ceph/ceph.conf on all your OSDs and restart them.
 
osd op queue = wpq
osd op queue cut off = high
 
That should really help and make backfills and recovery be
non-impactful. This will be the default in Octopus.
 

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph iscsi question

2019-10-16 Thread Jason Dillaman
On Wed, Oct 16, 2019 at 2:35 AM 展荣臻(信泰)  wrote:
>
> hi,all
>   we deploy ceph with ceph-ansible.osds,mons and daemons of iscsi runs in 
> docker.
>   I create iscsi target according to 
> https://docs.ceph.com/docs/luminous/rbd/iscsi-target-cli/.
>   I discovered and logined iscsi target on another host,as show below:
>
> [root@node1 tmp]# iscsiadm -m discovery -t sendtargets -p 192.168.42.110
> 192.168.42.110:3260,1 iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw
> 192.168.42.111:3260,2 iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw
> [root@node1 tmp]# iscsiadm -m node -T 
> iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw -p 192.168.42.110 -l
> Logging in to [iface: default, target: 
> iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw, portal: 192.168.42.110,3260] 
> (multiple)
> Login to [iface: default, target: iqn.2003-01.com.teamsun.iscsi-gw:iscsi-igw, 
> portal: 192.168.42.110,3260] successful.
>
>  /dev/sde is mapped,when i mkfs.xfs -f /dev/sde, an Error occur,
>
> [root@node1 tmp]# mkfs.xfs -f /dev/sde
> meta-data=/dev/sde   isize=512agcount=4, agsize=1966080 blks
>  =   sectsz=512   attr=2, projid32bit=1
>  =   crc=1finobt=0, sparse=0
> data =   bsize=4096   blocks=7864320, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
> log  =internal log   bsize=4096   blocks=3840, version=2
>  =   sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> existing superblock read failed: Input/output error
> mkfs.xfs: pwrite64 failed: Input/output error
>
> message in /var/log/messages:
> Oct 16 14:01:44 localhost kernel: Dev sde: unable to read RDB block 0
> Oct 16 14:01:44 localhost kernel: sde: unable to read partition table
> Oct 16 14:02:17 localhost kernel: Dev sde: unable to read RDB block 0
> Oct 16 14:02:17 localhost kernel: sde: unable to read partition table
>
> we use Luminous ceph.
> what cause this error? how debug it.any suggestion is appreciative.

Please use the associated multipath device, not the raw block device.

>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Increase of Ceph-mon memory usage - Luminous

2019-10-16 Thread nokia ceph
Hi Team,

We have noticed that memory usage of ceph-monitor processes increased by
1GB in 4 days.
We monitored the ceph-monitor memory usage every minute and we can see it
increases and decreases by few 100 MBs at any point; but over time, the
memory usage increases. We also noticed some monitor processes use up to
8GB.

Environment -
6 node Luminous cluster - 12.2.2
67 OSDs per node, monitor process on each node

Is this amount of memory usage expected for ceph-monitor processes ?

Thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com