from:"Suresh Rama"

[ceph-users] Ceph Outage (Nautilus) - 14.2.11

2020-12-15 Thread Suresh Rama

Dear All,

We have a 38 node HP Apollo cluster with 24 3.7T Spinning disk and 2 NVME
for journal. This is one of our 13 clusters which was upgraded from
Luminous to Nautilus (14.2.11).  When one of our openstack customers  uses
elastic search (they offer Logging as a Service) to their end users
reported IO latency issues, our SME rebooted two nodes that he felt were
doing memory leak.  The reboot didn't help rather worsen the situation and
he went ahead and recycled the entire cluster one node a time as to fix the
slow ops reported by OSDs.   This caused a huge issue and MONS were not
able to withstand the spam and started crashing.

1) We audited the network (inspecting TOR, iperf, MTR) and nothing was
indicating any issue but OSD logs were keep complaining about
BADAUTHORIZER

 2020-12-13 15:32:31.607 7fea5e3a2700  0 --1- 10.146.126.200:0/464096978 >>
v1:10.146.127.122:6809/1803700 conn(0x7fea3c1ba990 0x7fea3c1bf600 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2020-12-13 15:32:31.607 7fea5e3a2700  0 --1- 10.146.126.200:0/464096978 >>
v1:10.146.127.122:6809/1803700 conn(0x7fea3c1c1e20 0x7fea3c1bcdf0 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2020-12-13 15:32:31.607 7fea5e3a2700  0 --1- 10.146.126.200:0/464096978 >>
v1:10.146.127.122:6809/1803700 conn(0x7fea3c1ba990 0x7fea3c1bf600 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER

2) Made sure no clock skew and we use timesyncd.   After taking out a
couple of OSDs that were indicating slow in the ceph health response the
situation didn't improve.  After 3 days of troubleshooting we upgraded the
MONS to 14.2.15 and seems the situation improved a little but still
reporting  61308 slow ops which we really struggled to isolate with bad
OSDs as moving a couple of them didn't improve.  One of the MON(2) failed
to join the cluster and always doing compact and never was able to join
(see the size below).  I suspect that could be because the key value store
information between 1 and 3 is not up to date with 2.At times, we had
to stop and start to compress to get a better response from Ceph MON
(keeping them running in one single MON).

root@pistoremon-as-c01:~# du -sh /var/lib/ceph/mon
391G /var/lib/ceph/mon
root@pistoremon-as-c03:~# du -sh /var/lib/ceph/mon
337G /var/lib/ceph/mon
root@pistoremon-as-c02:~# du -sh /var/lib/ceph/mon
13G /var/lib/ceph/mon

 cluster:
id: bac20301-d458-4828-9dd9-a8406acf5d0f
health: HEALTH_WARN
noout,noscrub,nodeep-scrub flag(s) set
1 pools have many more objects per pg than average
10969 pgs not deep-scrubbed in time
46 daemons have recently crashed
61308 slow ops, oldest one blocked for 2572 sec, daemons
[mon.pistoremon-as-c01,mon.pistoremon-as-c03] have slow ops.
mons pistoremon-as-c01,pistoremon-as-c03 are using a lot of
disk space
1/3 mons down, quorum pistoremon-as-c01,pistoremon-as-c03

  services:
mon: 3 daemons, quorum pistoremon-as-c01,pistoremon-as-c03 (age 52m),
out of quorum: pistoremon-as-c02
mgr: pistoremon-as-c01(active, since 2h), standbys: pistoremon-as-c03,
pistoremon-as-c02
osd: 911 osds: 888 up (since 68m), 888 in
 flags noout,noscrub,nodeep-scrub
rgw: 2 daemons active (pistorergw-as-c01, pistorergw-as-c02)

  task status:

  data:
pools:   17 pools, 32968 pgs
objects: 62.98M objects, 243 TiB
usage:   748 TiB used, 2.4 PiB / 3.2 PiB avail
pgs: 32968 active+clean

  io:
client:   56 MiB/s rd, 95 MiB/s wr, 1.78k op/s rd, 4.27k op/s wr

3) When looking through ceph.log on the mon with tailf, I was getting a lot
of different time stamp reported in the ceph logs in MON1 which is master.
Confused on why the live log report various timestamps?

stat,write 2166784~4096] snapc 0=[] ondisk+write+known_if_redirected
e951384) initiated 2020-12-13 06:16:58.873964 currently delayed
2020-12-13 06:39:37.169504 osd.1224 (osd.1224) 325855 : cluster [WRN] slow
request osd_op(client.461445583.0:8881223 1.16aa
1:55684db0:::rbd_data.9ede65fc7af15.:head [stat,write
3547136~4096] snapc 0=[] ondisk+write+known_if_redirected e951384)
initiated 2020-12-13 06:16:59.082012 currently delayed
2020-12-13 06:39:37.169510 osd.1224 (osd.1224) 325856 : cluster [WRN] slow
request osd_op(client.461445583.0:8881191 1.16aa
1:55684db0:::rbd_data.9ede65fc7af15.:head [stat,write
2314240~4096] snapc 0=[] ondisk+write+known_if_redirected e951384)
initiated 2020-12-13 06:16:58.874031 currently delayed
2020-12-13 06:39:37.169513 osd.1224 (osd.1224) 325857 : cluster [WRN] slow
request osd_op(client.461445583.0:8881224 1.16aa
1:55684db0:::rbd_data.9ede65fc7af15.:head [stat,write
3571712~8192] snapc 0=[] ondisk+write+known_if_redirected e951384)
initiated 2020-12-13 06:16:59.082094 currently delayed
^C

root@pistorem

[ceph-users] Re: Ceph Outage (Nautilus) - 14.2.11 [EXT]

2020-12-16 Thread Suresh Rama

Thanks stefan.   Will review your feedback.  Matt suggested the same.

On Wed, Dec 16, 2020, 4:38 AM Stefan Kooman  wrote:

> On 12/16/20 10:21 AM, Matthew Vernon wrote:
> > Hi,
> >
> > On 15/12/2020 20:44, Suresh Rama wrote:
> >
> > TL;DR: use a real NTP client, not systemd-timesyncd
>
> +1. We have a lot of "ntp" daemons running, but on Ceph we use "chrony",
> and it's way faster with converging (especially with very unstable clock
> sources). It might be worth checking it out.
>
> Gr. Stefan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Monitors not starting, getting "e3 handle_auth_request failed to assign global_id"

2020-12-16 Thread Suresh Rama

We had same issue and this is stable after upgrading from 14.2.11 to
14.2.15.  Also, the size of the DB is not same for the one failed to join
since the information it had to sync is huge.   The compact on reboot does
the job but it takes a long time to catch up.  You can force the join by
quorum enter but won't help.  Upgrade helped our case.

On Mon, Dec 14, 2020, 4:42 PM Wesley Dillingham 
wrote:

> We had to rebuild our mons on a few occasions because of this. Only one mon
> was ever dropped from quorum at a time in our case. In other scenarios with
> the same error the mon was able to rejoin after thirty minutes or so. We
> believe we may have tracked it down (in our case) to the upgrade of an AV /
> packet inspection security technology being run on the servers. Perhaps
> you've made similar updates.
>
> Respectfully,
>
> *Wes Dillingham*
> w...@wesdillingham.com
> LinkedIn 
>
>
> On Tue, Dec 8, 2020 at 7:46 PM Wesley Dillingham 
> wrote:
>
> > We have also had this issue multiple times in 14.2.11
> >
> > On Tue, Dec 8, 2020, 5:11 PM  wrote:
> >
> >> I have same issue. My cluster runing 14.2.11 versions. What is your
> >> version ceph?
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] device management and failure prediction

2020-12-26 Thread Suresh Rama

Dear All,

Hope you all had a great Christmas and much needed time off with family!
Have any of you used "*device management and failure prediction"* in
Nautilus?   If yes, what is your feedback?  Do you use LOCAL or CLOUD
prediction models?

https://ceph.io/update/new-in-nautilus-device-management-and-failure-prediction/


Your feedback and input is valuable.
-- 
Regards,
Suresh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: scrub errors: inconsistent PGs

2021-01-28 Thread Suresh Rama

Just query the PG to see what is it that reporting and take action
accordingly

On Thu, Jan 28, 2021, 7:13 PM Void Star Nill 
wrote:

> Hello all,
>
> One of our clusters running nautilus release 14.2.15 is reporting health
> error. It reports that there are inconsistent PGs. However, when I inspect
> each of the reported PGs, I dont see any inconsistencies. Any inputs on
> what's going on?
>
> $ sudo ceph health detail
> HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent
> OSD_SCRUB_ERRORS 3 scrub errors
> PG_DAMAGED Possible data damage: 3 pgs inconsistent
> pg 2.a4 is active+clean+inconsistent, acting [2,60,73]
> pg 2.2b3 is active+clean+inconsistent, acting [15,3,38]
> pg 2.758 is active+clean+inconsistent, acting [4,40,35]
>
> $ rados list-inconsistent-obj 2.758 --format=json-pretty
> {
> "epoch": 9211,
> "inconsistents": []
> }
>
> $ rados list-inconsistent-obj 2.a4 --format=json-pretty
> {
> "epoch": 9213,
> "inconsistents": []
> }
>
> $ rados list-inconsistent-obj 2.758 --format=json-pretty
> {
> "epoch": 9211,
> "inconsistents": []
> }
>
> Regards,
> Shridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Issue with Nautilus upgrade from Luminous

2021-07-08 Thread Suresh Rama

Dear All,

We have 13  Ceph clusters and we started upgrading one by one from Luminous
to Nautilus. Post upgrade started fixing the warning alerts and had issues
setting "*ceph config set mon mon_crush_min_required_version firefly" *yielded
no results.  Updated the mon config and restart the daemons the warning
didn't go away

I have also tried to set it to hammer and no use.  The warning is still
there.  Do you have any recommendations?  I thought of changing it to
hammer so I can use straw2 but I was stuck with warning message.  I have
also bounced the nodes and the issue remains the same.

Please review and share your inputs.

  cluster:
id: xxx
health: HEALTH_WARN
crush map has legacy tunables (require firefly, min is hammer)
1 pools have many more objects per pg than average
15252 pgs not deep-scrubbed in time
21399 pgs not scrubbed in time
clients are using insecure global_id reclaim
mons are allowing insecure global_id reclaim
3 monitors have not enabled msgr2


ceph daemon mon.$(hostname -s) config show |grep -i
mon_crush_min_required_version
"mon_crush_min_required_version": "firefly",

ceph osd crush show-tunables
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 1,
"chooseleaf_stable": 0,
"straw_calc_version": 1,
"allowed_bucket_algs": 22,
"profile": "firefly",
"optimal_tunables": 0,
"legacy_tunables": 0,
"minimum_required_version": "firefly",
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"has_v2_rules": 0,
"require_feature_tunables3": 1,
"has_v3_rules": 0,
"has_v4_buckets": 0,
"require_feature_tunables5": 0,
"has_v5_rules": 0
}

ceph config dump
WHO   MASK LEVELOPTION VALUE   RO
  mon  advanced mon_crush_min_required_version firefly *

ceph versions
{
"mon": {
"ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
nautilus (stable)": 3
},
"mgr": {
"ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
nautilus (stable)": 3
},
"osd": {
"ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6)
nautilus (stable)": 549,
"ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
nautilus (stable)": 226
},
"mds": {},
"rgw": {
"ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
nautilus (stable)": 2
},
"overall": {
"ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6)
nautilus (stable)": 549,
"ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
nautilus (stable)": 234
}
}

ceph -s
  cluster:
id:xx
health: HEALTH_WARN
crush map has legacy tunables (require firefly, min is hammer)
1 pools have many more objects per pg than average
13811 pgs not deep-scrubbed in time
19994 pgs not scrubbed in time
clients are using insecure global_id reclaim
mons are allowing insecure global_id reclaim
3 monitors have not enabled msgr2

  services:
mon: 3 daemons, quorum
pistoremon-ho-c01,pistoremon-ho-c02,pistoremon-ho-c03 (age 24s)
mgr: pistoremon-ho-c02(active, since 2m), standbys: pistoremon-ho-c01,
pistoremon-ho-c03
osd: 800 osds: 775 up (since 105m), 775 in
rgw: 2 daemons active (pistorergw-ho-c01, pistorergw-ho-c02)

  task status:

  data:
pools:   28 pools, 27336 pgs
objects: 107.19M objects, 428 TiB
usage:   1.3 PiB used, 1.5 PiB / 2.8 PiB avail
pgs: 27177 active+clean
 142   active+clean+scrubbing+deep
 17active+clean+scrubbing

  io:
client:   220 MiB/s rd, 1.9 GiB/s wr, 7.07k op/s rd, 25.42k op/s wr

-- 
Regards,
Suresh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Issue with Nautilus upgrade from Luminous

2021-07-19 Thread Suresh Rama

Hi Dominic,  All,

After going through the errors in detail and looking through "ceph
features",  I have set *ceph osd set-require-min-compat-client luminous* and
cleared the warning.  I have fixed the remaining warning too and the
cluster is healthy.

Thank you everyone for taking time to respond.

Regards,
suresh

On Fri, Jul 9, 2021 at 3:37 PM  wrote:

> Suresh;
>
> I don't believe we use tunables, so I'm not terribly familiar with them.
>
> A quick Google search ("ceph tunable") supplied the following pages:
>
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.2.3/html/storage_strategies/crush_tunables
> https://docs.ceph.com/en/latest/rados/operations/crush-map/#tunables
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Vice President - Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
> -Original Message-
> From: Suresh Rama [mailto:sstk...@gmail.com]
> Sent: Thursday, July 8, 2021 7:25 PM
> To: ceph-users
> Subject: [ceph-users] Issue with Nautilus upgrade from Luminous
>
> Dear All,
>
> We have 13  Ceph clusters and we started upgrading one by one from Luminous
> to Nautilus. Post upgrade started fixing the warning alerts and had issues
> setting "*ceph config set mon mon_crush_min_required_version firefly"
> *yielded
> no results.  Updated the mon config and restart the daemons the warning
> didn't go away
>
> I have also tried to set it to hammer and no use.  The warning is still
> there.  Do you have any recommendations?  I thought of changing it to
> hammer so I can use straw2 but I was stuck with warning message.  I have
> also bounced the nodes and the issue remains the same.
>
> Please review and share your inputs.
>
>   cluster:
> id: xxx
> health: HEALTH_WARN
> crush map has legacy tunables (require firefly, min is hammer)
> 1 pools have many more objects per pg than average
> 15252 pgs not deep-scrubbed in time
> 21399 pgs not scrubbed in time
> clients are using insecure global_id reclaim
> mons are allowing insecure global_id reclaim
> 3 monitors have not enabled msgr2
>
>
> ceph daemon mon.$(hostname -s) config show |grep -i
> mon_crush_min_required_version
> "mon_crush_min_required_version": "firefly",
>
> ceph osd crush show-tunables
> {
> "choose_local_tries": 0,
> "choose_local_fallback_tries": 0,
> "choose_total_tries": 50,
> "chooseleaf_descend_once": 1,
> "chooseleaf_vary_r": 1,
> "chooseleaf_stable": 0,
> "straw_calc_version": 1,
> "allowed_bucket_algs": 22,
> "profile": "firefly",
> "optimal_tunables": 0,
> "legacy_tunables": 0,
> "minimum_required_version": "firefly",
> "require_feature_tunables": 1,
> "require_feature_tunables2": 1,
> "has_v2_rules": 0,
> "require_feature_tunables3": 1,
> "has_v3_rules": 0,
> "has_v4_buckets": 0,
> "require_feature_tunables5": 0,
> "has_v5_rules": 0
> }
>
> ceph config dump
> WHO   MASK LEVELOPTION VALUE   RO
>   mon  advanced mon_crush_min_required_version firefly *
>
> ceph versions
> {
> "mon": {
> "ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
> nautilus (stable)": 3
> },
> "mgr": {
> "ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
> nautilus (stable)": 3
> },
> "osd": {
> "ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6)
> nautilus (stable)": 549,
> "ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
> nautilus (stable)": 226
> },
> "mds": {},
> "rgw": {
> "ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
> nautilus (stable)": 2
> },
> "overall": {
> "ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6)
> nautilus (stable)": 549,
> "ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
> nautilus (stable)": 234
> }
> }
>
> ceph -s
>   cluster:
> id:xx
> health: HEALTH_WARN
> crush map has legacy tunables (require firefly, min is hammer)
>

[ceph-users] Re: Recovery stuck and Multiple PG fails

2021-08-14 Thread Suresh Rama

Amudhan,

Have you looked at the logs and did you try enabling debug to see why the
OSDs are marked down? There should be some reason right? Just focus on the
MON and take one node/OSD by enabling debug to see what is happening.
https://docs.ceph.com/en/latest/cephadm/operations/.

Thanks,
Suresh

On Sat, Aug 14, 2021, 9:53 AM Amudhan P  wrote:

> Hi,
> I am stuck with ceph cluster with multiple PG errors due to multiple OSD
> was stopped and starting OSD's manually again didn't help. OSD service
> stops again there is no issue with HDD for sure but for some reason, OSD
> stops.
>
> I am using running ceph version 15.2.5 on podman container.
>
> How do I recover these pg failures?
>
> can someone help me to recover this or where to look further?
>
> pgs: 0.360% pgs not active
>  124186/5082364 objects degraded (2.443%)
>  29899/5082364 objects misplaced (0.588%)
>  670 active+clean
>  69  active+undersized+remapped
>  26  active+undersized+degraded+remapped+backfill_wait
>  16  active+undersized+remapped+backfill_wait
>  15  active+undersized+degraded+remapped
>  13  active+clean+remapped
>  9   active+recovery_wait+degraded
>  4   active+remapped+backfill_wait
>  3   stale+down
>  3   active+undersized+remapped+inconsistent
>  2   active+recovery_wait+degraded+remapped
>  1   active+recovering+degraded+remapped
>  1   active+clean+remapped+inconsistent
>  1   active+recovering+degraded
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Nautius not working after setting MTU 9000

2020-05-23 Thread Suresh Rama

Hi,

It should be ping -M do -s 8972 IP ADDRESS.

You can't ping with 9000 size.  If you can't ping with 8972 size, then
somewhere in the path MTU config is wrong.

Regards,
Suresh

On Sat, May 23, 2020, 1:35 PM apely agamakou  wrote:

> Hi,
>
> Please check you MTU limit at the switch level, cand check other ressources
> with icmp ping.
> Try to add 14Byte for ethernet header at your switch level mean an MTU of
> 9014 ? are you using juniper ???
>
> Exemple : ping -D -s 9 other_ip
>
>
>
> Le sam. 23 mai 2020 à 15:18, Khodayar Doustar  a
> écrit :
>
> > Problem should be with network. When you change MTU it should be changed
> > all over the network, any single hup on your network should speak and
> > accept 9000 MTU packets. you can check it on your hosts with "ifconfig"
> > command and there is also equivalent commands for other network/security
> > devices.
> >
> > If you have just one node which it not correctly configured for MTU 9000
> it
> > wouldn't work.
> >
> > On Sat, May 23, 2020 at 2:30 PM si...@turka.nl  wrote:
> >
> > > Can the servers/nodes ping eachother using large packet sizes? I guess
> > not.
> > >
> > > Sinan Polat
> > >
> > > > Op 23 mei 2020 om 14:21 heeft Amudhan P  het
> > > volgende geschreven:
> > > >
> > > > In OSD logs "heartbeat_check: no reply from OSD"
> > > >
> > > >> On Sat, May 23, 2020 at 5:44 PM Amudhan P 
> > wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> I have set Network switch with MTU size 9000 and also in my netplan
> > > >> configuration.
> > > >>
> > > >> What else needs to be checked?
> > > >>
> > > >>
> > > >>> On Sat, May 23, 2020 at 3:39 PM Wido den Hollander 
> > > wrote:
> > > >>>
> > > >>>
> > > >>>
> > >  On 5/23/20 12:02 PM, Amudhan P wrote:
> > >  Hi,
> > > 
> > >  I am using ceph Nautilus in Ubuntu 18.04 working fine wit MTU size
> > > 1500
> > >  (default) recently i tried to update MTU size to 9000.
> > >  After setting Jumbo frame running ceph -s is timing out.
> > > >>>
> > > >>> Ceph can run just fine with an MTU of 9000. But there is probably
> > > >>> something else wrong on the network which is causing this.
> > > >>>
> > > >>> Check the Jumbo Frames settings on all the switches as well to make
> > > sure
> > > >>> they forward all the packets.
> > > >>>
> > > >>> This is definitely not a Ceph issue.
> > > >>>
> > > >>> Wido
> > > >>>
> > > 
> > >  regards
> > >  Amudhan P
> > >  ___
> > >  ceph-users mailing list -- ceph-users@ceph.io
> > >  To unsubscribe send an email to ceph-users-le...@ceph.io
> > > 
> > > >>> ___
> > > >>> ceph-users mailing list -- ceph-users@ceph.io
> > > >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >>>
> > > >>
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Nautius not working after setting MTU 9000

2020-05-24 Thread Suresh Rama

Ping with 9000 MTU won't get response as I said and it should be 8972. Glad
it is working but you should know what happened to avoid this issue later.

On Sun, May 24, 2020, 3:04 AM Amudhan P  wrote:

> No, ping with MTU size 9000 didn't work.
>
> On Sun, May 24, 2020 at 12:26 PM Khodayar Doustar 
> wrote:
>
> > Does your ping work or not?
> >
> >
> > On Sun, May 24, 2020 at 6:53 AM Amudhan P  wrote:
> >
> >> Yes, I have set setting on the switch side also.
> >>
> >> On Sat 23 May, 2020, 6:47 PM Khodayar Doustar, 
> >> wrote:
> >>
> >>> Problem should be with network. When you change MTU it should be
> changed
> >>> all over the network, any single hup on your network should speak and
> >>> accept 9000 MTU packets. you can check it on your hosts with "ifconfig"
> >>> command and there is also equivalent commands for other
> network/security
> >>> devices.
> >>>
> >>> If you have just one node which it not correctly configured for MTU
> 9000
> >>> it wouldn't work.
> >>>
> >>> On Sat, May 23, 2020 at 2:30 PM si...@turka.nl  wrote:
> >>>
>  Can the servers/nodes ping eachother using large packet sizes? I guess
>  not.
> 
>  Sinan Polat
> 
>  > Op 23 mei 2020 om 14:21 heeft Amudhan P  het
>  volgende geschreven:
>  >
>  > In OSD logs "heartbeat_check: no reply from OSD"
>  >
>  >> On Sat, May 23, 2020 at 5:44 PM Amudhan P 
>  wrote:
>  >>
>  >> Hi,
>  >>
>  >> I have set Network switch with MTU size 9000 and also in my netplan
>  >> configuration.
>  >>
>  >> What else needs to be checked?
>  >>
>  >>
>  >>> On Sat, May 23, 2020 at 3:39 PM Wido den Hollander  >
>  wrote:
>  >>>
>  >>>
>  >>>
>   On 5/23/20 12:02 PM, Amudhan P wrote:
>   Hi,
>  
>   I am using ceph Nautilus in Ubuntu 18.04 working fine wit MTU
> size
>  1500
>   (default) recently i tried to update MTU size to 9000.
>   After setting Jumbo frame running ceph -s is timing out.
>  >>>
>  >>> Ceph can run just fine with an MTU of 9000. But there is probably
>  >>> something else wrong on the network which is causing this.
>  >>>
>  >>> Check the Jumbo Frames settings on all the switches as well to
> make
>  sure
>  >>> they forward all the packets.
>  >>>
>  >>> This is definitely not a Ceph issue.
>  >>>
>  >>> Wido
>  >>>
>  
>   regards
>   Amudhan P
>   ___
>   ceph-users mailing list -- ceph-users@ceph.io
>   To unsubscribe send an email to ceph-users-le...@ceph.io
>  
>  >>> ___
>  >>> ceph-users mailing list -- ceph-users@ceph.io
>  >>> To unsubscribe send an email to ceph-users-le...@ceph.io
>  >>>
>  >>
>  > ___
>  > ceph-users mailing list -- ceph-users@ceph.io
>  > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
>  ___
>  ceph-users mailing list -- ceph-users@ceph.io
>  To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> >>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cephadm cluster network

2020-06-07 Thread Suresh Rama

This should give you the answer to your question
https://docs.ceph.com/docs/master/rados/configuration/network-config-ref/

Regards,
Suresh

On Sun, Jun 7, 2020, 5:10 AM  wrote:

> I am new to Ceph so I hope this is not a question of me not reading the
> documentation well enough.
>
> I have setup a small cluster to learn with three physical hosts each with
> two nics.
>
> The cluster is upp and running but I have not figured out how to tie the
> OSD:s to my second interface for a separate cluster network, as it is now
> all communication goes thru the public network.
>
> Is it possible to define the cluster network with cephadm in some way?
>
> /Jimmy
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Fwd: Ceph Upgrade Issue - Luminous to Nautilus (14.2.11 ) using ceph-ansible

2020-08-27 Thread Suresh Rama

Hi All,

We encountered an issue while upgrading our Ceph cluster from Luminous
12.2.12 to Nautilus 14.2.11.   We used
https://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous
and ceph-ansible to upgrade the cluster.  We use HDD for data and NVME for
WAL and DB.

*Cluster Background:*
HP DL360
24 x 3.6T SATA
2x1.6T NVME for Journal
osd_scenario: non-collocated
current version: Luminous  12.2.12  & 12.2.5
type: bluestore

The upgrade went well for MONs (though I had to overcome the systemd
masking issues).  While testing OSD upgrade with one OSD node, we
encountered issue with OSD daemon failing quickly after startup. After
comparing and checking the block devices mapping, everything looks fine.
The nodes was up for more than 700+ days and then I decided to do a clean
reboot.  After that noticed the mount points are completely missing and
also ceph-disk is no longer part of nautilus. Had to manually mount the
partitions after checking disk partitions  and whoami information.  After
manually mounting the osd.108, now it's throwing permission error which I'm
still reviewing (bdev(0xd1be000 /var/lib/ceph/osd/ceph-108/block) open open
got: (13) Permission denied).  Enclosed the log of the OSD for full review
- https://pastebin.com/7k0xBfDV.

*Questions*:
What could have went wrong here and how can we fix this?
Do we need to migrate the Luminous cluster from ceph-disk to ceph-volume
before attempting upgrade or any other best practice can be followed?
What's the best upgrade method using ceph-ansible to move from Luminous to
Nautilus?  Manual upgrade of Ceph-ansible?

Started thinking now Octopus release which uses container, what is the best
transition path for long run?  We don't want to destroy and rebuild the
entire cluster but we can do one node at a time but that would be a very
lengthy process for 2500+ systems of 13 clusters.  Looking for help and
expert comments on the transition path.

Any help would be greatly appreciated.

2020-08-27 14:41:01.132 7f0e0ebf2c00  0 bdev(0xb7e2a80
/var/lib/ceph/osd/ceph-108/block.wal) ioctl(F_SET_FILE_RW_HINT) on
/var/lib/ceph/osd/ceph-108/block.wal failed: (22) Invalid argument
2020-08-27 14:41:01.132 7f0e0ebf2c00  1 bdev(0xb7e2a80
/var/lib/ceph/osd/ceph-108/block.wal) open size 1073741824 (0x4000, 1
GiB) block_size 4096 (4 KiB) non-rotational discard supported
2020-08-27 14:41:01.132 7f0e0ebf2c00  1 bluefs add_block_device bdev 0 path
/var/lib/ceph/osd/ceph-108/block.wal size 1 GiB
2020-08-27 14:41:01.132 7f0e0ebf2c00  0  set rocksdb option
compaction_style = kCompactionStyleLevel
2020-08-27 14:41:01.132 7f0e0ebf2c00 -1 rocksdb: Invalid argument: Can't
parse option compaction_threads
2020-08-27 14:41:01.136 7f0e0ebf2c00 -1
/build/ceph-14.2.11/src/os/bluestore/BlueStore.cc: In function 'int
BlueStore::_upgrade_super()' thread 7f0e0ebf2c00 time 2020-08-27
14:41:01.135973
/build/ceph-14.2.11/src/os/bluestore/BlueStore.cc: 10249: FAILED
ceph_assert(ondisk_format > 0)

 ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x152) [0x846368]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*,
char const*, ...)+0) [0x846543]
 3: (BlueStore::_upgrade_super()+0x4b6) [0xd62346]
 4: (BlueStore::_mount(bool, bool)+0x592) [0xdb0b52]
 5: (OSD::init()+0x3f3) [0x8f5483]
 6: (main()+0x27e2) [0x84c462]
 7: (__libc_start_main()+0xf0) [0x7f0e0bda3830]
 8: (_start()+0x29) [0x880389]

Journalctl -xu log

Aug 27 20:18:39 pistore-as-b03 ceph-osd[345903]: 2020-08-27 20:18:39.309
7fe9410bfc00 -1 bluestore(/var/lib/ceph/osd/ceph-108/block)
_read_bdev_label failed to op
Aug 27 20:18:39 pistore-as-b03 ceph-osd[345903]: 2020-08-27 20:18:39.309
7fe9410bfc00 -1 bluestore(/var/lib/ceph/osd/ceph-108/block)
_read_bdev_label failed to op
Aug 27 20:18:39 pistore-as-b03 ceph-osd[345903]: 2020-08-27 20:18:39.309
7fe9410bfc00 -1 bluestore(/var/lib/ceph/osd/ceph-108/block)
_read_bdev_label failed to op
Aug 27 20:18:39 pistore-as-b03 ceph-osd[345903]: 2020-08-27 20:18:39.309
7fe9410bfc00 -1 bluestore(/var/lib/ceph/osd/ceph-108/block)
_read_bdev_label failed to op
Aug 27 20:18:39 pistore-as-b03 ceph-osd[345903]: 2020-08-27 20:18:39.309
7fe9410bfc00 -1 bluestore(/var/lib/ceph/osd/ceph-108/block)
_read_bdev_label failed to op
Aug 27 20:18:39 pistore-as-b03 ceph-osd[345903]: 2020-08-27 20:18:39.309
7fe9410bfc00 -1 bdev(0xd1be000 /var/lib/ceph/osd/ceph-108/block) open open
got: (13) Perm
Aug 27 20:18:39 pistore-as-b03 ceph-osd[345903]: 2020-08-27 20:18:39.317
7fe9410bfc00 -1 bdev(0xd1be000 /var/lib/ceph/osd/ceph-108/block) open open
got: (13) Perm
Aug 27 20:18:39 pistore-as-b03 ceph-osd[345903]: 2020-08-27 20:18:39.317
7fe9410bfc00 -1 bluestore(/var/lib/ceph/osd/ceph-108/block)
_read_bdev_label failed to op
Aug 27 20:18:39 pistore-as-b03 ceph-osd[345903]: 2020-08-27 20:18:39.317
7fe9410bfc00 -1 bluestore(/var/lib/ceph/osd/ceph-108/block)
_read_

[ceph-users] Re: Request to guide on ceph-deploy install command for luminuous 12.2.12 release

2019-08-18 Thread Suresh Rama

Since you deleted the stack this is meaningless.  You can simply delete the
volumes from pool using the rbd.

The proper way to delete is to delete them before destroying the stack.
 If the stack is alive and you have issues deleting, you can take two
approach.

1) run openstack volume delete with --debug to see what happens.

2) use rbd status pool/volume to see is there a watcher to the volume.  If
yes, then the instence has not been removed/not properly removed.

Thanks,
Suresh

On Fri, Aug 16, 2019, 8:21 PM Nerurkar, Ruchir (Nokia - US/Mountain View) <
ruchir.nerur...@nokia.com> wrote:

> Hello,
>
>
>
> So I am able to install Ceph on CentOS 7.4 and I can successfully
> integrate my Openstack testbed with it.
>
>
>
> However, I am facing with an issue recently where after deleting stack, my
> cinder volumes are not getting deleted and they are getting stuck.
>
>
>
> Any idea on this issue?
>
>
>
> Best Regards,
>
> Ruchir Nerurkar
>
> 857-701-3405
>
>
>
> *From:* Götz Reinicke 
> *Sent:* Monday, August 12, 2019 1:17 PM
> *To:* Nerurkar, Ruchir (Nokia - US/Mountain View) <
> ruchir.nerur...@nokia.com>
> *Cc:* ceph-users@ceph.io
> *Subject:* Re: [ceph-users] Request to guide on ceph-deploy install
> command for luminuous 12.2.12 release
>
>
>
> Hi,
>
>
>
> Am 07.08.2019 um 20:20 schrieb Nerurkar, Ruchir (Nokia - US/Mountain View)
> :
>
>
>
> Hello,
>
>
>
> I work in Nokia as a Software QA engineer and I am trying to install Ceph
> on centOS 7.4 version.
>
>
>
> Do you have to stick at 7.4?
>
>
>
>
>
> But I am getting this error with the following output: -
>
>
>
> vmgoscephcontrollerluminous][WARNIN] Error: Package:
> 2:ceph-common-12.2.12-0.el7.x86_64 (Ceph)
>
> [mvmgoscephcontrollerluminous][WARNIN]Requires:
> liblz4.so.1()(64bit)
>
> [mvmgoscephcontrollerluminous][WARNIN] Error: Package:
> 2:ceph-osd-12.2.12-0.el7.x86_64 (Ceph)
>
> [mvmgoscephcontrollerluminous][WARNIN]Requires:
> liblz4.so.1()(64bit)
>
> [mvmgoscephcontrollerluminous][WARNIN] Error: Package:
> 2:ceph-base-12.2.12-0.el7.x86_64 (Ceph)
>
> [mvmgoscephcontrollerluminous][WARNIN]Requires:
> gperftools-libs >= 2.6.1
>
> [mvmgoscephcontrollerluminous][WARNIN]Available:
> gperftools-libs-2.4-8.el7.i686 (base)
>
> [mvmgoscephcontrollerluminous][DEBUG ]  You could try using --skip-broken
> to work around the problem
>
> [mvmgoscephcontrollerluminous][WARNIN]gperftools-libs =
> 2.4-8.el7
>
> [mvmgoscephcontrollerluminous][WARNIN] Error: Package:
> 2:ceph-mon-12.2.12-0.el7.x86_64 (Ceph)
>
> [mvmgoscephcontrollerluminous][WARNIN]Requires:
> liblz4.so.1()(64bit)
>
> [mvmgoscephcontrollerluminous][WARNIN] Error: Package:
> 2:ceph-base-12.2.12-0.el7.x86_64 (Ceph)
>
> [mvmgoscephcontrollerluminous][WARNIN]Requires:
> liblz4.so.1()(64bit)
>
> [mvmgoscephcontrollerluminous][DEBUG ]  You could try running: rpm -Va
> --nofiles --nodigest
>
> [mvmgoscephcontrollerluminous][ERROR ] RuntimeError: command returned
> non-zero exit status: 1
>
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y
> install ceph ceph-radosgw
>
>
>
> Does anyone know about this issue like how can I resolve package
> dependencies?
>
>
>
> My first shot: Update Centos to min. 7.5 (which
> includes gperftools-libs-2.6.1)
>
>
>
> And install lz4 too.
>
>
>
> Regards . Götz
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Ceph Outage (Nautilus) - 14.2.11

[ceph-users] Re: Ceph Outage (Nautilus) - 14.2.11 [EXT]

[ceph-users] Re: Monitors not starting, getting "e3 handle_auth_request failed to assign global_id"

[ceph-users] device management and failure prediction

[ceph-users] Re: scrub errors: inconsistent PGs

[ceph-users] Issue with Nautilus upgrade from Luminous

[ceph-users] Re: Issue with Nautilus upgrade from Luminous

[ceph-users] Re: Recovery stuck and Multiple PG fails

[ceph-users] Re: Ceph Nautius not working after setting MTU 9000

[ceph-users] Re: Ceph Nautius not working after setting MTU 9000

[ceph-users] Re: Cephadm cluster network

[ceph-users] Fwd: Ceph Upgrade Issue - Luminous to Nautilus (14.2.11 ) using ceph-ansible

[ceph-users] Re: Request to guide on ceph-deploy install command for luminuous 12.2.12 release

13 matches

Site Navigation

Mail list logo

Footer information