[ceph-users] Re: ceph mon failing to start

2022-03-28 Thread Dan van der Ster
Are the two running mons also running 14.2.9 ?

--- dan

On Mon, Mar 28, 2022 at 8:27 AM Tomáš Hodek  wrote:
>
> Hi, I have 3 node ceph cluster (managed via proxmox). Got single node
> fatal failure and replaced it. Os boots correctly, however monitor on
> failed node did not start successfully; Other 2 monitors are OK, ceph
> status is healthy:
>
> ceph -s
> cluster:
> id: 845868a1-9902-4b61-aa06-0767cb09f1c2
> health: HEALTH_OK
>
> services:
> mon: 2 daemons, quorum pxmx1,pxmx3 (age 2h)
> mgr: pxmx1(active, since 56m), standbys: pxmx3
> osd: 18 osds: 18 up (since 111m), 18 in (since 3h)
>
> data:
> pools: 1 pools, 256 pgs
> objects: 2.12M objects, 8.1 TiB
> usage: 24 TiB used, 21 TiB / 45 TiB avail
> pgs: 256 active+clean
>
> content of ceph.conf
>
> [global]
> auth_client_required = cephx
> auth_cluster_required = cephx
> auth_service_required = cephx
> cluster_network = 10.60.10.1/24
> fsid = 845868a1-9902-4b61-aa06-0767cb09f1c2
> mon_allow_pool_delete = true
> mon_host = 10.60.10.1 10.60.10.3 10.60.10.2
> osd_pool_default_min_size = 2
> osd_pool_default_size = 3
> public_network = 10.60.10.1/24
>
> [client]
> keyring = /etc/pve/priv/$cluster.$name.keyring
>
> [mds]
> keyring = /var/lib/ceph/mds/ceph-$id/keyring
>
> Monitor is failing (at least as I understand the problem) with following
> logged error:
>
> mon.pxmx2@-1(probing) e4 handle_auth_request failed to assign global_id
>
> whole mon log attached.
>
> I have tried to scrap dead monitor and recreate it via proxmoxes gui,
> shell and even have created content /var/lib/ceph/mon/ manually and
> tried to run monitor from terminal. It starts, listens to connections on
> port 3300 and 6789, but does not communicate properly with other
> remaining mons.
>
> thanks for info
>
> Tomas Hodek___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD Exclusive lock to shared lock

2022-03-28 Thread Marc
> 
> >
> > My use case would be a HA cluster where a VM is mapping an rbd image,
> and then it encounters some network issue. An other node of the HA
> cluster could start the VM and map again the image, but if the
> networking is fixed on the first VM that would keep using the already
> mapped image. Here If I could instruct my second VM to treat the lock as
> exclusive after an automatic failover, then I'm protected against data
> corruption when the networking of initial VM is fixed. But I assume that
> a STONITH kind of fencing could also do the job (if it can be
> implemented).

Is it already possible to configure libvirt directly for this, or are the hook 
scripts still necessary?


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PG down, due to 3 OSD failing

2022-03-28 Thread Fulvio Galeazzi

Hallo,
all of a sudden, 3 of my OSDs failed, showing similar messages in 
the log:


.
-5> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch: 
616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0 
ec=148456/148456 lis/c
 612106/612106 les/c/f 612107/612107/0 612106/612106/612101) 
[168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0 
unknown mbc={}]

enter Started
-4> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch: 
616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0 
ec=148456/148456 lis/c
 612106/612106 les/c/f 612107/612107/0 612106/612106/612101) 
[168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0 
unknown mbc={}]

enter Start
-3> 2022-03-28 14:19:02.451 7fc20fe99700  1 osd.145 pg_epoch: 
616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0 
ec=148456/148456 lis/c
 612106/612106 les/c/f 612107/612107/0 612106/612106/612101) 
[168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0 
unknown mbc={}]

state: transitioning to Stray
-2> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch: 
616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0 
ec=148456/148456 lis/c
 612106/612106 les/c/f 612107/612107/0 612106/612106/612101) 
[168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0 
unknown mbc={}]

exit Start 0.08 0 0.00
-1> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch: 
616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0 
ec=148456/148456 lis/c
 612106/612106 les/c/f 612107/612107/0 612106/612106/612101) 
[168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0 
unknown mbc={}]

enter Started/Stray
 0> 2022-03-28 14:19:02.451 7fc20f698700 -1 *** Caught signal 
(Aborted) **

 in thread 7fc20f698700 thread_name:tp_osd_tp

 ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) 
nautilus (stable)

 1: (()+0x12ce0) [0x7fc2327dcce0]
 2: (gsignal()+0x10f) [0x7fc231452a4f]
 3: (abort()+0x127) [0x7fc231425db5]
 4: (ceph::__ceph_abort(char const*, int, char const*, 
std::__cxx11::basic_string, 
std::allocator > const&)+0x1b4) [0x55b8139cb671]

 5: (PG::check_past_interval_bounds() const+0xc16) [0x55b813b586f6]
 6: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x3e8) 
[0x55b813b963d8]
 7: (boost::statechart::simple_statePG::RecoveryState::RecoveryMachine, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x7d) [0x55b813bdd32d]
 8: (PG::handle_advance_map(std::shared_ptr, 
std::shared_ptr, std::vector >&, 
int, std::vector >&, int, 
PG::RecoveryCtx*)+0x39d) [0x55b813b7b5fd]
 9: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*)+0x2e9) [0x55b813ad14e9]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, 
std::shared_ptr, ThreadPool::TPHandle&)+0xaa) 
[0x55b813ae345a]
 11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x55) [0x55b813d66c15]
 12: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x1366) [0x55b813adff46]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) 
[0x55b8140dc944]

 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55b8140df514]
 15: (()+0x81cf) [0x7fc2327d21cf]
 16: (clone()+0x43) [0x7fc23143dd83]

Trying to "activate --all", rebotting server, and such, did not help.

I am now stuck with one PG (85.25) down, find below the output from "query".

The PG belongs to a 3+2 erasure-coded pool.
As the devices corresponding to the 3 down OSDs are properly mounted, is 
there a way to get PG.ID=85.25 from the devices and copy it elsewhere?

Actually, I tried to find 85.25 in the 3 down OSDs with command:
~]# ceph-objectstore-tool  --data-path /var/lib/ceph/osd/cephpa1-158/ 
--no-mon-config --pgid 85.25 --op export --file /tmp/pg_85-25

PG '85.25' not found
  which puzzled me... is there a way to search such PG.ID over the 
whole cluster?


  Thanks for your help!

Fulvio



~]# ceph --cluster cephpa1 health detail | grep down
.
PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg down
pg 85.25 is down+remapped, acting 
[2147483647,2147483647,96,2147483647,2147483647]

~]# ceph --cluster cephpa1 pg 85.25 query
{
"state": "down+remapped",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 617667,
"up": [
2147483647,
2147483647,
2147483647,
2147483647,
2147483647
],
"acting": [
2147483647,
2147483647,
96,
2147483647,
2147483647
],
"info": {
"pgid": "85.25s2",
"last_update": "606021'521273",
"last_complete": "606021'521273",
"log_tail": "605873'518175",
"last

[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-28 Thread Dan van der Ster
Hi Fulvio,

You can check (offline) which PGs are on an OSD with the list-pgs op, e.g.

ceph-objectstore-tool  --data-path /var/lib/ceph/osd/cephpa1-158/  --op list-pgs

The EC pgs have a naming convention like 85.25s1 etc.. for the various
k/m EC shards.

-- dan


On Mon, Mar 28, 2022 at 2:29 PM Fulvio Galeazzi  wrote:
>
> Hallo,
>  all of a sudden, 3 of my OSDs failed, showing similar messages in
> the log:
>
> .
>  -5> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> ec=148456/148456 lis/c
>   612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> unknown mbc={}]
> enter Started
>  -4> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> ec=148456/148456 lis/c
>   612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> unknown mbc={}]
> enter Start
>  -3> 2022-03-28 14:19:02.451 7fc20fe99700  1 osd.145 pg_epoch:
> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> ec=148456/148456 lis/c
>   612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> unknown mbc={}]
> state: transitioning to Stray
>  -2> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> ec=148456/148456 lis/c
>   612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> unknown mbc={}]
> exit Start 0.08 0 0.00
>  -1> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> ec=148456/148456 lis/c
>   612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> unknown mbc={}]
> enter Started/Stray
>   0> 2022-03-28 14:19:02.451 7fc20f698700 -1 *** Caught signal
> (Aborted) **
>   in thread 7fc20f698700 thread_name:tp_osd_tp
>
>   ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
> nautilus (stable)
>   1: (()+0x12ce0) [0x7fc2327dcce0]
>   2: (gsignal()+0x10f) [0x7fc231452a4f]
>   3: (abort()+0x127) [0x7fc231425db5]
>   4: (ceph::__ceph_abort(char const*, int, char const*,
> std::__cxx11::basic_string,
> std::allocator > const&)+0x1b4) [0x55b8139cb671]
>   5: (PG::check_past_interval_bounds() const+0xc16) [0x55b813b586f6]
>   6: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x3e8)
> [0x55b813b963d8]
>   7: (boost::statechart::simple_state PG::RecoveryState::RecoveryMachine, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x7d) [0x55b813bdd32d]
>   8: (PG::handle_advance_map(std::shared_ptr,
> std::shared_ptr, std::vector >&,
> int, std::vector >&, int,
> PG::RecoveryCtx*)+0x39d) [0x55b813b7b5fd]
>   9: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
> PG::RecoveryCtx*)+0x2e9) [0x55b813ad14e9]
>   10: (OSD::dequeue_peering_evt(OSDShard*, PG*,
> std::shared_ptr, ThreadPool::TPHandle&)+0xaa)
> [0x55b813ae345a]
>   11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0x55) [0x55b813d66c15]
>   12: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x1366) [0x55b813adff46]
>   13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x55b8140dc944]
>   14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55b8140df514]
>   15: (()+0x81cf) [0x7fc2327d21cf]
>   16: (clone()+0x43) [0x7fc23143dd83]
>
> Trying to "activate --all", rebotting server, and such, did not help.
>
> I am now stuck with one PG (85.25) down, find below the output from "query".
>
> The PG belongs to a 3+2 erasure-coded pool.
> As the devices corresponding to the 3 down OSDs are properly mounted, is
> there a way to get PG.ID=85.25 from the devices and copy it elsewhere?
> Actually, I tried to find 85.25 in the 3 down OSDs with command:
> ~]# ceph-objectstore-tool  --data-path /var/lib/ceph/osd/cephpa1-158/
> --no-mon-config --pgid 85.25 --op export --file /tmp/pg_85-25
> PG '85.25' not found
>which puzzled me... is there a way to search such PG.ID over the
> whole cluster?
>
>Thanks for your help!
>
> Fulvio
>
> 
>
> ~]# ceph --cluster cephpa1 health detail | grep down
> .
> PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg down
>  pg 85.25 is down+remapped, acting
> [2147483647,2147483647,96,2147483647,2147483647]
> ~]# ceph --cluster cephpa1 pg 85.25 query
> {
>

[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-03-28 Thread Neha Ojha
On Mon, Mar 28, 2022 at 2:48 PM Yuri Weinstein  wrote:
>
> We are trying to release v17.2.0 as soon as possible.
> And need to do a quick approval of tests and review failures.
>
> Still outstanding are two PRs:
> https://github.com/ceph/ceph/pull/45673
> https://github.com/ceph/ceph/pull/45604
>
> Build failing and I need help to fix it ASAP.
> (
> https://shaman.ceph.com/builds/ceph/wip-yuri11-testing-2022-03-28-0907-quincy/61b142c76c991abe3fe77390e384b025e1711757/
> )
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/55089
> Release Notes - https://github.com/ceph/ceph/pull/45048
>
> Seeking approvals for:
>
> smoke - Neha, Josh (the failure appears reproducible)

smoke/basic/{clusters/{fixed-3-cephfs openstack}
objectstore/bluestore-bitmap supported-random-distro$/{ubuntu_latest}
tasks/{0-install test/kclient_workunit_suites_pjd}} is failing
consistently in the kclient task. Venky, can you please check whether
this is a configuration/test issue or a bug? Here's an example
http://pulpito.front.sepia.ceph.com/yuriw-2022-03-28_19:24:40-smoke-quincy-distro-default-smithi/6765567/.

Thanks,
Neha




> rgw - Casey
> fs - Venky, Gerg
> rbd - Ilya, Deepika
> krbd  Ilya, Deepika
> upgrade/octopus-x - Casey
> powercycle - Brag (SELinux denials)
> ceph-volume - Guillaume, David G
>
> Please reply to this email with approval and/or tracks of know issues/PRs to 
> address them.
>
> Thx
> YuriW
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-03-28 Thread Venky Shankar
Hey Yuri,


On Tue, Mar 29, 2022 at 3:18 AM Yuri Weinstein  wrote:
>
> We are trying to release v17.2.0 as soon as possible.
> And need to do a quick approval of tests and review failures.
>
> Still outstanding are two PRs:
> https://github.com/ceph/ceph/pull/45673
> https://github.com/ceph/ceph/pull/45604

There are some outstanding CephFS PRs - labeled needs-qa, quincy-batch-1.

Do you plan to include those?

>
> Build failing and I need help to fix it ASAP.
> (
> https://shaman.ceph.com/builds/ceph/wip-yuri11-testing-2022-03-28-0907-quincy/61b142c76c991abe3fe77390e384b025e1711757/
> )
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/55089
> Release Notes - https://github.com/ceph/ceph/pull/45048
>
> Seeking approvals for:
>
> smoke - Neha, Josh (the failure appears reproducible)
> rgw - Casey
> fs - Venky, Gerg
> rbd - Ilya, Deepika
> krbd  Ilya, Deepika
> upgrade/octopus-x - Casey
> powercycle - Brag (SELinux denials)
> ceph-volume - Guillaume, David G
>
> Please reply to this email with approval and/or tracks of know issues/PRs to 
> address them.
>
> Thx
> YuriW
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io



-- 
Cheers,
Venky

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-03-28 Thread Venky Shankar
On Tue, Mar 29, 2022 at 10:56 AM Venky Shankar  wrote:
>
> Hey Yuri,
>
>
> On Tue, Mar 29, 2022 at 3:18 AM Yuri Weinstein  wrote:
> >
> > We are trying to release v17.2.0 as soon as possible.
> > And need to do a quick approval of tests and review failures.
> >
> > Still outstanding are two PRs:
> > https://github.com/ceph/ceph/pull/45673
> > https://github.com/ceph/ceph/pull/45604
>
> There are some outstanding CephFS PRs - labeled needs-qa, quincy-batch-1.
>
> Do you plan to include those?

We would need PR

https://github.com/ceph/ceph/pull/45558

at the minimum to avoid a regression

>
> >
> > Build failing and I need help to fix it ASAP.
> > (
> > https://shaman.ceph.com/builds/ceph/wip-yuri11-testing-2022-03-28-0907-quincy/61b142c76c991abe3fe77390e384b025e1711757/
> > )
> >
> > Details of this release are summarized here:
> >
> > https://tracker.ceph.com/issues/55089
> > Release Notes - https://github.com/ceph/ceph/pull/45048
> >
> > Seeking approvals for:
> >
> > smoke - Neha, Josh (the failure appears reproducible)
> > rgw - Casey
> > fs - Venky, Gerg
> > rbd - Ilya, Deepika
> > krbd  Ilya, Deepika
> > upgrade/octopus-x - Casey
> > powercycle - Brag (SELinux denials)
> > ceph-volume - Guillaume, David G
> >
> > Please reply to this email with approval and/or tracks of know issues/PRs 
> > to address them.
> >
> > Thx
> > YuriW
> > ___
> > Dev mailing list -- d...@ceph.io
> > To unsubscribe send an email to dev-le...@ceph.io
>
>
>
> --
> Cheers,
> Venky



-- 
Cheers,
Venky

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io