Re: [ceph-users] Crashed MDS (segfault)

2019-10-22 Thread Yan, Zheng
On Tue, Oct 22, 2019 at 1:49 AM Gustavo Tonini  wrote:
>
> Is there a possibility to lose data if I use "cephfs-data-scan init  
> --force-init"?
>

It only causes incorrect stat on root inode, can't cause data lose.

running 'ceph daemon mds.a scrub_path / force repair' after mds
restart can fix the incorrect stat.

> On Mon, Oct 21, 2019 at 4:36 AM Yan, Zheng  wrote:
>>
>> On Fri, Oct 18, 2019 at 9:10 AM Gustavo Tonini  
>> wrote:
>> >
>> > Hi Zheng,
>> > the cluster is running ceph mimic. This warning about network only appears 
>> > when using nautilus' cephfs-journal-tool.
>> >
>> > "cephfs-data-scan scan_links" does not report any issue.
>> >
>> > How could variable "newparent" be NULL at 
>> > https://github.com/ceph/ceph/blob/master/src/mds/SnapRealm.cc#L599 ? Is 
>> > there a way to fix this?
>> >
>>
>>
>> try 'cephfs-data-scan init'. It will setup root inode's snaprealm.
>>
>> > On Thu, Oct 17, 2019 at 9:58 PM Yan, Zheng  wrote:
>> >>
>> >> On Thu, Oct 17, 2019 at 10:19 PM Gustavo Tonini  
>> >> wrote:
>> >> >
>> >> > No. The cluster was just rebalancing.
>> >> >
>> >> > The journal seems damaged:
>> >> >
>> >> > ceph@deployer:~$ cephfs-journal-tool --rank=fs_padrao:0 journal inspect
>> >> > 2019-10-16 17:46:29.596 7fcd34cbf700 -1 NetHandler create_socket 
>> >> > couldn't create socket (97) Address family not supported by protocol
>> >>
>> >> corrupted journal shouldn't cause error like this. This is more like
>> >> network issue. please double check network config of your cluster.
>> >>
>> >> > Overall journal integrity: DAMAGED
>> >> > Corrupt regions:
>> >> > 0x1c5e4d904ab-1c5e4d9ddbc
>> >> > ceph@deployer:~$
>> >> >
>> >> > Could a journal reset help with this?
>> >> >
>> >> > I could snapshot all FS pools and export the journal before to 
>> >> > guarantee a rollback to this state if something goes wrong with jounal 
>> >> > reset.
>> >> >
>> >> > On Thu, Oct 17, 2019, 09:07 Yan, Zheng  wrote:
>> >> >>
>> >> >> On Tue, Oct 15, 2019 at 12:03 PM Gustavo Tonini 
>> >> >>  wrote:
>> >> >> >
>> >> >> > Dear ceph users,
>> >> >> > we're experiencing a segfault during MDS startup (replay process) 
>> >> >> > which is making our FS inaccessible.
>> >> >> >
>> >> >> > MDS log messages:
>> >> >> >
>> >> >> > Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> > 00:40:30.201 7f3c08f49700  1 -- 192.168.8.195:6800/3181891717 <== 
>> >> >> > osd.26 192.168.8.209:6821/2419345 3  osd_op_reply(21 1. 
>> >> >> > [getxattr] v0'0 uv0 ondisk = -61 ((61) No data available)) v8  
>> >> >> > 154+0+0 (3715233608 0 0) 0x2776340 con 0x18bd500
>> >> >> > Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> > 00:40:30.201 7f3c00589700 10 MDSIOContextBase::complete: 
>> >> >> > 18C_IO_Inode_Fetched
>> >> >> > Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> > 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched got 0 
>> >> >> > and 544
>> >> >> > Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> > 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100)  magic is 'ceph 
>> >> >> > fs volume v011' (expecting 'ceph fs volume v011')
>> >> >> > Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> > 00:40:30.201 7f3c00589700 10  mds.0.cache.snaprealm(0x100 seq 1 
>> >> >> > 0x1799c00) open_parents [1,head]
>> >> >> > Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> > 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched [inode 
>> >> >> > 0x100 [...2,head] ~mds0/ auth v275131 snaprealm=0x1799c00 f(v0 
>> >> >> > 1=1+0) n(v76166 rc2020-07-17 15:29:27.00 b41838692297 
>> >> >> > -3184=-3168+-16)/n() (iversion lock) 0x18bf800]
>> >> >> > Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> > 00:40:30.201 7f3c00589700 10 MDSIOContextBase::complete: 
>> >> >> > 18C_IO_Inode_Fetched
>> >> >> > Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> > 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x1) _fetched got 0 and 
>> >> >> > 482
>> >> >> > Oct 15 03:41:39.894891 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> > 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x1)  magic is 'ceph fs 
>> >> >> > volume v011' (expecting 'ceph fs volume v011')
>> >> >> > Oct 15 03:41:39.894958 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> > 00:40:30.205 7f3c00589700 -1 *** Caught signal (Segmentation fault) 
>> >> >> > **#012 in thread 7f3c00589700 thread_name:fn_anonymous#012#012 ceph 
>> >> >> > version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
>> >> >> > (stable)#012 1: (()+0x11390) [0x7f3c0e48a390]#012 2: 
>> >> >> > (operator<<(std::ostream&, SnapRealm const&)+0x42) [0x72cb92]#012 3: 
>> >> >> > (SnapRealm::merge_to(SnapRealm*)+0x308) [0x72f488]#012 4: 
>> >> >> > (CInode::decode_snap_blob(ceph::buffer::list&)+0x53) [0x6e1f63]#012 
>> >> >> > 5: (CInode::decode_store(ceph::buffer::list::iterator&)+0x76) 
>> >> >> > [0x702b86]#012 6: (CInode::_fetched(ceph::buffer::list&, 
>> >> >> > 

[ceph-users] How does IOPS/latency scale for additional OSDs? (Intel S3610 SATA SSD, for block storage use case)

2019-10-22 Thread Victor Hooi
Hi,

I'm running a 3-node Ceph cluster for VM block storage (Proxmox/KVM).

Replication is set to 3.

Previously, we were running 1 x Intel Optane 905P 960B

disk per node, with 4 x OSDs per drive, for total usable storage of 960 GB.

Performance was good, even without significant tuning, I assume largely
because of the Optane disks.

However, we need more storage space.

We have some old 800 GB SSDs we could potentially use (Intel S3610

).

I know it's possible to put the WAL/RocksDB on an Optane disks, and have
normal SSDs for the OSDs. I assume we'd go down to a single OSD per disk if
running normal SATA SSDs. However, other people are saying the performance
gain from this isn't that great (e.g.
https://yourcmc.ru/wiki/Ceph_performance)

Each of our 3 nodes has 8 drive bays, so we could populate this for 24 x
800GB SSDs in total. My question is:

   1. For the Intel S3610 - should we still run with 1 OSD per disk?
   2. How does performance (IOPS and latency) scale as the number of disks
   increase? (This is for VM block storage).

Thanks,
Victor
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph balancer do not start

2019-10-22 Thread David Turner
Of the top of my head, if say your cluster might have the wrong tunables
for crush-compat. I know I ran into that when I first set up the balancer
and nothing obviously said that was the problem. Only researching find it
for me.

My real question, though, is why aren't you using upmap? It is
significantly better than crush-compat. Unless you have clients on really
old kernels that can't update or that are on pre-luminous Ceph versions
that can't update, there's really no reason not to use upmap.

On Mon, Oct 21, 2019, 8:08 AM Jan Peters  wrote:

> Hello,
>
> I use ceph 12.2.12 and would like to activate the ceph balancer.
>
> unfortunately no redistribution of the PGs is started:
>
> ceph balancer status
> {
> "active": true,
> "plans": [],
> "mode": "crush-compat"
> }
>
> ceph balancer eval
> current cluster score 0.023776 (lower is better)
>
>
> ceph config-key dump
> {
> "initial_mon_keyring":
> "AQBLchlbABAA+5CuVU+8MB69xfc3xAXkjQ==",
> "mgr/balancer/active": "1",
> "mgr/balancer/max_misplaced:": "0.01",
> "mgr/balancer/mode": "crush-compat"
> }
>
>
> What am I not doing correctly?
>
> best regards
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Decreasing the impact of reweighting osds

2019-10-22 Thread David Turner
Most times you are better served with simpler settings like
osd_recovery_sleep, which has 3 variants if you have multiple types of OSDs
in your cluster (osd_recovery_sleep_hdd, osd_recovery_sleep_sdd,
osd_recovery_sleep_hybrid).
Using those you can tweak a specific type of OSD that might be having
problems during recovery/backfill while allowing the others to continue to
backfill at regular speeds.

Additionally you mentioned reweighting OSDs, but it sounded like you do
this manually. The balancer module, especially in upmap mode, can be
configured quite well to minimize client IO impact while balancing. You can
specify times of day that it can move data (only in UTC, it ignores local
timezones), a threshold of misplaced data that it will stop moving PGs at,
the increment size it will change weights with per operation, how many
weights it will adjust with each pass, etc.

On Tue, Oct 22, 2019, 6:07 PM Mark Kirkwood 
wrote:

> Thanks - that's a good suggestion!
>
> However I'd still like to know the answers to my 2 questions.
>
> regards
>
> Mark
>
> On 22/10/19 11:22 pm, Paul Emmerich wrote:
> > getting rid of filestore solves most latency spike issues during
> > recovery because they are often caused by random XFS hangs (splitting
> > dirs or just xfs having a bad day)
> >
> >
> > Paul
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reset compat weight-set changes caused by PG balancer module?

2019-10-22 Thread Konstantin Shalygin

Apparently the PG balancer crush-compat mode adds some crush bucket weights. 
Those cause major havoc in our cluster, our PG distribution is all over the 
place.
Seeing things like this:...
  97   hdd 9.09470  1.0 9.1 TiB 6.3 TiB 6.3 TiB  32 KiB  17 GiB 2.8 TiB 
69.03 1.08  28 up
  98   hdd 9.09470  1.0 9.1 TiB 4.5 TiB 4.5 TiB  96 KiB  11 GiB 4.6 TiB 
49.51 0.77  20 up
  99   hdd 9.09470  1.0 9.1 TiB 7.0 TiB 6.9 TiB  80 KiB  18 GiB 2.1 TiB 
76.47 1.20  31 up
Filling rates are from 50 - 90%. Unfortunately reweighing doesn't seem to help 
and I suspect it's because of bucket weights which are WEIRD
     bucket_id -42
     weight_set [
   [ 7.846 11.514 9.339 9.757 10.173 8.900 9.164 6.759 ]


I disabled the module already but the rebalance is broken now.
Do I have to hand reset this and push a new crush map? This is a sensitive 
production cluster, I don't feel pretty good about that.
Thanks for any ideas..


`osd crush weight-set rm-compat` and use upmap mode instead.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Decreasing the impact of reweighting osds

2019-10-22 Thread Mark Kirkwood

Thanks - that's a good suggestion!

However I'd still like to know the answers to my 2 questions.

regards

Mark

On 22/10/19 11:22 pm, Paul Emmerich wrote:

getting rid of filestore solves most latency spike issues during
recovery because they are often caused by random XFS hangs (splitting
dirs or just xfs having a bad day)


Paul


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] TCMU Runner: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown

2019-10-22 Thread Mike Christie
Ignore my log request. I think I know what is going on.

Let me do some testing here and I will make a test rpm for you.


On 10/22/2019 04:38 PM, Kilian Ries wrote:
> - Each LUN is exported to multiple clients (at the same time)
> 
> - yes, IO is done to the LUNs (read and write); (oVirt runs VMs on each
> of the LUNs)
> 
> 
> Ok, i'll update this tomorrow with the logs you asked for ...
> 
> 
> *Von:* Mike Christie 
> *Gesendet:* Dienstag, 22. Oktober 2019 19:43:40
> *An:* Kilian Ries; ceph-users@lists.ceph.com
> *Betreff:* Re: [ceph-users] TCMU Runner: Could not check lock ownership.
> Error: Cannot send after transport endpoint shutdown
>  
> On 10/22/2019 03:20 AM, Kilian Ries wrote:
>> Hi,
>> 
>> 
>> i'm running a ceph cluster with 4x ISCSI exporter nodes and oVirt on the
>> client side. In the tcmu-runner logs i the the following happening every
>> few seconds:
>> 
>> 
> 
> Are you exporting a LUN to one client or multiple clients at the same time?
> 
>> 
>> tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64
> 
> Are you doing any IO to the iscsi LUN?
> 
> If not, then we normally saw this with a older version. It would start
> at dm-multipath initialization and then just continue forever. Your
> package looks like it has the fix:
> 
> commit dd7dd51c6cafa8bbcd3ca0eef31fb378b27ff499
> Author: Mike Christie 
> Date:   Mon Jan 14 17:06:27 2019 -0600
> 
> Allow some commands to run while taking lock
> 
> 
> so we should not be seeing it.
> 
> Could you turn on tcmu-runner debugging? Open the file:
> 
> /etc/tcmu/tcmu.conf
> 
> and set:
> 
> log_level = 5
> 
> Do this while you are hitting this bug. I only need a couple seconds so
> I can see what commands are being sent.
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] TCMU Runner: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown

2019-10-22 Thread Kilian Ries
- Each LUN is exported to multiple clients (at the same time)

- yes, IO is done to the LUNs (read and write); (oVirt runs VMs on each of the 
LUNs)


Ok, i'll update this tomorrow with the logs you asked for ...


Von: Mike Christie 
Gesendet: Dienstag, 22. Oktober 2019 19:43:40
An: Kilian Ries; ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] TCMU Runner: Could not check lock ownership. Error: 
Cannot send after transport endpoint shutdown

On 10/22/2019 03:20 AM, Kilian Ries wrote:
> Hi,
>
>
> i'm running a ceph cluster with 4x ISCSI exporter nodes and oVirt on the
> client side. In the tcmu-runner logs i the the following happening every
> few seconds:
>
>

Are you exporting a LUN to one client or multiple clients at the same time?

>
> tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64

Are you doing any IO to the iscsi LUN?

If not, then we normally saw this with a older version. It would start
at dm-multipath initialization and then just continue forever. Your
package looks like it has the fix:

commit dd7dd51c6cafa8bbcd3ca0eef31fb378b27ff499
Author: Mike Christie 
Date:   Mon Jan 14 17:06:27 2019 -0600

Allow some commands to run while taking lock


so we should not be seeing it.

Could you turn on tcmu-runner debugging? Open the file:

/etc/tcmu/tcmu.conf

and set:

log_level = 5

Do this while you are hitting this bug. I only need a couple seconds so
I can see what commands are being sent.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clust recovery stuck

2019-10-22 Thread Andras Pataki

Hi Philipp,

Given 256 PG's triple replicated onto 4 OSD's you might be encountering 
the "PG overdose protection" of OSDs.  Take a look at 'ceph osd df' and 
see the number of PG's that are mapped to each OSD (last column or near 
the last).  The default limit is 200, so if any OSD exceeds that, it 
would explain the freeze, since the OSD will simply ignore the excess.  
In that case, try increasing mon_max_pg_per_osd to, say, 400 and see if 
that helps.  This would allow the recovery to proceed - but you should 
consider adding OSDs (or at least increase the memory allocated to OSDs 
above the defaults).


Andras

On 10/22/19 3:02 PM, Philipp Schwaha wrote:

hi,

On 2019-10-22 08:05, Eugen Block wrote:

Hi,

can you share `ceph osd tree`? What crush rules are in use in your
cluster? I assume that the two failed OSDs prevent the remapping because
the rules can't be applied.


ceph osd tree gives:

ID WEIGHT   TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 27.94199 root default
-2  9.31400 host alpha.local
  0  4.65700 osd.0   down0  1.0
  3  4.65700 osd.3 up  1.0  1.0
-3  9.31400 host beta.local
  1  4.65700 osd.1 up  1.0  1.0
  6  4.65700 osd.6   down0  1.0
-4  9.31400 host gamma.local
  2  4.65700 osd.2 up  1.0  1.0
  4  4.65700 osd.4 up  1.0  1.0


the crush rules should be fairly simple, nothing particularly customized
as far as I can tell:
'ceph osd crush tree' gives:
[
 {
 "id": -1,
 "name": "default",
 "type": "root",
 "type_id": 10,
 "items": [
 {
 "id": -2,
 "name": "alpha.local",
 "type": "host",
 "type_id": 1,
 "items": [
 {
 "id": 0,
 "name": "osd.0",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 },
 {
 "id": 3,
 "name": "osd.3",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 }
 ]
 },
 {
 "id": -3,
 "name": "beta.local",
 "type": "host",
 "type_id": 1,
 "items": [
 {
 "id": 1,
 "name": "osd.1",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 },
 {
 "id": 6,
 "name": "osd.6",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 }
 ]
 },
 {
 "id": -4,
 "name": "gamma.local",
 "type": "host",
 "type_id": 1,
 "items": [
 {
 "id": 2,
 "name": "osd.2",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 },
 {
 "id": 4,
 "name": "osd.4",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 }
 ]
 }
 ]
 }
]

and 'ceph osd crush rule dump' gives:
[
 {
 "rule_id": 0,
 "rule_name": "replicated_ruleset",
 "ruleset": 0,
 "type": 1,
 "min_size": 1,
 "max_size": 10,
 "steps": [
 {
 "op": "take",
 "item": -1,
 "item_name": "default"
 },
 {
 "op": "chooseleaf_firstn",
 "num": 0,
 "type": "host"
 },
 {
 "op": "emit"
 }
 ]
 }
]

the cluster actually reached health ok after osd.0 went down, but when
osd.6 went down it did not recover. the cluster is running ceph version
10.2.2.

any help is greatly appreciated!

thanks & cheers
Philipp


Zitat von Philipp Schwaha :


hi,

I have a problem with a cluster 

Re: [ceph-users] clust recovery stuck

2019-10-22 Thread Philipp Schwaha
hi,

On 2019-10-22 08:05, Eugen Block wrote:
> Hi,
> 
> can you share `ceph osd tree`? What crush rules are in use in your
> cluster? I assume that the two failed OSDs prevent the remapping because
> the rules can't be applied.
> 

ceph osd tree gives:

ID WEIGHT   TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 27.94199 root default
-2  9.31400 host alpha.local
 0  4.65700 osd.0   down0  1.0
 3  4.65700 osd.3 up  1.0  1.0
-3  9.31400 host beta.local
 1  4.65700 osd.1 up  1.0  1.0
 6  4.65700 osd.6   down0  1.0
-4  9.31400 host gamma.local
 2  4.65700 osd.2 up  1.0  1.0
 4  4.65700 osd.4 up  1.0  1.0


the crush rules should be fairly simple, nothing particularly customized
as far as I can tell:
'ceph osd crush tree' gives:
[
{
"id": -1,
"name": "default",
"type": "root",
"type_id": 10,
"items": [
{
"id": -2,
"name": "alpha.local",
"type": "host",
"type_id": 1,
"items": [
{
"id": 0,
"name": "osd.0",
"type": "osd",
"type_id": 0,
"crush_weight": 4.656998,
"depth": 2
},
{
"id": 3,
"name": "osd.3",
"type": "osd",
"type_id": 0,
"crush_weight": 4.656998,
"depth": 2
}
]
},
{
"id": -3,
"name": "beta.local",
"type": "host",
"type_id": 1,
"items": [
{
"id": 1,
"name": "osd.1",
"type": "osd",
"type_id": 0,
"crush_weight": 4.656998,
"depth": 2
},
{
"id": 6,
"name": "osd.6",
"type": "osd",
"type_id": 0,
"crush_weight": 4.656998,
"depth": 2
}
]
},
{
"id": -4,
"name": "gamma.local",
"type": "host",
"type_id": 1,
"items": [
{
"id": 2,
"name": "osd.2",
"type": "osd",
"type_id": 0,
"crush_weight": 4.656998,
"depth": 2
},
{
"id": 4,
"name": "osd.4",
"type": "osd",
"type_id": 0,
"crush_weight": 4.656998,
"depth": 2
}
]
}
]
}
]

and 'ceph osd crush rule dump' gives:
[
{
"rule_id": 0,
"rule_name": "replicated_ruleset",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]

the cluster actually reached health ok after osd.0 went down, but when
osd.6 went down it did not recover. the cluster is running ceph version
10.2.2.

any help is greatly appreciated!

thanks & cheers
Philipp

> 
> Zitat von Philipp Schwaha :
> 
>> hi,
>>
>> I have a problem with a cluster being stuck in recovery after osd
>> failure. at first recovery was doing quite well, but now it just sits
>> there without any progress. I currently looks like this:
>>
>>  health HEALTH_ERR
>>     36 pgs are stuck inactive for more than 300 seconds
>>     50 pgs backfill_wait
>>     52 pgs degraded
>>     36 pgs down
>>     36 pgs peering
>>     1 pgs recovering
>>     1 pgs recovery_wait
>>     36 pgs stuck inactive
>>     52 pgs stuck unclean
>>     52 pgs undersized
>>     recovery 261632/2235446 objects degraded (11.704%)
>>     recovery 259813/2235446 objects misplaced (11.622%)
>>     recovery 2/1117723 unfound (0.000%)
>>  

Re: [ceph-users] TCMU Runner: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown

2019-10-22 Thread Mike Christie
On 10/22/2019 03:20 AM, Kilian Ries wrote:
> Hi,
> 
> 
> i'm running a ceph cluster with 4x ISCSI exporter nodes and oVirt on the
> client side. In the tcmu-runner logs i the the following happening every
> few seconds:
> 
> 

Are you exporting a LUN to one client or multiple clients at the same time?

> 
> tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64

Are you doing any IO to the iscsi LUN?

If not, then we normally saw this with a older version. It would start
at dm-multipath initialization and then just continue forever. Your
package looks like it has the fix:

commit dd7dd51c6cafa8bbcd3ca0eef31fb378b27ff499
Author: Mike Christie 
Date:   Mon Jan 14 17:06:27 2019 -0600

Allow some commands to run while taking lock


so we should not be seeing it.

Could you turn on tcmu-runner debugging? Open the file:

/etc/tcmu/tcmu.conf

and set:

log_level = 5

Do this while you are hitting this bug. I only need a couple seconds so
I can see what commands are being sent.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating crush location on all nodes of a cluster

2019-10-22 Thread Alexandre Berthaud
Hey Martin,

Alright then, we'll just go with the update of every osd's location at once
then; just wanted to be sure this was not a problem. :)

On Tue, Oct 22, 2019 at 1:21 PM Martin Verges 
wrote:

> Hello Alexandre,
>
> maybe you take a look into https://www.youtube.com/watch?v=V33f7ipw9d4 where
> you can see how easy Ceph CRUSH can be managed.
>
> 1. Changing the locations of all hosts at once
>> We are worried that this will generate too much IO and network activity
>> (and there is no way to pause / throttle this AFAIK). Maybe this is not
>> actually an issue?
>
>
> Just configure the cluster to allow slow recovery before changing the
> crush map. Typical options that might help you are "osd recovery sleep
> hdd|hybrid|ssd" and "osd max backfills".
>
> 2. Changing the locations of a couple hosts to reduce data movement
>> We are afraid that if we set 2 hosts to DC1, 2 hosts to DC2 and leave the
>> rest as-is; Ceph will behave as if there are 3 DCs and will try and fill
>> those 4 hosts with as many replicas as possible until they are full.
>>
>
> If you leave any data unsorted, you will never know what data copies are
> getting unavailable. In fact, you will produce service impact with such
> setups in case of one data center fails.
> Do you use any EC configuration suitable for 2 DC configurations, or do
> you use replica and want to tolerate having 2 missing copies at the same
> time?
>
> 3. Try and move PGs ahead of the change?
>> Maybe we could move PGs so that each PG has a replica on an OSD of each
>> DC *before* updating the crush map so that the update does not have to
>> actually move any data? (which would allow us to do this at the desired
>> pace)
>>
>
> Maybe the PG UPMAP is something that you can use for this, but your
> cluster hardware and configuration should always be configured to handle
> workloads like this rebalance without impacting your clients. See 1.
>
> 4. Something else?
>> Thank you for your time and your help. :)
>>
>
> You are welcome as every Ceph user! ;)
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
>
> Am Di., 22. Okt. 2019 um 11:37 Uhr schrieb Alexandre Berthaud <
> alexandre.berth...@clever-cloud.com>:
>
>> Hello everyone,
>>
>> We have a Ceph cluster (running 14.2.2) which has already dozens of TB of
>> data and... we did not set the location of the OSD hosts. The hosts are
>> located in 2 datacenters. We would like to update the locations of all the
>> hosts so not all replicas end up in a single DC.
>>
>> We are wondering how we should go about this.
>>
>> 1. Changing the locations of all hosts at once
>>
>> We are worried that this will generate too much IO and network activity
>> (and there is no way to pause / throttle this AFAIK). Maybe this is not
>> actually an issue?
>>
>> 2. Changing the locations of a couple hosts to reduce data movement
>>
>> We are afraid that if we set 2 hosts to DC1, 2 hosts to DC2 and leave the
>> rest as-is; Ceph will behave as if there are 3 DCs and will try and fill
>> those 4 hosts with as many replicas as possible until they are full.
>>
>> 3. Try and move PGs ahead of the change?
>>
>> Maybe we could move PGs so that each PG has a replica on an OSD of each
>> DC *before* updating the crush map so that the update does not have to
>> actually move any data? (which would allow us to do this at the desired
>> pace)
>>
>> 4. Something else?
>>
>> Thank you for your time and your help. :)
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating crush location on all nodes of a cluster

2019-10-22 Thread Martin Verges
Hello Alexandre,

maybe you take a look into https://www.youtube.com/watch?v=V33f7ipw9d4 where
you can see how easy Ceph CRUSH can be managed.

1. Changing the locations of all hosts at once
> We are worried that this will generate too much IO and network activity
> (and there is no way to pause / throttle this AFAIK). Maybe this is not
> actually an issue?


Just configure the cluster to allow slow recovery before changing the crush
map. Typical options that might help you are "osd recovery sleep
hdd|hybrid|ssd" and "osd max backfills".

2. Changing the locations of a couple hosts to reduce data movement
> We are afraid that if we set 2 hosts to DC1, 2 hosts to DC2 and leave the
> rest as-is; Ceph will behave as if there are 3 DCs and will try and fill
> those 4 hosts with as many replicas as possible until they are full.
>

If you leave any data unsorted, you will never know what data copies are
getting unavailable. In fact, you will produce service impact with such
setups in case of one data center fails.
Do you use any EC configuration suitable for 2 DC configurations, or do you
use replica and want to tolerate having 2 missing copies at the same time?

3. Try and move PGs ahead of the change?
> Maybe we could move PGs so that each PG has a replica on an OSD of each DC
> *before* updating the crush map so that the update does not have to
> actually move any data? (which would allow us to do this at the desired
> pace)
>

Maybe the PG UPMAP is something that you can use for this, but your cluster
hardware and configuration should always be configured to handle workloads
like this rebalance without impacting your clients. See 1.

4. Something else?
> Thank you for your time and your help. :)
>

You are welcome as every Ceph user! ;)

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Di., 22. Okt. 2019 um 11:37 Uhr schrieb Alexandre Berthaud <
alexandre.berth...@clever-cloud.com>:

> Hello everyone,
>
> We have a Ceph cluster (running 14.2.2) which has already dozens of TB of
> data and... we did not set the location of the OSD hosts. The hosts are
> located in 2 datacenters. We would like to update the locations of all the
> hosts so not all replicas end up in a single DC.
>
> We are wondering how we should go about this.
>
> 1. Changing the locations of all hosts at once
>
> We are worried that this will generate too much IO and network activity
> (and there is no way to pause / throttle this AFAIK). Maybe this is not
> actually an issue?
>
> 2. Changing the locations of a couple hosts to reduce data movement
>
> We are afraid that if we set 2 hosts to DC1, 2 hosts to DC2 and leave the
> rest as-is; Ceph will behave as if there are 3 DCs and will try and fill
> those 4 hosts with as many replicas as possible until they are full.
>
> 3. Try and move PGs ahead of the change?
>
> Maybe we could move PGs so that each PG has a replica on an OSD of each DC
> *before* updating the crush map so that the update does not have to
> actually move any data? (which would allow us to do this at the desired
> pace)
>
> 4. Something else?
>
> Thank you for your time and your help. :)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Decreasing the impact of reweighting osds

2019-10-22 Thread Paul Emmerich
getting rid of filestore solves most latency spike issues during
recovery because they are often caused by random XFS hangs (splitting
dirs or just xfs having a bad day)


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Oct 22, 2019 at 6:02 AM Mark Kirkwood
 wrote:
>
> We recently needed to reweight a couple of OSDs on one of our clusters
> (luminous on Ubuntu,  8 hosts, 8 OSD/host). I (think) we reweighted by
> approx 0.2. This was perhaps too much, as IO latency on RBD drives
> spiked to several seconds at times.
>
> We'd like to lessen this effect as much as we can. So we are looking at
> priority and queue parameters (OSDs are Filestore based with S3700 SSD
> or similar NVME journals):
>
> # priorities
> osd_client_op_priority
> osd_recovery_op_priority
> osd_recovery_priority
> osd_scrub_priority
> osd_snap_trim_priority
>
> # queue tuning
> filestore_queue_max_ops
> filestore_queue_low_threshhold
> filestore_queue_high_threshhold
> filestore_expected_throughput_ops
> filestore_queue_high_delay_multiple
> filestore_queue_max_delay_multiple
>
> My first question is this - do these parameters require the CFQ
> scheduler (like osd_disk_thread_ioprio_priority does)? We are currently
> using deadline (we have not tweaked queue/iosched/write_expire down from
> 5000 to 1500 which might be good to do).
>
> My 2nd question is - should we consider increasing
> osd_disk_thread_ioprio_priority (and hence changing to CFQ scheduler)? I
> usually see this parameter discussed WRT scrubbing, and we are not
> having issues with that.
>
> regards
>
> Mark
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Updating crush location on all nodes of a cluster

2019-10-22 Thread Alexandre Berthaud
Hello everyone,

We have a Ceph cluster (running 14.2.2) which has already dozens of TB of
data and... we did not set the location of the OSD hosts. The hosts are
located in 2 datacenters. We would like to update the locations of all the
hosts so not all replicas end up in a single DC.

We are wondering how we should go about this.

1. Changing the locations of all hosts at once

We are worried that this will generate too much IO and network activity
(and there is no way to pause / throttle this AFAIK). Maybe this is not
actually an issue?

2. Changing the locations of a couple hosts to reduce data movement

We are afraid that if we set 2 hosts to DC1, 2 hosts to DC2 and leave the
rest as-is; Ceph will behave as if there are 3 DCs and will try and fill
those 4 hosts with as many replicas as possible until they are full.

3. Try and move PGs ahead of the change?

Maybe we could move PGs so that each PG has a replica on an OSD of each DC
*before* updating the crush map so that the update does not have to
actually move any data? (which would allow us to do this at the desired
pace)

4. Something else?

Thank you for your time and your help. :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] TCMU Runner: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown

2019-10-22 Thread Kilian Ries
Hi,


i'm running a ceph cluster with 4x ISCSI exporter nodes and oVirt on the client 
side. In the tcmu-runner logs i the the following happening every few seconds:


###

2019-10-22 10:11:11.231 1710 [WARN] tcmu_rbd_lock:762 rbd/image.lun0: Acquired 
exclusive lock.

2019-10-22 10:11:11.395 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: 
Could not check lock ownership. Error: Cannot send after transport endpoint 
shutdown.

2019-10-22 10:11:12.346 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0: 
Async lock drop. Old state 1

2019-10-22 10:11:12.353 1710 [INFO] alua_implicit_transition:566 
rbd/image.lun0: Starting lock acquisition operation.

2019-10-22 10:11:13.325 1710 [INFO] alua_implicit_transition:566 
rbd/image.lun0: Starting lock acquisition operation.

2019-10-22 10:11:13.852 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: 
Could not check lock ownership. Error: Cannot send after transport endpoint 
shutdown.

2019-10-22 10:11:13.854 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun1: 
Could not check lock ownership. Error: Cannot send after transport endpoint 
shutdown.

2019-10-22 10:11:13.863 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun1: 
Could not check lock ownership. Error: Cannot send after transport endpoint 
shutdown.

2019-10-22 10:11:14.202 1710 [INFO] alua_implicit_transition:566 
rbd/image.lun0: Starting lock acquisition operation.

2019-10-22 10:11:14.285 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: 
Could not check lock ownership. Error: Cannot send after transport endpoint 
shutdown.

2019-10-22 10:11:15.217 1710 [WARN] tcmu_rbd_lock:762 rbd/image.lun0: Acquired 
exclusive lock.

2019-10-22 10:11:15.873 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: 
Could not check lock ownership. Error: Cannot send after transport endpoint 
shutdown.

2019-10-22 10:11:16.696 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0: 
Async lock drop. Old state 1

2019-10-22 10:11:16.696 1710 [INFO] alua_implicit_transition:566 
rbd/image.lun0: Starting lock acquisition operation.

2019-10-22 10:11:16.696 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0: 
Async lock drop. Old state 2

2019-10-22 10:11:16.992 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2: 
Could not check lock ownership. Error: Cannot send after transport endpoint 
shutdown.

###



This happens on all of my four iscsi exporter nodes. Blacklist gives me the 
following (number of blacklisted objects does not really shrink):


###

ceph osd blacklist ls


listed 10579 entries

###



On the client site i configured the multipath config like this:


###

device {

vendor "LIO-ORG"

hardware_handler   "1 alua"

path_grouping_policy   "failover"

path_selector  "queue-length 0"

failback   60

path_checker   tur

prio   alua

prio_args  exclusive_pref_bit

fast_io_fail_tmo   25

no_path_retry  queue

}

###


And multipath -ll shows me all four path as "active ready" without errors.



For me this looks like the tcmu-runner cannot aquire exclusive lock and it is 
flapping between nodes. In addition, in the ceph gui / dashboard i can see the 
LUNs in the "active / optimized" state are flapping between nodes ...




I'm have installed the following versions (CentOS 7.7, Ceph 13.2.6):


###

rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel"


python-cephfs-13.2.6-0.el7.x86_64

ceph-selinux-13.2.6-0.el7.x86_64

kernel-3.10.0-957.5.1.el7.x86_64

kernel-3.10.0-957.1.3.el7.x86_64

kernel-tools-libs-3.10.0-1062.1.2.el7.x86_64

libcephfs2-13.2.6-0.el7.x86_64

libtcmu-1.4.0-106.gd17d24e.el7.x86_64

ceph-common-13.2.6-0.el7.x86_64

ceph-osd-13.2.6-0.el7.x86_64

tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64

kernel-3.10.0-1062.1.2.el7.x86_64

ceph-iscsi-3.3-1.el7.noarch

kernel-headers-3.10.0-1062.1.2.el7.x86_64

kernel-3.10.0-862.14.4.el7.x86_64

ceph-base-13.2.6-0.el7.x86_64

kernel-tools-3.10.0-1062.1.2.el7.x86_64

###


Greets,

Kilian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to reset compat weight-set changes caused by PG balancer module?

2019-10-22 Thread Philippe D'Anjou
Apparently the PG balancer crush-compat mode adds some crush bucket weights. 
Those cause major havoc in our cluster, our PG distribution is all over the 
place. 
Seeing things like this:...
 97   hdd 9.09470  1.0 9.1 TiB 6.3 TiB 6.3 TiB  32 KiB  17 GiB 2.8 TiB 
69.03 1.08  28 up 
 98   hdd 9.09470  1.0 9.1 TiB 4.5 TiB 4.5 TiB  96 KiB  11 GiB 4.6 TiB 
49.51 0.77  20 up 
 99   hdd 9.09470  1.0 9.1 TiB 7.0 TiB 6.9 TiB  80 KiB  18 GiB 2.1 TiB 
76.47 1.20  31 up
Filling rates are from 50 - 90%. Unfortunately reweighing doesn't seem to help 
and I suspect it's because of bucket weights which are WEIRD
    bucket_id -42
    weight_set [
  [ 7.846 11.514 9.339 9.757 10.173 8.900 9.164 6.759 ]


I disabled the module already but the rebalance is broken now.
Do I have to hand reset this and push a new crush map? This is a sensitive 
production cluster, I don't feel pretty good about that.
Thanks for any ideas..


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clust recovery stuck

2019-10-22 Thread Eugen Block

Hi,

can you share `ceph osd tree`? What crush rules are in use in your  
cluster? I assume that the two failed OSDs prevent the remapping  
because the rules can't be applied.



Regards,
Eugen


Zitat von Philipp Schwaha :


hi,

I have a problem with a cluster being stuck in recovery after osd
failure. at first recovery was doing quite well, but now it just sits
there without any progress. I currently looks like this:

 health HEALTH_ERR
36 pgs are stuck inactive for more than 300 seconds
50 pgs backfill_wait
52 pgs degraded
36 pgs down
36 pgs peering
1 pgs recovering
1 pgs recovery_wait
36 pgs stuck inactive
52 pgs stuck unclean
52 pgs undersized
recovery 261632/2235446 objects degraded (11.704%)
recovery 259813/2235446 objects misplaced (11.622%)
recovery 2/1117723 unfound (0.000%)
 monmap e3: 3 mons at
{0=192.168.19.13:6789/0,1=192.168.19.17:6789/0,2=192.168.19.23:6789/0}
election epoch 78, quorum 0,1,2 0,1,2
 osdmap e7430: 6 osds: 4 up, 4 in; 88 remapped pgs
flags sortbitwise
  pgmap v20023893: 256 pgs, 1 pools, 4366 GB data, 1091 kobjects
8421 GB used, 10183 GB / 18629 GB avail
261632/2235446 objects degraded (11.704%)
259813/2235446 objects misplaced (11.622%)
2/1117723 unfound (0.000%)
 168 active+clean
  50 active+undersized+degraded+remapped+wait_backfill
  36 down+remapped+peering
   1 active+recovering+undersized+degraded+remapped
   1 active+recovery_wait+undersized+degraded+remapped

Is there any way to motivate it to resume recovery?

Thanks
Philipp




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Replace ceph osd in a container

2019-10-22 Thread Alex Litvak

Hello cephers,

So I am having trouble with a new hardware systems with strange OSD behavior 
and I want to replace a disk with a brand new one to test the theory.

I run all daemons in containers and on one of the nodes I have mon, mgr, and 6 
osds.  So following 
https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd

I stopped container with osd.23, waited until it is down and out, ran 
safe-to-destroy loop and then destroyed the osd all using the monitor from the 
container on this node.  All good.

Then I swapped the SSDs and started running additional steps (from step 3) using the same mon container.  I have no ceph packages installed on the bare metal box. It looks like mon container doesn't 
see the disk.


podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap /dev/sdh
 stderr: lsblk: /dev/sdh: not a block device
 stderr: error: /dev/sdh: No such file or directory
 stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys 
expected.
usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
   [--osd-fsid OSD_FSID]
   [DEVICES [DEVICES ...]]
ceph-volume lvm zap: error: Unable to proceed with non-existing device: /dev/sdh
Error: exit status 2
root@storage2n2-la:~# ls -l /dev/sd
sda   sdc   sdd   sde   sdf   sdg   sdg1  sdg2  sdg5  sdh
root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume lvm 
zap sdh
 stderr: lsblk: sdh: not a block device
 stderr: error: sdh: No such file or directory
 stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys 
expected.
usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
   [--osd-fsid OSD_FSID]
   [DEVICES [DEVICES ...]]
ceph-volume lvm zap: error: Unable to proceed with non-existing device: sdh
Error: exit status 2

I execute lsblk and it sees device sdh
root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk
lsblk: dm-1: failed to get device path
lsblk: dm-2: failed to get device path
lsblk: dm-4: failed to get device path
lsblk: dm-6: failed to get device path
lsblk: dm-4: failed to get device path
lsblk: dm-2: failed to get device path
lsblk: dm-1: failed to get device path
lsblk: dm-0: failed to get device path
lsblk: dm-0: failed to get device path
lsblk: dm-7: failed to get device path
lsblk: dm-5: failed to get device path
lsblk: dm-7: failed to get device path
lsblk: dm-6: failed to get device path
lsblk: dm-5: failed to get device path
lsblk: dm-3: failed to get device path
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdf  8:80   0   1.8T  0 disk
sdd  8:48   0   1.8T  0 disk
sdg  8:96   0 223.5G  0 disk
|-sdg5   8:101  0   223G  0 part
|-sdg1   8:97   487M  0 part
`-sdg2   8:98 1K  0 part
sde  8:64   0   1.8T  0 disk
sdc  8:32   0   3.5T  0 disk
sda  8:00   3.5T  0 disk
sdh  8:112  0   3.5T  0 disk

So I use a fellow osd container (osd.5) on the same node and run all of the 
operations (zap and prepare) successfully.

I am suspecting that mon or mgr have no access to /dev or /var/lib while osd 
containers do.  Cluster configured originally by ceph-ansible (nautilus 14.2.2)

The question is if I want to replace all disks on a single node, and I have 6 
nodes with pools replication 3, is it safe to restart mgr mounting /dev and 
/var/lib/ceph volumes (not configured right now).

I cannot use other osd containers on the same box because my controller reverts from raid to non-raid mode with all disks lost and not just a single one.  So I need to replace all 6 osds to run back 
in containers and the only things will remain operational on node are mon and mgr containers.


I prefer not to install a full cluster or client on the bare metal node if 
possible.

Thank you for your help,

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com