[ceph-users] Re: ceph-volume deactivate errs out with "findmnt: invalid option -- M"

2021-02-05 Thread Frank Schilder
More issues: ceph-volume simple activate --file /etc/ceph/osd/33-615e6d0c-e3e9-4f55-9b6a-94243faa848b.json --no-systemd Running command: /usr/bin/mount -v /dev/sdb1 /var/lib/ceph/osd/ceph-33 stderr: mount: mount point /var/lib/ceph/osd/ceph-33 does not exist Shouldn't the mount point creation

[ceph-users] Re: mon db high iops

2021-02-05 Thread Seena Fallah
After disabling insights module in mgr, mons rocksdb submit sync latency gets down and my problem solved!! On Fri, Feb 5, 2021 at 2:36 PM Seena Fallah wrote: > Is there any suggestion on disk spec? I don’t find any doc about it on > ceph too! > > On Fri, Feb 5, 2021 at 11:37 AM Eugen Block

[ceph-users] ceph-volume deactivate errs out with "findmnt: invalid option -- M"

2021-02-05 Thread Frank Schilder
Hi all, I'm experimenting with ceph-volume on Centos7, ceph mimic 13.2.10. When I execute "ceph-volume deactivate ..." on a previousy activated OSD, I get this error: # ceph-volume lvm deactivate 12 0bbf481c-6a3d-4724-9a27-3a845eb05911 stderr: /usr/bin/findmnt: invalid option -- 'M' stderr:

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-05 Thread Mark Lehrer
> Redhat/Micron/Samsung/Supermicro have all put out white papers backing the > idea of 2 copies on NVMe's as safe for production. It's not like you can just jump from "unsafe" to "safe" -- it is about comparing the probability of losing data against how valuable that data is. A vendor's

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-05 Thread Simon Ironside
On 05/02/2021 20:10, Mario Giammarco wrote: It is not that a morning I wake up and put some random hardware together, I followed guidelines. The result should be: - if a disk (or more) brokes work goes on - if a server brokes the VMs on the server start on another server and work goes on. The

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Mark Lehrer
I have just one more suggestion for you: > but even our Supermicro contact that we worked the > config out with was in agreement with 2x on NVMe These kinds of settings aren't set in stone, it is a one line command to rebalance (admittedly you wouldn't want to just do this casually). I don't

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-05 Thread Mario Giammarco
Il giorno gio 4 feb 2021 alle ore 12:19 Eneko Lacunza ha scritto: > Hi all, > > El 4/2/21 a las 11:56, Frank Schilder escribió: > >> - three servers > >> - three monitors > >> - 6 osd (two per server) > >> - size=3 and min_size=2 > > This is a set-up that I would not run at all. The first one

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Frank Schilder
I don't run a secondary site and don't know if short windows of read-only access are terrible. From the data security point of view, min_size 2 is fine. Its the min_size 1 that really is dangerous, because it accepts non-redundant writes. Even if you loose the second site entirely, you can

[ceph-users] CephFS Octopus snapshots / kworker at 100% / kernel vs. fuse client

2021-02-05 Thread Sebastian Knust
Hi, I am running a Ceph Octopus (15.2.8) cluster primarily for CephFS. Metadata is stored on SSD, data is stored in three different pools on HDD. Currently, I use 22 subvolumes. I am rotating snapshots on 16 subvolumes, all in the same pool, which is the primary data pool for CephFS.

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Nathan Fish
Why would you use RAID underneath Ceph? The only reason I've seen to do that is if you don't have enough CPU to run enough OSDs. On Fri, Feb 5, 2021 at 11:09 AM Jack wrote: > > Is raid1 dangerous ? > Is raid5 dangerous ? > > They both allow non-redondant writes > > > On 2/5/21 4:19 PM, Frank

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Anthony D'Atri
Analogies between a distributed system and one that isn’t can be a bit strained or nuanced. The question really isn’t IF a given solution is dangerous, but HOW dangerous it is. There is always a long tail ; one picks a point along it based on capex, business needs, etc. I sometimes read

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Frank Schilder
> Picture this, using size=3, min_size=2: > - One node is down for maintenance > - You loose a couple of devices > - You loose data > > Is it likely that a nvme device dies during a short maintenance window ? > Is it likely that two devices dies at the same time ? If you just look at it from this

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Jack
Is raid1 dangerous ? Is raid5 dangerous ? They both allow non-redondant writes On 2/5/21 4:19 PM, Frank Schilder wrote: I don't run a secondary site and don't know if short windows of read-only access are terrible. From the data security point of view, min_size 2 is fine. Its the min_size 1

[ceph-users] Re: can't query most pgs after restart

2021-02-05 Thread Jeremy Austin
I'll power the cluster up today or tomorrow and take a look again, Dan, but the initial problem is that many of the pgs can't be queried — the requests time out. I don't know if it's purely the stale, or just the unknown pgs, that can't be queried, but I'll investigate if there's something wrong

[ceph-users] Re: can't query most pgs after restart

2021-02-05 Thread Dan van der Ster
Eeek! Don't run `osd_find_best_info_ignore_history_les = true` -- that leads to data loss even such that you don't expect. Are you sure all OSDs are up? Query a PG to find out why it is unknown: `ceph pg query`. Feel free to share that In fact, the 'unknown' state means the MGR doesn't know

[ceph-users] can't query most pgs after restart

2021-02-05 Thread Jeremy Austin
I was in the middle of a rebalance on a small test cluster with about 1% of pgs degraded, and shut the cluster entirely down for maintenance. On startup, many pgs are entirely unknown, and most stale. In fact most pgs can't be queried! No mon failures. Would osd logs tell me why pgs aren't even

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Adam Boyhan
Those are my thoughts as well. We have 40Gbit/s of dedicated dark fiber that we manage between the two sites. From: "Frank Schilder" To: "adamb" Cc: "Jack" , "ceph-users" Sent: Friday, February 5, 2021 10:19:06 AM Subject: Re: [ceph-users] Re: NVMe and 2x Replica I don't run a

[ceph-users] Re: reinstalling node with orchestrator/cephadm

2021-02-05 Thread Eugen Block
Hi Kenneth, I managed to succeed with this just now. It's a lab environment and the OSDs are not encrypted but I was able to get the OSDs up again. The ceph-volume commands also worked (just activation didn't) so I had the required information about those OSDs. What I did was - collect

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-05 Thread Frank Schilder
I think the answer is very simple: Data loss. You are setting yourself up for data loss. Having only +1 redundancy is a design flaw and you will be fully responsible for loosing data on such a set-up. If this is not a problem, then that's an option. If this will get you fired, its not. > There

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Adam Boyhan
This turned into a great thread. Lots of good information and clarification. I am 100% on board with 3 copies for the primary. What does everyone think about possibly only doing 2 copies on the secondary? Keeping in mind that I would keep min=2 which I think will be reasonable for a

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Jack
At the end, this is nothing but a probability stuff Picture this, using size=3, min_size=2: - One node is down for maintenance - You loose a couple of devices - You loose data Is it likely that a nvme device dies during a short maintenance window ? Is it likely that two devices dies at the same

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Wido den Hollander
On 04/02/2021 18:57, Adam Boyhan wrote: All great input and points guys. Helps me lean towards 3 copes a bit more. I mean honestly NVMe cost per TB isn't that much more than SATA SSD now. Somewhat surprised the salesmen aren't pitching 3x replication as it makes them more money. To add

[ceph-users] Re: mon db high iops

2021-02-05 Thread Seena Fallah
Is there any suggestion on disk spec? I don’t find any doc about it on ceph too! On Fri, Feb 5, 2021 at 11:37 AM Eugen Block wrote: > Hi, > > > My disk latency is 25ms because of the high block size that rocksdb is > > using. > > should I provide a high-performance disk than I'm using for my

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Janne Johansson
Den fre 5 feb. 2021 kl 07:38 skrev Pascal Ehlert : > Sorry to jump in here, but would you care to explain why the total disk > usage should stay under 60%? > This is not something I have heard before and a quick Google search > didn't return anything useful. > If you have 3 hosts with 3 drives

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-05 Thread Dan van der Ster
On Thu, Feb 4, 2021 at 10:30 PM huxia...@horebdata.cn wrote: > > >IMO with a cluster this size, you should not ever mark out any OSDs -- > >rather, you should leave the PGs degraded, replace the disk (keep the > >same OSD ID), then recover those objects to the new disk. > >Or, keep it <40% used

[ceph-users] Re: mon db high iops

2021-02-05 Thread Eugen Block
Hi, My disk latency is 25ms because of the high block size that rocksdb is using. should I provide a high-performance disk than I'm using for my monitor nodes? what are you currently using on the MON nodes? There are recommendations out there [1] to setup MONs with SSDs: An SSD or other