[ceph-users] Re: cephfs needs access from two networks

2020-08-31 Thread Simon Sutter
Hello again

So I have changed the network configuration.
Now my Ceph is reachable from outside, this also means all osd’s of all nodes 
are reachable.
I still have the same behaviour which is a timeout.

The client can resolve all nodes with their hostnames.
The mon’s are still listening on the internal network so the nat rule is still 
there.
I have set “public bind addr” to the external ip and restarted the mon but it’s 
still not working.

[root@testnode1 ~]# ceph config get mon.public_bind_addr
WHO MASK  LEVEL OPTIONVALUERO
mon   advanced  public_bind_addr  v2:[ext-addr]:0/0 *

Do I have to change them somewhere else too?

Thanks in advance,
Simon


Von: Janne Johansson [mailto:icepic...@gmail.com]
Gesendet: 27 August 2020 20:01
An: Simon Sutter 
Betreff: Re: [ceph-users] cephfs needs access from two networks

Den tors 27 aug. 2020 kl 12:05 skrev Simon Sutter 
mailto:ssut...@hosttech.ch>>:
Hello Janne

Oh I missed that point. No, the client cannot talk directly to the osds.
In this case it’s extremely difficult to set this up.

This is an absolute requirement to be a ceph client.

How is the mon telling the client, which host and port of the osd, it should 
connect to?

The same port and ip that the ODS called into the mon with when it started up 
and joined the clusster.

Can I have an influence on it?


Well, you set the ip on the OSD hosts, and the port ranges in use for OSDs are 
changeable/settable, but it would not really help the above-mentioned client.

Von: Janne Johansson [mailto:icepic...@gmail.com]
Gesendet: 26 August 2020 15:09
An: Simon Sutter mailto:ssut...@hosttech.ch>>
Cc: ceph-users@ceph.io
Betreff: Re: [ceph-users] cephfs needs access from two networks

Den ons 26 aug. 2020 kl 14:16 skrev Simon Sutter 
mailto:ssut...@hosttech.ch>>:
Hello,
So I know, the mon services can only bind to just one ip.
But I have to make it accessible to two networks because internal and external 
servers have to mount the cephfs.
The internal ip is 10.99.10.1 and the external is some public-ip.
I tried nat'ing it  with this: "firewall-cmd --zone=public 
--add-forward-port=port=6789:proto=tcp:toport=6789:toaddr=10.99.10.1 -permanent"

So the nat is working, because I get a "ceph v027" (alongside with some 
gibberish) when I do a telnet "telnet *public-ip* 6789"
But when I try to mount it, I get just a timeout:
mount - -t ceph *public-ip*:6789:/testing /mnt -o 
name=test,secretfile=/root/ceph.client. test.key
mount error 110 = Connection timed out

The tcpdump also recognizes a "Ceph Connect" packet, coming from the mon.

How can I get around this problem?
Is there something I have missed?

Any ceph client will need direct access to all OSDs involved also. Your mail 
doesn't really say if the cephfs-mounting client can talk to OSDs?

In ceph, traffic is not shuffled via mons, mons only tell the client which OSDs 
it needs to talk to, then all IO goes directly from client to any involved OSD 
servers.

--
May the most significant bit of your life be positive.


--
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm daemons vs cephadm services -- what's the difference?

2020-08-31 Thread John Zachary Dover
What is the difference between services and daemons?

Specifically, what does it mean that "orch ps" lists cephadm daemons and
"orch ls" lists cephadm services?

This question will help me close this bug:
https://tracker.ceph.com/issues/47142

Zac Dover
Upstream Docs
Ceph
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: setting bucket quota using admin API does not work

2020-08-31 Thread Youzhong Yang
Figured it out:
admin/bucket?quota works, but it does not seem to be documented.

On Mon, Aug 31, 2020 at 4:16 PM Youzhong Yang  wrote:

> Hi all,
>
> I tried to set bucket quota using admin API as shown below:
>
> admin/user?quota&uid=bse&bucket=test"a-type=bucket
>
> with payload in json format:
> {
> "enabled": true,
> "max_size": 1099511627776,
> "max_size_kb": 1073741824,
> "max_objects": -1
> }
>
> it returned success but the quota change did not happen, as confirmed by
> 'radosgw-admin bucket stats --bucket=test' command.
>
> Am I missing something obvious? Please kindly advise/suggest.
>
> By the way, I am using ceph mimic (v13.2.4). Setting quota by
> radosgw-admin quota set --bucket=${BUCK} --max-size=1T --quota-scope=bucket
> works, but I want to do it programmatically.
>
> Thanks in advance,
> -Youzhong
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Delete OSD spec (mgr)?

2020-08-31 Thread Darrell Enns
Is there a way to remove an OSD spec from the mgr? I've got one in there that I 
don't want. It shows up when I do "ceph orch osd spec --preview", and I can't 
find any way to get rid of it.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster degraded after adding OSDs to increase capacity

2020-08-31 Thread DHilsbos
Dallas;

First, I should point out that you have an issue with your units.  Your cluster 
is reporting 81TiB (1024^4) of available space, not 81TB (1000^4).  Similarly; 
it's reporting 22.8 TiB free space in the pool, not 22.8TB.  For comparison; 
your 5.5 TB drives (this is the correct unit here) is only 5.02 TiB.  Hard 
drive manufacturers market in one set of units, while software systems report 
in another.  Thus, while you added 66 TB to your cluster, that is only 60 TiB.  
For background information, these pages are interesting:
https://en.wikipedia.org/wiki/Tebibyte
https://en.wikipedia.org/wiki/Binary_prefix#Consumer_confusion

It looks like you're using a replicated rule for your cephfs_data pool.  With 
81.2 TiB available in the cluster, the maximum free space you can expect is 
27.06 TiB (81.2 / 3 = 27.06).  As we've seen you can't actually fill a cluster 
to 100%.  It might be worth noting that the discrepancy is roughly 10% of your 
entire cluster (122.8 / (81.2 - (22.8 * 3))).

From your previously provided OSD map, I'm seeing some reweights that aren't 1. 
 It's possible that has some impact.

It's also possible that your cluster is "reserving" space on your HDDs for DB 
and WAL operations.

It would take someone that is more familiar with the CephFS and Dashboard code 
than I am, to answer your question definitively.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International, Inc.
dhils...@performair.com 
www.PerformAir.com


From: Dallas Jones [mailto:djo...@tech4learning.com] 
Sent: Monday, August 31, 2020 2:59 PM
To: Dominic Hilsbos
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Cluster degraded after adding OSDs to increase 
capacity

Thanks to everyone who replied. After setting osd_recovery_sleep_hdd to 0 and 
changing osd-max-backfills to 16, my recovery throughput increased from < 1MBPS 
to 40-60MPBS
and finished up late last night.

The cluster is mopping up a bunch of queued deep scrubs, but is otherwise now 
healthy.

I do have one remaining question - the cluster now shows 81TB of free space, 
but the data pool only shows 22.8TB of free space. I was expecting/hoping to 
see the free space value for the pool
grow more after doubling the capacity of the cluster (it previously had 21 OSDs 
w/ 2.7TB SAS drives; I just added 12 more OSDs w/ 5.5TB drives).

Are my expectations flawed, or is there something I can do to prod Ceph into 
growing the data pool free space?








On Fri, Aug 28, 2020 at 9:37 AM  wrote:
Dallas;

I would expect so, yes.

I wouldn't be surprised to see the used percentage slowly drop as the recovery 
/ rebalance progresses.  I believe that the pool free space number is based on 
the free space of the most filled OSD under any of the PGs, so I expect the 
free space will go up as your near-full OSDs drain.

I've added OSDs to one of our clusters, once, and the recovery / rebalance 
completed fairly quickly.  I don't remember how the pool sizes progressed.  I'm 
going to need to expand our other cluster in the next couple of months, so 
follow up on how this proceeds would be appreciated.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International, Inc.
dhils...@performair.com 
www.PerformAir.com


From: Dallas Jones [mailto:djo...@tech4learning.com] 
Sent: Friday, August 28, 2020 7:58 AM
To: Florian Pritz
Cc: ceph-users@ceph.io; Dominic Hilsbos
Subject: Re: [ceph-users] Re: Cluster degraded after adding OSDs to increase 
capacity

Thanks for the reply. I dialed up the value for max backfills yesterday, which 
increased my recovery throughput from about 1mbps to 5ish. After tweaking 
osd_recovery_sleep_hdd, I'm seeing 50-60MBPS - which is fairly epic. No clients 
are currently using this cluster, so I'm not worried about tanking client 
performance.

One remaining question: Will the pool sizes begin to adjust once the recovery 
process is complete? Per the following screenshot, my data pool is ~94% full...



On Fri, Aug 28, 2020 at 4:31 AM Florian Pritz  
wrote:
On Thu, Aug 27, 2020 at 05:56:22PM +, dhils...@performair.com wrote:
> 2)  Adjust performance settings to allow the data movement to go faster.  
> Again, I don't have those setting immediately to hand, but Googling something 
> like 'ceph recovery tuning,' or searching this list, should point you in the 
> right direction. Notice that you only have 6 PGs trying to move at a time, 
> with 2 blocked on your near-full OSDs (8 & 19).  I believe; by default, each 
> OSD daemon is only involved in 1 data movement at a time.  The tradeoff here 
> is user activity suffers if you adjust to favor recovery, however, with the 
> cluster in ERROR status, I suspect user activity is already suffering.

We've set osd_max_backfills to 16 in the config and when necessary we
manually change the runtime value of osd_recovery_sleep_hdd. It defaults
to 0.1 seconds of wait time between objects (I think?). If you really
want fast recovery tr

[ceph-users] Re: Cluster degraded after adding OSDs to increase capacity

2020-08-31 Thread Dallas Jones
Thanks to everyone who replied. After setting osd_recovery_sleep_hdd to 0
and changing osd-max-backfills to 16, my recovery throughput increased from
< 1MBPS to 40-60MPBS
and finished up late last night.

The cluster is mopping up a bunch of queued deep scrubs, but is otherwise
now healthy.

I do have one remaining question - the cluster now shows 81TB of free
space, but the data pool only shows 22.8TB of free space. I was
expecting/hoping to see the free space value for the pool
grow more after doubling the capacity of the cluster (it previously had 21
OSDs w/ 2.7TB SAS drives; I just added 12 more OSDs w/ 5.5TB drives).

Are my expectations flawed, or is there something I can do to prod Ceph
into growing the data pool free space?


[image: image.png]

[image: image.png]



On Fri, Aug 28, 2020 at 9:37 AM  wrote:

> Dallas;
>
> I would expect so, yes.
>
> I wouldn't be surprised to see the used percentage slowly drop as the
> recovery / rebalance progresses.  I believe that the pool free space number
> is based on the free space of the most filled OSD under any of the PGs, so
> I expect the free space will go up as your near-full OSDs drain.
>
> I've added OSDs to one of our clusters, once, and the recovery / rebalance
> completed fairly quickly.  I don't remember how the pool sizes progressed.
> I'm going to need to expand our other cluster in the next couple of months,
> so follow up on how this proceeds would be appreciated.
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director – Information Technology
> Perform Air International, Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
> From: Dallas Jones [mailto:djo...@tech4learning.com]
> Sent: Friday, August 28, 2020 7:58 AM
> To: Florian Pritz
> Cc: ceph-users@ceph.io; Dominic Hilsbos
> Subject: Re: [ceph-users] Re: Cluster degraded after adding OSDs to
> increase capacity
>
> Thanks for the reply. I dialed up the value for max backfills yesterday,
> which increased my recovery throughput from about 1mbps to 5ish. After
> tweaking osd_recovery_sleep_hdd, I'm seeing 50-60MBPS - which is fairly
> epic. No clients are currently using this cluster, so I'm not worried about
> tanking client performance.
>
> One remaining question: Will the pool sizes begin to adjust once the
> recovery process is complete? Per the following screenshot, my data pool is
> ~94% full...
>
>
>
> On Fri, Aug 28, 2020 at 4:31 AM Florian Pritz <
> florian.pr...@rise-world.com> wrote:
> On Thu, Aug 27, 2020 at 05:56:22PM +, dhils...@performair.com wrote:
> > 2)  Adjust performance settings to allow the data movement to go
> faster.  Again, I don't have those setting immediately to hand, but
> Googling something like 'ceph recovery tuning,' or searching this list,
> should point you in the right direction. Notice that you only have 6 PGs
> trying to move at a time, with 2 blocked on your near-full OSDs (8 & 19).
> I believe; by default, each OSD daemon is only involved in 1 data movement
> at a time.  The tradeoff here is user activity suffers if you adjust to
> favor recovery, however, with the cluster in ERROR status, I suspect user
> activity is already suffering.
>
> We've set osd_max_backfills to 16 in the config and when necessary we
> manually change the runtime value of osd_recovery_sleep_hdd. It defaults
> to 0.1 seconds of wait time between objects (I think?). If you really
> want fast recovery try this additional change:
>
> ceph tell osd.\* config set osd_recovery_sleep_hdd 0
>
> Be warned though, this will seriously affect client performance. Then
> again it can bump your recovery speed by multiple orders of magnitude.
> If you want to go back to how things were, set it back to 0.1 instead of
> 0. It may take a couple of seconds (maybe a minute) until performance
> for clients starts to improve. I guess the OSDs are too busy with
> recovery to instantly accept the changed value.
>
> Florian
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory leak?

2020-08-31 Thread Frank Schilder
Looks like the image attachment got removed. Please find it here: 
https://imgur.com/a/3tabzCN

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 31 August 2020 14:42
To: Mark Nelson; Dan van der Ster; ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Hi Dan and Mark,

sorry, took a bit longer. I uploaded a new archive containing files with the 
following format 
(https://files.dtu.dk/u/jb0uS6U9LlCfvS5L/heap_profiling-2020-08-31.tgz?l - 
valid 60 days):

- osd.195.profile.*.heap - raw heap dump file
- osd.195.profile.*.heap.txt - output of conversion with --text
- osd.195.profile.*.heap-base0001.txt - output of conversion with --text 
against first dump as base
- osd.195.*.heap_stats - output of ceph daemon osd.195 heap stats, every hour
- osd.195.*.mempools - output of ceph daemon osd.195 dump_mempools, every hour
- osd.195.*.perf - output of ceph daemon osd.195 perf dump, every hour, 
counters are reset

Only for the last couple of days are converted files included, post-conversion 
of everything simply takes too long.

Please find also attached a recording of memory usage on one of the relevant 
OSD nodes. I marked restarts of all OSDs/the host with vertical red lines. What 
is worrying is the self-amplifying nature of the leak. ts not a linear process, 
it looks at least quadratic if not exponential. What we are looking for is, 
given the comparably short uptime, probably still in the lower percentages with 
increasing rate. The OSDs just started to overrun their limit:

top - 14:38:49 up 155 days, 19:17,  1 user,  load average: 5.99, 4.59, 4.59
Tasks: 684 total,   1 running, 293 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.9 us,  0.9 sy,  0.0 ni, 89.6 id,  7.6 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 65727628 total,  6937548 free, 41921260 used, 16868820 buff/cache
KiB Swap: 93532160 total, 90199040 free,  120 used.  6740136 avail Mem

PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
4099023 ceph  20   0 5918704   3.8g   9700 S   1.7  6.1 378:37.01 
/usr/bin/ceph-osd --cluster ceph -f -i 35 --setuser cep+
4097639 ceph  20   0 5340924   3.0g  11428 S  87.1  4.7  14636:30 
/usr/bin/ceph-osd --cluster ceph -f -i 195 --setuser ce+
4097974 ceph  20   0 3648188   2.3g   9628 S   8.3  3.6   1375:58 
/usr/bin/ceph-osd --cluster ceph -f -i 201 --setuser ce+
4098322 ceph  20   0 3478980   2.2g   9688 S   5.3  3.6   1426:05 
/usr/bin/ceph-osd --cluster ceph -f -i 223 --setuser ce+
4099374 ceph  20   0 3446784   2.2g   9252 S   4.6  3.5   1142:14 
/usr/bin/ceph-osd --cluster ceph -f -i 205 --setuser ce+
4098679 ceph  20   0 3832140   2.2g   9796 S   6.6  3.5   1248:26 
/usr/bin/ceph-osd --cluster ceph -f -i 132 --setuser ce+
4100782 ceph  20   0 3641608   2.2g   9652 S   7.9  3.5   1278:10 
/usr/bin/ceph-osd --cluster ceph -f -i 207 --setuser ce+
4095944 ceph  20   0 3375672   2.2g   8968 S   7.3  3.5   1250:02 
/usr/bin/ceph-osd --cluster ceph -f -i 108 --setuser ce+
4096956 ceph  20   0 3509376   2.2g   9456 S   7.9  3.5   1157:27 
/usr/bin/ceph-osd --cluster ceph -f -i 203 --setuser ce+
4099731 ceph  20   0 3563652   2.2g   8972 S   3.6  3.5   1421:48 
/usr/bin/ceph-osd --cluster ceph -f -i 61 --setuser cep+
4096262 ceph  20   0 3531988   2.2g   9040 S   9.9  3.5   1600:15 
/usr/bin/ceph-osd --cluster ceph -f -i 121 --setuser ce+
4100442 ceph  20   0 3359736   2.1g   9804 S   4.3  3.4   1185:53 
/usr/bin/ceph-osd --cluster ceph -f -i 226 --setuser ce+
4096617 ceph  20   0 3443060   2.1g   9432 S   5.0  3.4   1449:29 
/usr/bin/ceph-osd --cluster ceph -f -i 199 --setuser ce+
4097298 ceph  20   0 3483532   2.1g   9600 S   5.6  3.3   1265:28 
/usr/bin/ceph-osd --cluster ceph -f -i 97 --setuser cep+
4100093 ceph  20   0 3428348   2.0g   9568 S   3.3  3.2   1298:53 
/usr/bin/ceph-osd --cluster ceph -f -i 197 --setuser ce+
4095630 ceph  20   0 3440160   2.0g   8976 S   3.6  3.2   1451:35 
/usr/bin/ceph-osd --cluster ceph -f -i 62 --setuser cep+

Generally speaking, increasing the cache minimum seems to help with keeping 
important information in RAM. Unfortunately, it also means that swap usage 
starts much earlier.

Best regards and thanks for your help,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS troubleshooting documentation: ceph daemon mds. dump cache

2020-08-31 Thread Patrick Donnelly
On Mon, Aug 31, 2020 at 5:02 AM Stefan Kooman  wrote:
>
> Hi list,
>
> We had some stuck ops on our MDS. In order to figure out why, we looked
> up the documention. The first thing it mentions is the following:
>
> ceph daemon mds. dump cache /tmp/dump.txt
>
> Our MDS had 170 GB in cache at that moment.
>
> Turns out that is a sure way to get your active MDS replaced by a standby.
>
> Is this supposed to work on MDS with large cache size? If not, than a
> big warning sign to prohibit running this on MDSes with large caches
> would be appropriate.
>
> Gr. Stefan
>
> P.s. I think our only option was to get the active restarted at that
> point, but still.

Yes, there should be a note in the docs about that. It seems a new PR
is up to respond to this issue:
https://github.com/ceph/ceph/pull/36823


-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Filesystem recovery with intact pools

2020-08-31 Thread Cyclic 3
Both the MDS maps and the keyrings are lost as a side effect of the monitor
recovery process I mentioned in my initial email, detailed here
https://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-mon/#monitor-store-failures
.

On Mon, 31 Aug 2020 at 21:10, Eugen Block  wrote:

> I don’t understand, what happened to the previous MDS? If there are
> cephfs pools there also was an old MDS, right? Can you explain that
> please?
>
>
> Zitat von cyclic3@gmail.com:
>
> > I added an MDS, but there was no change in either output (apart from
> > recognising the existence of an MDS)
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] setting bucket quota using admin API does not work

2020-08-31 Thread Youzhong Yang
 Hi all,

I tried to set bucket quota using admin API as shown below:

admin/user?quota&uid=bse&bucket=test"a-type=bucket

with payload in json format:
{
"enabled": true,
"max_size": 1099511627776,
"max_size_kb": 1073741824,
"max_objects": -1
}

it returned success but the quota change did not happen, as confirmed by
'radosgw-admin bucket stats --bucket=test' command.

Am I missing something obvious? Please kindly advise/suggest.

By the way, I am using ceph mimic (v13.2.4). Setting quota by radosgw-admin
quota set --bucket=${BUCK} --max-size=1T --quota-scope=bucket works, but I
want to do it programmatically.

Thanks in advance,
-Youzhong
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Filesystem recovery with intact pools

2020-08-31 Thread Eugen Block
I don’t understand, what happened to the previous MDS? If there are  
cephfs pools there also was an old MDS, right? Can you explain that  
please?



Zitat von cyclic3@gmail.com:

I added an MDS, but there was no change in either output (apart from  
recognising the existence of an MDS)

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Filesystem recovery with intact pools

2020-08-31 Thread cyclic3 . git
This sounds rather risky; will this definitely not lose any of my data?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Filesystem recovery with intact pools

2020-08-31 Thread cyclic3 . git
I added an MDS, but there was no change in either output (apart from 
recognising the existence of an MDS)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-08-31 Thread Frank Schilder
I was talking about on-disk cache, but, yes, the controller cache needs to be 
disabled too. The first can be done with smartctl or hdparm. Check cache status 
with something like  'smartctl -g wcache /dev/sda' and disable with something 
like 'smartctl -s wcache=off /dev/sda'.

Controller cache needs to be disabled in the BIOS. By the way, if you can't use 
pass-through, you should disable controller cache for every disk, including the 
HDDs. There are cases in the list demonstrating that controller cache enabled 
can lead to data loss on power outage.

As I recommended before, please search the ceph-user list, you will find 
detailed instructions and also links to explanations and typical benchmarks.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: VELARTIS Philipp Dürhammer 
Sent: 31 August 2020 14:44:07
To: Frank Schilder; 'ceph-users@ceph.io'
Subject: AW: Can 16 server grade ssd's be slower then 60 hdds? (no extra 
journals)

We have older LSi Raid controller with no HBA/JBOD option. So we expose the 
single disks as raid0 devices. Ceph should not be aware of cache status?
But digging deeper in to it it seems that 1 out of 4 serves is performing a lot 
better and has super low commit/applay rates while the other have a lot mor 
(20+) on heavy writes. This just applys fore the ssd. For the hdds I cant see a 
difference...

-Ursprüngliche Nachricht-
Von: Frank Schilder 
Gesendet: Montag, 31. August 2020 13:19
An: VELARTIS Philipp Dürhammer ; 'ceph-users@ceph.io' 

Betreff: Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra 
journals)

Yes, they can - if volatile write cache is not disabled. There are many threads 
on this, also recent. Search for "disable write cache" and/or "disable volatile 
write cache".

You will also find different methods of doing this automatically.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: VELARTIS Philipp Dürhammer 
Sent: 31 August 2020 13:02:45
To: 'ceph-users@ceph.io'
Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no 
extra journals)

I have a productive 60 osd's cluster. No extra Journals. Its performing okay. 
Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is 
little slower or equal to the 60 hdd pool. 4K random as also sequential reads. 
All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on 
bluestore. Ceph Luminous.
What should be possible 16 ssd's vs. 60 hhd's no extra journals?

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Xfs kernel panic during rbd mount

2020-08-31 Thread Shain Miley
Llya,
Thank you for the quick response this was very helpful toward getting 
resolution.

Restarting those 3 osds has allowed the rbd image to mount successfully.

I really appreciate all your help on this.

Shain



On 8/31/20, 12:41 PM, "Ilya Dryomov"  wrote:

On Mon, Aug 31, 2020 at 6:21 PM Shain Miley  wrote:
>
> Hi,
> A few weeks ago several of our rdb images became unresponsive after a few 
of our OSDs reached a near full state.
>
> Another member of the team rebooted the server that the rbd images are 
mounted on in an attempt to resolve the issue.
> In the meantime I added several more nodes to the cluster in order to get 
additional space.
>
> Here are some cluster details:
>
> root@rbd1:/var/log# ceph -v
> ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous 
(stable)
>
> root@rbd1:/var/log# uname -va
> Linux rbd1 4.15.0-48-generic #51~16.04.1-Ubuntu SMP Fri Apr 5 12:01:12 
UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>
> root@rbd1:/var/log# ceph -s
>   cluster:
> id: 504b5794-34bd-44e7-a8c3-0494cf800c23
> health: HEALTH_ERR
> crush map has legacy tunables (require argonaut, min is 
firefly)
> full ratio(s) out of order
> 2091546/274437905 objects misplaced (0.762%)
> Reduced data availability: 114 pgs inactive
> 1 slow requests are blocked > 32 sec. Implicated osds 152
> 4 stuck requests are blocked > 4096 sec. Implicated osds 6,99
>   services:
> mon: 3 daemons, quorum hqceph1,hqceph2,hqceph3
> mgr: hqceph2(active), standbys: hqceph3
> osd: 291 osds: 283 up, 265 in; 116 remapped pgs
> rgw: 1 daemon active
>   data:
> pools:   17 pools, 8199 pgs
> objects: 91.48M objects, 292TiB
> usage:   880TiB used, 758TiB / 1.60PiB avail
> pgs: 1.390% pgs not active
>  2091546/274437905 objects misplaced (0.762%)
>  8040 active+clean
>  114  activating+remapped
>  40   active+clean+scrubbing+deep
>  3active+clean+scrubbing
>  2active+remapped+backfilling
>   io:
> recovery: 41.4MiB/s, 12objects/s
>
> This morning I got on the server in order to map and mount the 5 or so 
RBD images that are shared out via samba on this server.
>
> After waiting about 10 minutes it was clear something was not 100% 
correct…here is what I found in the kern.log file:
>
>
> Aug 31 11:43:16 rbd1 kernel: [2158818.570948] libceph: mon0 
10.35.1.201:6789 session established
> Aug 31 11:43:16 rbd1 kernel: [2158818.576306] libceph: client182797617 
fsid 504b5794-34bd-44e7-a8c3-0494cf800c23
> Aug 31 11:43:16 rbd1 kernel: [2158818.710199] rbd: rbd0: capacity 
54975581388800 features 0x0
> Aug 31 11:43:22 rbd1 kernel: [2158824.685353] XFS (rbd0): Mounting V4 
Filesystem
> Aug 31 11:44:19 rbd1 kernel: [2158881.536794] XFS (rbd0): Starting 
recovery (logdev: internal)
> Aug 31 11:47:06 rbd1 kernel: [2159048.202835] INFO: task mount:33177 
blocked for more than 120 seconds.
> Aug 31 11:47:06 rbd1 kernel: [2159048.203053]   Not tainted 
4.15.0-48-generic #51~16.04.1-Ubuntu
> Aug 31 11:47:06 rbd1 kernel: [2159048.203260] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug 31 11:47:06 rbd1 kernel: [2159048.203523] mount   D0 
33177  33011 0x
> Aug 31 11:47:06 rbd1 kernel: [2159048.203527] Call Trace:
> Aug 31 11:47:06 rbd1 kernel: [2159048.203538]  __schedule+0x3d6/0x8b0
> Aug 31 11:47:06 rbd1 kernel: [2159048.203542]  ? __switch_to_asm+0x34/0x70
> Aug 31 11:47:06 rbd1 kernel: [2159048.203546]  schedule+0x36/0x80
> Aug 31 11:47:06 rbd1 kernel: [2159048.203551]  
schedule_timeout+0x1db/0x370
> Aug 31 11:47:06 rbd1 kernel: [2159048.203629]  ? 
xfs_trans_read_buf_map+0xf8/0x330 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203634]  
wait_for_completion+0xb4/0x140
> Aug 31 11:47:06 rbd1 kernel: [2159048.203637]  ? wake_up_q+0x70/0x70
> Aug 31 11:47:06 rbd1 kernel: [2159048.203691]  ? 
xfs_trans_read_buf_map+0xf8/0x330 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203740]  ? _xfs_buf_read+0x23/0x30 
[xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203787]  
xfs_buf_submit_wait+0x7f/0x220 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203839]  ? 
xfs_trans_read_buf_map+0xf8/0x330 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203887]  _xfs_buf_read+0x23/0x30 
[xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203933]  
xfs_buf_read_map+0x10a/0x190 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203985]  
xfs_trans_read_buf_map+0xf8/0x330 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.204022]  xfs_read_agf+0x90/0x120 
[xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.204058]  
xfs_alloc_read_agf+0x49/

[ceph-users] Re: OSD memory leak?

2020-08-31 Thread Frank Schilder
Hi Dan and Mark,

sorry, took a bit longer. I uploaded a new archive containing files with the 
following format 
(https://files.dtu.dk/u/jb0uS6U9LlCfvS5L/heap_profiling-2020-08-31.tgz?l - 
valid 60 days):

- osd.195.profile.*.heap - raw heap dump file
- osd.195.profile.*.heap.txt - output of conversion with --text
- osd.195.profile.*.heap-base0001.txt - output of conversion with --text 
against first dump as base
- osd.195.*.heap_stats - output of ceph daemon osd.195 heap stats, every hour
- osd.195.*.mempools - output of ceph daemon osd.195 dump_mempools, every hour
- osd.195.*.perf - output of ceph daemon osd.195 perf dump, every hour, 
counters are reset

Only for the last couple of days are converted files included, post-conversion 
of everything simply takes too long.

Please find also attached a recording of memory usage on one of the relevant 
OSD nodes. I marked restarts of all OSDs/the host with vertical red lines. What 
is worrying is the self-amplifying nature of the leak. ts not a linear process, 
it looks at least quadratic if not exponential. What we are looking for is, 
given the comparably short uptime, probably still in the lower percentages with 
increasing rate. The OSDs just started to overrun their limit:

top - 14:38:49 up 155 days, 19:17,  1 user,  load average: 5.99, 4.59, 4.59
Tasks: 684 total,   1 running, 293 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.9 us,  0.9 sy,  0.0 ni, 89.6 id,  7.6 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 65727628 total,  6937548 free, 41921260 used, 16868820 buff/cache
KiB Swap: 93532160 total, 90199040 free,  120 used.  6740136 avail Mem 

PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND   
   
4099023 ceph  20   0 5918704   3.8g   9700 S   1.7  6.1 378:37.01 
/usr/bin/ceph-osd --cluster ceph -f -i 35 --setuser cep+ 
4097639 ceph  20   0 5340924   3.0g  11428 S  87.1  4.7  14636:30 
/usr/bin/ceph-osd --cluster ceph -f -i 195 --setuser ce+ 
4097974 ceph  20   0 3648188   2.3g   9628 S   8.3  3.6   1375:58 
/usr/bin/ceph-osd --cluster ceph -f -i 201 --setuser ce+ 
4098322 ceph  20   0 3478980   2.2g   9688 S   5.3  3.6   1426:05 
/usr/bin/ceph-osd --cluster ceph -f -i 223 --setuser ce+ 
4099374 ceph  20   0 3446784   2.2g   9252 S   4.6  3.5   1142:14 
/usr/bin/ceph-osd --cluster ceph -f -i 205 --setuser ce+ 
4098679 ceph  20   0 3832140   2.2g   9796 S   6.6  3.5   1248:26 
/usr/bin/ceph-osd --cluster ceph -f -i 132 --setuser ce+ 
4100782 ceph  20   0 3641608   2.2g   9652 S   7.9  3.5   1278:10 
/usr/bin/ceph-osd --cluster ceph -f -i 207 --setuser ce+ 
4095944 ceph  20   0 3375672   2.2g   8968 S   7.3  3.5   1250:02 
/usr/bin/ceph-osd --cluster ceph -f -i 108 --setuser ce+ 
4096956 ceph  20   0 3509376   2.2g   9456 S   7.9  3.5   1157:27 
/usr/bin/ceph-osd --cluster ceph -f -i 203 --setuser ce+ 
4099731 ceph  20   0 3563652   2.2g   8972 S   3.6  3.5   1421:48 
/usr/bin/ceph-osd --cluster ceph -f -i 61 --setuser cep+ 
4096262 ceph  20   0 3531988   2.2g   9040 S   9.9  3.5   1600:15 
/usr/bin/ceph-osd --cluster ceph -f -i 121 --setuser ce+ 
4100442 ceph  20   0 3359736   2.1g   9804 S   4.3  3.4   1185:53 
/usr/bin/ceph-osd --cluster ceph -f -i 226 --setuser ce+ 
4096617 ceph  20   0 3443060   2.1g   9432 S   5.0  3.4   1449:29 
/usr/bin/ceph-osd --cluster ceph -f -i 199 --setuser ce+ 
4097298 ceph  20   0 3483532   2.1g   9600 S   5.6  3.3   1265:28 
/usr/bin/ceph-osd --cluster ceph -f -i 97 --setuser cep+ 
4100093 ceph  20   0 3428348   2.0g   9568 S   3.3  3.2   1298:53 
/usr/bin/ceph-osd --cluster ceph -f -i 197 --setuser ce+ 
4095630 ceph  20   0 3440160   2.0g   8976 S   3.6  3.2   1451:35 
/usr/bin/ceph-osd --cluster ceph -f -i 62 --setuser cep+ 

Generally speaking, increasing the cache minimum seems to help with keeping 
important information in RAM. Unfortunately, it also means that swap usage 
starts much earlier.

Best regards and thanks for your help,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 20 August 2020 22:40
To: Mark Nelson; Dan van der Ster; ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Hi Mark and Dan,

I can generate text files. Can you let me know what you would like to see? 
Without further instructions, I can do a simple conversion and a conversion 
against the first dump as a base. I will upload an archive with converted files 
added tomorrow afternoon.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Nelson 
Sent: 20 August 2020 21:52
To: Frank Schilder; Dan van der Ster; ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

Hi Frank,


  I downloaded but haven't had time to get the environment setup yet
either.  It might be better to just generate the txt files if you can.


Thanks!

Mar

[ceph-users] Re: Xfs kernel panic during rbd mount

2020-08-31 Thread Ilya Dryomov
On Mon, Aug 31, 2020 at 6:21 PM Shain Miley  wrote:
>
> Hi,
> A few weeks ago several of our rdb images became unresponsive after a few of 
> our OSDs reached a near full state.
>
> Another member of the team rebooted the server that the rbd images are 
> mounted on in an attempt to resolve the issue.
> In the meantime I added several more nodes to the cluster in order to get 
> additional space.
>
> Here are some cluster details:
>
> root@rbd1:/var/log# ceph -v
> ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous 
> (stable)
>
> root@rbd1:/var/log# uname -va
> Linux rbd1 4.15.0-48-generic #51~16.04.1-Ubuntu SMP Fri Apr 5 12:01:12 UTC 
> 2019 x86_64 x86_64 x86_64 GNU/Linux
>
> root@rbd1:/var/log# ceph -s
>   cluster:
> id: 504b5794-34bd-44e7-a8c3-0494cf800c23
> health: HEALTH_ERR
> crush map has legacy tunables (require argonaut, min is firefly)
> full ratio(s) out of order
> 2091546/274437905 objects misplaced (0.762%)
> Reduced data availability: 114 pgs inactive
> 1 slow requests are blocked > 32 sec. Implicated osds 152
> 4 stuck requests are blocked > 4096 sec. Implicated osds 6,99
>   services:
> mon: 3 daemons, quorum hqceph1,hqceph2,hqceph3
> mgr: hqceph2(active), standbys: hqceph3
> osd: 291 osds: 283 up, 265 in; 116 remapped pgs
> rgw: 1 daemon active
>   data:
> pools:   17 pools, 8199 pgs
> objects: 91.48M objects, 292TiB
> usage:   880TiB used, 758TiB / 1.60PiB avail
> pgs: 1.390% pgs not active
>  2091546/274437905 objects misplaced (0.762%)
>  8040 active+clean
>  114  activating+remapped
>  40   active+clean+scrubbing+deep
>  3active+clean+scrubbing
>  2active+remapped+backfilling
>   io:
> recovery: 41.4MiB/s, 12objects/s
>
> This morning I got on the server in order to map and mount the 5 or so RBD 
> images that are shared out via samba on this server.
>
> After waiting about 10 minutes it was clear something was not 100% 
> correct…here is what I found in the kern.log file:
>
>
> Aug 31 11:43:16 rbd1 kernel: [2158818.570948] libceph: mon0 10.35.1.201:6789 
> session established
> Aug 31 11:43:16 rbd1 kernel: [2158818.576306] libceph: client182797617 fsid 
> 504b5794-34bd-44e7-a8c3-0494cf800c23
> Aug 31 11:43:16 rbd1 kernel: [2158818.710199] rbd: rbd0: capacity 
> 54975581388800 features 0x0
> Aug 31 11:43:22 rbd1 kernel: [2158824.685353] XFS (rbd0): Mounting V4 
> Filesystem
> Aug 31 11:44:19 rbd1 kernel: [2158881.536794] XFS (rbd0): Starting recovery 
> (logdev: internal)
> Aug 31 11:47:06 rbd1 kernel: [2159048.202835] INFO: task mount:33177 blocked 
> for more than 120 seconds.
> Aug 31 11:47:06 rbd1 kernel: [2159048.203053]   Not tainted 
> 4.15.0-48-generic #51~16.04.1-Ubuntu
> Aug 31 11:47:06 rbd1 kernel: [2159048.203260] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug 31 11:47:06 rbd1 kernel: [2159048.203523] mount   D0 33177  
> 33011 0x
> Aug 31 11:47:06 rbd1 kernel: [2159048.203527] Call Trace:
> Aug 31 11:47:06 rbd1 kernel: [2159048.203538]  __schedule+0x3d6/0x8b0
> Aug 31 11:47:06 rbd1 kernel: [2159048.203542]  ? __switch_to_asm+0x34/0x70
> Aug 31 11:47:06 rbd1 kernel: [2159048.203546]  schedule+0x36/0x80
> Aug 31 11:47:06 rbd1 kernel: [2159048.203551]  schedule_timeout+0x1db/0x370
> Aug 31 11:47:06 rbd1 kernel: [2159048.203629]  ? 
> xfs_trans_read_buf_map+0xf8/0x330 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203634]  wait_for_completion+0xb4/0x140
> Aug 31 11:47:06 rbd1 kernel: [2159048.203637]  ? wake_up_q+0x70/0x70
> Aug 31 11:47:06 rbd1 kernel: [2159048.203691]  ? 
> xfs_trans_read_buf_map+0xf8/0x330 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203740]  ? _xfs_buf_read+0x23/0x30 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203787]  xfs_buf_submit_wait+0x7f/0x220 
> [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203839]  ? 
> xfs_trans_read_buf_map+0xf8/0x330 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203887]  _xfs_buf_read+0x23/0x30 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203933]  xfs_buf_read_map+0x10a/0x190 
> [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.203985]  
> xfs_trans_read_buf_map+0xf8/0x330 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.204022]  xfs_read_agf+0x90/0x120 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.204058]  xfs_alloc_read_agf+0x49/0x1d0 
> [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.204094]  xfs_alloc_pagf_init+0x29/0x60 
> [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.204141]  
> xfs_initialize_perag_data+0x99/0x110 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.204193]  xfs_mountfs+0x79b/0x950 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.204243]  ? 
> xfs_mru_cache_create+0x12b/0x170 [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.204294]  xfs_fs_fill_super+0x428/0x5e0 
> [xfs]
> Aug 31 11:47:06 rbd1 kernel: [2159048.204300]  mount_bdev+0x246/0x290
>

[ceph-users] Xfs kernel panic during rbd mount

2020-08-31 Thread Shain Miley
Hi,
A few weeks ago several of our rdb images became unresponsive after a few of 
our OSDs reached a near full state.

Another member of the team rebooted the server that the rbd images are mounted 
on in an attempt to resolve the issue.
In the meantime I added several more nodes to the cluster in order to get 
additional space.

Here are some cluster details:

root@rbd1:/var/log# ceph -v
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous 
(stable)

root@rbd1:/var/log# uname -va
Linux rbd1 4.15.0-48-generic #51~16.04.1-Ubuntu SMP Fri Apr 5 12:01:12 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux

root@rbd1:/var/log# ceph -s
  cluster:
id: 504b5794-34bd-44e7-a8c3-0494cf800c23
health: HEALTH_ERR
crush map has legacy tunables (require argonaut, min is firefly)
full ratio(s) out of order
2091546/274437905 objects misplaced (0.762%)
Reduced data availability: 114 pgs inactive
1 slow requests are blocked > 32 sec. Implicated osds 152
4 stuck requests are blocked > 4096 sec. Implicated osds 6,99
  services:
mon: 3 daemons, quorum hqceph1,hqceph2,hqceph3
mgr: hqceph2(active), standbys: hqceph3
osd: 291 osds: 283 up, 265 in; 116 remapped pgs
rgw: 1 daemon active
  data:
pools:   17 pools, 8199 pgs
objects: 91.48M objects, 292TiB
usage:   880TiB used, 758TiB / 1.60PiB avail
pgs: 1.390% pgs not active
 2091546/274437905 objects misplaced (0.762%)
 8040 active+clean
 114  activating+remapped
 40   active+clean+scrubbing+deep
 3active+clean+scrubbing
 2active+remapped+backfilling
  io:
recovery: 41.4MiB/s, 12objects/s

This morning I got on the server in order to map and mount the 5 or so RBD 
images that are shared out via samba on this server.

After waiting about 10 minutes it was clear something was not 100% correct…here 
is what I found in the kern.log file:


Aug 31 11:43:16 rbd1 kernel: [2158818.570948] libceph: mon0 10.35.1.201:6789 
session established
Aug 31 11:43:16 rbd1 kernel: [2158818.576306] libceph: client182797617 fsid 
504b5794-34bd-44e7-a8c3-0494cf800c23
Aug 31 11:43:16 rbd1 kernel: [2158818.710199] rbd: rbd0: capacity 
54975581388800 features 0x0
Aug 31 11:43:22 rbd1 kernel: [2158824.685353] XFS (rbd0): Mounting V4 Filesystem
Aug 31 11:44:19 rbd1 kernel: [2158881.536794] XFS (rbd0): Starting recovery 
(logdev: internal)
Aug 31 11:47:06 rbd1 kernel: [2159048.202835] INFO: task mount:33177 blocked 
for more than 120 seconds.
Aug 31 11:47:06 rbd1 kernel: [2159048.203053]   Not tainted 
4.15.0-48-generic #51~16.04.1-Ubuntu
Aug 31 11:47:06 rbd1 kernel: [2159048.203260] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 31 11:47:06 rbd1 kernel: [2159048.203523] mount   D0 33177  
33011 0x
Aug 31 11:47:06 rbd1 kernel: [2159048.203527] Call Trace:
Aug 31 11:47:06 rbd1 kernel: [2159048.203538]  __schedule+0x3d6/0x8b0
Aug 31 11:47:06 rbd1 kernel: [2159048.203542]  ? __switch_to_asm+0x34/0x70
Aug 31 11:47:06 rbd1 kernel: [2159048.203546]  schedule+0x36/0x80
Aug 31 11:47:06 rbd1 kernel: [2159048.203551]  schedule_timeout+0x1db/0x370
Aug 31 11:47:06 rbd1 kernel: [2159048.203629]  ? 
xfs_trans_read_buf_map+0xf8/0x330 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203634]  wait_for_completion+0xb4/0x140
Aug 31 11:47:06 rbd1 kernel: [2159048.203637]  ? wake_up_q+0x70/0x70
Aug 31 11:47:06 rbd1 kernel: [2159048.203691]  ? 
xfs_trans_read_buf_map+0xf8/0x330 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203740]  ? _xfs_buf_read+0x23/0x30 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203787]  xfs_buf_submit_wait+0x7f/0x220 
[xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203839]  ? 
xfs_trans_read_buf_map+0xf8/0x330 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203887]  _xfs_buf_read+0x23/0x30 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203933]  xfs_buf_read_map+0x10a/0x190 
[xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203985]  
xfs_trans_read_buf_map+0xf8/0x330 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204022]  xfs_read_agf+0x90/0x120 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204058]  xfs_alloc_read_agf+0x49/0x1d0 
[xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204094]  xfs_alloc_pagf_init+0x29/0x60 
[xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204141]  
xfs_initialize_perag_data+0x99/0x110 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204193]  xfs_mountfs+0x79b/0x950 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204243]  ? 
xfs_mru_cache_create+0x12b/0x170 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204294]  xfs_fs_fill_super+0x428/0x5e0 
[xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204300]  mount_bdev+0x246/0x290
Aug 31 11:47:06 rbd1 kernel: [2159048.204349]  ? 
xfs_test_remount_options.isra.16+0x60/0x60 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204398]  xfs_fs_mount+0x15/0x20 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204402]  mount_fs+0x3d/0x150
Aug 31 11:47:06 rbd

[ceph-users] MDS troubleshooting documentation: ceph daemon mds. dump cache

2020-08-31 Thread Stefan Kooman
Hi list,

We had some stuck ops on our MDS. In order to figure out why, we looked
up the documention. The first thing it mentions is the following:

ceph daemon mds. dump cache /tmp/dump.txt

Our MDS had 170 GB in cache at that moment.

Turns out that is a sure way to get your active MDS replaced by a standby.

Is this supposed to work on MDS with large cache size? If not, than a
big warning sign to prohibit running this on MDSes with large caches
would be appropriate.

Gr. Stefan

P.s. I think our only option was to get the active restarted at that
point, but still.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Default data pool in CEPH

2020-08-31 Thread Gabriel Medve

Hi,

I have a CEPH 15.2.4 running in a docker. How to configure for use a 
specific data pool? i try put the follow line in the ceph.conf but the 
changes not working.  .


[client.myclient]
rbd default data pool = Mydatapool

I need it to configure for erasure pool with cloudstack

Can help me? , where is the ceph conf we i need configure?

Thanks.

--

Untitled Document
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to query status of scheduled commands.

2020-08-31 Thread Frank Schilder
Hi all,

can anyone help me with this? In mimic, for any of these commands:

ceph osd [deep-]scrub ID
ceph pg [deep-]scrub ID
ceph pg repair ID

an operation is scheduled asynchronously. How can I check the following states:

1) Operation is pending (scheduled, not started).
2) Operation is running.
3) Operation has completed.
4) Exit code and error messages if applicable.

Many thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-08-31 Thread Frank Schilder
Yes, they can - if volatile write cache is not disabled. There are many threads 
on this, also recent. Search for "disable write cache" and/or "disable volatile 
write cache".

You will also find different methods of doing this automatically.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: VELARTIS Philipp Dürhammer 
Sent: 31 August 2020 13:02:45
To: 'ceph-users@ceph.io'
Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no 
extra journals)

I have a productive 60 osd's cluster. No extra Journals. Its performing okay. 
Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is 
little slower or equal to the 60 hdd pool. 4K random as also sequential reads. 
All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on 
bluestore. Ceph Luminous.
What should be possible 16 ssd's vs. 60 hhd's no extra journals?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bluestore does not defer writes

2020-08-31 Thread Dennis Benndorf
Here the blog I wrote about:

https://yourcmc.ru/wiki/index.php?title=Ceph_performance&mobileaction=toggle_view_desktopHDD
 for data + SSD for journal
Filestore writes everything to the journal and only starts to flush 
it to the data device when the journal fills up to the configured 
percent. This is very convenient because it makes journal act as a 
«temporary buffer» that absorbs random write bursts.

Bluestore can’t do the same even when you put its WAL+DB on SSD. 
It also has sort of a «journal» which is called «deferred write
queue», 
but it’s very small (only 64 requests) and it lacks any kind of 
background flush threads. So you actually can increase the maximum 
number of deferred requests, but after the queue fills up the 
performance will drop until OSD restarts.




Maybe I have been hitten by that. Is there any change planned or done
on this?
Cant imagine users regulary restart their osds to get performance
back

Regards,
Dennis


 Weitergeleitete Nachricht Von: Wido den Hollander <
w...@42on.com>An: Dennis Benndorf , 
ceph-users@ceph.ioBetreff: Re: [ceph-users] Bluestore does not defer
writesDatum: Mon, 31 Aug 2020 16:06:37 +0200

On 31/08/2020 11:00, Dennis Benndorf wrote:
Hi,
today I recognized bad performance in our cluster. Running "watch
cephosd perf |sort -hk 2 -r" I found that all bluestore OSDs are slow
oncommit and that the commit timings are equal to their apply timings:
For exampleEvery 2.0s: ceph osd perf |sort -hk 2-
r  440 
8282430 5858435
 5656449 535344
2 4040441 30   
 30439 2727  99  0 
1  98  0 0  97 
 0 2  96  0 6  
95  0 2  94  0 
6  93  013
The once with zero commit timings are filestore and the others
arebluestore osds.I did not see this after installing the new bluestore
osds (maybe thisoccured later).Both types of osds have nvmes as
journal/db. Servers have equalcpus/ram etc.
The only tuning regarding bluestore is:   bluestore_block_db_size =
69793218560   bluestore_prefer_deferred_size_hdd = 524288In order to
make a filestore like behavior, but that does not seem towork.
As far as I know is that with BlueStore apply and commit latencies are
equal.
Where did you get the idea that you could influence this with these
settings?
Wido

Any tips?
Regards Dennis___ceph-users 
mailing list -- ceph-us...@ceph.ioto unsubscribe send an email to 
ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Persistent problem with slow metadata

2020-08-31 Thread Momčilo Medić
On Mon, 2020-08-31 at 14:36 +, Eugen Block wrote:
> > Disks are utilized roughly between 70 and 80 percent. Not sure why
> > would operations slow down when disks are getting more utilization.
> > If that would be the case, I'd expect Ceph to issue a warning.
> 
> It is warning you, that's why you see slow requests. ;-) But just
> to  
> be clear, by utilization I mean more than just the filling level of  
> the OSD, have you watched iostat (or something similar) for your
> disks  
> during usual and high load? Heavy metadata operation on rocksDB  
> increases the load on the main device. I'm not sure if you
> mentioned  
> it before, do you have stand-alone OSDs or with faster db devices?
> I  
> believe you only mentioned cephfs_metadata on SSD.

Indeed DB is stored on HDDs and only metadata resides on SSDs.

I accidentally stumbled upon someone mentioning disk caching should be
disabled to increase performance.
I'm now looking into how to configure that on these:
- PERC H730P Adapter (has cache on controller)
- Dell HBA330 Adp (doesn't have cache on controller)

It is not as easy as executing "hdparm" command :(

I'll also look into code that does the "truncating" as it may not be
Ceph-friendly :/

> > Have I understood correctly that the expectation is that if I used
> > larger drives I wouldn't be seeing these warnings?
> > I can understand that adding more disks would create better
> > parallelisation, that's why I'm asking about larger drives.
> 
> I don't think larger drives would improve that, probably even the  
> opposite, depending on the drives, of course. More drives should  
> scale, yes, but there's more to it.
> 
> 
> Zitat von Momčilo Medić :
> 
> > Hey Eugen,
> > 
> > On Wed, 2020-08-26 at 09:29 +, Eugen Block wrote:
> > > Hi,
> > > 
> > > > > root@cephosd01:~# ceph config get mds.cephosd01 osd_op_queue
> > > > > wpq
> > > > > root@0cephosd01:~# ceph config get mds.cephosd01
> > > > > osd_op_queue_cut_off
> > > > > high
> > > 
> > > just to make sure, I referred to OSD not MDS settings, maybe
> > > check
> > > again?
> > 
> > root@cephosd01:~# ceph config get osd.* osd_op_queue
> > wpq
> > root@cephosd01:~# ceph config get osd.* osd_op_queue_cut_off
> > high
> > root@cephosd01:~# ceph config get mon.* osd_op_queue
> > wpq
> > root@cephosd01:~# ceph config get mon.* osd_op_queue_cut_off
> > high
> > root@cephosd01:~# ceph config get mds.* osd_op_queue
> > wpq
> > root@cephosd01:~# ceph config get mds.* osd_op_queue_cut_off
> > high
> > root@cephosd01:~#
> > 
> > It seems no matter which setting I query, it's always the same.
> > Also, documentation for OSD clearly states[1] that it is the
> > default.
> > 
> > > I wouldn't focus too much on the MDS service, 64 GB RAM should be
> > > enough, but you could and should also check the actual RAM usage,
> > > of
> > > course. But in our case it's pretty clear that the hard disks are
> > > the
> > > bottleneck although we  have rocksDB on SSD for all OSDs. We seem
> > > to
> > > have a similar use case (we have nightly compile jobs running in
> > > cephfs) just with fewer clients. Our HDDs are saturated
> > > especially
> > > if
> > > we also run deep-scrubs during the night,  but the slow requests
> > > have
> > > been reduced since we changed the osd_op_queue settings for our
> > > OSDs.
> > > 
> > > Have you checked your disk utilization?
> > 
> > Disks are utilized roughly between 70 and 80 percent. Not sure why
> > would operations slow down when disks are getting more utilization.
> > If that would be the case, I'd expect Ceph to issue a warning.
> > 
> > Have I understood correctly that the expectation is that if I used
> > larger drives I wouldn't be seeing these warnings?
> > I can understand that adding more disks would create better
> > parallelisation, that's why I'm asking about larger drives.
> > 
> > Thank you for discussing this with me, it's highly appreciated.
> > 
> > 
> > 
> > [1]
> > 
https://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#operations
> > 
> > Kind regards,
> > Momo.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bluestore does not defer writes

2020-08-31 Thread Dennis Benndorf
Hi Wido,
as bluestore_prefer_deferred_size is the parameter size of IOs to defer
writes. There was a bug report involving Igor but dont find it. There
was discussed that this value must not exceed 512KB...
Regarding osd perf it is just my assumption that when commit and apply
timing are equal that there is no deferring at all.Looking at iostat
the iosizes to spinning disks are now a lot smaller (~ 4-80KB) than
before(~>300KB).
I could imagin I hitted some bug or filled my wal and now the osds are
not deferring anymore. I remember that someone wrong a blog about
bluestore mentioning that bluestore "journaling" does only work until a
specific point. If that is reached deferring is deactivated. Could
someone bring a bit of light in this?
At the moment I get the write latency of my hdds.
Regards,Dennis
 Weitergeleitete Nachricht Von: Wido den Hollander <
w...@42on.com>An: Dennis Benndorf , 
ceph-users@ceph.ioBetreff: Re: [ceph-users] Bluestore does not defer
writesDatum: Mon, 31 Aug 2020 16:06:37 +0200

On 31/08/2020 11:00, Dennis Benndorf wrote:
Hi,
today I recognized bad performance in our cluster. Running "watch
cephosd perf |sort -hk 2 -r" I found that all bluestore OSDs are slow
oncommit and that the commit timings are equal to their apply timings:
For exampleEvery 2.0s: ceph osd perf |sort -hk 2-
r  440 
8282430 5858435
 5656449 535344
2 4040441 30   
 30439 2727  99  0 
1  98  0 0  97 
 0 2  96  0 6  
95  0 2  94  0 
6  93  013
The once with zero commit timings are filestore and the others
arebluestore osds.I did not see this after installing the new bluestore
osds (maybe thisoccured later).Both types of osds have nvmes as
journal/db. Servers have equalcpus/ram etc.
The only tuning regarding bluestore is:   bluestore_block_db_size =
69793218560   bluestore_prefer_deferred_size_hdd = 524288In order to
make a filestore like behavior, but that does not seem towork.
As far as I know is that with BlueStore apply and commit latencies are
equal.
Where did you get the idea that you could influence this with these
settings?
Wido

Any tips?
Regards Dennis___ceph-users 
mailing list -- ceph-us...@ceph.ioto unsubscribe send an email to 
ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Persistent problem with slow metadata

2020-08-31 Thread Eugen Block

Disks are utilized roughly between 70 and 80 percent. Not sure why
would operations slow down when disks are getting more utilization.
If that would be the case, I'd expect Ceph to issue a warning.


It is warning you, that's why you see slow requests. ;-) But just to  
be clear, by utilization I mean more than just the filling level of  
the OSD, have you watched iostat (or something similar) for your disks  
during usual and high load? Heavy metadata operation on rocksDB  
increases the load on the main device. I'm not sure if you mentioned  
it before, do you have stand-alone OSDs or with faster db devices? I  
believe you only mentioned cephfs_metadata on SSD.




Have I understood correctly that the expectation is that if I used
larger drives I wouldn't be seeing these warnings?
I can understand that adding more disks would create better
parallelisation, that's why I'm asking about larger drives.


I don't think larger drives would improve that, probably even the  
opposite, depending on the drives, of course. More drives should  
scale, yes, but there's more to it.



Zitat von Momčilo Medić :


Hey Eugen,

On Wed, 2020-08-26 at 09:29 +, Eugen Block wrote:

Hi,

> > root@cephosd01:~# ceph config get mds.cephosd01 osd_op_queue
> > wpq
> > root@0cephosd01:~# ceph config get mds.cephosd01
> > osd_op_queue_cut_off
> > high

just to make sure, I referred to OSD not MDS settings, maybe check
again?


root@cephosd01:~# ceph config get osd.* osd_op_queue
wpq
root@cephosd01:~# ceph config get osd.* osd_op_queue_cut_off
high
root@cephosd01:~# ceph config get mon.* osd_op_queue
wpq
root@cephosd01:~# ceph config get mon.* osd_op_queue_cut_off
high
root@cephosd01:~# ceph config get mds.* osd_op_queue
wpq
root@cephosd01:~# ceph config get mds.* osd_op_queue_cut_off
high
root@cephosd01:~#

It seems no matter which setting I query, it's always the same.
Also, documentation for OSD clearly states[1] that it is the default.


I wouldn't focus too much on the MDS service, 64 GB RAM should be
enough, but you could and should also check the actual RAM usage,
of
course. But in our case it's pretty clear that the hard disks are
the
bottleneck although we  have rocksDB on SSD for all OSDs. We seem
to
have a similar use case (we have nightly compile jobs running in
cephfs) just with fewer clients. Our HDDs are saturated especially
if
we also run deep-scrubs during the night,  but the slow requests
have
been reduced since we changed the osd_op_queue settings for our OSDs.

Have you checked your disk utilization?


Disks are utilized roughly between 70 and 80 percent. Not sure why
would operations slow down when disks are getting more utilization.
If that would be the case, I'd expect Ceph to issue a warning.

Have I understood correctly that the expectation is that if I used
larger drives I wouldn't be seeing these warnings?
I can understand that adding more disks would create better
parallelisation, that's why I'm asking about larger drives.

Thank you for discussing this with me, it's highly appreciated.



[1]
https://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#operations

Kind regards,
Momo.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large RocksDB (db_slow_bytes) on OSD which is marked as out

2020-08-31 Thread Igor Fedotov

Could you please run:  ceph daemon  calc_objectstore_db_histogram

and share the output?


On 8/31/2020 4:33 PM, Wido den Hollander wrote:



On 31/08/2020 12:31, Igor Fedotov wrote:

Hi Wido,

'b' prefix relates to free list manager which keeps all the free 
extents for main device in a bitmap. Its records have fixed size 
hence you can easily estimate the overall size for these type of data.




Yes, so I figured.

But I doubt it takes that much. I presume that DB just lacks the 
proper compaction. Which could happen eventually but looks like you 
interrupted the process by going offline.


May be try manual compaction with ceph-kvstore-tool?



This cluster is suffering from a lot of spillovers. So we tested with 
marking one OSD as out.


After being marked as out it still had this large DB. A compact didn't 
work, the RocksDB database just stayed so large.


New OSDs coming into the cluster aren't suffering from this and they 
have a RocksDB of a couple of MB in size.


Old OSDs installed with Luminous and now upgraded to Nautilus are 
suffering from this.


It kind of seems like that garbage data stays behind in RocksDB which 
is never clean up.


Wido



Thanks,

Igor



On 8/31/2020 10:57 AM, Wido den Hollander wrote:

Hello,

On a Nautilus 14.2.8 cluster I am seeing large RocksDB database with 
many slow DB bytes in use.


To investigate this further I marked one OSD as out and waited for 
the all the backfilling to complete.


Once the backfilling was completed I exported BlueFS and 
investigated the RocksDB using 'ceph-kvstore-tool'. This resulted in 
22GB of data.


Listing all the keys in the RocksDB shows me there are 747.000 keys 
in the DB. A small portion are osdmaps, but the biggest amount are 
keys prefixed with 'b'.


I dumped the stats of the RocksDB and this shows me:

L1: 1/0: 439.32 KB
L2: 1/0: 2.65 MB
L3: 5/0: 14.36 MB
L4: 127/0: 7.22 GB
L5: 217/0: 13.73 GB
Sum: 351/0: 20.98 GB

So there is almost 21GB of data in this RocksDB database. Why? Where 
is this coming from?


Throughout this cluster OSDs are suffering from many slow bytes used 
and I can't figure out why.


Has anybody seen this or has a clue on what is going on?

I have an external copy of this RocksDB database to do 
investigations on.


Thank you,

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bluestore does not defer writes

2020-08-31 Thread Wido den Hollander




On 31/08/2020 11:00, Dennis Benndorf wrote:

Hi,

today I recognized bad performance in our cluster. Running "watch ceph
osd perf |sort -hk 2 -r" I found that all bluestore OSDs are slow on
commit and that the commit timings are equal to their apply timings:

For example
Every 2.0s: ceph osd perf |sort -hk 2
-r
  
440 8282

430 5858
435 5656
449 5353
442 4040
441 3030
439 2727
  99  0 1
  98  0 0
  97  0 2
  96  0 6
  95  0 2
  94  0 6
  93  013

The once with zero commit timings are filestore and the others are
bluestore osds.
I did not see this after installing the new bluestore osds (maybe this
occured later).
Both types of osds have nvmes as journal/db. Servers have equal
cpus/ram etc.

The only tuning regarding bluestore is:
   bluestore_block_db_size = 69793218560
   bluestore_prefer_deferred_size_hdd = 524288
In order to make a filestore like behavior, but that does not seem to
work.


As far as I know is that with BlueStore apply and commit latencies are 
equal.


Where did you get the idea that you could influence this with these 
settings?


Wido



Any tips?

Regards Dennis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd regularly wrongly marked down

2020-08-31 Thread Wido den Hollander



On 31/08/2020 15:44, Francois Legrand wrote:

Thanks Igor for your answer,

We could try do a compaction of RocksDB manually, but it's not clear to 
me if we have to compact on the mon with something like

ceph-kvstore-tool rocksdb  /var/lib/ceph/mon/mon01/store.db/ compact
or on the concerned osd with
ceph-kvstore-tool rocksdb  /var/lib/ceph/osd/ceph-16/ compact
(or for all osd with a script like in 
https://gist.github.com/wido/b0f0200bd1a2cbbe3307265c5cfb2771 )


You would compact the OSDs, not the MONs. So the last command or my 
script which you linked there.


For my culture, how does compaction works ? Is it done automatically in 
background, regularly, at startup ?


Usually it's done by the OSD in the background, but sometimes an offline 
compact works best.


Because in the logs of the osd we have every 10mn some reports about 
compaction (which suggests that compaction occurs regularly), like :




Yes, that is normal. But the offline compaction is sometimes more 
effective than the online ones are.



2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb: [db/db_impl.cc:777] 
--- DUMPING STATS ---

2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb: [db/db_impl.cc:778]
** DB Stats **
Uptime(secs): 449404.8 total, 600.0 interval
Cumulative writes: 136K writes, 692K keys, 136K commit groups, 1.0 
writes per commit group, ingest: 0.28 GB, 0.00 MB/s
Cumulative WAL: 136K writes, 67K syncs, 2.04 writes per sync, written: 
0.28 GB, 0.00 MB/s

Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 128 writes, 336 keys, 128 commit groups, 1.0 writes per 
commit group, ingest: 0.22 MB, 0.00 MB/s
Interval WAL: 128 writes, 64 syncs, 1.97 writes per sync, written: 0.00 
MB, 0.00 MB/s

Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) 
Comp(cnt) Avg(sec) KeyIn KeyDrop
 

   L0  1/0   60.48 MB   0.2  0.0 0.0 0.0   0.1 0.1   
0.0   1.0  0.0    163.7 0.52  0.40 2    0.258   
0  0
   L1  0/0    0.00 KB   0.0  0.1 0.1 0.0   0.1 0.1   
0.0   0.5 48.2 26.1 2.32  0.64 1    2.319    920K   
197K
   L2 17/0    1.00 GB   0.8  1.1 0.1 1.1   1.1 0.0   
0.0  18.3 69.8 67.5 16.38  4.97 1   16.380   
4747K    82K
   L3 81/0    4.50 GB   0.9  0.6 0.1 0.5   0.3 
-0.2   0.0   4.3 66.9 36.6 9.23  4.95 2
4.617   9544K   802K
   L4    285/0   16.64 GB   0.1  2.4 0.3 2.0   0.2 
-1.8   0.0   0.8    110.3 11.7 21.92  4.37 5
4.384 12M    12M
  Sum    384/0   22.20 GB   0.0  4.2 0.6 3.6   1.8 
-1.8   0.0  21.8 85.2 36.6 50.37 15.32 11
4.579 28M    13M
  Int  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.0 0.0   
0.0   0.0  0.0  0.0 0.00  0.00 0    0.000   
0  0


** Compaction Stats [default] **
Priority    Files   Size Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) 
Comp(cnt) Avg(sec) KeyIn KeyDrop
--- 

  Low  0/0    0.00 KB   0.0  4.2 0.6 3.6   1.7 
-1.9   0.0   0.0 86.0 35.3 49.86 14.92 9
5.540 28M    13M
High  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.1 0.1   
0.0   0.0  0.0    150.2 0.40  0.40 1    0.403   
0  0
User  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.0 0.0   
0.0   0.0  0.0    211.7 0.11  0.00 1    0.114   
0  0

Uptime(secs): 449404.8 total, 600.0 interval
Flush(GB): cumulative 0.083, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1.80 GB write, 0.00 MB/s write, 4.19 GB read, 
0.01 MB/s read, 50.4 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 
MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 
level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for 
pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 
memtable_compaction, 0 memtable_slowdown, interval 0 total count




Concerning the data removal, I don't know if this could be the trigger. 
We had some osd marked down before starting the removal, but at this 
epoch the situation was so confuse that I cannot be sure that the origin 
of the problem was

[ceph-users] Re: Recover pgs from failed osds

2020-08-31 Thread Vahideh Alinouri
osd_memory_target of failed osd in one ceph-osd node changed to 6G but
other osd_memory_target is 3G, starting failed osd with 6G memory_target
causes other osd "down" in ceph-osd node! and failed osd is still down.

On Mon, Aug 31, 2020 at 2:19 PM Eugen Block  wrote:

> Can you try the opposite and turn up the memory_target and only try to
> start a single OSD?
>
>
> Zitat von Vahideh Alinouri :
>
> > osd_memory_target is changed to 3G, starting failed osd causes ceph-osd
> > nodes crash! and failed osd is still "down"
> >
> > On Fri, Aug 28, 2020 at 1:13 PM Vahideh Alinouri <
> vahideh.alino...@gmail.com>
> > wrote:
> >
> >> Yes, each osd node has 7 osds with 4 GB memory_target.
> >>
> >>
> >> On Fri, Aug 28, 2020, 12:48 PM Eugen Block  wrote:
> >>
> >>> Just to confirm, each OSD node has 7 OSDs with 4 GB memory_target?
> >>> That leaves only 4 GB RAM for the rest, and in case of heavy load the
> >>> OSDs use even more. I would suggest to reduce the memory_target to 3
> >>> GB and see if they start successfully.
> >>>
> >>>
> >>> Zitat von Vahideh Alinouri :
> >>>
> >>> > osd_memory_target is 4294967296.
> >>> > Cluster setup:
> >>> > 3 mon, 3 mgr, 21 osds on 3 ceph-osd nodes in lvm scenario.  ceph-osd
> >>> nodes
> >>> > resources are 32G RAM - 4 core CPU - osd disk 4TB - 9 osds have
> >>> > block.wal on SSDs.  Public network is 1G and cluster network is 10G.
> >>> > Cluster installed and upgraded using ceph-ansible.
> >>> >
> >>> > On Thu, Aug 27, 2020 at 7:01 PM Eugen Block  wrote:
> >>> >
> >>> >> What is the memory_target for your OSDs? Can you share more details
> >>> >> about your setup? You write about high memory, are the OSD nodes
> >>> >> affected by OOM killer? You could try to reduce the
> osd_memory_target
> >>> >> and see if that helps bring the OSDs back up. Splitting the PGs is a
> >>> >> very heavy operation.
> >>> >>
> >>> >>
> >>> >> Zitat von Vahideh Alinouri :
> >>> >>
> >>> >> > Ceph cluster is updated from nautilus to octopus. On ceph-osd
> nodes
> >>> we
> >>> >> have
> >>> >> > high I/O wait.
> >>> >> >
> >>> >> > After increasing one of pool’s pg_num from 64 to 128 according to
> >>> warning
> >>> >> > message (more objects per pg), this lead to high cpu load and ram
> >>> usage
> >>> >> on
> >>> >> > ceph-osd nodes and finally crashed the whole cluster. Three osds,
> >>> one on
> >>> >> > each host, stuck at down state (osd.34 osd.35 osd.40).
> >>> >> >
> >>> >> > Starting the down osd service causes high ram usage and cpu load
> and
> >>> >> > ceph-osd node to crash until the osd service fails.
> >>> >> >
> >>> >> > The active mgr service on each mon host will crash after consuming
> >>> almost
> >>> >> > all available ram on the physical hosts.
> >>> >> >
> >>> >> > I need to recover pgs and solving corruption. How can i recover
> >>> unknown
> >>> >> and
> >>> >> > down pgs? Is there any way to starting up failed osd?
> >>> >> >
> >>> >> >
> >>> >> > Below steps are done:
> >>> >> >
> >>> >> > 1- osd nodes’ kernel was upgraded to 5.4.2 before ceph cluster
> >>> upgrading.
> >>> >> > Reverting to previous kernel 4.2.1 is tested for iowate
> decreasing,
> >>> but
> >>> >> it
> >>> >> > had no effect.
> >>> >> >
> >>> >> > 2- Recovering 11 pgs on failed osds by export them using
> >>> >> > ceph-objectstore-tools utility and import them on other osds. The
> >>> result
> >>> >> > followed: 9 pgs are “down” and 2 pgs are “unknown”.
> >>> >> >
> >>> >> > 2-1) 9 pgs export and import successfully but status is “down”
> >>> because of
> >>> >> > "peering_blocked_by" 3 failed osds. I cannot lost osds because of
> >>> >> > preventing unknown pgs from getting lost. pgs size in K and M.
> >>> >> >
> >>> >> > "peering_blocked_by": [
> >>> >> >
> >>> >> > {
> >>> >> >
> >>> >> > "osd": 34,
> >>> >> >
> >>> >> > "current_lost_at": 0,
> >>> >> >
> >>> >> > "comment": "starting or marking this osd lost may let us proceed"
> >>> >> >
> >>> >> > },
> >>> >> >
> >>> >> > {
> >>> >> >
> >>> >> > "osd": 35,
> >>> >> >
> >>> >> > "current_lost_at": 0,
> >>> >> >
> >>> >> > "comment": "starting or marking this osd lost may let us proceed"
> >>> >> >
> >>> >> > },
> >>> >> >
> >>> >> > {
> >>> >> >
> >>> >> > "osd": 40,
> >>> >> >
> >>> >> > "current_lost_at": 0,
> >>> >> >
> >>> >> > "comment": "starting or marking this osd lost may let us proceed"
> >>> >> >
> >>> >> > }
> >>> >> >
> >>> >> > ]
> >>> >> >
> >>> >> >
> >>> >> > 2-2) 1 pg (2.39) export and import successfully, but after
> starting
> >>> osd
> >>> >> > service (pg import to it), ceph-osd node RAM and CPU consumption
> >>> increase
> >>> >> > and cause ceph-osd node to crash until the osd service fails.
> Other
> >>> osds
> >>> >> > become "down" on ceph-osd node. pg status is “unknown”. I cannot
> use
> >>> >> > "force-create-pg" because of data lost. pg 2.39 size is 19G.
> >>> >> >
> >>> >> > # ceph pg map 2.39
> >>> >> >
> >>> >> > osdmap e40347 pg 2.39 (2.39) -> up [32,37] acting [32,37]
> >>> >> >
> >>> >> > # ceph pg 2.39 query

[ceph-users] Re: osd regularly wrongly marked down

2020-08-31 Thread Francois Legrand

Thanks Igor for your answer,

We could try do a compaction of RocksDB manually, but it's not clear to 
me if we have to compact on the mon with something like

ceph-kvstore-tool rocksdb  /var/lib/ceph/mon/mon01/store.db/ compact
or on the concerned osd with
ceph-kvstore-tool rocksdb  /var/lib/ceph/osd/ceph-16/ compact
(or for all osd with a script like in 
https://gist.github.com/wido/b0f0200bd1a2cbbe3307265c5cfb2771 )


For my culture, how does compaction works ? Is it done automatically in 
background, regularly, at startup ?
Because in the logs of the osd we have every 10mn some reports about 
compaction (which suggests that compaction occurs regularly), like :


2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb: [db/db_impl.cc:777] 
--- DUMPING STATS ---

2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb: [db/db_impl.cc:778]
** DB Stats **
Uptime(secs): 449404.8 total, 600.0 interval
Cumulative writes: 136K writes, 692K keys, 136K commit groups, 1.0 
writes per commit group, ingest: 0.28 GB, 0.00 MB/s
Cumulative WAL: 136K writes, 67K syncs, 2.04 writes per sync, written: 
0.28 GB, 0.00 MB/s

Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 128 writes, 336 keys, 128 commit groups, 1.0 writes per 
commit group, ingest: 0.22 MB, 0.00 MB/s
Interval WAL: 128 writes, 64 syncs, 1.97 writes per sync, written: 0.00 
MB, 0.00 MB/s

Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) 
Comp(cnt) Avg(sec) KeyIn KeyDrop


  L0  1/0   60.48 MB   0.2  0.0 0.0 0.0   0.1  
0.1   0.0   1.0  0.0    163.7 0.52  0.40 
2    0.258   0  0
  L1  0/0    0.00 KB   0.0  0.1 0.1 0.0   0.1  
0.1   0.0   0.5 48.2 26.1 2.32  0.64 
1    2.319    920K   197K
  L2 17/0    1.00 GB   0.8  1.1 0.1 1.1   1.1  
0.0   0.0  18.3 69.8 67.5 16.38  4.97 
1   16.380   4747K    82K
  L3 81/0    4.50 GB   0.9  0.6 0.1 0.5   0.3 
-0.2   0.0   4.3 66.9 36.6 9.23  4.95 
2    4.617   9544K   802K
  L4    285/0   16.64 GB   0.1  2.4 0.3 2.0   0.2 
-1.8   0.0   0.8    110.3 11.7 21.92  4.37 
5    4.384 12M    12M
 Sum    384/0   22.20 GB   0.0  4.2 0.6 3.6   1.8 
-1.8   0.0  21.8 85.2 36.6 50.37 15.32    
11    4.579 28M    13M
 Int  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.0  
0.0   0.0   0.0  0.0  0.0 0.00  0.00 
0    0.000   0  0


** Compaction Stats [default] **
Priority    Files   Size Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) 
Comp(cnt) Avg(sec) KeyIn KeyDrop

---
 Low  0/0    0.00 KB   0.0  4.2 0.6 3.6   1.7 
-1.9   0.0   0.0 86.0 35.3 49.86 14.92 
9    5.540 28M    13M
High  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.1  
0.1   0.0   0.0  0.0    150.2 0.40  0.40 
1    0.403   0  0
User  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.0  
0.0   0.0   0.0  0.0    211.7 0.11  0.00 
1    0.114   0  0

Uptime(secs): 449404.8 total, 600.0 interval
Flush(GB): cumulative 0.083, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1.80 GB write, 0.00 MB/s write, 4.19 GB read, 
0.01 MB/s read, 50.4 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 
MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 
level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for 
pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 
memtable_compaction, 0 memtable_slowdown, interval 0 total count




Concerning the data removal, I don't know if this could be the trigger. 
We had some osd marked down before starting the removal, but at this 
epoch the situation was so confuse that I cannot be sure that the origin 
of the problem was the same. Indeed, the large data removal concerned 
datas in an old pool which have been destroyed since (thus all the pg of 
this ancient pool no longer exists). And it seems that now the cluster 
is rather inactive due to holidays time (o

[ceph-users] Re: Large RocksDB (db_slow_bytes) on OSD which is marked as out

2020-08-31 Thread Wido den Hollander




On 31/08/2020 12:31, Igor Fedotov wrote:

Hi Wido,

'b' prefix relates to free list manager which keeps all the free extents 
for main device in a bitmap. Its records have fixed size hence you can 
easily estimate the overall size for these type of data.




Yes, so I figured.

But I doubt it takes that much. I presume that DB just lacks the proper 
compaction. Which could happen eventually but looks like you interrupted 
the process by going offline.


May be try manual compaction with ceph-kvstore-tool?



This cluster is suffering from a lot of spillovers. So we tested with 
marking one OSD as out.


After being marked as out it still had this large DB. A compact didn't 
work, the RocksDB database just stayed so large.


New OSDs coming into the cluster aren't suffering from this and they 
have a RocksDB of a couple of MB in size.


Old OSDs installed with Luminous and now upgraded to Nautilus are 
suffering from this.


It kind of seems like that garbage data stays behind in RocksDB which is 
never clean up.


Wido



Thanks,

Igor



On 8/31/2020 10:57 AM, Wido den Hollander wrote:

Hello,

On a Nautilus 14.2.8 cluster I am seeing large RocksDB database with 
many slow DB bytes in use.


To investigate this further I marked one OSD as out and waited for the 
all the backfilling to complete.


Once the backfilling was completed I exported BlueFS and investigated 
the RocksDB using 'ceph-kvstore-tool'. This resulted in 22GB of data.


Listing all the keys in the RocksDB shows me there are 747.000 keys in 
the DB. A small portion are osdmaps, but the biggest amount are keys 
prefixed with 'b'.


I dumped the stats of the RocksDB and this shows me:

L1: 1/0: 439.32 KB
L2: 1/0: 2.65 MB
L3: 5/0: 14.36 MB
L4: 127/0: 7.22 GB
L5: 217/0: 13.73 GB
Sum: 351/0: 20.98 GB

So there is almost 21GB of data in this RocksDB database. Why? Where 
is this coming from?


Throughout this cluster OSDs are suffering from many slow bytes used 
and I can't figure out why.


Has anybody seen this or has a clue on what is going on?

I have an external copy of this RocksDB database to do investigations on.

Thank you,

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Speeding up reconnection

2020-08-31 Thread William Edwards
I replaced the VMs taking care of routing between clients and MDSes by physical 
machines. Problems below are solved. It seems to have been related to issues 
with the virtual NIC. It seemed to work well with E1000 instead of VirtIO...


Met vriendelijke groeten,

William Edwards

- Original Message -
From: William Edwards (wedwa...@cyberfusion.nl)
Date: 08/11/20 11:38
To: ceph-users@ceph.io
Subject: Speeding up reconnection


Hello,

When connection is lost between kernel client, a few things happen:

1.
Caps become stale:

Aug 11 11:08:14 admin-cap kernel: [308405.227718] ceph: mds0 caps stale

2.
MDS evicts client for being unresponsive:

MDS log: 2020-08-11 11:12:08.923 7fd1f45ae700  0 log_channel(cluster) log [WRN] 
: evicting unresponsive client admin-cap.cf.ha.cyberfusion.cloud:DB0001-cap 
(144786749), after 300.978 seconds
Client log: Aug 11 11:12:11 admin-cap kernel: [308643.051006] ceph: mds0 hung

3.
Socket is closed:

Aug 11 11:22:57 admin-cap kernel: [309289.192705] libceph: mds0 
[fdb7:b01e:7b8e:0:10:10:10:1]:6849 socket closed (con state OPEN)

I am not sure whether the kernel client or MDS closes the connection. I think 
the kernel client does so, because nothing is logged at the MDS side at 11:22:57

4.
Connection is reset by MDS:

MDS log: 2020-08-11 11:22:58.831 7fd1f9e49700  0 --1- 
[v2:[fdb7:b01e:7b8e:0:10:10:10:1]:6800/3619156441,v1:[fdb7:b01e:7b8e:0:10:10:10:1]:6849/3619156441]
 >> v1:[fc00:b6d:cfc:951::7]:0/133007863 conn(0x55bfaf1c2880 0x55c16cb47000 
:6849 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending 
RESETSESSION
Client log: Aug 11 11:22:58 admin-cap kernel: [309290.058222] libceph: mds0 
[fdb7:b01e:7b8e:0:10:10:10:1]:6849 connection reset

5.
Kernel client reconnects:

Aug 11 11:22:58 admin-cap kernel: [309290.058972] ceph: mds0 closed our session
Aug 11 11:22:58 admin-cap kernel: [309290.058973] ceph: mds0 reconnect start
Aug 11 11:22:58 admin-cap kernel: [309290.069979] ceph: mds0 reconnect denied
Aug 11 11:22:58 admin-cap kernel: [309290.069996] ceph: dropping file locks for 
6a23d9dd 1099625041446
Aug 11 11:22:58 admin-cap kernel: [309290.071135] libceph: mds0 
[fdb7:b01e:7b8e:0:10:10:10:1]:6849 socket closed (con state NEGOTIATING)

Question:

As you can see, there's 10 minutes between losing the connection and the 
reconnection attempt (11:12:08 - 11:22:58). I could not find any settings 
related to the period after which reconnection is attempted. I would like to 
change this value from 10 minutes to something like 1 minute. I also tried 
searching the Ceph docs for the string '600' (10 minutes), but did not find 
anything useful.

Hope someone can help.

Environment details:

Client kernel: 4.19.0-10-amd64
Ceph version: ceph version 14.2.9 (bed944f8c45b9c98485e99b70e11bbcec6f6659a) 
nautilus (stable)

Met vriendelijke groeten,

William Edwards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-08-31 Thread VELARTIS Philipp Dürhammer
We have older LSi Raid controller with no HBA/JBOD option. So we expose the 
single disks as raid0 devices. Ceph should not be aware of cache status?
But digging deeper in to it it seems that 1 out of 4 serves is performing a lot 
better and has super low commit/applay rates while the other have a lot mor 
(20+) on heavy writes. This just applys fore the ssd. For the hdds I cant see a 
difference...

-Ursprüngliche Nachricht-
Von: Frank Schilder  
Gesendet: Montag, 31. August 2020 13:19
An: VELARTIS Philipp Dürhammer ; 'ceph-users@ceph.io' 

Betreff: Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra 
journals)

Yes, they can - if volatile write cache is not disabled. There are many threads 
on this, also recent. Search for "disable write cache" and/or "disable volatile 
write cache".

You will also find different methods of doing this automatically.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: VELARTIS Philipp Dürhammer 
Sent: 31 August 2020 13:02:45
To: 'ceph-users@ceph.io'
Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no 
extra journals)

I have a productive 60 osd's cluster. No extra Journals. Its performing okay. 
Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is 
little slower or equal to the 60 hdd pool. 4K random as also sequential reads. 
All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on 
bluestore. Ceph Luminous.
What should be possible 16 ssd's vs. 60 hhd's no extra journals?

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: slow "rados ls"

2020-08-31 Thread Marcel Kuiper
The compaction of the bluestore-kv's helped indeed. The repons is back to
acceptable levels

Thanks for the help

> Thank you Stefan, I'm going to give that a try
>
> Kind Regards
>
> Marcel Kuiper
>
>> On 2020-08-27 13:29, Marcel Kuiper wrote:
>>> Sorry that had to be Wido/Stefan
>>
>> What does "ceph osd df" give you? There is a column with "OMAP" and
>> "META". OMAP is ~ 13 B, META 26 GB in our setup. Quite a few files in
>> cephfs (main reason we have large OMAP).
>>
>>>
>>> Another question is: hoe to use this ceph-kvstore-tool tool to compact
>>> the
>>> rocksdb? (can't find a lot of examples)
>>
>> If you want to do a whole host a the time:
>>
>> systemctl stop ceph-osd.target
>>
>> wait a few seconds till all processes are closed.
>>
>> for osd in `ls /var/lib/ceph/osd/`; do (ceph-kvstore-tool bluestore-kv
>> /var/lib/ceph/osd/$osd compact &);done
>>
>> This works for us (no seperate WAL/DB). Check the help of the
>> ceph-kvstore-tool if you have to do anything special with separate DB /
>> WAL devices.
>>
>> Gr. Stefan
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-08-31 Thread Hans van den Bogert
Perhaps both clusters have the same bottleneck and you perceive them as 
equally fast.


Can you provide as much details of your clusters as possible?

Also please show outputs of the tests that you've run.

On 8/31/20 1:02 PM, VELARTIS Philipp Dürhammer wrote:

I have a productive 60 osd's cluster. No extra Journals. Its performing okay. 
Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is 
little slower or equal to the 60 hdd pool. 4K random as also sequential reads. 
All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on 
bluestore. Ceph Luminous.
What should be possible 16 ssd's vs. 60 hhd's no extra journals?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-08-31 Thread VELARTIS Philipp Dürhammer
I have a productive 60 osd's cluster. No extra Journals. Its performing okay. 
Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is 
little slower or equal to the 60 hdd pool. 4K random as also sequential reads. 
All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on 
bluestore. Ceph Luminous.
What should be possible 16 ssd's vs. 60 hhd's no extra journals?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large RocksDB (db_slow_bytes) on OSD which is marked as out

2020-08-31 Thread Igor Fedotov

Hi Wido,

'b' prefix relates to free list manager which keeps all the free extents 
for main device in a bitmap. Its records have fixed size hence you can 
easily estimate the overall size for these type of data.


But I doubt it takes that much. I presume that DB just lacks the proper 
compaction. Which could happen eventually but looks like you interrupted 
the process by going offline.


May be try manual compaction with ceph-kvstore-tool?


Thanks,

Igor



On 8/31/2020 10:57 AM, Wido den Hollander wrote:

Hello,

On a Nautilus 14.2.8 cluster I am seeing large RocksDB database with 
many slow DB bytes in use.


To investigate this further I marked one OSD as out and waited for the 
all the backfilling to complete.


Once the backfilling was completed I exported BlueFS and investigated 
the RocksDB using 'ceph-kvstore-tool'. This resulted in 22GB of data.


Listing all the keys in the RocksDB shows me there are 747.000 keys in 
the DB. A small portion are osdmaps, but the biggest amount are keys 
prefixed with 'b'.


I dumped the stats of the RocksDB and this shows me:

L1: 1/0: 439.32 KB
L2: 1/0: 2.65 MB
L3: 5/0: 14.36 MB
L4: 127/0: 7.22 GB
L5: 217/0: 13.73 GB
Sum: 351/0: 20.98 GB

So there is almost 21GB of data in this RocksDB database. Why? Where 
is this coming from?


Throughout this cluster OSDs are suffering from many slow bytes used 
and I can't figure out why.


Has anybody seen this or has a clue on what is going on?

I have an external copy of this RocksDB database to do investigations on.

Thank you,

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd regularly wrongly marked down

2020-08-31 Thread Igor Fedotov

Hi Francois,

given that slow operations are observed for collection listings you 
might want to manually compact RocksDB using ceph-kvstore-tool.


The observed slowdown tends to happen after massive data removals. I've 
seen multiple compains about this issue including some post in this 
mailing list. BTW I can see your post from Jun 24  about slow pool 
removal - couldn't this be a trigger?


Also wondering whether you have standalone fast(SSD/NVMe) drive for 
DB/WAL? Aren't there any BlueFS spillovers which might be relevant?



Thanks,

Igor


On 8/28/2020 11:33 AM, Francois Legrand wrote:

Hi all,

We have a ceph cluster in production with 6 osds servers (with 16x8TB 
disks), 3 mons/mgrs and 3 mdss. Both public and cluster networks are 
in 10GB and works well.


After a major crash in april, we turned the option bluefs_buffered_io 
to false  to workaround the large write bug when bluefs_buffered_io 
was true (we were in version 14.2.8 and the default value at this time 
was true).
Since that time, we regularly have some osds wrongly marked down by 
the cluster after heartbeat timeout (heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15).


Generally the osd restart and the cluster is back healthy, but several 
time, after many of these kick-off the osd reach the 
osd_op_thread_suicide_timeout and goes down definitely.


We increased the osd_op_thread_timeout and 
osd_op_thread_suicide_timeout... The problems still occurs (but less 
frequently).


Few days ago, we upgraded to 14.2.11 and revert the timeout to their 
default value, hoping that it will solve the problem (we thought that 
it should be related to this bug 
https://tracker.ceph.com/issues/45943), but it didn't. We still have 
some osds wrongly marked down.


Can somebody help us to fix this problem ?
Thanks.

Here is an extract of an osd log at failure time:

-
2020-08-28 02:19:05.019 7f03f1384700  0 log_channel(cluster) log [DBG] 
: 44.7d scrub starts
2020-08-28 02:19:25.755 7f040e43d700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:19:25.755 7f040dc3c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15

this last line is repeated more than 1000 times
...
2020-08-28 02:20:17.484 7f040d43b700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:20:17.551 7f03f1384700  0 
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation 
observed for _collection_list, latency = 67.3532s, lat = 67s cid 
=44.7d_head start GHMAX end GHMAX max 25

...
2020-08-28 02:20:22.600 7f040dc3c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.774 7f03f1384700  0 
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation 
observed for _collection_list, latency = 63.223s, lat = 63s cid 
=44.7d_head start 
#44:beffc78d:::rbd_data.1e48e8ab988992.11bd:0# end #MAX# 
max 2147483647
2020-08-28 02:21:20.774 7f03f1384700  1 heartbeat_map reset_timeout 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.805 7f03f1384700  0 log_channel(cluster) log [DBG] 
: 44.7d scrub ok
2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster) log [WRN] 
: Monitor daemon marked osd.16 down, but it is still running
2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster) log [DBG] 
: map e609411 wrongly marked me down at e609410
2020-08-28 02:21:21.099 7f03fd997700  1 osd.16 609411 
start_waiting_for_healthy

2020-08-28 02:21:21.119 7f03fd997700  1 osd.16 609411 start_boot
2020-08-28 02:21:21.124 7f03f0b83700  1 osd.16 pg_epoch: 609410 
pg[36.3d0( v 609409'481293 (449368'478292,609409'481293] 
local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c 
609403/609403 les/c/f 609404/609404/0 609410/609410/608752) [25,72] 
r=-1 lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198 
crt=609409'481293 lcod 609409'481292 active mbc={}] 
start_peering_interval up [25,72,16] -> [25,72], acting [25,72,16] -> 
[25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2 -> -1, 
features acting 4611087854031667199 upacting 4611087854031667199

...
2020-08-28 02:21:21.166 7f03f0b83700  1 osd.16 pg_epoch: 609411 
pg[36.56( v 609409'480511 (449368'477424,609409'480511] 
local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c 
609403/609403 les/c/f 609404/609404/0 609410/609410/609410) [103,102] 
r=-1 lpr=609410 pi=[609403,609410)/1 crt=609409'480511 lcod 
609409'480510 unknown NOTIFY mbc={}] state: transitioning to Stray
2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 
set_numa_affinity public network em1 numa node 0
2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 
set_numa_affinity cluster network em2 numa node 0
2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 
set_numa_affinity objectstore and network numa nodes do not match

[ceph-users] Re: Persistent problem with slow metadata

2020-08-31 Thread Momčilo Medić
Hey Eugen,

On Wed, 2020-08-26 at 09:29 +, Eugen Block wrote:
> Hi,
> 
> > > root@cephosd01:~# ceph config get mds.cephosd01 osd_op_queue
> > > wpq
> > > root@0cephosd01:~# ceph config get mds.cephosd01
> > > osd_op_queue_cut_off
> > > high
> 
> just to make sure, I referred to OSD not MDS settings, maybe check
> again?

root@cephosd01:~# ceph config get osd.* osd_op_queue
wpq
root@cephosd01:~# ceph config get osd.* osd_op_queue_cut_off
high
root@cephosd01:~# ceph config get mon.* osd_op_queue
wpq
root@cephosd01:~# ceph config get mon.* osd_op_queue_cut_off
high
root@cephosd01:~# ceph config get mds.* osd_op_queue
wpq
root@cephosd01:~# ceph config get mds.* osd_op_queue_cut_off
high
root@cephosd01:~#

It seems no matter which setting I query, it's always the same.
Also, documentation for OSD clearly states[1] that it is the default.

> I wouldn't focus too much on the MDS service, 64 GB RAM should be  
> enough, but you could and should also check the actual RAM usage,
> of  
> course. But in our case it's pretty clear that the hard disks are
> the  
> bottleneck although we  have rocksDB on SSD for all OSDs. We seem
> to  
> have a similar use case (we have nightly compile jobs running in  
> cephfs) just with fewer clients. Our HDDs are saturated especially
> if  
> we also run deep-scrubs during the night,  but the slow requests
> have  
> been reduced since we changed the osd_op_queue settings for our OSDs.
> 
> Have you checked your disk utilization?

Disks are utilized roughly between 70 and 80 percent. Not sure why
would operations slow down when disks are getting more utilization.
If that would be the case, I'd expect Ceph to issue a warning.

Have I understood correctly that the expectation is that if I used
larger drives I wouldn't be seeing these warnings?
I can understand that adding more disks would create better
parallelisation, that's why I'm asking about larger drives.

Thank you for discussing this with me, it's highly appreciated.



[1] 
https://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#operations

Kind regards,
Momo.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recover pgs from failed osds

2020-08-31 Thread Eugen Block
Can you try the opposite and turn up the memory_target and only try to  
start a single OSD?



Zitat von Vahideh Alinouri :


osd_memory_target is changed to 3G, starting failed osd causes ceph-osd
nodes crash! and failed osd is still "down"

On Fri, Aug 28, 2020 at 1:13 PM Vahideh Alinouri 
wrote:


Yes, each osd node has 7 osds with 4 GB memory_target.


On Fri, Aug 28, 2020, 12:48 PM Eugen Block  wrote:


Just to confirm, each OSD node has 7 OSDs with 4 GB memory_target?
That leaves only 4 GB RAM for the rest, and in case of heavy load the
OSDs use even more. I would suggest to reduce the memory_target to 3
GB and see if they start successfully.


Zitat von Vahideh Alinouri :

> osd_memory_target is 4294967296.
> Cluster setup:
> 3 mon, 3 mgr, 21 osds on 3 ceph-osd nodes in lvm scenario.  ceph-osd
nodes
> resources are 32G RAM - 4 core CPU - osd disk 4TB - 9 osds have
> block.wal on SSDs.  Public network is 1G and cluster network is 10G.
> Cluster installed and upgraded using ceph-ansible.
>
> On Thu, Aug 27, 2020 at 7:01 PM Eugen Block  wrote:
>
>> What is the memory_target for your OSDs? Can you share more details
>> about your setup? You write about high memory, are the OSD nodes
>> affected by OOM killer? You could try to reduce the osd_memory_target
>> and see if that helps bring the OSDs back up. Splitting the PGs is a
>> very heavy operation.
>>
>>
>> Zitat von Vahideh Alinouri :
>>
>> > Ceph cluster is updated from nautilus to octopus. On ceph-osd nodes
we
>> have
>> > high I/O wait.
>> >
>> > After increasing one of pool’s pg_num from 64 to 128 according to
warning
>> > message (more objects per pg), this lead to high cpu load and ram
usage
>> on
>> > ceph-osd nodes and finally crashed the whole cluster. Three osds,
one on
>> > each host, stuck at down state (osd.34 osd.35 osd.40).
>> >
>> > Starting the down osd service causes high ram usage and cpu load and
>> > ceph-osd node to crash until the osd service fails.
>> >
>> > The active mgr service on each mon host will crash after consuming
almost
>> > all available ram on the physical hosts.
>> >
>> > I need to recover pgs and solving corruption. How can i recover
unknown
>> and
>> > down pgs? Is there any way to starting up failed osd?
>> >
>> >
>> > Below steps are done:
>> >
>> > 1- osd nodes’ kernel was upgraded to 5.4.2 before ceph cluster
upgrading.
>> > Reverting to previous kernel 4.2.1 is tested for iowate decreasing,
but
>> it
>> > had no effect.
>> >
>> > 2- Recovering 11 pgs on failed osds by export them using
>> > ceph-objectstore-tools utility and import them on other osds. The
result
>> > followed: 9 pgs are “down” and 2 pgs are “unknown”.
>> >
>> > 2-1) 9 pgs export and import successfully but status is “down”
because of
>> > "peering_blocked_by" 3 failed osds. I cannot lost osds because of
>> > preventing unknown pgs from getting lost. pgs size in K and M.
>> >
>> > "peering_blocked_by": [
>> >
>> > {
>> >
>> > "osd": 34,
>> >
>> > "current_lost_at": 0,
>> >
>> > "comment": "starting or marking this osd lost may let us proceed"
>> >
>> > },
>> >
>> > {
>> >
>> > "osd": 35,
>> >
>> > "current_lost_at": 0,
>> >
>> > "comment": "starting or marking this osd lost may let us proceed"
>> >
>> > },
>> >
>> > {
>> >
>> > "osd": 40,
>> >
>> > "current_lost_at": 0,
>> >
>> > "comment": "starting or marking this osd lost may let us proceed"
>> >
>> > }
>> >
>> > ]
>> >
>> >
>> > 2-2) 1 pg (2.39) export and import successfully, but after starting
osd
>> > service (pg import to it), ceph-osd node RAM and CPU consumption
increase
>> > and cause ceph-osd node to crash until the osd service fails. Other
osds
>> > become "down" on ceph-osd node. pg status is “unknown”. I cannot use
>> > "force-create-pg" because of data lost. pg 2.39 size is 19G.
>> >
>> > # ceph pg map 2.39
>> >
>> > osdmap e40347 pg 2.39 (2.39) -> up [32,37] acting [32,37]
>> >
>> > # ceph pg 2.39 query
>> >
>> > Error ENOENT: i don't have pgid 2.39
>> >
>> >
>> > *pg 2.39 info on failed osd:
>> >
>> > # ceph-objectstore-tool --data-path /var/lib/ceph/osd/*ceph-34* --op
info
>> > --pgid 2.39
>> >
>> > {
>> >
>> > "pgid": "2.39",
>> >
>> > "last_update": "35344'6456084",
>> >
>> > "last_complete": "35344'6456084",
>> >
>> > "log_tail": "35344'6453182",
>> >
>> > "last_user_version": 10595821,
>> >
>> > "last_backfill": "MAX",
>> >
>> > "purged_snaps": [],
>> >
>> > "history": {
>> >
>> > "epoch_created": 146,
>> >
>> > "epoch_pool_created": 79,
>> >
>> > "last_epoch_started": 25208,
>> >
>> > "last_interval_started": 25207,
>> >
>> > "last_epoch_clean": 25208,
>> >
>> > "last_interval_clean": 25207,
>> >
>> > "last_epoch_split": 370,
>> >
>> > "last_epoch_marked_full": 0,
>> >
>> > "same_up_since": 8347,
>> >
>> > "same_interval_since": 25207,
>> >
>> > "same_primary_since": 8321,
>> >
>> > "last_scrub": "35328'6440139",
>> >
>> > "last_scrub_stamp": "2020-08-19T12:00:59.377593+0430",
>> >
>> > "last_deep_scrub": "35261'6031075",
>> >
>> > "last_deep_scrub_stamp": "2020-08-1

[ceph-users] Re: Persistent problem with slow metadata

2020-08-31 Thread Momčilo Medić
Hi Dave,

On Tue, 2020-08-25 at 15:25 +0100, david.neal wrote:
> Hi Momo,
> 
> This can be caused by many things apart from the ceph sw.
> 
> For example I saw this once with the MTU in openvswitch not fully
> matching on a few nodes . We realised this using ping between nodes.
> For a 9000 MTU:
> 
> "linux- ping -M do -s 8972 -c 4  

I've performed a test as you suggested and had no hiccups for any node
(long ping log is at the bottom of this email for the curious).

> Perhaps starting from the ground up and testing might be the way to
> go?

This cluster is very new, deployed in December and has been kept up to
date since. Like I said previously, we didn't do any config
customization - everything is as vanilla as possible.

> Kind regards,
> 
> Dave



Kind regards,
Momo.


Log:

root@cephosd01:~# ping -M do -s 8972 -c 4 10.179.40.33
PING 10.179.40.33 (10.179.40.33) 8972(9000) bytes of data.
8980 bytes from 10.179.40.33: icmp_seq=1 ttl=64 time=0.020 ms
8980 bytes from 10.179.40.33: icmp_seq=2 ttl=64 time=0.047 ms
8980 bytes from 10.179.40.33: icmp_seq=3 ttl=64 time=0.030 ms
8980 bytes from 10.179.40.33: icmp_seq=4 ttl=64 time=0.027 ms

--- 10.179.40.33 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3062ms
rtt min/avg/max/mdev = 0.020/0.031/0.047/0.009 ms
root@cephosd01:~# ping -M do -s 8972 -c 4 10.179.40.34
PING 10.179.40.34 (10.179.40.34) 8972(9000) bytes of data.
8980 bytes from 10.179.40.34: icmp_seq=1 ttl=64 time=0.141 ms
8980 bytes from 10.179.40.34: icmp_seq=2 ttl=64 time=0.094 ms
8980 bytes from 10.179.40.34: icmp_seq=3 ttl=64 time=0.106 ms
8980 bytes from 10.179.40.34: icmp_seq=4 ttl=64 time=0.139 ms

--- 10.179.40.34 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3078ms
rtt min/avg/max/mdev = 0.094/0.120/0.141/0.020 ms
root@cephosd01:~# ping -M do -s 8972 -c 4 10.179.40.35
PING 10.179.40.35 (10.179.40.35) 8972(9000) bytes of data.
8980 bytes from 10.179.40.35: icmp_seq=1 ttl=64 time=0.113 ms
8980 bytes from 10.179.40.35: icmp_seq=2 ttl=64 time=0.169 ms
8980 bytes from 10.179.40.35: icmp_seq=3 ttl=64 time=0.138 ms
8980 bytes from 10.179.40.35: icmp_seq=4 ttl=64 time=0.081 ms

--- 10.179.40.35 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3062ms
rtt min/avg/max/mdev = 0.081/0.125/0.169/0.033 ms
root@cephosd01:~# ping -M do -s 8972 -c 4 10.179.40.36
PING 10.179.40.36 (10.179.40.36) 8972(9000) bytes of data.
8980 bytes from 10.179.40.36: icmp_seq=1 ttl=64 time=0.147 ms
8980 bytes from 10.179.40.36: icmp_seq=2 ttl=64 time=0.163 ms
8980 bytes from 10.179.40.36: icmp_seq=3 ttl=64 time=0.132 ms
8980 bytes from 10.179.40.36: icmp_seq=4 ttl=64 time=0.077 ms

--- 10.179.40.36 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3054ms
rtt min/avg/max/mdev = 0.077/0.129/0.163/0.035 ms
root@cephosd01:~# ping -M do -s 8972 -c 4 10.179.40.37
PING 10.179.40.37 (10.179.40.37) 8972(9000) bytes of data.
8980 bytes from 10.179.40.37: icmp_seq=1 ttl=64 time=0.095 ms
8980 bytes from 10.179.40.37: icmp_seq=2 ttl=64 time=0.153 ms
8980 bytes from 10.179.40.37: icmp_seq=3 ttl=64 time=0.145 ms
8980 bytes from 10.179.40.37: icmp_seq=4 ttl=64 time=0.122 ms

--- 10.179.40.37 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3070ms
rtt min/avg/max/mdev = 0.095/0.128/0.153/0.026 ms
root@cephosd01:~# ping -M do -s 8972 -c 4 10.179.40.38
PING 10.179.40.38 (10.179.40.38) 8972(9000) bytes of data.
8980 bytes from 10.179.40.38: icmp_seq=1 ttl=64 time=0.132 ms
8980 bytes from 10.179.40.38: icmp_seq=2 ttl=64 time=0.156 ms
8980 bytes from 10.179.40.38: icmp_seq=3 ttl=64 time=0.101 ms
8980 bytes from 10.179.40.38: icmp_seq=4 ttl=64 time=0.143 ms

--- 10.179.40.38 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3070ms
rtt min/avg/max/mdev = 0.101/0.133/0.156/0.020 ms
root@cephosd01:~# ping -M do -s 8972 -c 4 10.179.40.39
PING 10.179.40.39 (10.179.40.39) 8972(9000) bytes of data.
8980 bytes from 10.179.40.39: icmp_seq=1 ttl=64 time=0.140 ms
8980 bytes from 10.179.40.39: icmp_seq=2 ttl=64 time=0.094 ms
8980 bytes from 10.179.40.39: icmp_seq=3 ttl=64 time=0.155 ms
8980 bytes from 10.179.40.39: icmp_seq=4 ttl=64 time=0.155 ms

--- 10.179.40.39 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3054ms
rtt min/avg/max/mdev = 0.094/0.136/0.155/0.025 ms
root@cephosd01:~#
root@cephosd02:~# ping -M do -s 8972 -c 4 10.179.40.33
PING 10.179.40.33 (10.179.40.33) 8972(9000) bytes of data.
8980 bytes from 10.179.40.33: icmp_seq=1 ttl=64 time=0.143 ms
8980 bytes from 10.179.40.33: icmp_seq=2 ttl=64 time=0.149 ms
8980 bytes from 10.179.40.33: icmp_seq=3 ttl=64 time=0.137 ms
8980 bytes from 10.179.40.33: icmp_seq=4 ttl=64 time=0.081 ms

--- 10.179.40.33 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3080ms
rtt min/avg/max/mdev = 0.081/0.127/0.149/0.029 ms
root@cephosd02:~# ping -M do -s 8972 -c 4 10.179.40.34
PING 10.179.40.34 (10.179.40.

[ceph-users] Re: osd regularly wrongly marked down

2020-08-31 Thread Francois Legrand
We tried to rise the osd_memory_target from 4 to 8G but the problem 
still occurs (osd wrongly marked down few times a day).

Does somebody have any clue ?
F.



On Fri, Aug 28, 2020 at 10:34 AM Francois Legrand
mailto:f...@lpnhe.in2p3.fr>> wrote:

Hi all,

We have a ceph cluster in production with 6 osds servers (with
16x8TB
disks), 3 mons/mgrs and 3 mdss. Both public and cluster
networks are in
10GB and works well.

After a major crash in april, we turned the option
bluefs_buffered_io to
false  to workaround the large write bug when
bluefs_buffered_io was
true (we were in version 14.2.8 and the default value at this
time was
true).
Since that time, we regularly have some osds wrongly marked
down by the
cluster after heartbeat timeout (heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15).

Generally the osd restart and the cluster is back healthy, but
several
time, after many of these kick-off the osd reach the
osd_op_thread_suicide_timeout and goes down definitely.

We increased the osd_op_thread_timeout and
osd_op_thread_suicide_timeout... The problems still occurs
(but less
frequently).

Few days ago, we upgraded to 14.2.11 and revert the timeout to
their
default value, hoping that it will solve the problem (we
thought that it
should be related to this bug
https://tracker.ceph.com/issues/45943),
but it didn't. We still have some osds wrongly marked down.

Can somebody help us to fix this problem ?
Thanks.

Here is an extract of an osd log at failure time:

-
2020-08-28 02:19:05.019 7f03f1384700  0 log_channel(cluster)
log [DBG] :
44.7d scrub starts
2020-08-28 02:19:25.755 7f040e43d700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:19:25.755 7f040dc3c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
this last line is repeated more than 1000 times
...
2020-08-28 02:20:17.484 7f040d43b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:20:17.551 7f03f1384700  0
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow
operation
observed for _collection_list, latency = 67.3532s, lat = 67s cid
=44.7d_head start GHMAX end GHMAX max 25
...
2020-08-28 02:20:22.600 7f040dc3c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.774 7f03f1384700  0
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow
operation
observed for _collection_list, latency = 63.223s, lat = 63s cid
=44.7d_head start
#44:beffc78d:::rbd_data.1e48e8ab988992.11bd:0# end
#MAX# max
2147483647
2020-08-28 02:21:20.774 7f03f1384700  1 heartbeat_map
reset_timeout
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.805 7f03f1384700  0 log_channel(cluster)
log [DBG] :
44.7d scrub ok
2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster)
log [WRN] :
Monitor daemon marked osd.16 down, but it is still running
2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster)
log [DBG] :
map e609411 wrongly marked me down at e609410
2020-08-28 02:21:21.099 7f03fd997700  1 osd.16 609411
start_waiting_for_healthy
2020-08-28 02:21:21.119 7f03fd997700  1 osd.16 609411 start_boot
2020-08-28 02:21:21.124 7f03f0b83700  1 osd.16 pg_epoch: 609410
pg[36.3d0( v 609409'481293 (449368'478292,609409'481293]
local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c
609403/609403 les/c/f 609404/609404/0 609410/609410/608752)
[25,72] r=-1
lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198
crt=609409'481293 lcod 609409'481292 active mbc={}]
start_peering_interval up [25,72,16] -> [25,72], acting
[25,72,16] ->
[25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2
-> -1,
features acting 4611087854031667199 upacting 4611087854031667199
...
2020-08-28 02:21:21.166 7f03f0b83700  1 osd.16 pg_epoch: 609411
pg[36.56( v 609409'480511 (449368'477424,609409'480511]
local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c
609403/609403 les/c/f 609404/609404/0 609410/609410/609410)
[103,102]
r=-1 lpr=609410 pi=[609403,609410)/1 crt=609409'480511 lcod
609409'480510 

[ceph-users] How to repair rbd image corruption

2020-08-31 Thread Jared
Hi,
I have a rook cluster running with ceph 12.2.7 for almost one year.
Recently some pvc couldn’t be attached with error as below,
  Warning  FailedMount  7m19s   kubelet, 192.168.34.119  
MountVolume.SetUp failed for volume "pvc-8f4ca7ac-42ab-11ea-99d7-005056b84936" 
: mount command failed, status: Failure, reason: failed to mount volume 
/dev/rbd1 [ext4] to 
/var/lib/kubelet/plugins/rook.io/rook-ceph/mounts/pvc-8f4ca7ac-42ab-11ea-99d7-005056b84936,
 error 'fsck' found errors on device /dev/rbd1 but could not correct them: fsck 
from util-linux 2.23.2
/dev/rbd1: recovering journal
/dev/rbd1 contains a file system with errors, check forced.
/dev/rbd1: Inode 393244, end of extent exceeds allowed value
  (logical block 512, physical block 12091904, len 4388)

After mapping storage to block device and run "fsck -y" on it, it indicated 
that filesystem was clean. Then retry to mount the storage, it still reported 
the same error as above.

Response from "ceph status” is as below,
cluster:
id: 54a729b6-7b59-4e5b-bc09-7dc99109cbad
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
Degraded data redundancy: 50711/152133 objects degraded (33.333%), 
100 pgs degraded, 100 pgs undersized
mons rook-ceph-mon41,rook-ceph-mon44 are low on available space

  services:
mon: 3 daemons, quorum rook-ceph-mon44,rook-ceph-mon47,rook-ceph-mon41
mgr: a(active)
osd: 3 osds: 3 up, 3 in
 flags noscrub,nodeep-scrub

  data:
pools:   1 pools, 100 pgs
objects: 50711 objects, 190 GB
usage:   383 GB used,  GB / 1495 GB avail
pgs: 50711/152133 objects degraded (33.333%)
 100 active+undersized+degraded

  io:
client:   78795 B/s wr, 0 op/s rd, 11 op/s wr


From output, why is scrub disabled? Could I trigger it manually?
And how could I check which pgs or objects is corrupted?

Any advice for fixing the issue?

Thanks for your help.
Jared


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Bluestore does not defer writes

2020-08-31 Thread Dennis Benndorf
Hi,

today I recognized bad performance in our cluster. Running "watch ceph
osd perf |sort -hk 2 -r" I found that all bluestore OSDs are slow on
commit and that the commit timings are equal to their apply timings:

For example
Every 2.0s: ceph osd perf |sort -hk 2
-r 
 
440 8282
430 5858
435 5656
449 5353
442 4040
441 3030
439 2727
 99  0 1
 98  0 0
 97  0 2
 96  0 6
 95  0 2
 94  0 6
 93  013

The once with zero commit timings are filestore and the others are
bluestore osds.
I did not see this after installing the new bluestore osds (maybe this
occured later).
Both types of osds have nvmes as journal/db. Servers have equal
cpus/ram etc.

The only tuning regarding bluestore is:
  bluestore_block_db_size = 69793218560
  bluestore_prefer_deferred_size_hdd = 524288
In order to make a filestore like behavior, but that does not seem to
work.

Any tips?

Regards Dennis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Erasure coding RBD pool for OpenStack

2020-08-31 Thread Lazuardi Nasution
Hi Max,

So, it seems that you prefer to use image cache than allowing cross access
between Ceph users. By that, all communications are APi based, the snapshot
and CoW happen inside the same pool for a single Ceph client only, isn't
it? I'll consider this way and compare with the cross pool access way.
Thank you for your guidance.

Best regards,

On Mon, Aug 31, 2020 at 1:53 PM Max Krasilnikov 
wrote:

> Hello!
>
>  Mon, Aug 31, 2020 at 01:06:13AM +0700, mrxlazuardin wrote:
>
> > Hi Max,
> >
> > As far as I know, cross access of Ceph pools is needed for copy on write
> > feature which enables fast cloning/snapshotting. For example, nova and
> > cinder users need to read to images pool to do copy on write from such an
> > image. So, it seems that Ceph policy from the previous URL can be
> modified
> > to be like the following.
> >
> >
> > *ceph auth get-or-create client.nova mon 'profile rbd' osd 'profile rbd
> > pool=vms, profile rbd-read-only pool=images' mgr 'profile rbd
> pool=vmsceph
> > auth get-or-create client.cinder mon 'profile rbd' osd 'profile rbd
> > pool=volumes, profile rbd-read-only pool=images' mgr 'profile rbd
> > pool=volumes  *
> >
> > Since it is just read access, I think this will not be a matter. I hope
> you
> > are right that cross writing is API based. What do you think?
>
> I've use image volume cache as volumes from images is commonly used:
>
> https://docs.openstack.org/cinder/latest/admin/blockstorage-image-volume-cache.html
> Cinder-backup, AFAIR, uses snapshots too.
>
> > On Sun, Aug 30, 2020 at 2:05 AM Max Krasilnikov 
> > wrote:
> >
> > > День добрий!
> > >
> > >  Sat, Aug 29, 2020 at 10:19:12PM +0700, mrxlazuardin wrote:
> > >
> > > > Hi Max,
> > > >
> > > > I see, it is very helpful and inspired, thank you for that. I assume
> that
> > > > you use the same way for Nova ephemeral (nova user to vms pool).
> > >
> > > As for now i dont' use any non-cinder volumes in cluster.
> > >
> > > > How do you put the policy for cross pools access between them? I mean
> > > nova
> > > > user to images and volumes pools, cinder user to images, vms and
> backups
> > > > pools, and of course cinder-backup user to volumes pool. I think each
> > > user
> > > > will need that cross pools access and will not be a problem on
> reading
> > > > since EC data pool has been defined per RBD image on creation. But,
> how
> > > > about writing, do you think that there will be no cross pools
> writing?
> > >
> > > You have to notice: all OpenStack services are interacting with each
> other
> > > using
> > > api calls and message queues, not accessing data, databases and files
> > > directly.
> > > Any of them may be deployed as standalone. The only glue for them is
> > > Keystone.
> > >
> > > > On Sat, Aug 29, 2020 at 2:21 PM Max Krasilnikov <
> pse...@avalon.org.ua>
> > > > wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > >  Fri, Aug 28, 2020 at 09:18:05PM +0700, mrxlazuardin wrote:
> > > > >
> > > > > > Hi Max,
> > > > > >
> > > > > > Would you mind to share some config examples? What happen if we
> > > create
> > > > > the
> > > > > > instance which boot with newly created or existing volume?
> > > > >
> > > > > In cinder.conf:
> > > > >
> > > > > [ceph]
> > > > > volume_driver = cinder.volume.drivers.rbd.RBDDriver
> > > > > volume_backend_name = ceph
> > > > > rbd_pool = volumes
> > > > > rbd_user = cinder
> > > > > rbd_secret_uuid = {{ rbd_uid }}
> > > > > 
> > > > >
> > > > > [ceph-private]
> > > > > volume_driver = cinder.volume.drivers.rbd.RBDDriver
> > > > > volume_backend_name = ceph-private
> > > > > rbd_pool = volumes-private-meta
> > > > > rbd_user = cinder-private
> > > > > rbd_secret_uuid = {{ rbd_uid_private }}
> > > > > 
> > > > >
> > > > > /etc/ceph/ceph.conf:
> > > > >
> > > > > [client.cinder-private]
> > > > > rbd_default_data_pool = volumes-private
> > > > >
> > > > > openstack volume type show private
> > > > > ...
> > > > > | properties | volume_backend_name='ceph-private'   |
> > > > > ...
> > > > >
> > > > > Erasure pool with metadata pool created as described here:
> > > > >
> > > > >
> > >
> https://docs.ceph.com/docs/master/rados/operations/erasure-code/#erasure-coding-with-overwrites
> > > > > So, data pool is volumes-private, metadata pool is replicated pool
> > > named
> > > > > volumes-private-meta.
> > > > >
> > > > > Instances is running well with this config. All my instances is
> booting
> > > > > from
> > > > > volumes, even with volumes of type 'private'.
> > > > >
> > > > > Metadata pool is quite small, it is 1.8 MiB used while data pool is
> > > 279 GiB
> > > > > used. Your particular sizes may differ, but not too much.
> > > > >
> > > > > > Best regards,
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 28, 2020 at 5:27 PM Max Krasilnikov <
> > > pse...@avalon.org.ua>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > >  Fri, Aug 28, 2020 at 04:05:55PM +0700, mrxlazuardin wrote:
> > > > > > >
> > > > > > > > Hi Konstan

[ceph-users] issues with object-map in benji

2020-08-31 Thread Pavel Vondřička
Hello,

Lately, we upgraded Ceph to version 15.2.4 and shortly after that we had
a blackout, which caused a restart of all servers at once (BTW, Ceph did
not come up well itself). Since then we were receiving lots of
complaints about problems with "object-maps" with every benji backup
(tool for differential backups of RBD images). So, I turned on
"exclusive-lock" and "fast-diff" and rebuilt object-maps for all RBD
images. Since then I am receiving the following message for every RBD
image backup:

2020-08-24T06:25:10.451+0200 7f0b857fa700 -1
librbd::object_map::InvalidateRequest: 0x7f0b8000a690 invalidating
object map in-memory
2020-08-24T06:25:10.451+0200 7f0b857fa700 -1
librbd::object_map::InvalidateRequest: 0x7f0b8000a690 invalidating
object map on-disk
2020-08-24T06:25:10.451+0200 7f0b857fa700 -1
librbd::object_map::InvalidateRequest: 0x7f0b8000a690 should_complete: r=0

I am not sure whether it indicates some real problem limiting the
functionality of the backups, or whether it is just natural part of the
process. Is it a problem of Ceph and the unexpected blackout causing
some corruption, or a problem of Benji and some new feature/issue of
Ceph version 15?

I wonder whether the object-maps were used before, since there were
never any reports about any problems of this kind, and even the features
mentioned above were (probably?) never turned on before (so there were
no object maps at all?). I am quite confused.

Thanks for any explanation,
Pavel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Large RocksDB (db_slow_bytes) on OSD which is marked as out

2020-08-31 Thread Wido den Hollander

Hello,

On a Nautilus 14.2.8 cluster I am seeing large RocksDB database with 
many slow DB bytes in use.


To investigate this further I marked one OSD as out and waited for the 
all the backfilling to complete.


Once the backfilling was completed I exported BlueFS and investigated 
the RocksDB using 'ceph-kvstore-tool'. This resulted in 22GB of data.


Listing all the keys in the RocksDB shows me there are 747.000 keys in 
the DB. A small portion are osdmaps, but the biggest amount are keys 
prefixed with 'b'.


I dumped the stats of the RocksDB and this shows me:

L1: 1/0: 439.32 KB
L2: 1/0: 2.65 MB
L3: 5/0: 14.36 MB
L4: 127/0: 7.22 GB
L5: 217/0: 13.73 GB
Sum: 351/0: 20.98 GB

So there is almost 21GB of data in this RocksDB database. Why? Where is 
this coming from?


Throughout this cluster OSDs are suffering from many slow bytes used and 
I can't figure out why.


Has anybody seen this or has a clue on what is going on?

I have an external copy of this RocksDB database to do investigations on.

Thank you,

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recover pgs from failed osds

2020-08-31 Thread Vahideh Alinouri
osd_memory_target is changed to 3G, starting failed osd causes ceph-osd
nodes crash! and failed osd is still "down"

On Fri, Aug 28, 2020 at 1:13 PM Vahideh Alinouri 
wrote:

> Yes, each osd node has 7 osds with 4 GB memory_target.
>
>
> On Fri, Aug 28, 2020, 12:48 PM Eugen Block  wrote:
>
>> Just to confirm, each OSD node has 7 OSDs with 4 GB memory_target?
>> That leaves only 4 GB RAM for the rest, and in case of heavy load the
>> OSDs use even more. I would suggest to reduce the memory_target to 3
>> GB and see if they start successfully.
>>
>>
>> Zitat von Vahideh Alinouri :
>>
>> > osd_memory_target is 4294967296.
>> > Cluster setup:
>> > 3 mon, 3 mgr, 21 osds on 3 ceph-osd nodes in lvm scenario.  ceph-osd
>> nodes
>> > resources are 32G RAM - 4 core CPU - osd disk 4TB - 9 osds have
>> > block.wal on SSDs.  Public network is 1G and cluster network is 10G.
>> > Cluster installed and upgraded using ceph-ansible.
>> >
>> > On Thu, Aug 27, 2020 at 7:01 PM Eugen Block  wrote:
>> >
>> >> What is the memory_target for your OSDs? Can you share more details
>> >> about your setup? You write about high memory, are the OSD nodes
>> >> affected by OOM killer? You could try to reduce the osd_memory_target
>> >> and see if that helps bring the OSDs back up. Splitting the PGs is a
>> >> very heavy operation.
>> >>
>> >>
>> >> Zitat von Vahideh Alinouri :
>> >>
>> >> > Ceph cluster is updated from nautilus to octopus. On ceph-osd nodes
>> we
>> >> have
>> >> > high I/O wait.
>> >> >
>> >> > After increasing one of pool’s pg_num from 64 to 128 according to
>> warning
>> >> > message (more objects per pg), this lead to high cpu load and ram
>> usage
>> >> on
>> >> > ceph-osd nodes and finally crashed the whole cluster. Three osds,
>> one on
>> >> > each host, stuck at down state (osd.34 osd.35 osd.40).
>> >> >
>> >> > Starting the down osd service causes high ram usage and cpu load and
>> >> > ceph-osd node to crash until the osd service fails.
>> >> >
>> >> > The active mgr service on each mon host will crash after consuming
>> almost
>> >> > all available ram on the physical hosts.
>> >> >
>> >> > I need to recover pgs and solving corruption. How can i recover
>> unknown
>> >> and
>> >> > down pgs? Is there any way to starting up failed osd?
>> >> >
>> >> >
>> >> > Below steps are done:
>> >> >
>> >> > 1- osd nodes’ kernel was upgraded to 5.4.2 before ceph cluster
>> upgrading.
>> >> > Reverting to previous kernel 4.2.1 is tested for iowate decreasing,
>> but
>> >> it
>> >> > had no effect.
>> >> >
>> >> > 2- Recovering 11 pgs on failed osds by export them using
>> >> > ceph-objectstore-tools utility and import them on other osds. The
>> result
>> >> > followed: 9 pgs are “down” and 2 pgs are “unknown”.
>> >> >
>> >> > 2-1) 9 pgs export and import successfully but status is “down”
>> because of
>> >> > "peering_blocked_by" 3 failed osds. I cannot lost osds because of
>> >> > preventing unknown pgs from getting lost. pgs size in K and M.
>> >> >
>> >> > "peering_blocked_by": [
>> >> >
>> >> > {
>> >> >
>> >> > "osd": 34,
>> >> >
>> >> > "current_lost_at": 0,
>> >> >
>> >> > "comment": "starting or marking this osd lost may let us proceed"
>> >> >
>> >> > },
>> >> >
>> >> > {
>> >> >
>> >> > "osd": 35,
>> >> >
>> >> > "current_lost_at": 0,
>> >> >
>> >> > "comment": "starting or marking this osd lost may let us proceed"
>> >> >
>> >> > },
>> >> >
>> >> > {
>> >> >
>> >> > "osd": 40,
>> >> >
>> >> > "current_lost_at": 0,
>> >> >
>> >> > "comment": "starting or marking this osd lost may let us proceed"
>> >> >
>> >> > }
>> >> >
>> >> > ]
>> >> >
>> >> >
>> >> > 2-2) 1 pg (2.39) export and import successfully, but after starting
>> osd
>> >> > service (pg import to it), ceph-osd node RAM and CPU consumption
>> increase
>> >> > and cause ceph-osd node to crash until the osd service fails. Other
>> osds
>> >> > become "down" on ceph-osd node. pg status is “unknown”. I cannot use
>> >> > "force-create-pg" because of data lost. pg 2.39 size is 19G.
>> >> >
>> >> > # ceph pg map 2.39
>> >> >
>> >> > osdmap e40347 pg 2.39 (2.39) -> up [32,37] acting [32,37]
>> >> >
>> >> > # ceph pg 2.39 query
>> >> >
>> >> > Error ENOENT: i don't have pgid 2.39
>> >> >
>> >> >
>> >> > *pg 2.39 info on failed osd:
>> >> >
>> >> > # ceph-objectstore-tool --data-path /var/lib/ceph/osd/*ceph-34* --op
>> info
>> >> > --pgid 2.39
>> >> >
>> >> > {
>> >> >
>> >> > "pgid": "2.39",
>> >> >
>> >> > "last_update": "35344'6456084",
>> >> >
>> >> > "last_complete": "35344'6456084",
>> >> >
>> >> > "log_tail": "35344'6453182",
>> >> >
>> >> > "last_user_version": 10595821,
>> >> >
>> >> > "last_backfill": "MAX",
>> >> >
>> >> > "purged_snaps": [],
>> >> >
>> >> > "history": {
>> >> >
>> >> > "epoch_created": 146,
>> >> >
>> >> > "epoch_pool_created": 79,
>> >> >
>> >> > "last_epoch_started": 25208,
>> >> >
>> >> > "last_interval_started": 25207,
>> >> >
>> >> > "last_epoch_clean": 25208,
>> >> >
>> >> > "last_interval_clean": 25207,
>> >> >
>> >> > "last_ep

[ceph-users] Re: Ceph Filesystem recovery with intact pools

2020-08-31 Thread Yan, Zheng
On Sun, Aug 30, 2020 at 8:05 PM  wrote:
>
> Hi,
> I've had a complete monitor failure, which I have recovered from with the 
> steps here: 
> https://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-mon/#monitor-store-failures
> The data and metadata pools are there and are completely intact, but ceph is 
> reporting that there are no filesystems, where (before the failure) there was 
> one.
>
> Is there any way of putting the filesystem back together again without having 
> to resort to having to rebuild a new metadata pool with cephfs-data-scan?
> I'm on ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus 
> (stable)
>


'ceph fs new[--force]
[--allow-dangerous-metadata-overlay]'

'ceph fs new' command can create fs using existing pools. before
running the command, make sure there is no mds running.  after run the
"fs new "command, run 'ceph fs reset  --yes-i-really-mean-it'
immediately.




> Thanks,
> Harlan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io