Responding partially to my own query, I have decided on the following
structure, in order to have encrypted OSDs/Bluestore journals and not wait
for proper ceph-volume support.
1) SSD(s), fully encrypted, acting as PV(s) for VG(s) to store LVs for the
Block DBs. My current setup is 1 SSD for 4 Bl
This is very common issue. Deleting mdsX_openfiles.Y has become part of
my standard maintenance repertoire. As soon as you have a few more
clients and one of them starts opening and closing files in rapid
succession (or does other metadata-heavy things), it becomes very likely
that the MDS cras
Hi,
I'm facing this issue too and I see the attached rocksdb log from Mark in
my cluster which means there is a burst read on my block.db. I've sent some
information from my issue in this thread[1]. Hope you help me with what's
going on in my cluster.
Thanks.
[1]:
https://lists.ceph.io/hyperkitt
Hello,
I have made some tests with creating OSDs and I have found out that there
are big issues with the ceph-volume functionality.
1) If using dmcrypt and separate data and db block devices, ceph-volume
creates cryprodevs/PVs/VGs/LVs for both devices. This might seem as
normal, until one consid
I found that bluefs_max_prefetch is set to 1048576 which equals to 1MiB! So
why it's reading about 1GiB/s?
On Thu, Dec 3, 2020 at 8:03 PM Seena Fallah wrote:
> My first question is about this metric: ceph_bluefs_read_prefetch_bytes
> and I want to know what operation is related to this metric?
>
This is a completely new cluster with full ssd and nvme :/
-Original Message-
From: Eugen Block
Sent: Friday, December 4, 2020 4:32 PM
To: ceph-users@ceph.io
Subject: [Suspicious newsletter] [ceph-users] Re: PG_DAMAGED
Email received from outside the company. If in doubt don't click lin
Hi,
Not sure is it related to my 15.2.7 update, but today I got many time this
issue:
2020-12-04T15:14:23.910799+0700 osd.40 (osd.40) 11 : cluster [DBG] 11.2
deep-scrub starts
2020-12-04T15:14:23.947255+0700 osd.40 (osd.40) 12 : cluster [ERR] 11.2 soid
11:434f049b:::.dir.75333f99-93d0-4238-91a
Hi,
I recently attempted to run the ‘rgw-orphan-list’ tool against our cluster
(octopus 15.2.7) to identify any orphans and noticed that the 'radosgw-admin
bucket radoslist’ command appeared to be stuck in a loop.
I saw in the 'radosgw-admin-XX.intermediate’ output file the same sequence
o
Hi all,
We would need the same feature in our HPC cluster. I guess this is not an
unfrequent problem, I was wondering if you guys found an alternative solution.
Best
--
Filippo Stenico
Services and Support for Science IT (S3IT)
Office Y11 F 52
University of Zürich
Winterthurerstrasse 190, CH-805
Excellent!
For the record, this PR is the plan to fix this:
https://github.com/ceph/ceph/pull/36089
(nautilus, octopus PRs here: https://github.com/ceph/ceph/pull/37382
https://github.com/ceph/ceph/pull/37383)
Cheers, Dan
On Fri, Dec 4, 2020 at 11:35 AM Anton Aleksandrov wrote:
>
> Thank you ve
Thank you very much! This solution helped:
Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.
We are back online. Amazing!!! :)
On 04.12.2020 12:20, Dan van der Ster wrote:
Please also make sure the mds_beacon_grace is high on the mon's too.
it doesn
In my experience inconsistencies caused by IO errors always have a
SCSI Medium Error showing up in the kernel logs. (dmesg, journalctl
-k, /v/l/messages, ...)
(Except in the case of one very bad non-enterprise SMR drive I run at
home, not at work).
-- dan
On Fri, Dec 4, 2020 at 11:03 AM Hans van
Please also make sure the mds_beacon_grace is high on the mon's too.
it doesn't matter which mds you select to be the running one.
Is the processing getting killed, restarted?
If you're confident that the mds is getting OOM killed during rejoin
step, then you might find this useful:
http://lists.
Yes, MDS eats all memory+swap, stays like this for a moment and then
frees memory.
mds_beacon_grace was already set to 1800
Also on other it is seen this message: Map has assigned me to become a
standby.
Does it matter, which MDS we stop and which we leave running?
Anton
On 04.12.2020 11:
Interesting, your comment implies that it is a replication issue, which
does not stem from a faulty disk. But, couldn't the disk have a bit
flip? Or would you argue that would've shown as a disk read error
somewhere (because of ECC on the disk.)
On 12/4/20 10:51 AM, Dan van der Ster wrote:
No
How many active MDS's did you have? (max_mds == 1, right?)
Stop the other two MDS's so you can focus on getting exactly one running.
Tail the log file and see what it is reporting.
Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS
while it is rejoining.
Is that single MDS run
Note that in this case the inconsistencies are not coming from object
reads, but from comparing the omap digests of an rgw index shard.
This seems to be a result of a replication issue sometime in the past
on this cluster.
On Fri, Dec 4, 2020 at 10:32 AM Eugen Block wrote:
>
> Hi,
>
> this is not
Hello community,
we are on ceph 13.2.8 - today something happenned with one MDS and cephs
status tells, that filesystem is degraded. It won't mount either. I have
take server with MDS, that was not working down. There are 2 more MDS
servers, but they stay in "rejoin" state. Also only 1 is show
There's no guarantee that new disks can't be faulty. We had this last
year when we expanded our cluster with brand new servers and disks,
one of the new OSDs failed almost immediately.
You can wait and see how often this appears and if it's always the
same disk. Just keep it in mind.
Zitat
Hi,
this is not necessarily but most likely a hint to a (slowly) failing
disk. Check all OSDs for this PG for disk errors in dmesg and smartctl.
Regards,
Eugen
Zitat von "Szabo, Istvan (Agoda)" :
Hi,
Not sure is it related to my 15.2.7 update, but today I got many
time this issue:
20
20 matches
Mail list logo