On 25-03-2024 23:07, Kai Stian Olstad wrote:
On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote:
On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list
and that is about every OSD with data prior
On 25-03-2024 22:58, Kai Stian Olstad wrote:
On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list
and that is about every OSD with data prior to adding ~100 empty OSDs.
How 400 read targets and 100 write
On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote:
On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list
and that is about every OSD with data prior to adding ~100 empty
OSDs. How 400 read targets
On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list
and that is about every OSD with data prior to adding ~100 empty OSDs.
How 400 read targets and 100 write targets can only equal ~60
backfills with
Neither downing or restarting the OSD cleared the bogus blocked_by. I
guess it makes no sense to look further at blocked_by as the cause when
the data can't be trusted and there is no obvious smoking gun like a few
OSDs blocking everything.
My tally came to 412 out of 539 OSDs showing up in a
First try "ceph osd down 89"
> On Mar 25, 2024, at 15:37, Alexander E. Patrakov wrote:
>
> On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard wrote:
>>
>>
>>
>> On 24/03/2024 01:14, Torkil Svensgaard wrote:
>>> On 24-03-2024 00:31, Alexander E. Patrakov wrote:
Hi Torkil,
>>>
>>> Hi
On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard wrote:
>
>
>
> On 24/03/2024 01:14, Torkil Svensgaard wrote:
> > On 24-03-2024 00:31, Alexander E. Patrakov wrote:
> >> Hi Torkil,
> >
> > Hi Alexander
> >
> >> Thanks for the update. Even though the improvement is small, it is
> >> still an
On 24/03/2024 01:14, Torkil Svensgaard wrote:
On 24-03-2024 00:31, Alexander E. Patrakov wrote:
Hi Torkil,
Hi Alexander
Thanks for the update. Even though the improvement is small, it is
still an improvement, consistent with the osd_max_backfills value, and
it proves that there are still
On 24-03-2024 13:41, Tyler Stachecki wrote:
On Sat, Mar 23, 2024, 4:26 AM Torkil Svensgaard wrote:
Hi
... Using mclock with high_recovery_ops profile.
What is the bottleneck here? I would have expected a huge number of
simultaneous backfills. Backfill reservation logjam?
mClock is very
On Sat, Mar 23, 2024, 4:26 AM Torkil Svensgaard wrote:
> Hi
>
> ... Using mclock with high_recovery_ops profile.
>
> What is the bottleneck here? I would have expected a huge number of
> simultaneous backfills. Backfill reservation logjam?
>
mClock is very buggy in my experience and frequently
Hi Torkil,
Thanks for the update. Even though the improvement is small, it is
still an improvement, consistent with the osd_max_backfills value, and
it proves that there are still unsolved peering issues.
I have looked at both the old and the new state of the PG, but could
not find anything else
Hi Alex
New query output attached after restarting both OSDs. OSD 237 is no
longer mentioned but it unfortunately made no difference for the number
of backfills which went 59->62->62.
Mvh.
Torkil
On 23-03-2024 22:26, Alexander E. Patrakov wrote:
Hi Torkil,
I have looked at the files that
Hi Torkil,
I have looked at the files that you attached. They were helpful: pool
11 is problematic, it complains about degraded objects for no obvious
reason. I think that is the blocker.
I also noted that you mentioned peering problems, and I suspect that
they are not completely resolved. As a
Hi Torkil,
I have looked at the CRUSH rules, and the equivalent rules work on my
test cluster. So this cannot be the cause of the blockage.
What happens if you increase the osd_max_backfills setting temporarily?
It may be a good idea to investigate a few of the stalled PGs. Please
run commands
On 23-03-2024 18:43, Alexander E. Patrakov wrote:
Hi Torkil,
Unfortunately, your files contain nothing obviously bad or suspicious,
except for two things: more PGs than usual and bad balance.
What's your "mon max pg per osd" setting?
[root@lazy ~]# ceph config get mon mon_max_pg_per_osd
On 23-03-2024 19:05, Alexander E. Patrakov wrote:
Sorry for replying to myself, but "ceph osd pool ls detail" by itself
is insufficient. For every erasure code profile mentioned in the
output, please also run something like this:
ceph osd erasure-code-profile get prf-for-ec-data
...where
Sorry for replying to myself, but "ceph osd pool ls detail" by itself
is insufficient. For every erasure code profile mentioned in the
output, please also run something like this:
ceph osd erasure-code-profile get prf-for-ec-data
...where "prf-for-ec-data" is the name that appears after the
Hi Torkil,
I take my previous response back.
You have an erasure-coded pool with nine shards but only three
datacenters. This, in general, cannot work. You need either nine
datacenters or a very custom CRUSH rule. The second option may not be
available if the current EC setup is already
Hi Torkil,
Unfortunately, your files contain nothing obviously bad or suspicious,
except for two things: more PGs than usual and bad balance.
What's your "mon max pg per osd" setting?
On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard wrote:
>
> On 2024-03-23 17:54, Kai Stian Olstad wrote:
> >
On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:
The other output is too big for pastebin and I'm not familiar with
paste services, any suggestion for a preferred way to share such
output?
You can attached files to the mail here on the list.
--
Kai Stian Olstad
On 23-03-2024 10:44, Alexander E. Patrakov wrote:
Hello Torkil,
Hi Alexander
It would help if you provided the whole "ceph osd df tree" and "ceph
pg ls" outputs.
Of course, here's ceph osd df tree to start with:
https://pastebin.com/X50b2W0J
The other output is too big for pastebin and
Hello Torkil,
It would help if you provided the whole "ceph osd df tree" and "ceph
pg ls" outputs.
On Sat, Mar 23, 2024 at 4:26 PM Torkil Svensgaard wrote:
>
> Hi
>
> We have this after adding some hosts and changing crush failure domain
> to datacenter:
>
> pgs: 1338512379/3162732055
22 matches
Mail list logo