[ceph-users] Re: restoring ceph cluster from osds

2023-03-09 Thread Eugen Block

Hi,

I still think the best approach would be to rebuild the MON store from  
the OSDs as described here [2]. Just creating new MONs with the same  
IDs might not be sufficient because they would miss all the OSD  
keyrings etc., so you'd still have to do some work to get it up. It  
might be easier with the OSD approach, but other users might have a  
better approach, it's really been a while since I had to go through  
that troubleshooting section.


Regards,
Eugen

[2]  
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#recovery-using-osds


Zitat von Ben :


Hi,

Yes, the old mon daemons are removed. In the first post mon daemons were
started with mon data from scratch. After some code search, I suspect
without original mon data I could restore the cluster from all osds. But I
may be wrong on this. For now, I think it could be of less configuration if
I could start a mon daemon cluster with exact ID as original one( something
like k,m,o). Any thoughts on this?

Ben

Eugen Block  于2023年3月9日周四 20:56写道:


Hi,

I'm not familiar with rook so the steps required may vary. If you try
to reuse the old mon stores you'll have the mentioned mismatch between
the new daemons and the old monmap (which still contains the old mon
daemons). It's not entirely clear what went wrong in the first place
and what you already tried exactly, so it's hard to tell if editing
the monmap is the way to go here. I guess the old mon daemons are
removed, is that assumption correct? In that case it could be worth a
try to edit the current monmap to contain only the new mons and inject
it (see [1] for details). If the mons start and form a quorum you'd
have a cluster, but I can't tell if the OSDs will register
successfully. I think the previous approach when the original mons
were up but the OSDs didn't start would have been more promising.
Anyway, maybe editing the monmap will fix this for you.

[1]

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#recovering-a-monitor-s-broken-monmap

Zitat von Ben :

> Hi Eugen,
>
> Thank you for help on this.
>
> Forget the log. A little progress, the monitors store were restored. I
> created a new ceph cluster to use the restored monitors store. But the
> monitor log complains:
>
> debug 2023-03-09T11:00:31.233+ 7fe95234f880  0 starting mon.a rank -1
> at public addrs [v2:169.169.163.25:3300/0,v1:169.169.163.25:6789/0] at
bind
> addrs [v2:197.166.206.27:3300/0,v1:197.166.206.27:6789/0] mon_data
> /var/lib/ceph/mon/ceph-a fsid 3f271841-6188-47c1-b3fd-90fd4f978c76
>
> debug 2023-03-09T11:00:31.234+ 7fe95234f880  1 mon.a@-1(???) e27
> preinit fsid 3f271841-6188-47c1-b3fd-90fd4f978c76
>
> debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 mon.a@-1(???) e27
not in
> monmap and have been in a quorum before; must have been removed
>
> debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 mon.a@-1(???) e27
commit
> suicide!
>
> debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 failed to initialize
>
>
> The fact is original monitor clusters ids are k,m,o, however the new ones
> are a,b,d. It was deployed by rook. Any ideas to make this work?
>
>
> Ben
>
> Eugen Block  于2023年3月9日周四 16:00写道:
>
>> Hi,
>>
>> there's no attachment to your email, please use something like
>> pastebin to provide OSD logs.
>>
>> Thanks
>> Eugen
>>
>> Zitat von Ben :
>>
>> > Hi,
>> >
>> > I ended up with having whole set of osds to get back original ceph
>> cluster.
>> > I figured out to make the cluster running. However, it's status is
>> > something as below:
>> >
>> > bash-4.4$ ceph -s
>> >
>> >   cluster:
>> >
>> > id: 3f271841-6188-47c1-b3fd-90fd4f978c76
>> >
>> > health: HEALTH_WARN
>> >
>> > 7 daemons have recently crashed
>> >
>> > 4 slow ops, oldest one blocked for 35077 sec, daemons
>> > [mon.a,mon.b] have slow ops.
>> >
>> >
>> >
>> >   services:
>> >
>> > mon: 3 daemons, quorum a,b,d (age 9h)
>> >
>> > mgr: b(active, since 14h), standbys: a
>> >
>> > osd: 4 osds: 0 up, 4 in (since 9h)
>> >
>> >
>> >
>> >   data:
>> >
>> > pools:   0 pools, 0 pgs
>> >
>> > objects: 0 objects, 0 B
>> >
>> > usage:   0 B used, 0 B / 0 B avail
>> >
>> > pgs:
>> >
>> >
>> > All osds are down.
>> >
>> >
>> > I checked the osds logs and attached with this.
>> >
>> >
>> > Please help and I wonder if it's possible to get the cluster back. I
have
>> > some backup for monitor's data. Till now I haven't restore that in the
>> > course.
>> >
>> >
>> > Thanks,
>> >
>> > Ben
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>







___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email 

[ceph-users] Re: restoring ceph cluster from osds

2023-03-09 Thread Ben
Hi,

Yes, the old mon daemons are removed. In the first post mon daemons were
started with mon data from scratch. After some code search, I suspect
without original mon data I could restore the cluster from all osds. But I
may be wrong on this. For now, I think it could be of less configuration if
I could start a mon daemon cluster with exact ID as original one( something
like k,m,o). Any thoughts on this?

Ben

Eugen Block  于2023年3月9日周四 20:56写道:

> Hi,
>
> I'm not familiar with rook so the steps required may vary. If you try
> to reuse the old mon stores you'll have the mentioned mismatch between
> the new daemons and the old monmap (which still contains the old mon
> daemons). It's not entirely clear what went wrong in the first place
> and what you already tried exactly, so it's hard to tell if editing
> the monmap is the way to go here. I guess the old mon daemons are
> removed, is that assumption correct? In that case it could be worth a
> try to edit the current monmap to contain only the new mons and inject
> it (see [1] for details). If the mons start and form a quorum you'd
> have a cluster, but I can't tell if the OSDs will register
> successfully. I think the previous approach when the original mons
> were up but the OSDs didn't start would have been more promising.
> Anyway, maybe editing the monmap will fix this for you.
>
> [1]
>
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#recovering-a-monitor-s-broken-monmap
>
> Zitat von Ben :
>
> > Hi Eugen,
> >
> > Thank you for help on this.
> >
> > Forget the log. A little progress, the monitors store were restored. I
> > created a new ceph cluster to use the restored monitors store. But the
> > monitor log complains:
> >
> > debug 2023-03-09T11:00:31.233+ 7fe95234f880  0 starting mon.a rank -1
> > at public addrs [v2:169.169.163.25:3300/0,v1:169.169.163.25:6789/0] at
> bind
> > addrs [v2:197.166.206.27:3300/0,v1:197.166.206.27:6789/0] mon_data
> > /var/lib/ceph/mon/ceph-a fsid 3f271841-6188-47c1-b3fd-90fd4f978c76
> >
> > debug 2023-03-09T11:00:31.234+ 7fe95234f880  1 mon.a@-1(???) e27
> > preinit fsid 3f271841-6188-47c1-b3fd-90fd4f978c76
> >
> > debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 mon.a@-1(???) e27
> not in
> > monmap and have been in a quorum before; must have been removed
> >
> > debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 mon.a@-1(???) e27
> commit
> > suicide!
> >
> > debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 failed to initialize
> >
> >
> > The fact is original monitor clusters ids are k,m,o, however the new ones
> > are a,b,d. It was deployed by rook. Any ideas to make this work?
> >
> >
> > Ben
> >
> > Eugen Block  于2023年3月9日周四 16:00写道:
> >
> >> Hi,
> >>
> >> there's no attachment to your email, please use something like
> >> pastebin to provide OSD logs.
> >>
> >> Thanks
> >> Eugen
> >>
> >> Zitat von Ben :
> >>
> >> > Hi,
> >> >
> >> > I ended up with having whole set of osds to get back original ceph
> >> cluster.
> >> > I figured out to make the cluster running. However, it's status is
> >> > something as below:
> >> >
> >> > bash-4.4$ ceph -s
> >> >
> >> >   cluster:
> >> >
> >> > id: 3f271841-6188-47c1-b3fd-90fd4f978c76
> >> >
> >> > health: HEALTH_WARN
> >> >
> >> > 7 daemons have recently crashed
> >> >
> >> > 4 slow ops, oldest one blocked for 35077 sec, daemons
> >> > [mon.a,mon.b] have slow ops.
> >> >
> >> >
> >> >
> >> >   services:
> >> >
> >> > mon: 3 daemons, quorum a,b,d (age 9h)
> >> >
> >> > mgr: b(active, since 14h), standbys: a
> >> >
> >> > osd: 4 osds: 0 up, 4 in (since 9h)
> >> >
> >> >
> >> >
> >> >   data:
> >> >
> >> > pools:   0 pools, 0 pgs
> >> >
> >> > objects: 0 objects, 0 B
> >> >
> >> > usage:   0 B used, 0 B / 0 B avail
> >> >
> >> > pgs:
> >> >
> >> >
> >> > All osds are down.
> >> >
> >> >
> >> > I checked the osds logs and attached with this.
> >> >
> >> >
> >> > Please help and I wonder if it's possible to get the cluster back. I
> have
> >> > some backup for monitor's data. Till now I haven't restore that in the
> >> > course.
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > Ben
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Trying to throttle global backfill

2023-03-09 Thread Rice, Christian
I received a few suggestions, and resolved my issue.

Anthony D'Atri suggested mclock (newer than my nautilus version), adding 
"--osd_recovery_max_single_start 1” (didn’t seem to take), 
“osd_op_queue_cut_off=high” (which I didn’t get to checking), and pgremapper 
(from github).

Pgremapper did the trick to cancel the backfill which had been initiated by an 
unfortunate OSD name-changing sequence.  Big winner, achieved EXACTLY what I 
needed, which was to undo an unfortunate recalculation of placement groups.

Before: 310842802/17308319325 objects misplaced (1.796%)
Ran: pgremapper cancel-backfill --yes
After: 421709/17308356309 objects misplaced (0.002%)

The “before” scenario was causing over 10GiB/s of backfill traffic.  The 
“after” scenario was a very cool 300-400MiB/s, entirely within the realm of 
sanity.  The cluster is temporarily split between two datacenters, being 
physically lifted and shifted over a period of a month.

Alex Gorbachev also suggested setting osd-recovery-sleep.  That was probably 
the solution I was looking for to throttle backfill operations at the 
beginning, and I’ll be keeping that in my toolbox, as well.

As always, I’m HUGELY appreciative of the community response.  I learned a lot 
in the process, had an outage-inducing scenario rectified very quickly, and got 
back to work.  Thanks so much!  Happy to answer any followup questions and 
return the favor when I can.

From: Rice, Christian 
Date: Wednesday, March 8, 2023 at 3:57 PM
To: ceph-users 
Subject: [EXTERNAL] [ceph-users] Trying to throttle global backfill
I have a large number of misplaced objects, and I have all osd settings to “1” 
already:

sudo ceph tell osd.\* injectargs '--osd_max_backfills=1 
--osd_recovery_max_active=1 --osd_recovery_op_priority=1'


How can I slow it down even more?  The cluster is too large, it’s impacting 
other network traffic 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: restoring ceph cluster from osds

2023-03-09 Thread Eugen Block

Hi,

I'm not familiar with rook so the steps required may vary. If you try  
to reuse the old mon stores you'll have the mentioned mismatch between  
the new daemons and the old monmap (which still contains the old mon  
daemons). It's not entirely clear what went wrong in the first place  
and what you already tried exactly, so it's hard to tell if editing  
the monmap is the way to go here. I guess the old mon daemons are  
removed, is that assumption correct? In that case it could be worth a  
try to edit the current monmap to contain only the new mons and inject  
it (see [1] for details). If the mons start and form a quorum you'd  
have a cluster, but I can't tell if the OSDs will register  
successfully. I think the previous approach when the original mons  
were up but the OSDs didn't start would have been more promising.  
Anyway, maybe editing the monmap will fix this for you.


[1]  
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#recovering-a-monitor-s-broken-monmap


Zitat von Ben :


Hi Eugen,

Thank you for help on this.

Forget the log. A little progress, the monitors store were restored. I
created a new ceph cluster to use the restored monitors store. But the
monitor log complains:

debug 2023-03-09T11:00:31.233+ 7fe95234f880  0 starting mon.a rank -1
at public addrs [v2:169.169.163.25:3300/0,v1:169.169.163.25:6789/0] at bind
addrs [v2:197.166.206.27:3300/0,v1:197.166.206.27:6789/0] mon_data
/var/lib/ceph/mon/ceph-a fsid 3f271841-6188-47c1-b3fd-90fd4f978c76

debug 2023-03-09T11:00:31.234+ 7fe95234f880  1 mon.a@-1(???) e27
preinit fsid 3f271841-6188-47c1-b3fd-90fd4f978c76

debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 mon.a@-1(???) e27 not in
monmap and have been in a quorum before; must have been removed

debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 mon.a@-1(???) e27 commit
suicide!

debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 failed to initialize


The fact is original monitor clusters ids are k,m,o, however the new ones
are a,b,d. It was deployed by rook. Any ideas to make this work?


Ben

Eugen Block  于2023年3月9日周四 16:00写道:


Hi,

there's no attachment to your email, please use something like
pastebin to provide OSD logs.

Thanks
Eugen

Zitat von Ben :

> Hi,
>
> I ended up with having whole set of osds to get back original ceph
cluster.
> I figured out to make the cluster running. However, it's status is
> something as below:
>
> bash-4.4$ ceph -s
>
>   cluster:
>
> id: 3f271841-6188-47c1-b3fd-90fd4f978c76
>
> health: HEALTH_WARN
>
> 7 daemons have recently crashed
>
> 4 slow ops, oldest one blocked for 35077 sec, daemons
> [mon.a,mon.b] have slow ops.
>
>
>
>   services:
>
> mon: 3 daemons, quorum a,b,d (age 9h)
>
> mgr: b(active, since 14h), standbys: a
>
> osd: 4 osds: 0 up, 4 in (since 9h)
>
>
>
>   data:
>
> pools:   0 pools, 0 pgs
>
> objects: 0 objects, 0 B
>
> usage:   0 B used, 0 B / 0 B avail
>
> pgs:
>
>
> All osds are down.
>
>
> I checked the osds logs and attached with this.
>
>
> Please help and I wonder if it's possible to get the cluster back. I have
> some backup for monitor's data. Till now I haven't restore that in the
> course.
>
>
> Thanks,
>
> Ben
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: restoring ceph cluster from osds

2023-03-09 Thread Ben
Hi Eugen,

Thank you for help on this.

Forget the log. A little progress, the monitors store were restored. I
created a new ceph cluster to use the restored monitors store. But the
monitor log complains:

debug 2023-03-09T11:00:31.233+ 7fe95234f880  0 starting mon.a rank -1
at public addrs [v2:169.169.163.25:3300/0,v1:169.169.163.25:6789/0] at bind
addrs [v2:197.166.206.27:3300/0,v1:197.166.206.27:6789/0] mon_data
/var/lib/ceph/mon/ceph-a fsid 3f271841-6188-47c1-b3fd-90fd4f978c76

debug 2023-03-09T11:00:31.234+ 7fe95234f880  1 mon.a@-1(???) e27
preinit fsid 3f271841-6188-47c1-b3fd-90fd4f978c76

debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 mon.a@-1(???) e27 not in
monmap and have been in a quorum before; must have been removed

debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 mon.a@-1(???) e27 commit
suicide!

debug 2023-03-09T11:00:31.234+ 7fe95234f880 -1 failed to initialize


The fact is original monitor clusters ids are k,m,o, however the new ones
are a,b,d. It was deployed by rook. Any ideas to make this work?


Ben

Eugen Block  于2023年3月9日周四 16:00写道:

> Hi,
>
> there's no attachment to your email, please use something like
> pastebin to provide OSD logs.
>
> Thanks
> Eugen
>
> Zitat von Ben :
>
> > Hi,
> >
> > I ended up with having whole set of osds to get back original ceph
> cluster.
> > I figured out to make the cluster running. However, it's status is
> > something as below:
> >
> > bash-4.4$ ceph -s
> >
> >   cluster:
> >
> > id: 3f271841-6188-47c1-b3fd-90fd4f978c76
> >
> > health: HEALTH_WARN
> >
> > 7 daemons have recently crashed
> >
> > 4 slow ops, oldest one blocked for 35077 sec, daemons
> > [mon.a,mon.b] have slow ops.
> >
> >
> >
> >   services:
> >
> > mon: 3 daemons, quorum a,b,d (age 9h)
> >
> > mgr: b(active, since 14h), standbys: a
> >
> > osd: 4 osds: 0 up, 4 in (since 9h)
> >
> >
> >
> >   data:
> >
> > pools:   0 pools, 0 pgs
> >
> > objects: 0 objects, 0 B
> >
> > usage:   0 B used, 0 B / 0 B avail
> >
> > pgs:
> >
> >
> > All osds are down.
> >
> >
> > I checked the osds logs and attached with this.
> >
> >
> > Please help and I wonder if it's possible to get the cluster back. I have
> > some backup for monitor's data. Till now I haven't restore that in the
> > course.
> >
> >
> > Thanks,
> >
> > Ben
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd on EC pool with fast and extremely slow writes/reads

2023-03-09 Thread Andrej Filipcic


Thanks for the hint, did run some short test, all fine. I am not sure 
it's a drive issue.


Some more digging, the file with bad performance has this segments:

[root@afsvos01 vicepa]# hdparm --fibmap $PWD/0

/vicepa/0:
filesystem blocksize 4096, begins at LBA 2048; assuming 512 byte sectors.
byte_offset  begin_LBA    end_LBA    sectors
  0 743232    2815039    2071808
 1060765696    3733064    5838279    2105216
 2138636288   70841232   87586575   16745344
10712252416   87586576   87635727  49152

Reading by segments:

# dd if=0 of=/tmp/0 bs=4M status=progress count=252
1052770304 bytes (1.1 GB, 1004 MiB) copied, 45 s, 23.3 MB/s
252+0 records in
252+0 records out

# dd if=0 of=/tmp/0 bs=4M status=progress skip=252 count=256
935329792 bytes (935 MB, 892 MiB) copied, 4 s, 234 MB/s
256+0 records in
256+0 records out

# dd if=0 of=/tmp/0 bs=4M status=progress skip=510
7885291520 bytes (7.9 GB, 7.3 GiB) copied, 12 s, 657 MB/s
2050+0 records in
2050+0 records out

So, 1st 1G is very slow, second segment is faster, then the rest quite 
fast, and it's reproducible (dropped caches before each dd)


Now, the rbd is 3TB with 256 pgs (EC 8+3), I checked with rados that 
objects are randomly distributed on pgs, eg


# rados --pgid 23.82 ls|grep rbd_data.20.2723bd3292f6f8
rbd_data.20.2723bd3292f6f8.0008
rbd_data.20.2723bd3292f6f8.000d
rbd_data.20.2723bd3292f6f8.01cb
rbd_data.20.2723bd3292f6f8.000601b2
rbd_data.20.2723bd3292f6f8.0009001b
rbd_data.20.2723bd3292f6f8.005b
rbd_data.20.2723bd3292f6f8.000900e8

where object ...05b for example corresponds to the 1st block of the file 
I am testing. Well, if my understanding of rbd  is correct: I assume 
that LBA regions are mapped to consecutive rbd objects.


So, now I am completely confused since the slow chunk of the file is 
still mapped to ~256 objects on different pgs


Maybe I misunderstood the whole thing.

Any other hints? we will still do hdd tests on all the drives

Cheers,
Andrej

On 3/6/23 20:25, Paul Mezzanini wrote:

When I have seen behavior like this it was a dying drive.  It only became 
obviously when I did a smart long test and I got failed reads.  Still reported 
smart OK though so that was a lie.



--

Paul Mezzanini
Platform Engineer III
Research Computing

Rochester Institute of Technology










From: Andrej Filipcic
Sent: Monday, March 6, 2023 8:51 AM
To: ceph-users
Subject: [ceph-users] rbd on EC pool with fast and extremely slow writes/reads


Hi,

I have a problem on one of ceph clusters I do not understand.
ceph 17.2.5 on 17 servers, 400 HDD OSDs, 10 and 25Gb/s NICs

3TB rbd image is on erasure coded 8+3 pool with 128pgs , xfs filesystem,
4MB objects in rbd image, mostly empy.

I have created a bunch of 10G files, most of them were written with
1.5GB/s, few of them were really slow, ~10MB/s, a factor of 100.

When reading these files back, the fast-written ones are read fast,
~2-2.5GB/s, the slowly-written are also extremely slow in reading, iotop
shows between 1 and 30 MB/s reading speed.

This does not happen at all on replicated images. There are some OSDs
with higher apply/commit latency, eg 200ms, but there are no slow ops.

The tests were done actually on proxmox vm with librbd, but the same
happens with krbd, and on bare metal with mounted krbd as well.

I have tried to check all OSDs for laggy drives, but they all look about
the same.

I have also copied entire image with "rados get...", object by object,
the strange thing here is that most of objects were copied within
0.1-0.2s, but quite some took more than 1s.
The cluster is quite busy with base traffic of ~1-2GB/s, so the speeds
can vary due to that. But I would not expect a factor of 100 slowdown
for some writes/reads with rbds.

Any clues on what might be wrong or what else to check? I have another
similar ceph cluster where everything looks fine.

Best,
Andrej

--
_
 prof. dr. Andrej Filipcic,   E-mail:andrej.filip...@ijs.si
 Department of Experimental High Energy Physics - F9
 Jozef Stefan Institute, Jamova 39, P.o.Box 3000
 SI-1001 Ljubljana, Slovenia
 Tel.: +386-1-477-3674Fax: +386-1-477-3166
-
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io



--
_
   prof. dr. Andrej Filipcic,   E-mail:andrej.filip...@ijs.si
   Department of Experimental High Energy Physics - F9
   Jozef Stefan Institute, Jamova 39, P.o.Box 3000
   SI-1001 Ljubljana, Slovenia
   Tel.: +386-1-477-3674Fax: +386-1-477-3166
-
___
ceph-users mailing list -- 

[ceph-users] libceph: mds1 IP+PORT wrong peer at address

2023-03-09 Thread Frank Schilder
Hi all,

we seem to have hit a bug in the ceph fs kernel client and I just want to 
confirm what action to take. We get the error "wrong peer at address" in dmesg 
and some jobs on that server seem to get stuck in fs access; log extract below. 
I found these 2 tracker items https://tracker.ceph.com/issues/23883 and 
https://tracker.ceph.com/issues/41519, which don't seem to have fixes.

My questions:

- Is this harmless or does it indicate invalid/corrupted client cache entries?
- How to resolve, ignore, umount+mount or reboot?

Here an extract from the dmesg log, the error has survived a couple of MDS 
restarts already:

[Mon Mar  6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Mon Mar  6 13:05:18 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar  6 13:05:18 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Mon Mar  6 13:13:50 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar  6 13:13:50 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Mon Mar  6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con 
state OPEN)
[Mon Mar  6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con 
state OPEN)
[Mon Mar  6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar  6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar  6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar  6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar  6 13:18:13 2023] ceph: update_snap_trace error -22
[Mon Mar  6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con 
state OPEN)
[Mon Mar  6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con 
state OPEN)
[Mon Mar  6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar  6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar  6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar  6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar  6 13:18:28 2023] ceph: mds7 reconnect success
[Mon Mar  6 13:18:29 2023] ceph: mds7 reconnect success
[Mon Mar  6 13:18:35 2023] ceph: update_snap_trace error -22
[Mon Mar  6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar  6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar  6 13:22:22 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Mon Mar  6 13:22:22 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Mon Mar  6 13:30:54 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[...]
[Thu Mar  9 09:37:24 2023] slurm.epilog.cl (31457): drop_caches: 3
[Thu Mar  9 09:38:26 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar  9 09:38:26 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Thu Mar  9 09:46:58 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar  9 09:46:58 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Thu Mar  9 09:55:30 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar  9 09:55:30 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Thu Mar  9 10:04:02 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar  9 10:04:02 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] radosgw - octopus - 500 Bad file descriptor on upload

2023-03-09 Thread Boris Behrens
Hi,
we've observed 500er errors on uploading files to a single bucket, but the
problem went away after around 2 hours.

We've checked and saw the following error message:
2023-03-08T17:55:58.778+ 7f8062f15700 0 WARNING: set_req_state_err
err_no=125 resorting to 500 2023-03-08T17:55:58.778+ 7f8062f15700 0
ERROR: RESTFUL_IO(s)->complete_header() returned err=Bad file
descriptor 2023-03-08T17:55:58.778+
7f8062f15700 1 == req done req=0x7f81d0189700 op status=-125
http_status=500 latency=65003730017ns == 2023-03-08T17:55:58.778+
7f8062f15700 1 beast: 0x7f81d0189700: IPADDRESS - -
[2023-03-08T17:55:58.778961+] "PUT /BUCKET/OBJECT HTTP/1.1" 500 57 -
"aws-sdk-php/3.257.11 OS/Linux/5.15.0-60-generic lang/php/8.2.3
GuzzleHttp/7" -

It only happened to a single bucket over a period of 1-2 hours (around 300
requests).
In the same time we've had >20k PUT requests the were working fine on other
buckets.

This error also seem to happen to other buckets, but only very sporadically.

Did someone encounter this issue or knows what it could be?

Cheers
 Boris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: LRC k6m3l3, rack outage and availability

2023-03-09 Thread Eugen Block

Hi,

I haven't had the chance to play with LRC yet, so I can't really  
comment on that. But can you share your osd tree as well? I assume you  
already did, but can you verify that the crush rule works as expected  
and the chunks are distributed correctly?


Regards,
Eugen

Zitat von steve.bake...@gmail.com:

Hi, currently we are testing LRC codes and I got a cluster setup  
with 3 racks and 4 hosts in each of those. What I want to achieve is  
to have a storage efficient erasure code (<=200%) and also  
availability during a rack outage. In (my) theory, that should have  
worked with the LRC k6m3l3 having a crush-locality=rack and a  
crush-failure domain=host. But when I tested it, the PGs of the pool  
all go in the "down" state. So, when we've got k=6 data chunks and  
m=3 coding chunks, the data should be reconstructable with 6 of  
these 9 objects. With l=3, LRC splits these 9 objects in 3 groups of  
3 objects and creates one additional locality-chunk per group. We  
now got 3 groups of 4 objects. These 3 groups get distributed over  
the 3 racks, the 4 objects of each group get distributed over the 4  
hosts of a rack. I thought that on a full rack outage, the 6  
remaining k/m chunks on the other 2 racks should still be enough to  
keep up the availability and the cluster could proceed in a degraded  
state.
 But it does not, so I guess my thinking is wrong :) I wonder what's  
the reason for this, is it maybe some min_size setting ? The default  
min_size of this pool becomes 7 - I also changed that to 6 (yes, one  
shouldn't do that in productrion I think) but got the same result.  
Below I've added some details about the cluster, pool creation and  
pg dumps. Any ideas ?  Can s.o. explain why this does not work or  
give another solution how to achieve the described specifications?  
Thx!




Ceph version:


ceph --version
ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)  
pacific (stable)


#
Creation of the pool:
#

ceph osd erasure-code-profile set lrc_individual_profile plugin=lrc  
k=6 m=3 l=3 crush-failure-domain=host crush-locality=rack  
crush-root=default
ceph osd pool create lrc_individual_pool 1024 1024 erasure  
lrc_individual_profile

ceph osd pool set lrc_individual_pool pg_num 1024
ceph osd pool set lrc_individual_pool pg_num_min 1024
ceph osd pool set lrc_individual_pool pgp_num 1024
ceph osd pool set lrc_individual_pool pg_autoscale_mode warn
ceph osd pool set lrc_individual_pool bulk true


##
Resulting pool details:
##

ceph osd pool ls detail
pool 72 'lrc_individual_pool' erasure profile lrc_individual_profile  
size 12 min_size 7 crush_rule 1 object_hash rjenkins pg_num 1024  
pgp_num 1024 autoscale_mode warn last_change 140484 flags  
hashpspool,bulk stripe_width 24576 pg_num_min 1024


ceph osd pool get lrc_individual_pool all
size: 12
min_size: 7
pg_num: 1024
pgp_num: 1024
crush_rule: lrc_individual_pool
hashpspool: true
allow_ec_overwrites: false
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
erasure_code_profile: lrc_individual_profile
fast_read: 0
pg_autoscale_mode: warn
pg_num_min: 1024
bulk: true


#
Resulting crush rule:
#

ceph osd crush rule dump lrc_individual_pool
{
"rule_id": 1,
"rule_name": "lrc_individual_pool",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 12,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "choose_indep",
"num": 3,
"type": "rack"
},
{
"op": "chooseleaf_indep",
"num": 4,
"type": "host"
},
{
"op": "emit"
}
]
}



Ceph status after the rack outage:


cluster:
id: ...
health: HEALTH_WARN
96 osds down
4 hosts (96 osds) down
1 rack (96 osds) down
Reduced data availability: 1024 pgs inactive, 1024 pgs down

  services:
mon: 3 daemons, quorum ...,...,... (age 4d)
mgr: ...(active, since 4d), standbys: ...,...
osd: 288 osds: 192 up (since 116s), 288 in (since 21h)

  data:
pools:   2 pools, 1025 pgs
objects: 291 objects, 0 B
usage:   199 GiB used, 524 TiB / 524 TiB avail
pgs: 99.902% pgs not active
 1024 down
 1active+clean


#
Section of pg dump:
#

72.32  0   0 0  00
   00   0 0 0  down   
2023-03-08T09:04:02.992141+0100

[ceph-users] Re: restoring ceph cluster from osds

2023-03-09 Thread Eugen Block

Hi,

there's no attachment to your email, please use something like  
pastebin to provide OSD logs.


Thanks
Eugen

Zitat von Ben :


Hi,

I ended up with having whole set of osds to get back original ceph cluster.
I figured out to make the cluster running. However, it's status is
something as below:

bash-4.4$ ceph -s

  cluster:

id: 3f271841-6188-47c1-b3fd-90fd4f978c76

health: HEALTH_WARN

7 daemons have recently crashed

4 slow ops, oldest one blocked for 35077 sec, daemons
[mon.a,mon.b] have slow ops.



  services:

mon: 3 daemons, quorum a,b,d (age 9h)

mgr: b(active, since 14h), standbys: a

osd: 4 osds: 0 up, 4 in (since 9h)



  data:

pools:   0 pools, 0 pgs

objects: 0 objects, 0 B

usage:   0 B used, 0 B / 0 B avail

pgs:


All osds are down.


I checked the osds logs and attached with this.


Please help and I wonder if it's possible to get the cluster back. I have
some backup for monitor's data. Till now I haven't restore that in the
course.


Thanks,

Ben
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io