[ceph-users] Re: Next (last) octopus point release

2022-07-01 Thread Laura Flores
>
> This is an important issue that I believe should be addressed before
> Quincy's last point release.


Apologies, this should say *Octopus's* last point release.

- Laura

On Fri, Jul 1, 2022 at 4:30 PM Laura Flores  wrote:

> There was a RocksDB PR that was merged a few days ago that I suspect is
> causing a regression. Details in this Tracker:
> https://tracker.ceph.com/issues/55636.
>
> It is Igor's PR, and I have asked him to take a look to verify whether
> it's related. This is an important issue that I believe should be addressed
> before Quincy's last point release.
>
> Yuri, I'm not sure how this affects your final QE process, but I want you
> to know that we may need a final fix/revert for this.
>
> - Laura
>
> On Fri, Jul 1, 2022 at 4:00 PM Yuri Weinstein  wrote:
>
>> We've been scraping for octopus PRs for awhile now.
>>
>> I see only two PRs being on final stages of testing:
>>
>> https://github.com/ceph/ceph/pull/44731 - Venky is reviewing
>> https://github.com/ceph/ceph/pull/46912 - Ilya is reviewing
>>
>> Unless I hear strong voices that something else must be included, I will
>> start the final QE process.
>>
>> Thx
>> YuriW
>>
>> On Wed, Jun 22, 2022 at 8:10 AM Yuri Weinstein 
>> wrote:
>>
>>> There are still 38 open PRs and only 8 with the "needs-qa" label.
>>>
>>> If you want your PR to be included in the last octopus point release pls
>>> add the "needs-qa" label ASAP.
>>>
>>> We plan to start QE next week.
>>>
>>> Thx
>>>
>>> On Fri, May 13, 2022 at 2:13 PM Yuri Weinstein 
>>> wrote:
>>>
 We are nearing the EOL for octopus and planning to do a point release
 soon.

 Dev leads - pls scrab for any major outstanding fixes.

 I see 47 open PRs =>
 https://github.com/ceph/ceph/pulls?q=is%3Aopen+is%3Apr+milestone%3Aoctopus

 But only 18 with the "needs-qa" label.
 Pls add the "needs-qa" label if you want PRs to be included.

 Thx
 YuriW

>>> ___
>> Dev mailing list -- d...@ceph.io
>> To unsubscribe send an email to dev-le...@ceph.io
>>
>
>
> --
>
> Laura Flores
>
> She/Her/Hers
>
> Associate Software Engineer, Ceph Storage
>
> Red Hat Inc. 
>
> La Grange Park, IL
>
> lflo...@redhat.com
> M: +17087388804
> @RedHat    Red Hat
>   Red Hat
> 
> 
>
>

-- 

Laura Flores

She/Her/Hers

Associate Software Engineer, Ceph Storage

Red Hat Inc. 

La Grange Park, IL

lflo...@redhat.com
M: +17087388804
@RedHat    Red Hat
  Red Hat


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Conversion to Cephadm

2022-07-01 Thread Brent Kennedy
Interesting thought.  Thanks for the reply J  

 

I have a mgr running on that same node but that’s what happened when I tried to 
spin up a monitor.  I went back to the node based on this feedback, removed the 
mgr instance so it had nothing on it.  Deleted all the images and containers, 
downloaded the octopus script instead, changed the repo and reinstalled 
cephadm.  I then redeployed the mgr node and it went just fine but when I tried 
to deploy the mon container/instance, it fails with the same error.  I did the 
same thing to the centos 8 stream node which I upgraded from centos 7(out of 
desperation) and it worked, it is running both the mgr and mon containers now.  
Oddly enough, the new containers are running 17.2.0 even though the installed 
cephadm is octopus.  I had started an upgrade before everything stopped working 
properly, so some of the containers are quincy and some are octopus on the mgr 
and mon nodes.

 

One of the biggest things I see is that it seems there is no clear path from 
centos 7 to centos stream 8 ( or rocky ) without blowing up the machine.  
During the upgrade of the centos 7 node, it told me the ceph and python 
packages might cause an issue and to remove them.  I removed them, but it wiped 
out any ceph configuration on the machine.  Perhaps this isn’t necessary.  I am 
not really worried about the monitor and access nodes as they are redundant, 
but the OSD nodes are physical and host all the drives.  Waiting for rebuilds 
with a petabyte of data will be a long upgrade…

 

-Brent

 

From: Redouane Kachach Elhichou  
Sent: Monday, June 27, 2022 3:10 AM
To: Brent Kennedy 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Conversion to Cephadm

 

From the error message:

 

2022-06-25 21:51:59,798 7f4748727b80 INFO /usr/bin/ceph-mon: stderr too many
arguments:
[--default-log-to-journald=true,--default-mon-cluster-log-to-journald=true]

 

it seems that you are not using the cephadm that corresponds to your ceph 
version. Please, try to get cephadm for octopus.

 

-Redo

 

On Sun, Jun 26, 2022 at 4:07 AM Brent Kennedy mailto:bkenn...@cfl.rr.com> > wrote:

I successfully converted to cephadm after upgrading the cluster to octopus.
I am on CentOS 7 and am attempting to convert some of the nodes over to
rocky, but when I try to add a rocky node in and start the mgr or mon
service, it tries to start an octopus container and the service comes back
with an error.  Is there a way to force it to start a quincy container on
the new host?



I tried to start an upgrade, which did deploy the manager nodes to the new
hosts, but it failed converting the monitors and now one is dead ( a centos
7 one ).  It seems it can spin up quincy containers on the new nodes, but
because it failed upgrading, it still trying to deploy the octopus ones to
the new node.  



Cephadm log on new node:



2022-06-25 21:51:34,427 7f4748727b80 DEBUG stat: Copying blob
sha256:7a0437f04f83f084b7ed68ad9c4a4947e12fc4e1b006b38129bac89114ec3621

2022-06-25 21:51:34,647 7f4748727b80 DEBUG stat: Copying blob
sha256:7a0437f04f83f084b7ed68ad9c4a4947e12fc4e1b006b38129bac89114ec3621

2022-06-25 21:51:34,652 7f4748727b80 DEBUG stat: Copying blob
sha256:731c3beff4deece7d4e54bc26ecf6d99988b19ea8414524277d83bc5a5d6eb70

2022-06-25 21:51:59,006 7f4748727b80 DEBUG stat: Copying config
sha256:2cf504fded3980c76b59a354fca8f301941f86e369215a08752874d1ddb69b73

2022-06-25 21:51:59,008 7f4748727b80 DEBUG stat: Writing manifest to image
destination

2022-06-25 21:51:59,008 7f4748727b80 DEBUG stat: Storing signatures

2022-06-25 21:51:59,239 7f4748727b80 DEBUG stat: 167 167

2022-06-25 21:51:59,703 7f4748727b80 DEBUG /usr/bin/ceph-mon: too many
arguments:
[--default-log-to-journald=true,--default-mon-cluster-log-to-journald=true]

2022-06-25 21:51:59,797 7f4748727b80 INFO Non-zero exit code 1 from
/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host
--entrypoint /usr/bin/ceph-mon --init -e
CONTAINER_IMAGE=docker.io/ceph/ceph:v15   -e 
NODE_NAME=tpixmon5 -e
CEPH_USE_RANDOM_NONCE=1 -v
/var/log/ceph/33ca8009-79d6-45cf-a67e-9753ab4dc861:/var/log/ceph:z -v
/var/lib/ceph/33ca8009-79d6-45cf-a67e-9753ab4dc861/mon.tpixmon5:/var/lib/cep
h/mon/ceph-tpixmon5:z -v /tmp/ceph-tmp7xmra8lk:/tmp/keyring:z -v
/tmp/ceph-tmp7mid2k57:/tmp/config:z docker.io/ceph/ceph:v15 
  --mkfs -i
tpixmon5 --fsid 33ca8009-79d6-45cf-a67e-9753ab4dc861 -c /tmp/config
--keyring /tmp/keyring --setuser ceph --setgroup ceph
--default-log-to-file=false --default-log-to-journald=true
--default-log-to-stderr=false --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-journald=true
--default-mon-cluster-log-to-stderr=false

2022-06-25 21:51:59,798 7f4748727b80 INFO /usr/bin/ceph-mon: stderr too many
arguments:
[--default-log-to-journald=true,--default-mon-cluster-log-to-journald=true]



Podman Images:

REPOSITORY   TAG IMAGE ID  CREATEDSIZE


[ceph-users] Re: Next (last) octopus point release

2022-07-01 Thread Laura Flores
There was a RocksDB PR that was merged a few days ago that I suspect is
causing a regression. Details in this Tracker:
https://tracker.ceph.com/issues/55636.

It is Igor's PR, and I have asked him to take a look to verify whether it's
related. This is an important issue that I believe should be addressed
before Quincy's last point release.

Yuri, I'm not sure how this affects your final QE process, but I want you
to know that we may need a final fix/revert for this.

- Laura

On Fri, Jul 1, 2022 at 4:00 PM Yuri Weinstein  wrote:

> We've been scraping for octopus PRs for awhile now.
>
> I see only two PRs being on final stages of testing:
>
> https://github.com/ceph/ceph/pull/44731 - Venky is reviewing
> https://github.com/ceph/ceph/pull/46912 - Ilya is reviewing
>
> Unless I hear strong voices that something else must be included, I will
> start the final QE process.
>
> Thx
> YuriW
>
> On Wed, Jun 22, 2022 at 8:10 AM Yuri Weinstein 
> wrote:
>
>> There are still 38 open PRs and only 8 with the "needs-qa" label.
>>
>> If you want your PR to be included in the last octopus point release pls
>> add the "needs-qa" label ASAP.
>>
>> We plan to start QE next week.
>>
>> Thx
>>
>> On Fri, May 13, 2022 at 2:13 PM Yuri Weinstein 
>> wrote:
>>
>>> We are nearing the EOL for octopus and planning to do a point release
>>> soon.
>>>
>>> Dev leads - pls scrab for any major outstanding fixes.
>>>
>>> I see 47 open PRs =>
>>> https://github.com/ceph/ceph/pulls?q=is%3Aopen+is%3Apr+milestone%3Aoctopus
>>>
>>> But only 18 with the "needs-qa" label.
>>> Pls add the "needs-qa" label if you want PRs to be included.
>>>
>>> Thx
>>> YuriW
>>>
>> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>


-- 

Laura Flores

She/Her/Hers

Associate Software Engineer, Ceph Storage

Red Hat Inc. 

La Grange Park, IL

lflo...@redhat.com
M: +17087388804
@RedHat    Red Hat
  Red Hat


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Broken PTR record for new Ceph Redmine IP

2022-07-01 Thread David Galloway


On 7/1/22 12:13, Ilya Dryomov wrote:

On Fri, Jul 1, 2022 at 5:48 PM Konstantin Shalygin  wrote:


Hi,

Since Jun 28 04:05:58 postfix/smtpd[567382]: NOQUEUE: reject: RCPT from 
unknown[158.69.70.147]: 450 4.7.25 Client host rejected: cannot find your hostname, 
[158.69.70.147]; from= helo=

ipaddr was changed from 158.69.68.89 to 158.69.70.147, but PTR record was not 
moved simultaneously

√ ~ % drill -x 158.69.68.89 | grep PTR | tail -n1
89.68.69.158.in-addr.arpa.  21600   IN  PTR tracker.ceph.com.
√ ~ % drill -x 158.69.70.147 | grep PTR | tail -n1
147.70.69.158.in-addr.arpa. 86400   IN  PTR ip-158-69-70.eu.

Can somebody fix this? :)


Adding David -- he handled the migration earlier this week.



Ah, right, I forgot OVH even offered PTR records.  Will fix now.  Thank you!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Broken PTR record for new Ceph Redmine IP

2022-07-01 Thread Ilya Dryomov
On Fri, Jul 1, 2022 at 5:48 PM Konstantin Shalygin  wrote:
>
> Hi,
>
> Since Jun 28 04:05:58 postfix/smtpd[567382]: NOQUEUE: reject: RCPT from 
> unknown[158.69.70.147]: 450 4.7.25 Client host rejected: cannot find your 
> hostname, [158.69.70.147]; from= 
> helo=
>
> ipaddr was changed from 158.69.68.89 to 158.69.70.147, but PTR record was not 
> moved simultaneously
>
> √ ~ % drill -x 158.69.68.89 | grep PTR | tail -n1
> 89.68.69.158.in-addr.arpa.  21600   IN  PTR tracker.ceph.com.
> √ ~ % drill -x 158.69.70.147 | grep PTR | tail -n1
> 147.70.69.158.in-addr.arpa. 86400   IN  PTR ip-158-69-70.eu.
>
> Can somebody fix this? :)

Adding David -- he handled the migration earlier this week.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ext] Re: cephadm orch thinks hosts are offline

2022-07-01 Thread Kuhring, Mathias
We found a fix for our issue ceph orch reporting wrong/outdated service 
information:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/DAFXD46NALFAFUBQEYODRIFWSD6SH2OL/

In our case our DNS settings were messed up on the cluster hosts AND 
also within the MGR daemon containers (cephadm deployed).
Not sure, but I could imaging this could also mess with proper host 
detection.
So, I guess it's worth it to at least confirm the settings on 
/etc/resolv.conf on all your hosts and MGR containers.

Best, Mathias

On 6/29/2022 5:59 PM, Mathias Kuhring wrote:
> Hey all,
>
> just want to note that I'm also looking for some kind of way to 
> restart/reset/refresh orchestrator.
> But in my case it's not the hosts but the services which are 
> presumably wrongly reported and outdated:
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NHEVEM3ESJYXZ4LPJ24BBCK6NCG4QRHP/
>  
>
>
> Don't know if this even can be related.
> But in case you find a solution, I'll just stick around here and check 
> if I can apply it.
>
> Best,
> Mathias
>
> On 6/27/2022 12:33 PM, Thomas Roth wrote:
>> Hi Adam,
>>
>> no, this is the 'feature' where the reboot of a mgr hosts causes all 
>> known hosts to become unmanaged.
>>
>>
>> > # lxbk0375 # ceph cephadm check-host lxbk0374 10.20.2.161
>> > mgr.server reply reply (1) Operation not permitted check-host failed:
>> > Host 'lxbk0374' not found. Use 'ceph orch host ls' to see all 
>> managed hosts.
>>
>> In some email on this issue I can't find atm, someone describes a 
>> workaround that allows to restart the entire orchestrator business.
>> But that sounded risky.
>>
>> Regards
>> Thomsa
>>
>>
>> On 23/06/2022 19.42, Adam King wrote:
>>> Hi Thomas,
>>>
>>> What happens if you run "ceph cephadm check-host " for one 
>>> of the
>>> hosts that is offline (and if that fails "ceph cephadm check-host
>>>  ")? Usually, the hosts get marked offline when 
>>> some ssh
>>> connection to them fails. The check-host command will attempt a 
>>> connection
>>> and maybe let us see why it's failing, or, if there is no longer an 
>>> issue
>>> connecting to the host, should mark the host online again.
>>>
>>> Thanks,
>>>    - Adam King
>>>
>>> On Thu, Jun 23, 2022 at 12:30 PM Thomas Roth  wrote:
>>>
 Hi all,

 found this bug https://tracker.ceph.com/issues/51629 (Octopus 
 15.2.13),
 reproduced it in Pacific and
 now again in Quincy:
 - new cluster
 - 3 mgr nodes
 - reboot active mgr node
 - (only in Quincy:) standby mgr node takes over, rebooted node becomse
 standby
 - `ceph orch host ls` shows all hosts as `offline`
 - add a new host: not offline

 In my setup, hostnames and IPs are well known, thus

 # ceph orch host ls
 HOST  ADDR LABELS  STATUS
 lxbk0374  10.20.2.161  _admin  Offline
 lxbk0375  10.20.2.162  Offline
 lxbk0376  10.20.2.163  Offline
 lxbk0377  10.20.2.164  Offline
 lxbk0378  10.20.2.165  Offline
 lxfs416   10.20.2.178  Offline
 lxfs417   10.20.2.179  Offline
 lxfs418   10.20.2.222  Offline
 lxmds22   10.20.6.67
 lxmds23   10.20.6.72   Offline
 lxmds24   10.20.6.74   Offline


 (All lxbk are mon nodes, the first 3 are mgr, 'lxmds22' was added 
 after
 the fatal reboot.)


 Does this matter at all?
 The old bug report is one year old, now with prio 'Low'. And some 
 people
 must have rebooted the one or
 other host in their clusters...

 There is a cephfs on our cluster, operations seem to be unaffected.


 Cheers
 Thomas

 -- 
 
 Thomas Roth
 Department: Informationstechnologie
 Location: SB3 2.291


 GSI Helmholtzzentrum für Schwerionenforschung GmbH
 Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

 Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
 Managing Directors / Geschäftsführung:
 Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
 Chairman of the Supervisory Board / Vorsitzender des 
 GSI-Aufsichtsrats:
 State Secretary / Staatssekretär Dr. Volkmar Dietz

 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io

>>>
>>
-- 
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail:  mathias.kuhr...@bih-charite.de
Mobile: +49 172 3475576

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Broken PTR record for new Ceph Redmine IP

2022-07-01 Thread Konstantin Shalygin
Hi,

Since Jun 28 04:05:58 postfix/smtpd[567382]: NOQUEUE: reject: RCPT from 
unknown[158.69.70.147]: 450 4.7.25 Client host rejected: cannot find your 
hostname, [158.69.70.147]; from= 
helo=

ipaddr was changed from 158.69.68.89 to 158.69.70.147, but PTR record was not 
moved simultaneously

√ ~ % drill -x 158.69.68.89 | grep PTR | tail -n1
89.68.69.158.in-addr.arpa.  21600   IN  PTR tracker.ceph.com.
√ ~ % drill -x 158.69.70.147 | grep PTR | tail -n1
147.70.69.158.in-addr.arpa. 86400   IN  PTR ip-158-69-70.eu.

Can somebody fix this? :)


Thanks,
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Orchestrator informations wrong and outdated

2022-07-01 Thread Kuhring, Mathias
We noticed that our DNS settings were inconsistent and partially wrong.
The NetworkManager somehow set useless nameservers in the 
/etc/resolv.conf of our hosts.
But in particular, the DNS settings in the MGR containers needed fixing 
as well.
I fixed etc/resolv.conf on our hosts and in the container of the active 
MGR daemon.
This fixed all the issues I described, including the output orch ps and 
orch ls as well as registry queries such as docker pull and upgrade ls.
Afterwards, I was able to do an upgrade to Quinzy.
And as far as I can tell, the newly deployed MGR containers picked up 
the proper DNS settings from the hosts.

Best, Mathias

On 6/29/2022 10:45 AM, Mathias Kuhring wrote:
> Dear Ceph community,
>
> we are in the curious situation that typical orchestrator queries 
> provide wrong or outdated information about different services.
> E.g. `ceph orch ls` will report wrong numbers on active services.
> Or `ceph orch ps` reports many OSDs as "starting" and many services 
> with an old version (15.2.14, but we are on 16.2.7).
> Also the refresh times seem way of (capital M == months?).
> However, the cluster is healthy (`ceph status` is happy).
> And sample validation of affected services with systemctl also shows 
> that they are up and ok.
>
> We already tried the following without success:
>
> a) re-registering cephadm as orchestrator backend
> 0|0[root@osd-1 ~]# ceph orch pause
> 0|0[root@osd-1 ~]# ceph orch set backend ''
> 0|0[root@osd-1 ~]# ceph mgr module disable cephadm
> 0|0[root@osd-1 ~]# ceph orch ls
> Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
> 0|0[root@osd-1 ~]# ceph mgr module enable cephadm
> 0|0[root@osd-1 ~]# ceph orch set backend 'cephadm'
>
> b) a failover of the MGR (hoping it would restart/reset the 
> orchestrator module)
> 0|0[root@osd-1 ~]# ceph status | grep mgr
>     mgr:   osd-1(active, since 6m), standbys: osd-5.jcfyqe, 
> osd-4.oylrhe, osd-3
> 0|0[root@osd-1 ~]# ceph mgr fail
> 0|0[root@osd-1 ~]# ceph status | grep mgr
>     mgr:   osd-5.jcfyqe(active, since 7s), standbys: 
> osd-4.oylrhe, osd-3, osd-1
>
> Is there any other way to somehow reset the orchestrator 
> information/connection?
> I added different relevant outputs below.
>
> I also went through the MGR logs and found an issue with querying the 
> docker repos.
> I attempted to upgrade the MGRs to 16.2.9 a few weeks ago due to a 
> different bug.
> But this upgrade never went through.
> Apparently due to cephadm not being able to pull the image.
> Interestingly, I'm able to pull the image manually with docker pull. 
> But cephadm is not.
> I also get an error with `ceph orch upgrade ls` to check on available 
> versions.
> I'm not sure, if this is relevant to the orchestrator problem we have.
> But to be safe, I also added the logs/output below.
>
> Thank you for all your help!
>
> Best Wishes,
> Mathias
>
>
> 0|0[root@osd-1 ~]# ceph status
>   cluster:
>     id: 55633ec3-6c0c-4a02-990c-0f87e0f7a01f
>     health: HEALTH_OK
>
>   services:
>     mon:   5 daemons, quorum osd-1,osd-2,osd-5,osd-4,osd-3 
> (age 86m)
>     mgr:   osd-5.jcfyqe(active, since 21m), standbys: 
> osd-4.oylrhe, osd-3, osd-1
>     mds:   1/1 daemons up, 1 standby
>     osd:   270 osds: 270 up (since 13d), 270 in (since 5w)
>     cephfs-mirror: 1 daemon active (1 hosts)
>     rgw:   3 daemons active (3 hosts, 2 zones)
>
>   data:
>     volumes: 1/1 healthy
>     pools:   17 pools, 6144 pgs
>     objects: 692.54M objects, 1.2 PiB
>     usage:   1.8 PiB used, 1.7 PiB / 3.5 PiB avail
>     pgs: 6114 active+clean
>  29   active+clean+scrubbing+deep
>  1    active+clean+scrubbing
>
>   io:
>     client:   0 B/s rd, 421 MiB/s wr, 52 op/s rd, 240 op/s wr
>
> 0|0[root@osd-1 ~]# ceph orch ls
> NAME   PORTS   RUNNING REFRESHED   
> AGE  PLACEMENT
> alertmanager   ?:9093,9094 0/1 -   
> 8M   count:1
> cephfs-mirror  0/1 -   
> 5M   count:1
> crash  2/6  7M 
> ago  4M   *
> grafana    ?:3000  0/1 -   
> 8M   count:1
> ingress.rgw.default    172.16.39.131:443,1967  0/2 -   
> 4M   osd-1
> ingress.rgw.ext    172.16.39.132:443,1968  4/2  7M 
> ago  4M   osd-5
> ingress.rgw.ext-website    172.16.39.133:443,1969  0/2 -   
> 4M   osd-3
> mds.cephfs 2/2  9M 
> ago  4M   count-per-host:1;label:mds
> mgr    5/5  9M 
> ago  9M   count:5
> mon    5/5  9M 
> ago  9M   count:5
> node-exporter  ?:9100  2/6  7M 
> ago  7w   *
> osd.all-available-devices    0 -   
> 5w   *
> 

[ceph-users] Re: Ceph recovery network speed

2022-07-01 Thread Curt
On Wed, Jun 29, 2022 at 11:22 PM Curt  wrote:

>
>
> On Wed, Jun 29, 2022 at 9:55 PM Stefan Kooman  wrote:
>
>> On 6/29/22 19:34, Curt wrote:
>> > Hi Stefan,
>> >
>> > Thank you, that definitely helped. I bumped it to 20% for now and
>> that's
>> > giving me around 124 PGs backfilling at 187 MiB/s, 47 Objects/s.  I'll
>> > see how that runs and then increase it a bit more if the cluster
>> handles
>> > it ok.
>> >
>> > Do you think it's worth enabling scrubbing while backfilling?
>>
>> If the cluster can cope with the extra load, sure. If it slows down the
>> backfilling to levels that are too slow ... temporarily disable it.
>>
>> Since
>> > this is going to take a while. I do have 1 inconsistent PG that has now
>> > become 10 as it splits.
>>
>> Hmm. Well, if it finds broken PGs, for sure pause backfilling (ceph osd
>> set nobackfill) and have it handle this ASAP: ceph pg repair $pg.
>> Something is wrong, and you want to have this fixed sooner rather than
>> later.
>>
>
>  When I try to run a repair nothing happens, if I try to list
> inconsistent-obj I get No scrub information available for 12.12.  If I tell
> it to run a deep scrub, nothing.  I'll set debug and see what I can find in
> the logs.
>
Just to give a quick update. This one was my fault, I missed a flag. Once
set correctly, scrubbed and repaired.  It's now back to adding more PG's,
which continue to get a bit faster as it expands.  I'm now up to pg_num
1362 and pgp_num 1234, with backfills happening at 250-300 Mb/s 60-70
Objects/s.

Thanks for all the help.

>
>> Not sure what hardware you have, but you might benefit from disabling
>> write caches, see this link:
>>
>> https://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches
>>
>> Thanks, I'm disabling cache and I'll see if it helps at all.

> Gr. Stefan
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bunch of " received unsolicited reservation grant from osd" messages in log

2022-07-01 Thread Denis Polom

OK, and when it will be backported to Pacific?

On 6/27/22 18:59, Neha Ojha wrote:

This issue should be addressed by https://github.com/ceph/ceph/pull/46860.

Thanks,
Neha

On Fri, Jun 24, 2022 at 2:53 AM Kenneth Waegeman
 wrote:

Hi,

I’ve updated the cluster to 17.2.0, but the log is still filled with these 
entries:

2022-06-24T11:45:12.408944+02:00 osd031 ceph-osd[22024]: osd.508 pg_epoch: 685367 
pg[5.166s0( v 685201'4130317 (680710'4123328,685201'4130317] 
local-lis/les=683269/683270 n=375262 ec=1104/1104 lis/c=683269/683269 
les/c/f=683270/683270/19430 sis=683269) 
[508,181,357,22,592,228,250,383,28,213,586]p508(0) r=0 lpr=683269 crt=685201'4130317 
lcod 685201'4130316 mlcod 685201'4130316 active+clean TIME_FOR_DEEP 
ps=[591~1,5a3~1,5a5~1,5a7~1,5a9~1,5ab~1,5ad~1,5af~1,5b1~2,5bb~1,5c5~14,5ed~c,5fd~1f]] 
scrubber : handle_scrub_reserve_grant: received unsolicited 
reservation grant from osd 357(2) (0x55e16d92f600)
2022-06-24T11:45:12.412196+02:00 osd031 ceph-osd[22024]: osd.508 pg_epoch: 685367 
pg[5.166s0( v 685201'4130317 (680710'4123328,685201'4130317] 
local-lis/les=683269/683270 n=375262 ec=1104/1104 lis/c=683269/683269 
les/c/f=683270/683270/19430 sis=683269) 
[508,181,357,22,592,228,250,383,28,213,586]p508(0) r=0 lpr=683269 crt=685201'4130317 
lcod 685201'4130316 mlcod 685201'4130316 active+clean TIME_FOR_DEEP 
ps=[591~1,5a3~1,5a5~1,5a7~1,5a9~1,5ab~1,5ad~1,5af~1,5b1~2,5bb~1,5c5~14,5ed~c,5fd~1f]] 
scrubber : handle_scrub_reserve_grant: received unsolicited 
reservation grant from osd 586(10) (0x55e1b57354a0)
2022-06-24T11:45:12.417867+02:00 osd031 ceph-osd[21674]: osd.560 pg_epoch: 685367 
pg[5.6e2s0( v 685198'4133308 (680724'4126463,685198'4133308] 
local-lis/les=675991/675992 n=375710 ec=1104/1104 lis/c=675991/675991 
les/c/f=675992/675992/19430 sis=675991) 
[560,259,440,156,324,358,338,218,191,335,256]p560(0) r=0 lpr=675991 
crt=685198'4133308 lcod 685198'4133307 mlcod 685198'4133307 active+clean 
TIME_FOR_DEEP 
ps=[591~1,5a3~1,5a5~1,5a7~1,5a9~1,5ab~1,5ad~1,5af~1,5b1~2,5bb~1,5c5~14,5ed~c,5fd~1f]] 
scrubber : handle_scrub_reserve_grant: received unsolicited 
reservation grant from osd 259(1) (0x559a5f371080)
2022-06-24T11:45:12.453294+02:00 osd031 ceph-osd[22024]: osd.508 pg_epoch: 685367 
pg[5.166s0( v 685201'4130317 (680710'4123328,685201'4130317] 
local-lis/les=683269/683270 n=375262 ec=1104/1104 lis/c=683269/683269 
les/c/f=683270/683270/19430 sis=683269) 
[508,181,357,22,592,228,250,383,28,213,586]p508(0) r=0 lpr=683269 crt=685201'4130317 
lcod 685201'4130316 mlcod 685201'4130316 active+clean TIME_FOR_DEEP 
ps=[591~1,5a3~1,5a5~1,5a7~1,5a9~1,5ab~1,5ad~1,5af~1,5b1~2,5bb~1,5c5~14,5ed~c,5fd~1f]] 
scrubber : handle_scrub_reserve_grant: received unsolicited 
reservation grant from osd 213(9) (0x55e1a9e922c0)

Is the bug still there, or is this something else?

Thanks!!

Kenneth



On 19 Dec 2021, at 11:05, Ronen Friedman  wrote:



On Sat, Dec 18, 2021 at 7:06 PM Ronen Friedman  wrote:

Hi all,

This was indeed a bug, which I've already fixed in 'master'.
I'll look for the backporting status tomorrow.

Ronen


The fix is part of a larger change (which fixes a more severe issue). Pending 
(non-trivial) backport.
I'll try to speed this up.

Ronen





On Fri, Dec 17, 2021 at 1:49 PM Kenneth Waegeman  
wrote:

Hi all,

I'm also seeing these messages spamming the logs after update from
octopus to pacific 16.2.7.

Any clue yet what this means?

Thanks!!

Kenneth

On 29/10/2021 22:21, Alexander Y. Fomichev wrote:

Hello.
After upgrading to 'pacific' I found log spammed by messages like this:
... active+clean]  scrubber pg(46.7aas0) handle_scrub_reserve_grant:
received unsolicited reservation grant from osd 138(1) (0x560e77c51600)

If I understand it correctly this is exactly what it looks, and this is not
good. Running with debug osd 1/5 don't help much  and google bring me
nothing and I stuck. Could anybody give a hint what's happening or where
   to dig.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: persistent write-back cache and quemu

2022-07-01 Thread Ilya Dryomov
On Fri, Jul 1, 2022 at 8:32 AM Ansgar Jazdzewski
 wrote:
>
> Hi folks,
>
> I did a little testing with the persistent write-back cache (*1) we
> run ceph quincy 17.2.1 qemu 6.2.0
>
> rbd.fio works with the cache, but as soon as we start we get something like
>
> error: internal error: process exited while connecting to monitor:
> Failed to open module: /usr/lib/x86_64-linux-gnu/qemu/block-rbd.so:
> undefined symbol: aio_bh_schedule_oneshot_full
> 2022-06-30T13:08:39.640532Z qemu-system-x86_64: -blockdev
>
> so my assumption is that we need to do a bit more to have it running
> with qemu, if you have some more information on how to get it running
> please let me know!

Hi Ansgar,

The QEMU failure doesn't look related at first sight.  Have you tried
opening an image with the same QEMU without persistent cache enabled?

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] persistent write-back cache and quemu

2022-07-01 Thread Ansgar Jazdzewski
Hi folks,

I did a little testing with the persistent write-back cache (*1) we
run ceph quincy 17.2.1 qemu 6.2.0

rbd.fio works with the cache, but as soon as we start we get something like

error: internal error: process exited while connecting to monitor:
Failed to open module: /usr/lib/x86_64-linux-gnu/qemu/block-rbd.so:
undefined symbol: aio_bh_schedule_oneshot_full
2022-06-30T13:08:39.640532Z qemu-system-x86_64: -blockdev

so my assumption is that we need to do a bit more to have it running
with qemu, if you have some more information on how to get it running
please let me know!

Thanks a lot,
Ansgar

(*1) 
https://docs.ceph.com/en/pacific/rbd/rbd-persistent-write-back-cache/#rbd-persistent-write-back-cache
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io