[ceph-users] Re: reef 18.2.1 QE Validation status

2023-11-09 Thread Venky Shankar
Hi Yuri,

On Fri, Nov 10, 2023 at 4:55 AM Yuri Weinstein  wrote:
>
> I've updated all approvals and merged PRs in the tracker and it looks
> like we are ready for gibba, LRC upgrades pending approval/update from
> Venky.

The smoke test failure is caused by missing (kclient) patches in
Ubuntu 20.04 that certain parts of the fs suite (via smoke tests) rely
on. More details here

https://tracker.ceph.com/issues/63488#note-8

The kclient tests in smoke pass with other distro's and the fs suite
tests have been reviewed and look good. Run details are here

https://tracker.ceph.com/projects/cephfs/wiki/Reef#07-Nov-2023

The smoke failure is noted as a known issue for now. Consider this run
as "fs approved".

>
> On Thu, Nov 9, 2023 at 1:31 PM Radoslaw Zarzynski  wrote:
> >
> > rados approved!
> >
> > Details are here: 
> > https://tracker.ceph.com/projects/rados/wiki/REEF#1821-Review.
> >
> > On Mon, Nov 6, 2023 at 10:33 PM Yuri Weinstein  wrote:
> > >
> > > Details of this release are summarized here:
> > >
> > > https://tracker.ceph.com/issues/63443#note-1
> > >
> > > Seeking approvals/reviews for:
> > >
> > > smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE failures)
> > > rados - Neha, Radek, Travis, Ernesto, Adam King
> > > rgw - Casey
> > > fs - Venky
> > > orch - Adam King
> > > rbd - Ilya
> > > krbd - Ilya
> > > upgrade/quincy-x (reef) - Laura PTL
> > > powercycle - Brad
> > > perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures)
> > >
> > > Please reply to this email with approval and/or trackers of known
> > > issues/PRs to address them.
> > >
> > > TIA
> > > YuriW
> > > ___
> > > Dev mailing list -- d...@ceph.io
> > > To unsubscribe send an email to dev-le...@ceph.io
> > >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] mds hit find_exports balancer runs too long

2023-11-09 Thread zxcs
Hi, Experts,

we have a CephFS cluster running with 16.2.*, and enable multi active mds, 
found somehow mds complain  some info as below:

mds.*.bal find_exports balancer runs too long


and we already  set below config

 mds_bal_interval = 30
 mds_bal_sample_interval = 12

and then we can see slow mds request from `ceph -s`.

our question is : why ceph mds complain this and how can we prevent this 
problem  from happening again?


Thanks a ton,


xiong
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Dashboard - Community News Sticker [Feedback]

2023-11-09 Thread Nizamudeen A
Thank you everyone for the feedback!

It's always good to hear if something gives value or not to the UI and to
users before we go ahead and start doing it.

And btw, if people are wondering whether we are short on features, the
short answer is no. Along with Multi-Cluster Management & monitoring
through the ceph-dashboard, some more extra management features will be
coming in on the upcoming major release. The News Sticker was one
of the items that was on the list.

If you have more feedback on something that you guys would want to see in
the GUI, please let us know and we'll add it to our list and work on it.

Regards,
Nizam

On Thu, Nov 9, 2023 at 7:24 PM Anthony D'Atri 
wrote:

> IMHO we don't need yet another place to look for information, especially
> one that some operators never see.  ymmv.
>
> >
> >> Hello,
> >>
> >> We wanted to get some feedback on one of the features that we are
> planning
> >> to bring in for upcoming releases.
> >>
> >> On the Ceph GUI, we thought it could be interesting to show information
> >> regarding the community events, ceph release information (Release notes
> and
> >> changelogs) and maybe even notify about new blog post releases and also
> >> inform regarding the community group meetings. There would be options to
> >> subscribe to the events that you want to get notified.
> >>
> >> Before proceeding with its implementation, we thought it'd be good to
> get
> >> some community feedback around it. So please let us know what you think
> >> (the goods and the bads).
> >>
> >> Regards,
> >> --
> >>
> >> Nizamudeen A
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph fs (meta) data inconsistent

2023-11-09 Thread Xiubo Li


On 11/10/23 00:18, Frank Schilder wrote:

Hi Xiubo,

I will try to answer questions from all your 3 e-mails here together with some 
new information we have.

New: The problem occurs in newer python versions when using the shutil.copy function. There is also 
a function shutil.copy2 for which the problem does not show up. Copy2 behaves a bit like "cp 
-p" while copy is like "cp". The only code difference (linux) between these 2 
functions is that copy calls copyfile+copymode while copy2 calls copyfile+copystat. For now we 
asked our users to use copy2 to avoid the issue.

The copyfile function calls _fastcopy_sendfile on linux, which in turn calls 
os.sendfile, which seems to be part of libc:

#include 
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

I'm wondering if using this function requires explicit meta-data updates or 
should be safe on ceph-fs. I'm also not sure if a user-space client even 
supports this function (seems to be meaningless). Should this function be safe 
to use on ceph kclient?


I didn't foresee any limit for this in kclient.

The shutil.copy will only copy the contents of the file, while the 
shutil.copy2 will also copy the metadata. I need to know what exactly 
they do in kclient for shutil.copy and shutil.copy2.



Answers to questions:


BTW, have you test the ceph-fuse with the same test ? Is also the same ?

I don't have fuse clients available, so can't test right now.


Have you tried other ceph version ?

We are in the process of deploying a new test cluster, the old one is scrapped 
already. I can't test this at the moment.


It looks like the cap update request was dropped to the ground in MDS.
[...]
If you can reproduce it, then please provide the mds logs by setting:
[...]

I can do a test with MDS logs on high level. Before I do that, looking at the 
python findings above, is this something that should work on ceph or is it a 
python issue?


Not sure yet. I need to understand what exactly shutil.copy does in kclient.

Thanks

- Xiubo





Thanks for your help!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS stuck in rejoin

2023-11-09 Thread Xiubo Li


On 11/9/23 23:41, Frank Schilder wrote:

Hi Xiubo,

great! I'm not sure if we observed this particular issue, but we did have the 
oldest_client_tid updates not advancing  message in a context that might re 
related.

If this fix is not too large, it would be really great if it could be included 
in the last Pacific point release.


Yeah, this will be backport after it getting merged. But for kclient we 
still need another patch.


Thanks

- Xiubo




Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Wednesday, November 8, 2023 1:38 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin

Hi Frank,

Recently I found a new possible case could cause this, please see
https://github.com/ceph/ceph/pull/54259. This is just a ceph side fix,
after this we need to fix it in kclient too, which hasn't done yet.

Thanks

- Xiubo

On 8/8/23 17:44, Frank Schilder wrote:

Dear Xiubo,

the nearfull pool is an RBD pool and has nothing to do with the file system. 
All pools for the file system have plenty of capacity.

I think we have an idea what kind of workload caused the issue. We had a user 
run a computation that reads the same file over and over again. He started 100 
such jobs in parallel and our storage servers were at 400% load. I saw 167K 
read IOP/s on an HDD pool that has an aggregated raw IOP/s budget of ca. 11K. 
Clearly, most of this was served from RAM.

It is possible that this extreme load situation triggered a race that remained 
undetected/unreported. There is literally no related message in any logs near 
the time the warning started popping up. It shows up out of nowhere.

We asked the user to change his workflow to use local RAM disk for the input 
files. I don't think we can reproduce the problem anytime soon.

About the bug fixes, I'm eagerly waiting for this and another one. Any idea 
when they might show up in distro kernels?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Tuesday, August 8, 2023 2:57 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin


On 8/7/23 21:54, Frank Schilder wrote:

Dear Xiubo,

I managed to collect some information. It looks like there is nothing in the 
dmesg log around the time the client failed to advance its TID. I collected 
short snippets around the critical time below. I have full logs in case you are 
interested. Its large files, I will need to do an upload for that.

I also have a dump of "mds session ls" output for clients that showed the same 
issue later. Unfortunately, no consistent log information for a single incident.

Here the summary, please let me know if uploading the full package makes sense:

- Status:

On July 29, 2023

ceph status/df/pool stats/health detail at 01:05:14:
 cluster:
   health: HEALTH_WARN
   1 pools nearfull

ceph status/df/pool stats/health detail at 01:05:28:
 cluster:
   health: HEALTH_WARN
   1 clients failing to advance oldest client/flush tid
   1 pools nearfull

Okay, then this could be the root cause.

If the pool nearful it could block flushing the journal logs to the pool
and then the MDS couldn't safe reply to the requests and then block them
like this.

Could you fix the pool nearful issue first and then check could you see
it again ?



[...]

On July 31, 2023

ceph status/df/pool stats/health detail at 10:36:16:
 cluster:
   health: HEALTH_WARN
   1 clients failing to advance oldest client/flush tid
   1 pools nearfull

 cluster:
   health: HEALTH_WARN
   1 pools nearfull

- client evict command (date, time, command):

2023-07-31 10:36  ceph tell mds.ceph-11 client evict id=145678457

We have a 1h time difference between the date stamp of the command and the 
dmesg date stamps. However, there seems to be a weird 10min delay from issuing 
the evict command until it shows up in dmesg on the client.

- dmesg:

[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con 
state OPEN)
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con 
state OPEN)
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con 
state OPEN)
[Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
[Sat Jul 

[ceph-users] Re: reef 18.2.1 QE Validation status

2023-11-09 Thread Yuri Weinstein
I've updated all approvals and merged PRs in the tracker and it looks
like we are ready for gibba, LRC upgrades pending approval/update from
Venky.

On Thu, Nov 9, 2023 at 1:31 PM Radoslaw Zarzynski  wrote:
>
> rados approved!
>
> Details are here: 
> https://tracker.ceph.com/projects/rados/wiki/REEF#1821-Review.
>
> On Mon, Nov 6, 2023 at 10:33 PM Yuri Weinstein  wrote:
> >
> > Details of this release are summarized here:
> >
> > https://tracker.ceph.com/issues/63443#note-1
> >
> > Seeking approvals/reviews for:
> >
> > smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE failures)
> > rados - Neha, Radek, Travis, Ernesto, Adam King
> > rgw - Casey
> > fs - Venky
> > orch - Adam King
> > rbd - Ilya
> > krbd - Ilya
> > upgrade/quincy-x (reef) - Laura PTL
> > powercycle - Brad
> > perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures)
> >
> > Please reply to this email with approval and/or trackers of known
> > issues/PRs to address them.
> >
> > TIA
> > YuriW
> > ___
> > Dev mailing list -- d...@ceph.io
> > To unsubscribe send an email to dev-le...@ceph.io
> >
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph fs (meta) data inconsistent

2023-11-09 Thread Frank Schilder
Hi Xiubo,

I will try to answer questions from all your 3 e-mails here together with some 
new information we have.

New: The problem occurs in newer python versions when using the shutil.copy 
function. There is also a function shutil.copy2 for which the problem does not 
show up. Copy2 behaves a bit like "cp -p" while copy is like "cp". The only 
code difference (linux) between these 2 functions is that copy calls 
copyfile+copymode while copy2 calls copyfile+copystat. For now we asked our 
users to use copy2 to avoid the issue.

The copyfile function calls _fastcopy_sendfile on linux, which in turn calls 
os.sendfile, which seems to be part of libc:

#include 
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

I'm wondering if using this function requires explicit meta-data updates or 
should be safe on ceph-fs. I'm also not sure if a user-space client even 
supports this function (seems to be meaningless). Should this function be safe 
to use on ceph kclient?

Answers to questions:

> BTW, have you test the ceph-fuse with the same test ? Is also the same ?
I don't have fuse clients available, so can't test right now.

> Have you tried other ceph version ?
We are in the process of deploying a new test cluster, the old one is scrapped 
already. I can't test this at the moment.

> It looks like the cap update request was dropped to the ground in MDS.
> [...]
> If you can reproduce it, then please provide the mds logs by setting:
> [...]
I can do a test with MDS logs on high level. Before I do that, looking at the 
python findings above, is this something that should work on ceph or is it a 
python issue?

Thanks for your help!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.1 QE Validation status

2023-11-09 Thread Yuri Weinstein
On Wed, Nov 8, 2023 at 6:33 AM Travis Nielsen  wrote:
>
> Yuri, we need to add this issue as a blocker for 18.2.1. We discovered this 
> issue after the release of 17.2.7, and don't want to hit the same blocker in 
> 18.2.1 where some types of OSDs are failing to be created in new clusters, or 
> failing to start in upgraded clusters.
> https://tracker.ceph.com/issues/63391

https://tracker.ceph.com/issues/63391 was resolved and
https://github.com/ceph/ceph/pull/54395/ merged

>
> Thanks!
> Travis
>
> On Wed, Nov 8, 2023 at 4:41 AM Venky Shankar  wrote:
>>
>> Hi Yuri,
>>
>> On Wed, Nov 8, 2023 at 2:32 AM Yuri Weinstein  wrote:
>> >
>> > 3 PRs above mentioned were merged and I am returning some tests:
>> > https://pulpito.ceph.com/?sha1=55e3239498650453ff76a9b06a37f1a6f488c8fd
>> >
>> > Still seeing approvals.
>> > smoke - Laura, Radek, Prashant, Venky in progress
>> > rados - Neha, Radek, Travis, Ernesto, Adam King
>> > rgw - Casey in progress
>> > fs - Venky
>>
>> There's a failure in the fs suite
>>
>> 
>> https://pulpito.ceph.com/vshankar-2023-11-07_05:14:36-fs-reef-release-distro-default-smithi/7450325/
>>
>> Seems to be related to nfs-ganesha. I've reached out to Frank Filz
>> (#cephfs on ceph slack) to have a look. WIll update as soon as
>> possible.
>>
>> > orch - Adam King
>> > rbd - Ilya approved
>> > krbd - Ilya approved
>> > upgrade/quincy-x (reef) - Laura PTL
>> > powercycle - Brad
>> > perf-basic - in progress
>> >
>> >
>> > On Tue, Nov 7, 2023 at 8:38 AM Casey Bodley  wrote:
>> > >
>> > > On Mon, Nov 6, 2023 at 4:31 PM Yuri Weinstein  
>> > > wrote:
>> > > >
>> > > > Details of this release are summarized here:
>> > > >
>> > > > https://tracker.ceph.com/issues/63443#note-1
>> > > >
>> > > > Seeking approvals/reviews for:
>> > > >
>> > > > smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE failures)
>> > > > rados - Neha, Radek, Travis, Ernesto, Adam King
>> > > > rgw - Casey
>> > >
>> > > rgw results are approved. https://github.com/ceph/ceph/pull/54371
>> > > merged to reef but is needed on reef-release
>> > >
>> > > > fs - Venky
>> > > > orch - Adam King
>> > > > rbd - Ilya
>> > > > krbd - Ilya
>> > > > upgrade/quincy-x (reef) - Laura PTL
>> > > > powercycle - Brad
>> > > > perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures)
>> > > >
>> > > > Please reply to this email with approval and/or trackers of known
>> > > > issues/PRs to address them.
>> > > >
>> > > > TIA
>> > > > YuriW
>> > > > ___
>> > > > ceph-users mailing list -- ceph-users@ceph.io
>> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
>> > > >
>> > >
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>>
>> --
>> Cheers,
>> Venky
>> ___
>> Dev mailing list -- d...@ceph.io
>> To unsubscribe send an email to dev-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS stuck in rejoin

2023-11-09 Thread Frank Schilder
Hi Xiubo,

great! I'm not sure if we observed this particular issue, but we did have the 
oldest_client_tid updates not advancing  message in a context that might re 
related.

If this fix is not too large, it would be really great if it could be included 
in the last Pacific point release.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Wednesday, November 8, 2023 1:38 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin

Hi Frank,

Recently I found a new possible case could cause this, please see
https://github.com/ceph/ceph/pull/54259. This is just a ceph side fix,
after this we need to fix it in kclient too, which hasn't done yet.

Thanks

- Xiubo

On 8/8/23 17:44, Frank Schilder wrote:
> Dear Xiubo,
>
> the nearfull pool is an RBD pool and has nothing to do with the file system. 
> All pools for the file system have plenty of capacity.
>
> I think we have an idea what kind of workload caused the issue. We had a user 
> run a computation that reads the same file over and over again. He started 
> 100 such jobs in parallel and our storage servers were at 400% load. I saw 
> 167K read IOP/s on an HDD pool that has an aggregated raw IOP/s budget of ca. 
> 11K. Clearly, most of this was served from RAM.
>
> It is possible that this extreme load situation triggered a race that 
> remained undetected/unreported. There is literally no related message in any 
> logs near the time the warning started popping up. It shows up out of nowhere.
>
> We asked the user to change his workflow to use local RAM disk for the input 
> files. I don't think we can reproduce the problem anytime soon.
>
> About the bug fixes, I'm eagerly waiting for this and another one. Any idea 
> when they might show up in distro kernels?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Xiubo Li 
> Sent: Tuesday, August 8, 2023 2:57 AM
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>
>
> On 8/7/23 21:54, Frank Schilder wrote:
>> Dear Xiubo,
>>
>> I managed to collect some information. It looks like there is nothing in the 
>> dmesg log around the time the client failed to advance its TID. I collected 
>> short snippets around the critical time below. I have full logs in case you 
>> are interested. Its large files, I will need to do an upload for that.
>>
>> I also have a dump of "mds session ls" output for clients that showed the 
>> same issue later. Unfortunately, no consistent log information for a single 
>> incident.
>>
>> Here the summary, please let me know if uploading the full package makes 
>> sense:
>>
>> - Status:
>>
>> On July 29, 2023
>>
>> ceph status/df/pool stats/health detail at 01:05:14:
>> cluster:
>>   health: HEALTH_WARN
>>   1 pools nearfull
>>
>> ceph status/df/pool stats/health detail at 01:05:28:
>> cluster:
>>   health: HEALTH_WARN
>>   1 clients failing to advance oldest client/flush tid
>>   1 pools nearfull
> Okay, then this could be the root cause.
>
> If the pool nearful it could block flushing the journal logs to the pool
> and then the MDS couldn't safe reply to the requests and then block them
> like this.
>
> Could you fix the pool nearful issue first and then check could you see
> it again ?
>
>
>> [...]
>>
>> On July 31, 2023
>>
>> ceph status/df/pool stats/health detail at 10:36:16:
>> cluster:
>>   health: HEALTH_WARN
>>   1 clients failing to advance oldest client/flush tid
>>   1 pools nearfull
>>
>> cluster:
>>   health: HEALTH_WARN
>>   1 pools nearfull
>>
>> - client evict command (date, time, command):
>>
>> 2023-07-31 10:36  ceph tell mds.ceph-11 client evict id=145678457
>>
>> We have a 1h time difference between the date stamp of the command and the 
>> dmesg date stamps. However, there seems to be a weird 10min delay from 
>> issuing the evict command until it shows up in dmesg on the client.
>>
>> - dmesg:
>>
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3
>> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
>> (con state OPEN)
>> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
>> (con state OPEN)
>> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
>> (con state OPEN)
>> [Sat Jul 29 18:21:42 2023] ceph: mds2 

[ceph-users] Re: Redeploy ceph orch OSDs after reboot, but don't mark as 'unmanaged'

2023-11-09 Thread Janek Bevendorff



I meant this one: https://tracker.ceph.com/issues/55395


Ah, alright, almost forgot about that one.


Is there an "unmanaged: true" statement in this output?
ceph orch ls osd --export


No, it only contains the managed services that I configured.

Just out of curiosity, is there a "service_name" in your unit.meta for 
that OSD?


grep service_name /var/lib/ceph/{fsid}/osd.{id}/unit.meta


Indeed! It says "osd" for all the unmanaged OSDs. When I change it to 
the name of my managed service and restart the daemon, it shows up in 
ceph orch ps --service-name. I checked whether cephadm deploy perhaps 
has an undocumented flag for setting the service name, but couldn't find 
any. I could run deploy, change the service name and then restart the 
service, but that's quite ugly. Any better ideas?


Janek






Zitat von Janek Bevendorff :


Hi Eugen,

I stopped one OSD (which was deployed by ceph orch before) and this 
is what the MGR log says:


2023-11-09T13:35:36.941+ 7f067f1f0700  0 [cephadm DEBUG 
cephadm.services.osd] osd id 96 daemon already exists


Before and after that are JSON dumps of the LVM properties of all 
OSDs. I get the same messages when I delete all files under 
/var/lib/ceph//osd.96 and the OSD service symlink in 
/etc/systemd/system/.


ceph cephadm osd activate --verbose only shows this:

[{'flags': 8,
  'help': 'Start OSD containers for existing OSDs',
  'module': 'mgr',
  'perm': 'rw',
  'sig': [argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=cephadm),
  argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=osd),
  argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=activate),
  argdesc(, req=True, 
name=host, n=N, numseen=0)]}]
Submitting command:  {'prefix': 'cephadm osd activate', 'host': 
['XXX'], 'target': ('mon-mgr', '')}
submit {"prefix": "cephadm osd activate", "host": ["XXX"], "target": 
["mon-mgr", ""]} to mon-mgr

Created no osd(s) on host XXX; already created?

I suspect that it doesn't work for OSDs that are not explicitly 
marked as managed by ceph orch. But how do I do that?



I also commented the tracker issue you referred to.


Which issue exactly do you mean?

Janek




Thanks,
Eugen

Zitat von Janek Bevendorff :

Actually, ceph cephadm osd activate doesn't do what I expected it 
to do. It  seems to be looking for new OSDs to create instead of 
looking for existing OSDs to activate. Hence, it does nothing on my 
hosts and only prints 'Created no osd(s) on host XXX; already 
created?' So this wouldn't be an option either, even if I were 
willing to deploy the admin key on the OSD hosts.



On 07/11/2023 11:41, Janek Bevendorff wrote:

Hi,

We have our cluster RAM-booted, so we start from a clean slate 
after every reboot. That means I need to redeploy all OSD daemons 
as well. At the moment, I run cephadm deploy via Salt on the 
rebooted node, which brings the deployed OSDs back up, but the 
problem with this is that the deployed OSD shows up as 'unmanaged' 
in ceph orch ps afterwards.


I could simply skip the cephadm call and wait for the Ceph 
orchestrator to reconcile and auto-activate the disks, but that 
can take up to 15 minutes, which is unacceptable. Running ceph 
cephadm osd activate is not an option either, since I don't have 
the admin keyring deployed on the OSD hosts (I could do that, but 
I don't want to).


How can I manually activate the OSDs after a reboot and hand over 
control to the Ceph orchestrator afterwards? I checked the 
deployments in /var/lib/ceph/, but the only difference I 
found between my manual cephadm deployment and what ceph orch does 
is that the device links to /dev/mapper/ceph--... instead of 
/dev/ceph-...


Any hints appreciated!

Janek



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.1 QE Validation status

2023-11-09 Thread Casey Bodley
On Wed, Nov 8, 2023 at 11:10 AM Yuri Weinstein  wrote:
>
> We merged 3 PRs and rebuilt "reef-release" (Build 2)
>
> Seeking approvals/reviews for:
>
> smoke - Laura, Radek 2 jobs failed in "objectstore/bluestore" tests
> (see Build 2)
> rados - Neha, Radek, Travis, Ernesto, Adam King
> rgw - Casey reapprove on Build 2

rgw reapproved

> fs - Venky, approve on Build 2
> orch - Adam King
> upgrade/quincy-x (reef) - Laura PTL
> powercycle - Brad (known issues)
>
> We need to close
> https://tracker.ceph.com/issues/63391
> (https://github.com/ceph/ceph/pull/54392) - Travis, Guillaume
> https://tracker.ceph.com/issues/63151 - Adam King do we need anything for 
> this?
>
> On Wed, Nov 8, 2023 at 6:33 AM Travis Nielsen  wrote:
> >
> > Yuri, we need to add this issue as a blocker for 18.2.1. We discovered this 
> > issue after the release of 17.2.7, and don't want to hit the same blocker 
> > in 18.2.1 where some types of OSDs are failing to be created in new 
> > clusters, or failing to start in upgraded clusters.
> > https://tracker.ceph.com/issues/63391
> >
> > Thanks!
> > Travis
> >
> > On Wed, Nov 8, 2023 at 4:41 AM Venky Shankar  wrote:
> >>
> >> Hi Yuri,
> >>
> >> On Wed, Nov 8, 2023 at 2:32 AM Yuri Weinstein  wrote:
> >> >
> >> > 3 PRs above mentioned were merged and I am returning some tests:
> >> > https://pulpito.ceph.com/?sha1=55e3239498650453ff76a9b06a37f1a6f488c8fd
> >> >
> >> > Still seeing approvals.
> >> > smoke - Laura, Radek, Prashant, Venky in progress
> >> > rados - Neha, Radek, Travis, Ernesto, Adam King
> >> > rgw - Casey in progress
> >> > fs - Venky
> >>
> >> There's a failure in the fs suite
> >>
> >> 
> >> https://pulpito.ceph.com/vshankar-2023-11-07_05:14:36-fs-reef-release-distro-default-smithi/7450325/
> >>
> >> Seems to be related to nfs-ganesha. I've reached out to Frank Filz
> >> (#cephfs on ceph slack) to have a look. WIll update as soon as
> >> possible.
> >>
> >> > orch - Adam King
> >> > rbd - Ilya approved
> >> > krbd - Ilya approved
> >> > upgrade/quincy-x (reef) - Laura PTL
> >> > powercycle - Brad
> >> > perf-basic - in progress
> >> >
> >> >
> >> > On Tue, Nov 7, 2023 at 8:38 AM Casey Bodley  wrote:
> >> > >
> >> > > On Mon, Nov 6, 2023 at 4:31 PM Yuri Weinstein  
> >> > > wrote:
> >> > > >
> >> > > > Details of this release are summarized here:
> >> > > >
> >> > > > https://tracker.ceph.com/issues/63443#note-1
> >> > > >
> >> > > > Seeking approvals/reviews for:
> >> > > >
> >> > > > smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE failures)
> >> > > > rados - Neha, Radek, Travis, Ernesto, Adam King
> >> > > > rgw - Casey
> >> > >
> >> > > rgw results are approved. https://github.com/ceph/ceph/pull/54371
> >> > > merged to reef but is needed on reef-release
> >> > >
> >> > > > fs - Venky
> >> > > > orch - Adam King
> >> > > > rbd - Ilya
> >> > > > krbd - Ilya
> >> > > > upgrade/quincy-x (reef) - Laura PTL
> >> > > > powercycle - Brad
> >> > > > perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures)
> >> > > >
> >> > > > Please reply to this email with approval and/or trackers of known
> >> > > > issues/PRs to address them.
> >> > > >
> >> > > > TIA
> >> > > > YuriW
> >> > > > ___
> >> > > > ceph-users mailing list -- ceph-users@ceph.io
> >> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> > > >
> >> > >
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >>
> >> --
> >> Cheers,
> >> Venky
> >> ___
> >> Dev mailing list -- d...@ceph.io
> >> To unsubscribe send an email to dev-le...@ceph.io
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HDD cache

2023-11-09 Thread quag...@bol.com.br
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Redeploy ceph orch OSDs after reboot, but don't mark as 'unmanaged'

2023-11-09 Thread Eugen Block

I meant this one: https://tracker.ceph.com/issues/55395
Is there an "unmanaged: true" statement in this output?

ceph orch ls osd --export

Just out of curiosity, is there a "service_name" in your unit.meta for  
that OSD?


grep service_name /var/lib/ceph/{fsid}/osd.{id}/unit.meta


Zitat von Janek Bevendorff :


Hi Eugen,

I stopped one OSD (which was deployed by ceph orch before) and this  
is what the MGR log says:


2023-11-09T13:35:36.941+ 7f067f1f0700  0 [cephadm DEBUG  
cephadm.services.osd] osd id 96 daemon already exists


Before and after that are JSON dumps of the LVM properties of all  
OSDs. I get the same messages when I delete all files under  
/var/lib/ceph//osd.96 and the OSD service symlink in  
/etc/systemd/system/.


ceph cephadm osd activate --verbose only shows this:

[{'flags': 8,
  'help': 'Start OSD containers for existing OSDs',
  'module': 'mgr',
  'perm': 'rw',
  'sig': [argdesc(, req=True,  
name=prefix, n=1, numseen=0, prefix=cephadm),
  argdesc(, req=True,  
name=prefix, n=1, numseen=0, prefix=osd),
  argdesc(, req=True,  
name=prefix, n=1, numseen=0, prefix=activate),
  argdesc(, req=True,  
name=host, n=N, numseen=0)]}]
Submitting command:  {'prefix': 'cephadm osd activate', 'host':  
['XXX'], 'target': ('mon-mgr', '')}
submit {"prefix": "cephadm osd activate", "host": ["XXX"], "target":  
["mon-mgr", ""]} to mon-mgr

Created no osd(s) on host XXX; already created?

I suspect that it doesn't work for OSDs that are not explicitly  
marked as managed by ceph orch. But how do I do that?



I also commented the tracker issue you referred to.


Which issue exactly do you mean?

Janek




Thanks,
Eugen

Zitat von Janek Bevendorff :

Actually, ceph cephadm osd activate doesn't do what I expected it  
to do. It  seems to be looking for new OSDs to create instead of  
looking for existing OSDs to activate. Hence, it does nothing on  
my hosts and only prints 'Created no osd(s) on host XXX; already  
created?' So this wouldn't be an option either, even if I were  
willing to deploy the admin key on the OSD hosts.



On 07/11/2023 11:41, Janek Bevendorff wrote:

Hi,

We have our cluster RAM-booted, so we start from a clean slate  
after every reboot. That means I need to redeploy all OSD daemons  
as well. At the moment, I run cephadm deploy via Salt on the  
rebooted node, which brings the deployed OSDs back up, but the  
problem with this is that the deployed OSD shows up as  
'unmanaged' in ceph orch ps afterwards.


I could simply skip the cephadm call and wait for the Ceph  
orchestrator to reconcile and auto-activate the disks, but that  
can take up to 15 minutes, which is unacceptable. Running ceph  
cephadm osd activate is not an option either, since I don't have  
the admin keyring deployed on the OSD hosts (I could do that, but  
I don't want to).


How can I manually activate the OSDs after a reboot and hand over  
control to the Ceph orchestrator afterwards? I checked the  
deployments in /var/lib/ceph/, but the only difference I  
found between my manual cephadm deployment and what ceph orch  
does is that the device links to /dev/mapper/ceph--... instead of  
/dev/ceph-...


Any hints appreciated!

Janek



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Redeploy ceph orch OSDs after reboot, but don't mark as 'unmanaged'

2023-11-09 Thread Janek Bevendorff

Hi Eugen,

I stopped one OSD (which was deployed by ceph orch before) and this is 
what the MGR log says:


2023-11-09T13:35:36.941+ 7f067f1f0700  0 [cephadm DEBUG 
cephadm.services.osd] osd id 96 daemon already exists


Before and after that are JSON dumps of the LVM properties of all OSDs. 
I get the same messages when I delete all files under 
/var/lib/ceph//osd.96 and the OSD service symlink in 
/etc/systemd/system/.


ceph cephadm osd activate --verbose only shows this:

[{'flags': 8,
  'help': 'Start OSD containers for existing OSDs',
  'module': 'mgr',
  'perm': 'rw',
  'sig': [argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=cephadm),
  argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=osd),
  argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=activate),
  argdesc(, req=True, 
name=host, n=N, numseen=0)]}]
Submitting command:  {'prefix': 'cephadm osd activate', 'host': ['XXX'], 
'target': ('mon-mgr', '')}
submit {"prefix": "cephadm osd activate", "host": ["XXX"], "target": 
["mon-mgr", ""]} to mon-mgr

Created no osd(s) on host XXX; already created?

I suspect that it doesn't work for OSDs that are not explicitly marked 
as managed by ceph orch. But how do I do that?



I also commented the tracker issue you referred to.


Which issue exactly do you mean?

Janek




Thanks,
Eugen

Zitat von Janek Bevendorff :

Actually, ceph cephadm osd activate doesn't do what I expected it to 
do. It  seems to be looking for new OSDs to create instead of looking 
for existing OSDs to activate. Hence, it does nothing on my hosts and 
only prints 'Created no osd(s) on host XXX; already created?' So this 
wouldn't be an option either, even if I were willing to deploy the 
admin key on the OSD hosts.



On 07/11/2023 11:41, Janek Bevendorff wrote:

Hi,

We have our cluster RAM-booted, so we start from a clean slate after 
every reboot. That means I need to redeploy all OSD daemons as well. 
At the moment, I run cephadm deploy via Salt on the rebooted node, 
which brings the deployed OSDs back up, but the problem with this is 
that the deployed OSD shows up as 'unmanaged' in ceph orch ps 
afterwards.


I could simply skip the cephadm call and wait for the Ceph 
orchestrator to reconcile and auto-activate the disks, but that can 
take up to 15 minutes, which is unacceptable. Running ceph cephadm 
osd activate is not an option either, since I don't have the admin 
keyring deployed on the OSD hosts (I could do that, but I don't want 
to).


How can I manually activate the OSDs after a reboot and hand over 
control to the Ceph orchestrator afterwards? I checked the 
deployments in /var/lib/ceph/, but the only difference I found 
between my manual cephadm deployment and what ceph orch does is that 
the device links to /dev/mapper/ceph--... instead of /dev/ceph-...


Any hints appreciated!

Janek



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Eugen Block
It's the '#' character, everything after (including '#' itself) is cut  
off. I tried with single and double quotes which also failed. But as I  
already said, use a simple password and then change it within grafana.  
That way you also don't have the actual password lying around in clear  
text in a yaml file...


Zitat von Eugen Block :

I just tried it on a 17.2.6 test cluster, although I don't have a  
stack trace the complicated password doesn't seem to be applied  
(don't know why yet). But since it's an "initial" password you can  
choose something simple like "admin", and during the first login you  
are asked to change it anyway. And then you can choose your more  
complicated password, I just verified that.


Zitat von Sake Ceph :

I tried everything at this point, even waited a hour, still no  
luck. Got it 1 time accidentally working, but with a placeholder  
for a password. Tried with correct password, nothing and trying  
again with the placeholder didn't work anymore.


So I thought to switch the manager, maybe something is not right  
(shouldn't happen). But applying the Grafana spec on the other mgr,  
I get the following error in the log files:


services/grafana/ceph-dashboard.yml.j2 Traceback (most recent call  
last): File "/usr/share/ceph/mgr/cephadm/template.py",
line 40, in render template = self.env.get_template(name) File  
"/lib/python3.6/site-packages/jinja2/environment.py",
ine 830, in get_template return self._load_template(name,  
self.make_globals(globals)) File  
"/lib/python3.6/site-packages/jinja2/environment.py",
line 804, in _load_template template = self.loader.load(self, name,  
globals) File "/lib/python3.6/site-packages/jinja2/loaders.py",
line 113, in load source, filename, uptodate =  
self.get_source(environment, name) File  
"/lib/python3.6/site-packages/jinja2/loaders.py",
line 235, in get_source raise TemplateNotFound(template)  
jinja2.exceptions.TemplateNotFound:  
services/grafana/ceph-dashboard.yml.j2


During handling of the above exception, another exception occurred:  
Traceback (most recent call last): File  
"/usr/share/ceph/mgr/cephadm/serve.py",
line 1002, in _check_daemons self.mgr._daemon_action(daemon_spec,  
action=action) File "/usr/share/ceph/mgr/cephadm/module.py",
line 2131, in _daemon_action  
daemon_spec.daemon_type)].prepare_create(daemon_spec) File  
"/usr/share/ceph/mgr/cephadm/services/monitoring.py",
line 27, in prepare_create daemon_spec.final_config,  
daemon_spec.deps = self.generate_config(daemon_spec) File  
"/usr/share/ceph/mgr/cephadm/services/monitoring.py",
line 54, in generate_config  
'services/grafana/ceph-dashboard.yml.j2', {'hosts': prom_services,  
'loki_host': loki_host}) File  
"/usr/share/ceph/mgr/cephadm/template.py",
line 109, in render return self.engine.render(name, ctx) File  
"/usr/share/ceph/mgr/cephadm/template.py",
line 47, in render raise TemplateNotFoundError(e.message)  
cephadm.template.TemplateNotFoundError:  
services/grafana/ceph-dashboard.yml.j2


I use the following config for Grafana, nothing special.

service_type: grafana
service_name: grafana
placement:
 count: 2
 label: grafana
extra_container_args:
- -v=/opt/ceph_cert/host.cert:/etc/grafana/certs/cert_file:ro
- -v=/opt/ceph_cert/host.key:/etc/grafana/certs/cert_key:ro
spec:
 anonymous_access: true
 initial_admin_password: aPassw0rdWithSpecialChars-#



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Dashboard - Community News Sticker [Feedback]

2023-11-09 Thread Anthony D'Atri
IMHO we don't need yet another place to look for information, especially one 
that some operators never see.  ymmv.

> 
>> Hello,
>> 
>> We wanted to get some feedback on one of the features that we are planning
>> to bring in for upcoming releases.
>> 
>> On the Ceph GUI, we thought it could be interesting to show information
>> regarding the community events, ceph release information (Release notes and
>> changelogs) and maybe even notify about new blog post releases and also
>> inform regarding the community group meetings. There would be options to
>> subscribe to the events that you want to get notified.
>> 
>> Before proceeding with its implementation, we thought it'd be good to get
>> some community feedback around it. So please let us know what you think
>> (the goods and the bads).
>> 
>> Regards,
>> --
>> 
>> Nizamudeen A
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Eugen Block
I just tried it on a 17.2.6 test cluster, although I don't have a  
stack trace the complicated password doesn't seem to be applied (don't  
know why yet). But since it's an "initial" password you can choose  
something simple like "admin", and during the first login you are  
asked to change it anyway. And then you can choose your more  
complicated password, I just verified that.


Zitat von Sake Ceph :

I tried everything at this point, even waited a hour, still no luck.  
Got it 1 time accidentally working, but with a placeholder for a  
password. Tried with correct password, nothing and trying again with  
the placeholder didn't work anymore.


So I thought to switch the manager, maybe something is not right  
(shouldn't happen). But applying the Grafana spec on the other mgr,  
I get the following error in the log files:


services/grafana/ceph-dashboard.yml.j2 Traceback (most recent call  
last): File "/usr/share/ceph/mgr/cephadm/template.py",
line 40, in render template = self.env.get_template(name) File  
"/lib/python3.6/site-packages/jinja2/environment.py",
ine 830, in get_template return self._load_template(name,  
self.make_globals(globals)) File  
"/lib/python3.6/site-packages/jinja2/environment.py",
line 804, in _load_template template = self.loader.load(self, name,  
globals) File "/lib/python3.6/site-packages/jinja2/loaders.py",
line 113, in load source, filename, uptodate =  
self.get_source(environment, name) File  
"/lib/python3.6/site-packages/jinja2/loaders.py",
line 235, in get_source raise TemplateNotFound(template)  
jinja2.exceptions.TemplateNotFound:  
services/grafana/ceph-dashboard.yml.j2


During handling of the above exception, another exception occurred:  
Traceback (most recent call last): File  
"/usr/share/ceph/mgr/cephadm/serve.py",
line 1002, in _check_daemons self.mgr._daemon_action(daemon_spec,  
action=action) File "/usr/share/ceph/mgr/cephadm/module.py",
line 2131, in _daemon_action  
daemon_spec.daemon_type)].prepare_create(daemon_spec) File  
"/usr/share/ceph/mgr/cephadm/services/monitoring.py",
line 27, in prepare_create daemon_spec.final_config,  
daemon_spec.deps = self.generate_config(daemon_spec) File  
"/usr/share/ceph/mgr/cephadm/services/monitoring.py",
line 54, in generate_config  
'services/grafana/ceph-dashboard.yml.j2', {'hosts': prom_services,  
'loki_host': loki_host}) File  
"/usr/share/ceph/mgr/cephadm/template.py",
line 109, in render return self.engine.render(name, ctx) File  
"/usr/share/ceph/mgr/cephadm/template.py",
line 47, in render raise TemplateNotFoundError(e.message)  
cephadm.template.TemplateNotFoundError:  
services/grafana/ceph-dashboard.yml.j2


I use the following config for Grafana, nothing special.

service_type: grafana
service_name: grafana
placement:
  count: 2
  label: grafana
extra_container_args:
- -v=/opt/ceph_cert/host.cert:/etc/grafana/certs/cert_file:ro
- -v=/opt/ceph_cert/host.key:/etc/grafana/certs/cert_key:ro
spec:
  anonymous_access: true
  initial_admin_password: aPassw0rdWithSpecialChars-#



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stretch mode size

2023-11-09 Thread Sake Ceph
I believe they are working on it or want to work on it to revert from a 
stretched cluster, because of the reason you mention: if the other datacenter 
is totally burned down, you maybe want for the time being switch to one 
datacenter setup. 

Best regards, 
Sake
> Op 09-11-2023 11:18 CET schreef Eugen Block :
> 
>  
> Hi,
> 
> I'd like to ask for confirmation how I understand the docs on stretch  
> mode [1]. It requires exact size 4 for the rule? Other sizes are not  
> supported/won't work, for example size 6? Are there clusters out there  
> which use this stretch mode?
> Once stretch mode is enabled, it's not possible to get out of it. How  
> would one deal with a burnt down datacenter which can take months to  
> rebuild? In a "self-managed" stretch cluster (let's say size 6) I  
> could simply change the crush rule to not consider the failed  
> datacenter anymore, deploy an additional mon somewhere and maybe  
> reduce the size/min_size. Am I missing something?
> 
> Thanks,
> Eugen
> 
> [1] https://docs.ceph.com/en/reef/rados/operations/stretch-mode/#id2
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph
I tried everything at this point, even waited a hour, still no luck. Got it 1 
time accidentally working, but with a placeholder for a password. Tried with 
correct password, nothing and trying again with the placeholder didn't work 
anymore. 

So I thought to switch the manager, maybe something is not right (shouldn't 
happen). But applying the Grafana spec on the other mgr, I get the following 
error in the log files:

services/grafana/ceph-dashboard.yml.j2 Traceback (most recent call last): File 
"/usr/share/ceph/mgr/cephadm/template.py",
line 40, in render template = self.env.get_template(name) File 
"/lib/python3.6/site-packages/jinja2/environment.py", 
ine 830, in get_template return self._load_template(name, 
self.make_globals(globals)) File 
"/lib/python3.6/site-packages/jinja2/environment.py",
line 804, in _load_template template = self.loader.load(self, name, globals) 
File "/lib/python3.6/site-packages/jinja2/loaders.py",
line 113, in load source, filename, uptodate = self.get_source(environment, 
name) File "/lib/python3.6/site-packages/jinja2/loaders.py",
line 235, in get_source raise TemplateNotFound(template) 
jinja2.exceptions.TemplateNotFound: services/grafana/ceph-dashboard.yml.j2

During handling of the above exception, another exception occurred: Traceback 
(most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py",
line 1002, in _check_daemons self.mgr._daemon_action(daemon_spec, 
action=action) File "/usr/share/ceph/mgr/cephadm/module.py",
line 2131, in _daemon_action 
daemon_spec.daemon_type)].prepare_create(daemon_spec) File 
"/usr/share/ceph/mgr/cephadm/services/monitoring.py",
line 27, in prepare_create daemon_spec.final_config, daemon_spec.deps = 
self.generate_config(daemon_spec) File 
"/usr/share/ceph/mgr/cephadm/services/monitoring.py",
line 54, in generate_config 'services/grafana/ceph-dashboard.yml.j2', {'hosts': 
prom_services, 'loki_host': loki_host}) File 
"/usr/share/ceph/mgr/cephadm/template.py",
line 109, in render return self.engine.render(name, ctx) File 
"/usr/share/ceph/mgr/cephadm/template.py",
line 47, in render raise TemplateNotFoundError(e.message) 
cephadm.template.TemplateNotFoundError: services/grafana/ceph-dashboard.yml.j2

I use the following config for Grafana, nothing special. 

service_type: grafana
service_name: grafana
placement:
  count: 2
  label: grafana
extra_container_args:
- -v=/opt/ceph_cert/host.cert:/etc/grafana/certs/cert_file:ro
- -v=/opt/ceph_cert/host.key:/etc/grafana/certs/cert_key:ro
spec:
  anonymous_access: true
  initial_admin_password: aPassw0rdWithSpecialChars-#
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Dashboard - Community News Sticker [Feedback]

2023-11-09 Thread Reto Gysi
Hi,

No, I don't think it's not very useful at best, and bad at worst.

In the IT organizations I've worked so far, any systems that actually store
data, were in the highest security zone, where no connection incoming/or
outgoing to the internet was allowed. Our systems couldn't even resolve
internet DNS names. Our IT Risk would never allow such connections.
Access to the dashboard would also be limited to a few IT operations
people, who wouldn't have interest or time to look through community events
and blog posts. Software that assumes it has free access to internet always
causes more work/hassle for people who have to manage it in IT environments
that have a no internet access policy

Thanks & Regards

Reto

Am Do., 9. Nov. 2023 um 07:36 Uhr schrieb Nizamudeen A :

> Hello,
>
> We wanted to get some feedback on one of the features that we are planning
> to bring in for upcoming releases.
>
> On the Ceph GUI, we thought it could be interesting to show information
> regarding the community events, ceph release information (Release notes and
> changelogs) and maybe even notify about new blog post releases and also
> inform regarding the community group meetings. There would be options to
> subscribe to the events that you want to get notified.
>
> Before proceeding with its implementation, we thought it'd be good to get
> some community feedback around it. So please let us know what you think
> (the goods and the bads).
>
> Regards,
> --
>
> Nizamudeen A
>
> Software Engineer
>
> Red Hat 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Memory footprint of increased PG number

2023-11-09 Thread Eugen Block
I was going through the hardware recommendations for a customer and  
wanted to cite the memory section from the current docs [1]:


Setting the osd_memory_target below 2GB is not recommended. eph may  
fail to keep the memory consumption under 2GB and extremely slow  
performance is likely.
Setting the memory target between 2GB and 4GB typically works but  
may result in degraded performance: metadata may need to be read  
from disk during IO unless the active data set is relatively small.
4GB is the current default value for osd_memory_target This default  
was chosen for typical use cases, and is intended to balance RAM  
cost and OSD performance.
Setting the osd_memory_target higher than 4GB can improve  
performance when there many (small) objects or when large (256GB/OSD  
or more) data sets are processed. This is especially true with fast  
NVMe OSDs.


And further:

We recommend budgeting at least 20% extra memory on your system to  
prevent OSDs from going OOM (Out Of Memory) during temporary spikes  
or due to delay in the kernel reclaiming freed pages.


[1] https://docs.ceph.com/en/quincy/start/hardware-recommendations/#memory

Zitat von Eugen Block :


Hi,

I don't think increasing the PGs has an impact on the OSD's memory,  
at least I'm not aware of such reports and haven't seen it myself.  
But your cluster could get in trouble as it already is, only 24 GB  
for 16 OSDs is too low. It can work (and apparently does) when  
everything is calm, but during recovery the memory usage spikes. The  
default is 4 GB per OSD, there have been several reports over the  
years where users couldn't get their OSDs back up after a failure  
because of low memory settings. I'd recommend to increase RAM.


Regards,
Eugen

Zitat von Nicola Mori :


Dear Ceph user,

I'm wondering how much an increase of PG number would impact on the  
memory occupancy of OSD daemons. In my cluster I currently have 512  
PGs and I would like to increase it to 1024 to mitigate some disk  
occupancy issues, but having machines with low amount of memory  
(down to 24 GB for 16 OSDs) I fear this could kill my cluster. Is  
it possible to evaluate the relative increase in OSD memory  
footprint when doubling the number of PGs (hopefully not a linear  
scaling)? Or is there a way to experiment without crashing  
everything?

Thank you.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: one cephfs volume becomes very slow

2023-11-09 Thread Eugen Block
Do you see a high disk utilization? Are those OSDs hdd-only or at  
least have their db on SSDs? I'd say the HDDs are the bottleneck.  
There was a recent thread [1] where Zakhar explained nicely how many  
IOPS you can expect from a hdd-only cluster. Maybe that helps.


[1]  
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/FPMCNPYIHBHIJLWVVG2ECI2DSTR6DZIO/


Zitat von Ben :


Dear cephers,

we have a cephfs volume, that will be mounted by many clients with
concurrent read/write capability. From time to time, maybe when concurrency
goes as high as 100 clients' access, accessing it will become very slow to
be useful at all.
the cluster has multiple active mds. All disks are hdd.
Any ideas to improve this?

here is one of mds log during the slow time, others are simillar:

{"log":"debug 2023-11-08T07:26:00.114+ 7f190b014700  0
log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest
blocked for \u003e 5.662970
secs\n","stream":"stderr","time":"2023-11-08T07:26:00.121996282Z"}

{"log":"debug 2023-11-08T07:26:00.114+ 7f190b014700  0
log_channel(cluster) log [WRN] : slow request 5.662970 seconds old,
received at 2023-11-08T07:25:54.458863+:
peer_request:client.12917739:8654334 currently
dispatched\n","stream":"stderr","time":"2023-11-08T07:26:00.122016551Z"}

{"log":"debug 2023-11-08T07:29:54.118+ 7f190b014700  0
log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest
blocked for \u003e 11.900602
secs\n","stream":"stderr","time":"2023-11-08T07:29:54.124567293Z"}

{"log":"debug 2023-11-08T07:29:54.118+ 7f190b014700  0
log_channel(cluster) log [WRN] : slow request 11.900601 seconds old,
received at 2023-11-08T07:29:42.223813+:
client_request(client.27494331:18564666 getattr pAsLsXsFs #0x70001830366
2023-11-08T07:29:42.219416+ caller_uid=0, caller_gid=0{}) currently
failed to rdlock,
waiting\n","stream":"stderr","time":"2023-11-08T07:29:54.124589613Z"}

{"log":"debug 2023-11-08T07:30:00.118+ 7f190b014700  0
log_channel(cluster) log [WRN] : 5 slow requests, 5 included below; oldest
blocked for \u003e 17.900670
secs\n","stream":"stderr","time":"2023-11-08T07:30:00.124691442Z"}

{"log":"debug 2023-11-08T07:30:00.118+ 7f190b014700  0
log_channel(cluster) log [WRN] : slow request 17.900670 seconds old,
received at 2023-11-08T07:29:42.223813+:
client_request(client.27494331:18564666 getattr pAsLsXsFs #0x70001830366
2023-11-08T07:29:42.219416+ caller_uid=0, caller_gid=0{}) currently
failed to rdlock,
waiting\n","stream":"stderr","time":"2023-11-08T07:30:00.124726772Z"}

{"log":"debug 2023-11-08T07:30:00.118+ 7f190b014700  0
log_channel(cluster) log [WRN] : slow request 6.649942 seconds old,
received at 2023-11-08T07:29:53.474541+: client_request(mds.1:305661
rename #0x70001851b32/91e670f9004ddb237a353b2a9ddc063208f5
#0x649/800019f1da7 caller_uid=0, caller_gid=0{}) currently failed to
acquire_locks\n","stream":"stderr","time":"2023-11-08T07:30:00.124731626Z"}

{"log":"debug 2023-11-08T07:30:00.118+ 7f190b014700  0
log_channel(cluster) log [WRN] : slow request 6.649864 seconds old,
received at 2023-11-08T07:29:53.474619+: client_request(mds.1:305662
rename #0x70001851b32/91e670f9004ddb237a353b2a9ddc063208f5
#0x649/800019f1da7 caller_uid=0, caller_gid=0{}) currently requesting
remote
authpins\n","stream":"stderr","time":"2023-11-08T07:30:00.124734415Z"}

{"log":"debug 2023-11-08T07:30:00.118+ 7f190b014700  0
log_channel(cluster) log [WRN] : slow request 6.649719 seconds old,
received at 2023-11-08T07:29:53.474764+:
client_request(client.27497255:25173 getattr pAsLsXsFs #0x800019f1da7
2023-11-08T07:29:53.473182+ caller_uid=0, caller_gid=0{}) currently
requesting remote
authpins\n","stream":"stderr","time":"2023-11-08T07:30:00.124736973Z"}

{"log":"debug 2023-11-08T07:30:00.118+ 7f190b014700  0
log_channel(cluster) log [WRN] : slow request 6.648454 seconds old,
received at 2023-11-08T07:29:53.476029+: client_request(mds.1:305663
rename #0x70001851b32/91e670f9004ddb237a353b2a9ddc063208f5
#0x649/800019f1da7 caller_uid=0, caller_gid=0{}) currently requesting
remote
authpins\n","stream":"stderr","time":"2023-11-08T07:30:00.124739607Z"}

{"log":"debug 2023-11-08T07:43:30.127+ 7f190b014700  0
log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest
blocked for \u003e 5.206645
secs\n","stream":"stderr","time":"2023-11-08T07:43:30.133682292Z"}

{"log":"debug 2023-11-08T07:43:30.127+ 7f190b014700  0
log_channel(cluster) log [WRN] : slow request 5.206644 seconds old,
received at 2023-11-08T07:43:24.926862+:
client_request(client.27430891:5371608 mkdir #0x700018317cd/13
2023-11-08T07:43:24.924423+ caller_uid=0, caller_gid=0{}) currently
submit entry:
journal_and_reply\n","stream":"stderr","time":"2023-11-08T07:43:30.133708161Z"}

{"log":"debug 2023-11-08T07:43:30.127+ 7f190b014700  0
log_channel(cluster) log [WRN] : slow request 5.206209 seconds old,
received at 

[ceph-users] High iowait when using Ceph NVME

2023-11-09 Thread Huy Nguyen
Hi,
Currently, I'm testing Ceph v17.2.7 with NVMe. When mapping an rbd image to 
physical compute host, "fio bs=4k iodepth=128 randwrite" give 150k IOPS. I have 
a VM that located within the compute host, and the fio give ~40k IOPS and with 
50 %iowait. I know there is a bottleneck, I'm not sure where it is though

I have tried to enabled iothread in virsh but nothing change. Anyone have any 
ideal?

Thanks in advance.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Eugen Block
Usually, removing the grafana service should be enough. I also have  
this directory (custom_config_files/grafana.) but it's  
empty. Can you confirm that after running 'ceph orch rm grafana' the  
service is actually gone ('ceph orch ls grafana')? The directory  
underneath /var/lib/ceph/{fsid}/grafana. should also be  
gone, can you confirm? Removing a service can take some time, so maybe  
wait a few minutes and then check again. And after you removed it, did  
you deploy again with your initial password? Maybe share the exact  
commands you used and the content of your grafana.yaml (mask sensitive  
data).


Zitat von Sake Ceph :


Using podman version 4.4.1 on RHEL 8.8, Ceph 17.2.7

I used 'podman system prune -a -f' and 'podman volume prune -f' to  
cleanup files, but this leaves a lot of files over in  
/var/lib/containers/storage/overlay and a empty folder  
/var/lib/ceph//custom_config_files/grafana..

Found those files with 'find / -name *grafana*'.


Op 09-11-2023 09:53 CET schreef Eugen Block :


What doesn't work exactly? For me it did...

Zitat von Sake Ceph :

> To bad, that doesn't work :(
>> Op 09-11-2023 09:07 CET schreef Sake Ceph :
>>
>>
>> Hi,
>>
>> Well to get promtail working with Loki, you need to setup a
>> password in Grafana.
>> But promtail wasn't working with the 17.2.6 release, the URL was
>> set to containers.local. So I stopped using it, but forgot to click
>> on save in KeePass :(
>>
>> I didn't configure anything special in Grafana, the default
>> dashboards are great! So a wipe isn't a problem, it's what I want.
>>
>> Best regards,
>> Sake
>> > Op 09-11-2023 08:19 CET schreef Eugen Block :
>> >
>> >
>> > Hi,
>> > you mean you forgot your password? You can remove the service with
>> > 'ceph orch rm grafana', then re-apply your grafana.yaml containing the
>> > initial password. Note that this would remove all of the grafana
>> > configs or custom dashboards etc., you would have to reconfigure them.
>> > So before doing that you should verify that this is actually what
>> > you're looking for. Not sure what this has to do with Loki though.
>> >
>> > Eugen
>> >
>> > Zitat von Sake Ceph :
>> >
>> > > I configured a password for Grafana because I want to use Loki. I
>> > > used the spec parameter initial_admin_password and this  
works fine for a
>> > > staging environment, where I never tried to used Grafana  
with a password

>> > > for Loki. 
>> > > 
>> > >Using the username admin with the configured password gives a
>> > > credentials error on environment where I tried to use Grafana
>> with Loki in
>> > > the past (with 17.2.6 of Ceph/cephadm). I changed the password
>> in the past
>> > > within Grafana, but how can I overwrite this now? Or is  
there a way to

>> > > cleanup all Grafana files? 
>> > > 
>> > >Best regards, 
>> > >Sake
>> >
>> >
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Stretch mode size

2023-11-09 Thread Eugen Block

Hi,

I'd like to ask for confirmation how I understand the docs on stretch  
mode [1]. It requires exact size 4 for the rule? Other sizes are not  
supported/won't work, for example size 6? Are there clusters out there  
which use this stretch mode?
Once stretch mode is enabled, it's not possible to get out of it. How  
would one deal with a burnt down datacenter which can take months to  
rebuild? In a "self-managed" stretch cluster (let's say size 6) I  
could simply change the crush rule to not consider the failed  
datacenter anymore, deploy an additional mon somewhere and maybe  
reduce the size/min_size. Am I missing something?


Thanks,
Eugen

[1] https://docs.ceph.com/en/reef/rados/operations/stretch-mode/#id2

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.1 QE Validation status

2023-11-09 Thread Venky Shankar
Hi Yuri,

On Wed, Nov 8, 2023 at 4:10 PM Venky Shankar  wrote:
>
> Hi Yuri,
>
> On Wed, Nov 8, 2023 at 2:32 AM Yuri Weinstein  wrote:
> >
> > 3 PRs above mentioned were merged and I am returning some tests:
> > https://pulpito.ceph.com/?sha1=55e3239498650453ff76a9b06a37f1a6f488c8fd
> >
> > Still seeing approvals.
> > smoke - Laura, Radek, Prashant, Venky in progress
> > rados - Neha, Radek, Travis, Ernesto, Adam King
> > rgw - Casey in progress
> > fs - Venky
>
> There's a failure in the fs suite
>
> 
> https://pulpito.ceph.com/vshankar-2023-11-07_05:14:36-fs-reef-release-distro-default-smithi/7450325/
>
> Seems to be related to nfs-ganesha. I've reached out to Frank Filz
> (#cephfs on ceph slack) to have a look. WIll update as soon as
> possible.

Frank confirmed that this is a bug with nfs-ganesha that got
introduced lately and is fixed in the latest version (v5.7).

So, the pending thing for CephFS is the smoke failure that Laura
reported. More on that soon.

>
> > orch - Adam King
> > rbd - Ilya approved
> > krbd - Ilya approved
> > upgrade/quincy-x (reef) - Laura PTL
> > powercycle - Brad
> > perf-basic - in progress
> >
> >
> > On Tue, Nov 7, 2023 at 8:38 AM Casey Bodley  wrote:
> > >
> > > On Mon, Nov 6, 2023 at 4:31 PM Yuri Weinstein  wrote:
> > > >
> > > > Details of this release are summarized here:
> > > >
> > > > https://tracker.ceph.com/issues/63443#note-1
> > > >
> > > > Seeking approvals/reviews for:
> > > >
> > > > smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE failures)
> > > > rados - Neha, Radek, Travis, Ernesto, Adam King
> > > > rgw - Casey
> > >
> > > rgw results are approved. https://github.com/ceph/ceph/pull/54371
> > > merged to reef but is needed on reef-release
> > >
> > > > fs - Venky
> > > > orch - Adam King
> > > > rbd - Ilya
> > > > krbd - Ilya
> > > > upgrade/quincy-x (reef) - Laura PTL
> > > > powercycle - Brad
> > > > perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures)
> > > >
> > > > Please reply to this email with approval and/or trackers of known
> > > > issues/PRs to address them.
> > > >
> > > > TIA
> > > > YuriW
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >
> > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> --
> Cheers,
> Venky



-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph
Using podman version 4.4.1 on RHEL 8.8, Ceph 17.2.7

I used 'podman system prune -a -f' and 'podman volume prune -f' to cleanup 
files, but this leaves a lot of files over in 
/var/lib/containers/storage/overlay and a empty folder 
/var/lib/ceph//custom_config_files/grafana..
Found those files with 'find / -name *grafana*'.

> Op 09-11-2023 09:53 CET schreef Eugen Block :
> 
>  
> What doesn't work exactly? For me it did...
> 
> Zitat von Sake Ceph :
> 
> > To bad, that doesn't work :(
> >> Op 09-11-2023 09:07 CET schreef Sake Ceph :
> >>
> >>
> >> Hi,
> >>
> >> Well to get promtail working with Loki, you need to setup a  
> >> password in Grafana.
> >> But promtail wasn't working with the 17.2.6 release, the URL was  
> >> set to containers.local. So I stopped using it, but forgot to click  
> >> on save in KeePass :(
> >>
> >> I didn't configure anything special in Grafana, the default  
> >> dashboards are great! So a wipe isn't a problem, it's what I want.
> >>
> >> Best regards,
> >> Sake
> >> > Op 09-11-2023 08:19 CET schreef Eugen Block :
> >> >
> >> >
> >> > Hi,
> >> > you mean you forgot your password? You can remove the service with
> >> > 'ceph orch rm grafana', then re-apply your grafana.yaml containing the
> >> > initial password. Note that this would remove all of the grafana
> >> > configs or custom dashboards etc., you would have to reconfigure them.
> >> > So before doing that you should verify that this is actually what
> >> > you're looking for. Not sure what this has to do with Loki though.
> >> >
> >> > Eugen
> >> >
> >> > Zitat von Sake Ceph :
> >> >
> >> > > I configured a password for Grafana because I want to use Loki. I
> >> > > used the spec parameter initial_admin_password and this works fine for 
> >> > > a
> >> > > staging environment, where I never tried to used Grafana with a 
> >> > > password
> >> > > for Loki. 
> >> > > 
> >> > >Using the username admin with the configured password gives a
> >> > > credentials error on environment where I tried to use Grafana  
> >> with Loki in
> >> > > the past (with 17.2.6 of Ceph/cephadm). I changed the password  
> >> in the past
> >> > > within Grafana, but how can I overwrite this now? Or is there a way to
> >> > > cleanup all Grafana files? 
> >> > > 
> >> > >Best regards, 
> >> > >Sake
> >> >
> >> >
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Eugen Block

What doesn't work exactly? For me it did...

Zitat von Sake Ceph :


To bad, that doesn't work :(

Op 09-11-2023 09:07 CET schreef Sake Ceph :


Hi,

Well to get promtail working with Loki, you need to setup a  
password in Grafana.
But promtail wasn't working with the 17.2.6 release, the URL was  
set to containers.local. So I stopped using it, but forgot to click  
on save in KeePass :(


I didn't configure anything special in Grafana, the default  
dashboards are great! So a wipe isn't a problem, it's what I want.


Best regards,
Sake
> Op 09-11-2023 08:19 CET schreef Eugen Block :
>
>
> Hi,
> you mean you forgot your password? You can remove the service with
> 'ceph orch rm grafana', then re-apply your grafana.yaml containing the
> initial password. Note that this would remove all of the grafana
> configs or custom dashboards etc., you would have to reconfigure them.
> So before doing that you should verify that this is actually what
> you're looking for. Not sure what this has to do with Loki though.
>
> Eugen
>
> Zitat von Sake Ceph :
>
> > I configured a password for Grafana because I want to use Loki. I
> > used the spec parameter initial_admin_password and this works fine for a
> > staging environment, where I never tried to used Grafana with a password
> > for Loki. 
> > 
> >Using the username admin with the configured password gives a
> > credentials error on environment where I tried to use Grafana  
with Loki in
> > the past (with 17.2.6 of Ceph/cephadm). I changed the password  
in the past

> > within Grafana, but how can I overwrite this now? Or is there a way to
> > cleanup all Grafana files? 
> > 
> >Best regards, 
> >Sake
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph
To bad, that doesn't work :(
> Op 09-11-2023 09:07 CET schreef Sake Ceph :
> 
>  
> Hi, 
> 
> Well to get promtail working with Loki, you need to setup a password in 
> Grafana. 
> But promtail wasn't working with the 17.2.6 release, the URL was set to 
> containers.local. So I stopped using it, but forgot to click on save in 
> KeePass :(
> 
> I didn't configure anything special in Grafana, the default dashboards are 
> great! So a wipe isn't a problem, it's what I want. 
> 
> Best regards, 
> Sake 
> > Op 09-11-2023 08:19 CET schreef Eugen Block :
> > 
> >  
> > Hi,
> > you mean you forgot your password? You can remove the service with  
> > 'ceph orch rm grafana', then re-apply your grafana.yaml containing the  
> > initial password. Note that this would remove all of the grafana  
> > configs or custom dashboards etc., you would have to reconfigure them.  
> > So before doing that you should verify that this is actually what  
> > you're looking for. Not sure what this has to do with Loki though.
> > 
> > Eugen
> > 
> > Zitat von Sake Ceph :
> > 
> > > I configured a password for Grafana because I want to use Loki. I
> > > used the spec parameter initial_admin_password and this works fine for a
> > > staging environment, where I never tried to used Grafana with a password
> > > for Loki. 
> > > 
> > >Using the username admin with the configured password gives a
> > > credentials error on environment where I tried to use Grafana with Loki in
> > > the past (with 17.2.6 of Ceph/cephadm). I changed the password in the past
> > > within Grafana, but how can I overwrite this now? Or is there a way to
> > > cleanup all Grafana files? 
> > > 
> > >Best regards, 
> > >Sake
> > 
> > 
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crush map & rule

2023-11-09 Thread David C.
(I wrote it freehand, test before applying)
If your goal is to have a replication of 3 on a row and to be able to
switch to the secondary row, then you need 2 roles and you change the crush
rule on the pool side :

rule primary_location {
(...)
   step take primary class ssd
   step chooseleaf firstn 0 type host
   step emit
}

rule secondary_loc {
(...)
  step take secondary ...

If the aim is to make a replica 2 on 2 rows (not recommended) :

rule row_repli {
(...)
  step take default class ssd
  step chooseleaf firstn 0 type row
  step emit
}

If the aim is to distribute replications over the 2 rows (for example 2*2
or 2*3 replica) :

type replicated
step take primary  class ssd
step chooseleaf firstn 2 type host
step emit
step take secondary  class ssd
step chooseleaf firstn 2 type host
step emit

as far as erasure code is concerned, I really don't see what's reasonably
possible on this architecture.


Cordialement,

*David CASIER*






Le jeu. 9 nov. 2023 à 08:48, Albert Shih  a écrit :

> Le 08/11/2023 à 19:29:19+0100, David C. a écrit
> Hi David.
>
> >
> > What would be the number of replicas (in total and on each row) and their
> > distribution on the tree ?
>
> Well “inside” a row that would be 3 in replica mode.
>
> Between row...well two ;-)
>
> Beside to understanding how to write a rule a little more complex than the
> example in the official documentation, they are another purpose and it's
> to try to have
> a protocole for changing the hardware.
>
> For example if «row primary» are only with old bare metal server, and I
> have some new server I put inside the ceph and want to copy everything
> from the “row primary” to “row secondary”.
>
> Regards
>
> >
> >
> > Le mer. 8 nov. 2023 à 18:45, Albert Shih  a
> écrit :
> >
> > Hi everyone,
> >
> > I'm totally newbie with ceph, so sorry if I'm asking some stupid
> question.
> >
> > I'm trying to understand how the crush map & rule work, my goal is
> to have
> > two groups of 3 servers, so I'm using “row” bucket
> >
> > ID   CLASS  WEIGHTTYPE NAME STATUS  REWEIGHT
> PRI-AFF
> >  -1 59.38367  root default
> > -15 59.38367  zone City
> > -17 29.69183  row primary
> >  -3  9.89728  host server1
> >   0ssd   3.49309  osd.0 up   1.0
> 1.0
> >   1ssd   1.74660  osd.1 up   1.0
> 1.0
> >   2ssd   1.74660  osd.2 up   1.0
> 1.0
> >   3ssd   2.91100  osd.3 up   1.0
> 1.0
> >  -5  9.89728  host server2
> >   4ssd   1.74660  osd.4 up   1.0
> 1.0
> >   5ssd   1.74660  osd.5 up   1.0
> 1.0
> >   6ssd   2.91100  osd.6 up   1.0
> 1.0
> >   7ssd   3.49309  osd.7 up   1.0
> 1.0
> >  -7  9.89728  host server3
> >   8ssd   3.49309  osd.8 up   1.0
> 1.0
> >   9ssd   1.74660  osd.9 up   1.0
> 1.0
> >  10ssd   2.91100  osd.10up   1.0
> 1.0
> >  11ssd   1.74660  osd.11up   1.0
> 1.0
> > -19 29.69183  row secondary
> >  -9  9.89728  host server4
> >  12ssd   1.74660  osd.12up   1.0
> 1.0
> >  13ssd   1.74660  osd.13up   1.0
> 1.0
> >  14ssd   3.49309  osd.14up   1.0
> 1.0
> >  15ssd   2.91100  osd.15up   1.0
> 1.0
> > -11  9.89728  host server5
> >  16ssd   1.74660  osd.16up   1.0
> 1.0
> >  17ssd   1.74660  osd.17up   1.0
> 1.0
> >  18ssd   3.49309  osd.18up   1.0
> 1.0
> >  19ssd   2.91100  osd.19up   1.0
> 1.0
> > -13  9.89728  host server6
> >  20ssd   1.74660  osd.20up   1.0
> 1.0
> >  21ssd   1.74660  osd.21up   1.0
> 1.0
> >  22ssd   2.91100  osd.22up   1.0
> 1.0
> >
> > and I want to create a some rules, first I like to have
> >
> >   a rule «replica» (over host) inside the «row» primary
> >   a rule «erasure» (over host)  inside the «row» primary
> >
> > but also two crush rule between primary/secondary, meaning I like to
> have a
> > replica (with only 1 copy of course) of pool from 

[ceph-users] Re: Ceph Dashboard - Community News Sticker [Feedback]

2023-11-09 Thread Daniel Baumann
On 11/9/23 07:35, Nizamudeen A wrote:
> On the Ceph GUI, we thought it could be interesting to show information
> regarding the community events, ceph release information

like others have already said, it's not the right place to put that
information for lots of reasons.

one more to add: putting the information on the dashboard would be only
visible for those that have direct access to the ceph cluster in an
organisation, which usually are very small group of people. putting the
same (valuable!) information on the website where it belongs, makes it
much easier accessible to everyone interested.

(or in other words: this stuff should be on the normal website already
anyway, so, no extra value in duplicating it on the dashboard).


thanks for asking though. this sounds like the "we're out of things to
do"-syndrom which can be often observed in matured projects where people
start to integrate/duplicate things into it to make it the one-and-only
entry-point for $everything, which in turn makes it worse for what it
used to do perfectly in the first place.

Regards,
Daniel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Redeploy ceph orch OSDs after reboot, but don't mark as 'unmanaged'

2023-11-09 Thread Eugen Block

Hi Janek,

I don't really have a solution, but I tend to disagree that 'ceph  
cephadm osd activate' looks for OSDs to create. It's specifically  
stated in the docs that it's for existing OSDs to activate, and it did  
work in my test environment. I also commented the tracker issue you  
referred to. So as I see it the question would be why it doesn't  
activate OSDs. And what it does differently when you deploy them via  
cephadm. Do you have the cephadm.log and the mgr log from the 'ceph  
cephadm osd activate' call?


Thanks,
Eugen

Zitat von Janek Bevendorff :

Actually, ceph cephadm osd activate doesn't do what I expected it to  
do. It  seems to be looking for new OSDs to create instead of  
looking for existing OSDs to activate. Hence, it does nothing on my  
hosts and only prints 'Created no osd(s) on host XXX; already  
created?' So this wouldn't be an option either, even if I were  
willing to deploy the admin key on the OSD hosts.



On 07/11/2023 11:41, Janek Bevendorff wrote:

Hi,

We have our cluster RAM-booted, so we start from a clean slate  
after every reboot. That means I need to redeploy all OSD daemons  
as well. At the moment, I run cephadm deploy via Salt on the  
rebooted node, which brings the deployed OSDs back up, but the  
problem with this is that the deployed OSD shows up as 'unmanaged'  
in ceph orch ps afterwards.


I could simply skip the cephadm call and wait for the Ceph  
orchestrator to reconcile and auto-activate the disks, but that can  
take up to 15 minutes, which is unacceptable. Running ceph cephadm  
osd activate is not an option either, since I don't have the admin  
keyring deployed on the OSD hosts (I could do that, but I don't  
want to).


How can I manually activate the OSDs after a reboot and hand over  
control to the Ceph orchestrator afterwards? I checked the  
deployments in /var/lib/ceph/, but the only difference I  
found between my manual cephadm deployment and what ceph orch does  
is that the device links to /dev/mapper/ceph--... instead of  
/dev/ceph-...


Any hints appreciated!

Janek



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HDD cache

2023-11-09 Thread Konstantin Shalygin
Hi Peter,

> On Nov 8, 2023, at 20:32, Peter  wrote:
> 
> Anyone experienced this can advise?

You can try:

* check for current cache status

smartctl -x /dev/sda | grep "Write cache"

* turn off write cache

smartctl -s wcache-sct,off,p /dev/sda

* check again

smartctl -x /dev/sda | grep "Write cache"


Good luck,
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Memory footprint of increased PG number

2023-11-09 Thread Eugen Block

Hi,

I don't think increasing the PGs has an impact on the OSD's memory, at  
least I'm not aware of such reports and haven't seen it myself. But  
your cluster could get in trouble as it already is, only 24 GB for 16  
OSDs is too low. It can work (and apparently does) when everything is  
calm, but during recovery the memory usage spikes. The default is 4 GB  
per OSD, there have been several reports over the years where users  
couldn't get their OSDs back up after a failure because of low memory  
settings. I'd recommend to increase RAM.


Regards,
Eugen

Zitat von Nicola Mori :


Dear Ceph user,

I'm wondering how much an increase of PG number would impact on the  
memory occupancy of OSD daemons. In my cluster I currently have 512  
PGs and I would like to increase it to 1024 to mitigate some disk  
occupancy issues, but having machines with low amount of memory  
(down to 24 GB for 16 OSDs) I fear this could kill my cluster. Is it  
possible to evaluate the relative increase in OSD memory footprint  
when doubling the number of PGs (hopefully not a linear scaling)? Or  
is there a way to experiment without crashing everything?

Thank you.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph
Hi, 

Well to get promtail working with Loki, you need to setup a password in 
Grafana. 
But promtail wasn't working with the 17.2.6 release, the URL was set to 
containers.local. So I stopped using it, but forgot to click on save in KeePass 
:(

I didn't configure anything special in Grafana, the default dashboards are 
great! So a wipe isn't a problem, it's what I want. 

Best regards, 
Sake 
> Op 09-11-2023 08:19 CET schreef Eugen Block :
> 
>  
> Hi,
> you mean you forgot your password? You can remove the service with  
> 'ceph orch rm grafana', then re-apply your grafana.yaml containing the  
> initial password. Note that this would remove all of the grafana  
> configs or custom dashboards etc., you would have to reconfigure them.  
> So before doing that you should verify that this is actually what  
> you're looking for. Not sure what this has to do with Loki though.
> 
> Eugen
> 
> Zitat von Sake Ceph :
> 
> > I configured a password for Grafana because I want to use Loki. I
> > used the spec parameter initial_admin_password and this works fine for a
> > staging environment, where I never tried to used Grafana with a password
> > for Loki. 
> > 
> >Using the username admin with the configured password gives a
> > credentials error on environment where I tried to use Grafana with Loki in
> > the past (with 17.2.6 of Ceph/cephadm). I changed the password in the past
> > within Grafana, but how can I overwrite this now? Or is there a way to
> > cleanup all Grafana files? 
> > 
> >Best regards, 
> >Sake
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io