[ceph-users] Re: Reef - what happened to OSD spec?

2023-08-28 Thread Nigel Williams
On Tue, 29 Aug 2023 at 10:09, Nigel Williams 
wrote:

> and giving it a try it fails when it bumps into the root drive (which has
> an active LVM). I expect I can add a filter to avoid it.
>

I found the cause of this initial failure when applying the spec from the
web-gui. Even though I (thought) I zapped the devices via ceph orch, it
needed a follow-up wipefs to clear them completely - then it was ok.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Reef - what happened to OSD spec?

2023-08-28 Thread Nigel Williams
We upgraded to Reef from Quincy, all went smoothly (thanks Ceph developers!)

When adding OSDs, the process seems to have changed, the docs no longer
mention OSD spec, and giving it a try it fails when it bumps into the root
drive (which has an active LVM). I expect I can add a filter to avoid it.

But is using the OSD spec (
https://docs.ceph.com/en/octopus/cephadm/drivegroups/) approach now
deprecated? Is the web-interface now the preferred way?

thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Questions since updating to 18.0.2

2023-08-28 Thread Curt
Hello,

We recently upgraded our cluster to version 18 and I've noticed some things
that I'd like feedback on before I go down a rabbit hole for
non-issues. cephadm was used for the upgrade and there were no issues.
Cluster is 56 OSD's spinners for right now only used for RBD images.

I've noticed active scrubs/deep scrubs. I don't remember seeing a large
amount before, usually around 20-30 scrubs and 15 deep I think, now I will
have 70 scrubs and 70 deep scrubs happening. Which I thought were limited
to 1 per OSD or am I misunderstanding osd_max_scrubs?  Everything on the
cluster is currently at default values.

The other thing I've noticed is since the upgrade it seems like any time
backfill happens the client io drops, but neither is high to begin with,
30MiB/s read/write client IO drops to 10-15 with 200MiB/s backfill. Before
upgrading backfill would be hitting 5-600 with 30 clientio. I realize lots
of things could affect this and it could be separate from the cluster, I'm
still investigating, but wanted to mention it incase someone could
recommend a check or some change to Reef that could cause this. mclock
profile is client_io.

Thanks,
Curt
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] two ways of adding OSDs? LVN vs ceph orch daemon add

2023-08-28 Thread Giuliano Maggi
Hi,

I am learning about Ceph, and I found this two ways of adding OSDs:

https://docs.ceph.com/en/quincy/install/manual-deployment/#short-form 
 (via 
LVM)
AND
https://docs.ceph.com/en/quincy/cephadm/services/osd/#creating-new-osds 
 (ceph 
orch daemon add osd **:**)

Are these two ways equivalents?

Thanks,
Giuliano,
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: A couple OSDs not starting after host reboot

2023-08-28 Thread apeisker
Hi,

Thank you for your reply. I don’t think the device names changed, but ceph 
seems to be confused about which device the OSD is on. It’s reporting that 
there are 2 OSDs on the same device although this is not true.

ceph device ls-by-host  | grep sdu
ATA_HGST_HUH728080ALN600_VJH4GLUX sdu  osd.665
ATA_HGST_HUH728080ALN600_VJH60MAX sdu  osd.657

The osd.665 is actually on device sdm. Could this be the cause of the issue? Is 
there a way to correct it?
Thanks,
Alison
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm to setup wal/db on nvme

2023-08-28 Thread Satish Patel
I have replaced Samsung with Intel P4600 6.4TB nvme (I have created 3 OSDs
on top of nvme)

Here is the result:

(venv-openstack) root@os-ctrl1:~# rados -p test-nvme -t 64 -b 4096
bench 10 write
hints = 1
Maintaining 64 concurrent writes of 4096 bytes to objects of size 4096
for up to 10 seconds or 0 objects
Object prefix: benchmark_data_os-ctrl1_1030914
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1  63 31188 31125121.56   121.582 0.000996695  0.00205185
2  63 67419 67356   131.529   141.527  0.00158563  0.00189714
3  63101483101420   132.033   133.062  0.00311369  0.00189039
4  64135147135083   131.893   131.496  0.00132065  0.00189281
5  63169856169793   132.628   135.586  0.00163604   0.0018825
6  64204437204373   133.032   135.078 0.000880165  0.00187612
7  63239369239306   133.518   136.457  0.00215911  0.00187017
8  64274318274254133.89   136.516  0.00130235  0.00186506
9  63309388309325   134.233   136.996  0.00134813  0.00186031
   10   1343849343848   134.293   134.855  0.00205662  0.00185956
Total time run: 10.0018
Total writes made:  343849
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 134.292
Stddev Bandwidth:   5.1937
Max bandwidth (MB/sec): 141.527
Min bandwidth (MB/sec): 121.582
Average IOPS:   34378
Stddev IOPS:1329.59
Max IOPS:   36231
Min IOPS:   31125
Average Latency(s): 0.00185956
Stddev Latency(s):  0.00161079
Max latency(s): 0.107432
Min latency(s): 0.000603733
Cleaning up (deleting benchmark objects)
Removed 343849 objects
Clean up completed and total clean up time :8.41907



On Fri, Aug 25, 2023 at 2:33 PM Anthony D'Atri 
wrote:

>
>
> > Thank you for reply,
> >
> > I have created two class SSD and NvME and assigned them to crush maps.
>
> You don't have enough drives to keep them separate.  Set the NVMe drives
> back to "ssd" and just make one pool.
>
> >
> > $ ceph osd crush rule ls
> > replicated_rule
> > ssd_pool
> > nvme_pool
> >
> >
> > Running benchmarks on nvme is the worst performing. SSD showing much
> better
> > results compared to NvME.
>
> You have more SATA SSDs and thus more OSDs, than NVMe SSDs.
>
>
> > NvME model is Samsung_SSD_980_PRO_1TB
>
> Client-grade, don't expect much from it.
>
>
> >
> >  NvME pool benchmark with 3x replication
> >
> > # rados -p test-nvme -t 64 -b 4096 bench 10 write
> > hints = 1
> > Maintaining 64 concurrent writes of 4096 bytes to objects of size 4096
> for
> > up to 10 seconds or 0 objects
> > Object prefix: benchmark_data_os-ctrl1_1931595
> >  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> > lat(s)
> >0   0 0 0 0 0   -
> > 0
> >1  64  5541  5477   21.3917   21.3945   0.0134898
> > 0.0116529
> >2  64 11209 11145   21.7641   22.1406  0.00939951
> > 0.0114506
> >3  64 17036 16972   22.0956   22.7617  0.00938263
> > 0.0112938
> >4  64 23187 23123   22.5776   24.0273  0.00863939
> > 0.0110473
> >5  64 29753 29689   23.1911   25.6484  0.00925603
> > 0.0107662
> >6  64 36222 36158   23.5369   25.2695   0.0100759
> > 0.010606
> >7  63 42997 42934   23.9551   26.4688  0.00902186
> > 0.0104246
> >8  64 49859 49795   24.3102   26.8008  0.00884379
> > 0.0102765
> >9  64 56429 56365   24.4601   25.6641  0.00989885
> > 0.0102124
> >   10  31 62727 62696   24.4869   24.7305   0.0115833
> > 0.0102027
> > Total time run: 10.0064
> > Total writes made:  62727
> > Write size: 4096
> > Object size:4096
> > Bandwidth (MB/sec): 24.4871
> > Stddev Bandwidth:   1.85423
> > Max bandwidth (MB/sec): 26.8008   <   Only 26MB/s for
> nvme
> > disk
> > Min bandwidth (MB/sec): 21.3945
> > Average IOPS:   6268
> > Stddev IOPS:474.683
> > Max IOPS:   6861
> > Min IOPS:   5477
> > Average Latency(s): 0.0102022
> > Stddev Latency(s):  0.00170505
> > Max latency(s): 0.0365743
> > Min latency(s): 0.00641319
> > Cleaning up (deleting benchmark objects)
> > Removed 62727 objects
> > Clean up completed and total clean up time :8.23223
> >
> >
> >
> > ### SSD pool benchmark
> >
> > (venv-openstack) root@os-ctrl1:~# rados -p test-ssd -t 64 -b 4096 bench
> 10
> > write
> > hints = 1
> > Maintaining 64 concurrent writes of 4096 bytes to objects of size 4096
> for
> > up to 10 seconds or 0 objects
> > Object prefix: benchmark_data_os-ctrl1_1933383
> >  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> > lat(s)
> >

[ceph-users] Re: 16.2.14 pacific QE validation status

2023-08-28 Thread Yuri Weinstein
I am waiting for checks to pass and will merge one remaining PR
https://github.com/ceph/ceph/pull/53157
And will start the build as soon as it is merged.

On Mon, Aug 28, 2023 at 4:57 AM Adam King  wrote:

> cephadm piece of rados can be approved. Failures all look known to me.
>
> On Fri, Aug 25, 2023 at 4:06 PM Radoslaw Zarzynski 
> wrote:
>
>> rados approved
>>
>> On Thu, Aug 24, 2023 at 12:33 AM Laura Flores  wrote:
>>
>>> Rados summary is here:
>>> https://tracker.ceph.com/projects/rados/wiki/PACIFIC#Pacific-v16214-httpstrackercephcomissues62527note-1
>>>
>>> Most are known, except for two new trackers I raised:
>>>
>>>1. https://tracker.ceph.com/issues/62557 - rados/dashboard:
>>>Teuthology test failure due to "MDS_CLIENTS_LAGGY" warning - Ceph - RADOS
>>>2. https://tracker.ceph.com/issues/62559 - rados/cephadm/dashboard:
>>>test times out due to host stuck in maintenance mode - Ceph - 
>>> Orchestrator
>>>
>>> #1 is related to a similar issue we saw where the MDS_CLIENTS_LAGGY
>>> warning was coming up in the Jenkins api check, where these kinds of
>>> conditions are expected. In that case, I would call #1 more of a test
>>> issue, and say that the fix is to whitelist the warning for that test.
>>> Would be good to have someone from CephFS weigh in though-- @Patrick
>>> Donnelly  @Dhairya Parmar 
>>>
>>> #2 looks new to me. @Adam King  can you take a look
>>> and see if it's something to be concerned about? The same test failed for a
>>> different reason in the rerun, so the failure did not reproduce.
>>>
>>> On Wed, Aug 23, 2023 at 1:08 PM Laura Flores  wrote:
>>>
 Thanks Yuri! I will take a look for rados and get back to this thread.

 On Wed, Aug 23, 2023 at 9:41 AM Yuri Weinstein 
 wrote:

> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/62527#note-1
> Release Notes - TBD
>
> Seeking approvals for:
>
> smoke - Venky
> rados - Radek, Laura
>   rook - Sébastien Han
>   cephadm - Adam K
>   dashboard - Ernesto
>
> rgw - Casey
> rbd - Ilya
> krbd - Ilya
> fs - Venky, Patrick
>
> upgrade/pacific-p2p - Laura
> powercycle - Brad (SELinux denials)
>
>
> Thx
> YuriW
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


 --

 Laura Flores

 She/Her/Hers

 Software Engineer, Ceph Storage 

 Chicago, IL

 lflo...@ibm.com | lflo...@redhat.com 
 M: +17087388804



>>>
>>> --
>>>
>>> Laura Flores
>>>
>>> She/Her/Hers
>>>
>>> Software Engineer, Ceph Storage 
>>>
>>> Chicago, IL
>>>
>>> lflo...@ibm.com | lflo...@redhat.com 
>>> M: +17087388804
>>>
>>>
>>> ___
>>> Dev mailing list -- d...@ceph.io
>>> To unsubscribe send an email to dev-le...@ceph.io
>>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Status of diskprediction MGR module?

2023-08-28 Thread Anthony D'Atri


>> The module don't have new commits for more than two year
> 
> So diskprediction_local is unmaintained. Will it be removed?
> It looks like a nice feature but when you try to use it it's useless.

IIRC it has only a specific set of drive models, and the binary blob from 
ProphetStor.

>> I suggest to use smartctl_exporter [1] for monitoring drives health
> 
> I tried to deploy that with cephadm as a custom container.

I would deploy it outside of cephadm.

This exporter is promising, but note that if you have drives hidden under RoC 
HBA VDs, it does not have the ability to jump through the requisite hoops to 
extract metrics from them.


> 
> Follow-up questions:
> 
> How do I tell cephadm that smartctl_exporter has to run in a priviledged 
> container as root with all the devices?
> 
> How do I tell the cephadm managed Prometheus that it can scrape these new 
> exporters?
> 
> How do I add a dashboard in cephadm managed Grafana that shows the values 
> from smartctl_exporter? Where do I get such a dashboard?
> 
> How do I add alerts to the cephadm managed Alert-Manager? Where do I get 
> useful alert definitions for smartctl_exporter metrics?
> 
> Regards
> -- 
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
> 
> https://www.heinlein-support.de
> 
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
> 
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Status of diskprediction MGR module?

2023-08-28 Thread Robert Sander

On 8/28/23 13:26, Konstantin Shalygin wrote:


The module don't have new commits for more than two year


So diskprediction_local is unmaintained. Will it be removed?
It looks like a nice feature but when you try to use it it's useless.


I suggest to use smartctl_exporter [1] for monitoring drives health


I tried to deploy that with cephadm as a custom container.

Follow-up questions:

How do I tell cephadm that smartctl_exporter has to run in a priviledged 
container as root with all the devices?


How do I tell the cephadm managed Prometheus that it can scrape these 
new exporters?


How do I add a dashboard in cephadm managed Grafana that shows the 
values from smartctl_exporter? Where do I get such a dashboard?


How do I add alerts to the cephadm managed Alert-Manager? Where do I get 
useful alert definitions for smartctl_exporter metrics?


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.14 pacific QE validation status

2023-08-28 Thread Adam King
cephadm piece of rados can be approved. Failures all look known to me.

On Fri, Aug 25, 2023 at 4:06 PM Radoslaw Zarzynski 
wrote:

> rados approved
>
> On Thu, Aug 24, 2023 at 12:33 AM Laura Flores  wrote:
>
>> Rados summary is here:
>> https://tracker.ceph.com/projects/rados/wiki/PACIFIC#Pacific-v16214-httpstrackercephcomissues62527note-1
>>
>> Most are known, except for two new trackers I raised:
>>
>>1. https://tracker.ceph.com/issues/62557 - rados/dashboard:
>>Teuthology test failure due to "MDS_CLIENTS_LAGGY" warning - Ceph - RADOS
>>2. https://tracker.ceph.com/issues/62559 - rados/cephadm/dashboard:
>>test times out due to host stuck in maintenance mode - Ceph - Orchestrator
>>
>> #1 is related to a similar issue we saw where the MDS_CLIENTS_LAGGY
>> warning was coming up in the Jenkins api check, where these kinds of
>> conditions are expected. In that case, I would call #1 more of a test
>> issue, and say that the fix is to whitelist the warning for that test.
>> Would be good to have someone from CephFS weigh in though-- @Patrick
>> Donnelly  @Dhairya Parmar 
>>
>> #2 looks new to me. @Adam King  can you take a look
>> and see if it's something to be concerned about? The same test failed for a
>> different reason in the rerun, so the failure did not reproduce.
>>
>> On Wed, Aug 23, 2023 at 1:08 PM Laura Flores  wrote:
>>
>>> Thanks Yuri! I will take a look for rados and get back to this thread.
>>>
>>> On Wed, Aug 23, 2023 at 9:41 AM Yuri Weinstein 
>>> wrote:
>>>
 Details of this release are summarized here:

 https://tracker.ceph.com/issues/62527#note-1
 Release Notes - TBD

 Seeking approvals for:

 smoke - Venky
 rados - Radek, Laura
   rook - Sébastien Han
   cephadm - Adam K
   dashboard - Ernesto

 rgw - Casey
 rbd - Ilya
 krbd - Ilya
 fs - Venky, Patrick

 upgrade/pacific-p2p - Laura
 powercycle - Brad (SELinux denials)


 Thx
 YuriW
 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io

>>>
>>>
>>> --
>>>
>>> Laura Flores
>>>
>>> She/Her/Hers
>>>
>>> Software Engineer, Ceph Storage 
>>>
>>> Chicago, IL
>>>
>>> lflo...@ibm.com | lflo...@redhat.com 
>>> M: +17087388804
>>>
>>>
>>>
>>
>> --
>>
>> Laura Flores
>>
>> She/Her/Hers
>>
>> Software Engineer, Ceph Storage 
>>
>> Chicago, IL
>>
>> lflo...@ibm.com | lflo...@redhat.com 
>> M: +17087388804
>>
>>
>> ___
>> Dev mailing list -- d...@ceph.io
>> To unsubscribe send an email to dev-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Status of diskprediction MGR module?

2023-08-28 Thread Konstantin Shalygin
Hi,

> On 28 Aug 2023, at 12:45, Robert Sander  wrote:
> 
> Several years ago the diskprediction module was added to the MGR collecting 
> SMART data from the OSDs.
> 
> There were local and cloud modes available claiming different accuracies. Now 
> only the local mode remains.
> 
> What is the current status of that MGR module (diskprediction_local)?
> 
> We have a cluster where SMART data is available from the disks (tested with 
> smartctl and visible in the Ceph dashboard), but even with an enabled 
> diskprediction_local module no health and lifetime info is shown.

The module don't have new commits for more than two year
I suggest to use smartctl_exporter [1] for monitoring drives health


[1] https://github.com/prometheus-community/smartctl_exporter
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: What does 'removed_snaps_queue' [d5~3] means?

2023-08-28 Thread Eugen Block
It would be helpful to know what exactly happened. Who creates the  
snapshots and how? What are your clients, openstack compute nodes? If  
an 'rbd ls' shows some output, does 'rbd status /'  
display any info as well or does it return an error? This is a  
reoccuring issue if client connections break, at least something  
similar is reported once in a while in the openstack discuss mailing  
list. Sometimes a compute node reboot helps to get rid of the lock, or  
a "blocklist" of the client. But without more details it's difficult  
to say what exactly the issue is and what would help.


Zitat von Work Ceph :


I see, thanks for the reply!

BTW, while the snapshots are not removed yet, should we be able to delete
the image that had the snapshots being deleted?

We noticed the following for the images that had snapshots deleted, but not
actually removed from the system:
```
This means the image is still open or the client using it crashed. Try
again after closing/unmapping it or waiting 30s for the crashed client to
timeout.
```

After executing the "rbd rm", we receive that message, and the image still
displays with the "rbd ls" command. However, the "rbd info" returns a
message saying that the image does not exist. Is that a known
issue/situation?

On Sat, Aug 26, 2023 at 5:24 AM Eugen Block  wrote:


Hi,

that specifies a range of (to be) removed snapshots. Do you have rbd
mirroring configured or some scripted snapshot creation/deletion?
Snapshot deletion is an asynchronous operation, so they are added to
the queue and deleted at some point. Does the status/range change?
Which exact Octopus version are you running? I have two test clusters
(latest Octopus) with rbd mirroring and when I set that up I expected
to see something similar, in earlier Ceph versions that was visible in
the pool ls detail output. Anyway, I wouldn't worry about it as long
as the queue doesn't grow and the snaps are removed eventually. You
should see the snaptrimming in the 'ceph -s' output as well, the PGs
have a respective state (active+snaptrim or active+snaptrim_wait). I
write this from memory, so the PG state might differ a bit.
You just need to be aware of the impacts of many snapshots for many
images, I'm still investigating a customer issue, some of the results
I posted in this list [1].

Regards,
Eugen

[1]

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ZEMGKBLMEREBZB7SWOLDA6QZX3S7FLL3/#YAHVTTES6YU5IXZJ2UNXKURXSHM5HDEX

Zitat von Work Ceph :

> Hello guys,
> We are facing/seeing an unexpected mark in one of our pools. Do you guys
> know what does "removed_snaps_queue" it mean? We see some notation such
as
> "d5~3" after this tag. What does it mean? We tried to look into the docs,
> but could not find anything meaningful.
>
> We are running Ceph Octopus on top of Ubuntu 18.04.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Status of diskprediction MGR module?

2023-08-28 Thread Robert Sander

Hi,

Several years ago the diskprediction module was added to the MGR 
collecting SMART data from the OSDs.


There were local and cloud modes available claiming different 
accuracies. Now only the local mode remains.


What is the current status of that MGR module (diskprediction_local)?

We have a cluster where SMART data is available from the disks (tested 
with smartctl and visible in the Ceph dashboard), but even with an 
enabled diskprediction_local module no health and lifetime info is shown.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Windows 2016 RBD Driver install failure

2023-08-28 Thread Lucian Petrut
Hi,

Windows Server 2019 is the minimum supported version for rbd-wnbd 
(https://github.com/cloudbase/wnbd#requirements).

You may use ceph-dokan (cephfs) with Windows Server 2016 by disabling the WNBD 
driver when running the MSI installer.

Regards,
Lucian

From: Robert Ford
Sent: Tuesday, August 22, 2023 5:09 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Windows 2016 RBD Driver install failure

Hello,

We have been running into an issue installing the pacific windows rbd
driver on windows 2016. It has no issues with either 2019 or 2022. It
looks like it fails at checkpoint creation. We are installing it as
admin. Has anyone seen this before or know of a solution?

The closest thing I can find to why it wont install:

   *** Product: D:\software\ceph_pacific_beta.msi
   *** Action: INSTALL
   *** CommandLine: **
MSI (s) (CC:24) [12:31:30:315]: Machine policy value
'DisableUserInstalls' is 0
MSI (s) (CC:24) [12:31:30:315]: Note: 1: 2203 2:
C:\windows\Installer\inprogressinstallinfo.ipi 3: -2147287038
MSI (s) (CC:24) [12:31:30:315]: Machine policy value
'LimitSystemRestoreCheckpointing' is 0
MSI (s) (CC:24) [12:31:30:315]: Note: 1: 1715 2: Ceph for Windows
MSI (s) (CC:24) [12:31:30:315]: Calling SRSetRestorePoint API.
dwRestorePtType: 0, dwEventType: 102, llSequenceNumber: 0,
szDescription: "Installed Ceph for Windows".
MSI (s) (CC:24) [12:31:30:315]: The call to SRSetRestorePoint API
failed. Returned status: 0. GetLastError() returned: 127

--
--


Robert Ford
GoDaddy | SRE III
9519020587
Phoenix, AZ
rf...@godaddy.com
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] hardware setup recommendations wanted

2023-08-28 Thread Kai Zimmer

Dear listers,

my employer already has a production Ceph cluster running but we need a 
second one. I just wanted to ask your opininion on the following setup. 
It is planned for 500 TB net capacity, expandable to 2 PB. I expect the 
number of OSD servers to double in the next 4 years. Erasure Code 3:2 
will be used for OSDs. Usage will be file storage, Rados block devices 
and S3:


5x OSD servers (12x18 TB Toshiba MG09SCA18TE SAS spinning disks for 
data, 2x512 GB Samsung PM9A1 M.2 NVME SSD 0,55 DWPD for system, 1xAMD 
7313P CPU with 16 cores @3GHz, 256 GB RAM, LSI SAS 9500 HBA, Broadcom 
P425G network adapter 4x25 Gbit/s)


3x MON servers (1x2 TB Samsung PM9A1 M.2 NVME SSD 0,55 DWPD for system, 
2x1.6TB Kioxia CD6-V SSD 3.0 DWPD for data, 2x Broadcom P210/N210 
network 4x10 GBit/s, 1xAMD 7232P CPU with 8 cores @3.1 GHz, 64 GB RAM)


3x MDS servers (1x2 TB Samsung PM9A1 M.2 NVME SSD 0,55 DWPD for system, 
2x1.6 TB Kioxia CD6-V SSD 3.0 DWPD for data, 2x Broadcom P210/N210 
network 4x10 GBit/s, 1xAMD 7313P CPU with 16 cores @3 GHz, 128 GB RAM)


OSD servers will be connected via 2x25 GBit fibre interfaces "backend" to

2x Mikrotik CRS518-16XS-2XQ (which are connected for high-availability 
via 100 GBit)


For the "frontend" connection to servers/clients via 2x10 GBit we're 
looking into


3x Mikrotik CRS326-24S+2Q+RM (which are connected for high-availability 
via 40 GBit)


Especially for the "frontend" switches i'm looking for alternatives. 
Currently we use Huawei C6810-32T16A4Q-LI models with 2x33 LACP 
connections connected via 10 GBit/s RJ45. But those had ports blocking 
after a number of errors which resulted in some trouble. We'd like to 
avoid IOS and clones in general and would prefer a decent web interface.


Any comments/recommendations?

Best regards,

Kai
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd export-diff/import-diff hangs

2023-08-28 Thread Mykola Golub
On Mon, Aug 28, 2023 at 6:21 AM Tony Liu  wrote:
>
> It's export-diff from an in-use image, both from-snapshot and to-snapshot 
> exist.
> The same from-snapshot exists in import image, which is the to-snapshot from 
> last diff.
> export/import is used for local backup, rbd-mirroring is used for remote 
> backup.

Just to make it clear, do you mean you are running export-diff for an
image that is being mirrored
(snapshot based)?

> Looking for options to get more info to troubleshoot.

I would split "rbd export-diff | rbd import-diff" into two commands:

   rbd export-diff > image.diff
   rbd import-diff < image.diff

and see if it gets stuck for the first one, so we are sure the
export-diff is the issue here.
The next step would be enabling rbd debug, something like this:

  rbd export-diff .. --debug-rbd=20
--log-file=/tmp/{image}_{from_snap}_{to_snap}.log
--log-to-stderr=false

Hope it will not use too much space and you will be able to get a log
for a getting stuck case.
Then please provide the log for review somehow. Also, notice the time
when you interrupt the hanging
export-diff.

--
Mykola Golub
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd export-diff/import-diff hangs

2023-08-28 Thread Tony Liu
Figured it out. It's not rbd issue. Sorry for this false alarm.

Thanks!
Tony

From: Tony Liu 
Sent: August 27, 2023 08:19 PM
To: Eugen Block; ceph-users@ceph.io
Subject: [ceph-users] Re: rbd export-diff/import-diff hangs

It's export-diff from an in-use image, both from-snapshot and to-snapshot exist.
The same from-snapshot exists in import image, which is the to-snapshot from 
last diff.
export/import is used for local backup, rbd-mirroring is used for remote backup.
Looking for options to get more info to troubleshoot.


Thanks!
Tony

From: Eugen Block 
Sent: August 27, 2023 11:53 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: rbd export-diff/import-diff hangs

You mean the image is in use while you’re exporting? Have thought
about creating snapshots and exporting those? Or set up rbd mirroring?

Zitat von Tony Liu :

> To update, hanging happens when updating local image, not remote, networking
> is not a concern here. Any advices how to look into it?
>
> Thanks!
> Tony
> 
> From: Tony Liu 
> Sent: August 26, 2023 10:43 PM
> To: d...@ceph.io; ceph-users@ceph.io
> Subject: [ceph-users] rbd export-diff/import-diff hangs
>
> Hi,
>
> I'm using rbd import and export to copy image from one cluster to another.
> Also using import-diff and export-diff to update image in remote cluster.
> For example, "rbd --cluster local export-diff ... | rbd --cluster
> remote import-diff ...".
> Sometimes, the whole command is stuck. I can't tell it's stuck on
> which end of the pipe.
> I did some search, [1] seems the same issue and [2] is also related.
>
> Wonder if there is any way to identify where it's stuck and get more
> debugging info.
> Given [2], I'd suspect the import-diff is stuck, cause rbd client is
> importing to the
> remote cluster. Networking latency could be involved here? Ping
> latency is 7~8 ms.
>
> Any comments is appreciated!
>
> [1] https://bugs.launchpad.net/cinder/+bug/2031897
> [2] https://stackoverflow.com/questions/69858763/ceph-rbd-import-hangs
>
> Thanks!
> Tony
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io