[ceph-users] Re: Ceph - Error ERANGE: (34) Numerical result out of range

2023-10-27 Thread Eugen Block

Hi,

please provide more information about your cluster, like 'ceph -s',  
'ceph osd tree' and the exact procedure you used to create the OSDs.  
From your last post it seems like the OSD creation failed and this  
might be just a consequence of that? Do you have the logs from the OSD  
creation as well? Not just the logs from the failing OSD start.


Thanks,
Eugen

Zitat von Pardhiv Karri :


Hi,
Trying to move a node/host under a new SSD root and getting below error.
Has anyone seen it and know the fix? the pg_num and pgp_num are same for
all pools so that is not the issue.

 [root@hbmon1 ~]# ceph osd crush move hbssdhost1 root=ssd
Error ERANGE: (34) Numerical result out of range
 [root@hbmon1 ~]#

Thanks,
Pardhiv
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

2023-10-27 Thread Patrick Begou

Hi all,

First of all I apologize if I've not done things correctly but these are 
some tests results.


1) I've compiled the main branch in a fresh podman container (Alma Linux 
8) and installed. Successfull!
2) I have done a copy of the /etc/ceph directory of the host (member of 
the ceph cluster in Pacific 16.2.14) in this container (good or bad idea ?)

3) "ceph-volume inventory" works but with some error messages:

[root@74285dcfa91f etc]# ceph-volume inventory
 stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or 
/sys expected.
 stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or 
/sys expected.
 stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or 
/sys expected.
 stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or 
/sys expected.
 stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or 
/sys expected.


Device Path   Size Device nodes    rotates available 
Model name
/dev/sdc  232.83 GB    sdc True True  
SAMSUNG HE253GJ
/dev/sda  232.83 GB    sda True False 
SAMSUNG HE253GJ
/dev/sdb  465.76 GB    sdb True False 
WDC WD5003ABYX-1

4) ceph version show:
[root@74285dcfa91f etc]# ceph -v
ceph version 18.0.0-6846-g2706ecac4a9 
(2706ecac4a90447420904e42d6e0445134dff2be) reef (dev)



5) lsblk works (container launched with "--privileged" flag)
[root@74285dcfa91f etc]# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda  8:0    1 232.9G  0 disk
|-sda1   8:1    3.9G  0 part
|-sda2   8:2    1   3.9G  0 part [SWAP]
`-sda3   8:3    1   225G  0 part
sdb  8:16   1 465.8G  0 disk
sdc  8:32   1 232.9G  0 disk

But some commands do not works (my setup or ceph ?)

[root@74285dcfa91f etc]# ceph orch device zap 
mostha1.legi.grenoble-inp.fr /dev/sdc --force
Error EINVAL: Device path '/dev/sdc' not found on host 
'mostha1.legi.grenoble-inp.fr'

[root@74285dcfa91f etc]#

[root@74285dcfa91f etc]# ceph orch device ls
[root@74285dcfa91f etc]#

Patrick


Le 24/10/2023 à 22:43, Zack Cerza a écrit :

That's correct - it's the removable flag that's causing the disks to
be excluded.

I actually just merged this PR last week:
https://github.com/ceph/ceph/pull/49954

One of the changes it made was to enable removable (but not USB)
devices, as there are vendors that report hot-swappable drives as
removable. Patrick, it looks like this may resolve your issue as well.


On Tue, Oct 24, 2023 at 5:57 AM Eugen Block  wrote:

Hi,


May be because they are hot-swappable hard drives.

yes, that's my assumption as well.


Zitat von Patrick Begou :


Hi Eugen,

Yes Eugen, all the devices /dev/sd[abc] have the removable flag set
to 1. May be because they are hot-swappable hard drives.

I have contacted the commit author Zack Cerza and he asked me for
some additional tests too this morning. I add him in copy to this
mail.

Patrick

Le 24/10/2023 à 12:57, Eugen Block a écrit :

Hi,

just to confirm, could you check that the disk which is *not*
discovered by 16.2.11 has a "removable" flag?

cat /sys/block/sdX/removable

I could reproduce it as well on a test machine with a USB thumb
drive (live distro) which is excluded in 16.2.11 but is shown in
16.2.10. Although I'm not a developer I tried to understand what
changes were made in
https://github.com/ceph/ceph/pull/46375/files#diff-330f9319b0fe352dff0486f66d3c4d6a6a3d48efd900b2ceb86551cfd88dc4c4R771
 and there's this
line:


if get_file_contents(os.path.join(_sys_block_path, dev,
'removable')) == "1":
continue

The thumb drive is removable, of course, apparently that is filtered here.

Regards,
Eugen

Zitat von Patrick Begou :


Le 23/10/2023 à 03:04, 544463...@qq.com a écrit :

I think you can try to roll back this part of the python code and
wait for your good news :)


Not so easy 😕


[root@e9865d9a7f41 ceph]# git revert
4fc6bc394dffaf3ad375ff29cbb0a3eb9e4dbefc
Auto-merging src/ceph-volume/ceph_volume/tests/util/test_device.py
CONFLICT (content): Merge conflict in
src/ceph-volume/ceph_volume/tests/util/test_device.py
Auto-merging src/ceph-volume/ceph_volume/util/device.py
CONFLICT (content): Merge conflict in
src/ceph-volume/ceph_volume/util/device.py
Auto-merging src/ceph-volume/ceph_volume/util/disk.py
CONFLICT (content): Merge conflict in
src/ceph-volume/ceph_volume/util/disk.py
error: could not revert 4fc6bc394df... ceph-volume: Optionally
consume loop devices

Patrick
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




__

[ceph-users] Re: [ext] CephFS pool not releasing space after data deletion

2023-10-27 Thread Kuhring, Mathias
Dear ceph users,

We are wondering, if this might be the same issue as with this bug:
https://tracker.ceph.com/issues/52581

Except that we seem to have been snapshots dangling on the old pool.
And the bug report snapshots dangling on the new pool.
But maybe it's both?

I mean, once the global root layout was created to a new pool,
the new pool became in charge for snapshooting at least of new data, right?
What about data which is overwritten? Is there a conflict of responsibility?

We do have similar listings of snaps with "ceph osd pool ls detail", I 
think:

0|0[root@osd-1 ~]# ceph osd pool ls detail | grep -B 1 removed_snaps_queue
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 1 
object_hash rjenkins pg_num 115 pgp_num 107 pg_num_target 32 
pgp_num_target 32 autoscale_mode on last_change 803558 lfor 
0/803250/803248 flags hashpspool,selfmanaged_snaps stripe_width 0 
expected_num_objects 1 application cephfs
     removed_snaps_queue 
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
--
pool 3 'hdd_ec' erasure profile hdd_ec size 3 min_size 2 crush_rule 3 
object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off 
last_change 803558 lfor 0/87229/87229 flags 
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192 application 
cephfs
     removed_snaps_queue 
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
--
pool 20 'hdd_ec_8_2_pool' erasure profile hdd_ec_8_2_profile size 10 
min_size 9 crush_rule 5 object_hash rjenkins pg_num 8192 pgp_num 8192 
autoscale_mode off last_change 803558 lfor 0/0/681917 flags 
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 32768 
application cephfs
     removed_snaps_queue 
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]


Here, pool hdd_ec_8_2_pool is the one we recently assigned to the root 
layout.
Pool hdd_ec is the one which was assigned before and which won't release 
space (at least where I know of).

Is this removed_snaps_queue the same as removed_snaps in the bug issue 
(i.e. the label was renamed)?
And is it normal that all queues list the same info or should this be 
different per pool?
Might this be related to pools having now share responsibility over some 
snaps due to layout changes?

And for the big question:
How can I actually trigger/speedup the removal of those snaps?
I find the removed_snaps/removed_snaps_queue mentioned a few times in 
the user list.
But never with some conclusive answer how to deal with them.
And the only mentions in the docs are just change logs.

I also looked into and started cephfs stray scrubbing:
https://docs.ceph.com/en/latest/cephfs/scrub/#evaluate-strays-using-recursive-scrub
But according to the status output, no scrubbing is actually active.

I would appreciate any further ideas. Thanks a lot.

Best Wishes,
Mathias

On 10/23/2023 12:42 PM, Kuhring, Mathias wrote:
> Dear Ceph users,
>
> Our CephFS is not releasing/freeing up space after deleting hundreds of
> terabytes of data.
> By now, this drives us in a "nearfull" osd/pool situation and thus
> throttles IO.
>
> We are on ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
> quincy (stable).
>
> Recently, we moved a bunch of data to a new pool with better EC.
> This was done by adding a new EC pool to the FS.
> Then assigning the FS root to the new EC pool via the directory layout xattr
> (so all new data is written to the new pool).
> And finally copying old data to new folders.
>
> I swapped the data as follows to remain the old directory structures.
> I also made snapshots for validation purposes.
>
> So basically:
> cp -r mymount/mydata/ mymount/new/ # this creates copy on new pool
> mkdir mymount/mydata/.snap/tovalidate
> mkdir mymount/new/mydata/.snap/tovalidate
> mv mymount/mydata/ mymount/old/
> mv mymount/new/mydata mymount/
>
> I could see the increase of data in the new pool as expected (ceph df).
> I compared the snapshots with hashdeep to make sure the new data is alright.
>
> Then I went ahead deleting the old data, basically:
> rmdir mymount/old/mydata/.snap/* # this also included a bunch of other
> older snapshots
> rm -r mymount/old/mydata
>
> At first we had a bunch of PGs with snaptrim/snaptrim_wait.
> But they are done for quite some time now.
> And now, already two weeks later the size of the old pool still hasn't
> really decreased.
> I'm still waiting for around 500 TB to be released (and much more is
> planned).
>
> I honestly have no clue, where to go from here.
>   From my point of view (i.e. the CephFS mount), the data is gone.
> I also never hard/soft-linked it anywhere.
>
> This doesn't seem to be a regular issue.
> At least I couldn't find anything related or resolved in the docs or
> user list, yet.
> If anybody has an idea how to r

[ceph-users] Re: "cephadm version" in reef returns "AttributeError: 'CephadmContext' object has no attribute 'fsid'"

2023-10-27 Thread John Mulligan
On Friday, October 27, 2023 2:40:17 AM EDT Eugen Block wrote:
> Are the issues you refer to the same as before? I don't think this
> version issue is the root cause, I do see it as well in my test
> cluster(s) but the rest works properly except for the tag issue I
> already reported which you can easily fix by setting the config value
> for the default image
> (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LASBJCSPFGD
> YAWPVE2YLV2ZLF3HC5SLS/#LASBJCSPFGDYAWPVE2YLV2ZLF3HC5SLS). Or are there new
> issues you encountered?


I concur. That `cephadm version` failure is/was a known issue but should not 
be the cause of any other issues.  On the main branch `cephadm version` no 
longer fails this way - rather, it reports the version of a cephadm build and 
no longer inspects a container image.  We can look into backporting this 
before the next reef release.

The issue related to the container image tag that Eugen filed has also been 
fixed on reef. Thanks for filing that.

Martin you may want to retry things after the next reef release. 
Unfortunately, I don't know when that is planned but I think it's soonish. 

> 
> Zitat von Martin Conway :
> > I just had another look through the issues tracker and found this
> > bug already listed.
> > https://tracker.ceph.com/issues/59428
> > 
> > I need to go back to the other issues I am having and figure out if
> > they are related or something different.
> > 
> > 
> > Hi
> > 
> > I wrote before about issues I was having with cephadm in 18.2.0
> > Sorry, I didn't see the helpful replies because my mail service
> > binned the responses.
> > 
> > I still can't get the reef version of cephadm to work properly.
> > 
> > I had updated the system rpm to reef (ceph repo) and also upgraded
> > the containerised ceph daemons to reef before my first email.
> > 
> > Both the system package cephadm and the one found at
> > /var/lib/ceph/${fsid}/cephadm.* return the same error when running
> > "cephadm version"
> > 
> > Traceback (most recent call last):
> >   File
> > 
> > "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4
> > e", line 9468, in 
> > 
> > main()
> >   
> >   File
> > 
> > "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4
> > e", line 9456, in main
> > 
> > r = ctx.func(ctx)
> >   
> >   File
> > 
> > "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4
> > e", line 2108, in _infer_image
> > 
> > ctx.image = infer_local_ceph_image(ctx, ctx.container_engine.path)
> >   
> >   File
> > 
> > "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4
> > e", line 2191, in infer_local_ceph_image
> > 
> > container_info = get_container_info(ctx, daemon, daemon_name is not
> > None)
> >   
> >   File
> > 
> > "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4
> > e", line 2154, in get_container_info
> > 
> > matching_daemons = [d for d in daemons if daemon_name_or_type(d)
> > 
> > == daemon_filter and d['fsid'] == ctx.fsid]
> > 
> >   File
> > 
> > "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4
> > e", line 2154, in 
> > 
> > matching_daemons = [d for d in daemons if daemon_name_or_type(d)
> > 
> > == daemon_filter and d['fsid'] == ctx.fsid]
> > 
> >   File
> > 
> > "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4
> > e", line 217, in __getattr__
> > 
> > return super().__getattribute__(name)
> > 
> > AttributeError: 'CephadmContext' object has no attribute 'fsid'
> > 
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Stickyness of writing vs full network storage writing

2023-10-27 Thread Hans Kaiser

Hello list,

 

I am new to ceph and try to understand the ceph philosophy. We did a bunch of tests on a 3 node ceph cluster.

 

After tests, I see that our network is always the bottleneck of writing to very fast storages.

 

So let me explain first my point of view before I get down to the questions:
For my discussion I am assuming nowadays PCIe based NVMe drives, which are capable of writing about 8GiB/s, which is about 64GBit/s.

 

So having 2 or 4 of such drives in a local server we can write in ideal servers 128Gbit/s (2 drives) or 256Gbit/s

 

Considering the latencies here, is also a dream value if we use PCI 5.0 (or even 4.0)

 

Now considering the situation that you have 5 nodes each has 4 of that drives,

will make all small and mid-sized companies to go bankrupt ;-) only from buying the corresponding networking switches.

 

But the servers hardware is still a simplistic commodity hardware which can saturate the given any given commodity network hardware easily.

If I want to be able to use full 64GBit/s I would require at least 100GBit/s networking or tons of trunked ports and cabaling with lower bandwidth switches.

 

If we now also consider distributing the nodes over racks, building on same location or distributed datacenters, the costs will be even more painfull.


And IMHO it could be "easily" changed, if some "minor" different behavior would be available.

My target scenario would be to implement a ceph cluster with such named servers as above.
The ceph commit requirement will be 2 copies on different OSDs (comparable to a mirrored drive) and in total 3 or 4 copies on the cluster (comparable to a RAID with multiple disk redudancy)


In all our tests so far, we could not control the behavior of how ceph is persisting this 2 copies. It will always try to persist it somehow over the network.
Q1: Is this behavior mandatory?

 

Our common workload, and afaik nearly all webservice based applications are:

- a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)

- and probably mostly 1write to 4read or even 1:6 ratio on utilizing the cluster

Hope I could explain the situation here well enough.
 

 

Now assuming my ideal world with ceph:

if ceph would do:
1. commit 2 copies to local drives to the node there ceph client is connected to
2. after commit sync (optimized/queued) the data over the network to fulfill the common needs of ceph storage with 4 copies
3. maybe optionally move 1 copy away from the intial node which still holds the 2 local copies...

 

this behaviour would ensure that:
- the felt performance of the OSD clients will be the full bandwidth of the local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s and the latency would be comparable as writing locally
- we would have 2 copies nearly "immediately" reported to any ceph client
- bandwidth utilization will be optimized, since we do not duplicate the stored data transfers on the network immediatelly, we defer it from the initial writing of the ceph client and can so utilize better a queing mechanism
- IMHO the scalability with commodity network would be far easier to implement, since the networking requirements are factors lower

 

Mabe I have a total wrong understanding of ceph cluster and data distribution of the copies.
Q2: If so plz let me know where I may read more about this?


So to bring it quickly down:
Q3: is it possible to configure ceph to behave like named above in my ideal world?
   means to first write n minimal copies to local drives, and deferred the syncing of the other copies into the network
Q4: if not, are there any plans into this direction?
Q5: if possible, is there a good documentation for it?
Q6: we would still like to be able to distribute over racks, enclosures and datacenters
 

best wishes

Hans
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Stickyness of writing vs full network storage writing

2023-10-27 Thread Hans Kaiser


(now hopefully as plain test only...)

 

Hello list,

 

I am new to ceph and try to understand the ceph philosophy. We did a bunch of tests on a 3 node ceph cluster.

 

After tests, I see that our network is always the bottleneck of writing to very fast storages.

 

So let me explain first my point of view before I get down to the questions:
For my discussion I am assuming nowadays PCIe based NVMe drives, which are capable of writing about 8GiB/s, which is about 64GBit/s.

 

So having 2 or 4 of such drives in a local server we can write in ideal servers 128Gbit/s (2 drives) or 256Gbit/s

 

Considering the latencies here, is also a dream value if we use PCI 5.0 (or even 4.0)

 

Now considering the situation that you have 5 nodes each has 4 of that drives,

will make all small and mid-sized companies to go bankrupt ;-) only from buying the corresponding networking switches.

 

But the servers hardware is still a simplistic commodity hardware which can saturate the given any given commodity network hardware easily.

If I want to be able to use full 64GBit/s I would require at least 100GBit/s networking or tons of trunked ports and cabaling with lower bandwidth switches.

 

If we now also consider distributing the nodes over racks, building on same location or distributed datacenters, the costs will be even more painfull.


And IMHO it could be "easily" changed, if some "minor" different behavior would be available.

My target scenario would be to implement a ceph cluster with such named servers as above.
The ceph commit requirement will be 2 copies on different OSDs (comparable to a mirrored drive) and in total 3 or 4 copies on the cluster (comparable to a RAID with multiple disk redudancy)


In all our tests so far, we could not control the behavior of how ceph is persisting this 2 copies. It will always try to persist it somehow over the network.
Q1: Is this behavior mandatory?

 

Our common workload, and afaik nearly all webservice based applications are:

- a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)

- and probably mostly 1write to 4read or even 1:6 ratio on utilizing the cluster

Hope I could explain the situation here well enough.
 

 

Now assuming my ideal world with ceph:

if ceph would do:
1. commit 2 copies to local drives to the node there ceph client is connected to
2. after commit sync (optimized/queued) the data over the network to fulfill the common needs of ceph storage with 4 copies
3. maybe optionally move 1 copy away from the intial node which still holds the 2 local copies...

 

this behaviour would ensure that:
- the felt performance of the OSD clients will be the full bandwidth of the local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s and the latency would be comparable as writing locally
- we would have 2 copies nearly "immediately" reported to any ceph client
- bandwidth utilization will be optimized, since we do not duplicate the stored data transfers on the network immediatelly, we defer it from the initial writing of the ceph client and can so utilize better a queing mechanism
- IMHO the scalability with commodity network would be far easier to implement, since the networking requirements are factors lower

 

Mabe I have a total wrong understanding of ceph cluster and data distribution of the copies.
Q2: If so plz let me know where I may read more about this?


So to bring it quickly down:
Q3: is it possible to configure ceph to behave like named above in my ideal world?
   means to first write n minimal copies to local drives, and deferred the syncing of the other copies into the network
Q4: if not, are there any plans into this direction?
Q5: if possible, is there a good documentation for it?
Q6: we would still like to be able to distribute over racks, enclosures and datacenters
 

best wishes

Hans

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stickyness of writing vs full network storage writing

2023-10-27 Thread Anthony D'Atri
Ceph is all about strong consistency and data durability.  There can also be a 
distinction between performance of the cluster in aggregate vs a single client, 
especially in a virtualization scenario where to avoid the noisy-neighbor 
dynamic you deliberately throttle iops and bandwidth per client.

> For my discussion I am assuming nowadays PCIe based NVMe drives, which are 
> capable of writing about 8GiB/s, which is about 64GBit/s.

Written how, though?  Benchmarks sometimes are written with 100% sequential 
workloads, top-SKU CPUs that mortals can't afford, and especially with a queue 
depth of like 256.

With most Ceph deployments, the IO a given drive experiences is often pretty 
much random and with lower QD.  And depending on the drive, significant read 
traffic may impact write bandwidth to a degree.  At . Mountpoint (Vancouver 
BC 2018) someone gave a presentation about the difficulties saturating NVMe 
bandwidth.  

> Now considering the situation that you have 5 nodes each has 4 of that drives,
> will make all small and mid-sized companies to go bankrupt ;-) only from 
> buying the corresponding networking switches.

Depending where you get your components...

* You probably don't need "mixed-use" (~3 DWPD) drives, for most purposes "read 
intensive" (~1DWPD) (or less, sometimes) are plenty.  But please please please 
stick with real enterprise-class drives.

* Chassis brands mark up their storage (and RAM) quite a bit.  You can often 
get SSDs elsewhere for half of what they cost from your chassis manufacturer.

>   But the servers hardware is still a simplistic commodity hardware which can 
> saturate the given any given commodity network hardware easily.
> If I want to be able to use full 64GBit/s I would require at least 100GBit/s 
> networking or tons of trunked ports and cabaling with lower bandwidth 
> switches.

Throughput and latency are different things, though.  Also, are you assuming 
here the traditional topology of separate public and 
cluster/private/replication networks?  With modern networking (and Ceph 
releases) that is often overkill and you can leave out the replication network.

Also, would your clients have the same networking provisioned?  If you're 

>   If we now also consider distributing the nodes over racks, building on same 
> location or distributed datacenters, the costs will be even more painfull.

Don't you already have multiple racks?  They don't need to be dedicated only to 
Ceph.

> The ceph commit requirement will be 2 copies on different OSDs (comparable to 
> a mirrored drive) and in total 3 or 4 copies on the cluster (comparable to a 
> RAID with multiple disk redudancy)

Not entirely comparable, but the distinctions mostly don't matter here.

> In all our tests so far, we could not control the behavior of how ceph is 
> persisting this 2 copies. It will always try to persist it somehow over the 
> network.
> Q1: Is this behavior mandatory?

It's a question of how important the data is, and how bad it would be to lose 
some.

>   Our common workload, and afaik nearly all webservice based applications are:
> - a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)
> - and probably mostly 1write to 4read or even 1:6 ratio on utilizing the 
> cluster

QLC might help your costs, look into the D5-P5430, D5-P5366, etc.  Though these 
days if you shop smart you can get TLC for close the same cost.  Won't always 
be true though, and you can't get a 60TB TLC SKU ;)

> Hope I could explain the situation here well enough.
> Now assuming my ideal world with ceph:
> if ceph would do:
> 1. commit 2 copies to local drives to the node there ceph client is connected 
> to
> 2. after commit sync (optimized/queued) the data over the network to fulfill 
> the common needs of ceph storage with 4 copies

You could I think craft a CRUSH rule to do that.  Default for replicated pools 
FWIW is 3 copies not 4.

> 3. maybe optionally move 1 copy away from the intial node which still holds 
> the 2 local copies...

I don't know of an elegant way to change placement after the fact.

>   this behaviour would ensure that:
> - the felt performance of the OSD clients will be the full bandwidth of the 
> local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s 
> and the latency would be comparable as writing locally
> - we would have 2 copies nearly "immediately" reported to any ceph client

I was once told that writes return to the client when min_size copies are 
written; later I was told that it's actually not until all copies are written.

But say we could do this.  Think about what happens if one of those two local 
drives -- or the entire server -- dies.  Before any copies are persisted to 
other servers, or if only one copy is persisted to another server.  You risk 
data loss.

> - bandwidth utilization will be optimized, since we do not duplicate the 
> stored data transfers on the network immediatelly, we defer it from the 
> initial writ

[ceph-users] Re: Join us for the User + Dev Meeting, happening tomorrow!

2023-10-27 Thread Laura Flores
The archive has been updated with latest presentations, as well as the
meeting recording: https://ceph.io/en/community/meetups/user-dev-archive/

On Wed, Oct 18, 2023 at 10:45 AM Laura Flores  wrote:

> Hi Ceph users and developers,
>
> You are invited to join us at the User + Dev meeting tomorrow at 10:00 AM
> EST! See below for more meeting details.
>
> We have two guest speakers joining us tomorrow:
>
> 1. "CRUSH Changes at Scale" by Joshua Baergen, Digital Ocean
> In this talk, Joshua Baergen will discuss the problems that operators
> encounter with CRUSH changes at scale and how DigitalOcean built
> pg-remapper to control and speed up CRUSH-induced backfill.
>
> 2. "CephFS Management with Ceph Dashboard" by Pedro Gonzalez Gomez, IBM
> This talk will demonstrate new Dashboard behavior regarding CephFS
> management.
>
> The last part of the meeting will be dedicated to open discussion. Feel
> free to add questions for the speakers or additional topics under the "Open
> Discussion" section on the agenda:
> https://pad.ceph.com/p/ceph-user-dev-monthly-minutes
>
> If you have an idea for a focus topic you'd like to present at a future
> meeting, you are welcome to submit it to this Google Form:
> https://docs.google.com/forms/d/e/1FAIpQLSdboBhxVoBZoaHm8xSmeBoemuXoV_rmh4vJDGBrp6d-D3-BlQ/viewform?usp=sf_link
> Any Ceph user or developer is eligible to submit!
>
> Thanks,
> Laura Flores
>
> Meeting link: https://meet.jit.si/ceph-user-dev-monthly
>
> Time conversions:
> UTC:   Thursday, October 19, 14:00 UTC
> Mountain View, CA, US: Thursday, October 19,  7:00 PDT
> Phoenix, AZ, US:   Thursday, October 19,  7:00 MST
> Denver, CO, US:Thursday, October 19,  8:00 MDT
> Huntsville, AL, US:Thursday, October 19,  9:00 CDT
> Raleigh, NC, US:   Thursday, October 19, 10:00 EDT
> London, England:   Thursday, October 19, 15:00 BST
> Paris, France: Thursday, October 19, 16:00 CEST
> Helsinki, Finland: Thursday, October 19, 17:00 EEST
> Tel Aviv, Israel:  Thursday, October 19, 17:00 IDT
> Pune, India:   Thursday, October 19, 19:30 IST
> Brisbane, Australia:   Friday, October 20,  0:00 AEST
> Singapore, Asia:   Thursday, October 19, 22:00 +08
> Auckland, New Zealand: Friday, October 20,  3:00 NZDT
>
> --
>
> Laura Flores
>
> She/Her/Hers
>
> Software Engineer, Ceph Storage 
>
> Chicago, IL
>
> lflo...@ibm.com | lflo...@redhat.com 
> M: +17087388804
>
>
>

-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Problem with upgrade

2023-10-27 Thread Jorge Garcia
I think I figured it out. The problem was that my ceph.conf file only
listed the first machine in mon_initial_members and in mon_host. I'm not
sure why. I added the other monitors, restarted the monitors and the
managers, and everything is now working as expected. I have now upgraded
all the monitors and all the managers to Pacific and Rocky 9. Now on to the
OSDs. Well, maybe next week...

On Thu, Oct 26, 2023 at 5:37 PM Tyler Stachecki 
wrote:

> On Thu, Oct 26, 2023, 8:11 PM Jorge Garcia  wrote:
>
>> Oh, I meant that "ceph -s" just hangs. I didn't even try to look at the
>> I/O. Maybe I can do that, but the "ceph -s" hang just freaked me out.
>>
>> Also, I know that the recommended order is mon->mgr->osd->mds->rgw, but
>> when you run mgr on the same hardware as the monitors, it's hard to not
>> upgrade both at the same time. Particularly if you're upgrading the whole
>> machine at once. Here's where upgrading to the new container method will
>> help a lot! FWIW, the managers seem to be running fine.
>>
>
> I recently did something like this, so I understand that it's difficult.
> Most of my testing and prep-work was centered around exactly this problem,
> which was avoided by first upgrading mons/mgrs to an interim OS while
> remaining on Octopus -- solely for the purposes of opening an avenue from
> Octopus to Quincy separate from tbe OS upgrade.
>
> In my pre-prod resting, trying to upgrade the mons/mgrs without that
> middle step that allowed mgrs to be upgraded separately did result in `ceph
> -s` locking up. Client I/O remained non-impacted in this state though.
>
> Maybe look at which mgr is active and/or try stopping all but the Octopus
> mgr when stopping the mon as well?
>
> Cheers,
> Tyler
>
>
>> On Thu, Oct 26, 2023 at 4:57 PM Tyler Stachecki <
>> stachecki.ty...@gmail.com> wrote:
>>
>>> On Thu, Oct 26, 2023 at 6:52 PM Jorge Garcia 
>>> wrote:
>>> >
>>> > Hi Tyler,
>>> >
>>> > Maybe you didn't read the full message, but in the message you will
>>> notice that I'm doing exactly that, and the problem just occurred when I
>>> was doing the upgrade from Octopus to Pacific. I'm nowhere near Quincy yet.
>>> The original goal was to move from Nautilus to Quincy, but I have gone to
>>> Octopus (no problems) and now to Pacific (problems).
>>>
>>> I did not, apologies -- though do see my second message about ordering
>>> mon/mgr ordering...
>>>
>>> When you say "the cluster becomes unresponsive" -- does the client I/O
>>> lock up, or do you mean that `ceph -s` and such hangs?
>>>
>>> May help to look to Pacific mons via the asok and see if they respond
>>> in such a state (and their status) if I/O is not locked up and you can
>>> afford to leave it in that state for a couple minutes:
>>> $ ceph daemon mon.name mon_status
>>>
>>> Cheers,
>>> Tyler
>>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io