[ceph-users] does the RBD client block write when the Watcher times out?

2024-05-22 Thread Yuma Ogami
Hello.

I'm currently verifying the behavior of RBD on failure. I'm wondering
about the consistency of RBD images after network failures. As a
result of my investigation, I found that RBD sets a Watcher to RBD
image if a client mounts this volume to prevent multiple mounts. In
addition, I found that if the client is isolated from the network for
a long time, the Watcher is released. However, the client still mounts
this image. In this situation, if another client can also mount this
image and the image is writable from both clients, data corruption
occurs. Could you tell me whether this is a realistic scenario?

I tested the following case on which the watcher was released by hand
and detected data corruption.

1. Release the Watcher on the node (A) that mounts the RBD using the
`ceph osd blocklist add` command
2. Another node (B) mounts the RBD volume.
3. Unblock using the `ceph osd blocklist rm` command
4. Write from node (B) (write successfully)
5. Write from node (A) (can be written successfully from the
application's point of view. In fact, the write fails)
6. Write content at node (A) is lost.

In this case, I released the watcher by hand to emulate the timeout
due to network failure. It's because I couldn't emulate real network
failure in this test.

I considered using exclusive lock and restricting access to those from
a single node. However, we gave up on that as blocking writes entirely
would make snapshots non-functional.

The version of Ceph we are using is v17.2.6.

Best regards,
Yuma.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs-data-scan orphan objects while mds active?

2024-05-22 Thread Olli Rajala
Hmm... seems I might have been blinded and looking in the wrong place.

I did some scripting and took a look at all the *. objects'
"parent" xattrs on the pool. Nothing funky there and no files with a
backtrace pointing to that deleted folder. No considerable amount of these
inode object "sequences" with missing . chunk either. So, probably
it's not orphan objects at this level then :|

...and then I noticed that there are quite a considerable amount of clones
despite there are no snapshots - or can there be some other reason for that?

# rados -p cephfs_ec22hdd_data lssnap
0 snaps
# rados -p cephfs_ec22hdd_data df
POOL_NAME   USED   OBJECTS   CLONES COPIES
 MISSING_ON_PRIMARY  UNFOUND  DEGRADED RD_OPS  RD WR_OPS
WR  USED COMPR  UNDER COMPR
cephfs_ec22hdd_data  179 TiB  68334318  8399291  273337272
  00 0  70158  86 TiB  234691728  117 TiB 0 B
   0 B

Is there some way to force these to get trimmed?

tnx,
---
Olli Rajala - Lead TD
Anima Vitae Ltd.
www.anima.fi
---


On Fri, May 17, 2024 at 6:48 AM Gregory Farnum  wrote:

> It's unfortunately more complicated than that. I don't think that
> forward scrub tag gets persisted to the raw objects; it's just a
> notation for you. And even if it was, it would only be on the first
> object in every file — larger files would have many more objects
> forward scrub doesn't touch.
>
> This isn't a case anybody has really built tooling for. Your best bet
> is probably to live with the data leakage, or else find a time to turn
> it off and run the data-scan tools.
> -Greg
>
> On Tue, May 14, 2024 at 10:26 AM Olli Rajala  wrote:
> >
> > Tnx Gregory,
> >
> > Doesn't sound too safe then.
> >
> > Only reason to discover these orphans via scanning would be to delete the
> > files again and I know all these files were at least one year old... so,
> I
> > wonder if I could somehow do something like:
> > 1) do forward scrub with a custom tag
> > 2) iterate over all the objects in the pool and delete all objects
> without
> > the tag and older than one year
> >
> > Is there any tooling to do such an operation? Any risks or flawed logic
> > there?
> >
> > ...or any other ways to discover and get rid of these objects?
> >
> > Cheers!
> > ---
> > Olli Rajala - Lead TD
> > Anima Vitae Ltd.
> > www.anima.fi
> > ---
> >
> >
> > On Tue, May 14, 2024 at 9:41 AM Gregory Farnum 
> wrote:
> >
> > > The cephfs-data-scan tools are built with the expectation that they'll
> > > be run offline. Some portion of them could be run without damaging the
> > > live filesystem (NOT all, and I'd have to dig in to check which is
> > > which), but they will detect inconsistencies that don't really exist
> > > (due to updates that are committed to the journal but not fully
> > > flushed out to backing objects) and so I don't think it would do any
> > > good.
> > > -Greg
> > >
> > > On Mon, May 13, 2024 at 4:33 AM Olli Rajala 
> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I suspect that I have some orphan objects on a data pool after quite
> > > > haphazardly evicting and removing a cache pool after deleting 17TB of
> > > files
> > > > from cephfs. I have forward scrubbed the mds and the filesystem is in
> > > clean
> > > > state.
> > > >
> > > > This is a production system and I'm curious if it would be safe to
> > > > run cephfs-data-scan scan_extents and scan_inodes while the fs is
> online?
> > > > Does it help if I give a custom tag while forward scrubbing and then
> > > > use --filter-tag on the backward scans?
> > > >
> > > > ...or is there some other way to check and cleanup orphans?
> > > >
> > > > tnx,
> > > > ---
> > > > Olli Rajala - Lead TD
> > > > Anima Vitae Ltd.
> > > > www.anima.fi
> > > > ---
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] User + Dev Meetup Tomorrow!

2024-05-22 Thread Laura Flores
Hi all,

The User + Dev Meetup will be held tomorrow at 10:00 AM EDT. We will be
discussing the results of the latest survey, and users who attend will have
the opportunity to provide additional feedback in real time.

See you there!
Laura Flores

Meeting Details:
https://www.meetup.com/ceph-user-group/events/300883526/

-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm bootstraps cluster with bad CRUSH map(?)

2024-05-22 Thread Matthew Vernon

Hi,

On 22/05/2024 12:44, Eugen Block wrote:


you can specify the entire tree in the location statement, if you need to:


[snip]

Brilliant, that's just the ticket, thank you :)


This should be made a bit clearer in the docs [0], I added Zac.


I've opened a MR to update the docs, I hope it's at least useful as a 
starter-for-ten:

https://github.com/ceph/ceph/pull/57633

Thanks,

Matthew
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef RGWs stop processing requests

2024-05-22 Thread Enrico Bocchi

Hi Iain,

Can you check if it relates to this? -- 
https://tracker.ceph.com/issues/63373

There is a bug when bulk deleting objects, causing the RGWs to deadlock.

Cheers,
Enrico


On 5/17/24 11:24, Iain Stott wrote:

Hi,

We are running 3 clusters in multisite. All 3 were running Quincy 17.2.6 and 
using cephadm. We upgraded one of the secondary sites to Reef 18.2.1 a couple 
of weeks ago and were planning on doing the rest shortly afterwards.

We run 3 RGW daemons on separate physical hosts behind an external HAProxy HA 
pair for each cluster.

Since we upgrade to Reef we have had issues with the RGWs stopping processing 
requests. We can see that they don't crash as they still have entries in the 
logs about syncing, but as far as request processing goes, they just stop. 
While debugging this we have 1 of the 3 RGWs running a Quincy image, and this 
has never had an issue where it stops processing requests. Any Reef containers 
we deploy have always stopped within 48Hrs of being deployed. We have tried 
Reef versions 18.2.1, 18.2.2 and 18.1.3 and all exhibit the same issue. We are 
running podman 4.6.1 on Centos 8 with kernel 4.18.0-513.24.1.el8_9.x86_64.

We have enabled debug logs for the RGWs but we have been unable to find 
anything in them that would shed light on the cause.

We are just wondering if anyone had any ideas on what could be causing this or 
how to debug it further?

Thanks
Iain

Iain Stott
OpenStack Engineer
iain.st...@thg.com
[THG Ingenuity Logo]
www.thg.com
[LinkedIn] [Instagram] 
  [X] 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage & Data Management  - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm bootstraps cluster with bad CRUSH map(?)

2024-05-22 Thread Eugen Block

Hi,

you can specify the entire tree in the location statement, if you need to:

ceph:~ # cat host-spec.yaml
service_type: host
hostname: ceph
addr: 
location:
  root: default
  rack: rack2


and after the bootstrap it looks like expected:

ceph:~ # ceph osd tree
ID  CLASS  WEIGHT  TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1  0  root default
-3  0  rack rack2
-2  0  host ceph


This should be made a bit clearer in the docs [0], I added Zac.

Regards,
Eugen

[0]  
https://docs.ceph.com/en/latest/cephadm/host-management/#setting-the-initial-crush-location-of-host


Zitat von Matthew Vernon :


Hi,

Returning to this, it looks like the issue wasn't to do with how  
osd_crush_chooseleaf_type ; I destroyed and re-created my cluster as  
before, and I have the same problem again:


pg 1.0 is stuck inactive for 10m, current state unknown, last acting []

as before, ceph osd tree:

root@moss-be1001:/# ceph osd tree
ID  CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT  PRI-AFF
-7 176.11194  rack F3
-6 176.11194  host moss-be1003
13hdd7.33800  osd.13up   1.0  1.0
15hdd7.33800  osd.15up   1.0  1.0

And checking the crushmap, the default bucket is again empty:

root default {
id -1   # do not change unnecessarily
id -14 class hdd# do not change unnecessarily
# weight 0.0
alg straw2
hash 0  # rjenkins1
}

[by way of confirming that I didn't accidentally leave the old  
config fragment lying around, the replication rule has:

step chooseleaf firstn 0 type host
]

So it looks like setting location: in my spec is breaking the  
cluster bootstrap - the hosts aren't put into default, but neither  
are the declared racks. As a reminder, that spec has host entries  
like:


service_type: host
hostname: moss-be1003
addr: 10.64.136.22
location:
  rack: F3
labels:
  - _admin
  - NVMe

Is this expected behaviour? Presumably I can fix the cluster by  
using "ceph osd crush move F3 root=default" and similar for the  
others, but is there a way to have what I want done by cephadm  
bootstrap?


Thanks,

Matthew
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How network latency affects ceph performance really with NVME only storage?

2024-05-22 Thread Frank Schilder
Hi Stefan,

ahh OK, misunderstood your e-mail. It sounded like it was a custom profile, not 
a standard one shipped with tuned.

Thanks for the clarification!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Bauer 
Sent: Wednesday, May 22, 2024 12:44 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: How network latency affects ceph performance really 
with NVME only storage?

Hi Frank,

it's pretty straightforward. Just follow the steps:

apt install tuned

tuned-adm profile network-latency

According to [1]:

network-latency
A server profile focused on lowering network latency.
This profile favors performance over power savings by setting
|intel_pstate| and |min_perf_pct=100|. It disables transparent huge
pages, and automatic NUMA balancing. It also uses *cpupower* to set
the |performance| cpufreq governor, and requests a
/|cpu_dma_latency|/ value of |1|. It also sets /|busy_read|/ and
/|busy_poll|/ times to |50| μs, and /|tcp_fastopen|/ to |3|.

[1]
https://access.redhat.com/documentation/de-de/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-tool_reference-tuned_adm

Cheers.

Stefan

Am 22.05.24 um 12:18 schrieb Frank Schilder:
> Hi Stefan,
>
> can you provide a link to or copy of the contents of the tuned-profile so 
> others can also profit from it?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Stefan Bauer
> Sent: Wednesday, May 22, 2024 10:51 AM
> To: Anthony D'Atri;ceph-users@ceph.io
> Subject: [ceph-users] Re: How network latency affects ceph performance really 
> with NVME only storage?
>
> Hi Anthony and others,
>
> thank you for your reply.  To be honest, I'm not even looking for a
> solution, i just wanted to ask if latency affects the performance at all
> in my case and how others handle this ;)
>
> One of our partners delivered a solution with a latency-optimized
> profile for tuned-daemon. Now the latency is much better:
>
> apt install tuned
>
> tuned-adm profile network-latency
>
> # ping 10.1.4.13
> PING 10.1.4.13 (10.1.4.13) 56(84) bytes of data.
> 64 bytes from 10.1.4.13: icmp_seq=1 ttl=64 time=0.047 ms
> 64 bytes from 10.1.4.13: icmp_seq=2 ttl=64 time=0.028 ms
> 64 bytes from 10.1.4.13: icmp_seq=3 ttl=64 time=0.025 ms
> 64 bytes from 10.1.4.13: icmp_seq=4 ttl=64 time=0.020 ms
> 64 bytes from 10.1.4.13: icmp_seq=5 ttl=64 time=0.023 ms
> 64 bytes from 10.1.4.13: icmp_seq=6 ttl=64 time=0.026 ms
> 64 bytes from 10.1.4.13: icmp_seq=7 ttl=64 time=0.024 ms
> 64 bytes from 10.1.4.13: icmp_seq=8 ttl=64 time=0.023 ms
> 64 bytes from 10.1.4.13: icmp_seq=9 ttl=64 time=0.033 ms
> 64 bytes from 10.1.4.13: icmp_seq=10 ttl=64 time=0.021 ms
> ^C
> --- 10.1.4.13 ping statistics ---
> 10 packets transmitted, 10 received, 0% packet loss, time 9001ms
> rtt min/avg/max/mdev = 0.020/0.027/0.047/0.007 ms
>
> Am 21.05.24 um 15:08 schrieb Anthony D'Atri:
>> Check the netmask on your interfaces, is it possible that you're sending 
>> inter-node traffic up and back down needlessly?
>>
>>> On May 21, 2024, at 06:02, Stefan Bauer  wrote:
>>>
>>> Dear Users,
>>>
>>> i recently setup a new ceph 3 node cluster. Network is meshed between all 
>>> nodes (2 x 25G with DAC).
>>> Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20)
>>>
>>> The latency with ping tests between the nodes shows:
>>>
>>> # ping 10.1.3.13
>>> PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data.
>>> 64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms
>>> --- 10.1.3.13 ping statistics ---
>>> 11 packets transmitted, 11 received, 0% packet loss, time 10242ms
>>> rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms
>>>
>>>
>>> On another cluster i have much better values, with 10G SFP+ and 
>>> fibre-cables:
>>>
>>> 64 bytes from large-ipv6-ip: icmp_seq=42 ttl=64 time=0.081 ms
>>> 64 bytes from large-ipv6-ip: icmp_seq=43 ttl=64 time=0.078 ms
>>> 64 bytes from large-ipv6-ip: icmp_seq=44 ttl=64 time=0.084 ms
>>> 64 bytes from large-ipv6-ip: icmp_seq=45 ttl=64 time=0.075 ms
>>> 64 bytes from large-ipv6-ip: icmp_seq=46 ttl=64 time=0.071 ms
>>> 64 bytes from large-ipv6-ip: icmp_seq=47 ttl=64 time=0.081 ms
>>> 64 bytes from large-ipv6-ip: icmp_seq=48 ttl=64 time=0.074 ms
>>> 64 

[ceph-users] Re: How network latency affects ceph performance really with NVME only storage?

2024-05-22 Thread Stefan Bauer

Hi Frank,

it's pretty straightforward. Just follow the steps:

apt install tuned

tuned-adm profile network-latency

According to [1]:

network-latency
   A server profile focused on lowering network latency.
   This profile favors performance over power savings by setting
   |intel_pstate| and |min_perf_pct=100|. It disables transparent huge
   pages, and automatic NUMA balancing. It also uses *cpupower* to set
   the |performance| cpufreq governor, and requests a
   /|cpu_dma_latency|/ value of |1|. It also sets /|busy_read|/ and
   /|busy_poll|/ times to |50| μs, and /|tcp_fastopen|/ to |3|.

[1] 
https://access.redhat.com/documentation/de-de/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-tool_reference-tuned_adm


Cheers.

Stefan

Am 22.05.24 um 12:18 schrieb Frank Schilder:

Hi Stefan,

can you provide a link to or copy of the contents of the tuned-profile so 
others can also profit from it?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Bauer
Sent: Wednesday, May 22, 2024 10:51 AM
To: Anthony D'Atri;ceph-users@ceph.io
Subject: [ceph-users] Re: How network latency affects ceph performance really 
with NVME only storage?

Hi Anthony and others,

thank you for your reply.  To be honest, I'm not even looking for a
solution, i just wanted to ask if latency affects the performance at all
in my case and how others handle this ;)

One of our partners delivered a solution with a latency-optimized
profile for tuned-daemon. Now the latency is much better:

apt install tuned

tuned-adm profile network-latency

# ping 10.1.4.13
PING 10.1.4.13 (10.1.4.13) 56(84) bytes of data.
64 bytes from 10.1.4.13: icmp_seq=1 ttl=64 time=0.047 ms
64 bytes from 10.1.4.13: icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from 10.1.4.13: icmp_seq=3 ttl=64 time=0.025 ms
64 bytes from 10.1.4.13: icmp_seq=4 ttl=64 time=0.020 ms
64 bytes from 10.1.4.13: icmp_seq=5 ttl=64 time=0.023 ms
64 bytes from 10.1.4.13: icmp_seq=6 ttl=64 time=0.026 ms
64 bytes from 10.1.4.13: icmp_seq=7 ttl=64 time=0.024 ms
64 bytes from 10.1.4.13: icmp_seq=8 ttl=64 time=0.023 ms
64 bytes from 10.1.4.13: icmp_seq=9 ttl=64 time=0.033 ms
64 bytes from 10.1.4.13: icmp_seq=10 ttl=64 time=0.021 ms
^C
--- 10.1.4.13 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9001ms
rtt min/avg/max/mdev = 0.020/0.027/0.047/0.007 ms

Am 21.05.24 um 15:08 schrieb Anthony D'Atri:

Check the netmask on your interfaces, is it possible that you're sending 
inter-node traffic up and back down needlessly?


On May 21, 2024, at 06:02, Stefan Bauer  wrote:

Dear Users,

i recently setup a new ceph 3 node cluster. Network is meshed between all nodes 
(2 x 25G with DAC).
Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20)

The latency with ping tests between the nodes shows:

# ping 10.1.3.13
PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data.
64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms
64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms
64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms
64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms
64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms
64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms
64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms
64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms
64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms
64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms
64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms
--- 10.1.3.13 ping statistics ---
11 packets transmitted, 11 received, 0% packet loss, time 10242ms
rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms


On another cluster i have much better values, with 10G SFP+ and fibre-cables:

64 bytes from large-ipv6-ip: icmp_seq=42 ttl=64 time=0.081 ms
64 bytes from large-ipv6-ip: icmp_seq=43 ttl=64 time=0.078 ms
64 bytes from large-ipv6-ip: icmp_seq=44 ttl=64 time=0.084 ms
64 bytes from large-ipv6-ip: icmp_seq=45 ttl=64 time=0.075 ms
64 bytes from large-ipv6-ip: icmp_seq=46 ttl=64 time=0.071 ms
64 bytes from large-ipv6-ip: icmp_seq=47 ttl=64 time=0.081 ms
64 bytes from large-ipv6-ip: icmp_seq=48 ttl=64 time=0.074 ms
64 bytes from large-ipv6-ip: icmp_seq=49 ttl=64 time=0.085 ms
64 bytes from large-ipv6-ip: icmp_seq=50 ttl=64 time=0.077 ms
64 bytes from large-ipv6-ip: icmp_seq=51 ttl=64 time=0.080 ms
64 bytes from large-ipv6-ip: icmp_seq=52 ttl=64 time=0.084 ms
64 bytes from large-ipv6-ip: icmp_seq=53 ttl=64 time=0.084 ms
^C
--- long-ipv6-ip ping statistics ---
53 packets transmitted, 53 received, 0% packet loss, time 53260ms
rtt min/avg/max/mdev = 0.071/0.082/0.111/0.006 ms

If i want best performance, does the latency difference matter at all? Should i 
change DAC to SFP-transceivers wwith fibre-cables to improve overall ceph 
performance or is this nitpicking?

Thanks a lot.

Stefan

[ceph-users] Re: How network latency affects ceph performance really with NVME only storage?

2024-05-22 Thread Frank Schilder
Hi Stefan,

can you provide a link to or copy of the contents of the tuned-profile so 
others can also profit from it?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Bauer 
Sent: Wednesday, May 22, 2024 10:51 AM
To: Anthony D'Atri; ceph-users@ceph.io
Subject: [ceph-users] Re: How network latency affects ceph performance really 
with NVME only storage?

Hi Anthony and others,

thank you for your reply.  To be honest, I'm not even looking for a
solution, i just wanted to ask if latency affects the performance at all
in my case and how others handle this ;)

One of our partners delivered a solution with a latency-optimized
profile for tuned-daemon. Now the latency is much better:

apt install tuned

tuned-adm profile network-latency

# ping 10.1.4.13
PING 10.1.4.13 (10.1.4.13) 56(84) bytes of data.
64 bytes from 10.1.4.13: icmp_seq=1 ttl=64 time=0.047 ms
64 bytes from 10.1.4.13: icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from 10.1.4.13: icmp_seq=3 ttl=64 time=0.025 ms
64 bytes from 10.1.4.13: icmp_seq=4 ttl=64 time=0.020 ms
64 bytes from 10.1.4.13: icmp_seq=5 ttl=64 time=0.023 ms
64 bytes from 10.1.4.13: icmp_seq=6 ttl=64 time=0.026 ms
64 bytes from 10.1.4.13: icmp_seq=7 ttl=64 time=0.024 ms
64 bytes from 10.1.4.13: icmp_seq=8 ttl=64 time=0.023 ms
64 bytes from 10.1.4.13: icmp_seq=9 ttl=64 time=0.033 ms
64 bytes from 10.1.4.13: icmp_seq=10 ttl=64 time=0.021 ms
^C
--- 10.1.4.13 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9001ms
rtt min/avg/max/mdev = 0.020/0.027/0.047/0.007 ms

Am 21.05.24 um 15:08 schrieb Anthony D'Atri:
> Check the netmask on your interfaces, is it possible that you're sending 
> inter-node traffic up and back down needlessly?
>
>> On May 21, 2024, at 06:02, Stefan Bauer  wrote:
>>
>> Dear Users,
>>
>> i recently setup a new ceph 3 node cluster. Network is meshed between all 
>> nodes (2 x 25G with DAC).
>> Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20)
>>
>> The latency with ping tests between the nodes shows:
>>
>> # ping 10.1.3.13
>> PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data.
>> 64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms
>> 64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms
>> 64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms
>> 64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms
>> 64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms
>> 64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms
>> 64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms
>> 64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms
>> 64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms
>> 64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms
>> 64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms
>> --- 10.1.3.13 ping statistics ---
>> 11 packets transmitted, 11 received, 0% packet loss, time 10242ms
>> rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms
>>
>>
>> On another cluster i have much better values, with 10G SFP+ and fibre-cables:
>>
>> 64 bytes from large-ipv6-ip: icmp_seq=42 ttl=64 time=0.081 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=43 ttl=64 time=0.078 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=44 ttl=64 time=0.084 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=45 ttl=64 time=0.075 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=46 ttl=64 time=0.071 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=47 ttl=64 time=0.081 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=48 ttl=64 time=0.074 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=49 ttl=64 time=0.085 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=50 ttl=64 time=0.077 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=51 ttl=64 time=0.080 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=52 ttl=64 time=0.084 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=53 ttl=64 time=0.084 ms
>> ^C
>> --- long-ipv6-ip ping statistics ---
>> 53 packets transmitted, 53 received, 0% packet loss, time 53260ms
>> rtt min/avg/max/mdev = 0.071/0.082/0.111/0.006 ms
>>
>> If i want best performance, does the latency difference matter at all? 
>> Should i change DAC to SFP-transceivers wwith fibre-cables to improve 
>> overall ceph performance or is this nitpicking?
>>
>> Thanks a lot.
>>
>> Stefan
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

--
Mit freundlichen Grüßen

Stefan Bauer
Schulstraße 5
83308 Trostberg
0179-1194767
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How network latency affects ceph performance really with NVME only storage?

2024-05-22 Thread Stefan Bauer

Hi Anthony and others,

thank you for your reply.  To be honest, I'm not even looking for a 
solution, i just wanted to ask if latency affects the performance at all 
in my case and how others handle this ;)


One of our partners delivered a solution with a latency-optimized 
profile for tuned-daemon. Now the latency is much better:


apt install tuned

tuned-adm profile network-latency

# ping 10.1.4.13
PING 10.1.4.13 (10.1.4.13) 56(84) bytes of data.
64 bytes from 10.1.4.13: icmp_seq=1 ttl=64 time=0.047 ms
64 bytes from 10.1.4.13: icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from 10.1.4.13: icmp_seq=3 ttl=64 time=0.025 ms
64 bytes from 10.1.4.13: icmp_seq=4 ttl=64 time=0.020 ms
64 bytes from 10.1.4.13: icmp_seq=5 ttl=64 time=0.023 ms
64 bytes from 10.1.4.13: icmp_seq=6 ttl=64 time=0.026 ms
64 bytes from 10.1.4.13: icmp_seq=7 ttl=64 time=0.024 ms
64 bytes from 10.1.4.13: icmp_seq=8 ttl=64 time=0.023 ms
64 bytes from 10.1.4.13: icmp_seq=9 ttl=64 time=0.033 ms
64 bytes from 10.1.4.13: icmp_seq=10 ttl=64 time=0.021 ms
^C
--- 10.1.4.13 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9001ms
rtt min/avg/max/mdev = 0.020/0.027/0.047/0.007 ms

Am 21.05.24 um 15:08 schrieb Anthony D'Atri:

Check the netmask on your interfaces, is it possible that you're sending 
inter-node traffic up and back down needlessly?


On May 21, 2024, at 06:02, Stefan Bauer  wrote:

Dear Users,

i recently setup a new ceph 3 node cluster. Network is meshed between all nodes 
(2 x 25G with DAC).
Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20)

The latency with ping tests between the nodes shows:

# ping 10.1.3.13
PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data.
64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms
64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms
64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms
64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms
64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms
64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms
64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms
64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms
64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms
64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms
64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms
--- 10.1.3.13 ping statistics ---
11 packets transmitted, 11 received, 0% packet loss, time 10242ms
rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms


On another cluster i have much better values, with 10G SFP+ and fibre-cables:

64 bytes from large-ipv6-ip: icmp_seq=42 ttl=64 time=0.081 ms
64 bytes from large-ipv6-ip: icmp_seq=43 ttl=64 time=0.078 ms
64 bytes from large-ipv6-ip: icmp_seq=44 ttl=64 time=0.084 ms
64 bytes from large-ipv6-ip: icmp_seq=45 ttl=64 time=0.075 ms
64 bytes from large-ipv6-ip: icmp_seq=46 ttl=64 time=0.071 ms
64 bytes from large-ipv6-ip: icmp_seq=47 ttl=64 time=0.081 ms
64 bytes from large-ipv6-ip: icmp_seq=48 ttl=64 time=0.074 ms
64 bytes from large-ipv6-ip: icmp_seq=49 ttl=64 time=0.085 ms
64 bytes from large-ipv6-ip: icmp_seq=50 ttl=64 time=0.077 ms
64 bytes from large-ipv6-ip: icmp_seq=51 ttl=64 time=0.080 ms
64 bytes from large-ipv6-ip: icmp_seq=52 ttl=64 time=0.084 ms
64 bytes from large-ipv6-ip: icmp_seq=53 ttl=64 time=0.084 ms
^C
--- long-ipv6-ip ping statistics ---
53 packets transmitted, 53 received, 0% packet loss, time 53260ms
rtt min/avg/max/mdev = 0.071/0.082/0.111/0.006 ms

If i want best performance, does the latency difference matter at all? Should i 
change DAC to SFP-transceivers wwith fibre-cables to improve overall ceph 
performance or is this nitpicking?

Thanks a lot.

Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Mit freundlichen Grüßen

Stefan Bauer
Schulstraße 5
83308 Trostberg
0179-1194767
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS as Offline Storage

2024-05-22 Thread Joachim Kraftmayer
I have already installed multiple one node ceph cluster with cephfs for
non-productive workloads in the last few years.
Had no major issue, e.g. once a broken HDD. The question is what kind of EC
or replication you will use. Also only powered off the node in a clean and
healthy state ;-)

What would interest me is,  pause the cluster with "ceph osd pause" and
send the nodes to hibernate.

Joachim


Am Mi., 22. Mai 2024 um 07:55 Uhr schrieb Matthias Ferdinand <
mf+ml.c...@mfedv.net>:

> On Tue, May 21, 2024 at 08:54:26PM +, Eugen Block wrote:
> > It’s usually no problem to shut down a cluster. Set at least the noout
> flag,
> > the other flags like norebalance, nobackfill etc won’t hurt either. Then
> > shut down the servers. I do that all the time with test clusters (they do
> > have data, just not important at all), and I’ve never had data loss after
> > powering them back on. When all OSDs are up, unset the flags and let it
> > scrub. Usually, the (deep-)scrubbing will start almost immediately.
>
> One surprise I have with the Ubuntu test cluster (non-containerized,
> Ubuntu packages) that I regularly shut down is that the signals from log
> rotation (I assume) to Ceph daemons interfere with Ceph startup. When
> rebooted after some days, it is the same on all nodes: no Ceph daemon is
> running.
> Workaround: another reboot
>
> Matthias
>
> >
> > Zitat von "adam.ther" :
> >
> > > Thanks guys,
> > >
> > > I think ill just risk it cause it's just for backup, then write
> > > something up later as a follow up on what happens in-case others want
> to
> > > do similar. I agree it not typical, im a bit of an odd-duck datahorder.
> > >
> > > Regards,
> > >
> > > Adam
> > >
> > > On 5/21/24 14:21, Matt Vandermeulen wrote:
> > > > I would normally vouch for ZFS for this sort of thing, but the mix
> > > > of drive sizes will be... and inconvenience, at best. You could get
> > > > creative with the hierarchy (making zraid{2,3} of mirrors of
> > > > same-sized drives, or something), but it would be far from ideal. I
> > > > use ZFS for my own home machines however, all the drives are
> > > > identical.
> > > >
> > > > I'm curious about this application of Ceph though, in home-lab use.
> > > > Performance likely isn't a top concern, just a durable persistent
> > > > storage target, so this is an interesting use case.
> > > >
> > > >
> > > > On 2024-05-21 17:02, adam.ther wrote:
> > > > > Hello,
> > > > >
> > > > > It's all non-corperate data, I'm just trying to cut back on
> > > > > wattage (removes around 450W of the 2.4KW) by powering down
> > > > > backup servers that house 208TB while not being backed up or
> > > > > restoring.
> > > > >
> > > > > ZFS sounds interesting however does it play nice with a mix of
> > > > > drive sizes? That's primarily why I use Ceph, it's okay (if not
> > > > > ideal) with 4x 22TB, 8x 10TB, 10x 4TB.
> > > > >
> > > > > So that said, would Ceph have any known issues with long power
> > > > > downs aside from it nagging about the scrubbing schedule? Mark i
> > > > > see you said it wouldn't matter but does Ceph not use a date
> > > > > based scheduler?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Adam
> > > > >
> > > > > On 5/21/24 13:29, Marc wrote:
> > > > > > > > I think it is his lab so maybe it is a test setup for
> production.
> > > > > > > Home production?
> > > > > > A home setup to test on, before he applies changes to his
> production
> > > > > >
> > > > > > Saluti  ;)
> > > > > >
> > > > > ___
> > > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io