[ceph-users] Re: Viability of NVMeOF/TCP for VMWare

Frédéric Nass Fri, 28 Jun 2024 08:00:49 -0700

We came to the same conclusions as Alexander when we studied replacing Ceph's 
iSCSI implementation with Ceph's NFS-Ganesha implementation: HA was not working.
During failovers, vmkernel would fail with messages like this:
2023-01-14T09:39:27.200Z Wa(180) vmkwarning: cpu18:2098740)WARNING: NFS41: 
NFS41ProcessExidResult:2499: 'Cluster Mismatch due to different server scope. 
Probable server bug. Remount data store to access.'


We replaced Ceph's iSCSI implementation with PetaSAN's iSCSI GWs plugged to our 
external Ceph cluster (unsupported setup) and never looked back.
It's HA, active/active, highly scalable, robust whatever the situation (network 
issues, slow requests, ceph osd pause), and it rocks performances wise.

We're using a PSP of type RR with below SATP rule:
esxcli storage nmp satp rule add -s VMW_SATP_ALUA -P VMW_PSP_RR -V PETASAN -M 
RBD -c tpgs_on -o enable_action_OnRetryErrors -O "iops=1" -e "Ceph iSCSI ALUA 
RR iops=1 PETASAN"

And these adaptor settings:
esxcli system settings advanced set -o /ISCSI/MaxIoSizeKB -i 512
esxcli iscsi adapter param set -A $(esxcli iscsi adapter list | grep iscsi_vmk 
| cut -d ' ' -f1) --key FirstBurstLength --value 524288
esxcli iscsi adapter param set -A $(esxcli iscsi adapter list | grep iscsi_vmk 
| cut -d ' ' -f1) --key MaxBurstLength --value 524288
esxcli iscsi adapter param set -A $(esxcli iscsi adapter list | grep iscsi_vmk 
| cut -d ' ' -f1) --key MaxRecvDataSegment --value 524288

Note that it's important to **not** use object-map feature on RBD images to 
avoid the issue mentioned in [1].

Regarding NVMe-oF, there's been an incredible amount of work done over a year 
and a half but we'll likely need to wait a bit longer to have something fully 
production-ready (hopefully active/active).

Regards,
Frédéric.

[1] https://croit.io/blog/fixing-data-corruption

----- Le 28 Juin 24, à 7:12, Alexander Patrakov patra...@gmail.com a écrit :

> For NFS (e.g., as implemented by NFS-ganesha), the situation is also
> quite stupid.
> 
> Without high availability (HA), it works (that is, until you update
> NFS-Ganesha version), but corporate architects won't let you deploy
> any system without HA, because, in their view, non-HA systems are not
> production-ready by definition. (And BTW, the current NVMe-oF gateway
> also has no multipath and thus no viable HA)
> 
> With an attempt to set up HA for NFS, you'll get at least the
> following showstoppers:
> 
> For NFS v4.1:
> 
> * VMware refuses to work until the manual admin intervention if it
> sees any change in the "owner" and "scope" fields of the EXCHANGE_ID
> message between the previous and the current NFS connection.
> * NFS-Ganesha sets both fields from the hostname by default, and the
> patch that makes these fields configurable is "quite recent" (in
> version 4.3). This is important, as otherwise, every NFS server
> fail-over would trip off VMware, thus defeating the point of a
> high-availability setup.
> * There is a regression in NFS-Ganesha that manifests as a deadlock
> (easily triggerable even without Ceph by running xfstests), which is
> critical, because systemd cannot restart deadlocked services.
> Unfortunately, the last NFS-Ganesha version before the regression
> (4.0.8) does not contain the patch that allows manipulating the
> "owner" and "scope" fields.
> * Cephadm-based deployments do not set these configuration options anyway.
> * If you would like to use the "rados_cluster" NFSv4 recovery backend
> (used for grace periods), you need to be extra careful with various
> "server names" also because they are used to decide whether to end the
> grace period. If the recovery backend has seen two server names
> (corresponding to two NFS-Ganesha instances, for scale-out), then both
> must be up for the grace period to end. If there is only one server
> name, you are allowed to run only one instance. If you want high
> availability together with scale-out, you need to be able to schedule
> two NFS-Ganesha instances (with names like a and b, not corresponding
> to the names of hosts where they run) on two out of three available
> servers. Orchestrators do not do this, you need to implement this on
> your own.
> 
> For NFS v3:
> 
> * NFS-Ganesha opens files and acquires MDS locks just in case, to make
> sure that another client cannot modify them while the original client
> might have cached something.
> * If NFS-Ganesha crashes or a server reboots, then the other
> NFS-Ganesha, brought up to replace the original one, will also stumble
> upon these locks, as the MDS recognizes it as a different client.
> Result: it waits until the locks time out, which is too long
> (minutes!), as the guest OS in VMware would then time out its storage.
> * To avoid the problem mentioned above and to get seamless fail-over,
> the replacement instance of NFS-Ganesha must present itself as the
> same client (i.e., as the same fake hostname) to the MDS, but no known
> orchestrators facilitate this.
> 
> Conclusion: please use iSCSI or sacrifice HA, as there are no working
> alternatives yet.
> 
> On Fri, Jun 28, 2024 at 1:31 AM Anthony D'Atri <anthony.da...@gmail.com> 
> wrote:
>>
>> There are folks actively working on this gateway and there's a Slack 
>> channel.  I
>> haven't used it myself yet.
>>
>> My understanding is that ESXi supports NFS.  Some people have had good 
>> success
>> mounting KRBD volumes on a gateway system or VM and re-exporting via NFS.
>>
>>
>>
>> > On Jun 27, 2024, at 09:01, Drew Weaver <drew.wea...@thenap.com> wrote:
>> >
>> > Howdy,
>> >
>> > I recently saw that Ceph has a gateway which allows VMWare ESXi to connect 
>> > to
>> > RBD.
>> >
>> > We had another gateway like this awhile back the ISCSI gateway.
>> >
>> > The ISCSI gateway ended up being... let's say problematic.
>> >
>> > Is there any reason to believe that NVMeOF will also end up on the floor 
>> > and has
>> > anyone that uses VMWare extensively evaluated it's viability?
>> >
>> > Just curious!
>> >
>> > Thanks,
>> > -Drew
>> >
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> --
> Alexander Patrakov
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Viability of NVMeOF/TCP for VMWare

Reply via email to