Oh, thanks, that does not sound very encouraging. 
In our case it looked the same, we had to reboot three ESXi nodes via IPMI, 
because it got stuck at ordinary soft reboot. 

1. RecoveryTimeout is set at 25 on our nodes
2. We have one two-port adapter per node (Connect-X 5) and 4 iSCSI GWs total, 
one per OSD server. Multipath works, tested it by randomly rebooting one of two 
switches or manually shutting down the port.
One curious thing I did not mention before is that we see a number of dropped 
Rx packets on each NIC that corresponds to the iSCSI VLAN. Increase in the 
dropped packets seems to correlate with the current IOPS load.

I am beginning to settle with the version that our cluster is quite low on IOPS 
generally and slight increase in traffic may significantly raise latency on the 
iSCSI target. And ESXi is just being very touchy about that.


> On 4 Oct 2020, at 18:59, Phil Regnauld <p...@x0.dk> wrote:
> 
> Yep, and we're still experiencing it every few months. One (and only one) of
> our ESXi nodes, which are otherwise identical, is experiencing total freeze
> of all I/O, and it won't recover - I mean, ESXi is so dead, we have to go into
> IPMI and reset the box...
> 
> We're using Croit's software, but the issue doesn't seem to be with CEPH so
> much as with vmware.
> 
> That said, there's a couple of things you should be looking at:
> 
> 1. Make sure you remember to set the RecoveryTimeout to 25 ?
> 
> https://docs.ceph.com/en/latest/rbd/iscsi-initiator-esx/ 
> <https://docs.ceph.com/en/latest/rbd/iscsi-initiator-esx/>
> 
> 2. Make sure you have got working multipath across more than 1 adapter.
> 
> What's possibly biting us right now, is that with 2 iscsi gateways in our
> cluster, and although both are autodiscovered at iscsi configuration time,
> we see that the ESXi nodes still only will show one path to each LUN.
> 
> Currently these ESXi nodes have only 1 x 10gbit connected, it looks like
> I'll need to wire up the second connector and set up a second path to
> the iscsi gateway from that. It may not solve the problem, but it might
> lower the I/O on a single gateway enough that we won't see the problem
> anymore (and hopefully our customers stop getting pissed off).
> 
> Cheers,
> Phil
> 
> Golasowski Martin (martin.golasowski) writes:
>> For clarity, the issue has been reported also before:
>> 
>> https://www.spinics.net/lists/ceph-users/msg59798.html 
>> <https://www.spinics.net/lists/ceph-users/msg59798.html> 
>> <https://www.spinics.net/lists/ceph-users/msg59798.html 
>> <https://www.spinics.net/lists/ceph-users/msg59798.html>>
>> 
>> https://www.spinics.net/lists/target-devel/msg10469.html
>> 
>> 
>> 
>>> On 4 Oct 2020, at 16:46, Steve Thompson <s...@vgersoft.com> wrote:
>>> 
>>> On Sun, 4 Oct 2020, Martin Verges wrote:
>>> 
>>>>> Does that mean that occasional iSCSI path drop-outs are somewhat
>>>> expected?
>>>> Not that I'm aware of, but I have no HDD based ISCSI cluster at hand to
>>>> check. Sorry.
>>> 
>>> I use iscsi extensively, but for ZFS and not ceph. Path drop-outs are not 
>>> common; indeed, so far as I am aware, I have never had one. CentOS 7.8.
>>> 
>>> Steve
>>> -- 
>>> ----------------------------------------------------------------------------
>>> Steve Thompson                 E-mail:      smt AT vgersoft DOT com
>>> Voyager Software LLC           Web:         http://www DOT vgersoft DOT com
>>> 3901 N Charles St              VSW Support: support AT vgersoft DOT com
>>> Baltimore MD 21218
>>> "186,282 miles per second: it's not just a good idea, it's the law"
>>> ----------------------------------------------------------------------------
>> 
> 
> 
> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> -- 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to