Re: [ceph-users] ceph + vmware

Nick Fisk Fri, 22 Jul 2016 07:52:05 -0700

From: Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr] 
Sent: 22 July 2016 15:13
To: n...@fisk.me.uk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph + vmware

Le 22/07/2016 14:10, Nick Fisk a écrit :

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 11:19
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Jake Young'  
<mailto:jak3...@gmail.com> <jak3...@gmail.com>; 'Jan Schermer'  
<mailto:j...@schermer.cz> <j...@schermer.cz>
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

Le 22/07/2016 11:48, Nick Fisk a écrit :

From: Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr] 
Sent: 22 July 2016 10:40
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Jake Young'  
<mailto:jak3...@gmail.com> <jak3...@gmail.com>; 'Jan Schermer'  
<mailto:j...@schermer.cz> <j...@schermer.cz>
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

Le 22/07/2016 10:23, Nick Fisk a écrit :

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 09:10
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Jake Young'  
<mailto:jak3...@gmail.com> <jak3...@gmail.com>; 'Jan Schermer'  
<mailto:j...@schermer.cz> <j...@schermer.cz>
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 08:11
To: Jake Young  <mailto:jak3...@gmail.com> <jak3...@gmail.com>; Jan Schermer  
<mailto:j...@schermer.cz> <j...@schermer.cz>
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :

On Wednesday, July 20, 2016, Jan Schermer <j...@schermer.cz 
<mailto:j...@schermer.cz> > wrote:

> On 20 Jul 2016, at 18:38, Mike Christie <mchri...@redhat.com 
> <mailto:mchri...@redhat.com> > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor can't tell what type 
of data the VMs are writing, all IO is treated as needing to be synchronous. 

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there is no fsck for VMFS...

Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without 
client support
(you essentialy have to do something like transactions and replay the last 
transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO synchronous or make it 
at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between the targets) 
without making it synchronous all the way.

This is why the SAN vendors wrote their own clients and drivers. It is not 
possible to dynamically make all OS's do what your iSCSI target expects. 

Something like VMware does the right thing pretty much all the time (there are 
some iSCSI initiator bugs in earlier ESXi 5.x).  If you have control of your 
ESXi hosts then attempting to set up HA iSCSI targets is possible. 

If you have a mixed client environment with various versions of Windows 
connecting to the target, you may be better off buying some SAN appliances.

The one time I had to use it I resorted to simply mirroring in via mdraid on 
the client side over two targets sharing the same
DAS, and this worked fine during testing but never went to production in the 
end.

Jan

>
>>
>> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
>> clients ?
>
> I can't say, because I have not used stgt with rbd bs-type support enough.

For starters, STGT doesn't implement VAAI properly and you will need to disable 
VAAI in ESXi.

LIO does seem to implement VAAI properly, but performance is not nearly as good 
as STGT even with VAAI's benefits. The assumption for the cause is that LIO 
currently uses kernel rbd mapping and kernel rbd performance is not as good as 
librbd.  

I recently did a simple test of creating an 80GB eager zeroed disk with STGT 
(VAAI disabled, no rbd client cache) and LIO (VAAI enabled) and found that STGT 
was actually slightly faster.

I think we're all holding our breath waiting for LIO librbd support via TCMU, 
which seems to be right around the corner. That solution will combine the 
performance benefits of librbd with the more feature-full LIO iSCSI interface. 
The lrbd configuration tool for LIO from SUSE is pretty cool and it makes 
configuring LIO easier than STGT. 

Hi Jake,

Problem we're facing with LIO is that it has ESXs disconnecting from vCenter 
regularly. This is a result from the iSCSI datastore becoming unreachable.
It's happens randomly, last time with almost no VM activity at all (only 6 VMs 
in the lab), but when ESX requested a write to '.iormstats.sf' file, which I 
suppose is related to storage I/O Control, but I'm not sure of that.

Setting VMFS3.UseATSForHBOnVMFS5 to 0 didn't help. Restarting the LIO target 
almost instantly solves it.

Any one of you ever encountered this issue with LIO target ?

Yes, this is a current known problem that will hopefully be resolved soon. When 
there is a delay servicing IO, ESXi asks the target to cancel the IO, LIO tries 
to do this, but from what I understand, the RBD doesn’t have the API to allow 
LIO to reach into the Ceph cluster and cancel the in flight IO. LIO responds 
back, saying I can’t do this and then ESXi asks again. And so LIO and ESXi 
enter a loop forever.

Hi Nick,

Thanks for this explanation.

Are you aware of any workaround or ESXi initiator option to tweak (like an I/O 
timeout value) to avoid that ?

Or does this makes LIO target unusable with ESXi as of now ?

Is STGT also affected or does it respond better with the rbd (librbd) backstore 
?

Check out my response in this thread

http://ceph-users.ceph.narkive.com/JFwme605/suse-enterprise-storage3-rbd-lio-vmware-performance-bad

Nick,

What a great post (#5) ! :-)

It clearly states what I'm hitting with LIO (vmkernel.log) :
2016-07-21T07:33:38.544Z cpu26:386324)WARNING: ScsiPath: 7154: Set retry 
timeout for failed TaskMgmt abort for CmdSN  0x0, status Failure, path 
vmhba40:C2:T1:L0

Have you try STGT (with rbd backstore) ? I'll give SCST a try...

Yep, but see my point about being unable to stop when there is ongoing IO, this 
makes clustering hard as you have to start adding resource agents to 
block/manipulate TCP packets to drain iscsi connections. I gave up trying to 
get it to work 100% reliably.

When you say 'NFS is very easy to configure for HA', how that ?
I thought it was something hard to achieve, involving clustering software as 
Corosync, Pacemaker, DRBD or GFS2. Am I missing something ? (NFS-Ganesha ?)

Easy compared to iSCSI. Yes, you have to use pacemaker/corosync, but that’s the 
easy part of the whole process. 

Ok. So this would be an active / passive scenario, right ?

The hard part seems to set the right fencing with the right commands on each 
NFS node. :-/
It's not really clear to me whether an active, under load, NFS server will 
accept to shutdown gracefully, so you can unmap the RBD without fear and have 
it remmaped on the other node.

Frederic.

That’s where stonith comes in to play. If the resource ever gets into a state 
where it can’t stop, it will be marked unclean and then stonith will reboot the 
node to resolve the situation.

And then ESXi would drop connections, follow the ViP moving to the other NFS 
gateway, and resume its workload without pain ? What about NFS locks ?

Frederic.

Yes, that’s what seems to happen and quite smoothly too. ESXi doesn’t use NFS 
locking, it just creates a .lck file.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US 
<https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2136521>
 &cmd=displayKC&externalId=2136521

There’s a lot of things that can go wrong doing clustered iscsi, whereas I have 
found NFS to be much simpler. ESXi seems to handle NFS failure better. With 
iSCSI unless you catch it quickly everything goes APD/PDL and you end up with 
all sorts of problems. NFS seems to be able to disappear and then pop back with 
no drama from what I have seen so far.

Again thanks for you help,

Frederic.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

Reply via email to