Re: [zfs-discuss] Does a zvol use the zil?

2010-10-25 Thread Miles Nordin
> "re" == Richard Elling  writes:

re> it seems the hypervisors try to do crazy things like make the
re> disks readonly,

haha!

re> which is perhaps the worst thing you can do to a guest OS
re> because now it needs to be rebooted

I might've set it up to ``pause'' the VM for most failures, and for
punts like this read-only case, maybe leave it paused until someone
comes along to turn it off or unpause it.  But for loss of connection
to an iSCSI-backed disk, I think that's wrong.  I guess the truly
correct failure handling would be to immediately poweroff the guest
VM: pausing it tempts the sysadmin to fix the iscsi connection and
unpause it, which in this case is the only real disaster-begging thing
to do.  One would get a lot of complaints from sysadmins who don't
understand the iscsi write hole, but I think it's right.  so...in that
context, maybe read-only-until-reboot is actually not so dumb!

For guests unknowingly getting their disks via NFS, it would make
sense to pause the VM to stop (some of) its interval timer(s), (and
hope you get the timer running the ATA/SCSI/... driver among the
stopped ones) because the guest's disk driver won't understand NFS
hard mount timeout rules---won't understand that, for certain errors,
you can pass ``stale file handle'' up the stack, but for other errors
you must wait forever.  Instead they'll enforce a 30-second timeout
like for an ATA disk.  I think you could probably still avoid losing
the 'write B' if the guest fired its ATA timeout with an NFS-backed
disk because the writes have already been handed off to the host.  It
might be weird user experience in the VM manager because whatever
process is doing the NFS writes will be unkillable 'D' state even if
you poweroff the VM, but this weirdness is an expression of arcane
reality, not a bug.  It'd be better sysadmin experience to avoid the
guest ATA timeout, though: pause the VM and resume so that NFS server
reboots would freeze guests for a while, not require rebooting them,
just like they do for nonvirtual NFSv3 clients.  You would have to
figure out the maximum number of seconds the guests can go without
disk access, and deviously pause them before their burried /
proprietary disk timeouts can fire.


pgpYvvjgSY5Gl.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does a zvol use the zil?

2010-10-23 Thread Richard Elling
On Oct 22, 2010, at 10:40 AM, Miles Nordin wrote:

>> "re" == Richard Elling  writes:
> 
>re> The risk here is not really different that that faced by
>re> normal disk drives which have nonvolatile buffers (eg
>re> virtually all HDDs and some SSDs).  This is why applications
>re> can send cache flush commands when they need to ensure the
>re> data is on the media.
> 
> It's probably different because of the iSCSI target reboot problem
> I've written about before:
> 
> iSCSI initiator iSCSI target   nonvolatile medium
> 
> write A   >
>   <-  ack A
> write B   >
>   <-  ack B
>  -->[A]
> [REBOOT]
> write C   >
> [timeout!]
> reconnect >
>   <-  ack Connected
> write C   >
>   <-  ack C
> flush >
>  -> [C]
>   <-  ack Flush
> 
> in the above time chart, the initiator thinks A, B, and C are written,
> but in fact only A and C are written.  I regard this as a failing of
> imagination in the SCSI protocol, but probably with better
> understanding of the details than I have the initiator could be made
> to provably work around the problem.  My guess has always been that no
> current initiators actually do, though.
> 
> I think it could happen also with a directly-attached SATA disk if you
> remove power from the disk without rebooting the host, so as Richard
> said it is not really different, except that in the real world it's
> much more common for an iSCSI target to lose power without the
> initiator's also losing power than it is for a disk to lose power
> without its host adapter losing power.  

I agree. I'd like to have some good field information on this, but I think it
is safe to assume that for the average small server, when the local disks
lose power the server also loses power and the exposure to this issue
is lost in the general recovery of the server. For the geezers, who remember
such pain as Netware or Aegis, NFS was a breath of fresh air and led to
much more robust designs.  In that respect, iSCSI or even FC is a step
backwards, down the protocol stack.

> The ancient practice of unix
> filesystem design always considers cord-yanking as something happening
> to the entire machine, and failing disks are not the filesystem's
> responsibility to work arund because how could it?  This assumption
> should have been changed and wasn't, when we entered the era of RAID
> and removable disks, where the connections to disks and disks
> themselves are both allowed to fail.  However, when NFS was designed,
> the assumption *WAS* changed, and indeed NFSv2 and earlier operated
> always with the write cache OFF to be safe from this, just as COMSTAR
> does in its (default?) abyssmal-performance mode (so campuses bought
> prestoserve cards (equivalent to a DDRDrive except much less silly
> because they have onboard batteries), or auspex servers with included
> NVRAM, which are analagous outside the NFS world to netapp/hitachi/emc
> FC/iSCSI targets which always have big NVRAM's so they can leave the
> write cache off), and NFSv3 has a commit protocol that is smart enough
> to replay the 'write B' which makes the nonvolatile caches less
> necessary (so long as you're not closing files frequently, I guess?).

With COMSTAR, you can implement the commit to media policy in at least
three ways:
1. server side: disable writeback cache, per LUN
2. server side: change the sync policy to "always" for the zvol
3. client side: disable write cache enable per LUN

For choices 1 and 2, the ZIL and separate log come into play.

> I think it would be smart to design more storage systems so NFS can
> replace the role of iSCSI, for disk access.  

I agree.

> In Isilon or Lustre
> clusters this trick is common when a node can settle with unshared
> access to a subtree: create an image file on the NFS/Lustre back-end
> and fill it with an ext3 or XFS, and writes to that inner filesystem
> become much faster because this rube goldberg arrangement discards the
> clsoe-to-open consistency guarantee.  We might use it in the ZFS world
> for actual physical disk acess instead of iSCSI, ex., it should be
> possible to NFS-export a zvol and see a share with a single file in it
> named 'theTarget' or something, but this file would be without
> read-ahead.  Better yet, to accomodate VMWare limitations, would be to
> export a single fake /zvol share containing all NFS-shared zvol's, and
> as you export zvol's their files appear within this share.  Also it
> should be possible to mount vdev elements over NFS without
> deadlocks---I know that is difficult, but VMWare does it.  Perahps it
> cannot be done through the existing NFS client, but obviously it can
> be done somehow, and it would both solve 

Re: [zfs-discuss] Does a zvol use the zil?

2010-10-22 Thread Miles Nordin
> "re" == Richard Elling  writes:

re> The risk here is not really different that that faced by
re> normal disk drives which have nonvolatile buffers (eg
re> virtually all HDDs and some SSDs).  This is why applications
re> can send cache flush commands when they need to ensure the
re> data is on the media.

It's probably different because of the iSCSI target reboot problem
I've written about before:

iSCSI initiator iSCSI target   nonvolatile medium

write A   >
   <-  ack A
write B   >
   <-  ack B
  -->[A]
 [REBOOT]
write C   >
[timeout!]
reconnect >
   <-  ack Connected
write C   >
   <-  ack C
flush >
  -> [C]
   <-  ack Flush

in the above time chart, the initiator thinks A, B, and C are written,
but in fact only A and C are written.  I regard this as a failing of
imagination in the SCSI protocol, but probably with better
understanding of the details than I have the initiator could be made
to provably work around the problem.  My guess has always been that no
current initiators actually do, though.

I think it could happen also with a directly-attached SATA disk if you
remove power from the disk without rebooting the host, so as Richard
said it is not really different, except that in the real world it's
much more common for an iSCSI target to lose power without the
initiator's also losing power than it is for a disk to lose power
without its host adapter losing power.  The ancient practice of unix
filesystem design always considers cord-yanking as something happening
to the entire machine, and failing disks are not the filesystem's
responsibility to work arund because how could it?  This assumption
should have been changed and wasn't, when we entered the era of RAID
and removable disks, where the connections to disks and disks
themselves are both allowed to fail.  However, when NFS was designed,
the assumption *WAS* changed, and indeed NFSv2 and earlier operated
always with the write cache OFF to be safe from this, just as COMSTAR
does in its (default?) abyssmal-performance mode (so campuses bought
prestoserve cards (equivalent to a DDRDrive except much less silly
because they have onboard batteries), or auspex servers with included
NVRAM, which are analagous outside the NFS world to netapp/hitachi/emc
FC/iSCSI targets which always have big NVRAM's so they can leave the
write cache off), and NFSv3 has a commit protocol that is smart enough
to replay the 'write B' which makes the nonvolatile caches less
necessary (so long as you're not closing files frequently, I guess?).

I think it would be smart to design more storage systems so NFS can
replace the role of iSCSI, for disk access.  In Isilon or Lustre
clusters this trick is common when a node can settle with unshared
access to a subtree: create an image file on the NFS/Lustre back-end
and fill it with an ext3 or XFS, and writes to that inner filesystem
become much faster because this rube goldberg arrangement discards the
clsoe-to-open consistency guarantee.  We might use it in the ZFS world
for actual physical disk acess instead of iSCSI, ex., it should be
possible to NFS-export a zvol and see a share with a single file in it
named 'theTarget' or something, but this file would be without
read-ahead.  Better yet, to accomodate VMWare limitations, would be to
export a single fake /zvol share containing all NFS-shared zvol's, and
as you export zvol's their files appear within this share.  Also it
should be possible to mount vdev elements over NFS without
deadlocks---I know that is difficult, but VMWare does it.  Perahps it
cannot be done through the existing NFS client, but obviously it can
be done somehow, and it would both solve the iSCSI target reboot
problem and also allow using more kinds of proprietary storage
backend---the same reasons VMWare wants to give admins a choice
applies to ZFS.  When NFS is used in this way the disk image file is
never closed, so the NFS server will not need a slog to give good
performance: the same job is accomplished by double-caching the
uncommitted data on the client so it can be replayed if the time
diagram above happens.


pgp5D3EwpiIVp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does a zvol use the zil?

2010-10-21 Thread Richard Elling
On Oct 21, 2010, at 5:26 PM, Erik Trimble wrote:

> On Thu, 2010-10-21 at 17:09 -0700, Richard Elling wrote:
>> On Oct 21, 2010, at 6:19 AM, Eff Norwood wrote:
>>> Let me frame this in the context specifically of VMWare ESXi 4.x. If I 
>>> create a zvol and give it to ESXi via iSCSI our experience has been that it 
>>> is very fast and guest response is excellent. If we use NFS without a zil 
>>> (we use DDRdrive X1==awesome) because VMWare uses sync (Stable = FSYNC) 
>>> writes NFS performance is not very good. Once we enable our zil 
>>> accelerator, NFS performance is approximately as fast as iSCSI. Enabling or 
>>> disabling the zil has no measurable impact on iSCSI performance for us.
>>> 
>>> Does a zvol use the zil then or not? If it does, then iSCSI performance 
>>> seems like it should also be slower without a zil accelerator but it's not. 
>>> If it doesn't, then is it true that if the power goes off when I'm doing a 
>>> write to iSCSI and I have no battery backed HBA or RAID card I'll lose data?
>> 
>> The risk here is not really different that that faced by normal disk drives 
>> which have
>> nonvolatile buffers (eg virtually all HDDs and some SSDs).  This is why 
>> applications
>> can send cache flush commands when they need to ensure the data is on the 
>> media.
>> -- richard
>> 
> 
> I think you mean "volatile buffers", right? You'll lose data if you HD
> or SSD has a volatile buffer (almost always DRAM chips with no battery
> or supercapacitor).

Indeed... to quote someone (Erik) recently, 

(a) I'm not infallible. :-)

 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA '10 Conference, November 7-12, San Jose, CA
ZFS and performance consulting
http://www.RichardElling.com













___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does a zvol use the zil?

2010-10-21 Thread Erik Trimble
On Thu, 2010-10-21 at 17:09 -0700, Richard Elling wrote:
> On Oct 21, 2010, at 6:19 AM, Eff Norwood wrote:
> > Let me frame this in the context specifically of VMWare ESXi 4.x. If I 
> > create a zvol and give it to ESXi via iSCSI our experience has been that it 
> > is very fast and guest response is excellent. If we use NFS without a zil 
> > (we use DDRdrive X1==awesome) because VMWare uses sync (Stable = FSYNC) 
> > writes NFS performance is not very good. Once we enable our zil 
> > accelerator, NFS performance is approximately as fast as iSCSI. Enabling or 
> > disabling the zil has no measurable impact on iSCSI performance for us.
> > 
> > Does a zvol use the zil then or not? If it does, then iSCSI performance 
> > seems like it should also be slower without a zil accelerator but it's not. 
> > If it doesn't, then is it true that if the power goes off when I'm doing a 
> > write to iSCSI and I have no battery backed HBA or RAID card I'll lose data?
> 
> The risk here is not really different that that faced by normal disk drives 
> which have
> nonvolatile buffers (eg virtually all HDDs and some SSDs).  This is why 
> applications
> can send cache flush commands when they need to ensure the data is on the 
> media.
>  -- richard
> 

I think you mean "volatile buffers", right? You'll lose data if you HD
or SSD has a volatile buffer (almost always DRAM chips with no battery
or supercapacitor).



-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does a zvol use the zil?

2010-10-21 Thread Richard Elling
On Oct 21, 2010, at 6:19 AM, Eff Norwood wrote:
> Let me frame this in the context specifically of VMWare ESXi 4.x. If I create 
> a zvol and give it to ESXi via iSCSI our experience has been that it is very 
> fast and guest response is excellent. If we use NFS without a zil (we use 
> DDRdrive X1==awesome) because VMWare uses sync (Stable = FSYNC) writes NFS 
> performance is not very good. Once we enable our zil accelerator, NFS 
> performance is approximately as fast as iSCSI. Enabling or disabling the zil 
> has no measurable impact on iSCSI performance for us.
> 
> Does a zvol use the zil then or not? If it does, then iSCSI performance seems 
> like it should also be slower without a zil accelerator but it's not. If it 
> doesn't, then is it true that if the power goes off when I'm doing a write to 
> iSCSI and I have no battery backed HBA or RAID card I'll lose data?

The risk here is not really different that that faced by normal disk drives 
which have
nonvolatile buffers (eg virtually all HDDs and some SSDs).  This is why 
applications
can send cache flush commands when they need to ensure the data is on the media.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA '10 Conference, November 7-12, San Jose, CA
ZFS and performance consulting
http://www.RichardElling.com













___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does a zvol use the zil?

2010-10-21 Thread Darren J Moffat

On 21/10/2010 18:59, Maurice Volaski wrote:

Does the write cache referred to above refer to the "Writeback Cache"

> property listed by stmfadm list-lu -v (when a zvol is a target) or
> is that some other cache and if it is, how does it interact with the
> first one?

Yes it does, that basically results in DKIOCGETWCE ioctl being called on 
the ZVOL (though you won't see that in truss because it is called from 
the comstar kernel modules not directly from stmfadm in userland).


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does a zvol use the zil?

2010-10-21 Thread Maurice Volaski
Does the write cache referred to above refer to the "Writeback Cache" property 
listed by stmfadm list-lu -v (when a zvol is a target) or is that some other 
cache and if it is, how does it interact with the first one?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does a zvol use the zil?

2010-10-21 Thread Darren J Moffat

Yes, ZVOLs do use the ZIL.

If the write cache has been disabled on the zvol by the DKIOCSETWCE 
ioctl or the sync property is set to always.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Does a zvol use the zil?

2010-10-21 Thread Eff Norwood
Let me frame this in the context specifically of VMWare ESXi 4.x. If I create a 
zvol and give it to ESXi via iSCSI our experience has been that it is very fast 
and guest response is excellent. If we use NFS without a zil (we use DDRdrive 
X1==awesome) because VMWare uses sync (Stable = FSYNC) writes NFS performance 
is not very good. Once we enable our zil accelerator, NFS performance is 
approximately as fast as iSCSI. Enabling or disabling the zil has no measurable 
impact on iSCSI performance for us.

Does a zvol use the zil then or not? If it does, then iSCSI performance seems 
like it should also be slower without a zil accelerator but it's not. If it 
doesn't, then is it true that if the power goes off when I'm doing a write to 
iSCSI and I have no battery backed HBA or RAID card I'll lose data?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss