Re: [zfs-discuss] cannot delete file when fs 100% full

2008-08-31 Thread Sanjeev
Thanks Michael for the clarification about the IDR ! :-)
I was planing to give this explaination myself.

The fix I have in there is a temporary fix.
I am currently looking at a better way of accounting the
fatzap blocks to make sure we cover all the cases.
I have got some pointers from Mark Maybee and am looking into it
right now.

Thanks and regards,
Sanjeev.

On Fri, Aug 29, 2008 at 06:59:07AM -0700, Michael Schuster wrote:
> On 08/29/08 04:09, Tomas ?gren wrote:
> > On 15 August, 2008 - Tomas ?gren sent me these 0,4K bytes:
> > 
> >> On 14 August, 2008 - Paul Raines sent me these 2,9K bytes:
> >>
> >>> This problem is becoming a real pain to us again and I was wondering
> >>> if there has been in the past few month any known fix or workaround.
> >> Sun is sending me an IDR this/next week regarding this bug..
> > 
> > It seems to work, but I am unfortunately not allowed to pass this IDR
> 
> IDR are "point patches", built against specific kernel builds (IIRC) and as 
> such not intended for a wider distribution. Therefore they need to be 
> tracked so they can be replaced with the proper patch once that is available.
> If you believe you need the IDR, you need to get in touch with your local 
> services organisation and ask them to get it to you - they know the proper 
> procedures to make sure you get one that works on your machine(s) and that 
> you also get the patch once it's available.
> 
> HTH
> Michael
> -- 
> Michael Schuster  http://blogs.sun.com/recursion
> Recursion, n.: see 'Recursion'
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] RFE: allow zfs to interpret '.' as da datatset?

2008-08-31 Thread Gavin Maltby
Hi,

I'd like to be able to utter cmdlines such as

$ zfs set readonly=on .
$ zfs snapshot [EMAIL PROTECTED]

with '.' interpreted to mean the dataset corresponding to
the current working directory.

This would shorten what I find to be a very common operaration -
that of discovering your current (working directory) dataset
and performing some operation on it.  I usally do this
with df and some cut and paste:

([EMAIL PROTECTED]:fx-review/fmaxvm-review2/usr/src/uts )-> df -h .
Filesystem size   used  avail capacity  Mounted on
tank/scratch/gavinm/fx-review/fmaxvm-review2
1.0T15G   287G 5%
/tank/scratch/gavinm/fx-review/fmaxvm-review2

([EMAIL PROTECTED]:fx-review/fmaxvm-review2/usr/src/uts )-> zfs set readonly=on 
tank/scratch/gavinm/fx-review/fmaxvm-review2

I know I could script this, but I'm thing of general ease-of-use.
The failure semantics where . is not a zfs filesystem are clear;
perhaps one concern would be that it would be all to easy to
target the wrong dataset with something like 'zfs destroy .' - I'd
be happy to restrict the usage to non-destructive operations only.

Cheers

Gavin


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-08-31 Thread Richard Elling
Miles Nordin wrote:
>> "dc" == David Collier-Brown <[EMAIL PROTECTED]> writes:
>> 
>
> dc> one discovers latency growing without bound on disk
> dc> saturation,
>
> yeah, ZFS needs the same thing just for scrub.
>   

ZFS already schedules scrubs at a low priority.  However, once the
iops leave ZFS's queue, they can't be rescheduled by ZFS.
> I guess if the disks don't let you tag commands with priorities, then
> you have to run them at slightly below max throughput in order to QoS
> them.
>
> It's sort of like network QoS, but not quite, because: 
>
>   (a) you don't know exactly how big the ``pipe'' is, only
>   approximately, 
>
>   (b) you're not QoS'ing half of a bidirectional link---you get
>   instant feedback of how long it took to ``send'' each ``packet''
>   that you don't get with network QoS, and
>
>   (c) all the fabrics are lossless, so while there are queues which
>   undesireably fill up during congestion, these queues never drop
>   ``packets'' but instead exert back-pressure all the way up to
>   the top of the stack.
>
> I'm surprised we survive as well as we do without disk QoS.  Are the
> storage vendors already doing it somehow?
>   

Excellent question.  I hope someone will pipe up with an
answer.  In my experience, they get by through overprovisioning.
But I predict that SSDs will render this question moot, at least
for another generation or so.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-31 Thread Richard Elling
Ross Smith wrote:
> Triple mirroring you say?  That'd be me then :D
>
> The reason I really want to get ZFS timeouts sorted is that our long 
> term goal is to mirror that over two servers too, giving us a pool 
> mirrored across two servers, each of which is actually a zfs iscsi 
> volume hosted on triply mirrored disks.
>
> Oh, and we'll have two sets of online off-site backups running 
> raid-z2, plus a set of off-line backups too.
>
> All in all I'm pretty happy with the integrity of the data, wouldn't 
> want to use anything other than ZFS for that now.  I'd just like to 
> get the availability working a bit better, without having to go back 
> to buying raid controllers.  We have big plans for that too; once we 
> get the iSCSI / iSER timeout issue sorted our long term availability 
> goals are to have the setup I mentioned above hosted out from a pair 
> of clustered Solaris NFS / CIFS servers.
>
> Failover time on the cluster is currently in the order of 5-10 
> seconds, if I can get the detection of a bad iSCSI link down under 2 
> seconds we'll essentially have a worst case scenario of < 15 seconds 
> downtime.

I don't think this is possible for a stable system.  2 second failure 
detection
for IP networks is troublesome for a wide variety of reasons.  Even with
Solaris Clusters, we can show consistent failover times for NFS services on
the order of a minute (2-3 client retry intervals, including backoff).  But
getting to consistent sub-minute failover for a service like NFS might be a
bridge too far, given the current technology and the amount of 
customization
required to "make it work"^TM.

> Downtime that low means it's effectively transparent for our users as 
> all of our applications can cope with that seamlessly, and I'd really 
> love to be able to do that this calendar year.

I think most people (traders are a notable exception) and applications can
deal with larger recovery times, as long as human-intervention is not  
required.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-08-31 Thread Miles Nordin
> "dc" == David Collier-Brown <[EMAIL PROTECTED]> writes:

dc> one discovers latency growing without bound on disk
dc> saturation,

yeah, ZFS needs the same thing just for scrub.

I guess if the disks don't let you tag commands with priorities, then
you have to run them at slightly below max throughput in order to QoS
them.

It's sort of like network QoS, but not quite, because: 

  (a) you don't know exactly how big the ``pipe'' is, only
  approximately, 

  (b) you're not QoS'ing half of a bidirectional link---you get
  instant feedback of how long it took to ``send'' each ``packet''
  that you don't get with network QoS, and

  (c) all the fabrics are lossless, so while there are queues which
  undesireably fill up during congestion, these queues never drop
  ``packets'' but instead exert back-pressure all the way up to
  the top of the stack.

I'm surprised we survive as well as we do without disk QoS.  Are the
storage vendors already doing it somehow?


pgp08hjfQdC5c.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-08-31 Thread Richard Elling
David Collier-Brown wrote:
> Re Availability: ZFS needs to handle disk removal / 
>  driver failure better
>   
>>> A better option would be to not use this to perform FMA diagnosis, but
>>> instead work into the mirror child selection code.  This has already
>>> been alluded to before, but it would be cool to keep track of latency
>>> over time, and use this to both a) prefer one drive over another when
>>> selecting the child and b) proactively timeout/ignore results from one
>>> child and select the other if it's taking longer than some historical
>>> standard deviation.  This keeps away from diagnosing drives as faulty,
>>> but does allow ZFS to make better choices and maintain response times.
>>> It shouldn't be hard to keep track of the average and/or standard
>>> deviation and use it for selection; proactively timing out the slow I/Os
>>> is much trickier.
>>>   
>
>   Interestingly, tracking latency has come under discussion in the
> Linux world, too, as they start to deal with developing resource
> management for disks as well as CPU.
>
>   In fact, there are two cases where you can use a feedback loop to
> adjust disk behavior, and a third to detect problems. The first 
> loop is the one you identified, for dealing with near/far and
> fast/slow mirrors.
>   

[what usually concerns me is that the software people spec'ing device
drivers don't seem to have much training in control systems, which is
what is being designed]

The feedback loop is troublesome because there is usually at least one
queue, perhaps 3 queues between the host and the media.  At each
queue, iops can be reordered.  As Sommerfeld points out, we see the
same sort of thing in IP networks, but two things bother me about that:

1. latency for disk seeks, rotates, and cache hits look very different
   than random IP network latencies.  For example: a TNF trace I
   recently examined for an IDE disk (no queues which reorder)
   running a single thread read workload showed the following data:
   block   size   latency (ms)
  
   44646448   1.18
  718094416  13.82   (long seek?)
  7181072   112   3.65   (some rotation?)
  7181184   112   2.16
  718129616   0.53   (track cache?)
   44651216   0.57   (track cache?)

   This same system using a SATA disk might look very
   different, because there are 2 additional queues at
   work, and (expect) NCQ. OK, so the easy way around
   this is to build in a substantial guard band... no
   problem, but if you get above about a second, then
   you aren't much different than the B_FAILFAST solution
   even though...

2. The algorithm *must* be computationally efficient.
   We are looking down the tunnel at I/O systems that can
   deliver on the order of 5 Million iops.  We really won't
   have many (any?) spare cycles to play with.

>   The second is for resource management, where one throttles
> disk-hog projects when one discovers latency growing without
> bound on disk saturation, and the third is in case of a fault
> other than the above.
>   

Resource management is difficult when you cannot directly attribute
physical I/O to a process.

>   For the latter to work well, I'd like to see the resource management
> and fast/slow mirror adaptation be something one turns on explicitly,
> because then when FMA discovered that you in fact have a fast/slow
> mirror or a Dr. Evil program saturating the array, the "fix"
> could be to notify the sysadmin that they had a problem and
> suggesting built-in tools to ameliorate it. 
>   

Agree 100%.

>  
> Ian Collins writes: 
>   
>> One solution (again, to be used with a remote mirror) is the three way 
>> mirror.  If two devices are local and one remote, data is safe once the 
>> two local writes return.  I guess the issue then changes from "is my 
>> data safe" to "how safe is my data".  I would be reluctant to deploy a 
>> remote mirror device without local redundancy, so this probably won't be 
>> an uncommon setup.  There would have to be an acceptable window of risk 
>> when local data isn't replicated.
>> 
>
>   And in this case too, I'd prefer the sysadmin provide the information
> to ZFS about what she wants, and have the system adapt to it, and
> report how big the risk window is.
>
>   This would effectively change the FMA behavior, you understand, so as 
> to have it report failures to complete the local writes in time t0 and 
> remote in time t1, much as the resource management or fast/slow cases would
> need to be visible to FMA.
>   

I think this can be reasonably accomplished within the scope of FMA.
Perhaps we should pick that up on fm-discuss?

But I think the bigger problem is that unless you can solve for the general
case, you *will* get nailed.  I might even argue that we need a way for
storage devices to notify hosts of their characteristics, which would 
requi

Re: [zfs-discuss] EMC - top of the table for efficiency, how well would ZFS do?

2008-08-31 Thread Ross Smith

Dear god.  Thanks Tim, that's useful info.

The sales rep we spoke to was really trying quite hard to persuade us that 
NetApp was the best solution for us, they spent a couple of months working with 
us, but ultimately we were put off because of those 'limitations'.  They knew 
full well that those were two of our major concerns, but never had an answer 
for us.  That was a big part of the reason we started seriously looking into 
ZFS instead of NetApp.

If nothing else at least I now know a firm to avoid when buying NetApp...

Date: Sun, 31 Aug 2008 11:06:16 -0500
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Re: [zfs-discuss] EMC - top of the table for efficiency, how well 
would ZFS do?
CC: zfs-discuss@opensolaris.org



On Sun, Aug 31, 2008 at 10:39 AM, Ross Smith <[EMAIL PROTECTED]> wrote:






Hey Tim,

I'll admit I just quoted the blog without checking, I seem to remember the 
sales rep I spoke to recommending putting aside 20-50% of my disk for 
snapshots.  Compared to ZFS where I don't need to reserve any space it feels 
very old fashioned.  With ZFS, snapshots just take up as much space as I want 
them to.

Your sales rep was an idiot then.  Snapshot reserve isn't required at all. It 
isn't necessary to take snapshots.  It's simply a portion of space out of a 
volume that can only be used for snapshots, live data cannot enter into this 
space.  Snapshots, however, can exist on a volume with no snapshot reserve.  
They are in no way limited to the "snapshot reserve" you've set. Snapshot 
reserve is a guaranteed minimum amount of space out of a volume.  You can set 
it 90% as you mention below, and it will work just fine.


ZFS is no different than NetApp when it comes to snapshots.  I suggest until 
you have a basic understanding of how NetApp software works, not making ANY 
definitive statements about them.  You're sounding like a fool and/or someone 
working for one of their competitors.

 

The problem though for our usage with NetApp was that we actually couldn't 
reserve enough space for snapshots.  50% of the pool was their maximum, and 
we're interested in running ten years worth of snapshots here, which could see 
us with a pool with just 10% of live data and 90% of the space taken up by 
snapshots.  The NetApp approach was just too restrictive.


Ross
 There is not, and never has been a "50% of the pool maximum".  That's also a 
lie.  If you want snapshots to take up 90% of the pool, ONTAP will GLADLY do 
so.  I've got a filer sitting in my lab and would be MORE than happy to post 
the df output of a volume that has snapshots taking up 90% of the volume.



--Tim




_
Win a voice over part with Kung Fu Panda & Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] EMC - top of the table for efficiency, how well would ZFS do?

2008-08-31 Thread Brian Hechinger
On Sun, Aug 31, 2008 at 11:06:16AM -0500, Tim wrote:
> > The problem though for our usage with NetApp was that we actually couldn't
> > reserve enough space for snapshots.  50% of the pool was their maximum, and
> > we're interested in running ten years worth of snapshots here, which could
> > see us with a pool with just 10% of live data and 90% of the space taken up
> > by snapshots.  The NetApp approach was just too restrictive.
> 
> There is not, and never has been a "50% of the pool maximum".  That's also a
> lie.  If you want snapshots to take up 90% of the pool, ONTAP will GLADLY do
> so.  I've got a filer sitting in my lab and would be MORE than happy to post
> the df output of a volume that has snapshots taking up 90% of the volume.

Even so, I don't think snapshots is really what he needs.  It sounds a lot like
what he really needs is an HFS like SAM.  That's just my opinion though maybe.
10 years of snapshots sounds an awful lot like backups to me, and there are much
better ways to handle that then with snapshots (on any filesystem).

-brian
-- 
"Coding in C is like sending a 3 year old to do groceries. You gotta
tell them exactly what you want or you'll end up with a cupboard full of
pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] EMC - top of the table for efficiency, how well would ZFS do?

2008-08-31 Thread Tim
On Sun, Aug 31, 2008 at 10:39 AM, Ross Smith <[EMAIL PROTECTED]> wrote:

>  Hey Tim,
>
> I'll admit I just quoted the blog without checking, I seem to remember the
> sales rep I spoke to recommending putting aside 20-50% of my disk for
> snapshots.  Compared to ZFS where I don't need to reserve any space it feels
> very old fashioned.  With ZFS, snapshots just take up as much space as I
> want them to.
>

Your sales rep was an idiot then.  Snapshot reserve isn't required at all.
It isn't necessary to take snapshots.  It's simply a portion of space out of
a volume that can only be used for snapshots, live data cannot enter into
this space.  Snapshots, however, can exist on a volume with no snapshot
reserve.  They are in no way limited to the "snapshot reserve" you've set.
Snapshot reserve is a guaranteed minimum amount of space out of a volume.
You can set it 90% as you mention below, and it will work just fine.

ZFS is no different than NetApp when it comes to snapshots.  I suggest until
you have a basic understanding of how NetApp software works, not making ANY
definitive statements about them.  You're sounding like a fool and/or
someone working for one of their competitors.


>
>
> The problem though for our usage with NetApp was that we actually couldn't
> reserve enough space for snapshots.  50% of the pool was their maximum, and
> we're interested in running ten years worth of snapshots here, which could
> see us with a pool with just 10% of live data and 90% of the space taken up
> by snapshots.  The NetApp approach was just too restrictive.
>
> Ross
>

There is not, and never has been a "50% of the pool maximum".  That's also a
lie.  If you want snapshots to take up 90% of the pool, ONTAP will GLADLY do
so.  I've got a filer sitting in my lab and would be MORE than happy to post
the df output of a volume that has snapshots taking up 90% of the volume.


--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] EMC - top of the table for efficiency, how well would ZFS do?

2008-08-31 Thread Ross Smith

Hey Tim,

I'll admit I just quoted the blog without checking, I seem to remember the 
sales rep I spoke to recommending putting aside 20-50% of my disk for 
snapshots.  Compared to ZFS where I don't need to reserve any space it feels 
very old fashioned.  With ZFS, snapshots just take up as much space as I want 
them to.

The problem though for our usage with NetApp was that we actually couldn't 
reserve enough space for snapshots.  50% of the pool was their maximum, and 
we're interested in running ten years worth of snapshots here, which could see 
us with a pool with just 10% of live data and 90% of the space taken up by 
snapshots.  The NetApp approach was just too restrictive.

Ross


> Date: Sun, 31 Aug 2008 08:08:09 -0700
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] EMC - top of the table for efficiency, how well 
> would ZFS do?
> 
> Netapp does NOT recommend 100 percent.  Perhaps you should talk to
> netapp or one of their partners who know their tech instead of their
> competitors next time.
> 
> Zfs, the way its currently implemented will require roughly the same
> as netapp... Which still isn't 100.
> 
> 
> 
> On 8/30/08, Ross <[EMAIL PROTECTED]> wrote:
> > Just saw this blog post linked from the register, it's EMC pointing out that
> > their array wastes less disk space than either HP or NetApp.  I'm loving the
> > 10% of space they have to reserve for snapshots, and you can't add more o_0.
> >
> > HP similarly recommend 20% of reserved space for snapshots, and NetApp
> > recommend a whopping 100% (that was one reason we didn't buy NetApp
> > actually).
> >
> > Could anybody say how ZFS would match up to these figures?  I'd have thought
> > a 14+2 raid-z2 scheme similar to NFS' would probably be fairest.
> >
> > http://chucksblog.typepad.com/chucks_blog/2008/08/your-storage-mi.html
> >
> > Ross
> > --
> > This message posted from opensolaris.org
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >

_
Make a mini you on Windows Live Messenger!
http://clk.atdmt.com/UKM/go/107571437/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed 2540 and ZFS configuration

2008-08-31 Thread Bob Friesenhahn
On Sun, 31 Aug 2008, Ross wrote:

> You could split this into two raid-z2 sets if you wanted, that would 
> have a bit better performance, but if you can cope with the speed of 
> a single pool for now I'd be tempted to start with that.  It's 
> likely that by Christmas you'll be able to buy flash devices to use 
> as read or write cache with ZFS, at which point the speed of the 
> disks becomes academic for many cases.

We have not heard how this log server is going to receive the log 
data.  Receiving the logs via BSD logging protocol is much different 
than receiving the logs via a NFS mount.  If the logs are received via 
BSD logging protocol then the writes will be asynchronous and there is 
no need at all for a NV write cache.  If the logs are received via 
NFS, then the writes are synchronous so there may be need for a NV 
write cache in order to maintain adequate performance.  Luckily the 
StorageTek 2540 provides a reasonable NV write cache already.

Without performing any actual testing to prove it, I would assume that 
two raidz2 sets will offer almost 2X the transactional performance of 
one bit raidz2 set, which may be important for a logging server which 
is receiving simultaneous input from many places.

For reliability, I definitely recommend something like the BSD logging 
protocol if it can be used since it is more likely to capture all of 
the logs if there is a problem.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed 2540 and ZFS configuration

2008-08-31 Thread Tim
With the restriping: wouldn't it be as simple as creating a new
folder/dataset/whatever on the same pool and doing an rsync to the
same pool/new location.  This would obviously cause a short downtime
to switch over and delete the old dataset, but seems like it should
work fine.  If you're doubling the pool size, space shouldn't be an
issue.




On 8/31/08, Ross <[EMAIL PROTECTED]> wrote:
> Personally I'd go for an 11 disk raid-z2, with one hot spare.  You loose
> some capacity, but you've got more than enough for your current needs, and
> with 1TB disks single parity raid means a lot of time with your data
> unprotected when one fails.
>
> You could split this into two raid-z2 sets if you wanted, that would have a
> bit better performance, but if you can cope with the speed of a single pool
> for now I'd be tempted to start with that.  It's likely that by Christmas
> you'll be able to buy flash devices to use as read or write cache with ZFS,
> at which point the speed of the disks becomes academic for many cases.
>
> Adding a further 12 disks sounds fine, just as you suggest.  You can add
> another 11 disk raid-z2 set to your pool very easily.  ZFS can't yet
> restripe your existing data across the new disks, so you'll have some data
> on the old 12 disk array, some striped across all 24, and some on the new
> array.
>
> ZFS probably does add some overhead compared to hardware raid, but unless
> you have a lot of load on that box I wouldn't expect it to be a problem.  I
> don't know the T5220 servers though, so you might want to double check that.
>
> I do agree that you don't want to use the hardware raid though, ZFS has
> plenty of advantages and it's best to let it manage the whole lot.  Could
> you do me a favour though and see how ZFS copes on that array if you just
> pull a disk while the ZFS pool is running?  I've had some problems on a home
> built box after pulling disks, I suspect a proper raid array will cope fine
> but haven't been able to get that tested yet.
>
> thanks,
>
> Ross
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] EMC - top of the table for efficiency, how well would ZFS do?

2008-08-31 Thread Tim
Netapp does NOT recommend 100 percent.  Perhaps you should talk to
netapp or one of their partners who know their tech instead of their
competitors next time.

Zfs, the way its currently implemented will require roughly the same
as netapp... Which still isn't 100.



On 8/30/08, Ross <[EMAIL PROTECTED]> wrote:
> Just saw this blog post linked from the register, it's EMC pointing out that
> their array wastes less disk space than either HP or NetApp.  I'm loving the
> 10% of space they have to reserve for snapshots, and you can't add more o_0.
>
> HP similarly recommend 20% of reserved space for snapshots, and NetApp
> recommend a whopping 100% (that was one reason we didn't buy NetApp
> actually).
>
> Could anybody say how ZFS would match up to these figures?  I'd have thought
> a 14+2 raid-z2 scheme similar to NFS' would probably be fairest.
>
> http://chucksblog.typepad.com/chucks_blog/2008/08/your-storage-mi.html
>
> Ross
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Sidebar to ZFS Availability discussion

2008-08-31 Thread David Collier-Brown
Re Availability: ZFS needs to handle disk removal / 
 driver failure better
>> A better option would be to not use this to perform FMA diagnosis, but
>> instead work into the mirror child selection code.  This has already
>> been alluded to before, but it would be cool to keep track of latency
>> over time, and use this to both a) prefer one drive over another when
>> selecting the child and b) proactively timeout/ignore results from one
>> child and select the other if it's taking longer than some historical
>> standard deviation.  This keeps away from diagnosing drives as faulty,
>> but does allow ZFS to make better choices and maintain response times.
>> It shouldn't be hard to keep track of the average and/or standard
>> deviation and use it for selection; proactively timing out the slow I/Os
>> is much trickier.

  Interestingly, tracking latency has come under discussion in the
Linux world, too, as they start to deal with developing resource
management for disks as well as CPU.

  In fact, there are two cases where you can use a feedback loop to
adjust disk behavior, and a third to detect problems. The first 
loop is the one you identified, for dealing with near/far and
fast/slow mirrors.

  The second is for resource management, where one throttles
disk-hog projects when one discovers latency growing without
bound on disk saturation, and the third is in case of a fault
other than the above.

  For the latter to work well, I'd like to see the resource management
and fast/slow mirror adaptation be something one turns on explicitly,
because then when FMA discovered that you in fact have a fast/slow
mirror or a Dr. Evil program saturating the array, the "fix"
could be to notify the sysadmin that they had a problem and
suggesting built-in tools to ameliorate it. 

 
Ian Collins writes: 
> One solution (again, to be used with a remote mirror) is the three way 
> mirror.  If two devices are local and one remote, data is safe once the 
> two local writes return.  I guess the issue then changes from "is my 
> data safe" to "how safe is my data".  I would be reluctant to deploy a 
> remote mirror device without local redundancy, so this probably won't be 
> an uncommon setup.  There would have to be an acceptable window of risk 
> when local data isn't replicated.

  And in this case too, I'd prefer the sysadmin provide the information
to ZFS about what she wants, and have the system adapt to it, and
report how big the risk window is.

  This would effectively change the FMA behavior, you understand, so as 
to have it report failures to complete the local writes in time t0 and 
remote in time t1, much as the resource management or fast/slow cases would
need to be visible to FMA.

--dave (at home) c-b

-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-31 Thread Johan Hartzenberg
On Thu, Aug 28, 2008 at 11:21 PM, Ian Collins <[EMAIL PROTECTED]> wrote:

> Miles Nordin writes:
>
> > suggested that unlike the SVM feature it should be automatic, because
> > by so being it becomes useful as an availability tool rather than just
> > performance optimisation.
> >
> So on a server with a read workload, how would you know if the remote
> volume
> was working?
>

Even reads induced writes (last access time, if nothing else)

My question: If a pool becomes non-redundant (eg due to a timeout, hotplug
removal, bad data returned from device, or for whatever reason), do we want
the affected pool/vdev/system to hang?  Generally speaking I would say that
this is what currently happens with other solutions.

Conversely:  Can the current situation be improved by allowing a device to
be taken out of the pool for writes - eg be placed in read-only mode?  I
would assume it is possible to modify the CoW system / functions which
allocates blocks for writes to ignore certain devices, at least
temporarily.

This would also lay a groundwork for allowing devices to be removed from a
pool - eg: Step 1: Make the device read-only. Step 2: touch every allocated
block on that device (causing it to be copied to some other disk), step 3:
remove it from the pool for reads as well and finally remove it from the
pool permanently.

  _hartz
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed 2540 and ZFS configuration

2008-08-31 Thread Ross
Personally I'd go for an 11 disk raid-z2, with one hot spare.  You loose some 
capacity, but you've got more than enough for your current needs, and with 1TB 
disks single parity raid means a lot of time with your data unprotected when 
one fails.

You could split this into two raid-z2 sets if you wanted, that would have a bit 
better performance, but if you can cope with the speed of a single pool for now 
I'd be tempted to start with that.  It's likely that by Christmas you'll be 
able to buy flash devices to use as read or write cache with ZFS, at which 
point the speed of the disks becomes academic for many cases.

Adding a further 12 disks sounds fine, just as you suggest.  You can add 
another 11 disk raid-z2 set to your pool very easily.  ZFS can't yet restripe 
your existing data across the new disks, so you'll have some data on the old 12 
disk array, some striped across all 24, and some on the new array.

ZFS probably does add some overhead compared to hardware raid, but unless you 
have a lot of load on that box I wouldn't expect it to be a problem.  I don't 
know the T5220 servers though, so you might want to double check that.

I do agree that you don't want to use the hardware raid though, ZFS has plenty 
of advantages and it's best to let it manage the whole lot.  Could you do me a 
favour though and see how ZFS copes on that array if you just pull a disk while 
the ZFS pool is running?  I've had some problems on a home built box after 
pulling disks, I suspect a proper raid array will cope fine but haven't been 
able to get that tested yet.

thanks,

Ross
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss