Re: [zfs-discuss] A few questions

2010-12-20 Thread Richard Elling
On Dec 20, 2010, at 4:19 PM, Edward Ned Harvey 
 wrote:

>> From: Erik Trimble [mailto:erik.trim...@oracle.com]
>> 
>> We can either (a) change how ZFS does resilvering or (b) repack the
>> zpool layouts to avoid the problem in the first place.
>> 
>> In case (a), my vote would be to seriously increase the number of
>> in-flight resilver slabs, AND allow for out-of-time-order slab
>> resilvering.  
> 
> Question for any clueful person:
> 
> Suppose you have a mirror to resilver, made of disk1 and disk2, where disk2
> failed and is resilvering.  If you have an algorithm to create a list of all
> the used blocks of disk1 in disk order, then you're able to resilver the
> mirror extremely fast, because all the reads will be sequential in nature,
> plus you get to skip past all the unused space.

Sounds like the definition of random access :-) 

> 
> Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3
> is resilvering).  You find some way of ordering all the used blocks of
> disk1...  Which means disk1 will be able to read in optimal order and speed.

Sounds like prefetching :-)

> Does that necessarily imply disk2 will also work well?  Does the on-disk
> order of blocks of disk1 necessarily match the order of blocks on disk2?

This is an interesting question, that will become more interesting
as the physical sector size gets bigger...
 -- richard

> 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Richard Elling
On Dec 20, 2010, at 7:31 AM, Phil Harman  wrote:

> On 20/12/2010 13:59, Richard Elling wrote:
>> 
>> On Dec 20, 2010, at 2:42 AM, Phil Harman  wrote:
>> 
>>> 
 Why does resilvering take so long in raidz anyway?
>>> Because it's broken. There were some changes a while back that made it more 
>>> broken.
>> 
>> "broken" is the wrong term here. It functions as designed and correctly 
>> resilvers devices. Disagreeing with the design is quite different than
>> proving a defect.
> 
> It might be the wrong term in general, but I think it does apply in the 
> budget home media server context of this thread.

If you only have a few slow drives, you don't have performance.
Like trying to win the Indianapolis 500 with a tricycle...

> I think we can agree that ZFS currently doesn't play well on cheap disks. I 
> think we can also agree that the performance of ZFS resilvering is known to 
> be suboptimal under certain conditions.

... and those conditions are also a strength. For example, most file
systems are nowhere near full. With ZFS you only resilver data. For those
who recall the resilver throttles in SVM or VXVM, you will appreciate not
having to resilver non-data.

> For a long time at Sun, the rule was "correctness is a constraint, 
> performance is a goal". However, in the real world, performance is often also 
> a constraint (just as a quick but erroneous answer is a wrong answer, so 
> also, a slow but correct answer can also be "wrong").
> 
> Then one brave soul at Sun once ventured that "if Linux is faster, it's a 
> Solaris bug!" and to his surprise, the idea caught on. I later went on to 
> tell people that ZFS delievered RAID "where I = inexpensive", so I'm a 
> just a little frustrated when that promise becomes less respected over time. 
> First it was USB drives (which I agreed with), now it's SATA (and I'm not so 
> sure).

"slow" doesn't begin with an "i" :-)

> 
>> 
>>> There has been a lot of discussion, anecdotes and some data on this list. 
>> 
>> "slow because I use devices with poor random write(!) performance"
>> is very different than "broken."
> 
> Again, context is everything. For example, if someone was building a business 
> critical NAS appliance from consumer grade parts, I'd be the first to say 
> "are you nuts?!"

Unfortunately, the math does not support your position...

> 
>> 
>>> The resilver doesn't do a single pass of the drives, but uses a "smarter" 
>>> temporal algorithm based on metadata.
>> 
>> A design that only does a single pass does not handle the temporal
>> changes. Many RAID implementations use a mix of spatial and temporal
>> resilvering and suffer with that design decision.
> 
> Actually, it's easy to see how a combined spatial and temporal approach could 
> be implemented to an advantage for mirrored vdevs.
> 
>> 
>>> However, the current implentation has difficulty finishing the job if 
>>> there's a steady flow of updates to the pool.
>> 
>> Please define current. There are many releases of ZFS, and
>> many improvements have been made over time. What has not
>> improved is the random write performance of consumer-grade
>> HDDs.
> 
> I was led to believe this was not yet fixed in Solaris 11, and that there are 
> therefore doubts about what Solaris 10 update may see the fix, if any.
> 
>> 
>>> As far as I'm aware, the only way to get bounded resilver times is to stop 
>>> the workload until resilvering is completed.
>> 
>> I know of no RAID implementation that bounds resilver times
>> for HDDs. I believe it is not possible. OTOH, whether a resilver
>> takes 10 seconds or 10 hours makes little difference in data
>> availability. Indeed, this is why we often throttle resilvering
>> activity. See previous discussions on this forum regarding the
>> dueling RFEs.
> 
> I don't share your disbelief or "little difference" analysys. If it is true 
> that no current implementation succeeds, isn't that a great opportunity to 
> change the rules? Wasn't resilver time vs availability was a major factor in 
> Adam Leventhal's paper introducing the need for RAIDZ3?

No, it wasn't. There are two failure modes we can model given the data
provided by disk vendors:
1. failures by time (MTBF)
2. failures by bits read (UER)

Over time, the MTBF has improved, but the failures by bits read has not
improved. Just a few years ago enterprise class HDDs had an MTBF
of around 1 million hours. Today, they are in the range of 1.6 million
hours. Just looking at the size of the numbers, the probability that a
drive will fail in one hour is on the order of 10e-6.

By contrast, the failure rate by bits read has not improved much.
Consumer class HDDs are usually spec'ed at 1 error per 1e14
bits read.  To put this in perspective, a 2TB disk has around 1.6e13
bits. Or, the probability of an unrecoverable read if you read every bit 
on a 2TB is growing well above 10%. Some of the better enterprise class 
HDDs are rated two orders of magnitude better, but the only way to get
much better is

Re: [zfs-discuss] a single nfs file system shared out twice with different permissions

2010-12-20 Thread Geoff Nordli
>From: Darren J Moffat 
>Sent: Monday, December 20, 2010 4:15 AM
>Subject: Re: [zfs-discuss] a single nfs file system shared out twice with
different
>permissions
>
>On 18/12/2010 07:09, Geoff Nordli wrote:
>> I am trying to configure a system where I have two different NFS
>> shares which point to the same directory.  The idea is if you come in
>> via one path, you will have read-only access and can't delete any
>> files, if you come in the 2nd path, then you will have read/write access.
>
>That sounds very similar to what you would do with Trusted Extensions.
>The read/write label would be a higher classification than the read-only
one -
>since you can read down, can't see higher and need to be equal to modify.
>
>For more information on Trusted Extensions start with these resources:
>
>
>Oracle Solaris 11 Express Trusted Extensions Collection
>
>   http://docs.sun.com/app/docs/coll/2580.1?l=en
>
>OpenSolaris Security Community pages on TX:
>
>http://hub.opensolaris.org/bin/view/Community+Group+security/tx
>

Darren, thanks for the suggestion.  I think I am going to go back to using
CIFS.   It seems to be quite a bit simpler than what I am looking at with
NFS.

Have a great day!

Geoff  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] a single nfs file system shared out twice with different permissions

2010-12-20 Thread Geoff Nordli
>From: Richard Elling 
>Sent: Monday, December 20, 2010 8:14 PM
>Subject: Re: [zfs-discuss] a single nfs file system shared out twice with
different
>permissions
>
>On Dec 20, 2010, at 11:26 AM, "Geoff Nordli"  wrote:
>
>>> From: Edward Ned Harvey
>>> Sent: Monday, December 20, 2010 9:25 AM
>>> Subject: RE: [zfs-discuss] a single nfs file system shared out twice
>>> with
>> different
>>> permissions
>>>
 From: Richard Elling

> zfs create tank/snapshots
> zfs set sharenfs=on tank/snapshots

 "on" by default sets the NFS share parameters to: "rw"
 You can set specific NFS share parameters by using a string that
 contains the parameters.  For example,

zfs set sharenfs=rw=192.168.12.13,ro=192.168.12.14 my/file/system

 sets readonly access for host 192.168.12.14 and read/write access
 for 192.168.12.13.
>>>
>>> Yeah, but for some reason, the OP didn't want to make it readonly for
>> different
>>> clients ... He wanted a single client to have it mounted twice on two
>> different
>>> directories, one with readonly, and the other with read-write.
>
>Is someone suggesting my solution won't work? Or are they just not up to
the
>challenge? :-)
>

It won't work :) 

The challenge is exporting two shares from the same folder.  Linux has a
"bind" command which will make this work, but from what I can see there
isn't an equivalent on OpenSolaris.  

This isn't a big deal though; I can make it work using CIFS.   It isn't
something that has to be NFS, but I thought I would ask to see if there was
a simple solution I was missing.   

>>> I guess he has some application he can imprison into a specific
>>> read-only subdirectory, while some other application should be able
>>> to read/write or something like that, using the same username, on the
same
>machine.
>>
>> It is the same application, but for some functions it needs to use
>> read-only access or it will modify the files when I don't want it to.
>
>Sounds like a simple dtrace script should do the trick, too.

Unfortunately, there isn't anything I can do about the application, and it
really isn't a big deal.  There is a pretty straight forward workaround.


Have a great day!

Geoff 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] a single nfs file system shared out twice with different permissions

2010-12-20 Thread Richard Elling
On Dec 20, 2010, at 11:26 AM, "Geoff Nordli"  wrote:

>> From: Edward Ned Harvey
>> Sent: Monday, December 20, 2010 9:25 AM
>> Subject: RE: [zfs-discuss] a single nfs file system shared out twice with
> different
>> permissions
>> 
>>> From: Richard Elling
>>> 
 zfs create tank/snapshots
 zfs set sharenfs=on tank/snapshots
>>> 
>>> "on" by default sets the NFS share parameters to: "rw"
>>> You can set specific NFS share parameters by using a string that
>>> contains the parameters.  For example,
>>> 
>>>zfs set sharenfs=rw=192.168.12.13,ro=192.168.12.14 my/file/system
>>> 
>>> sets readonly access for host 192.168.12.14 and read/write access for
>>> 192.168.12.13.
>> 
>> Yeah, but for some reason, the OP didn't want to make it readonly for
> different
>> clients ... He wanted a single client to have it mounted twice on two
> different
>> directories, one with readonly, and the other with read-write.

Is someone suggesting my solution won't work? Or are they just not
up to the challenge? :-)

>> I guess he has some application he can imprison into a specific read-only
>> subdirectory, while some other application should be able to read/write or
>> something like that, using the same username, on the same machine.
> 
> It is the same application, but for some functions it needs to use read-only
> access or it will modify the files when I don't want it to. 

Sounds like a simple dtrace script should do the trick, too.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock
It well may be that different methods are optimal for different use cases.

Mechanical disk vs. SSD; mirrored vs. raidz[123]; sparse vs. populated; etc.

It would be interesting to read more in this area, if papers are available.

I'll have to take a look. ... Or does someone have pointers?

Mark


On Dec 20, 2010, at 6:28 PM, Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Erik Trimble
>> 
>>> In the case of resilvering on a mirrored disk, why not take a snapshot,
> and
>> then
>>> resilver by doing a pure block copy from the snapshot? It would be
>> sequential,
>> 
>> So, a
>> ZFS snapshot would be just as fragmented as the ZFS filesystem was at
>> the time.
> 
> I think Mark was suggesting something like "dd" copy device 1 onto device 2,
> in order to guarantee a first-pass sequential resilver.  And my response
> would be:  Creative thinking and suggestions are always a good thing.  In
> fact, the above suggestion is already faster than the present-day solution
> for what I'm calling "typical" usage, but there are an awful lot of use
> cases where the "dd" solution would be worse... Such as a pool which is
> largely sequential already, or largely empty, or made of high IOPS devices
> such as SSD.  However, there is a desire to avoid resilvering unused blocks,
> so I hope a better solution is possible... 
> 
> The fundamental requirement for a better optimized solution would be a way
> to resilver according to disk ordering...  And it's just a question for
> somebody that actually knows the answer ... How terrible is the idea of
> figuring out the on-disk order?
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intermittent ZFS hang

2010-12-20 Thread Bob Friesenhahn

On Sun, 19 Dec 2010, Robin Axelsson wrote:


To conclude this (in case you don't view this message using a 
monospace font) all drives in the affected storage pool (c9t0d0 - 
c9t7d0) report 2 Illegal Requests (save c9t3d0 that reports 5 
illegal requests). There is one drive (c9t3d0) that looks like the 
black sheep where it also is reported to have 35 Hard Errors, 21 
Transport Errors and 30 Media Errors. Does this mean that the disk 
is about to give up and should be replaced? zpool status indicates 
that it is in the online state and reports no failures.


I agree that it is best to attend to the "black sheep".  First make 
sure that there is nothing odd about its mechanics such as a loose 
mount which might allow it to vibrate.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Eric D. Mudama

On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:

If there is no correlation between on-disk order of blocks for different
disks within the same vdev, then all hope is lost; it's essentially
impossible to optimize the resilver/scrub order unless the on-disk order of
multiple disks is highly correlated or equal by definition.


Very little is impossible.

Drives have been optimally ordering seeks for 35+ years.  I'm guessing
that the trick (difficult, but not impossible) is how to solve a
"travelling salesman" route pathing problem where you have billions or
trillions of transactions, and do it fast enough that it was worth
doing any extra computation besides just giving the device 32+ queued
commands at a time that align with the elements of each ordered
transaction ID.

Add to that all the complexity of unwinding the error recovery in the
event that you fail checksum validation on transaction N-1 after
moving past transaction N, which would be a required capability if you
wanted to queue more than a single transaction for verification at a
time.

Oh, and do all of the above without noticably affecting the throughput
of the applications already running on the system.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Erik Trimble
> 
> > In the case of resilvering on a mirrored disk, why not take a snapshot,
and
> then
> > resilver by doing a pure block copy from the snapshot? It would be
> sequential,
>
> So, a
> ZFS snapshot would be just as fragmented as the ZFS filesystem was at
> the time.

I think Mark was suggesting something like "dd" copy device 1 onto device 2,
in order to guarantee a first-pass sequential resilver.  And my response
would be:  Creative thinking and suggestions are always a good thing.  In
fact, the above suggestion is already faster than the present-day solution
for what I'm calling "typical" usage, but there are an awful lot of use
cases where the "dd" solution would be worse... Such as a pool which is
largely sequential already, or largely empty, or made of high IOPS devices
such as SSD.  However, there is a desire to avoid resilvering unused blocks,
so I hope a better solution is possible... 

The fundamental requirement for a better optimized solution would be a way
to resilver according to disk ordering...  And it's just a question for
somebody that actually knows the answer ... How terrible is the idea of
figuring out the on-disk order?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Edward Ned Harvey
> From: Erik Trimble [mailto:erik.trim...@oracle.com]
> 
> We can either (a) change how ZFS does resilvering or (b) repack the
> zpool layouts to avoid the problem in the first place.
> 
> In case (a), my vote would be to seriously increase the number of
> in-flight resilver slabs, AND allow for out-of-time-order slab
> resilvering.  

Question for any clueful person:

Suppose you have a mirror to resilver, made of disk1 and disk2, where disk2
failed and is resilvering.  If you have an algorithm to create a list of all
the used blocks of disk1 in disk order, then you're able to resilver the
mirror extremely fast, because all the reads will be sequential in nature,
plus you get to skip past all the unused space.

Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3
is resilvering).  You find some way of ordering all the used blocks of
disk1...  Which means disk1 will be able to read in optimal order and speed.
Does that necessarily imply disk2 will also work well?  Does the on-disk
order of blocks of disk1 necessarily match the order of blocks on disk2?

If there is no correlation between on-disk order of blocks for different
disks within the same vdev, then all hope is lost; it's essentially
impossible to optimize the resilver/scrub order unless the on-disk order of
multiple disks is highly correlated or equal by definition.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilvering - Scrubing whats the different

2010-12-20 Thread Alexander Lesle
Hello Erik Trimble and Ian Collins,

thx for quick answering.

My inexperience is solved and I am glad.

-- 
Best Regards
Alexander
Dezember, 20 2010

[1] mid:4d0fb4a4.1090...@oracle.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock

On Dec 20, 2010, at 2:05 PM, Erik Trimble wrote:

> On 12/20/2010 11:56 AM, Mark Sandrock wrote:
>> Erik,
>> 
>>  just a hypothetical what-if ...
>> 
>> In the case of resilvering on a mirrored disk, why not take a snapshot, and 
>> then
>> resilver by doing a pure block copy from the snapshot? It would be 
>> sequential,
>> so long as the original data was unmodified; and random access in dealing 
>> with
>> the modified blocks only, right.
>> 
>> After the original snapshot had been replicated, a second pass would be done,
>> in order to update the clone to 100% live data.
>> 
>> Not knowing enough about the inner workings of ZFS snapshots, I don't know 
>> why
>> this would not be doable. (I'm biased towards mirrors for busy filesystems.)
>> 
>> I'm supposing that a block-level snapshot is not doable -- or is it?
>> 
>> Mark
> Snapshots on ZFS are true snapshots - they take a picture of the current 
> state of the system. They DON'T copy any data around when created. So, a ZFS 
> snapshot would be just as fragmented as the ZFS filesystem was at the time.

But if one does a raw (block) copy, there isn't any fragmentation -- except for 
the COW updates.

If there were no updates to the snapshot, then it becomes a 100% sequential 
block copy operation.

But even with COW updates, presumably the large majority of the copy would 
still be sequential i/o.

Maybe for the 2nd pass, the filesystem would have to be locked, so the 
operation would ever complete,
but if this is fairly short in relation to the overall resilvering time, then 
it could still be a win in many cases.

I'm probably not explaining it well, and may be way off, but it seemed an 
interesting notion.

Mark

> 
> 
> The problem is this:
> 
> Let's say I write block A, B, C, and D on a clean zpool (what kind, it 
> doesn't matter).  I now delete block C.  Later on, I write block E.   There 
> is a probability (increasing dramatically as times goes on), that the on-disk 
> layout will now look like:
> 
> A, B, E, D
> 
> rather than
> 
> A, B, [space], D, E
> 
> 
> So, in the first case, I can do a sequential read to get A & B, but then must 
> do a seek to get D, and a seek to get E.
> 
> The "fragmentation" problem is mainly due to file deletion, NOT to file 
> re-writing.  (though, in ZFS, being a C-O-W filesystem, re-writing generally 
> looks like a delete-then-write process, rather than a modify process).
> 
> 
> -- 
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Bakul Shah
On Mon, 20 Dec 2010 11:27:41 PST Erik Trimble   wrote:
> 
> The problem boils down to this:
> 
> When ZFS does a resilver, it walks the METADATA tree to determine what 
> order to rebuild things from. That means, it resilvers the very first 
> slab ever written, then the next oldest, etc.   The problem here is that 
> slab "age" has nothing to do with where that data physically resides on 
> the actual disks. If you've used the zpool as a WORM device, then, sure, 
> there should be a strict correlation between increasing slab age and 
> locality on the disk.  However, in any reasonable case, files get 
> deleted regularly. This means that the probability that for a slab B, 
> written immediately after slab A, it WON'T be physically near slab A.
> 
> In the end, the problem is that using metadata order, while reducing the 
> total amount of work to do in the resilver (as you only resilver live 
> data, not every bit on the drive), increases the physical inefficiency 
> for each slab.  That is, seek time between cyclinders begins to dominate 
> your slab reconstruction time.  In RAIDZ, this problem is magnified by 
> both the much larger average vdev size vs mirrors, and the necessity 
> that all drives containing a slab information return that data before 
> the corrected data can be written to the resilvering drive.
> 
> Thus, current ZFS resilvering tends to be seek-time limited, NOT 
> throughput limited.  This is really the "fault" of the underlying media, 
> not ZFS.  For instance, if you have a raidZ of SSDs (where seek time is 
> negligible, but throughput isn't),  they resilver really, really fast. 
> In fact, they resilver at the maximum write throughput rate.   However, 
> HDs are severely seek-limited, so that dominates HD resilver time.

You guys may be interested in a solution I used in a totally
different situation.  There an identical tree data structure
had to be maintained on every node of a distributed system.
When a new node was added, it needed to be initialized with
an identical copy before it could be put in operation. But
this had to be done while the rest of the system was
operational and there may even be updates from a central node
during the `mirroring' operation. Some of these updates could
completely change the tree!  Starting at the root was not
going to work since a subtree that was being copied may stop
existing in the middle and its space reused! In a way this is
a similar problem (but worse!). I needed something foolproof
and simple.

My algorithm started copying sequentially from the start.  If
N blocks were already copied when an update comes along,
updates of any block with block# > N are ignored (since the
sequential copy would get to them eventually).  Updates of
any block# <= N were queued up (further update of the same
block would overwrite the old update, to reduce work).
Periodically they would be flushed out to the new node. This
was paced so at to not affect the normal operation much.

I should think a variation would work for active filesystems.
You sequentially read some amount of data from all the disks
from which data for the new disk to be prepared and write it
out sequentially. Each time read enough data so that reading
time dominates any seek time. Handle concurrent updates as
above. If you dedicate N% of time to resilvering, the total
time to complete resilver will be 100/N times sequential read
time of the whole disk. (For example, 1TB disk, 100MBps io
speed, 25% for resilver => under 12 hours).  How much worse
this gets depends on the amount of updates during
resilvering.

At the time of resilvering your FS is more likely to be near
full than near empty so I wouldn't worry about optimizing the
mostly empty FS case.

Bakul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Erik Trimble

On 12/20/2010 11:56 AM, Mark Sandrock wrote:

Erik,

just a hypothetical what-if ...

In the case of resilvering on a mirrored disk, why not take a snapshot, and then
resilver by doing a pure block copy from the snapshot? It would be sequential,
so long as the original data was unmodified; and random access in dealing with
the modified blocks only, right.

After the original snapshot had been replicated, a second pass would be done,
in order to update the clone to 100% live data.

Not knowing enough about the inner workings of ZFS snapshots, I don't know why
this would not be doable. (I'm biased towards mirrors for busy filesystems.)

I'm supposing that a block-level snapshot is not doable -- or is it?

Mark
Snapshots on ZFS are true snapshots - they take a picture of the current 
state of the system. They DON'T copy any data around when created. So, a 
ZFS snapshot would be just as fragmented as the ZFS filesystem was at 
the time.



The problem is this:

Let's say I write block A, B, C, and D on a clean zpool (what kind, it 
doesn't matter).  I now delete block C.  Later on, I write block E.   
There is a probability (increasing dramatically as times goes on), that 
the on-disk layout will now look like:


A, B, E, D

rather than

A, B, [space], D, E


So, in the first case, I can do a sequential read to get A & B, but then 
must do a seek to get D, and a seek to get E.


The "fragmentation" problem is mainly due to file deletion, NOT to file 
re-writing.  (though, in ZFS, being a C-O-W filesystem, re-writing 
generally looks like a delete-then-write process, rather than a modify 
process).



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock
Erik,

just a hypothetical what-if ...

In the case of resilvering on a mirrored disk, why not take a snapshot, and then
resilver by doing a pure block copy from the snapshot? It would be sequential,
so long as the original data was unmodified; and random access in dealing with
the modified blocks only, right.

After the original snapshot had been replicated, a second pass would be done,
in order to update the clone to 100% live data.

Not knowing enough about the inner workings of ZFS snapshots, I don't know why
this would not be doable. (I'm biased towards mirrors for busy filesystems.)

I'm supposing that a block-level snapshot is not doable -- or is it?

Mark

On Dec 20, 2010, at 1:27 PM, Erik Trimble wrote:

> On 12/20/2010 9:20 AM, Saxon, Will wrote:
>>> -Original Message-
>>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>>> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
>>> Sent: Monday, December 20, 2010 11:46 AM
>>> To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
>>> Subject: Re: [zfs-discuss] A few questions
>>> 
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
> I believe Oracle is aware of the problem, but most of
> the core ZFS team has left. And of course, a fix for
> Oracle Solaris no longer means a fix for the rest of
> us.
 OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
>>> want
 to committ to a file system that is 'broken' and may not be fully fixed,
>>> if at all.
>>> 
>>> ZFS is not "broken."  It is, however, a weak spot, that resilver is very
>>> inefficient.  For example:
>>> 
>>> On my server, which is made up of 10krpm SATA drives, 1TB each...  My
>>> drives
>>> can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
>>> to resilver the entire drive (in a mirror) sequentially, it would take ...
>>> 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
>>> and disks are around 70% full, and resilver takes 12-14 hours.
>>> 
>>> So although resilver is "broken" by some standards, it is bounded, and you
>>> can limit it to something which is survivable, by using mirrors instead of
>>> raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
>>> fine.  But you start getting unsustainable if you get up to 21-disk radiz3
>>> for example.
>> This argument keeps coming up on the list, but I don't see where anyone has 
>> made a good suggestion about whether this can even be 'fixed' or how it 
>> would be done.
>> 
>> As I understand it, you have two basic types of array reconstruction: in a 
>> mirror you can make a block-by-block copy and that's easy, but in a parity 
>> array you have to perform a calculation on the existing data and/or existing 
>> parity to reconstruct the missing piece. This is pretty easy when you can 
>> guarantee that all your stripes are the same width, start/end on the same 
>> sectors/boundaries/whatever and thus know a piece of them lives on all 
>> drives in the set. I don't think this is possible with ZFS since we have 
>> variable stripe width. A failed disk d may or may not contain data from 
>> stripe s (or transaction t). This information has to be discovered by 
>> looking at the transaction records. Right?
>> 
>> Can someone speculate as to how you could rebuild a variable stripe width 
>> array without replaying all the available transactions? I am no filesystem 
>> engineer but I can't wrap my head around how this could be handled any 
>> better than it already is. I've read that resilvering is throttled - 
>> presumably to keep performance degradation to a minimum during the process - 
>> maybe this could be a tunable (e.g. priority: low, normal, high)?
>> 
>> Do we know if resilvers on a mirror are actually handled differently from 
>> those on a raidz?
>> 
>> Sorry if this has already been explained. I think this is an issue that 
>> everyone who uses ZFS should understand completely before jumping in, 
>> because the behavior (while not 'wrong') is clearly NOT the same as with 
>> more conventional arrays.
>> 
>> -Will
> the "problem" is NOT the checksum/error correction overhead. that's 
> relatively trivial.  The problem isn't really even variable width (i.e. 
> variable number of disks one crosses) slabs.
> 
> The problem boils down to this:
> 
> When ZFS does a resilver, it walks the METADATA tree to determine what order 
> to rebuild things from. That means, it resilvers the very first slab ever 
> written, then the next oldest, etc.   The problem here is that slab "age" has 
> nothing to do with where that data physically resides on the actual disks. If 
> you've used the zpool as a WORM device, then, sure, there should be a strict 
> correlation between increasing slab age and locality on the disk.  However, 
> in any reasonable case, files get deleted regularly. This means that the 
> probability that for a slab B,

Re: [zfs-discuss] Resilvering - Scrubing whats the different

2010-12-20 Thread Erik Trimble

On 12/20/2010 11:36 AM, Alexander Lesle wrote:

Hello All

I read this thread Resilver/scrub times? for a few minutes
and I have recognize that I dont know the different between
Resilvering and Scrubing. Shame on me. :-(

I dont find some declarations in the man-pages and I know the command
to start scrubing "zpool scrub tank"
but what is the command to start resilver and what is the different?

Resilvering is reconstruction of a failed drive (or portion of that 
drive).  It involves walking the metadata tree of the pool, to see if 
all blocks are stored properly on the correct devices; if not, then a 
write is issued to the device which is missing the correct block.  It 
does NOT deal with checksums of the individual blocks.


Scrubbing is error-detection.  Scrubbing looks for blocks whose metadata 
checksums do not match the checksum returned by the data held in the 
block. If the data doesn't match, ONLY then does a new block get written 
out.


Scrubbing is independent of media failure (that is, it isn't triggered 
by the failure of block or whole device) - it is performed by a userland 
action.


Resilvering is dependent on device failure (whether permanent, or 
temporary) - it is triggered by a system condition.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilvering - Scrubing whats the different

2010-12-20 Thread Ian Collins

 On 12/21/10 08:36 AM, Alexander Lesle wrote:

Hello All

I read this thread Resilver/scrub times? for a few minutes
and I have recognize that I dont know the different between
Resilvering and Scrubing. Shame on me. :-(

Scrubbing is used to check the contents of a pool by reading the data 
and verifying its checksum.


To quote the man page:

 Scrubbing and resilvering are very  similar  operations.
 The  difference  is  that resilvering only examines data
 that ZFS knows to be out  of  date  (for  example,  when
 attaching  a  new  device  to  a  mirror or replacing an
 existing device), whereas scrubbing examines all data to
 discover  silent  errors  due to hardware faults or disk
 failure.


I dont find some declarations in the man-pages and I know the command
to start scrubing "zpool scrub tank"
but what is the command to start resilver and what is the different?


There isn't one.  resilvering starts automatically when a dive is replaced.

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Resilvering - Scrubing whats the different

2010-12-20 Thread Alexander Lesle
Hello All

I read this thread Resilver/scrub times? for a few minutes
and I have recognize that I dont know the different between
Resilvering and Scrubing. Shame on me. :-(

I dont find some declarations in the man-pages and I know the command
to start scrubing "zpool scrub tank"
but what is the command to start resilver and what is the different?

-- 
Best Regards
Alexander
Dezember, 20 2010

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Erik Trimble

On 12/20/2010 9:20 AM, Saxon, Will wrote:

-Original Message-
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: Monday, December 20, 2010 11:46 AM
To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Lanky Doodle


I believe Oracle is aware of the problem, but most of
the core ZFS team has left. And of course, a fix for
Oracle Solaris no longer means a fix for the rest of
us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I

want

to committ to a file system that is 'broken' and may not be fully fixed,

if at all.

ZFS is not "broken."  It is, however, a weak spot, that resilver is very
inefficient.  For example:

On my server, which is made up of 10krpm SATA drives, 1TB each...  My
drives
can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.

So although resilver is "broken" by some standards, it is bounded, and you
can limit it to something which is survivable, by using mirrors instead of
raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
fine.  But you start getting unsustainable if you get up to 21-disk radiz3
for example.

This argument keeps coming up on the list, but I don't see where anyone has 
made a good suggestion about whether this can even be 'fixed' or how it would 
be done.

As I understand it, you have two basic types of array reconstruction: in a 
mirror you can make a block-by-block copy and that's easy, but in a parity 
array you have to perform a calculation on the existing data and/or existing 
parity to reconstruct the missing piece. This is pretty easy when you can 
guarantee that all your stripes are the same width, start/end on the same 
sectors/boundaries/whatever and thus know a piece of them lives on all drives 
in the set. I don't think this is possible with ZFS since we have variable 
stripe width. A failed disk d may or may not contain data from stripe s (or 
transaction t). This information has to be discovered by looking at the 
transaction records. Right?

Can someone speculate as to how you could rebuild a variable stripe width array 
without replaying all the available transactions? I am no filesystem engineer 
but I can't wrap my head around how this could be handled any better than it 
already is. I've read that resilvering is throttled - presumably to keep 
performance degradation to a minimum during the process - maybe this could be a 
tunable (e.g. priority: low, normal, high)?

Do we know if resilvers on a mirror are actually handled differently from those 
on a raidz?

Sorry if this has already been explained. I think this is an issue that 
everyone who uses ZFS should understand completely before jumping in, because 
the behavior (while not 'wrong') is clearly NOT the same as with more 
conventional arrays.

-Will


As far as a possible fix, here's what I can see:

[Note:  I'm not a kernel or FS-level developer. I would love to be able 
to fix this myself, but I have neither the aptitude nor the [extensive] 
time to learn such skill]


We can either (a) change how ZFS does resilvering or (b) repack the 
zpool layouts to avoid the problem in the first place.


In case (a), my vote would be to seriously increase the number of 
in-flight resilver slabs, AND allow for out-of-time-order slab 
resilvering.  By that, I mean that ZFS would read several 
disk-sequential slabs, and then mark them as "done". This would mean a 
*lot* of scanning the metadata tree (since leaves all over the place 
could be "done").   Frankly, I can't say how bad that would be; the 
problem is that for ANY resilver, ZFS would have to scan the entire 
metadata tree to see if it had work to do, rather than simply look for 
the latest completed leave, then assume everything after that needs to 
be done.  There'd also be the matter of determining *if* one should read 
a disk sector...


In case (b), we need the ability to move slabs around on the physical 
disk (via the mythical "Block Pointer Re-write" method).  If there is 
that underlying mechanism, then a "defrag" utility can be run to repack 
the zpool to the point where chronological creation time = physical 
layout.  Which then substantially mitigates the seek time problem.



I can't fix (a) - I don't understand the codebase well enough. Neither 
can I do the BP-rewrite implementation.  However, if I can get 
BP-rewrite, I've got a prototype defragger that seems to work well 
(under simulation). I'm sure it could use some performance improvement, 
but it works reasonably well on a simulated fragmented pool.



Please, Santa, can a good littl

Re: [zfs-discuss] A few questions

2010-12-20 Thread Erik Trimble

On 12/20/2010 9:20 AM, Saxon, Will wrote:

-Original Message-
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: Monday, December 20, 2010 11:46 AM
To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Lanky Doodle


I believe Oracle is aware of the problem, but most of
the core ZFS team has left. And of course, a fix for
Oracle Solaris no longer means a fix for the rest of
us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I

want

to committ to a file system that is 'broken' and may not be fully fixed,

if at all.

ZFS is not "broken."  It is, however, a weak spot, that resilver is very
inefficient.  For example:

On my server, which is made up of 10krpm SATA drives, 1TB each...  My
drives
can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.

So although resilver is "broken" by some standards, it is bounded, and you
can limit it to something which is survivable, by using mirrors instead of
raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
fine.  But you start getting unsustainable if you get up to 21-disk radiz3
for example.

This argument keeps coming up on the list, but I don't see where anyone has 
made a good suggestion about whether this can even be 'fixed' or how it would 
be done.

As I understand it, you have two basic types of array reconstruction: in a 
mirror you can make a block-by-block copy and that's easy, but in a parity 
array you have to perform a calculation on the existing data and/or existing 
parity to reconstruct the missing piece. This is pretty easy when you can 
guarantee that all your stripes are the same width, start/end on the same 
sectors/boundaries/whatever and thus know a piece of them lives on all drives 
in the set. I don't think this is possible with ZFS since we have variable 
stripe width. A failed disk d may or may not contain data from stripe s (or 
transaction t). This information has to be discovered by looking at the 
transaction records. Right?

Can someone speculate as to how you could rebuild a variable stripe width array 
without replaying all the available transactions? I am no filesystem engineer 
but I can't wrap my head around how this could be handled any better than it 
already is. I've read that resilvering is throttled - presumably to keep 
performance degradation to a minimum during the process - maybe this could be a 
tunable (e.g. priority: low, normal, high)?

Do we know if resilvers on a mirror are actually handled differently from those 
on a raidz?

Sorry if this has already been explained. I think this is an issue that 
everyone who uses ZFS should understand completely before jumping in, because 
the behavior (while not 'wrong') is clearly NOT the same as with more 
conventional arrays.

-Will
the "problem" is NOT the checksum/error correction overhead. that's 
relatively trivial.  The problem isn't really even variable width (i.e. 
variable number of disks one crosses) slabs.


The problem boils down to this:

When ZFS does a resilver, it walks the METADATA tree to determine what 
order to rebuild things from. That means, it resilvers the very first 
slab ever written, then the next oldest, etc.   The problem here is that 
slab "age" has nothing to do with where that data physically resides on 
the actual disks. If you've used the zpool as a WORM device, then, sure, 
there should be a strict correlation between increasing slab age and 
locality on the disk.  However, in any reasonable case, files get 
deleted regularly. This means that the probability that for a slab B, 
written immediately after slab A, it WON'T be physically near slab A.


In the end, the problem is that using metadata order, while reducing the 
total amount of work to do in the resilver (as you only resilver live 
data, not every bit on the drive), increases the physical inefficiency 
for each slab.  That is, seek time between cyclinders begins to dominate 
your slab reconstruction time.  In RAIDZ, this problem is magnified by 
both the much larger average vdev size vs mirrors, and the necessity 
that all drives containing a slab information return that data before 
the corrected data can be written to the resilvering drive.


Thus, current ZFS resilvering tends to be seek-time limited, NOT 
throughput limited.  This is really the "fault" of the underlying media, 
not ZFS.  For instance, if you have a raidZ of SSDs (where seek time is 
negligible, but throughput isn't),  they resilver really, really fast. 
In fact, they resilver at the maximum write throughput rate.   Howev

Re: [zfs-discuss] a single nfs file system shared out twice with different permissions

2010-12-20 Thread Geoff Nordli
>From: Edward Ned Harvey
>Sent: Monday, December 20, 2010 9:25 AM
>Subject: RE: [zfs-discuss] a single nfs file system shared out twice with
different
>permissions
>
>> From: Richard Elling
>>
>> > zfs create tank/snapshots
>> > zfs set sharenfs=on tank/snapshots
>>
>> "on" by default sets the NFS share parameters to: "rw"
>> You can set specific NFS share parameters by using a string that
>> contains the parameters.  For example,
>>
>>  zfs set sharenfs=rw=192.168.12.13,ro=192.168.12.14 my/file/system
>>
>> sets readonly access for host 192.168.12.14 and read/write access for
>> 192.168.12.13.
>
>Yeah, but for some reason, the OP didn't want to make it readonly for
different
>clients ... He wanted a single client to have it mounted twice on two
different
>directories, one with readonly, and the other with read-write.
>
>I guess he has some application he can imprison into a specific read-only
>subdirectory, while some other application should be able to read/write or
>something like that, using the same username, on the same machine.

It is the same application, but for some functions it needs to use read-only
access or it will modify the files when I don't want it to. 

Have a great day!

Geoff 

   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] AHCI or IDE?

2010-12-20 Thread Alexander Lesle
Hello Richard Elling and List,

thx for your answer.

I have this problems when I want to install and use Nexenta 3.0.4. I
have wrote that I have bought the Supermicro Board and when all items
are here I want to make a new installation.
I can not send now zpool status -xv.

Maybe you have right that the BIOS is crappy.
Before I install the HBA at supermicro board I will make a reset with
the HBA.

on Dezember, 19 2010, 19:55  wrote in [1]:

> On Dec 16, 2010, at 12:08 PM, Alexander Lesle wrote:

>> Hello Pasi,
>> 
>> thx for the quick answer.
>> 
>> Its sounds fine because with the Asus Board and at Nexenta always I
>> get a message that my rpool is degraded when I set AHCI.

> That is very unusual.  Please send the output of "zpool status -xv"

>> Additionally with the LSI HBA the board doesnt boot by set AHCI.

> This sounds like a crappy BIOS or HBA firmware.  Also, it could be
> some combination of BIOS settings that confuse the issue.
>  -- richard

>> 
>> am Donnerstag, 16. Dezember 2010 um 20:53 hat  u.a.
>> in mid:20101216195305.gz2...@reaktio.net geschrieben:
>>> On Thu, Dec 16, 2010 at 08:43:02PM +0100, Alexander Lesle wrote:
 Hello All,
 
 I want to build a home file and media server now. After experiment with a
 Asus Board and running in unsolve problems I have bought this
 Supermicro Board X8SIA-F with Intel i3-560 and 8 GB Ram
 http://www.supermicro.com/products/motherboard/Xeon3000/3400/X8SIA.cfm?IPMI=Y
 also the LSI HBA SAS 9211-8i
 http://lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas9211-8i/index.html
 
 rpool = 2vdev mirror
 tank = 2 x 2vdev mirror. For the future I want to have the option to
 expand up to 12 x 2vdev mirror.
 
 After reading the board manual I found at page 4-9 where I can set
 SATA#1 from IDE to AHCI.
 
 Can zfs handle AHCI for rpool?
 Can zfs handle AHCI for tank?
 
 Thx for helping.
 
>> 
>>> You definitely want to use AHCI and not the legacy IDE.
>> 
>>> AHCI enables:
>>>- disk hotswap.
>>>- NCQ (Native Command Queuing) to execute multiple commands at the 
>>> same time.
>> 
>> 
>>> -- Pasi
>> 
>> 
>> -- 
>> Schoenen Abend wuenscht
>> Alexander
>> mailto:gro...@tierarzt-mueller.de
>> eMail geschrieben mit: The Bat! 4.0.38
>> unter Windows Pro 
>> 
>> 
>> 
>> 
>> 
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
Best Regards
Alexander
Dezember, 20 2010

[1] mid:39b246ac-a158-4aed-a161-7bfa26858...@gmail.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faulted SSDs

2010-12-20 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Paul Piscuc
> 
> NAME        STATE     READ WRITE CKSUM
>         zpool       ONLINE       0     0     0
>           raidz1-0  ONLINE       0     0     0
>             c2t0d0  ONLINE       0     0     0
>             c2t1d0  ONLINE       0     0     0
>             c2t2d0  ONLINE       0     0     0
>         cache
>           c2t3d0    FAULTED      0     0     0  too many errors
>           c2t4d0    FAULTED      0     0     0  too many errors
> 
> The cache disks are mirrored.

This may be irrelevant, but no, they are not mirrored.
(1) you can't mirror cache devices (nor is there any need to) and
(2) in the listing above, they are not misrepresented as mirrors in any way.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] a single nfs file system shared out twice with different permissions

2010-12-20 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Richard Elling
> 
> > zfs create tank/snapshots
> > zfs set sharenfs=on tank/snapshots
> 
> "on" by default sets the NFS share parameters to: "rw"
> You can set specific NFS share parameters by using a string that
> contains the parameters.  For example,
> 
>   zfs set sharenfs=rw=192.168.12.13,ro=192.168.12.14 my/file/system
> 
> sets readonly access for host 192.168.12.14 and read/write access
> for 192.168.12.13.

Yeah, but for some reason, the OP didn't want to make it readonly for
different clients ... He wanted a single client to have it mounted twice on
two different directories, one with readonly, and the other with read-write.

I guess he has some application he can imprison into a specific read-only
subdirectory, while some other application should be able to read/write or
something like that, using the same username, on the same machine.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting/Strange Problem

2010-12-20 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of artiepen
> 
> Sure, but it's really straightforward:
> 0,5,10,15,20,25,30,35,40,45,50,55 * * * * chown -R user1:group1
> /zpool1/test/share2/* 2> /dev/null ; chmod -R g+w /zpool1/test/share2/* 2>
> /dev/null

Do you have any spaces in the file/dir names under share2?  Your command
using the * will get expanded by the shell at runtime, instead of chown &
chmod recursively descending into the share2 directory.  I suggest the
following instead:

chown -R user1:group1 /zpool1/test/share2 2> /dev/null ; chmod -R g+w
/zpool1/test/share2 2> /dev/null

Also, get a really good idea of what you're actually working on:
echo /zpool1/test/share2/*

And don't assume there are no symlinks.  Test it:
find /zpool1/test/share2 -type l

And look to see if any filesystem is mounted as a subdirectory of the other
filesystem.  
zfs list | grep zpool1/test/share2 | grep -v @


> To clarify how odd that is: /zpool1/test/share2 is mounted on a web server
at
> /mount/point. Going to /mount/point as root and chowning * caused the
> issue to happen with /zpool1/test/share1.

If the web server is a NFS client, you should consider whether or not any
previous mount/unmount request may have been at fault.  Such as lazy
dismount or forced dismount, and stuff like that.  To be really sure, you
could reboot the NFS client, which would guarantee no stale mounts lingering
around.


> This is reproducible, by the way. I can cause this to happen again, right
now if
> I wanted to...

Could you show the output of "ls -ld" on some file or directory, to show the
before and after?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Saxon, Will
> -Original Message-
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
> Sent: Monday, December 20, 2010 11:46 AM
> To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] A few questions
> 
> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Lanky Doodle
> >
> > > I believe Oracle is aware of the problem, but most of
> > > the core ZFS team has left. And of course, a fix for
> > > Oracle Solaris no longer means a fix for the rest of
> > > us.
> >
> > OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
> want
> > to committ to a file system that is 'broken' and may not be fully fixed,
> if at all.
> 
> ZFS is not "broken."  It is, however, a weak spot, that resilver is very
> inefficient.  For example:
> 
> On my server, which is made up of 10krpm SATA drives, 1TB each...  My
> drives
> can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
> to resilver the entire drive (in a mirror) sequentially, it would take ...
> 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
> and disks are around 70% full, and resilver takes 12-14 hours.
> 
> So although resilver is "broken" by some standards, it is bounded, and you
> can limit it to something which is survivable, by using mirrors instead of
> raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
> fine.  But you start getting unsustainable if you get up to 21-disk radiz3
> for example.

This argument keeps coming up on the list, but I don't see where anyone has 
made a good suggestion about whether this can even be 'fixed' or how it would 
be done.

As I understand it, you have two basic types of array reconstruction: in a 
mirror you can make a block-by-block copy and that's easy, but in a parity 
array you have to perform a calculation on the existing data and/or existing 
parity to reconstruct the missing piece. This is pretty easy when you can 
guarantee that all your stripes are the same width, start/end on the same 
sectors/boundaries/whatever and thus know a piece of them lives on all drives 
in the set. I don't think this is possible with ZFS since we have variable 
stripe width. A failed disk d may or may not contain data from stripe s (or 
transaction t). This information has to be discovered by looking at the 
transaction records. Right?

Can someone speculate as to how you could rebuild a variable stripe width array 
without replaying all the available transactions? I am no filesystem engineer 
but I can't wrap my head around how this could be handled any better than it 
already is. I've read that resilvering is throttled - presumably to keep 
performance degradation to a minimum during the process - maybe this could be a 
tunable (e.g. priority: low, normal, high)? 

Do we know if resilvers on a mirror are actually handled differently from those 
on a raidz?

Sorry if this has already been explained. I think this is an issue that 
everyone who uses ZFS should understand completely before jumping in, because 
the behavior (while not 'wrong') is clearly NOT the same as with more 
conventional arrays.

-Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk failed, System not booting

2010-12-20 Thread Tim Cook
Just boot off a live cd, import the pool, and swap it that way.

I'm guessing you havent changed your failmode to continue?
On Dec 20, 2010 10:48 AM, "Albert Frenz"  wrote:
> hi there,
>
> i got freenas installed with a raidz1 pool of 3 disks. one of them now
failed and it gives me errors like "Unrecovered red errors:
autorreallocatefailed" or "MEDIUM ERROR asc:11,4" and the system won't even
boot up. so i bought a replacement drive, but i am a bit concerned since
normaly you should detach the drive via terminal. i can't do it, since it
won't boot up. so am i safe, if i just shut down the machine and replace the
drive with the new one and resilver?
>
> thanks in advance
> adrian
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver/scrub times?

2010-12-20 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Tobias Lauridsen
> 
> sorry to bring the old one up, but I think it is better than make a new
one ??
> Are there some one who have some resilver time from a raidz1/2 pool whith
> 5TB+ data on it ?

resilver & scrub time aren't primarily influenced by the number of bytes to
resilver or scrub.  It's primarily influenced by the number of fragments in
said data...

If you have no snapshots, and you never did, and if your files were
initially written serially and quickly...   Then your scrub & resilver time
should be very quick, because all your data is laid out on disk serially and
therefore you're able to cover a large number of GB/min.

But if you have been performing random reads/writes over a long period of
time... creating and destroying snapshots... etc ... Then it could be pretty
awful.

All of this is, of course, no concern for SSD drives with really high IOPS.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Lanky Doodle
> 
> Is there any argument against using the rpool for all data storage as well
as
> being the install volume?

Generally speaking, you can't do it.
The rpool is only supported on mirrors, not raidz.  I believe this is
because you need rpool in order to load the kernel, and until the kernel is
loaded, there's just no reasonable way to have a fully zfs-aware,
supports-every-feature bootloader able to read rpool in order to fetch the
kernel.

Normally, you'll dedicate 2 disks to the OS, and then you build additional
separate data pools.  If you absolutely need all the disk space of the OS
disks, then you partition the OS into a smaller section of the OS disks and
assign the remaining space to some pool.  But doing that partitioning scheme
can be complex, and if you're not careful, risky.  I don't advise it unless
you truly have your back against a wall for more disk space.


> Why does resilvering take so long in raidz anyway?

There are some really long and sometimes complex threads in this mailing
list discussing that.  Fundamentally ... First of all, it's not always true.
It depends on your usage behavior and the type of disks you're using.  But
the "typical" usage includes reading & writing a lot of files, essentially
randomly over time, creating and deleting snapshots, using spindle disks, so
the "typical" usage behavior does have a resilver performance problem.

The root cause of the problem is that ZFS does not resilver the whole
disk...  It only resilvers the used portions of the disk.  Sounds like a
performance enhancer, right?  It would be, if the disks were mostly empty
... or if ZFS were resilvering a partial disk, in order according to disk
layout.  Unfortunately, it's resilvering according to the temporal order
blocks were written, and usually a disk is significantly full (say, 50% or
more) and as such, the disks have to thrash all around, performing all sorts
of random reads, until eventually it can read all the used parts in random
order.

It's worse on raidzN than on mirrors, because the number of items which must
be read is higher in radizN, assuming you're using larger vdev's and
therefore more items exist scattered about inside that vdev.  You therefore
have a higher number of things which must be randomly read before you reach
completion.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver/scrub times?

2010-12-20 Thread Carsten Aulbert
Hi

On Sunday 19 December 2010 11:12:32 Tobias Lauridsen wrote:
> sorry to bring the old one up, but I think it is better than make a new one
> ?? Are there some one who have some resilver time from a raidz1/2 pool
> whith  5TB+ data on it ?

if you just looked into the discussion over the past day (or week), you would 
learn that the resilver time depends on the amount of writes to the system 
while resilvering. On an idle system you might be able to guesstimate this by 
taking the disk size and the number of iops of the disk and the system into 
account, usually a couple of hours should be alright.

Cheers

Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Lanky Doodle
> 
> > I believe Oracle is aware of the problem, but most of
> > the core ZFS team has left. And of course, a fix for
> > Oracle Solaris no longer means a fix for the rest of
> > us.
> 
> OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
want
> to committ to a file system that is 'broken' and may not be fully fixed,
if at all.

ZFS is not "broken."  It is, however, a weak spot, that resilver is very
inefficient.  For example:

On my server, which is made up of 10krpm SATA drives, 1TB each...  My drives
can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.

So although resilver is "broken" by some standards, it is bounded, and you
can limit it to something which is survivable, by using mirrors instead of
raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
fine.  But you start getting unsustainable if you get up to 21-disk radiz3
for example.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] did you get this? (urgent)

2010-12-20 Thread ishan soni

If you want to discover secrets most network marketers will never
know about creating lifelong financial independence & passive,
residual income, then this may be the most important email you ever
read...

Click HERE Now To Discover
What "They" Don't Want You To Know...

Once I discovered these secrets, I started getting up to 37 checks
per month, earning upwards of $4,954.55 while I was sleeping at
night and best of all - I could recruit people into my network
marketing opportunity at the push of a button without ever picking
up the phone... 

Sounds crazy? You can do it as well. Its simple once you know how...

Click HERE Now

P.S. This is the only thing I know that guarantees you make money in network marketing... I'm not kidding!

Check it out now!

Make it happen,
- Ishan


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intermittent ZFS hang

2010-12-20 Thread Robin Axelsson
I have now upgraded to OpenIndiana b148 which should fix those bugs that you 
mentioned. I lost the picture on the monitor but by ssh:ing from another 
computer the system seems to be running fine.

The problems have become worse now and I get a freeze every time I try to 
access the 8-disk raidz2 tank (using no dedup and never have used it). It also 
takes considerably longer than before to mount the storage pool during boot up. 
No errors are reported when using zpool status but there is one significant 
difference since after the update; the "iostat -En" command now reports errors 
and here's what it looks like:

c7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Model: SAMSUNG HD103SJ Revision:  Serial No: #  Size: 1000.20GB 
<1000202305536 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 
c9t0d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA  Product: SAMSUNG HD154UI  Revision: 1118 Serial No:  
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 2 Predictive Failure Analysis: 0 
c9t1d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA  Product: SAMSUNG HD154UI  Revision: 1118 Serial No:  
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 2 Predictive Failure Analysis: 0 
c9t2d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA  Product: SAMSUNG HD154UI  Revision: 1118 Serial No:  
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 2 Predictive Failure Analysis: 0 
c9t3d0   Soft Errors: 0 Hard Errors: 35 Transport Errors: 21 
Vendor: ATA  Product: SAMSUNG HD154UI  Revision: 1118 Serial No:  
Size: 1500.30GB <1500301910016 bytes>
Media Error: 30 Device Not Ready: 0 No Device: 5 Recoverable: 0 
Illegal Request: 5 Predictive Failure Analysis: 0 
c9t4d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA  Product: SAMSUNG HD154UI  Revision: 1118 Serial No:  
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 2 Predictive Failure Analysis: 0 
c9t5d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA  Product: SAMSUNG HD154UI  Revision: 1118 Serial No:  
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 2 Predictive Failure Analysis: 0 
c9t6d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA  Product: SAMSUNG HD154UI  Revision: 1118 Serial No:  
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 2 Predictive Failure Analysis: 0 
c9t7d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA  Product: SAMSUNG HD154UI  Revision: 1118 Serial No:  
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 2 Predictive Failure Analysis: 0 


To conclude this (in case you don't view this message using a monospace font) 
all drives in the affected storage pool (c9t0d0 - c9t7d0) report 2 Illegal 
Requests (save c9t3d0 that reports 5 illegal requests). There is one drive 
(c9t3d0) that looks like the black sheep where it also is reported to have 35 
Hard Errors, 21 Transport Errors and 30 Media Errors. Does this mean that the 
disk is about to give up and should be replaced? zpool status indicates that it 
is in the online state and reports no failures.

Any suggestions on how to proceed with this would be much appreciated.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting/Strange Problem

2010-12-20 Thread artiepen
If it helps anyone who might see this in the future. I still haven't figured it 
out. I ran a dependency checker on the application and even though >I< can 
browse the share that it's located on, the application says that its dlls 
cannot be found even though they are in the same dir as the app.

FWIW, the dependency checker application is called depends. Oddly enough, in 
this program, when the application isn't available, it no longer lists it in 
the Recent menu...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver/scrub times?

2010-12-20 Thread Tobias Lauridsen
sorry to bring the old one up, but I think it is better than make a new one ?? 
Are there some one who have some resilver time from a raidz1/2 pool whith  5TB+ 
data on it ?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Disk failed, System not booting

2010-12-20 Thread Albert Frenz
hi there,

i got freenas installed with a raidz1 pool of 3 disks. one of them now failed 
and it gives me errors like "Unrecovered red errors: autorreallocatefailed" or 
"MEDIUM ERROR asc:11,4" and the system won't even boot up. so i bought a 
replacement drive, but i am a bit concerned since normaly you should detach the 
drive via terminal. i can't do it, since it won't boot up. so am i safe, if i 
just shut down the machine and replace the drive with the new one and resilver?

thanks in advance
adrian
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting/Strange Problem

2010-12-20 Thread artiepen
To clarify how odd that is: /zpool1/test/share2 is mounted on a web server at 
/mount/point. Going to /mount/point as root and chowning * caused the issue to 
happen with /zpool1/test/share1.

This is reproducible, by the way. I can cause this to happen again, right now 
if I wanted to...

Another thing: I checked the ownership and perms on /zpool1/test/share1. ls -dV 
showed no change in the ACLs than from what I had set.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock

On Dec 18, 2010, at 12:23 PM, Lanky Doodle wrote:

> Now this is getting really complex, but can you have server failover in ZFS, 
> much like DFS-R in Windows - you point clients to a clustered ZFS namespace 
> so if a complete server failed nothing is interrupted.

This is the purpose of an Amber Road dual-head cluster (7310C, 7410C, etc.) -- 
not only the storage pool fails over,
but also the server IP address fails over, so that NFS, etc. shares remain 
active, when one storage head goes down.

Amber Road uses ZFS, but the clustering and failover are not related to the 
filesystem type.

Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting/Strange Problem

2010-12-20 Thread artiepen
Sure, but it's really straightforward:
0,5,10,15,20,25,30,35,40,45,50,55 * * * * chown -R user1:group1
/zpool1/test/share2/* 2> /dev/null ; chmod -R g+w /zpool1/test/share2/* 2>
/dev/null

Here's the thing: There's no way that it was a hard/soft link. I know what 
those are and I haven't linked anything from those filesystems.

When I was trying to troubleshoot this I discovered that on the system that was 
>mounting< the NFS share I could change the permissions at the mount point 
(which correlated to /share2) and it would mess up the CIFS share. Yes, setting 
permissions on the >mounting< system would cause the problem to happen. 

To clarify how odd that is: /zpool1/test/share2 is mounted on a web server at 
/mount/point. Going to /mount/point as root and chowning * caused the issue to 
happen with /zpool1/test/share1.

This is reproducible, by the way. I can cause this to happen again, right now 
if I wanted to...

Another thing: I checked the ownership and perms on /zpool1/test/share1. ls -dV 
showed no change in the ACLs than from what I had set.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Phil Harman

On 20/12/2010 13:59, Richard Elling wrote:
On Dec 20, 2010, at 2:42 AM, Phil Harman > wrote:



Why does resilvering take so long in raidz anyway?
Because it's broken. There were some changes a while back that made 
it more broken.


"broken" is the wrong term here. It functions as designed and correctly
resilvers devices. Disagreeing with the design is quite different than
proving a defect.


It might be the wrong term in general, but I think it does apply in the 
budget home media server context of this thread. I think we can agree 
that ZFS currently doesn't play well on cheap disks. I think we can also 
agree that the performance of ZFS resilvering is known to be suboptimal 
under certain conditions.


For a long time at Sun, the rule was "correctness is a constraint, 
performance is a goal". However, in the real world, performance is often 
also a constraint (just as a quick but erroneous answer is a wrong 
answer, so also, a slow but correct answer can also be "wrong").


Then one brave soul at Sun once ventured that "if Linux is faster, it's 
a Solaris bug!" and to his surprise, the idea caught on. I later went on 
to tell people that ZFS delievered RAID "where I = inexpensive", so I'm 
a just a little frustrated when that promise becomes less respected over 
time. First it was USB drives (which I agreed with), now it's SATA (and 
I'm not so sure).


There has been a lot of discussion, anecdotes and some data on this 
list.


"slow because I use devices with poor random write(!) performance"
is very different than "broken."


Again, context is everything. For example, if someone was building a 
business critical NAS appliance from consumer grade parts, I'd be the 
first to say "are you nuts?!"


The resilver doesn't do a single pass of the drives, but uses a 
"smarter" temporal algorithm based on metadata.


A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.


Actually, it's easy to see how a combined spatial and temporal approach 
could be implemented to an advantage for mirrored vdevs.


However, the current implentation has difficulty finishing the job if 
there's a steady flow of updates to the pool.


Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.


I was led to believe this was not yet fixed in Solaris 11, and that 
there are therefore doubts about what Solaris 10 update may see the fix, 
if any.


As far as I'm aware, the only way to get bounded resilver times is to 
stop the workload until resilvering is completed.


I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.


I don't share your disbelief or "little difference" analysys. If it is 
true that no current implementation succeeds, isn't that a great 
opportunity to change the rules? Wasn't resilver time vs availability 
was a major factor in Adam Leventhal's paper introducing the need for 
RAIDZ3?


The appropriateness or otherwise of resilver throttling depends on the 
context. If I can tolerate further failures without data loss (e.g. 
RAIDZ2 with one failed device, or RAIDZ3 with two failed devices), or if 
I can recover business critical data in a timely manner, then great. But 
there may come a point where I would rather take a short term 
performance hit to close the window on total data loss.


The problem exists for mirrors too, but is not as marked because 
mirror reconstruction is inherently simpler.


Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.


This only holds in a quiesced system.

I believe Oracle is aware of the problem, but most of the core ZFS 
team has left. And of course, a fix for Oracle Solaris no longer 
means a fix for the rest of us.


Some "improvements" were made post-b134 and pre-b148.


That is, indeed, good news.


 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Lanky Doodle
Thanks relling.

I suppose at the end of the day any file system/volume manager has it's flaws 
so perhaps it's better to look at the positives of each and decide based on 
them.

So, back to my question above, is there a deciding argument [i]against[/i] 
putting data on the install volume (rpool). Forget about mirroring for a sec;

1) Select 3 disks during install creating raidz1. Create a further 4x 3 drive 
raidz1's, giving me a 10TB rpool with no spare disks

2) Select 5 disks during install creating raidz1. Create a further 2x 5 drive 
raidsz1's giving me a 12TB rpool with no spare disks

3) Select 7 disks during install creating raidz1. Create a further 7 drive 
raidz1 giving me 12TB rpool with 1 spare disk

As there is no space gain between 2) and 3) there is no point going for 3), 
other than having a spare disk, but resilver times would be slower.

So it becomes between 1) and 2). Neither offer spare disks but 1) would offer 
faster resilver times with upto 5 simultaneous disk failures and 2) would offer 
2TB extra space with upto 3 simultaneous disk failures.

FYI, I am using Samsung SpinPoint F2's, which have the variable RPM speeds 
(http://www.scan.co.uk/products/1tb-samsung-hd103si-ecogreen-f2-sata-3gb-s-32mb-cache-89-ms-ncq)

I may wait at least until I get the next 4 drives in (I actually have 6 at the 
mo, not 5) taking me to 10, before migrating to ZFS so plenty of time to think 
about it and hopefully time for them to fix resilvering! ;-)

Thanks again...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faulted SSDs

2010-12-20 Thread Paul Piscuc
Here is a part of "fmdump -eV" :

Dec 19 2010 03:02:47.919024953 ereport.fs.zfs.probe_failure
nvlist version: 0
class = ereport.fs.zfs.probe_failure
ena = 0x4bd7543b8cf1
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x9e0b8e0f936d08c6
vdev = 0xf3c1ec665a2f2e9a
(end detector)

pool = zpool
pool_guid = 0x9e0b8e0f936d08c6
pool_context = 0
pool_failmode = continue
vdev_guid = 0xf3c1ec665a2f2e9a
vdev_type = disk
vdev_path = /dev/dsk/c2t3d0s0
vdev_devid = id1,s...@sata_adata_ssd_s599_60045/a
prev_state = 0x0
__ttl = 0x1
__tod = 0x4d0de657 0x36c73539

There are also similar errors, regarding /p...@0,0/pci8086,3...@1f,2/d...@3,0
with the same error name.


On Mon, Dec 20, 2010 at 4:52 PM, Richard Elling wrote:

> NexentaStor logs are in /var/log. But the real information of
> interest is in the FMA ereports. fmdump -eV is your friend.
>
>  -- richard
>
> On Dec 20, 2010, at 6:39 AM, Paul Piscuc 
> wrote:
>
> > Hi,
> >
> > The problem seems to be solved with a zpool clear. It is not clear what
> generated the issue, and I cannot locate what caused it, because a reboot
> seems to have deleted all logs:| . I have issued serveral grep's under
> /var/log, /var and now under /, but I could find any record. Also, I thought
> that Nexenta might have rotated the logs, but I could find any archive.
> >
> > Anyways, that was rather strange, and hopefully, was a temporary issue.
> If it happens again, I'll make sure that I won't reboot the system.
> >
> > Thx alot for all your help.
> >
> > Paul
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faulted SSDs

2010-12-20 Thread Richard Elling
NexentaStor logs are in /var/log. But the real information of
interest is in the FMA ereports. fmdump -eV is your friend.

 -- richard

On Dec 20, 2010, at 6:39 AM, Paul Piscuc  wrote:

> Hi,
> 
> The problem seems to be solved with a zpool clear. It is not clear what 
> generated the issue, and I cannot locate what caused it, because a reboot 
> seems to have deleted all logs:| . I have issued serveral grep's under 
> /var/log, /var and now under /, but I could find any record. Also, I thought 
> that Nexenta might have rotated the logs, but I could find any archive.
> 
> Anyways, that was rather strange, and hopefully, was a temporary issue. If it 
> happens again, I'll make sure that I won't reboot the system.
> 
> Thx alot for all your help.
> 
> Paul
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Faulted SSDs

2010-12-20 Thread Paul Piscuc
Hi,

The problem seems to be solved with a zpool clear. It is not clear
what generated the issue, and I cannot locate what caused it, because a
reboot seems to have deleted all logs:| . I have issued serveral grep's
under /var/log, /var and now under /, but I could find any record. Also, I
thought that Nexenta might have rotated the logs, but I could find any
archive.

Anyways, that was rather strange, and hopefully, was a temporary issue. If
it happens again, I'll make sure that I won't reboot the system.

Thx alot for all your help.

Paul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faulted SSDs

2010-12-20 Thread Richard Elling
Also check your email. NexentaStor sends an email message
describing the actions taken when this occurs. If you did not
setup email for the appliance, then look in the NMS log.

 -- richard

On Dec 20, 2010, at 5:33 AM, Khushil Dep  wrote:

> Check the dmesg and system logs for any output concerning those devices
> 
> re-seat one then the other just in case too.
> 
> ---
> W. A. Khushil Dep - khushil@gmail.com -  07905374843
> 
> Visit my blog at http://www.khushil.com/
> 
> 
> 
> 
> 
> 
> On 20 December 2010 13:10, Paul Piscuc  wrote:
> Hi, this is current setup that I have been doing tests on:
> 
> NAMESTATE READ WRITE CKSUM
> zpool   ONLINE   0 0 0
>   raidz1-0  ONLINE   0 0 0
> c2t0d0  ONLINE   0 0 0
> c2t1d0  ONLINE   0 0 0
> c2t2d0  ONLINE   0 0 0
> cache
>   c2t3d0FAULTED  0 0 0  too many errors
>   c2t4d0FAULTED  0 0 0  too many errors
> 
> I would like to mention that this box uses Nexenta Community Edition, the 
> cache disks are SSDs (ADATA AS599S-64GM-C ), and it is functional for about 1 
> month. The cache disks are mirrored.
> I wouldn't mind a faulted disk, but those two are 64GB SSDs. Could you point 
> me in the right direction to see what happend? Or to what generated the error?
> 
> P.S. The system wasn't under huge loads 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Richard Elling

On Dec 20, 2010, at 2:42 AM, Phil Harman  wrote:

>> Why does resilvering take so long in raidz anyway?
> 
> Because it's broken. There were some changes a while back that made it more 
> broken.

"broken" is the wrong term here. It functions as designed and correctly 
resilvers devices. Disagreeing with the design is quite different than
proving a defect.

> There has been a lot of discussion, anecdotes and some data on this list. 

"slow because I use devices with poor random write(!) performance"
is very different than "broken."

> The resilver doesn't do a single pass of the drives, but uses a "smarter" 
> temporal algorithm based on metadata.

A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.

> However, the current implentation has difficulty finishing the job if there's 
> a steady flow of updates to the pool.

Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.

> As far as I'm aware, the only way to get bounded resilver times is to stop 
> the workload until resilvering is completed.

I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.

> The problem exists for mirrors too, but is not as marked because mirror 
> reconstruction is inherently simpler.

Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.

> I believe Oracle is aware of the problem, but most of the core ZFS team has 
> left. And of course, a fix for Oracle Solaris no longer means a fix for the 
> rest of us.

Some "improvements" were made post-b134 and pre-b148.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faulted SSDs

2010-12-20 Thread Khushil Dep
Check the dmesg and system logs for any output concerning those devices

re-seat one then the other just in case too.

---
W. A. Khushil Dep - khushil@gmail.com -  07905374843

Visit my blog at http://www.khushil.com/






On 20 December 2010 13:10, Paul Piscuc  wrote:

> Hi, this is current setup that I have been doing tests on:
>
> NAMESTATE READ WRITE CKSUM
> zpool   ONLINE   0 0 0
>   raidz1-0  ONLINE   0 0 0
> c2t0d0  ONLINE   0 0 0
> c2t1d0  ONLINE   0 0 0
> c2t2d0  ONLINE   0 0 0
> cache
>   c2t3d0FAULTED  0 0 0  too many errors
>   c2t4d0FAULTED  0 0 0  too many errors
>
> I would like to mention that this box uses Nexenta Community Edition, the
> cache disks are SSDs (ADATA AS599S-64GM-C ), and it is functional for
> about 1 month. The cache disks are mirrored.
> I wouldn't mind a faulted disk, but those two are 64GB SSDs. Could you
> point me in the right direction to see what happend? Or to what generated
> the error?
>
> P.S. The system wasn't under huge loads
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Joerg Schilling
Phil Harman  wrote:

> Changes to the resilvering implementation don't necessarily require 
> changes to the on disk format (although they could). Of course, there 
> might be an issue moving a pool mid-resilver from one implementation to 
> another.

We seem to come to a similar problem as wuth UFS 20 years ago. At that time,
Sun did enhance the UFS on-disk format but the *BSDs did not follow this change 
even though the format change was "documented" in the related include files.

For a future ZFS development, thee may be a need to allow an implementation to 
implement on-disk version 1..21 + 24 and another implementation to support 
on-disk version 1..23 + 25.

These thoughts of course are void in case that Oracle continues the OSS 
decisions for Solaris and other Solaris variants can import the code related to
recent enhancements.



Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Faulted SSDs

2010-12-20 Thread Paul Piscuc
Hi, this is current setup that I have been doing tests on:

NAMESTATE READ WRITE CKSUM
zpool   ONLINE   0 0 0
  raidz1-0  ONLINE   0 0 0
c2t0d0  ONLINE   0 0 0
c2t1d0  ONLINE   0 0 0
c2t2d0  ONLINE   0 0 0
cache
  c2t3d0FAULTED  0 0 0  too many errors
  c2t4d0FAULTED  0 0 0  too many errors

I would like to mention that this box uses Nexenta Community Edition, the
cache disks are SSDs (ADATA AS599S-64GM-C ), and it is functional for about
1 month. The cache disks are mirrored.
I wouldn't mind a faulted disk, but those two are 64GB SSDs. Could you point
me in the right direction to see what happend? Or to what generated the
error?

P.S. The system wasn't under huge loads
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Phil Harman

On 20/12/2010 11:29, Lanky Doodle wrote:

I believe Oracle is aware of the problem, but most of
the core ZFS team has left. And of course, a fix for
Oracle Solaris no longer means a fix for the rest of
us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want 
to committ to a file system that is 'broken' and may not be fully fixed, if at 
all.

Hmnnn...


My home server is still running snv_82, and my iMac is running Apple's 
last public beta release for Leopard. The way I see it, the on-disk 
format is sound, and the basic "always consistent on disk" promise seems 
to be worth something. My files are read-mostly, and performance isn't 
an issue for me. ZFS has protected my data for several years now in the 
face of various hardware issues. I'll upgrade my NAS appliance to 
OpenSolaris snv_134b sometime soon, but as far as I can tell, I can't 
use Oracle Solaris 11 Express for licensing reasons (I have backups of 
business data). I'll be watching Illumos with interest, but snv_82 has 
served me well for 3 years, so I figure snv_134b probably has quite a 
lot of useful life left in it, and maybe then brtfs will be ready for 
prime time?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Phil Harman

On 20/12/2010 11:03, Deano wrote:

Hi,
Which brings up an interesting question...

IF it were fixed in for example illumos or freebsd is there a plan for how
to handle possible incompatible zfs implementations?

Currently the basic version numbering only works as it implies only one
stream of development, now with multiple possible stream does ZFS need to
move to a feature bit system or are we going to have to have forks or
multiple incompatible versions?

Thanks,
Deano


Changes to the resilvering implementation don't necessarily require 
changes to the on disk format (although they could). Of course, there 
might be an issue moving a pool mid-resilver from one implementation to 
another.


With arguably considerably more ZFS expertise outside Oracle than in, 
there's a good chance the community will get to a fix first. It would 
then be interesting to see whether NIH prevails, or perhaps even a new 
spirit of "share and share alike".


"You may say I'm a dreamer ..."
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] a single nfs file system shared out twice with different permissions

2010-12-20 Thread Darren J Moffat

On 18/12/2010 07:09, Geoff Nordli wrote:

I am trying to configure a system where I have two different NFS shares
which point to the same directory.  The idea is if you come in via one path,
you will have read-only access and can't delete any files, if you come in
the 2nd path, then you will have read/write access.


That sounds very similar to what you would do with Trusted Extensions. 
The read/write label would be a higher classification than the read-only 
one - since you can read down, can't see higher and need to be equal to 
modify.


For more information on Trusted Extensions start with these resources:


Oracle Solaris 11 Express Trusted Extensions Collection

http://docs.sun.com/app/docs/coll/2580.1?l=en

OpenSolaris Security Community pages on TX:

http://hub.opensolaris.org/bin/view/Community+Group+security/tx

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Lanky Doodle
> I believe Oracle is aware of the problem, but most of
> the core ZFS team has left. And of course, a fix for
> Oracle Solaris no longer means a fix for the rest of
> us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want 
to committ to a file system that is 'broken' and may not be fully fixed, if at 
all.

Hmnnn...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Deano
Hi,
Which brings up an interesting question... 

IF it were fixed in for example illumos or freebsd is there a plan for how
to handle possible incompatible zfs implementations?

Currently the basic version numbering only works as it implies only one
stream of development, now with multiple possible stream does ZFS need to
move to a feature bit system or are we going to have to have forks or
multiple incompatible versions?

Thanks,
Deano

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Phil Harman
Sent: 20 December 2010 10:43
To: Lanky Doodle
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions

> Why does resilvering take so long in raidz anyway?

Because it's broken. There were some changes a while back that made it more
broken.

There has been a lot of discussion, anecdotes and some data on this list. 

The resilver doesn't do a single pass of the drives, but uses a "smarter"
temporal algorithm based on metadata.

However, the current implentation has difficulty finishing the job if
there's a steady flow of updates to the pool.

As far as I'm aware, the only way to get bounded resilver times is to stop
the workload until resilvering is completed.

The problem exists for mirrors too, but is not as marked because mirror
reconstruction is inherently simpler.

I believe Oracle is aware of the problem, but most of the core ZFS team has
left. And of course, a fix for Oracle Solaris no longer means a fix for the
rest of us.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Phil Harman
> Why does resilvering take so long in raidz anyway?

Because it's broken. There were some changes a while back that made it more 
broken.

There has been a lot of discussion, anecdotes and some data on this list. 

The resilver doesn't do a single pass of the drives, but uses a "smarter" 
temporal algorithm based on metadata.

However, the current implentation has difficulty finishing the job if there's a 
steady flow of updates to the pool.

As far as I'm aware, the only way to get bounded resilver times is to stop the 
workload until resilvering is completed.

The problem exists for mirrors too, but is not as marked because mirror 
reconstruction is inherently simpler.

I believe Oracle is aware of the problem, but most of the core ZFS team has 
left. And of course, a fix for Oracle Solaris no longer means a fix for the 
rest of us.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Lanky Doodle
Oh, does anyone know if resilvering efficiency is improved or fixed in Solaris 
11 Express, as that is what i'm using.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Lanky Doodle
Thanks Edward.

I do agree about mirrored rpool (equivalent to Windows OS volume); not doing it 
goes against one of my principles when building enterprise servers.

Is there any argument against using the rpool for all data storage as well as 
being the install volume?

Say for example I chucked 15x 1TB disks in there and created a mirrored rpool 
during installation, using 2 disks. If I added another 6 mirrors (12 disks) to 
it that would give me an rpool of 7TB. The 15th disk being a spare.

Or, say I selected 3 disks during install, does this create a 3 way mirrored 
rpool or does it give you the option of creating raidz? If so, I could then 
create a further 4x 3 drive raidz's, giving me a 10TB rpool.

Or, I could use 2 smaller disks (say 80GB) for the rpool, then create 4x 3 
drive raidz's, giving me an 8TB rpool. Again this gives me a spare disk.

Either of these 3 should keep resilvering times to a minimum, against say one 
big raidz2 of 13 disks.

Why does resilvering take so long in raidz anyway?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss