RE: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks

2011-01-05 Thread Chris Forgeron
Yup, but the second set (stripe of 2 raidz1's) can achieve slightly better 
performance, particularly on a system that has a lot of load. There's a number 
of blog articles that discuss that in more detail than I care to get into here. 
Of course, that's a bit of a moot point, as you're not going to heavily load a 
9 drive system, like a 48 drive system, but.. 

In that example, the first (raidz2) would be a bit more safe as it could take 2 
drives failing. The latter (2 raidz1's) would die if those two failing drives 
are within 1 raidz1 vdev. 

It all comes down to that final decision on how much risk do you want to take 
with your data, what your budget is, and what your performance requirements 
are. 

I'm starting to settle into a stripe of 6 vdevs that are each a 5 disk raidz1, 
with two hot-spares kicking about, and a collection of small SSD's adding up to 
either 500G or 1TB of SSD L2ARC. A bit more risk, but I'm also planning on 
having an entirely redundant (yet slower) SAN device that will get a daily ZFS 
send, so my worst nightmare is yesterday's data - Which I can stand. 

Oh - I am also a fan of buying drives at different time periods or from 
different suppliers.. I have seen entire 4 and 8 drive arrays fail within a 
month of the first drives going. Always really fun when you were too slack to 
handle the first drive failure, the second one put you in a tight spot the next 
week, and then the third one dies while you're madly trying to do data 
recovery.. :-)

Really, in a big enough array, I like to swap out older drives for newer ones 
every now and then and repurpose the old - Just to keep the dreaded complete 
failure at bay. Things you learn to do with cheap SATA drives..


-Original Message-
From: owner-freebsd-sta...@freebsd.org 
[mailto:owner-freebsd-sta...@freebsd.org] On Behalf Of Damien Fleuriot
Sent: Wednesday, January 05, 2011 5:55 PM
To: Chris Forgeron
Cc: freebsd-stable@freebsd.org
Subject: Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks

Well actually...

raidz2:
- 7x 1.5 tb = 10.5tb
- 2 parity drives

raidz1:
- 3x 1.5 tb = 4.5 tb
- 4x 1.5 tb = 6 tb , total 10.5tb
- 2 parity drives in split thus different raidz1 arrays

So really, in both cases 2 different parity drives and same storage...

---
Fleuriot Damien

On 5 Jan 2011, at 16:55, Chris Forgeron  wrote:

> First off, raidz2 and raidz1 with copies=2 are not the same thing. 
> 
> raidz2 will give you two copies of parity instead of just one. It also 
> guarantees that this parity is on different drives. You can sustain 2 drive 
> failures without data loss. 
> 
> raidz1 with copies=2 will give you two copies of all your files, but there is 
> no guarantee that they are on different drives, and you can still only 
> sustain 1 drive failure.
> 
> You'll have better space efficiency with raidz2 if you're using 9 drives. 
> 
> If I were you, I'd use your 9 disks as one big raidz, or better yet, get 10 
> disks, and make a stripe of 2 5 disk raidz's for the best performance. 
> 
> Save your SSD drive for the L2ARC (cache) or ZIL, you'll get better speed 
> that way instead of throwing it away on a boot drive. 
> 
> --
> 
> 
> -Original Message-
> From: owner-freebsd-sta...@freebsd.org 
> [mailto:owner-freebsd-sta...@freebsd.org] On Behalf Of Damien Fleuriot
> Sent: January-05-11 5:01 AM
> To: Damien Fleuriot
> Cc: freebsd-stable@freebsd.org
> Subject: Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks
> 
> Hi again List,
> 
> I'm not so sure about using raidz2 anymore, I'm concerned for the performance.
> 
> Basically I have 9x 1.5T sata drives.
> 
> raidz2 and 2x raidz1 will provide the same capacity.
> 
> Are there any cons against using 2x raidz1 instead of 1x raidz2 ?
> 
> I plan on using a SSD drive for the OS, 40-64gb, with 15 for the system 
> itself and some spare.
> 
> Is it worth using the free space for cache ? ZIL ? both ?
> 
> @jean-yves : didn't you experience problems recently when using both ?
> 
> ---
> Fleuriot Damien
> 
> On 3 Jan 2011, at 16:08, Damien Fleuriot  wrote:
> 
>> 
>> 
>> On 1/3/11 2:17 PM, Ivan Voras wrote:
>>> On 12/30/10 12:40, Damien Fleuriot wrote:
>>> 
 I am concerned that in the event a drive fails, I won't be able to 
 repair the disks in time before another actually fails.
>>> 
>>> An old trick to avoid that is to buy drives from different series or 
>>> manufacturers (the theory is that identical drives tend to fail at 
>>> the same time), but this may not be applicable if you have 5 drives 
>>> in a volume :) Still, you can try playing with RAIDZ levels and 
>>> probabilities.
>>> 
>> 
>> That's sound advice, although one also hears that they should get 
>> devices from the same vendor for maximum compatibility -.-
>> 
>> 
>> Ah well, next time ;)
>> 
>> 
>> A piece of advice I shall heed though is using 1% less capacity than 
>> what the disks really provide, in case one day I have to swap a drive 
>> and its replacement is a few kby

Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks

2011-01-05 Thread Artem Belevich
On Wed, Jan 5, 2011 at 1:55 PM, Damien Fleuriot  wrote:
> Well actually...
>
> raidz2:
> - 7x 1.5 tb = 10.5tb
> - 2 parity drives
>
> raidz1:
> - 3x 1.5 tb = 4.5 tb
> - 4x 1.5 tb = 6 tb , total 10.5tb
> - 2 parity drives in split thus different raidz1 arrays
>
> So really, in both cases 2 different parity drives and same storage...

In second case you get better performance, but lose some data
protection. It's still raidz1 and you can't guarantee functionality in
all cases of two drives failing. If two drives fail in the same vdev,
your entire pool will be gone.  Granted, it's better than single-vdev
raidz1, but it's *not* as good as raidz2.

--Artem
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks

2011-01-05 Thread Damien Fleuriot
Well actually...

raidz2:
- 7x 1.5 tb = 10.5tb
- 2 parity drives

raidz1:
- 3x 1.5 tb = 4.5 tb
- 4x 1.5 tb = 6 tb , total 10.5tb
- 2 parity drives in split thus different raidz1 arrays

So really, in both cases 2 different parity drives and same storage...

---
Fleuriot Damien

On 5 Jan 2011, at 16:55, Chris Forgeron  wrote:

> First off, raidz2 and raidz1 with copies=2 are not the same thing. 
> 
> raidz2 will give you two copies of parity instead of just one. It also 
> guarantees that this parity is on different drives. You can sustain 2 drive 
> failures without data loss. 
> 
> raidz1 with copies=2 will give you two copies of all your files, but there is 
> no guarantee that they are on different drives, and you can still only 
> sustain 1 drive failure.
> 
> You'll have better space efficiency with raidz2 if you're using 9 drives. 
> 
> If I were you, I'd use your 9 disks as one big raidz, or better yet, get 10 
> disks, and make a stripe of 2 5 disk raidz's for the best performance. 
> 
> Save your SSD drive for the L2ARC (cache) or ZIL, you'll get better speed 
> that way instead of throwing it away on a boot drive. 
> 
> --
> 
> 
> -Original Message-
> From: owner-freebsd-sta...@freebsd.org 
> [mailto:owner-freebsd-sta...@freebsd.org] On Behalf Of Damien Fleuriot
> Sent: January-05-11 5:01 AM
> To: Damien Fleuriot
> Cc: freebsd-stable@freebsd.org
> Subject: Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks
> 
> Hi again List,
> 
> I'm not so sure about using raidz2 anymore, I'm concerned for the performance.
> 
> Basically I have 9x 1.5T sata drives.
> 
> raidz2 and 2x raidz1 will provide the same capacity.
> 
> Are there any cons against using 2x raidz1 instead of 1x raidz2 ?
> 
> I plan on using a SSD drive for the OS, 40-64gb, with 15 for the system 
> itself and some spare.
> 
> Is it worth using the free space for cache ? ZIL ? both ?
> 
> @jean-yves : didn't you experience problems recently when using both ?
> 
> ---
> Fleuriot Damien
> 
> On 3 Jan 2011, at 16:08, Damien Fleuriot  wrote:
> 
>> 
>> 
>> On 1/3/11 2:17 PM, Ivan Voras wrote:
>>> On 12/30/10 12:40, Damien Fleuriot wrote:
>>> 
 I am concerned that in the event a drive fails, I won't be able to 
 repair the disks in time before another actually fails.
>>> 
>>> An old trick to avoid that is to buy drives from different series or 
>>> manufacturers (the theory is that identical drives tend to fail at 
>>> the same time), but this may not be applicable if you have 5 drives 
>>> in a volume :) Still, you can try playing with RAIDZ levels and 
>>> probabilities.
>>> 
>> 
>> That's sound advice, although one also hears that they should get 
>> devices from the same vendor for maximum compatibility -.-
>> 
>> 
>> Ah well, next time ;)
>> 
>> 
>> A piece of advice I shall heed though is using 1% less capacity than 
>> what the disks really provide, in case one day I have to swap a drive 
>> and its replacement is a few kbytes smaller (thus preventing a rebuild).
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


[releng_8 tinderbox] failure on i386/pc98

2011-01-05 Thread FreeBSD Tinderbox
TB --- 2011-01-05 20:39:14 - tinderbox 2.6 running on freebsd-stable.sentex.ca
TB --- 2011-01-05 20:39:14 - starting RELENG_8 tinderbox run for i386/pc98
TB --- 2011-01-05 20:39:14 - cleaning the object tree
TB --- 2011-01-05 20:39:38 - cvsupping the source tree
TB --- 2011-01-05 20:39:38 - /usr/bin/csup -z -r 3 -g -L 1 -h cvsup.sentex.ca 
/tinderbox/RELENG_8/i386/pc98/supfile
TB --- 2011-01-05 20:40:22 - building world
TB --- 2011-01-05 20:40:22 - MAKEOBJDIRPREFIX=/obj
TB --- 2011-01-05 20:40:22 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2011-01-05 20:40:22 - TARGET=pc98
TB --- 2011-01-05 20:40:22 - TARGET_ARCH=i386
TB --- 2011-01-05 20:40:22 - TZ=UTC
TB --- 2011-01-05 20:40:22 - __MAKE_CONF=/dev/null
TB --- 2011-01-05 20:40:22 - cd /src
TB --- 2011-01-05 20:40:22 - /usr/bin/make -B buildworld
>>> World build started on Wed Jan  5 20:40:23 UTC 2011
>>> Rebuilding the temporary build tree
>>> stage 1.1: legacy release compatibility shims
>>> stage 1.2: bootstrap tools
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3: cross tools
>>> stage 4.1: building includes
>>> stage 4.2: building libraries
>>> stage 4.3: make dependencies
>>> stage 4.4: building everything
[...]
gzip -cn /src/usr.bin/ncplogin/ncplogout.1 > ncplogout.1.gz
===> usr.bin/netstat (all)
cc -O2 -pipe  -fno-strict-aliasing -DIPSEC -DSCTP -DINET6 -DNETGRAPH -DIPX 
-std=gnu99 -fstack-protector -Wsystem-headers -Werror -Wall -Wno-format-y2k -W 
-Wno-unused-parameter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith 
-Wno-uninitialized -Wno-pointer-sign -c /src/usr.bin/netstat/if.c
cc -O2 -pipe  -fno-strict-aliasing -DIPSEC -DSCTP -DINET6 -DNETGRAPH -DIPX 
-std=gnu99 -fstack-protector -Wsystem-headers -Werror -Wall -Wno-format-y2k -W 
-Wno-unused-parameter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith 
-Wno-uninitialized -Wno-pointer-sign -c /src/usr.bin/netstat/inet.c
cc1: warnings being treated as errors
/src/usr.bin/netstat/inet.c: In function 'protopr':
/src/usr.bin/netstat/inet.c:463: warning: format '%6u' expects type 'unsigned 
int', but argument 2 has type 'uint64_t'
/src/usr.bin/netstat/inet.c:463: warning: format '%6u' expects type 'unsigned 
int', but argument 3 has type 'uint64_t'
*** Error code 1

Stop in /src/usr.bin/netstat.
*** Error code 1

Stop in /src/usr.bin.
*** Error code 1

Stop in /src.
*** Error code 1

Stop in /src.
*** Error code 1

Stop in /src.
TB --- 2011-01-05 21:37:40 - WARNING: /usr/bin/make returned exit code  1 
TB --- 2011-01-05 21:37:40 - ERROR: failed to build world
TB --- 2011-01-05 21:37:40 - 2501.15 user 373.52 system 3506.43 real


http://tinderbox.freebsd.org/tinderbox-releng_8-RELENG_8-i386-pc98.full
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


[releng_8 tinderbox] failure on i386/i386

2011-01-05 Thread FreeBSD Tinderbox
TB --- 2011-01-05 20:34:44 - tinderbox 2.6 running on freebsd-stable.sentex.ca
TB --- 2011-01-05 20:34:44 - starting RELENG_8 tinderbox run for i386/i386
TB --- 2011-01-05 20:34:44 - cleaning the object tree
TB --- 2011-01-05 20:35:15 - cvsupping the source tree
TB --- 2011-01-05 20:35:15 - /usr/bin/csup -z -r 3 -g -L 1 -h cvsup.sentex.ca 
/tinderbox/RELENG_8/i386/i386/supfile
TB --- 2011-01-05 20:36:12 - building world
TB --- 2011-01-05 20:36:12 - MAKEOBJDIRPREFIX=/obj
TB --- 2011-01-05 20:36:12 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2011-01-05 20:36:12 - TARGET=i386
TB --- 2011-01-05 20:36:12 - TARGET_ARCH=i386
TB --- 2011-01-05 20:36:12 - TZ=UTC
TB --- 2011-01-05 20:36:12 - __MAKE_CONF=/dev/null
TB --- 2011-01-05 20:36:12 - cd /src
TB --- 2011-01-05 20:36:12 - /usr/bin/make -B buildworld
>>> World build started on Wed Jan  5 20:36:13 UTC 2011
>>> Rebuilding the temporary build tree
>>> stage 1.1: legacy release compatibility shims
>>> stage 1.2: bootstrap tools
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3: cross tools
>>> stage 4.1: building includes
>>> stage 4.2: building libraries
>>> stage 4.3: make dependencies
>>> stage 4.4: building everything
[...]
gzip -cn /src/usr.bin/ncplogin/ncplogout.1 > ncplogout.1.gz
===> usr.bin/netstat (all)
cc -O2 -pipe  -fno-strict-aliasing -DIPSEC -DSCTP -DINET6 -DNETGRAPH -DIPX 
-std=gnu99 -fstack-protector -Wsystem-headers -Werror -Wall -Wno-format-y2k -W 
-Wno-unused-parameter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith 
-Wno-uninitialized -Wno-pointer-sign -c /src/usr.bin/netstat/if.c
cc -O2 -pipe  -fno-strict-aliasing -DIPSEC -DSCTP -DINET6 -DNETGRAPH -DIPX 
-std=gnu99 -fstack-protector -Wsystem-headers -Werror -Wall -Wno-format-y2k -W 
-Wno-unused-parameter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith 
-Wno-uninitialized -Wno-pointer-sign -c /src/usr.bin/netstat/inet.c
cc1: warnings being treated as errors
/src/usr.bin/netstat/inet.c: In function 'protopr':
/src/usr.bin/netstat/inet.c:463: warning: format '%6u' expects type 'unsigned 
int', but argument 2 has type 'uint64_t'
/src/usr.bin/netstat/inet.c:463: warning: format '%6u' expects type 'unsigned 
int', but argument 3 has type 'uint64_t'
*** Error code 1

Stop in /src/usr.bin/netstat.
*** Error code 1

Stop in /src/usr.bin.
*** Error code 1

Stop in /src.
*** Error code 1

Stop in /src.
*** Error code 1

Stop in /src.
TB --- 2011-01-05 21:34:25 - WARNING: /usr/bin/make returned exit code  1 
TB --- 2011-01-05 21:34:25 - ERROR: failed to build world
TB --- 2011-01-05 21:34:25 - 2527.38 user 365.25 system 3581.07 real


http://tinderbox.freebsd.org/tinderbox-releng_8-RELENG_8-i386-i386.full
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: gstripe/gpart problems.

2011-01-05 Thread Clifton Royston
On Wed, Jan 05, 2011 at 11:36:59AM +0200, Daniel Braniss wrote:
> Hi Clifton,
> I was getting very frustrated yesterday, hence the cripted message, your
> response requieres some background :-)
> the box is a Sun Fire X2200, which has bays for 2 disks, (we have several of 
> these)
> before the latest upgrade, the 2 disks were 'raided' via 'nVidia MediaShield' 
> and
> appeared as ar0, when I upgraded to 8.2, it disappeared, since I had in the 
> kernel config file
> ATA_CAM. So I starded fiddling with gstripe, which 'recoverd' the data.
> Next, since the kernel boot kept complaining abouf GEOM errors, (and not 
> wanting to
> mislead the operators) I cleaned up the data, and started from scratch.
> the machine boots diskless, but I like to keep a root bootable partition just 
> in case.
> the process was in every case the same, first the stripe, then gpart the 
> stripe.

  Thanks, that makes it very clear why things are as they are.  Good to know
that the booting issues are covered via diskless boot.

  I had never thought about being able to recover a RAID stripe using
gstripe.  That's a very interesting capability!

  Assuming that FreeBSD considers partitioning a stripe to be valid in
principle - and you give reasons it should - then there may be a geom/driver
interaction bug to investigate here if the geom layer is refusing to write a
stripe-oriented partition to the raw drive.

  -- Clifton

-- 
Clifton Royston  --  clift...@iandicomputing.com / clift...@lava.net
   President  - I and I Computing * http://www.iandicomputing.com/
 Custom programming, network design, systems and network consulting services
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


RE: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks

2011-01-05 Thread Chris Forgeron
First off, raidz2 and raidz1 with copies=2 are not the same thing. 

raidz2 will give you two copies of parity instead of just one. It also 
guarantees that this parity is on different drives. You can sustain 2 drive 
failures without data loss. 

raidz1 with copies=2 will give you two copies of all your files, but there is 
no guarantee that they are on different drives, and you can still only sustain 
1 drive failure.

You'll have better space efficiency with raidz2 if you're using 9 drives. 

If I were you, I'd use your 9 disks as one big raidz, or better yet, get 10 
disks, and make a stripe of 2 5 disk raidz's for the best performance. 

Save your SSD drive for the L2ARC (cache) or ZIL, you'll get better speed that 
way instead of throwing it away on a boot drive. 

--


-Original Message-
From: owner-freebsd-sta...@freebsd.org 
[mailto:owner-freebsd-sta...@freebsd.org] On Behalf Of Damien Fleuriot
Sent: January-05-11 5:01 AM
To: Damien Fleuriot
Cc: freebsd-stable@freebsd.org
Subject: Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks

Hi again List,

I'm not so sure about using raidz2 anymore, I'm concerned for the performance.

Basically I have 9x 1.5T sata drives.

raidz2 and 2x raidz1 will provide the same capacity.

Are there any cons against using 2x raidz1 instead of 1x raidz2 ?

I plan on using a SSD drive for the OS, 40-64gb, with 15 for the system itself 
and some spare.

Is it worth using the free space for cache ? ZIL ? both ?

@jean-yves : didn't you experience problems recently when using both ?

---
Fleuriot Damien

On 3 Jan 2011, at 16:08, Damien Fleuriot  wrote:

> 
> 
> On 1/3/11 2:17 PM, Ivan Voras wrote:
>> On 12/30/10 12:40, Damien Fleuriot wrote:
>> 
>>> I am concerned that in the event a drive fails, I won't be able to 
>>> repair the disks in time before another actually fails.
>> 
>> An old trick to avoid that is to buy drives from different series or 
>> manufacturers (the theory is that identical drives tend to fail at 
>> the same time), but this may not be applicable if you have 5 drives 
>> in a volume :) Still, you can try playing with RAIDZ levels and 
>> probabilities.
>> 
> 
> That's sound advice, although one also hears that they should get 
> devices from the same vendor for maximum compatibility -.-
> 
> 
> Ah well, next time ;)
> 
> 
> A piece of advice I shall heed though is using 1% less capacity than 
> what the disks really provide, in case one day I have to swap a drive 
> and its replacement is a few kbytes smaller (thus preventing a rebuild).
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFSv4 - how to set up at FreeBSD 8.1 ?

2011-01-05 Thread Rick Macklem
> Yes, to access the file volumes via any version of NFS, they need to
> be exported. (I don't think it would make sense to allow access to all
> of the server's data without limitations for NFSv4?)
> 
> What is different (and makes it confusing for folks familiar with
> NFSv2,3)
> is the fact that it is a single "mount tree" for NFSv4 that has to be
> rooted
> somewhere.
> Solaris10 - always roots it at "/" but somehow works around the ZFS
> case,
> so any exported share can be mounted with the same path used
> by NFSv2,3.
> Linux - Last I looked (which was a couple of years ago), it exported a
> single volume for NFSv4 and the rest of the server's volumes
> could only be accessed via NFSv2,3. (I don't know if they've
> changed this yet?)
> 
> So, I chose to allow a little more flexibility than Solaris10 and
> allow
> /etc/exports to set the location of the "mount root". I didn't
> anticipate
> the "glitch" that ZFS introduced (where all ZFS volumes in the mount
> path
> must be exported for mount to work) because it does its own exporting.
> (Obviously, the glitch/inconsistency needs to be resolved at some
> point.)
> 
Perhaps it would help to show what goes on the wire when a mount is done.
# mount -t nfs -o nfsv4 server:/usr/home /mnt

For NFSv2,3 there will be a Mount RPC with /usr/home as the argument. This
goes directly to mountd and then mountd decides if it is ok and replies with
the file handle (FH) for /usr/home if it is.

For NFSv4, the client will do a compound RPC that looks something like this:
(The exact structure is up to the client implementor.)

PutRooFH  <-- set the position to the "root mount location" as specified by the 
V4: line
Lookup usr
Lookup home
GetFH <-- return the file handle at this location

As such, there can only be one "root mount location" and at least Lookup
operations must work for all elements of the path from there to the client's
mount point. (For non-ZFS, it currently allows Lookup plus a couple of others 
that
some clients use during mounting to work for non-exported file systems, so that
setting "root mount location" == "/" works without exporting the entire file
server's tree.)

For all other operations, the file system must be exported just like for 
NFSv2,3.

Hope this helps, rick
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFSv4 - how to set up at FreeBSD 8.1 ?

2011-01-05 Thread Rick Macklem
> Hi
> 
> On 5 January 2011 12:09, Rick Macklem  wrote:
> 
> > You can also do the following:
> > For /etc/exports
> > V4: /
> > /usr/home -maproot=root -network 192.168.183.0 -mask 255.255.255.0
> >
> > Then mount:
> > # mount_nfs -o nfsv4 192.168.183.131:/usr/home /marek_nfs4/
> > (But only if the file system for "/" is ufs and not zfs and,
> > admittedly
> > there was a debate that has to be continued someday that might make
> > it
> > necessary to export "/" as well for ufs like zfs requires.)
> >
> > rick
> > ps: And some NFSv4 clients can cross server mount points, unlike
> > NFSv2, 3.
> >
> 
> I've done that (exporting V4: /)
> 
> but then when I mount a sub zfs filesystem (e.g. /pool/backup/sites/m)
> then it appears empty on the client.
> 
> If I export /pool/backup/sites/m , then I see the content of the
> directory.
> 
> Most of the sub-directory in /pool are actually zfs file system
> mounted.
> 
> It is something I expected with NFSv3 .. but not with nfs v4.
> 
Yes, to access the file volumes via any version of NFS, they need to
be exported. (I don't think it would make sense to allow access to all
of the server's data without limitations for NFSv4?)

What is different (and makes it confusing for folks familiar with NFSv2,3)
is the fact that it is a single "mount tree" for NFSv4 that has to be rooted
somewhere.
Solaris10 - always roots it at "/" but somehow works around the ZFS case,
so any exported share can be mounted with the same path used
by NFSv2,3.
Linux - Last I looked (which was a couple of years ago), it exported a
single volume for NFSv4 and the rest of the server's volumes
could only be accessed via NFSv2,3. (I don't know if they've
changed this yet?)

So, I chose to allow a little more flexibility than Solaris10 and allow
/etc/exports to set the location of the "mount root". I didn't anticipate
the "glitch" that ZFS introduced (where all ZFS volumes in the mount path
must be exported for mount to work) because it does its own exporting.
(Obviously, the glitch/inconsistency needs to be resolved at some point.)

rick
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFSv4 - how to set up at FreeBSD 8.1 ?

2011-01-05 Thread Rick Macklem
> Rick Macklem  wrote:
> 
> > ... one of the fundamental principals for NFSv2, 3 was a stateless
> > server ...
> 
> Only as long as UDP transport was used. Any NFS implementation that
> used TCP for transport had thereby abandoned the stateless server
> principle, since a TCP connection itself requires that state be
> maintained on both ends.
> 
You've seen the responses w.r.t. what stateless server referred to
already. But you might find this "quirk" interesting...

Early in the NFSv4 design, I suggested that handling of the server
state (opens/locks/...) might be tied to the TCP connections (NFSv4
doesn't use UDP). Lets just say the idea "flew like a lead balloon".
Lots of responses along the lines of "NFS should be separate for the
RPC transport layer, etc.

Then, several years later, they came up with Sessions for NFSv4.1, which
does RPC transport management in a very NFSv4.1-specific way, including
stuff like changing the RPC semantics to exactly-once... Although very
different from what I had envisioned, in a sense Sessions does tie state
handling (ordering of locking operations, for example) to RPC transport
management as I currently understand it.

rick
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFSv4 - how to set up at FreeBSD 8.1 ?

2011-01-05 Thread Rick Macklem
> On Wednesday, January 05, 2011 5:55:53 am per...@pluto.rain.com wrote:
> > Rick Macklem  wrote:
> >
> > > ... one of the fundamental principals for NFSv2, 3 was a stateless
> > > server ...
> >
> > Only as long as UDP transport was used. Any NFS implementation that
> > used TCP for transport had thereby abandoned the stateless server
> > principle, since a TCP connection itself requires that state be
> > maintained on both ends.
> 
> Not filesystem cache coherency state, only socket state. And even NFS
> UDP
> mounts maintain their own set of "socket" state to manage retries and
> retransmits for UDP RPCs. The filesystem is equally incoherent for
> both UDP
> and TCP NFSv[23] mounts. TCP did not change any of that.
> 
Unfortunately even NFSv4 doesn't maintain cache coherency in general. The state 
it
maintains/recovers after a server crash are opens/locks/delegations, but
the opens are a Windows-like open share lock (can't remember the Windows/Samba
term for them) and not a POSIX-like open. NFSv4 does tie cache coherency to
file locking, so that clients will get a coherent view of file data for byte
ranges they lock.

The term stateless server refers to the fact that the server doesn't know 
anything
about the file handling state in the client that needs to be recovered after
a server crash (opens, locks, ...). When an NFSv2,3 server is rebooted, it
normally knows nothing about what clients are mounted, what clients have files
open, etc and just services RPCs as they come in. The design avoided the
complexity of recovery after a crash but results in a non-POSIX compliant
file system that can't do a good job of cache coherency, knows nothing about
file locks, etc. (Sun did add a separate file locking protocol called the
NLM or rpc.lockd if you prefer, but that protocol design was fundamentally
flawed imho and, as such, using it is in the "your mileage may vary" category.)

Further, since without any information about previous operations, retries of
non-idempotent RPCs would cause weird failures, "soft state" in the form of
a cache of recent RPCs (typically called the Duplicate Request Cache or DRC
these days) was added, to avoid performing the non-idempotent operation
twice. A server is not required to retain the contents of a DRC after a
crash/reboot but some vendors with non-volatile RAM hardware may choose to
do so in order to provide "closer to correct" behaviour after a server
crash/reboot.

rick
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFSv4 - how to set up at FreeBSD 8.1 ?

2011-01-05 Thread Rick Macklem
> > You can also do the following:
> > For /etc/exports
> > V4: /
> > /usr/home -maproot=root -network 192.168.183.0 -mask 255.255.255.0
> 
> Not in my configuration - '/' and '/usr' are different partitions
> (both UFS)
> 
Hmm. Since entire volumes are exported for NFSv4, I can't remember if
exporting a subtree of the volume works (I think it does, but??).

However, I do know that if you change the /etc/exports for the above to:
(note I also moved the V4: line to the end because at one time it
 was required to be at the end and I can't remember if that restriction
 is still enforced. Always check /var/log/messages after starting mountd
 with a modified /etc/exports and look for any messages related to problems
 with /etc/exports.) In other words, export the volume's mount point and
 put the V4: line at the end are changes that "might be required?". If you
 take a look at mountd.c, you'll understand why I have trouble remembering
 exactly what works and what doesn't.:-)
/usr -maproot=root -network 192.168.183.0 -mask 255.255.255.0
V4: /

then for the above situation:
# mount -t nfs -o nfsv4 server:/ /mnt
- will fail because "/" isn't exported
however
# mount -t nfs -o nfsv4 server:/usr /mnt
- should work. If it doesn't work, it is not because /etc/exports are
  wrong.

A small number of NFSv4 ops are allowed on non-exported UFS partitions
so that "mount" can traverse the tree down to the mount point, but that
mount point must be exported. When I did this I did not realize that ZFS
did its own exporting and, as such, traveral of non-exported ZFS volumes
doesn't work, because ZFS doesn't allow any operations on the non-exported
volumes to work.

At some point, there needs to be a debate w.r.t. inconsistent behaviour.
The easiest fix is to disable the capability of traversal of non-exported
UFS volumes. The downside of this is that it is harder to configure the
single (sub)tree on the server that is needed for NFSv4.

Have fun with it, rick
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFSv4 - how to set up at FreeBSD 8.1 ?

2011-01-05 Thread John Baldwin
On Wednesday, January 05, 2011 5:55:53 am per...@pluto.rain.com wrote:
> Rick Macklem  wrote:
> 
> > ... one of the fundamental principals for NFSv2, 3 was a stateless
> > server ...
> 
> Only as long as UDP transport was used.  Any NFS implementation that
> used TCP for transport had thereby abandoned the stateless server
> principle, since a TCP connection itself requires that state be
> maintained on both ends.

Not filesystem cache coherency state, only socket state.  And even NFS UDP 
mounts maintain their own set of "socket" state to manage retries and 
retransmits for UDP RPCs.  The filesystem is equally incoherent for both UDP 
and TCP NFSv[23] mounts.  TCP did not change any of that.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFSv4 - how to set up at FreeBSD 8.1 ?

2011-01-05 Thread Marek Salwerowicz

You can also do the following:
For /etc/exports
V4: /
/usr/home -maproot=root -network 192.168.183.0 -mask 255.255.255.0


Not in my configuration - '/' and '/usr' are different partitions (both UFS)

--
Marek Salwerowicz

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFSv4 - how to set up at FreeBSD 8.1 ?

2011-01-05 Thread perryh
Rick Macklem  wrote:

> ... one of the fundamental principals for NFSv2, 3 was a stateless
> server ...

Only as long as UDP transport was used.  Any NFS implementation that
used TCP for transport had thereby abandoned the stateless server
principle, since a TCP connection itself requires that state be
maintained on both ends.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: gstripe/gpart problems.

2011-01-05 Thread Daniel Braniss
> On Tue, Jan 04, 2011 at 04:21:31PM +0200, Daniel Braniss wrote:
> > Hi,
> > I have 2 ada disks striped:
> > 
> > # gstripe list
> > Geom name: s1
> > State: UP
> > Status: Total=2, Online=2
> > Type: AUTOMATIC
> > Stripesize: 65536
> > ID: 2442772675
> > Providers:
> > 1. Name: stripe/s1
> >Mediasize: 1000215674880 (932G)
> >Sectorsize: 512
> >Stripesize: 65536
> >Stripeoffset: 0
> >Mode: r0w0e0
> > Consumers:
> > 1. Name: ada0
> >Mediasize: 500107862016 (466G)
> >Sectorsize: 512
> >Mode: r0w0e0
> >Number: 0
> > 2. Name: ada1
> >Mediasize: 500107862016 (466G)
> >Sectorsize: 512
> >Mode: r0w0e0
> >Number: 1
> > 
> > boot complains:
> > 
> > GEOM_STRIPE: Device s1 created (id=2442772675).
> > GEOM_STRIPE: Disk ada0 attached to s1.
> > GEOM: ada0: corrupt or invalid GPT detected.
> > GEOM: ada0: GPT rejected -- may not be recoverable.
> > GEOM_STRIPE: Disk ada1 attached to s1.
> > GEOM_STRIPE: Device s1 activated.
> > 
> > # gpart show
> > =>34  1953546173  stripe/s1  GPT  (932G)
> >   34 128  1  freebsd-boot  (64K)
> >  162  1953546045 - free -  (932G)
> > # gpart show
> > =>34  1953546173  stripe/s1  GPT  (932G)
> >   34 128  1  freebsd-boot  (64K)
> >  162  1953546045 - free -  (932G)
> > 
> > # gpart add -t freebsd-ufs -s 20g stripe/s1
> > GEOM: ada0: corrupt or invalid GPT detected.
> > GEOM: ada0: GPT rejected -- may not be recoverable.
> > stripe/s1p2 added
> > # gpart show
> > =>34  1953546173  stripe/s1  GPT  (932G)
> >   34 128  1  freebsd-boot  (64K)
> >  16241943040  2  freebsd-ufs  (20G)
> > 41943202  1911603005 - free -  (912G)
> > 
> > if I go the MBR road, all seems ok, but as soon as I try to write
> > the boot block (boot0cfg -B /dev/stripe/s1) again the kernel
> > starts to complain about corrupted GEOM too.
> 
> So are you trying to partition the drives and then stripe the
> partitions within the drives, or are you trying to partition the
> stripe?
> 
> It seems here as though you might be trying to first partition the
> drives (not clear on that) then stripe the whole drives - which will
> mean the partition info is wrong for the resulting striped drive set -
> and then repartition the striped drive set, and neither is ending up
> valid.
> 
> If what you are intending is to partition after striping the raw
> drives, then you are doing the right steps, but when the geom layer
> tries to look at the info on the individual drives as at boot, it will
> find it invalid.  If it the gpart layer is actually refusing to write
> partition info to the drives which is wrong for the drives taken
> individually, that would account for your problems.
> 
> One valid order to do things in would be partition the drives with
> gpart, creating identical sets of partitions on both drives, then
> stripe the partitions created within them (syntax not exact):
>  
> gpart add -t freebsd-ufs0 -s 10g ada0
> gpart add -t freebsd-ufs1 -s 10g ada1
> gstripe label freebsd-ufs freebsd-ufs0 freebsd-ufs1
> 
> That would give you a 20GB stripe, with valid partition info on each
> drive.
> 
> If this will be your boot drive, depending on how much needs to be read
> from the drive before the geom_stripe kernel module gets loaded, I
> would think there could also be a problem booting from the drive.  This
> is not like gmirroring two drives or partitions, where the info read
> from either disk early in boot will be identical, and identical (except
> for the last block of the partition) to what the OS sees later after
> the mirror is formed.
> 
> I assume you're bearing in mind that if you lose either drive to a
> hardware fault you lose the whole thing, and consider the risk worth
> the potential speed/size gain.
>   -- Clifton 

Hi Clifton,
I was getting very frustrated yesterday, hence the cripted message, your
response requieres some background :-)
the box is a Sun Fire X2200, which has bays for 2 disks, (we have several of 
these)
before the latest upgrade, the 2 disks were 'raided' via 'nVidia MediaShield' 
and
appeared as ar0, when I upgraded to 8.2, it disappeared, since I had in the 
kernel config file
ATA_CAM. So I starded fiddling with gstripe, which 'recoverd' the data.
Next, since the kernel boot kept complaining abouf GEOM errors, (and not 
wanting to
mislead the operators) I cleaned up the data, and started from scratch.
the machine boots diskless, but I like to keep a root bootable partition just 
in case.
the process was in every case the same, first the stripe, then gpart the 
stripe.
btw, I know that if I loose a disk I lose everything. Also that I wont be able 
to boot from
it (I can always boot via the net, and mount the root localy, or some other 
combination - USB, etc)
But you gave me some ideas and will start experimenting soon.

thanks,
danny


___

Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks

2011-01-05 Thread Damien Fleuriot
Hi again List,

I'm not so sure about using raidz2 anymore, I'm concerned for the performance.

Basically I have 9x 1.5T sata drives.

raidz2 and 2x raidz1 will provide the same capacity.

Are there any cons against using 2x raidz1 instead of 1x raidz2 ?

I plan on using a SSD drive for the OS, 40-64gb, with 15 for the system itself 
and some spare.

Is it worth using the free space for cache ? ZIL ? both ?

@jean-yves : didn't you experience problems recently when using both ?

---
Fleuriot Damien

On 3 Jan 2011, at 16:08, Damien Fleuriot  wrote:

> 
> 
> On 1/3/11 2:17 PM, Ivan Voras wrote:
>> On 12/30/10 12:40, Damien Fleuriot wrote:
>> 
>>> I am concerned that in the event a drive fails, I won't be able to
>>> repair the disks in time before another actually fails.
>> 
>> An old trick to avoid that is to buy drives from different series or
>> manufacturers (the theory is that identical drives tend to fail at the
>> same time), but this may not be applicable if you have 5 drives in a
>> volume :) Still, you can try playing with RAIDZ levels and probabilities.
>> 
> 
> That's sound advice, although one also hears that they should get
> devices from the same vendor for maximum compatibility -.-
> 
> 
> Ah well, next time ;)
> 
> 
> A piece of advice I shall heed though is using 1% less capacity than
> what the disks really provide, in case one day I have to swap a drive
> and its replacement is a few kbytes smaller (thus preventing a rebuild).
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"