RE: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks
Yup, but the second set (stripe of 2 raidz1's) can achieve slightly better performance, particularly on a system that has a lot of load. There's a number of blog articles that discuss that in more detail than I care to get into here. Of course, that's a bit of a moot point, as you're not going to heavily load a 9 drive system, like a 48 drive system, but.. In that example, the first (raidz2) would be a bit more safe as it could take 2 drives failing. The latter (2 raidz1's) would die if those two failing drives are within 1 raidz1 vdev. It all comes down to that final decision on how much risk do you want to take with your data, what your budget is, and what your performance requirements are. I'm starting to settle into a stripe of 6 vdevs that are each a 5 disk raidz1, with two hot-spares kicking about, and a collection of small SSD's adding up to either 500G or 1TB of SSD L2ARC. A bit more risk, but I'm also planning on having an entirely redundant (yet slower) SAN device that will get a daily ZFS send, so my worst nightmare is yesterday's data - Which I can stand. Oh - I am also a fan of buying drives at different time periods or from different suppliers.. I have seen entire 4 and 8 drive arrays fail within a month of the first drives going. Always really fun when you were too slack to handle the first drive failure, the second one put you in a tight spot the next week, and then the third one dies while you're madly trying to do data recovery.. :-) Really, in a big enough array, I like to swap out older drives for newer ones every now and then and repurpose the old - Just to keep the dreaded complete failure at bay. Things you learn to do with cheap SATA drives.. -Original Message- From: owner-freebsd-sta...@freebsd.org [mailto:owner-freebsd-sta...@freebsd.org] On Behalf Of Damien Fleuriot Sent: Wednesday, January 05, 2011 5:55 PM To: Chris Forgeron Cc: freebsd-stable@freebsd.org Subject: Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks Well actually... raidz2: - 7x 1.5 tb = 10.5tb - 2 parity drives raidz1: - 3x 1.5 tb = 4.5 tb - 4x 1.5 tb = 6 tb , total 10.5tb - 2 parity drives in split thus different raidz1 arrays So really, in both cases 2 different parity drives and same storage... --- Fleuriot Damien On 5 Jan 2011, at 16:55, Chris Forgeron wrote: > First off, raidz2 and raidz1 with copies=2 are not the same thing. > > raidz2 will give you two copies of parity instead of just one. It also > guarantees that this parity is on different drives. You can sustain 2 drive > failures without data loss. > > raidz1 with copies=2 will give you two copies of all your files, but there is > no guarantee that they are on different drives, and you can still only > sustain 1 drive failure. > > You'll have better space efficiency with raidz2 if you're using 9 drives. > > If I were you, I'd use your 9 disks as one big raidz, or better yet, get 10 > disks, and make a stripe of 2 5 disk raidz's for the best performance. > > Save your SSD drive for the L2ARC (cache) or ZIL, you'll get better speed > that way instead of throwing it away on a boot drive. > > -- > > > -Original Message- > From: owner-freebsd-sta...@freebsd.org > [mailto:owner-freebsd-sta...@freebsd.org] On Behalf Of Damien Fleuriot > Sent: January-05-11 5:01 AM > To: Damien Fleuriot > Cc: freebsd-stable@freebsd.org > Subject: Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks > > Hi again List, > > I'm not so sure about using raidz2 anymore, I'm concerned for the performance. > > Basically I have 9x 1.5T sata drives. > > raidz2 and 2x raidz1 will provide the same capacity. > > Are there any cons against using 2x raidz1 instead of 1x raidz2 ? > > I plan on using a SSD drive for the OS, 40-64gb, with 15 for the system > itself and some spare. > > Is it worth using the free space for cache ? ZIL ? both ? > > @jean-yves : didn't you experience problems recently when using both ? > > --- > Fleuriot Damien > > On 3 Jan 2011, at 16:08, Damien Fleuriot wrote: > >> >> >> On 1/3/11 2:17 PM, Ivan Voras wrote: >>> On 12/30/10 12:40, Damien Fleuriot wrote: >>> I am concerned that in the event a drive fails, I won't be able to repair the disks in time before another actually fails. >>> >>> An old trick to avoid that is to buy drives from different series or >>> manufacturers (the theory is that identical drives tend to fail at >>> the same time), but this may not be applicable if you have 5 drives >>> in a volume :) Still, you can try playing with RAIDZ levels and >>> probabilities. >>> >> >> That's sound advice, although one also hears that they should get >> devices from the same vendor for maximum compatibility -.- >> >> >> Ah well, next time ;) >> >> >> A piece of advice I shall heed though is using 1% less capacity than >> what the disks really provide, in case one day I have to swap a drive >> and its replacement is a few kby
Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks
On Wed, Jan 5, 2011 at 1:55 PM, Damien Fleuriot wrote: > Well actually... > > raidz2: > - 7x 1.5 tb = 10.5tb > - 2 parity drives > > raidz1: > - 3x 1.5 tb = 4.5 tb > - 4x 1.5 tb = 6 tb , total 10.5tb > - 2 parity drives in split thus different raidz1 arrays > > So really, in both cases 2 different parity drives and same storage... In second case you get better performance, but lose some data protection. It's still raidz1 and you can't guarantee functionality in all cases of two drives failing. If two drives fail in the same vdev, your entire pool will be gone. Granted, it's better than single-vdev raidz1, but it's *not* as good as raidz2. --Artem ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks
Well actually... raidz2: - 7x 1.5 tb = 10.5tb - 2 parity drives raidz1: - 3x 1.5 tb = 4.5 tb - 4x 1.5 tb = 6 tb , total 10.5tb - 2 parity drives in split thus different raidz1 arrays So really, in both cases 2 different parity drives and same storage... --- Fleuriot Damien On 5 Jan 2011, at 16:55, Chris Forgeron wrote: > First off, raidz2 and raidz1 with copies=2 are not the same thing. > > raidz2 will give you two copies of parity instead of just one. It also > guarantees that this parity is on different drives. You can sustain 2 drive > failures without data loss. > > raidz1 with copies=2 will give you two copies of all your files, but there is > no guarantee that they are on different drives, and you can still only > sustain 1 drive failure. > > You'll have better space efficiency with raidz2 if you're using 9 drives. > > If I were you, I'd use your 9 disks as one big raidz, or better yet, get 10 > disks, and make a stripe of 2 5 disk raidz's for the best performance. > > Save your SSD drive for the L2ARC (cache) or ZIL, you'll get better speed > that way instead of throwing it away on a boot drive. > > -- > > > -Original Message- > From: owner-freebsd-sta...@freebsd.org > [mailto:owner-freebsd-sta...@freebsd.org] On Behalf Of Damien Fleuriot > Sent: January-05-11 5:01 AM > To: Damien Fleuriot > Cc: freebsd-stable@freebsd.org > Subject: Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks > > Hi again List, > > I'm not so sure about using raidz2 anymore, I'm concerned for the performance. > > Basically I have 9x 1.5T sata drives. > > raidz2 and 2x raidz1 will provide the same capacity. > > Are there any cons against using 2x raidz1 instead of 1x raidz2 ? > > I plan on using a SSD drive for the OS, 40-64gb, with 15 for the system > itself and some spare. > > Is it worth using the free space for cache ? ZIL ? both ? > > @jean-yves : didn't you experience problems recently when using both ? > > --- > Fleuriot Damien > > On 3 Jan 2011, at 16:08, Damien Fleuriot wrote: > >> >> >> On 1/3/11 2:17 PM, Ivan Voras wrote: >>> On 12/30/10 12:40, Damien Fleuriot wrote: >>> I am concerned that in the event a drive fails, I won't be able to repair the disks in time before another actually fails. >>> >>> An old trick to avoid that is to buy drives from different series or >>> manufacturers (the theory is that identical drives tend to fail at >>> the same time), but this may not be applicable if you have 5 drives >>> in a volume :) Still, you can try playing with RAIDZ levels and >>> probabilities. >>> >> >> That's sound advice, although one also hears that they should get >> devices from the same vendor for maximum compatibility -.- >> >> >> Ah well, next time ;) >> >> >> A piece of advice I shall heed though is using 1% less capacity than >> what the disks really provide, in case one day I have to swap a drive >> and its replacement is a few kbytes smaller (thus preventing a rebuild). > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
[releng_8 tinderbox] failure on i386/pc98
TB --- 2011-01-05 20:39:14 - tinderbox 2.6 running on freebsd-stable.sentex.ca TB --- 2011-01-05 20:39:14 - starting RELENG_8 tinderbox run for i386/pc98 TB --- 2011-01-05 20:39:14 - cleaning the object tree TB --- 2011-01-05 20:39:38 - cvsupping the source tree TB --- 2011-01-05 20:39:38 - /usr/bin/csup -z -r 3 -g -L 1 -h cvsup.sentex.ca /tinderbox/RELENG_8/i386/pc98/supfile TB --- 2011-01-05 20:40:22 - building world TB --- 2011-01-05 20:40:22 - MAKEOBJDIRPREFIX=/obj TB --- 2011-01-05 20:40:22 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2011-01-05 20:40:22 - TARGET=pc98 TB --- 2011-01-05 20:40:22 - TARGET_ARCH=i386 TB --- 2011-01-05 20:40:22 - TZ=UTC TB --- 2011-01-05 20:40:22 - __MAKE_CONF=/dev/null TB --- 2011-01-05 20:40:22 - cd /src TB --- 2011-01-05 20:40:22 - /usr/bin/make -B buildworld >>> World build started on Wed Jan 5 20:40:23 UTC 2011 >>> Rebuilding the temporary build tree >>> stage 1.1: legacy release compatibility shims >>> stage 1.2: bootstrap tools >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3: cross tools >>> stage 4.1: building includes >>> stage 4.2: building libraries >>> stage 4.3: make dependencies >>> stage 4.4: building everything [...] gzip -cn /src/usr.bin/ncplogin/ncplogout.1 > ncplogout.1.gz ===> usr.bin/netstat (all) cc -O2 -pipe -fno-strict-aliasing -DIPSEC -DSCTP -DINET6 -DNETGRAPH -DIPX -std=gnu99 -fstack-protector -Wsystem-headers -Werror -Wall -Wno-format-y2k -W -Wno-unused-parameter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-uninitialized -Wno-pointer-sign -c /src/usr.bin/netstat/if.c cc -O2 -pipe -fno-strict-aliasing -DIPSEC -DSCTP -DINET6 -DNETGRAPH -DIPX -std=gnu99 -fstack-protector -Wsystem-headers -Werror -Wall -Wno-format-y2k -W -Wno-unused-parameter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-uninitialized -Wno-pointer-sign -c /src/usr.bin/netstat/inet.c cc1: warnings being treated as errors /src/usr.bin/netstat/inet.c: In function 'protopr': /src/usr.bin/netstat/inet.c:463: warning: format '%6u' expects type 'unsigned int', but argument 2 has type 'uint64_t' /src/usr.bin/netstat/inet.c:463: warning: format '%6u' expects type 'unsigned int', but argument 3 has type 'uint64_t' *** Error code 1 Stop in /src/usr.bin/netstat. *** Error code 1 Stop in /src/usr.bin. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. TB --- 2011-01-05 21:37:40 - WARNING: /usr/bin/make returned exit code 1 TB --- 2011-01-05 21:37:40 - ERROR: failed to build world TB --- 2011-01-05 21:37:40 - 2501.15 user 373.52 system 3506.43 real http://tinderbox.freebsd.org/tinderbox-releng_8-RELENG_8-i386-pc98.full ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
[releng_8 tinderbox] failure on i386/i386
TB --- 2011-01-05 20:34:44 - tinderbox 2.6 running on freebsd-stable.sentex.ca TB --- 2011-01-05 20:34:44 - starting RELENG_8 tinderbox run for i386/i386 TB --- 2011-01-05 20:34:44 - cleaning the object tree TB --- 2011-01-05 20:35:15 - cvsupping the source tree TB --- 2011-01-05 20:35:15 - /usr/bin/csup -z -r 3 -g -L 1 -h cvsup.sentex.ca /tinderbox/RELENG_8/i386/i386/supfile TB --- 2011-01-05 20:36:12 - building world TB --- 2011-01-05 20:36:12 - MAKEOBJDIRPREFIX=/obj TB --- 2011-01-05 20:36:12 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2011-01-05 20:36:12 - TARGET=i386 TB --- 2011-01-05 20:36:12 - TARGET_ARCH=i386 TB --- 2011-01-05 20:36:12 - TZ=UTC TB --- 2011-01-05 20:36:12 - __MAKE_CONF=/dev/null TB --- 2011-01-05 20:36:12 - cd /src TB --- 2011-01-05 20:36:12 - /usr/bin/make -B buildworld >>> World build started on Wed Jan 5 20:36:13 UTC 2011 >>> Rebuilding the temporary build tree >>> stage 1.1: legacy release compatibility shims >>> stage 1.2: bootstrap tools >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3: cross tools >>> stage 4.1: building includes >>> stage 4.2: building libraries >>> stage 4.3: make dependencies >>> stage 4.4: building everything [...] gzip -cn /src/usr.bin/ncplogin/ncplogout.1 > ncplogout.1.gz ===> usr.bin/netstat (all) cc -O2 -pipe -fno-strict-aliasing -DIPSEC -DSCTP -DINET6 -DNETGRAPH -DIPX -std=gnu99 -fstack-protector -Wsystem-headers -Werror -Wall -Wno-format-y2k -W -Wno-unused-parameter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-uninitialized -Wno-pointer-sign -c /src/usr.bin/netstat/if.c cc -O2 -pipe -fno-strict-aliasing -DIPSEC -DSCTP -DINET6 -DNETGRAPH -DIPX -std=gnu99 -fstack-protector -Wsystem-headers -Werror -Wall -Wno-format-y2k -W -Wno-unused-parameter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-uninitialized -Wno-pointer-sign -c /src/usr.bin/netstat/inet.c cc1: warnings being treated as errors /src/usr.bin/netstat/inet.c: In function 'protopr': /src/usr.bin/netstat/inet.c:463: warning: format '%6u' expects type 'unsigned int', but argument 2 has type 'uint64_t' /src/usr.bin/netstat/inet.c:463: warning: format '%6u' expects type 'unsigned int', but argument 3 has type 'uint64_t' *** Error code 1 Stop in /src/usr.bin/netstat. *** Error code 1 Stop in /src/usr.bin. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. TB --- 2011-01-05 21:34:25 - WARNING: /usr/bin/make returned exit code 1 TB --- 2011-01-05 21:34:25 - ERROR: failed to build world TB --- 2011-01-05 21:34:25 - 2527.38 user 365.25 system 3581.07 real http://tinderbox.freebsd.org/tinderbox-releng_8-RELENG_8-i386-i386.full ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: gstripe/gpart problems.
On Wed, Jan 05, 2011 at 11:36:59AM +0200, Daniel Braniss wrote: > Hi Clifton, > I was getting very frustrated yesterday, hence the cripted message, your > response requieres some background :-) > the box is a Sun Fire X2200, which has bays for 2 disks, (we have several of > these) > before the latest upgrade, the 2 disks were 'raided' via 'nVidia MediaShield' > and > appeared as ar0, when I upgraded to 8.2, it disappeared, since I had in the > kernel config file > ATA_CAM. So I starded fiddling with gstripe, which 'recoverd' the data. > Next, since the kernel boot kept complaining abouf GEOM errors, (and not > wanting to > mislead the operators) I cleaned up the data, and started from scratch. > the machine boots diskless, but I like to keep a root bootable partition just > in case. > the process was in every case the same, first the stripe, then gpart the > stripe. Thanks, that makes it very clear why things are as they are. Good to know that the booting issues are covered via diskless boot. I had never thought about being able to recover a RAID stripe using gstripe. That's a very interesting capability! Assuming that FreeBSD considers partitioning a stripe to be valid in principle - and you give reasons it should - then there may be a geom/driver interaction bug to investigate here if the geom layer is refusing to write a stripe-oriented partition to the raw drive. -- Clifton -- Clifton Royston -- clift...@iandicomputing.com / clift...@lava.net President - I and I Computing * http://www.iandicomputing.com/ Custom programming, network design, systems and network consulting services ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
RE: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks
First off, raidz2 and raidz1 with copies=2 are not the same thing. raidz2 will give you two copies of parity instead of just one. It also guarantees that this parity is on different drives. You can sustain 2 drive failures without data loss. raidz1 with copies=2 will give you two copies of all your files, but there is no guarantee that they are on different drives, and you can still only sustain 1 drive failure. You'll have better space efficiency with raidz2 if you're using 9 drives. If I were you, I'd use your 9 disks as one big raidz, or better yet, get 10 disks, and make a stripe of 2 5 disk raidz's for the best performance. Save your SSD drive for the L2ARC (cache) or ZIL, you'll get better speed that way instead of throwing it away on a boot drive. -- -Original Message- From: owner-freebsd-sta...@freebsd.org [mailto:owner-freebsd-sta...@freebsd.org] On Behalf Of Damien Fleuriot Sent: January-05-11 5:01 AM To: Damien Fleuriot Cc: freebsd-stable@freebsd.org Subject: Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks Hi again List, I'm not so sure about using raidz2 anymore, I'm concerned for the performance. Basically I have 9x 1.5T sata drives. raidz2 and 2x raidz1 will provide the same capacity. Are there any cons against using 2x raidz1 instead of 1x raidz2 ? I plan on using a SSD drive for the OS, 40-64gb, with 15 for the system itself and some spare. Is it worth using the free space for cache ? ZIL ? both ? @jean-yves : didn't you experience problems recently when using both ? --- Fleuriot Damien On 3 Jan 2011, at 16:08, Damien Fleuriot wrote: > > > On 1/3/11 2:17 PM, Ivan Voras wrote: >> On 12/30/10 12:40, Damien Fleuriot wrote: >> >>> I am concerned that in the event a drive fails, I won't be able to >>> repair the disks in time before another actually fails. >> >> An old trick to avoid that is to buy drives from different series or >> manufacturers (the theory is that identical drives tend to fail at >> the same time), but this may not be applicable if you have 5 drives >> in a volume :) Still, you can try playing with RAIDZ levels and >> probabilities. >> > > That's sound advice, although one also hears that they should get > devices from the same vendor for maximum compatibility -.- > > > Ah well, next time ;) > > > A piece of advice I shall heed though is using 1% less capacity than > what the disks really provide, in case one day I have to swap a drive > and its replacement is a few kbytes smaller (thus preventing a rebuild). ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFSv4 - how to set up at FreeBSD 8.1 ?
> Yes, to access the file volumes via any version of NFS, they need to > be exported. (I don't think it would make sense to allow access to all > of the server's data without limitations for NFSv4?) > > What is different (and makes it confusing for folks familiar with > NFSv2,3) > is the fact that it is a single "mount tree" for NFSv4 that has to be > rooted > somewhere. > Solaris10 - always roots it at "/" but somehow works around the ZFS > case, > so any exported share can be mounted with the same path used > by NFSv2,3. > Linux - Last I looked (which was a couple of years ago), it exported a > single volume for NFSv4 and the rest of the server's volumes > could only be accessed via NFSv2,3. (I don't know if they've > changed this yet?) > > So, I chose to allow a little more flexibility than Solaris10 and > allow > /etc/exports to set the location of the "mount root". I didn't > anticipate > the "glitch" that ZFS introduced (where all ZFS volumes in the mount > path > must be exported for mount to work) because it does its own exporting. > (Obviously, the glitch/inconsistency needs to be resolved at some > point.) > Perhaps it would help to show what goes on the wire when a mount is done. # mount -t nfs -o nfsv4 server:/usr/home /mnt For NFSv2,3 there will be a Mount RPC with /usr/home as the argument. This goes directly to mountd and then mountd decides if it is ok and replies with the file handle (FH) for /usr/home if it is. For NFSv4, the client will do a compound RPC that looks something like this: (The exact structure is up to the client implementor.) PutRooFH <-- set the position to the "root mount location" as specified by the V4: line Lookup usr Lookup home GetFH <-- return the file handle at this location As such, there can only be one "root mount location" and at least Lookup operations must work for all elements of the path from there to the client's mount point. (For non-ZFS, it currently allows Lookup plus a couple of others that some clients use during mounting to work for non-exported file systems, so that setting "root mount location" == "/" works without exporting the entire file server's tree.) For all other operations, the file system must be exported just like for NFSv2,3. Hope this helps, rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFSv4 - how to set up at FreeBSD 8.1 ?
> Hi > > On 5 January 2011 12:09, Rick Macklem wrote: > > > You can also do the following: > > For /etc/exports > > V4: / > > /usr/home -maproot=root -network 192.168.183.0 -mask 255.255.255.0 > > > > Then mount: > > # mount_nfs -o nfsv4 192.168.183.131:/usr/home /marek_nfs4/ > > (But only if the file system for "/" is ufs and not zfs and, > > admittedly > > there was a debate that has to be continued someday that might make > > it > > necessary to export "/" as well for ufs like zfs requires.) > > > > rick > > ps: And some NFSv4 clients can cross server mount points, unlike > > NFSv2, 3. > > > > I've done that (exporting V4: /) > > but then when I mount a sub zfs filesystem (e.g. /pool/backup/sites/m) > then it appears empty on the client. > > If I export /pool/backup/sites/m , then I see the content of the > directory. > > Most of the sub-directory in /pool are actually zfs file system > mounted. > > It is something I expected with NFSv3 .. but not with nfs v4. > Yes, to access the file volumes via any version of NFS, they need to be exported. (I don't think it would make sense to allow access to all of the server's data without limitations for NFSv4?) What is different (and makes it confusing for folks familiar with NFSv2,3) is the fact that it is a single "mount tree" for NFSv4 that has to be rooted somewhere. Solaris10 - always roots it at "/" but somehow works around the ZFS case, so any exported share can be mounted with the same path used by NFSv2,3. Linux - Last I looked (which was a couple of years ago), it exported a single volume for NFSv4 and the rest of the server's volumes could only be accessed via NFSv2,3. (I don't know if they've changed this yet?) So, I chose to allow a little more flexibility than Solaris10 and allow /etc/exports to set the location of the "mount root". I didn't anticipate the "glitch" that ZFS introduced (where all ZFS volumes in the mount path must be exported for mount to work) because it does its own exporting. (Obviously, the glitch/inconsistency needs to be resolved at some point.) rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFSv4 - how to set up at FreeBSD 8.1 ?
> Rick Macklem wrote: > > > ... one of the fundamental principals for NFSv2, 3 was a stateless > > server ... > > Only as long as UDP transport was used. Any NFS implementation that > used TCP for transport had thereby abandoned the stateless server > principle, since a TCP connection itself requires that state be > maintained on both ends. > You've seen the responses w.r.t. what stateless server referred to already. But you might find this "quirk" interesting... Early in the NFSv4 design, I suggested that handling of the server state (opens/locks/...) might be tied to the TCP connections (NFSv4 doesn't use UDP). Lets just say the idea "flew like a lead balloon". Lots of responses along the lines of "NFS should be separate for the RPC transport layer, etc. Then, several years later, they came up with Sessions for NFSv4.1, which does RPC transport management in a very NFSv4.1-specific way, including stuff like changing the RPC semantics to exactly-once... Although very different from what I had envisioned, in a sense Sessions does tie state handling (ordering of locking operations, for example) to RPC transport management as I currently understand it. rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFSv4 - how to set up at FreeBSD 8.1 ?
> On Wednesday, January 05, 2011 5:55:53 am per...@pluto.rain.com wrote: > > Rick Macklem wrote: > > > > > ... one of the fundamental principals for NFSv2, 3 was a stateless > > > server ... > > > > Only as long as UDP transport was used. Any NFS implementation that > > used TCP for transport had thereby abandoned the stateless server > > principle, since a TCP connection itself requires that state be > > maintained on both ends. > > Not filesystem cache coherency state, only socket state. And even NFS > UDP > mounts maintain their own set of "socket" state to manage retries and > retransmits for UDP RPCs. The filesystem is equally incoherent for > both UDP > and TCP NFSv[23] mounts. TCP did not change any of that. > Unfortunately even NFSv4 doesn't maintain cache coherency in general. The state it maintains/recovers after a server crash are opens/locks/delegations, but the opens are a Windows-like open share lock (can't remember the Windows/Samba term for them) and not a POSIX-like open. NFSv4 does tie cache coherency to file locking, so that clients will get a coherent view of file data for byte ranges they lock. The term stateless server refers to the fact that the server doesn't know anything about the file handling state in the client that needs to be recovered after a server crash (opens, locks, ...). When an NFSv2,3 server is rebooted, it normally knows nothing about what clients are mounted, what clients have files open, etc and just services RPCs as they come in. The design avoided the complexity of recovery after a crash but results in a non-POSIX compliant file system that can't do a good job of cache coherency, knows nothing about file locks, etc. (Sun did add a separate file locking protocol called the NLM or rpc.lockd if you prefer, but that protocol design was fundamentally flawed imho and, as such, using it is in the "your mileage may vary" category.) Further, since without any information about previous operations, retries of non-idempotent RPCs would cause weird failures, "soft state" in the form of a cache of recent RPCs (typically called the Duplicate Request Cache or DRC these days) was added, to avoid performing the non-idempotent operation twice. A server is not required to retain the contents of a DRC after a crash/reboot but some vendors with non-volatile RAM hardware may choose to do so in order to provide "closer to correct" behaviour after a server crash/reboot. rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFSv4 - how to set up at FreeBSD 8.1 ?
> > You can also do the following: > > For /etc/exports > > V4: / > > /usr/home -maproot=root -network 192.168.183.0 -mask 255.255.255.0 > > Not in my configuration - '/' and '/usr' are different partitions > (both UFS) > Hmm. Since entire volumes are exported for NFSv4, I can't remember if exporting a subtree of the volume works (I think it does, but??). However, I do know that if you change the /etc/exports for the above to: (note I also moved the V4: line to the end because at one time it was required to be at the end and I can't remember if that restriction is still enforced. Always check /var/log/messages after starting mountd with a modified /etc/exports and look for any messages related to problems with /etc/exports.) In other words, export the volume's mount point and put the V4: line at the end are changes that "might be required?". If you take a look at mountd.c, you'll understand why I have trouble remembering exactly what works and what doesn't.:-) /usr -maproot=root -network 192.168.183.0 -mask 255.255.255.0 V4: / then for the above situation: # mount -t nfs -o nfsv4 server:/ /mnt - will fail because "/" isn't exported however # mount -t nfs -o nfsv4 server:/usr /mnt - should work. If it doesn't work, it is not because /etc/exports are wrong. A small number of NFSv4 ops are allowed on non-exported UFS partitions so that "mount" can traverse the tree down to the mount point, but that mount point must be exported. When I did this I did not realize that ZFS did its own exporting and, as such, traveral of non-exported ZFS volumes doesn't work, because ZFS doesn't allow any operations on the non-exported volumes to work. At some point, there needs to be a debate w.r.t. inconsistent behaviour. The easiest fix is to disable the capability of traversal of non-exported UFS volumes. The downside of this is that it is harder to configure the single (sub)tree on the server that is needed for NFSv4. Have fun with it, rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFSv4 - how to set up at FreeBSD 8.1 ?
On Wednesday, January 05, 2011 5:55:53 am per...@pluto.rain.com wrote: > Rick Macklem wrote: > > > ... one of the fundamental principals for NFSv2, 3 was a stateless > > server ... > > Only as long as UDP transport was used. Any NFS implementation that > used TCP for transport had thereby abandoned the stateless server > principle, since a TCP connection itself requires that state be > maintained on both ends. Not filesystem cache coherency state, only socket state. And even NFS UDP mounts maintain their own set of "socket" state to manage retries and retransmits for UDP RPCs. The filesystem is equally incoherent for both UDP and TCP NFSv[23] mounts. TCP did not change any of that. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFSv4 - how to set up at FreeBSD 8.1 ?
You can also do the following: For /etc/exports V4: / /usr/home -maproot=root -network 192.168.183.0 -mask 255.255.255.0 Not in my configuration - '/' and '/usr' are different partitions (both UFS) -- Marek Salwerowicz ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFSv4 - how to set up at FreeBSD 8.1 ?
Rick Macklem wrote: > ... one of the fundamental principals for NFSv2, 3 was a stateless > server ... Only as long as UDP transport was used. Any NFS implementation that used TCP for transport had thereby abandoned the stateless server principle, since a TCP connection itself requires that state be maintained on both ends. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: gstripe/gpart problems.
> On Tue, Jan 04, 2011 at 04:21:31PM +0200, Daniel Braniss wrote: > > Hi, > > I have 2 ada disks striped: > > > > # gstripe list > > Geom name: s1 > > State: UP > > Status: Total=2, Online=2 > > Type: AUTOMATIC > > Stripesize: 65536 > > ID: 2442772675 > > Providers: > > 1. Name: stripe/s1 > >Mediasize: 1000215674880 (932G) > >Sectorsize: 512 > >Stripesize: 65536 > >Stripeoffset: 0 > >Mode: r0w0e0 > > Consumers: > > 1. Name: ada0 > >Mediasize: 500107862016 (466G) > >Sectorsize: 512 > >Mode: r0w0e0 > >Number: 0 > > 2. Name: ada1 > >Mediasize: 500107862016 (466G) > >Sectorsize: 512 > >Mode: r0w0e0 > >Number: 1 > > > > boot complains: > > > > GEOM_STRIPE: Device s1 created (id=2442772675). > > GEOM_STRIPE: Disk ada0 attached to s1. > > GEOM: ada0: corrupt or invalid GPT detected. > > GEOM: ada0: GPT rejected -- may not be recoverable. > > GEOM_STRIPE: Disk ada1 attached to s1. > > GEOM_STRIPE: Device s1 activated. > > > > # gpart show > > =>34 1953546173 stripe/s1 GPT (932G) > > 34 128 1 freebsd-boot (64K) > > 162 1953546045 - free - (932G) > > # gpart show > > =>34 1953546173 stripe/s1 GPT (932G) > > 34 128 1 freebsd-boot (64K) > > 162 1953546045 - free - (932G) > > > > # gpart add -t freebsd-ufs -s 20g stripe/s1 > > GEOM: ada0: corrupt or invalid GPT detected. > > GEOM: ada0: GPT rejected -- may not be recoverable. > > stripe/s1p2 added > > # gpart show > > =>34 1953546173 stripe/s1 GPT (932G) > > 34 128 1 freebsd-boot (64K) > > 16241943040 2 freebsd-ufs (20G) > > 41943202 1911603005 - free - (912G) > > > > if I go the MBR road, all seems ok, but as soon as I try to write > > the boot block (boot0cfg -B /dev/stripe/s1) again the kernel > > starts to complain about corrupted GEOM too. > > So are you trying to partition the drives and then stripe the > partitions within the drives, or are you trying to partition the > stripe? > > It seems here as though you might be trying to first partition the > drives (not clear on that) then stripe the whole drives - which will > mean the partition info is wrong for the resulting striped drive set - > and then repartition the striped drive set, and neither is ending up > valid. > > If what you are intending is to partition after striping the raw > drives, then you are doing the right steps, but when the geom layer > tries to look at the info on the individual drives as at boot, it will > find it invalid. If it the gpart layer is actually refusing to write > partition info to the drives which is wrong for the drives taken > individually, that would account for your problems. > > One valid order to do things in would be partition the drives with > gpart, creating identical sets of partitions on both drives, then > stripe the partitions created within them (syntax not exact): > > gpart add -t freebsd-ufs0 -s 10g ada0 > gpart add -t freebsd-ufs1 -s 10g ada1 > gstripe label freebsd-ufs freebsd-ufs0 freebsd-ufs1 > > That would give you a 20GB stripe, with valid partition info on each > drive. > > If this will be your boot drive, depending on how much needs to be read > from the drive before the geom_stripe kernel module gets loaded, I > would think there could also be a problem booting from the drive. This > is not like gmirroring two drives or partitions, where the info read > from either disk early in boot will be identical, and identical (except > for the last block of the partition) to what the OS sees later after > the mirror is formed. > > I assume you're bearing in mind that if you lose either drive to a > hardware fault you lose the whole thing, and consider the risk worth > the potential speed/size gain. > -- Clifton Hi Clifton, I was getting very frustrated yesterday, hence the cripted message, your response requieres some background :-) the box is a Sun Fire X2200, which has bays for 2 disks, (we have several of these) before the latest upgrade, the 2 disks were 'raided' via 'nVidia MediaShield' and appeared as ar0, when I upgraded to 8.2, it disappeared, since I had in the kernel config file ATA_CAM. So I starded fiddling with gstripe, which 'recoverd' the data. Next, since the kernel boot kept complaining abouf GEOM errors, (and not wanting to mislead the operators) I cleaned up the data, and started from scratch. the machine boots diskless, but I like to keep a root bootable partition just in case. the process was in every case the same, first the stripe, then gpart the stripe. btw, I know that if I loose a disk I lose everything. Also that I wont be able to boot from it (I can always boot via the net, and mount the root localy, or some other combination - USB, etc) But you gave me some ideas and will start experimenting soon. thanks, danny ___
Re: ZFS - moving from a zraid1 to zraid2 pool with 1.5tb disks
Hi again List, I'm not so sure about using raidz2 anymore, I'm concerned for the performance. Basically I have 9x 1.5T sata drives. raidz2 and 2x raidz1 will provide the same capacity. Are there any cons against using 2x raidz1 instead of 1x raidz2 ? I plan on using a SSD drive for the OS, 40-64gb, with 15 for the system itself and some spare. Is it worth using the free space for cache ? ZIL ? both ? @jean-yves : didn't you experience problems recently when using both ? --- Fleuriot Damien On 3 Jan 2011, at 16:08, Damien Fleuriot wrote: > > > On 1/3/11 2:17 PM, Ivan Voras wrote: >> On 12/30/10 12:40, Damien Fleuriot wrote: >> >>> I am concerned that in the event a drive fails, I won't be able to >>> repair the disks in time before another actually fails. >> >> An old trick to avoid that is to buy drives from different series or >> manufacturers (the theory is that identical drives tend to fail at the >> same time), but this may not be applicable if you have 5 drives in a >> volume :) Still, you can try playing with RAIDZ levels and probabilities. >> > > That's sound advice, although one also hears that they should get > devices from the same vendor for maximum compatibility -.- > > > Ah well, next time ;) > > > A piece of advice I shall heed though is using 1% less capacity than > what the disks really provide, in case one day I have to swap a drive > and its replacement is a few kbytes smaller (thus preventing a rebuild). ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"