[zfs-discuss] zpool status output confusion
I get the following output when i run a zpool status , but i am a little confused of why c9t8d0 is more left align then the rest of the disks in the pool , what does it mean ? $ zpool status blmpool pool: blmpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM blmpool ONLINE 0 0 0 raidz2ONLINE 0 0 0 c9t0d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c9t6d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 c9t8d0ONLINE 0 0 0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusion
On 27 May, 2010 - Per Jorgensen sent me these 1,0K bytes: I get the following output when i run a zpool status , but i am a little confused of why c9t8d0 is more left align then the rest of the disks in the pool , what does it mean ? Because someone forced it in without redundancy (or created it as such). Your pool is bad, as c9t8d0 is without redundancy. If it fails, your pool is toast. zpool history should be able to tell when it happened at least. $ zpool status blmpool pool: blmpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM blmpool ONLINE 0 0 0 raidz2ONLINE 0 0 0 c9t0d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c9t6d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 c9t8d0ONLINE 0 0 0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusion
On May 27, 2010, at 12:37 PM, Per Jorgensen wrote: I get the following output when i run a zpool status , but i am a little confused of why c9t8d0 is more left align then the rest of the disks in the pool , what does it mean ? It means that is is another top-level vdev in your pool. Basically you have two top-level vdevs: one is your raidz2 vdev containing 7 disks, and another one single disk top-level vdev c9t8d0. I guess it was added like this zpool add -f blmpool c9t8d0. Without -f it would complain about mismatching replication levels. You can check pool history to see when exactly it was done. $ zpool status blmpool pool: blmpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM blmpool ONLINE 0 0 0 raidz2ONLINE 0 0 0 c9t0d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c9t6d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 c9t8d0ONLINE 0 0 0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
Neil Perrin wrote: Yes, I agree this seems very appealing. I have investigated and observed similar results. Just allocating larger intent log blocks but only writing to say the first half of them has seen the same effect. Despite the impressive results, we have not pursued this further mainly because of it's maintainability. There is quite a variance between drives so, as mentioned, feedback profiling of the device is needed in the working system. The layering of the Solaris IO subsystem doesn't provide the feedback necessary and the ZIL code is layered on the SPA/DMU. Still it should be possible. Good luck! Thanks :) Though I hoped to get a different answer. An integration into ZFS code would be much more elegant, but of course in a few years the necessity for this optimization will be gone, when SSDs are cheap, fast and reliable. There seems to be some interest in this idea here. Would it make sense to start a project for it? Currently I'm implementing a driver as a proof of concept, but I'm in need of a lot of discussions about algo- rithms and concepts, and maybe some code reviews. Can I count on some support from here? --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusion
thanks for the quick responses and yes the history show just what you said :( is there a way i can get c9t8d0 out of the pool , or how do i get the pool back to optimal redundancy ? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS
On Thu, May 27, 2010 at 2:39 AM, Marc Bevand m.bev...@gmail.com wrote: Hi, Brandon High bhigh at freaks.com writes: I only looked at the Megaraid that he mentioned, which has a PCIe 1.0 4x interface, or 1000MB/s. You mean x8 interface (theoretically plugged into that x4 slot below...) The board also has a PCIe 1.0 4x electrical slot, which is 8x physical. If the card was in the PCIe slot furthest from the CPUs, then it was only running 4x. The tests were done connecting both cards to the PCIe 2.0 x8 slot#6 that connects directly to the Intel 5520 chipset. I totally ignored the differences between PCIe 1.0 and 2.0. My fault. If Giovanni had put the Megaraid in this slot, he would have seen an even lower throughput, around 600MB/s: This slot is provided by the ICH10R which as you can see on: http://www.supermicro.com/manuals/motherboard/5500/MNL-1062.pdf is connected to the northbridge through a DMI link, an Intel- proprietary PCIe 1.0 x4 link. The ICH10R supports a Max_Payload_Size of only 128 bytes on the DMI link: http://www.intel.com/Assets/PDF/datasheet/320838.pdf And as per my experience: http://opensolaris.org/jive/thread.jspa?threadID=54481tstart=45 a 128-byte MPS allows using just about 60% of the theoretical PCIe throughput, that is, for the DMI link: 250MB/s * 4 links * 60% = 600MB/s. Note that the PCIe x4 slot supports a larger, 256-byte MPS but this is irrevelant as the DMI link will be the bottleneck anyway due to the smaller MPS. A single 3Gbps link provides in theory 300MB/s usable after 8b-10b encoding, but practical throughput numbers are closer to 90% of this figure, or 270MB/s. 6 disks per link means that each disk gets allocated 270/6 = 45MB/s. ... except that a SFF-8087 connector contains four 3Gbps connections. Yes, four 3Gbps links, but 24 disks per SFF-8087 connector. That's still 6 disks per 3Gbps (according to Giovanni, his LSI HBA was connected to the backplane with a single SFF-8087 cable). Correct. The backplane on the SC646E1 only has one SFF-8087 cable to the HBA. It may depend on how the drives were connected to the expander. You're assuming that all 18 are on 3 channels, in which case moving drives around could help performance a bit. True, I assumed this and, frankly, this is probably what he did by using adjacent drive bays... A more optimal solution would be spread the 18 drives in a 5+5+4+4 config so that the 2 most congested 3Gbps links are shared by only 5 drives, instead of 6, which would boost the througput by 6/5 = 1.2x. Which would change my first overall 810MB/s estimate to 810*1.2 = 972MB/s. The chassis has 4 columns of 6 disks. The 18 disks I was testing were all on columns #1 #2 #3. Column #0 still has a pair of SSDs and more disks which I havent' used in this test. I'll try to move things around to make use of the 4 port multipliers and test again. SuperMicro is going to release 6Gb/s backplane that uses the LSI SAS2X36 chipset in the near future, I've been told. Good thing this is still a lab experience. Thanks very much for the invaluable help! -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusion
On 05/27/10 09:16 PM, Per Jorgensen wrote: thanks for the quick responses and yes the history show just what you said :( is there a way i can get c9t8d0 out of the pool , or how do i get the pool back to optimal redundancy ? No, you will have to destroy the pool and start over. Or if that isn't an option, attach a mirror deive to c9t8d0. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
(resent because of mail problems) Edward Ned Harvey wrote: From: sensille [mailto:sensi...@gmx.net] The only thing I'd like to point out is that ZFS doesn't do random writes on a slog, but nearly linear writes. This might even be hurting performance more than random writes, because you always hit the worst case of one full rotation. Um ... I certainly have a doubt about this. My understanding is that hard disks are already optimized for sustained sequential throughput. I have a really hard time believing Seagate, WD, etc, designed their drives such that you read/write one track, then pause and wait for a full rotation, then read/write one track, and wait again, and so forth. This would limit the drive to approx 50% duty cycle, and the market is very competitive. Yes, I am really quite sure, without any knowledge at all, that the drive mfgrs are intelligent enough to map the logical blocks in such a way that sequential reads/writes which are larger than a single track will not suffer such a huge penalty. Just a small penalty to jump up one track, and wait for a few degrees of rotation, not 360 degrees. I'm afraid you got me wrong here. Of course the drives are optimized for sequential reads/writes. If you give the drive a single read or write that is larger than one track the drive acts exactly as you described. The same holds if you give the drive multiple smaller consecutive reads/writes in advance (NCQ/TCQ) so that the drive can coagulate them to one big op. But this is not what happens in case of ZFS/ZIL with a single application. The application requests a synchronous op. This request goes down into ZFS, which in turn allocates a ZIL block, writes it to the disk and issues a cache flush. Only after the cache flush completes, ZFS can acknowledge the op to the application. Now the application can issue the next op, for which ZFS will again allocate ZIL block, probably immediately after the previous one. It writes the block and issues a flush. But in the meantime the head has traveled some sectors down the track. To physically write the block the drive has of course to wait until the sector is under the head again, which means waiting nearly one full rotation. If ZFS would have chosen a block appropriately further down the track the possibility would have been high that the head had not passed it and could write without a big rotational delay. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
(resent because of received bounce) Edward Ned Harvey wrote: From: sensille [mailto:sensi...@gmx.net] So this brings me back to the question I indirectly asked in the middle of a much longer previous email - Is there some way, in software, to detect the current position of the head? If not, then I only see two possibilities: Either you have some previous knowledge (or assumptions) about the drive geometry, rotation speed, and wall clock time passed since the last write completed, and use this (possibly vague or inaccurate) info to make your best guess what available blocks are accessible with minimum latency next ... That is my approach currently, and it works quite well. I obtain the prior knowledge through a special measuring process run before first using the disk. To keep the driver in sync with the disk during idle times it issues dummy ops in regular intervals, say 20 per second. or else some sort of new hardware behavior would be necessary. Possibly a special type of drive, which always assumes a command to write to a magical block number actually means write to the next available block or something like that ... or reading from a magical block actually tells you the position of the head or something like that... That would be nice. But what would be much nicer is a drive with an extremely small setup time. Current drives need the command 0.4-0.7ms in advance, depending on manufacturer and drive type. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
On 5/27/2010 10:33 AM, sensille wrote: (resent because of received bounce) Edward Ned Harvey wrote: From: sensille [mailto:sensi...@gmx.net] So this brings me back to the question I indirectly asked in the middle of a much longer previous email - Is there some way, in software, to detect the current position of the head? If not, then I only see two possibilities: Either you have some previous knowledge (or assumptions) about the drive geometry, rotation speed, and wall clock time passed since the last write completed, and use this (possibly vague or inaccurate) info to make your best guess what available blocks are accessible with minimum latency next ... That is my approach currently, and it works quite well. I obtain the prior knowledge through a special measuring process run before first using the disk. To keep the driver in sync with the disk during idle times it issues dummy ops in regular intervals, say 20 per second. or else some sort of new hardware behavior would be necessary. Possibly a special type of drive, which always assumes a command to write to a magical block number actually means write to the next available block or something like that ... or reading from a magical block actually tells you the position of the head or something like that... That would be nice. But what would be much nicer is a drive with an extremely small setup time. Current drives need the command 0.4-0.7ms in advance, depending on manufacturer and drive type. Technology like DDRdrive X1 (which is well beyond $200) doesn't have this problem. The setup times for that kind of hardware are measured in usec. (I.e. measured in PCI cycles.) - Garrett ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
Edward Ned Harvey wrote: From: sensille [mailto:sensi...@gmx.net] The only thing I'd like to point out is that ZFS doesn't do random writes on a slog, but nearly linear writes. This might even be hurting performance more than random writes, because you always hit the worst case of one full rotation. Um ... I certainly have a doubt about this. My understanding is that hard disks are already optimized for sustained sequential throughput. I have a really hard time believing Seagate, WD, etc, designed their drives such that you read/write one track, then pause and wait for a full rotation, then read/write one track, and wait again, and so forth. This would limit the drive to approx 50% duty cycle, and the market is very competitive. Yes, I am really quite sure, without any knowledge at all, that the drive mfgrs are intelligent enough to map the logical blocks in such a way that sequential reads/writes which are larger than a single track will not suffer such a huge penalty. Just a small penalty to jump up one track, and wait for a few degrees of rotation, not 360 degrees. I'm afraid you got me wrong here. Of course the drives are optimized for sequential reads/writes. If you give the drive a single read or write that is larger than one track the drive acts exactly as you described. The same holds if you give the drive multiple smaller consecutive reads/writes in advance (NCQ/TCQ) so that the drive can coagulate them to one big op. But this is not what happens in case of ZFS/ZIL with a single application. The application requests a synchronous op. This request goes down into ZFS, which in turn allocates a ZIL block, writes it to the disk and issues a cache flush. Only after the cache flush completes, ZFS can acknowledge the op to the application. Now the application can issue the next op, for which ZFS will again allocate ZIL block, probably immediately after the previous one. It writes the block and issues a flush. But in the meantime the head has traveled some sectors down the track. To physically write the block the drive has of course to wait until the sector is under the head again, which means waiting nearly one full rotation. If ZFS would have chosen a block appropriately further down the track the possibility would have been high that the head had not passed it and could write without a big rotational delay. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Windows file versioning integration?
Hi all Since Windows 2003 Server or so, it has had some versioning support usable from the client side if checking the properties on a file. Is it somehow possible to use this functionality with ZFS snapshots? -- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Windows file versioning integration?
On May 27, 2010, at 6:32 AM, Roy Sigurd Karlsbakk wrote: Hi all Since Windows 2003 Server or so, it has had some versioning support usable from the client side if checking the properties on a file. Is it somehow possible to use this functionality with ZFS snapshots? Yes, there is some integration with VSS and snapshots. But a more complete and full featured solution looks like: http://www.nexenta.com/corp/applications/delorean -- richard -- Richard Elling rich...@nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Windows file versioning integration?
Hi all Since Windows 2003 Server or so, it has had some versioning support usable from the client side if checking the properties on a file. Is it somehow possible to use this functionality with ZFS snapshots? http://blogs.sun.com/amw/entry/using_the_previous_versions_tab ;) -- iMx i...@streamvia.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs/lofi/share panic
Hi Frank, On 24/05/10 16:52 -0400, Frank Middleton wrote: Many many moons ago, I submitted a CR into bugs about a highly reproducible panic that occurs if you try to re-share a lofi mounted image. That CR has AFAIK long since disappeared - I even forget what it was called. This server is used for doing network installs. Let's say you have a 64 bit iso lofi-mounted and shared. You do the install, and then wish to switch to a 32 bit iso. You unshare, umount, delete the loopback, and then lofiadm the new iso, mount it and then share it. Panic, every time. Is this such a rare use-case that no one is interested? I have the backtrace and cores if anyone wants them, although such were submitted with the original CR. This is pretty frustrating since you start to run out of ideas for mountpoint names after a while unless you forget and get the panic. FWIW (even on a freshly booted system after a panic) # lofiadm zyzzy.iso /dev/lofi/1 # mount -F hsfs /dev/lofi/1 /mnt mount: /dev/lofi/1 is already mounted or /mnt is busy # mount -O -F hsfs /dev/lofi/1 /mnt # share /mnt # If you unshare /mnt and then do this again, it will panic. This has been a bug since before Open Solaris came out. It doesn't happen if the iso is originally on UFS, but UFS really isn't an option any more. FWIW the dataset containing the isos has the sharenfs attribute set, although it doesn;t have to be actually mounted by any remote NFS for this panic to occur. Suggestions for a workaround most welcome! the bug (6798273) has been closed as incomplete with following note: I cannot reproduce any issue with the given testcase on b137. So you should test this with b137 or newer build. There have been some extensive changes going to treeclimb_* functions, so the bug is probably fixed or will be in near future. Let us know if you can still reproduce the panic on recent build. thanks -jan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs/lofi/share panic
On 5/27/2010 2:45 PM, Jan Kryl wrote: Hi Frank, On 24/05/10 16:52 -0400, Frank Middleton wrote: Many many moons ago, I submitted a CR into bugs about a highly reproducible panic that occurs if you try to re-share a lofi mounted image. That CR has AFAIK long since disappeared - I even forget what it was called. This server is used for doing network installs. Let's say you have a 64 bit iso lofi-mounted and shared. You do the install, and then wish to switch to a 32 bit iso. You unshare, umount, delete the loopback, and then lofiadm the new iso, mount it and then share it. Panic, every time. Is this such a rare use-case that no one is interested? I have the backtrace and cores if anyone wants them, although such were submitted with the original CR. This is pretty frustrating since you start to run out of ideas for mountpoint names after a while unless you forget and get the panic. FWIW (even on a freshly booted system after a panic) # lofiadm zyzzy.iso /dev/lofi/1 # mount -F hsfs /dev/lofi/1 /mnt mount: /dev/lofi/1 is already mounted or /mnt is busy # mount -O -F hsfs /dev/lofi/1 /mnt # share /mnt # If you unshare /mnt and then do this again, it will panic. This has been a bug since before Open Solaris came out. It doesn't happen if the iso is originally on UFS, but UFS really isn't an option any more. FWIW the dataset containing the isos has the sharenfs attribute set, although it doesn;t have to be actually mounted by any remote NFS for this panic to occur. Suggestions for a workaround most welcome! the bug (6798273) has been closed as incomplete with following note: I cannot reproduce any issue with the given testcase on b137. So you should test this with b137 or newer build. There have been some extensive changes going to treeclimb_* functions, so the bug is probably fixed or will be in near future. Let us know if you can still reproduce the panic on recent build. I don't know if the code path is the same enough, bu you should also try it like this: # mount -F hsfs zyzzy.iso /mnt For many builds now, (Open)Solaris hasn't needed the 'lofiadm' step for ISO's (and possibly other FS's that can be guessed) I now put ISO's (for installs just like you) directly in my /etc/vfstab. -Kyle thanks -jan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs/lofi/share panic
Jan Kryl wrote: the bug (6798273) has been closed as incomplete with following note: I cannot reproduce any issue with the given testcase on b137. So you should test this with b137 or newer build. There have been some extensive changes going to treeclimb_* functions, so the bug is probably fixed or will be in near future. Let us know if you can still reproduce the panic on recent build. The most recent build available outside of Oracle is still 134, or am I missing something? -- Carson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs/lofi/share panic
On 5/27/2010 12:21 PM, Carson Gaspar wrote: Jan Kryl wrote: the bug (6798273) has been closed as incomplete with following note: I cannot reproduce any issue with the given testcase on b137. So you should test this with b137 or newer build. There have been some extensive changes going to treeclimb_* functions, so the bug is probably fixed or will be in near future. Let us know if you can still reproduce the panic on recent build. The most recent build available outside of Oracle is still 134, or am I missing something? That's the latest binary build. It is possible to build something newer yourself, but doing so will take some unusual effort. - Garrett ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] nfs share of nested zfs directories?
I was wondering if there is a special option to share out a set of nested directories? Currently if I share out a directory with /pool/mydir1/mydir2 on a system, mydir1 shows up, and I can see mydir2, but nothing in mydir2. mydir1 and mydir2 are each a zfs filesystem, each shared with the proper sharenfs permissions. Did I miss a browse or traverse option somewhere? - Cassandra Unix Administrator From a little spark may burst a mighty flame. -Dante Alighieri ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs share of nested zfs directories?
I share filesystems all the time this way, and have never had this problem. My first guess would be a problem with NFS or directory permissions. You are using NFS, right? - Garrett On 5/27/2010 1:02 PM, Cassandra Pugh wrote: I was wondering if there is a special option to share out a set of nested directories? Currently if I share out a directory with /pool/mydir1/mydir2 on a system, mydir1 shows up, and I can see mydir2, but nothing in mydir2. mydir1 and mydir2 are each a zfs filesystem, each shared with the proper sharenfs permissions. Did I miss a browse or traverse option somewhere? - Cassandra Unix Administrator From a little spark may burst a mighty flame. -Dante Alighieri ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs share of nested zfs directories?
- Cassandra Pugh cp...@pppl.gov skrev: I was wondering if there is a special option to share out a set of nested directories? Currently if I share out a directory with /pool/mydir1/mydir2 on a system, mydir1 shows up, and I can see mydir2, but nothing in mydir2. mydir1 and mydir2 are each a zfs filesystem, each shared with the proper sharenfs permissions. Did I miss a browse or traverse option somewhere? is mydir2 on a separate filesystem/dataset? -- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs/lofi/share panic
FWIW (even on a freshly booted system after a panic) # lofiadm zyzzy.iso /dev/lofi/1 # mount -F hsfs /dev/lofi/1 /mnt mount: /dev/lofi/1 is already mounted or /mnt is busy # mount -O -F hsfs /dev/lofi/1 /mnt # share /mnt # If you unshare /mnt and then do this again, it will panic. This has been a bug since before Open Solaris came out. I just tried this with a UFS based filesystem just for a lark. r...@aequitas:/# mkdir /testfs r...@aequitas:/# mount -F ufs -o noatime,nologging /dev/dsk/c0d1s0 /testfs r...@aequitas:/# ls -l /testfs/sol\-nv\-b130\-x86\-dvd.iso -rw-r--r-- 1 root root 3818782720 Feb 5 16:02 /testfs/sol-nv-b130-x86-dvd.iso r...@aequitas:/# lofiadm -a /testfs/sol-nv-b130-x86-dvd.iso May 27 21:08:58 aequitas pseudo: pseudo-device: lofi0 May 27 21:08:58 aequitas genunix: lofi0 is /pseudo/l...@0 May 27 21:08:58 aequitas rootnex: xsvc0 at root: space 0 offset 0 May 27 21:08:58 aequitas genunix: xsvc0 is /x...@0,0 May 27 21:08:58 aequitas pseudo: pseudo-device: devinfo0 May 27 21:08:58 aequitas genunix: devinfo0 is /pseudo/devi...@0 /dev/lofi/1 r...@aequitas:/# mount -F hsfs -o ro /dev/lofi/1 /mnt r...@aequitas:/# share -F nfs -o nosub,nosuid,sec=sys,ro,anon=0 /mnt Then at a Sol 10 server : # uname -a SunOS jupiter 5.10 Generic_142900-11 sun4u sparc SUNW,Sun-Fire-480R # dfshares aequitas RESOURCE SERVER ACCESSTRANSPORT aequitas:/mnt aequitas - - # # mount -F nfs -o bg,intr,nosuid,ro,vers=4 aequitas:/mnt /mnt # ls /mnt Copyrightautorun.inf JDS-THIRDPARTYLICENSEREADME autorun.sh License boot README.txt installer Solaris_11 sddtool Sun_HPC_ClusterTools # umount aequitas:/mnt # dfshares aequitas RESOURCE SERVER ACCESSTRANSPORT aequitas:/mnt aequitas - - Then back at the snv_138 box I unshare and re-share and ... nothing bad happens. r...@aequitas:/# unshare /mnt r...@aequitas:/# share -F nfs -o nosub,nosuid,sec=sys,ro,anon=0 /mnt r...@aequitas:/# unshare /mnt r...@aequitas:/# Guess I must now try this with a ZFS fs under that iso file. -- Dennis Clarke dcla...@opensolaris.ca - Email related to the open source Solaris dcla...@blastwave.org - Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs share of nested zfs directories?
Cassandra, Which Solaris release is this? This is working for me between an Solaris 10 server and a OpenSolaris client. Nested mount points can be tricky and I'm not sure if you are looking for the mirror mount feature that is not available in the Solaris 10 release, where new directory contents are accessible on the client. See the examples below. Thanks, Cindy On the server: # zpool create pool c1t3d0 # zfs create pool/myfs1 # cp /usr/dict/words /pool/myfs1/file.1 # zfs create -o mountpoint=/pool/myfs1/myfs2 pool/myfs2 # ls /pool/myfs1 file.1 myfs2 # cp /usr/dict/words /pool/myfs1/myfs2/file.2 # ls /pool/myfs1/myfs2/ file.2 # zfs set sharenfs=on pool/myfs1 # zfs set sharenfs=on pool/myfs2 # share - /pool/myfs1 rw - /pool/myfs1/myfs2 rw On the client: # ls /net/t2k-brm-03/pool/myfs1 file.1 myfs2 # ls /net/t2k-brm-03/pool/myfs1/myfs2 file.2 # mount -F nfs t2k-brm-03:/pool/myfs1 /mnt # ls /mnt file.1 myfs2 # ls /mnt/myfs2 file.2 On the server: # touch /pool/myfs1/myfs2/file.3 On the client: # ls /mnt/myfs2 file.2 file.3 On 05/27/10 14:02, Cassandra Pugh wrote: I was wondering if there is a special option to share out a set of nested directories? Currently if I share out a directory with /pool/mydir1/mydir2 on a system, mydir1 shows up, and I can see mydir2, but nothing in mydir2. mydir1 and mydir2 are each a zfs filesystem, each shared with the proper sharenfs permissions. Did I miss a browse or traverse option somewhere? - Cassandra Unix Administrator From a little spark may burst a mighty flame. -Dante Alighieri ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS
On Wed, May 26, 2010 at 6:09 PM, Giovanni Tirloni gtirl...@sysdroid.com wrote: On Wed, May 26, 2010 at 9:22 PM, Brandon High bh...@freaks.com wrote: I'd wager it's the PCIe x4. That's about 1000MB/s raw bandwidth, about 800MB/s after overhead. Makes perfect sense. I was calculating the bottlenecks using the full-duplex bandwidth and it wasn't apparent the one-way bottleneck. Actually both of you guys are wrong :-) The Supermicro X8DTi mobo and LSISAS9211-4i HBA are both PCIe 2.0 compatible, so the max theoretical PCIe x4 throughput is 4GB/s aggregate, or 2GB/s in each direction, well above the 800MB/s bottleneck observed by Giovanni. This bottleneck is actually caused by the backplane: Supermicro E1 chassis like Giovanni's (SC846E1) include port multipliers that degrade performance by putting 6 disks behind a single 3Gbps link. A single 3Gbps link provides in theory 300MB/s usable after 8b-10b encoding, but practical throughput numbers are closer to 90% of this figure, or 270MB/s. 6 disks per link means that each disk gets allocated 270/6 = 45MB/s. So with 18 disks striped, this gives a max usable throughput of 18*45 = 810MB/s, which matches exactly what Giovanni observed. QED! -mrb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gang blocks at will?
on 27/05/2010 07:11 Jeff Bonwick said the following: You can set metaslab_gang_bang to (say) 8k to force lots of gang block allocations. Bill, Jeff, thanks a lot! This helped to reproduce the issue and find the bug. Just in case: http://www.freebsd.org/cgi/query-pr.cgi?pr=bin/144214 On May 25, 2010, at 11:42 PM, Andriy Gapon wrote: I am working on improving some ZFS-related bits in FreeBSD boot chain. At the moment it seems that the things work mostly fine except for a case where the boot code needs to read gang blocks. We have some reports from users about failures, but unfortunately their pools are not available for testing anymore and I can not reproduce the issue at will. I am sure that (Open)Solaris GRUB version has been properly tested, including the above environment. Could you please help me with ideas how to create a pool/filesystem/file that would have gang-blocks with high probability? Perhaps, there are some pre-made test pool images available? Or some specialized tool? Thanks a lot! -- Andriy Gapon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Andriy Gapon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Windows file versioning integration?
Hi all Since Windows 2003 Server or so, it has had some versioning support usable from the client side if checking the properties on a file. Is it somehow possible to use this functionality with ZFS snapshots? http://blogs.sun.com/amw/entry/using_the_previous_versions_tab ;) -- iMx i...@streamvia.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusion
On Thu, May 27, 2010 at 2:16 AM, Per Jorgensen p...@combox.dk wrote: is there a way i can get c9t8d0 out of the pool , or how do i get the pool back to optimal redundancy ? It's not possible to remove vdevs right now. When the mythical bp_rewrite shows up, then you can. For now, the only thing you can do to save your pool is attach another disk (or two) as a mirror. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs share of nested zfs directories?
On Thu, May 27, 2010 at 1:02 PM, Cassandra Pugh cp...@pppl.gov wrote: I was wondering if there is a special option to share out a set of nested directories? Currently if I share out a directory with /pool/mydir1/mydir2 on a system, mydir1 shows up, and I can see mydir2, but nothing in mydir2. mydir1 and mydir2 are each a zfs filesystem, each shared with the proper sharenfs permissions. Did I miss a browse or traverse option somewhere? What kind of client are you mounting on? Linux clients don't properly follow nested exports. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs share of nested zfs directories?
Some tips… (1) Do a zfs mount -a and a zfs share -a. Just in case something didn't get shared out correctly (though that's supposed to automatically happen, I think) (2) The Solaris automounter (i.e. in a NIS environment) does not seem to automatically mount descendent filesystems (i.e. if the NIS automounter has a map for /public pointing to myserver:/mnt/zfs/public but on myserver, I create a descendent filesystem in /mnt/zfs/public/folder1, browsing to /public/folder1 on another computer will just show an empty directory all the time). If you're in that sort of environment, you need to add another map on NIS. (3) Try using /net mounts. If you're not aware of how this works, you can browse to /net/computer name to see all the NFS mounts. On Solaris, /net *will* automatically mount descendent filesystems (unlike NIS). -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs share of nested zfs directories?
On 5/27/2010 9:30 PM, Reshekel Shedwitz wrote: Some tips… (1) Do a zfs mount -a and a zfs share -a. Just in case something didn't get shared out correctly (though that's supposed to automatically happen, I think) (2) The Solaris automounter (i.e. in a NIS environment) does not seem to automatically mount descendent filesystems (i.e. if the NIS automounter has a map for /public pointing to myserver:/mnt/zfs/public but on myserver, I create a descendent filesystem in /mnt/zfs/public/folder1, browsing to /public/folder1 on another computer will just show an empty directory all the time). The automounter behaves the same irregardless of whether NIS is invovled or not (or LDAP for that matter.) The Automounter can be configured with files locally, and that won't change it's behavior. The behavior your describing has been the behavior of all flavors of NFS since it was born, and also doesn't have anything to do with the automounter - it was by design. No automounter I'm aware of is capable of learning on it's own that 'folder1' is a new filesystem (not a new directory) and mounting it. So this isn't limited to Solaris. If you're in that sort of environment, you need to add another map on NIS. Your example doesn't specify if /public is a direct or indirect mount, being in / kind of implies it's direct, and those mounts can be more limiting (more so in the past) and most admins avoid using the auto.direct map for these reasons. If the example was /import/public with /import being defined by the auto.import map, then the solution to this problem is not an entirely new entry in the the map for /import/public/folder1, but to convert the entry for /import/folder1 to a hierarchical mount entry, specifying explicitly the folder1 sub mount. A hierarchical mount can even mount folder1 from a different server than public came from. In the past (SunOS4 and early Solaris timeframe) heirarchical mounts had some limitations (mainly issues with unmounting them) that made people wary of them. Most if not all of those have been eliminated. In general the Solaris automounter is very reliable and flexible and can be configured to do almost anything you want. Recent linux automounters (autofs4??) have come very close to the Solaris ones, however earlier ones had some missing fieatures, buggy features, and some different interpretations of the maps. But the issues described in this thread is not an automounter issue, it's a design issue of NFS - at least for all versions of NFS before v4. Version 4 has a feature that others have mentioned called mirror mounts that tries to pass along the information trequired for the client to re-create the sub-mount - Even if the original fileserver mounted the sub-filesystem from another server! It's a cool feature, but NFS v4 suport in clients isn't complete yet, so specifying the full hierarchical mount tree in the automount maps is still required. (3) Try using /net mounts. If you're not aware of how this works, you can browse to /net/computer name to see all the NFS mounts. On Solaris, /net *will* automatically mount descendent filesystems (unlike NIS). In general /net mounts are a bad idea. While it will basically scan the output of 'showmount -e' for everything the server exports, and mount it all, that's not exactly what you always want. It will only pick up sub-filesystem that are explicitly shared (which NFSv4 might also only do I'm not sure) and it will miss branches of the tree if they are mounted from another server. Also most automounters that I'm aware of will only mount all the exported filesystems at the time of the access to /net/hostname, and (unles it's unused long enough to be unmounted) will miss all changes in what is exported on the server until the mount is triggered again. On top of that, /net/hostname mounts encourage embedding the hostname of the server in config files, scripts, and binaries (-R path for shared libraries) and that's not good since you then can't move a filesystem from one host to another, since you need to maintain that /net/hostname path forever - or edit many files and recompile programs. (If I recall correctly, this was once used as one of the arguments against shared libraries by some.) Because of this, by using /net/hostname, you give up one of the biggest benefits of the automounter - redirection. By making an auto.import map that has an entry for 'public' you allow yourself to be able to clone public to a new server, and modify the map to (over time as it is unmounted and remounted) migrate the clients to the new server. Lastly using /net also diables the load-sharing and failover abilities of read-only automounts, since you are by definition limiting yourself to one hostname. That was longer than I expected, but hopefully it will help some. :) -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org
Re: [zfs-discuss] nfs share of nested zfs directories?
Brandon High wrote: On Thu, May 27, 2010 at 1:02 PM, Cassandra Pugh cp...@pppl.gov wrote: I was wondering if there is a special option to share out a set of nested directories? Currently if I share out a directory with /pool/mydir1/mydir2 on a system, mydir1 shows up, and I can see mydir2, but nothing in mydir2. mydir1 and mydir2 are each a zfs filesystem, each shared with the proper sharenfs permissions. Did I miss a browse or traverse option somewhere? What kind of client are you mounting on? Linux clients don't properly follow nested exports. -B This behavior is not limited to Linux clients nor to nfs shares. I've seen it with Windows (SMB) clients and CIFS shares. The CIFS version is referenced here: Nested ZFS Filesystems in a CIFS Share http://mail.opensolaris.org/pipermail/cifs-discuss/2008-June/000358.html http://bugs.opensolaris.org/view_bug.do?bug_id=6582165 Is there any commonality besides the observed behaviors? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS
On Fri, May 28, 2010 at 00:56, Marc Bevand m.bev...@gmail.com wrote: Giovanni Tirloni gtirloni at sysdroid.com writes: The chassis has 4 columns of 6 disks. The 18 disks I was testing were all on columns #1 #2 #3. Good, so this confirms my estimations. I know you said the current ~810 MB/s are amply sufficient for your needs. Spreading the 18 drives across all 4 port multipliers The Supermicro SC846E1 cases don't contain multiple (sata) port multipliers, they contain a single SAS expander, which shares bandwidth among the controllers and drives: no column- or row-based limitations should be present. That backplane has two 8087 ports, IIRC: one labeled for the host, and one for a downstream chassis. I don't think there's actually any physical or logical difference between the upstream and downstream ports, so you might consider trying connecting two cables (ideally from two SAS controllers, with multipath) and see if that goes any faster. Giovanni: When you say you saturated the system with a RAID-0 device, what do you mean? I think the suggested benchmark (read from all the disks independently, using dd or some other sequential-transfer mechanism like vdbench) would be more interesting in terms of finding the limiting bus bandwidth than a ZFS-based or hardware-raid-based benchmark. Inter-disk synchronization and checksums and such can put a damper on ZFS performance, so simple read-sequentially-from-disk can often deliver surprising results. Note that such results aren't always useful: after all, the goal is to run ZFS on the hardware, not dd! but may indicate that a certain component of the system is or is not to blame. Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss