[zfs-discuss] make zfs(1M) use literals when displaying properties in scripted mode
as the topic says, this uses literals when the zfs command is asked to list stuff in script mode (ie, zfs list -H). this is useful if you want the sizes of things in raw values. i have no onnv systems to build this on, so i am unable to demonstrate this, but i would really like to see this (or something like this) integrated. alternatively i could add a new flag to zfs list that toggles this behaviour. comments? suggestions? diff -r fb422f16cbd0 usr/src/cmd/zfs/zfs_main.c --- a/usr/src/cmd/zfs/zfs_main.cTue Sep 30 14:29:46 2008 -0700 +++ b/usr/src/cmd/zfs/zfs_main.cWed Oct 01 10:57:27 2008 +1000 @@ -1695,7 +1695,7 @@ right_justify = B_FALSE; if (pl-pl_prop != ZPROP_INVAL) { if (zfs_prop_get(zhp, pl-pl_prop, property, - sizeof (property), NULL, NULL, 0, B_FALSE) != 0) + sizeof (property), NULL, NULL, 0, scripted) != 0) propstr = -; else propstr = property; ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] make zfs(1M) use literals when displaying properties in scripted mode
A better solution (one that wouldn't break backwards compatability) would be to add the '-p' option (parseable output) from 'zfs get' to the 'zfs list' command as well. - Eric On Wed, Oct 01, 2008 at 03:59:27PM +1000, David Gwynne wrote: as the topic says, this uses literals when the zfs command is asked to list stuff in script mode (ie, zfs list -H). this is useful if you want the sizes of things in raw values. i have no onnv systems to build this on, so i am unable to demonstrate this, but i would really like to see this (or something like this) integrated. alternatively i could add a new flag to zfs list that toggles this behaviour. comments? suggestions? diff -r fb422f16cbd0 usr/src/cmd/zfs/zfs_main.c --- a/usr/src/cmd/zfs/zfs_main.c Tue Sep 30 14:29:46 2008 -0700 +++ b/usr/src/cmd/zfs/zfs_main.c Wed Oct 01 10:57:27 2008 +1000 @@ -1695,7 +1695,7 @@ right_justify = B_FALSE; if (pl-pl_prop != ZPROP_INVAL) { if (zfs_prop_get(zhp, pl-pl_prop, property, - sizeof (property), NULL, NULL, 0, B_FALSE) != 0) + sizeof (property), NULL, NULL, 0, scripted) != 0) propstr = -; else propstr = property; ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Weird ZFS recv / NFS export problem
Hello all, in the setup I try to build I want to have snapshots of a file system replicated from host replsource to host repltarget and from there NFS-mounted on host nfsclient to access snapshots directly: replsource# zfs create pool1/nfsw replsource# mkdir /pool1/nfsw/lala replsource# zfs snapshot pool1/[EMAIL PROTECTED] replsource# zfs send pool1/[EMAIL PROTECTED] | \ ssh repltarget zfs receive -d pool1 (a pool1 exists on repltarget as well.) repltarget# zfs set sharenfs=ro=nfsclient pool1/nfsw nfsclient# mount repltarget:/pool1/nfsw/.zfs/snapshot /mnt/nfsw/ nfsclient# cd /mnt/nfsw/snap1 nfsclient# access ./lala access(./lala, R_OK | X_OK) == 0 So far, so good. But now I see the following: (wait a bit, for instance 3 minutes, then replicate another snapshot) replsource# zfs snapshot pool1/[EMAIL PROTECTED] replsource# zfs send -i pool1/[EMAIL PROTECTED] pool1/[EMAIL PROTECTED] | \ ssh repltarget zfs receive pool1/nfsw (the PWD of the shell on nfsclient is still /mnt/nfsw/snap1) nfsclient# access ./lala access(./lala, R_OK | X_OK) == -1 (if you think that is surprising, watch this:) nfsclient# ls /mnt/nfsw snap1 snap2 nfsclient# access ./lala access(./lala, R_OK | X_OK) == 0 The access program does exactly the access(2) call illustrated in its output. The weird thing is that a directory can be accessed, then not accessed after the exported file system on repltarget has been updated by a zfs recv, then again be accessed after an ls of the mounted directory. In a snoop I see that, when the access(2) fails, the nfsclient gets a Stale NFS file handle response, which gets translated to an ENOENT. My problem is that the application accessing the contents inside of the NFS-mounted snapshot cannot find the content any more after the filesystem on repltarget has been updated. Is this a known problem? More important, is there a known workaround? All machines are running SunOS 5.10 Generic_127128-11 i86pc. If some more information could be helpful, I'll gladly provide it. Regards, Juergen. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Weird ZFS recv / NFS export problem
Jürgen, In a snoop I see that, when the access(2) fails, the nfsclient gets a Stale NFS file handle response, which gets translated to an ENOENT. What happens if you use the noac NFS mount option on the client? I'd not recommend to use it for production environments unless you really need to, but this looks like a nfs client caching issue. Is this an nfsv3 or nfsv4 mount? What happens if you use one or the other? Please provide nfsstat -m output. Nils ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems
Hi, I am running snv90. I have a pool that is 6x1TB, config raidz. After a computer crash (root is NOT on the pool - only data) the pool showed FAULTED status. I exported and tried to reimport it, with the result as follows: # zpool import pool: ztank id: 12125153257763159358 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-72 config: ztank FAULTED corrupted data raidz1ONLINE c1t6d0 ONLINE c1t5d0 ONLINE c1t4d0 ONLINE c1t3d0 ONLINE c1t2d0 ONLINE c1t1d0 ONLINE I searched google and run zdb -l for every pool device. Results follow below... to me it appears that all disks are ok and zdb can see the zpool structure off of each of them. (at least this is how I can interpret the messages, but the zpool still says corrupt zpool metadata :-( Any ideas as to what I might be able to do to salvage the data? restoring from backup is not an option (yes, I know :() - as this is a personal project I hoped the raidz would be enough :-( The output for each of the disks is more or less identical, all labels are accessible. # zdb -l /dev/dsk/c1t6d0s0 LABEL 0 version=10 name='ztank' state=0 txg=207161 pool_guid=12125153257763159358 hostid=628051022 hostname='zfssrv' top_guid=763279656890868029 guid=10947029755543026189 vdev_tree type='raidz' id=0 guid=763279656890868029 nparity=1 metaslab_array=14 metaslab_shift=35 ashift=9 asize=6001149345792 is_log=0 children[0] type='disk' id=0 guid=10947029755543026189 path='/dev/dsk/c1t1d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/[EMAIL PROTECTED],0/pci1000,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=193 children[1] type='disk' id=1 guid=2640926618230776740 path='/dev/dsk/c1t2d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/[EMAIL PROTECTED],0/pci1000,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=192 children[2] type='disk' id=2 guid=8982722125061616789 path='/dev/dsk/c1t3d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/[EMAIL PROTECTED],0/pci1000,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=191 children[3] type='disk' id=3 guid=7263648809970512976 path='/dev/dsk/c1t4d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/[EMAIL PROTECTED],0/pci1000,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=190 children[4] type='disk' id=4 guid=5275414937202266822 path='/dev/dsk/c1t5d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/[EMAIL PROTECTED],0/pci1000,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=189 children[5] type='disk' id=5 guid=8503895341004279533 path='/dev/dsk/c1t6d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/[EMAIL PROTECTED],0/pci1000,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=188 LABEL 1 version=10 name='ztank' state=0 txg=207161 pool_guid=12125153257763159358 hostid=628051022 hostname='zfssrv' top_guid=763279656890868029 guid=10947029755543026189 vdev_tree type='raidz' id=0 guid=763279656890868029 nparity=1 metaslab_array=14 metaslab_shift=35 ashift=9 asize=6001149345792 is_log=0 children[0] type='disk' id=0 guid=10947029755543026189 path='/dev/dsk/c1t1d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/[EMAIL PROTECTED],0/pci1000,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=193 children[1] type='disk' id=1
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, 30 Sep 2008, Robert Thurlow wrote: Modern NFS runs over a TCP connection, which includes its own data validation. This surely helps. Less than we'd sometimes like :-) The TCP checksum isn't very strong, and we've seen corruption tied to a broken router, where the Ethernet checksum was recomputed on bad data, and the TCP checksum didn't help. It sucked. TCP does not see the router. The TCP and ethernet checksums are at completely different levels. Routers do not pass ethernet packets. They pass IP packets. Your statement does not make technical sense. I think he was referring to a broken VLAN switch. But even then, any active component will take bist from the wire, check the MAC, changes what needed and redo the MAC and other checksums which needed changes. The whole packet lives in the memory of the switch/router and if that memory is broken the packet will be send damaged. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
[EMAIL PROTECTED] wrote: On Tue, 30 Sep 2008, Robert Thurlow wrote: Modern NFS runs over a TCP connection, which includes its own data validation. This surely helps. Less than we'd sometimes like :-) The TCP checksum isn't very strong, and we've seen corruption tied to a broken router, where the Ethernet checksum was recomputed on bad data, and the TCP checksum didn't help. It sucked. TCP does not see the router. The TCP and ethernet checksums are at completely different levels. Routers do not pass ethernet packets. They pass IP packets. Your statement does not make technical sense. I think he was referring to a broken VLAN switch. But even then, any active component will take bist from the wire, check the MAC, changes what needed and redo the MAC and other checksums which needed changes. The whole packet lives in the memory of the switch/router and if that memory is broken the packet will be send damaged. Which is why you need a network end-to-end strong checksum for iSCSI. I recommend that IPsec AH (at least but in many cases ESP) be deployed. If you care enough about your data to set checksum=sha256 for the ZFS datasets then make sure you care enough to setup IPsec and use HMAC-SHA256 for on the wire integrity protection too. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems
an update to the above: I tried to run zdb -e on the pool id and here's the result: # zdb -e 12125153257763159358 zdb: can't open 12125153257763159358: I/O error NB zdb seems to recognize the ID because runnig it with an incorrect ID gives me an error # zdb -e 12125153257763159354 zdb: can't open 12125153257763159354: No such file or directory Also zdb -e with the ID of the syspool works: # zdb -e 8843238790372298114 Uberblock magic = 00bab10c version = 10 txg = 317369 guid_sum = 14131844542001965925 timestamp = 1222857640 UTC = Wed Oct 1 12:40:40 2008 Dataset mos [META], ID 0, cr_txg 4, 2.76M, 244 objects Dataset 8843238790372298114/export/home [ZPL], ID 60, cr_txg 721, 1.21G, 55 objects Dataset 8843238790372298114/export [ZPL], ID 54, cr_txg 718, 19.0K, 5 objects Dataset 8843238790372298114/swap [ZVOL], ID 28, cr_txg 15, 519M, 3 objects Dataset 8843238790372298114/ROOT/snv_90 [ZPL], ID 48, cr_txg 710, 6.85G, 254748 objects Dataset 8843238790372298114/ROOT [ZPL], ID 22, cr_txg 12, 18.0K, 4 objects Dataset 8843238790372298114/dump [ZVOL], ID 34, cr_txg 18, 512M, 3 objects Dataset 8843238790372298114 [ZPL], ID 5, cr_txg 4, 39.5K, 13 objects etc etc. = Any ideas? Could this be a hardware problem? I have no idea what to do next :-( thanks for your help! Vasile -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, NFS and Auto Mounting
On Wed, Oct 1, 2008 at 3:42 AM, Douglas R. Jones [EMAIL PROTECTED] wrote: ... 3) Next I created another file system called dpool/GroupWS/Integration. Its mount point was inherited from GroupWS and is /mnt/zfs1/GroupWS/Integration. Essentially I only allowed the new file system to inherit from its parent. 4) I change the auto.ws map thusly: Integration chekov:/mnt/zfs1/GroupWS/ Upgradeschekov:/mnt/zfs1/GroupWS/ cstools chekov:/mnt/zfs1/GroupWS/ com chekov:/mnt/zfs1/GroupWS Now the odd behavior. You will notice that the directories Upgrades and cstools are just that. Directories in GroupWS. You can cd /ws/cstools from [b][i]any server[/b][/i] without a problem. Perform and ls and you see what you expect to see. Now the rub. If on chekov, one does a cd /ws/Integration you end up in chekov:/mnt/zsf1/GroupWS/Integration and everything is great. Do a cd to /ws/com and everything is fine. You can do a cd to Integration and everything is fine. But. If you go to another server and do a cd /ws/Integration all is well. However, if you do a cd to /ws/com and then a cd Integration, Integration is EMPTY!! Any ideas? Well, I guess you're running Solaris 10 and not OpenSolaris/SXCE. I think the term is mirror mounts. It works just fine on my SXCE boxes. Until then, the way we got round this was to not make the new filesystem a child. So instead of: /mnt/zfs1/GroupWS /mnt/zfs1/GroupWS/Integration create /mnt/zfs1/GroupWS /mnt/zfs1/Integration and use that for the Integration mountpoint. Then in GroupWS, 'ln -s ../Integration .'. That way, if you look at Integration in /ws/com you get to something that exists. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Tim [EMAIL PROTECTED] wrote: Hmm ... well, there is a considerable price difference, so unless someone says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200 drives. By the way, how many of those would saturate a single (non trunked) Gig ethernet link ? Workload NFS sharing of software and homes. I think 4 disks should be about enough to saturate it ? SAS has far greater performance, and if your workload is extremely random, will have a longer MTBF. SATA drives suffer badly on random workloads. The SATA Barracuda ST310003 I recently bought has a MTBF of 136 years. If you believe that you may compare MTBF values in the range 100 years, you may do something wrong. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oracle DB sequential dump questions
Louwtjie Burger [EMAIL PROTECTED] wrote: Server: T5120 on 10 U5 Storage: Internal 8 drives on SAS HW RAID (R5) Oracle: ZFS fs, recordsize=8K and atime=off Tape: LTO-4 (half height) on SAS interface. Dumping a large file from memory using tar to LTO yields 44 MB/s ... I suspect the CPU cannot push more since it's a single thread doing all the work. What is the speed of the LTO? If you are talking about tar, it is unclea which TAR implementation you are referring to. Sun tar is not very fast. GNU tar is not very fast. Star is optimized for best speed. I recommend to check star. The standard blocksize of tar (10 kB) is not optimal for tape drives. If you like to get speed and best portability of the tapes, use a block size of 63 kB and if you like to get best speed, use 256 kB as blocksize. I recommend to use: star -c -time bs=256k f=/dev/rmt/ files... Star should be able to give you the native LTO speed. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv
Bob Friesenhahn [EMAIL PROTECTED] wrote: On Tue, 30 Sep 2008, BJ Quinn wrote: True, but a search for zfs segmentation fault returns 500 bugs. It's possible one of those is related to my issue, but it would take all day to find out. If it's not flaky or unstable, I'd like to try upgrading to the newest kernel first, unless my Linux mindset is truly out of place here, or if it's not relatively easy to do. Are these kernels truly considered stable? How would I upgrade? -- This Linux and Solaris are quite different when it comes to kernel strategies. Linux documents and stabilizes its kernel interfaces Linux does not implement stable kernel interfaces. It may be that there is an intention to do so but I've seen problems on Linux resulting from self-incompatibility on a regular base. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv
Next stable (as in fedora or ubuntu releases) opensolaris version will be 2008.11. In my case I found 2008.05 is simply unusable (my main interest is xen/xvm), but upgrading to the latest available build with OS's pkg, (similar to apt-get) fixed the problem. If you installed the original OS 2008.05, upgrading is somewhat harder because it requires some additional steps (see OS website for details). Once you're running current build, upgrading is just a simple command. In OS, when you upgrade, you get to keep you old version as well, so you can easily rollback if something went wrong. On 10/1/08, BJ Quinn [EMAIL PROTECTED] wrote: True, but a search for zfs segmentation fault returns 500 bugs. It's possible one of those is related to my issue, but it would take all day to find out. If it's not flaky or unstable, I'd like to try upgrading to the newest kernel first, unless my Linux mindset is truly out of place here, or if it's not relatively easy to do. Are these kernels truly considered stable? How would I upgrade? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, Oct 01, 2008 at 01:03:28AM +0200, Ahmed Kamal wrote: Hmm ... well, there is a considerable price difference, so unless someone says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200 drives. By the way, how many of those would saturate a single (non trunked) Gig ethernet link ? Workload NFS sharing of software and homes. I think 4 disks should be about enough to saturate it ? You keep mentioning that you plan on using NFS, and everyone seems to keep ignoring the fact that in order to make NFS performance reasonable you're really going to want a couple very fast slog devices. Since I don't have the correct amount of money to afford a very fast slog device, I can't speak to which one is the best price/performance ratio, but there are tons of options out there. -brian -- Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you'll end up with a cupboard full of pop tarts and pancake mix. -- IRC User (http://www.bash.org/?841435) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oracle DB sequential dump questions
Carson Gaspar [EMAIL PROTECTED] wrote: Louwtjie Burger wrote: Dumping a large file from memory using tar to LTO yields 44 MB/s ... I suspect the CPU cannot push more since it's a single thread doing all the work. Dumping oracle db files from filesystem yields ~ 25 MB/s. The interesting bit (apart from it being a rather slow speed) is the fact that the speed fluctuates from the disk area.. but stays constant to the tape. I see up to 50-60 MB/s spikes over 5 seconds, while the tape continues to push it's steady 25 MB/s. ... Does your tape drive compress (most do)? If so, you may be seeing compressible vs. uncompressible data effects. HW Compression in the tape drive usually increases the speed of the drive. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
David Magda [EMAIL PROTECTED] wrote: On Sep 30, 2008, at 19:09, Tim wrote: SAS has far greater performance, and if your workload is extremely random, will have a longer MTBF. SATA drives suffer badly on random workloads. Well, if you can probably afford more SATA drives for the purchase price, you can put them in a striped-mirror set up, and that may help things. If your disks are cheap you can afford to buy more of them (space, heat, and power not withstanding). SATA and SAS disks usually base on the same drive mechanism. The seek times are most likely identical. Some SATA disks support tagged command queueing and others do not. I would asume that there is no speed difference between SATA with command queueing and SAS. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Toby Thain Wrote: ZFS allows the architectural option of separate storage without losing end to end protection, so the distinction is still important. Of course this means ZFS itself runs on the application server, but so what? The OP in question is not running his network clients on Solaris or OpenSolaris or FreeBSD or MacOSX, but rather a collection of Linux workstations. Unless there's been a recent port of ZFS to Linux, that makes a big What. Given the fact that NFS, as implemented in his client systems, provides no end-to-end reliability, the only data protection that ZFS has any control over is after the write() is issued by the NFS server process. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Ian Collins wrote: I think you'd be surprised how large an organisation can migrate most, if not all of their application servers to zones one or two Thumpers. Isn't that the reason for buying in server appliances? Assuming that the application servers can coexist in the only 16GB available on a thumper, and the only 8GHz of CPU core speed, and the fact that the System controller is a massive single point of failure for both the applications and the storage. You may have a difference of opinion as to what a large organization is, but the reality is that the thumper series is good for some things in a large enterprise, and not good for some things. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, Oct 1, 2008 at 8:52 AM, Brian Hechinger [EMAIL PROTECTED] wrote: On Wed, Oct 01, 2008 at 01:03:28AM +0200, Ahmed Kamal wrote: Hmm ... well, there is a considerable price difference, so unless someone says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200 drives. By the way, how many of those would saturate a single (non trunked) Gig ethernet link ? Workload NFS sharing of software and homes. I think 4 disks should be about enough to saturate it ? You keep mentioning that you plan on using NFS, and everyone seems to keep ignoring the fact that in order to make NFS performance reasonable you're really going to want a couple very fast slog devices. Since I don't have the correct amount of money to afford a very fast slog device, I can't speak to which one is the best price/performance ratio, but there are tons of options out there. +1 for the slog devices - make them 15k RPM SAS Also the OP has not stated how his Linux clients intend to use this fileserver. In particular, we need to understand how many IOPS (I/O Ops/Sec) are required and whether the typical workload is sequencial (large or small file) or random and the percentage or read to write operations. Often a mix of different ZFS configs are required to provide a complete and flexible solution. Here is a rough generalization: - for large file sequential I/O with high reliability go raidz2 with 6 disks minimun and use SATA disks. - for workloads with random I/O patterns and you need lots of IOPS - use a ZFS multi-way mirror and 15k RPM SAS disks. For example, a 3-way mirror will distribute the reads across 3 drives - so you'll see 3 * (single disk) IOPS for reads and 1* IOPS for writes. Consider 4 or more way mirrors for heavy (random) read workloads. Usually it makes sense to configure more that one ZFS pool config and then use the zpool that is appropriate for each specific workload. Also this config (diversity) future-proofs your fileserver - because its very difficult to predict how your usage patterns will change a year down the road[1]. Also, bear in mind that, in the future, you may wish to replace disks with SSDs (or add SSDs) to this fileserver - when the pricing is more reasonable. So only spend what you absolutely need to spend to meet todays requirements. You can always push in newer/bigger/better/faster *devices* down the road and this will provide you with a more flexible fileserver as your needs evolve. This is a huge strength for ZFS. Feel free to email me off list if you want more specific recommendations. [1] on a 10 disk system we have: a) a 5 disk RAIDZ pool b) a 3-way mirror (pool) c) a 2-way mirror (pool) If I was to do it again, I'd make a) a 6-disk RAIDZ2 config to take advantage of the higher reliability provided by this config. Regards, -- Al Hopper Logical Approach Inc,Plano,TX [EMAIL PROTECTED] Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Moore, Joe wrote: Toby Thain Wrote: ZFS allows the architectural option of separate storage without losing end to end protection, so the distinction is still important. Of course this means ZFS itself runs on the application server, but so what? The OP in question is not running his network clients on Solaris or OpenSolaris or FreeBSD or MacOSX, but rather a collection of Linux workstations. Unless there's been a recent port of ZFS to Linux, that makes a big What. Given the fact that NFS, as implemented in his client systems, provides no end-to-end reliability, the only data protection that ZFS has any control over is after the write() is issued by the NFS server process. NFS can provided on the wire protection if you enable Kerberos support (there are usually 3 options for Kerberos: krb5 (or sometimes called krb5a) which is Auth only, krb5i which is Auth plus integrity provided by the RPCSEC_GSS layer, krb5p Auth+Integrity+Encrypted data. I have personally seen krb5i NFS mounts catch problems when there was a router causing failures that the TCP checksum don't catch. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, Oct 1, 2008 at 9:34 AM, Moore, Joe [EMAIL PROTECTED] wrote: Ian Collins wrote: I think you'd be surprised how large an organisation can migrate most, if not all of their application servers to zones one or two Thumpers. Isn't that the reason for buying in server appliances? Assuming that the application servers can coexist in the only 16GB available on a thumper, and the only 8GHz of CPU core speed, and the fact that the System controller is a massive single point of failure for both the applications and the storage. You may have a difference of opinion as to what a large organization is, but the reality is that the thumper series is good for some things in a large enterprise, and not good for some things. Agreed. My biggest issue with the Thumper is that all the disks are 7,200RPM SATA and have limited IOPS. I'd like to see the Thumper configurations offered allowing a user chosen mixture of SAS and SATA drives with 7,200 and 15K RPM spindle speeds. And yes - I agree - you need as much RAM in the box as you can afford; ZFS loves lots and lots of RAM and your users will love the performance that large memory ZFS boxes provide. Did'nt they just offer a thumper with more RAM recently??? -- Al Hopper Logical Approach Inc,Plano,TX [EMAIL PROTECTED] Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Sidebar re ABI stability (was Segmentation fault / core dump)
[EMAIL PROTECTED] wrote Linux does not implement stable kernel interfaces. It may be that there is an intention to do so but I've seen problems on Linux resulting from self-incompatibility on a regular base. To be precise, Linus tries hard to prevent ABI changes in the system call interfaces exported from the kernel, but the glibc team had defeated him in the past. For example, they accidentally started returning ENOTSUP from getgid when one had a library version mis- match (!). Sun stabilizes both library and system call interfaces: I used to work on that with David J. Brown's team, back when I was an employee. --dave (who's a contractor) c-b -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Darren J Moffat wrote: Moore, Joe wrote: Given the fact that NFS, as implemented in his client systems, provides no end-to-end reliability, the only data protection that ZFS has any control over is after the write() is issued by the NFS server process. NFS can provided on the wire protection if you enable Kerberos support (there are usually 3 options for Kerberos: krb5 (or sometimes called krb5a) which is Auth only, krb5i which is Auth plus integrity provided by the RPCSEC_GSS layer, krb5p Auth+Integrity+Encrypted data. I have personally seen krb5i NFS mounts catch problems when there was a router causing failures that the TCP checksum don't catch. No doubt, additional layers of data protection are available. I don't know the state of RPCSEC on Linux, so I can't comment on this, certainly your experience brings valuable insight into this discussion. It is also recommended (when iSCSI is an appropriate transport) to run over IPSEC in ESP mode to also ensure data-packet-content consistancy. Certainly NFS over IPSEC/ESP would be more resistant to on-the-wire corruption. Either of these would give better data reliability than pure NFS, just like ZFS on the backend gives better data reliability than for example, UFS or EXT3. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On 10/01/08 10:46, Al Hopper wrote: On Wed, Oct 1, 2008 at 9:34 AM, Moore, Joe [EMAIL PROTECTED] wrote: Ian Collins wrote: I think you'd be surprised how large an organisation can migrate most, if not all of their application servers to zones one or two Thumpers. Isn't that the reason for buying in server appliances? Assuming that the application servers can coexist in the only 16GB available on a thumper, and the only 8GHz of CPU core speed, and the fact that the System controller is a massive single point of failure for both the applications and the storage. You may have a difference of opinion as to what a large organization is, but the reality is that the thumper series is good for some things in a large enterprise, and not good for some things. Agreed. My biggest issue with the Thumper is that all the disks are 7,200RPM SATA and have limited IOPS. I'd like to see the Thumper configurations offered allowing a user chosen mixture of SAS and SATA drives with 7,200 and 15K RPM spindle speeds. And yes - I agree - you need as much RAM in the box as you can afford; ZFS loves lots and lots of RAM and your users will love the performance that large memory ZFS boxes provide. Did'nt they just offer a thumper with more RAM recently??? The x4540 has twice the DIMM slots and # of cores. It also uses an LSI disk controller. Still 48 sata disks @ 7200 rpm. You can build a thumper using any rack mount server you like and the J4200/J4400 JBOD arrays. Then you can mix and match drives types (SATA and SAS). The server portion could have as many as 16/32 cores and 32/64 DIMM slots (the x4450/X4640). You'll use up a little more rack space but the drives will be serviceable without shutting down the system. I think Thumper/Thor fills a specific role (maximum disk density in a minimum chassis). I'd doubt that it will change much. -- Matt Sweeney Systems Engineer Sun Microsystems 585-368-5930/x29097 desk 585-727-0573cell ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
On Tue, 30 Sep 2008, Al Hopper wrote: I *suspect* that there might be something like a hash table that is degenerating into a singly linked list as the root cause of this issue. But this is only my WAG. That seems to be a reasonable conclusion. BTFW that my million file test directory uses this sort of file naming, but it has only been written once. When making data multi-access safe, often it is easiest to mark old data entries as unused while retaining the allocation. At some later time when it is convenient to do so, these old entries may be made available for reuse. It seems like your algorithm is causing the directory size to grow quite large, with many stale entries. Another possibility is that the directory is becoming fragmented due to the limitations of block size. The original directory was contiguous, but the updated directory is now fragmented. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
On Wed, 1 Oct 2008, Ian Collins wrote: A million files in ZFS is no big deal: But how similar were your file names? The file names are like: image.dpx[000] image.dpx[001] image.dpx[002] image.dpx[003] image.dpx[004] . . . So they will surely trip up Al Hopper's bad algorithm. It is pretty common that images arranged in sequences have the common part up front so that sorting works. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, 1 Oct 2008, Tim wrote: I think you'd be surprised how large an organisation can migrate most, if not all of their application servers to zones one or two Thumpers. Isn't that the reason for buying in server appliances? I think you'd be surprised how quickly they'd be fired for putting that much risk into their enterprise. There is the old saying that No one gets fired for buying IBM. If one buys an IBM system which runs 30 isolated instances of Linux, all of which are used for mission critical applications, is this a similar risk to consolidating storage on a Thumper since we are really talking about just one big system? In what way is consolidating on Sun/Thumper more or less risky to an enterprise than consolidating on a big IBM server with many subordinate OS instances? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
On Wed, 1 Oct 2008, Ram Sharma wrote: So for storing 1 million MYISAM tables (MYISAM being a good performer when it comes to not very large data) , I need to save 3 million data files in a single folder on disk. This is the way MYISAM saves data. I will never need to do an ls on this folder. This folder(~database) will be used just by MYSQL engine to exceute my SQL queries and fetch me results. As long as you do not need to list the files in the directory, I think that you will be ok with zfs: First access: % ptime ls -l 'image.dpx[666]' -r--r--r-- 8001 bfriesen home 12754944 Jun 16 2005 image.dpx[666] real0.023 user0.000 sys 0.002 Second access: % ptime ls -l 'image.dpx[666]' -r--r--r-- 8001 bfriesen home 12754944 Jun 16 2005 image.dpx[666] real0.003 user0.000 sys 0.002 Access to a file in a small directory: % ptime ls -l .zprofile -rwxr-xr-x 1 bfriesen home 236 Dec 30 2007 .zprofile real0.003 user0.000 sys 0.002 Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sidebar re ABI stability (was Segmentation fault / core dump)
[EMAIL PROTECTED] wrote Linux does not implement stable kernel interfaces. It may be that there is an intention to do so but I've seen problems on Linux resulting from self-incompatibility on a regular base. To be precise, Linus tries hard to prevent ABI changes in the system call interfaces exported from the kernel, but the glibc team had defeated him in the past. For example, they accidentally started returning ENOTSUP from getgid when one had a library version mis- match (!). Sun stabilizes both library and system call interfaces: I used to work on that with David J. Brown's team, back when I was an employee. We don't stabilize the layer between libc and the kernel; e..g, look at the changes in the thread libraries in Solaris (between 9 and 10, for one). Of course, the system call interface will look the same, but only in the C library entry points defined, not how they are implemented in the library and the calls between libc and the kernel. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, Oct 1, 2008 at 9:18 AM, Joerg Schilling [EMAIL PROTECTED] wrote: David Magda [EMAIL PROTECTED] wrote: On Sep 30, 2008, at 19:09, Tim wrote: SAS has far greater performance, and if your workload is extremely random, will have a longer MTBF. SATA drives suffer badly on random workloads. Well, if you can probably afford more SATA drives for the purchase price, you can put them in a striped-mirror set up, and that may help things. If your disks are cheap you can afford to buy more of them (space, heat, and power not withstanding). SATA and SAS disks usually base on the same drive mechanism. The seek times are most likely identical. Some SATA disks support tagged command queueing and others do not. I would asume that there is no speed difference between SATA with command queueing and SAS. Jörg Ummm, no. SATA and SAS seek times are not even in the same universe. They most definitely do not use the same mechanics inside. Whoever told you that rubbish is an outright liar. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, Oct 1, 2008 at 10:28 AM, Bob Friesenhahn [EMAIL PROTECTED] wrote: On Wed, 1 Oct 2008, Tim wrote: I think you'd be surprised how large an organisation can migrate most, if not all of their application servers to zones one or two Thumpers. Isn't that the reason for buying in server appliances? I think you'd be surprised how quickly they'd be fired for putting that much risk into their enterprise. There is the old saying that No one gets fired for buying IBM. If one buys an IBM system which runs 30 isolated instances of Linux, all of which are used for mission critical applications, is this a similar risk to consolidating storage on a Thumper since we are really talking about just one big system? In what way is consolidating on Sun/Thumper more or less risky to an enterprise than consolidating on a big IBM server with many subordinate OS instances? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ Are you honestly trying to compare a Thumper's reliability to an IBM mainframe? Please tell me that's a joke... We can start at redundant, hot-swappable components and go from there. The thumper can't even hold a candle to Sun's own older sparc platforms. It's not even in the same game as the IBM mainframes. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Ummm, no. SATA and SAS seek times are not even in the same universe.= They most definitely do not use the same mechanics inside. Whoever told y= ou that rubbish is an outright liar. Which particular disks are you guys talking about? I;m thinking you guys are talking about the same 3.5 w/ the same RPM, right? We're not comparing 10K/2.5 SAS drives agains 7.2K/3.5 SATA devices, are we? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, Oct 1, 2008 at 11:20 AM, [EMAIL PROTECTED] wrote: Ummm, no. SATA and SAS seek times are not even in the same universe.= They most definitely do not use the same mechanics inside. Whoever told y= ou that rubbish is an outright liar. Which particular disks are you guys talking about? I;m thinking you guys are talking about the same 3.5 w/ the same RPM, right? We're not comparing 10K/2.5 SAS drives agains 7.2K/3.5 SATA devices, are we? Casper I'm talking about 10k and 15k SAS drives, which is what the OP was talking about from the get-go. Apparently this is yet another case of subsequent posters completely ignoring the topic and taking us off on tangents that have nothing to do with the OP's problem. --Tm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oracle DB sequential dump questions
Joerg Schilling wrote: Carson Gaspar[EMAIL PROTECTED] wrote: Louwtjie Burger wrote: Dumping a large file from memory using tar to LTO yields 44 MB/s ... I suspect the CPU cannot push more since it's a single thread doing all the work. Dumping oracle db files from filesystem yields ~ 25 MB/s. The interesting bit (apart from it being a rather slow speed) is the fact that the speed fluctuates from the disk area.. but stays constant to the tape. I see up to 50-60 MB/s spikes over 5 seconds, while the tape continues to push it's steady 25 MB/s. ... Does your tape drive compress (most do)? If so, you may be seeing compressible vs. uncompressible data effects. HW Compression in the tape drive usually increases the speed of the drive. Yes. Which is exactly what I was saying. The tar data might be more compressible than the DB, thus be faster. Shall I draw you a picture, or are you too busy shilling for star at every available opportunity? -- Carson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, October 1, 2008 10:18, Joerg Schilling wrote: SATA and SAS disks usually base on the same drive mechanism. The seek times are most likely identical. Some SATA disks support tagged command queueing and others do not. I would asume that there is no speed difference between SATA with command queueing and SAS. I guess the meaning in my e-mail wasn't clear: because SAS drives are generally more expensive on a per unit basis, for a given budget, you can buy fewer of them than SATA drives. To get the same storage between capacity with SAS drives and SATA drives, you'd probably have to put the SAS drives in a RAID-5/6/Z configuration to be more space efficient. However by doing this you'd be losing spindles, and therefore IOPS. With SATA drives, since you can buy more for the same budget, you could put them in a RAID-10 configuration. While the individual disk many be slower, you'd have more spindles in the zpool, so that should help with the IOPS. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, 1 Oct 2008, Joerg Schilling wrote: SATA and SAS disks usually base on the same drive mechanism. The seek times are most likely identical. This must be some sort of urban legend. While the media composition and drive chassis is similar, the rest of the product clearly differs. The seek times for typical SAS drives are clearly much better, and the typical drive rotates much faster. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, NFS and Auto Mounting
pt == Peter Tribble [EMAIL PROTECTED] writes: pt I think the term is mirror mounts. he doesn't need them---he's using the traditional automounter, like we all used to use before this newfangled mirror mounts baloney. There were no mirror mounts with the old UFS NFSv3 setup that he inherited, and it worked fine. Maybe mirror mounts are breaking the automounter? I think someone who knows the automounter better than I could explain it, but one thing you migh try is to make the server and client's filesystems similarly-nested. Right now you have: /ws/com/mnt/.../GroupWS /ws/Integration/mnt/.../GroupWS/Integration /ws/cstools/mnt/.../GroupWS/cstools /ws/Upgrades /mnt/.../GroupWS/Upgrades so, /ws/{Integration,cstools,upgrades} are decendents of /ws/com on the server, but not the client. This may break some assumption that the automounter needs to function, an assumption which I don't have enough experience and wit to state quickly and explicitly but suspect might exist. Why not change to: /ws/com/mnt/.../GroupWS/com /ws/Integration/mnt/.../GroupWS/Integration /ws/cstools/mnt/.../GroupWS/cstools /ws/Upgrades /mnt/.../GroupWS/Upgrades or: /ws/com/mnt/.../GroupWS /ws/com/Integration/mnt/.../GroupWS/Integration /ws/com/cstools/mnt/.../GroupWS/cstools /ws/com/Upgrades /mnt/.../GroupWS/Upgrades and update the auto.ws map to match whichever you pick. pgpTJbJ7oSZkF.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] query: why does zfs boot in 10/08 not support flash archive jumpstart
With much excitement I have been reading the new features coming into Solaris 10 in 10/08 and am eager to start playing with zfs root. However one thing which struck me as strange and somewhat annoying is that it appears in the FAQs and documentation that its not possible to do a ZFS root install using jumpstart and flash archives? I predominantly do my installs using flash archives as it saves massive amounts of time in the install process and gives me consistancy between builds. Really I am just curious why it isnt supported, and what the intention is for supporting it and when? Cheers, Adrian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
t == Tim [EMAIL PROTECTED] writes: t So what would be that the application has to run on Solaris. t And requires a LUN to function. ITYM requires two LUN's, or else when your filesystem becomes corrupt after a crash the sysadmin will get blamed for it. Maybe you can deduplicate the ZFS mirror LUNs on the storage back-end or something. pgpFW901WOk9u.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, Oct 1, 2008 at 11:53 AM, Ahmed Kamal [EMAIL PROTECTED] wrote: Thanks for all the opinions everyone, my current impression is: - I do need as much RAM as I can afford (16GB look good enough for me) Depends on both the workload, and the amount of storage behind it. From your descriptions though, I think you'll be ok. - SAS disks offers better iops better MTBF than SATA. But Sata offers enough performance for me (to saturate a gig link), and its MTBF is around 100 years, which is I guess good enough for me too. If I wrap 5 or 6 SATA disks in a raidz2 that should give me enough protection and performance. It seems I will go with sata then for now. I hope for all practical purposes the raidz2 array of say 6 sata drives are very well protected for say the next 10 years! (If not please tell me) ***If you have a sequential workload. It's not a blanket SATA is fast enough. - This will mainly be used for NFS sharing. Everyone is saying it will have bad performance. My question is, how bad is bad ? Is it worse than a plain Linux server sharing NFS over 4 sata disks, using a crappy 3ware raid card with caching disabled ? coz that's what I currently have. Is it say worse that a Linux box sharing over soft raid ? Whoever is saying that is being dishonest. NFS is plenty fast for most workloads. There are very, VERY few workloads in the enterprise that are I/O bound, they are almost all IOPS bound. - If I will be using 6 sata disks in raidz2, I understand to improve performance I can add a 15k SAS drive as a Zil device, is this correct ? Is the zil device per pool. Do I loose any flexibility by using it ? Does it become a SPOF say ? Typically how much percentage improvement should I expect to get from such a zil device ? ZIL's come with their own fun. Isn't there still the issue of losing the entire pool if you lose the ZIL? And you can't get it back without extensive, ugly work? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 09:54:04PM -0400, Miles Nordin wrote: ok, I get that S3 went down due to corruption, and that the network checksums I mentioned failed to prevent the corruption. The missing piece is: belief that the corruption occurred on the network rather than somewhere else. Their post-mortem sounds to me as though a bit flipped inside the memory of one server could be spread via this ``gossip'' protocol to infect the entire cluster. The replication and spreadability of the data makes their cluster into a many-terabyte gamma ray detector. A bit flipped inside an end of an end-to-end system will not be detected by that system. So the CPU, memory and memory bus of an end have to be trusted and so require their own corruption detection mechanisms (e.g., ECC memory). In the S3 case it sounds like there's a lot of networking involved, and that they weren't providing integrity protection for the gossip protocol. Given a two-bit-flip-that-passed-all-Ethernet-and-TCP-CRCs event that we had within Sun a few years ago (much alluded to elsewhere in this thread), and which happened in one faulty switch, I would suspect the switch. Also, years ago when 100Mbps Ethernet first came on the market I saw lots of bad cat-5 wiring issues, where a wire would go bad and start introducing errors just a few months into its useful life. I don't trust the networking equipment -- I prefer end-to-end protection. Just because you have to trust that the ends behave correctly doesn't mean that you should have to trust everything in the middle too. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] query: why does zfs boot in 10/08 not support flash archive jumpstart
It was something we couldn't get into the release due to insufficient resources. I'd like to see it implemented in the future. Lori Adrian Saul wrote: With much excitement I have been reading the new features coming into Solaris 10 in 10/08 and am eager to start playing with zfs root. However one thing which struck me as strange and somewhat annoying is that it appears in the FAQs and documentation that its not possible to do a ZFS root install using jumpstart and flash archives? I predominantly do my installs using flash archives as it saves massive amounts of time in the install process and gives me consistancy between builds. Really I am just curious why it isnt supported, and what the intention is for supporting it and when? Cheers, Adrian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, 1 Oct 2008, [EMAIL PROTECTED] wrote: To get the same storage between capacity with SAS drives and SATA drives, you'd probably have to put the SAS drives in a RAID-5/6/Z configuration to be more space efficient. However by doing this you'd be losing spindles, and therefore IOPS. With SATA drives, since you can buy more for the same budget, you could put them in a RAID-10 configuration. While the individual disk many be slower, you'd have more spindles in the zpool, so that should help with the IOPS. I will agree with that except to point out that there are many applications which require performance but not a huge amount of storage. For many critical applications, even 10s of gigabytes is a lot of storage. Based on this, I would say that most applications where SAS is desireable are the ones which desire the most reliability and performance whereas the applications where SATA is desireable are the ones which place a priority on bulk storage capacity. If you are concerned about total storage capacity and you are also specifying SAS for performance/reliability for critical data then it is likely that there is something wrong with your plan for storage and how the data is distributed. There is a reason why when you go to the store you see tack hammers, construction hammers, and sledge hammers. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Tim [EMAIL PROTECTED] wrote: Ummm, no. SATA and SAS seek times are not even in the same universe. They most definitely do not use the same mechanics inside. Whoever told you that rubbish is an outright liar. It is extremely unlikely that two drives from the same manufacturer and with the same RPM differ in seek times if you compare a SAS variant with a SATA variant. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oracle DB sequential dump questions
Carson Gaspar [EMAIL PROTECTED] wrote: Yes. Which is exactly what I was saying. The tar data might be more compressible than the DB, thus be faster. Shall I draw you a picture, or are you too busy shilling for star at every available opportunity? If you did never compare Sun tar speed with star speed, it would not help if you draw pictures. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
cd == Casper Dik [EMAIL PROTECTED] writes: cd The whole packet lives in the memory of the switch/router and cd if that memory is broken the packet will be send damaged. that's true, but by algorithmically modifying the checksum to match your ttl decrementing and MAC address label-swapping rather than recomputing it from scratch, it's possible for an L2 or even L3 switch to avoid ``splitting the protection domain''. It'll still send the damaged packet, but with a wrong FCS, so it'll just get dropped by the next input port and eventually retransmitted. This is what 802.1d suggests. I suspect one reason the IP/UDP/TCP checksums were specified as simple checksums rather than CRC's like the Ethernet L2 FCS, is that it's really easy and obvious how to algorithmically modify them. sounds like they are not good enough though, because unless this broken router that Robert and Darren saw was doing NAT, yeah, it should not have touch the TCP/UDP checksum. BTW which router was it, or you can't say because you're in the US? :) I would expect any cost-conscious router or switch manufacturer to use the same Ethernet MAC ASIC's as desktops, so the checksums would likely be computed right before transmission using the ``offload'' feature of the Ethernet chip, but of course we can't tell because they're all proprietary. Eventually I bet it will become commonplace for Ethernet MAC's to do IPsec offload, so we'll have to remember the ``avoid splitting the protection domain'' idea when that starts happening. pgpIRJL9G6bGy.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, NFS and Auto Mounting
On Wed, Oct 01, 2008 at 01:12:08PM -0400, Miles Nordin wrote: pt == Peter Tribble [EMAIL PROTECTED] writes: pt I think the term is mirror mounts. he doesn't need them---he's using the traditional automounter, like we all used to use before this newfangled mirror mounts baloney. Oh man, I *love* mirror mounts -- they're *not* baloney. There were no mirror mounts with the old UFS NFSv3 setup that he inherited, and it worked fine. Maybe mirror mounts are breaking the automounter? Doubtful. There is a race condition in mirror mounts that can cause one or more of several threads racing to cause a mirror mount to happen to get an error. Usually you see that when running dmake. Otherwise mirror mounts work perfectly. I think someone who knows the automounter better than I could explain it, but one thing you migh try is to make the server and client's filesystems similarly-nested. Right now you have: Yes, the OP needs hierarchical automount map entries. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, 1 Oct 2008, Joerg Schilling wrote: Ummm, no. SATA and SAS seek times are not even in the same universe. They most definitely do not use the same mechanics inside. Whoever told you that rubbish is an outright liar. It is extremely unlikely that two drives from the same manufacturer and with the same RPM differ in seek times if you compare a SAS variant with a SATA variant. I did find a manufacturer (Seagate) which does offer a SAS variant of what is normally a SATA drive. Is this the specific product you are talking about? The interface itself is perhaps not all that important but drive vendors have traditionally selected that SCSI based products are based on high performance hardware with a focus on reliability while ATA based products are based on low or medium performance hardware with a focus on cost. There is very little overlap between these distinct product lines. It is rare to find similarity between the specification sheets. It is quite rare to find similar rotation rates or seek times. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Ahmed Kamal wrote: Thanks for all the opinions everyone, my current impression is: - I do need as much RAM as I can afford (16GB look good enough for me) - SAS disks offers better iops better MTBF than SATA. But Sata offers enough performance for me (to saturate a gig link), and its MTBF is around 100 years, which is I guess good enough for me too. If I wrap 5 or 6 SATA disks in a raidz2 that should give me enough protection and performance. It seems I will go with sata then for now. I hope for all practical purposes the raidz2 array of say 6 sata drives are very well protected for say the next 10 years! (If not please tell me) OK, so what the specs don't tell you is how MTBF changes over time. It is very common to see an MTBF quoted, but you will almost never see it described as a function of age. Rather, you will see something in the specs about expected service lifetime, and how the environment can decrease the service lifetime (read: decrease the MTBF over time more rapidly). I've not seen a consumer grade disk spec with 10 years of expected service life -- some are 5 years. In other words, as time goes by, you should plan to replace them. A more lengthy discussion of this, and why we measure field reliability in other ways, see: http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent - This will mainly be used for NFS sharing. Everyone is saying it will have bad performance. My question is, how bad is bad ? Is it worse than a plain Linux server sharing NFS over 4 sata disks, using a crappy 3ware raid card with caching disabled ? coz that's what I currently have. Is it say worse that a Linux box sharing over soft raid ? - If I will be using 6 sata disks in raidz2, I understand to improve performance I can add a 15k SAS drive as a Zil device, is this correct ? Is the zil device per pool. Do I loose any flexibility by using it ? Does it become a SPOF say ? Typically how much percentage improvement should I expect to get from such a zil device ? See the best practices guide: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Bob Friesenhahn [EMAIL PROTECTED] wrote: On Wed, 1 Oct 2008, Joerg Schilling wrote: SATA and SAS disks usually base on the same drive mechanism. The seek times are most likely identical. This must be some sort of urban legend. While the media composition and drive chassis is similar, the rest of the product clearly differs. The seek times for typical SAS drives are clearly much better, and the typical drive rotates much faster. Did you recently look at spec files from drive manufacturers? If you look at drives in the same category, the difference between a SATA and a SAS disk is only the firmware and the way the drive mechanism has been selected. Another difference is that SAS drives may have two SAS interfaces instead of the single SATA interface found in the SATA drives. IOPS/s depend on seek times, latency times and probably on disk cache size. If you have a drive with 1 ms seek time, the seek time is not really important. What's important is the latency time which is 4ms for a 7200 rpm drive and only 2 ms for 15000 rpm drive. People who talk about SAS usually forget that they try to compare 15000 rpm SAS drives with 7200 rpm SATA drives. There are faster SATA drives but these drives consume more power. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] making sense of arcstat.pl output
I'm using Neelakanth's arcstat tool to troubleshoot performance problems with a ZFS filer we have, sharing home directories to a CentOS frontend Samba box. Output shows an arc target size of 1G, which I find odd, since I haven't tuned the arc, and the system has 4G of RAM. prstat -a tells me that userland processes are only using about 200-300mb of RAM, and even if Solaris is eating 1GB, that still leaves quite a lot of RAM not being used by the arc. I would believe that this was due to low workload, but I see that 'arcsz' matches 'c', which makes me think the system is hitting a bottleneck/wall of some kind. Any thoughts on further troubleshooting appreciated. Blake -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Miles Nordin wrote: sounds like they are not good enough though, because unless this broken router that Robert and Darren saw was doing NAT, yeah, it should not have touch the TCP/UDP checksum. I believe we proved that the problem bit flips were such that the TCP checksum was the same, so the original checksum still appeared correct. BTW which router was it, or you can't say because you're in the US? :) I can't remember; it was aging at the time. Rob T ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, Oct 01, 2008 at 12:22:56PM -0500, Tim wrote: - This will mainly be used for NFS sharing. Everyone is saying it will have bad performance. My question is, how bad is bad ? Is it worse than a plain Linux server sharing NFS over 4 sata disks, using a crappy 3ware raid card with caching disabled ? coz that's what I currently have. Is it say worse that a Linux box sharing over soft raid ? Whoever is saying that is being dishonest. NFS is plenty fast for most workloads. There are very, VERY few workloads in the enterprise that are I/O bound, they are almost all IOPS bound. NFS is bad for workloads that involve lots of operations that NFS requires to be synchronous and which the application doesn't parallelize. Things like open(2) and close(2), for example, which means applications like tar(1). The solution is to get a fast slog device. (Or to use an NFS server that violates the synchrony requirement.) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, NFS and Auto Mounting
On Wed, Oct 01, 2008 at 01:30:45PM +0100, Peter Tribble wrote: On Wed, Oct 1, 2008 at 3:42 AM, Douglas R. Jones [EMAIL PROTECTED] wrote: Any ideas? Well, I guess you're running Solaris 10 and not OpenSolaris/SXCE. I think the term is mirror mounts. It works just fine on my SXCE boxes. Until then, the way we got round this was to not make the new filesystem a child. So instead of: /mnt/zfs1/GroupWS /mnt/zfs1/GroupWS/Integration create /mnt/zfs1/GroupWS /mnt/zfs1/Integration No, that's not the workaround. The problem is that the automounter -hosts map does a MOUNT call once to get the list of exports from the server, and that means that filesystems added since the first mount via /net will not be visible. Mirror mounts solves *that* problem. And it fixes the poster's problem as well. The poster isn't using the -hosts automount map, so his workaround is to create hierarchical automount map entries. See automount(1M). and use that for the Integration mountpoint. Then in GroupWS, 'ln -s ../Integration .'. That works, but hierarchical automount map entries work better. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, Oct 1, 2008 at 12:51 PM, Joerg Schilling [EMAIL PROTECTED] wrote: Did you recently look at spec files from drive manufacturers? If you look at drives in the same category, the difference between a SATA and a SAS disk is only the firmware and the way the drive mechanism has been selected. Another difference is that SAS drives may have two SAS interfaces instead of the single SATA interface found in the SATA drives. IOPS/s depend on seek times, latency times and probably on disk cache size. If you have a drive with 1 ms seek time, the seek time is not really important. What's important is the latency time which is 4ms for a 7200 rpm drive and only 2 ms for 15000 rpm drive. People who talk about SAS usually forget that they try to compare 15000 rpm SAS drives with 7200 rpm SATA drives. There are faster SATA drives but these drives consume more power. That's because the faster SATA drives cost just as much money as their SAS counterparts for less performance and none of the advantages SAS brings such as dual ports. Not to mention none of them can be dual sourced making it a non-starter in the enterprise. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv
The problem could be in the zfs command or in the kernel. Run pstack on the core dump and search the bug database for the functions it lists. If you can't find a bug that matches your situation and your stack, file a new bug and attach the core. If the engineers find a duplicate bug, they'll just close it as a duplicate, and the bug database will show a pointer to the original bug. David -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, Oct 01, 2008 at 11:54:55AM -0600, Robert Thurlow wrote: Miles Nordin wrote: sounds like they are not good enough though, because unless this broken router that Robert and Darren saw was doing NAT, yeah, it should not have touch the TCP/UDP checksum. I believe we proved that the problem bit flips were such that the TCP checksum was the same, so the original checksum still appeared correct. The bit flips came in pairs, IIRC. I forget the details, but it's probably buried somewhere in my (and many others') e-mail. BTW which router was it, or you can't say because you're in the US? :) I can't remember; it was aging at the time. I can't remember either -- it was a few years ago. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] one step forward - pinging Lukas pool: ztankKarwacki (kangurek)
on the advice of Okana in the freenode.net #opensolaris channel I tried to run the latest opensolaris livecd and try to import the pool. No luck, however I tried the trick in Lukas's post that allowed him to import the pool and I had a beginning of luck. By doing the mdb wizardry he indicated I was able to run zpool import with the following result: pool: ztank id: whatever state: ONLINE status: The pool was last accessed by another system. see http://www.sun.com/msg/ZFS-8000-EY config: ztankONLINE raidz1 ONLINE c4t0d0 ONLINE c4t1d0 ONLINE c4t2d0 ONLINE c4t3d0 ONLINE c4t4d0 ONLINE c4t5d0 ONLINE HOWEVER. When I attempt again to import using zdb -e ztank I still get zdb: can't open ztank: I/O error and zpool import -f, whilst it starts and seems to access the disks sequentially, it stops al the 3rd one (no sure which precisely - it spins it up and the process stops right there, and the system will not reboot when asked to (shutdown -g0 -y -i5) so there's some slight progress here. I would really appreciate ideas from you guys! Thanks Vasile -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] making sense of arcstat.pl output
Blake Irvin wrote: I'm using Neelakanth's arcstat tool to troubleshoot performance problems with a ZFS filer we have, sharing home directories to a CentOS frontend Samba box. Output shows an arc target size of 1G, which I find odd, since I haven't tuned the arc, and the system has 4G of RAM. prstat -a tells me that userland processes are only using about 200-300mb of RAM, and even if Solaris is eating 1GB, that still leaves quite a lot of RAM not being used by the arc. I would believe that this was due to low workload, but I see that 'arcsz' matches 'c', which makes me think the system is hitting a bottleneck/wall of some kind. Any thoughts on further troubleshooting appreciated. It doesn't sound like you have a memory shortfall. Please start with the ZFS best practices guide http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide Many of the recommendations for NFS will also apply to other file sharing protocols, such as CIFS. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, 1 Oct 2008, Joerg Schilling wrote: Did you recently look at spec files from drive manufacturers? Yes. If you look at drives in the same category, the difference between a SATA and a The problem is that these drives (SAS / SATA) are generally not in the same category so your comparison does not make sense. There is very little overlap between the exotic sports car class and the family mini van class. In some very few cases we see some transition vehicles such as station wagons in a sport form factor. Most drive vendors try to make sure that the drives are in truely distinct classes in order to preserve the profit margins on the more expensive drives. In some cases we see SAS interfaces fitted to drives which are fundamentally SATA-class drives but such products are rare. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] making sense of arcstat.pl output
I think I need to clarify a bit. I'm wondering why arc size is staying so low, when i have 10 nfs clients and about 75 smb clients accessing the store via resharing (on one of the 10 linux nfs clients) of the zfs/nfs export. Or is it normal for the arc target and arc size to match? Of note, I didn't see these performance issues until the box had been up for about a week, probably enough time for weekly (roughly) windows reboots and profile syncs across multiple clients to force the arc to fill. I have read through and follow the advice on the tuning guide, but still see Windows users with roaming profiles getting very slow profile syncs. This makes me think that zfs isn't handling the random i/o generated by a profile sync very well. Well, at least that's what I'm thinking when I see an arc size of 1G, there is at least another free gig of memory, and the clients syncing more than a gig of data fairly often. I will return to studying the tuning guide, though, to make sure I've not missed some key bit. It's not unlikely that I'm missing something fundamental about how zfs should behave in this scenario. cheers, Blake -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, NFS and Auto Mounting
Douglas R. Jones wrote: 4) I change the auto.ws map thusly: Integration chekov:/mnt/zfs1/GroupWS/ Upgradeschekov:/mnt/zfs1/GroupWS/ cstools chekov:/mnt/zfs1/GroupWS/ com chekov:/mnt/zfs1/GroupWS This is standard NFS behavior (prior to NFSv4). Child Filesystems have to be mounted on the NFS client explicitly. As someone else mentioned, NFSv4 has a feature called 'mirror-mounts' that is supposed to automate this for you. For now try this: Integration chekov:/mnt/zfs1/GroupWS/ Upgrades chekov:/mnt/zfs1/GroupWS/ cstools chekov:/mnt/zfs1/GroupWS/ com /chekov:/mnt/zfs1/GroupWS \ /Integration chekov:/mnt/zfs1/GroupWS/Integration Note the \ line continuation character. The last 2 lines are really all one line. If you had had 'Integration' on it's own ufs or ext2fs filesystem in the past, but still mounted below 'GroupWS' you would have seen this in the past. It's not a ZFS thing, or a Solaris thing. -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance degradation when backups are running
You might want to also try toggling the Nagle tcp setting to see if that helps with your workload: ndd -get /dev/tcp tcp_naglim_def (save that value, default is 4095) ndd -set /dev/tcp tcp_naglim_def 1 If no (or a negative) difference, set it back to the original value ndd -set /dev/tcp tcp_naglim_def 4095 (or whatever it was) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] make zfs(1M) use literals when displaying properties in scripted mode
On Tue, Sep 30, 2008 at 11:09:05PM -0700, Eric Schrock wrote: A better solution (one that wouldn't break backwards compatability) would be to add the '-p' option (parseable output) from 'zfs get' to the 'zfs list' command as well. yes, that makes sense to me. thanks for pointing the -p out in zfs get, it means i get get the numbers i need on s10 without having to do crazy stuff to get a custom zfs binary. here's an updated diff that implements -p on zfs list. thanks to james mcpherson for both fixing and testing this for me. diff -r 4fa3bfcd83d7 -r dbe864e2cc70 usr/src/cmd/zfs/zfs_main.c --- a/usr/src/cmd/zfs/zfs_main.cWed Oct 01 00:06:47 2008 -0700 +++ b/usr/src/cmd/zfs/zfs_main.cThu Oct 02 07:26:16 2008 +1000 @@ -1623,6 +1623,7 @@ typedef struct list_cbdata { boolean_t cb_first; boolean_t cb_scripted; + boolean_t cb_literal; zprop_list_t*cb_proplist; } list_cbdata_t; @@ -1672,7 +1673,8 @@ * to the described layout. */ static void -print_dataset(zfs_handle_t *zhp, zprop_list_t *pl, boolean_t scripted) +print_dataset(zfs_handle_t *zhp, zprop_list_t *pl, boolean_t scripted, +boolean_t literal) { boolean_t first = B_TRUE; char property[ZFS_MAXPROPLEN]; @@ -1695,7 +1697,7 @@ right_justify = B_FALSE; if (pl-pl_prop != ZPROP_INVAL) { if (zfs_prop_get(zhp, pl-pl_prop, property, - sizeof (property), NULL, NULL, 0, B_FALSE) != 0) + sizeof (property), NULL, NULL, 0, literal) != 0) propstr = -; else propstr = property; @@ -1742,7 +1744,7 @@ cbp-cb_first = B_FALSE; } - print_dataset(zhp, cbp-cb_proplist, cbp-cb_scripted); + print_dataset(zhp, cbp-cb_proplist, cbp-cb_scripted, cbp-cb_literal); return (0); } @@ -1752,6 +1754,7 @@ { int c; boolean_t scripted = B_FALSE; + boolean_t literal = B_FALSE; static char default_fields[] = name,used,available,referenced,mountpoint; int types = ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME; @@ -1764,10 +1767,13 @@ int flags = ZFS_ITER_PROP_LISTSNAPS | ZFS_ITER_ARGS_CAN_BE_PATHS; /* check options */ - while ((c = getopt(argc, argv, :o:rt:Hs:S:)) != -1) { + while ((c = getopt(argc, argv, :o:prt:Hs:S:)) != -1) { switch (c) { case 'o': fields = optarg; + break; + case 'p': + literal = B_TRUE; break; case 'r': flags |= ZFS_ITER_RECURSE; @@ -1855,6 +1861,7 @@ != 0) usage(B_FALSE); + cb.cb_literal = literal; cb.cb_scripted = scripted; cb.cb_first = B_TRUE; ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, 2008-10-01 at 11:54 -0600, Robert Thurlow wrote: like they are not good enough though, because unless this broken router that Robert and Darren saw was doing NAT, yeah, it should not have touch the TCP/UDP checksum. NAT was not involved. I believe we proved that the problem bit flips were such that the TCP checksum was the same, so the original checksum still appeared correct. That's correct. The pattern we found in corrupted data was that there would be two offsetting bit-flips. A 0-1 was followed 256 or 512 or 1024 bytes later by a 1-0 Or vice-versa. (It was always the same bit; in the cases I analyzed, the corrupted files contained C source code and the bit-flips were obvious). Under the 16-bit one's-complement checksum used by TCP, these two changes cancel each other out and the resulting packet has the same checksum. BTW which router was it, or you can't say because you're in the US? :) I can't remember; it was aging at the time. to use excruciatingly precise terminology, I believe the switch in question is marketed as a combo L2 bridge/L3 router but in this case may have been acting as a bridge rather than a router. After we noticed the data corruption we looked at TCP counters on hosts on that subnet and noticed a high rate of failed checksums, so clearly the TCP checksum was catching *most* of the corrupted packets; we just didn't look at the counters until after we saw data corruption. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, NFS and Auto Mounting
First of all let me thank each and everyone of you who helped with this issue. Your responses were not only helpful but insightful as well. I have been around Unix for a long time but only recently have I had the opportunity to do some real world admin work (they laid off or had quit those who were doing this before me) i am just a code jockey. Anyway the answer turned out to be hierarchical automounting. I really did not know the difference between direct and hierarchical before. What eventually worked was demonstrated by Kyle. In the end, the auto.ws map looks like: Integration chekov:/mnt/zfs1/GroupWS/ Upgradeschekov:/mnt/zfs1/GroupWS/ cstools chekov:/mnt/zfs1/GroupWS/ com / chekov:/mnt/zfs1/GroupWS \ /Integrationchekov:/mnt/zfs1/GroupWS/Integration And it appears to be working fine one I slapped the autofs. Thanks again for the help! Doug -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] making sense of arcstat.pl output
Blake Irvin wrote: I think I need to clarify a bit. I'm wondering why arc size is staying so low, when i have 10 nfs clients and about 75 smb clients accessing the store via resharing (on one of the 10 linux nfs clients) of the zfs/nfs export. Or is it normal for the arc target and arc size to match? Of note, I didn't see these performance issues until the box had been up for about a week, probably enough time for weekly (roughly) windows reboots and profile syncs across multiple clients to force the arc to fill. In any case, the ARC size is not an indicator of a memory shortfall. The next time it happens, look a the scan rate in vmstat for an indication of memory shortfall. Then proceed to debug accordingly. An excellent book on this topic is the Solaris Performance and Tools companion to Solaris Internals. I have read through and follow the advice on the tuning guide, but still see Windows users with roaming profiles getting very slow profile syncs. This makes me think that zfs isn't handling the random i/o generated by a profile sync very well. Well, at least that's what I'm thinking when I see an arc size of 1G, there is at least another free gig of memory, and the clients syncing more than a gig of data fairly often. By default, the ARC leaves 1 GByte of memory free. This may or may not be appropriate for your system, which is why there are some tuning suggestions in various places. There is also an issue with the decision to cache versus flush for writes, and the interaction with write throttles. Roch did a nice writeup on changes in this area. You may be running into this, but IMHO it shouldn't appear to be a memory shortfall. Check Roch's blog to see if the symptoms are similar. http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
On 1-Oct-08, at 1:56 AM, Ram Sharma wrote: Hi Guys, Thanks for so many good comments. Perhaps I got even more than what I asked for! I am targeting 1 million users for my application.My DB will be on solaris machine.And the reason I am making one table per user is that it will be a simple design as compared to keeping all the data in single table. You have a green light from ZFS experts, but there is no way you'd get that schema past a good DBA. This design will fail you long before you get near a million users. --Toby In that case I need to worry about things like horizontal partitioning which inturn will require higher level of management. So for storing 1 million MYISAM tables (MYISAM being a good performer when it comes to not very large data) , I need to save 3 million data files in a single folder on disk. This is the way MYISAM saves data. I will never need to do an ls on this folder. This folder (~database) will be used just by MYSQL engine to exceute my SQL queries and fetch me results. And now that ZFS allows me to do this easily, I believe I can go forward with this design easily.Correct me if I am missing something. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oracle DB sequential dump questions
Carson Gaspar [EMAIL PROTECTED] writes: Joerg Schilling wrote: Carson Gaspar[EMAIL PROTECTED] wrote: Louwtjie Burger wrote: Dumping a large file from memory using tar to LTO yields 44 MB/s ... I suspect the CPU cannot push more since it's a single thread doing all the work. Dumping oracle db files from filesystem yields ~ 25 MB/s. The interesting bit (apart from it being a rather slow speed) is the fact that the speed fluctuates from the disk area.. but stays constant to the tape. I see up to 50-60 MB/s spikes over 5 seconds, while the tape continues to push it's steady 25 MB/s. ... Does your tape drive compress (most do)? If so, you may be seeing compressible vs. uncompressible data effects. HW Compression in the tape drive usually increases the speed of the drive. Yes. Which is exactly what I was saying. The tar data might be more compressible than the DB, thus be faster. Shall I draw you a picture, or are you too busy shilling for star at every available opportunity? Sheesh, calm down, man. Boyd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Tim tim at tcsac.net writes: That's because the faster SATA drives cost just as much money as their SAS counterparts for less performance and none of the advantages SAS brings such as dual ports. SAS drives are far from always being the best choice, because absolute IOPS or throughput numbers do not matter. What only matters in the end is (TB, throughput, or IOPS) per (dollar, Watt, or Rack Unit). 7500rpm (SATA) drives clearly provide the best TB/$, throughput/$, and IOPS/$. You can't argue against that. To paraphrase what was said earlier in this thread, to get the best IOPS out of $1000, spend your money on 10 7500rpm (SATA) drives instead of 3 or 4 15000rpm (SAS) drives. Similarly, for the best IOPS/RU, 15000rpm drives have the advantage. Etc. -marc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Marc Bevand wrote: Tim tim at tcsac.net writes: That's because the faster SATA drives cost just as much money as their SAS counterparts for less performance and none of the advantages SAS brings such as dual ports. SAS drives are far from always being the best choice, because absolute IOPS or throughput numbers do not matter. What only matters in the end is (TB, throughput, or IOPS) per (dollar, Watt, or Rack Unit). 7500rpm (SATA) drives clearly provide the best TB/$, throughput/$, and IOPS/$. You can't argue against that. To paraphrase what was said earlier in this thread, to get the best IOPS out of $1000, spend your money on 10 7500rpm (SATA) drives instead of 3 or 4 15000rpm (SAS) drives. Similarly, for the best IOPS/RU, 15000rpm drives have the advantage. Etc. -marc Be very careful about that. 73GB SAS drives aren't that expensive, so you can get 6 x 73GB 15k SAS drives for the same amount as 11 x 250GB SATA drives (per Sun list pricing for J4200 drives). SATA doesn't always win the IOPS/$. Remember, a SAS drive can provide more than 2x the number of IOPs a SATA drive can. Likewise, throughput on a 15k drive can be roughly 2x a 7.2k drive, depending on I/O load. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss