Re: [zfs-discuss] Petabytes on a budget - blog
On Sat, Sep 5, 2009 at 12:30 AM, Marc Bevand wrote: > Tim Cook cook.ms> writes: > > > > Whats the point of arguing what the back-end can do anyways? This is > bulk > data storage. Their MAX input is ~100MB/sec. The backend can more than > satisfy that. Who cares at that point whether it can push 500MB/s or > 5000MB/s? It's not a database processing transactions. It only needs to > be > able to push as fast as the front-end can go. --Tim > > True, what they have is sufficient to match GbE speed. But internal I/O > throughput matters for resilvering RAID arrays, scrubbing, local data > analysis/processing, etc. In their case they have 3 15-drive RAID6 arrays > per > pod. If their layout is optimal they put 5 drives on the PCI bus (to > minimize > this number) & 10 drives behind PCI-E links per array, so this means the > PCI > bus's ~100MB/s practical bandwidth is shared by 5 drives, so 20MB/s per > (1.5TB-)drive, so it is going to take minimun 20.8 hours to resilver one of > their arrays. > > -mrb > > But none of that matters. The data is replicated at a higher layer, combined with raid-6. They'd have to see triple disk failure across multiple arrays at the same time... They aren't concerned with performance, the home users they're backing up aren't ever going to get anything remotely close to gigE speeds. Absolute BEST case scenario *MIGHT* push 20mbit if the end-user is lucky enough to have FIOS or docsis 3.0 in their area, and has large files with a clean link. Even rebuilding two failed disks that setup will push 2MB/sec all day long. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Petabytes on a budget - blog
Tim Cook cook.ms> writes: > > Whats the point of arguing what the back-end can do anyways? This is bulk data storage. Their MAX input is ~100MB/sec. The backend can more than satisfy that. Who cares at that point whether it can push 500MB/s or 5000MB/s? It's not a database processing transactions. It only needs to be able to push as fast as the front-end can go. --Tim True, what they have is sufficient to match GbE speed. But internal I/O throughput matters for resilvering RAID arrays, scrubbing, local data analysis/processing, etc. In their case they have 3 15-drive RAID6 arrays per pod. If their layout is optimal they put 5 drives on the PCI bus (to minimize this number) & 10 drives behind PCI-E links per array, so this means the PCI bus's ~100MB/s practical bandwidth is shared by 5 drives, so 20MB/s per (1.5TB-)drive, so it is going to take minimun 20.8 hours to resilver one of their arrays. -mrb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] incremental send/recv larger than sum of snapshots?
I've been sending daily incrementals off-site for a while now, but recently they failed so I had to send an incremental covering a number of snapshots. I expected the incremental to be approximately the sum of the snapshots, but it seems to be considerably larger and still going. The source machine is nv72 and the destination is nv99. I send/recv with this command: /usr/sbin/zfs send -i tank/v...@2009-08-15 tank/v...@2009-08-26 | bzip2 -c | ssh offsite-computer "bzcat | /usr/sbin/zfs recv -F tank/vm" The sum of the 11 days of snapshots is about 100G, but I see the remote computer registering over 130G. I'm pushing this over a single T1, so the process has been running for about a week. Is this expected? If so, is there anyway I can calculate how much data will need to be transferred? Here is a snippet of zfs list on the source: tank/v...@2009-08-14 8.46G - 440G - tank/v...@2009-08-15 7.49G - 440G - tank/v...@2009-08-16 7.42G - 440G - tank/v...@2009-08-17 7.45G - 441G - tank/v...@2009-08-18 11.0G - 538G - tank/v...@2009-08-19 11.1G - 479G - tank/v...@2009-08-20 11.1G - 479G - tank/v...@2009-08-21 7.61G - 480G - tank/v...@2009-08-22 6.45G - 481G - tank/v...@2009-08-23 7.31G - 481G - tank/v...@2009-08-24 9.66G - 481G - tank/v...@2009-08-25 10.1G - 481G - tank/v...@2009-08-26 12.5G - 481G - And the remote: tank/v...@2009-08-148.46G - 440G - tank/v...@2009-08-159.38G - 440G - tank/vm/%2009-08-26136G 867G 475G /tank/vm/%2009-08-26 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On Sep 4, 2009, at 10:02 PM, David Magda wrote: On Sep 4, 2009, at 21:44, Ross Walker wrote: Though I have only heard good comments from my ESX admins since moving the VMs off iSCSI and on to ZFS over NFS, so it can't be that bad. What's your pool configuration? Striped mirrors? RAID-Z with SSDs? Other? Striped mirrors off NVRAM backed controller (Dell PERC 6/E). RAID-Z isn't the best for many VMs as the whole vdev acts as single disk for random io. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motherboard for home zfs/solaris file server
On Thu, Sep 3, 2009 at 4:57 AM, Karel Gardas wrote: > Hello, > your "(open)solaris for Ecc support (which seems to have been dropped from > 200906)" is misunderstanding. OS 2009.06 also supports ECC as 2005 did. Just > install it and use my updated ecccheck.pl script to get informed about > errors. Also you might verify that Solaris' memory scrubber is really > running if you are that curious: > http://developmentonsolaris.wordpress.com/2009/03/06/how-to-make-sure-memory-scrubber-is-running/ > Karel > -- > Is there something that needs to be done on the solaris side for memscrub scans to occur? I'm running snv_118, with a supermicro board running ECC memory and amd opteron CPU's. It would appear it's doing a lot of nothing. Aug 8 03:56:23 fserv unix: [ID 950921 kern.info] cpu0: x86 (chipid 0x0 AuthenticAMD 40F13 family 15 model 65 step 3 clock 2010 MHz) Aug 8 03:56:23 fserv unix: [ID 950921 kern.info] cpu0: Dual-Core AMD Opteron(tm) Processor 2212 r...@fserv:~# isainfo -v 64-bit amd64 applications tscp ahf cx16 sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications tscp ahf cx16 sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov amd_sysc cx8 tsc fpu r...@fserv:~# echo "memscrub_scans_done/U" | mdb -k memscrub_scans_done: memscrub_scans_done:0 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On Sep 4, 2009, at 21:44, Ross Walker wrote: Though I have only heard good comments from my ESX admins since moving the VMs off iSCSI and on to ZFS over NFS, so it can't be that bad. What's your pool configuration? Striped mirrors? RAID-Z with SSDs? Other? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On Fri, 4 Sep 2009, Ross Walker wrote: I have yet to see a read happen during the write flush either. That impacts my application since it needs to read in order to proceed, and it does a similar amount of writes as it does reads. The ARC makes it hard to tell if they are satisfied from cache or blocked due to writes. The existing prefetch bug makes it doubly hard. :-) First I complained about the blocking reads, and then I complained about the blocking writes (presumed responsible for the blocking reads) and now I am waiting for working prefetch in order to feed my hungry application. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On Sep 4, 2009, at 8:59 PM, Bob Friesenhahn > wrote: On Fri, 4 Sep 2009, Ross Walker wrote: I guess one can find a silver lining in any grey cloud, but for myself I'd just rather see a more linear approach to writes. Anyway I have never seen any reads happen during these write flushes. I have yet to see a read happen during the write flush either. That impacts my application since it needs to read in order to proceed, and it does a similar amount of writes as it does reads. The ARC makes it hard to tell if they are satisfied from cache or blocked due to writes. I suppose if you have the hardware to go sync that might be the best bet. That and limiting the write cache. Though I have only heard good comments from my ESX admins since moving the VMs off iSCSI and on to ZFS over NFS, so it can't be that bad. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On Fri, 4 Sep 2009, Ross Walker wrote: I guess one can find a silver lining in any grey cloud, but for myself I'd just rather see a more linear approach to writes. Anyway I have never seen any reads happen during these write flushes. I have yet to see a read happen during the write flush either. That impacts my application since it needs to read in order to proceed, and it does a similar amount of writes as it does reads. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] PMP support in Opensolaris
On Fri, Sep 4, 2009 at 1:12 PM, Nigel Smith wrote: > Let us know if you can get the port multipliers working.. > > But remember, there is a problem with ZFS raidz in that release, so be > careful: I saw that, so I think I'll be waiting until snv_124 to update. The system that I'm thinking of using currently only has mirrored vdevs however, so it shouldn't be any risk. Something like one of the following seems reasonable to add a few drives to an existing system, although eSATA just seems like a bad idea for a number of reasons: http://www.newegg.com/Product/Product.aspx?Item=N82E16816132016 http://www.newegg.com/Product/Product.aspx?Item=N82E16816111057 A good use that I can see is combining a Intel D945GCLF2 board with a case that has more that 2 drive bays, using an internal PMP. One of the systems I have is an Atom board in a small Chenbro 2-bay case, which gives surprisingly good performance and is . There is a 4-bay version available but lack of SATA ports on the motherboard kept me from using it. http://www.cooldrives.com/siseata5pomu.html http://www.newegg.com/Product/Product.aspx?Item=N82E16811123122 -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On Sep 4, 2009, at 5:25 PM, Scott Meilicke > wrote: I only see the blocking while load testing, not during regular usage, so I am not so worried. I will try the kernel settings to see if that helps if/when I see the issue in production. For what it is worth, here is the pattern I see when load testing NFS (iometer, 60% random, 65% read, 8k chunks, 32 outstanding I/Os): data01 59.6G 20.4T 46 24 757K 3.09M data01 59.6G 20.4T 39 24 593K 3.09M data01 59.6G 20.4T 45 25 687K 3.22M data01 59.6G 20.4T 45 23 683K 2.97M data01 59.6G 20.4T 33 23 492K 2.97M data01 59.6G 20.4T 16 41 214K 1.71M data01 59.6G 20.4T 3 2.36K 53.4K 30.4M data01 59.6G 20.4T 1 2.23K 20.3K 29.2M data01 59.6G 20.4T 0 2.24K 30.2K 28.9M data01 59.6G 20.4T 0 1.93K 30.2K 25.1M data01 59.6G 20.4T 0 2.22K 0 28.4M data01 59.7G 20.4T 21295 317K 4.48M data01 59.7G 20.4T 32 12 495K 1.61M data01 59.7G 20.4T 35 25 515K 3.22M data01 59.7G 20.4T 36 11 522K 1.49M data01 59.7G 20.4T 33 24 508K 3.09M LSI SAS HBA, 3 x 5 disk raidz, Dell 2950, 16GB RAM. With that setup you'll see max 3x the IOPS of the type of disks, not really the kind of setup for 60% random workload. Assuming 2TB SATA drives the max IOPS would be around 240 IOPS. Now if it were mirror vdevs you'd get 7x or 560 IOPS. Is this for VMware or data warehousing? You'll also need an SSD drive in the mix if your not using a controller with NVRAM write-back. Especially when sharing over NFS. I guess since it's 15 drives it's an MD1000, I might have gone with the newer 2.5" drive enclosure as it holds 24 over 15 and most SSDs come in 2.5". Since you got it already, invest in a PERC 6/E with 512MB of cache and stick it in the other PCIe 8x slot. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On Sep 4, 2009, at 6:33 PM, Bob Friesenhahn > wrote: On Fri, 4 Sep 2009, Scott Meilicke wrote: I only see the blocking while load testing, not during regular usage, so I am not so worried. I will try the kernel settings to see if that helps if/when I see the issue in production. The flipside of the "pulsing" is that the deferred writes dimish contention for precious read IOPs and quite a few programs have a habit of updating/rewriting a file over and over again. If the file is completely asynchronously rewritten once per second and zfs writes a transaction group every 30 seconds, then 29 of those updates avoided consuming write IOPs. Another benefit is that if zfs has more data in hand to write, then it can do a much better job of avoiding fragmentation, avoid unnecessary COW by diminishing short tail writes, and achieve more optimum write patterns. I guess one can find a silver lining in any grey cloud, but for myself I'd just rather see a more linear approach to writes. Anyway I have never seen any reads happen during these write flushes. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On Fri, 4 Sep 2009, Scott Meilicke wrote: I only see the blocking while load testing, not during regular usage, so I am not so worried. I will try the kernel settings to see if that helps if/when I see the issue in production. The flipside of the "pulsing" is that the deferred writes dimish contention for precious read IOPs and quite a few programs have a habit of updating/rewriting a file over and over again. If the file is completely asynchronously rewritten once per second and zfs writes a transaction group every 30 seconds, then 29 of those updates avoided consuming write IOPs. Another benefit is that if zfs has more data in hand to write, then it can do a much better job of avoiding fragmentation, avoid unnecessary COW by diminishing short tail writes, and achieve more optimum write patterns. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs compression algorithm : jpeg ??
On Fri, 4 Sep 2009, Louis-Frédéric Feuillette wrote: JPEG2000 uses arithmetic encoding to do the final compression step. Arithmetic encoding has a higher compression rate (in general) than gzip-9, lzbj or others. There is an opensource implementation of jpeg2000 called jasper[1]. Jasper is the reference implementation for jpeg2000, meaning that all other jpeg2000 programs must verify it's output to that of jasper (kinda). Jasper is incredibly slow and consumes large amount of memory. Other JPEG2000 programs are validated by how many times faster they are than Jasper. :-) Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] check a zfs rcvd file
On 09/04/09 10:17, dick hoogendijk wrote: Lori Alt wrote: The -u option to zfs recv (which was just added to support flash archive installs, but it's useful for other reasons too) suppresses all mounts of the received file systems. So you can mount them yourself afterward in whatever order is appropriate, or do a 'zfs mount -a'. You misunderstood my problem. It is very convenient that the filesystems are not mounted. I only wish they could stay that way!. Alas, they ARE mounted (even if I don't want them to) when I *reboot* the system. And THAT's when thing get ugly. I then have different zfs filesystems using the same mountpoints! The backed up ones have the same mountpoints as their origin :-/ -> The only way to stop it is to *export* the "backup" zpool OR to change *manualy* the zfs prop "canmount=noauto" in all backed up snapshots/filesystems. As I understand I cannot give this "canmount=noauto" to the zfs receive command. # zfs send -Rv rp...@0909 | zfs receive -Fdu backup/snaps There is a RFE to allow zfs recv to assign properties, but I'm not sure whether it would help in your case. I would have thought that "canmount=noauto" would have already been set on the sending side, however. In that case, the property should be preserved when the stream is preserved. But if for some reason, you're not setting that property on the sending side, but want it set on the receiving side, you might have to write a script to set the properties for all those datasets after they are received. lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs compression algorithm : jpeg ??
On Fri, Sep 04, 2009 at 01:41:15PM -0700, Richard Elling wrote: > On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote: > >We have groups generating terabytes a day of image data from lab > >instruments and saving them to an X4500. > > Wouldn't it be easier to compress at the application, or between the > application and the archiving file system? Especially when it comes to reading the images back! ZFS compression is transparent. You can't write uncompressed data then read back compressed data. And compression is at the block level, not for the whole file, so even if you could read it back compressed, it wouldn't be in a useful format. Most people want to transfer data compressed, particularly images. So compressing at the application level in this case seems best to me. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
I only see the blocking while load testing, not during regular usage, so I am not so worried. I will try the kernel settings to see if that helps if/when I see the issue in production. For what it is worth, here is the pattern I see when load testing NFS (iometer, 60% random, 65% read, 8k chunks, 32 outstanding I/Os): data01 59.6G 20.4T 46 24 757K 3.09M data01 59.6G 20.4T 39 24 593K 3.09M data01 59.6G 20.4T 45 25 687K 3.22M data01 59.6G 20.4T 45 23 683K 2.97M data01 59.6G 20.4T 33 23 492K 2.97M data01 59.6G 20.4T 16 41 214K 1.71M data01 59.6G 20.4T 3 2.36K 53.4K 30.4M data01 59.6G 20.4T 1 2.23K 20.3K 29.2M data01 59.6G 20.4T 0 2.24K 30.2K 28.9M data01 59.6G 20.4T 0 1.93K 30.2K 25.1M data01 59.6G 20.4T 0 2.22K 0 28.4M data01 59.7G 20.4T 21295 317K 4.48M data01 59.7G 20.4T 32 12 495K 1.61M data01 59.7G 20.4T 35 25 515K 3.22M data01 59.7G 20.4T 36 11 522K 1.49M data01 59.7G 20.4T 33 24 508K 3.09M LSI SAS HBA, 3 x 5 disk raidz, Dell 2950, 16GB RAM. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs compression algorithm : jpeg ??
On Fri, 2009-09-04 at 13:41 -0700, Richard Elling wrote: > On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote: > > > We have groups generating terabytes a day of image data from lab > > instruments and saving them to an X4500. > > Wouldn't it be easier to compress at the application, or between the > application and the archiving file system? Preamble: I am actively doing research into image set compression, specifically jpeg2000, so this is my point of reference. I think it would be easier to compress at the application level. I would suggest getting the image from the source, then use lossless jpeg2000 compression on it, saving the result to an uncompressed ZFS pool. JPEG2000 uses arithmetic encoding to do the final compression step. Arithmetic encoding has a higher compression rate (in general) than gzip-9, lzbj or others. There is an opensource implementation of jpeg2000 called jasper[1]. Jasper is the reference implementation for jpeg2000, meaning that all other jpeg2000 programs must verify it's output to that of jasper (kinda). Saving the jpeg2000 image to an uncompressed ZFS partition will be the fastest thing. Since jpeg2000 is already compressed, trying to compress it will not yeild any storage space reduction, in fact it may _increase_ the size of the data stored on disk. Since good compression algorithms result in random data you can see why running on a compressed pool would be bad for performance. [1] http://www.ece.uvic.ca/~mdadams/jasper On a side note, if you want to know how Arithmetic encoding works, Wikipedia[2] has a real nice explanation. Suffice it to say, in theory ( Without considering implementation details ) arithmetic encoding can encode _any_ data at the rate of data_entropy*num_of_symbols + data_symbol_table. In practice this doesn't happen due to floating point overflows and some other issues. [2] http://en.wikipedia.org/wiki/Arithmetic_coding -- Louis-Frédéric Feuillette ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On Sep 4, 2009, at 4:33 PM, Scott Meilicke > wrote: Yes, I was getting confused. Thanks to you (and everyone else) for clarifying. Sync or async, I see the txg flushing to disk starve read IO. Well try the kernel setting and see how it helps. Honestly though if you can say it's all sync writes with certainty and IO is still blocking, you need a better storage sub-system, or an additional pool. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs compression algorithm : jpeg ??
On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote: We have groups generating terabytes a day of image data from lab instruments and saving them to an X4500. Wouldn't it be easier to compress at the application, or between the application and the archiving file system? We have tried lzbj : compressratio = 1.13 in 11 hours , 1.3 TB -> 1.1 TB gzip -9 : compress ratio = 1.68 in > 37 hours, 1.3 TB -> .75 TB The filesystem performance was noticably laggy (ie ls took > 10 seconds) while gzip -9 compression was used do you have any idea if lossless jpeg compression is being planned for ZFS? We can envisage of 1.3 TB, > .8 TB will be images and if we could get better or equivalent compression on jpeg lossless compression with less impact on the filesystem than gzip -9 compression, that would be worthwhile, if it worked. I don't know of anyone working on that specific compression scheme, but I've put together some thoughts on the subject of adding a new compressor to ZFS. Perhaps others could comment? http://richardelling.blogspot.com/2009/08/justifying-new-compression-algorithms.html -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
Yes, I was getting confused. Thanks to you (and everyone else) for clarifying. Sync or async, I see the txg flushing to disk starve read IO. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On Sep 4, 2009, at 2:22 PM, Scott Meilicke > wrote: So, I just re-read the thread, and you can forget my last post. I had thought the argument was that the data were not being written to disk twice (assuming no separate device for the ZIL), but it was just explaining to me that the data are not read from the ZIL to disk, but rather from memory to disk. I need more coffee... I think your confusing ARC write-back with ZIL and it isn't the sync writes that are blocking IO it's the async writes that have been cached and are now being flushed. Just tell ARC to cache less IO for your hardware with the kernel config Bob mentioned way back. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] PMP support in Opensolaris
Hi Brandon To answer your question, all you need to do is look up those bug numbers: http://bugs.opensolaris.org/view_bug.do?bug_id=6422924 http://bugs.opensolaris.org/view_bug.do?bug_id=6691950 ..and you see the fix should be in release snv_122. Your in luck, as the OpenSolaris dev repository was updated to snv_122 yesterday: http://mail.opensolaris.org/pipermail/opensolaris-announce/2009-September/001256.html http://pkg.opensolaris.org/dev/en/index.shtml Let us know if you can get the port multipliers working.. But remember, there is a problem with ZFS raidz in that release, so be careful: http://mail.opensolaris.org/pipermail/zfs-discuss/2009-September/031434.html Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] PMP support in Opensolaris
On Wed, Sep 2, 2009 at 4:56 PM, David Magda wrote: > Said support was committed only two to three weeks ago: > >> PSARC/2009/394 SATA Framework Port Multiplier Support >> 6422924 sata framework has to support port multipliers >> 6691950 ahci driver needs to support SIL3726/4726 SATA port multiplier When is this going to show up in the repo at http://pkg.opensolaris.org/dev/ ? Is it already there? Sorry if it's a dumb question, but I'm not sure where to look so the release process is a bit opaque to me. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs compression algorithm : jpeg ??
We have groups generating terabytes a day of image data from lab instruments and saving them to an X4500. We have tried lzbj : compressratio = 1.13 in 11 hours , 1.3 TB -> 1.1 TB gzip -9 : compress ratio = 1.68 in > 37 hours, 1.3 TB -> .75 TB The filesystem performance was noticably laggy (ie ls took > 10 seconds) while gzip -9 compression was used do you have any idea if lossless jpeg compression is being planned for ZFS? We can envisage of 1.3 TB, > .8 TB will be images and if we could get better or equivalent compression on jpeg lossless compression with less impact on the filesystem than gzip -9 compression, that would be worthwhile, if it worked. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
Scott Meilicke wrote: I am still not buying it :) I need to research this to satisfy myself. I can understand that the writes come from memory to disk during a txg write for async, and that is the behavior I see in testing. But for sync, data must be committed, and a SSD/ZIL makes that faster because you are writing to the SSD/ZIL, and not to spinning disk. Eventually that data on the SSD must get to spinning disk. But the txg (which may contain more data than just the sync data that was written to the ZIL) is still written from memory. Just because the sync data was written to the ZIL, doesn't mean it's not still in memory. -Kyle To the books I go! -Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Petabytes on a budget - blog
On Fri, Sep 4, 2009 at 5:36 AM, Marc Bevand wrote: > Marc Bevand gmail.com> writes: > > > > So in conclusion, my SBNSWAG (scientific but not so wild-ass guess) > > is that the max I/O throughput when reading from all the disks on > > 1 of their storage pod is about 1000MB/s. > > Correction: the SiI3132 are on x1 (not x2) links, so my guess as to > the aggregate throughput when reading from all the disks is: > 3*150+100 = 550MB/s. > (150MB/s is 60% of the max theoretical 250MB/s bandwidth of an x1 link) > > And if they tuned MAX_PAYLOAD_SIZE to allow the 3 PCI-E SATA cards > to exploit closer to the max theoretical bandwidth of an x1 PCI-E > link, it would be: > 3*250+100 = 850MB/s. > > -mrb > > Whats the point of arguing what the back-end can do anyways? This is bulk data storage. Their MAX input is ~100MB/sec. The backend can more than satisfy that. Who cares at that point whether it can push 500MB/s or 5000MB/s? It's not a database processing transactions. It only needs to be able to push as fast as the front-end can go. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
So, I just re-read the thread, and you can forget my last post. I had thought the argument was that the data were not being written to disk twice (assuming no separate device for the ZIL), but it was just explaining to me that the data are not read from the ZIL to disk, but rather from memory to disk. I need more coffee... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
Doh! I knew that, but then forgot... So, for the case of no separate device for the ZIL, the ZIL lives on the disk pool. In which case, the data are written to the pool twice during a sync: 1. To the ZIL (on disk) 2. From RAM to disk during tgx If this is correct (and my history in this thread is not so good, so...), would that then explain some sort of pulsing write behavior for sync write operations? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
Scott Meilicke wrote: > So what happens during the txg commit? > > For example, if the ZIL is a separate device, SSD for this example, does it > not work like: > > 1. A sync operation commits the data to the SSD > 2. A txg commit happens, and the data from the SSD are written to the > spinning disk #1 is correct. #2 is incorrect. The TXG commit goes from memory into the main pool. The SSD data is simply left there in case something bad happens before the TXG commit succeeds. Once it succeeds, then the SSD data can be overwritten. The only time you need to read from a ZIL device is if a crash occurs and you need those blocks to repair the pool. Eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding when (and how) ZFS will use spare disks
This sounds like the same behavior as opensolaris 2009.06. I had several disks recently go UNAVAIL, and the spares did not take over. But as soon as I physically removed a disk, the spare started replacing the removed disk. It seems UNAVAIL is not the same as the disk not being there. I wish the spare *would* take over in these cases, since the pool is degraded. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
I am still not buying it :) I need to research this to satisfy myself. I can understand that the writes come from memory to disk during a txg write for async, and that is the behavior I see in testing. But for sync, data must be committed, and a SSD/ZIL makes that faster because you are writing to the SSD/ZIL, and not to spinning disk. Eventually that data on the SSD must get to spinning disk. To the books I go! -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On Fri, 4 Sep 2009, Scott Meilicke wrote: So what happens during the txg commit? For example, if the ZIL is a separate device, SSD for this example, does it not work like: 1. A sync operation commits the data to the SSD 2. A txg commit happens, and the data from the SSD are written to the spinning disk So this is two writes, correct? From past descriptions, the slog is basically a list of pending write system calls. The only time the slog is read is after a reboot. Otherwise, the slog is simply updated as write operations proceed. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
So what happens during the txg commit? For example, if the ZIL is a separate device, SSD for this example, does it not work like: 1. A sync operation commits the data to the SSD 2. A txg commit happens, and the data from the SSD are written to the spinning disk So this is two writes, correct? -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Understanding when (and how) ZFS will use spare disks
We have a number of shared spares configured in our ZFS pools, and we're seeing weird issues where spares don't get used under some circumstances. We're running Solaris 10 U6 using pools made up of mirrored vdevs, and what I've seen is: * if ZFS detects enough checksum errors on an active disk, it will automatically pull in a spare. * if the system reboots without some of the disks available (so that half of the mirrored pairs drop out, for example), spares will *not* get used. ZFS recognizes that the disks are not there; they are marked as UNAVAIL and the vdevs (and pools) as DEGRADED, but it doesn't try to use spares. (This is in a SAN environment where half of all of the mirrors come from one controller and half come from another one.) All of this makes me think that I don't understand how ZFS spares really work, and under what circumstances they'll get used. Does anyone know if there's a writeup of this somewhere? (What I've gathered so far from reading zfs-discuss archives is that ZFS spares are not handled automatically in the kernel code but are instead deployed to pools by a fmd ZFS management module[*], doing more or less 'zpool repace ' (presumably through an internal code path, since 'zpool history' doesn't seem to show spare deployment). Is this correct?) Also, searching turns up some old zfs-discuss messages suggesting that not bringing in spares in response to UNAVAIL disks was a bug that's now fixed in at least OpenSolaris. If so, does anyone know if the fix has made it into S10 U7 (or is planned or available as a patch)? Thanks in advance. - cks [*: http://blogs.sun.com/eschrock/entry/zfs_hot_spares suggests that it is 'zfs-retire', which is separate from 'zfs-diagnosis'.] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On 09/04/09 09:54, Scott Meilicke wrote: Roch Bourbonnais Wrote: ""100% random writes produce around 200 IOPS with a 4-6 second pause around every 10 seconds. " This indicates that the bandwidth you're able to transfer through the protocol is about 50% greater than the bandwidth the pool can offer to ZFS. Since, this is is not sustainable, you see here ZFS trying to balance the 2 numbers." When I have tested using 50% reads, 60% random using iometer over NFS, I can see the data going straight to disk due to the sync nature of NFS. But I also see writes coming to a stand still every 10 seconds or so, which I have attributed to the ZIL dumping to disk. Therefore I conclude that it is the process of dumping the ZIL to disk that (mostly?) blocks writes during the dumping. The ZIL does does not work like that. It is not a journal. Under a typical write load write transactions are batched and written out in a group transaction (txg). This txg sync occurs every 30s under light load but more frequently or continuously under heavy load. When writing synchronous data (eg NFS) the transactions get written immediately to the intent log and are made stable. When the txg later commits the intent log blocks containing those committed transactions can be freed. So as you can see there is no periodic dumping of the ZIL to disk. What you are probably observing is the periodic txg commit. Hope that helps: Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] check a zfs rcvd file
Lori Alt wrote: The -u option to zfs recv (which was just added to support flash archive installs, but it's useful for other reasons too) suppresses all mounts of the received file systems. So you can mount them yourself afterward in whatever order is appropriate, or do a 'zfs mount -a'. You misunderstood my problem. It is very convenient that the filesystems are not mounted. I only wish they could stay that way!. Alas, they ARE mounted (even if I don't want them to) when I *reboot* the system. And THAT's when thing get ugly. I then have different zfs filesystems using the same mountpoints! The backed up ones have the same mountpoints as their origin :-/ -> The only way to stop it is to *export* the "backup" zpool OR to change *manualy* the zfs prop "canmount=noauto" in all backed up snapshots/filesystems. As I understand I cannot give this "canmount=noauto" to the zfs receive command. # zfs send -Rv rp...@0909 | zfs receive -Fdu backup/snaps -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | SunOS 10u7 5/09 | OpenSolaris 2010.02 B121 + All that's really worth doing is what we do for others (Lewis Carrol) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Archiving and Restoring Snapshots
On Sep 3, 2009, at 10:32 PM, Tim Cook wrote: On Fri, Sep 4, 2009 at 12:17 AM, Ross wrote: Hi Richard, Actually, reading your reply has made me realise I was overlooking something when I talked about tar, star, etc... How do you backup a ZFS volume? That's something traditional tools can't do. Are snapshots the only way to create a backup or archive of those? Below the application, dd would do it. But if you want incrementals, then either use the application's backup scheme or zfs send. Personally I'm quite happy with snapshots - we have a ZFS system at work that's replicating all of it's data to an offsite ZFS store using snapshots. Using ZFS as a backup store is something I'm quite happy with, it's just storing just a snapshot file that makes me nervous. The correct answer is ndmp. Whether Sun will ever add it to opensolaris is another subject entirely though. Available since b78 with source Integrated in b102. http://www.opensolaris.org/os/project/ndmp/ But NDMP is just part of an overall data management architecture... -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] question about my hardware choice
Hi zfs congnoscenti, a few quick question about my hardware choice (a bit late, since the box is up already): A 3U supermicro chassis with 16x SATA/SAS hotplug Supermicro X8DDAi (2x Xeon QC 1.26 GHz S1366, 24 GByte RAM, IPMI) 2x LSI SAS3081E-R 16x WD2002FYPS Right now I'm running Solaris 10 5/9 (Oracle doesn't support OpenSolaris, unfortunately). I would like to run Oracle in a zone/container, and use the rest for random storage and network servage. My questions: * does the hardware choice make sense? Particularly, the LSI host adapters. should I change anything hardware-side? * what kind of zfs layout would you recommend if I want to run Oracle in a container? * should I put some SSD (e.g. Intel 80 GByte 2nd gen) into the system if I can, or doesn't Solaris 10 5/9 zfs support it? * is there a reason speaking against containers and Oracle? * how many hot spares would you suggest? Thanks. -- Eugen* Leitl http://leitl.org";>leitl http://leitl.org __ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] check a zfs rcvd file
On 09/04/09 09:41, dick hoogendijk wrote: Lori Alt wrote: The -n option does some verification. It verifies that the record headers distributed throughout the stream are syntactically valid. Since each record header contains a length field which allows the next header to be found, one bad header will cause the processing of the stream to abort. But it doesn't verify the content of the data associated with each record. So, storing the stream in a zfs received filesystem is the better option. Alas, it also is the most difficult one. Storing to a file with "zfs send -Rv" is easy. The result is just a file and if your reboot the system all is OK. However, if I zfs "receive -Fdu" into a zfs filesystem I'm in trouble when I reboot the system. I get confusion on mountpoints! Let me explain: Some time ago I backup up my rpool and my /export ; /export/home to /backup/snaps (with zfs receive -Fdu). All's OK because the newly created zfs FS's stay unmounted 'till the next reboot(!). When I rebooted my system (due to a kernel upgrade) the system would nog boot, because it had mounted the zfs FS "backup/snaps/export" on /export and "backup/snaps/export/home on /export/home. The system itself had those FS's too, of course. So, there was a mix up. It would be nice if the backup FS's would not be mounted (canmount=noauto), but I cannot give this option when I create the zfs send | receive, can I? And giving this option later on is very difficult, because "canmount" is NOT recursive! And I don't want to set it manualy on all those backup up FS's. I wonder how other people overcome this mountpoint issue. The -u option to zfs recv (which was just added to support flash archive installs, but it's useful for other reasons too) suppresses all mounts of the received file systems. So you can mount them yourself afterward in whatever order is appropriate, or do a 'zfs mount -a'. lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
Roch Bourbonnais Wrote: ""100% random writes produce around 200 IOPS with a 4-6 second pause around every 10 seconds. " This indicates that the bandwidth you're able to transfer through the protocol is about 50% greater than the bandwidth the pool can offer to ZFS. Since, this is is not sustainable, you see here ZFS trying to balance the 2 numbers." When I have tested using 50% reads, 60% random using iometer over NFS, I can see the data going straight to disk due to the sync nature of NFS. But I also see writes coming to a stand still every 10 seconds or so, which I have attributed to the ZIL dumping to disk. Therefore I conclude that it is the process of dumping the ZIL to disk that (mostly?) blocks writes during the dumping. I do agree with Bob and others that suggest making the size of the dump smaller will mask this behavior, and that seems like a good idea, although I have not yet tried and tested it myself. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] check a zfs rcvd file
Lori Alt wrote: The -n option does some verification. It verifies that the record headers distributed throughout the stream are syntactically valid. Since each record header contains a length field which allows the next header to be found, one bad header will cause the processing of the stream to abort. But it doesn't verify the content of the data associated with each record. So, storing the stream in a zfs received filesystem is the better option. Alas, it also is the most difficult one. Storing to a file with "zfs send -Rv" is easy. The result is just a file and if your reboot the system all is OK. However, if I zfs "receive -Fdu" into a zfs filesystem I'm in trouble when I reboot the system. I get confusion on mountpoints! Let me explain: Some time ago I backup up my rpool and my /export ; /export/home to /backup/snaps (with zfs receive -Fdu). All's OK because the newly created zfs FS's stay unmounted 'till the next reboot(!). When I rebooted my system (due to a kernel upgrade) the system would nog boot, because it had mounted the zfs FS "backup/snaps/export" on /export and "backup/snaps/export/home on /export/home. The system itself had those FS's too, of course. So, there was a mix up. It would be nice if the backup FS's would not be mounted (canmount=noauto), but I cannot give this option when I create the zfs send | receive, can I? And giving this option later on is very difficult, because "canmount" is NOT recursive! And I don't want to set it manualy on all those backup up FS's. I wonder how other people overcome this mountpoint issue. -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | SunOS 10u7 5/09 | OpenSolaris 2010.02 b122 + All that's really worth doing is what we do for others (Lewis Carrol) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Change the volblocksize of a ZFS volume
stuart anderson writes: > > > > Question : > > > > > > > > Is there a way to change the volume blocksize > > say > > > via 'zfs snapshot send/receive'? > > > > > > > > As I see things, this isn't possible as the > > target > > > volume (including property values) gets > > overwritten > > > by 'zfs receive'. > > > > > > > > > > By default, properties are not received. To pass > > > properties, you need > > > to use > > > the -R flag. > > > > I have tried that, and while it works for properties > > like compression, I have not found a way to preserve > > a non-default volblocksize across zfs send | zfs > > receive. the zvol created on the receive side is > > always defaulting to 8k. Is there a way to do this? > > > > I spoke too soon. More particularly, during the zfs send/recv > processes the receiving side reports 8k, but once the receive is done > the volblocksize is reporting the expected value as sent with -R. > > Hopefully, this is just a reporting bug during an active receive. > > Note, this was observed with s10u7 (x86). > Sounds like so. I would be very surprised if one would be able to change the volblocksize of a zvol through send/receive (with or without -R). It's an immutable property of the zvol. -r > Thanks. > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
"100% random writes produce around 200 IOPS with a 4-6 second pause around every 10 seconds. " This indicates that the bandwidth you're able to transfer through the protocol is about 50% greater than the bandwidth the pool can offer to ZFS. Since, this is is not sustainable, you see here ZFS trying to balance the 2 numbers. -r David Bond writes: > Hi, > > I was directed here after posting in CIFS discuss (as i first thought that > it could be a CIFS problem). > > I posted the following in CIFS: > > When using iometer from windows to the file share on opensolaris > svn101 and svn111 I get pauses every 5 seconds of around 5 seconds > (maybe a little less) where no data is transfered, when data is > transfered it is at a fair speed and gets around 1000-2000 iops with 1 > thread (depending on the work type). The maximum read response time is > 200ms and the maximum write response time is 9824ms, which is very > bad, an almost 10 seconds delay in being able to send data to the > server. > This has been experienced on 2 test servers, the same servers have > also been tested with windows server 2008 and they havent shown this > problem (the share performance was slightly lower than CIFS, but it > was consistent, and the average access time and maximums were very > close. > > > I just noticed that if the server hasnt hit its target arc size, the > pauses are for maybe .5 seconds, but as soon as it hits its arc > target, the iops drop to around 50% of what it was and then there are > the longer pauses around 4-5 seconds. and then after every pause the > performance slows even more. So it appears it is definately server > side. > > This is with 100% random io with a spread of 33% write 66% read, 2KB > blocks. over a 50GB file, no compression, and a 5.5GB target arc > size. > > > > Also I have just ran some tests with different IO patterns and 100 > sequencial writes produce and consistent IO of 2100IOPS, except when > it pauses for maybe .5 seconds every 10 - 15 seconds. > > 100% random writes produce around 200 IOPS with a 4-6 second pause > around every 10 seconds. > > 100% sequencial reads produce around 3700IOPS with no pauses, just > random peaks in response time (only 16ms) after about 1 minute of > running, so nothing to complain about. > > 100% random reads produce around 200IOPS, with no pauses. > > So it appears that writes cause a problem, what is causing these very > long write delays? > > A network capture shows that the server doesnt respond to the write > from the client when these pauses occur. > > Also, when using iometer, the initial file creation doesnt have and > pauses in the creation, so it might only happen when modifying > files. > > Any help on finding a solution to this would be really appriciated. > > David > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Petabytes on a budget - blog
Marc Bevand gmail.com> writes: > > So in conclusion, my SBNSWAG (scientific but not so wild-ass guess) > is that the max I/O throughput when reading from all the disks on > 1 of their storage pod is about 1000MB/s. Correction: the SiI3132 are on x1 (not x2) links, so my guess as to the aggregate throughput when reading from all the disks is: 3*150+100 = 550MB/s. (150MB/s is 60% of the max theoretical 250MB/s bandwidth of an x1 link) And if they tuned MAX_PAYLOAD_SIZE to allow the 3 PCI-E SATA cards to exploit closer to the max theoretical bandwidth of an x1 PCI-E link, it would be: 3*250+100 = 850MB/s. -mrb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Petabytes on a budget - blog
Bill Moore sun.com> writes: > > Moving on, modern high-capacity SATA drives are in the 100-120MB/s > range. Let's call it 125MB/s for easier math. A 5-port port multiplier > (PM) has 5 links to the drives, and 1 uplink. SATA-II speed is 3Gb/s, > which after all the framing overhead, can get you 300MB/s on a good day. > So 3 drives can more than saturate a PM. 45 disks (9 backplanes at 5 > disks + PM each) in the box won't get you more than about 21 drives > worth of performance, tops. So you leave at least half the available > drive bandwidth on the table, in the best of circumstances. That also > assumes that the SiI controllers can push 100% of the bandwidth coming > into them, which would be 300MB/s * 2 ports = 600MB/s, which is getting > close to a 4x PCIe-gen2 slot. Wrong. The theoretical bandwidth of an x4 PCI-E v2.0 slot is 2GB/s per direction (5Gbit/s before 8b-10b encoding per lane, times 0.8, times 4), amply sufficient to deal with 600MB/s. However they don't have this kind of slot, they have x2 PCI-E v1.0 slots (500MB/s per direction). Moreover SiI3132 default to a MAX_PAYLOAD_SIZE of 128 bytes therefore my guess is that each 2-port SATA card is only able to provide 60% of the theoretical throughput[1], or about 300MB/s. Then they have 3 such cards: total throughput of 900MB/s. Finally the 4th SATA card (with 4 ports) is in a 32-bit 33MHz PCI slot (not PCI-E). In practice such a bus can only provide a usable throughput of about 100MB/s (out of 133MB/s theoretical). All the bottlenecks are obviously the PCI-E links and the PCI bus. So in conclusion, my SBNSWAG (scientific but not so wild-ass guess) is that the max I/O throughput when reading from all the disks on 1 of their storage pod is about 1000MB/s. This is poor compared to a Thumper for example, but the most important factor for them was GB/$, not GB/sec. And they did a terrific job at that! > And I'd re-iterate what myself and others have observed about SiI and > silent data corruption over the years. Irrelevant, because it seems they have built fault-tolerance higher in the stack, à la Google. Commodity hardware + reliable software = great combo. [1] http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ -mrb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARC limits not obeyed in OSol 2009.06
Do you have the zfs primarycache property on this release ? if so, you could set it to 'metadata' or none. primarycache=all | none | metadata Controls what is cached in the primary cache (ARC). If this property is set to "all", then both user data and metadata is cached. If this property is set to "none", then neither user data nor metadata is cached. If this property is set to "metadata", then only metadata is cached. The default value is "all". -r Udo Grabowski writes: > Hi, > we've capped Arcsize via set zfs:zfs_arc_max = 0x2000 in /etc/system to > 512 MB, since ARC > still does not release memory when applications need it (this is another > bug). But this hard limit is > not obeyed, instead, when traversing all files in a large and deep > directory, we see the values below > (arc started with 300 MB). After a while, machine (Ultra 20 M2 with 6GB) > swaps and then, hours later, freezes completely (even no reaction on quick > push power button, no ping, no mouse, have to hard > reset). Arc summary shows clearly that limits are not what they supposed to > be. If this is working as > intended, then the intention must be changed. As poorly as ARC is working > now, it's absolutely > necessary that a hard limit is indeed a hard limit for ARC. Please fix this. > Is there anything I can do to > really limit or switch off the ARC completely ? It's breaking our production > work often since we've > installed OSol (we came from SXDE 1.08 which worked better), we must find a > way to stop this > problem as fast as possible ! > > arcstat: > Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c > 13:22:16 95M 23M 24 10M 14 12M 64 22M 24 963M 536M > 13:22:172K 256 10796 177 15 2229 965M 536M > 13:22:182K 490 22 119 10 371 38 482 22 970M 536M > 13:22:194K 214 4 1506643 1403 971M 536M > 13:22:202K 427 19574 370 37 419 19 971M 536M > 13:22:211K 208 19 103 17 105 21 202 19 971M 536M > > 13:23:161K 481 27808 401 47 478 27 1G 536M > 13:23:172K 255 11 125 10 130 13 218 10 1G 536M > and counting... > arc_summary: > System Memory: > Physical RAM: 6134 MB > Free Memory : 1739 MB > LotsFree: 95 MB > > ZFS Tunables (/etc/system): > set zfs:zfs_arc_max = 0x2000 > > ARC Size: > Current Size: 1357 MB (arcsize) > Target Size (Adaptive): 512 MB (c) > Min Size (Hard Limit):191 MB (zfs_arc_min) > Max Size (Hard Limit):512 MB (zfs_arc_max) > > ARC Size Breakdown: > Most Recently Used Cache Size: 93%479 MB (p) > Most Frequently Used Cache Size: 6%32 MB (c-p) > > ARC Efficency: > Cache Access Total: 97131108 > Cache Hit Ratio: 75% 7321 [Defined State for > buffer] > Cache Miss Ratio: 24% 23886667 [Undefined State for > Buffer] > REAL Hit Ratio: 67% 65874421 [MRU/MFU Hits Only] > > Data Demand Efficiency:66% > Data Prefetch Efficiency: 8% > > CACHE HITS BY CACHE LIST: > Anon: --%Counter Rolled. > Most Recently Used: 15%11463028 (mru) [ > Return Customer ] > Most Frequently Used: 74%54411393 (mfu) [ > Frequent Customer ] > Most Recently Used Ghost: 10%7537123 (mru_ghost)[ > Return Customer Evicted, Now Back ] > Most Frequently Used Ghost: 19%14619417 (mfu_ghost) [ > Frequent Customer Evicted, Now Back ] > CACHE HITS BY DATA TYPE: > Demand Data: 3%2716192 > Prefetch Data: 0%3506 > Demand Metadata:86%63089419 > Prefetch Metadata: 10%7435324 > CACHE MISSES BY DATA TYPE: > Demand Data: 5%1365132 > Prefetch Data: 0%36544 > Demand Metadata:40%9664064 > Prefetch Metadata: 53%12820927 > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to find poor performing disks
Scott Lawson writes: > Also you may wish to look at the output of 'iostat -xnce 1' as well. > > You can post those to the list if you have a specific problem. > > You want to be looking for error counts increasing and specifically 'asvc_t' > for the service times on the disks. I higher number for asvc_t may help to > isolate poorly performing individual disks. > > I blast the pool with dd, and look for drives that are *always* active, while others in the same group have completed their transaction group and get no more activity. Within a group drives should be getting the same amount of data per 5 second (zfs_txg_synctime) and the ones that are always active are the ones slowing you down. If whole groups are unbalanced that's a sign that they have different amount of free space and the expectation is that you will be gated by the speed on the group that needs to catch up. -r > > Scott Meilicke wrote: > > You can try: > > > > zpool iostat pool_name -v 1 > > > > This will show you IO on each vdev at one second intervals. Perhaps you > > will see different IO behavior on any suspect drive. > > > > -Scott > > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Read about ZFS backup - Still confused
Let me explain what i have and you decide if it's what you're looking for. I run a home NAS based on ZFS (due to hardware issues i am using FreeBSD 7.2 as my os but all the data is on ZFS) This system has multiple uses. I have about 10 users and 4 HTPC's connected via gigabit. I have ZFS filesystems for Video, Audio and Data. I have no problem using it for my main itune library or storing downloaded and recorded video. Each user also has thier own share to store data and backups. The system itself is made up of 3 raidz vdevs rights now, each with 4 1tb hard drives so i have about 9 TB total space right now. Having a setup like this sort of changes how you do things. I have several computers, but all the stuff i care about it on the NAS. I am very happy with ZFS for this purpose. I originally used a linux backend with mdadm and xfs but i am very much in love with my new system. I love the ability to clone and snapshot and i use it often. It's already saved me from human error on 2 occasions. It's also very fast. I'm using cheap parts and have seen speeds over 250 MB/s, although i get around 30 MB/s per client average with samba. for streaming music and video it has never shuddered or skipped. I have mostly 720p video but a large amount of 1080p as well. It's not uncommon to have 3 htpc's streaming at the same time and 2 people using the network for other stuff.i'm very happy with it. I'm SURE you can find a method to backup/restore your data with ZFS. Just think of it more as a backend solution. You'll still probably use whatever method you're used to for transfering data, although i use a combination of samba/nfs and even FTP. If you're used to tar, no need to stop using it. You might also look at rysnc. You could set up a ZFS filesystem on the NAS and set up rsync on your client, then set up automatic snapshots on the ZFS machine. This way you'd have multiple methods of restoring (you could just dump back the latest rsync or you could clone one of the older snapshots and dump THAT back) On Thu, Sep 3, 2009 at 4:58 PM, Cork Smith wrote: > Let me try rephrasing this. I would like the ability to restore so my > system mirrors its state at the time when I backed it up given the old hard > drive is now a door stop. > > Cork > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] one time passwords - apache infrastructure incident report 8/28/2009
Hi, Just be reading about apache.org incident report for 8/28/2009 ( https://blogs.apache.org/infra/entry/apache_org_downtime_report ) The use of Solaris and ZFS on the European server was interesting including the recovery. However, what I found more interesting was the use of one time passwords which is supported by FreeBSD ( http://www.freebsd.org/doc/en/books/handbook/one-time-passwords.html ). Could or should this technology be incorporated into OpenSolaris? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss