Re: [zfs-discuss] Zpool Import Hanging
On Jan 17, 2011, at 8:22 PM, Repetski, Stephen wrote: On Mon, Jan 17, 2011 at 22:08, Ian Collins i...@ianshome.com wrote: On 01/18/11 04:00 PM, Repetski, Stephen wrote: Hi All, I believe this has been asked before, but I wasn’t able to find too much information about the subject. Long story short, I was moving data around on a storage zpool of mine and a zfs destroy filesystem hung (or so I thought). This pool had dedup turned on at times while imported as well; it’s running on a Nexenta Core 3.0.1 box (snv_134f). The first time the machine was rebooted, it hung at the “Loading ZFS filesystems” line after loading the kernel; I booted the box with all drives unplugged and exported the pool. The machine was rebooted, and now the pool is hanging on import (zpool import –Fn Nalgene). I’m using “0t2761::pid2proc|::walk thread|::findstack | mdb –k” to try and view what the import processes is doing, but I’m not a hard-core ZFS/Solaris dev so I don’t know if I’m reading the output correctly, but it appears that ZFS is continuing to delete a snapshot/FS from before (reading from the top down): What does zpool iostat pool 10 show? If you have a lot a deduped data and not a lot of RAM (or a cache device), it can take a very long time to destroy a filesystem. You will see lot of reads and not many writes if this is happening. -- Ian. Zpool iostat itself hangs, but iostat does show me one drive in particular causing some issues - http://pastebin.com/6rJG3qV9 - %w and %b drop to ~50 and ~90, respectively, when mdk shows ZFS doing some deduplication work (http://pastebin.com/EMPYy5Rr). As you said the pool is mostly reading data and not writing much. I should be able to switch up that drive to another controller (currently on a PCI SATA adapter) and see what iostat reports then. %w should be near 0 for most cases. Until you solve that problem, everything will be slow. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] incorrect vdev added to pool
Hi I have a pool with a raidz2 vdev. Today I accidentally added a single drive to the pool. I now have a pool that partially has no redundancy as this vdev is a single drive. Is there a way to remove the vdev and replace it with a new raidz2 vdev? If not what can I do to do damage control and add some redundancy to the single drive vdev? Thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] configuration
With two drives it makes more sense to use a mirror then raidz configuration. You will have the same amount of space and mirroring gives you more performance, afaik. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Status of zpool remove in raidz and non-redundant stripes
I second that. This is exactly what happened to me. There is a bug (ID 4852783) that is in State 6-Fix Understood but it is unchanged since February 2010. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] configuration
You can also make 250 GB slices (partitions) and create RAIDZ 3x250GB and mirror 2x1750GB (one or more). Mirror has better performance for write operations, Raidz shoud be faster for read. Regards -- Piotr Tarnowski /DrFugazi/ http://www.drfugazi.eu.org/ -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Surprise Thread Preemptions
Hi, I would like to know about which threads will be preempted by which on my OpenSolaris machine. Therefore, I ran a multithreaded program myprogram with 32 threads on my 24-core Solaris machine. I make sure that each thread of my program has same priority (priority zero), so that we can reduce priority inversions (saving preemptions -- system overhead). However, I ran the following script whoprempt.d to see who preempted myprogram threads and got the following output Unlike what I thought, myprogram threads are preempted (for 2796 times -- last line of the output) by the threads of same myprogram. Could anyone explain why this happens, please? DTrace script == #pragma D option quiet sched:::preempt { self-preempt = 1; } sched:::remain-cpu /self-preempt/ { self-preempt = 0; } sched:::off-cpu /self-preempt/ { /* * If we were told to preempt ourselves, see who we ended up giving * the CPU to. */ @[stringof(args[1]-pr_fname), args[0]-pr_pri, execname, curlwpsinfo-pr_pri] = count(); self-preempt = 0; } END { printf(%30s %3s %30s %3s %5s\n, PREEMPTOR, PRI,||,PREEMPTED, PRI, #); printa(%30s %3d %30s %3d %5@d\n, @); } Output: === PREEMPTOR PRI || PREEMPTED PRI # dtrace 0|| myprogram 0 1 dtrace 50 || myprogram 0 1 sched -1 ||myprogram 0 1 myprogram 0||dtrace 0 1 . nscd 59 || myprogram 0 4 sendmail 59||myprogram 0 4 sched 60 || myprogram 092 sched 98 || myprogram 0 272 sched 99 || myprogram 0 2110 myprogram 0 || myprogram 0 2796 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Is my bottleneck RAM?
Hi guys, sorry in advance if this is somewhat a lowly question, I've recently built a zfs test box based on nexentastor with 4x samsung 2tb drives connected via SATA-II in a raidz1 configuration with dedup enabled compression off and pool version 23. From running bonnie++ I get the following results: Version 1.03b --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 429.8 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 7181 29 + +++ + +++ 21477 97 + +++ + +++ nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ I'd expect more than 105290K/s on a sequential read as a peak for a single drive, let alone a striped set. The system has a relatively decent CPU, however only 2GB memory, do you think increasing this to 4GB would noticeably affect performance of my zpool? The memory is only DDR1. Thanks in advance. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] kernel panic on USB disk power loss
I was copying a filesystem using zfs send | zfs receive and inadvertently unplugged the power to the USB disk that was the destination. Much to my horror this caused the system to panic. I recovered fine on rebooting, but it *really* unnerved me. I don't find anything about this online. I would expect it would trash the copy operation, but the panic seemed a bit extreme. It's an Ultra 20 running Solaris 10 Generic_137112-02 I've got a copy of U8 I'm planning to install as the U9 license seems to prohibit my using it. Suggestions? I'd like to understand what happened and why the system went down. Thanks, Reg -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] configuration
Hello, I'm going to build home server. System is deployed on 8 GB USB flash drive. I have two identical 2 TB HDD and 250 GB one. Could you please recommend me ZFS configuration for the set of my hard drives? 1) pool1: mirror 2tb x 2 pool2: 250 gb (or maybe add this drive to pool1???) 2) pool1: mirror 2tb x 2 + cache/log 250 gb -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HP ProLiant N36L
I've successfully installed NexentaStor 3.0.4 on this microserver using PXE. Works like a charm. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] incorrect vdev added to pool
On 01/15/11 11:32 PM, Gal Buki wrote: Hi I have a pool with a raidz2 vdev. Today I accidentally added a single drive to the pool. I now have a pool that partially has no redundancy as this vdev is a single drive. Is there a way to remove the vdev Not at the moment, as far as I know. and replace it with a new raidz2 vdev? If not what can I do to do damage control and add some redundancy to the single drive vdev? I think you should be able to attach another disk to it to make them into a mirror. (Make sure you attach, and not add.) -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage
Hi all This is just an off-the-cuff idea at the moment, but I would like to sound it out. Consider the situation where someone has a large amount of off-site data storage (of the order of 100s of TB or more). They have a slow network link to this storage. My idea is that this could be used to build the main vdevs for a ZFS pool. On top of this, an array of disks (of the order of TBs to 10s of TB) is available locally, which can be used as L2ARC. There are also smaller, faster arrays (of the order of 100s of GB) which, in my mind, could be used as a ZIL. Now, in this theoretical situation, in-play read data is kept on the L2ARC, and can be accessed about as fast as if this array was just used as the main pool vdevs. Written data goes to the ZIL, as is then sent down the slow link to the offsite storage. Rarely used data is still available as if on site (shows up in the same file structure), but is effectively archived to the offsite storage. Now, here comes the problem. According to what I have read, the maximum size for the ZIL is approx 50% of the physical memory in the system, which would be too small for this particular situation. Also, you cannot mirror the L2ARC, which would have dire performance consequences in the case of a disk failure in the L2ARC. I also believe (correct me if I am wrong) that the L2ARC is invalidated on reboot, so would have to warm up again). And finally, if the network link was to die, I am assuming the entire ZPool would become unavailable. This is a setup which I can see many use cases for, but it introduces too many failure modes. What I would like to see is an extension to ZFS's hierarchical storage environment, such that an additional layer can be put behind the main pool vdevs as an archive store (i.e. it goes [ARC]-[L2ARC/ZIL]-[main]-[archive]). Infrequently used files/blocks could be pushed into this storage, but appear to be available as normal. It would, for example, allow old snapshot data to be pushed down, as this is very rarely going to be used, or files which must be archived for legal reasons. It would also utilise the bandwidth available more efficiently, as only data being specifically sent to it would need transferring. In the case where the archive storage becomes unavailable, there would be a number of possible actions (e.g. error on access, block on access, make the files disappear temporarily). I know there are already solutions out there which do similar jobs. The company I work for use one which pushes archive data to a tape stacker, and pulls it back when accessed. But I think this is a ripe candidate for becoming part of the ZFS stack. So, what does everyone think? Rgds Karl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
...If this is a general rule, maybe it will be worth considering using SHA512 truncated to 256 bits to get more speed... Doesn't it need more investigation if truncating 512bit to 256bit gives equivalent security as a plain 256bit hash? Maybe truncation will introduce some bias? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is my bottleneck RAM?
On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote: Hi guys, sorry in advance if this is somewhat a lowly question, I've recently built a zfs test box based on nexentastor with 4x samsung 2tb drives connected via SATA-II in a raidz1 configuration with dedup enabled compression off and pool version 23. From running bonnie++ I get the following results: Version 1.03b --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 429.8 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 7181 29 + +++ + +++ 21477 97 + +++ + +++ nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ I'd expect more than 105290K/s on a sequential read as a peak for a single drive, let alone a striped set. The system has a relatively decent CPU, however only 2GB memory, do you think increasing this to 4GB would noticeably affect performance of my zpool? The memory is only DDR1. 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn off dedup and enable compression. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
Totally Off Topic: Very interesting. Did you produce some papers on this? Where do you work? Seems very fun place to work at! BTW, I thought about this. What do you say? Assume I want to compress data and I succeed in doing so. And then I transfer the compressed data. So all the information I transferred is the compressed data. But, then you don't count all the information: knowledge about which algorithm was used, which number system, laws of math, etc. So there are lots of other information that is implicit, when compress/decompress - not just the data. So, if you add data and all implicit information you get a certain bit size X. Do this again on the same set of data, with another algorithm and you get another bit size Y. You compress the data, using lots of implicit information. If you use less implicit information (simple algorithm relying on simple math), will X be smaller than if you use lots of implicit information (advanced algorithm relying on a large body of advanced math)? What can you say about the numbers X and Y? Advanced math requires many math books that you need to transfer as well. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Surprise Thread Preemptions
Big subject! You haven't said what your 32 threads are doing, or how you gave them the same priority, or what scheduler class they are running in. However, you only have 24 VCPUs, and (I assume) 32 active threads, so Solaris will try to share resources evenly, and yes, it will preempt one of your threads to run another. The preemption behaviour, including the time a thread is allowed to run without interruption, will depend on the scheduling class and parameters of each thread. If you want to reduce preemption, you can move threads to the FX class, set an absolute priority, and tune the time quantum. What you are seeing is expected. Hope this helps, Phil p.s. if you need any more help with this, please feel free to contact me offline. On 18/01/2011 06:13, Kishore Kumar Pusukuri wrote: Hi, I would like to know about which threads will be preempted by which on my OpenSolaris machine. Therefore, I ran a multithreaded program myprogram with 32 threads on my 24-core Solaris machine. I make sure that each thread of my program has same priority (priority zero), so that we can reduce priority inversions (saving preemptions -- system overhead). However, I ran the following script whoprempt.d to see who preempted myprogram threads and got the following output Unlike what I thought, myprogram threads are preempted (for 2796 times -- last line of the output) by the threads of same myprogram. Could anyone explain why this happens, please? DTrace script == #pragma D option quiet sched:::preempt { self-preempt = 1; } sched:::remain-cpu /self-preempt/ { self-preempt = 0; } sched:::off-cpu /self-preempt/ { /* * If we were told to preempt ourselves, see who we ended up giving * the CPU to. */ @[stringof(args[1]-pr_fname), args[0]-pr_pri, execname, curlwpsinfo-pr_pri] = count(); self-preempt = 0; } END { printf(%30s %3s %30s %3s %5s\n, PREEMPTOR, PRI,||,PREEMPTED, PRI, #); printa(%30s %3d %30s %3d %5@d\n, @); } Output: === PREEMPTOR PRI || PREEMPTED PRI # dtrace 0|| myprogram 0 1 dtrace 50 || myprogram 0 1 sched -1 ||myprogram 0 1 myprogram 0||dtrace 0 1 . nscd 59 || myprogram 0 4 sendmail 59||myprogram 0 4 sched 60 || myprogram 092 sched 98 || myprogram 0 272 sched 99 || myprogram 0 2110 myprogram 0 || myprogram 0 2796 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is my bottleneck RAM?
I've since turned off dedup, added another 3 drives and results have improved to around 148388K/sec on average, would turning on compression make things more CPU bound and improve performance further? On 18 Jan 2011, at 15:07, Richard Elling wrote: On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote: Hi guys, sorry in advance if this is somewhat a lowly question, I've recently built a zfs test box based on nexentastor with 4x samsung 2tb drives connected via SATA-II in a raidz1 configuration with dedup enabled compression off and pool version 23. From running bonnie++ I get the following results: Version 1.03b --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 429.8 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 7181 29 + +++ + +++ 21477 97 + +++ + +++ nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ I'd expect more than 105290K/s on a sequential read as a peak for a single drive, let alone a striped set. The system has a relatively decent CPU, however only 2GB memory, do you think increasing this to 4GB would noticeably affect performance of my zpool? The memory is only DDR1. 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn off dedup and enable compression. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Tue, Jan 18, 2011 at 07:16:04AM -0800, Orvar Korvar wrote: BTW, I thought about this. What do you say? Assume I want to compress data and I succeed in doing so. And then I transfer the compressed data. So all the information I transferred is the compressed data. But, then you don't count all the information: knowledge about which algorithm was used, which number system, laws of math, etc. So there are lots of other information that is implicit, when compress/decompress - not just the data. So, if you add data and all implicit information you get a certain bit size X. Do this again on the same set of data, with another algorithm and you get another bit size Y. You compress the data, using lots of implicit information. If you use less implicit information (simple algorithm relying on simple math), will X be smaller than if you use lots of implicit information (advanced algorithm relying on a large body of advanced math)? What can you say about the numbers X and Y? Advanced math requires many math books that you need to transfer as well. Just as the laws of thermodynamics preclude perpetual motion machines, so do they preclude infinite, loss-less data compression. Yes, thermodynamics and information theory are linked, amazingly enough. Data compression algorithms work by identifying certain types of patterns, then replacing the input with notes such as pattern 1 is ... and appears at offsets 12345 and 1234567 (I'm simplifying a lot). Data that has few or no observable patterns (observable by the compression algorithm in question) will not compress, but will expand if you insist on compressing -- randomly-generated data (e.g., the output of /dev/urandom) will not compress at all and will expand if you insist. Even just one bit needed to indicate whether a file is compressed or not will mean expansion when you fail to compress and store the original instead of the compressed version. Data compression reduces repetition, thus making it harder to further compress compressed data. Try it yourself. Try building a pipeline of all the compression tools you have, see how many rounds of compression you can apply to typical data before further compression fails. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HP ProLiant N36L
On Mon, Jan 17, 2011 at 02:19:23AM -0800, Trusty Twelve wrote: I've successfully installed NexentaStor 3.0.4 on this microserver using PXE. Works like a charm. I've got 5 of them today, and for some reason NexentaCore 3.0.1 b134 was unable to write to disks (whether internal USB or the 4x SATA). Known problem? Should I go to stable, or try NexentaStor instead? (I'd rather keep options open with Nexenta Core and napp-it). -- Eugen* Leitl a href=http://leitl.org;leitl/a http://leitl.org __ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is my bottleneck RAM?
On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote: I've since turned off dedup, added another 3 drives and results have improved to around 148388K/sec on average, would turning on compression make things more CPU bound and improve performance further? On 18 Jan 2011, at 15:07, Richard Elling wrote: On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote: Hi guys, sorry in advance if this is somewhat a lowly question, I've recently built a zfs test box based on nexentastor with 4x samsung 2tb drives connected via SATA-II in a raidz1 configuration with dedup enabled compression off and pool version 23. From running bonnie++ I get the following results: Version 1.03b --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 429.8 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 7181 29 + +++ + +++ 21477 97 + +++ + +++ nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ I'd expect more than 105290K/s on a sequential read as a peak for a single drive, let alone a striped set. The system has a relatively decent CPU, however only 2GB memory, do you think increasing this to 4GB would noticeably affect performance of my zpool? The memory is only DDR1. 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn off dedup and enable compression. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Compression will help speed things up (I/O, that is), presuming that you're not already CPU-bound, which it doesn't seem you are. If you want Dedup, you pretty much are required to buy an SSD for L2ARC, *and* get more RAM. These days, I really don't recommend running ZFS as a fileserver without a bare minimum of 4GB of RAM (8GB for anything other than light use), even with Dedup turned off. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is my bottleneck RAM?
Thanks everyone, I think overtime I'm gonna update the system to include an ssd for sure. Memory may come later though. Thanks for everyone's responses Erik Trimble erik.trim...@oracle.com wrote: On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote: I've since turned off dedup, added another 3 drives and results have improved to around 148388K/sec on average, would turning on compression make things more CPU bound and improve performance further? On 18 Jan 2011, at 15:07, Richard Elling wrote: On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote: Hi guys, sorry in advance if this is somewhat a lowly question, I've recently built a zfs test box based on nexentastor with 4x samsung 2tb drives connected via SATA-II in a raidz1 configuration with dedup enabled compression off and pool version 23. From running bonnie++ I get the following results: Version 1.03b --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 429.8 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 7181 29 + +++ + +++ 21477 97 + +++ + +++ nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ I'd expect more than 105290K/s on a sequential read as a peak for a single drive, let alone a striped set. The system has a relatively decent CPU, however only 2GB memory, do you think increasing this to 4GB would noticeably affect performance of my zpool? The memory is only DDR1. 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn off dedup and enable compression. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Compression will help speed things up (I/O, that is), presuming that you're not already CPU-bound, which it doesn't seem you are. If you want Dedup, you pretty much are required to buy an SSD for L2ARC, *and* get more RAM. These days, I really don't recommend running ZFS as a fileserver without a bare minimum of 4GB of RAM (8GB for anything other than light use), even with Dedup turned off. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is my bottleneck RAM?
You can't really do that. Adding an SSD for L2ARC will help a bit, but L2ARC storage also consumes RAM to maintain a cache table of what's in the L2ARC. Using 2GB of RAM with an SSD-based L2ARC (even without Dedup) likely won't help you too much vs not having the SSD. If you're going to turn on Dedup, you need at least 8GB of RAM to go with the SSD. -Erik On Tue, 2011-01-18 at 18:35 +, Michael Armstrong wrote: Thanks everyone, I think overtime I'm gonna update the system to include an ssd for sure. Memory may come later though. Thanks for everyone's responses Erik Trimble erik.trim...@oracle.com wrote: On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote: I've since turned off dedup, added another 3 drives and results have improved to around 148388K/sec on average, would turning on compression make things more CPU bound and improve performance further? On 18 Jan 2011, at 15:07, Richard Elling wrote: On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote: Hi guys, sorry in advance if this is somewhat a lowly question, I've recently built a zfs test box based on nexentastor with 4x samsung 2tb drives connected via SATA-II in a raidz1 configuration with dedup enabled compression off and pool version 23. From running bonnie++ I get the following results: Version 1.03b --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 429.8 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 7181 29 + +++ + +++ 21477 97 + +++ + +++ nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ I'd expect more than 105290K/s on a sequential read as a peak for a single drive, let alone a striped set. The system has a relatively decent CPU, however only 2GB memory, do you think increasing this to 4GB would noticeably affect performance of my zpool? The memory is only DDR1. 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn off dedup and enable compression. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Compression will help speed things up (I/O, that is), presuming that you're not already CPU-bound, which it doesn't seem you are. If you want Dedup, you pretty much are required to buy an SSD for L2ARC, *and* get more RAM. These days, I really don't recommend running ZFS as a fileserver without a bare minimum of 4GB of RAM (8GB for anything other than light use), even with Dedup turned off. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] configuration
On Mon, Jan 17, 2011 at 6:19 AM, Piotr Tarnowski drfug...@drfugazi.eu.org wrote: You can also make 250 GB slices (partitions) and create RAIDZ 3x250GB and mirror 2x1750GB (one or more). This configuration doesn't make a lot of sense for redundancy, since it doesn't provide any. It will have poor performance caused by excessive disk seeks as well. The only time it would make sense is if you're planning on replacing each slice with a separate drives. Mirror has better performance for write operations, Raidz shoud be faster for read. ZFS mirrors will read off of both sides of the mirror, as in a stripe. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is my bottleneck RAM?
Ah ok, I wont be using dedup anyway just wanted to try. Ill be adding more ram though, I guess you can't have too much. Thanks Erik Trimble erik.trim...@oracle.com wrote: You can't really do that. Adding an SSD for L2ARC will help a bit, but L2ARC storage also consumes RAM to maintain a cache table of what's in the L2ARC. Using 2GB of RAM with an SSD-based L2ARC (even without Dedup) likely won't help you too much vs not having the SSD. If you're going to turn on Dedup, you need at least 8GB of RAM to go with the SSD. -Erik On Tue, 2011-01-18 at 18:35 +, Michael Armstrong wrote: Thanks everyone, I think overtime I'm gonna update the system to include an ssd for sure. Memory may come later though. Thanks for everyone's responses Erik Trimble erik.trim...@oracle.com wrote: On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote: I've since turned off dedup, added another 3 drives and results have improved to around 148388K/sec on average, would turning on compression make things more CPU bound and improve performance further? On 18 Jan 2011, at 15:07, Richard Elling wrote: On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote: Hi guys, sorry in advance if this is somewhat a lowly question, I've recently built a zfs test box based on nexentastor with 4x samsung 2tb drives connected via SATA-II in a raidz1 configuration with dedup enabled compression off and pool version 23. From running bonnie++ I get the following results: Version 1.03b --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 429.8 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 7181 29 + +++ + +++ 21477 97 + +++ + +++ nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ I'd expect more than 105290K/s on a sequential read as a peak for a single drive, let alone a striped set. The system has a relatively decent CPU, however only 2GB memory, do you think increasing this to 4GB would noticeably affect performance of my zpool? The memory is only DDR1. 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn off dedup and enable compression. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Compression will help speed things up (I/O, that is), presuming that you're not already CPU-bound, which it doesn't seem you are. If you want Dedup, you pretty much are required to buy an SSD for L2ARC, *and* get more RAM. These days, I really don't recommend running ZFS as a fileserver without a bare minimum of 4GB of RAM (8GB for anything other than light use), even with Dedup turned off. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How well does zfs mirror handle temporary disk offlines?
Sorry if this is well known.. I tried a bunch of googles, but didnt get anywhere useful. Closest I came, was http://mail.opensolaris.org/pipermail/zfs-discuss/2009-April/028090.html but that doesnt answer my question, below, reguarding zfs mirror recovery. Details of our needs follow. We normally are very into redundancy. Pretty much all our SAN storage is dual ported, along with all our production hosts. Two completely redundant paths to storage. Two independant SANs. However, now, we are encountering a need for tier 3 storage, aka not that important, we're going to go cheap on it ;-) That being said, we'd still like to make it as reliable and robust as possible. So I was wondering just how robust it would be to do ZFS mirroring, across 2 sans. My specific question is, how easily does ZFS handle *temporary* SAN disconnects, to one side of the mirror? What if the outage is only 60 seconds? 3 minutes? 10 minutes? an hour? If we have 2x1TB drives, in a simple zfs mirror if one side goes temporarily off line, will zfs attempt to resync **1 TB** when it comes back? Or does it have enough intelligence to say, oh hey I know this disk..and I know [these bits] are still good, so I just need to resync [that bit] ? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?
On 1/18/2011 2:46 PM, Philip Brown wrote: My specific question is, how easily does ZFS handle*temporary* SAN disconnects, to one side of the mirror? What if the outage is only 60 seconds? 3 minutes? 10 minutes? an hour? Depends on the multipath drivers and the failure mode. For example, if the link drops completely at the host hba connection some failover drivers will mark the path down immediately which will propagate up the stack faster than an intermittent connection or something father down stream failing. If we have 2x1TB drives, in a simple zfs mirror if one side goes temporarily off line, will zfs attempt to resync **1 TB** when it comes back? Or does it have enough intelligence to say, oh hey I know this disk..and I know [these bits] are still good, so I just need to resync [that bit] ? My understanding is yes though I can't find the reference for this. (I'm sure someone else will find it in short order.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?
On Tue, 2011-01-18 at 14:51 -0500, Torrey McMahon wrote: On 1/18/2011 2:46 PM, Philip Brown wrote: My specific question is, how easily does ZFS handle*temporary* SAN disconnects, to one side of the mirror? What if the outage is only 60 seconds? 3 minutes? 10 minutes? an hour? Depends on the multipath drivers and the failure mode. For example, if the link drops completely at the host hba connection some failover drivers will mark the path down immediately which will propagate up the stack faster than an intermittent connection or something father down stream failing. If we have 2x1TB drives, in a simple zfs mirror if one side goes temporarily off line, will zfs attempt to resync **1 TB** when it comes back? Or does it have enough intelligence to say, oh hey I know this disk..and I know [these bits] are still good, so I just need to resync [that bit] ? My understanding is yes though I can't find the reference for this. (I'm sure someone else will find it in short order.) ZFS's ability to handle short-term interruptions depend heavily on the underlying device driver. If the device driver reports the device as dead/missing/etc at any point, then ZFS is going to require a zpool replace action before it re-accepts the device. If the underlying driver simply stalls, then it's more graceful (and no user interaction is required). As far as what the resync does: ZFS does smart resilvering, in that it compares what the good side of the mirror has against what the bad side has, and only copies the differences over to sync them up. This is one of ZFS's great strengths, in that most other RAID systems can't do this. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?
Erik Trimble wrote: On Tue, 2011-01-18 at 14:51 -0500, Torrey McMahon wrote: On 1/18/2011 2:46 PM, Philip Brown wrote: My specific question is, how easily does ZFS handle*temporary* SAN disconnects, to one side of the mirror? What if the outage is only 60 seconds? 3 minutes? 10 minutes? an hour? No idea how well it will reconnect the device but we had an X4500 that would randomly boot up and one or two disks would be missing. Reboot again and one or two other disks would be missing. While we were trouble shooting this problem this happened dozens and dozens of times and zfs had no trouble with it as far as I could tell. Would only resliver the data that was changed while that drive was offline. We had no data loss. Thank you, Chris Banal ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?
On Tue, 2011-01-18 at 14:51 -0500, Torrey McMahon wrote: ZFS's ability to handle short-term interruptions depend heavily on the underlying device driver. If the device driver reports the device as dead/missing/etc at any point, then ZFS is going to require a zpool replace action before it re-accepts the device. If the underlying driver simply stalls, then it's more graceful (and no user interaction is required). As far as what the resync does: ZFS does smart resilvering, in that it compares what the good side of the mirror has against what the bad side has, and only copies the differences over to sync them up. Hmm. Well, we're talking fibre, so we're very concerned with the recovery mode when the fibre drivers have marked it as failed. (except it hasnt really failed, we've just had a switch drop out) I THINK what you are saying, is that we could, in this situation, do: zpool replace (old drive) (new drive) and then your smart recovery, should do the limited resilvering only. Even for potentially long outages. Is that what you are saying? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HP ProLiant N36L
I've installed nexentastor on 8GB usb stick without any problems, so try nexentastor instead of nexentacore... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?
On Tue, 2011-01-18 at 13:34 -0800, Philip Brown wrote: On Tue, 2011-01-18 at 14:51 -0500, Torrey McMahon wrote: ZFS's ability to handle short-term interruptions depend heavily on the underlying device driver. If the device driver reports the device as dead/missing/etc at any point, then ZFS is going to require a zpool replace action before it re-accepts the device. If the underlying driver simply stalls, then it's more graceful (and no user interaction is required). As far as what the resync does: ZFS does smart resilvering, in that it compares what the good side of the mirror has against what the bad side has, and only copies the differences over to sync them up. Hmm. Well, we're talking fibre, so we're very concerned with the recovery mode when the fibre drivers have marked it as failed. (except it hasnt really failed, we've just had a switch drop out) I THINK what you are saying, is that we could, in this situation, do: zpool replace (old drive) (new drive) and then your smart recovery, should do the limited resilvering only. Even for potentially long outages. Is that what you are saying? Yes. It will always look at the replaced drive to see if it was a prior member of the mirror, and do smart resilvering if possible. If the device path stays the same (which, hopefully, it should), you can even do: zpool replace (old device) (old device) -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] configuration
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Trusty Twelve Hello, I'm going to build home server. System is deployed on 8 GB USB flash drive. I have two identical 2 TB HDD and 250 GB one. Could you please recommend me ZFS configuration for the set of my hard drives? 1) pool1: mirror 2tb x 2 pool2: 250 gb (or maybe add this drive to pool1???) 2) pool1: mirror 2tb x 2 + cache/log 250 gb I recommend option 3: mirror 2tb x 2 disconnect 250G disconnect 8G flash drive ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Karl Wagner Consider the situation where someone has a large amount of off-site data storage (of the order of 100s of TB or more). They have a slow network link to this storage. My idea is that this could be used to build the main vdevs for a ZFS pool. On top of this, an array of disks (of the order of TBs to 10s of TB) is available locally, which can be used as L2ARC. There are also smaller, faster arrays (of the order of 100s of GB) which, in my mind, could be used as a ZIL. Now, in this theoretical situation, in-play read data is kept on the L2ARC, and can be accessed about as fast as if this array was just used as the main pool vdevs. Written data goes to the ZIL, as is then sent down the slow link to the offsite storage. Rarely used data is still available as if on site (shows up in the same file structure), but is effectively archived to the offsite storage. Now, here comes the problem. According to what I have read, the maximum size for the ZIL is approx 50% of the physical memory in the system, which would Here's the bigger problem: You seem to be thinking of ZIL as write buffer. This is not the case. ZIL only allows sync writes to become async writes, which are buffered in RAM. Depending on your system, it will refuse to buffer more than 5sec or 30sec of async writes, and your async writes are still going to be slow. Also, L2ARC is not persistent, and there is a maximum fill rate (which I don't know much about.) So populating the L2ARC might not happen as fast as you want, and every time you reboot it will have to be repopulated. If at all possible, instead of using the remote storage as the primary storage, you can use the remote storage to receive incremental periodic snapshots, and that would perform optimally, because the remote storage is then isolated from rapid volatile changes. The zfs send | zfs receive datastreams will be full of large sequential blocks and not small random IO. Most likely you will gain performance by enabling both compression and dedup. But of course, that depends on the nature of your data. And finally, if the network link was to die, I am assuming the entire ZPool would become unavailable. The behavior in this situation is configurable via failmode. The default is wait which essentially pauses the filesystem until the disks become available again. Unfortunately, until the disks become available again, the system can become ... pretty undesirable to use, and possibly require a power cycle. You can also use panic or continue, which you can read about in the zpool manpage if your want. vdevs as an archive store (i.e. it goes [ARC]-[L2ARC/ZIL]-[main]-[archive]). Infrequently used files/blocks could You're pretty much describing precisely what I'm suggesting... using zfs send | zfs receive. I suppose the difference between what you're suggesting and what I'm suggesting, is the separation of two pools versus misrepresenting the remote storage as part of the local pool, etc. That's a pretty major architectural change. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Erik Trimble As far as what the resync does: ZFS does smart resilvering, in that it compares what the good side of the mirror has against what the bad side has, and only copies the differences over to sync them up. This is one of ZFS's great strengths, in that most other RAID systems can't do this. It's also one of ZFS's great weaknesses. It's a strength as long as not much data has changed, or it was highly sequential in nature, or the drives in the pool have extremely high IOPS (SSD's etc) because then resilvering just the changed parts can be done very quickly. Much quicker than resilvering the whole drive sequentially as a typical hardware raid would do. However, as is often the case, a large percentage of the drive may have changed, in essentially random order. There are many situations where something like 3% of the drive has changed, yet the resilver takes 100% as long as rewriting the entire drive sequentially would have taken. 10% of the drive changed ZFS resilver might be 4x slower than sequentially overwriting the entire disk as a hardware raid would have done. Ultimately, your performance depends entirely on your usage patterns, your pool configuration, and type of hardware. To the OP: If you've got one device on one SAN, mirrored to another device on another SAN, you're probably only expecting very brief outages on either SAN. As such, you probably won't see any large percentage of the online SAN change, and when the temporarily failed SAN comes back online, you can probably expect a very fast resilver. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss