Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Jeroen, Have you tried the DDRdrive from Christopher George ? Looks to me like a much better fit for your application than the F20? It would not hurt to check it out. Looks to me like you need a product with low *latency* - and a RAM based cache would be a much better performer than any solution based solely on flash. Let us know (on the list) how this works out for you. Regards, -- Al Hopper Logical Approach Inc,Plano,TX a...@logical-approach.com Voice: 214.233.5089 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Apr 2, 2010, at 2:29 PM, Edward Ned Harvey wrote: > I’ve also heard that the risk for unexpected failure of your pool is higher > if/when you reach 100% capacity. I’ve heard that you should always create a > small ZFS filesystem within a pool, and give it some reserved space, along > with the filesystem that you actually plan to use in your pool. Anyone care > to offer any comments on that? Define "failure" in this context? I am not aware of a data loss failure when near full. However, all file systems will experience performance degradation for write operations as they become full. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup and memory/l2arc requirements
On Apr 2, 2010, at 2:03 PM, Miles Nordin wrote: >> "re" == Richard Elling writes: > >re> # ptime zdb -S zwimming Simulated DDT histogram: >re> refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE > DSIZE >re> Total2.63M277G218G225G3.22M337G263G > 270G > >re>in-core size = 2.63M * 250 = 657.5 MB > > Thanks, that is really useful! It'll probably make the difference > between trying dedup and not, for me. > > It is not working for me yet. It got to this point in prstat: > > 6754 root 2554M 1439M sleep 600 0:03:31 1.9% zdb/106 > > and then ran out of memory: > > $ pfexec ptime zdb -S tub > out of memory -- generating core dump This is annoying. By default, zdb is compiled as a 32-bit executable and it can be a hog. Compiling it yourself is too painful for most folks :-( > I might add some swap I guess. I will have to try it on another > machine with more RAM and less pool, and see how the size of the zdb > image compares to the calculated size of DDT needed. So long as zdb > is the same or a little smaller than the DDT it predicts, the tool's > still useful, just sometimes it will report ``DDT too big but not sure > by how much'', by coredumping/thrashing instead of finishing. In my experience, more swap doesn't help break through the 2GB memory barrier. As zdb is an intentionally unsupported tool, methinks recompile may be required (or write your own). -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Fri, Apr 2, 2010 at 2:29 PM, Edward Ned Harvey wrote: > I’ve also heard that the risk for unexpected failure of your pool is > higher if/when you reach 100% capacity. I’ve heard that you should always > create a small ZFS filesystem within a pool, and give it some reserved > space, along with the filesystem that you actually plan to use in your > pool. Anyone care to offer any comments on that? > I think you can just create a dataset with a reservation to avoid the issue. As I understand it, zfs doesn't automatically set aside a few percent of reserved space like ufs does. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Fri, Apr 2, 2010 at 2:23 PM, Edward Ned Harvey wrote: > There is some question about performance. Is there any additional > overhead caused by using a slice instead of the whole physical device? > zfs will disable the write cache when it's not working with whole disks, which may reduce performance. You can turn the cache back on however. I don't remember the exact incantation to do so, but "format -e" springs to mind. And finally, if anyone has experience doing this, and process > recommendations? That is … My next task is to go read documentation again, > to refresh my memory from years ago, about the difference between “format,” > “partition,” “label,” “fdisk,” because those terms don’t have the same > meaning that they do in other OSes… And I don’t know clearly right now, > which one(s) I want to do, in order to create the large slice of my disks. > The whole partition vs. slice thing is a bit fuzzy to me, so take this with a grain of salt. You can create partitions using fdisk, or slices using format. The BIOS and other operating systems (windows, linux, etc) will be able to recognize partitions, while they won't be able to make sense of slices. If you need to boot from the drive or share it with another OS, then partitions are the way to go. If it's exclusive to solaris, then you can use slices. You can (but shouldn't) use slices and partitions from the same device (eg: c5t0d0s0 and c5t0d0p0). -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On 04/ 3/10 10:23 AM, Edward Ned Harvey wrote: Momentarily, I will begin scouring the omniscient interweb for information, but I’d like to know a little bit of what people would say here. The question is to slice, or not to slice, disks before using them in a zpool. Not. One reason to slice comes from recent personal experience. One disk of a mirror dies. Replaced under contract with an identical disk. Same model number, same firmware. Yet when it’s plugged into the system, for an unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to attach and un-degrade the mirror. It seems logical this problem could have been avoided if the device added to the pool originally had been a slice somewhat smaller than the whole physical device. Say, a slice of 28G out of the 29G physical disk. Because later when I get the infinitesimally smaller disk, I can always slice 28G out of it to use as the mirror device. What build were you running? The should have been addressed by CR6844090 that went into build 117. There is some question about performance. Is there any additional overhead caused by using a slice instead of the whole physical device? There is another question about performance. One of my colleagues said he saw some literature on the internet somewhere, saying ZFS behaves differently for slices than it does on physical devices, because it doesn’t assume it has exclusive access to that physical device, and therefore caches or buffers differently … or something like that. it's well documented. ZFS won't attempt to enable the drive's cache unless it has the physical device. See http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pools -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID-Z with Permanent errors detected in files
> I guess it will then > remain a mystery how did this happen, since I'm very > careful when engaging the commands and I'm sure that > I didn't miss the "raidz" parameter. You can be sure by calling "zpool history". Robert -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
This might be unrelated, but along similar lines . I've also heard that the risk for unexpected failure of your pool is higher if/when you reach 100% capacity. I've heard that you should always create a small ZFS filesystem within a pool, and give it some reserved space, along with the filesystem that you actually plan to use in your pool. Anyone care to offer any comments on that? From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Edward Ned Harvey Sent: Friday, April 02, 2010 5:23 PM To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] To slice, or not to slice Momentarily, I will begin scouring the omniscient interweb for information, but I'd like to know a little bit of what people would say here. The question is to slice, or not to slice, disks before using them in a zpool. One reason to slice comes from recent personal experience. One disk of a mirror dies. Replaced under contract with an identical disk. Same model number, same firmware. Yet when it's plugged into the system, for an unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to attach and un-degrade the mirror. It seems logical this problem could have been avoided if the device added to the pool originally had been a slice somewhat smaller than the whole physical device. Say, a slice of 28G out of the 29G physical disk. Because later when I get the infinitesimally smaller disk, I can always slice 28G out of it to use as the mirror device. There is some question about performance. Is there any additional overhead caused by using a slice instead of the whole physical device? There is another question about performance. One of my colleagues said he saw some literature on the internet somewhere, saying ZFS behaves differently for slices than it does on physical devices, because it doesn't assume it has exclusive access to that physical device, and therefore caches or buffers differently . or something like that. Any other pros/cons people can think of? And finally, if anyone has experience doing this, and process recommendations? That is . My next task is to go read documentation again, to refresh my memory from years ago, about the difference between "format," "partition," "label," "fdisk," because those terms don't have the same meaning that they do in other OSes. And I don't know clearly right now, which one(s) I want to do, in order to create the large slice of my disks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] To slice, or not to slice
Momentarily, I will begin scouring the omniscient interweb for information, but I'd like to know a little bit of what people would say here. The question is to slice, or not to slice, disks before using them in a zpool. One reason to slice comes from recent personal experience. One disk of a mirror dies. Replaced under contract with an identical disk. Same model number, same firmware. Yet when it's plugged into the system, for an unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to attach and un-degrade the mirror. It seems logical this problem could have been avoided if the device added to the pool originally had been a slice somewhat smaller than the whole physical device. Say, a slice of 28G out of the 29G physical disk. Because later when I get the infinitesimally smaller disk, I can always slice 28G out of it to use as the mirror device. There is some question about performance. Is there any additional overhead caused by using a slice instead of the whole physical device? There is another question about performance. One of my colleagues said he saw some literature on the internet somewhere, saying ZFS behaves differently for slices than it does on physical devices, because it doesn't assume it has exclusive access to that physical device, and therefore caches or buffers differently . or something like that. Any other pros/cons people can think of? And finally, if anyone has experience doing this, and process recommendations? That is . My next task is to go read documentation again, to refresh my memory from years ago, about the difference between "format," "partition," "label," "fdisk," because those terms don't have the same meaning that they do in other OSes. And I don't know clearly right now, which one(s) I want to do, in order to create the large slice of my disks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup and memory/l2arc requirements
> "re" == Richard Elling writes: re> # ptime zdb -S zwimming Simulated DDT histogram: re> refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE re> Total2.63M277G218G225G3.22M337G263G 270G re>in-core size = 2.63M * 250 = 657.5 MB Thanks, that is really useful! It'll probably make the difference between trying dedup and not, for me. It is not working for me yet. It got to this point in prstat: 6754 root 2554M 1439M sleep 600 0:03:31 1.9% zdb/106 and then ran out of memory: $ pfexec ptime zdb -S tub out of memory -- generating core dump I might add some swap I guess. I will have to try it on another machine with more RAM and less pool, and see how the size of the zdb image compares to the calculated size of DDT needed. So long as zdb is the same or a little smaller than the DDT it predicts, the tool's still useful, just sometimes it will report ``DDT too big but not sure by how much'', by coredumping/thrashing instead of finishing. pgprpk9HSdr61.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, Apr 2 at 11:14, Tirso Alonso wrote: If my new replacement SSD with identical part number and firmware is 0.001 Gb smaller than the original and hence unable to mirror, what's to prevent the same thing from happening to one of my 1TB spindle disk mirrors? There is a standard for sizes that many manufatures use (IDEMA LBA1-02): LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes ??? 50.0)) Sizes should match exactly if the manufacturer follows the standard. See: http://opensolaris.org/jive/message.jspa?messageID=393336#393336 http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=download&data_file_id=1066 Problem is that it only applies to devices that are >= 50GB in size, and the X25 in question is only 32GB. That being said, I'd be skeptical of either the sourcing of the parts, or else some other configuration feature on the drives (like HPA or DCO) that is changing the capacity. It's possible one of these is in effect. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, Apr 2, 2010 at 10:08 AM, Kyle McDonald wrote: > On 4/2/2010 8:08 AM, Edward Ned Harvey wrote: > >> I know it is way after the fact, but I find it best to coerce each > >> drive down to the whole GB boundary using format (create Solaris > >> partition just up to the boundary). Then if you ever get a drive a > >> little smaller it still should fit. > >> > > It seems like it should be unnecessary. It seems like extra work. But > > based on my present experience, I reached the same conclusion. > > > > If my new replacement SSD with identical part number and firmware is > 0.001 > > Gb smaller than the original and hence unable to mirror, what's to > prevent > > the same thing from happening to one of my 1TB spindle disk mirrors? > > Nothing. That's what. > > > > > Actually, It's my experience that Sun (and other vendors) do exactly > that for you when you buy their parts - at least for rotating drives, I > have no experience with SSD's. > > The Sun disk label shipped on all the drives is setup to make the drive > the standard size for that sun part number. They have to do this since > they (for many reasons) have many sources (diff. vendors, even diff. > parts from the same vendor) for the actual disks they use for a > particular Sun part number. > > This isn't new, I beleive IBM, EMC, HP, etc all do it also for the same > reasons. > I'm a little surprised that the engineers would suddenly stop doing it > only on SSD's. But who knows. > > -Kyle > > If I were forced to ignorantly cast a stone, it would be into Intel's lap (if the SSD's indeed came directly from Sun). Sun's "normal" drive vendors have been in this game for decades, and know the expectations. Intel on the other hand, may not have quite the same QC in place yet. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS behavior under limited resources
I am trying to see how ZFS behaves under resource starvation - corner cases in embedded environments. I see some very strange behavior. Any help/explanation would really be appreciated. My current setup is : OpenSolaris 111b (iSCSI seems to be broken in 132 - unable to get multiple connections/mutlipathing) iSCSI Storage Array that is capable of 20 MB/s random writes @ 4k and 70 MB random reads @ 4k 150 MB/s random writes @ 128k and 180 MB/S random reads @ 128K 180+ MB/S for sequntial reads and write at both 4k and 128k. 8 Intel CPU and 12 GB of RAM (DELL poweredge 610) The ARC size is limited to 512MB (hard limit). No L2 Cache. In both test below the file system size is about 300 GB. This file system conatins a single directory with about 15'000 files totalling to 200 GB (so the file system is 2/3 full). The tests are run within the same directory. Test 1: Random writes @ 4k to 1000 1MB files (1000 threads, 1 per file). First I observe that ARC size grows (momentarily) above 512 MB limit (via kstat and arcstat.pl). Q: It seems that zfs:zfs_arc_max is not really a hard limit? I tried setting primarycache to none, metadata and all. The I/O reported is similar in the NONE and METADATA case (17 MB/S) while when set to ALL, I/O is 3 - 4 time less (4-5 MB/S). Q: Any explanation would be useful. In this test I observe for backend on average I/O is 132 MB/s for READs and 51 MB/s WRITES Q: Why is more read than wtritten? Test 2: Random writes @ 4k to 10'000 1MB files (10'000 threads, 1 per file). - ARC size now goes to 1 GB during the entire test (way above the hard limit) - ::memstat reports that zfs grew from the original 430 MB to about 1.5 GB Q: Does mdb memstat reporting include ARC? Q: On the backend I see 170 MB/s reads and 0.5 MB.s writes -- What is happening here? SOME sample output ... --- > ::memstat Page SummaryPagesMB %Tot Kernel 800933 3128 25% ZFS File Data 394450 1540 13% Anon 128909 5034% Exec and libs4172160% Page cache 14749570% Free (cachelist)21884851% Free (freelist) 1776079 6937 57% Total 3141176 12270 Physical 3141175 12270 -- System Memory: Physical RAM: 12270 MB Free Memory : 6966 MB LotsFree: 191 MB ZFS Tunables (/etc/system): set zfs:zfs_prefetch_disable = 1 set zfs:zfs_arc_max = 0x2000 set zfs:zfs_arc_min = 0x1000 ARC Size: Current Size: 669 MB (arcsize) Target Size (Adaptive): 512 MB (c) Min Size (Hard Limit):256 MB (zfs_arc_min) Max Size (Hard Limit):512 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 6%32 MB (p) Most Frequently Used Cache Size:93%480 MB (c-p) ARC Efficency: Cache Access Total: 47002757 Cache Hit Ratio: 52% 24657634 [Defined State for buffer] Cache Miss Ratio: 47% 22345123 [Undefined State for Buffer] REAL Hit Ratio: 52% 24657634 [MRU/MFU Hits Only] Data Demand Efficiency:36% Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable) CACHE HITS BY CACHE LIST: Anon: --%Counter Rolled. Most Recently Used: 13%3420349 (mru) [ Return Customer ] Most Frequently Used: 86%21237285 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 16%4057965 (mru_ghost)[ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 31%7837353 (mfu_ghost)[ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data:31%7793822 Prefetch Data: 0%0 Demand Metadata:68%16863812 Prefetch Metadata: 0%0 CACHE MISSES BY DATA TYPE: Demand Data:60%13573358 Prefetch Data: 0%0 Demand Metadata:39%8771406 Prefetch Metadata: 0%359 - -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> "enh" == Edward Ned Harvey writes: enh> If you have zpool less than version 19 (when ability to remove enh> log device was introduced) and you have a non-mirrored log enh> device that failed, you had better treat the situation as an enh> emergency. Ed the log device removal support is only good for adding a slog to try it out, then changing your mind and removing the slog (which was not possible before). It doesn't change the reliability situation one bit: pools with dead slogs are not importable. There've been threads on this for a while. It's well-discussed because it's an example of IMHO broken process of ``obviously a critical requirement but not technically part of the original RFE which is already late,'' as well as a dangerous pitfall for ZFS admins. I imagine the process works well in other cases to keep stuff granular enough that it can be prioritized effectively, but in this case it's made the slog feature significantly incomplete for a couple years and put many production systems in a precarious spot, and the whole mess was predicted before the slog feature was integrated. >> The on-disk log (slog or otherwise), if I understand right, can >> actually make the filesystem recover to a crash-INconsistent >> state enh> You're speaking the opposite of common sense. Yeah, I'm doing it on purpose to suggest that just guessing how you feel things ought to work based on vague notions of economy isn't a good idea. enh> If disabling the ZIL makes the system faster *and* less prone enh> to data corruption, please explain why we don't all disable enh> the ZIL? I said complying with fsync can make the system recover to a state not equal to one you might have hypothetically snapshotted in a moment leading up to the crash. Elsewhere I might've said disabling the ZIL does not make the system more prone to data corruption, *iff* you are not an NFS server. If you are, disabling the ZIL can lead to lost writes if an NFS server reboots and an NFS client does not, which can definitely cause app-level data corruption. Disabling the ZIL breaks the D requirement of ACID databases which might screw up apps that replicate, or keep databases on several separate servers in sync, and it might lead to lost mail on an MTA, but because unlike non-COW filesystems it costs nothing extra for ZFS to preserve write ordering even without fsync(), AIUI you will not get corrupted application-level data by disabling the ZIL. you just get missing data that the app has a right to expect should be there. The dire warnings written by kernel developers in the wikis of ``don't EVER disable the ZIL'' are totally ridiculous and inappropriate IMO. I think they probably just worked really hard to write the ZIL piece of ZFS, and don't want people telling their brilliant code to fuckoff just because it makes things a little slower. so we get all this ``enterprise'' snobbery and so on. ``crash consistent'' is a technical term not a common-sense term, and I may have used it incorrectly: http://oraclestorageguy.typepad.com/oraclestorageguy/2007/07/why-emc-technol.html With a system that loses power on which fsync() had been in use, the files getting fsync()'ed will probably recover to more recent versions than the rest of the files, which means the recovered state achieved by yanking the cord couldn't have been emulated by cloning a snapshot and not actually having lost power. However, the app calling fsync() will expect this, so it's not supposed to lead to application-level inconsistency. If you test your app's recovery ability in just that way, by cloning snapshots of filesystems on which the app is actively writing and then seeing if the app can recover the clone, then you're unfortunately not testing the app quite hard enough if fsync() is involved, so yeah I guess disabling the ZIL might in theory make incorrectly-written apps less prone to data corruption. Likewise, no testing of the app on a ZFS will be aggressive enough to make the app powerfail-proof on a non-COW POSIX system because ZFS keeps more ordering than the API actually guarantees to the app. I'm repeating myself though. I wish you'll just read my posts with at least paragraph granularity instead of just picking out individual sentences and discarding everything that seems too complicated or too awkwardly stated. I'm basing this all on the ``common sense'' that to do otherwise, fsync() would have to completely ignore its filedescriptor argument. It'd have to copy the entire in-memory ZIL to the slog and behave the same as 'lockfs -fa', which I think would perform too badly compared to non-ZFS filesystems' fsync()s, and would lead to emphatic performance advice like ``segregate files that get lots of fsync()s into separate ZFS datasets from files that get high write bandwidth,'' and we don't have advice like that in the blogs/lists/wikis which makes me think it's not beneficial (the benefit would be dramat
Re: [zfs-discuss] is this pool recoverable?
Thanks, that worked!! It needed "-Ff" The pool has been recovered with minimal loss in data. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup and memory/l2arc requirements
On Apr 1, 2010, at 5:39 PM, Roy Sigurd Karlsbakk wrote: > Hi all > > I've been told (on #opensolaris, irc.freenode.net) that opensolaris needs a > lot of memory and/or l2arc for dedup to function properly. How much memory or > l2arc should I get for a 12TB zpool (8x2GB in RAIDz2), and then, how much for > 125TB (after RAIDz2 overhead)? Is there a function into which I can plug my > recordsize and volume size to get the appropriate numbers? You can estimate the amount of disk space needed for the deduplication table and the expected deduplication ratio by using "zdb -S poolname" on your existing pool. Be patient, for an existing pool with lots of objects, this can take some time to run. # ptime zdb -S zwimming Simulated DDT histogram: bucket allocated referenced __ __ __ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE -- -- - - - -- - - - 12.27M239G188G194G2.27M239G188G194G 2 327K 34.3G 27.8G 28.1G 698K 73.3G 59.2G 59.9G 430.1K 2.91G 2.10G 2.11G 152K 14.9G 10.6G 10.6G 87.73K691M529M529M74.5K 6.25G 4.79G 4.80G 16 673 43.7M 25.8M 25.9M13.1K822M492M494M 32 197 12.3M 7.02M 7.03M7.66K480M269M270M 64 47 1.27M626K626K3.86K103M 51.2M 51.2M 128 22908K250K251K3.71K150M 40.3M 40.3M 2567302K 48K 53.7K2.27K 88.6M 17.3M 19.5M 5124131K 7.50K 7.75K2.74K102M 5.62M 5.79M 2K1 2K 2K 2K3.23K 6.47M 6.47M 6.47M 8K1128K 5K 5K13.9K 1.74G 69.5M 69.5M Total2.63M277G218G225G3.22M337G263G270G dedup = 1.20, compress = 1.28, copies = 1.03, dedup * compress / copies = 1.50 real 8:02.391932786 user 1:24.231855093 sys15.193256108 In this file system, 2.75 million blocks are allocated. The in-core size of a DDT entry is approximately 250 bytes. So the math is pretty simple: in-core size = 2.63M * 250 = 657.5 MB If your dedup ratio is 1.0, then this number will scale linearly with size. If the dedup rate > 1.0, then this number will not scale linearly, it will be less. So you can use the linear scale as a worst-case approximation. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> If my new replacement SSD with identical part number and firmware is 0.001 > Gb smaller than the original and hence unable to mirror, what's to prevent > the same thing from happening to one of my 1TB spindle disk mirrors? There is a standard for sizes that many manufatures use (IDEMA LBA1-02): LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes – 50.0)) Sizes should match exactly if the manufacturer follows the standard. See: http://opensolaris.org/jive/message.jspa?messageID=393336#393336 http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=download&data_file_id=1066 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is this pool recoverable?
On Fri, 2 Apr 2010, Patrick Tiquet wrote: I tried booting with b134 to attempt to recover the pool. I attempted with one disk of the mirror. Zpool tells me to use -F for import, fails, but then tells me to use -f, which also fails and tells me to use -F again. Any thoughts? It looks like it wants you to use both -f and -F at the same time. I don't see that you tried that. Good luck. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 02/04/2010 16:04, casper@sun.com wrote: sync() is actually *async* and returning from sync() says nothing about to clarify - in case of ZFS sync() is actually synchronous. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is this pool recoverable?
I tried booting with b134 to attempt to recover the pool. I attempted with one disk of the mirror. Zpool tells me to use -F for import, fails, but then tells me to use -f, which also fails and tells me to use -F again. Any thoughts? j...@opensolaris:~# zpool import pool: atomfs id: 1344695315736882 state: FAULTED status: The pool was last accessed by another system. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: atomfs FAULTED corrupted data mirror-0 FAULTED corrupted data c4t5d0 ONLINE c9d0UNAVAIL cannot open j...@opensolaris:~# zpool import -f pool: atomfs id: 1344695315736882 state: FAULTED status: The pool was last accessed by another system. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: atomfs FAULTED corrupted data mirror-0 FAULTED corrupted data c4t5d0 ONLINE c9d0UNAVAIL cannot open j...@opensolaris:~# zpool import -f 1344695315736882 newpool cannot import 'atomfs' as 'newpool': one or more devices is currently unavailable Recovery is possible, but will result in some data loss. Returning the pool to its state as of March 12, 2010 09:08:29 AM PST should correct the problem. Recovery can be attempted by executing 'zpool import -F atomfs'. A scrub of the pool is strongly recommended after recovery. j...@opensolaris:~# zpool import -F atomfs cannot import 'atomfs': pool may be in use from other system, it was last accessed by blue (hostid: 0x82aa00) on Fri Mar 12 09:08:29 2010 use '-f' to import anyway j...@opensolaris:~# zpool status no pools available j...@opensolaris:~# zpool import -f 1344695315736882 cannot import 'atomfs': one or more devices is currently unavailable Recovery is possible, but will result in some data loss. Returning the pool to its state as of March 12, 2010 09:08:29 AM PST should correct the problem. Recovery can be attempted by executing 'zpool import -F atomfs'. A scrub of the pool is strongly recommended after recovery. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [install-discuss] Installing Opensolaris without ZFS?
I doubt it. ZFS is meant to be used for large systems, in which memory is not an issue Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. - "ольга крыжановская" skrev: > Are there plans to reduce the memory usage of ZFS in the near future? > > Olga > > 2010/4/2 Alan Coopersmith : > > ольга крыжановская wrote: > >> Does Opensolaris have an option to install without ZFS, i.e. use > UFS > >> for root like SXCE did? > > > > No. beadm & pkg image-update rely on ZFS functionality for the root > > filesystem. > > > > -- > >-Alan Coopersmith-alan.coopersm...@oracle.com > > Oracle Solaris Platform Engineering: X Window System > > > > > > > > -- > , __ , > { \/`o;-Olga Kryzhanovska -;o`\/ } > .'-/`-/ olga.kryzhanov...@gmail.com \-`\-'. > `'-..-| / Solaris/BSD//C/C++ programmer \ |-..-'` > /\/\ /\/\ > `--` `--` > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] dedup and memory/l2arc requirements
Hi all I've been told (on #opensolaris, irc.freenode.net) that opensolaris needs a lot of memory and/or l2arc for dedup to function properly. How much memory or l2arc should I get for a 12TB zpool (8x2GB in RAIDz2), and then, how much for 125TB (after RAIDz2 overhead)? Is there a function into which I can plug my recordsize and volume size to get the appropriate numbers? Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, Apr 2, 2010 at 8:03 AM, Edward Ned Harvey wrote: >> > Seriously, all disks configured WriteThrough (spindle and SSD disks >> > alike) >> > using the dedicated ZIL SSD device, very noticeably faster than >> > enabling the >> > WriteBack. >> >> What do you get with both SSD ZIL and WriteBack disks enabled? >> >> I mean if you have both why not use both? Then both async and sync IO >> benefits. > > Interesting, but unfortunately false. Soon I'll post the results here. I > just need to package them in a way suitable to give the public, and stick it > on a website. But I'm fighting IT fires for now and haven't had the time > yet. > > Roughly speaking, the following are approximately representative. Of course > it varies based on tweaks of the benchmark and stuff like that. > Stripe 3 mirrors write through: 450-780 IOPS > Stripe 3 mirrors write back: 1030-2130 IOPS > Stripe 3 mirrors write back + SSD ZIL: 1220-2480 IOPS > Stripe 3 mirrors write through + SSD ZIL: 1840-2490 IOPS > > Overall, I would say WriteBack is 2-3 times faster than naked disks. SSD > ZIL is 3-4 times faster than naked disk. And for some reason, having the > WriteBack enabled while you have SSD ZIL actually hurts performance by > approx 10%. You're better off to use the SSD ZIL with disks in Write > Through mode. > > That result is surprising to me. But I have a theory to explain it. When > you have WriteBack enabled, the OS issues a small write, and the HBA > immediately returns to the OS: "Yes, it's on nonvolatile storage." So the > OS quickly gives it another, and another, until the HBA write cache is full. > Now the HBA faces the task of writing all those tiny writes to disk, and the > HBA must simply follow orders, writing a tiny chunk to the sector it said it > would write, and so on. The HBA cannot effectively consolidate the small > writes into a larger sequential block write. But if you have the WriteBack > disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on > SSD, and immediately return to the process: "Yes, it's on nonvolatile > storage." So the application can issue another, and another, and another. > ZFS is smart enough to aggregate all these tiny write operations into a > single larger sequential write before sending it to the spindle disks. Hmm, when you did the write-back test was the ZIL SSD included in the write-back? What I was proposing was write-back only on the disks, and ZIL SSD with no write-back. Not all operations hit the ZIL, so it would still be nice to have the non-ZIL operations return quickly. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Apr 2, 2010, at 5:08 AM, Edward Ned Harvey wrote: >> I know it is way after the fact, but I find it best to coerce each >> drive down to the whole GB boundary using format (create Solaris >> partition just up to the boundary). Then if you ever get a drive a >> little smaller it still should fit. > > It seems like it should be unnecessary. It seems like extra work. But > based on my present experience, I reached the same conclusion. > > If my new replacement SSD with identical part number and firmware is 0.001 > Gb smaller than the original and hence unable to mirror, what's to prevent > the same thing from happening to one of my 1TB spindle disk mirrors? > Nothing. That's what. > > I take it back. Me. I am to prevent it from happening. And the technique > to do so is precisely as you've said. First slice every drive to be a > little smaller than actual. Then later if I get a replacement device for > the mirror, that's slightly smaller than the others, I have no reason to > care. However, I believe there are some downsides to letting ZFS manage just a slice rather than an entire drive, but perhaps those do not apply as significantly to SSD devices? Thanks -- Stuart Anderson ander...@ligo.caltech.edu http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, 2 Apr 2010, Edward Ned Harvey wrote: were taking place at the same time. That is, if two processes both complete a write operation at the same time, one in sync mode and the other in async mode, then it is guaranteed the data on disk will never have the async data committed before the sync data. Based on this understanding, if you disable ZIL, then there is no guarantee about order of writes being committed to disk. Neither of the above guarantees is valid anymore. Sync writes may be completed out of order. Async writes that supposedly happened after sync writes may be committed to disk before the sync writes. You seem to be assuming that Solaris is an incoherent operating system. With ZFS, the filesystem in memory is coherent, and transaction groups are constructed in simple chronological order (capturing combined changes up to that point in time), without regard to SYNC options. The only possible exception to the coherency is for memory mapped files, where the mapped memory is a copy of data (originally) from the ZFS ARC and needs to be reconciled with the ARC if an application has dirtied it. This differs from UFS and the way Solaris worked prior to Solaris 10. Synchronous writes are not "faster" than asynchronous writes. If you drop heavy and light objects from the same height, they fall at the same rate. This was proven long ago. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, 2 Apr 2010, Edward Ned Harvey wrote: So you're saying that while the OS is building txg's to write to disk, the OS will never reorder the sequence in which individual write operations get ordered into the txg's. That is, an application performing a small sync write, followed by a large async write, will never have the second operation flushed to disk before the first. Can you support this belief in any way? I am like a "pool" or "tank" of regurgitated zfs knowledge. I simply pay attention when someone who really knows explains something (e.g. Neil Perrin, as Casper referred to) so I can regurgitate it later. I try to do so faithfully. If I had behaved this way in school, I would have been a good student. Sometimes I am wrong or the design has somewhat changed since the original information was provided. There are indeed popular filesystems (e.g. Linux EXT4) which write data to disk in different order than cronologically requested so it is good that you are paying attention to these issues. While in the slog-based recovery scenario, it is possible for a TXG to be generated which lacks async data, this only happens after a system crash and if all of the critical data is written as a sync request, it will be faithfully preserved. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, Apr 2, 2010 at 16:24, Edward Ned Harvey wrote: >> The purpose of the ZIL is to act like a fast "log" for synchronous >> writes. It allows the system to quickly confirm a synchronous write >> request with the minimum amount of work. > > Bob and Casper and some others clearly know a lot here. But I'm hearing > conflicting information, and don't know what to believe. Does anyone here > work on ZFS as an actual ZFS developer for Sun/Oracle? Can claim "I can > answer this question, I wrote that code, or at least have read it?" > > Questions to answer would be: > > Is a ZIL log device used only by sync() and fsync() system calls? Is it > ever used to accelerate async writes? sync() will tell the filesystems to flush writes to disk. sync() will not use ZIL, it will just start a new TXG, and could return before the writes are done. fsync() is what you are interested in. > > Suppose there is an application which sometimes does sync writes, and > sometimes async writes. In fact, to make it easier, suppose two processes > open two files, one of which always writes asynchronously, and one of which > always writes synchronously. Suppose the ZIL is disabled. Is it possible > for writes to be committed to disk out-of-order? Meaning, can a large block > async write be put into a TXG and committed to disk before a small sync > write to a different file is committed to disk, even though the small sync > write was issued by the application before the large async write? Remember, > the point is: ZIL is disabled. Question is whether the async could > possibly be committed to disk before the sync. > Writers from a TXG will not be used until the whole TXG is committed to disk. Everything from a half written TXG will be ignored after a crash. This means that the order of writes within a TXG is not important. The only way to do a sync write without ZIL is to start a new TXG after the write. That costs a lot so we have the ZIL for sync writes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 4/2/2010 8:08 AM, Edward Ned Harvey wrote: >> I know it is way after the fact, but I find it best to coerce each >> drive down to the whole GB boundary using format (create Solaris >> partition just up to the boundary). Then if you ever get a drive a >> little smaller it still should fit. >> > It seems like it should be unnecessary. It seems like extra work. But > based on my present experience, I reached the same conclusion. > > If my new replacement SSD with identical part number and firmware is 0.001 > Gb smaller than the original and hence unable to mirror, what's to prevent > the same thing from happening to one of my 1TB spindle disk mirrors? > Nothing. That's what. > > Actually, It's my experience that Sun (and other vendors) do exactly that for you when you buy their parts - at least for rotating drives, I have no experience with SSD's. The Sun disk label shipped on all the drives is setup to make the drive the standard size for that sun part number. They have to do this since they (for many reasons) have many sources (diff. vendors, even diff. parts from the same vendor) for the actual disks they use for a particular Sun part number. This isn't new, I beleive IBM, EMC, HP, etc all do it also for the same reasons. I'm a little surprised that the engineers would suddenly stop doing it only on SSD's. But who knows. -Kyle > I take it back. Me. I am to prevent it from happening. And the technique > to do so is precisely as you've said. First slice every drive to be a > little smaller than actual. Then later if I get a replacement device for > the mirror, that's slightly smaller than the others, I have no reason to > care. > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
>Questions to answer would be: > >Is a ZIL log device used only by sync() and fsync() system calls? Is it >ever used to accelerate async writes? There are quite a few of "sync" writes, specifically when you mix in the NFS server. >Suppose there is an application which sometimes does sync writes, and >sometimes async writes. In fact, to make it easier, suppose two processes >open two files, one of which always writes asynchronously, and one of which >always writes synchronously. Suppose the ZIL is disabled. Is it possible >for writes to be committed to disk out-of-order? Meaning, can a large block >async write be put into a TXG and committed to disk before a small sync >write to a different file is committed to disk, even though the small sync >write was issued by the application before the large async write? Remember, >the point is: ZIL is disabled. Question is whether the async could >possibly be committed to disk before the sync. What I quoted from the other discussion, it seems to be that later writes cannot be committed in an earlier TXG then your sync write or other earlier writes. >I make the assumption that an uberblock is the term for a TXG after it is >committed to disk. Correct? The "uberblock" is the "root of all the data". All the data in a ZFS pool is referenced by it; after the txg is in stable storage then the uberblock is updated. >At boot time, or "zpool import" time, what is taken to be "the current >filesystem?" The latest uberblock? Something else? The current "zpool" and the filesystems such as referenced by the last uberblock. >My understanding is that enabling a dedicated ZIL device guarantees sync() >and fsync() system calls block until the write has been committed to >nonvolatile storage, and attempts to accelerate by using a physical device >which is faster or more idle than the main storage pool. My understanding >is that this provides two implicit guarantees: (1) sync writes are always >guaranteed to be committed to disk in order, relevant to other sync writes. >(2) In the event of OS halting or ungraceful shutdown, sync writes committed >to disk are guaranteed to be equal or greater than the async writes that >were taking place at the same time. That is, if two processes both complete >a write operation at the same time, one in sync mode and the other in async >mode, then it is guaranteed the data on disk will never have the async data >committed before the sync data. sync() is actually *async* and returning from sync() says nothing about stable storage. After fsync() returns it signals that all the data is in stable storage (except if you disable ZIL), or, apparently, in Linux when the write caches for your disks are enabled (the default for PC drives). ZFS doesn't care about the writecache; it makes sure it is flushed. (There's fsyc() and open(..., O_DSYNC|O_SYNC) >Based on this understanding, if you disable ZIL, then there is no guarantee >about order of writes being committed to disk. Neither of the above >guarantees is valid anymore. Sync writes may be completed out of order. >Async writes that supposedly happened after sync writes may be committed to >disk before the sync writes. > >Somebody, (Casper?) said it before, and now I'm starting to realize ... This >is also true of the snapshots. If you disable your ZIL, then there is no >guarantee your snapshots are consistent either. Rolling back doesn't >necessarily gain you anything. > >The only way to guarantee consistency in the snapshot is to always >(regardless of ZIL enabled/disabled) give priority for sync writes to get >into the TXG before async writes. > >If the OS does give priority for sync writes going into TXG's before async >writes (even with ZIL disabled), then after spontaneous ungraceful reboot, >the latest uberblock is guaranteed to be consistent. I believe that the writes are still ordered so the consistency you want is actually delivered even without the ZIL enabled. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> The purpose of the ZIL is to act like a fast "log" for synchronous > writes. It allows the system to quickly confirm a synchronous write > request with the minimum amount of work. Bob and Casper and some others clearly know a lot here. But I'm hearing conflicting information, and don't know what to believe. Does anyone here work on ZFS as an actual ZFS developer for Sun/Oracle? Can claim "I can answer this question, I wrote that code, or at least have read it?" Questions to answer would be: Is a ZIL log device used only by sync() and fsync() system calls? Is it ever used to accelerate async writes? Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. I make the assumption that an uberblock is the term for a TXG after it is committed to disk. Correct? At boot time, or "zpool import" time, what is taken to be "the current filesystem?" The latest uberblock? Something else? My understanding is that enabling a dedicated ZIL device guarantees sync() and fsync() system calls block until the write has been committed to nonvolatile storage, and attempts to accelerate by using a physical device which is faster or more idle than the main storage pool. My understanding is that this provides two implicit guarantees: (1) sync writes are always guaranteed to be committed to disk in order, relevant to other sync writes. (2) In the event of OS halting or ungraceful shutdown, sync writes committed to disk are guaranteed to be equal or greater than the async writes that were taking place at the same time. That is, if two processes both complete a write operation at the same time, one in sync mode and the other in async mode, then it is guaranteed the data on disk will never have the async data committed before the sync data. Based on this understanding, if you disable ZIL, then there is no guarantee about order of writes being committed to disk. Neither of the above guarantees is valid anymore. Sync writes may be completed out of order. Async writes that supposedly happened after sync writes may be committed to disk before the sync writes. Somebody, (Casper?) said it before, and now I'm starting to realize ... This is also true of the snapshots. If you disable your ZIL, then there is no guarantee your snapshots are consistent either. Rolling back doesn't necessarily gain you anything. The only way to guarantee consistency in the snapshot is to always (regardless of ZIL enabled/disabled) give priority for sync writes to get into the TXG before async writes. If the OS does give priority for sync writes going into TXG's before async writes (even with ZIL disabled), then after spontaneous ungraceful reboot, the latest uberblock is guaranteed to be consistent. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> Only a broken application uses sync writes > sometimes, and async writes at other times. Suppose there is a virtual machine, with virtual processes inside it. Some virtual process issues a sync write to the virtual OS, meanwhile another virtual process issues an async write. Then the virtual OS will sometimes issue sync writes and sometimes async writes to the host OS. Are you saying this makes qemu, and vbox, and vmware "broken applications?" ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> >Dude, don't be so arrogant. Acting like you know what I'm talking > about > >better than I do. Face it that you have something to learn here. > > You may say that, but then you post this: Acknowledged. I read something arrogant, and I replied even more arrogant. That was dumb of me. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
>So you're saying that while the OS is building txg's to write to disk, the >OS will never reorder the sequence in which individual write operations get >ordered into the txg's. That is, an application performing a small sync >write, followed by a large async write, will never have the second operation >flushed to disk before the first. Can you support this belief in any way? The question is not how the writes are ordered but whether an earlier write can be in a later txg. A transaction group is committed atomically. In http://arc.opensolaris.org/caselog/PSARC/2010/108/mail I ask a similar question to make sure I understand it correctly, and the answer was: "> = Casper", the answer is from Neil Perrin: > Is there a partialy order defined for all filesystem operations? > File system operations will be written in order for all settings of the sync flag. > Specifically, will ZFS guarantee that when fsync()/O_DATA happens on a > file, (I assume by O_DATA you meant O_DSYNC). > that later transactions will not be in an earlier transaction group? > (Or is this already the case?) This is already the case. So what I assumed was true but what you made me doubt, was apparently still true: later transactions cannot be committed in an earlier txg. >If that's true, if there's no increased risk of data corruption, then why >doesn't everybody just disable their ZIL all the time on every system? For an application running on the file server, there is no difference. When the system panics you know that data might be lost. The application also dies. (The snapshot and the last valid uberblock are equally valid) But for an application on an NFS client, without ZIL data will be lost while the NFS client believes the data is written amd it will not try again. With the ZIL, when the NFS server says that data is written then it is actually on stable storage. >The reason to have a sync() function in C/C++ is so you can ensure data is >written to disk before you move on. It's a blocking call, that doesn't >return until the sync is completed. The only reason you would ever do this >is if order matters. If you cannot allow the next command to begin until >after the previous one was completed. Such is the situation with databases >and sometimes virtual machines. So the question is: when will your data invalid? What happens with the data when the system dies before the fsync() call? What happens with the data when the system dies after the fsync() call? What happens with the data when the system dies after more I/O operations? With the zil disabled, you call fsync() but you may encounter data from before the call to fsync(). That could happen before, so I assume you can actually recover from that situation. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
>> > http://nfs.sourceforge.net/ >> >> I think B4 is the answer to Casper's question: > >We were talking about ZFS, and under what circumstances data is flushed to >disk, in what way "sync" and "async" writes are handled by the OS, and what >happens if you disable ZIL and lose power to your system. > >We were talking about C/C++ sync and async. Not NFS sync and async. I don't think so. http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg36783.html (This discussion was started, I think, in the context of NFS performance) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> If you have zpool less than version 19 (when ability to remove log > device > was introduced) and you have a non-mirrored log device that failed, you > had > better treat the situation as an emergency. > Instead, do "man zpool" and look for "zpool > remove." > If it says "supports removing log devices" then you had better use it > to > remove your log device. If it says "only supports removing hotspares > or > cache" then your zpool is lost permanently. I take it back. If you lost your log device on a zpool which is less than version 19, then you *might* have a possible hope if you migrate your disks to a later system. You *might* be able to "zpool import" on a later version of OS. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> ZFS recovers to a crash-consistent state, even without the slog, > meaning it recovers to some state through which the filesystem passed > in the seconds leading up to the crash. This isn't what UFS or XFS > do. > > The on-disk log (slog or otherwise), if I understand right, can > actually make the filesystem recover to a crash-INconsistent state (a You're speaking the opposite of common sense. If disabling the ZIL makes the system faster *and* less prone to data corruption, please explain why we don't all disable the ZIL? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> hello > > i have had this problem this week. our zil ssd died (apt slc ssd 16gb). > because we had no spare drive in stock, we ignored it. > > then we decided to update our nexenta 3 alpha to beta, exported the > pool and made a fresh install to have a clean system and tried to > import the pool. we only got a error message about a missing drive. > > we googled about this and it seems there is no way to acces the pool > !!! > (hope this will be fixed in future) > > we had a backup and the data are not so important, but that could be a > real problem. > you have a valid zfs3 pool and you cannot access your data due to > missing zil. If you have zpool less than version 19 (when ability to remove log device was introduced) and you have a non-mirrored log device that failed, you had better treat the situation as an emergency. Normally you can find your current zpool version by doing "zpool upgrade," but you cannot now if you're in this failure state. Do not attempt "zfs send" or "zfs list" or any other zpool or zfs command. Instead, do "man zpool" and look for "zpool remove." If it says "supports removing log devices" then you had better use it to remove your log device. If it says "only supports removing hotspares or cache" then your zpool is lost permanently. If you are running Solaris, take it as given, you do not have zpool version 19. If you are running Opensolaris, I don't know at which point zpool 19 was introduced. Your only hope is to "zpool remove" the log device. Use tar or cp or something, to try and salvage your data out of there. Your zpool is lost and if it's functional at all right now, it won't stay that way for long. Your system will soon hang, and then you will not be able to import your pool. Ask me how I know. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> > I am envisioning a database, which issues a small sync write, > followed by a > > larger async write. Since the sync write is small, the OS would > prefer to > > defer the write and aggregate into a larger block. So the > possibility of > > the later async write being committed to disk before the older sync > write is > > a real risk. The end result would be inconsistency in my database > file. > > Zfs writes data in transaction groups and each bunch of data which > gets written is bounded by a transaction group. The current state of > the data at the time the TXG starts will be the state of the data once > the TXG completes. If the system spontaneously reboots then it will > restart at the last completed TXG so any residual writes which might > have occured while a TXG write was in progress will be discarded. > Based on this, I think that your ordering concerns (sync writes > getting to disk "faster" than async writes) are unfounded for normal > file I/O. So you're saying that while the OS is building txg's to write to disk, the OS will never reorder the sequence in which individual write operations get ordered into the txg's. That is, an application performing a small sync write, followed by a large async write, will never have the second operation flushed to disk before the first. Can you support this belief in any way? If that's true, if there's no increased risk of data corruption, then why doesn't everybody just disable their ZIL all the time on every system? The reason to have a sync() function in C/C++ is so you can ensure data is written to disk before you move on. It's a blocking call, that doesn't return until the sync is completed. The only reason you would ever do this is if order matters. If you cannot allow the next command to begin until after the previous one was completed. Such is the situation with databases and sometimes virtual machines. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> > http://nfs.sourceforge.net/ > > I think B4 is the answer to Casper's question: We were talking about ZFS, and under what circumstances data is flushed to disk, in what way "sync" and "async" writes are handled by the OS, and what happens if you disable ZIL and lose power to your system. We were talking about C/C++ sync and async. Not NFS sync and async. I don't think anything relating to NFS is the answer to Casper's question, or else, Casper was simply jumping context by asking it. Don't get me wrong, I have no objection to his question or anything, it's just that the conversation has derailed and now people are talking about NFS sync/async instead of what happens when a C/C++ application is doing sync/async writes to a disabled ZIL. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
When we use one vmod, both machines are finished in about 6min45, zilstat maxes out at about 4200 IOPS. Using four vmods it takes about 6min55, zilstat maxes out at 2200 IOPS. Can you try 4 concurrent tar to four different ZFS filesystems (same pool). -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> I know it is way after the fact, but I find it best to coerce each > drive down to the whole GB boundary using format (create Solaris > partition just up to the boundary). Then if you ever get a drive a > little smaller it still should fit. It seems like it should be unnecessary. It seems like extra work. But based on my present experience, I reached the same conclusion. If my new replacement SSD with identical part number and firmware is 0.001 Gb smaller than the original and hence unable to mirror, what's to prevent the same thing from happening to one of my 1TB spindle disk mirrors? Nothing. That's what. I take it back. Me. I am to prevent it from happening. And the technique to do so is precisely as you've said. First slice every drive to be a little smaller than actual. Then later if I get a replacement device for the mirror, that's slightly smaller than the others, I have no reason to care. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
> > Seriously, all disks configured WriteThrough (spindle and SSD disks > > alike) > > using the dedicated ZIL SSD device, very noticeably faster than > > enabling the > > WriteBack. > > What do you get with both SSD ZIL and WriteBack disks enabled? > > I mean if you have both why not use both? Then both async and sync IO > benefits. Interesting, but unfortunately false. Soon I'll post the results here. I just need to package them in a way suitable to give the public, and stick it on a website. But I'm fighting IT fires for now and haven't had the time yet. Roughly speaking, the following are approximately representative. Of course it varies based on tweaks of the benchmark and stuff like that. Stripe 3 mirrors write through: 450-780 IOPS Stripe 3 mirrors write back: 1030-2130 IOPS Stripe 3 mirrors write back + SSD ZIL: 1220-2480 IOPS Stripe 3 mirrors write through + SSD ZIL: 1840-2490 IOPS Overall, I would say WriteBack is 2-3 times faster than naked disks. SSD ZIL is 3-4 times faster than naked disk. And for some reason, having the WriteBack enabled while you have SSD ZIL actually hurts performance by approx 10%. You're better off to use the SSD ZIL with disks in Write Through mode. That result is surprising to me. But I have a theory to explain it. When you have WriteBack enabled, the OS issues a small write, and the HBA immediately returns to the OS: "Yes, it's on nonvolatile storage." So the OS quickly gives it another, and another, until the HBA write cache is full. Now the HBA faces the task of writing all those tiny writes to disk, and the HBA must simply follow orders, writing a tiny chunk to the sector it said it would write, and so on. The HBA cannot effectively consolidate the small writes into a larger sequential block write. But if you have the WriteBack disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on SSD, and immediately return to the process: "Yes, it's on nonvolatile storage." So the application can issue another, and another, and another. ZFS is smart enough to aggregate all these tiny write operations into a single larger sequential write before sending it to the spindle disks. Long story short, the evidence suggests if you have SSD ZIL, you're better off without WriteBack on the HBA. And I conjecture the reasoning behind it is because ZFS can write buffer better than the HBA can. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how can I remove files when the fiile system is full?
> On opensolaris? Did you try deleting any old BEs? Don't forget to "zfs destroy rp...@snapshot" In fact, you might start with destroying snapshots ... if there are any occupying space. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how can I remove files when the fiile system is full?
Thanks, Brandon. Now that the issue goes away, I could recover my host. -Eiji > > On Thu, Apr 1, 2010 at 1:39 PM, Eiji Ota < eiji@oracle.com > wrote: > > Thanks. It worked, but yet the fs says it's full. Is it normal and I can get > some space eventually (if I continue this)? > > You may need to destroy some snapshots before the space becomes available. > zfs list -t snapshot will how approximately how much space will be freed for each snapshot. > -B > -- > Brandon High : bh...@freaks.com >___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how can I remove files when the fiile system is full?
Thanks. It worked, but yet the fs says it's full. Is it normal and I can get some space eventually (if I continue this)? # cat /dev/null >./messages.1 # cat /dev/null >./messages.0 # df -kl Filesystem 1K-blocks Used Available Use% Mounted on rpool/ROOT/opensolaris 4976123 4976123 0 100% / <== the availabe space is yet 0. swap 14218704 244 14218460 1% /etc/svc/volatile /usr/lib/libc/libc_hwcap2.so.1 4976123 4976123 0 100% /lib/libc.so.1 swap 14218600 140 14218460 1% /tmp swap 14218472 12 14218460 1% /var/run -Eiji > > On Thu, Apr 1, 2010 at 12:46 PM, Eiji Ota < eiji@oracle.com > wrote: > # cd /var/adm > # rm messages.? > rm: cannot remove `messages.0': No space left on device > rm: cannot remove `messages.1': No space left on device > > I think doing "cat /dev/null > /var/adm/messages.1" will work. > -B > -- > Brandon High : bh...@freaks.com >___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how can I remove files when the fiile system is full?
On 04/ 1/10 01:46 PM, Eiji Ota wrote: During the IPS upgrade, the file system got full, then I cannot do anything to recover it. # df -kl Filesystem 1K-blocks Used Available Use% Mounted on rpool/ROOT/opensolaris 4976642 4976642 0 100% / swap 14217564 244 14217320 1% /etc/svc/volatile /usr/lib/libc/libc_hwcap2.so.1 4976642 4976642 0 100% /lib/libc.so.1 swap 14217460 140 14217320 1% /tmp swap 1421734424 14217320 1% /var/run # cd /var/adm # rm messages.? rm: cannot remove `messages.0': No space left on device rm: cannot remove `messages.1': No space left on device Likely a similar issue was reported a few years ago like: http://opensolaris.org/jive/thread.jspa?messageID=241580 However, my system is on snv_133. Is there any way to work around the situation? This is really critical since after the IPS gets the file system full, customers seem not able to recover. Thanks, -Eiji ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss On opensolaris? Did you try deleting any old BEs? -tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Robert Milkowski writes: > On 01/04/2010 20:58, Jeroen Roodhart wrote: > > > >> I'm happy to see that it is now the default and I hope this will cause the > >> Linux NFS client implementation to be faster for conforming NFS servers. > >> > > Interesting thing is that apparently defaults on Solaris an Linux are > > chosen such that one can't signal the desired behaviour to the other. At > > least we didn't manage to get a Linux client to asynchronously mount a > > Solaris (ZFS backed) NFS export... > > > > Which is to be expected as it is not a nfs client which requests the > behavior but rather a nfs server. > Currently on Linux you can export a share with as sync (default) or > async share while on Solaris you can't really currently force a NFS > server to start working in an async mode. > True, and there is an entrenched misconception (not you) that this a ZFS specific problem which it's not. It's really an NFS protocol feature which can be circumvented using zil_disable which therefore reinforces the misconception. It's further reinforced by testing NFS server on disk drives with WCE=1 with filesystem not ZFS. All fast options cause the NFS client to become inconsistent after a server reboot. Whatever was being done in the moments prior to server reboot will need to be wiped out by users if they are told that the server did reboot. That's manageable for home use not for the entreprise. -r > -- > Robert Milkowski > http://milek.blogspot.com > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
>On 01/04/2010 20:58, Jeroen Roodhart wrote: >> >>> I'm happy to see that it is now the default and I hope this will cause the >>> Linux NFS client implementation to be faster for conforming NFS servers. >>> >> Interesting thing is that apparently defaults on Solaris an Linux are chosen >> such that one can't signal the desired behaviour to the other. At least we didn't manage to get a Linux client to asyn chronously mount a Solaris (ZFS backed) NFS export... >> > >Which is to be expected as it is not a nfs client which requests the >behavior but rather a nfs server. >Currently on Linux you can export a share with as sync (default) or >async share while on Solaris you can't really currently force a NFS >server to start working in an async mode. The other part of the issue is that the Solaris Clients have been developed with a "sync" server. The client write behinds more and continues caching the non-acked data. The Linux client has been developed with a "async" server and has some catching up to do. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [install-discuss] Installing Opensolaris without ZFS?
[removing all lists except ZFS-discuss, as this is really pertinent only there] ольга крыжановская wrote: Are there plans to reduce the memory usage of ZFS in the near future? Olga 2010/4/2 Alan Coopersmith : ольга крыжановская wrote: Does Opensolaris have an option to install without ZFS, i.e. use UFS for root like SXCE did? No. beadm & pkg image-update rely on ZFS functionality for the root filesystem. -- -Alan Coopersmith-alan.coopersm...@oracle.com Oracle Solaris Platform Engineering: X Window System The vast majority of ZFS memory consumption is for caching, which can be manually reduced if it's impinging on your application. See the tuning guide for more info: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide As pointed out elsewhere, these tuning parameters are generally for highwater marks - ZFS will return RAM back to the system if it's needed for applications. So, in your original problem, the likelihood is /not/ that ZFS is consuming RAM and not releasing it, but rather than your other apps are overloading the system. That said, there are certain minimum allocations that can't be reduced and must be held in RAM, but they're not generally significant. UFS's memory usage is really not measurably different than ZFS's, so far as I can measure from a kernel standpoint. It's all the caching that makes ZFS look like a RAM pig. One thing though: taking away all of ZFS's caching hurts performance more than removing all of UFS's file cache, because ZFS stores more than simple data in it's filecache (ARC). Realistically speaking, I can't see running ZFS on a machine with less than 1GB of RAM. I also can't see modifying ZFS to work well in such circumstances, as (a) ZFS isn't targeted at such limited platforms and (b) you'd seriously compromise a major chunk of performance trying to make it fit. These days, 4GB is really more of a minimum for a 64-bit machine/OS in any case. I certainly would be interested in seeing what a large L2ARC cache would mean for reduction in RAM footprint; on one hand, having an L2ARC requires ARC (i.e. DRAM) allocations for each entry in the L2ARC, but on the other hand, it would reduce/eliminate storage of actual data and metadata in DRAM. Anyone up for running tests for a box with say 512MB of RAM and a 10GB+ L2ARC (in say an SSD)? -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [install-discuss] Installing Opensolaris without ZFS?
Are there plans to reduce the memory usage of ZFS in the near future? Olga 2010/4/2 Alan Coopersmith : > ольга крыжановская wrote: >> Does Opensolaris have an option to install without ZFS, i.e. use UFS >> for root like SXCE did? > > No. beadm & pkg image-update rely on ZFS functionality for the root > filesystem. > > -- >-Alan Coopersmith-alan.coopersm...@oracle.com > Oracle Solaris Platform Engineering: X Window System > > -- , __ , { \/`o;-Olga Kryzhanovska -;o`\/ } .'-/`-/ olga.kryzhanov...@gmail.com \-`\-'. `'-..-| / Solaris/BSD//C/C++ programmer \ |-..-'` /\/\ /\/\ `--` `--` ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ARC Tail
Greeting All Can any one help me figure out the size of the ARC tail, i.e, the portion in ARC that l2_feed thread is reading from before pages are evicted from ARC. Is the size of this tail proportional to total ARC size ? L2ARC device size ? is tunable ?? your feed back is highly appreciated -- Abdullah Al-Dahlawi PhD Candidate George Washington University Department. Of Electrical & Computer Engineering Check The Fastest 500 Super Computers Worldwide http://www.top500.org/list/2009/11/100 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss