Re: [zfs-discuss] dedup and memory/l2arc requirements
I might add some swap I guess. I will have to try it on another machine with more RAM and less pool, and see how the size of the zdb image compares to the calculated size of DDT needed. So long as zdb is the same or a little smaller than the DDT it predicts, the tool's still useful, just sometimes it will report ``DDT too big but not sure by how much'', by coredumping/thrashing instead of finishing. In my experience, more swap doesn't help break through the 2GB memory barrier. As zdb is an intentionally unsupported tool, methinks recompile may be required (or write your own). I guess this tool might not work too well, then, with 20TiB in 47M files? Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup and memory/l2arc requirements
You can estimate the amount of disk space needed for the deduplication table and the expected deduplication ratio by using zdb -S poolname on your existing pool. This is all good, but it doesn't work too well for planning. Is there a rule of thumb I can use for a general overview? Say I want 125TB space and I want to dedup that for backup use. It'll probably be quite efficient dedup, so long alignment will match. By the way, is there a way to auto-align data for dedup in case of backup? Or does zfs do this by itself? Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Question about large pools
Hi all From http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide I read Avoid creating a RAIDZ, RAIDZ-2, RAIDZ-3, or a mirrored configuration with one logical device of 40+ devices. See the sections below for examples of redundant configurations. What do they mean by this? 40+ devices in a single raidz[123] set or 40+ devices in a pool regardless of raidz[123] sets? Best regards roy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
The only way to guarantee consistency in the snapshot is to always (regardless of ZIL enabled/disabled) give priority for sync writes to get into the TXG before async writes. If the OS does give priority for sync writes going into TXG's before async writes (even with ZIL disabled), then after spontaneous ungraceful reboot, the latest uberblock is guaranteed to be consistent. This is what Jeff Bonwick says in the zil synchronicity arc case: What I mean is that the barrier semantic is implicit even with no ZIL at all. In ZFS, if event A happens before event B, and you lose power, then what you'll see on disk is either nothing, A, or both A and B. Never just B. It is impossible for us not to have at least barrier semantics. So there's no chance that a *later* async write will overtake an earlier sync *or* async write. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about large pools
On 02/04/2010 05:45, Roy Sigurd Karlsbakk wrote: Hi all From http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide I read Avoid creating a RAIDZ, RAIDZ-2, RAIDZ-3, or a mirrored configuration with one logical device of 40+ devices. See the sections below for examples of redundant configurations. What do they mean by this? 40+ devices in a single raidz[123] set or 40+ devices in a pool regardless of raidz[123] sets? It means - try to avoid a single RAID-Z group with 40+ disk drives. Creating several smaller groups in a one pool is perfectly fine. So for example - on x4540 servers try to avoid creating a pool with a single RAID-Z3 group made of 44 disks, rather create 4 RAID-Z2 groups each made of 11 disks all of them in a single pool. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bit-flipping in RAM...
Have not the ZFS data corruption researchers been in touch with Jeff Bonwick and the ZFS team? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is this pool recoverable?
Patrick, I'm happy that you were able to recover your pool. Your original zpool status says that this pool was last accessed on another system, which I believe is what caused of the pool to fail, particularly if it was accessed simultaneously from two systems. It is important that the cause of the original pool failure is identified to prevent it from happening again. This rewind pool recovery is a last-ditch effort and might not recover all broken pools. Thanks, Cindy On 04/02/10 12:32, Patrick Tiquet wrote: Thanks, that worked!! It needed -Ff The pool has been recovered with minimal loss in data. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 04/02/10 08:24, Edward Ned Harvey wrote: The purpose of the ZIL is to act like a fast log for synchronous writes. It allows the system to quickly confirm a synchronous write request with the minimum amount of work. Bob and Casper and some others clearly know a lot here. But I'm hearing conflicting information, and don't know what to believe. Does anyone here work on ZFS as an actual ZFS developer for Sun/Oracle? Can claim I can answer this question, I wrote that code, or at least have read it? I'm one of the ZFS developers. I wrote most of the zil code. Still I don't have all the answers. There's a lot of knowledgeable people on this alias. I usually monitor this alias and sometimes chime in when there's some misinformation being spread, but sometimes the volume is so high. Since I started this reply there's been 20 new posts on this thread alone! Questions to answer would be: Is a ZIL log device used only by sync() and fsync() system calls? - The intent log (separate device(s) or not) is only used by fsync, O_DSYNC, O_SYNC, O_RSYNC. NFS commits are seen to ZFS as fsyncs. Note sync(1m) and sync(2s) do not use the intent log. They force transaction group (txg) commits on all pools. So zfs goes beyond the the requirement for sync() which only requires it schedules but does not necessarily complete the writing before returning. The zfs interpretation is rather expensive but seemed broken so we fixed it. Is it ever used to accelerate async writes? The zil is not used to accelerate async writes. Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. Threads can be pre-empted in the OS at any time. So even though thread A issued W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as W1, W2. Multi-threaded applications have to handle this. If this was a single thread issuing W1 then W2 then yes the order is guaranteed regardless of whether W1 or W2 are synchronous or asynchronous. Of course if the system crashes then the async operations might not be there. I make the assumption that an uberblock is the term for a TXG after it is committed to disk. Correct? - Kind of. The uberblock contains the root of the txg. At boot time, or zpool import time, what is taken to be the current filesystem? The latest uberblock? Something else? A txg is for the whole pool which can contain many filesystems. The latest txg defines the current state of the pool and each individual fs. My understanding is that enabling a dedicated ZIL device guarantees sync() and fsync() system calls block until the write has been committed to nonvolatile storage, and attempts to accelerate by using a physical device which is faster or more idle than the main storage pool. Correct (except replace sync() with O_DSYNC, etc). This also assumes hardware that for example handles correctly the flushing of it's caches. My understanding is that this provides two implicit guarantees: (1) sync writes are always guaranteed to be committed to disk in order, relevant to other sync writes. (2) In the event of OS halting or ungraceful shutdown, sync writes committed to disk are guaranteed to be equal or greater than the async writes that were taking place at the same time. That is, if two processes both complete a write operation at the same time, one in sync mode and the other in async mode, then it is guaranteed the data on disk will never have the async data committed before the sync data. The ZIL doesn't make such guarantees. It's the DMU that handles transactions and their grouping into txgs. It ensures that writes are committed in order by it's transactional nature. The function of the zil is to merely ensure that synchronous operations are stable and replayed after a crash/power fail onto the latest txg. Based on this understanding, if you disable ZIL, then there is no guarantee about order of writes being committed to disk. Neither of the above guarantees is valid anymore. Sync writes may be completed out of order. Async writes that supposedly happened after sync writes may be committed to disk before the sync writes. No, disabling the ZIL does not disable the DMU. Somebody, (Casper?) said it before, and now I'm starting to realize ... This is also true of the snapshots. If you
[zfs-discuss] To slice, or not to slice
Momentarily, I will begin scouring the omniscient interweb for information, but I'd like to know a little bit of what people would say here. The question is to slice, or not to slice, disks before using them in a zpool. One reason to slice comes from recent personal experience. One disk of a mirror dies. Replaced under contract with an identical disk. Same model number, same firmware. Yet when it's plugged into the system, for an unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to attach and un-degrade the mirror. It seems logical this problem could have been avoided if the device added to the pool originally had been a slice somewhat smaller than the whole physical device. Say, a slice of 28G out of the 29G physical disk. Because later when I get the infinitesimally smaller disk, I can always slice 28G out of it to use as the mirror device. There is some question about performance. Is there any additional overhead caused by using a slice instead of the whole physical device? There is another question about performance. One of my colleagues said he saw some literature on the internet somewhere, saying ZFS behaves differently for slices than it does on physical devices, because it doesn't assume it has exclusive access to that physical device, and therefore caches or buffers differently ... or something like that. Any other pros/cons people can think of? And finally, if anyone has experience doing this, and process recommendations? That is ... My next task is to go read documentation again, to refresh my memory from years ago, about the difference between format, partition, label, fdisk, because those terms don't have the same meaning that they do in other OSes... And I don't know clearly right now, which one(s) I want to do, in order to create the large slice of my disks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
One reason to slice comes from recent personal experience. One disk of a mirror dies. Replaced under contract with an identical disk. Same model number, same firmware. Yet when it's plugged into the system, for an unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to attach and un-degrade the mirror. It seems logical this problem could have been avoided if the device added to the pool originally had been a slice somewhat smaller than the whole physical device. Say, a slice of 28G out of the 29G physical disk. Because later when I get the infinitesimally smaller disk, I can always slice 28G out of it to use as the mirror device. What build were you running? The should have been addressed by CR6844090 that went into build 117. I'm running solaris, but that's irrelevant. The storagetek array controller itself reports the new disk as infinitesimally smaller than the one which I want to mirror. Even before the drive is given to the OS, that's the way it is. Sun X4275 server. BTW, I'm still degraded. Haven't found an answer yet, and am considering breaking all my mirrors, to create a new pool on the freed disks, and using partitions in those disks, for the sake of rebuilding my pool using partitions on all disks. The aforementioned performance problem is not as scary to me as running in degraded redundancy. it's well documented. ZFS won't attempt to enable the drive's cache unless it has the physical device. See http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide #Storage_Pools Nice. Thank you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
- Edward Ned Harvey solar...@nedharvey.com skrev: What build were you running? The should have been addressed by CR6844090 that went into build 117. I'm running solaris, but that's irrelevant. The storagetek array controller itself reports the new disk as infinitesimally smaller than the one which I want to mirror. Even before the drive is given to the OS, that's the way it is. Sun X4275 server. BTW, I'm still degraded. Haven't found an answer yet, and am considering breaking all my mirrors, to create a new pool on the freed disks, and using partitions in those disks, for the sake of rebuilding my pool using partitions on all disks. The aforementioned performance problem is not as scary to me as running in degraded redundancy. I would return the drive to get a bigger one before doing something as drastic as that. There might have been a hichup in the production line, and that's not your fault. roy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
And finally, if anyone has experience doing this, and process recommendations? That is My next task is to go read documentation again, to refresh my memory from years ago, about the difference between format, partition, label, fdisk, because those terms dont have the same meaning that they do in other OSes And I dont know clearly right now, which one(s) I want to do, in order to create the large slice of my disks. The whole partition vs. slice thing is a bit fuzzy to me, so take this with a grain of salt. You can create partitions using fdisk, or slices using format. The BIOS and other operating systems (windows, linux, etc) will be able to recognize partitions, while they won't be able to make sense of slices. If you need to boot from the drive or share it with another OS, then partitions are the way to go. If it's exclusive to solaris, then you can use slices. You can (but shouldn't) use slices and partitions from the same device (eg: c5t0d0s0 and c5t0d0p0). Oh, I managed to find a really good answer to this question. Several sources all say to do precisely the same procedure, and when I did it on a test system, it worked perfectly. Simple and easy to repeat. So I think this is the gospel method to create the slices, if you're going to create slices: http://docs.sun.com/app/docs/doc/806-4073/6jd67r9hu and http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Rep lacing.2FRelabeling_the_Root_Pool_Disk ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
Oh, I managed to find a really good answer to this question. Several sources all say to do precisely the same procedure, and when I did it on a test system, it worked perfectly. Simple and easy to repeat. So I think this is the gospel method to create the slices, if you're going to create Seems like a clumsy workaround for a hardware problem. It will also disable the drives' cache, which is not a good idea. Why not just get a new drive? Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Apr 2, 2010, at 2:29 PM, Edward Ned Harvey wrote: I've also heard that the risk for unexpected failure of your pool is higher if/when you reach 100% capacity. I've heard that you should always create a small ZFS filesystem within a pool, and give it some reserved space, along with the filesystem that you actually plan to use in your pool. Anyone care to offer any comments on that? Define failure in this context? I am not aware of a data loss failure when near full. However, all file systems will experience performance degradation for write operations as they become full. To tell the truth, I'm not exactly sure. Because I've never lost any ZFS pool or filesystem. I only have it deployed on 3 servers, and only one of those gets heavy use. It only filled up once, and it didn't have any problem. So I'm only trying to understand the great beyond, that which I have never known myself. Learn from other peoples' experience, preventively. Yes, I do embrace a lot of voodoo and superstition in doing sysadmin, but that's just cuz stuff ain't perfect, and I've seen so many things happen that were supposedly not possible. (Not talking about ZFS in that regard... yet.) Well, unless you count the issue I'm having right now, with two identical disks appearing as different sizes... But I don't think that's a zfs problem. I recall some discussion either here or on opensolaris-discuss or opensolaris-help, where at least one or a few people said they had some sort of problem or problems, and they were suspicious about the correlation between it happening, and the disk being full. I also recall talking to some random guy at a conference who said something similar. But it's all vague. I really don't know. And I have nothing concrete. Hence the post asking for peoples' comments. Somebody might relate something they experienced less vague than what I know. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
I would return the drive to get a bigger one before doing something as drastic as that. There might have been a hichup in the production line, and that's not your fault. Yeah, but I already have 2 of the replacement disks, both doing the same thing. One has a firmware newer than my old disk (so originally I thought that was the cause, and requested another replacement disk). But then we got a replacement disk which is identical in every way to the failed disk ... but it still appears smaller for some reason. So this happened on my SSD. What's to prevent it from happening on one of the spindle disks in the future? Nothing that I know of ... So far, the idea of slicing seems to be the only preventive or corrective measure. Hence, wondering what pros/cons people would describe, beyond what I've already thought up myself. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is this pool recoverable?
Your original zpool status says that this pool was last accessed on another system, which I believe is what caused of the pool to fail, particularly if it was accessed simultaneously from two systems. The message last accessed on another system is the normal behavior if the pool is ungracefully offlined for some reason, and then you boot back up again on the same system. I learned that by using a pool on an external disk, and accidentally knocking out the power cord of the external disk. The system hung. I power cycled, couldn't boot normal. Had to boot failsafe, and got the above message while trying to import. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Sat, 3 Apr 2010, Edward Ned Harvey wrote: I would return the drive to get a bigger one before doing something as drastic as that. There might have been a hichup in the production line, and that's not your fault. Yeah, but I already have 2 of the replacement disks, both doing the same thing. One has a firmware newer than my old disk (so originally I thought that was the cause, and requested another replacement disk). But then we got a replacement disk which is identical in every way to the failed disk ... but it still appears smaller for some reason. So this happened on my SSD. What's to prevent it from happening on one of the spindle disks in the future? Nothing that I know of ... Just keep in mind that this has been fixed in OpenSolaris for some time, and will surely be fixed in Solaris 10, if not already. The annoying issue is that you probably need to add all of the vdev devices using an OS which already has the fix. I don't know if it can repair a slightly overly-large device. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Fri, Apr 2, 2010 at 4:05 PM, Edward Ned Harvey guacam...@nedharvey.comwrote: Momentarily, I will begin scouring the omniscient interweb for information, but I’d like to know a little bit of what people would say here. The question is to slice, or not to slice, disks before using them in a zpool. One reason to slice comes from recent personal experience. One disk of a mirror dies. Replaced under contract with an identical disk. Same model number, same firmware. Yet when it’s plugged into the system, for an unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to attach and un-degrade the mirror. It seems logical this problem could have been avoided if the device added to the pool originally had been a slice somewhat smaller than the whole physical device. Say, a slice of 28G out of the 29G physical disk. Because later when I get the infinitesimally smaller disk, I can always slice 28G out of it to use as the mirror device. There is some question about performance. Is there any additional overhead caused by using a slice instead of the whole physical device? There is another question about performance. One of my colleagues said he saw some literature on the internet somewhere, saying ZFS behaves differently for slices than it does on physical devices, because it doesn’t assume it has exclusive access to that physical device, and therefore caches or buffers differently … or something like that. Any other pros/cons people can think of? And finally, if anyone has experience doing this, and process recommendations? That is … My next task is to go read documentation again, to refresh my memory from years ago, about the difference between “format,” “partition,” “label,” “fdisk,” because those terms don’t have the same meaning that they do in other OSes… And I don’t know clearly right now, which one(s) I want to do, in order to create the large slice of my disks. Your experience is exactly why I suggested ZFS start doing some right sizing if you will. Chop off a bit from the end of any disk so that we're guaranteed to be able to replace drives from different manufacturers. The excuse being no reason to, Sun drives are always of identical size. If your drives did indeed come from Sun, their response is clearly not true. Regardless, I guess I still think it should be done. Figure out what the greatest variation we've seen from drives that are supposedly of the exact same size, and chop it off the end of every disk. I'm betting it's no more than 1GB, and probably less than that. When we're talking about a 2TB drive, I'm willing to give up a gig to be guaranteed I won't have any issues when it comes time to swap it out. --Tim --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC Workingset Size
On 02 April, 2010 - Abdullah Al-Dahlawi sent me these 128K bytes: Hi all I ran a workload that reads writes within 10 files each file is 256M, ie, (10 * 256M = 2.5GB total Dataset Size). I have set the ARC max size to 1 GB on etc/system file In the worse case, let us assume that the whole dataset is hot, meaning my workingset size= 2.5GB My SSD flash size = 8GB and being used for L2ARC No slog is used in the pool My File system record size = 8K , meaning 2.5% of 8GB is used for L2ARC Directory in ARC. which ultimately mean that available ARC is 1024M - 204.8M = 819.2M Available ARC (Am I Right ?) Seems about right. Now the Question ... After running the workload for 75 minutes, I have noticed that L2ARC device has grown to 6 GB !!! No, 6GB of the area has been touched by Copy on Write, not all of it is in use anymore though. What is in L2ARC beyond my 2.5GB Workingset ?? something else is has been added to L2ARC [ snip lots of data ] This is your last one: module: zfs instance: 0 name: arcstatsclass:misc c 1073741824 c_max 1073741824 c_min 134217728 [...] l2_size 2632226304 l2_write_bytes 6486009344 Roughly 6GB has been written to the device, and slightly less than 2.5GB is actually in use. p 775528448 /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Al, Have you tried the DDRdrive from Christopher George cgeo...@ddrdrive.com? Looks to me like a much better fit for your application than the F20? It would not hurt to check it out. Looks to me like you need a product with low *latency* - and a RAM based cache would be a much better performer than any solution based solely on flash. Let us know (on the list) how this works out for you. Well, I did look at it but at that time there was no Solaris support yet. Right now it seems there is only a beta driver? I kind of remember that if you'd want reliable fallback to nvram, you'd need an UPS feeding the card. I could be very wrong there, but the product documentation isn't very clear on this (at least to me ;) ) Also, we'd kind of like to have a SnOracle supported option. But yeah, on paper it does seem it could be an attractive solution... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC Workingset Size
Hi Tomas Thanks for the clarification. If I understood you right , you mean that 6 GB (including my 2.5GB files) has been written to the device and still occupy space on the device !!! This is fair enough for this case since most of my files ended up in L2ARC Great ... But this brings two related questions 1. What are really in L2ARC ... is it my old workingset data files that have been updated but still in L2ARC ? or something else ? Metadata ? 2. More importantly, what if my workingset was larger that 2.5GB (Say 5GB), I guess my L2ARC device will be filled completely before all my workingset transfer to the L2ARC device !!! Thanks On Sat, Apr 3, 2010 at 4:31 PM, Tomas Ögren st...@acc.umu.se wrote: On 02 April, 2010 - Abdullah Al-Dahlawi sent me these 128K bytes: Hi all I ran a workload that reads writes within 10 files each file is 256M, ie, (10 * 256M = 2.5GB total Dataset Size). I have set the ARC max size to 1 GB on etc/system file In the worse case, let us assume that the whole dataset is hot, meaning my workingset size= 2.5GB My SSD flash size = 8GB and being used for L2ARC No slog is used in the pool My File system record size = 8K , meaning 2.5% of 8GB is used for L2ARC Directory in ARC. which ultimately mean that available ARC is 1024M - 204.8M = 819.2M Available ARC (Am I Right ?) Seems about right. Now the Question ... After running the workload for 75 minutes, I have noticed that L2ARC device has grown to 6 GB !!! No, 6GB of the area has been touched by Copy on Write, not all of it is in use anymore though. What is in L2ARC beyond my 2.5GB Workingset ?? something else is has been added to L2ARC [ snip lots of data ] This is your last one: module: zfs instance: 0 name: arcstatsclass:misc c 1073741824 c_max 1073741824 c_min 134217728 [...] l2_size 2632226304 l2_write_bytes 6486009344 Roughly 6GB has been written to the device, and slightly less than 2.5GB is actually in use. p 775528448 /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/http://www.acc.umu.se/%7Estric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se -- Abdullah Al-Dahlawi PhD Candidate George Washington University Department. Of Electrical Computer Engineering Check The Fastest 500 Super Computers Worldwide http://www.top500.org/list/2009/11/100 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Well, I did look at it but at that time there was no Solaris support yet. Right now it seems there is only a beta driver? Correct, we just completed functional validation of the OpenSolaris driver. Our focus has now turned to performance tuning and benchmarking. We expect to formally introduce the DDRdrive X1 to the ZFS community later this quarter. It is our goal to focus exclusively on the dedicated ZIL device market going forward. I kind of remember that if you'd want reliable fallback to nvram, you'd need an UPS feeding the card. Currently, a dedicated external UPS is required for correct operation. Based on community feedback, we will be offering automatic backup/restore prior to release. This guarantees the UPS will only be required for 60 secs to successfully backup the drive contents on a host power or hardware failure. Dutifully on the next reboot the restore will occur prior to the OS loading for seamless non-volatile operation. Also,we have heard loud and clear the requests for a internal power option. It is our intention the X1 will be the first in a family of products all dedicated to ZIL acceleration for not only OpenSolaris but also Solaris 10 and FreeBSD. Also, we'd kind of like to have a SnOracle supported option. Although a much smaller company, we believe our singular focus and absolute passion for ZFS and the potential of Hybrid Storage Pools will serve our customers well. We are actively designing our soon to be available support plans. Your voice will be heard, please email directly at cgeorge at ddrdrive dot com for requests, comments and/or questions. Thanks, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On 03/04/2010 19:24, Tim Cook wrote: On Fri, Apr 2, 2010 at 4:05 PM, Edward Ned Harvey guacam...@nedharvey.com mailto:guacam...@nedharvey.com wrote: Momentarily, I will begin scouring the omniscient interweb for information, but I’d like to know a little bit of what people would say here. The question is to slice, or not to slice, disks before using them in a zpool. One reason to slice comes from recent personal experience. One disk of a mirror dies. Replaced under contract with an identical disk. Same model number, same firmware. Yet when it’s plugged into the system, for an unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to attach and un-degrade the mirror. It seems logical this problem could have been avoided if the device added to the pool originally had been a slice somewhat smaller than the whole physical device. Say, a slice of 28G out of the 29G physical disk. Because later when I get the infinitesimally smaller disk, I can always slice 28G out of it to use as the mirror device. There is some question about performance. Is there any additional overhead caused by using a slice instead of the whole physical device? There is another question about performance. One of my colleagues said he saw some literature on the internet somewhere, saying ZFS behaves differently for slices than it does on physical devices, because it doesn’t assume it has exclusive access to that physical device, and therefore caches or buffers differently … or something like that. Any other pros/cons people can think of? And finally, if anyone has experience doing this, and process recommendations? That is … My next task is to go read documentation again, to refresh my memory from years ago, about the difference between “format,” “partition,” “label,” “fdisk,” because those terms don’t have the same meaning that they do in other OSes… And I don’t know clearly right now, which one(s) I want to do, in order to create the large slice of my disks. Your experience is exactly why I suggested ZFS start doing some right sizing if you will. Chop off a bit from the end of any disk so that we're guaranteed to be able to replace drives from different manufacturers. The excuse being no reason to, Sun drives are always of identical size. If your drives did indeed come from Sun, their response is clearly not true. Regardless, I guess I still think it should be done. Figure out what the greatest variation we've seen from drives that are supposedly of the exact same size, and chop it off the end of every disk. I'm betting it's no more than 1GB, and probably less than that. When we're talking about a 2TB drive, I'm willing to give up a gig to be guaranteed I won't have any issues when it comes time to swap it out. that's what open solaris is doing more or less for some time now. look in the archives of this mailing list for more information. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 1 apr 2010, at 06.15, Stuart Anderson wrote: Assuming you are also using a PCI LSI HBA from Sun that is managed with a utility called /opt/StorMan/arcconf and reports itself as the amazingly informative model number Sun STK RAID INT what worked for me was to run, arcconf delete (to delete the pre-configured volume shipped on the drive) arcconf create (to create a new volume) Just to sort things out (or not? :-): I more than agree that this product is highly confusing, but I don't think there is anything LSI in or about that card. I believe it is an Adaptec card, developed, manufactured and supported by Intel for Adaptec, licensed (or something) to StorageTek, and later included in Sun machines (since Sun bought StorageTek, I suppose). Now we could add Oracle to this name dropping inferno, if we would want to. I am not sure why they (Sun) put those in there, they don't seem very fast or smart or anything. /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 2 apr 2010, at 22.47, Neil Perrin wrote: Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. Threads can be pre-empted in the OS at any time. So even though thread A issued W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as W1, W2. Multi-threaded applications have to handle this. If this was a single thread issuing W1 then W2 then yes the order is guaranteed regardless of whether W1 or W2 are synchronous or asynchronous. Of course if the system crashes then the async operations might not be there. Could you please clarify this last paragraph a little: Do you mean that this is in the case that you have ZIL enabled and the txg for W1 and W2 hasn't been commited, so that upon reboot the ZIL is replayed, and therefore only the sync writes are eventually there? If, lets say, W1 is an async small write, W2 is a sync small write, W1 arrives to zfs before W2, and W2 arrives before the txg is commited, will both writes always be in the txg on disk? If so, it would mean that zfs itself never buffer up async writes to larger blurbs to write at a later txg, correct? I take it that ZIL enabled or not does not make any difference here (we pretend the system did _not_ crash), correct? Thanks! /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Sat, Apr 3, 2010 at 6:53 PM, Robert Milkowski mi...@task.gda.pl wrote: On 03/04/2010 19:24, Tim Cook wrote: On Fri, Apr 2, 2010 at 4:05 PM, Edward Ned Harvey guacam...@nedharvey.com wrote: Momentarily, I will begin scouring the omniscient interweb for information, but I’d like to know a little bit of what people would say here. The question is to slice, or not to slice, disks before using them in a zpool. One reason to slice comes from recent personal experience. One disk of a mirror dies. Replaced under contract with an identical disk. Same model number, same firmware. Yet when it’s plugged into the system, for an unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to attach and un-degrade the mirror. It seems logical this problem could have been avoided if the device added to the pool originally had been a slice somewhat smaller than the whole physical device. Say, a slice of 28G out of the 29G physical disk. Because later when I get the infinitesimally smaller disk, I can always slice 28G out of it to use as the mirror device. There is some question about performance. Is there any additional overhead caused by using a slice instead of the whole physical device? There is another question about performance. One of my colleagues said he saw some literature on the internet somewhere, saying ZFS behaves differently for slices than it does on physical devices, because it doesn’t assume it has exclusive access to that physical device, and therefore caches or buffers differently … or something like that. Any other pros/cons people can think of? And finally, if anyone has experience doing this, and process recommendations? That is … My next task is to go read documentation again, to refresh my memory from years ago, about the difference between “format,” “partition,” “label,” “fdisk,” because those terms don’t have the same meaning that they do in other OSes… And I don’t know clearly right now, which one(s) I want to do, in order to create the large slice of my disks. Your experience is exactly why I suggested ZFS start doing some right sizing if you will. Chop off a bit from the end of any disk so that we're guaranteed to be able to replace drives from different manufacturers. The excuse being no reason to, Sun drives are always of identical size. If your drives did indeed come from Sun, their response is clearly not true. Regardless, I guess I still think it should be done. Figure out what the greatest variation we've seen from drives that are supposedly of the exact same size, and chop it off the end of every disk. I'm betting it's no more than 1GB, and probably less than that. When we're talking about a 2TB drive, I'm willing to give up a gig to be guaranteed I won't have any issues when it comes time to swap it out. that's what open solaris is doing more or less for some time now. look in the archives of this mailing list for more information. -- Robert Milkowski http://milek.blogspot.com Since when? It isn't doing it on any of my drives, build 134, and judging by the OP's issues, it isn't doing it for him either... I try to follow this list fairly closely and I've never seen anyone at Sun/Oracle say they were going to start doing it after I was shot down the first time. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Sat, Apr 3, 2010 at 7:50 PM, Tim Cook t...@cook.ms wrote: On Sat, Apr 3, 2010 at 6:53 PM, Robert Milkowski mi...@task.gda.plwrote: On 03/04/2010 19:24, Tim Cook wrote: On Fri, Apr 2, 2010 at 4:05 PM, Edward Ned Harvey guacam...@nedharvey.com wrote: Momentarily, I will begin scouring the omniscient interweb for information, but I’d like to know a little bit of what people would say here. The question is to slice, or not to slice, disks before using them in a zpool. One reason to slice comes from recent personal experience. One disk of a mirror dies. Replaced under contract with an identical disk. Same model number, same firmware. Yet when it’s plugged into the system, for an unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to attach and un-degrade the mirror. It seems logical this problem could have been avoided if the device added to the pool originally had been a slice somewhat smaller than the whole physical device. Say, a slice of 28G out of the 29G physical disk. Because later when I get the infinitesimally smaller disk, I can always slice 28G out of it to use as the mirror device. There is some question about performance. Is there any additional overhead caused by using a slice instead of the whole physical device? There is another question about performance. One of my colleagues said he saw some literature on the internet somewhere, saying ZFS behaves differently for slices than it does on physical devices, because it doesn’t assume it has exclusive access to that physical device, and therefore caches or buffers differently … or something like that. Any other pros/cons people can think of? And finally, if anyone has experience doing this, and process recommendations? That is … My next task is to go read documentation again, to refresh my memory from years ago, about the difference between “format,” “partition,” “label,” “fdisk,” because those terms don’t have the same meaning that they do in other OSes… And I don’t know clearly right now, which one(s) I want to do, in order to create the large slice of my disks. Your experience is exactly why I suggested ZFS start doing some right sizing if you will. Chop off a bit from the end of any disk so that we're guaranteed to be able to replace drives from different manufacturers. The excuse being no reason to, Sun drives are always of identical size. If your drives did indeed come from Sun, their response is clearly not true. Regardless, I guess I still think it should be done. Figure out what the greatest variation we've seen from drives that are supposedly of the exact same size, and chop it off the end of every disk. I'm betting it's no more than 1GB, and probably less than that. When we're talking about a 2TB drive, I'm willing to give up a gig to be guaranteed I won't have any issues when it comes time to swap it out. that's what open solaris is doing more or less for some time now. look in the archives of this mailing list for more information. -- Robert Milkowski http://milek.blogspot.com Since when? It isn't doing it on any of my drives, build 134, and judging by the OP's issues, it isn't doing it for him either... I try to follow this list fairly closely and I've never seen anyone at Sun/Oracle say they were going to start doing it after I was shot down the first time. --Tim Oh... and after 15 minutes of searching for everything from 'right-sizing' to 'block reservation' to 'replacement disk smaller size fewer blocks' etc. etc. I don't see a single thread on it. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Problems with zfs and a STK RAID INT SAS HBA
Hello, Maybe this question should be put on another list, but since there are a lot of people here using all kinds of HBAs, this could be right anyway; I have a X4150 running snv_134. It was shipped with a STK RAID INT adaptec/intel/storagetek/sun SAS HBA. When running the card in copyback write cache mode, I got horrible performance (with zfs), much worse than with copyback disabled (which I believe should mean it does write-through), when tested with filebench. This could actually be expected, depending on how good or bad the the card is, but I am still not sure about what to expect. It logs some errors, as shown with fmdump -e(V). It is most often a pci bridge error (I think), about five to ten times an hour, and occasionally a problem with accessing a mode page on the disks for enabling/disabling the write cache, one error for each disk, about every three hours. I don't believe the two have to be related. I am not sure if the PCI-PCI bridge is on the RAID board itself or in the host. I haven't seen this problem on other more or less identical machines running sol10. Is this a known software problem, or do I have faulty hardware? Thanks! /ragge -- % fmdump -e ... Apr 04 01:21:53.2244 ereport.io.pci.fabric Apr 04 01:30:00.6999 ereport.io.pci.fabric Apr 04 01:30:23.4647 ereport.io.scsi.cmd.disk.dev.uderr Apr 04 01:30:23.4651 ereport.io.scsi.cmd.disk.dev.uderr ... % fmdump -eV Apr 04 2010 01:21:53.224492765 ereport.io.pci.fabric nvlist version: 0 class = ereport.io.pci.fabric ena = 0xd6a00a43be800c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /p...@0,0/pci8086,2...@4 (end detector) bdf = 0x20 device_id = 0x25f8 vendor_id = 0x8086 rev_id = 0xb1 dev_type = 0x40 pcie_off = 0x6c pcix_off = 0x0 aer_off = 0x100 ecc_ver = 0x0 pci_status = 0x10 pci_command = 0x147 pci_bdg_sec_status = 0x0 pci_bdg_ctrl = 0x3 pcie_status = 0x0 pcie_command = 0x2027 pcie_dev_cap = 0xfc1 pcie_adv_ctl = 0x0 pcie_ue_status = 0x0 pcie_ue_mask = 0x10 pcie_ue_sev = 0x62031 pcie_ue_hdr0 = 0x0 pcie_ue_hdr1 = 0x0 pcie_ue_hdr2 = 0x0 pcie_ue_hdr3 = 0x0 pcie_ce_status = 0x0 pcie_ce_mask = 0x0 pcie_rp_status = 0x0 pcie_rp_control = 0x7 pcie_adv_rp_status = 0x0 pcie_adv_rp_command = 0x7 pcie_adv_rp_ce_src_id = 0x0 pcie_adv_rp_ue_src_id = 0x0 remainder = 0x0 severity = 0x1 __ttl = 0x1 __tod = 0x4bb7cd91 0xd617cdd ... Apr 04 2010 01:30:23.464768275 ereport.io.scsi.cmd.disk.dev.uderr nvlist version: 0 class = ereport.io.scsi.cmd.disk.dev.uderr ena = 0xde0cd54f84201c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /p...@0,0/pci8086,2...@4/pci108e,2...@0/d...@5,0 devid = id1,s...@tsun_stk_raid_intea4b6f24 (end detector) driver-assessment = fail op-code = 0x1a cdb = 0x1a 0x0 0x8 0x0 0x18 0x0 pkt-reason = 0x0 pkt-state = 0x1f pkt-stats = 0x0 stat-code = 0x0 un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page code mismatch 0 un-decode-value = __ttl = 0x1 __tod = 0x4bb7cf8f 0x1bb3cd13 ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Apr 3, 2010, at 5:56 PM, Tim Cook wrote: On Sat, Apr 3, 2010 at 7:50 PM, Tim Cook t...@cook.ms wrote: Your experience is exactly why I suggested ZFS start doing some right sizing if you will. Chop off a bit from the end of any disk so that we're guaranteed to be able to replace drives from different manufacturers. The excuse being no reason to, Sun drives are always of identical size. If your drives did indeed come from Sun, their response is clearly not true. Regardless, I guess I still think it should be done. Figure out what the greatest variation we've seen from drives that are supposedly of the exact same size, and chop it off the end of every disk. I'm betting it's no more than 1GB, and probably less than that. When we're talking about a 2TB drive, I'm willing to give up a gig to be guaranteed I won't have any issues when it comes time to swap it out. that's what open solaris is doing more or less for some time now. look in the archives of this mailing list for more information. -- Robert Milkowski http://milek.blogspot.com Since when? It isn't doing it on any of my drives, build 134, and judging by the OP's issues, it isn't doing it for him either... I try to follow this list fairly closely and I've never seen anyone at Sun/Oracle say they were going to start doing it after I was shot down the first time. --Tim Oh... and after 15 minutes of searching for everything from 'right-sizing' to 'block reservation' to 'replacement disk smaller size fewer blocks' etc. etc. I don't see a single thread on it. CR 6844090, zfs should be able to mirror to a smaller disk http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844090 b117, June 2009 -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Apr 2, 2010, at 2:05 PM, Edward Ned Harvey wrote: Momentarily, I will begin scouring the omniscient interweb for information, but I’d like to know a little bit of what people would say here. The question is to slice, or not to slice, disks before using them in a zpool. One reason to slice comes from recent personal experience. One disk of a mirror dies. Replaced under contract with an identical disk. Same model number, same firmware. Yet when it’s plugged into the system, for an unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to attach and un-degrade the mirror. It seems logical this problem could have been avoided if the device added to the pool originally had been a slice somewhat smaller than the whole physical device. Say, a slice of 28G out of the 29G physical disk. Because later when I get the infinitesimally smaller disk, I can always slice 28G out of it to use as the mirror device. If the HBA is configured for RAID mode, then it will reserve some space on disk for its metadata. This occurs no matter what type of disk you attach. There is some question about performance. Is there any additional overhead caused by using a slice instead of the whole physical device? No. There is another question about performance. One of my colleagues said he saw some literature on the internet somewhere, saying ZFS behaves differently for slices than it does on physical devices, because it doesn’t assume it has exclusive access to that physical device, and therefore caches or buffers differently … or something like that. Any other pros/cons people can think of? If the disk is only used for ZFS, then it is ok to enable volatile disk write caching if the disk also supports write cache flush requests. If the disk is shared with UFS, then it is not ok to enable volatile disk write caching. -- richard And finally, if anyone has experience doing this, and process recommendations? That is … My next task is to go read documentation again, to refresh my memory from years ago, about the difference between “format,” “partition,” “label,” “fdisk,” because those terms don’t have the same meaning that they do in other OSes… And I don’t know clearly right now, which one(s) I want to do, in order to create the large slice of my disks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Sat, Apr 3, 2010 at 9:52 PM, Richard Elling richard.ell...@gmail.comwrote: On Apr 3, 2010, at 5:56 PM, Tim Cook wrote: On Sat, Apr 3, 2010 at 7:50 PM, Tim Cook t...@cook.ms wrote: Your experience is exactly why I suggested ZFS start doing some right sizing if you will. Chop off a bit from the end of any disk so that we're guaranteed to be able to replace drives from different manufacturers. The excuse being no reason to, Sun drives are always of identical size. If your drives did indeed come from Sun, their response is clearly not true. Regardless, I guess I still think it should be done. Figure out what the greatest variation we've seen from drives that are supposedly of the exact same size, and chop it off the end of every disk. I'm betting it's no more than 1GB, and probably less than that. When we're talking about a 2TB drive, I'm willing to give up a gig to be guaranteed I won't have any issues when it comes time to swap it out. that's what open solaris is doing more or less for some time now. look in the archives of this mailing list for more information. -- Robert Milkowski http://milek.blogspot.com Since when? It isn't doing it on any of my drives, build 134, and judging by the OP's issues, it isn't doing it for him either... I try to follow this list fairly closely and I've never seen anyone at Sun/Oracle say they were going to start doing it after I was shot down the first time. --Tim Oh... and after 15 minutes of searching for everything from 'right-sizing' to 'block reservation' to 'replacement disk smaller size fewer blocks' etc. etc. I don't see a single thread on it. CR 6844090, zfs should be able to mirror to a smaller disk http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844090 b117http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844090%0Ab117, June 2009 -- richard Unless the bug description is incomplete, that's talking about adding a mirror to an existing drive. Not about replacing a failed drive in an existing vdev that could be raid-z#. I'm almost positive I had an issue post b117 with replacing a failed drive in a raid-z2 vdev. I'll have to see if I can dig up a system to test the theory on. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup and memory/l2arc requirements
On Apr 1, 2010, at 9:34 PM, Roy Sigurd Karlsbakk wrote: You can estimate the amount of disk space needed for the deduplication table and the expected deduplication ratio by using zdb -S poolname on your existing pool. This is all good, but it doesn't work too well for planning. Is there a rule of thumb I can use for a general overview? If you know the average record size for your workload, then you can calculate the average number of records when given the total space. This should get you in the ballpark. Say I want 125TB space and I want to dedup that for backup use. It'll probably be quite efficient dedup, so long alignment will match. By the way, is there a way to auto-align data for dedup in case of backup? Or does zfs do this by itself? ZFS does not change alignment. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On Apr 3, 2010, at 8:00 PM, Tim Cook wrote: On Sat, Apr 3, 2010 at 9:52 PM, Richard Elling richard.ell...@gmail.com wrote: On Apr 3, 2010, at 5:56 PM, Tim Cook wrote: On Sat, Apr 3, 2010 at 7:50 PM, Tim Cook t...@cook.ms wrote: Your experience is exactly why I suggested ZFS start doing some right sizing if you will. Chop off a bit from the end of any disk so that we're guaranteed to be able to replace drives from different manufacturers. The excuse being no reason to, Sun drives are always of identical size. If your drives did indeed come from Sun, their response is clearly not true. Regardless, I guess I still think it should be done. Figure out what the greatest variation we've seen from drives that are supposedly of the exact same size, and chop it off the end of every disk. I'm betting it's no more than 1GB, and probably less than that. When we're talking about a 2TB drive, I'm willing to give up a gig to be guaranteed I won't have any issues when it comes time to swap it out. that's what open solaris is doing more or less for some time now. look in the archives of this mailing list for more information. -- Robert Milkowski http://milek.blogspot.com Since when? It isn't doing it on any of my drives, build 134, and judging by the OP's issues, it isn't doing it for him either... I try to follow this list fairly closely and I've never seen anyone at Sun/Oracle say they were going to start doing it after I was shot down the first time. --Tim Oh... and after 15 minutes of searching for everything from 'right-sizing' to 'block reservation' to 'replacement disk smaller size fewer blocks' etc. etc. I don't see a single thread on it. CR 6844090, zfs should be able to mirror to a smaller disk http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844090 b117, June 2009 -- richard Unless the bug description is incomplete, that's talking about adding a mirror to an existing drive. Not about replacing a failed drive in an existing vdev that could be raid-z#. I'm almost positive I had an issue post b117 with replacing a failed drive in a raid-z2 vdev. It is the same code. That said, I have experimented with various cases and I have not found prediction of tolerable size difference to be easy. I'll have to see if I can dig up a system to test the theory on. Works fine. # ramdiskadm -a rd1 10k /dev/ramdisk/rd1 # ramdiskadm -a rd2 10k /dev/ramdisk/rd2 # ramdiskadm -a rd3 10k /dev/ramdisk/rd3 # ramdiskadm -a rd4 99900k /dev/ramdisk/rd4 # zpool create -o cachefile=none zwimming raidz /dev/ramdisk/rd1 /dev/ramdisk/rd2 /dev/ramdisk/rd3 # zpool status zwimming pool: zwimming state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM zwimming ONLINE 0 0 0 raidz1-0ONLINE 0 0 0 /dev/ramdisk/rd1 ONLINE 0 0 0 /dev/ramdisk/rd2 ONLINE 0 0 0 /dev/ramdisk/rd3 ONLINE 0 0 0 errors: No known data errors # zpool replace zwimming /dev/ramdisk/rd3 /dev/ramdisk/rd4 # zpool status zwimming pool: zwimming state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Sat Apr 3 20:08:35 2010 config: NAME STATE READ WRITE CKSUM zwimming ONLINE 0 0 0 raidz1-0ONLINE 0 0 0 /dev/ramdisk/rd1 ONLINE 0 0 0 /dev/ramdisk/rd2 ONLINE 0 0 0 /dev/ramdisk/rd4 ONLINE 0 0 0 45K resilvered errors: No known data errors -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC Workingset Size
On Apr 1, 2010, at 9:41 PM, Abdullah Al-Dahlawi wrote: Hi all I ran a workload that reads writes within 10 files each file is 256M, ie, (10 * 256M = 2.5GB total Dataset Size). I have set the ARC max size to 1 GB on etc/system file In the worse case, let us assume that the whole dataset is hot, meaning my workingset size= 2.5GB My SSD flash size = 8GB and being used for L2ARC No slog is used in the pool My File system record size = 8K , meaning 2.5% of 8GB is used for L2ARC Directory in ARC. which ultimately mean that available ARC is 1024M - 204.8M = 819.2M Available ARC (Am I Right ?) this is worst case Now the Question ... After running the workload for 75 minutes, I have noticed that L2ARC device has grown to 6 GB !!! You're not interpreting the values properly, see below. What is in L2ARC beyond my 2.5GB Workingset ?? something else is has been added to L2ARC ZFS is COW, so modified data is written to disk and the L2ARC. Here is a 5 minute interval of Zpool iostat [snip] Also, a FULL Kstat ZFS for 5 minutes Interval [snip] module: zfs instance: 0 name: arcstatsclass:misc c 1073741824 c_max 1073741824 Max ARC size is limited to 1GB c_min 134217728 crtime 28.083178473 data_size 955407360 deleted 966956 demand_data_hits843880 demand_data_misses 452182 demand_metadata_hits68572 demand_metadata_misses 5737 evict_skip 82548 hash_chain_max 18 hash_chains 61732 hash_collisions 1444874 hash_elements 329553 hash_elements_max 329561 hdr_size46553328 hits978241 l2_abort_lowmem 0 l2_cksum_bad0 l2_evict_lock_retry 0 l2_evict_reading0 l2_feeds4738 l2_free_on_write184 l2_hdr_size 17024784 size of L2ARC headers is approximately 17MB l2_hits 252839 l2_io_error 0 l2_misses 203767 l2_read_bytes 2071482368 l2_rw_clash 13 l2_size 2632226304 currently, there is approximately 2.5GB in the L2ARC l2_write_bytes 6486009344 total amount of data written to L2ARC since boot is 6+ GB l2_writes_done 4127 l2_writes_error 0 l2_writes_hdr_miss 21 l2_writes_sent 4127 memory_throttle_count 0 mfu_ghost_hits 120524 mfu_hits500516 misses 468227 mru_ghost_hits 61398 mru_hits412112 mutex_miss 511 other_size 56325712 p 775528448 prefetch_data_hits 50804 prefetch_data_misses7819 prefetch_metadata_hits 14985 prefetch_metadata_misses2489 recycle_miss13096 size1073830768 ARC size is 1GB The best way to understand these in detail is to look at the source which is nicely commented. L2ARC design is commented near http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3590 -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Apr 3, 2010, at 5:47 PM, Ragnar Sundblad wrote: On 2 apr 2010, at 22.47, Neil Perrin wrote: Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. Threads can be pre-empted in the OS at any time. So even though thread A issued W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as W1, W2. Multi-threaded applications have to handle this. If this was a single thread issuing W1 then W2 then yes the order is guaranteed regardless of whether W1 or W2 are synchronous or asynchronous. Of course if the system crashes then the async operations might not be there. Could you please clarify this last paragraph a little: Do you mean that this is in the case that you have ZIL enabled and the txg for W1 and W2 hasn't been commited, so that upon reboot the ZIL is replayed, and therefore only the sync writes are eventually there? yes. The ZIL needs to be replayed on import after an unclean shutdown. If, lets say, W1 is an async small write, W2 is a sync small write, W1 arrives to zfs before W2, and W2 arrives before the txg is commited, will both writes always be in the txg on disk? yes If so, it would mean that zfs itself never buffer up async writes to larger blurbs to write at a later txg, correct? correct I take it that ZIL enabled or not does not make any difference here (we pretend the system did _not_ crash), correct? For import following a clean shutdown, there are no transactions in the ZIL to apply. For async-only workloads, there are no transactions in the ZIL to apply. Do not assume that power outages are the only cause of unclean shutdowns. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss