Re: [zfs-discuss] Re: recovered state after system crash
kyusun Chang wrote: Does ZFS recover all file system transactions which it returned with success since the last commit of TxG, which implis that ZIL must flush log records for each successful file system transaction before it returns to caller so that t can replay the filesystem transactions? Only synchronous transactions (those forced by O_DSYNC or fsync()) are written to the intent log. Could you help me to clarify on "writing 'synchronous' transactions to the log"? Assume a scenario where a sequence of new subdirectories D1, D2 (as child of D1) have been created, then new files F1 in D1 and F2 in D2 have been created, and after some writes to F1 and F2, fsync(F1) was issued. Also , assume a file F3 in other parts of the file system that are being modified. To recover F1, creation of D1 and D2 must be recovered. It would be painful to find and log the relevant information at the time of fsync() to recover them. The ZIL will write log records to stable storage for all directory creations and the data for F1, but not the data for F2, F3.. See the code in zil_commit_writer() for the exact details: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#938 It means that 1) ZFS needs to log EVERY (vs "synchronous") file system transactions to replay (i.e., redo onto the on-disk state of last commit of TxG) since one cannot predict when fsync() would be requested for which file, i.e., ZFS log them all in-memory, but flushes only at synchronous transaction? It also means ZFS log user data for for every write()? 2) If the cumulated log records up to fsync(F1) (from last fsync()) is flushed to disk for replay at subsequent recovery, ZFS recovers the consistent file system state at the point in time of latest fsync(), including all successful file system transaction up to that point that have nothing to do with F1, e.g., F3, before crash"? Or, am I missing something? So the actual code logs everything except, writes, setattr, acls, and truncates for other files. This has undergone some change over time and may continue to change. I presume that flush of log occurs also at every write() of file opened with O_DSYNC. Otherwise, it should be same as fsync() case. Correct Are there any other synchronization request that forces in-memory log? There are others: O_RSYNC, O_SYNC, sync(1M) As a side question, does ZFS log atime update (and does snapshot copy-on-write for it)? I don't think atime updates are logged as transactions. Not sure about snapshots COW-ing. Again, thank you for your time. So what are your concerns here? Correctness or performance? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: simple Raid-Z question
>No one has said that you can't increase the size of a zpool. What can't >be increased is the size of a RAID-Z vdev (except by increasing the size >of all of the components of the RAID-Z). You have created additional >RAID-Z vdevs and added them to the pool. If following is nonsense, please bear with me being a newbie... Suppose you partition the each of the n disk into m equally sized partitions, make m raidz vdevs and finally create one pool from them all. When you then wish to add another disk, you first partition it as the previous ones, then for i in 1:m { remove raidz vdev number i from the pool destroy raidz vdev number i create a raidz vdev from partition i of each of the n+1 disks add the new raidz vdev to the pool } This would demand that 1/m:th of the available disk space would be free, but would make it possible to add additional disks without backup/restore the entire data. Also, it would make it possible to add disks of different sizes, since some raidz:s could include partitions from more disks than others. The interesting question is, what would the performance hit be for pooling m raidz:s from partitions instead of using one regular raidz? Does the driver in some way collect requests for the same disk, which would be destroyed when multiple vdevs in the same pool come from the same disk? Cheers Anders This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Optimal strategy (add or replace disks) to build a cheap and raidz?
What brand is your 8 port satacontroller? I want one sata controller too, but heard that Solaris is picky about the model. All controllers doesnt work. Your does? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] External eSata ZFS raid possible?
Is this possible? I want an external case with 4 HD in, each having an individual eSata cable. I plug in this case only when needed (each HD plugs into my PC separately), and do something like ">import ZFS pool" and use my ZFS raid. When done, I unplug it with ">export ZFS pool" or something similar command. This way I save energy, less noise, etc. I want it only as a safe storing place. In normal work, I dont need access to all data. Just for backup. I have a 250GB HD as a system disc, and temporary store. Would this scenario be possible? Here is the case: http://www.stardom.com.tw/sohotank%20st5610.htm This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue with adding existing EFI disks to a zpool
Mario Goebbels wrote: do it". So I added the disk using the zero slice notation (c0d0s0), as suggested for performance reasons. I checked the pool status and noticed however that the pool size didn't raise. I believe you got this wrong. You should have given ZFS the whole disk - c0d0 and not a slice. When presented a whole disk, it EFI-labels it and turns on the write cache. -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motley group of discs?
Lee Fyock wrote: least this year. I'd like to favor available space over performance, and be able to swap out a failed drive without losing any data. Lee Fyock later wrote: In the mean time, I'd like to hang out with the system and drives I have. As "mike" said, my understanding is that zfs would provide error correction until a disc fails, if the setup is properly done. That's the setup for which I'm requesting a recommendation. ZFS always lets you know if the data you are requesting has gone bad. If you have redundancy, it provides error correction as well. Money isn't an issue here, but neither is creating an optimal zfs system. I'm curious what the right zfs configuration is for the system I have. You obviously have the option of having a giant pool of all the disks and what you get is dynamic striping. But if a disk goes toast, the data in it is gone. If you plan to back up important data elsewhere and data loss is something you can live with, this might be a good choice. The next option is to mirror (/raidz) disks. If you mirror a 200 GB disk with a 250 GB one, you will get only 200 GB of redundant storage. If a disk goes for a toss, all of your data is safe. But you lose disk space. Mirroring the 600GB disk with a stripe of 160+200+250 would have been nice, but I believe this is not possible with ZFS (yet?). There is a third option - create a giant pool of all the disks. Set copy=2. ZFS will create two copies of all the data blocks. That is pretty good redundancy. But depending on how full your disks are, the copies may or may not be on different disks. In other words, this does not guarantee that *all* of your data is safe, if say your 600 GB disk dies. But it might be 'good enough'. From what I understand your requirements are, this just might be your best choice. A periodic scrub would also be a good thing to do. The earlier you detect a flaky disk, the better it is... Hope this helps. -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: recovered state after system crash
> > > Does ZFS recover all file system transactions which > it returned with success > > since the last commit of TxG, which implis that ZIL > must flush log records for > > each successful file system transaction before it > returns to caller so that > t can replay > > the filesystem transactions? > > Only synchronous transactions (those forced by > O_DSYNC or fsync()) are > written to the intent log. > Could you help me to clarify on "writing 'synchronous' transactions to the log"? Assume a scenario where a sequence of new subdirectories D1, D2 (as child of D1) have been created, then new files F1 in D1 and F2 in D2 have been created, and after some writes to F1 and F2, fsync(F1) was issued. Also , assume a file F3 in other parts of the file system that are being modified. To recover F1, creation of D1 and D2 must be recovered. It would be painful to find and log the relevant information at the time of fsync() to recover them. It means that 1) ZFS needs to log EVERY (vs "synchronous") file system transactions to replay (i.e., redo onto the on-disk state of last commit of TxG) since one cannot predict when fsync() would be requested for which file, i.e., ZFS log them all in-memory, but flushes only at synchronous transaction? It also means ZFS log user data for for every write()? 2) If the cumulated log records up to fsync(F1) (from last fsync()) is flushed to disk for replay at subsequent recovery, ZFS recovers the consistent file system state at the point in time of latest fsync(), including all successful file system transaction up to that point that have nothing to do with F1, e.g., F3, before crash"? Or, am I missing something? I presume that flush of log occurs also at every write() of file opened with O_DSYNC. Otherwise, it should be same as fsync() case. Are there any other synchronization request that forces in-memory log? As a side question, does ZFS log atime update (and does snapshot copy-on-write for it)? Again, thank you for your time. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Very Large Filesystems
> What's the maximum filesystem size you've used in production environment? How > did the experience come out? I have a 26tb pool that will be upgraded to 39tb in the next couple of months. This is the backend for Backup images. The ease of managing this sort of expanding storage is a little bit of wonderful. I remember the pain of managing 10tb of A5200's back in 2000 and this is a welcome sight. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS and Oracle db production deployment
Why did you choose to deploy the database on ZFS ? -On disk consistancy was big - one of our datacenters was having power problems and the systems would sometimes drop live. I had a couple of instances of data errors with VXVM/VXFS and we had to restore from tape. -zfs snapshot saves us many hours with our larger databases when the DBAs need a backup for patching. Our maintenance window goes from 8 hours to 2 or 3 because we don't have to waste time waiting on I/O from a rsync or dump or waiting for that same I/O in a restore if something goes south. -ease of maintenance. Our storage creation has gone from a significant part of the install process to mostly being scripted. What features of ZFS are you using ? -quotas/reservations -snapshots -recordsize -zfs send/receive -looking forward to using ditto blocks What tuning was done during ZFS setup ? -Number of devices -setting the arc cache low is crucial -moving logfiles off to ufs/directio -datafile recordsize to 8k -make /backup on a separate pool How big are the databases ? -30,70,100,350 and 350 - all growing. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue with adding existing EFI disks to a zpool
On May 5, 2007, at 09:34, Mario Goebbels wrote: I spend yesterday all day evading my data of one of the Windows disks, so that I can add it to the pool. Using mount-ntfs, it's a pain due to its slowness. But once I finished, I thought "Cool, let's do it". So I added the disk using the zero slice notation (c0d0s0), as suggested for performance reasons. I checked the pool status and noticed however that the pool size didn't raise. After a short panic (myself, not the kernel), I remembered that I partitioned this disk as EFI disk in Windows (mostly just because). c0d0s0 was the emergency, boot or whatever partition automatically created according to the recommended EFI partitioning scheme. So it added the minimal space of that partition to the pool. The real whole disk partition was c0d0s1. Since there's no device removal in ZFS yet, I had to replace slice 0 with slice 1 since destroying the pool was out of the question. Two things now: a) ZFS would have added EFI labels anyway. Will ZFS figure things out for itself, or did I lose write cache control because I didn't explicitely specify s0 though this is an EFI disk already? yes if add the whole device to the pool .. that is use c0t0d0 instead of c0t0d0s0 .. in this case, ZFS creates a large partition on s0 starting at sector 34 and encompassing the entire disk. If you need to check the write_cache use "format -e", cache, write_cache, display. b) I don't remember it mentioned anywhere in the documentation. If a) is indeed an issue, it should be mentioned that you have to unlabel EFI disks before adding. Removing an EFI label is a little trickier .. you can replace the EFI label with an SMI label if it's below 1TB (format -e then l) and then "dd if=/dev/zero of=/dev/dsk/c0t0d0s2 bs=512 count=1" to remove the SMI label .. or you could also attempt to access the entire disk (c0t0d0) with dd and zero out the first 17KB and the last 8MB, but you'd have to get the 8MB offset from the VTOC. You know you've got an empty label if you get stderr entries at the top of the format output, or syslog messages around "corrupt label - bad magic number" Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal strategy (add or replace disks) to build a cheap and raidz?
Harold Ancell wrote: At 04:41 AM 5/5/2007, Christian Rost wrote: My Question now: Is the second way reasonable or do i missing some things? Anything else to consider? Mirroring is the simplest way to expand in size and performance. Pardon me for jumping into a group I just joined, but I sense you are asking sort of a "philosophy of buying" question, and I have a different one that you may find useful, plus a question I'd like to confirm from my reading and searching of this list so far: For what and when to buy, I observe two things: at some point you HAVE to buy something; with disks exceeding Moores Law (aren't they at about a doubling every 12 months instead of 18?), you're going to feel some pain afterwards *whenever* you purchase as prices continue to plummet. From that, many say buy what you have to buy when you have to, although that isn't so useful if growing a RAID-Z is difficult Disk prices remain constant. Disk densities change. With 500 GByte disks in the $120 range, they are on the way out, so are likely to be optimally priced. But they may not be available next year. If you mirror, then it is a no-brainer, just add two. And now the useful (I hope) observation: try plotting the price performance of parts like this. When you do so, you'll generally find a "knee" where it shoots up dramatically for the last increment(s) of performance. When I buy e.g. processors, I pick one that is just before the beginning of this knee, and for me (your mileage will vary :-), I suffer the least "buyers remorse" afterwards. "buyers remorse" for buying computer gear? We might have to revoke your geek license :-) The last time I checked and plotted this out, the knee is between 500 and 750GB for 7200.10 Seagate drives, and we can be pretty sure this won't change until sometime after 1TB disks are widely adopted---and 500GB makes the math simple ^_^. BTW, is it really true there are no PCI or PCIe multiple SATA connection host adaptors (at ANY reasonable price, doesn't have to be "budget") that are really solid under OpenSolaris right now? This would indeed seem to be a very big problem in the context of what ZFS and especially RAID-Z/Z2 have to offer I know I've selected OpenSolaris primarily based on the "pick the software you want to run, and then buy the platform that best supports it", that software being ZFS (plus I just plain like Solaris, and don't particularly like Linux, even if I still curse the BSD -> AT&T change of 4.x to 5.x :-). Look for gear based on LSI 106x or Marvell (see marvell88sx). These are used in Sun products, such as thumper. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal strategy (add or replace disks) to build a cheap and raidz?
On Sat, May 05, 2007 at 02:41:28AM -0700, Christian Rost wrote: > > - Buying "cheap" 8x250 GB SATA disks at first and replacing them from time to > time by 750 GB > or bigger disks. Disadvantage: At the end i've bought 8x250 GB + 8x750 GB > Harddisks. Look at it this way. The amount you spend on 750G disks now will be equal to the amount yould would spend on 250G now and 750G later after the prices drop. For the same cost of what you would spend on 750G disks now, you would end up with a set of both 250G *AND* 750G disks. Maybe those 250G disks can be used elsewhere. Maybe buying more controllers and more disk enclosures and *ADDING* the 750G disks to the pool would be something of a reasonable cost in the future. then you would essenially have 1TB disks worth of space. THe beauty of ZFS (IMHO) is that you only need to keep disks the same size within a vdev, not the pool. so a raidz vdev of 250G disks and a raidz vdev of 750G disks will happily work in a single pool. To showcase ZFS at work i setup a zpool with three vdevs. Two 146G disks in a mirror, 5 36G disks in a raidz and 5 76G disks in a raidz. Everyone was completely impressed. ;) -brian ps: does mixing raidz and mirrors in a single pool have any performance degradation associated with it? Is ZFS smart enough to know what the read/write characteristics of a mirror vs. a raidz and try to take advantage of that? Just curious. -- "Perl can be fast and elegant as much as J2EE can be fast and elegant. In the hands of a skilled artisan, it can and does happen; it's just that most of the shit out there is built by people who'd be better suited to making sure that my burger is cooked thoroughly." -- Jonathan Patschke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal strategy (add or replace disks) to build a cheap and raidz?
At 04:41 AM 5/5/2007, Christian Rost wrote: >My Question now: >Is the second way reasonable or do i missing some things? Anything else to >consider? Pardon me for jumping into a group I just joined, but I sense you are asking sort of a "philosophy of buying" question, and I have a different one that you may find useful, plus a question I'd like to confirm from my reading and searching of this list so far: For what and when to buy, I observe two things: at some point you HAVE to buy something; with disks exceeding Moores Law (aren't they at about a doubling every 12 months instead of 18?), you're going to feel some pain afterwards *whenever* you purchase as prices continue to plummet. From that, many say buy what you have to buy when you have to, although that isn't so useful if growing a RAID-Z is difficult And now the useful (I hope) observation: try plotting the price performance of parts like this. When you do so, you'll generally find a "knee" where it shoots up dramatically for the last increment(s) of performance. When I buy e.g. processors, I pick one that is just before the beginning of this knee, and for me (your mileage will vary :-), I suffer the least "buyers remorse" afterwards. The last time I checked and plotted this out, the knee is between 500 and 750GB for 7200.10 Seagate drives, and we can be pretty sure this won't change until sometime after 1TB disks are widely adopted---and 500GB makes the math simple ^_^. BTW, is it really true there are no PCI or PCIe multiple SATA connection host adaptors (at ANY reasonable price, doesn't have to be "budget") that are really solid under OpenSolaris right now? This would indeed seem to be a very big problem in the context of what ZFS and especially RAID-Z/Z2 have to offer I know I've selected OpenSolaris primarily based on the "pick the software you want to run, and then buy the platform that best supports it", that software being ZFS (plus I just plain like Solaris, and don't particularly like Linux, even if I still curse the BSD -> AT&T change of 4.x to 5.x :-). For now, I'm going to buy a board with 4 SATA ports and use two 10K.7 SCSI drives for my system disks, and "play" with RAID-Z with 4 500GB drives on the former, and mirroring and ZFS in general with the latter, and hope by the time I NEED to build additional larger SATA arrays or mirrors that the above adaptor issue has been resolved - Harold ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Issue with adding existing EFI disks to a zpool
I spend yesterday all day evading my data of one of the Windows disks, so that I can add it to the pool. Using mount-ntfs, it's a pain due to its slowness. But once I finished, I thought "Cool, let's do it". So I added the disk using the zero slice notation (c0d0s0), as suggested for performance reasons. I checked the pool status and noticed however that the pool size didn't raise. After a short panic (myself, not the kernel), I remembered that I partitioned this disk as EFI disk in Windows (mostly just because). c0d0s0 was the emergency, boot or whatever partition automatically created according to the recommended EFI partitioning scheme. So it added the minimal space of that partition to the pool. The real whole disk partition was c0d0s1. Since there's no device removal in ZFS yet, I had to replace slice 0 with slice 1 since destroying the pool was out of the question. Two things now: a) ZFS would have added EFI labels anyway. Will ZFS figure things out for itself, or did I lose write cache control because I didn't explicitely specify s0 though this is an EFI disk already? b) I don't remember it mentioned anywhere in the documentation. If a) is indeed an issue, it should be mentioned that you have to unlabel EFI disks before adding. Thanks. -mg This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Motley group of discs?
On 5-May-07, at 2:07 AM, MC wrote: That's a lot of talking without an answer :) internal EIDE 320GB (boot drive), internal 250, 200 and 160 GB drives, and an external USB 2.0 600 GB drive. So, what's the best zfs configuration in this situation? RAIDZ uses disk space like RAID5. So the best you could do here for redundant space is (160 * 4 or 5)-160, and then use the remaining spaces as non-redundant or mirrored. If you want to play with opensolaris and zfs you can do so easily with a vmware or parallels virtual machine. He can't, on the hardware in question: the machine is a G4. Lee is apparently anticipating the integration of ZFS with OS X 10.5. I would agree that, while he waits, he should rustle up a spare PC and install Solaris. --Toby It sounds like that is all you want to do right now. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Optimal strategy (add or replace disks) to build a cheap and raidz?
You could estimate how long it will take for ZFS to get the feature you need, and then buy enough space so that you don't run out before then. Alternatively, Linux mdadm DOES support growing a RAID5 array with devices, so you could use that instead. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Optimal strategy (add or replace disks) to build a cheap and raidz?
Hello, i have an 8 port sata-controller and i don't want to spend the money for 8 x 750 GB Sata Disks right now. I'm thinking about an optimal way of building a growing raidz-pool without loosing any data. As far as i know there are two ways to achieve this: - Adding 750 GB Disks from time to time. But this would lead to multiple groups with multiple redundancy/parity disks. I would not reach the maximum capacity of 7x750 GB at the end. - Buying "cheap" 8x250 GB SATA disks at first and replacing them from time to time by 750 GB or bigger disks. Disadvantage: At the end i've bought 8x250 GB + 8x750 GB Harddisks. My Question now: Is the second way reasonable or do i missing some things? Anything else to consider? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss