[zfs-discuss] SSDs with a SCSI SCA interface?
Hey folks. I've looked around quite a bit, and I can't find something like this: I have a bunch of older systems which use Ultra320 SCA hot-swap connectors for their internal drives. (e.g. v20z and similar) I'd love to be able to use modern flash SSDs with these systems, but I have yet to find someone who makes anything that would fit the bill. I need either: (a) a SSD with an Ultra160/320 parallel interface (I can always find an interface adapter, so I'm not particular about whether it's a 68-pin or SCA) (b) a SAS or SATA to UltraSCSI adapter (preferably with a SCA interface) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] b128a available w/deduplication
> Dennis Clarke wrote: >>> FYI, >>> OpenSolaris b128a is available for download or image-update from the >>> dev repository. Enjoy. >> >> I thought that dedupe has been out for weeks now ? > > The source has, yes. But what Richard was referring to was the > respun build now available via IPS. Oh, sorry. Thought I had missed something. I hadn't :-) I'm not on version 22 for ZFS and am not even entirely sure what that is : # uname -a SunOS europa 5.11 snv_129 sun4u sparc SUNW,UltraAX-i2 # zpool upgrade -v This system is currently running ZFS pool version 22. The following versions are supported: VER DESCRIPTION --- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 Snapshot user holds 19 Log device removal 20 Compression using zle (zero-length encoding) 21 Deduplication 22 Received properties For more information on a particular version, including supported releases, see: http://www.opensolaris.org/os/community/zfs/version/N Where 'N' is the version number. HOWEVER, that URL no longer works for N > 19 and in fact, the entire URL has changed to : http://hub.opensolaris.org/bin/view/Community+Group+zfs/22 -- Dennis Clarke dcla...@opensolaris.ca <- Email related to the open source Solaris dcla...@blastwave.org <- Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Nicolas Williams wrote: On Thu, Dec 03, 2009 at 12:44:16PM -0800, Per Baatrup wrote: if any of f2..f5 have different block sizes from f1 This restriction does not sound so bad to me if this only refers to changes to the blocksize of a particular ZFS filesystem or copying between different ZFSes in the same pool. This can properly be managed with a "-f" switch on the userlan app to force the copy when it would fail. Why expose such details? If you have dedup on and if the file blocks and sizes align then cat f1 f2 f3 f4 f5 > f6 will do the right thing and consume only space for new metadata. I think Per's concern was not only with space consumed but also the effort involved in the process (think large files); if I read his emails correctly, he'd like what amounts to manipulation of meta-data only to have the data blocks of what was originally 5 files to end up in one file; the traditional concat operation will cause all the data to be read and written back, at which point dedup will kick in, and so most of the processing has already been spent. (Per, please correct/comment) Michael -- Michael Schusterhttp://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
I eventually performed a few more tests, adjusting some zfs tuning options which had no effect, and trying the itmpt driver which someone had said would work, and regardless my system would always freeze quite rapidly in snv 127 and 128a. Just to double check my hardware, I went back to the opensolaris 2009.06 release version, and everything is working fine. The system has been running a few hours and copied a lot of data and not had any trouble, mpt syslog events, or iostat errors. One thing I found interesting, and I don't know if it's significant or not, is that under the recent builds and under 2009.06, I had run "echo '::interrupts' | mdb -k" to check the interrupts used. (I don't have the printout handy for snv 127+, though). I have a dual port gigabit Intel 1000 P PCI-e card, which shows up as e1000g0 and e1000g1. In snv 127+, each of my e1000g devices shares an IRQ with my mpt devices (mpt0, mpt1) on the IRQ listing, whereas in opensolaris 2009.06, all 4 devices are on different IRQs. I don't know if this is significant, but most of my testing when I encountered errors was data transfer via the network, so it could have potentially been interfering with the mpt drivers when it was on the same IRQ. The errors did seem to be less frequent when the server I was copying from was linked at 100 instead of 1000 (one of my tests), but that is as likely to be a result of the slower zpool throughput as it is to be related to the network traffic. I'll probably stay with 2009.06 for now since it works fine for me, but I can try a newer build again once some more progress is made in this area and people want to see if its fixed (this machine is mainly to backup another array so it's not too big a deal to test later when the mpt drivers are looking better and wipe again in the event of problems) Chad On Tue, Dec 01, 2009 at 03:06:31PM -0800, Chad Cantwell wrote: > To update everyone, I did a complete zfs scrub, and it it generated no errors > in iostat, and I have 4.8T of > data on the filesystem so it was a fairly lengthy test. The machine also has > exhibited no evidence of > instability. If I were to start copying a lot of data to the filesystem > again though, I'm sure it would > generate errors and crash again. > > Chad > > > On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote: > > Well, ok, the msi=0 thing didn't help after all. A few minutes after my > > last message a few errors showed > > up in iostat, and then in a few minutes more the machine was locked up > > hard... Maybe I will try just > > doing a scrub instead of my rsync process and see how that does. > > > > Chad > > > > > > On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: > > > I don't think the hardware has any problems, it only started having > > > errors when I upgraded OpenSolaris. > > > It's still working fine again now after a reboot. Actually, I reread one > > > of your earlier messages, > > > and I didn't realize at first when you said "non-Sun JBOD" that this > > > didn't apply to me (in regards to > > > the msi=0 fix) because I didn't realize JBOD was shorthand for an > > > external expander device. Since > > > I'm just using baremetal, and passive backplanes, I think the msi=0 fix > > > should apply to me based on > > > what you wrote earlier, anyway I've put > > > set mpt:mpt_enable_msi = 0 > > > now in /etc/system and rebooted as it was suggested earlier. I've > > > resumed my rsync, and so far there > > > have been no errors, but it's only been 20 minutes or so. I should have > > > a good idea by tomorrow if this > > > definitely fixed the problem (since even when the machine was not > > > crashing it was tallying up iostat errors > > > fairly rapidly) > > > > > > Thanks again for your help. Sorry for wasting your time if the > > > previously posted workaround fixes things. > > > I'll let you know tomorrow either way. > > > > > > Chad > > > > > > On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: > > > > Chad Cantwell wrote: > > > > >After another crash I checked the syslog and there were some different > > > > >errors than the ones > > > > >I saw previously during operation: > > > > ... > > > > > > > > >Nov 30 20:59:13 the-vault LSI PCI device (1000,) not > > > > >supported. > > > > ... > > > > >Nov 30 20:59:13 the-vault mpt_config_space_init failed > > > > ... > > > > >Nov 30 20:59:15 the-vault mpt_restart_ioc failed > > > > > > > > > > > > >Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: > > > > >PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major > > > > >Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 > > > > >Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: > > > > >System-Serial-Number, HOSTNAME: the-vault > > > > >Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 > > > > >Nov 30 21:33:02 the-vault EVENT-ID: > > > > >7886cc0d-4760-60b2-e0
Re: [zfs-discuss] ZIL corrupt, not recoverable even with logfix
It was created on AMD64 FreeBSD with 8.0RC2 (which was version 13 of ZFS iirc.) At some point I knocked it out (export) somehow, I don't remember doing so intentionally. So I can't do commands like zpool replace since there are no pools. It says it was last used by the FreeBSD box, but the FreeBSD does not show it with "zpool status" command. I'm going down tomorrow to work on it again, and I'm going to try 8.0 Release AMD64 FreeBSD (I've already tried i386 AMD64 FreeBSD 8.0 Release) and Opensolaris dev-127. I was just hoping there was some way I'm missing to mount it read only (I have tried "zpool import -f -o readonly=yes" but that doesn't work either.) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] b128a available w/deduplication
On Fri, Dec 4 at 1:12, Dennis Clarke wrote: FYI, OpenSolaris b128a is available for download or image-update from the dev repository. Enjoy. I thought that dedupe has been out for weeks now ? Dedupe has been out, but there were some accounting issues scheduled to be fixed in 128. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] b128a available w/deduplication
Dennis Clarke wrote: FYI, OpenSolaris b128a is available for download or image-update from the dev repository. Enjoy. I thought that dedupe has been out for weeks now ? The source has, yes. But what Richard was referring to was the respun build now available via IPS. cheers, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] b128a available w/deduplication
> FYI, > OpenSolaris b128a is available for download or image-update from the > dev repository. Enjoy. I thought that dedupe has been out for weeks now ? Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] b128a available w/deduplication
FYI, OpenSolaris b128a is available for download or image-update from the dev repository. Enjoy. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-USAS-L8i
On Thu, Dec 3, 2009 at 8:02 PM, steven wrote: > It will work in a standard 8x or 16x slot. The bracket is backward. Not one > for subtlety, I took the bracket off, grabbed some pliers, and reversed all > the bends. Not exactly ideal... but I was then able to get it in the case > and get some screw tension on it to hold it snugly to the case. > > I had some problems with getting the card to initialize at first. One MB > would simply not allow me to run the card in the x16 slot, even with onboard > video, even with a generic pci video card. > > Another motherboard I had, an asus-- don't recall the model, would allow it > to work. I am using an old Geforce2 PCI card for video. > > I recently picked up a pair of Intel SASUC8I. I was able to flash them with the LSI IT firmware for the 3081, and they appear to work just fine. I haven't done extensive testing, but booting off a livecd, it sees the disks just fine, and loads a driver for them. -- --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-USAS-L8i
It will work in a standard 8x or 16x slot. The bracket is backward. Not one for subtlety, I took the bracket off, grabbed some pliers, and reversed all the bends. Not exactly ideal... but I was then able to get it in the case and get some screw tension on it to hold it snugly to the case. I had some problems with getting the card to initialize at first. One MB would simply not allow me to run the card in the x16 slot, even with onboard video, even with a generic pci video card. Another motherboard I had, an asus-- don't recall the model, would allow it to work. I am using an old Geforce2 PCI card for video. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] possible mega_sas issue sol10u8 (Re: Workaround for mpt timeouts in snv_127)
Tru Huynh wrote: follow up, another crash today. On Mon, Nov 30, 2009 at 11:35:07AM +0100, Tru Huynh wrote: 1) OS SunOS xargos.bis.pasteur.fr 5.10 Generic_141445-09 i86pc i386 i86pc You should be logging a support call for this issue. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC in clusters
Robert Milkowski wrote: Robert Milkowski wrote: Robert Milkowski wrote: Hi, When deploying ZFS in cluster environment it would be nice to be able to have some SSDs as local drives (not on SAN) and when pool switches over to the other node zfs would pick up the node's local disk drives as L2ARC. To better clarify what I mean lets assume there is a 2-node cluster with 1sx 2540 disk array. Now lets put 4x SSDs in each node (as internal/local drives). Now lets assume one zfs pool would be created on top of a lun exported from 2540. Now 4x local SSDs could be added as L2ARC but because they are not visible on a 2nd node when cluster does failover it should be able to pick up the ssd's which are local to the other node. L2ARC doesn't contain any data which is critical to pool so it doesn't have to be shared between node. SLOG would be a whole different story and generally it wouldn't be possible. But L2ARC should be. Perhaps a scenario like below should be allowed: node-1# zpool add mysql cache node-1-ssd1 node-1-ssd2 node1-ssd3 node-1-ssd4 node-1# zpool export mysql node-2# zpool import mysql node-2# zpool add mysql cache node-2-ssd1 node-2-ssd2 node2-ssd3 node-2-ssd4 This is assuming that pool can be imported when some of its slog devices are not accessible. That way the pool always would have some L2ARC/SSDs not accessible but would provide L2ARC cache on each node with local SSDs. Actually it looks like it already works like that! A pool imports with its cache device unavailable just fine. Then I added another cache device. And I can still import it with the first one available but not the 2nd one. zpool status complains of course but other than that it seems to be working fine. Any thought? Ooo. That's a scenario I hadn't thought about. Right now, I'm doing something similar on the cheap: I have an iSCSI LUN (big ass SATA Raidz2) mounted on host A, and am using a spare 15k SAS drive locally as the L2ARC. When I export it and import it to another host, with a identical disk in the same location (.e.g. c1t1d0), I've done a 'zpool remove/add', since they write different ZFS signatures on the cache drive. Works like a champ. Given that I want to use the same device location (e.g. c1t1d0) on both hosts, is there a way I can somehow add both as cache devices, and have ZFS tell them apart by the ID signature? That is, on Host A, I do this: # zpool create tank cache c1t1d0 # zpool export tank Then, on Host B, I'm currently doing: # zpool import tank # zpool remove tank c1t1d0 # zpool add tank cache c1t1d0 I'd obviously like to figure some way that I don't need to do the 'zpool add/remove' Robert's idea looks great, but I'm assuming that all the SSD devices have different drive locations. What I need is some way of telling ZFS to use a device X as a cache device, based on it's ZFS signature, rather than it's physical device location, as that location might (in the past) be used by another vdev. Theoretically, I'd like to do something like this: hostA# zpool create tank hostA# zpool add tank cache c1t1d0 hostA# zpool export tank hostB# zpool import tank hostB# zpool add tank cache c1t1d0 And from then on, I just import/export between the two hosts, and it auto-picks the correct c1t1d0 drive. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 03, 2009 at 12:44:16PM -0800, Per Baatrup wrote: > >any of f1..f5's last blocks are partial > Does this mean that f1,f2,f3,f4 needs to be exact multiplum of the ZFS > blocksize? This is a severe restriction that will fail unless in very > special cases. Is this related to the disk format or is it > restriction in the implrmentation? (do you know where to look in the > source code?). I'm sure it's related to the FS structure. How do you find a particular point in a file quickly? You don't read up to that point, you want to go to it directly. To do so, you have to know how the file is indexed. If every block contains the same amount of data, this is a simple math equation. If some blocks have more or less data, then you have to keep track of them and their size. I doubt ZFS has any space or ability to include non-full blocks in the middle of a file. -- Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import - device names not always updated?
Thank you Cindy for your reply! On 3 dec 2009, at 18.35, Cindy Swearingen wrote: > A bug might exist but you are building a pool based on the ZFS > volumes that are created in another pool. This configuration > is not supported and possible deadlocks can occur. I had absolutely no idea that ZFS volumes weren't supported as ZFS containers. Were can I find information about what is and what isn't supported for ZFS volumes? > If you can retry this example without building a pool on another > pool, like using files to create a pool and can reproduce this, > then please let me know. I retried it with files instead, and it then worked exactly as expected. (Also, it didn't anymore magically remember locations of earlier found volumes in other directories for import, with or without the sleeps.) I don't know if it is of interest, to anyone, but I'll include the reworked file based test below. /ragge #!/bin/bash set -e set -x mkdir /d mkfile 1g /d/f1 mkfile 1g /d/f2 zpool create pool mirror /d/f1 /d/f2 zpool status pool zpool export pool mkdir /d/subdir1 mkdir /d/subdir2 mv /d/f1 /d/subdir1/ mv /d/f2 /d/subdir2/ zpool import -d /d/subdir1 zpool import -d /d/subdir2 zpool import -d /d/subdir1 -d /d/subdir2 pool zpool status pool # cleanup - remove the "# DELETEME_" part # DELETEME_zpool destroy pool # DELETEME_rm -rf /d ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Quota information from nfs mounting linux client
We are using zfs (solaris 10u9) to serve disk to a couple of hundred linux clients via nfs. We would like users on the linux clients to be able to monitor their disk space on the zfs file system. They do not have shell. accounts on the fileserver. Is the quota information on the fileserver (user and group) available to be read by a user program without priveleged access on a remote host (the linux client). Where would documentation be? Thanks. Sent from my BlackBerry device This e-mail may contain confidential, personal and/or health information(information which may be subject to legal restrictions on use, retention and/or disclosure) for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this e-mail in error, please contact the sender and delete all copies. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC in clusters
Robert Milkowski wrote: Robert Milkowski wrote: Robert Milkowski wrote: When deploying ZFS in cluster environment it would be nice to be able to have some SSDs as local drives (not on SAN) and when pool switches over to the other node zfs would pick up the node's local disk drives as L2ARC. Any thought? The 7310/7410 uses this type of configuration, so obviously it works. When in doubt, just think What Would Fishworks Do? Wes Felter ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
> > Isn't this only true if the file sizes are such that the concatenated > > blocks are perfectly aligned on the same zfs block boundaries they used > > before? This seems unlikely to me. > > Yes that would be the case. While eagerly awaiting b128 to appear in IPS, I have been giving this issue (block size and alignment vs dedup) some thought recently. I have a different, but sufficiently similar, scenario where the effectiveness of dedup will depend heavily on this factor. For this case, though, the alignment question for short tails is relatively easily dealt with. The key is that the record size of the file is "up to 128k" and may be shorter depending on various circumstances, such as the write pattern used. To simplify, let us assume that the original files were all written quickly and sequentially, that is that they have n 128k blocks, plus a shorter tail. When concatenating them, it should be sufficient to write out the target file in 128k chunks from the source, then the first tail, then issue an fsync before moving on to the chunks from the second file. If the source files were not written in this pattern (e.g. log files, accumulating small varying-size writes), the best thing to do is to rewrite those "in place" as well, with the same pattern as being written to the joined file. This can also have an improvement on compression efficiency, by allowing larger block sizes than the original. Issues/questions: * This is an optimistic method of alignment, is there any mechanism to get stronger results - ie, to know the size of each record of the original, or to produce specific record size/alignment on output? * There's already the very useful seek interface for finding holes and data, perhaps something similar is useful here. Or a direct io related option to read, that can return short reads only up to the end of the current record? * Perhaps a pause of some kind (to wait for the txg to close) is also necessary, to ensure the tail doesn't get combined with new data and reblocked? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 03, 2009 at 12:44:16PM -0800, Per Baatrup wrote: > >if any of f2..f5 have different block sizes from f1 > > This restriction does not sound so bad to me if this only refers to > changes to the blocksize of a particular ZFS filesystem or copying > between different ZFSes in the same pool. This can properly be managed > with a "-f" switch on the userlan app to force the copy when it would > fail. Why expose such details? If you have dedup on and if the file blocks and sizes align then cat f1 f2 f3 f4 f5 > f6 will do the right thing and consume only space for new metadata. If the file blocks and sizes do not align then cat f1 f2 f3 f4 f5 > f6 will still work correctly. Or do you mean that you want a way to do that cat ONLY if it would consume no new space for data? (That might actually be a good justification for a ZFS cat command, though I think, too, that one could script it.) > >any of f1..f5's last blocks are partial > > Does this mean that f1,f2,f3,f4 needs to be exact multiplum of the ZFS > blocksize? This is a severe restriction that will fail unless in very > special cases. Say f1 is 1MB, f2 is 128KB, f3 is 510 bytes, f4 is 514 bytes, and f5 is 10MB, and the recordsize for their containing datasets is 128KB, then the new file will consume 10MB + 128KB more than f1..f5 did, but 1MB + 128KB will be de-duplicated. This is not really "a severe restriction". To make ZFS do better than that would require much extra metadata and complexity in the filesystem that users who don't need to do space-efficient file concatenation (most users, that is) won't want to pay for. > Is this related to the disk format or is it restriction in the > implrmentation? (do you know where to look in the source code?). Both. > >...but also ZFS most likely could not do any better with any other, more > >specific non-dedup solution > > Properly lots of I/O traffic, digest calculation+lookups, could be > saved as we already know it will be a duplicate. (In our case the > files are gigabyte sizes) ZFS hashes, and records hashes of blocks, not sub-blocks. Look at my above example. To efficiently dedup the concatenation of the 10MB of f5 would require being able to have something like "sub-block pointers". Alternatively, if you want a concatenation-specific feature ZFS would have to have a metadata notion of concatentation, but then the Unix way of concatenating files couldn't be used for this since the necessary context is lost in the I/O redirection. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
>Btw. I would be surprised to hear that this can be implemented >with current APIs; I agree. However it looks like an opportunity to dive into the Z-source code. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
>if any of f2..f5 have different block sizes from f1 This restriction does not sound so bad to me if this only refers to changes to the blocksize of a particular ZFS filesystem or copying between different ZFSes in the same pool. This can properly be managed with a "-f" switch on the userlan app to force the copy when it would fail. >any of f1..f5's last blocks are partial Does this mean that f1,f2,f3,f4 needs to be exact multiplum of the ZFS blocksize? This is a severe restriction that will fail unless in very special cases. Is this related to the disk format or is it restriction in the implrmentation? (do you know where to look in the source code?). >...but also ZFS most likely could not do any better with any other, more >specific non-dedup solution Properly lots of I/O traffic, digest calculation+lookups, could be saved as we already know it will be a duplicate. (In our case the files are gigabyte sizes) --Per -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Petabytes on a budget - blog
Just thought I would let everybody know I saw one at a local ISP yesterday. They hadn't started testing the metal had only arrived the day before and they where waiting for the drives to arrive. They had also changed the design to give it more network. I will try to find out more as the customer progresses. >Interesting blog: >http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ > -- Trevor Pretty | Technical Account Manager | T: +64 9 639 0652 | M: +64 21 666 161 Eagle Technology Group Ltd. Gate D, Alexandra Park, Greenlane West, Epsom Private Bag 93211, Parnell, Auckland www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] EON ZFS Storage 0.59.5 based on snv 125 released!
Embedded Operating system/Networking (EON), RAM based live ZFS NAS appliance is released on Genunix! Many thanks to Al Hopper and Genunix.org for download hosting and serving the opensolaris community. EON ZFS storage is available in a 32/64-bit CIFS and Samba versions: tryitEON 64-bit x86 CIFS ISO image version 0.59.5 based on snv_125 * eon-0.595-125-64-cifs.iso * MD5: a21c0b6111803f95c29e421af96ee016 * Size: ~90Mb * Released: Thursday 3-December-2009 tryitEON 64-bit x86 Samba ISO image version 0.59.5 based on snv_125 * eon-0.595-125-64-smb.iso * MD5: 4678298f0152439867d218987c3ec20e * Size: ~103Mb * Released: Thursday 3-December-2009 tryitEON 32-bit x86 CIFS ISO image version 0.59.5 based on snv_125 * eon-0.595-125-32-cifs.iso * MD5: 4b76893c3363d46fad34bf7d0c23548c * Size: ~57Mb * Released: Thursday 3-December-2009 tryitEON 32-bit x86 Samba ISO image version 0.59.5 based on snv_125 * eon-0.595-125-32-smb.iso * MD5: f478a8ea9228f16dc1bd93adae03d200 * Size: ~70Mb * Released: Thursday 3-December-2009 tryitEON 64-bit x86 CIFS ISO image version 0.59.5 based on snv_125 (NO HTTP) * eon-0.595-125-64-cifs-min.iso * MD5: c7b9ec5c487302c1aa97363eb440fe00 * Size: ~85Mb * Released: Thursday 3-December-2009 tryitEON 64-bit x86 Samba ISO image version 0.59.5 based on snv_125 (NO HTTP) * eon-0.595-125-64-smb-min.iso * MD5: a33f34506f05070ffc554de7beaafd4d * Size: ~98Mb * Released: Thursday 3-December-2009 New/Changes/Fixes: - removed iscsitgd and replaced it with COMSTAR (iscsit, stmf) - added SUNWhd to image vs being in the binary kit. - added rsync to image vs being in the binary kit. - added nge, yge and yukonx drivers. - added (/etc/inet/hosts, /etc/default/init) to /mnt/eon0/.backup (TIMEZONE and hostname change fix) - fixed typo entry /mnt/eon0/.exec zpool -a to zpool import -a - eon rebooting at grub(since snv_122) in ESXi, Fusion and various versions of VMware workstation. This is related to bug 6820576. Workaround, at grub press e and add on the end of the kernel line "-B disable-pcieb=true" -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Per, Per Baatrup schrieb: Roland, Clearly an extension of "cp" would be very nice when managing large files. Today we are relying heavily on snapshots for this, but this requires disipline on storing files in separate zfs'es avioding to snapshot too many files that changes frequently. The reason I was speaking about "cat" in stead of "cp" is that in addition to copying a single file I would like also to concatenate several files into a single file. Can this be accomplished with your "(z)cp"? No - "zcp" is a simpler case than what you proposed, and thats why I pointed it out as a discussion case. ( And it is clearly NOT the same as 'ln'. ) Btw. I would be surprised to hear that this can be implemented with current APIs; you would need a call like (my fantasy here) "write_existing_block()" where the data argument is not a pointer to a buffer in memory but instead a reference to an already existing data block in the pool. Based on such a call ( and a corresponding one for read that returns those references in the pool ) IMHO an implementation of the commands would be straight forward ( the actual work would be in the implementation of those calls ). This can certainly been done - I just doubt it already exists. -- Roland -- ** Roland Rambau Platform Technology Team Principal Field Technologist Global Systems Engineering Phone: +49-89-46008-2520 Mobile:+49-172-84 58 129 Fax: +49-89-46008- mailto:roland.ram...@sun.com ** Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht München: HRB 161028; Geschäftsführer: Thomas Schröder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Häring *** UNIX * /bin/sh FORTRAN ** ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] possible mega_sas issue sol10u8 (Re: Workaround for mpt timeouts in snv_127)
follow up, another crash today. On Mon, Nov 30, 2009 at 11:35:07AM +0100, Tru Huynh wrote: > 1) OS > SunOS xargos.bis.pasteur.fr 5.10 Generic_141445-09 i86pc i386 i86pc > > it's only sharing though NFS v3 to linux clients running > 20x CentOS-5 x86_64 2.6.18-164.6.1.el5 x86_64/i386 > 78x CentOS-3 x86_64/ia32e/i386 > > 2) usual logs: > /var/adm/messages > -> nothing still empty > > 3) fmdump -ev > /var/fm/fmd/errlog is empty same > 7) not tried yet > reboot -d to force a dump failed (not returned from sync) reboot -dfn failed at 98% of the dump (I could not catch the reason, screen blanked too fast) > > 9) from the #irc channel, I will keep a screen running with: [...@xargos ~]$ ps -ef UID PID PPID CSTIME TTY TIME CMD root 0 0 0 Nov 29 ? 3:16 sched root 1 0 0 Nov 29 ? 0:00 /sbin/init root 2 0 0 Nov 29 ? 0:00 pageout root 3 0 0 Nov 29 ? 20:04 fsflush root 154 1 0 Nov 29 ? 0:00 /usr/lib/picl/picld root 7 1 0 Nov 29 ? 0:04 /lib/svc/bin/svc.startd root 9 1 0 Nov 29 ? 0:08 /lib/svc/bin/svc.configd daemon 152 1 0 Nov 29 ? 0:03 /usr/lib/crypto/kcfd tru 2258 2226 0 Nov 30 pts/7 0:00 /usr/bin/bash root 409 408 0 Nov 29 ? 0:00 /usr/sadm/lib/smc/bin/smcboot root 142 1 0 Nov 29 ? 0:01 /usr/lib/sysevent/syseventd root 429 1 0 Nov 29 ? 0:00 sh /opt/MegaRaidStorageManager/Framework/startup.sh root57 1 0 Nov 29 ? 0:00 /sbin/dhcpagent root64 1 0 Nov 29 ? 0:00 devfsadmd root 208 1 0 Nov 29 ? 0:00 /lib/svc/method/iscsid daemon 306 1 0 Nov 29 ? 0:00 /usr/sbin/rpcbind root 146 1 0 Nov 29 ? 0:12 /usr/sbin/nscd root 2228 2226 0 Nov 30 pts/2 0:08 zpool iostat -v 60 root 332 7 0 Nov 29 ? 0:00 /usr/lib/saf/sac -t 300 root 145 1 0 Nov 29 ? 0:00 /usr/lib/power/powerd root 226 1 0 Nov 29 ? 0:10 /usr/lib/inet/xntpd root 394 332 0 Nov 29 ? 0:00 /usr/lib/saf/ttymon root 262 1 0 Nov 29 ? 0:00 /usr/sbin/cron root 366 1 0 Nov 29 ? 0:00 /usr/lib/utmpd noaccess 673 1 0 Nov 29 ? 3:04 /usr/java/bin/java -server -Xmx128m -XX:+UseParallelGC -XX:ParallelGCThreads=4 root 349 7 0 Nov 29 console 0:00 /usr/lib/saf/ttymon -g -d /dev/console -l console -m ldterm,ttcompat -h -p xarg daemon 315 1 0 Nov 29 ? 0:00 /usr/lib/nfs/statd daemon 317 1 0 Nov 29 ? 0:01 /usr/lib/nfs/nfsmapid root 552 1 0 Nov 29 ? 0:01 /usr/sfw/sbin/snmpd daemon 324 1 0 Nov 29 ? 0:00 /usr/lib/nfs/lockd root 431 1 0 Nov 29 ? 0:05 /usr/sbin/syslogd tru 695 689 0 Nov 29 pts/1 0:00 -bash root 367 1 0 Nov 29 ? 0:00 /usr/lib/autofs/automountd root 365 1 0 Nov 29 ? 0:02 /usr/lib/inet/inetd start root 369 367 0 Nov 29 ? 0:01 /usr/lib/autofs/automountd root 430 429 0 Nov 29 ? 3:26 ../jre/bin/java -classpath ../jre/lib/rt.jar:../jre/lib/jsse.jar:../jre/lib/jce root 408 1 0 Nov 29 ? 0:00 /usr/sadm/lib/smc/bin/smcboot root 410 408 0 Nov 29 ? 0:00 /usr/sadm/lib/smc/bin/smcboot root 2234 2226 0 Nov 30 pts/5 0:15 intrstat 60 tru 2236 2226 0 Nov 30 pts/6 0:06 vmstat 60 tru 689 688 0 Nov 29 ? 0:01 /usr/lib/ssh/sshd root 594 1 0 Nov 29 ? 4:25 /usr/sbin/lsi_mrdsnmpagent -c /etc/sma/snmp/snmpd.conf tru 2232 2226 0 Nov 30 pts/4 0:13 prstat 60 tru 2225 695 0 Nov 30 pts/1 0:01 screen root 443 1 0 Nov 29 ? 0:00 /usr/lib/ssh/sshd root 688 443 0 Nov 29 ? 0:00 /usr/lib/ssh/sshd root 541 1 0 Nov 29 ? 0:03 /usr/lib/sendmail -bd -q15m -C /etc/mail/local.cf smmsp 537 1 0 Nov 29 ? 0:00 /usr/lib/sendmail -Ac -q15m root 2565 1 0 Nov 30 ? 0:06 /usr/local/bin/mrmonitord tru 2226 2225 0 Nov 30 ? 0:05 screen tru 3988 3982 0 15:33:51 pts/11 0:00 prstat root 498 1 0 Nov 29 ? 0:00 /usr/sbin/vold -f /etc/vold.conf root 509 1 0 Nov 29 ? 0:04 /usr/lib/fm/fmd/fmd tru 2230 2226 0 Nov 30 pts/3 0:06 iostat -xn 60 tru 3967 3966 0 15:33:36 ? 0:00 /usr/lib/ssh/sshd root 522 1 0 Nov 29 ? 0:00 /usr/lib/nfs/mountd daemon 524 1 0 Nov 29 ? 9:45 /usr/lib/
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 03, 2009 at 03:57:28AM -0800, Per Baatrup wrote: > I would like to to concatenate N files into one big file taking > advantage of ZFS copy-on-write semantics so that the file > concatenation is done without actually copying any (large amount of) > file content. > cat f1 f2 f3 f4 f5 > f15 > Is this already possible when source and target are on the same ZFS > filesystem? > > Am looking into the ZFS source code to understand if there are > sufficient (private) interfaces to make a simple "zcat -o f15 f1 f2 > f3 f4 f5" userland application in C code. Does anybody have advice on > this? There have been plenty of answers already. Quite aside from dedup, the fact that all blocks in a file must have the same uncompressed size means that if any of f2..f5 have different block sizes from f1, or any of f1..f5's last blocks are partial then ZFS could not perform this concatenation as efficiently as you wish. In other words: dedup _is_ what you're looking for... ...but also ZFS most likely could not do any better with any other, more specific non-dedup solution. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] L2ARC re-uses new device if it is in the same "place"
Hi, mi...@r600:/rpool/tmp# zpool status test pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /rpool/tmp/f1 ONLINE 0 0 0 errors: No known data errors lets add a cache device: mi...@r600:/rpool/tmp# zfs create -V 100m rpool/tmp/ssd2 mi...@r600:/rpool/tmp# zpool add test cache /dev/zvol/dsk/rpool/tmp/ssd2 mi...@r600:/rpool/tmp# zpool status test pool: test state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM testONLINE 0 0 0 /rpool/tmp/f1 ONLINE 0 0 0 cache /dev/zvol/dsk/rpool/tmp/ssd2 ONLINE 0 0 0 errors: No known data errors mi...@r600:/rpool/tmp# now lets export the pool, re-create the zvol and then import the pool again: mi...@r600:/rpool/tmp# zpool export test mi...@r600:/rpool/tmp# zfs destroy rpool/tmp/ssd2 mi...@r600:/rpool/tmp# zfs create -V 100m rpool/tmp/ssd2 mi...@r600:/rpool/tmp# zpool import -d /rpool/tmp/ test mi...@r600:/rpool/tmp# zpool status test pool: test state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM testONLINE 0 0 0 /rpool/tmp/f1 ONLINE 0 0 0 cache /dev/zvol/dsk/rpool/tmp/ssd2 ONLINE 0 0 0 errors: No known data errors mi...@r600:/rpool/tmp# No complaint here... I'm not entirely sure that it should behave that way - in some circumstances it could be risky. For example what if zvol/ssd/disk which is used on one server as a cache device has the same path on another server and then a pool is imported there? Would l2arc just blindly start using it as a cache device and overwriting some other data? Shouldn't l2arc devices have a label/signature or at least use uuid of a disk and during import be checked if it is the same device? Or maybe it does and there is some other issue here with re-creating zvol... btw: x86, snv_127 -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC in clusters
Robert Milkowski wrote: Robert Milkowski wrote: Hi, When deploying ZFS in cluster environment it would be nice to be able to have some SSDs as local drives (not on SAN) and when pool switches over to the other node zfs would pick up the node's local disk drives as L2ARC. To better clarify what I mean lets assume there is a 2-node cluster with 1sx 2540 disk array. Now lets put 4x SSDs in each node (as internal/local drives). Now lets assume one zfs pool would be created on top of a lun exported from 2540. Now 4x local SSDs could be added as L2ARC but because they are not visible on a 2nd node when cluster does failover it should be able to pick up the ssd's which are local to the other node. L2ARC doesn't contain any data which is critical to pool so it doesn't have to be shared between node. SLOG would be a whole different story and generally it wouldn't be possible. But L2ARC should be. Perhaps a scenario like below should be allowed: node-1# zpool add mysql cache node-1-ssd1 node-1-ssd2 node1-ssd3 node-1-ssd4 node-1# zpool export mysql node-2# zpool import mysql node-2# zpool add mysql cache node-2-ssd1 node-2-ssd2 node2-ssd3 node-2-ssd4 This is assuming that pool can be imported when some of its slog devices are not accessible. That way the pool always would have some L2ARC/SSDs not accessible but would provide L2ARC cache on each node with local SSDs. Actually it looks like it already works like that! A pool imports with its cache device unavailable just fine. Then I added another cache device. And I can still import it with the first one available but not the 2nd one. zpool status complains of course but other than that it seems to be working fine. Any thought? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 03, 2009 at 09:36:23AM -0800, Per Baatrup wrote: > The reason I was speaking about "cat" in stead of "cp" is that in > addition to copying a single file I would like also to concatenate > several files into a single file. Can this be accomplished with your > "(z)cp"? Unless you have special data formats, I think it's unlikely that the last ZFS block in the file will be exactly full. But to append without copying, you'd need some way of ignoring a portion of the data in a non-final ZFS block and stitching together the bytestream. I don't think that's possible with the ZFS layout. -- Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC in clusters
Robert Milkowski wrote: Robert Milkowski wrote: Hi, When deploying ZFS in cluster environment it would be nice to be able to have some SSDs as local drives (not on SAN) and when pool switches over to the other node zfs would pick up the node's local disk drives as L2ARC. To better clarify what I mean lets assume there is a 2-node cluster with 1sx 2540 disk array. Now lets put 4x SSDs in each node (as internal/local drives). Now lets assume one zfs pool would be created on top of a lun exported from 2540. Now 4x local SSDs could be added as L2ARC but because they are not visible on a 2nd node when cluster does failover it should be able to pick up the ssd's which are local to the other node. L2ARC doesn't contain any data which is critical to pool so it doesn't have to be shared between node. SLOG would be a whole different story and generally it wouldn't be possible. But L2ARC should be. Perhaps a scenario like below should be allowed: node-1# zpool add mysql cache node-1-ssd1 node-1-ssd2 node1-ssd3 node-1-ssd4 node-1# zpool export mysql node-2# zpool import mysql node-2# zpool add mysql cache node-2-ssd1 node-2-ssd2 node2-ssd3 node-2-ssd4 This is assuming that pool can be imported when some of its slog devices are not accessible. That way the pool always would have some L2ARC/SSDs not accessible but would provide L2ARC cache on each node with local SSDs. btw: mi...@r600:/rpool/tmp# mkfile 200m f1 mi...@r600:/rpool/tmp# mkfile 100m s1 mi...@r600:/rpool/tmp# zpool create test /rpool/tmp/f1 mi...@r600:/rpool/tmp# zpool status test pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /rpool/tmp/f1 ONLINE 0 0 0 errors: No known data errors mi...@r600:/rpool/tmp# zpool add test cache /rpool/tmp/s1 cannot add to 'test': cache device must be a disk or disk slice mi...@r600:/rpool/tmp# is there a reason why a cache device can't be set-up on a file like for other vdevs? mi...@r600:/rpool/tmp# zfs create -V 100m rpool/tmp/ssd1 mi...@r600:/rpool/tmp# zpool add test cache /dev/zvol/rdsk/rpool/tmp/ssd1 cannot use '/dev/zvol/rdsk/rpool/tmp/ssd1': must be a block device or regular file mi...@r600:/rpool/tmp# zpool add test cache /dev/zvol/dsk/rpool/tmp/ssd1 mi...@r600:/rpool/tmp# So when I try to add a cache device on-top of a file I get an error that a cache device must be a disk or a disk slice, so when I try to add a cache device on a rdsk I get an error that it bust be a block device or regular file which suggest a regular file should work... (dsk works fine). -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC in clusters
Robert Milkowski wrote: Hi, When deploying ZFS in cluster environment it would be nice to be able to have some SSDs as local drives (not on SAN) and when pool switches over to the other node zfs would pick up the node's local disk drives as L2ARC. To better clarify what I mean lets assume there is a 2-node cluster with 1sx 2540 disk array. Now lets put 4x SSDs in each node (as internal/local drives). Now lets assume one zfs pool would be created on top of a lun exported from 2540. Now 4x local SSDs could be added as L2ARC but because they are not visible on a 2nd node when cluster does failover it should be able to pick up the ssd's which are local to the other node. L2ARC doesn't contain any data which is critical to pool so it doesn't have to be shared between node. SLOG would be a whole different story and generally it wouldn't be possible. But L2ARC should be. Perhaps a scenario like below should be allowed: node-1# zpool add mysql cache node-1-ssd1 node-1-ssd2 node1-ssd3 node-1-ssd4 node-1# zpool export mysql node-2# zpool import mysql node-2# zpool add mysql cache node-2-ssd1 node-2-ssd2 node2-ssd3 node-2-ssd4 This is assuming that pool can be imported when some of its slog devices are not accessible. That way the pool always would have some L2ARC/SSDs not accessible but would provide L2ARC cache on each node with local SSDs. btw: mi...@r600:/rpool/tmp# mkfile 200m f1 mi...@r600:/rpool/tmp# mkfile 100m s1 mi...@r600:/rpool/tmp# zpool create test /rpool/tmp/f1 mi...@r600:/rpool/tmp# zpool status test pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /rpool/tmp/f1 ONLINE 0 0 0 errors: No known data errors mi...@r600:/rpool/tmp# zpool add test cache /rpool/tmp/s1 cannot add to 'test': cache device must be a disk or disk slice mi...@r600:/rpool/tmp# is there a reason why a cache device can't be set-up on a file like for other vdevs? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] L2ARC in clusters
Hi, When deploying ZFS in cluster environment it would be nice to be able to have some SSDs as local drives (not on SAN) and when pool switches over to the other node zfs would pick up the node's local disk drives as L2ARC. To better clarify what I mean lets assume there is a 2-node cluster with 1sx 2540 disk array. Now lets put 4x SSDs in each node (as internal/local drives). Now lets assume one zfs pool would be created on top of a lun exported from 2540. Now 4x local SSDs could be added as L2ARC but because they are not visible on a 2nd node when cluster does failover it should be able to pick up the ssd's which are local to the other node. L2ARC doesn't contain any data which is critical to pool so it doesn't have to be shared between node. SLOG would be a whole different story and generally it wouldn't be possible. But L2ARC should be. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Roland, Clearly an extension of "cp" would be very nice when managing large files. Today we are relying heavily on snapshots for this, but this requires disipline on storing files in separate zfs'es avioding to snapshot too many files that changes frequently. The reason I was speaking about "cat" in stead of "cp" is that in addition to copying a single file I would like also to concatenate several files into a single file. Can this be accomplished with your "(z)cp"? --Per -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import - device names not always updated?
Hi Ragnar, A bug might exist but you are building a pool based on the ZFS volumes that are created in another pool. This configuration is not supported and possible deadlocks can occur. If you can retry this example without building a pool on another pool, like using files to create a pool and can reproduce this, then please let me know. Thanks, Cindy On 12/01/09 17:57, Ragnar Sundblad wrote: It seems that device names aren't always updated when importing pools if devices have moved. I am not sure if this is only an cosmetic issue or if it could actually be a real problem - could it lead to the device not being found at a later import? /ragge (This is on snv_127.) I ran the following script: #!/bin/bash set -e set -x zfs create -V 1G rpool/vol1 zfs create -V 1G rpool/vol2 zpool create pool mirror /dev/zvol/dsk/rpool/vol1 /dev/zvol/dsk/rpool/vol2 zpool status pool zpool export pool zfs create rpool/subvol1 zfs create rpool/subvol2 zfs rename rpool/vol1 rpool/subvol1/vol1 zfs rename rpool/vol2 rpool/subvol2/vol2 zpool import -d /dev/zvol/dsk/rpool/subvol1 sleep 1 zpool import -d /dev/zvol/dsk/rpool/subvol2 sleep 1 zpool import -d /dev/zvol/dsk/rpool/subvol1 pool zpool status pool And got the output below. I have annotated it with ### remarks. # bash zfs-test.bash + zfs create -V 1G rpool/vol1 + zfs create -V 1G rpool/vol2 + zpool create pool mirror /dev/zvol/dsk/rpool/vol1 /dev/zvol/dsk/rpool/vol2 + zpool status pool pool: pool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 mirror-0ONLINE 0 0 0 /dev/zvol/dsk/rpool/vol1 ONLINE 0 0 0 /dev/zvol/dsk/rpool/vol2 ONLINE 0 0 0 errors: No known data errors + zpool export pool + zfs create rpool/subvol1 + zfs create rpool/subvol2 + zfs rename rpool/vol1 rpool/subvol1/vol1 + zfs rename rpool/vol2 rpool/subvol2/vol2 + zpool import -d /dev/zvol/dsk/rpool/subvol1 pool: pool id: 13941781561414544058 state: DEGRADED status: One or more devices are missing from the system. action: The pool can be imported despite missing or damaged devices. The fault tolerance of the pool may be compromised if imported. see: http://www.sun.com/msg/ZFS-8000-2Q config: pool DEGRADED mirror-0DEGRADED /dev/zvol/dsk/rpool/subvol1/vol1 ONLINE /dev/zvol/dsk/rpool/vol2 UNAVAIL cannot open ### Note that it can't find vol2 - which is expected. + sleep 1 ### The sleep here seems to be necessary for vol1 to magically be ### found in the next zpool import. + zpool import -d /dev/zvol/dsk/rpool/subvol2 pool: pool id: 13941781561414544058 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: pool ONLINE mirror-0ONLINE /dev/zvol/dsk/rpool/vol1 ONLINE /dev/zvol/dsk/rpool/subvol2/vol2 ONLINE ### Note that it says vol1 is ONLINE, under it's old path, though it actually has moved + sleep 1 + zpool import -d /dev/zvol/dsk/rpool/subvol1 pool + zpool status pool pool: pool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 mirror-0ONLINE 0 0 0 /dev/zvol/dsk/rpool/subvol1/vol1 ONLINE 0 0 0 /dev/zvol/dsk/rpool/vol2 ONLINE 0 0 0 errors: No known data errors ### Note that vol2 has it old path shown! ### Interestingly, if you then + zpool export pool + zpool import -d /dev/zvol/dsk/rpool/subvol2 pool ### vol2's path gets updated too: + zpool status pool pool: pool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 mirror-0ONLINE 0 0 0 /dev/zvol/dsk/rpool/subvol1/vol1 ONLINE 0 0 0 /dev/zvol/dsk/rpool/subvol2/vol2 ONLINE 0 0 0 errors: No known data errors ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL corrupt, not recoverable even with logfix
Was the zpool originally created by a FreeBSD operating system or by an OpenSolaris operating system? If so, what version of FreeBSD, SXCE, OpenSolaris Indiana was it originally created by? The reason I'm asking this is because there are different versions of ZFS in different versions of OpenSolaris, so if you take a newer version zpool and try to mount it in an older version OpenSolaris, it won't mount. The last time I tried it a long time ago, ZFS in FreeBSD was pretty unstable and still under heavy development, which was the sole reason I migrated my storage server with my important data on it to OpenSolaris, and it has been rock solid stable since. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Michael, michael schuster schrieb: Roland Rambau wrote: gang, actually a simpler version of that idea would be a "zcp": if I just cp a file, I know that all blocks of the new file will be duplicates; so the cp could take full advantage for the dedup without a need to check/read/write anz actual data I think they call it 'ln' ;-) and that even works on ufs. quite similar but with a critical difference: with hard links any modifications through either link are seen by both links, since it stays a single file (note that editors like vi do an implicit cp, they do NOT update the original file ) That "zcp" ( actually it should be just a feature of 'cp' ) would be blockwise copy-on-write. It would have exactly the same semantics as cp but just avoid any data movement, since we can easily predict what the effect of a cp followed by a dedup should be. -- Roland -- ** Roland Rambau Platform Technology Team Principal Field Technologist Global Systems Engineering Phone: +49-89-46008-2520 Mobile:+49-172-84 58 129 Fax: +49-89-46008- mailto:roland.ram...@sun.com ** Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht München: HRB 161028; Geschäftsführer: Thomas Schröder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Häring *** UNIX * /bin/sh FORTRAN ** ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate Zil on HDD ?
On 12/03/09 09:21, mbr wrote: Hello, Bob Friesenhahn wrote: On Thu, 3 Dec 2009, mbr wrote: What about the data that were on the ZILlog SSD at the time of failure, is a copy of the data still in the machines memory from where it can be used to put the transaction to the stable storage pool? The intent log SSD is used as 'write only' unless the system reboots, in which case it is used to support recovery. The system memory is used as the write path in the normal case. Once the data is written to the intent log, then the data is declared to be written as far as higher level applications are concerned. thank you Bob for the clarification. So I don't need a mirrored ZILlog for security reasons, all the information is still in memory and will be used from there by default if only the ZILlog SSD fails. Mirrored log devices are advised to improve reliablity. As previously mentioned, if during writing a log device fails or is temporarily full then we use the main pool devices to chain the log blocks. If we get read errors when trying to replay the intent log (after a crash/power fail) then the admin is given the option to ignore the log and continue or somehow fix the device (eg re-attach) and then retry. Multiple log devices would provide extra reliability here. We do not look in memory for the log records if we can't get the records from the log blocks. If the intent log SSD fails and the system spontaneously reboots, then data may be lost. I can live with the data loss as long as the machine comes up with the faulty ZILlog SSD but otherwise without probs and with a clean zpool. The log records are not required for consistency of the pool (it's not a journal). Has the following error no consequences? Bug ID 6538021 Synopsis Need a way to force pool startup when zil cannot be replayed State 3-Accepted (Yes, that is a problem) Link http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6538021 Er that bug should probably be closed as a duplicate. We now have this functionality. Michael. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate Zil on HDD ?
On Thu, 3 Dec 2009, mbr wrote: Has the following error no consequences? Bug ID 6538021 Synopsis Need a way to force pool startup when zil cannot be replayed State 3-Accepted (Yes, that is a problem) Link http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6538021 I don't know the status of this but it does make sense to require the user to explicitly corrupt/lose data in the storage pool. It could be that the log device is just temporarily missing and can be restored so zfs should not do this by default. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, 3 Dec 2009, Jason King wrote: Well it could be done in a way such that it could be fs-agnostic (perhaps extending /bin/cat with a new flag such as -o outputfile, or detecting if stdout is a file vs tty, though corner cases might get tricky). If a particular fs supported such a feature, it could take advantage of it, but if it didn't, it could fall back to doing a read+append. Sort of like how mv figures out if the source & target are the same or different filesystems and acts accordingly. The most common way that I concatenate files into a larger file is by using a utility such as 'tar', which outputs a different format. I rarely use 'cat' to concatenate files. It is desired to concatenate files in a way which works best for deduplication then a tar-like format can be invented which takes care to always start new file output on a filesystem block boundary. With zfs deduplication this should be faster and take less space than compressing the entire result as long as the ouput is stored in the same pool. If output is written to a destination filesystem which uses a different block size, then the ideal block size will be that of the destination filesystem so that large archive files can still be usefull deduplicated. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate Zil on HDD ?
Hello, Bob Friesenhahn wrote: On Thu, 3 Dec 2009, mbr wrote: What about the data that were on the ZILlog SSD at the time of failure, is a copy of the data still in the machines memory from where it can be used to put the transaction to the stable storage pool? The intent log SSD is used as 'write only' unless the system reboots, in which case it is used to support recovery. The system memory is used as the write path in the normal case. Once the data is written to the intent log, then the data is declared to be written as far as higher level applications are concerned. thank you Bob for the clarification. So I don't need a mirrored ZILlog for security reasons, all the information is still in memory and will be used from there by default if only the ZILlog SSD fails. If the intent log SSD fails and the system spontaneously reboots, then data may be lost. I can live with the data loss as long as the machine comes up with the faulty ZILlog SSD but otherwise without probs and with a clean zpool. Has the following error no consequences? Bug ID 6538021 Synopsis Need a way to force pool startup when zil cannot be replayed State 3-Accepted (Yes, that is a problem) Link http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6538021 Michael. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 3, 2009 at 9:58 AM, Bob Friesenhahn wrote: > On Thu, 3 Dec 2009, Erik Ableson wrote: >> >> Much depends on the contents of the files. Fixed size binary blobs that >> align nicely with 16/32/64k boundaries, or variable sized text files. > > Note that the default zfs block size is 128K and so that will therefore be > the default dedup block size. > > Most files are less than 128K and occupy a short tail block so concatenating > them will not usually enjoy the benefits of deduplication. > > It is not wise to riddle zfs with many special-purpose features since zfs > would then be encumbered by these many features, which tend to defeat future > improvements. Well it could be done in a way such that it could be fs-agnostic (perhaps extending /bin/cat with a new flag such as -o outputfile, or detecting if stdout is a file vs tty, though corner cases might get tricky). If a particular fs supported such a feature, it could take advantage of it, but if it didn't, it could fall back to doing a read+append. Sort of like how mv figures out if the source & target are the same or different filesystems and acts accordingly. There are a few use cases I've encountered where having this would have been _very_ useful (usually when trying to get large crashdumps to Sun quickly). In general, it would allow one to manipulate very large files by breaking them up into smaller subsets while still having the end result be a single file. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Bob Friesenhahn wrote: On Thu, 3 Dec 2009, Erik Ableson wrote: Much depends on the contents of the files. Fixed size binary blobs that align nicely with 16/32/64k boundaries, or variable sized text files. Note that the default zfs block size is 128K and so that will therefore be the default dedup block size. Most files are less than 128K and occupy a short tail block so concatenating them will not usually enjoy the benefits of deduplication. Most ? I think that is a bit of a sweeping statement. In know of some environments where "most" files are multiple gigabytes in size and others where 1K is the upper bound of the file system. So I don't think you can say at all that "Most" files are < 128K. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, 3 Dec 2009, Erik Ableson wrote: Much depends on the contents of the files. Fixed size binary blobs that align nicely with 16/32/64k boundaries, or variable sized text files. Note that the default zfs block size is 128K and so that will therefore be the default dedup block size. Most files are less than 128K and occupy a short tail block so concatenating them will not usually enjoy the benefits of deduplication. It is not wise to riddle zfs with many special-purpose features since zfs would then be encumbered by these many features, which tend to defeat future improvements. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
michael schuster wrote: Roland Rambau wrote: gang, actually a simpler version of that idea would be a "zcp": if I just cp a file, I know that all blocks of the new file will be duplicates; so the cp could take full advantage for the dedup without a need to check/read/write anz actual data I think they call it 'ln' ;-) and that even works on ufs. Michael +1 More and more it sounds like an optimization that will either A. not add much over dedup or B. have value only in specific situations - and completely misbehave in other situations (even the same situations after passage of time) Why not just make a special-purpose application (completely user-land) for it? I know, 'ln' is remotely kin of this idea but, 'ln' is POSIX and people know what to expect. What you'd practically need to do is whip up a vfs layer that exposes the underlying blocks of a filesystem and possibly name them by their SHA256 or MD5 hash. Then you'd need (another?) vfs abstraction that allows 'virtual' files to be assembled from these blocks in multiple independent chains. I know there is already a fuse implementation of the first vfs driver (the name evades me, but I think it was something like chunkfs[1]) and one could at least whip up a reasonable read-only Proof-of-Concept of the second part. The reason _I_ wouldn't do that is because, I'm already happy with e.g.: mkfifo /var/run/my_part_collector (while true; do cat /local/data/my_part_* > /var/run/my_part_collector; done)& wc -l /var/run/my_part_collector The equivalent of this could be (better) expressed in C, perl or any language of your choice). I believe this is all POSIX. [1] The reason this exists is obviously for backup and synchronization implementations: it will make it possible to backup files using rsync when the encryption key is not available to the backup process (with a EBC mode crypto algo); it should make it 'simple' to synchronize ones large monolythic files with e.g. Amazon S3 cloud storage etc. etc. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Per Baatrup wrote: Actually 'ln -s source target' would not be the same "zcp source target" as writing to the source file after the operation would change the target file as well where as for "zcp" this would only change the source file due to copy-on-write semantics of ZFS. I actually was thinking of creating a hard link (without the -s option), but your point is valid for hard and soft links. cheers Michael -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate Zil on HDD ?
On Thu, 3 Dec 2009, mbr wrote: What about the data that were on the ZILlog SSD at the time of failure, is a copy of the data still in the machines memory from where it can be used to put the transaction to the stable storage pool? The intent log SSD is used as 'write only' unless the system reboots, in which case it is used to support recovery. The system memory is used as the write path in the normal case. Once the data is written to the intent log, then the data is declared to be written as far as higher level applications are concerned. If the intent log SSD fails and the system spontaneously reboots, then data may be lost. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Actually 'ln -s source target' would not be the same "zcp source target" as writing to the source file after the operation would change the target file as well where as for "zcp" this would only change the source file due to copy-on-write semantics of ZFS. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Bob Friesenhahn wrote: On Thu, 3 Dec 2009, Darren J Moffat wrote: The answer to this is likely deduplication which ZFS now has. The reason dedup should help here is that after the 'cat' f15 will be made up of blocks that match the blocks of f1 f2 f3 f4 f5. Copy-on-write isn't what helps you here it is dedup. Isn't this only true if the file sizes are such that the concatenated blocks are perfectly aligned on the same zfs block boundaries they used before? This seems unlikely to me. Yes that would be the case. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Roland Rambau wrote: gang, actually a simpler version of that idea would be a "zcp": if I just cp a file, I know that all blocks of the new file will be duplicates; so the cp could take full advantage for the dedup without a need to check/read/write anz actual data I think they call it 'ln' ;-) and that even works on ufs. Michael -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
gang, actually a simpler version of that idea would be a "zcp": if I just cp a file, I know that all blocks of the new file will be duplicates; so the cp could take full advantage for the dedup without a need to check/read/write anz actual data -- Roland Per Baatrup schrieb: "dedup" operates on the block level leveraging the existing FFS checksums. Read "What to dedup: Files, blocks, or bytes" here http://blogs.sun.com/bonwick/entry/zfs_dedup The trick should be that the zcat userland app already knows that it will generate duplicate files so data read and writes could be avoided all together. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
"zcat" was my acronym for a special ZFS aware version of "cat" and the name was obviously a big mistake as I did not know it was an existing command and simply forgot to check. Should rename if to "zfscat" or something similar? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Per Baatrup wrote: "dedup" operates on the block level leveraging the existing FFS checksums. Read "What to dedup: Files, blocks, or bytes" here http://blogs.sun.com/bonwick/entry/zfs_dedup The trick should be that the zcat userland app already knows that it will generate duplicate files so data read and writes could be avoided all together. you'd probably be better off avoiding "zcat" - it's been in use since almost forever, from the man-page: zcat The zcat utility writes to standard output the uncompressed form of files that have been compressed using compress. It is the equivalent of uncompress-c. Input files are not affected. :-) cheers Michael -- Michael Schusterhttp://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
"dedup" operates on the block level leveraging the existing FFS checksums. Read "What to dedup: Files, blocks, or bytes" here http://blogs.sun.com/bonwick/entry/zfs_dedup The trick should be that the zcat userland app already knows that it will generate duplicate files so data read and writes could be avoided all together. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On 3 déc. 2009, at 13:29, Bob Friesenhahn s> wrote: On Thu, 3 Dec 2009, Darren J Moffat wrote: The answer to this is likely deduplication which ZFS now has. The reason dedup should help here is that after the 'cat' f15 will be made up of blocks that match the blocks of f1 f2 f3 f4 f5. Copy-on-write isn't what helps you here it is dedup. Isn't this only true if the file sizes are such that the concatenated blocks are perfectly aligned on the same zfs block boundaries they used before? This seems unlikely to me. It's also worth noting that if the block alignment works out for the dedup, the actual write traffic will be trivial, consisting only of pointer references, so the heavy lifting will be the read operations. Much depends on the contents of the files. Fixed size binary blobs that align nicely with 16/32/64k boundaries, or variable sized text files. Cordialement, Erik Ableson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate Zil on HDD ?
Hello, Edward Ned Harvey wrote: Yes, I have SSD for ZIL. Just one SSD. 32G. But if this is the problem, then you'll have the same poor performance on the local machine that you have over NFS. So I'm curious to see if you have the same poor performance locally. The ZIL does not need to be reliable; if it fails, the ZIL will begin writing to the main storage, and performance will suffer until the new SSD is put into production. I am also planning to install a SSD as ZILlog. Is it really true that there are no problems if the ZILlog fails and there is no mirror of the ZILlog? What about the data that were on the ZILlog SSD at the time of failure, is a copy of the data still in the machines memory from where it can be used to put the transaction to the stable storage pool? What if the machine reboots after the SSD has failed? The ZFS Best Practices Guide commends to mirror the log: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pool_Performance_Considerations Mirroring the log device is highly recommended. Protecting the log device by mirroring will allow you to access the storage pool even if a log device has failed. Failure of the log device may cause the storage pool to be inaccessible if you are running the Solaris Nevada release prior to build 96 and a release prior to the Solaris 10 10/09 release. For more information, see CR 6707530. http://bugs.opensolaris.org/view_bug.do?bug_id=6707530 No probs with that if I use Sol10U8? Regrads, Michael. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, 3 Dec 2009, Darren J Moffat wrote: The answer to this is likely deduplication which ZFS now has. The reason dedup should help here is that after the 'cat' f15 will be made up of blocks that match the blocks of f1 f2 f3 f4 f5. Copy-on-write isn't what helps you here it is dedup. Isn't this only true if the file sizes are such that the concatenated blocks are perfectly aligned on the same zfs block boundaries they used before? This seems unlikely to me. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Peter Tribble wrote: On Thu, Dec 3, 2009 at 12:08 PM, Darren J Moffat wrote: Per Baatrup wrote: I would like to to concatenate N files into one big file taking advantage of ZFS copy-on-write semantics so that the file concatenation is done without actually copying any (large amount of) file content. cat f1 f2 f3 f4 f5 > f15 Is this already possible when source and target are on the same ZFS filesystem? Am looking into the ZFS source code to understand if there are sufficient (private) interfaces to make a simple "zcat -o f15 f1 f2 f3 f4 f5" userland application in C code. Does anybody have advice on this? The answer to this is likely deduplication which ZFS now has. The reason dedup should help here is that after the 'cat' f15 will be made up of blocks that match the blocks of f1 f2 f3 f4 f5. Is that likely to happen? dedup is at the block level, so the blocks in f2 will only match the same data in f15 if they're aligned, which is only going to happen if f1 ends on a block boundary. Correct you will only get the maximum benefit if the source files are ending on a block boundary. Which is why I said "likely deduplication". Besides, you still have to read all the data off the disk, manipulate it, and write it all back. Yep. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 3, 2009 at 12:08 PM, Darren J Moffat wrote: > Per Baatrup wrote: >> >> I would like to to concatenate N files into one big file taking advantage >> of ZFS copy-on-write semantics so that the file concatenation is done >> without actually copying any (large amount of) file content. >> cat f1 f2 f3 f4 f5 > f15 >> Is this already possible when source and target are on the same ZFS >> filesystem? >> >> Am looking into the ZFS source code to understand if there are sufficient >> (private) interfaces to make a simple "zcat -o f15 f1 f2 f3 f4 f5" >> userland application in C code. Does anybody have advice on this? > > The answer to this is likely deduplication which ZFS now has. > > The reason dedup should help here is that after the 'cat' f15 will be made > up of blocks that match the blocks of f1 f2 f3 f4 f5. Is that likely to happen? dedup is at the block level, so the blocks in f2 will only match the same data in f15 if they're aligned, which is only going to happen if f1 ends on a block boundary. Besides, you still have to read all the data off the disk, manipulate it, and write it all back. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Per Baatrup wrote: I would like to to concatenate N files into one big file taking advantage of ZFS copy-on-write semantics so that the file concatenation is done without actually copying any (large amount of) file content. cat f1 f2 f3 f4 f5 > f15 Is this already possible when source and target are on the same ZFS filesystem? Am looking into the ZFS source code to understand if there are sufficient (private) interfaces to make a simple "zcat -o f15 f1 f2 f3 f4 f5" userland application in C code. Does anybody have advice on this? The answer to this is likely deduplication which ZFS now has. The reason dedup should help here is that after the 'cat' f15 will be made up of blocks that match the blocks of f1 f2 f3 f4 f5. Copy-on-write isn't what helps you here it is dedup. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] file concatenation with ZFS copy-on-write
I would like to to concatenate N files into one big file taking advantage of ZFS copy-on-write semantics so that the file concatenation is done without actually copying any (large amount of) file content. cat f1 f2 f3 f4 f5 > f15 Is this already possible when source and target are on the same ZFS filesystem? Am looking into the ZFS source code to understand if there are sufficient (private) interfaces to make a simple "zcat -o f15 f1 f2 f3 f4 f5" userland application in C code. Does anybody have advice on this? TIA Per -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate Zil on HDD ?
On Wed, Dec 02, 2009 at 03:57:47AM -0800, Brian McKerr wrote: > I previously had a linux NFS server that I had mounted 'ASYNC' and, as one > would expect, NFS performance was pretty good getting close to 900gb/s. Now > that I have moved to opensolaris, NFS performance is not very good, I'm > guessing mainly due to the 'SYNC' nature of NFS. I've seen various threads > and most point at 2 options; > > 1. Disable the ZIL > 2. Add independent log device/s We have experienced the same performance penalty using NFS over ZFS. The issue is indeed caused by the synchronous nature of ZFS. More precisely, it is caused by the fact that ZFS promises correct behaviour while eg. a linux NFS server (using async) does not. The issue is decribed in great detail at http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine If you want the same behaviour as you had with your Linux NFS server, you can disable the ZIL. Doing so should give the same guarantees as the linux NFS service. The big issue with disabling the ZIL is that it is system-wide. Although it could be an acceptable tradeoff for one filesystem, it is not necesarily a good system-wide setting. That is why I think the option to disable the ZIL should be per-filesystem (Which I think should be possible because a ZIL is actually kept per-filesystem). As for adding HDD's as ZIL-devices, I'd advise against it. We have tried this and the performance decreased. Using SSD's as the ZIL is probably the way to go. A final option is to accept the situation as it is, arguing that you have traded performance for increased reliability. Regards, Auke -- Auke Folkerts University of Amsterdam pgp98O6FxZsbM.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss