Re: [zfs-discuss] [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
Hey Jim - There's something we're missing here. There does not appear to be enough ZFS write activity to cause the system to pause regularly. Were you able to capture a kernel profile during the pause period? Thanks, /jim Jim Leonard wrote: The only thing that jumps out at me is the ARC size - 53.4GB, or most of your 64GB of RAM. This in-and-of-itself is not necessarily a bad thing - if there are no other memory consumers, let ZFS cache data in the ARC. But if something is coming along to flush dirty ARC pages periodically The workload is a set of 50 python processes, each receiving a stream of data via TCP/IP. The processes run until they notice something interesting in the stream (sorry I can't be more specific), then they connect to a server via TCP/IP and issue a command or two. Log files are written that takes up about 50M per day per process. It's relatively low-traffic. I found what looked to be an applicable bug; CR 6699438 zfs induces crosscall storm under heavy mapped sequential read workload but the stack signature for the above bug is different than yours, and it doesn't sound like your workload is doing mmap'd sequential reads. That said, I would be curious to know if your workload used mmap(), versus read/write? I asked and they couldn't say. It's python so I think it's unlikely. For the ZFS folks just seeing this, here's the stack frame; unix`xc_do_call+0x8f unix`xc_wait_sync+0x36 unix`x86pte_invalidate_pfn+0x135 unix`hat_pte_unmap+0xa9 unix`hat_unload_callback+0x109 unix`hat_unload+0x2a unix`segkmem_free_vn+0x82 unix`segkmem_zio_free+0x10 genunix`vmem_xfree+0xee genunix`vmem_free+0x28 genunix`kmem_slab_destroy+0x80 genunix`kmem_slab_free+0x1be genunix`kmem_magazine_destroy+0x54 genunix`kmem_depot_ws_reap+0x4d genunix`taskq_thread+0xbc unix`thread_start+0x8 Let's see what the fsstat and zpool iostat data looks like when this starts happening.. Both are unremarkable, I'm afraid. Here's the fsstat from when it starts happening: new name name attr attr lookup rddir read read write write file remov chng get set ops ops ops bytes ops bytes 0 0 0 75 0 0 0 0 0 10 1.25M zfs 0 0 0 83 0 0 0 0 0 7 896K zfs 0 0 0 78 0 0 0 0 0 13 1.62M zfs 0 0 0 229 0 0 0 0 0 29 3.62M zfs 0 0 0 217 0 0 0 0 0 28 3.37M zfs 0 0 0 212 0 0 0 0 0 26 3.03M zfs 0 0 0 151 0 0 0 0 0 18 2.07M zfs 0 0 0 184 0 0 0 0 0 31 3.41M zfs 0 0 0 187 0 0 0 0 0 32 2.74M zfs 0 0 0 219 0 0 0 0 0 24 2.61M zfs 0 0 0 222 0 0 0 0 0 29 3.29M zfs 0 0 0 206 0 0 0 0 0 29 3.26M zfs 0 0 0 205 0 0 0 0 0 19 2.26M zfs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool replace single disk with raidz
On Fri, 25 Sep 2009, Ryan Hirsch wrote: I have a zpool named rtank. I accidently attached a single drive to the pool. I am an idiot I know :D Now I want to replace this single drive with a raidz group. Below is the pool setup and what I tried: I think that the best you will be able to do is the turn this single drive into a mirror. It seems that this sort of human error occurs pretty often and there is not yet a way to properly fix it. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
On Sep 25, 2009, at 19:39, Frank Middleton wrote: /var/tmp is a strange beast. It can get quite large, and be a serious bottleneck if mapped to a physical disk and used by any program that synchronously creates and deletes large numbers of files. I have had no problems mapping /var/tmp to /tmp. Hopefully a guru will step in here and explain why this is a bad idea, but so far no problems... The contents of /var/tmp can be expected to survive between boots (e.g., /var/tmp/vi.recover); /tmp is nuked on power cycles (because it's just memory/swap): /tmp: A directory made available for applications that need a place to create temporary files. Applications shall be allowed to create files in this directory, but shall not assume that such files are preserved between invocations of the application. http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap10.html If a program is creating and deleting large numbers of files, and those files aren't needed between reboots, then it really should be using /tmp. Similar definition for Linux FWIW: http://www.pathname.com/fhs/pub/fhs-2.3.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 09/25/09 16:19, Bob Friesenhahn wrote: On Fri, 25 Sep 2009, Ross Walker wrote: Problem is most SSD manufactures list sustained throughput with large IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD that can handle the throughput. Who said that the slog SSD is written to in 128K chunks? That seems wrong to me. Previously we were advised that the slog is basically a log of uncommitted system calls so the size of the data chunks written to the slog should be similar to the data sizes in the system calls. Log blocks are variable in size dependent on what needs to be committed. The minimum size is 4KB and the max 128KB. Log records are aggregated and written together as much as possible. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS pool replace single disk with raidz
I have a zpool named rtank. I accidently attached a single drive to the pool. I am an idiot I know :D Now I want to replace this single drive with a raidz group. Below is the pool setup and what I tried: NAMESTATE READ WRITE CKSUM rtank ONLINE 0 0 0 - raidz1ONLINE 0 0 0 -- c4t0d0 ONLINE 0 0 0 -- c4t1d0 ONLINE 0 0 0 -- c4t2d0 ONLINE 0 0 0 -- c4t3d0 ONLINE 0 0 0 -- c4t4d0 ONLINE 0 0 0 -- c4t5d0 ONLINE 0 0 0 -- c4t6d0 ONLINE 0 0 0 -- c4t7d0 ONLINE 0 0 0 - raidz1ONLINE 0 0 0 -- c3t0d0 ONLINE 0 0 0 -- c3t1d0 ONLINE 0 0 0 -- c3t2d0 ONLINE 0 0 0 -- c3t3d0 ONLINE 0 0 0 -- c3t4d0 ONLINE 0 0 0 -- c3t5d0 ONLINE 0 0 0 - c5d0 ONLINE 0 0 0 <--- single drive in the pool not in any raidz $ pfexec zpool replace rtank c5d0 raidz c3t6d0 c3t7d0 c3t8d0 c3t9d0 c3t10d0 c3t11d0 too many arguments $ zpool upgrade -v This system is currently running ZFS pool version 18. Is what I am trying to do possible? If so what am I doing wrong? Thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Sep 25, 2009, at 6:19 PM, Bob Friesenhahn > wrote: On Fri, 25 Sep 2009, Ross Walker wrote: Problem is most SSD manufactures list sustained throughput with large IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD that can handle the throughput. Who said that the slog SSD is written to in 128K chunks? That seems wrong to me. Previously we were advised that the slog is basically a log of uncommitted system calls so the size of the data chunks written to the slog should be similar to the data sizes in the system calls. Are these not broken into recordsize chunks? -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
On Fri, 2009-09-25 at 14:39 -0600, Lori Alt wrote: > The list of datasets in a root pool should look something like this: ... > rpool/swap I've had success with putting swap into other pools. I believe others have, as well. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
on Fri Sep 25 2009, Glenn Lagasse wrote: > The question you're asking can't easily be answered. Sun doesn't test > configs like that. If you really want to do this, you'll pretty much > have to 'try it and see what breaks'. And you get to keep both pieces > if anything breaks. Heh, that doesn't sound like much fun. I have a VM I can experiment with, but I don't want to do this badly enough to take that risk. > There's very little you can safely move in my experience. /export > certainly. Anything else, not really (though ymmv). I tried to create > a seperate zfs dataset for /usr/local. That worked some of the time, > but it also screwed up my system a time or two during > image-updates/package installs. That's hard to imagine. My OpenSolaris installation didn't come with a /usr/local directory. How can mounting a filesystem from a non-root pool under /usr possibly mess anything up? > On my 2010.02/123 system I see: > > bin Symlink to /usr/bin > boot/ > dev/ > devices/ > etc/ > export/ Safe to move, not tied to the 'root' system Good to know. > kernel/ > lib/ > media/ > mnt/ > net/ > opt/ > platform/ > proc/ > rmdisk/ > root/ Could probably move root's homedir I don't think I'd risk it. > rpool/ > sbin/ > system/ > tmp/ > usr/ > var/ > > Other than /export, everything else is considered 'part of the root > system'. Thus part of the root pool. > > Really, if you can't add a mirror for your root pool, then make backups > of your root pool (left as an exercise to the reader) and store the > non-system specific bits (/export) on you're raidz2 pool. Yeah, that's my fallback. Actually, that along with copies=2 on my root pool, which I might well do anyhow. But you people are making a pretty strong case for making the effort to figure out how to do the mirror thing. Thanks, all, for the feedback. -- Dave Abrahams BoostPro Computing http://www.boostpro.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] selecting zfs BE from OBP
Ah yes. Thanks Cindy! donour On Sep 25, 2009, at 10:37 AM, Cindy Swearingen wrote: Hi Donour, You would use the boot -L syntax to select the ZFS BE to boot from, like this: ok boot -L Rebooting with command: boot -L Boot device: /p...@8,60/SUNW,q...@4/f...@0,0/ d...@w2104cf7fa6c7,0:a File and args: -L 1 zfs1009BE 2 zfs10092BE Select environment to boot: [ 1 - 2 ]: 2 Then copy and paste the boot string that is provided: To boot the selected entry, invoke: boot [] -Z rpool/ROOT/zfs10092BE Program terminated {0} ok boot -Z rpool/ROOT/zfs10092BE See this pointer as well: http://docs.sun.com/app/docs/doc/819-5461/ggpco?a=view Cindy On 09/25/09 11:09, Donour Sizemore wrote: Can you select the LU boot environment from sparc obp, if the filesystem is zfs? With ufs, you simply invoke 'boot [slice]'. thanks donour ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] White box server for OpenSolaris
From a product standpoint, expanding the variety available in the Storage 7000 (Amber Road) line is somewhere I think we'd (Sun) make bank on. Things like: [ for the home/very small business market ] Mini-Tower sized case, 4-6 3.5" HS SATA-only bays (to take the X2200-style spud bracket drives), 2 CF slots (for boot), single-socket, with 4 DIMMs, and a built-in ILOM. /maybe/ a x4 PCI-E slot, but maybe not. [ for the small business/branch office with no racks] Mid-tower case, 4-bay 2.5" HS area, 6-8 bay 3.5" HS area, single socket, 4/6 DIMMs, ILOM. (2) x4 or x8 PCI-E slots too. (I'd probably go with Socket AM3, with ECC, of course) I'd sell them in both fully loaded with the Amber Road software (and mandatory Service Contract), and no-OS Loaded, no-Service Contract appliance versions. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
On 09/25/09 04:44 PM, Lori Alt wrote: rpool rpool/ROOT rpool/ROOT/snv_124 (or whatever version you're running) rpool/ROOT/snv_124/var (you might not have this) rpool/ROOT/snv_121 (or whatever other BEs you still have) rpool/dump rpool/export rpool/export/home rpool/swap Unless you machine is so starved for physical memory that you couldn't possibly install anything, AFAIK you can always boot without dump and swap, so even if your data pool can't be mounted, you should be OK. I've done many a reboot and pkg image-update with dump and swap inaccessible. Of course with no dump, you won't get, well, a dump, after a panic... Having /usr/local (IIRC this doesn't even exist in a straight OpenSolaris install) in a shared space on your data pool is quite useful if you have more than one machine unless you have multiple architectures. Then it turns into the /opt problem. Hiving off /opt does not seem to prevent booting, and having it on a data pool doesn't seem to prevent upgrade installs. The big problem with putting /opt on a shared pool is when multiple hosts have different /opts. Using legacy mounts seems to be the only way around this. Do the gurus have a technical explanation why putting /opt in a different pool shouldn't work? /var/tmp is a strange beast. It can get quite large, and be a serious bottleneck if mapped to a physical disk and used by any program that synchronously creates and deletes large numbers of files. I have had no problems mapping /var/tmp to /tmp. Hopefully a guru will step in here and explain why this is a bad idea, but so far no problems... A 32GB SSD is marginal for a root pool, so shrinking it as much as possible makes a lot of sense until bigger SSDS become cost effective (not long from now I imagine). But if you already have a 16GB or 32GB SSD, or a dedicated boot disk <= 32GB than you can be SOL unless you are very careful to empty /var/pkg/download, which doesn't seem to get emptied even if you set the magic flag. HTH -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
rswwal...@gmail.com said: > Yes, but if it's on NFS you can just figure out the workload in MB/s and use > that as a rough guideline. I wonder if that's the case. We have an NFS server without NVRAM cache (X4500), and it gets huge MB/sec throughput on large-file writes over NFS. But it's painfully slow on the "tar extract lots of small files" test, where many, tiny, synchronous metadata operations are performed. > I did a smiliar test with a 512MB BBU controller and saw no difference with > or without the SSD slog, so I didn't end up using it. > > Does your BBU controller ignore the ZFS flushes? I believe it does (it would be slow otherwise). It's the Sun StorageTek internal SAS RAID HBA. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Fri, 25 Sep 2009, Ross Walker wrote: Problem is most SSD manufactures list sustained throughput with large IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD that can handle the throughput. Who said that the slog SSD is written to in 128K chunks? That seems wrong to me. Previously we were advised that the slog is basically a log of uncommitted system calls so the size of the data chunks written to the slog should be similar to the data sizes in the system calls. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Fri, 25 Sep 2009, Richard Elling wrote: By default, the txg commit will occur when 1/8 of memory is used for writes. For 30 GBytes, that would mean a main memory of only 240 Gbytes... feasible for modern servers. Ahem. We were advised that 7/8s of memory is currently what is allowed for writes. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
* David Magda (dma...@ee.ryerson.ca) wrote: > On Sep 25, 2009, at 16:39, Glenn Lagasse wrote: > > >There's very little you can safely move in my experience. /export > >certainly. Anything else, not really (though ymmv). I tried to > >create > >a seperate zfs dataset for /usr/local. That worked some of the time, > >but it also screwed up my system a time or two during > >image-updates/package installs. > > I'd be very surprised (disappointed?) if /usr/local couldn't be > detached from the rpool. Given that in many cases it's an NFS mount, > I'm curious to know why it would need to be part of the rpool. If it > is a 'dependency' I would consider that a bug. It can be detached, however one issue I ran in to was packages which installed into /usr/local caused problems when those packages were upgraded. Essentially what occurred was that /usr/local was created on the root pool and upon reboot caused the filesystem service to go into maintenance because it couldn't mount the zfs /usr/local dataset on top of the filled /usr/local root pool location. I didn't have time to investigate into it fully. At that point, spinning /usr/local off into it's own zfs dataset just didn't seem worth the hassle. Others mileage may vary. -- Glenn ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Fri, Sep 25, 2009 at 5:47 PM, Marion Hakanson wrote: > j...@jamver.id.au said: >> For a predominantly NFS server purpose, it really looks like a case of the >> slog has to outperform your main pool for continuous write speed as well as >> an instant response time as the primary criterion. Which might as well be a >> fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of >> them. > > I wonder if you ran Richard Elling's "zilstat" while running your > workload. That should tell you how much ZIL bandwidth is needed, > and it would be interesting to see if its stats match with your > other measurements of slog-device traffic. Yes, but if it's on NFS you can just figure out the workload in MB/s and use that as a rough guideline. Problem is most SSD manufactures list sustained throughput with large IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD that can handle the throughput. > I did some filebench and "tar extract over NFS" tests of J4400 (500GB, > 7200RPM SATA drives), with and without slog, where slog was using the > internal 2.5" 10kRPM SAS drives in an X4150. These drives were behind > the standard Sun/Adaptec internal RAID controller, 256MB battery-backed > cache memory, all on Solaris-10U7. > > We saw slight differences on filebench oltp profile, and a huge speedup > for the "tar extract over NFS" tests with the slog present. Granted, the > latter was with only one NFS client, so likely did not fill NVRAM. Pretty > good results for a poor-person's slog, though: > http://acc.ohsu.edu/~hakansom/j4400_bench.html I did a smiliar test with a 512MB BBU controller and saw no difference with or without the SSD slog, so I didn't end up using it. Does your BBU controller ignore the ZFS flushes? > Just as an aside, and based on my experience as a user/admin of various > NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem > to get very good improvements with relatively small amounts of NVRAM > (128K, 1MB, 256MB, etc.). None of the filers I've seen have ever had > tens of GB of NVRAM. They don't hold on to the cache for a long time, just as long as it takes to write it all to disk. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot hold 'xxx': pool must be upgraded
Chris Kirby wrote: On Sep 25, 2009, at 2:43 PM, Robert Milkowski wrote: Chris Kirby wrote: On Sep 25, 2009, at 11:54 AM, Robert Milkowski wrote: That's useful information indeed. I've filed this CR: 6885860 zfs send shouldn't require support for snapshot holds Sorry for the trouble, please look for this to be fixed soon. Thank you. btw: how do you want to fix it? Do you want to acquire a snapshot hold but continue anyway if it is not possible (only in case whene error is ENOTSUP I think)? Or do you want to get rid of it entirely? In this particular case, we should make sure the pool version supports snapshot holds before trying to request (or release) any. We still want to acquire the temporary holds if we can, since that prevents a race with zfs destroy. That case is becoming more common with automated snapshots and their associated retention policies. Yeah, this makes sense. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
j...@jamver.id.au said: > For a predominantly NFS server purpose, it really looks like a case of the > slog has to outperform your main pool for continuous write speed as well as > an instant response time as the primary criterion. Which might as well be a > fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of > them. I wonder if you ran Richard Elling's "zilstat" while running your workload. That should tell you how much ZIL bandwidth is needed, and it would be interesting to see if its stats match with your other measurements of slog-device traffic. I did some filebench and "tar extract over NFS" tests of J4400 (500GB, 7200RPM SATA drives), with and without slog, where slog was using the internal 2.5" 10kRPM SAS drives in an X4150. These drives were behind the standard Sun/Adaptec internal RAID controller, 256MB battery-backed cache memory, all on Solaris-10U7. We saw slight differences on filebench oltp profile, and a huge speedup for the "tar extract over NFS" tests with the slog present. Granted, the latter was with only one NFS client, so likely did not fill NVRAM. Pretty good results for a poor-person's slog, though: http://acc.ohsu.edu/~hakansom/j4400_bench.html Just as an aside, and based on my experience as a user/admin of various NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem to get very good improvements with relatively small amounts of NVRAM (128K, 1MB, 256MB, etc.). None of the filers I've seen have ever had tens of GB of NVRAM. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Fri, Sep 25, 2009 at 1:39 PM, Richard Elling wrote: > On Sep 25, 2009, at 9:14 AM, Ross Walker wrote: > >> On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn >> wrote: >>> >>> On Fri, 25 Sep 2009, Ross Walker wrote: As a side an slog device will not be too beneficial for large sequential writes, because it will be throughput bound not latency bound. slog devices really help when you have lots of small sync writes. A RAIDZ2 with the ZIL spread across it will provide much >>> >>> Surely this depends on the origin of the large sequential writes. If the >>> origin is NFS and the SSD has considerably more sustained write bandwidth >>> than the ethernet transfer bandwidth, then using the SSD is a win. If >>> the SSD accepts data slower than the ethernet can deliver it (which seems to >>> be this particular case) then the SSD is not helping. >>> >>> If the ethernet can pass 100MB/second, then the sustained write >>> specification for the SSD needs to be at least 100MB/second. Since data >>> is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the >>> SSD should support write bursts of at least double that or else it will >>> not be helping bulk-write performance. >> >> Specifically I was talking NFS as that was what the OP was talking >> about, but yes it does depend on the origin, but you also assume that >> NFS IO goes over only a single 1Gbe interface when it could be over >> multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe >> interfaces. You also assume the IO recorded in the ZIL is just the raw >> IO when there is also meta-data or multiple transaction copies as >> well. >> >> Personnally I still prefer to spread the ZIL across the pool and have >> a large NVRAM backed HBA as opposed to an slog which really puts all >> my IO in one basket. If I had a pure NVRAM device I might consider >> using that as an slog device, but SSDs are too variable for my taste. > > Back of the envelope math says: > 10 Gbe = ~1 GByte/sec of I/O capacity > > If the SSD can only sink 70 MByte/s, then you will need: > int(1000/70) + 1 = 15 SSDs for the slog > > For capacity, you need: > 1 GByte/sec * 30 sec = 30 GBytes Where did the 30 seconds come in here? The amount of time to hold cache depends on how fast you can fill it. > Ross' idea has merit, if the size of the NVRAM in the array is 30 GBytes > or so. I'm thinking you can do less if you don't need to hold it for 30 seconds. > Both of the above assume there is lots of memory in the server. > This is increasingly becoming easier to do as the memory costs > come down and you can physically fit 512 GBytes in a 4u server. > By default, the txg commit will occur when 1/8 of memory is used > for writes. For 30 GBytes, that would mean a main memory of only > 240 Gbytes... feasible for modern servers. > > However, most folks won't stomach 15 SSDs for slog or 30 GBytes of > NVRAM in their arrays. So Bob's recommendation of reducing the > txg commit interval below 30 seconds also has merit. Or, to put it > another way, the dynamic sizing of the txg commit interval isn't > quite perfect yet. [Cue for Neil to chime in... :-)] I'm sorry did I miss something Bob said about the txg commit interval? I looked back and didn't see it, maybe it was off-list? -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] White box server for OpenSolaris
On Fri, Sep 25, 2009 at 10:56 PM, Toby Thain wrote: > > On 25-Sep-09, at 2:58 PM, Frank Middleton wrote: > >> On 09/25/09 11:08 AM, Travis Tabbal wrote: >>> >>> ... haven't heard if it's a known >>> bug or if it will be fixed in the next version... >> >> Out of courtesy to our host, Sun makes some quite competitive >> X86 hardware. I have absolutely no idea how difficult it is >> to buy Sun machines retail, > > Not very difficult. And there is try and buy. Indeed, at least in Spain and in Italy I had no problem buying workstations. Recently I owned both Sun Ultra 20 M2 and Ultra 24. I had a great feeling with them and price seemed very competitive to me, compared to offers of other mainstream hardware providers. > > People overestimate the cost of Sun, and underestimate the real value of > "fully integrated". +1. People like "fully integration" when it comes, for example, to Apple, iPods and iPhones. When it comes, just to make another example..., to Solaris, ZFS, ECC memory and so forth (do you remember those posts some time ago?), they quickly forget. > > --Toby > >> but it seems they might be missing >> out on an interesting market - robust and scalable SOHO servers >> for the DYI gang ... >> >> Cheers -- Frank >> >> >> ___ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Ελευθερία ή θάνατος "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." GPG key: 1024D/FD2229AF ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Fri, Sep 25, 2009 at 5:24 PM, James Lever wrote: > > On 26/09/2009, at 1:14 AM, Ross Walker wrote: > >> By any chance do you have copies=2 set? > > No, only 1. So the double data going to the slog (as reported by iostat) is > still confusing me and clearly potentially causing significant harm to my > performance. Weird then, I thought that would be an easy explaination. >> Also, try setting zfs_write_limit_override equal to the size of the >> NVRAM cache (or half depending on how long it takes to flush): >> >> echo zfs_write_limit_override/W0t268435456 | mdb -kw > > That’s an interesting concept. All data still appears to go via the slog > device, however, under heavy load my responsive to a new write is typically > below 2s (a few outliers at about 3.5s) and a read (directory listing of a > non-cached entry) is about 2s. > > What will this do once it hits the limit? Will streaming writes now be sent > directly to a txg and streamed to the primary storage devices? (that is > what I would like to see happen). It's sets the max size of a txg to the given size. When it hits that number it flushes to disk. >> As a side an slog device will not be too beneficial for large >> sequential writes, because it will be throughput bound not latency >> bound. slog devices really help when you have lots of small sync >> writes. A RAIDZ2 with the ZIL spread across it will provide much >> higher throughput then an SSD. An example of a workload that benefits >> from an slog device is ESX over NFS, which does a COMMIT for each >> block written, so it benefits from an slog, but a standard media >> server will not (but an L2ARC would be beneficial). >> >> Better workload analysis is really what it is about. > > > It seems that it doesn’t matter what the workload is if the NFS pipe can > sustain more continuous throughput the slog chain can support. Only on large sequentials, small sync IO should benefit from the slog. > I suppose some creative use of the logbias setting might assist this > situation and force all potentially heavy writers directly to the primary > storage. This would, however, negate any benefit for having a fast, low > latency device for those filesystems for the times when it is desirable (any > large batch of small writes, for example). > > Is there a way to have a dynamic, auto logbias type setting depending on the > transaction currently presented to the server such that if it is clearly a > large streaming write it gets treated as logbias=throughput and if it is a > small transaction it gets treated as logbias=latency? (i.e. such that NFS > transactions can be effectively treated as if it was local storage but > minorly breaking the benefits of the txg scheduling). I'll leave that to the Sun guys to answer. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 26/09/2009, at 1:14 AM, Ross Walker wrote: By any chance do you have copies=2 set? No, only 1. So the double data going to the slog (as reported by iostat) is still confusing me and clearly potentially causing significant harm to my performance. Also, try setting zfs_write_limit_override equal to the size of the NVRAM cache (or half depending on how long it takes to flush): echo zfs_write_limit_override/W0t268435456 | mdb -kw That’s an interesting concept. All data still appears to go via the slog device, however, under heavy load my responsive to a new write is typically below 2s (a few outliers at about 3.5s) and a read (directory listing of a non-cached entry) is about 2s. What will this do once it hits the limit? Will streaming writes now be sent directly to a txg and streamed to the primary storage devices? (that is what I would like to see happen). As a side an slog device will not be too beneficial for large sequential writes, because it will be throughput bound not latency bound. slog devices really help when you have lots of small sync writes. A RAIDZ2 with the ZIL spread across it will provide much higher throughput then an SSD. An example of a workload that benefits from an slog device is ESX over NFS, which does a COMMIT for each block written, so it benefits from an slog, but a standard media server will not (but an L2ARC would be beneficial). Better workload analysis is really what it is about. It seems that it doesn’t matter what the workload is if the NFS pipe can sustain more continuous throughput the slog chain can support. I suppose some creative use of the logbias setting might assist this situation and force all potentially heavy writers directly to the primary storage. This would, however, negate any benefit for having a fast, low latency device for those filesystems for the times when it is desirable (any large batch of small writes, for example). Is there a way to have a dynamic, auto logbias type setting depending on the transaction currently presented to the server such that if it is clearly a large streaming write it gets treated as logbias=throughput and if it is a small transaction it gets treated as logbias=latency? (i.e. such that NFS transactions can be effectively treated as if it was local storage but minorly breaking the benefits of the txg scheduling). On 26/09/2009, at 3:39 AM, Richard Elling wrote: Back of the envelope math says: 10 Gbe = ~1 GByte/sec of I/O capacity If the SSD can only sink 70 MByte/s, then you will need: int(1000/70) + 1 = 15 SSDs for the slog For capacity, you need: 1 GByte/sec * 30 sec = 30 GBytes Ross' idea has merit, if the size of the NVRAM in the array is 30 GBytes or so. At this point, enter the fusionIO cards or similar devices. Unfortunately there does not seem to be anything on the market with infinitely fast write capacity (memory speeds) that is also supported under OpenSolaris as a slog device. I think this is precisely what I (and anybody running a general purpose NFS server) need for a general purpose slog device. Both of the above assume there is lots of memory in the server. This is increasingly becoming easier to do as the memory costs come down and you can physically fit 512 GBytes in a 4u server. By default, the txg commit will occur when 1/8 of memory is used for writes. For 30 GBytes, that would mean a main memory of only 240 Gbytes... feasible for modern servers. However, most folks won't stomach 15 SSDs for slog or 30 GBytes of NVRAM in their arrays. So Bob's recommendation of reducing the txg commit interval below 30 seconds also has merit. Or, to put it another way, the dynamic sizing of the txg commit interval isn't quite perfect yet. [Cue for Neil to chime in... :-)] How does reducing the txg commit interval really help? WIll data no longer go via the slog once it is streaming to disk? or will data still all be pushed through the slog regardless? For a predominantly NFS server purpose, it really looks like a case of the slog has to outperform your main pool for continuous write speed as well as an instant response time as the primary criterion. Which might as well be a fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of them. Is there also a way to throttle synchronous writes to the slog device? Much like the ZFS write throttling that is already implemented, so that there is a gap for new writers to enter when writing to the slog device? (or is this the norm and includes slog writes?) cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
On Sep 25, 2009, at 16:39, Glenn Lagasse wrote: There's very little you can safely move in my experience. /export certainly. Anything else, not really (though ymmv). I tried to create a seperate zfs dataset for /usr/local. That worked some of the time, but it also screwed up my system a time or two during image-updates/package installs. I'd be very surprised (disappointed?) if /usr/local couldn't be detached from the rpool. Given that in many cases it's an NFS mount, I'm curious to know why it would need to be part of the rpool. If it is a 'dependency' I would consider that a bug. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
Hi David, I believe /opt is an essential file system as it contains software that is maintained by the packaging system. In fact anywhere you install software via pkgadd probably should be in the BE under /rpool/ROOT/bename AFIK it should not even be split from root in the BE under zfs boot (only /var is supported) other wise LU breaks. I have sub directories of /opt like /aop/app which does not contain software installed via pkgadd. I also split off /var/core and /var/crash. Unfortunately when you need to boot -F and import the pool for maintenance it doesn't mount /var causing directory /var/core and /var/crash to be created in the root file system. The system then reboots but when you do a lucreate, or lumount it fails due to /var/core and /var/crash existing on the / file system causing the mount of /var to fail in the ABE. I have found it a bit problematic to split of file systems from / under zfs boot and still have LU work properly. I haven't tried putting split off file systems as apposed to application file systems on a different pool but I believe there may be mount ordering issues with mounting dependent file systems from different pools where the parent file system are not part of the BE or legacy mounts. It is not possible to mount a vxfs file system under a non legacy zone root file system due to ordering issues with mounting on boot (legacy is done before automatic zfs mounts). Perhaps u7 addressed some of there issues as I believe it is now allowable to have zone root file system on a non root pool. These are just my experiences and I'm sure others can give more definitive answers. Perhaps its easier to get some bigger disks. Thanks Peter 2009/9/25 David Abrahams : > > on Fri Sep 25 2009, Cindy Swearingen wrote: > >> Hi David, >> >> All system-related components should remain in the root pool, such as >> the components needed for booting and running the OS. > > Yes, of course. But which *are* those? > >> If you have datasets like /export/home or other non-system-related >> datasets in the root pool, then feel free to move them out. > > Well, for example, surely /opt can be moved? > >> Moving OS components out of the root pool is not tested by us and I've >> heard of one example recently of breakage when usr and var were moved >> to a non-root RAIDZ pool. >> >> It would be cheaper and easier to buy another disk to mirror your root >> pool then it would be to take the time to figure out what could move out >> and then possibly deal with an unbootable system. >> >> Buy another disk and we'll all sleep better. > > Easy for you to say. There's no room left in the machine for another disk. > > -- > Dave Abrahams > BoostPro Computing > http://www.boostpro.com > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ARC vs Oracle cache
Hi, Definitely large SGA, small arc. In fact, it's best to disable the ARC altogether for the Oracle filesystems. Blocks in the db_cache (oracle cache) can be used "as is" while cached data from ARC needs significant CPU processing before it's inserted back into the db_cache. Not to mention that block in db_cache can remain dirty for longer periods, saving disk writes. But definetelly: - separate redo disk (preferably dedicated disk/pool) - your ZFS filesystem needs to match the oracle block size (8Kb default) With your configuration, and assuming nothing else (but oracle database server) on the system, a db_cache size in the 70 GiB range would be perfectly acceptable. Don't forget to set pga_aggregate_target to something reasonable too, like 20 GiB. Christo Kutrovsky Senior DBA The Pythian Group I Blog at: www.pythian.com/news -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] White box server for OpenSolaris
On 25-Sep-09, at 2:58 PM, Frank Middleton wrote: On 09/25/09 11:08 AM, Travis Tabbal wrote: ... haven't heard if it's a known bug or if it will be fixed in the next version... Out of courtesy to our host, Sun makes some quite competitive X86 hardware. I have absolutely no idea how difficult it is to buy Sun machines retail, Not very difficult. And there is try and buy. People overestimate the cost of Sun, and underestimate the real value of "fully integrated". --Toby but it seems they might be missing out on an interesting market - robust and scalable SOHO servers for the DYI gang ... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
* David Abrahams (d...@boostpro.com) wrote: > > on Fri Sep 25 2009, Cindy Swearingen wrote: > > > Hi David, > > > > All system-related components should remain in the root pool, such as > > the components needed for booting and running the OS. > > Yes, of course. But which *are* those? > > > If you have datasets like /export/home or other non-system-related > > datasets in the root pool, then feel free to move them out. > > Well, for example, surely /opt can be moved? Don't be so sure. > > Moving OS components out of the root pool is not tested by us and I've > > heard of one example recently of breakage when usr and var were moved > > to a non-root RAIDZ pool. > > > > It would be cheaper and easier to buy another disk to mirror your root > > pool then it would be to take the time to figure out what could move out > > and then possibly deal with an unbootable system. > > > > Buy another disk and we'll all sleep better. > > Easy for you to say. There's no room left in the machine for another disk. The question you're asking can't easily be answered. Sun doesn't test configs like that. If you really want to do this, you'll pretty much have to 'try it and see what breaks'. And you get to keep both pieces if anything breaks. There's very little you can safely move in my experience. /export certainly. Anything else, not really (though ymmv). I tried to create a seperate zfs dataset for /usr/local. That worked some of the time, but it also screwed up my system a time or two during image-updates/package installs. On my 2010.02/123 system I see: bin Symlink to /usr/bin boot/ dev/ devices/ etc/ export/ Safe to move, not tied to the 'root' system kernel/ lib/ media/ mnt/ net/ opt/ platform/ proc/ rmdisk/ root/ Could probably move root's homedir rpool/ sbin/ system/ tmp/ usr/ var/ Other than /export, everything else is considered 'part of the root system'. Thus part of the root pool. Really, if you can't add a mirror for your root pool, then make backups of your root pool (left as an exercise to the reader) and store the non-system specific bits (/export) on you're raidz2 pool. Cheers, -- Glenn ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
I have no idea why that last mail lost its line feeds. Trying again: On 09/25/09 13:35, David Abrahams wrote: Hi, Since I don't even have a mirror for my root pool "rpool," I'd like to move as much of my system as possible over to my raidz2 pool, "tank." Can someone tell me which parts need to stay in rpool in order for the system to work normally? Thanks. The list of datasets in a root pool should look something like this: rpool rpool/ROOT rpool/ROOT/snv_124 (or whatever version you're running) rpool/ROOT/snv_124/var (you might not have this) rpool/ROOT/snv_121 (or whatever other BEs you still have) rpool/dump rpool/export rpool/export/home rpool/swap plus any other datasets you might have added. Datasets you've added in addition to the above (unless they are zone roots under rpool/ROOT/ ) can be moved to another pool. Anything you have in /export or /export/ home can be moved to another pool. Everything else needs to stay in the root pool. Yes, there are contents of the above datasets that could be moved and your system would still run (you'd have to play with mount points or symlinks to get them included in the Solaris name space), but such a configuration would be non-standard, unsupported, and probably not upgradeable. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] extremely slow writes (with good reads)
Oh, for the record, the drives are 1.5TB SATA, in a 4+1 raidz-1 config. All the drives are on the same LSI 150-6 PCI controller card, and the M/B is a generic something or other with a triple-core, and 2GB RAM. Paul 3:34pm, Paul Archer wrote: Since I got my zfs pool working under solaris (I talked on this list last week about moving it from linux & bsd to solaris, and the pain that was), I'm seeing very good reads, but nada for writes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
On 09/25/09 13:35, David Abrahams wrote: Hi, Since I don't even have a mirror for my root pool "rpool," I'd like to move as much of my system as possible over to my raidz2 pool, "tank." Can someone tell me which parts need to stay in rpool in order for the system to work normally? Thanks. The list of datasets in a root pool should look something like this: rpool rpool/ROOT rpool/ROOT/snv_124 (or whatever version you're running) rpool/ROOT/snv_124/var (you might not have this) rpool/ROOT/snv_121 (or whatever other BEs you still have) rpool/dump rpool/export rpool/export/home rpool/swap plus any other datasets you might have added. Datasets you've added in addition to the above (unless they are zone roots under rpool/ROOT/ ) can be moved to another pool. Anything you have in /export or /export/ home can be moved to another pool. Everything else needs to stay in the root pool. Yes, there are contents of the above datasets that could be moved and your system would still run (you'd have to play with mount points or symlinks to get them included in the Solaris name space), but such a configuration would be non-standard, unsupported, and probably not upgradeable. lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] extremely slow writes (with good reads)
Since I got my zfs pool working under solaris (I talked on this list last week about moving it from linux & bsd to solaris, and the pain that was), I'm seeing very good reads, but nada for writes. Reads: r...@shebop:/data/dvds# rsync -aP young_frankenstein.iso /tmp sending incremental file list young_frankenstein.iso ^C1032421376 20% 86.23MB/s0:00:44 Writes: r...@shebop:/data/dvds# rsync -aP /tmp/young_frankenstein.iso yf.iso sending incremental file list young_frankenstein.iso ^C 68976640 6%2.50MB/s0:06:42 This is pretty typical of what I'm seeing. r...@shebop:/data/dvds# zpool status -v pool: datapool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAMESTATE READ WRITE CKSUM datapoolONLINE 0 0 0 raidz1ONLINE 0 0 0 c2d0s0 ONLINE 0 0 0 c3d0s0 ONLINE 0 0 0 c4d0s0 ONLINE 0 0 0 c6d0s0 ONLINE 0 0 0 c5d0s0 ONLINE 0 0 0 errors: No known data errors pool: syspool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM syspool ONLINE 0 0 0 c0d1s0ONLINE 0 0 0 errors: No known data errors (This is while running an rsync from a remote machine to a ZFS filesystem) r...@shebop:/data/dvds# iostat -xn 5 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 11.14.8 395.8 275.9 5.8 0.1 364.74.3 2 5 c0d1 9.8 10.9 514.3 346.4 6.8 1.4 329.7 66.7 68 70 c5d0 9.8 10.9 516.6 346.4 6.7 1.4 323.1 66.2 67 70 c6d0 9.7 10.9 491.3 346.3 6.7 1.4 324.7 67.2 67 70 c3d0 9.8 10.9 519.9 346.3 6.8 1.4 326.7 67.2 68 71 c4d0 9.8 11.0 493.5 346.6 3.6 0.8 175.3 37.9 38 41 c2d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c0d1 64.6 12.6 8207.4 382.1 32.8 2.0 424.7 25.9 100 100 c5d0 62.2 12.2 7203.2 370.1 27.9 2.0 375.1 26.7 99 100 c6d0 53.2 11.8 5973.9 390.2 25.9 2.0 398.8 30.5 98 99 c3d0 49.4 10.6 5398.2 389.8 30.2 2.0 503.7 33.3 99 100 c4d0 45.2 12.8 5431.4 337.0 14.3 1.0 247.3 17.9 52 52 c2d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 Any ideas? Paul ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
on Fri Sep 25 2009, Cindy Swearingen wrote: > Hi David, > > All system-related components should remain in the root pool, such as > the components needed for booting and running the OS. Yes, of course. But which *are* those? > If you have datasets like /export/home or other non-system-related > datasets in the root pool, then feel free to move them out. Well, for example, surely /opt can be moved? > Moving OS components out of the root pool is not tested by us and I've > heard of one example recently of breakage when usr and var were moved > to a non-root RAIDZ pool. > > It would be cheaper and easier to buy another disk to mirror your root > pool then it would be to take the time to figure out what could move out > and then possibly deal with an unbootable system. > > Buy another disk and we'll all sleep better. Easy for you to say. There's no room left in the machine for another disk. -- Dave Abrahams BoostPro Computing http://www.boostpro.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
Hi David, All system-related components should remain in the root pool, such as the components needed for booting and running the OS. If you have datasets like /export/home or other non-system-related datasets in the root pool, then feel free to move them out. Moving OS components out of the root pool is not tested by us and I've heard of one example recently of breakage when usr and var were moved to a non-root RAIDZ pool. It would be cheaper and easier to buy another disk to mirror your root pool then it would be to take the time to figure out what could move out and then possibly deal with an unbootable system. Buy another disk and we'll all sleep better. Cindy On 09/25/09 13:35, David Abrahams wrote: Hi, Since I don't even have a mirror for my root pool "rpool," I'd like to move as much of my system as possible over to my raidz2 pool, "tank." Can someone tell me which parts need to stay in rpool in order for the system to work normally? Thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot hold 'xxx': pool must be upgraded
On Sep 25, 2009, at 2:43 PM, Robert Milkowski wrote: Chris Kirby wrote: On Sep 25, 2009, at 11:54 AM, Robert Milkowski wrote: That's useful information indeed. I've filed this CR: 6885860 zfs send shouldn't require support for snapshot holds Sorry for the trouble, please look for this to be fixed soon. Thank you. btw: how do you want to fix it? Do you want to acquire a snapshot hold but continue anyway if it is not possible (only in case whene error is ENOTSUP I think)? Or do you want to get rid of it entirely? In this particular case, we should make sure the pool version supports snapshot holds before trying to request (or release) any. We still want to acquire the temporary holds if we can, since that prevents a race with zfs destroy. That case is becoming more common with automated snapshots and their associated retention policies. -Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot hold 'xxx': pool must be upgraded
Chris Kirby wrote: On Sep 25, 2009, at 11:54 AM, Robert Milkowski wrote: Hi, I have a zfs send command failing for some reason... # uname -a SunOS 5.11 snv_123 i86pc i386 i86pc Solaris # zfs send -R -I archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50 archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59 >/dev/null cannot hold 'archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50': pool must be upgraded cannot hold 'archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59': pool must be upgraded cannot hold 'archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14': pool must be upgraded cannot hold 'archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59': pool must be upgraded # zfs list -r -t all archive-1/archive/ NAME USED AVAIL REFER MOUNTPOINT archive-1/archive/ 65.6G 7.69T 8.69G /archive-1/archive/ archive-1/archive/x...@rsync-2009-04-21_14:52--2009-04-21_15:13 11.9G - 12.0G - archive-1/archive/x...@rsync-2009-05-01_07:45--2009-05-01_08:06 12.0G - 12.1G - archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50 12.2G - 12.3G - archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59 8.26G - 8.37G - archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14 12.6G - 12.7G - archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59 0 - 8.69G - The pool is at version 14 and all file systems are at version 3. Ahhh... if -R is provided zfs send now calls zfs_hold_range() which later fails in dsl_dataset_user_hold_check() as it checks if dataset is not below SPA_VERSION_USERREFS which is defined as SPA_VERSION_18 and in my case it is 14 so it fails. But I don't really want to upgrade to version 18 as then I won't be able to reboot back to snv_111b (which supports up-to version 14 only). I guess if I would use libzfs from older build it would work as keeping a user hold is not really required... I can understand why it was introduced I'm just unhappy that I can't do zfs send -R -I now without upgrading a pool Probably no point sending the email, as I was looking at the code and dtracing while writing it, but since I've written it I will post it. Maybe someone will find it useful. Robert, That's useful information indeed. I've filed this CR: 6885860 zfs send shouldn't require support for snapshot holds Sorry for the trouble, please look for this to be fixed soon. Thank you. btw: how do you want to fix it? Do you want to acquire a snapshot hold but continue anyway if it is not possible (only in case whene error is ENOTSUP I think)? Or do you want to get rid of it entirely? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
I didn't want my question to lead to an answer, but perhaps I should have put more information. My idea is to copy the file system with one of the following: cp -rp zfs send | zfs receive tar cpio But I don't know what would be the best. Then I would do a "diff -r" on them before deleting the old. I don't know the "obscure" (for me) secondary things like attributes, links, extended modes, etc. Thanks again. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Which directories must be part of rpool?
Hi, Since I don't even have a mirror for my root pool "rpool," I'd like to move as much of my system as possible over to my raidz2 pool, "tank." Can someone tell me which parts need to stay in rpool in order for the system to work normally? Thanks. -- Dave Abrahams BoostPro Computing http://www.boostpro.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New to ZFS: One LUN, multiple zones
2009/9/24 Robert Milkowski > Mike Gerdts wrote: > >> On Wed, Sep 23, 2009 at 7:32 AM, bertram fukuda >> wrote: >> >> >>> Thanks for the info Mike. >>> >>> Just so I'm clear. You suggest 1)create a single zpool from my LUN 2) >>> create a single ZFS filesystem 3) create 2 zone in the ZFS filesystem. Sound >>> right? >>> >>> >> >> Correct >> >> >> >> > Well I would actually recommend to create a dedicate zfs file system for > each zone (which zoneadm should do for you anyway). The reason is that it is > much easier then to get information on how much storage each zone is using, > you can set a quote or reservation for storage for each zone independently, > you can easily clone each zone, snapshot it, etc. > Another thing. If you will use live upgrade (and as I understand then "pkg image-update" does that seamlessly) then besides putting each zone on its own filesystem you also should add another two datasets to be delegated to zones where they can store their data. This would ensure that during LU you don't boot up with a bit old data in zones. For example, this could be very important on mail servers so you don't "forget" some new mails in spool directories which arrived after creation of new environment, but before reboot. > > > -- > Robert Milkowski > http://milek.blogspot.com > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] White box server for OpenSolaris
On 09/25/09 11:08 AM, Travis Tabbal wrote: ... haven't heard if it's a known bug or if it will be fixed in the next version... Out of courtesy to our host, Sun makes some quite competitive X86 hardware. I have absolutely no idea how difficult it is to buy Sun machines retail, but it seems they might be missing out on an interesting market - robust and scalable SOHO servers for the DYI gang - certainly OEMS like us recommend them, although there doesn't seem to be a single-box file+application server in the lineup which might be a disadvantage to some. Also, assuming Oracle keeps the product line going, we plan to give them a serious look when we finally have to replace those sturdy old SPARCS. Unfortunately there aren't entry level SPARCs in the lineup, but sadly there probably isn't a big enough market to justify them and small developers don't need the big iron. It would be interesting to hear from Sun if they have any specific recommendations for the use of Suns for the DYI SOHO market; AFAIK it is the profits from hardware that are going a long way to support Sun's support of FOSS that we are all benefiting from, and there's a good bet that OpenSolaris will run well on Sun hardware :-) Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cloning Systems using zpool
The whole pool. Although you can choose to exclude individual datasets from the flar when creating it. lori On 09/25/09 12:03, Peter Pickford wrote: Hi Lori, Is the u8 flash support for the whole root pool or an individual BE using live upgrade? Thanks Peter 2009/9/24 Lori Alt : On 09/24/09 15:54, Peter Pickford wrote: Hi Cindy, Wouldn't touch /reconfigure mv /etc/path_to_inst* /var/tmp/ regenerate all device information? It might, but it's hard to say whether that would accomplish everything needed to move a root file system from one system to another. I just got done modifying flash archive support to work with zfs root on Solaris 10 Update 8. For those not familiar with it, "flash archives" are a way to clone full boot environments across multiple machines. The S10 Solaris installer knows how to install one of these flash archives on a system and then do all the customizations to adapt it to the local hardware and local network environment. I'm pretty sure there's more to the customization than just a device reconfiguration. So feel free to hack together your own solution. It might work for you, but don't assume that you've come up with a completely general way to clone root pools. lori AFIK zfs doesn't care about the device names it scans for them it would only affect things like vfstab. I did a restore from a E2900 to V890 and is seemed to work Created the pool and zfs recieve. I would like to be able to have a zfs send of a minimal build and install it in an abe and activate it. I tried that is test and it seems to work. It seems to work but IM just wondering what I may have missed. I saw someone else has done this on the list and was going to write a blog. It seems like a good way to get a minimal install on a server with reduced downtime. Now if I just knew how to run the installer in and abe without there being an OS there already that would be cool too. Thanks Peter 2009/9/24 Cindy Swearingen : Hi Peter, I can't provide it because I don't know what it is. Even if we could provide a list of items, tweaking the device informaton if the systems are not identical would be too difficult. cs On 09/24/09 12:04, Peter Pickford wrote: Hi Cindy, Could you provide a list of system specific info stored in the root pool? Thanks Peter 2009/9/24 Cindy Swearingen : Hi Karl, Manually cloning the root pool is difficult. We have a root pool recovery procedure that you might be able to apply as long as the systems are identical. I would not attempt this with LiveUpgrade and manually tweaking. http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Complete_Solaris_ZFS_Root_Pool_Recovery The problem is that the amount system-specific info stored in the root pool and any kind of device differences might be insurmountable. Solaris 10 ZFS/flash archive support is available with patches but not for the Nevada release. The ZFS team is working on a split-mirrored-pool feature and that might be an option for future root pool cloning. If you're still interested in a manual process, see the steps below attempted by another community member who moved his root pool to a larger disk on the same system. This is probably more than you wanted to know... Cindy # zpool create -f altrpool c1t1d0s0 # zpool set listsnapshots=on rpool # SNAPNAME=`date +%Y%m%d` # zfs snapshot -r rpool/r...@$snapname # zfs list -t snapshot # zfs send -R rp...@$snapname | zfs recv -vFd altrpool # installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk /dev/rdsk/c1t1d0s0 for x86 do # installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t1d0s0 Set the bootfs property on the root pool BE. # zpool set bootfs=altrpool/ROOT/zfsBE altrpool # zpool export altrpool # init 5 remove source disk (c1t0d0s0) and move target disk (c1t1d0s0) to slot0 -insert solaris10 dvd ok boot cdrom -s # zpool import altrpool rpool # init 0 ok boot disk1 On 09/24/09 10:06, Karl Rossing wrote: I would like to clone the configuration on a v210 with snv_115. The current pool looks like this: -bash-3.2$ /usr/sbin/zpool statuspool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 c1t1d0s0 ONLINE 0 0 0 errors: No known data errors After I run zpool detach rpool c1t1d0s0, how can I remount c1t1d0s0 to /tmp/a so that I can make the changes I need prior to removing the drive and putting it into the new v210. I supose I could lucreate -n new_v210, lumount new_v210, edit what I need to, luumount new_v210, luactivate new_v210, zpool detach rpool c1t1d0s0 and then luactivate the original boot environment. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___
[zfs-discuss] Best way to convert checksums
What is the "Best" way to convert the checksums of an existing ZFS file system from one checksum to another? To me "Best" means safest and most complete. My zpool is 39% used, so there is plenty of space available. Thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cloning Systems using zpool
Hi Lori, Is the u8 flash support for the whole root pool or an individual BE using live upgrade? Thanks Peter 2009/9/24 Lori Alt : > On 09/24/09 15:54, Peter Pickford wrote: > > Hi Cindy, > > Wouldn't > > touch /reconfigure > mv /etc/path_to_inst* /var/tmp/ > > regenerate all device information? > > > It might, but it's hard to say whether that would accomplish everything > needed to move a root file system from one system to another. > > I just got done modifying flash archive support to work with zfs root on > Solaris 10 Update 8. For those not familiar with it, "flash archives" are a > way to clone full boot environments across multiple machines. The S10 > Solaris installer knows how to install one of these flash archives on a > system and then do all the customizations to adapt it to the local hardware > and local network environment. I'm pretty sure there's more to the > customization than just a device reconfiguration. > > So feel free to hack together your own solution. It might work for you, but > don't assume that you've come up with a completely general way to clone root > pools. > > lori > > AFIK zfs doesn't care about the device names it scans for them > it would only affect things like vfstab. > > I did a restore from a E2900 to V890 and is seemed to work > > Created the pool and zfs recieve. > > I would like to be able to have a zfs send of a minimal build and > install it in an abe and activate it. > I tried that is test and it seems to work. > > It seems to work but IM just wondering what I may have missed. > > I saw someone else has done this on the list and was going to write a blog. > > It seems like a good way to get a minimal install on a server with > reduced downtime. > > Now if I just knew how to run the installer in and abe without there > being an OS there already that would be cool too. > > Thanks > > Peter > > 2009/9/24 Cindy Swearingen : > > > Hi Peter, > > I can't provide it because I don't know what it is. > > Even if we could provide a list of items, tweaking > the device informaton if the systems are not identical > would be too difficult. > > cs > > On 09/24/09 12:04, Peter Pickford wrote: > > > Hi Cindy, > > Could you provide a list of system specific info stored in the root pool? > > Thanks > > Peter > > 2009/9/24 Cindy Swearingen : > > > Hi Karl, > > Manually cloning the root pool is difficult. We have a root pool recovery > procedure that you might be able to apply as long as the > systems are identical. I would not attempt this with LiveUpgrade > and manually tweaking. > > > http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Complete_Solaris_ZFS_Root_Pool_Recovery > > The problem is that the amount system-specific info stored in the root > pool and any kind of device differences might be insurmountable. > > Solaris 10 ZFS/flash archive support is available with patches but not > for the Nevada release. > > The ZFS team is working on a split-mirrored-pool feature and that might > be an option for future root pool cloning. > > If you're still interested in a manual process, see the steps below > attempted by another community member who moved his root pool to a > larger disk on the same system. > > This is probably more than you wanted to know... > > Cindy > > > > # zpool create -f altrpool c1t1d0s0 > # zpool set listsnapshots=on rpool > # SNAPNAME=`date +%Y%m%d` > # zfs snapshot -r rpool/r...@$snapname > # zfs list -t snapshot > # zfs send -R rp...@$snapname | zfs recv -vFd altrpool > # installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk > /dev/rdsk/c1t1d0s0 > for x86 do > # installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t1d0s0 > Set the bootfs property on the root pool BE. > # zpool set bootfs=altrpool/ROOT/zfsBE altrpool > # zpool export altrpool > # init 5 > remove source disk (c1t0d0s0) and move target disk (c1t1d0s0) to slot0 > -insert solaris10 dvd > ok boot cdrom -s > # zpool import altrpool rpool > # init 0 > ok boot disk1 > > On 09/24/09 10:06, Karl Rossing wrote: > > > I would like to clone the configuration on a v210 with snv_115. > > The current pool looks like this: > > -bash-3.2$ /usr/sbin/zpool status pool: rpool > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > rpool ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t0d0s0 ONLINE 0 0 0 > c1t1d0s0 ONLINE 0 0 0 > > errors: No known data errors > > After I run zpool detach rpool c1t1d0s0, how can I remount c1t1d0s0 to > /tmp/a so that I can make the changes I need prior to removing the drive > and > putting it into the new v210. > > I supose I could lucreate -n new_v210, lumount new_v210, edit what I > need > to, luumount new_v210, luactivate new_v210, zpool detach rpool c1t1d0s0 > and > then luactivate the original boot environment. > > > ___ > zfs-discuss ma
Re: [zfs-discuss] ZFS flar image.
Hi Peter, Do you have any notes on what you did to restore a sendfile to an existing BE? I'm interested in creating a 'golden image' and restring this into a new BE on a running system as part of a hardening project. Thanks Peter 2009/9/14 Peter Karlsson : > Hi Greg, > > We did a hack on those lines when we installed 100 Ultra 27s that was used > during J1, but we automated the process by using AI to install a bootstrap > image that had a SMF service that pulled over the zfs sendfile, create a new > BE and received the sendfile to the new BE. Work fairly OK, there where a > few things that we to run a few scripts to fix, but at large it was > smooth I really need to get that blog entry done :) > > /peter > > Greg Mason wrote: >> >> As an alternative, I've been taking a snapshot of rpool on the golden >> system, sending it to a file, and creating a boot environment from the >> archived snapshot on target systems. After fiddling with the snapshots a >> little, I then either appropriately anonymize the system or provide it with >> its identity. When it boots up, it's ready to go. >> >> The only downfall to my method is that I still have to run the full >> OpenSolaris installer, and I can't exclude anything in the archive. >> >> Essentially, it's a poor man's flash archive. >> >> -Greg >> >> cindy.swearin...@sun.com wrote: >>> >>> Hi RB, >>> >>> We have a draft of the ZFS/flar image support here: >>> >>> http://opensolaris.org/os/community/zfs/boot/flash/ >>> >>> Make sure you review the Solaris OS requirements. >>> >>> Thanks, >>> >>> Cindy >>> >>> On 09/14/09 11:45, RB wrote: Is it possible to create flar image of ZFS root filesystem to install it to other macines? >>> >>> ___ >>> zfs-discuss mailing list >>> zfs-discuss@opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >>> >> ___ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Sep 25, 2009, at 9:14 AM, Ross Walker wrote: On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn wrote: On Fri, 25 Sep 2009, Ross Walker wrote: As a side an slog device will not be too beneficial for large sequential writes, because it will be throughput bound not latency bound. slog devices really help when you have lots of small sync writes. A RAIDZ2 with the ZIL spread across it will provide much Surely this depends on the origin of the large sequential writes. If the origin is NFS and the SSD has considerably more sustained write bandwidth than the ethernet transfer bandwidth, then using the SSD is a win. If the SSD accepts data slower than the ethernet can deliver it (which seems to be this particular case) then the SSD is not helping. If the ethernet can pass 100MB/second, then the sustained write specification for the SSD needs to be at least 100MB/second. Since data is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the SSD should support write bursts of at least double that or else it will not be helping bulk-write performance. Specifically I was talking NFS as that was what the OP was talking about, but yes it does depend on the origin, but you also assume that NFS IO goes over only a single 1Gbe interface when it could be over multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe interfaces. You also assume the IO recorded in the ZIL is just the raw IO when there is also meta-data or multiple transaction copies as well. Personnally I still prefer to spread the ZIL across the pool and have a large NVRAM backed HBA as opposed to an slog which really puts all my IO in one basket. If I had a pure NVRAM device I might consider using that as an slog device, but SSDs are too variable for my taste. Back of the envelope math says: 10 Gbe = ~1 GByte/sec of I/O capacity If the SSD can only sink 70 MByte/s, then you will need: int(1000/70) + 1 = 15 SSDs for the slog For capacity, you need: 1 GByte/sec * 30 sec = 30 GBytes Ross' idea has merit, if the size of the NVRAM in the array is 30 GBytes or so. Both of the above assume there is lots of memory in the server. This is increasingly becoming easier to do as the memory costs come down and you can physically fit 512 GBytes in a 4u server. By default, the txg commit will occur when 1/8 of memory is used for writes. For 30 GBytes, that would mean a main memory of only 240 Gbytes... feasible for modern servers. However, most folks won't stomach 15 SSDs for slog or 30 GBytes of NVRAM in their arrays. So Bob's recommendation of reducing the txg commit interval below 30 seconds also has merit. Or, to put it another way, the dynamic sizing of the txg commit interval isn't quite perfect yet. [Cue for Neil to chime in... :-)] -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] selecting zfs BE from OBP
Hi Donour, You would use the boot -L syntax to select the ZFS BE to boot from, like this: ok boot -L Rebooting with command: boot -L Boot device: /p...@8,60/SUNW,q...@4/f...@0,0/d...@w2104cf7fa6c7,0:a File and args: -L 1 zfs1009BE 2 zfs10092BE Select environment to boot: [ 1 - 2 ]: 2 Then copy and paste the boot string that is provided: To boot the selected entry, invoke: boot [] -Z rpool/ROOT/zfs10092BE Program terminated {0} ok boot -Z rpool/ROOT/zfs10092BE See this pointer as well: http://docs.sun.com/app/docs/doc/819-5461/ggpco?a=view Cindy On 09/25/09 11:09, Donour Sizemore wrote: Can you select the LU boot environment from sparc obp, if the filesystem is zfs? With ufs, you simply invoke 'boot [slice]'. thanks donour ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] selecting zfs BE from OBP
Can you select the LU boot environment from sparc obp, if the filesystem is zfs? With ufs, you simply invoke 'boot [slice]'. thanks donour ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot hold 'xxx': pool must be upgraded
On Sep 25, 2009, at 11:54 AM, Robert Milkowski wrote: Hi, I have a zfs send command failing for some reason... # uname -a SunOS 5.11 snv_123 i86pc i386 i86pc Solaris # zfs send -R -I archive-1/archive/ x...@rsync-2009-06-01_07:45--2009-06-01_08:50 archive-1/archive/ x...@rsync-2009-09-01_07:45--2009-09-01_07:59 >/dev/null cannot hold 'archive-1/archive/ x...@rsync-2009-06-01_07:45--2009-06-01_08:50': pool must be upgraded cannot hold 'archive-1/archive/ x...@rsync-2009-07-01_07:45--2009-07-01_07:59': pool must be upgraded cannot hold 'archive-1/archive/ x...@rsync-2009-08-01_07:45--2009-08-01_10:14': pool must be upgraded cannot hold 'archive-1/archive/ x...@rsync-2009-09-01_07:45--2009-09-01_07:59': pool must be upgraded # zfs list -r -t all archive-1/archive/ NAME USED AVAIL REFER MOUNTPOINT archive-1/archive/ 65.6G 7.69T 8.69G /archive-1/archive/ archive-1/archive/x...@rsync-2009-04-21_14:52--2009-04-21_15:13 11.9G - 12.0G - archive-1/archive/x...@rsync-2009-05-01_07:45--2009-05-01_08:06 12.0G - 12.1G - archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50 12.2G - 12.3G - archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59 8.26G - 8.37G - archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14 12.6G - 12.7G - archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59 0 - 8.69G - The pool is at version 14 and all file systems are at version 3. Ahhh... if -R is provided zfs send now calls zfs_hold_range() which later fails in dsl_dataset_user_hold_check() as it checks if dataset is not below SPA_VERSION_USERREFS which is defined as SPA_VERSION_18 and in my case it is 14 so it fails. But I don't really want to upgrade to version 18 as then I won't be able to reboot back to snv_111b (which supports up-to version 14 only). I guess if I would use libzfs from older build it would work as keeping a user hold is not really required... I can understand why it was introduced I'm just unhappy that I can't do zfs send -R -I now without upgrading a pool Probably no point sending the email, as I was looking at the code and dtracing while writing it, but since I've written it I will post it. Maybe someone will find it useful. Robert, That's useful information indeed. I've filed this CR: 6885860 zfs send shouldn't require support for snapshot holds Sorry for the trouble, please look for this to be fixed soon. -Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NLM_DENIED_NOLOCKS Solaris 10u5 X4500
Try nfs-disc...@opensolaris.org -- richard On Sep 25, 2009, at 7:28 AM, Chris Banal wrote: This was previously posed to the sun-managers mailing list but the only reply I received recommended I post here at well. We have a production Solaris 10u5 / ZFS X4500 file server which is reporting NLM_DENIED_NOLOCKS immediately for any nfs locking request. The lockd does not appear to be busy so is it possible we have hit some sort of limit on the number of files that can be locked? Are there any items to check before restarting lockd / statd. This appears to have at least temporarily cleared up the issue. Thanks, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] cannot hold 'xxx': pool must be upgraded
Hi, I have a zfs send command failing for some reason... # uname -a SunOS 5.11 snv_123 i86pc i386 i86pc Solaris # zfs send -R -I archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50 archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59 >/dev/null cannot hold 'archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50': pool must be upgraded cannot hold 'archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59': pool must be upgraded cannot hold 'archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14': pool must be upgraded cannot hold 'archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59': pool must be upgraded # zfs list -r -t all archive-1/archive/ NAME USED AVAIL REFER MOUNTPOINT archive-1/archive/ 65.6G 7.69T 8.69G /archive-1/archive/ archive-1/archive/x...@rsync-2009-04-21_14:52--2009-04-21_15:13 11.9G - 12.0G - archive-1/archive/x...@rsync-2009-05-01_07:45--2009-05-01_08:06 12.0G - 12.1G - archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50 12.2G - 12.3G - archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59 8.26G - 8.37G - archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14 12.6G - 12.7G - archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59 0 - 8.69G - The pool is at version 14 and all file systems are at version 3. Ahhh... if -R is provided zfs send now calls zfs_hold_range() which later fails in dsl_dataset_user_hold_check() as it checks if dataset is not below SPA_VERSION_USERREFS which is defined as SPA_VERSION_18 and in my case it is 14 so it fails. But I don't really want to upgrade to version 18 as then I won't be able to reboot back to snv_111b (which supports up-to version 14 only). I guess if I would use libzfs from older build it would work as keeping a user hold is not really required... I can understand why it was introduced I'm just unhappy that I can't do zfs send -R -I now without upgrading a pool Probably no point sending the email, as I was looking at the code and dtracing while writing it, but since I've written it I will post it. Maybe someone will find it useful. -- Robert Milkowski http://milek.blogspot.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help! System panic when pool imported
Assertion failures indicate bugs. You might try another version of the OS. In general, they are easy to search for in the bugs database. A quick search reveals http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6822816 but that doesn't look like it will help you. I suggest filing a new bug at the very least. http://en.wikipedia.org/wiki/Assertion_(computing) -- richard On Sep 24, 2009, at 10:21 PM, Albert Chin wrote: Running snv_114 on an X4100M2 connected to a 6140. Made a clone of a snapshot a few days ago: # zfs snapshot a...@b # zfs clone a...@b tank/a # zfs clone a...@b tank/b The system started panicing after I tried: # zfs snapshot tank/b...@backup So, I destroyed tank/b: # zfs destroy tank/b then tried to destroy tank/a # zfs destroy tank/a Now, the system is in an endless panic loop, unable to import the pool at system startup or with "zpool import". The panic dump is: panic[cpu1]/thread=ff0010246c60: assertion failed: 0 == zap_remove_int(mos, ds_prev->ds_phys->ds_next_clones_obj, obj, tx) (0x0 == 0x2), file: ../../common/fs/zfs/dsl_dataset.c, line: 1512 ff00102468d0 genunix:assfail3+c1 () ff0010246a50 zfs:dsl_dataset_destroy_sync+85a () ff0010246aa0 zfs:dsl_sync_task_group_sync+eb () ff0010246b10 zfs:dsl_pool_sync+196 () ff0010246ba0 zfs:spa_sync+32a () ff0010246c40 zfs:txg_sync_thread+265 () ff0010246c50 unix:thread_start+8 () We really need to import this pool. Is there a way around this? We do have snv_114 source on the system if we need to make changes to usr/src/uts/common/fs/zfs/dsl_dataset.c. It seems like the "zfs destroy" transaction never completed and it is being replayed, causing the panic. This cycle continues endlessly. -- albert chin (ch...@thewrittenword.com) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS flar image.
On 09/25/09 09:59, RB wrote: I tired to install the flar image using the method explained in this link http://opensolaris.org/os/community/zfs/boot/flash/ I installed 119534-15 patch on the box whose flar image was required. Then created a flar image using flarcreate -n zfs_flar /flar_dir/zfs_flar.flar I then installed 124630-26 patch on the miniroot of Solaris 05/09 (update 7) net install image. This was done by unpacking the miniroot using root_archive, patching the miniroot using patchadd and then repacking it back. Profile for the jumpstart : install_type flash_install archive_location nfs ://zfs_flar.flar partitioning explicit But the jumpstart fails with the following error: Executing SolStart preinstall phase... Executing begin script "install_begin"... Begin script install_begin execution completed. Processing profile - Opening Flash archive ERROR: Could not mount ://zfs_flar.flar ERROR: Flash installation failed Solaris installation program exited. Any clues what could be wrong? I don't know. There's all kinds of reasons a nfs mount might fail. One think you could is to boot the system from the install image and then escape out of the install after going through all the configuration steps (i.e. the questions about name server, routers, etc.). Then try to do an explicit NFS mount of the flar location (onto /mnt or a temporary mount point created in /tmp). If it fails, that may be the source of your problem. lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn wrote: > On Fri, 25 Sep 2009, Ross Walker wrote: >> >> As a side an slog device will not be too beneficial for large >> sequential writes, because it will be throughput bound not latency >> bound. slog devices really help when you have lots of small sync >> writes. A RAIDZ2 with the ZIL spread across it will provide much > > Surely this depends on the origin of the large sequential writes. If the > origin is NFS and the SSD has considerably more sustained write bandwidth > than the ethernet transfer bandwidth, then using the SSD is a win. If the > SSD accepts data slower than the ethernet can deliver it (which seems to be > this particular case) then the SSD is not helping. > > If the ethernet can pass 100MB/second, then the sustained write > specification for the SSD needs to be at least 100MB/second. Since data is > buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the > SSD should support write bursts of at least double that or else it will not > be helping bulk-write performance. Specifically I was talking NFS as that was what the OP was talking about, but yes it does depend on the origin, but you also assume that NFS IO goes over only a single 1Gbe interface when it could be over multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe interfaces. You also assume the IO recorded in the ZIL is just the raw IO when there is also meta-data or multiple transaction copies as well. Personnally I still prefer to spread the ZIL across the pool and have a large NVRAM backed HBA as opposed to an slog which really puts all my IO in one basket. If I had a pure NVRAM device I might consider using that as an slog device, but SSDs are too variable for my taste. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS flar image.
I tired to install the flar image using the method explained in this link http://opensolaris.org/os/community/zfs/boot/flash/ I installed 119534-15 patch on the box whose flar image was required. Then created a flar image using flarcreate -n zfs_flar /flar_dir/zfs_flar.flar I then installed 124630-26 patch on the miniroot of Solaris 05/09 (update 7) net install image. This was done by unpacking the miniroot using root_archive, patching the miniroot using patchadd and then repacking it back. Profile for the jumpstart : install_type flash_install archive_location nfs ://zfs_flar.flar partitioning explicit But the jumpstart fails with the following error: Executing SolStart preinstall phase... Executing begin script "install_begin"... Begin script install_begin execution completed. Processing profile - Opening Flash archive ERROR: Could not mount ://zfs_flar.flar ERROR: Flash installation failed Solaris installation program exited. Any clues what could be wrong? Thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Fri, 25 Sep 2009, Ross Walker wrote: As a side an slog device will not be too beneficial for large sequential writes, because it will be throughput bound not latency bound. slog devices really help when you have lots of small sync writes. A RAIDZ2 with the ZIL spread across it will provide much Surely this depends on the origin of the large sequential writes. If the origin is NFS and the SSD has considerably more sustained write bandwidth than the ethernet transfer bandwidth, then using the SSD is a win. If the SSD accepts data slower than the ethernet can deliver it (which seems to be this particular case) then the SSD is not helping. If the ethernet can pass 100MB/second, then the sustained write specification for the SSD needs to be at least 100MB/second. Since data is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the SSD should support write bursts of at least double that or else it will not be helping bulk-write performance. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Thu, Sep 24, 2009 at 11:29 PM, James Lever wrote: > > On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote: > > The commentary says that normally the COMMIT operations occur during > close(2) or fsync(2) system call, or when encountering memory pressure. If > the problem is slow copying of many small files, this COMMIT approach does > not help very much since very little data is sent per file and most time is > spent creating directories and files. > > The problem appears to be slog bandwidth exhaustion due to all data being > sent via the slog creating a contention for all following NFS or locally > synchronous writes. The NFS writes do not appear to be synchronous in > nature - there is only a COMMIT being issued at the very end, however, all > of that data appears to be going via the slog and it appears to be inflating > to twice its original size. > For a test, I just copied a relatively small file (8.4MB in size). Looking > at a tcpdump analysis using wireshark, there is a SETATTR which ends with a > V3 COMMIT and no COMMIT messages during the transfer. > iostat output that matches looks like this: > slog write of the data (17MB appears to hit the slog) [snip] > then a few seconds later, the transaction group gets flushed to primary > storage writing nearly 11.4MB which is inline with raid Z2 (expect around > 10.5MB; 8.4/8*10): [snip] > So I performed the same test with a much larger file (533MB) to see what it > would do, being larger than the NVRAM cache in front of the SSD. Note that > after the second second of activity the NVRAM is full and only allowing in > about the sequential write speed of the SSD (~70MB/s). [snip] > Again, the slog wrote about double the file size (1022.6MB) and a few > seconds later, the data was pushed to the primary storage (684.9MB with an > expectation of 666MB = 533MB/8*10) so again about the right number hit the > spinning platters. [snip] > Can anybody explain what is going on with the slog device in that all data > is being shunted via it and why about double the data size is being written > to it per transaction? By any chance do you have copies=2 set? That will make 2 transactions of 1. Also, try setting zfs_write_limit_override equal to the size of the NVRAM cache (or half depending on how long it takes to flush): echo zfs_write_limit_override/W0t268435456 | mdb -kw Set the PERC flush interval to say 1 second. As a side an slog device will not be too beneficial for large sequential writes, because it will be throughput bound not latency bound. slog devices really help when you have lots of small sync writes. A RAIDZ2 with the ZIL spread across it will provide much higher throughput then an SSD. An example of a workload that benefits from an slog device is ESX over NFS, which does a COMMIT for each block written, so it benefits from an slog, but a standard media server will not (but an L2ARC would be beneficial). Better workload analysis is really what it is about. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] White box server for OpenSolaris
> I am after suggestions of motherboard, CPU and ram. > Basically I want ECC ram and at least two PCI-E x4 > channels. As I want to run 2 x AOC-USAS_L8i cards > for 16 drives. Asus M4N82 Deluxe. I have one running with 2 USAS-L8i cards just fine. I don't have all the drives loaded in yet, but the cards are detected and they can use the drives I do have attached. I currently have 8GB of ECC RAM on the board and it's working fine. The ECC options in the BIOS are enabled and it reports the ECC is enabled at boot. It has 3 PCIe x16 slots, I have a graphics card in the other slot, and an Intel e1000g card in the PCIe x1 slot. The onboard peripherals all work, with the exception of the onboard AHCI ports being buggy in b123 under xVM. Not sure what that's all about, I posted in the main discussion board but haven't heard if it's a known bug or if it will be fixed in the next version. It would be nice as my boot drives are on that controller. 2009.06 works fine though. CPU is a Phenom II X3 720. Probably overkill for fileserver duties, but I also want to do some VMs for other things, thus the bug I found with the xVM updates. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Collecting hardware configurations (was Re: White box server for OpenSolaris)
The opensolaris.org site will be transitioning to a wiki-based site soon, as described here: http://www.opensolaris.org/os/about/faq/site-transition-faq/ I think it would be best to use the new site to collect this information because it will be much easier for community members to contribute. I'll provide a heads up when the transition, which has been delayed, is complete. Cindy On 09/25/09 03:31, Eugen Leitl wrote: On Fri, Sep 25, 2009 at 10:18:15AM +0100, Tim Foster wrote: I don't have enough experience myself in terms of knowing what's the best hardware on the market, but from time to time, I do think about upgrading my system at home, and would really appreciate a zfs-community-recommended configuration to use. Any takers? I'm willing to contribute (zfs on Opensolaris, mostly Supermicro boxes and FreeNAS (FreeBSD 7.2, next 8.x probably)). Is there a wiki for that somewhere? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] NLM_DENIED_NOLOCKS Solaris 10u5 X4500
This was previously posed to the sun-managers mailing list but the only reply I received recommended I post here at well. We have a production Solaris 10u5 / ZFS X4500 file server which is reporting NLM_DENIED_NOLOCKS immediately for any nfs locking request. The lockd does not appear to be busy so is it possible we have hit some sort of limit on the number of files that can be locked? Are there any items to check before restarting lockd / statd. This appears to have at least temporarily cleared up the issue. Thanks, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] White box server for OpenSolaris
It does seem to come up regularly... perhaps someone with access could throw up a page under the ZFS community with the conclusions (and periodic updates as appropriate).. On Fri, Sep 25, 2009 at 3:32 AM, Erik Trimble wrote: > Nathan wrote: >> >> While I am about to embark on building a home NAS box using OpenSolaris >> with ZFS. >> >> Currently I have a chassis that will hold 16 hard drives, although not in >> caddies - down time doesn't bother me if I need to switch a drive, probably >> could do it running anyways just a bit of a pain. :) >> >> I am after suggestions of motherboard, CPU and ram. Basically I want ECC >> ram and at least two PCI-E x4 channels. As I want to run 2 x AOC-USAS_L8i >> cards for 16 drives. >> >> I want something with a bit of guts but over the top. I know the HCL is >> there but I want to see what other people are using in their solutions. >> > > Go back and look through the archives for this list. We just had this > discussion last month. Let's not rehash it again, as it seems to get redone > way too often. > > > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Strange issue with ZFS and du (snv_118 and snv_123)
Hi Guys, maybe someone has some time to take a look at my issue, I didn't find a answer using the search. Here we go: I was running a backup of a directory located on a ZFS pool named TimeMachine, before I started the job, I checked the size of the directory called NFS, and du -h or du -s was telling me 25GB for /TimeMachine/NFS. So I started the job after a while I was very surprised that the backup app requested a new tape, and the a new one again, so in total 3 tapes (lto1) for 25GB ??!?! After that I checked the directory again: r...@fileserver:/TimeMachine/NFS# du -h . 25G r...@fileserver:/TimeMachine/NFS# du -s 25861519 r...@fileserver:/TimeMachine/NFS# ls -lh total 25G -rw-r--r-- 1 root root 232G 2009-09-25 14:04 nfs.tar zfs list TimeMachine/NFS NAME USED AVAIL REFER MOUNTPOINT TimeMachine/NFS 24.7G 818G 24.7G /TimeMachine/NFS Also, if I use nautilus under Gnome, he also tells me that Directory NFS used 232GB and not 24.7GB as du and zfs list reports to me ?!?! Same if I mount that share (AFP) from a Mac and via NFS, still got 232GB used for TimeMachine/NFS. r...@fileserver:/Data/nfs_org# ls -lh total 232G -rw-r--r-- 1 root root 232G 2009-09-24 17:57 nfs.tar r...@fileserver:/Data/nfs_org# du -h . 232G. r...@fileserver:/Data/nfs_org# I've upgraded from snv_118 to snv_123 but still the same. I also copy the contend of the directory to another ZFS spool, removed the org content and copy it back again, but I still get an incorrect value! pool: TimeMachine state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM TimeMachine ONLINE 0 0 0 raidz1ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 errors: No known data errors r...@fileserver:/TimeMachine/NFS# zfs get all TimeMachine NAME PROPERTY VALUE SOURCE TimeMachine type filesystem - TimeMachine creation Sat Feb 28 17:48 2009 - TimeMachine used 2.76T - TimeMachine available 818G - TimeMachine referenced1.24T - TimeMachine compressratio 1.00x - TimeMachine mounted yes- TimeMachine quota none default TimeMachine reservation none default TimeMachine recordsize128K default TimeMachine mountpoint/TimeMachine default TimeMachine sharenfs offdefault TimeMachine checksum on default TimeMachine compression offdefault TimeMachine atime on default TimeMachine devices on default TimeMachine exec on default TimeMachine setuidon default TimeMachine readonly offdefault TimeMachine zoned offdefault TimeMachine snapdir hidden default TimeMachine aclmode groupmask default TimeMachine aclinheritrestricted default TimeMachine canmount on default TimeMachine shareiscsioffdefault TimeMachine xattr on default TimeMachine copies1 default TimeMachine version 3 - TimeMachine utf8only off- TimeMachine normalization none - TimeMachine casesensitivity sensitive - TimeMachine vscan offdefault TimeMachine nbmandoffdefault TimeMachine sharesmb offdefault TimeMachine refquota none default TimeMachine refreservationnone default TimeMachine primarycache alldefault TimeMachine secondarycachealldefault TimeMachine usedbysnapshots 0 - TimeMachine usedbydataset 1.24T - TimeMachine usedbychildren1.53T - TimeMachine usedbyrefreservation 0 - TimeMachine logbias latencydefault r...@fileserver:/TimeMachine/NFS# r...@fileserver:/TimeMachine/NFS# zfs list TimeMachine NAME USED AVAIL REFER MOUNTPOINT
Re: [zfs-discuss] Collecting hardware configurations (was Re: White box server for OpenSolaris)
On Fri, Sep 25, 2009 at 10:18:15AM +0100, Tim Foster wrote: > I don't have enough experience myself in terms of knowing what's the > best hardware on the market, but from time to time, I do think about > upgrading my system at home, and would really appreciate a > zfs-community-recommended configuration to use. > > Any takers? I'm willing to contribute (zfs on Opensolaris, mostly Supermicro boxes and FreeNAS (FreeBSD 7.2, next 8.x probably)). Is there a wiki for that somewhere? -- Eugen* Leitl http://leitl.org";>leitl http://leitl.org __ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Collecting hardware configurations (was Re: White box server for OpenSolaris)
On Fri, 2009-09-25 at 01:32 -0700, Erik Trimble wrote: > Go back and look through the archives for this list. We just had this > discussion last month. Let's not rehash it again, as it seems to get > redone way too often. You know, this seems like such a common question to the list, would we (the zfs community) be interested in coming up with a rolling set of 'recommended' systems that home users could use as a reference, rather than requiring people to trawl through the archives each time? Perhaps a few tiers, with as many user-submitted systems per-tier as we get. * small boot disk + 2 or 3 disks, low power, quiet, small media server * medium boot disk + 3 - 9 disks, home office, larger media server * large boot disk + 9 or more disks, thumper-esque and keep them up to date as new hardware becomes available, with a bit of space on a website somewhere to manage them. These could either be off-the-shelf dedicated NAS systems, or build-to-order machines, but getting their configuration & last-known-price would be useful. I don't have enough experience myself in terms of knowing what's the best hardware on the market, but from time to time, I do think about upgrading my system at home, and would really appreciate a zfs-community-recommended configuration to use. Any takers? cheers, tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] White box server for OpenSolaris
Nathan wrote: While I am about to embark on building a home NAS box using OpenSolaris with ZFS. Currently I have a chassis that will hold 16 hard drives, although not in caddies - down time doesn't bother me if I need to switch a drive, probably could do it running anyways just a bit of a pain. :) I am after suggestions of motherboard, CPU and ram. Basically I want ECC ram and at least two PCI-E x4 channels. As I want to run 2 x AOC-USAS_L8i cards for 16 drives. I want something with a bit of guts but over the top. I know the HCL is there but I want to see what other people are using in their solutions. Go back and look through the archives for this list. We just had this discussion last month. Let's not rehash it again, as it seems to get redone way too often. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] White box server for OpenSolaris
While I am about to embark on building a home NAS box using OpenSolaris with ZFS. Currently I have a chassis that will hold 16 hard drives, although not in caddies - down time doesn't bother me if I need to switch a drive, probably could do it running anyways just a bit of a pain. :) I am after suggestions of motherboard, CPU and ram. Basically I want ECC ram and at least two PCI-E x4 channels. As I want to run 2 x AOC-USAS_L8i cards for 16 drives. I want something with a bit of guts but over the top. I know the HCL is there but I want to see what other people are using in their solutions. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
>On Fri, 25 Sep 2009, James Lever wrote: >> >> NFS Version 3 introduces the concept of "safe asynchronous writes.? > >Being "safe" then requires a responsibilty level on the client which >is often not present. For example, if the server crashes, and then >the client crashes, how does the client resend the uncommitted data? >If the client had a non-volatile storage cache, then it would be able >to responsibly finish the writes that failed. If the client crashes, it is clear that "work will be lost" up to the point that the client did a successful commit. Other than support for the NFSv3 commit operation and resending the missing operations. If the client crashes, we know that non-committed operations may be dropped in the floor. >The commentary says that normally the COMMIT operations occur during >close(2) or fsync(2) system call, or when encountering memory >pressure. If the problem is slow copying of many small files, this >COMMIT approach does not help very much since very little data is sent >per file and most time is spent creating directories and files. Indeed; the commit is mostly to make sure that the pipe between the server and the client can be filled for write operations. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss