[zfs-discuss] Using a zvol from your rpool as zil for another zpool
We have a server with a couple X-25E's and a bunch of larger SATA disks. To save space, we want to install Solaris 10 (our install is only about 1.4GB) to the X-25E's and use the remaining space on the SSD's for ZIL attached to a zpool created from the SATA drives. Currently we do this by installing the OS using SVM+UFS (to mirror the OS between the two SSD's) and then using the remaining space on a slice as ZIL for the larger SATA-based zpool. However, SVM+UFS is more annoying to work with as far as LiveUpgrade is concerned. We'd love to use a ZFS root, but that requires that the entire SSD be dedicated as an rpool leaving no space for ZIL. Or does it? It appears that we could do a: # zfs create -V 24G rpool/zil On our rpool and then: # zpool add satapool log /dev/zvol/dsk/rpool/zil (I realize 24G is probably far more than a ZIL device will ever need) As rpool is mirrored, this would also take care of redundancy for the ZIL as well. This lets us have a nifty ZFS rpool for simplified LiveUpgrades and a fast SSD-based ZIL for our SATA zpool as well... What are the downsides to doing this? Will there be a noticeable performance hit? I know I've seen this discussed here before, but wasn't able to come up with the right search terms... Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?
On 07/01/10 22:33, Erik Trimble wrote: On 7/1/2010 9:23 PM, Geoff Nordli wrote: Hi Erik. Are you saying the DDT will automatically look to be stored in an L2ARC device if one exists in the pool, instead of using ARC? Or is there some sort of memory pressure point where the DDT gets moved from ARC to L2ARC? Thanks, Geoff Good question, and I don't know. My educated guess is the latter (initially stored in ARC, then moved to L2ARC as size increases). Anyone? The L2ARC just holds blocks that have been evicted from the ARC due to memory pressure. The DDT is no different than any other object (e.g. file). So when looking for a block ZFS checks first in the ARC then the L2ARC and if neither succeeds reads from the main pool. - Anyone. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs - filesystem versus directory
I created a zpool called 'data' from 7 disks. I created zfs filesystems on the zpool for each Xen vm I can choose to recursively snapshot all 'data' I can choose to snapshot the individual 'directories' If you use mkdir, I don't believe you can snapshot/restore at that level Malachi de Ælfweald http://www.google.com/profiles/malachid On Thu, Jul 1, 2010 at 9:12 PM, Peter Taps wrote: > Folks, > > While going through a quick tutorial on zfs, I came across a way to create > zfs filesystem within a filesystem. For example: > > # zfs create mytest/peter > > where mytest is a zpool filesystem. > > When does this way, the new filesystem has the mount point as > /mytest/peter. > > When does it make sense to create such a filesystem versus just creating a > directory? > > # mkdir mytest/peter > > Thank you in advance for your help. > > Regards, > Peter > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?
On 7/1/2010 9:23 PM, Geoff Nordli wrote: Hi Erik. Are you saying the DDT will automatically look to be stored in an L2ARC device if one exists in the pool, instead of using ARC? Or is there some sort of memory pressure point where the DDT gets moved from ARC to L2ARC? Thanks, Geoff Good question, and I don't know. My educated guess is the latter (initially stored in ARC, then moved to L2ARC as size increases). Anyone? -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?
> Actually, I think the rule-of-thumb is 270 bytes/DDT > entry. It's 200 > bytes of ARC for every L2ARC entry. > > DDT doesn't count for this ARC space usage > > E.g.:I have 1TB of 4k files that are to be > deduped, and it turns > out that I have about a 5:1 dedup ratio. I'd also > like to see how much > ARC usage I eat up with a 160GB L2ARC. > > (1)How many entries are there in the DDT: > 1TB of 4k files means there are 2^30 > files (about 1 billion). > However, at a 5:1 dedup ratio, I'm only > actually storing > 0% of that, so I have about 214 million blocks. > Thus, I need a DDT of about 270 * 214 > million =~ 58GB in size > (2)My L2ARC is 160GB in size, but I'm using 58GB > for the DDT. Thus, > I have 102GB free for use as a data cache. > 102GB / 4k =~ 27 million blocks can be > stored in the > emaining L2ARC space. > However, 26 million files takes up: >200 * 27 million =~ > GB of space in ARC > Thus, I'd better have at least 5.5GB of > RAM allocated > olely for L2ARC reference pointers, and no other use. > > Hi Erik. Are you saying the DDT will automatically look to be stored in an L2ARC device if one exists in the pool, instead of using ARC? Or is there some sort of memory pressure point where the DDT gets moved from ARC to L2ARC? Thanks, Geoff -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs - filesystem versus directory
Folks, While going through a quick tutorial on zfs, I came across a way to create zfs filesystem within a filesystem. For example: # zfs create mytest/peter where mytest is a zpool filesystem. When does this way, the new filesystem has the mount point as /mytest/peter. When does it make sense to create such a filesystem versus just creating a directory? # mkdir mytest/peter Thank you in advance for your help. Regards, Peter -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] confused about lun alignment
doh! It turns out the host in question is actually a Solaris 10 update 6 host. It appears that an Solaris 10 update 8 host actually sets the start sector at 256. So to simplify the question. If I'm using ZFS with EFI label and full disk do I even need to worry about lun alignment? I was all excited to do some benchmarking and now it appears that zfs will just do the right thing (assuming the right OS update). THanks. Deet. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool on raw disk. Do I need to format?
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Peter Taps > > I am learning more about zfs storage. It appears, zfs pool can be > created on a raw disk. There is no need to create any partitions, etc. > on the disk. Does this mean there is no need to run "format" on a raw > disk? No need to format. Also, confusingly, the term "format" doesn't mean the same here as it does in other situations. "format" is what you use in solaris to create partitions (slices) and a few other operations. Without any partitions or slices, the disk will be called something like c8t1d0. But with a partition, it might be c8t1d0p0 or d8t1d0s0 Usually people will do as Cindy said. "zpool create c8t1d0" But depending on who you ask, there is possibly some benefit to creating slices. Specifically, if you want to mirror or replace a drive with a new drive that's not precisely the same size. In later versions of zpool (later than what's currently available in osol 2009.06 or solaris 10) they have applied a patch which works around the "slightly different disk size" problem and eliminates any possible problem with using the whole disk. So generally speaking, you're advised to use the whole disk, the easy way, as Cindy mentioned. ;-) > I have added a new disk to my system. It shows up as > /dev/rdsk/c8t1d0s0. Do I need to format it before I convert it to zfs > storage? Or, can I simply use it as: Again, drop the "s0" from the above. The "s0" means the first slice on the disk. zpool create mypool c8t1d0 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help destroying phantom clone (zfs filesystem)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Alxen4 > > It looks like I have some leftovers of old clones that I cannot delete: > > Clone name is tank/WinSrv/Latest > > I'm trying: > > zfs destroy -f -R tank/WinSrv/Latest > cannot unshare 'tank/WinSrv/Latest': path doesn't exist: unshare(1M) > failed > > Please help me to get rid of this garbage. This may not be what you're experiencing, but I recently had a similar experience. If you "zdb -d poolname" you'll see a bunch of stuff. If you grep for a percent % character, I'm not certain what it means, but I know on my systems, it is a temporary clone that only exists while a "zfs receive" incremental is in progress. If the incremental receive is interrupted, the % is supposed to go away, but if the receive is interrupted due to system crash, then of course, it remains, and prevents other things from being destroyed. As a reproducible thing: If you are receiving an incremental zfs send, and you power cycle the system, one of these things will remain and prevent you from receiving future incrementals. The solution is to simply "zfs destroy" the thing which contains the "%" ... and then the regular incremental works again. But you have a different symptom from what I had. So your problem might be different too. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Checksum errors with SSD.
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Benjamin Grogg > > When I scrub my pool I got a lot of checksum errors : > > NAMESTATE READ WRITE CKSUM > rpool DEGRADED 0 0 5 > c8d0s0DEGRADED 0 071 too many errors > > Any hints? What's the confusion? Replace the drive. If you think it's a false positive (drive is not actually failing) then you would zpool clear, (or online, or whatever, until the pool looks normal again) and then scrub. If the errors come back, it definitely means the drive is failing. Or perhaps the sata cable that connects to it, or perhaps the controller. But 99% certain the drive. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool on raw disk. Do I need to format?
Awesome. Thank you, CIndy. Regards, Peter -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] confused about lun alignment
Folks. My env is Solaris 10 update 8 amd64. Does LUN alignment matter when I'm creating zpool's on disks (lun's) with EFI labels and providing zpool the entire disk? I recently read some sun/oracle docs and blog posts about adjusting the starting sector for partition 0 (in format -e) to address lun alignment problems. However they never mention wether this matters for ZFS or not. In this scenario I have a lun provisioned from an STK 2540 and I put an EFI label on the disk. Then I modified partition 0 to start at sector 64 (tried 256 as well) and saved those changes. Once I do a zpool create the first sector get's set to "34". Here is the format output with the adjusted sector partition> pr Current partition table (original): Total disk sectors available: 31440861 + 16384 (reserved sectors) Part TagFlag First SectorSizeLast Sector 0usrwm 256 14.99GB 31440860 1 unassignedwm 0 0 0 2 unassignedwm 0 0 0 3 unassignedwm 0 0 0 4 unassignedwm 0 0 0 5 unassignedwm 0 0 0 6 unassignedwm 0 0 0 7 unassignedwm 0 0 0 8 reservedwu 31440861 8.00MB 31457244 now create the zpool zpool create testsector c5t6006016070B01E0052374B8B4CAFDE11d0 partition table after zpool create partition> pr Current partition table (original): Total disk sectors available: 31440861 + 16384 (reserved sectors) Part TagFlag First SectorSizeLast Sector 0usrwm34 14.99GB 31440861 I know i'm missing something here and any tips would be appreciated. TIA. Deet. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
Victor, A little more info on the crash, from the messages file is attached here. I have also decompressed the dump with savecore to generate unix.0, vmcore.0, and vmdump.0. Jun 30 19:39:10 HL-SAN unix: [ID 836849 kern.notice] Jun 30 19:39:10 HL-SAN ^Mpanic[cpu3]/thread=ff0017909c60: Jun 30 19:39:10 HL-SAN genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=ff0017909790 addr=0 occurred in module "" due to a NULL pointer dereference Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] Jun 30 19:39:10 HL-SAN unix: [ID 839527 kern.notice] sched: Jun 30 19:39:10 HL-SAN unix: [ID 753105 kern.notice] #pf Page fault Jun 30 19:39:10 HL-SAN unix: [ID 532287 kern.notice] Bad kernel fault at addr=0x0 Jun 30 19:39:10 HL-SAN unix: [ID 243837 kern.notice] pid=0, pc=0x0, sp=0xff0017909880, eflags=0x10002 Jun 30 19:39:10 HL-SAN unix: [ID 211416 kern.notice] cr0: 8005003b cr4: 6f8 Jun 30 19:39:10 HL-SAN unix: [ID 624947 kern.notice] cr2: 0 Jun 30 19:39:10 HL-SAN unix: [ID 625075 kern.notice] cr3: 336a71000 Jun 30 19:39:10 HL-SAN unix: [ID 625715 kern.notice] cr8: c Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rdi: 282 rsi:15809 rdx: ff03edb1e538 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rcx:5 r8:0 r9: ff03eb2d6a00 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rax: 202 rbx:0 rbp: ff0017909880 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]r10: f80d16d0 r11:4 r12:0 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]r13: ff03e21bca40 r14: ff03e1a0d7e8 r15: ff03e21bcb58 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]fsb:0 gsb: ff03e25fa580 ds: 4b Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice] es: 4b fs:0 gs: 1c3 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]trp:e err: 10 rip:0 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice] cs: 30 rfl:10002 rsp: ff0017909880 Jun 30 19:39:10 HL-SAN unix: [ID 266532 kern.notice] ss: 38 Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909670 unix:die+dd () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909780 unix:trap+177b () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909790 unix:cmntrap+e6 () Jun 30 19:39:10 HL-SAN genunix: [ID 802836 kern.notice] ff0017909880 0 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179098a0 unix:debug_enter+38 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179098c0 unix:abort_sequence_enter+35 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909910 kbtrans:kbtrans_streams_key+102 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909940 conskbd:conskbdlrput+e7 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179099b0 unix:putnext+21e () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179099f0 kbtrans:kbtrans_queueevent+7c () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a20 kbtrans:kbtrans_queuepress+7c () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a60 kbtrans:kbtrans_untrans_keypressed_raw+46 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a90 kbtrans:kbtrans_processkey+32 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909ae0 kbtrans:kbtrans_streams_key+175 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b10 kb8042:kb8042_process_key+40 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b50 kb8042:kb8042_received_byte+109 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b80 kb8042:kb8042_intr+6a () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909bb0 i8042:i8042_intr+c5 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0017909c00 unix:av_dispatch_autovect+7c () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0017909c40 unix:dispatch_hardint+33 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183552f0 unix:switch_sp_and_call+13 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355340 unix:do_interrupt+b8 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355350 unix:_interrupt+b8 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183554a0 unix:htable_steal+198 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355510 unix:htable_alloc+248 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183555c0
Re: [zfs-discuss] zpool on raw disk. Do I need to format?
Even easier, use the zpool create command to create a pool on c8t1d0, using the whole disk. Try this: # zpool create MyData c8t1d0 cs On 07/01/10 16:01, Peter Taps wrote: Folks, I am learning more about zfs storage. It appears, zfs pool can be created on a raw disk. There is no need to create any partitions, etc. on the disk. Does this mean there is no need to run "format" on a raw disk? I have added a new disk to my system. It shows up as /dev/rdsk/c8t1d0s0. Do I need to format it before I convert it to zfs storage? Or, can I simply use it as: # zfs create MyData /dev/rdsk/c8t1d0s0 Thank you in advance for your help. Regards, Peter ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool on raw disk. Do I need to format?
Folks, I am learning more about zfs storage. It appears, zfs pool can be created on a raw disk. There is no need to create any partitions, etc. on the disk. Does this mean there is no need to run "format" on a raw disk? I have added a new disk to my system. It shows up as /dev/rdsk/c8t1d0s0. Do I need to format it before I convert it to zfs storage? Or, can I simply use it as: # zfs create MyData /dev/rdsk/c8t1d0s0 Thank you in advance for your help. Regards, Peter -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found
On 7/1/2010 12:23 PM, Lo Zio wrote: Thanks roy, I read a lot around and also was thinking it was a dedup-related problem. Although I did not find any indication of how many RAM is enough, and never find something saying "Do not use dedup, it will definitely crash your server". I'm using a Dell Xeon with 4 Gb of RAM, maybe it is not an uber-server but it works really well (when it is not hung, I mean). Do you have an idea about the optimal config to have 1,5T of available space in 10 datasets (5 deduped), and 10 rotating snapshots? Thanks Take a look at the archives for these threads: Dedup RAM requirements, vs. L2ARC? http://mail.opensolaris.org/pipermail/zfs-discuss/2010-June/042661.html Dedup performance hit http://mail.opensolaris.org/pipermail/zfs-discuss/2010-June/042235.html 4GB of RAM is likely to be *way* too small to run dedup with your setup. You almost certainly need a SSD for L2ARC, and probably at least 2x the RAM. The "hangs" you see are likely the Dedup Table being built on-the-fly from the datasets, which is massively I/O intensive. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found
- Original Message - > Thanks roy, I read a lot around and also was thinking it was a > dedup-related problem. Although I did not find any indication of how > many RAM is enough, and never find something saying "Do not use dedup, > it will definitely crash your server". I'm using a Dell Xeon with 4 Gb > of RAM, maybe it is not an uber-server but it works really well (when > it is not hung, I mean). > Do you have an idea about the optimal config to have 1,5T of available > space in 10 datasets (5 deduped), and 10 rotating snapshots? > Thanks Erik Timble had a post on this today, pasted below. Look though that, but seriously, with 4 gigs of RAM and no L2ARC, dedup is of no use roy Actually, I think the rule-of-thumb is 270 bytes/DDT entry. It's 200 bytes of ARC for every L2ARC entry. DDT doesn't count for this ARC space usage E.g.:I have 1TB of 4k files that are to be deduped, and it turns out that I have about a 5:1 dedup ratio. I'd also like to see how much ARC usage I eat up with a 160GB L2ARC. (1)How many entries are there in the DDT: 1TB of 4k files means there are 2^30 files (about 1 billion). However, at a 5:1 dedup ratio, I'm only actually storing 20% of that, so I have about 214 million blocks. Thus, I need a DDT of about 270 * 214 million =~ 58GB in size (2)My L2ARC is 160GB in size, but I'm using 58GB for the DDT. Thus, I have 102GB free for use as a data cache. 102GB / 4k =~ 27 million blocks can be stored in the remaining L2ARC space. However, 26 million files takes up: 200 * 27 million =~ 5.5GB of space in ARC Thus, I'd better have at least 5.5GB of RAM allocated solely for L2ARC reference pointers, and no other use. -- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found
Thanks roy, I read a lot around and also was thinking it was a dedup-related problem. Although I did not find any indication of how many RAM is enough, and never find something saying "Do not use dedup, it will definitely crash your server". I'm using a Dell Xeon with 4 Gb of RAM, maybe it is not an uber-server but it works really well (when it is not hung, I mean). Do you have an idea about the optimal config to have 1,5T of available space in 10 datasets (5 deduped), and 10 rotating snapshots? Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NexentaStor Community edition 3.0.3 released
Hello, this may not apply to your machine. I have two changes to your setup: * Opensolaris instead of Nexenta * DL585G1 instead of your DL380G4 Here's my problem: reproducible crash after a certain time (1:30h in my case). Explanation: the HP machine has enterprise features (ECC RAM) and performs scrubbing of the RAM, just as you could scrub ZFS disks; with the 4 AMD dual core CPUs, the memory is divided into 4 chunks and when the scrubber hits a hole, then the machine crashes without so much as a crashdump Solution: add the following to /etc/system set snooping=1 set pcplusmp:apic_panic_on_nmi=1 set cpu_ms.AuthenticAMD.15:ao_scrub_policy = 1 set cpu_ms.AuthenticAMD.15:ao_scrub_rate_dcache = 0 set cpu_ms.AuthenticAMD.15:ao_scrub_rate_l2cache = 0 set mc-amd:mc_no_attach=1 set disable_memscrub = 1 Best regards, Oliver -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mix SAS and SATA drives?
- Original Message - > > As the 15k drives are faster seek-wise (and possibly faster for > > linear I/O), you may want to separate them into different VDEVs or > > even pools, but then, it's quite impossible to give a "correct" > > answer unless knowing what it's going to be used for. > > > Mostly database duty. Then use the 15k drives in a striped mirror - they'll perform quite a bit better that way :p Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mix SAS and SATA drives?
> As the 15k drives are faster seek-wise (and possibly faster for linear I/O), > you may want to separate them into different VDEVs or even pools, but then, > it's quite impossible to give a "correct" answer unless knowing what it's > going to be used for.Mostly database duty.> > Also, using 10+ drives in a > single VDEV is not really recommended - use fewer drives, loose more space, > but get a faster system with fewer drives in each VDEV. 15x 750GB SATA drives > in a single VDEV will give you about 120 IOPS for the whole VDEV at most. > Using smaller VDEVs will help speeding up things and will give you better > protection against faulty drives (silent or noisy). Once more - see > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_GuideWith > that much space we can afford mirroring everything. We'll put every disk in > pair across separate JBODs and controllers.Thanks a lot! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mix SAS and SATA drives?
- Original Message - > Another question... > We're building a ZFS NAS/SAN out of the following JBODs we already > own: > > > 2x 15x 1000GB SATA > 3x 15x 750GB SATA > 2x 12x 600GB SAS 15K > 4x 15x 300GB SAS 15K > > > That's a lot of spindles we'd like to benefit from, but our assumption > is that we should split these in two separate pools, one for SATA > drives and one for SAS 15K drives. Are we right? As the 15k drives are faster seek-wise (and possibly faster for linear I/O), you may want to separate them into different VDEVs or even pools, but then, it's quite impossible to give a "correct" answer unless knowing what it's going to be used for. Also, using 10+ drives in a single VDEV is not really recommended - use fewer drives, loose more space, but get a faster system with fewer drives in each VDEV. 15x 750GB SATA drives in a single VDEV will give you about 120 IOPS for the whole VDEV at most. Using smaller VDEVs will help speeding up things and will give you better protection against faulty drives (silent or noisy). Once more - see http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mix SAS and SATA drives?
Sorry for the formatting, that's 2x 15x 1000GB SATA 3x 15x 750GB SATA 2x 12x 600GB SAS 15K 4x 15x 300GB SAS 15K ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Mix SAS and SATA drives?
Another question...We're building a ZFS NAS/SAN out of the following JBODs we already own: 2x 15x 1000GB SATA3x 15x 750GB SATA2x 12x 600GB SAS 15K4x 15x 300GB SAS 15K That's a lot of spindles we'd like to benefit from, but our assumption is that we should split these in two separate pools, one for SATA drives and one for SAS 15K drives. Are we right? Thanks___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expected throughput
Hi! We've put 28x 750GB SATA drives in a RAIDZ2 pool (a single vdev) and we get about 80MB/s in sequential read or write. We're running local tests on the server itself (no network involved). Is that what we should be expecting? It seems slow to me. Please read the ZFS best practices guide at http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide To summarise, putting 28 disks in a single vdev is nothing you would do if you want performance. You'll end up with as many IOPS a single drive can do. Split it up into smaller (<10 disk) vdevs and try again. If you need high performance, put them in a striped mirror (aka RAID1+0) A little addition - for 28 drives, I guess I'd choose four vdevs with seven drives each in raidz2. You'll loose space, but it'll be four times faster, and four times safer. Better safe than sorry Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-USAS-L8i
> On a slightly different but related topic, anyone have advice on how > to connect up my drives? I've got room for 20 pool drives in the case. > I'll have two AOC-USAS-L8i cards along with cables to connect 16 SATA2 > drives. The motherboard has 6 SATA2 connectors plus 2 SATA3 > connectors. I was planning to use the SATA3 connectors for the boot > drives (hopefully mirrored ZFS). Initially I'll have three 2TB drives > with the intention of using raidz1 on them. Phase 2 will be three more > 2TB drives. Phase 3 will be three 1.5TB drives. (I already have most > of the drives for phase 2 and 3.) I'll probably fill the rest with a > JBOD pool of assorted drives I have lying around. I'd try to grab more drives of the same size to be able to do RAIDz2. It's quite common that a drive fails and there are 'silent' errors (non-detectable on the drive's CRC, but detectable by ZFS) during resilver. This will give you data corruption. If you have RAIDz2, ZFS will figure that out > I'm specifically wondering if I'll see better performance by spreading > the raidz1 array devices across the controllers or if I should keep an > array to a single controller as much as possible. This has been debated a lot. If you put a VDEV (or pool) on a single controller, if that controller dies, your pool will be unavailable until the controller is replaced. If you spread the disks over more controllers, the chances are better to allow access to the data until you replace the controller. Again, raidz2 will help out there as well Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found
- Original Message - > I also have this problem, with 134 if I delete big snapshots the > server hangs only responding to ping. > I also have the ZVOL issue. > Any news about having them solved? > In my case this is a big problem since I'm using osol as a file > server... Are you using dedup? If so, that's the reason - dedup isn't ready for production yet unless you have sufficient RAM/L2ARC Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expected throughput
Hi! We've put 28x 750GB SATA drives in a RAIDZ2 pool (a single vdev) and we get about 80MB/s in sequential read or write. We're running local tests on the server itself (no network involved). Is that what we should be expecting? It seems slow to me. Please read the ZFS best practices guide at http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide To summarise, putting 28 disks in a single vdev is nothing you would do if you want performance. You'll end up with as many IOPS a single drive can do. Split it up into smaller (<10 disk) vdevs and try again. If you need high performance, put them in a striped mirror (aka RAID1+0) Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on external iSCSI storage
> >The best would be to export the drives in JBOD style, one "array" per > >drive. If you rely on the Promise RAID, it you won't be able to > >recover from "silent" errors. I'm in the progress of moving from a > >NexSAN RAID to a JBOD-like style just because of that (we had data > >>corruption on the RAID, stuff the NexSAN box didn't detect, as the > >drives didn't detect it). As this NexSAN box can't export true JBOD, > >I export each pair of drives as a stripe, and use them in VDEVs in > >RAIDz2s and a spare or two. While not as good as local drives, it'll > >be >better than trusting the 'hardware' RAID on the box (which has > >proven to be unable to detect silent errors). > > Yup, we just did the same thing with our PAC storage units. Created 48 > Raid 0 stripes with 1 drive each. We originally tried 4 drives per > Raid 0 stripe but we ran into massive performance hits. In case someone gets across NexSAN SATABOY boxes, keep in mind that they can neither do JBOD nor make more than 14 RAID sets. I ended up exporting three drives for each RAID-0 set and then used RAIDz2 across them together with a spare. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Expected throughput
Hi! We've put 28x 750GB SATA drives in a RAIDZ2 pool (a single vdev) and we get about 80MB/s in sequential read or write. We're running local tests on the server itself (no network involved). Is that what we should be expecting? It seems slow to me. Thanks___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on external iSCSI storage
- Original Message - > I'm new with ZFS, but I have had good success using it with raw > physical disks. One of my systems has access to an iSCSI storage > target. The underlying physical array is in a propreitary disk storage > device from Promise. So the question is, when building a OpenSolaris > host to store its data on an external iSCSI device, is there anything > conceptually wrong with creating a raidz pool from a group of "raw" > LUNs carved from the iSCSI device? The best would be to export the drives in JBOD style, one "array" per drive. If you rely on the Promise RAID, it you won't be able to recover from "silent" errors. I'm in the progress of moving from a NexSAN RAID to a JBOD-like style just because of that (we had data corruption on the RAID, stuff the NexSAN box didn't detect, as the drives didn't detect it). As this NexSAN box can't export true JBOD, I export each pair of drives as a stripe, and use them in VDEVs in RAIDz2s and a spare or two. While not as good as local drives, it'll be better than trusting the 'hardware' RAID on the box (which has proven to be unable to detect silent errors). Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Help destroying phantom clone (zfs filesystem)
It looks like I have some leftovers of old clones that I cannot delete: Clone name is tank/WinSrv/Latest I'm trying: zfs destroy -f -R tank/WinSrv/Latest cannot unshare 'tank/WinSrv/Latest': path doesn't exist: unshare(1M) failed Please help me to get rid of this garbage. Thanks a lot. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Checksum errors with SSD.
Hi Benjamin, I'm not familiar with this disk but you can see the fmstat output that disk, system event, and zfs-related diagnostics are on overtime about something and its probably this disk. You can get further details from fmdump -eV and you will probably see lots of checksum errors on this disk. You might review some of the h/w diagnostic recommendations in this wiki: http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide I would recommend replacing the disk, soon, or figure out what other issue might be causing problems for this disk. Thanks, Cindy Benjamin Grogg wrote: Dear Forum I use a KINGSTON SNV125-S2/30GB SSD on a ASUS M3A78-CM Motherboard (AMD SB700 Chipset). SATA Type (in BIOS) is SATA Os : SunOS homesvr 5.11 snv_134 i86pc i386 i86pc When I scrub my pool I got a lot of checksum errors : NAMESTATE READ WRITE CKSUM rpool DEGRADED 0 0 5 c8d0s0DEGRADED 0 071 too many errors zpool clear rpool works after a scrub I have again the same situation. fmstat looks like this : module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz cpumem-retire0 0 0.00.0 0 0 0 0 0 0 disk-transport 0 0 0.0 1541.1 0 0 0 032b 0 eft 1 0 0.04.7 0 0 0 0 1.2M 0 ext-event-transport 3 0 0.02.1 0 0 0 0 0 0 fabric-xlate 0 0 0.00.0 0 0 0 0 0 0 fmd-self-diagnosis 6 0 0.00.0 0 0 0 0 0 0 io-retire0 0 0.00.0 0 0 0 0 0 0 sensor-transport 0 0 0.0 37.3 0 0 0 032b 0 snmp-trapgen 3 0 0.01.1 0 0 0 0 0 0 sysevent-transport 0 0 0.0 2836.3 0 0 0 0 0 0 syslog-msgs 3 0 0.02.7 0 0 0 0 0 0 zfs-diagnosis 91 77 0.0 28.9 0 0 2 1 336b 280b zfs-retire 10 0 0.0 387.9 0 0 0 0 620b 0 fmadm looks like this : --- -- - TIMEEVENT-ID MSG-ID SEVERITY --- -- - Jun 30 16:37:28 806072e5-7cd6-efc1-c89d-d40bce4adf72 ZFS-8000-GHMajor Host: homesvr Platform: System-Product-Name Chassis_id : System-Serial-Number Product_sn : Fault class : fault.fs.zfs.vdev.checksum Affects : zfs://pool=rpool/vdev=f7dad7554a72b3bc faulted but still in service Problem in : zfs://pool=rpool/vdev=f7dad7554a72b3bc faulted but still in service In /var/adm/messages I don't have any abnormal issues. I can put the SSD also on a other SATA-Port but without success. My other HDD runs smoothly : NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c4d1ONLINE 0 0 0 c5d0ONLINE 0 0 0 iostat gives me following : c4d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: WDC WD10EVDS-63 Revision: Serial No: WD-WCAV592 Size: 1000.20GB <1000202305536 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c5d0 Soft Errors: 981 Hard Errors: 0 Transport Errors: 981 Model: Hitachi HDS7210 Revision: Serial No: JP2921HQ0 Size: 1000.20GB <1000202305536 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: KINGSTON SSDNOW Revision: Serial No: 30PM10I Size: 30.02GB <30016659456 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Any hints? Best regards and many thanks for your help! Benjamin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS on external iSCSI storage
I'm new with ZFS, but I have had good success using it with raw physical disks. One of my systems has access to an iSCSI storage target. The underlying physical array is in a propreitary disk storage device from Promise. So the question is, when building a OpenSolaris host to store its data on an external iSCSI device, is there anything conceptually wrong with creating a raidz pool from a group of "raw" LUNs carved from the iSCSI device? Thanks for your advice. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] optimal ZFS filesystem layout on JBOD
Joachim Worringen wrote: > Greetings, > > we are running a few databases of currently 200GB > (growing) in total for data warehousing: > - new data via INSERTs for (up to) millions of rows > per day; sometimes with UPDATEs > - most data in a single table (=> 10 to 100s of > millions of rows) > - queries SELECT subsets of this table via an index > - for effective parallelisation, queries create > (potentially large) non-temporary tables which are > deleted at the end of the query => lots of simple > INSERTs and SELECTs during queries > - large transactions: they may contain millions of > INSERTs/UPDATEs > - running version PostgreSQL 8.4.2 > > We are moving all this to a larger system - the > hardware is available, therefore fixed: > - Sun X4600 (16 cores, 64GB) > - external SAS JBOD with 24 2,5" slots: > o 18x SAS 10k 146GB drives > o 2x SAS 10k 73GB drives > o 4x Intel SLC 32GB SATA SSD > JBOD connected to Adaptec SAS HBA with BBU > - Internal storage via on-board RAID HBA: > o 2x 73GB SAS 10k for OS (RAID1) > o 2x Intel SLC 32GB SATA SSD for ZIL (RAID1) (?) > - OS will be Solaris 10 to have ZFS as filesystem > (and dtrace) > - 10GigE towards client tier (currently, another > X4600 with 32cores and 64GB) > > What would be the optimal storage/ZFS layout for > this? I checked solarisinternals.com and some > PostgreSQL resources and came to the following > concept - asking for your comments: > - run the JBOD without HW-RAID, but let all > redundancy be done by ZFS for maximum flexibility > - create separate ZFS pools for tablespaces (data, > index, temp) and WAL on separate devices (LUNs): > - use the 4 SSDs in the JBOD as Level-2 ARC cache > (can I use a single cache for all pools?) w/o > redundancy > - use the 2 SSDs connected to the on-board HBA as > RAID1 for ZFS ZIL > > Potential issues that I see: > - the ZFS ZIL will not benefit from a BBU (as it is > connected to the backplane, driven by the > onboard-RAID), and might be too small (32GB for ~2TB > of data with lots of writes)? > - the pools on the JBOD might have the wrong size for > the tablespaces - like: using the 2 73GB drives as > RAID 1 for temp might become too small, but adding a > 146GB drive might not be a good idea? > - with 20 spindles, does it make sense at all to use > dedicated devices for the tabelspaces, or will the > load be distributed well enough across the spindles > anyway? > > thanks for any comments & suggestions, > > Joachim I'll chime in based on some tuning experience I had under UFS with Pg 7.x coupled with some experience with ZFS, but not experience with later Pg on ZFS. Take this with a grain of salt. Pg loves to push everything to the WAL and then dribble the changes back to the datafiles when convenient. At a checkpoint, all of the changes are flushed in bulk to the tablespace. Since the changes to WAL and disk are synchronous, ZIL is used, which I believe translates to all data being written four times under ZFS: once to WAL ZIL, then to WAL, then to tablespace ZIL, then to tablespace. For writes, I would break WAL into it's own pool and then put an SSD ZIL mirror on that. It would allow all writes to be nearly instant to WAL and would keep the ZIL needs to the size of the WAL, which probably won't exceed the size of your SSD. The ZIL on WAL will especially help with large index updates which can cause cascading b-tree splits and result in large amounts of small syncronous I/O, bringing Pg to a crawl. Checkpoints will still slow things down when the data is flushed to the tablespace pool, but that will happen with coalesced writes, so iops would be less of a concern. For reads, I would either keep indexes and tables on the same pool and back them with as much L2ARC as needed for the working set, or if you lack sufficient L2ARC, break the indexes into their own pool and L2ARC those instead, because index reads generally are more random and heavily used, at least for well tuned queries. Full table scans for well-vacuumed tables are generally sequential in nature, so table iops again are less of a concern. If you have to break the indexes into their own pool for dedicated SSD L2ARC, you might consider adding some smaller or short-stroked 15K drives for L2ARC on the table pool. For geometry, find the redundancy that you need, e.g. +1, +2 or +3, then decide which is more important, space or iops. If L2ARC and ZIL reduce your need for iops, then go with RAIDZ[123]. If you still need the iops, pile a bunch of [123]-way mirrors together. Yes, I would avoid HW raid and run pure JBOD and would be tempted to keep temp tables on the index or table pool. Like I said above, take this with a grain of salt and feel free to throw out, disagree with or lampoon me for anything that does not resonate with you. Whatever you do, make sure you stress-test the configuration with production-size data and workloads before you deploy it. Good luck, Marty -- This message posted from
[zfs-discuss] Checksum errors with SSD.
Dear Forum I use a KINGSTON SNV125-S2/30GB SSD on a ASUS M3A78-CM Motherboard (AMD SB700 Chipset). SATA Type (in BIOS) is SATA Os : SunOS homesvr 5.11 snv_134 i86pc i386 i86pc When I scrub my pool I got a lot of checksum errors : NAMESTATE READ WRITE CKSUM rpool DEGRADED 0 0 5 c8d0s0DEGRADED 0 071 too many errors zpool clear rpool works after a scrub I have again the same situation. fmstat looks like this : module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz cpumem-retire0 0 0.00.0 0 0 0 0 0 0 disk-transport 0 0 0.0 1541.1 0 0 0 032b 0 eft 1 0 0.04.7 0 0 0 0 1.2M 0 ext-event-transport 3 0 0.02.1 0 0 0 0 0 0 fabric-xlate 0 0 0.00.0 0 0 0 0 0 0 fmd-self-diagnosis 6 0 0.00.0 0 0 0 0 0 0 io-retire0 0 0.00.0 0 0 0 0 0 0 sensor-transport 0 0 0.0 37.3 0 0 0 032b 0 snmp-trapgen 3 0 0.01.1 0 0 0 0 0 0 sysevent-transport 0 0 0.0 2836.3 0 0 0 0 0 0 syslog-msgs 3 0 0.02.7 0 0 0 0 0 0 zfs-diagnosis 91 77 0.0 28.9 0 0 2 1 336b 280b zfs-retire 10 0 0.0 387.9 0 0 0 0 620b 0 fmadm looks like this : --- -- - TIMEEVENT-ID MSG-ID SEVERITY --- -- - Jun 30 16:37:28 806072e5-7cd6-efc1-c89d-d40bce4adf72 ZFS-8000-GHMajor Host: homesvr Platform: System-Product-Name Chassis_id : System-Serial-Number Product_sn : Fault class : fault.fs.zfs.vdev.checksum Affects : zfs://pool=rpool/vdev=f7dad7554a72b3bc faulted but still in service Problem in : zfs://pool=rpool/vdev=f7dad7554a72b3bc faulted but still in service In /var/adm/messages I don't have any abnormal issues. I can put the SSD also on a other SATA-Port but without success. My other HDD runs smoothly : NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c4d1ONLINE 0 0 0 c5d0ONLINE 0 0 0 iostat gives me following : c4d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: WDC WD10EVDS-63 Revision: Serial No: WD-WCAV592 Size: 1000.20GB <1000202305536 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c5d0 Soft Errors: 981 Hard Errors: 0 Transport Errors: 981 Model: Hitachi HDS7210 Revision: Serial No: JP2921HQ0 Size: 1000.20GB <1000202305536 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: KINGSTON SSDNOW Revision: Serial No: 30PM10I Size: 30.02GB <30016659456 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Any hints? Best regards and many thanks for your help! Benjamin -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NexentaStor Community edition 3.0.3 released
On Jul 1, 2010, at 10:39, Pasi Kärkkäinen wrote: basicly 5-30 seconds after login prompt shows up on the console the server will reboot due to kernel crash. the error seems to be about the broadcom nic driver.. Is this a known bug? Please contact Nexenta via their support infrastructure (web site, forums, lists?): http://www.nexentastor.org/ http://www.nexenta.com/corp/support Using ZFS- and OpenSolaris-specific lists isn't appropriate. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Announce: zfsdump
> From: Asif Iqbal [mailto:vad...@gmail.com] > > currently to speed up the zfs send| zfs recv I am using mbuffer. It > moves the data > lot faster than using netcat (or ssh) as the transport method Yup, this works because network and disk latency can both be variable. So without buffering, your data stream must instantaneously go the speed of whichever is slower: The disk or the network. But when you use buffering, you're able to go as fast as the network at all times. You remove the effect of transient disk latency. > that is why I thought may be transport it like axel does better than > wget. > axel let you create multiple pipes, so you get the data multiple times > faster > than with wget. If you're using axel to download something from the internet, the reason it's faster than wget is because your data stream is competing against all the other users of the internet, to get something from that server across some WAN. Inherently, all the routers and servers on the internet will treat each data stream fairly (except when explicitly configured to be unfair.) So when you axel some file from the internet using multiple threads, instead of wget'ing with a single thread, you're unfairly hogging the server and WAN bandwidth between your site and the remote site. Slowing down everyone else on the internet who are running with only 1 thread each. Assuming your zfs send backup is going local, on a LAN, you almost certainly do not want to do that. If your zfs send is going across the WAN ... maybe you do want to multithread the datastream. But you better ensure it's encrypted. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NexentaStor Community edition 3.0.3 released
On Tue, Jun 15, 2010 at 10:57:53PM +0530, Anil Gulecha wrote: > Hi All, > > On behalf of NexentaStor team, I'm happy to announce the release of > NexentaStor Community Edition 3.0.3. This release is the result of the > community efforts of Nexenta Partners and users. > > Changes over 3.0.2 include > * Many fixes to ON/ZFS backported to b134. > * Multiple bug fixes in the appliance. > > With the addition of many new features, NexentaStor CE is the *most > complete*, and feature-rich gratis unified storage solution today. > > Quick Summary of Features > - > * ZFS additions: Deduplication (based on OpenSolaris b134). > * Free for upto 12 TB of *used* storage > * Community edition supports easy upgrades > * Many new features in the easy to use management interface. > * Integrated search > > Grab the iso from > http://www.nexentastor.org/projects/site/wiki/CommunityEdition > > If you are a storage solution provider, we invite you to join our > growing social network at http://people.nexenta.com. > Hey, I tried installing Nexenta 3.0.3 on an old HP DL380G4 server, and it installed ok, but it crashes all the time.. basicly 5-30 seconds after login prompt shows up on the console the server will reboot due to kernel crash. the error seems to be about the broadcom nic driver.. Is this a known bug? See the screenshots for the kernel error message: http://pasik.reaktio.net/nexenta/nexenta303-crash02.jpg http://pasik.reaktio.net/nexenta/nexenta303-crash01.jpg -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] optimal ZFS filesystem layout on JBOD
Greetings, we are running a few databases of currently 200GB (growing) in total for data warehousing: - new data via INSERTs for (up to) millions of rows per day; sometimes with UPDATEs - most data in a single table (=> 10 to 100s of millions of rows) - queries SELECT subsets of this table via an index - for effective parallelisation, queries create (potentially large) non-temporary tables which are deleted at the end of the query => lots of simple INSERTs and SELECTs during queries - large transactions: they may contain millions of INSERTs/UPDATEs - running version PostgreSQL 8.4.2 We are moving all this to a larger system - the hardware is available, therefore fixed: - Sun X4600 (16 cores, 64GB) - external SAS JBOD with 24 2,5" slots: o 18x SAS 10k 146GB drives o 2x SAS 10k 73GB drives o 4x Intel SLC 32GB SATA SSD - JBOD connected to Adaptec SAS HBA with BBU - Internal storage via on-board RAID HBA: o 2x 73GB SAS 10k for OS (RAID1) o 2x Intel SLC 32GB SATA SSD for ZIL (RAID1) (?) - OS will be Solaris 10 to have ZFS as filesystem (and dtrace) - 10GigE towards client tier (currently, another X4600 with 32cores and 64GB) What would be the optimal storage/ZFS layout for this? I checked solarisinternals.com and some PostgreSQL resources and came to the following concept - asking for your comments: - run the JBOD without HW-RAID, but let all redundancy be done by ZFS for maximum flexibility - create separate ZFS pools for tablespaces (data, index, temp) and WAL on separate devices (LUNs): - use the 4 SSDs in the JBOD as Level-2 ARC cache (can I use a single cache for all pools?) w/o redundancy - use the 2 SSDs connected to the on-board HBA as RAID1 for ZFS ZIL Potential issues that I see: - the ZFS ZIL will not benefit from a BBU (as it is connected to the backplane, driven by the onboard-RAID), and might be too small (32GB for ~2TB of data with lots of writes)? - the pools on the JBOD might have the wrong size for the tablespaces - like: using the 2 73GB drives as RAID 1 for temp might become too small, but adding a 146GB drive might not be a good idea? - with 20 spindles, does it make sense at all to use dedicated devices for the tabelspaces, or will the load be distributed well enough across the spindles anyway? thanks for any comments & suggestions, Joachim -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found
I also have this problem, with 134 if I delete big snapshots the server hangs only responding to ping. I also have the ZVOL issue. Any news about having them solved? In my case this is a big problem since I'm using osol as a file server... Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] dedup accounting anomaly / dedup experiments
Hello list, I wanted to test deduplication a little and did a experiment. My question was: can I dedupe infinite or is ther a upper limit ? So for that I did a very basic test. - I created a ramdisk-pool (1GB) - enabled dedup and - wrote zeros to it (in one single file) until an error is returned. The size of the pool was 1046 MB, I was able to write 62 GB to it then it says "no space left on device". The block size was 128k, so I was able to write 507.000 blocks to the pool. With this device beeing full, I see the following: 1) zfs list reports that no space is left (AVAIL=0) 2) zpool reports that the dedup factor was ~507.000x 3) zpool reports also that 8,6 MB of space were allocated in the pool (0% used) So for me it looks like there is something broken in ZFS accounting with dedupe. - zpool and zfs usage free space reporting do not align - the real deduplication factor was not 507.000 (meaning I would have been able to write 507.000x1GB = a lot to the pool) - when calculating 1046 MB / 507000 = 2.1 KB, somehow for each block of 128k, 2,1 KB of data bas been written (assuming zfs list is correct). What is this ? Metadata ? Meaning that I have aprox 1.6 % of Meatadata in ZFS (1/(128k/2,1k)) ? I repeatet the same thing for a recordsize of 32k. The funny thing is: - Also 60 GB could be written before "no space left" - 31 MB of space were alloated in the pool (zpool list) The version of the pool is 25. During the experiment I could nicely see: - that performance on ramdisk is CPU bound doing ~125 MB /sec per Core. - performance scales linearly with adding CPU cores. (125 MB/s cor 1core, 253 Mb/s for 2core, 408 MB/s for 4core). - that the upper size of the deduplication table is blocks * ~150 Byte, indipendent of the dedupe factor - the ddt does not grow for deduplicatable blocks (zdb -D) - performance goes down factor of ~4 when switching from allocation policy of "closest" to "best fit" (when the pool fills rate drops from 250 MB/s to 67 MB/s. I suspect even worse results for spinning media because of the head movements (>10x slow down). Anyone knowing why the dedup factor is wrong ? Any insights on what has actually been written (compressed meta data, deduped meta data .. etc.) would be greatly appreshiated. Regards, Robert -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-USAS-L8i
> I plan on removing the second USAS-L8i and connect > all 16 drives to the > first USAS-L8i when I need more storage capacity. I > have no doubt that > it will work as intended. I will report to the list > otherwise. I'm a little late to the party here. First, I'd like to thank those pioneers who came before me and found this board works fine. I have two of them on order for a server I'm putting together. I'm curious about this connecting all 16 drives bit. I read that the controller will handle up to 122 drives, but short of a lot of extra, expensive hardware, I have no idea how that is accomplished. How are you intending to connect 16 drives to the two SAS ports on the controller? On a slightly different but related topic, anyone have advice on how to connect up my drives? I've got room for 20 pool drives in the case. I'll have two AOC-USAS-L8i cards along with cables to connect 16 SATA2 drives. The motherboard has 6 SATA2 connectors plus 2 SATA3 connectors. I was planning to use the SATA3 connectors for the boot drives (hopefully mirrored ZFS). Initially I'll have three 2TB drives with the intention of using raidz1 on them. Phase 2 will be three more 2TB drives. Phase 3 will be three 1.5TB drives. (I already have most of the drives for phase 2 and 3.) I'll probably fill the rest with a JBOD pool of assorted drives I have lying around. I'm specifically wondering if I'll see better performance by spreading the raidz1 array devices across the controllers or if I should keep an array to a single controller as much as possible. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss