[zfs-discuss] Snapshot size as reported by the USED property
I create snapshots on my datasets quite frequently. My understanding of the USED property of a snapshot is that it indicates the amount of data that was written to the dataset after the snapshot was taken. But now I'm seeing a snapshot with USED == 0 where there was definitely write activity after it was taken. Is my understanding wrong or do I see a thing that is not supposed to happen? I am on NCP3. Thanks, Peter -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] pool died during scrub
I have a bunch of sol10U8 boxes with ZFS pools, most all raidz2 8-disk stripe. They're all supermicro-based with retail LSI cards. I've noticed a tendency for things to go a little bonkers during the weekly scrub (they all scrub over the weekend), and that's when I'll lose a disk here and there. OK, fine, that's sort of the point, and they're SATA drives so things happen. I've never lost a pool though, until now. This is Not Fun. ::status debugging crash dump vmcore.0 (64-bit) from ny-fs4 operating system: 5.10 Generic_142901-10 (i86pc) panic message: BAD TRAP: type=e (#pf Page fault) rp=fe80007cb850 addr=28 occurred in module zfs due to a NULL pointer dereference dump content: kernel pages only $C fe80007cb960 vdev_is_dead+2() fe80007cb9a0 vdev_mirror_child_select+0x65() fe80007cba00 vdev_mirror_io_start+0x44() fe80007cba30 zio_vdev_io_start+0x159() fe80007cba60 zio_execute+0x6f() fe80007cba90 zio_wait+0x2d() fe80007cbb40 arc_read_nolock+0x668() fe80007cbbd0 dmu_objset_open_impl+0xcf() fe80007cbc20 dsl_pool_open+0x4e() fe80007cbcc0 spa_load+0x307() fe80007cbd00 spa_open_common+0xf7() fe80007cbd10 spa_open+0xb() fe80007cbd30 pool_status_check+0x19() fe80007cbd80 zfsdev_ioctl+0x1b1() fe80007cbd90 cdev_ioctl+0x1d() fe80007cbdb0 spec_ioctl+0x50() fe80007cbde0 fop_ioctl+0x25() fe80007cbec0 ioctl+0xac() fe80007cbf10 _sys_sysenter_post_swapgs+0x14b() pool: srv id: 9515618289022845993 state: UNAVAIL status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-6X config: srvUNAVAIL missing device raidz2 ONLINE c2t5000C5001F2CCE1Fd0 ONLINE c2t5000C5001F34F5FAd0 ONLINE c2t5000C5001F48D399d0 ONLINE c2t5000C5001F485EC3d0 ONLINE c2t5000C5001F492E42d0 ONLINE c2t5000C5001F48549Bd0 ONLINE c2t5000C5001F370919d0 ONLINE c2t5000C5001F484245d0 ONLINE raidz2 ONLINE c2t5F000B5C8187d0 ONLINE c2t5F000B5C8157d0 ONLINE c2t5F000B5C9101d0 ONLINE c2t5F000B5C8167d0 ONLINE c2t5F000B5C9120d0 ONLINE c2t5F000B5C9151d0 ONLINE c2t5F000B5C9170d0 ONLINE c2t5F000B5C9180d0 ONLINE raidz2 ONLINE c2t5000C50010A88E76d0 ONLINE c2t5000C5000DCD308Cd0 ONLINE c2t5000C5001F1F456Dd0 ONLINE c2t5000C50010920E06d0 ONLINE c2t5000C5001F20C81Fd0 ONLINE c2t5000C5001F3C7735d0 ONLINE c2t5000C500113BC008d0 ONLINE c2t5000C50014CD416Ad0 ONLINE Additional devices are known to be part of this pool, though their exact configuration cannot be determined. All of this would be ok... except THOSE ARE THE ONLY DEVICES THAT WERE PART OF THE POOL. How can it be missing a device that didn't exist? A zpool import -fF results in the above kernel panic. This also creates /etc/zfs/zpool.cache.tmp, which then results in the pool being imported, which leads to a continuous reboot/panic cycle. I can't obviously use b134 to import the pool without logs, since that would imply upgrading the pool first, which is hard to do if it's not imported. My zdb skills are lacking - zdb -l gets you about so far and that's it. (where the heck are the other options to zdb even written down, besides in the code?) OK, so this isn't the end of the world, but it's 15TB of data I'd really rather not have to re-copy across a 100Mbit line. It really more concerns me that ZFS would do this in the first place - it's not supposed to corrupt itself!! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs lists discrepancy after added a new vdev to pool
On Saturday, August 28, 2010 06:04:17 am Mattias Pantzare wrote: On Sat, Aug 28, 2010 at 02:54, Darin Perusich darin.perus...@cognigencorp.com wrote: Hello All, I'm sure this has been discussed previously but I haven't been able to find an answer to this. I've added another raidz1 vdev to an existing storage pool and the increased available storage isn't reflected in the 'zfs list' output. Why is this? The system in question is runnning Solaris 10 5/09 s10s_u7wos_08, kernel Generic_139555-08. The system does not have the lastest patches which might be the cure. Thanks! I think you have to explain your problem more, 392G is more than 196G? This is actually the wrong output, it was the end of a LONG day. Here's the correct output. zpool create datapool raidz1 c1t50060E800042AA70d0 c1t50060E800042AA70d1 zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT datapool 398G 191K 398G 0% ONLINE - zfs list NAME USED AVAIL REFER MOUNTPOINT datapool91K 196G 1K /datapool zpool add datapool raidz c1t50060E800042AA70d2 c1t50060E800042AA70d3 zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT datapool 796G 231K 796G 0% ONLINE - zfs list NAME USED AVAIL REFER MOUNTPOINT datapool 111K 392G18K /datapool -- Darin Perusich Unix Systems Administrator Cognigen Corporation 395 Youngs Rd. Williamsville, NY 14221 Phone: 716-633-3463 Email: darin...@cognigencorp.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs lists discrepancy after added a new vdev to pool
On Saturday, August 28, 2010 12:27:36 am Edho P Arief wrote: On Sat, Aug 28, 2010 at 7:54 AM, Darin Perusich darin.perus...@cognigencorp.com wrote: Hello All, I'm sure this has been discussed previously but I haven't been able to find an answer to this. I've added another raidz1 vdev to an existing storage pool and the increased available storage isn't reflected in the 'zfs list' output. Why is this? you must do zpool export followed by zpool import I tried this but it didn't have any effect. -- Darin Perusich Unix Systems Administrator Cognigen Corporation 395 Youngs Rd. Williamsville, NY 14221 Phone: 716-633-3463 Email: darin...@cognigencorp.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
As I said, please by all means try it and post your benchmarks for first hour, first day and first week and then first month. The data will be of interest to you. On a subjective basis, if you feel that an SSD is working just fine as your ZIL, run with it. Good luck! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs lists discrepancy after added a new vdev to pool
On Saturday, August 28, 2010 05:56:27 am Tomas Ögren wrote: On 27 August, 2010 - Darin Perusich sent me these 2,1K bytes: Hello All, I'm sure this has been discussed previously but I haven't been able to find an answer to this. I've added another raidz1 vdev to an existing storage pool and the increased available storage isn't reflected in the 'zfs list' output. Why is this? The system in question is runnning Solaris 10 5/09 s10s_u7wos_08, kernel Generic_139555-08. The system does not have the lastest patches which might be the cure. Thanks! Here's what I'm seeing. zpool create datapool raidz1 c1t50060E800042AA70d0 c1t50060E800042AA70d1 Just fyi, this is an inefficient variant of a mirror. More cpu required and lower performance. This is a testing setup, the production pool is currently 1 raidz1 vdev split across 6 disks. Thanks for the heads up though. -- Darin Perusich Unix Systems Administrator Cognigen Corporation 395 Youngs Rd. Williamsville, NY 14221 Phone: 716-633-3463 Email: darin...@cognigencorp.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs lists discrepancy after added a new vdev to pool
This is a FAQ Why doesn't the space that is reported by the zpool list command and the zfs list command match? http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq -- richard On Aug 30, 2010, at 5:47 AM, Darin Perusich wrote: On Saturday, August 28, 2010 06:04:17 am Mattias Pantzare wrote: On Sat, Aug 28, 2010 at 02:54, Darin Perusich darin.perus...@cognigencorp.com wrote: Hello All, I'm sure this has been discussed previously but I haven't been able to find an answer to this. I've added another raidz1 vdev to an existing storage pool and the increased available storage isn't reflected in the 'zfs list' output. Why is this? The system in question is runnning Solaris 10 5/09 s10s_u7wos_08, kernel Generic_139555-08. The system does not have the lastest patches which might be the cure. Thanks! I think you have to explain your problem more, 392G is more than 196G? This is actually the wrong output, it was the end of a LONG day. Here's the correct output. zpool create datapool raidz1 c1t50060E800042AA70d0 c1t50060E800042AA70d1 zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT datapool 398G 191K 398G 0% ONLINE - zfs list NAME USED AVAIL REFER MOUNTPOINT datapool91K 196G 1K /datapool zpool add datapool raidz c1t50060E800042AA70d2 c1t50060E800042AA70d3 zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT datapool 796G 231K 796G 0% ONLINE - zfs list NAME USED AVAIL REFER MOUNTPOINT datapool 111K 392G18K /datapool -- Darin Perusich Unix Systems Administrator Cognigen Corporation 395 Youngs Rd. Williamsville, NY 14221 Phone: 716-633-3463 Email: darin...@cognigencorp.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Postmortem - file system recovered [SEC=UNCLASSIFIED]
I am afraid I can't describe the exact procedure that eventually fixed the file system as I merely observed it while Victor was logged into my system. I am quoting from the explanation he provided but if he reads this perhaps he could add whatever details seem pertinent. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool died during scrub
On Mon, 30 Aug 2010, Jeff Bacon wrote: All of this would be ok... except THOSE ARE THE ONLY DEVICES THAT WERE PART OF THE POOL. How can it be missing a device that didn't exist? The device(s) in question are probably the logs you refer to here: I can't obviously use b134 to import the pool without logs, since that would imply upgrading the pool first, which is hard to do if it's not imported. The stack trace you show is indicative of a memory corruption that may have gotten out to disk. In other words, ZFS wrote data to ram, ram was corrupted, then the checksum was calculated and the result was written out. Do you have a core dump from the panic? Also, what kind of DRAM does this system use? If you're lucky, then there's no corruption and instead it's a stale config that's causing the problem. Try removing /etc/zfs/zpool.cache and then doing an zpool import -a ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool died during scrub
All of this would be ok... except THOSE ARE THE ONLY DEVICES THAT WERE PART OF THE POOL. How can it be missing a device that didn't exist? The device(s) in question are probably the logs you refer to here: There is a log, with a different GUID, from another pool from long ago. It isn't valid. I clipped that: ny-fs4(71)# zpool import pool: srv id: 6111323963551805601 state: UNAVAIL status: The pool was last accessed by another system. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-EY config: srv UNAVAIL insufficient replicas logs srv UNAVAIL insufficient replicas mirror ONLINE c3t0d0s4 ONLINE box doesn't even have a c3 c0t0d0s4 ONLINE what it's looking at - leftover from who knows what pool: srv id: 9515618289022845993 state: UNAVAIL status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-6X config: I can't obviously use b134 to import the pool without logs, since that would imply upgrading the pool first, which is hard to do if it's not imported. The stack trace you show is indicative of a memory corruption that may have gotten out to disk. In other words, ZFS wrote data to ram, ram was corrupted, then the checksum was calculated and the result was written out. Now this worries me. Granted, box works fairly hard, but ... no ECC events to IPMI that I can see. Possible that the controller ka-futzed somehow... but then presumably there should be SOME valid data to go back to here somewhere? The one fairly unusual item about this box is that it has another pool with 12 15k SAS drives, which has a mysql database on it which gets fairly well thrashed on a permanent basis. Do you have a core dump from the panic? Also, what kind of DRAM does this system use? It has 12 4GB DDR3-1066 ECC REG DIMMs. I can regenerate the panic on command (try to import the pool with -F and it will go back into reboot loop mode). I pulled the stack from a core dump. If you're lucky, then there's no corruption and instead it's a stale config that's causing the problem. Try removing /etc/zfs/zpool.cache and then doing an zpool import -a Not nearly that lucky. It won't import. If it goes into reboot mode, the only thing you can do is go to single-user, remove the cache, and reboot so it forgets about the pool. (Please, no rumblings from the peanut gallery about the evils of SATA or SAS/SATA encapsulation. This is the only box in this mode. The mysql database is an RTG stats database whose loss is not the end of the world. The dataset is replicated in two other sites, this is a local copy - just that it's 15TB, and as I said, recovery is, well, time-consuming and therefore not the preferred option. Real Production Boxes - slowly coming on line - are all using the SuperMicro E26 dual-port backplane with 2TB constellation SAS drives on paired LSI 9211-8is, with aforementioned ECC REG RAM, and I'm trying to figure out how to either -- get my hands on SAS SSDs (of which there appears to be one, the new OCZ Vertex 2 Pro), or -- install interposers in front of SATA SSDs so at least the controllers aren't dealing with SATA encap - the big challenge being, of all things, the form factor and the tray I think I'm going to yank the SAS drives out and migrate them so that they're on a separate backplane and controller) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Intermittent ZFS hang
Howdy, We're having a ZFS performance issue over here that I was hoping you guys could help me troubleshoot. We have a ZFS pool made up of 24 disks, arranged into 7 raid-z devices of 4 disks each. We're using it as an iSCSI back-end for VMWare and some Oracle RAC clusters. Under normal circumstances performance is very good both in benchmarks and under real-world use. Every couple days, however, I/O seems to hang for anywhere between several seconds and several minutes. The hang seems to be a complete stop of all write I/O. The following zpool iostat illustrates: pool0 2.47T 5.13T120 0 293K 0 pool0 2.47T 5.13T127 0 308K 0 pool0 2.47T 5.13T131 0 322K 0 pool0 2.47T 5.13T144 0 347K 0 pool0 2.47T 5.13T135 0 331K 0 pool0 2.47T 5.13T122 0 295K 0 pool0 2.47T 5.13T135 0 330K 0 While this is going on our VMs all hang, as do any zfs create commands or attempts to touch/create files in the zfs pool from the local system. After several minutes the system un-hangs and we see very high write rates before things return to normal across the board. Some more information about our configuration: We're running OpenSolaris svn-134. ZFS is at version 22. Our disks are 15kRPM 300gb Seagate Cheetahs, mounted in Promise J610S Dual enclosures, hanging off a Dell SAS 5/e controller. We'd tried out most of this configuration previously on OpenSolaris 2009.06 without running into this problem. The only thing that's new, aside from the newer OpenSolaris/ZFS is a set of four SSDs configured as log disks. At first we blamed de-dupe, but we've disabled that. Next we suspected the SSD log disks, but we've seen the problem with those removed, as well. Has anyone seen anything like this before? Are there any tools we can use to gather information during the hang which might be useful in determining what's going wrong? Thanks for any insights you may have. -Charles -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool scrub clean, filesystem broken
I've posted a post-mortem followup thread: http://opensolaris.org/jive/thread.jspa?threadID=133472 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Terrible ZFS performance on a Dell 1850 w/ PERC 4e/Si (Sol10U6)
I have the same problem you do, ZFS performance under Solaris 10 u8 is horrible. When you say passthrough mode, do you mean non-RAID configuration? And if so, could you tell me how you configured it? The best I can manage is to configure each physical drive as a RAID 0 array then export that as a logical drive. All tips/suggestions are appreciated. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intermittent ZFS hang
Charles, Did you check for any HW issues reported during the hangs? fmdump -ev and the like? ..Remco On 8/30/10 6:02 PM, Charles J. Knipe wrote: Howdy, We're having a ZFS performance issue over here that I was hoping you guys could help me troubleshoot. We have a ZFS pool made up of 24 disks, arranged into 7 raid-z devices of 4 disks each. We're using it as an iSCSI back-end for VMWare and some Oracle RAC clusters. Under normal circumstances performance is very good both in benchmarks and under real-world use. Every couple days, however, I/O seems to hang for anywhere between several seconds and several minutes. The hang seems to be a complete stop of all write I/O. The following zpool iostat illustrates: pool0 2.47T 5.13T120 0 293K 0 pool0 2.47T 5.13T127 0 308K 0 pool0 2.47T 5.13T131 0 322K 0 pool0 2.47T 5.13T144 0 347K 0 pool0 2.47T 5.13T135 0 331K 0 pool0 2.47T 5.13T122 0 295K 0 pool0 2.47T 5.13T135 0 330K 0 While this is going on our VMs all hang, as do any zfs create commands or attempts to touch/create files in the zfs pool from the local system. After several minutes the system un-hangs and we see very high write rates before things return to normal across the board. Some more information about our configuration: We're running OpenSolaris svn-134. ZFS is at version 22. Our disks are 15kRPM 300gb Seagate Cheetahs, mounted in Promise J610S Dual enclosures, hanging off a Dell SAS 5/e controller. We'd tried out most of this configuration previously on OpenSolaris 2009.06 without running into this problem. The only thing that's new, aside from the newer OpenSolaris/ZFS is a set of four SSDs configured as log disks. At first we blamed de-dupe, but we've disabled that. Next we suspected the SSD log disks, but we've seen the problem with those removed, as well. Has anyone seen anything like this before? Are there any tools we can use to gather information during the hang which might be useful in determining what's going wrong? Thanks for any insights you may have. -Charles ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intermittent ZFS hang
Charles, Is it just ZFS hanging (or what it appears to be is slowing down or blocking) or does the whole system hang? A couple of questions What does iostat show during the time period of the slowdown? What does mpstat show during the time of the slowdown? You can look at the metadata statistics by running the following. echo ::arc | mdb -k When looking at a ZFS problem, I usually like to gather echo ::spa | mdb -k echo ::zio_state | mdb -k I suspect you could drill down more with dtrace or lockstat to see where the slowdown is happening. Dave On 08/30/10 11:02, Charles J. Knipe wrote: Howdy, We're having a ZFS performance issue over here that I was hoping you guys could help me troubleshoot. We have a ZFS pool made up of 24 disks, arranged into 7 raid-z devices of 4 disks each. We're using it as an iSCSI back-end for VMWare and some Oracle RAC clusters. Under normal circumstances performance is very good both in benchmarks and under real-world use. Every couple days, however, I/O seems to hang for anywhere between several seconds and several minutes. The hang seems to be a complete stop of all write I/O. The following zpool iostat illustrates: pool0 2.47T 5.13T120 0 293K 0 pool0 2.47T 5.13T127 0 308K 0 pool0 2.47T 5.13T131 0 322K 0 pool0 2.47T 5.13T144 0 347K 0 pool0 2.47T 5.13T135 0 331K 0 pool0 2.47T 5.13T122 0 295K 0 pool0 2.47T 5.13T135 0 330K 0 While this is going on our VMs all hang, as do any zfs create commands or attempts to touch/create files in the zfs pool from the local system. After several minutes the system un-hangs and we see very high write rates before things return to normal across the board. Some more information about our configuration: We're running OpenSolaris svn-134. ZFS is at version 22. Our disks are 15kRPM 300gb Seagate Cheetahs, mounted in Promise J610S Dual enclosures, hanging off a Dell SAS 5/e controller. We'd tried out most of this configuration previously on OpenSolaris 2009.06 without running into this problem. The only thing that's new, aside from the newer OpenSolaris/ZFS is a set of four SSDs configured as log disks. At first we blamed de-dupe, but we've disabled that. Next we suspected the SSD log disks, but we've seen the problem with those removed, as well. Has anyone seen anything like this before? Are there any tools we can use to gather information during the hang which might be useful in determining what's going wrong? Thanks for any insights you may have. -Charles -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intermittent ZFS hang
David, Thanks for your reply. Answers to your questions are below. Is it just ZFS hanging (or what it appears to be is slowing down or blocking) or does the whole system hang?nbsp; br Only the ZFS storage is affected. Any attempt to write to it blocks until the issue passes. Other than that the system behaves normally. I have not, as far as I remember, tried writing to the root pool while this is going on, I'll have to check that next time. I suspect the problem is likely limited to a single pool. What does iostat show during the time period of the slowdown?br What does mpstat show during the time of the slowdown?br br You can look at the metadata statistics by running the following. echo ::arc | mdb -kbr When looking at a ZFS problem, I usually like to gather echo ::spa | mdb -kbr echo ::zio_state | mdb -kbr I will plan to dump information from all of these sources next time I can catch it in the act. Any other diag commands you think might be useful? I suspect you could drill down more with dtrace or lockstat to see where the slowdown is happening. I'm brand new to DTrace. I'm doing some reading now toward being in a position to ask intelligent questions. -Charles -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
On Mon, Aug 30 at 15:05, Ray Van Dolson wrote: I want to fix (as much as is possible) a misalignment issue with an X-25E that I am using for both OS and as an slog device. This is on x86 hardware running Solaris 10U8. Partition table looks as follows: Part TagFlag CylindersSizeBlocks 0 rootwm 1 - 1306 10.00GB(1306/0/0) 20980890 1 unassignedwu 0 0 (0/0/0) 0 2 backupwm 0 - 3886 29.78GB(3887/0/0) 62444655 3 unassignedwu1307 - 3886 19.76GB(2580/0/0) 41447700 4 unassignedwu 0 0 (0/0/0) 0 5 unassignedwu 0 0 (0/0/0) 0 6 unassignedwu 0 0 (0/0/0) 0 7 unassignedwu 0 0 (0/0/0) 0 8 bootwu 0 -07.84MB(1/0/0) 16065 9 unassignedwu 0 0 (0/0/0) 0 And here is fdisk: Total disk size is 3890 cylinders Cylinder size is 16065 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 ActiveSolaris 1 38893889100 Slice 0 is where the OS lives and slice 3 is our slog. As you can see from the fdisk partition table (and from the slice view), the OS partition starts on cylinder 1 -- which is not 4k aligned. I don't think there is much I can do to fix this without reinstalling. However, I'm most concerned about the slog slice and would like to recreate its partition such that it begins on cylinder 1312. So a few questions: - Would making s3 be 4k block aligned help even though s0 is not? - Do I need to worry about 4k block aligning the *end* of the slice? eg instead of ending s3 on cylinder 3886, end it on 3880 instead? Thanks, Ray Do you specifically have benchmark data indicating unaligned or aligned+offset access on the X25-E is significantly worse than aligned access? I'd thought the tier1 SSDs didn't have problems with these workloads. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
On Mon, Aug 30, 2010 at 03:37:52PM -0700, Eric D. Mudama wrote: On Mon, Aug 30 at 15:05, Ray Van Dolson wrote: I want to fix (as much as is possible) a misalignment issue with an X-25E that I am using for both OS and as an slog device. This is on x86 hardware running Solaris 10U8. Partition table looks as follows: Part TagFlag CylindersSizeBlocks 0 rootwm 1 - 1306 10.00GB(1306/0/0) 20980890 1 unassignedwu 0 0 (0/0/0) 0 2 backupwm 0 - 3886 29.78GB(3887/0/0) 62444655 3 unassignedwu1307 - 3886 19.76GB(2580/0/0) 41447700 4 unassignedwu 0 0 (0/0/0) 0 5 unassignedwu 0 0 (0/0/0) 0 6 unassignedwu 0 0 (0/0/0) 0 7 unassignedwu 0 0 (0/0/0) 0 8 bootwu 0 -07.84MB(1/0/0) 16065 9 unassignedwu 0 0 (0/0/0) 0 And here is fdisk: Total disk size is 3890 cylinders Cylinder size is 16065 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 ActiveSolaris 1 38893889100 Slice 0 is where the OS lives and slice 3 is our slog. As you can see from the fdisk partition table (and from the slice view), the OS partition starts on cylinder 1 -- which is not 4k aligned. I don't think there is much I can do to fix this without reinstalling. However, I'm most concerned about the slog slice and would like to recreate its partition such that it begins on cylinder 1312. So a few questions: - Would making s3 be 4k block aligned help even though s0 is not? - Do I need to worry about 4k block aligning the *end* of the slice? eg instead of ending s3 on cylinder 3886, end it on 3880 instead? Thanks, Ray Do you specifically have benchmark data indicating unaligned or aligned+offset access on the X25-E is significantly worse than aligned access? I'd thought the tier1 SSDs didn't have problems with these workloads. I've been experiencing heavy Device Not Ready errors with this configuration, and thought perhaps it could be exacerbated by the block alignment issue. See this thread[1]. So this would be a troubleshooting step to attempt to further isolate the problem -- by eliminating the 4k alignment issue as a factor. Just want to make sure I set up the alignment as optimally as possible. Ray [1] http://markmail.org/message/5rmfzvqwlmosh2oh ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
comment below... On Aug 30, 2010, at 3:42 PM, Ray Van Dolson wrote: On Mon, Aug 30, 2010 at 03:37:52PM -0700, Eric D. Mudama wrote: On Mon, Aug 30 at 15:05, Ray Van Dolson wrote: I want to fix (as much as is possible) a misalignment issue with an X-25E that I am using for both OS and as an slog device. This is on x86 hardware running Solaris 10U8. Partition table looks as follows: Part TagFlag CylindersSizeBlocks 0 rootwm 1 - 1306 10.00GB(1306/0/0) 20980890 1 unassignedwu 0 0 (0/0/0) 0 2 backupwm 0 - 3886 29.78GB(3887/0/0) 62444655 3 unassignedwu1307 - 3886 19.76GB(2580/0/0) 41447700 4 unassignedwu 0 0 (0/0/0) 0 5 unassignedwu 0 0 (0/0/0) 0 6 unassignedwu 0 0 (0/0/0) 0 7 unassignedwu 0 0 (0/0/0) 0 8 bootwu 0 -07.84MB(1/0/0) 16065 9 unassignedwu 0 0 (0/0/0) 0 And here is fdisk: Total disk size is 3890 cylinders Cylinder size is 16065 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 ActiveSolaris 1 38893889100 Slice 0 is where the OS lives and slice 3 is our slog. As you can see from the fdisk partition table (and from the slice view), the OS partition starts on cylinder 1 -- which is not 4k aligned. To get to a fine alignment, you need an EFI label. However, Solaris does not (yet) support booting from EFI labeled disks. The older SMI labels are all cylinder aligned which gives you a 1/4 chance of alignment. I don't think there is much I can do to fix this without reinstalling. However, I'm most concerned about the slog slice and would like to recreate its partition such that it begins on cylinder 1312. So a few questions: - Would making s3 be 4k block aligned help even though s0 is not? - Do I need to worry about 4k block aligning the *end* of the slice? eg instead of ending s3 on cylinder 3886, end it on 3880 instead? Thanks, Ray Do you specifically have benchmark data indicating unaligned or aligned+offset access on the X25-E is significantly worse than aligned access? I'd thought the tier1 SSDs didn't have problems with these workloads. I've been experiencing heavy Device Not Ready errors with this configuration, and thought perhaps it could be exacerbated by the block alignment issue. See this thread[1]. So this would be a troubleshooting step to attempt to further isolate the problem -- by eliminating the 4k alignment issue as a factor. In my experience, port expanders with SATA drives do not handle the high I/O rate that can be generated by a modest server. We are still trying to get to the bottom of these issues, but they do not appear to be related to the OS, mpt driver, ZIL use, or alignment. -- richard Just want to make sure I set up the alignment as optimally as possible. Ray [1] http://markmail.org/message/5rmfzvqwlmosh2oh ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
On Mon, Aug 30, 2010 at 03:56:42PM -0700, Richard Elling wrote: comment below... On Aug 30, 2010, at 3:42 PM, Ray Van Dolson wrote: On Mon, Aug 30, 2010 at 03:37:52PM -0700, Eric D. Mudama wrote: On Mon, Aug 30 at 15:05, Ray Van Dolson wrote: I want to fix (as much as is possible) a misalignment issue with an X-25E that I am using for both OS and as an slog device. This is on x86 hardware running Solaris 10U8. Partition table looks as follows: Part TagFlag CylindersSizeBlocks 0 rootwm 1 - 1306 10.00GB(1306/0/0) 20980890 1 unassignedwu 0 0 (0/0/0) 0 2 backupwm 0 - 3886 29.78GB(3887/0/0) 62444655 3 unassignedwu1307 - 3886 19.76GB(2580/0/0) 41447700 4 unassignedwu 0 0 (0/0/0) 0 5 unassignedwu 0 0 (0/0/0) 0 6 unassignedwu 0 0 (0/0/0) 0 7 unassignedwu 0 0 (0/0/0) 0 8 bootwu 0 -07.84MB(1/0/0) 16065 9 unassignedwu 0 0 (0/0/0) 0 And here is fdisk: Total disk size is 3890 cylinders Cylinder size is 16065 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 ActiveSolaris 1 38893889100 Slice 0 is where the OS lives and slice 3 is our slog. As you can see from the fdisk partition table (and from the slice view), the OS partition starts on cylinder 1 -- which is not 4k aligned. To get to a fine alignment, you need an EFI label. However, Solaris does not (yet) support booting from EFI labeled disks. The older SMI labels are all cylinder aligned which gives you a 1/4 chance of alignment. Yep... our other boxes similar to this one are using whole disks as ZIL, so we're able to use EFI. The Device Not Ready errors happen there too (SSD's are on an expander) but only from between 5-15 errors per day (vs the 500 per hour on the split OS/slog setup). I don't think there is much I can do to fix this without reinstalling. However, I'm most concerned about the slog slice and would like to recreate its partition such that it begins on cylinder 1312. So a few questions: - Would making s3 be 4k block aligned help even though s0 is not? - Do I need to worry about 4k block aligning the *end* of the slice? eg instead of ending s3 on cylinder 3886, end it on 3880 instead? Thanks, Ray Do you specifically have benchmark data indicating unaligned or aligned+offset access on the X25-E is significantly worse than aligned access? I'd thought the tier1 SSDs didn't have problems with these workloads. I've been experiencing heavy Device Not Ready errors with this configuration, and thought perhaps it could be exacerbated by the block alignment issue. See this thread[1]. So this would be a troubleshooting step to attempt to further isolate the problem -- by eliminating the 4k alignment issue as a factor. In my experience, port expanders with SATA drives do not handle the high I/O rate that can be generated by a modest server. We are still trying to get to the bottom of these issues, but they do not appear to be related to the OS, mpt driver, ZIL use, or alignment. -- richard Very interesting. We've been looking at Nexenta as we haven't been able to reproduce our issues on OpenSolaris -- I was hoping this meant NexentaStor wouldn't have the issue. In any case -- any thoughts on whether or not I'll be helping anything if I change my slog slice starting cylinder to be 4k aligned even though slice 0 isn't? Just want to make sure I set up the alignment as optimally as possible. Ray [1] http://markmail.org/message/5rmfzvqwlmosh2oh Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
On Tue, Aug 31, 2010 at 6:03 AM, Ray Van Dolson rvandol...@esri.com wrote: In any case -- any thoughts on whether or not I'll be helping anything if I change my slog slice starting cylinder to be 4k aligned even though slice 0 isn't? some people claims that due to how zfs works, there will be performance hit as long the reported sector size is different with the physical size. This thread[1] has the discussion on what happened and how to handle such drives on freebsd. [1] http://marc.info/?l=freebsd-fsm=126976001214266w=2 -- O ascii ribbon campaign - stop html mail - www.asciiribbon.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
On Tue, Aug 31 at 6:12, Edho P Arief wrote: On Tue, Aug 31, 2010 at 6:03 AM, Ray Van Dolson rvandol...@esri.com wrote: In any case -- any thoughts on whether or not I'll be helping anything if I change my slog slice starting cylinder to be 4k aligned even though slice 0 isn't? some people claims that due to how zfs works, there will be performance hit as long the reported sector size is different with the physical size. This thread[1] has the discussion on what happened and how to handle such drives on freebsd. [1] http://marc.info/?l=freebsd-fsm=126976001214266w=2 Yes, but that's for a 4k rotating drive, which has a much different latency profile than an SSD. I was wondering if anyone had a benchmarking showing this alignment mattered on the latest SSDs. My guess is no, but I have no data. -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
On Mon, Aug 30, 2010 at 04:12:48PM -0700, Edho P Arief wrote: On Tue, Aug 31, 2010 at 6:03 AM, Ray Van Dolson rvandol...@esri.com wrote: In any case -- any thoughts on whether or not I'll be helping anything if I change my slog slice starting cylinder to be 4k aligned even though slice 0 isn't? some people claims that due to how zfs works, there will be performance hit as long the reported sector size is different with the physical size. This thread[1] has the discussion on what happened and how to handle such drives on freebsd. [1] http://marc.info/?l=freebsd-fsm=126976001214266w=2 Thanks for the pointer -- these posts seem to reference data disks within the pool rather than disks being used for slog. Perhaps some of the same issues could arise, but I'm not sure that variable stripe sizing in a RAIDZ pool would change how the ZIL / slog devices are addressed. I'm sure someone will correct me if I'm wrong on that... Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
I was wondering if anyone had a benchmarking showing this alignment mattered on the latest SSDs. My guess is no, but I have no data. I don't believe there can be any doubt whether a Flash based SSD (tier1 or not) is negatively affected by partition misalignment. It is intrinsic to the required asymmetric erase/program dual operation and the resultant RMW penalty to perform a write if unaligned. This is detailed in the following vendor benchmarking guidelines (SF-1500 controller): http://www.smartm.com/files/salesLiterature/storage/AN001_Benchmark_XceedIOPSSATA_Apr2010_.pdf Highlight from link - Proper partition alignment is one of the most critical attributes that can greatly boost the I/O performance of an SSD due to reduced read modify‐write operations. It should be noted, the above highlight only applies to Flash based SSD as an NVRAM based SSD does *not* suffer the same fate, as its performance is not bound by or vary with partition (mis)alignment. Best regards, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss