Re: [zfs-discuss] ZFS Hard link space savings
On Sun, Jun 12, 2011 at 5:28 PM, Nico Williams wrote: > On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson > wrote: > > I have an interesting question that may or may not be answerable from > some > > internal > > ZFS semantics. > > This is really standard Unix filesystem semantics. > > > [...] > > > > So total storage used is around ~7.5MB due to the hard linking taking > place > > on each store. > > > > If hard linking capability had been turned off, this same message would > have > > used 1500 x 2MB =3GB > > worth of storage. > > > > My question is there any simple ways of determining the space savings on > > each of the stores from the usage of hard links? [...] > > But... you just did! :) It's: number of hard links * (file size + > sum(size of link names and/or directory slot size)). For sufficiently > large files (say, larger than one disk block) you could approximate > that as: number of hard links * file size. The key is the number of > hard links, which will typically vary, but for e-mails that go to all > users, well, you know the number of links then is the number of users. > > You could write a script to do this -- just look at the size and > hard-link count of every file in the store, apply the above formula, > add up the inflated sizes, and you're done. > > Nico > > PS: Is it really the case that Exchange still doesn't deduplicate > e-mails? Really? It's much simpler to implement dedup in a mail > store than in a filesystem... > MS has had SIS since Exchange 4.0. They dumped it in 2010 because it was a huge source of their small random I/O's. In an effort to allow Exchange to be more "storage friendly" (IE: more of a large sequential I/O profile), they've done away with SIS. The defense for it is that you can buy more "cheap" storage for less money than you'd save with SIS and 15k rpm disks. Whether that's factual I suppose is for the reader to decide. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On 6/12/11 7:25 PM, "Richard Elling" wrote: >>> >> >> Here's the timeline: >> >> - The Intel X25-M was marked "FAULTED" Monday evening, 6pm. This was not >> detected by NexentaStor. > >Is the volume-check runner enabled? All of the check runner results are >logged in >the report database and sent to the system administrator via email. I >will assume >that you have configured email for delivery, as it is a required step in >the installation >procedure. > >In any case, a disk declared FAULTED is no longer used by ZFS, except >when a >pool is cleared. The volume-check runner can do this on your behalf, if >it is >configured to do so. See Data Management -> Runners -> volume-check >And, of course, these actions are recorded in the logs and report >database. > >-- richard I checked seven of my NexentaStor installations (3.0.4 and 3.0.5). Six of them had the disk-check fault trigger disabled by default. volume-check is enabled on all and is set to run hourly. Email notification is configured, and I actively receive other alerts (DDT table, auto-sync) and reports. -- Edmund White ewwh...@mac.com 847-530-1605 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Jun 12, 2011, at 5:04 PM, Edmund White wrote: > On 6/12/11 6:18 PM, "Jim Klimov" wrote: >> 2011-06-12 23:57, Richard Elling wrote: >>> >>> How long should it wait? Before you answer, read through the thread: >>> http://lists.illumos.org/pipermail/developer/2011-April/001996.html >>> Then add your comments :-) >>> -- richard >> >> But the point of my previous comment was that, according >> to the original poster, after a while his disk did get >> marked as "faulted" or "offlined". IF this happened >> during the system's initial uptime, but it froze anyway, >> it it a problem. >> >> What I do not know is if he rebooted the box within the >> 5 minutes set aside for the timeout, or if some other >> processes gave up during the 5 minutes of no IO and >> effectively hung the system. >> >> If it is somehow the latter - that the inaccessible drive >> did (lead to) hang(ing) the system past any set IO retry >> timeouts - that is a bug, I think. >> > > Here's the timeline: > > - The Intel X25-M was marked "FAULTED" Monday evening, 6pm. This was not > detected by NexentaStor. Is the volume-check runner enabled? All of the check runner results are logged in the report database and sent to the system administrator via email. I will assume that you have configured email for delivery, as it is a required step in the installation procedure. In any case, a disk declared FAULTED is no longer used by ZFS, except when a pool is cleared. The volume-check runner can do this on your behalf, if it is configured to do so. See Data Management -> Runners -> volume-check And, of course, these actions are recorded in the logs and report database. > - The storage system performance diminished at 9am the next morning. > Intermittent spikes in system load (of the VMs hosted on the unit). This is consistent with reset storms. > - By 11am, the Nexenta interface and console were unresponsive and the > virtual machines dependent on the underlying storage stalled completely. Also consistent with reset storms. > - At 12pm, I gained physical access to the server, but I could not acquire > console access (shell or otherwise). I did see the FMA error output on the > screen indicating the actual device FAULT time. > - I powered the system off, removed the Intel X-25M, and powered back on. > The VMs picked up where they left off and the system stabilized. > > The total impact to end-users was 3 hours of either poor performance or > straight downtime. Yes, this is consistent with reset storms. Older Intel SSDs are not the only devices that handle this poorly. In my experience a number of SATA devices are poorly designed :-( -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On 13/06/11 11:36 AM, Jim Klimov wrote: Some time ago I wrote a script to find any "duplicate" files and replace them with hardlinks to one inode. Apparently this is only good for same files which don't change separately in future, such as distro archives. I can send it to you offlist, but it would be slow in your case because it is not quite the tool for the job (it will start by calculating checksums of all of your files ;) ) What you might want to do and script up yourself is a recursive listing "find /var/opt/SUNWmsqsr/store/partition... -ls". This would print you the inode numbers and file sizes and link counts. Pipe it through something like this: find ... -ls | awk '{print $1" "$4" "$7}' | sort | uniq And you'd get 3 columns - inode, count, size My AWK math is a bit rusty today, so I present a monster-script like this to multiply and sum up the values: ( find ... -ls | awk '{print $1" "$4" "$7}' | sort | uniq | awk '{ print $2"*"$3"+\\" }'; echo 0 ) | bc This looks something like what I thought would have to be done, I was just looking to see if there was something tried and tested before I had to invent something. I was really hoping in zdb there might have been some magic information I could have tapped into.. ;) Can be done cleaner, i.e. in a PERL one-liner, and if you have many values - that would probably complete faster too. But as a prototype this would do. HTH, //Jim PS: Why are you replacing the cool Sun Mail? Is it about Oracle licensing and the now-required purchase and support cost? Yes it is about cost mostly. We had Sun Mail for our Staff and students. We had 20,000 + students on it up until Christmas time as well. We have now migrated them to M$ Live@EDU. This leaves us with 1500 Staff left who all like to use LookOut. The Sun connector for LookOut is a bit flaky at best. But the Oracle licensing cost for Messaging and Calendar starts at 10,000 users plus and so is now rather expensive for what mailboxes we have left. M$ also heavily discounts Exchange CALS to Edu and Oracle is not very friendly the way Sun was with their JES licensing. So it is bye bye Sun Messaging Server for us. 2011-06-13 1:14, Scott Lawson пишет: Hi All, I have an interesting question that may or may not be answerable from some internal ZFS semantics. I have a Sun Messaging Server which has 5 ZFS based email stores. The Sun Messaging server uses hard links to link identical messages together. Messages are stored in standard SMTP MIME format so the binary attachments are included in the message ASCII. Each individual message is stored in a separate file. So as an example if a user sends a email with a 2MB attachment to the staff mailing list and there is 3 staff stores with 500 users on each, it will generate a space usage like : /store1 = 1 x 2MB + 499 x 1KB /store2 = 1 x 2MB + 499 x 1KB /store3 = 1 x 2MB + 499 x 1KB So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? The reason I ask is that our educational institute wishes to migrate these stores to M$ Exchange 2010 which doesn't do message single instancing. I need to try and project what the storage requirement will be on the new target environment. If anyone has any ideas be it ZFS based or any useful scripts that could help here, I am all ears. I may post this to Sun Managers as well to see if anyone there might have any ideas on this as well. Regards, Scott. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On 6/12/11 6:18 PM, "Jim Klimov" wrote: >2011-06-12 23:57, Richard Elling wrote: >> >> How long should it wait? Before you answer, read through the thread: >> http://lists.illumos.org/pipermail/developer/2011-April/001996.html >> Then add your comments :-) >> -- richard > >But the point of my previous comment was that, according >to the original poster, after a while his disk did get >marked as "faulted" or "offlined". IF this happened >during the system's initial uptime, but it froze anyway, >it it a problem. > >What I do not know is if he rebooted the box within the >5 minutes set aside for the timeout, or if some other >processes gave up during the 5 minutes of no IO and >effectively hung the system. > >If it is somehow the latter - that the inaccessible drive >did (lead to) hang(ing) the system past any set IO retry >timeouts - that is a bug, I think. > Here's the timeline: - The Intel X25-M was marked "FAULTED" Monday evening, 6pm. This was not detected by NexentaStor. - The storage system performance diminished at 9am the next morning. Intermittent spikes in system load (of the VMs hosted on the unit). - By 11am, the Nexenta interface and console were unresponsive and the virtual machines dependent on the underlying storage stalled completely. - At 12pm, I gained physical access to the server, but I could not acquire console access (shell or otherwise). I did see the FMA error output on the screen indicating the actual device FAULT time. - I powered the system off, removed the Intel X-25M, and powered back on. The VMs picked up where they left off and the system stabilized. The total impact to end-users was 3 hours of either poor performance or straight downtime. -- Edmund White ewwh...@mac.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On 13/06/11 10:28 AM, Nico Williams wrote: On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson wrote: I have an interesting question that may or may not be answerable from some internal ZFS semantics. This is really standard Unix filesystem semantics. I Understand this, just wanting to see if here is any easy way before I trawl through 10 million little files.. ;) [...] So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? [...] But... you just did! :) It's: number of hard links * (file size + sum(size of link names and/or directory slot size)). For sufficiently large files (say, larger than one disk block) you could approximate that as: number of hard links * file size. The key is the number of hard links, which will typically vary, but for e-mails that go to all users, well, you know the number of links then is the number of users. Yes this number varies based on number of recipients, so could be as many a You could write a script to do this -- just look at the size and hard-link count of every file in the store, apply the above formula, add up the inflated sizes, and you're done. Looks like I will have to, just looking for a tried and tested method before I have to create my own one if possible. Just was looking for an easy option before I have to sit down and develop and test a script. I have resigned from my current job of 9 years and finish in 15 days and have a heck of a lot of documentation and knowledge transfer I need to do around other UNIX systems and am running very short on time... Nico PS: Is it really the case that Exchange still doesn't deduplicate e-mails? Really? It's much simpler to implement dedup in a mail store than in a filesystem... As a side not Exchange 2002 + Exchange 2007 do do this. But apparently M$ decided in Exchange 2010 that they no longer wished to do this and dropped the capability. Bizarre to say the least, but it may come down to what they have done in the underlying store technology changes.. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
2011-06-13 2:28, Nico Williams пишет: PS: Is it really the case that Exchange still doesn't deduplicate e-mails? Really? It's much simpler to implement dedup in a mail store than in a filesystem... That's especially strange, because NTFS has hardlinks and softlinks... Not that Microsoft provided any tools for using that, but there are third-party programs like CygWin "ls" and the FAR File Manger. Well, enought off-topicing ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
Some time ago I wrote a script to find any "duplicate" files and replace them with hardlinks to one inode. Apparently this is only good for same files which don't change separately in future, such as distro archives. I can send it to you offlist, but it would be slow in your case because it is not quite the tool for the job (it will start by calculating checksums of all of your files ;) ) What you might want to do and script up yourself is a recursive listing "find /var/opt/SUNWmsqsr/store/partition... -ls". This would print you the inode numbers and file sizes and link counts. Pipe it through something like this: find ... -ls | awk '{print $1" "$4" "$7}' | sort | uniq And you'd get 3 columns - inode, count, size My AWK math is a bit rusty today, so I present a monster-script like this to multiply and sum up the values: ( find ... -ls | awk '{print $1" "$4" "$7}' | sort | uniq | awk '{ print $2"*"$3"+\\" }'; echo 0 ) | bc Can be done cleaner, i.e. in a PERL one-liner, and if you have many values - that would probably complete faster too. But as a prototype this would do. HTH, //Jim PS: Why are you replacing the cool Sun Mail? Is it about Oracle licensing and the now-required purchase and support cost? 2011-06-13 1:14, Scott Lawson пишет: Hi All, I have an interesting question that may or may not be answerable from some internal ZFS semantics. I have a Sun Messaging Server which has 5 ZFS based email stores. The Sun Messaging server uses hard links to link identical messages together. Messages are stored in standard SMTP MIME format so the binary attachments are included in the message ASCII. Each individual message is stored in a separate file. So as an example if a user sends a email with a 2MB attachment to the staff mailing list and there is 3 staff stores with 500 users on each, it will generate a space usage like : /store1 = 1 x 2MB + 499 x 1KB /store2 = 1 x 2MB + 499 x 1KB /store3 = 1 x 2MB + 499 x 1KB So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? The reason I ask is that our educational institute wishes to migrate these stores to M$ Exchange 2010 which doesn't do message single instancing. I need to try and project what the storage requirement will be on the new target environment. If anyone has any ideas be it ZFS based or any useful scripts that could help here, I am all ears. I may post this to Sun Managers as well to see if anyone there might have any ideas on this as well. Regards, Scott. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ++ || | Климов Евгений, Jim Klimov | | технический директор CTO | | ЗАО "ЦОС и ВТ" JSC COS&HT | || | +7-903-7705859 (cellular) mailto:jimkli...@cos.ru | | CC:ad...@cos.ru,jimkli...@mail.ru | ++ | () ascii ribbon campaign - against html mail | | /\- against microsoft attachments | ++ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Jun 12, 2011, at 4:18 PM, Jim Klimov wrote: > 2011-06-12 23:57, Richard Elling wrote: >> >> How long should it wait? Before you answer, read through the thread: >> http://lists.illumos.org/pipermail/developer/2011-April/001996.html >> Then add your comments :-) >> -- richard > > Interesting thread. I did not quite get the resentment against > a tunable value instead of a hard-coded #define, though. Tunables are evil. They increase complexity and lead to local optimizations that interfere with systemic optimizations. > Especially if we might want to somehow tune it per-device, > i.e. CDROM, enterprise SAS and some commodity drive or a > USB stick (or a VMWare emulated HDD, as Ceri pointed out) > might all be plugged into the same box and require different > timeouts only the sysadmin might know about (the numeric > values per-device). So I'd rather go with some hardcoded > default and many tuned lines in sd.conf, probably. yuck. I'd rather have my eye poked out with a sharp stick. > But the point of my previous comment was that, according > to the original poster, after a while his disk did get > marked as "faulted" or "offlined". IF this happened > during the system's initial uptime, but it froze anyway, > it it a problem. > > What I do not know is if he rebooted the box within the > 5 minutes set aside for the timeout, or if some other > processes gave up during the 5 minutes of no IO and > effectively hung the system. Not likely. Much more likely that that which you were expecting was blocked. > If it is somehow the latter - that the inaccessible drive > did (lead to) hang(ing) the system past any set IO retry > timeouts - that is a bug, I think. > > But maybe I'm just too annoyed with my box hanging with > a more-or-less reproducible scenario, and now I'm barking > up any tree that looks like system freeze related to IO ;) Yep, a common reaction. I think we can be more creative... -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations
2011-06-12 23:57, Richard Elling wrote: How long should it wait? Before you answer, read through the thread: http://lists.illumos.org/pipermail/developer/2011-April/001996.html Then add your comments :-) -- richard Interesting thread. I did not quite get the resentment against a tunable value instead of a hard-coded #define, though. Especially if we might want to somehow tune it per-device, i.e. CDROM, enterprise SAS and some commodity drive or a USB stick (or a VMWare emulated HDD, as Ceri pointed out) might all be plugged into the same box and require different timeouts only the sysadmin might know about (the numeric values per-device). So I'd rather go with some hardcoded default and many tuned lines in sd.conf, probably. But the point of my previous comment was that, according to the original poster, after a while his disk did get marked as "faulted" or "offlined". IF this happened during the system's initial uptime, but it froze anyway, it it a problem. What I do not know is if he rebooted the box within the 5 minutes set aside for the timeout, or if some other processes gave up during the 5 minutes of no IO and effectively hung the system. If it is somehow the latter - that the inaccessible drive did (lead to) hang(ing) the system past any set IO retry timeouts - that is a bug, I think. But maybe I'm just too annoyed with my box hanging with a more-or-less reproducible scenario, and now I'm barking up any tree that looks like system freeze related to IO ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson wrote: > I have an interesting question that may or may not be answerable from some > internal > ZFS semantics. This is really standard Unix filesystem semantics. > [...] > > So total storage used is around ~7.5MB due to the hard linking taking place > on each store. > > If hard linking capability had been turned off, this same message would have > used 1500 x 2MB =3GB > worth of storage. > > My question is there any simple ways of determining the space savings on > each of the stores from the usage of hard links? [...] But... you just did! :) It's: number of hard links * (file size + sum(size of link names and/or directory slot size)). For sufficiently large files (say, larger than one disk block) you could approximate that as: number of hard links * file size. The key is the number of hard links, which will typically vary, but for e-mails that go to all users, well, you know the number of links then is the number of users. You could write a script to do this -- just look at the size and hard-link count of every file in the store, apply the above formula, add up the inflated sizes, and you're done. Nico PS: Is it really the case that Exchange still doesn't deduplicate e-mails? Really? It's much simpler to implement dedup in a mail store than in a filesystem... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Hard link space savings
Hi All, I have an interesting question that may or may not be answerable from some internal ZFS semantics. I have a Sun Messaging Server which has 5 ZFS based email stores. The Sun Messaging server uses hard links to link identical messages together. Messages are stored in standard SMTP MIME format so the binary attachments are included in the message ASCII. Each individual message is stored in a separate file. So as an example if a user sends a email with a 2MB attachment to the staff mailing list and there is 3 staff stores with 500 users on each, it will generate a space usage like : /store1 = 1 x 2MB + 499 x 1KB /store2 = 1 x 2MB + 499 x 1KB /store3 = 1 x 2MB + 499 x 1KB So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? The reason I ask is that our educational institute wishes to migrate these stores to M$ Exchange 2010 which doesn't do message single instancing. I need to try and project what the storage requirement will be on the new target environment. If anyone has any ideas be it ZFS based or any useful scripts that could help here, I am all ears. I may post this to Sun Managers as well to see if anyone there might have any ideas on this as well. Regards, Scott. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Jun 11, 2011, at 9:26 AM, Jim Klimov wrote: > 2011-06-11 19:15, Pasi Kärkkäinen пишет: >> On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote: >>>I've had two incidents where performance tanked suddenly, leaving the VM >>>guests and Nexenta SSH/Web consoles inaccessible and requiring a full >>>reboot of the array to restore functionality. In both cases, it was the >>>Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed >>> to >>>alert me on the cache failure, however the general ZFS FMA alert was >>>visible on the (unresponsive) console screen. >>> >>>The "zpool status" output showed: >>> >>> cache >>> c6t5001517959467B45d0 FAULTED 2 542 0 too many errors >>> >>>This did not trigger any alerts from within Nexenta. >>> >>>I was under the impression that an L2ARC failure would not impact the >>>system. But in this case, it was the culprit. I've never seen any >>>recommendations to RAID L2ARC for resiliency. Removing the bad SSD >>>entirely from the server got me back running, but I'm concerned about the >>>impact of the device failure and the lack of notification from >>>NexentaStor. >> IIRC recently there was discussion on this list about firmware bug >> on the Intel X25 SSDs causing them to fail under high disk IO with "reset >> storms". > Even if so, this does not forgive ZFS hanging - especially > if it detected the drive failure, and especially if this drive > is not required for redundant operation. How long should it wait? Before you answer, read through the thread: http://lists.illumos.org/pipermail/developer/2011-April/001996.html Then add your comments :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Jun 11, 2011, at 6:35 AM, Edmund White wrote: > Posted in greater detail at Server Fault - > http://serverfault.com/q/277966/13325 > Replied in greater detail at same. > I have an HP ProLiant DL380 G7 system running NexentaStor. The server has > 36GB RAM, 2 LSI 9211-8i SAS controllers (no SAS expanders), 2 SAS system > drives, 12 SAS data drives, a hot-spare disk, an Intel X25-M L2ARC cache and > a DDRdrive PCI ZIL accelerator. This system serves NFS to multiple VMWare > hosts. I also have about 90-100GB of deduplicated data on the array. > > I've had two incidents where performance tanked suddenly, leaving the VM > guests and Nexenta SSH/Web consoles inaccessible and requiring a full reboot > of the array to restore functionality. > The reboot is your decision, the software will, eventually, recover. > In both cases, it was the Intel X-25M L2ARC SSD that failed or was > "offlined". NexentaStor failed to alert me on the cache failure, however the > general ZFS FMA alert was visible on the (unresponsive) console screen. > > NexentaStor fault triggers run in addition to the existing FMA and syslog services. > The "zpool status" output showed: > > > cache > c6t5001517959467B45d0 FAULTED 2 542 0 too many errors > > This did not trigger any alerts from within Nexenta. > > The NexentaStor volume-check runner looks for zpool status error messages. Check your configuration for the runner schedule, by default it is hourly. > I was under the impression that an L2ARC failure would not impact the system. > With all due respect, that is a naive assumption. Any system failure can impact the system. The worst kinds of failures are those that impact performance. In this case, the broken SSD firmware causes very slow response to I/O requests. It does not return an error code that says "I'm broken" it just responds very slowly, perhaps after other parts of the system ask it to reset and retry a few times. > But in this case, it was the culprit. I've never seen any recommendations to > RAID L2ARC for resiliency. Removing the bad SSD entirely from the server got > me back running, but I'm concerned about the impact of the device failure and > the lack of notification from NexentaStor. > > We have made some improvements in notification for this type of failure in the 3.1 release. Why? Because we have seen a large number of these errors from various disk and SSD manufacturers recently. You will notice that Nexenta does not support these SSDs behind SAS expanders for this very reason. At the end of the day, resolution is to get the device fixed or replaced. Contact your hardware provider for details. > What's the current best-choice SSD for L2ARC cache applications these days? > It seems as though the Intel units are no longer well-regarded. > > No device is perfect. Some have better firmware, components, or design than others. YMMV. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS receive checksum mismatch
On Jun 11, 2011, at 5:46 AM, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Jim Klimov >> >> See FEC suggestion from another poster ;) > > Well, of course, all storage mediums have built-in hardware FEC. At least > disk & tape for sure. But naturally you can't always trust it blindly... > > If you simply want to layer on some more FEC, there must be some standard > generic FEC utilities out there, right? > zfs send | fec > /dev/... > Of course this will inflate the size of the data stream somewhat, but > improves the reliability... The problem is that many FEC algorithms are good at correcting a few bits. For example, disk drives tend to correct somewhere on the order of 8 bytes per block. Tapes can correct more bytes per block. I've collected a large number of error reports showing the bitwise analysis of data corruption we've seen in ZFS and there is only one case where a stuck bit was detected. Most of the corruptions I see are multiple bytes and many are zero-filled. In other words, if you are expecting to use FEC and FEC only corrects a few bits, you might be disappointed. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning disk failure detection?
On May 10, 2011, at 9:18 AM, Ray Van Dolson wrote: > We recently had a disk fail on one of our whitebox (SuperMicro) ZFS > arrays (Solaris 10 U9). > > The disk began throwing errors like this: > > May 5 04:33:44 dev-zfs4 scsi: [ID 243001 kern.warning] WARNING: > /pci@0,0/pci8086,3410@9/pci15d9,400@0 (mpt_sas0): > May 5 04:33:44 dev-zfs4mptsas_handle_event_sync: IOCStatus=0x8000, > IOCLogInfo=0x31110610 These are commonly seen when hardware is having difficulty and devices are being reset. > > And errors for the drive were incrementing in iostat -En output. > Nothing was seen in fmdump. That is unusual because the ereports are sent along with the code that increments error counters in sd. Are you sure you ran "fmdump -e" as root or with appropriate privileges? > > Unfortunately, it took about three hours for ZFS (or maybe it was MPT) > to decide the drive was actually dead: > > May 5 07:41:06 dev-zfs4 scsi: [ID 107833 kern.warning] WARNING: > /scsi_vhci/disk@g5000c5002cbc76c0 (sd4): > May 5 07:41:06 dev-zfs4drive offline > > During this three hours the I/O performance on this server was pretty > bad and caused issues for us. Once the drive "failed" completely, ZFS > pulled in a spare and all was well. > > My question is -- is there a way to tune the MPT driver or even ZFS > itself to be more/less aggressive on what it sees as a "failure" > scenario? mpt driver is closed source. Contact the source author for such details. mpt_sas is open source, but the decision to retire for Solaris-derived OSes is done via the Fault Management Architecture (FMA) agents. Many of these have tunable algorithms, but AFAIK they are only documented in source. That said, there are failure modes that do not fit the current algorithms very well. Feel free to propose alternatives. > > I suppose this would have been handled differently / better if we'd > been using real Sun hardware? Maybe, maybe not. These are generic conditions and can be seen on all sorts of hardware under a wide variety of failure conditions. -- richard > > Our other option is to watch better for log entries similar to the > above and either alert someone or take some sort of automated action > .. I'm hoping there's a better way to tune this via driver or ZFS > settings however. > > Thanks, > Ray > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import crashs SX11 trying to recovering a corrupted zpool
Did you try a read-only import as well? I THINK it goes like this: zpool import -o ro -o cachefile=none -F -f badpool Did you manage to capture any error output? For example, is it an option for you to set up a serial console and copy-paste the error text from the serial terminal on another machine? As far as I know there are no other releases of Solaris 11 yet, and since the code is now developed behind closed doors (due to be opened after Solaris 11 OS public GA release), no other implementations of ZFS match its v31 features. In part (another part being licensing) this is why the open community at large shunned the S11X release - its ZFS version not yet interoperable (unlike the options with OpenIndiana on different kernels and FreeBSD all supporting zpool v28, not so sure Linux-FUSE - but maybe it is close too), and you can't look at the source code to see what might go wrong, or build a debug version which would not kernel-panic. Speaking of which, you might try to enforce dumping the kernel memory to a dedicated volume/slice/partition so as to stacktrace it with the kernel debugger. Maybe that would yield something... //Jim -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Sat, Jun 11, 2011 at 08:26:34PM +0400, Jim Klimov wrote: > 2011-06-11 19:15, Pasi Kärkkäinen ??: >> On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote: >>> I've had two incidents where performance tanked suddenly, leaving the VM >>> guests and Nexenta SSH/Web consoles inaccessible and requiring a full >>> reboot of the array to restore functionality. In both cases, it was the >>> Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed >>> to >>> alert me on the cache failure, however the general ZFS FMA alert was >>> visible on the (unresponsive) console screen. >>> >>> The "zpool status" output showed: >>> >>> cache >>> c6t5001517959467B45d0 FAULTED 2 542 0 too many errors >>> >>> This did not trigger any alerts from within Nexenta. >>> >>> I was under the impression that an L2ARC failure would not impact the >>> system. But in this case, it was the culprit. I've never seen any >>> recommendations to RAID L2ARC for resiliency. Removing the bad SSD >>> entirely from the server got me back running, but I'm concerned about >>> the >>> impact of the device failure and the lack of notification from >>> NexentaStor. >> IIRC recently there was discussion on this list about firmware bug >> on the Intel X25 SSDs causing them to fail under high disk IO with "reset >> storms". > Even if so, this does not forgive ZFS hanging - especially > if it detected the drive failure, and especially if this drive > is not required for redundant operation. > > I've seen similar bad behaviour on my oi_148a box when > I tested USB flash devices as L2ARC caches and > occasionally they died by slightly moving out of the > USB socket due to vibration or whatever reason ;) > > Similarly, this oi_148a box hung upon loss of SATA > connection to a drive in the raidz2 disk set due to > unreliable cable connectors, while it should have > stalled IOs to that pool but otherwise the system > should have remained remain responsive (tested > failmode=continue and failmode=wait on different > occasions). > > So I can relate - these things happen, they do annoy, > and I hope they will be fixed sometime soon so that > ZFS matches its docs and promises ;) > True, definitely sounds like a bug in ZFS aswell.. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Q: pool didn't expand. why? can I force it?
Indeed it was! Thanks!! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Q: pool didn't expand. why? can I force it?
On Sun, Jun 12, 2011 at 3:54 AM, Johan Eliasson < johan.eliasson.j...@gmail.com> wrote: > I replaced a smaller disk in my tank2, so now they're all 2TB. But look, > zfs still thinks it's a pool of 1.5 TB disks: > > nebol@filez:~# zpool list tank2 > NAMESIZE ALLOC FREECAP DEDUP HEALTH ALTROOT > tank2 5.44T 4.20T 1.24T77% 1.00x ONLINE - > > nebol@filez:~# zpool status tank2 > pool: tank2 > state: ONLINE > scrub: none requested > config: > >NAMESTATE READ WRITE CKSUM >tank2 ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 >c8t0d0 ONLINE 0 0 0 >c8t1d0 ONLINE 0 0 0 >c8t2d0 ONLINE 0 0 0 >c8t3d0 ONLINE 0 0 0 > > errors: No known data errors > > and: > > 6. c8t0d0 > /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@0,0 > 7. c8t1d0 > /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@1,0 > 8. c8t2d0 > /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@2,0 > 9. c8t3d0 > > So the question is, why didn't it expand? And can I fix it? > > Autoexpand is likely turned off. http://download.oracle.com/docs/cd/E19253-01/819-5461/githb/index.html --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Q: pool didn't expand. why? can I force it?
I replaced a smaller disk in my tank2, so now they're all 2TB. But look, zfs still thinks it's a pool of 1.5 TB disks: nebol@filez:~# zpool list tank2 NAMESIZE ALLOC FREECAP DEDUP HEALTH ALTROOT tank2 5.44T 4.20T 1.24T77% 1.00x ONLINE - nebol@filez:~# zpool status tank2 pool: tank2 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM tank2 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 errors: No known data errors and: 6. c8t0d0 /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@0,0 7. c8t1d0 /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@1,0 8. c8t2d0 /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@2,0 9. c8t3d0 So the question is, why didn't it expand? And can I fix it? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss