Re: [zfs-discuss] How many disk in one pool
Here's an example of a ZFS-based product you can buy with a large number of disks in the volume: http://www.aberdeeninc.com/abcatg/petarack.htm 360 3T drives A full petabyte of storage (1080TB) in a single rack, under a single namespace or volume On Sat, Oct 6, 2012 at 11:48 AM, Richard Elling wrote: > On Oct 5, 2012, at 1:57 PM, Albert Shih wrote: > >> Hi all, >> >> I'm actually running ZFS under FreeBSD. I've a question about how many >> disks I «can» have in one pool. >> >> At this moment I'm running with one server (FreeBSD 9.0) with 4 MD1200 >> (Dell) meaning 48 disks. I've configure with 4 raidz2 in the pool (one on >> each MD1200) >> >> On what I understand I can add more more MD1200. But if I loose one MD1200 >> for any reason I lost the entire pool. >> >> In your experience what's the «limit» ? 100 disk ? > > I can't speak for current FreeBSD, but I've seen more than 400 > disks (HDDs) in a single pool. > > -- richard > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
Reducing the record size would negatively impact performance. For rational why, see thesection titled "Match Average I/O Block Sizes" in my blog post on filesystem caching:http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.htmlBrad Brad Diggs | Principal Sales Consultant | 972.814.3698eMail: brad.di...@oracle.comTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 8:08 AM, Robert Milkowski wrote: Try reducing recordsize to 8K or even less *before* you put any data.This can potentially improve your dedup ratio and keep it higher after you start modifying data. From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brad DiggsSent: 28 December 2011 21:15To: zfs-discuss discussion listSubject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup As promised, here are the findings from my testing. I created 6 directory server instances where the firstinstance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup ofthe first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following tableshows the increased L1ARC size, increased search rate performance, and increase CPU% busy witheach starting and applying load to each successive directory server instance. The L1ARC cache grewa little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratioremained the same because no data on the directory server instances was changing. However, once I started modifying the data of the replicated directory server topology, the caching efficiency quickly diminished. The following table shows that the delta for each instance increased by roughly 2GB after only 300k of changes. I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication occurs at the block level rather than at the byte level. When a write is sent to one directory server instance, the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate. However this was not the case. There could be other reasons for the divergence as well. The two key takeaways from this exercise were as follows. There is tremendous caching potentialthrough the use of ZFS deduplication. However, the current block level deduplication does not benefit directory as much as it perhaps could if deduplication occurred at the byte level rather thanthe block level. It very could be that even byte level deduplication doesn't work as well either. Until that option is available, we won't know for sure. Regards, BradBrad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:Thanks everyone for your input on this thread. It sounds like there is sufficient weightbehind the affirmative that I will include this methodology into my performance analysistest plan. If the performance goes well, I will share some of the results when we concludein January/February timeframe. Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC detect and prevent streaming reads such as from dd from populating the cache. Seemy previous blog post at the web site link below for a way around this protectivecaching control of ZFS. http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.html Thanks again! BradBrad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 8, 2011, at 4:22 PM, Mark Musante wrote:You can see the original ARC case here:http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.altOn 8 Dec 2011, at 16:41, Ian Collins wrote:On 12/ 9/11 12:39 AM, Darren J Moffat wrote:On 12/07/11 20:48, Mertol Ozyoney wrote:Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp In fact , most of our functions, like replication is not dedup aware.For example, thecnicaly it's possible to optimize our replication thatit does not send daya chunks if a data chunk with the same chechsumexists in target, without enabling dedup on target and source.We already do that with 'zfs send -D': -D Perform dedup processing on the stream. Deduplicated streams cannot be received on systems that do not support the stream deduplication feature.Is there any more published information on how this feature works? --Ian. ___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
S11 FCSBrad Brad Diggs | Principal Sales Consultant | 972.814.3698eMail: brad.di...@oracle.comTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 8:11 AM, Robert Milkowski wrote: And these results are from S11 FCS I assume.On older builds or Illumos based distros I would expect L1 arc to grow much bigger. From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brad DiggsSent: 28 December 2011 21:15To: zfs-discuss discussion listSubject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup As promised, here are the findings from my testing. I created 6 directory server instances where the firstinstance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup ofthe first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following tableshows the increased L1ARC size, increased search rate performance, and increase CPU% busy witheach starting and applying load to each successive directory server instance. The L1ARC cache grewa little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratioremained the same because no data on the directory server instances was changing. However, once I started modifying the data of the replicated directory server topology, the caching efficiency quickly diminished. The following table shows that the delta for each instance increased by roughly 2GB after only 300k of changes. I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication occurs at the block level rather than at the byte level. When a write is sent to one directory server instance, the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate. However this was not the case. There could be other reasons for the divergence as well. The two key takeaways from this exercise were as follows. There is tremendous caching potentialthrough the use of ZFS deduplication. However, the current block level deduplication does not benefit directory as much as it perhaps could if deduplication occurred at the byte level rather thanthe block level. It very could be that even byte level deduplication doesn't work as well either. Until that option is available, we won't know for sure. Regards, BradBrad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:Thanks everyone for your input on this thread. It sounds like there is sufficient weightbehind the affirmative that I will include this methodology into my performance analysistest plan. If the performance goes well, I will share some of the results when we concludein January/February timeframe. Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC detect and prevent streaming reads such as from dd from populating the cache. Seemy previous blog post at the web site link below for a way around this protectivecaching control of ZFS. http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.html Thanks again! BradBrad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 8, 2011, at 4:22 PM, Mark Musante wrote:You can see the original ARC case here:http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.altOn 8 Dec 2011, at 16:41, Ian Collins wrote:On 12/ 9/11 12:39 AM, Darren J Moffat wrote:On 12/07/11 20:48, Mertol Ozyoney wrote:Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp In fact , most of our functions, like replication is not dedup aware.For example, thecnicaly it's possible to optimize our replication thatit does not send daya chunks if a data chunk with the same chechsumexists in target, without enabling dedup on target and source.We already do that with 'zfs send -D': -D Perform dedup processing on the stream. Deduplicated streams cannot be received on systems that do not support the stream deduplication feature.Is there any more published information on how this feature works? --Ian. ___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
Jim,You are spot on. I was hoping that the writes would be close enough to identical thatthere would be a high ratio of duplicate data since I use the same record size, page size,compression algorithm, … etc. However, that was not the case. The main thing that Iwanted to prove though was that if the data was the same the L1 ARC only caches thedata that was actually written to storage. That is a really cool thing! I am sure there willbe future study on this topic as it applies to other scenarios.With regards to directory engineering investing any energy into optimizing ODSEE DS to more effectively leverage this caching potential, that won't happen. OUD far outperforms ODSEE. That said OUD may get some focus in this area. However, time willtell on that one.For now, I hope everyone benefits from the little that I did validate.Have a great day!Brad Brad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 4:45 AM, Jim Klimov wrote:Thanks for running and publishing the tests :)A comment on your testing technique follows, though.2011-12-29 1:14, Brad Diggs wrote:As promised, here are the findings from my testing. I created 6directory server instances ...However, once I started modifying the data of the replicated directoryserver topology, the caching efficiencyquickly diminished. The following table shows that the delta for eachinstance increased by roughly 2GBafter only 300k of changes.I suspect the divergence in data as seen by ZFS deduplication mostlikely occurs because reduplicationoccurs at the block level rather than at the byte level. When a write issent to one directory server instance,the exact same write is propagated to the other 5 instances andtherefore should be considered a duplicate.However this was not the case. There could be other reasons for thedivergence as well.Hello, Brad,If you tested with Sun DSEE (and I have no reason tobelieve other descendants of iPlanet Directory serverwould work differently under the hood), then there aretwo factors hindering your block-dedup gains:1) The data is stored in the backend BerkeleyDB binaryfile. In Sun DSEE7 and/or in ZFS this could also becompressed data. Since for ZFS you dedup unique blocks,including same data at same offsets, it is quite unlikelyyou'd get the same data often enough. For example, eachdatabase might position same userdata blocks at differentoffsets due to garbage collection or whatever otheroptimisation the DB might think of, making on-diskblocks different and undedupable.You might look if it is possible to tune the databaseto write in sector-sized -> min.block-sized (512b/4096b)records and consistently use the same DSEE compression(or lack thereof) - in this case you might get more sameblocks and win with dedup. But you'll likely lose withcompression, especially of the empty sparse structurewhich a database initially is.2) During replication each database actually becomesunique. There are hidden records with "ns" prefix whichmark when the record was created and replicated, whoinitiated it, etc. Timestamps in the data alreadywarrant uniqueness ;)This might be an RFE for the DSEE team though - to keepsuch volatile metadata separately from userdata. Thenyour DS instances would more likely dedup well afterreplication, and unique metadata would be storedseparately and stay unique. You might even keep it ina different dataset with no dedup, then... :)---So, at the moment, this expectation does not hold true: "When a write is sent to one directory server instance, the exact same write is propagated to the other five instances and therefore should be considered a duplicate."These writes are not exact.HTH,//Jim Klimov___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
Thanks everyone for your input on this thread. It sounds like there is sufficient weightbehind the affirmative that I will include this methodology into my performance analysistest plan. If the performance goes well, I will share some of the results when we concludein January/February timeframe.Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC detect and prevent streaming reads such as from dd from populating the cache. Seemy previous blog post at the web site link below for a way around this protectivecaching control of ZFS.http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.htmlThanks again!Brad Brad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 8, 2011, at 4:22 PM, Mark Musante wrote:You can see the original ARC case here:http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.altOn 8 Dec 2011, at 16:41, Ian Collins wrote:On 12/ 9/11 12:39 AM, Darren J Moffat wrote:On 12/07/11 20:48, Mertol Ozyoney wrote:Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.The only vendor i know that can do this is NetappIn fact , most of our functions, like replication is not dedup aware.For example, thecnicaly it's possible to optimize our replication thatit does not send daya chunks if a data chunk with the same chechsumexists in target, without enabling dedup on target and source.We already do that with 'zfs send -D': -D Perform dedup processing on the stream. Deduplicated streams cannot be received on systems that do not support the stream deduplication feature.Is there any more published information on how this feature works?-- Ian.___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Improving L1ARC cache efficiency with dedup
Hello,I have a hypothetical question regarding ZFS reduplication. Does the L1ARC cache benefit from reduplicationin the sense that the L1ARC will only need to cache one copy of the reduplicated data versus many copies? Here is an example:Imagine that I have a server with 2TB of RAM and a PB of disk storage. On this server I create a single 1TB data file that is full of unique data. Then I make 9 copies of that file giving each file a unique name and location within the same ZFS zpool. If I start up 10 application instances where each application reads all of its own unique copy of the data, will the L1ARC contain only the deduplicated data or will it cache separate copies the data from each file? In simpler terms, will the L1ARC require 10TB of RAM or just 1TB of RAM to cache all 10 1TB files worth of data?My hope is that since the data only physically occupies 1TB of storage via deduplication that the L1ARCwill also only require 1TB of RAM for the data.Note that I know the deduplication table will use the L1ARC as well. However, the focus of my questionis on how the L1ARC would benefit from a data caching standpoint.Thanks in advance!Brad Brad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenIndiana | ZFS | scrub | network | awful slow
3G per TB would be a better ballpark estimate. On Wed, Jun 15, 2011 at 8:17 PM, Daniel Carosone wrote: > On Wed, Jun 15, 2011 at 07:19:05PM +0200, Roy Sigurd Karlsbakk wrote: >> >> Dedup is known to require a LOT of memory and/or L2ARC, and 24GB isn't >> really much with 34TBs of data. > > The fact that your second system lacks the l2arc cache device is absolutely > your prime suspect. > > -- > Dan. > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk space size, used, available mismatch
Thank you for your insight. This is a system that was handed down to me when another sysadmin went to greener pastures. There were no quotas set on the system. I used zfs destroy to free up some space and did put a quota on it. I still have 0 freespace available. I think this is due to the quota limit. Before I rebooted I had about a 68GB bootpool. After the zfs destroy I had about 1.7GB free. I put a 66.5 GB quota on it which I am hitting so services will not start up. I don't want to saw off the tree branch I am sitting on so I am reluctant to increase the quota too much. Here are some questions I have: 1) zfs destroy did free up a snapshot but it is still showing up in lustatus. How to I correct this? 2) This system is installed with everything under / so the ETL team can fill up root with out bounds. What are the best practices for separating filesystems in ZFS so I can bound the ETL team with out affecting the OS? 3) I have captured all the critical data on to SAN disk and am thinking about jumpstarting the host cleanly. That way I will have a known baseline to start with. Does anyone have any suggestions here? 4) We deal with very large data sets. These usually exist just in Oracle but this host is for ETL and Informatica processing. What would be a good quota to set so I have a back door onto the system to take care of problems. Thanks for your feedback. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
> As for certified systems, It's my understanding that Nexenta themselves don't > "certify" anything. They have systems which are recommended and supported by > their network of VAR's. The certified solutions listed on Nexenta's website were certified by Nexenta. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Administation Concole
I am new to OpenSolaris and I have been reading about and seeing screenshots of the ZFS Administration Console. I have been looking at the dates on it and every post is from about two years ago. I am just wondering is this option not available on OpenSolaris anymore and if it is how do I set it up and use it. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup relationship between pool and filesystem
For de-duplication to perform well you need to be able to fit the de-dup table in memory. Is a good rule-of-thumb for needed RAM Size=(pool capacity/avg block size)*270 bytes? Or perhaps it's Size/expected_dedup_ratio? And if you limit de-dup to certain datasets in the pool, how would this calculation change? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs compression with Oracle - anyone implemented?
Ed, See my answers inline: "I don't think your question is clear. What do you mean "on oracle backed by storage luns?"" We'll be using luns from a storage array vs ZFS controller disks. The luns are mapped the db server and from there initialize under ZFS. " Do you mean "on oracle hardware?" " On Sun/Oracle x86 hardware "Do you mean you plan to run oracle database on the server, with ZFS under the database? " Yes Generally speaking, you can enable compression on any zfs filesystem, and the cpu overhead is not very big, and the compression level is not very strong by default. However, if the data you have is generally uncompressible, any overhead is a waste. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS on solid state as disk rather than L2ARC...
Has anyone done much testing of just using the solid state devices (F20 or F5100) as devices for ZFS pools? Are there any concerns with running in this mode versus usingsolid state devices for L2ARC cache?Second, has anyone done this sort of testing with MLC based solid state drives?What has your experience been?Thanks in advance!BradBrad Diggs | Principal Security Sales ConsultantOracle North America Technology Organization16000 Dallas Parkway, Dallas, TX 75248Tech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs compression with Oracle - anyone implemented?
Hi! I'd been scouring the forums and web for admins/users who deployed zfs with compression enabled on Oracle backed by storage array luns. Any problems with cpu/memory overhead? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup - Does "on" imply "sha256?"
Correct, but presumably "for a limited time only". I would think that over time as the technology improves that the default would change. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] OpenStorage Summit
Just wanted to make a quick announcement that there will be an OpenStorage Summit in Palo Alto, CA in late October. The conference should have a lot of good OpenSolaris talks, with ZFS experts such as Bill Moore, Adam Levanthal, and Ben Rockwood already planning to give presentations. The conference is open to other storage solutions, and we also expect participation from FreeNAS, OpenFiler, and Lustre for example. There will be presentations on SSDs, ZFS basics, performance tuning, etc. The agenda is still being formed, as we are hoping to get more presentation proposals from the community. To submit a proposal, send an email to summit2...@nexenta.com. For additional details or to take advantage of early bird registration, go to http://nexenta-summit2010.eventbrite.com. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Possible to save custom properties on a zfs file system?
Peter - Here is an example, where the company myco wants to add a property "myprop" to a file system "myfs" contained within the pool "mypool". zfs set myco:myprop=11 mypool/myfs On Mon, Aug 2, 2010 at 1:45 PM, Peter Taps wrote: > Folks, > > I need to store some application-specific settings for a ZFS filesystem. Is > it possible to extend a ZFS filesystem and add additional properties? > > Thank you in advance for your help. > > Regards, > Peter > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hybrid drive: flash and platters
Hello,As an avid fan of the application to flash technologies to the storage stratum, I researched theDMCache project (maintained here). It appears that the DmCache project is quite a bit behindL2ARC but headed in the right direction.I found the lwn article very interesting as it is effectively a Linux application of L2ARC to improve MySQL performance. I had proposed the same idea in my blog post titledFilesystem Cache Optimization Strategies.The net there is that if you can cache the data in the filesystem cache, you can improve overallperformance by reducing the I/O to disk. I had hoped to have someone do some benchmarkingof MySQL in a cache optimized server with F20 PCIe flash cards but never got around to it.So, if you want to get all of the caching benefits of DmCache, just run your app on Solaris 10 today. ;-)Have a great day!Brad Brad Diggs | Principal Security Sales Consultant | +1.972.814.3698Oracle North America Technology Organization16000 Dallas Parkway, Dallas, TX 75248eMail: brad.di...@oracle.comTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On May 21, 2010, at 8:00 PM, David Magda wrote:Seagate is planning on releasing a disk that's part spinning rust and part flash: http://www.theregister.co.uk/2010/05/21/seagate_momentus_xt/The design will have the flash be transparent to the operating system, but I wish they would have some way to access the two components separately. ZFS could certainly make use of it, and Linux is also working on a capability: http://kernelnewbies.org/KernelProjects/DmCache http://lwn.net/Articles/385442/___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replaced disk...copy back completed but spare is in use
Thanks! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] replaced disk...copy back completed but spare is in use
I yanked a disk to simulate failure to the test pool to test hot spare failover - everything seemed fine until the copy back completed. The hot spare is still showing in used...do we need to remove the spare from the pool to get it to deattach? # zpool status pool: ZPOOL.TEST state: ONLINE scrub: resilver completed after 7h55m with 0 errors on Tue May 4 16:33:33 2010 config: NAME STATE READ WRITE CKSUM ZPOOL.TEST ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A3B6695d0ONLINE 0 0 0 c10t5000C5001A3CED7Fd0ONLINE 0 0 0 c10t5000C5001A5A45C1d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A6B2300d0ONLINE 0 0 0 c10t5000C5001A6BC6C6d0ONLINE 0 0 0 c10t5000C5001A6C3439d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A6F177Bd0ONLINE 0 0 0 c10t5000C5001A6FDB0Bd0ONLINE 0 0 0 c10t5000C5001A6FFF86d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A39D7BEd0ONLINE 0 0 0 c10t5000C5001A60BED0d0ONLINE 0 0 0 c10t5000C5001A70D8AAd0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A70D9B0d0ONLINE 0 0 0 c10t5000C5001A70D89Ed0ONLINE 0 0 0 c10t5000C5001A70D719d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A700E07d0ONLINE 0 0 0 c10t5000C5001A701A12d0ONLINE 0 0 0 c10t5000C5001A701CD0d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A702c10Ed0ONLINE 0 0 0 c10t5000C5001A702C8Ed0ONLINE 0 0 0 c10t5000C5001A703D23d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A703FADd0ONLINE 0 0 0 c10t5000C5001A707D86d0ONLINE 0 0 0 c10t5000C5001A707EDCd0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A7013D4d0ONLINE 0 0 0 c10t5000C5001A7013E6d0ONLINE 0 0 0 c10t5000C5001A7013FDd0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A7021ADd0ONLINE 0 0 0 c10t5000C5001A7028B6d0ONLINE 0 0 0 c10t5000C5001A7029A2d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A7036F4d0ONLINE 0 0 0 c10t5000C5001A7053ADd0ONLINE 0 0 0 spareONLINE 6.05M 0 0 c10t5000C5001A7069CAd0 ONLINE 0 0 0 171G resilvered c10t5000C5001A703651d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A70104Dd0ONLINE 0 0 0 c10t5000C5001A70126Fd0ONLINE 0 0 0 c10t5000C5001A70183Cd0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A70296Cd0ONLINE 0 0 0 c10t5000C5001A70395Ed0ONLINE 0 0 0 c10t5000C5001A70587Dd0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A70704Ad0ONLINE 0 0 0 c10t5000C5001A70830Ed0ONLINE 0 0 0 c10t5000C5001A701563d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c10t5000C5001A702542d0ONLINE 0 0 0 c10t5000C5001A702625d0ONLINE 0 0 0 c10t5000C5001A703374d0ONLINE 0 0 0 logs mirror ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 cache c1t1d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 spares c10t5000C5001A703651d0 INUSE currently in use c10t5000C50
Re: [zfs-discuss] Solaris 10 default caching segmap/vpm size
The reason I asked was just to understand how those attributes play with ufs/vxfs... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Solaris 10 default caching segmap/vpm size
Whats the default size of the file system cache for Solaris 10 x86 and can it be tuned? I read various posts on the subject and its confusing.. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] not showing data in L2ARC or ZIL
thanks - :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] not showing data in L2ARC or ZIL
Hmm so that means read requests are hitting/fulfilled by the arc cache? Am I correct in assuming that because the ARC cache is fulfilling read requests, the zpool and l2arc is barely touched? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] not showing data in L2ARC or ZIL
I'm not showing any data being populated in the L2ARC or ZIL SSDs with a J4500 (48 - 500GB SATA drives). # zpool iostat -v capacity operationsbandwidth poolused avail read write read write - - - - - - - POOL 2.71T 4.08T 35492 1.06M 5.67M mirror185G 279G 2 30 72.5K 327K c10t5000C5001A3B6695d0 - - 0 4 24.5K 327K c10t5000C5001A3CED7Fd0 - - 0 4 24.5K 327K c10t5000C5001A5A45C1d0 - - 0 5 24.5K 327K mirror185G 279G 2 30 72.8K 327K c10t5000C5001A6B2300d0 - - 0 5 24.6K 327K c10t5000C5001A6BC6C6d0 - - 0 5 24.6K 327K c10t5000C5001A6C3439d0 - - 0 5 24.6K 327K mirror185G 279G 2 30 72.6K 327K c10t5000C5001A6F177Bd0 - - 0 4 24.4K 327K c10t5000C5001A6FDB0Bd0 - - 0 4 24.7K 327K c10t5000C5001A6FFF86d0 - - 0 5 24.5K 327K mirror185G 279G 2 30 72.4K 327K c10t5000C5001A39D7BEd0 - - 0 4 24.6K 327K c10t5000C5001A60BED0d0 - - 0 4 24.6K 327K c10t5000C5001A70D8AAd0 - - 0 4 24.4K 327K mirror185G 279G 2 30 72.5K 327K c10t5000C5001A70D9B0d0 - - 0 5 24.6K 327K c10t5000C5001A70D89Ed0 - - 0 5 24.6K 327K c10t5000C5001A70D719d0 - - 0 5 24.5K 327K mirror185G 279G 2 30 72.5K 327K c10t5000C5001A700E07d0 - - 0 4 24.7K 327K c10t5000C5001A701A12d0 - - 0 5 24.5K 327K c10t5000C5001A701CD0d0 - - 0 5 24.4K 327K mirror185G 279G 2 30 72.4K 327K c10t5000C5001A702c10Ed0 - - 0 4 24.4K 327K c10t5000C5001A702C8Ed0 - - 0 4 24.5K 327K c10t5000C5001A703D23d0 - - 0 4 24.6K 327K mirror185G 279G 2 30 72.4K 327K c10t5000C5001A703FADd0 - - 0 4 24.4K 327K c10t5000C5001A707D86d0 - - 0 4 24.5K 327K c10t5000C5001A707EDCd0 - - 0 4 24.5K 327K mirror185G 279G 2 30 72.7K 327K c10t5000C5001A7013D4d0 - - 0 4 24.5K 327K c10t5000C5001A7013E6d0 - - 0 4 24.6K 327K c10t5000C5001A7013FDd0 - - 0 4 24.5K 327K mirror185G 279G 2 30 72.6K 327K c10t5000C5001A7021ADd0 - - 0 4 24.6K 327K c10t5000C5001A7028B6d0 - - 0 4 24.5K 327K c10t5000C5001A7029A2d0 - - 0 4 24.5K 327K mirror185G 279G 2 30 72.6K 327K c10t5000C5001A7036F4d0 - - 0 4 24.5K 327K c10t5000C5001A7053ADd0 - - 0 5 24.5K 327K c10t5000C5001A7069CAd0 - - 0 5 24.6K 327K mirror185G 279G 2 30 72.5K 327K c10t5000C5001A70104Dd0 - - 0 4 24.6K 327K c10t5000C5001A70126Fd0 - - 0 4 24.5K 327K c10t5000C5001A70183Cd0 - - 0 5 24.5K 327K mirror185G 279G 2 30 72.7K 327K c10t5000C5001A70296Cd0 - - 0 4 24.6K 327K c10t5000C5001A70395Ed0 - - 0 5 24.5K 327K c10t5000C5001A70587Dd0 - - 0 5 24.7K 327K mirror186G 278G 2 30 72.2K 327K c10t5000C5001A70704Ad0 - - 0 4 24.4K 327K c10t5000C5001A70830Ed0 - - 0 4 24.5K 327K c10t5000C5001A701563d0 - - 0 5 24.3K 327K mirror185G 279G 2 30 72.2K 327K c10t5000C5001A702542d0 - - 0 4 24.5K 327K c10t5000C5001A702625d0 - - 0 4 24.4K 327K c10t5000C5001A703374d0 - - 0 4 24.4K 327K mirror236K 29.5G 0 37 0 909K c1t3d0 - - 0 37 0 909K c1t4d0 - - 0 37 0 909K cache - - - - - - c1t1d0 29.7G 8M 6 21 175K 1.13M c1t2d0 29.7G 8M 6 21 175K 1.13M - - - - - - -
Re: [zfs-discuss] zpool import -F hangs system
What build are you on? zpool import hangs for me on b134. On Wed, Apr 21, 2010 at 9:21 AM, John Balestrini wrote: > Howdy All, > > I have a raidz pool that hangs the system when importing. I attempted a > pfexec zpool import -F pool1 (which has been importing for two days with no > result), but doesn't seem to get anywhere and makes the system mostly > non-responsive -- existing logins continue to work, new logins never > complete and running any zpool or zfs commands will hang the session. Zdb > commands seem function ok. Apparently the pool has some corruption that > causes havoc. I'd like to attempt to roll back to an older txg, but the > descriptions of how to do it only detail it when working with a single vdev > -- this one has three. > > Any ideas, pointers or help would be greatly appreciated. > > Thanks, > > -- John > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpxio load-balancing...it doesn't work??
I'm wondering if the author is talking about "cache mirroring" where the cache is mirrored between both controllers. If that is the case, is he saying that for every write to the active controlle,r a second write issued on the passive controller to keep the cache mirrored? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] mpxio load-balancing...it doesn't work??
I had always thought that with mpxio, it load-balances IO request across your storage ports but this article http://christianbilien.wordpress.com/2007/03/23/storage-array-bottlenecks/ has got me thinking its not true. "The available bandwidth is 2 or 4Gb/s (200 or 400MB/s – FC frames are 10 bytes long -) per port. As load balancing software (Powerpath, MPXIO, DMP, etc.) are most of the times used both for redundancy and load balancing, I/Os coming from a host can take advantage of an aggregated bandwidth of two ports. However, reads can use only one path, but writes are duplicated, i.e. a host write ends up as one write on each host port. " Is this true? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] j4500 cache flush
Marion - Do you happen to know which SAS hba it applys to? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] j4500 cache flush
Since the j4500 doesn't have a internal SAS controller, would it be safe to say that ZFS cache flushes would be handled by the host's SAS hba? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] naming zfs disks
Is there anyway to assign a unique name or id to a disk part of a zpool? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oracle Performance - ZFS vs UFS
Don't use raidz for the raid type - go with a striped set -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Instructions for ignoring ZFS write cache flushing on intelligent arrays
We're running 10/09 on the dev box but 11/06 is prodqa. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Instructions for ignoring ZFS write cache flushing on intelligent arrays
Cindy, It does not list our SAN (LSI/STK/NetApp)...I'm confused about disabling cache from the wiki entries. Should we disable it by turning off zfs cache syncs via "echo zfs_nocacheflush/W0t1 | mdb -kw " or specify it by storage device via the sd.conf method where the array ignores cache flushes from zfs? Brad -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] compression ratio
With the default compression scheme (LZJB ), how does one calculate the ratio or amount compressed ahead of time when allocating storage? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Instructions for ignoring ZFS write cache flushing on intelligent arrays
Hi! So after reading through this thread and checking the bug report...do we still need to tell zfs to disable cache flush? set zfs:zfs_nocacheflush=1 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500...need input and clarity on striped/mirrored configuration
Did you buy the SSDs directly from Sun? I've heard there could possibly be firmware that's vendor specific for the X25-E. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500...need input and clarity on striped/mirrored configuration
I was reading your old posts about load-shares http://opensolaris.org/jive/thread.jspa?messageID=294580 . So between raidz and load-share "striping", raidz stripes a file system block evenly across each vdev but with load sharing the file system block is written on a vdev that's not filled up (slab??) then for the next file system block it continues filling up the 1MB slab until its full being moving on to the next one? Richard can you comment? :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500...need input and clarity on striped/mirrored configuration
"Zfs does not do striping across vdevs, but its load share approach will write based on (roughly) a round-robin basis, but will also prefer a less loaded vdev when under a heavy write load, or will prefer to write to an empty vdev rather than write to an almost full one." I'm trying to visualize this...can you elaborate or give a ascii example? So with the syntax below, load sharing is implemented? zpool create testpool disk1 disk2 disk3 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500...need input and clarity on striped/mirrored configuration
@hortnon - ASM is not within the scope of this project. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] x4500...need input and clarity on striped/mirrored configuration
Can anyone recommend a optimum and redundant striped configuration for a X4500? We'll be using it for a OLTP (Oracle) database and will need best performance. Is it also true that the reads will be load-balanced across the mirrors? Is this considered a raid 1+0 configuration? zpool create -f testpool mirror c0t0d0 c1t0d0 mirror c4t0d0 c6t0d0 mirror c0t1d0 c1t1d0 mirror c4t1d0 c5t1d0 mirror c6t1d0 c7t1d0 mirror c0t2d0 c1t2d0 mirror c4t2d0 c5t2d0 mirror c6t2d0 c7t2d0 mirror c0t3d0 c1t3d0 mirror c4t3d0 c5t3d0 mirror c6t3d0 c7t3d0 mirror c0t4d0 c1t4d0 mirror c4t4d0 c6t4d0 mirror c0t5d0 c1t5d0 mirror c4t5d0 c5t5d0 mirror c6t5d0 c7t5d0 mirror c0t6d0 c1t6d0 mirror c4t6d0 c5t6d0 mirror c6t6d0 c7t6d0 mirror c0t7d0 c1t7d0 mirror c4t7d0 c5t7d0 mirror c6t7d0 c7t7d0 mirror c7t0d0 c7t4d0 Is it even possible to do a raid 0+1? zpool create -f testpool c0t0d0 c4t0d0 c0t1d0 c4t1d0 c6t1d0 c0t2d0 c4t2d0 c6t2d0 c0t3d0 c4t3d0 c6t3d0 c0t4d0 c4t4d0 c0t5d0 c4t5d0 c6t5d0 c0t6d0 c4t6d0 c6t6d0 c0t7d0 c4t7d0 c6t7d0 c7t0d0 mirror c1t0d0 c6t0d0 c1t1d0 c5t1d0 c7t1d0 c1t2d0 c5t2d0 c7t2d0 c1t3d0 c5t3d0 c7t3d0 c1t4d0 c6t4d0 c1t5d0 c5t5d0 c7t5d0 c1t6d0 c5t6d0 c7t6d0 c1t7d0 c5t7d0 c7t7d0 c7t4d0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500/x4540 does the internal controllers have a bbu?
Richard, "Yes, write cache is enabled by default, depending on the pool configuration." Is it enabled for a striped (mirrored configuration) zpool? I'm asking because of a concern I've read on this forum about a problem with SSDs (and disks) where if a power outage occurs any data in cache would be lost if it hasn't been flushed to disk. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500/x4540 does the internal controllers have a bbu?
"(Caching isn't the problem; ordering is.)" Weird I was reading about a problem where using SSDs (intel x25-e) if the power goes out and the data in cache is not flushed, you would have loss of data. Could you elaborate on "ordering"? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] x4500/x4540 does the internal controllers have a bbu?
Has anyone worked with a x4500/x4540 and know if the internal raid controllers have a bbu? I'm concern that we won't be able to turn off the write-cache on the internal hds and SSDs to prevent data corruption in case of a power failure. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz stripe size (not stripe width)
Hi Adam, >From your the picture, it looks like the data is distributed evenly (with the >exception of parity) across each spindle then wrapping around again (final 4K) >- is this one single write operation or two? | P | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 | <-one write op?? | P | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 | <-one write op?? For a stripe configuration, is this would it would like look for 8K? | D00 D01 D02 D03 D04 D05 D06 D07 D08 | | D09 D10 D11 D12 D13 D14 D15 D16 D17 | -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] raidz stripe size (not stripe width)
If a 8K file system block is written on a 9 disk raidz vdev, how is the data distributed (writtened) between all devices in the vdev since a zfs write is one continuously IO operation? Is it distributed evenly (1.125KB) per device? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz vs raid5 clarity needed
@ross "If the write doesn't span the whole stripe width then there is a read of the parity chunk, write of the block and a write of the parity chunk which is the write hole penalty/vulnerability, and is 3 operations (if the data spans more then 1 chunk then it is written in parallel so you can think of it as one operation, if the data doesn't fill any given chunk then a read of the existing data chunk is necessary to fill in the missing data making it 4 operations). No other operation on the array can execute while this is happening." I thought with raid5 for a new FS block write, the previous block is read in, then read parity, write/update parity then write the new block (2 reads 2 writes)?? "Yes, reads are exactly like writes on the raidz vdev, no other operation, read or write, can execute while this is happening. This is where the problem lies, and is felt hardest with random IOs." Ah - so with a random read workload, a read IO can not be executed in multiple streams or simultaneously until the current IO has completed with raidz. Was the thought process behind this to mitigate the write hole issue or for performance (a write is a single IO instead of 3 or 4 IOs with raid5)? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] raidz vs raid5 clarity needed
Hi! I'm attempting to understand the pros/cons between raid5 and raidz after running into a performance issue with Oracle on zfs (http://opensolaris.org/jive/thread.jspa?threadID=120703&tstart=0). I would appreciate some feedback on what I've understood so far: WRITES raid5 - A FS block is written on a single disk (or multiple disks depending on size data???) raidz - A FS block is written in a dynamic stripe (depending on size of data?)across n number of vdevs (minus parity). READS raid5 - IO count depends on how many disks FS block written to. (data crosses two disks 2 IOs??) raidz - A single read will span across n number of vdevs (minus parity). (1single IO??) NEGATIVES raid5 - Write hole penalty, where if system crashes in the middle of a write block update before or after updating parity - data is corrupt. - Overhead (read previous block, read parity, update parity and write block) - No checksumming of data! - Slow read sequential performance. raidz - Bound by x number of IOPS from slowest vdev since blocks are striped. Bad for small random reads POSITIVES raid5 - Good for random reads (between raid5 and raidz!) since blocks are not striped across sum of disks. raidz - Good for sequential reads and writes since data is striped across sum of vdevs. - No write hole penalty! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] repost - high read iops
@relling "For small, random read IOPS the performance of a single, top-level vdev is performance = performance of a disk * (N / (N - P)) 133 * 12/(12-1)= 133 * 12/11 where, N = number of disks in the vdev P = number of parity devices in the vdev" performance of a disk => Is this a rough estimate of the disk's IOP? "For example, using 5 disks @ 100 IOPS we get something like: 2-disk mirror: 200 IOPS 4+1 raidz: 125 IOPS 3+2 raidz2: 167 IOPS 2+3 raidz3: 250 IOPS" So if the rated iops on our disks is @133 iops 133 * 12/(12-1) = 145 11+1 raidz: 145 IOPS? If that's the rate for a 11+1 raidz vdev, then why is iostat showing about 700 combined IOPS (reads/writes) per disk? r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 1402.2 7805.3 2.7 36.2 0.2 54.9 0.0 6.0 0 940 c1 10.8 1.0 0.1 0.0 0.0 0.1 0.0 7.0 0 7 c1t0d0 117.1 640.7 0.2 1.8 0.0 4.5 0.0 5.9 1 76 c1t1d0 116.9 638.2 0.2 1.7 0.0 4.6 0.0 6.1 1 78 c1t2d0 116.4 639.1 0.2 1.8 0.0 4.6 0.0 6.0 1 78 c1t3d0 116.6 638.1 0.2 1.7 0.0 4.6 0.0 6.1 1 77 c1t4d0 113.2 638.0 0.2 1.8 0.0 4.6 0.0 6.1 1 77 c1t5d0 116.6 635.3 0.2 1.7 0.0 4.5 0.0 6.0 1 76 c1t6d0 116.2 637.8 0.2 1.8 0.0 4.7 0.0 6.2 1 79 c1t7d0 115.3 636.7 0.2 1.8 0.0 4.4 0.0 5.8 1 77 c1t8d0 115.4 637.8 0.2 1.8 0.0 4.5 0.0 5.9 1 77 c1t9d0 114.8 635.0 0.2 1.8 0.0 4.3 0.0 5.7 1 76 c1t10d0 114.9 639.9 0.2 1.8 0.0 4.7 0.0 6.2 1 78 c1t11d0 115.1 638.7 0.2 1.8 0.0 4.4 0.0 5.9 1 77 c1t12d0 1.6 140.0 0.0 15.1 0.0 0.6 0.0 4.4 0 8 c1t13d0 1.3 9.1 0.0 0.1 0.0 0.0 0.0 1.0 0 0 c1t14d0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] repost - high read iops
@eric "As a general rule of thumb, each vdev has the random performance roughly the same as a single member of that vdev. Having six RAIDZ vdevs in a pool should give roughly the performance as a stripe of six bare drives, for random IO." It sounds like we'll need 16 vdevs striped in a pool to at least get the performance of 15 drives plus another 16 mirrored for redundancy. If we are bounded in iops by the vdev, would it make sense to go with the bare minimum of drives (3) per vdev? "This winds up looking similar to RAID10 in layout, in that you're striping across a lot of disks that each consists of a mirror, though the checksumming rules are different. Performance should also be similar, though it's possible RAID10 may give slightly better random read performance at the expense of some data quality guarantees, since I don't believe RAID10 normally validates checksums on returned data if the device didn't return an error. In normal practice, RAID10 and a pool of mirrored vdevs should benchmark against each other within your margin of error." That's interesting to know that with ZFS's implementation of raid10 it doesn't have checksumming built-in. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] repost - high read iops
@ross "Because each write of a raidz is striped across the disks the effective IOPS of the vdev is equal to that of a single disk. This can be improved by utilizing multiple (smaller) raidz vdevs which are striped, but not by mirroring them." So with random reads, would it perform better on a raid5 layout since the FS blocks are written to each disk instead of a stripe? With zfs's implementation of raid10, would we still get data protection and checksumming? "How many luns are you working with now? 15? Is the storage direct attached or is it coming from a storage server that may have the physical disks in a raid configuration already? If direct attached, create a pool of mirrors. If it's coming from a storage server where the disks are in a raid already, just create a striped pool and set copies=2." We're not using a SAN but a Sun X4270 with sixteen SAS drives (two dedicated to OS, two for ssd, raid 11+1. There's a total of seven datasets from a single pool. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] repost - high read iops
Thanks for the suggestion! I have heard mirrored vdevs configuration are preferred for Oracle but whats the difference between a raidz mirrored vdev vs a raid10 setup? We have tested a zfs stripe configuration before with 15 disks and our tester was extremely happy with the performance. After talking to our tester, she doesn't feel comfortable with the current raidz setup. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] repost - high read iops
"This doesn't make sense to me. You've got 32 GB, why not use it? Artificially limiting the memory use to 20 GB seems like a waste of good money." I'm having a hard time convincing the dbas to increase the size of the SGA to 20GB because their philosophy is, no matter what eventually you'll have to hit disk to pick up data thats not stored in cache (arc or l2arc). The typical database server in our environment holds over 3TB of data. If the performance does not improve then we'll possibly have to change the raid layout from raidz to a raid10. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] repost - high read iops
"Try an SGA more like 20-25 GB. Remember, the database can cache more effectively than any file system underneath. The best I/O is the I/O you don't have to make." We'll be turning up the SGA size from 4GB to 16GB. The arc size will be set from 8GB to 4GB. "This can be a red herring. Judging by the number of IOPS below, it has not improved. At this point, I will assume you are using disks that have NCQ or CTQ (eg most SATA and all FC/SAS drives). If you only issue one command at a time, you effectively disable NCQ and thus cannot take advantage of its efficiencies." Here's another sample of the data taken at another time after the number of concurrent ios change from 10 to 1. We're using Seagate Savio 10K SAS drives...I could not pull up info if the drives support NCQ or not. What's the recommended value to set concurrent IOs to? r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c0 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 1402.2 7805.32.7 36.2 0.2 54.90.06.0 0 940 c1 10.81.00.10.0 0.0 0.10.07.0 0 7 c1t0d0 117.1 640.70.21.8 0.0 4.50.05.9 1 76 c1t1d0 116.9 638.20.21.7 0.0 4.60.06.1 1 78 c1t2d0 116.4 639.10.21.8 0.0 4.60.06.0 1 78 c1t3d0 116.6 638.10.21.7 0.0 4.60.06.1 1 77 c1t4d0 113.2 638.00.21.8 0.0 4.60.06.1 1 77 c1t5d0 116.6 635.30.21.7 0.0 4.50.06.0 1 76 c1t6d0 116.2 637.80.21.8 0.0 4.70.06.2 1 79 c1t7d0 115.3 636.70.21.8 0.0 4.40.05.8 1 77 c1t8d0 115.4 637.80.21.8 0.0 4.50.05.9 1 77 c1t9d0 114.8 635.00.21.8 0.0 4.30.05.7 1 76 c1t10d0 114.9 639.90.21.8 0.0 4.70.06.2 1 78 c1t11d0 115.1 638.70.21.8 0.0 4.40.05.9 1 77 c1t12d0 1.6 140.00.0 15.1 0.0 0.60.04.4 0 8 c1t13d0 1.39.10.00.1 0.0 0.00.01.0 0 0 c1t14d0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] repost - high read iops
Richard - the l2arc is c1t13d0. What tools can be use to show the l2arc stats? raidz1 2.68T 580G543453 4.22M 3.70M c1t1d0 - -258102 689K 358K c1t2d0 - -256103 684K 354K c1t3d0 - -258102 690K 359K c1t4d0 - -260103 687K 354K c1t5d0 - -255101 686K 358K c1t6d0 - -263103 685K 354K c1t7d0 - -259101 689K 358K c1t8d0 - -259103 687K 354K c1t9d0 - -260102 689K 358K c1t10d0 - -263103 686K 354K c1t11d0 - -260102 687K 359K c1t12d0 - -263104 684K 354K c1t14d0 396K 29.5G 0 65 7 3.61M cache- - - - - - c1t13d029.7G 11.1M157 84 3.93M 6.45M We've added 16GB to the box bring the overall total to 32GB. arc_max is set to 8GB: set zfs:zfs_arc_max = 8589934592 arc_summary output: ARC Size: Current Size: 8192 MB (arcsize) Target Size (Adaptive): 8192 MB (c) Min Size (Hard Limit):1024 MB (zfs_arc_min) Max Size (Hard Limit):8192 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 39%3243 MB (p) Most Frequently Used Cache Size:60%4948 MB (c-p) ARC Efficency: Cache Access Total: 154663786 Cache Hit Ratio: 41% 64221251 [Defined State for buffer] Cache Miss Ratio: 58% 90442535 [Undefined State for Buffer] REAL Hit Ratio: 41% 64221251 [MRU/MFU Hits Only] Data Demand Efficiency:38% Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable) CACHE HITS BY CACHE LIST: Anon: --%Counter Rolled. Most Recently Used: 17%8906 (mru) [ Return Customer ] Most Frequently Used: 82%53102345 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 14%9427708 (mru_ghost)[ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 6%4344287 (mfu_ghost)[ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data:84%5108 Prefetch Data: 0%0 Demand Metadata:15%9777143 Prefetch Metadata: 0%0 CACHE MISSES BY DATA TYPE: Demand Data:96%87542292 Prefetch Data: 0%0 Demand Metadata: 3%2900243 Prefetch Metadata: 0%0 Also disabled file-level pre-fletch and vdev cache max: set zfs:zfs_prefetch_disable = 1 set zfs:zfs_vdev_cache_max = 0x1 After reading about some issues with concurrent ios, I tweaked the setting down from 35 to 1 and it reduced the response times greatly (2 -> 8ms): set zfs:zfs_vdev_max_pending=1 It did increased the actv...I'm still unsure about the side-effects here: r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c0 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 2295.2 398.74.27.2 0.0 18.60.06.9 0 1084 c1 0.00.80.00.0 0.0 0.00.00.1 0 0 c1t0d0 190.3 22.90.40.0 0.0 1.50.07.0 0 87 c1t1d0 180.9 20.60.30.0 0.0 1.70.08.5 0 95 c1t2d0 195.0 43.00.30.2 0.0 1.60.06.8 0 93 c1t3d0 193.2 21.70.40.0 0.0 1.50.06.8 0 88 c1t4d0 195.7 34.80.30.1 0.0 1.70.07.5 0 97 c1t5d0 186.8 20.60.30.0 0.0 1.50.07.3 0 88 c1t6d0 188.4 21.00.40.0 0.0 1.60.07.7 0 91 c1t7d0 189.6 21.20.30.0 0.0 1.60.07.4 0 91 c1t8d0 193.8 22.60.40.0 0.0 1.50.07.1 0 91 c1t9d0 192.6 20.80.30.0 0.0 1.40.06.8 0 88 c1t10d0 195.7 22.20.30.0 0.0 1.50.06.7 0 88 c1t11d0 184.7 20.30.30.0 0.0 1.40.06.8 0 84 c1t12d0 7.3 82.40.15.5 0.0 0.00.00.2 0 1 c1t13d0 1.3 23.90.01.3 0.0 0.00.00.2 0 0 c1t14d0 I'm still in talks with the dba in seeing if we can raise the SGA from 4GB to 6GB to see if it'll help. The changes that showed a lot of improvement is disabling file/device level pre-fletch and reducing concurrent ios from 35 to 1 (tried 10 but it didn't help much). Is there anything else that could be tweaked to increase write performance? Record sizes are set according to 8K and 128K for redo logs. -- This m
[zfs-discuss] repost - high read iops
repost - Sorry for ccing the other forums. I'm running into a issue where there seems to be a high number of read iops hitting disks and physical free memory is fluctuating between 200MB -> 450MB out of 16GB total. We have the l2arc configured on a 32GB Intel X25-E ssd and slog on another 32GB X25-E ssd. According to our tester, Oracle writes are extremely slow (high latency). Below is a snippet of iostat: r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 4898.3 34.2 23.2 1.4 0.1 385.3 0.0 78.1 0 1246 c1 0.0 0.8 0.0 0.0 0.0 0.0 0.0 16.0 0 1 c1t0d0 401.7 0.0 1.9 0.0 0.0 31.5 0.0 78.5 1 100 c1t1d0 421.2 0.0 2.0 0.0 0.0 30.4 0.0 72.3 1 98 c1t2d0 403.9 0.0 1.9 0.0 0.0 32.0 0.0 79.2 1 100 c1t3d0 406.7 0.0 2.0 0.0 0.0 33.0 0.0 81.3 1 100 c1t4d0 414.2 0.0 1.9 0.0 0.0 28.6 0.0 69.1 1 98 c1t5d0 406.3 0.0 1.8 0.0 0.0 32.1 0.0 79.0 1 100 c1t6d0 404.3 0.0 1.9 0.0 0.0 31.9 0.0 78.8 1 100 c1t7d0 404.1 0.0 1.9 0.0 0.0 34.0 0.0 84.1 1 100 c1t8d0 407.1 0.0 1.9 0.0 0.0 31.2 0.0 76.6 1 100 c1t9d0 407.5 0.0 2.0 0.0 0.0 33.2 0.0 81.4 1 100 c1t10d0 402.8 0.0 2.0 0.0 0.0 33.5 0.0 83.2 1 100 c1t11d0 408.9 0.0 2.0 0.0 0.0 32.8 0.0 80.3 1 100 c1t12d0 9.6 10.8 0.1 0.9 0.0 0.4 0.0 20.1 0 17 c1t13d0 0.0 22.7 0.0 0.5 0.0 0.5 0.0 22.8 0 33 c1t14d0 Is this an indicator that we need more physical memory? From http://blogs.sun.com/brendan/entry/test, the order that a read request is satisfied is: 1) ARC 2) vdev cache of L2ARC devices 3) L2ARC devices 4) vdev cache of disks 5) disks Using arc_summary.pl, we determined that prefletch was not helping much so we disabled. CACHE HITS BY DATA TYPE: Demand Data: 22% 158853174 Prefetch Data: 17% 123009991 <---not helping??? Demand Metadata: 60% 437439104 Prefetch Metadata: 0% 2446824 The write iops started to kick in more and latency reduced on spinning disks: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 1629.0 968.0 17.4 7.3 0.0 35.9 0.0 13.8 0 1088 c1 0.0 1.9 0.0 0.0 0.0 0.0 0.0 1.7 0 0 c1t0d0 126.7 67.3 1.4 0.2 0.0 2.9 0.0 14.8 0 90 c1t1d0 129.7 76.1 1.4 0.2 0.0 2.8 0.0 13.7 0 90 c1t2d0 128.0 73.9 1.4 0.2 0.0 3.2 0.0 16.0 0 91 c1t3d0 128.3 79.1 1.3 0.2 0.0 3.6 0.0 17.2 0 92 c1t4d0 125.8 69.7 1.3 0.2 0.0 2.9 0.0 14.9 0 89 c1t5d0 128.3 81.9 1.4 0.2 0.0 2.8 0.0 13.1 0 89 c1t6d0 128.1 69.2 1.4 0.2 0.0 3.1 0.0 15.7 0 93 c1t7d0 128.3 80.3 1.4 0.2 0.0 3.1 0.0 14.7 0 91 c1t8d0 129.2 69.3 1.4 0.2 0.0 3.0 0.0 15.2 0 90 c1t9d0 130.1 80.0 1.4 0.2 0.0 2.9 0.0 13.6 0 89 c1t10d0 126.2 72.6 1.3 0.2 0.0 2.8 0.0 14.2 0 89 c1t11d0 129.7 81.0 1.4 0.2 0.0 2.7 0.0 12.9 0 88 c1t12d0 90.4 41.3 1.0 4.0 0.0 0.2 0.0 1.2 0 6 c1t13d0 0.0 24.3 0.0 1.2 0.0 0.0 0.0 0.2 0 0 c1t14d0 Is it true if your MFU stats start to go over 50% then more memory is needed? CACHE HITS BY CACHE LIST: Anon: 10% 74845266 [ New Customer, First Cache Hit ] Most Recently Used: 19% 140478087 (mru) [ Return Customer ] Most Frequently Used: 65% 475719362 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 2% 20785604 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 1% 9920089 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data: 22% 158852935 Prefetch Data: 17% 123009991 Demand Metadata: 60% 437438658 Prefetch Metadata: 0% 2446824 My theory is since there's not enough memory for the arc to cache data, its hits the l2arc where it can't find data and has to query the disk for the request. This causes contention between reads and writes causing the service times to inflate. uname: 5.10 Generic_141445-09 i86pc i386 i86pc Sun Fire X4270: 11+1 raidz (SAS) l2arc Intel X25-E slog Intel X25-E Thoughts? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] high read iops - more memory for arc?
I'm running into a issue where there seems to be a high number of read iops hitting disks and physical free memory is fluctuating between 200MB -> 450MB out of 16GB total. We have the l2arc configured on a 32GB Intel X25-E ssd and slog on another32GB X25-E ssd. According to our tester, Oracle writes are extremely slow (high latency). Below is a snippet of iostat: r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c0 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 4898.3 34.2 23.21.4 0.1 385.30.0 78.1 0 1246 c1 0.00.80.00.0 0.0 0.00.0 16.0 0 1 c1t0d0 401.70.01.90.0 0.0 31.50.0 78.5 1 100 c1t1d0 421.20.02.00.0 0.0 30.40.0 72.3 1 98 c1t2d0 403.90.01.90.0 0.0 32.00.0 79.2 1 100 c1t3d0 406.70.02.00.0 0.0 33.00.0 81.3 1 100 c1t4d0 414.20.01.90.0 0.0 28.60.0 69.1 1 98 c1t5d0 406.30.01.80.0 0.0 32.10.0 79.0 1 100 c1t6d0 404.30.01.90.0 0.0 31.90.0 78.8 1 100 c1t7d0 404.10.01.90.0 0.0 34.00.0 84.1 1 100 c1t8d0 407.10.01.90.0 0.0 31.20.0 76.6 1 100 c1t9d0 407.50.02.00.0 0.0 33.20.0 81.4 1 100 c1t10d0 402.80.02.00.0 0.0 33.50.0 83.2 1 100 c1t11d0 408.90.02.00.0 0.0 32.80.0 80.3 1 100 c1t12d0 9.6 10.80.10.9 0.0 0.40.0 20.1 0 17 c1t13d0 0.0 22.70.00.5 0.0 0.50.0 22.8 0 33 c1t14d0 Is this an indicator that we need more physical memory? From http://blogs.sun.com/brendan/entry/test, the order that a read request is satisfied is: 1) ARC 2) vdev cache of L2ARC devices 3) L2ARC devices 4) vdev cache of disks 5) disks Using arc_summary.pl, we determined that prefletch was not helping much so we disabled. CACHE HITS BY DATA TYPE: Demand Data:22%158853174 Prefetch Data: 17%123009991 <---not helping??? Demand Metadata:60%437439104 Prefetch Metadata: 0%2446824 The write iops started to kick in more and latency reduced on spinning disks: r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c0 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 1629.0 968.0 17.47.3 0.0 35.90.0 13.8 0 1088 c1 0.01.90.00.0 0.0 0.00.01.7 0 0 c1t0d0 126.7 67.31.40.2 0.0 2.90.0 14.8 0 90 c1t1d0 129.7 76.11.40.2 0.0 2.80.0 13.7 0 90 c1t2d0 128.0 73.91.40.2 0.0 3.20.0 16.0 0 91 c1t3d0 128.3 79.11.30.2 0.0 3.60.0 17.2 0 92 c1t4d0 125.8 69.71.30.2 0.0 2.90.0 14.9 0 89 c1t5d0 128.3 81.91.40.2 0.0 2.80.0 13.1 0 89 c1t6d0 128.1 69.21.40.2 0.0 3.10.0 15.7 0 93 c1t7d0 128.3 80.31.40.2 0.0 3.10.0 14.7 0 91 c1t8d0 129.2 69.31.40.2 0.0 3.00.0 15.2 0 90 c1t9d0 130.1 80.01.40.2 0.0 2.90.0 13.6 0 89 c1t10d0 126.2 72.61.30.2 0.0 2.80.0 14.2 0 89 c1t11d0 129.7 81.01.40.2 0.0 2.70.0 12.9 0 88 c1t12d0 90.4 41.31.04.0 0.0 0.20.01.2 0 6 c1t13d0 0.0 24.30.01.2 0.0 0.00.00.2 0 0 c1t14d0 Is it true if your MFU stats start to go over 50% then more memory is needed? CACHE HITS BY CACHE LIST: Anon: 10%74845266 [ New Customer, First Cache Hit ] Most Recently Used: 19%140478087 (mru)[ Return Customer ] Most Frequently Used: 65%475719362 (mfu)[ Frequent Customer ] Most Recently Used Ghost:2%20785604 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 1%9920089 (mfu_ghost)[ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data:22%158852935 Prefetch Data: 17%123009991 Demand Metadata:60%437438658 Prefetch Metadata: 0%2446824 My theory is since there's not enough memory for the arc to cache data, its hits the l2arc where it can't find data and has to query the disk for the request. This causes contention between reads and writes causing the service times to inflate. Thoughts? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Have you considered running your script with ZFS pre-fetching disabled altogether to see if the results are consistent between runs? Brad Brad Diggs Senior Directory Architect Virtualization Architect xVM Technology Lead Sun Microsystems, Inc. Phone x52957/+1 972-992-0002 Mail bradley.di...@sun.com Blog http://TheZoneManager.com Blog http://BradDiggs.com On Jul 15, 2009, at 9:59 AM, Bob Friesenhahn wrote: On Wed, 15 Jul 2009, Ross wrote: Yes, that makes sense. For the first run, the pool has only just been mounted, so the ARC will be empty, with plenty of space for prefetching. I don't think that this hypothesis is quite correct. If you use 'zpool iostat' to monitor the read rate while reading a large collection of files with total size far larger than the ARC, you will see that there is no fall-off in read performance once the ARC becomes full. The performance problem occurs when there is still metadata cached for a file but the file data has since been expunged from the cache. The implication here is that zfs speculates that the file data will be in the cache if the metadata is cached, and this results in a cache miss as well as disabling the file read- ahead algorithm. You would not want to do read-ahead on data that you already have in a cache. Recent OpenSolaris seems to take a 2X performance hit rather than the 4X hit that Solaris 10 takes. This may be due to improvement of existing algorithm function performance (optimizations) rather than a related design improvement. I wonder if there is any tuning that can be done to counteract this? Is there any way to tell ZFS to bias towards prefetching rather than preserving data in the ARC? That may provide better performance for scripts like this, or for random access workloads. Recent zfs development focus has been on how to keep prefetch from damaging applications like database where prefetch causes more data to be read than is needed. Since OpenSolaris now apparently includes an option setting which blocks file data caching and prefetch, this seems to open the door for use of more aggressive prefetch in the normal mode. In summary, I agree with Richard Elling's hypothesis (which is the same as my own). Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
You might want to have a look at my blog on filesystem cache tuning... It will probably help you to avoid memory contention between the ARC and your apps. http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Brad Brad Diggs Senior Directory Architect Virtualization Architect xVM Technology Lead Sun Microsystems, Inc. Phone x52957/+1 972-992-0002 Mail bradley.di...@sun.com Blog http://TheZoneManager.com Blog http://BradDiggs.com On Jul 4, 2009, at 2:48 AM, Phil Harman wrote: ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it copies into the Solaris page cache. As long as they remain there ZFS will be slow for those files, even if you subsequently use read(2) to access them. If you reboot, your cpio(1) tests will probably go fast again, until someone uses mmap(2) on the files again. I think tar(1) uses read(2), but from my iPod I can't be sure. It would be interesting to see how tar(1) performs if you run that test before cp(1) on a freshly rebooted system. I have done some work with the ZFS team towards a fix, but it is only currently in OpenSolaris. The other thing that slows you down is that ZFS only flushes to disk every 5 seconds if there are no synchronous writes. It would be interesting to see iostat -xnz 1 while you are running your tests. You may find the disks are writing very efficiently for one second in every five. Hope this helps, Phil blogs.sun.com/pgdh Sent from my iPod On 4 Jul 2009, at 05:26, Bob Friesenhahn wrote: On Fri, 3 Jul 2009, Bob Friesenhahn wrote: Copy MethodData Rate == cpio -pdum75 MB/s cp -r32 MB/s tar -cf - . | (cd dest && tar -xf -)26 MB/s It seems that the above should be ammended. Running the cpio based copy again results in zpool iostat only reporting a read bandwidth of 33 MB/second. The system seems to get slower and slower as it runs. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs
Hi Victor, Yes, you may access the system via ssh. Please contact me at bar001 at uark dot edu and I will reply with details of how to connect. Thanks, Brad -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs
Hi Victor, 'zdb -e -bcsv -t 2435913 tank' ran for about a week with no output. We had yet another brown out and then the comp shut down (have a UPS on the way). A few days before that I started the following commands, which also had no output: zdb -e -bcsv -t 2435911 tank zdb -e -bcsv -t 2435897 tank I've given up on these because I don't think they'll finish...should I try again? Right now I am trying the following commands which so far have no output: zdb -e -bcsvL -t 2435913 tank zdb -e -bsvL -t 2435913 tank zdb -e -bb -t 2435913 tank 'zdb -e - -t 2435913 tank' has output and is very long...is there anything I should be looking for? Without -t 243... this command failed on dmu_read, now it just keeps going forever. Your help is much appreciated. Thanks, Brad -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs
Hi Victor, Sorry it took a while for me to reply, I was traveling and had limited network access. 'zdb -e -bcsv -t 2435913 tank' has been running for a few days with no output...want to try something else? Here's the output of 'zdb -e -u -t 2435913 tank': Uberblock magic = 00bab10c version = 4 txg = 2435911 guid_sum = 16655261404755214374 timestamp = 1240287900 UTC = Mon Apr 20 23:25:00 2009 Thanks, Brad -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs
Hi Victor, Here's the output of 'zdb -e -bcsvL tank' (similar to above but with -c). Thanks, Brad Traversing all blocks to verify checksums ... zdb_blkptr_cb: Got error 50 reading <0, 11, 0, 0> [L0 packed nvlist] 4000L/4000P DVA[0]=<0:2500014000:4000> DVA[1]=<0:4400014000:4000> fletcher4 uncompressed LE contiguous birth=2435914 fill=1 cksum=2cdaaa4db0:2b105dbdf910e:14b020cadaf6:1720f5444d0b5366 -- skipping zdb_blkptr_cb: Got error 50 reading <0, 12, 0, 0> [L0 bplist] 4000L/4000P DVA[0]=<0:258000:4000> DVA[1]=<0:448000:4000> fletcher4 uncompressed LE contiguous birth=2435914 fill=1 cksum=16b3e12568:143f7c0b11757:91395667fbce35f:9f686628032bddf2 -- skipping zdb_blkptr_cb: Got error 50 reading <0, 24, 0, 0> [L0 SPA space map] 1000L/1000P DVA[0]=<0:2500024000:1000> DVA[1]=<0:440002c000:1000> fletcher4 uncompressed LE contiguous birth=2435914 fill=1 cksum=76284dc2e9:1438efeb0fb9b:1d2f57253c8d409:d4a881948152382b -- skipping zdb_blkptr_cb: Got error 50 reading <0, 30, 0, 4> [L0 SPA space map] 1000L/1000P DVA[0]=<0:256000:1000> DVA[1]=<0:446000:1000> fletcher4 uncompressed LE contiguous birth=2435914 fill=1 cksum=5d6df356b5:eed102f1beb0:15280d72b8c8588:5925604865323b6b -- skipping zdb_blkptr_cb: Got error 50 reading <0, 35, 0, 0> [L0 SPA space map] 1000L/1000P DVA[0]=<0:257000:1000> DVA[1]=<0:447000:1000> fletcher4 uncompressed LE contiguous birth=2435914 fill=1 cksum=335f09c578:8b7b18876592:c58c7a556ca72b:c22e91ead638c69c -- skipping Error counts: errno count 50 5 block traversal size 431585053184 != alloc 431585209344 (unreachable 156160) bp count: 4078410 bp logical:433202894336 avg: 106218 bp physical: 431089822720 avg: 105700compression: 1.00 bp allocated: 431585053184 avg: 105821compression: 1.00 SPA allocated: 431585209344 used: 57.75% Blocks LSIZE PSIZE ASIZE avgcomp %Total Type - - - - - -- deferred free 1512 512 1K 1K1.00 0.00 object directory 3 1.50K 1.50K 3.00K 1K1.00 0.00 object array 116K 16K 32K 32K1.00 0.00 packed nvlist - - - - - -- packed nvlist size 114 13.9M 1.05M 2.11M 18.9K 13.25 0.00 bplist - - - - - -- bplist header - - - - - -- SPA space map header 1.36K 6.57M 3.88M 7.94M 5.82K1.69 0.00 SPA space map - - - - - -- ZIL intent log 49.4K 791M220M442M 8.95K3.60 0.11 DMU dnode 4 4K 2.50K 7.50K 1.88K1.60 0.00 DMU objset - - - - - -- DSL directory 2 1K 1K 2K 1K1.00 0.00 DSL directory child map 1512 512 1K 1K1.00 0.00 DSL dataset snap map 2 1K 1K 2K 1K1.00 0.00 DSL props - - - - - -- DSL dataset - - - - - -- ZFS znode - - - - - -- ZFS V0 ACL 3.77M 403G401G401G106K1.0099.87 ZFS plain file 72.4K 114M 49.0M 98.8M 1.37K2.32 0.02 ZFS directory 1512 512 1K 1K1.00 0.00 ZFS master node 3 19.5K 1.50K 3.00K 1K 13.00 0.00 ZFS delete queue - - - - - -- zvol object - - - - - -- zvol prop - - - - - -- other uint8[] - - - - - -- other uint64[] - - - - - -- other ZAP - - - - - -- persistent error log 1 128K 5.00K 10.0K 10.0K 25.60 0.00 SPA history - - - - - -- SPA history offsets - - - - - -- Pool properties - - - - - -- DSL permissions - - - - - -- ZFS ACL - - - - - -- ZFS SYSACL - - - - - -- FUID table - - - - - -- FUID table size - - - - - -- DSL dataset next clones - - - - - -- scrub work queue 3.89M 403G401G402G103K1.00 100.00 Total capacity operations bandwidth errors description
Re: [zfs-discuss] zpool import hangs
Here's the output of 'zdb -e -bsvL tank' (without -c) in case it helps. I'll post with -c if it finishes. Thanks, Brad Traversing all blocks ... block traversal size 431585053184 != alloc 431585209344 (unreachable 156160) bp count: 4078410 bp logical:433202894336 avg: 106218 bp physical: 431089822720 avg: 105700compression: 1.00 bp allocated: 431585053184 avg: 105821compression: 1.00 SPA allocated: 431585209344 used: 57.75% Blocks LSIZE PSIZE ASIZE avgcomp %Total Type - - - - - -- deferred free 1512 512 1K 1K1.00 0.00 object directory 3 1.50K 1.50K 3.00K 1K1.00 0.00 object array 116K 16K 32K 32K1.00 0.00 packed nvlist - - - - - -- packed nvlist size 114 13.9M 1.05M 2.11M 18.9K 13.25 0.00 bplist - - - - - -- bplist header - - - - - -- SPA space map header 1.36K 6.57M 3.88M 7.94M 5.82K1.69 0.00 SPA space map - - - - - -- ZIL intent log 49.4K 791M220M442M 8.95K3.60 0.11 DMU dnode 4 4K 2.50K 7.50K 1.88K1.60 0.00 DMU objset - - - - - -- DSL directory 2 1K 1K 2K 1K1.00 0.00 DSL directory child map 1512 512 1K 1K1.00 0.00 DSL dataset snap map 2 1K 1K 2K 1K1.00 0.00 DSL props - - - - - -- DSL dataset - - - - - -- ZFS znode - - - - - -- ZFS V0 ACL 3.77M 403G401G401G106K1.0099.87 ZFS plain file 72.4K 114M 49.0M 98.8M 1.37K2.32 0.02 ZFS directory 1512 512 1K 1K1.00 0.00 ZFS master node 3 19.5K 1.50K 3.00K 1K 13.00 0.00 ZFS delete queue - - - - - -- zvol object - - - - - -- zvol prop - - - - - -- other uint8[] - - - - - -- other uint64[] - - - - - -- other ZAP - - - - - -- persistent error log 1 128K 5.00K 10.0K 10.0K 25.60 0.00 SPA history - - - - - -- SPA history offsets - - - - - -- Pool properties - - - - - -- DSL permissions - - - - - -- ZFS ACL - - - - - -- ZFS SYSACL - - - - - -- FUID table - - - - - -- FUID table size - - - - - -- DSL dataset next clones - - - - - -- scrub work queue 3.89M 403G401G402G103K1.00 100.00 Total capacity operations bandwidth errors descriptionused avail read write read write read write cksum tank 402G 294G 463 0 1.27M 0 0 0 1 mirror 402G 294G 463 0 1.27M 0 0 0 4 /dev/dsk/c2d0p0 69 0 4.05M 0 0 0 4 /dev/dsk/c1d0p0 67 0 3.96M 0 0 0 4 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs
Hi Victor, zdb -e -bcsvL tank (let this go for a few hours...no output. I will let it go overnight) zdb -e -u tank Uberblock magic = 00bab10c version = 4 txg = 2435914 guid_sum = 16655261404755214374 timestamp = 1240517036 UTC = Thu Apr 23 15:03:56 2009 Thanks for your help, Brad -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs
Hello, I've run into a problem with zpool import that seems very similar to the following thread as far as I can tell: http://opensolaris.org/jive/thread.jspa?threadID=70205&tstart=15 The suggested solution was to use a later version of open solaris (b99 or later) but that did not work. I've tried the following versions of solaris without success: Solaris 10 u4 (original system) Solaris 10 u6 Opensolaris 2008.11 Opensolaris 2008.11 b99 SXCE b113 Any help with this will be greatly appreciated...my last backup was four months ago so a lot of my thesis work will be lost. I mistakenly thought a mirrored zpool on new drives would be good enough for a while. So here's what happened: we had a power outage one day and as soon as I tried to boot the server again it enters an endless reboot cycle. So I thought the OS drive became corrupted (not mirrored) and reinstalled the OS. Then when I try zpool import it hangs forever. I even left it going for a couple days in case it was trying to correct corrupted data. The same thing happens no matter what version of solaris I use. The symptoms and diagnostic results (see below) seem to be very similar to the post above but the solution doesn't work. Please let me know if you need any other information. Thanks, Brad bash-3.2# zpool import pool: tank id: 4410438565134310480 state: ONLINE status: The pool is formatted using an older on-disk version. action: The pool can be imported using its name or numeric identifier, though some features will not be available without an explicit 'zpool upgrade'. config: tankONLINE mirrorONLINE c2d0p0 ONLINE c1d0p0 ONLINE bash-3.2# zpool import tank cannot import 'tank': pool may be in use from other system use '-f' to import anyway bash-3.2# zpool import -f tank (then it hangs here forever, can't be killed) (the following commands were performed while this was running) bash-3.2# fmdump -eV TIME CLASS May 27 2009 22:22:55.308533986 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0xd22e37db9000401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x3d350681ea839c50 vdev = 0x85cc302105002c5d (end detector) pool = tank pool_guid = 0x3d350681ea839c50 pool_context = 0 pool_failmode = wait vdev_guid = 0x85cc302105002c5d vdev_type = disk vdev_path = /dev/dsk/c2d0p0 vdev_devid = id1,c...@ast3750640as=5qd3myrh/q parent_guid = 0x8fb729a008f16e65 parent_type = mirror zio_err = 50 zio_offset = 0x2500407000 zio_size = 0x1000 zio_objset = 0x0 zio_object = 0x23 zio_level = 0 zio_blkid = 0x0 __ttl = 0x1 __tod = 0x4a1e038f 0x1263dae2 (and many others just like this) bash-3.2# echo "0t3735::pid2proc|::walk thread|::findtsack -v" | mdb -k stack pointer for thread df6ed220: e1ca8c54 e1ca8c94 swtch+0x188() e1ca8ca4 cv_wait+0x53(e08442aa, e084426c, , f9c305ac) e1ca8ce4 txg_wait_synced+0x90(e0844100, 252b4c, 0, 0) e1ca8d34 spa_config_update_common+0x88(d6c429c0, 0, 0, e1ca8d68) e1ca8d84 spa_import_common+0x3cf() e1ca8db4 spa_import+0x18(dbfcf000, e3c0d018, 0, f9c65810) e1ca8de4 zfs_ioc_pool_import+0xcd(dbfcf000, 0, 0) e1ca8e14 zfsdev_ioctl+0x124() e1ca8e44 cdev_ioctl+0x31(2d8, 5a02, 80418d0, 13, dfde91f8, e1ca8f00) e1ca8e74 spec_ioctl+0x6b(d7a593c0, 5a02, 80418d0, 13, dfde91f8, e1ca8f00) e1ca8ec4 fop_ioctl+0x49(d7a593c0, 5a02, 80418d0, 13, dfde91f8, e1ca8f00) e1ca8f84 ioctl+0x171() e1ca8fac sys_sysenter+0x106() bash-3.2# echo "::threadlist -v" | mdb -k d4ed8dc0 fec1f5580 0 60 d5033604 PC: _resume_from_idle+0xb1THREAD: txg_sync_thread() stack pointer for thread d4ed8dc0: d4ed8ba8 swtch+0x188() cv_wait+0x53() zio_wait+0x55() vdev_uberblock_sync_list+0x19e() vdev_config_sync+0x11c() spa_sync+0x5a5() txg_sync_thread+0x308() thread_start+8() (just chose one seemingly relevant thread from the long list) bash-3.2# zdb -e -bb tank Traversing all blocks to verify nothing leaked ... Assertion failed: space_map_load(&msp->ms_map, &zdb_space_map_ops, 0x0, &msp->ms_smo, spa->spa_meta_objset) == 0, file ../zdb.c, line 1420, function zdb_leak_init Abort (core dumped) bash-3.2# zdb -e - tank Dataset mos [META], ID 0, cr_txg 4, 10.3M, 137 objects, rootbp [L0 DMU objset] 400L/400P DVA[0]=<0:255c00:400> DVA[1]=<0:445c00:400> DVA[2]=<0:633000:400> fletcher4 uncompressed LE contiguous birth=2435914 fill=137 cksum=5224494b4:4524146f316:1d44c6f4690ea:84ef3bd0c105a0 Object
Re: [zfs-discuss] Data size grew.. with compression on
I've run into this too... I believe the issue is that the block size/allocation unit size in ZFS is much larger than the default size on older filesystems (ufs, ext2, ext3). The result is that if you have lots of small files smaller than the block size, they take up more total space on the filesystem because they occupy at least the block size amount. See the 'recordsize' ZFS filesystem property, though re-reading the man pages, I'm not 100% sure that tuning this property will have the intended effect. BP > I rsynced an 11gb pile of data from a remote linux machine to a zfs > filesystem with compression turned on. > > The data appears to have grown in size rather than been compressed. > > Many, even most of the files are formats that are already compressed, > such as mpg jpg avi and several others. But also many text files > (*.html) are in there. So didn't expect much compression but also > didn't expect the size to grow. > > I realize these are different filesystems that may report > differently. Reiserfs on the linux machine and zfs on osol. > > in bytes: > > Osol:11542196307 > linux:11525114469 > = > 17081838 > > Or (If I got the math right) about 16.29 MB bigger on the zfs side > with compression on. > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- bpl...@cs.umd.edu ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any way to set casesensitivity=mixed on the main pool?
If you have an older Solaris release using ZFS and Samba, and you upgrade to a version with CIFS support, how do you ensure the file systems/pools have casesensitivity mixed? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz1 faulted with single bad disk. Requesting
Yes. I have disconnected the bad disk and booted with nothing in the slot, and also with known good replacement disk in on the same sata port. Doesn't change anything. Running 2008.11 on the box and 2008.11 snv_101b_rc2 on the LiveCD. I'll give it a shot booting from the latest build and see if that makes any kind of difference. Thanks for the suggestions. Brad > Just a thought, but have you physically disconnected > the bad disk? It's not unheard of for a bad disk to > cause problems with others. > > Failing that, it's the "corrupted data" bit that's > worrying me, it sounds like you may have other > corruption on the pool (always a risk with single > parity raid), but I'm worried that it's not giving > you any more details as to what's wrong. > > Also, what version of OpenSolaris are you running? > Could you maybe try booting off a CD of the latest > build? There are often improvements in the way ZFS > copes with errors, so it's worth a try. I don't > think it's likely to help, but I wouldn't discount > it. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz1 faulted with single bad disk. Requesting assistan
I do, thank you. The disk that went out sounds like it had a head crash or some such - loud clicking shortly after spin-up then it spins down and gives me nothing. BIOS doesn't even detect it properly to do a firmware update. > Do you know 7200.11 has firmware bugs? > > Go to seagate website to check. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz1 faulted with single bad disk. Requesting
r...@opensolaris:~# zpool import -f tank internal error: Bad exchange descriptor Abort (core dumped) Hoping someone has seen that before... the Google is seriously letting me down on that one. > I guess you could try 'zpool import -f'. This is a > pretty odd status, > I think. I'm pretty sure raidz1 should survive a > single disk failure. > > Perhaps a more knowledgeable list member can explain. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz1 faulted with single bad disk. Requesting assistan
Any ideas on this? It looks like a potential bug to me, or there is something that I'm not seeing. Thanks again! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz1 faulted with single bad disk. Requesting assistance.
> I've seen reports of a recent Seagate firmware update > bricking drives again. > > What's the output of 'zpool import' from the LiveCD? > It sounds like > ore than 1 drive is dropping off. r...@opensolaris:~# zpool import pool: tank id: 16342816386332636568 state: FAULTED status: The pool was last accessed by another system. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: tankFAULTED corrupted data raidz1DEGRADED c6t0d0 ONLINE c6t1d0 ONLINE c6t2d0 ONLINE c6t3d0 UNAVAIL cannot open c6t4d0 ONLINE pool: rpool id: 9891756864015178061 state: ONLINE status: The pool was last accessed by another system. action: The pool can be imported using its name or numeric identifier and the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: rpool ONLINE c3d0s0ONLINE -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz1 faulted with single bad disk. Requesting assistance.
> I would get a new 1.5 TB and make sure it has the new > firmware and replace > c6t3d0 right away - even if someone here comes up > with a magic solution, you > don't want to wait for another drive to fail. The replacement disk showed up today but I'm unable to replace the one marked UNAVAIL: r...@blitz:~# zpool replace tank c6t3d0 cannot open 'tank': pool is unavailable > I would in this case also immediately export the pool (to prevent any > write attempts) and see about a firmware update for the failed drive > (probably need windows for this). While I didn't export first, I did boot with a livecd and tried to force the import with that: r...@opensolaris:~# zpool import -f tank internal error: Bad exchange descriptor Abort (core dumped) Hopefully someone on this list understands what situation I am in and how to resolve it. Again, many thanks in advance for any suggestions you all have to offer. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz1 p
Sure, and thanks for the quick reply. Controller: Supermicro AOC-SAT2-MV8 plugged into a 64-big PCI-X 133 bus Drives: 5 x Seagate 7200.11 1.5TB disks for the raidz1. Single 36GB western digital 10krpm raptor as system disk. Mate for this is in but not yet mirrored. Motherboard: Tyan Thunder K8W S2885 (Dual AMD CPU) with 1GB ECC Ram Anything else I can provide? (thanks again) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Raidz1 p
Greetings! I lost one out of five disks on a machine with a raidz1 and I'm not sure exactly how to recover from it. The pool is marked as FAULTED which I certainly wasn't expecting with only one bum disk. r...@blitz:/# zpool status -v tank pool: tank state: FAULTED status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-3C scrub: none requested config: NAMESTATE READ WRITE CKSUM tankFAULTED 0 0 1 corrupted data raidz1DEGRADED 0 0 6 c6t0d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c6t3d0 UNAVAIL 0 0 0 cannot open c6t4d0 ONLINE 0 0 0 Any recovery guidance I may gain from the esteemed experts of this group would be extremely appreciated. I recently migrated to opensolaris + zfs on the impassioned advice of a coworker and will lose some data that has been modified since the move, but not yet backed up yet. Many thanks in advance... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Aggregate Pool I/O
Well if I do fsstat mountpoint on all the filesystems in the ZFS pool, then I guess my aggregate number for read and write bandwidth should equal the aggregate numbers for the pool? Yes? The downside is that fsstat has the same granularity issue as zpool iostat. What I'd really like is nread and nwrite numbers instead of r/s w/s. That way, if I miss some polls I can smooth out the results. kstat -c disk sd::: is interesting, but seems to be only for locally-attached disks, right? I am using iSCSI although soon will also have pools with local disks. For device data, I'd really like the per-pool and per-pool per device breakdowns provided by zpool iostat, if only it weren't summarized in a 5-character field. Perhaps I should simply be asking for sample code that accesses libzfs I have rolled my own cron scheduler so I can have the sub-second queries. Thanks for the info! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Aggregate Pool I/O
I'd like to track a server's ZFS pool I/O throughput over time. What's a good data source to use for this? I like zpool iostat for this, but if I poll at two points in time I would get a number since boot (e.g. 1.2M) and a current number (e.g. 1.3K). If I use the current number then I've lost data between polling intervals. But if I use the number since boot it's not precise enough to be useful. Is there a kstat equivalent to the I/O since boot? Some other good data source? And then is there a similar kstat equivalent to iostat? Would both data values then allow me to trend file i/O versus physical disk I/O? Thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool add dumping core
> Are you sure this isn't a case of CR 6433264 which > was fixed > long ago, but arrived in patch 118833-36 to Solaris > 10? It certainly looks similar, but this system already had 118833-36 when the error occurred, so if this bug is truly fixed, it must be something else. Then again, I wasn't adding spares, I was adding a raidz1 group, so maybe it was patched for adding spares but not other vdevs. I looked at the bug ID but couldn't tell if there was a simple test I could perform to determine if this was the same or a related bug, or something completely new. The error message is the same, except for the reported line number. Here's some mdb output similar to what was in the original bug report: r...@kronos:/ # mdb core Loading modules: [ libumem.so.1 libnvpair.so.1 libuutil.so.1 libc.so.1 libavl.so.1 libsysevent.so.1 ld.so.1 ] > $c libc.so.1`_lwp_kill+8(6, 0, ff1c3058, ff12bed8, , 6) libc.so.1`abort+0x110(ffbfb760, 1, 0, fcba0, ff1c13d8, 0) libc.so.1`_assert+0x64(213a8, 213d8, 277, 8d990, fc8bc, 32008) 0x1afe8(11, 0, 1a2d78, dff40, 16f2a400, 4) 0x1b028(8df60, 8cfd0, 0, 0, 0, 4) make_root_vdev+0x9c(abe48, 0, 1, 0, 8df60, 8cfd0) 0x1342c(8, abe48, 0, 7, 0, ffbffdca) main+0x154(9, ffbffce4, 9, 3, 33400, ffbffdc6) _start+0x108(0, 0, 0, 0, 0, 0) I'm happy to further poke at the core file or provide other data if anyone's interested... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool add dumping core
Problem solved... after the resilvers completed, the status reported that the filesystem needed an upgrade. I did a zpool upgrade -a, and after that completed and there was no resilvering going on, the zpool add ran successfully. I would like to suggest, however, that the behavior be fixed -- it should report something more intelligent, either "cannot add to pool during resilver", or "cannot add to pool until the filesystem is upgraded", whichever is correct, instead of dumping core. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool add dumping core
I'm trying to add some additional devices to my existing pool, but it's not working. I'm adding a raidz group of 5 300 GB drives, but the command always fails: r...@kronos:/ # zpool add raid raidz c8t8d0 c8t13d0 c7t8d0 c3t8d0 c5t8d0 Assertion failed: nvlist_lookup_string(cnv, "path", &path) == 0, file zpool_vdev.c, line 631 Abort (core dumped) The disks all work, were labeled easily using 'format' after zfs and other tools refused to look at them. Creating a UFS filesystem with newfs on them runs with no issues, but I can't add them to the existing zpool. I can use the same devices to create a NEW zpool without issue. I fully patched up this system after encountering this problem, no change. The zpool to which I am adding them is fairly large and in a degraded state (three resilvers running, one that never seems to complete and two related to trying to add these new disks), but I didn't think that should prevent me from adding another vdev. For those who suggest waiting 20 minutes for the resilver to finish, it's been estimating less than 30 minutes for the last 12 hours, and we're running out of space, so I wanted to add the new devices sooner rather than later. Can anyone help? extra details below: r...@kronos:/ # uname -a SunOS kronos 5.10 Generic_137137-09 sun4u sparc SUNW,Sun-Fire-480R r...@kronos:/ # smpatch analyze 137276-01 SunOS 5.10: uucico patch 122470-02 Gnome 2.6.0: GNOME Java Help Patch 121430-31 SunOS 5.8 5.9 5.10: Live Upgrade Patch 121428-11 SunOS 5.10: Live Upgrade Zones Support Patch r...@kronos:patch # zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT raid 4.32T 4.23T 92.1G97% DEGRADED - r...@kronos:patch # zpool status pool: raid state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: resilver in progress for 12h22m, 97.25% done, 0h20m to go config: NAMESTATE READ WRITE CKSUM raidDEGRADED 0 0 0 raidz1ONLINE 0 0 0 c9t0d0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c10t0d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c10t1d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c10t3d0 ONLINE 0 0 0 raidz1DEGRADED 0 0 0 c9t4d0 ONLINE 0 0 0 spare DEGRADED 0 0 0 c5t13d0 ONLINE 0 0 0 c6t4d0FAULTED 0 12.3K 0 too many errors c2t4d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c10t4d0 ONLINE 0 0 0 raidz1DEGRADED 0 0 0 c9t5d0 ONLINE 0 0 0 spare DEGRADED 0 0 0 replacing DEGRADED 0 0 0 c6t5d0s0/o UNAVAIL 0 0 0 cannot open c6t5d0 ONLINE 0 0 0 c11t13d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c10t5d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c5t9d0 ONLINE 0 0 0 c7t9d0 ONLINE 0 0 0 c3t9d0 ONLINE 0 0 0 c8t9d0 ONLINE 0 0 0 c11t9d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c5t10d0 ONLINE 0 0 0 c7t10d0 ONLINE 0 0 0 c3t10d0 ONLINE 0 0 0 c8t10d0 ONLINE 0 0 0 c11t10d0ONLINE 0 0 0 raidz1ONLINE 0 0 0 c5t11d0 ONLINE 0 0 0
[zfs-discuss] zpool add dumping core
I'm trying to add some additional devices to my existing pool, but it's not working. I'm adding a raidz group of 5 300 GB drives, but the command always fails: r...@kronos:/ # zpool add raid raidz c8t8d0 c8t13d0 c7t8d0 c3t8d0 c5t8d0 Assertion failed: nvlist_lookup_string(cnv, "path", &path) == 0, file zpool_vdev.c, line 631 Abort (core dumped) The disks all work, were labeled easily using 'format' after zfs and other tools refused to look at them. Creating a UFS filesystem with newfs on them runs with no issues, but I can't add them to the existing zpool. I can use the same devices to create a NEW zpool without issue. I fully patched up this system after encountering this problem, no change. The zpool to which I am adding them is fairly large and in a degraded state (three resilvers running, one that never seems to complete and two related to trying to add these new disks), but I didn't think that should prevent me from adding another vdev. For those who suggest waiting 20 minutes for the resilver to finish, it's been estimating < 30 minutes for the last 12 hours, and we're running out of space, so I wanted to add the new devices sooner rather than later. Can anyone help? extra details below: r...@kronos:/ # uname -a SunOS kronos 5.10 Generic_137137-09 sun4u sparc SUNW,Sun-Fire-480R r...@kronos:/ # smpatch analyze 137276-01 SunOS 5.10: uucico patch 122470-02 Gnome 2.6.0: GNOME Java Help Patch 121430-31 SunOS 5.8 5.9 5.10: Live Upgrade Patch 121428-11 SunOS 5.10: Live Upgrade Zones Support Patch r...@kronos:patch # zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT raid 4.32T 4.23T 92.1G97% DEGRADED - r...@kronos:patch # zpool status pool: raid state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: resilver in progress for 12h22m, 97.25% done, 0h20m to go config: NAMESTATE READ WRITE CKSUM raidDEGRADED 0 0 0 raidz1ONLINE 0 0 0 c9t0d0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c10t0d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c10t1d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c10t3d0 ONLINE 0 0 0 raidz1DEGRADED 0 0 0 c9t4d0 ONLINE 0 0 0 spare DEGRADED 0 0 0 c5t13d0 ONLINE 0 0 0 c6t4d0FAULTED 0 12.3K 0 too many errors c2t4d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c10t4d0 ONLINE 0 0 0 raidz1DEGRADED 0 0 0 c9t5d0 ONLINE 0 0 0 spare DEGRADED 0 0 0 replacing DEGRADED 0 0 0 c6t5d0s0/o UNAVAIL 0 0 0 cannot open c6t5d0 ONLINE 0 0 0 c11t13d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c10t5d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c5t9d0 ONLINE 0 0 0 c7t9d0 ONLINE 0 0 0 c3t9d0 ONLINE 0 0 0 c8t9d0 ONLINE 0 0 0 c11t9d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c5t10d0 ONLINE 0 0 0 c7t10d0 ONLINE 0 0 0 c3t10d0 ONLINE 0 0 0 c8t10d0 ONLINE 0 0 0 c11t10d0ONLINE 0 0 0 raidz1ONLINE 0 0 0 c5t11d0 ONLINE 0 0 0 c7t11d
Re: [zfs-discuss] ZFS filesystem creation during JumpStart
Thanks for the response Peter. However, I'm not looking to create a different boot environment (bootenv). I'm actually looking for a way within JumpStart to separate out the ZFS filesystems from a new installation to have better control over quotas and reservations for applications that usually run rampant later. In particular, I would like better control over the following (e.g. the ability to explicitly create them at install time): rpool/opt - /opt rpool/usr - /usr rpool/var - /var rpool/home - /home Of the above /home can easily be created post-install, but the others need to have the flexibility of being explicitly called out in the JumpStart profile from the initial install to provide better ZFS accounting/controls. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS filesystem creation during JumpStart
Does anyone know of a way to specify the creation of ZFS file systems for a ZFS root pool during a JumpStart installation? For example, creating the following during the install: Filesystem Mountpoint rpool/var /var rpool/var/tmp /var/tmp rpool/home /home The creation of separate filesystems allows the use of quotas/reservations via ZFS, whereas these are not created/protected during a JumpStart install with ZFS root. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and SAN
> - on a sun cluster, luns are seen on both nodes. Can > we prevent mistakes like creating a pool on already > assigned luns ? for example, veritas wants a "force" > flag. With ZFS i can do : > node1: zpool create X add lun1 lun2 > node2 : zpool create Y add lun1 lun2 > and then, results are unexpected, but pool X will > never switch again ;-) resource and zone are dead. For our iSCSI SAN, we use iSNS to put LUNs into separate discovery domains (default + domain per host). So as part of creating or expanding a pool we first move LUNs to the appropriate host's domain. Create would fail on node2 because it wouldn't have visibility to the luns. Would that address your issue? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can't rm file when "No space left on device"...
Great point. Hadn't thought of it in that way. I haven't tried truncating a file prior to trying to remove it. Either way though, I think it is a bug if once the filesystem fills up, you can't remove a file. Brad On Thu, 2008-06-05 at 21:13 -0600, Keith Bierman wrote: > On Jun 5, 2008, at 8:58 PM 6/5/, Brad Diggs wrote: > > > Hi Keith, > > > > Sure you can truncate some files but that effectively corrupts > > the files in our case and would cause more harm than good. The > > only files in our volume are data files. > > > > > > So an rm is ok, but a truncation is not? > > Seems odd to me, but if that's your constraint so be it. > -- ----- _/_/_/ _/_/ _/ _/ Brad Diggs _/ _/_/ _/_/ _/Communications Area Market _/_/_/ _/_/ _/ _/ _/ Senior Directory Architect _/ _/_/ _/ _/_/ _/_/_/ _/_/_/ _/ _/ Office: 972-992-0002 E-Mail: [EMAIL PROTECTED] M I C R O S Y S T E M S ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Can't rm file when "No space left on device"...
Hello, A customer recently brought to my attention that ZFS can get into a situation where the filesystem is full but no files can be removed. The workaround is to remove a snapshot and then you should have enough free space to remove a file. Here is a sample series of commands to reproduce the problem. # mkfile 1g /tmp/disk.raw # zpool create -f zFullPool /tmp/disk2.raw # sz=`df -k /zFullPool | awk '{ print $2 }' | tail -1` # mkfile $((${sz}-1024))k /zFullPool/f1 # zfs snapshot [EMAIL PROTECTED] # sz=`df -k /zFullPool | awk '{ print $2 }' | tail -1` # mkfile ${sz}k /zFullPool/f2 /zFullPool/f2: initialized 401408 of 1031798784 bytes: No space left on device # df -k /zFullPool Filesystemkbytesused avail capacity Mounted on zFullPool1007659 1007403 0 100%/zFullPool # rm -f /zFullPool/f1 # ls -al /zFullPool total 2014797 drwxr-xr-x 2 root sys4 Jun 4 12:15 . drwxr-xr-x 31 root root 18432 Jun 4 12:14 .. -rw--T 1 root root 1030750208 Jun 4 12:15 f1 -rw--- 1 root root 1031798784 Jun 4 12:15 f2 # rm -f /zFullPool/f2 # ls -al /zFullPool total 2014797 drwxr-xr-x 2 root sys4 Jun 4 12:15 . drwxr-xr-x 31 root root 18432 Jun 4 12:14 .. -rw--T 1 root root 1030750208 Jun 4 12:15 f1 -rw--- 1 root root 1031798784 Jun 4 12:15 f2 At this point, the only way in which I can free up sufficient space to remove either file is to first remove the snapshot. # zfs destroy [EMAIL PROTECTED] # rm -f /zFullPool/f1 # ls -al /zFullPool total 1332 drwxr-xr-x 2 root sys3 Jun 4 12:17 . drwxr-xr-x 31 root root 18432 Jun 4 12:14 .. -rw--- 1 root root 1031798784 Jun 4 12:15 f2 Is there an existing bug on this that is going to address enabling the removal of a file without the pre-requisite removal of a snapshot? Thanks in advance, Brad -- - _/_/_/ _/_/ _/ _/ Brad Diggs _/ _/_/ _/_/ _/Communications Area Market _/_/_/ _/_/ _/ _/ _/ Senior Directory Architect _/ _/_/ _/ _/_/ _/_/_/ _/_/_/ _/ _/ Office: 972-992-0002 E-Mail: [EMAIL PROTECTED] M I C R O S Y S T E M S ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Shrinking a zpool?
Solaris 10 update 5 was released 05/2008, but no zpool shrink :-( Any update? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How do you determine the zfs_vdev_cache_size current value?
How do you ascertain the current zfs vdev cache size (e.g. zfs_vdev_cache_size) via mdb or kstat or any other cmd? Thanks in advance, Brad -- The Zone Manager http://TheZoneManager.COM http://opensolaris.org/os/project/zonemgr ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] RFE: Start with desired end state in mind...
I love the send and receive feature of zfs. However, the one feature that it lacks is that I can't specify on the receive end how I want the destination zfs filesystem to be be created before receiving the data being sent. For example, lets say that I would like to do a compression study to determine which level of compression of the gzip algorithm would save the most space for my data. One of the easiest ways to do that locally or remotely would be to use send/receive like so. zfs snapshot zpool/[EMAIL PROTECTED] gz=1 while [ ${gz} -le 9 ] do zfs send zpool/[EMAIL PROTECTED] | \ zfs receive -o compression=gzip-${gz} zpool/gz${gz}data zfs list zpool/gz${gz}data done zfs destroy zpool/[EMAIL PROTECTED] Another example. Lets assume that that the zfs encryption feature was available today. Further, lets assume that I have a filesystem that has compression and encryption enabled. I want to duplicate that exact zfs filesystem on another system through send/receive. Today the receive feature does not give me the ability to specify the desired end state configuration of the destination zfs filesystem before receiving the data. I think that would be a great feature. Just some food for thought. Thanks in advance, Brad ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Is gzip planned to be in S10U5?
Hello, Is the gzip compression algorithm planned to be in Solaris 10 Update 5? Thanks in advance, Brad -- The Zone Manager http://TheZoneManager.COM http://opensolaris.org/os/project/zonemgr ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] UFS on zvol Cache Questions...
Hello Darren, Please find responses in line below... On Fri, 2008-02-08 at 10:52 +, Darren J Moffat wrote: > Brad Diggs wrote: > > I would like to use ZFS but with ZFS I cannot prime the cache > > and I don't have the ability to control what is in the cache > > (e.g. like with the directio UFS option). > > Why do you believe you need that at all ? My application is directory server. The #1 resource that directory needs to make maximum utilization of is RAM. In order to do that, I want to control every aspect of RAM utilization both to safely use as much RAM as possible AND avoid contention among things trying to use RAM. Lets consider the following example. A customer has a 50M entry directory. The sum of the data (db3 files) is approximately 60GB. However, there is another 2GB for the root filesystem, 30GB for the changelog, 1GB for the transaction logs, and 10GB for the informational logs. The system on which directory server will run has only 64GB of RAM. The system is configured with the following partitions: FS Used(GB) Description / 2 root /db60directory data /logs 41changelog, txn logs, and info logs swap 10system swap I prefer to keep the directory db cache and entry caches relatively small. So the db cache is 2GB and the entry cache is 100M. This leaves roughly 63GB of RAM for my 60GB of directory data and Solaris. The only way to ensure that the directory data (/db) is the only thing in the filesystem cache is to set directio on / (root) and (/logs). > What do you do to "prime" the cache with UFS cd /db for i in `find . -name '*.db3"` do dd if="${i}" of=/dev/null done > and what benefit do you think it is giving you ? Priming the directory server data into filesystem cache reduces ldap response time for directory data in the filesystem cache. This could mean the difference between a sub ms response time and a response time on the order of tens or hundreds of ms depending on the underlying storage speed. For telcos in particular, minimal response time is paramount. Another common scenario is when we do benchmark bakeoffs with another vendor's product. If the data isn't pre- primed, then ldap response time and throughput will be artificially degraded until the data is primed into either the filesystem or directory (db or entry) cache. Priming via ldap operations can take many hours or even days depending on the number of entries in the directory server. However, priming the same data via dd takes minutes to hours depending on the size of the files. As you know in benchmarking scenarios, time is the most limited resource that we typically have. Thus, priming via dd is much preferred. Lastly, in order to achieve optimal use of available RAM, we use directio for the root (/) and other non-data filesystems. This makes certain that the only data in the filesystem cache is the directory data. > Have you tried just using ZFS and found it doesn't perform as you need > or are you assuming it won't because it doesn't have directio ? We have done extensive testing with ZFS and love it. The three areas lacking for our use cases are as follows: * No ability to control what is in cache. e.g. no directio * No absolute ability to apply an upper boundary to the amount of RAM consumed by ZFS. I know that the arc cache has a control that seems to work well. However, the arc cache is only part of ZFS ram consumption. * No ability to rapidly prime the ZFS cache with the data that I want in the cache. I hope that helps give understanding to where I am coming from! Brad ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] UFS on zvol Cache Questions...
Hello, I have a unique deployment scenario where the marriage of ZFS zvol and UFS seem like a perfect match. Here are the list of feature requirements for my use case: * snapshots * rollback * copy-on-write * ZFS level redundancy (mirroring, raidz, ...) * compression * filesystem cache control (control what's in and out) * priming the filesystem cache (dd if=file of=/dev/null) * control the upper boundary of RAM consumed by the filesystem. This helps me to avoid contention between the filesystem cache and my application. Before zfs came along, I could achieve all but rollback, copy-on-write and compression through UFS+some volume manager. I would like to use ZFS but with ZFS I cannot prime the cache and I don't have the ability to control what is in the cache (e.g. like with the directio UFS option). If I create a ZFS zvol and format it as a UFS filesystem, it seems like I get the best of both worlds. Can anyone poke holes in this strategy? I think the biggest possible risk factor is if the ZFS zvol still uses the arc cache. If this is the case, I may be double-dipping on the filesystem cache. e.g. The UFS filesystem uses some RAM and ZFS zvol uses some RAM for filesystem cache. Is this a true statement or does the zvol use a minimal amount of system RAM? Lastly, if I were to try this scenario, does anyone know how to monitor the RAM consumed by the zvol and UFS? e.g. Is there a dtrace script for monitoring ZFS or UFS memory consuption? Thanks in advance, Brad ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS quota
> OK, you asked for "creative" workarounds... here's one (though it requires > that the filesystem be briefly unmounted, which may be deal-killing): That is, indeed, creative. :) And yes, the unmount make it impractical in my environment. I ended up going back to rsync, because we had more and more complaints as the snapshots accumulated, but am now just rsyncing to another system, which in turn runs snapshots on the backup copy. It's still time- and i/o-consuming, and the users can't recover their own files, but at least I'm not eating up 200% of the space otherwise necessary on the expensive new hardware raid and fielding daily over-quota (when not really over-quota) complaints. Thanks for the suggestion. Looking forward to the new feature... BP > > zfs create pool/realfs > zfs set quota=1g pool/realfs > > again: > zfs umount pool/realfs > zfs rename pool/realfs pool/oldfs > zfs snapshot pool/[EMAIL PROTECTED] > zfs clone pool/[EMAIL PROTECTED] pool/realfs > zfs set quota=1g pool/realfs (6364688 would be useful here) > zfs set quota=none pool/oldfs > zfs promote pool/oldfs > zfs destroy pool/backupfs > zfs rename pool/oldfs pool/backupfs > backup pool/[EMAIL PROTECTED] > sleep $backupinterval > goto again > > FYI, we are working on "fs-only" quotas. > > --matt -- [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS quota
Just wanted to voice another request for this feature. I was forced on a previous Solaris10/ZFS system to rsync whole filesystems, and snapshot the backup copy to prevent the snapshots from negatively impacting users. This obviously has the effect of reducing available space on the system by over half. It also robs you of lots of I/O bandwidth while all that data is rsyncing, and means that users can't see their snapshots, only a sysadmin with access to the backup copy can. We've got a new system that isn't doing the rsync, and users very quickly discovered over-quota problems when their directories appeared empty, and deleting files didn't help. They required sysadmin intervention to increase their filesystem quotas to accomodate the snapshots and their real data. Trying to anticipate the space required for the snapshots and giving them that as a quota is more or less hopeless, plus it gives them that much more rope with which to hang themselves with massive snapshots. I hate to start rsyncing again, but may be forced to; policing the snapshot space consumption is getting painful, but the online snapshot feature is too valuable to discard altogether. or if there are other creative solutions, I'm all ears... This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS - Use h/w raid or not? Thoughts. Considerations.
> > At the moment, I'm hearing that using h/w raid under my zfs may be > >better for some workloads and the h/w hot spare would be nice to > >have across multiple raid groups, but the checksum capabilities in > >zfs are basically nullified with single/multiple h/w lun's > >resulting in "reduced protection." Therefore, it sounds like I > >should be strongly leaning towards not using the hardware raid in > >external disk arrays and use them like a JBOD. > The big reasons for continuing to use hw raid is speed, in some cases, > and heterogeneous environments where you can't farm out non-raid > protected LUNs and raid protected LUNs from the same storage array. In > some cases the array will require a raid protection setting, like the > 99x0, before you can even start farming out storage. Just a data point -- I've had miserable luck with ZFS JBOD drives failing. They consistently wedge my machines (Ultra-45, E450, V880, using SATA, SCSI drives) when one of the drives fails. The system recovers okay and without data loss after a reboot, but a total drive failure (when a drive stops talking to the system) is not handled well. Therefore I would recommend a hardware raid for high-availability applications. Note, it's not clear that this is a ZFS problem. I suspect it's a solaris or hardware controller or driver problem, so this may not be an issue if you find a controller that doesn't freak on a drive failure. BP -- [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
Did you find a resoltion to this issue? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS over NFS extra slow?
write cache was enabled on all the ZFS drives, but disabling it gave a negligible speed improvement: (FWIW, the pool has 50 drives) (write cache on) /bin/time tar xf /tmp/vbulletin_3-6-4.tar real 51.6 user0.0 sys 1.0 (write cache off) /bin/time tar xf /tmp/vbulletin_3-6-4.tar real 49.2 user0.0 sys 1.0 ...this is a production system, so I attribute the 2-second (4%) difference more to variable system activity than to the write cache. I suppose I could test with larger samples, but since this is still ten times slower than I want, I think this effectively discounts the disk write cache as anything significant. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS over NFS extra slow?
Ah, thanks -- reading that thread did a good job of explaining what I was seeing. I was going nuts trying to isolate the problem. Is work being done to improve this performance? 100% of my users are coming in over NFS, and that's a huge hit. Even on single large files, writes are slower by a factor of 2 to 10 compared to if I copy via scp or onto a non-zfs filesystem. Thanks! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS over NFS extra slow?
I had a user report extreme slowness on a ZFS filesystem mounted over NFS over the weekend. After some extensive testing, the extreme slowness appears to only occur when a ZFS filesystem is mounted over NFS. One example is doing a 'gtar xzvf php-5.2.0.tar.gz'... over NFS onto a ZFS filesystem. this takes: real5m12.423s user0m0.936s sys 0m4.760s Locally on the server (to the same ZFS filesystem) takes: real0m4.415s user0m1.884s sys 0m3.395s The same job over NFS to a UFS filesystem takes real1m22.725s user0m0.901s sys 0m4.479s Same job locally on server to same UFS filesystem: real0m10.150s user0m2.121s sys 0m4.953s This is easily reproducible even with single large files, but the multiple small files seems to illustrate some awful sync latency between each file. Any idea why ZFS over NFS is so bad? I saw the threads that talk about an fsync penalty, but they don't seem relevant since the local ZFS performance is quite good. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss