Re: [zfs-discuss] Running on Dell hardware?
'Tim Cook' wrote: [... snip ... ] Dell requires Dell branded drives as of roughly 8 months ago. I don't think there was ever an H700 firmware released that didn't require this. I'd bet you're going to waste a lot of money to get a drive the system refuses to recognize. This should no longer be an issue as Dell has abandoned that practice because of customer pressure. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
'Edward Ned Harvey' wrote: From: Henrik Johansen [mailto:hen...@scannet.dk] The 10g models are stable - especially the R905's are real workhorses. You would generally consider all your machines stable now? Can you easily pdsh to all those machines? Yes - the only problem child has been 1 R610 (the other 2 that we have in production have not shown any signs of trouble) kstat | grep current_cstate ; kstat | grep supported_max_cstates I'd really love to see if "some current_cstate is higher than supported_max_cstates" is an accurate indicator of system instability. Here's a little sample from different machines : R610 #1 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 0 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 R610 #2 current_cstate 3 current_cstate 0 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 3 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 PE2900 current_cstate 1 current_cstate 1 current_cstate 0 current_cstate 1 current_cstate 1 current_cstate 0 current_cstate 1 current_cstate 1 supported_max_cstates 1 supported_max_cstates 1 supported_max_cstates 1 supported_max_cstates 1 supported_max_cstates 1 supported_max_cstates 1 supported_max_cstates 1 supported_max_cstates 1 PER905 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 0 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 0 current_cstate 1 current_cstate 1 supported_max_cstates 0 supported_max_cstates 0 supported_max_cstates 0 supported_max_cstates 0 supported_max_cstates 0 supported_max_cstates 0
Re: [zfs-discuss] Running on Dell hardware?
'Edward Ned Harvey' wrote: I have a Dell R710 which has been flaky for some time. It crashes about once per week. I have literally replaced every piece of hardware in it, and reinstalled Sol 10u9 fresh and clean. I am wondering if other people out there are using Dell hardware, with what degree of success, and in what configuration? We are running (Open)Solaris on lots of 10g servers (PE2900, PE1950, PE2950, R905) and some 11g (R610 and soon some R815) with both PERC and non-PERC controllers and lots of MD1000's. The 10g models are stable - especially the R905's are real workhorses. We have had only one 11g server (R610) which caused trouble. The box froze at least once a week - after replacing almost the entire box I switched from the old iscsitgt to COMSTAR and the box has been stable since. Go figure ... I might add that none of these machine use the onboard Broadcom nic's. The failure seems to be related to the perc 6i. For some period around the time of crash, the system still responds to ping, and anything currently in memory or running from remote storage continues to function fine. But new processes that require the local storage ... Such as inbound ssh etc, or even physical login at the console ... those are all hosed. And eventually the system stops responding to ping. As soon as the problem starts, the only recourse is power cycle. I can't seem to reproduce the problem reliably, but it does happen regularly. Yesterday it happened several times in one day, but sometimes it will go 2 weeks without a problem. Again, just wondering what other people are using, and experiencing. To see if any more clues can be found to identify the cause. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O statistics for each file system
On 05/17/10 03:05 PM, eXeC001er wrote: perfect! I found info about kstat for Perl. Where can I find the meaning of each field? Most of them can be found here under the section "I/O kstat" : http://docs.sun.com/app/docs/doc/819-2246/kstat-3kstat?a=view r...@atom:~# kstat stmf:0:stmf_lu_io_ff00d1c2a8f8 1274100947 module: stmfinstance: 0 name: stmf_lu_io_ff00d1c2a8f8 class:io crtime 2333040.65018394 nread 9954962 nwritten5780992 rcnt0 reads 599 rlastupdate 2334856.48028583 rlentime2.792307252 rtime 2.453258966 snaptime2335022.3396771 wcnt0 wlastupdate 2334856.43951113 wlentime0.103487047 writes 510 wtime 0.069508209 2010/5/17 Henrik Johansen mailto:hen...@scannet.dk>> Hi, On 05/17/10 01:57 PM, eXeC001er wrote: good. but this utility is used to view statistics for mounted FS. How can i view statistics for iSCSI shared FS? fsstat(1M) relies on certain kstat counters for it's operation - last I checked I/O against zvols does not update those counters. It your are using newer builds and COMSTAR you can use the stmf kstat counters to get I/O details per target and per LUN. Thanks. 2010/5/17 Darren J Moffat mailto:darr...@opensolaris.org> <mailto:darr...@opensolaris.org <mailto:darr...@opensolaris.org>>> On 17/05/2010 12:41, eXeC001er wrote: I known that i can view statistics for the pool (zpool iostat). I want to view statistics for each file system on pool. Is it possible? See fsstat(1M) -- Darren J Moffat -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk <mailto:hen...@scannet.dk> Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org <mailto:zfs-discuss@opensolaris.org> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O statistics for each file system
Hi, On 05/17/10 01:57 PM, eXeC001er wrote: good. but this utility is used to view statistics for mounted FS. How can i view statistics for iSCSI shared FS? fsstat(1M) relies on certain kstat counters for it's operation - last I checked I/O against zvols does not update those counters. It your are using newer builds and COMSTAR you can use the stmf kstat counters to get I/O details per target and per LUN. Thanks. 2010/5/17 Darren J Moffat mailto:darr...@opensolaris.org>> On 17/05/2010 12:41, eXeC001er wrote: I known that i can view statistics for the pool (zpool iostat). I want to view statistics for each file system on pool. Is it possible? See fsstat(1M) -- Darren J Moffat -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [indiana-discuss] future of OpenSolaris
On 02/22/10 09:52 PM, Tim Cook wrote: On Mon, Feb 22, 2010 at 2:21 PM, Jacob Ritorto mailto:jacob.rito...@gmail.com>> wrote: Since it seems you have absolutely no grasp of what's happening here, Coming from the guy proclaiming the sky is falling without actually having ANY official statement whatsoever to back up that train of thought. perhaps it would be best for you to continue to sit idly by and let this happen. Thanks helping out with the crude characterisations though. Idly let what happen? The unconfirmed death of opensolaris that you've certified for us all without any actual proof? Well - the lack of support subscriptions *is* a death sentence for OpenSolaris in many companies and I believe that this is what the OP complained about. Do you understand that the OpenSolaris page has a sunset in it and the Solaris page doesn't? I understand previous versions of every piece of software Oracle sells have Sunset pages, yes. If you read the page I sent you, it clearly states that every release of Opensolaris gets 5 years of support from GA. That doesn't mean they aren't releasing another version. That doesn't mean they're ending the opensolaris project. That doesn't mean they are no longer selling support for it. Had you actually read the link I posted, you'd have figured that out. Sun provides contractual support on the OpenSolaris OS for up to five years from the product's first General Availability (GA) date as described <http://www.sun.com/service/eosl/eosl_opensolaris.html>. OpenSolaris Package Updates are released approximately every 6 months. OpenSolaris Subscriptions entitle customers during the term of the Customer's Subscription contract to receive support on their current version of OpenSolaris, as well as receive individual Package Updates and OpenSolaris Support Repository Package Updates when made commercially available by Sun. Sun may require a Customer to download and install Package Updates or OpenSolaris OS Updates that have been released since Customer's previous installation of OpenSolaris, particularly when fixes have already been Have you spent enough (any) time trying to renew your contracts only to see that all mentions of OpenSolaris have been deleted from the support pages over the last few days? Can you tell me which Oracle rep you've spoken to who confirmed the cancellation of Opensolaris? It's funny, nobody I've talked to seems to have any idea what you're talking about. So please, a name would be wonderful so I can direct my inquiry to this as-of-yet unnamed source. I have spoken to our local Oracle sales office last week because I wanted to purchase a new OpenSolaris support contract - I was informed that this was no longer possible and that Oracle is unable to provide paid support for OpenSolaris at this time. This, specifically, is what has been yanked out from under me and my company. This represents years of my and my team's effort and investment. Again, without some sort of official word, nothing has changed... I take the official Oracle website to be rather ... official ? Lets recap, shall we ? a) Almost every trace of OpenSolaris Support subscriptions vanished from the official website within the last 14 days. b) An Oracle sales rep informed me personally last week that I could no longer purchase support subscriptions for OpenSolaris. Please, do me a favor and call your local Oracle rep and ask for an Opensolaris Support subscription quote and let us know how it goes ... It says right here those contracts are for both solaris AND opensolaris. http://www.sun.com/service/subscriptions/index.jsp Click Sun System Service Plans <http://www.sun.com/service/serviceplans/sunspectrum/index.jsp>: http://www.sun.com/service/serviceplans/sunspectrum/index.jsp Sun System Service Plans for Solaris Sun System Service Plans for the Solaris Operating System provide integrated hardware and* Solaris OS (or OpenSolaris OS)* support service coverage to help keep your systems running smoothly. This single price, complete system approach is ideal for companies running Solaris on Sun hardware. Sun System Service Plans != (Open)Solaris Support subscriptions But thank you for the scare chicken little. --Tim -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] future of OpenSolaris
On 02/22/10 03:35 PM, Jacob Ritorto wrote: On 02/22/10 09:19, Henrik Johansen wrote: On 02/22/10 02:33 PM, Jacob Ritorto wrote: On 02/22/10 06:12, Henrik Johansen wrote: Well - once thing that makes me feel a bit uncomfortable is the fact that you no longer can buy OpenSolaris Support subscriptions. Almost every trace of it has vanished from the Sun/Oracle website and a quick call to our local Sun office confirmed that they apparently no longer sell them. I was actually very startled to see that since we're using it in production here. After digging through the web for hours, I found that OpenSolaris support is now included in Solaris support. This is a win for us because we never know if a particular box, especially a dev box, is going to remain Solaris or OpenSolaris for the duration of a support purchase and now we're free to mix and mingle. If you refer to the Solaris support web page (png attached if the mailing list allows), you'll see that OpenSolaris is now officially part of the deal and is no longer being treated as a second class support offering. That would be *very* nice indeed. I have checked the URL in your screenshot but I am getting a different result (png attached). Ohwell - I'll just have to wait and see. Confirmed your finding Henrik. This is a showstopper for us as the higherups are already quite leery of Sun/Oracle and the future of Solaris. I'm calling Oracle to see if I can get some answers. The SUSE folks recently took a big chunk of our UNIX business here and OpenSolaris was my main tool in battling that. For us, the loss of OpenSolaris and its support likely indicates the end of Solaris altogether. Well - I too am reluctant to put more OpenSolaris boxes into production until this matter has been resolved. -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] future of OpenSolaris
On 02/22/10 12:00 PM, Michael Ramchand wrote: I think Oracle have been quite clear about their plans for OpenSolaris. They have publicly said they plan to continue to support it and the community. They're just a little distracted right now because they are in the process of on-boarding many thousand Sun employees, and trying to get them feeling happy, comfortable and at home in their new surroundings so that they can start making money again. The silence means that you're in a queue and they forgot to turn the "hold" music on. Have patience. :-) Well - once thing that makes me feel a bit uncomfortable is the fact that you no longer can buy OpenSolaris Support subscriptions. Almost every trace of it has vanished from the Sun/Oracle website and a quick call to our local Sun office confirmed that they apparently no longer sell them. On 02/22/10 09:22, Eugen Leitl wrote: Oracle's silence is starting to become a bit ominous. What are the future options for zfs, should OpenSolaris be left dead in the water by Suracle? I have no insight into who core zfs developers are (have any been fired by Sun even prior to the merger?), and who's paying them. Assuming a worst case scenario, what would be the best candidate for a fork? Nexenta? Debian already included FreeBSD as a kernel flavor into its fold, it seems Nexenta could be also a good candidate. Maybe anyone in the know could provide a short blurb on what the state is, and what the options are. -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large scale ZFS deployments out there (>200 disks)
On 01/29/10 07:36 PM, Richard Elling wrote: On Jan 29, 2010, at 12:45 AM, Henrik Johansen wrote: On 01/28/10 11:13 PM, Lutz Schumann wrote: While thinking about ZFS as the next generation filesystem without limits I am wondering if the real world is ready for this kind of incredible technology ... I'm actually speaking of hardware :) ZFS can handle a lot of devices. Once in the import bug (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6761786) is fixed it should be able to handle a lot of disks. That was fixed in build 125. I want to ask the ZFS community and users what large scale deploments are out there. How man disks ? How much capacity ? Single pool or many pools on a server ? How does resilver work in those environtments ? How to you backup ? What is the experience so far ? Major headakes ? It would be great if large scale users would share their setups and experiences with ZFS. The largest ZFS deployment that we have is currently comprised of 22 Dell MD1000 enclosures (330 750 GB Nearline SAS disks). We have 3 head nodes and use one zpool per node, comprised of rather narrow (5+2) RAIDZ2 vdevs. This setup is exclusively used for storing backup data. This is an interesting design. It looks like a good use of hardware and redundancy for backup storage. Would you be able to share more of the details? :-) Each head node (Dell PE 2900's) has 3 PERC 6/E controllers (LSI 1078 based) with 512 MB cache each. The PERC 6/E supports both load-balancing and path failover so each controller has 2 SAS connections to a daisy chained group of 3 MD1000 enclosures. The RAIDZ2 vdev layout was chosen because it gives a reasonable performance vs space ratio and it maps nicely onto the 15 disk MD1000's ( 2 x (5+2) +1 ). There is room for improvement in the design (fewer disks per controller, faster PCI Express slots, etc) but performance is good enough for our current needs. Resilver times could be better - I am sure that this will improve once we upgrade from S10u9 to 2010.03. Nit: Solaris 10 u9 is 10/03 or 10/04 or 10/05, depending on what you read. Solaris 10 u8 is 11/09. One of the things that I am missing in ZFS is the ability to prioritize background operations like scrub and resilver. All our disks are idle during daytime and I would love to be able to take advantage of this, especially during resilver operations. Scrub I/O is given the lowest priority and is throttled. However, I am not sure that the throttle is in Solaris 10, because that source is not publicly available. In general, you will not notice a resource cap until the system utilization is high enough that the cap is effective. In other words, if the system is mostly idle, the scrub consumes the bulk of the resources. That's not what I am seeing - resilver operations crawl even when the pool is idle. This setup has been running for about a year with no major issues so far. The only hickups we've had were all HW related (no fun in firmware upgrading 200+ disks). ugh. -- richard -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large scale ZFS deployments out there (>200 disks)
On 01/28/10 11:13 PM, Lutz Schumann wrote: While thinking about ZFS as the next generation filesystem without limits I am wondering if the real world is ready for this kind of incredible technology ... I'm actually speaking of hardware :) ZFS can handle a lot of devices. Once in the import bug (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6761786) is fixed it should be able to handle a lot of disks. That was fixed in build 125. I want to ask the ZFS community and users what large scale deploments are out there. How man disks ? How much capacity ? Single pool or many pools on a server ? How does resilver work in those environtments ? How to you backup ? What is the experience so far ? Major headakes ? It would be great if large scale users would share their setups and experiences with ZFS. The largest ZFS deployment that we have is currently comprised of 22 Dell MD1000 enclosures (330 750 GB Nearline SAS disks). We have 3 head nodes and use one zpool per node, comprised of rather narrow (5+2) RAIDZ2 vdevs. This setup is exclusively used for storing backup data. Resilver times could be better - I am sure that this will improve once we upgrade from S10u9 to 2010.03. One of the things that I am missing in ZFS is the ability to prioritize background operations like scrub and resilver. All our disks are idle during daytime and I would love to be able to take advantage of this, especially during resilver operations. This setup has been running for about a year with no major issues so far. The only hickups we've had were all HW related (no fun in firmware upgrading 200+ disks). Will you ? :) Thanks, Robert -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
Ross Walker wrote: On Aug 27, 2009, at 4:30 AM, David Bond wrote: Hi, I was directed here after posting in CIFS discuss (as i first thought that it could be a CIFS problem). I posted the following in CIFS: When using iometer from windows to the file share on opensolaris svn101 and svn111 I get pauses every 5 seconds of around 5 seconds (maybe a little less) where no data is transfered, when data is transfered it is at a fair speed and gets around 1000-2000 iops with 1 thread (depending on the work type). The maximum read response time is 200ms and the maximum write response time is 9824ms, which is very bad, an almost 10 seconds delay in being able to send data to the server. This has been experienced on 2 test servers, the same servers have also been tested with windows server 2008 and they havent shown this problem (the share performance was slightly lower than CIFS, but it was consistent, and the average access time and maximums were very close. I just noticed that if the server hasnt hit its target arc size, the pauses are for maybe .5 seconds, but as soon as it hits its arc target, the iops drop to around 50% of what it was and then there are the longer pauses around 4-5 seconds. and then after every pause the performance slows even more. So it appears it is definately server side. This is with 100% random io with a spread of 33% write 66% read, 2KB blocks. over a 50GB file, no compression, and a 5.5GB target arc size. Also I have just ran some tests with different IO patterns and 100 sequencial writes produce and consistent IO of 2100IOPS, except when it pauses for maybe .5 seconds every 10 - 15 seconds. 100% random writes produce around 200 IOPS with a 4-6 second pause around every 10 seconds. 100% sequencial reads produce around 3700IOPS with no pauses, just random peaks in response time (only 16ms) after about 1 minute of running, so nothing to complain about. 100% random reads produce around 200IOPS, with no pauses. So it appears that writes cause a problem, what is causing these very long write delays? A network capture shows that the server doesnt respond to the write from the client when these pauses occur. Also, when using iometer, the initial file creation doesnt have and pauses in the creation, so it might only happen when modifying files. Any help on finding a solution to this would be really appriciated. What version? And system configuration? I think it might be the issue where ZFS/ARC write caches more then the underlying storage can handle writing in a reasonable time. There is a parameter to control how much is write cached, I believe it is zfs_write_override. You should be able to disable the write throttle mechanism altogether with the undocumented zfs_no_write_throttle tunable. I never got around to testing this though ... -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906
Joseph L. Casale wrote: Quick snipped from zpool iostat : mirror 1.12G 695G 0 0 0 0 c8t12d0 - - 0 0 0 0 c8t13d0 - - 0 0 0 0 c7t2d04K 29.0G 0 1.56K 0 200M c7t3d04K 29.0G 0 1.58K 0 202M The disks on c7 are both Intel X25-E Henrik, So the SATA discs are in the MD1000 behind the PERC 6/E and how have you configured/attached the 2 SSD slogs and L2ARC drive? If I understand you, you have sued 14 of the 15 slots in the MD so I assume you have the 3 SSD's in the R905, what controller are they running on? The internal PERC 6/i controller - but I've had them on the PERC 6/E during other test runs since I have a couple of spare MD1000's at hand. Both controllers work well with the SSD's. Thanks! jlc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906
Ross Walker wrote: On Aug 5, 2009, at 2:49 AM, Henrik Johansen wrote: Ross Walker wrote: On Aug 4, 2009, at 8:36 PM, Carson Gaspar wrote: Ross Walker wrote: I get pretty good NFS write speeds with NVRAM (40MB/s 4k sequential write). It's a Dell PERC 6/e with 512MB onboard. ... there, dedicated slog device with NVRAM speed. It would be even better to have a pair of SSDs behind the NVRAM, but it's hard to find compatible SSDs for these controllers, Dell currently doesn't even support SSDs in their RAID products :-( Isn't the PERC 6/e just a re-branded LSI? LSI added SSD support recently. Yes, but the LSI support of SSDs is on later controllers. Sure that's not just a firmware issue ? My PERC 6/E seems to support SSD's : # ./MegaCli -AdpAllInfo -a2 | grep -i ssd Enable Copyback to SSD on SMART Error : No Enable SSD Patrol Read : No Allow SSD SAS/SATA Mix in VD : No Allow HDD/SSD Mix in VD : No Controller info :Versions Product Name: PERC 6/E Adapter Serial No : FW Package Build: 6.0.3-0002 Mfg. Data Mfg. Date : 06/08/07 Rework Date : 06/08/07 Revision No : Battery FRU : N/A Image Versions in Flash: FW Version : 1.11.82-0473 BIOS Version : NT13-2 WebBIOS Version: 1.1-32-e_11-Rel Ctrl-R Version : 1.01-010B Boot Block Version : 1.00.00.01-0008 I currently have 2 x Intel X25-E (32 GB) as dedicated slogs and 1 x Intel X25-M (80 GB) for the L2ARC behind a PERC 6/i on my Dell R905 testbox. So far there have been no problems with them. Really? Now you have my interest. Two questions, did you get the X25 from Dell? Are you using it with a hot-swap carrier? Knowing that these will work would be great news. Those disks are not from Dell as they were incapable of delivering Intel SSD's. Just out of curiosity - do they have to be from Dell ? I have tested the Intel SSD's on various Dell servers - they work out-of-the-box with both their 2.5" and 3.5" trays (the 3.5" trays do require a SATA interposer which is included with all SATA disks ordered from them). -Ross -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906
Ross Walker wrote: On Aug 5, 2009, at 3:09 AM, Henrik Johansen wrote: Ross Walker wrote: On Aug 4, 2009, at 10:22 PM, Bob Friesenhahn > wrote: On Tue, 4 Aug 2009, Ross Walker wrote: Are you sure that it is faster than an SSD? The data is indeed pushed closer to the disks, but there may be considerably more latency associated with getting that data into the controller NVRAM cache than there is into a dedicated slog SSD. I don't see how, as the SSD is behind a controller it still must make it to the controller. If you take a look at 'iostat -x' output you will see that the system knows about a queue for each device. If it was any other way, then a slow device would slow down access to all of the other devices. If there is concern about lack of bandwidth (PCI- E?) to the controller, then you can use a separate controller for the SSDs. It's not bandwidth. Though with a lot of mirrors that does become a concern. Well the duplexing benefit you mention does hold true. That's a complex real-world scenario that would be hard to benchmark in production. But easy to see the effects of. I actually meant to say, hard to bench out of production. Tests done by others show a considerable NFS write speed advantage when using a dedicated slog SSD rather than a controller's NVRAM cache. I get pretty good NFS write speeds with NVRAM (40MB/s 4k sequential write). It's a Dell PERC 6/e with 512MB onboard. I get 47.9 MB/s (60.7 MB/s peak) here too (also with 512MB NVRAM), but that is not very good when the network is good for 100 MB/s. With an SSD, some other folks here are getting essentially network speed. In testing with ram disks I was only able to get a max of around 60MB/ s with 4k block sizes, with 4 outstanding. I can do 64k blocks now and get around 115MB/s. I just ran some filebench microbenchmarks against my 10 Gbit testbox which is a Dell R905, 4 x 2.5 Ghz AMD Quad Core CPU's and 64 GB RAM. My current pool is comprised of 7 mirror vdevs (SATA disks), 2 Intel X25-E as slogs and 1 Intel X25-M for the L2ARC. The pool is a MD1000 array attached to a PERC 6/E using 2 SAS cables. The nic's are ixgbe based. Here are the numbers : Randomwrite benchmark - via 10Gbit NFS : IO Summary: 4483228 ops, 73981.2 ops/s, (0/73981 r/w) 578.0mb/s, 44us cpu/op, 0.0ms latency Randomread benchmark - via 10Gbit NFS : IO Summary: 7663903 ops, 126467.4 ops/s, (126467/0 r/w) 988.0mb/s, 5us cpu/op, 0.0ms latency The real question is if these numbers can be trusted - I am currently preparing new test runs with other software to be able to do a comparison. Yes, need to make sure it is sync io as NFS clients can still choose to use async and work out of their own cache. Quick snipped from zpool iostat : mirror 1.12G 695G 0 0 0 0 c8t12d0 - - 0 0 0 0 c8t13d0 - - 0 0 0 0 c7t2d04K 29.0G 0 1.56K 0 200M c7t3d04K 29.0G 0 1.58K 0 202M The disks on c7 are both Intel X25-E -Ross -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906
Ross Walker wrote: On Aug 4, 2009, at 10:17 PM, James Lever wrote: On 05/08/2009, at 11:41 AM, Ross Walker wrote: What is your recipe for these? There wasn't one! ;) The drive I'm using is a Dell badged Samsung MCCOE50G5MPQ-0VAD3. So the key is the drive needs to have the Dell badging to work? I called my rep about getting a Dell badged SSD and he told me they didn't support those in MD series enclosures so therefore were unavailable. If the Dell branded SSD's are Samsung's then you might want to search the archives - if I remember correctly there were mentionings of less-than-desired performance using them but I cannot recall the details. Maybe it's time for a new account rep. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906
Ross Walker wrote: On Aug 4, 2009, at 10:22 PM, Bob Friesenhahn > wrote: On Tue, 4 Aug 2009, Ross Walker wrote: Are you sure that it is faster than an SSD? The data is indeed pushed closer to the disks, but there may be considerably more latency associated with getting that data into the controller NVRAM cache than there is into a dedicated slog SSD. I don't see how, as the SSD is behind a controller it still must make it to the controller. If you take a look at 'iostat -x' output you will see that the system knows about a queue for each device. If it was any other way, then a slow device would slow down access to all of the other devices. If there is concern about lack of bandwidth (PCI-E?) to the controller, then you can use a separate controller for the SSDs. It's not bandwidth. Though with a lot of mirrors that does become a concern. Well the duplexing benefit you mention does hold true. That's a complex real-world scenario that would be hard to benchmark in production. But easy to see the effects of. I actually meant to say, hard to bench out of production. Tests done by others show a considerable NFS write speed advantage when using a dedicated slog SSD rather than a controller's NVRAM cache. I get pretty good NFS write speeds with NVRAM (40MB/s 4k sequential write). It's a Dell PERC 6/e with 512MB onboard. I get 47.9 MB/s (60.7 MB/s peak) here too (also with 512MB NVRAM), but that is not very good when the network is good for 100 MB/s. With an SSD, some other folks here are getting essentially network speed. In testing with ram disks I was only able to get a max of around 60MB/ s with 4k block sizes, with 4 outstanding. I can do 64k blocks now and get around 115MB/s. I just ran some filebench microbenchmarks against my 10 Gbit testbox which is a Dell R905, 4 x 2.5 Ghz AMD Quad Core CPU's and 64 GB RAM. My current pool is comprised of 7 mirror vdevs (SATA disks), 2 Intel X25-E as slogs and 1 Intel X25-M for the L2ARC. The pool is a MD1000 array attached to a PERC 6/E using 2 SAS cables. The nic's are ixgbe based. Here are the numbers : Randomwrite benchmark - via 10Gbit NFS : IO Summary: 4483228 ops, 73981.2 ops/s, (0/73981 r/w) 578.0mb/s, 44us cpu/op, 0.0ms latency Randomread benchmark - via 10Gbit NFS : IO Summary: 7663903 ops, 126467.4 ops/s, (126467/0 r/w) 988.0mb/s, 5us cpu/op, 0.0ms latency The real question is if these numbers can be trusted - I am currently preparing new test runs with other software to be able to do a comparison. There is still bus and controller plus SSD latency. I suppose one could use a pair of disks as an slog mirror, enable NVRAM just for those and let the others do write-through with their disk caches But this encounters the problem that when the NVRAM becomes full then you hit the wall of synchronous disk write performance. With the SSD slog, the write log can be quite large and disk writes are then done in a much more efficient ordered fashion similar to non- sync writes. Yes, you have a point there. So, what SSD disks do you use? -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906
Ross Walker wrote: On Aug 4, 2009, at 8:36 PM, Carson Gaspar wrote: Ross Walker wrote: I get pretty good NFS write speeds with NVRAM (40MB/s 4k sequential write). It's a Dell PERC 6/e with 512MB onboard. ... there, dedicated slog device with NVRAM speed. It would be even better to have a pair of SSDs behind the NVRAM, but it's hard to find compatible SSDs for these controllers, Dell currently doesn't even support SSDs in their RAID products :-( Isn't the PERC 6/e just a re-branded LSI? LSI added SSD support recently. Yes, but the LSI support of SSDs is on later controllers. Sure that's not just a firmware issue ? My PERC 6/E seems to support SSD's : # ./MegaCli -AdpAllInfo -a2 | grep -i ssd Enable Copyback to SSD on SMART Error : No Enable SSD Patrol Read : No Allow SSD SAS/SATA Mix in VD : No Allow HDD/SSD Mix in VD : No Controller info : Versions Product Name: PERC 6/E Adapter Serial No : FW Package Build: 6.0.3-0002 Mfg. Data Mfg. Date : 06/08/07 Rework Date : 06/08/07 Revision No : Battery FRU : N/A Image Versions in Flash: FW Version : 1.11.82-0473 BIOS Version : NT13-2 WebBIOS Version: 1.1-32-e_11-Rel Ctrl-R Version : 1.01-010B Boot Block Version : 1.00.00.01-0008 I currently have 2 x Intel X25-E (32 GB) as dedicated slogs and 1 x Intel X25-M (80 GB) for the L2ARC behind a PERC 6/i on my Dell R905 testbox. So far there have been no problems with them. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] surprisingly poor performance
# time tar xf zeroes.tar real8m7.176s user0m0.438s sys 0m5.754s While this was running, I was looking at the output of zpool iostat fastdata 10 to see how it was going and was surprised to see the seemingly low IOPS. Have you tried running this locally on your OpenSolaris box - just to get an idea of what it could deliver in terms of speed ? Which NFS version are you using ? jam...@scalzi:~$ zpool iostat fastdata 10 capacity operationsbandwidth pool used avail read write read write -- - - - - - - fastdata10.0G 2.02T 0312268 3.89M fastdata10.0G 2.02T 0818 0 3.20M fastdata10.0G 2.02T 0811 0 3.17M fastdata10.0G 2.02T 0860 0 3.27M Strangely, when I added a second SSD as a second slog, it made no difference to the write operations. I'm not sure where to go from here, these results are appalling (about 3x the time of the old system with 8x 10kRPM spindles) even with two Enterprise SSDs as separate log devices. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Med venlig hilsen / Best Regards Henrik Johansen hen...@scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best controller card for 8 SATA drives ?
Erik Ableson wrote: The problem I had was with the single raid 0 volumes (miswrote RAID 1 on the original message) This is not a straight to disk connection and you'll have problems if you ever need to move disks around or move them to another controller. Would you mind explaining exactly what issues or problems you had ? I have moved disks around several controllers without problems. You must remember however to create the RAID 0 lun throught LSI's megaraid CLI tool and / or to clear any foreign config before the controller will expose the disk(s) to the OS. The only real problem that I can think of is that you cannot use the autoreplace functionality of recent ZFS versions with these controllers. I agree that the MD1000 with ZFS is a rocking, inexpensive setup (we have several!) but I'd recommend using a SAS card with a true JBOD mode for maximum flexibility and portability. If I remember correctly, I think we're using the Adaptec 3085. I've pulled 465MB/s write and 1GB/s read off the MD1000 filled with SATA drives. Cordialement, Erik Ableson +33.6.80.83.58.28 Envoyé depuis mon iPhone On 23 juin 2009, at 21:18, Henrik Johansen wrote: Kyle McDonald wrote: Erik Ableson wrote: Just a side note on the PERC labelled cards: they don't have a JBOD mode so you _have_ to use hardware RAID. This may or may not be an issue in your configuration but it does mean that moving disks between controllers is no longer possible. The only way to do a pseudo JBOD is to create broken RAID 1 volumes which is not ideal. It won't even let you make single drive RAID 0 LUNs? That's a shame. We currently have 90+ disks that are created as single drive RAID 0 LUNs on several PERC 6/E (LSI 1078E chipset) controllers and used by ZFS. I can assure you that they work without any problems and perform very well indeed. In fact, the combination of PERC 6/E and MD1000 disk arrays has worked so well for us that we are going to double the number of disks during this fall. The lack of portability is disappointing. The trade-off though is battery backed cache if the card supports it. -Kyle Cordialement, Erik Ableson +33.6.80.83.58.28 Envoyé depuis mon iPhone On 23 juin 2009, at 04:33, "Eric D. Mudama" > wrote: > On Mon, Jun 22 at 15:46, Miles Nordin wrote: >>>>>>> "edm" == Eric D Mudama writes: >> >> edm> We bought a Dell T610 as a fileserver, and it comes with an >> edm> LSI 1068E based board (PERC6/i SAS). >> >> which driver attaches to it? >> >> pciids.sourceforge.net says this is a 1078 board, not a 1068 board. >> >> please, be careful. There's too much confusion about these cards. > > Sorry, that may have been confusing. We have the cheapest storage > option on the T610, with no onboard cache. I guess it's called the > "Dell SAS6i/R" while they reserve the PERC name for the ones with > cache. I had understood that they were basically identical except for > the cache, but maybe not. > > Anyway, this adapter has worked great for us so far. > > > snippet of prtconf -D: > > > i86pc (driver name: rootnex) >pci, instance #0 (driver name: npe) >pci8086,3411, instance #6 (driver name: pcie_pci) >pci1028,1f10, instance #0 (driver name: mpt) >sd, instance #1 (driver name: sd) >sd, instance #6 (driver name: sd) >sd, instance #7 (driver name: sd) >sd, instance #2 (driver name: sd) >sd, instance #4 (driver name: sd) >sd, instance #5 (driver name: sd) > > > For this board the mpt driver is being used, and here's the prtconf > -pv info: > > > Node 0x1f >assigned-addresses: > 81020010..fc00..0100.83020014.. > df2ec000..4000.8302001c. > .df2f..0001 >reg: > 0002.....01020010....0100.03020014....4000.0302001c. > ...0001 >compatible: 'pciex1000,58.1028.1f10.8' + 'pciex1000,58.1028.1f10' > + 'pciex1000,58.8' + 'pciex1000,58' + 'pciexclass,01' + > 'pciexclass,0100' + 'pci1000,58.1028.1f10.8' + > 'pci1000,58.1028.1f10' + 'pci1028,1f10' + 'pci1000,58.8' + > 'pci1000,58' + 'pciclass, 01' + 'pciclass,0100' >model: 'SCSI bus controller' >power-consumption: 0001.0001 >devsel-speed: >interrupts: 0001 >subsystem-vendor-id: 1028 >subsystem-
Re: [zfs-discuss] Large zpool design considerations
Chris Cosby wrote: >I'm going down a bit of a different path with my reply here. I know that all >shops and their need for data are different, but hear me out. > >1) You're backing up 40TB+ of data, increasing at 20-25% per year. That's >insane. Perhaps it's time to look at your backup strategy no from a hardware >perspective, but from a data retention perspective. Do you really need that >much data backed up? There has to be some way to get the volume down. If >not, you're at 100TB in just slightly over 4 years (assuming the 25% growth >factor). If your data is critical, my recommendation is to go find another >job and let someone else have that headache. Well, we are talking about backup for ~900 servers that are in production. Our retention period is 14 days for stuff like web servers, and 3 weeks for SQL and such. We could deploy deduplication but it makes me a wee bit uncomfortable to blindly trust our backup software. >2) 40TB of backups is, at the best possible price, 50-1TB drives (for spares >and such) - $12,500 for raw drive hardware. Enclosures add some money, as do >cables and such. For mirroring, 90-1TB drives is $22,500 for the raw drives. >In my world, I know yours is different, but the difference in a $100,000 >solution and a $75,000 solution is pretty negligible. The short description >here: you can afford to do mirrors. Really, you can. Any of the parity >solutions out there, I don't care what your strategy, is going to cause you >more trouble than you're ready to deal with. Good point. I'll take that into consideration. >I know these aren't solutions for you, it's just the stuff that was in my >head. The best possible solution, if you really need this kind of volume, is >to create something that never has to resilver. Use some nifty combination >of hardware and ZFS, like a couple of somethings that has 20TB per container >exported as a single volume, mirror those with ZFS for its end-to-end >checksumming and ease of management. > >That's my considerably more than $0.02 > >On Thu, Jul 3, 2008 at 11:56 AM, Bob Friesenhahn < >[EMAIL PROTECTED]> wrote: > >> On Thu, 3 Jul 2008, Don Enrique wrote: >> > >> > This means that i potentially could loose 40TB+ of data if three >> > disks within the same RAIDZ-2 vdev should die before the resilvering >> > of at least one disk is complete. Since most disks will be filled i >> > do expect rather long resilvering times. >> >> Yes, this risk always exists. The probability of three disks >> independently dying during the resilver is exceedingly low. The chance >> that your facility will be hit by an airplane during resilver is >> likely higher. However, it is true that RAIDZ-2 does not offer the >> same ease of control over physical redundancy that mirroring does. >> If you were to use 10 independent chassis and split the RAIDZ-2 >> uniformly across the chassis then the probability of a similar >> calamity impacting the same drives is driven by rack or facility-wide >> factors (e.g. building burning down) rather than shelf factors. >> However, if you had 10 RAID arrays mounted in the same rack and the >> rack falls over on its side during resilver then hope is still lost. >> >> I am not seeing any options for you here. ZFS RAIDZ-2 is about as >> good as it gets and if you want everything in one huge pool, there >> will be more risk. Perhaps there is a virtual filesystem layer which >> can be used on top of ZFS which emulates a larger filesystem but >> refuses to split files across pools. >> >> In the future it would be useful for ZFS to provide the option to not >> load-share across huge VDEVs and use VDEV-level space allocators. >> >> Bob >> == >> Bob Friesenhahn >> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ >> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ >> >> ___ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > > >-- >chris -at- microcozm -dot- net >=== Si Hoc Legere Scis Nimium Eruditionis Habes -- Med venlig hilsen / Best Regards Henrik Johansen [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large zpool design considerations
[Richard Elling] wrote: > Don Enrique wrote: >> Hi, >> >> I am looking for some best practice advice on a project that i am working on. >> >> We are looking at migrating ~40TB backup data to ZFS, with an annual data >> growth of >> 20-25%. >> >> Now, my initial plan was to create one large pool comprised of X RAIDZ-2 >> vdevs ( 7 + 2 ) >> with one hotspare per 10 drives and just continue to expand that pool as >> needed. >> >> Between calculating the MTTDL and performance models i was hit by a rather >> scary thought. >> >> A pool comprised of X vdevs is no more resilient to data loss than the >> weakest vdev since loss >> of a vdev would render the entire pool unusable. >> > > Yes, but a raidz2 vdev using enterprise class disks is very reliable. That's nice to hear. >> This means that i potentially could loose 40TB+ of data if three disks >> within the same RAIDZ-2 >> vdev should die before the resilvering of at least one disk is complete. >> Since most disks >> will be filled i do expect rather long resilvering times. >> >> We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project >> with as much hardware >> redundancy as we can get ( multiple controllers, dual cabeling, I/O >> multipathing, redundant PSUs, >> etc.) >> > > nit: SATA disks are single port, so you would need a SAS implementation > to get multipathing to the disks. This will not significantly impact the > overall availability of the data, however. I did an availability > analysis of > thumper to show this. > http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs Yeah, I read your blog. Very informative indeed. I am using SAS HBA cards and SAS enclosures with SATA disks so I should be fine. >> I could use multiple pools but that would make data management harder which >> in it self is a lengthy >> process in our shop. >> >> The MTTDL figures seem OK so how much should i need to worry ? Anyone having >> experience from >> this kind of setup ? >> > > I think your design is reasonable. We'd need to know the exact > hardware details to be able to make more specific recommendations. > -- richard Well, my choice of hardware is kind of limited by 2 things : 1. We are a 100% Dell shop. 2. We already have lots of enclosures that i would like to reuse for my project. The HBA cards are SAS 5/E (LSI SAS1068 chipset) cards, the enclosures are Dell MD1000 diskarrays. > -- Med venlig hilsen / Best Regards Henrik Johansen [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs data corruption
> I'm just interested in understanding how zfs determined there was data > corruption when I have checksums disabled and there were no > non-retryable read errors reported in the messages file. If the metadata is corrupt, how is ZFS going to find the data blocks on disk? > > I don't believe it was a real disk read error because of the > > absence of evidence in /var/adm/messages. It's not safe to jump to this conclusion. Disk drivers that support FMA won't log error messages to /var/adm/messages. As more support for I/O FMA shows up, you won't see random spew in the messages file any more. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
> Is deleting the old files/directories in the ZFS file system > sufficient or do I need to destroy/recreate the pool and/or file > system itself? I've been doing the former. The former should be sufficient, it's not necessary to destroy the pool. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
> -Still playing with 'recsize' values but it doesn't seem to be doing > much...I don't think I have a good understand of what exactly is being > written...I think the whole file might be overwritten each time > because it's in binary format. The other thing to keep in mind is that the tunables like compression and recsize only affect newly written blocks. If you have a bunch of data that was already laid down on disk and then you change the tunable, this will only cause new blocks to have the new size. If you experiment with this, make sure all of your data has the same blocksize by copying it over to the new pool once you've changed the properties. > -Setting zfs_nocacheflush, though got me drastically increased > throughput--client requests took, on average, less than 2 seconds > each! > > So, in order to use this, I should have a storage array, w/battery > backup, instead of using the internal drives, correct? zfs_nocacheflush should only be used on arrays with a battery backed cache. If you use this option on a disk, and you lose power, there's no guarantee that your write successfully made it out of the cache. A performance problem when flushing the cache of an individual disk implies that there's something wrong with the disk or its firmware. You can disable the write cache of an individual disk using format(1M). When you do this, ZFS won't lose any data, whereas enabling zfs_nocacheflush can lead to problems. I'm attaching a DTrace script that will show the cache-flush times per-vdev. Remove the zfs_nocacheflush tuneable and re-run your test while using this DTrace script. If one particular disk takes longer than the rest to flush, this should show us. In that case, we can disable the write cache on that particular disk. Otherwise, we'll need to disable the write cache on all of the disks. The script is attached as zfs_flushtime.d Use format(1M) with the -e option to adjust the write_cache settings for SCSI disks. -j #!/usr/sbin/dtrace -Cs /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright 2008 Sun Microsystems, Inc. All rights reserved. * Use is subject to license terms. */ #define DKIOC (0x04 << 8) #define DKIOCFLUSHWRITECACHE(DKIOC|34) fbt:zfs:vdev_disk_io_start:entry /(args[0]->io_cmd == DKIOCFLUSHWRITECACHE) && (self->traced == 0)/ { self->traced = args[0]; self->start = timestamp; } fbt:zfs:vdev_disk_ioctl_done:entry /args[0] == self->traced/ { @a[stringof(self->traced->io_vd->vdev_path)] = quantize(timestamp - self->start); self->start = 0; self->traced = 0; } ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mdb ::memstat including zfs buffer details?
> ZFS data buffers are attached to zvp; however, we still keep > metadata in the crashdump. At least right now, this means that > cached ZFS metadata has kvp as its vnode. > >Still, it's better than what you get currently. I absolutely agree. At one point, we discussed adding another vp for the metadata. IIRC, this was in the context of moving all of ZFS's allocations outside of the cage. There's no reason why you couldn't do the same to make counting of buffers more understandable, though. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mdb ::memstat including zfs buffer details?
>I don't think it should be too bad (for ::memstat), given that (at >least in Nevada), all of the ZFS caching data belongs to the "zvp" >vnode, instead of "kvp". ZFS data buffers are attached to zvp; however, we still keep metadata in the crashdump. At least right now, this means that cached ZFS metadata has kvp as its vnode. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fileserver performance tests
> statfile1 988ops/s 0.0mb/s 0.0ms/op 22us/op-cpu > deletefile1 991ops/s 0.0mb/s 0.0ms/op 48us/op-cpu > closefile2997ops/s 0.0mb/s 0.0ms/op4us/op-cpu > readfile1 997ops/s 139.8mb/s 0.2ms/op 175us/op-cpu > openfile2 997ops/s 0.0mb/s 0.0ms/op 28us/op-cpu > closefile1 1081ops/s 0.0mb/s 0.0ms/op6us/op-cpu > appendfilerand1 982ops/s 14.9mb/s 0.1ms/op 91us/op-cpu > openfile1 982ops/s 0.0mb/s 0.0ms/op 27us/op-cpu > > IO Summary: 8088 ops 8017.4 ops/s, (997/982 r/w) 155.6mb/s,508us > cpu/op, 0.2ms > I expected to see some higher numbers really... > a simple "time mkfile 16g lala" gave me something like 280Mb/s. mkfile isn't an especially realistic test for performance. You'll note that the fileserver workload is performing stats, deletes, closes, reads, opens, and appends. Mkfile is a write benchmark. You might consider trying the singlestreamwrite benchmark, if you're looking for a single-threaded write performance test. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Direct I/O ability with zfs?
> But note that, for ZFS, the win with direct I/O will be somewhat > less. That's because you still need to read the page to compute > its checksum. So for direct I/O with ZFS (with checksums enabled), > the cost is W:LPS, R:2*LPS. Is saving one page of writes enough to > make a difference? Possibly not. It's more complicated than that. The kernel would be verifying checksums on buffers in a user's address space. For this to work, we have to map these buffers into the kernel and simultaneously arrange for these pages to be protected from other threads in the user's address space. We discussed some of the VM gymnastics required to properly implement this back in January: http://mail.opensolaris.org/pipermail/zfs-discuss/2007-January/thread.html#36890 -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/WAFL lawsuit
It's Columbia Pictures vs. Bunnell: http://www.eff.org/legal/cases/torrentspy/columbia_v_bunnell_magistrate_order.pdf The Register syndicated a Security Focus article that summarizes the potential impact of the court decision: http://www.theregister.co.uk/2007/08/08/litigation_data_retention/ -j On Thu, Sep 06, 2007 at 08:14:56PM +0200, [EMAIL PROTECTED] wrote: > > > >It really is a shot in the dark at this point, you really never know what > >will happen in court (take the example of the recent court decision that > >all data in RAM be held for discovery ?!WHAT, HEAD HURTS!?). But at the > >end of the day, if you waited for a sure bet on any technology or > >potential patent disputes you would not implement anything, ever. > > > Do you have a reference for "all data in RAM most be held". I guess we > need to build COW RAM as well. > > Casper > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely long creat64 latencies on higly utilized zpools
You might also consider taking a look at this thread: http://mail.opensolaris.org/pipermail/zfs-discuss/2007-July/041760.html Although I'm not certain, this sounds a lot like the other pool fragmentation issues. -j On Wed, Aug 15, 2007 at 01:11:40AM -0700, Yaniv Aknin wrote: > Hello friends, > > I've recently seen a strange phenomenon with ZFS on Solaris 10u3, and was > wondering if someone may have more information. > > The system uses several zpools, each a bit under 10T, each containing one zfs > with lots and lots of small files (way too many, about 100m files and 75m > directories). > > I have absolutely no control over the directory structure and believe me I > tried to change it. > > Filesystem usage patterns are create and read, never delete and never rewrite. > > When volumes approach 90% usage, and under medium/light load (zpool iostat > reports 50mb/s and 750iops reads), some creat64 system calls take over 50 > seconds to complete (observed with 'truss -D touch'). When doing manual > tests, I've seen similar times on unlink() calls (truss -D rm). > > I'd like to stress this happens on /some/ of the calls, maybe every 100th > manual call (I scripted the test), which (along with normal system > operations) would probably be every 10,000th or 100,000th call. > > Other system parameters (memory usage, loadavg, process number, etc) appear > nominal. The machine is an NFS server, though the crazy latencies were > observed both local and remote. > > What would you suggest to further diagnose this? Has anyone seen trouble with > high utilization and medium load? (with or without insanely high filecount?) > > Many thanks in advance, > - Yaniv > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is send/receive incremental
You can do it either way. Eric Kustarz has a good explanation of how to set up incremental send/receive on your laptop. The description is on his blog: http://blogs.sun.com/erickustarz/date/20070612 The technique he uses is applicable to any ZFS filesystem. -j On Wed, Aug 08, 2007 at 04:44:16PM -0600, Peter Baumgartner wrote: > >I'd like to send a backup of my filesystem offsite nightly using zfs >send/receive. Are those done incrementally so only changes move or >would a full copy get shuttled across everytime? >-- >Pete > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] si3124 controller problem and fix (fwd)
In an attempt to speed up progress on some of the si3124 bugs that Roger reported, I've created a workspace with the fixes for: 6565894 sata drives are not identified by si3124 driver 6566207 si3124 driver loses interrupts. I'm attaching a driver which contains these fixes as well as a diff of the changes I used to produce them. I don't have access to a si3124 chipset, unfortunately. Would somebody be able to review these changes and try the new driver on a si3124 card? Thanks, -j On Tue, Jul 17, 2007 at 02:39:00AM -0700, Nigel Smith wrote: > You can see the status of bug here: > > http://bugs.opensolaris.org/view_bug.do?bug_id=6566207 > > Unfortunately, it's showing no progress since 20th June. > > This fix really could do to be in place for S10u4 and snv_70. > Thanks > Nigel Smith > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss si3124.tar.gz Description: application/tar-gz --- usr/src/uts/common/io/sata/adapters/si3124/si3124.c --- Index: usr/src/uts/common/io/sata/adapters/si3124/si3124.c --- /ws/onnv-clone/usr/src/uts/common/io/sata/adapters/si3124/si3124.c Mon Nov 13 23:20:01 2006 +++ /export/johansen/si-fixes/usr/src/uts/common/io/sata/adapters/si3124/si3124.c Tue Jul 17 14:37:17 2007 @@ -22,11 +22,11 @@ /* * Copyright 2006 Sun Microsystems, Inc. All rights reserved. * Use is subject to license terms. */ -#pragma ident "@(#)si3124.c 1.4 06/11/14 SMI" +#pragma ident "@(#)si3124.c 1.5 07/07/17 SMI" /* * SiliconImage 3124/3132 sata controller driver @@ -381,11 +381,11 @@ extern struct mod_ops mod_driverops; static struct modldrv modldrv = { &mod_driverops, /* driverops */ - "si3124 driver v1.4", + "si3124 driver v1.5", &sictl_dev_ops, /* driver ops */ }; static struct modlinkage modlinkage = { MODREV_1, @@ -2808,10 +2808,13 @@ si_portp = si_ctlp->sictl_ports[port]; mutex_enter(&si_portp->siport_mutex); /* Clear Port Reset. */ ddi_put32(si_ctlp->sictl_port_acc_handle, + (uint32_t *)PORT_CONTROL_SET(si_ctlp, port), + PORT_CONTROL_SET_BITS_PORT_RESET); + ddi_put32(si_ctlp->sictl_port_acc_handle, (uint32_t *)PORT_CONTROL_CLEAR(si_ctlp, port), PORT_CONTROL_CLEAR_BITS_PORT_RESET); /* * Arm the interrupts for: Cmd completion, Cmd error, @@ -3509,16 +3512,16 @@ port); if (port_intr_status & INTR_COMMAND_COMPLETE) { (void) si_intr_command_complete(si_ctlp, si_portp, port); - } - + } else { /* Clear the interrupts */ ddi_put32(si_ctlp->sictl_port_acc_handle, (uint32_t *)(PORT_INTERRUPT_STATUS(si_ctlp, port)), port_intr_status & INTR_MASK); + } /* * Note that we did not clear the interrupt for command * completion interrupt. Reading of slot_status takes care * of clearing the interrupt for command completion case. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance and memory consumption
> But now I have another question. > How 8k blocks will impact on performance ? When tuning recordsize for things like databases, we try to recommend that the customer's recordsize match the I/O size of the database record. I don't think that's the case in your situation. ZFS is clever enough that changes to recordsize only affect new blocks written to the filesystem. If you're seeing metaslab fragmentation problems now, changing your recordsize to 8k is likely to increase your performance. This is because you're out of 128k metaslabs, so using a smaller size lets you make better use of the remaining space. This also means you won't have to iterate through all of the used 128k metaslabs looking for a free one. If you're asking, "How does setting the recordsize to 8k affect performance when I'm not encountering fragmentation," I would guess that there would be some reduction. However, you can adjust the recordsize once you encounter this problem with the default size. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] si3124 controller problem and fix (fwd)
> it's been assigned CR 6566207 by Linda Bernal. Basically, if you look > at si_intr and read the comments in the code, the bug is pretty > obvious. > > si3124 driver's interrupt routine is incorrectly coded. The ddi_put32 > that clears the interrupts should be enclosed in an "else" block, > thereby making it consistent with the comment just below. Otherwise, > you would be double clearing the interrupts, thus losing pending > interrupts. > > Since this is a simple fix, there's really no point dealing it as a > contributor. The bug report for 6566207 states that the submitter is an OpenSolaris contributor who wishes to work on the fix. If this is not the case, we should clarify this CR so it doesn't languish. It's still sitting in the dispatched state (hasn't been accepted by anyone). -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [storage-discuss] NCQ performance
> When sequential I/O is done to the disk directly there is no performance > degradation at all. All filesystems impose some overhead compared to the rate of raw disk I/O. It's going to be hard to store data on a disk unless some kind of filesystem is used. All the tests that Eric and I have performed show regressions for multiple sequential I/O streams. If you have data that shows otherwise, please feel free to share. > [I]t does not take any additional time in ldi_strategy(), > bdev_strategy(), mv_rw_dma_start(). In some instance it actually > takes less time. The only thing that sometimes takes additional time > is waiting for the disk I/O. Let's be precise about what was actually observed. Eric and I saw increased service times for the I/O on devices with NCQ enabled when running multiple sequential I/O streams. Everything that we observed indicated that it actually took the disk longer to service requests when many sequential I/Os were queued. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?
Marko, Matt and I discussed this offline some more and he had a couple of ideas about double-checking your hardware. It looks like your controller (or disks, maybe?) is having trouble with multiple simultaneous I/Os to the same disk. It looks like prefetch aggravates this problem. When I asked Matt what we could do to verify that it's the number of concurrent I/Os that is causing performance to be poor, he had the following suggestions: set zfs_vdev_{min,max}_pending=1 and run with prefetch on, then iostat should show 1 outstanding io and perf should be good. or turn prefetch off, and have multiple threads reading concurrently, then iostat should show multiple outstanding ios and perf should be bad. Let me know if you have any additional questions. -j On Wed, May 16, 2007 at 11:38:24AM -0700, [EMAIL PROTECTED] wrote: > At Matt's request, I did some further experiments and have found that > this appears to be particular to your hardware. This is not a general > 32-bit problem. I re-ran this experiment on a 1-disk pool using a 32 > and 64-bit kernel. I got identical results: > > 64-bit > == > > $ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k > count=1 > 1+0 records in > 1+0 records out > > real 20.1 > user0.0 > sys 1.2 > > 62 Mb/s > > # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1 > 1+0 records in > 1+0 records out > > real 19.0 > user0.0 > sys 2.6 > > 65 Mb/s > > 32-bit > == > > /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k > count=1 > 1+0 records in > 1+0 records out > > real 20.1 > user0.0 > sys 1.7 > > 62 Mb/s > > # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1 > 1+0 records in > 1+0 records out > > real 19.1 > user0.0 > sys 4.3 > > 65 Mb/s > > -j > > On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens wrote: > > Marko Milisavljevic wrote: > > >now lets try: > > >set zfs:zfs_prefetch_disable=1 > > > > > >bingo! > > > > > > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > > > 609.00.0 77910.00.0 0.0 0.80.01.4 0 83 c0d0 > > > > > >only 1-2 % slower then dd from /dev/dsk. Do you think this is general > > >32-bit problem, or specific to this combination of hardware? > > > > I suspect that it's fairly generic, but more analysis will be necessary. > > > > >Finally, should I file a bug somewhere regarding prefetch, or is this > > >a known issue? > > > > It may be related to 6469558, but yes please do file another bug report. > > I'll have someone on the ZFS team take a look at it. > > > > --matt > > ___ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?
At Matt's request, I did some further experiments and have found that this appears to be particular to your hardware. This is not a general 32-bit problem. I re-ran this experiment on a 1-disk pool using a 32 and 64-bit kernel. I got identical results: 64-bit == $ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 20.1 user0.0 sys 1.2 62 Mb/s # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 19.0 user0.0 sys 2.6 65 Mb/s 32-bit == /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 20.1 user0.0 sys 1.7 62 Mb/s # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 19.1 user0.0 sys 4.3 65 Mb/s -j On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens wrote: > Marko Milisavljevic wrote: > >now lets try: > >set zfs:zfs_prefetch_disable=1 > > > >bingo! > > > > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > > 609.00.0 77910.00.0 0.0 0.80.01.4 0 83 c0d0 > > > >only 1-2 % slower then dd from /dev/dsk. Do you think this is general > >32-bit problem, or specific to this combination of hardware? > > I suspect that it's fairly generic, but more analysis will be necessary. > > >Finally, should I file a bug somewhere regarding prefetch, or is this > >a known issue? > > It may be related to 6469558, but yes please do file another bug report. > I'll have someone on the ZFS team take a look at it. > > --matt > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?
> >*sata_hba_list::list sata_hba_inst_t satahba_next | ::print > >sata_hba_inst_t satahba_dev_port | ::array void* 32 | ::print void* | > >::grep ".!=0" | ::print sata_cport_info_t cport_devp.cport_sata_drive | > >::print -a sata_drive_info_t satadrv_features_support satadrv_settings > >satadrv_features_enabled > This gives me "mdb: failed to dereference symbol: unknown symbol > name". You may not have the SATA module installed. If you type: ::modinfo ! grep sata and don't get any output, your sata driver is attached some other way. My apologies for the confusion. -K ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?
> Each drive is freshly formatted with one 2G file copied to it. How are you creating each of these files? Also, would you please include a the output from the isalist(1) command? > These are snapshots of iostat -xnczpm 3 captured somewhere in the > middle of the operation. Have you double-checked that this isn't a measurement problem by measuring zfs with zpool iostat (see zpool(1M)) and verifying that outputs from both iostats match? > single drive, zfs file >r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 258.30.0 33066.60.0 33.0 2.0 127.77.7 100 100 c0d1 > > Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s / > r/s gives 256K, as I would imagine it should. Not sure. If we can figure out why ZFS is slower than raw disk access in your case, it may explain why you're seeing these results. > What if we read a UFS file from the PATA disk and ZFS from SATA: >r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 792.80.0 44092.90.0 0.0 1.80.02.2 1 98 c1d0 > 224.00.0 28675.20.0 33.0 2.0 147.38.9 100 100 c0d0 > > Now that is confusing! Why did SATA/ZFS slow down too? I've retried this a > number of times, not a fluke. This could be cache interference. ZFS and UFS use different caches. How much memory is in this box? > I have no idea what to make of all this, except that it ZFS has a problem > with this hardware/drivers that UFS and other traditional file systems, > don't. Is it a bug in the driver that ZFS is inadvertently exposing? A > specific feature that ZFS assumes the hardware to have, but it doesn't? Who > knows! This may be a more complicated interaction than just ZFS and your hardware. There are a number of layers of drivers underneath ZFS that may also be interacting with your hardware in an unfavorable way. If you'd like to do a little poking with MDB, we can see the features that your SATA disks claim they support. As root, type mdb -k, and then at the ">" prompt that appears, enter the following command (this is one very long line): *sata_hba_list::list sata_hba_inst_t satahba_next | ::print sata_hba_inst_t satahba_dev_port | ::array void* 32 | ::print void* | ::grep ".!=0" | ::print sata_cport_info_t cport_devp.cport_sata_drive | ::print -a sata_drive_info_t satadrv_features_support satadrv_settings satadrv_features_enabled This should show satadrv_features_support, satadrv_settings, and satadrv_features_enabled for each SATA disk on the system. The values for these variables are defined in: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/sata/impl/sata.h this is the relevant snippet for interpreting these values: /* * Device feature_support (satadrv_features_support) */ #define SATA_DEV_F_DMA 0x01 #define SATA_DEV_F_LBA280x02 #define SATA_DEV_F_LBA480x04 #define SATA_DEV_F_NCQ 0x08 #define SATA_DEV_F_SATA10x10 #define SATA_DEV_F_SATA20x20 #define SATA_DEV_F_TCQ 0x40/* Non NCQ tagged queuing */ /* * Device features enabled (satadrv_features_enabled) */ #define SATA_DEV_F_E_TAGGED_QING0x01/* Tagged queuing enabled */ #define SATA_DEV_F_E_UNTAGGED_QING 0x02/* Untagged queuing enabled */ /* * Drive settings flags (satdrv_settings) */ #define SATA_DEV_READ_AHEAD 0x0001 /* Read Ahead enabled */ #define SATA_DEV_WRITE_CACHE0x0002 /* Write cache ON */ #define SATA_DEV_SERIAL_FEATURES0x8000 /* Serial ATA feat. enabled */ #define SATA_DEV_ASYNCH_NOTIFY 0x2000 /* Asynch-event enabled */ This may give us more information if this is indeed a problem with hardware/drivers supporting the right features. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?
Marko, I tried this experiment again using 1 disk and got nearly identical times: # /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 21.4 user0.0 sys 2.4 $ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 21.0 user0.0 sys 0.7 > [I]t is not possible for dd to meaningfully access multiple-disk > configurations without going through the file system. I find it > curious that there is such a large slowdown by going through file > system (with single drive configuration), especially compared to UFS > or ext3. Comparing a filesystem to raw dd access isn't a completely fair comparison either. Few filesystems actually layout all of their data and metadata so that every read is a completely sequential read. > I simply have a small SOHO server and I am trying to evaluate which OS to > use to keep a redundant disk array. With unreliable consumer-level hardware, > ZFS and the checksum feature are very interesting and the primary selling > point compared to a Linux setup, for as long as ZFS can generate enough > bandwidth from the drive array to saturate single gigabit ethernet. I would take Bart's reccomendation and go with Solaris on something like a dual-core box with 4 disks. > My hardware at the moment is the "wrong" choice for Solaris/ZFS - PCI 3114 > SATA controller on a 32-bit AthlonXP, according to many posts I found. Bill Moore lists some controller reccomendations here: http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html > However, since dd over raw disk is capable of extracting 75+MB/s from this > setup, I keep feeling that surely I must be able to get at least that much > from reading a pair of striped or mirrored ZFS drives. But I can't - single > drive or 2-drive stripes or mirrors, I only get around 34MB/s going through > ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.) Maybe this is a problem with your controller? What happens when you have two simultaneous dd's to different disks running? This would simulate the case where you're reading from the two disks at the same time. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?
This certainly isn't the case on my machine. $ /usr/bin/time dd if=/test/filebench/largefile2 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real1.3 user0.0 sys 1.2 # /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 22.3 user0.0 sys 2.2 This looks like 56 MB/s on the /dev/dsk and 961 MB/s on the pool. My pool is configured into a 46 disk RAID-0 stripe. I'm going to omit the zpool status output for the sake of brevity. > What I am seeing is that ZFS performance for sequential access is > about 45% of raw disk access, while UFS (as well as ext3 on Linux) is > around 70%. For workload consisting mostly of reading large files > sequentially, it would seem then that ZFS is the wrong tool > performance-wise. But, it could be just my setup, so I would > appreciate more data points. This isn't what we've observed in much of our performance testing. It may be a problem with your config, although I'm not an expert on storage configurations. Would you mind providing more details about your controller, disks, and machine setup? -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: gzip compression throttles system?
A couple more questions here. [mpstat] > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 00 0 3109 3616 316 1965 17 48 45 2450 85 0 15 > 10 0 3127 3797 592 2174 17 63 46 1760 84 0 15 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 00 0 3051 3529 277 2012 14 25 48 2160 83 0 17 > 10 0 3065 3739 606 1952 14 37 47 1530 82 0 17 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 00 0 3011 3538 316 2423 26 16 52 2020 81 0 19 > 10 0 3019 3698 578 2694 25 23 56 3090 83 0 17 > > # lockstat -kIW -D 20 sleep 30 > > Profiling interrupt: 6080 events in 31.341 seconds (194 events/sec) > > Count indv cuml rcnt nsec Hottest CPU+PILCaller > --- > 2068 34% 34% 0.00 1767 cpu[0] deflate_slow > 1506 25% 59% 0.00 1721 cpu[1] longest_match > 1017 17% 76% 0.00 1833 cpu[1] mach_cpu_idle > 454 7% 83% 0.00 1539 cpu[0] fill_window > 215 4% 87% 0.00 1788 cpu[1] pqdownheap What do you have zfs compresison set to? The gzip level is tunable, according to zfs set, anyway: PROPERTY EDIT INHERIT VALUES compression YES YES on | off | lzjb | gzip | gzip-[1-9] You still have idle time in this lockstat (and mpstat). What do you get for a lockstat -A -D 20 sleep 30? Do you see anyone with long lock hold times, long sleeps, or excessive spinning? The largest numbers from mpstat are for interrupts and cross calls. What does intrstat(1M) show? Have you run dtrace to determine the most frequent cross-callers? #!/usr/sbin/dtrace -s sysinfo:::xcalls { @a[stack(30)] = count(); } END { trunc(@a, 30); } is an easy way to do this. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Help me understand ZFS caching
Tony: > Now to another question related to Anton's post. You mention that > directIO does not exist in ZFS at this point. Are their plan's to > support DirectIO; any functionality that will simulate directIO or > some other non-caching ability suitable for critical systems such as > databases if the client still wanted to deploy on filesystems. I would describe DirectIO as the ability to map the application's buffers directly for disk DMAs. You need to disable the filesystem's cache to do this correctly. Having the cache disabled is an implementation requirement for this feature. Based upon this definition, are you seeking the ability to disable the filesystem's cache or the ability to directly map application buffers for DMA? -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bottlenecks in building a system
Adam: > Hi, hope you don't mind if I make some portions of your email public in > a reply--I hadn't seen it come through on the list at all, so it's no > duplicate to me. I don't mind at all. I had hoped to avoid sending the list a duplicate e-mail, although it looks like my first post never made it here. > > I suspect that if you have a bottleneck in your system, it would be due > > to the available bandwidth on the PCI bus. > > Mm. yeah, it's what I was worried about, too (mostly through ignorance > of the issues), which is why I was hoping HyperTransport and PCIe were > going to give that data enough room on the bus. > But after others expressed the opinion that the Areca PCIe cards were > overkill, I'm now looking to putting some PCI-X cards on a different > (probably slower) motherboard. I dug up a copy of the S2895 block diagram and asked Bill Moore about it. He said that you should be able to get about 700mb/s off of each of the PCI-X channels and that you only need 100mb/s to saturate a GigE link. He also observed that the RAID card you were using was unnecessary and would probably hamper performance. He reccomended non-RAID SATA cards based upon the Marvell chipset. Here's the e-mail trail on this list where he discusses Marvell SATA cards in a bit more detail: http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html It sounds like if getting disk -> network is the concern, you'll have plenty of bandwidth, assuming you have a reasonable controller card. > > Caching isn't going to be a huge help for writes, unless there's another > > thread reading simultaneoulsy from the same file. > > > > Prefetch will definitely use the additional RAM to try to boost the > > performance of sequential reads. However, in the interest of full > > disclosure, there is a pathology that we've seen where the number of > > sequential readers exceeds the available space in the cache. In this > > situation, sometimes the competeing prefetches for the different streams > > will cause more temporally favorable data to be evicted from the cache > > and performance will drop. The workaround right now is just to disable > > prefetch. We're looking into more comprehensive solutions. > > Interesting. So noted. I will expect to have to test thoroughly. If you run across this problem and are willing to let me debug on your system, shoot me an e-mail. We've only seen this in a couple of situations and it was combined with another problem where we were seeing excessive overhead for kcopyout. It's unlikely, but possible that you'll hit this. -K ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bottlenecks in building a system
Adam: > Does anyone have a clue as to where the bottlenecks are going to be with > this: > > 16x hot swap SATAII hard drives (plus an internal boot drive) > Tyan S2895 (K8WE) motherboard > Dual GigE (integral nVidia ports) > 2x Areca 8-port PCIe (8-lane) RAID drivers > 2x AMD Opteron 275 CPUs (2.2GHz, dual core) > 8 GiB RAM > The supplier is used to shipping Linux servers in this 3U chassis, but > hasn't dealt with Solaris. He originally suggested 2GiB RAM, but I hear > things about ZFS getting RAM hungry after a while. ZFS is opportunistic when it comes to using free memory for caching. I'm not sure what exactly you've heard. > I guess my questions are: > - Does anyone out there have a clue where the potential bottlenecks > might be? What's your workload? Bart is subscribed to this list, but he has a famous saying, "One experiment is worth a thousand expert opinions." Without knowing what you're trying to do with this box, it's going to be hard to offer any useful advice. However, you'll learn the most by getting one of these boxes and running your workload. If you have problems, Solaris has a lot of tools that we can use to diagnose the problem. Then we can improve the performance and everybody wins. > - If I focused on simple streaming IO, would giving the server less RAM > have an impact on performance? The more RAM you can give your box, the more of it ZFS will use for caching. If your workload doesn't benefit from caching, then the impact on performance won't be large. Could you be more specific about what the filesystem's consumers are doing when they're performing "simple streaming IO?" > - I had assumed four cores would be better than the two faster (3.0GHz) > single-core processors the vendor originally suggested. Agree? I suspect that this is correct. ZFS does many steps in its I/O path asynchronously and they execute in the context of different threads. Four cores are probably better than two. Of course experimentation could prove me wrong here, too. :) -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: C'mon ARC, stay small...
> I've been seeing this failure to cap on a number of (Solaris 10 update > 2 and 3) machines since the script came out (arc hogging is a huge > problem for me, esp on Oracle). This is probably a red herring, but my > v490 testbed seemed to actually cap on 3 separate tests, but my t2000 > testbed doesn't even pretend to cap - kernel memory (as identified in > Orca) sails right to the top, leaves me maybe 2GB free on a 32GB > machine and shoves Oracle data into swap. What method are you using to cap this memory? Jim and I just disucssed the required steps for doing this by hand using MDB. > This isn't as amusing as one Stage and one Production Oracle machine > which have 128GB and 96GB respectively. Sending in 92GB core dumps to > support is an impressive gesture taking 2-3 days to complete. This is solved by CR 4894692, which is in snv_56 and s10u4. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
I suppose I should have been more forward about making my last point. If the arc_c_max isn't set in /etc/system, I don't believe that the ARC will initialize arc.p to the correct value. I could be wrong about this; however, next time you set c_max, set c to the same value as c_max and set p to half of c. Let me know if this addresses the problem or not. -j > >How/when did you configure arc_c_max? > Immediately following a reboot, I set arc.c_max using mdb, > then verified reading the arc structure again. > >arc.p is supposed to be > >initialized to half of arc.c. Also, I assume that there's a reliable > >test case for reproducing this problem? > > > Yep. I'm using a x4500 in-house to sort out performance of a customer test > case that uses mmap. We acquired the new DIMMs to bring the > x4500 to 32GB, since the workload has a 64GB working set size, > and we were clobbering a 16GB thumper. We wanted to see how doubling > memory may help. > > I'm trying clamp the ARC size because for mmap-intensive workloads, > it seems to hurt more than help (although, based on experiments up to this > point, it's not hurting a lot). > > I'll do another reboot, and run it all down for you serially... > > /jim > > >Thanks, > > > >-j > > > >On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote: > > > >> > >>>ARC_mru::print -d size lsize > >>> > >>size = 0t10224433152 > >>lsize = 0t10218960896 > >> > >>>ARC_mfu::print -d size lsize > >>> > >>size = 0t303450112 > >>lsize = 0t289998848 > >> > >>>ARC_anon::print -d size > >>> > >>size = 0 > >> > >>So it looks like the MRU is running at 10GB... > >> > >>What does this tell us? > >> > >>Thanks, > >>/jim > >> > >> > >> > >>[EMAIL PROTECTED] wrote: > >> > >>>This seems a bit strange. What's the workload, and also, what's the > >>>output for: > >>> > >>> > >>> > ARC_mru::print size lsize > ARC_mfu::print size lsize > > > >>>and > >>> > >>> > ARC_anon::print size > > > >>>For obvious reasons, the ARC can't evict buffers that are in use. > >>>Buffers that are available to be evicted should be on the mru or mfu > >>>list, so this output should be instructive. > >>> > >>>-j > >>> > >>>On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote: > >>> > >>> > FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: > > > > > >arc::print -tad > > > > > { > . . . > c02e29e8 uint64_t size = 0t10527883264 > c02e29f0 uint64_t p = 0t16381819904 > c02e29f8 uint64_t c = 0t1070318720 > c02e2a00 uint64_t c_min = 0t1070318720 > c02e2a08 uint64_t c_max = 0t1070318720 > . . . > > Perhaps c_max does not do what I think it does? > > Thanks, > /jim > > > Jim Mauro wrote: > > > >Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 > >(update 3). All file IO is mmap(file), read memory segment, unmap, > >close. > > > >Tweaked the arc size down via mdb to 1GB. I used that value because > >c_min was also 1GB, and I was not sure if c_max could be larger than > >c_minAnyway, I set c_max to 1GB. > > > >After a workload run: > > > > > >>arc::print -tad > >> > >> > >{ > >. . . > >c02e29e8 uint64_t size = 0t3099832832 > >c02e29f0 uint64_t p = 0t16540761088 > >c02e29f8 uint64_t c = 0t1070318720 > >c02e2a00 uint64_t c_min = 0t1070318720 > >c02e2a08 uint64_t c_max = 0t1070318720 > >. . . > > > >"size" is at 3GB, with c_max at 1GB. > > > >What gives? I'm looking at the code now, but was under the impression > >c_max would limit ARC growth. Granted, it's not a factor of 10, and > >it's certainly much better than the out-of-the-box growth to 24GB > >(this is a 32GB x4500), so clearly ARC growth is being limited, but it > >still grew to 3X c_max. > > > >Thanks, > >/jim > >___ > >zfs-discuss mailing list > >zfs-discuss@opensolaris.org > >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > >>___ > >>zfs-discuss mailing list > >>zfs-discuss@opensolaris.org > >>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >> > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___
Re: [zfs-discuss] C'mon ARC, stay small...
Something else to consider, depending upon how you set arc_c_max, you may just want to set arc_c and arc_p at the same time. If you try setting arc_c_max, and then setting arc_c to arc_c_max, and then set arc_p to arc_c / 2, do you still get this problem? -j On Thu, Mar 15, 2007 at 05:18:12PM -0700, [EMAIL PROTECTED] wrote: > Gar. This isn't what I was hoping to see. Buffers that aren't > available for eviction aren't listed in the lsize count. It looks like > the MRU has grown to 10Gb and most of this could be successfully > evicted. > > The calculation for determining if we evict from the MRU is in > arc_adjust() and looks something like: > > top_sz = ARC_anon.size + ARC_mru.size > > Then if top_sz > arc.p and ARC_mru.lsize > 0 we evict the smaller of > ARC_mru.lsize and top_size - arc.p > > In your previous message it looks like arc.p is > (ARC_mru.size + > ARC_anon.size). It might make sense to double-check these numbers > together, so when you check the size and lsize again, also check arc.p. > > How/when did you configure arc_c_max? arc.p is supposed to be > initialized to half of arc.c. Also, I assume that there's a reliable > test case for reproducing this problem? > > Thanks, > > -j > > On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote: > > > > > > > ARC_mru::print -d size lsize > > size = 0t10224433152 > > lsize = 0t10218960896 > > > ARC_mfu::print -d size lsize > > size = 0t303450112 > > lsize = 0t289998848 > > > ARC_anon::print -d size > > size = 0 > > > > > > > So it looks like the MRU is running at 10GB... > > > > What does this tell us? > > > > Thanks, > > /jim > > > > > > > > [EMAIL PROTECTED] wrote: > > >This seems a bit strange. What's the workload, and also, what's the > > >output for: > > > > > > > > >>ARC_mru::print size lsize > > >>ARC_mfu::print size lsize > > >> > > >and > > > > > >>ARC_anon::print size > > >> > > > > > >For obvious reasons, the ARC can't evict buffers that are in use. > > >Buffers that are available to be evicted should be on the mru or mfu > > >list, so this output should be instructive. > > > > > >-j > > > > > >On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote: > > > > > >>FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: > > >> > > >> > > >> > > >>>arc::print -tad > > >>> > > >>{ > > >>. . . > > >> c02e29e8 uint64_t size = 0t10527883264 > > >> c02e29f0 uint64_t p = 0t16381819904 > > >> c02e29f8 uint64_t c = 0t1070318720 > > >> c02e2a00 uint64_t c_min = 0t1070318720 > > >> c02e2a08 uint64_t c_max = 0t1070318720 > > >>. . . > > >> > > >>Perhaps c_max does not do what I think it does? > > >> > > >>Thanks, > > >>/jim > > >> > > >> > > >>Jim Mauro wrote: > > >> > > >>>Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 > > >>>(update 3). All file IO is mmap(file), read memory segment, unmap, close. > > >>> > > >>>Tweaked the arc size down via mdb to 1GB. I used that value because > > >>>c_min was also 1GB, and I was not sure if c_max could be larger than > > >>>c_minAnyway, I set c_max to 1GB. > > >>> > > >>>After a workload run: > > >>> > > arc::print -tad > > > > >>>{ > > >>>. . . > > >>> c02e29e8 uint64_t size = 0t3099832832 > > >>> c02e29f0 uint64_t p = 0t16540761088 > > >>> c02e29f8 uint64_t c = 0t1070318720 > > >>> c02e2a00 uint64_t c_min = 0t1070318720 > > >>> c02e2a08 uint64_t c_max = 0t1070318720 > > >>>. . . > > >>> > > >>>"size" is at 3GB, with c_max at 1GB. > > >>> > > >>>What gives? I'm looking at the code now, but was under the impression > > >>>c_max would limit ARC growth. Granted, it's not a factor of 10, and > > >>>it's certainly much better than the out-of-the-box growth to 24GB > > >>>(this is a 32GB x4500), so clearly ARC growth is being limited, but it > > >>>still grew to 3X c_max. > > >>> > > >>>Thanks, > > >>>/jim > > >>>___ > > >>>zfs-discuss mailing list > > >>>zfs-discuss@opensolaris.org > > >>>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > >>> > > >>___ > > >>zfs-discuss mailing list > > >>zfs-discuss@opensolaris.org > > >>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > >> > > ___ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
Gar. This isn't what I was hoping to see. Buffers that aren't available for eviction aren't listed in the lsize count. It looks like the MRU has grown to 10Gb and most of this could be successfully evicted. The calculation for determining if we evict from the MRU is in arc_adjust() and looks something like: top_sz = ARC_anon.size + ARC_mru.size Then if top_sz > arc.p and ARC_mru.lsize > 0 we evict the smaller of ARC_mru.lsize and top_size - arc.p In your previous message it looks like arc.p is > (ARC_mru.size + ARC_anon.size). It might make sense to double-check these numbers together, so when you check the size and lsize again, also check arc.p. How/when did you configure arc_c_max? arc.p is supposed to be initialized to half of arc.c. Also, I assume that there's a reliable test case for reproducing this problem? Thanks, -j On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote: > > > > ARC_mru::print -d size lsize > size = 0t10224433152 > lsize = 0t10218960896 > > ARC_mfu::print -d size lsize > size = 0t303450112 > lsize = 0t289998848 > > ARC_anon::print -d size > size = 0 > > > > So it looks like the MRU is running at 10GB... > > What does this tell us? > > Thanks, > /jim > > > > [EMAIL PROTECTED] wrote: > >This seems a bit strange. What's the workload, and also, what's the > >output for: > > > > > >>ARC_mru::print size lsize > >>ARC_mfu::print size lsize > >> > >and > > > >>ARC_anon::print size > >> > > > >For obvious reasons, the ARC can't evict buffers that are in use. > >Buffers that are available to be evicted should be on the mru or mfu > >list, so this output should be instructive. > > > >-j > > > >On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote: > > > >>FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: > >> > >> > >> > >>>arc::print -tad > >>> > >>{ > >>. . . > >> c02e29e8 uint64_t size = 0t10527883264 > >> c02e29f0 uint64_t p = 0t16381819904 > >> c02e29f8 uint64_t c = 0t1070318720 > >> c02e2a00 uint64_t c_min = 0t1070318720 > >> c02e2a08 uint64_t c_max = 0t1070318720 > >>. . . > >> > >>Perhaps c_max does not do what I think it does? > >> > >>Thanks, > >>/jim > >> > >> > >>Jim Mauro wrote: > >> > >>>Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 > >>>(update 3). All file IO is mmap(file), read memory segment, unmap, close. > >>> > >>>Tweaked the arc size down via mdb to 1GB. I used that value because > >>>c_min was also 1GB, and I was not sure if c_max could be larger than > >>>c_minAnyway, I set c_max to 1GB. > >>> > >>>After a workload run: > >>> > arc::print -tad > > >>>{ > >>>. . . > >>> c02e29e8 uint64_t size = 0t3099832832 > >>> c02e29f0 uint64_t p = 0t16540761088 > >>> c02e29f8 uint64_t c = 0t1070318720 > >>> c02e2a00 uint64_t c_min = 0t1070318720 > >>> c02e2a08 uint64_t c_max = 0t1070318720 > >>>. . . > >>> > >>>"size" is at 3GB, with c_max at 1GB. > >>> > >>>What gives? I'm looking at the code now, but was under the impression > >>>c_max would limit ARC growth. Granted, it's not a factor of 10, and > >>>it's certainly much better than the out-of-the-box growth to 24GB > >>>(this is a 32GB x4500), so clearly ARC growth is being limited, but it > >>>still grew to 3X c_max. > >>> > >>>Thanks, > >>>/jim > >>>___ > >>>zfs-discuss mailing list > >>>zfs-discuss@opensolaris.org > >>>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >>> > >>___ > >>zfs-discuss mailing list > >>zfs-discuss@opensolaris.org > >>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >> > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
This seems a bit strange. What's the workload, and also, what's the output for: > ARC_mru::print size lsize > ARC_mfu::print size lsize and > ARC_anon::print size For obvious reasons, the ARC can't evict buffers that are in use. Buffers that are available to be evicted should be on the mru or mfu list, so this output should be instructive. -j On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote: > > FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: > > > > arc::print -tad > { > . . . >c02e29e8 uint64_t size = 0t10527883264 >c02e29f0 uint64_t p = 0t16381819904 >c02e29f8 uint64_t c = 0t1070318720 >c02e2a00 uint64_t c_min = 0t1070318720 >c02e2a08 uint64_t c_max = 0t1070318720 > . . . > > Perhaps c_max does not do what I think it does? > > Thanks, > /jim > > > Jim Mauro wrote: > >Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 > >(update 3). All file IO is mmap(file), read memory segment, unmap, close. > > > >Tweaked the arc size down via mdb to 1GB. I used that value because > >c_min was also 1GB, and I was not sure if c_max could be larger than > >c_minAnyway, I set c_max to 1GB. > > > >After a workload run: > >> arc::print -tad > >{ > >. . . > > c02e29e8 uint64_t size = 0t3099832832 > > c02e29f0 uint64_t p = 0t16540761088 > > c02e29f8 uint64_t c = 0t1070318720 > > c02e2a00 uint64_t c_min = 0t1070318720 > > c02e2a08 uint64_t c_max = 0t1070318720 > >. . . > > > >"size" is at 3GB, with c_max at 1GB. > > > >What gives? I'm looking at the code now, but was under the impression > >c_max would limit ARC growth. Granted, it's not a factor of 10, and > >it's certainly much better than the out-of-the-box growth to 24GB > >(this is a 32GB x4500), so clearly ARC growth is being limited, but it > >still grew to 3X c_max. > > > >Thanks, > >/jim > >___ > >zfs-discuss mailing list > >zfs-discuss@opensolaris.org > >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] understanding zfs/thunoer "bottlenecks"?
> it seems there isn't an algorithm in ZFS that detects sequential write > in traditional fs such as ufs, one would trigger directio. There is no directio for ZFS. Are you encountering a situation in which you believe directio support would improve performance? If so, please explain. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS multi-threading
> Would the logic behind ZFS take full advantage of a heavily multicored > system, such as on the Sun Niagara platform? Would it utilize of the > 32 concurrent threads for generating its checksums? Has anyone > compared ZFS on a Sun Tx000, to that of a 2-4 thread x64 machine? Pete and I are working on resolving ZFS scalability issues with Niagara and StarCat right now. I'm not sure if any official numbers about ZFS performance on Niagara have been published. As far as concurrent threads generating checksums goes, the system doesn't work quite the way you have postulated. The checksum is generated in the ZIO_STAGE_CHECKSUM_GENERATE pipeline state for writes, and verified in the ZIO_STAGE_CHECKSUM_VERIFY pipeline stage for reads. Whichever thread happens to advance the pipline to the checksum generate stage is the thread that will actually perform the work. ZFS does not break the work of the checksum into chunks and have multiple CPUs perform the computation. However, it is possible to have concurrent writes simultaneously in the checksum_generate stage. More details about this can be found in zfs/zio.c and zfs/sys/zio_impl.h -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
> And this feature is independant on whether or not the data is > DMA'ed straight into the user buffer. I suppose so, however, it seems like it would make more sense to configure a dataset property that specifically describes the caching policy that is desired. When directio implies different semantics for different filesystems, customers are going to get confused. > The other feature, is to avoid a bcopy by DMAing full > filesystem block reads straight into user buffer (and verify > checksum after). The I/O is high latency, bcopy adds a small > amount. The kernel memory can be freed/reuse straight after > the user read completes. This is where I ask, how much CPU > is lost to the bcopy in workloads that benefit from DIO ? Right, except that if we try to DMA into user buffers with ZFS there's a bunch of other things we need the VM to do on our behalf to protect the integrity of the kernel data that's living in user pages. Assume you have a high-latency I/O and you've locked some user pages for this I/O. In a pathological case, when another thread tries to access the locked pages and then also blocks, it does so for the duration of the first thread's I/O. At that point, it seems like it might be easier to accept the cost of the bcopy instead of blocking another thread. I'm not even sure how to assess the impact of VM operations required to change the permissions on the pages before we start the I/O. > The quickest return on investement I see for the directio > hint would be to tell ZFS to not grow the ARC when servicing > such requests. Perhaps if we had an option that specifies not to cache data from a particular dataset, that would suffice. I think you've filed a CR along those lines already (6429855)? -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
> Note also that for most applications, the size of their IO operations > would often not match the current page size of the buffer, causing > additional performance and scalability issues. Thanks for mentioning this, I forgot about it. Since ZFS's default block size is configured to be larger than a page, the application would have to issue page-aligned block-sized I/Os. Anyone adjusting the block size would presumably be responsible for ensuring that the new size is a multiple of the page size. (If they would want Direct I/O to work...) I believe UFS also has a similar requirement, but I've been wrong before. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
> Basically speaking - there needs to be some sort of strategy for > bypassing the ARC or even parts of the ARC for applications that > may need to advise the filesystem of either: > 1) the delicate nature of imposing additional buffering for their > data flow > 2) already well optimized applications that need more adaptive > cache in the application instead of the underlying filesystem or > volume manager This advice can't be sensibly delivered to ZFS via a Direct I/O mechanism. Anton's characterization of Direct I/O as, "an optimization which allows data to be transferred directly between user data buffers and disk, without a memory-to-memory copy," is concise and accurate. Trying to intuit advice from this is unlikely to be useful. It would be better to develop a separate mechanism for delivering advice about the application to the filesystem. (fadvise, perhaps?) A DIO implementation for ZFS is more complicated than UFS and adversely impacts well optimized applications. I looked into this late last year when we had a customer who was suffering from too much bcopy overhead. Billm found another workaround instead of bypassing the ARC. The challenge for implementing DIO for ZFS is in dealing with access to the pages mapped by the user application. Since ZFS has to checksum all of its data, the user's pages that are involved in the direct I/O cannot be written to by another thread during the I/O. If this policy isn't enforced, it is possible for the data written to or read from disk to be different from their checksums. In order to protect the user pages while a DIO is in progress, we want support from the VM that isn't presently implemented. To prevent a page from being accessed by another thread, we have to unmap the TLB/PTE entries and lock the page. There's a cost associated with this, as it may be necessary to cross-call other CPUs. Any thread that accesses the locked pages will block. While it's possible lock pages in the VM today, there isn't a neat set of interfaces the filesystem can use to maintain the integrity of the user's buffers. Without an experimental prototype to verify the design, it's impossible to say whether overhead of manipulating the page permissions is more than the cost of bypassing the cache. What do you see as potential use cases for ZFS Direct I/O? I'm having a hard time imagining a situation in which this would be useful to a customer. The application would probably have to be single-threaded, and if not, it would have to be pretty careful about how its threads access buffers involved in I/O. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Robert: > Better yet would be if memory consumed by ZFS for caching (dnodes, > vnodes, data, ...) would behave similar to page cache like with UFS so > applications will be able to get back almost all memory used for ZFS > caches if needed. I believe that a better response to memory pressure is a long-term goal for ZFS. There's also an effort in progress to improve the caching algorithms used in the ARC. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
> > Note that you'd actually have to verify that the blocks were the same; > > you cannot count on the hash function. If you didn't do this, anyone > > discovering a collision could destroy the colliding blocks/files. > > Given that nobody knows how to find sha256 collisions, you'd of course > need to test this code with a weaker hash algorithm. > > (It would almost be worth it to have the code panic in the event that a > real sha256 collision was found) The novel discovery of a sha256 collision will be lost on any administrator whose system panics. Imagine how much this will annoy the first customer who accidentally discovers a reproducible test-case. Perhaps generating an FMA error report would be more appropriate? -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and savecore
This is CR: 4894692 caching data in heap inflates crash dump I have a fix which I am testing now. It still needs review from Matt/Mark before it's eligible for putback, though. -j On Fri, Nov 10, 2006 at 02:40:40PM -0800, Thomas Maier-Komor wrote: > Hi, > > I'm not sure if this is the right forum, but I guess this topic will > be bounced into the right direction from here. > > With ZFS using as much physical memory as it can get, dumps and > livedumps via 'savecore -L' are huge in size. I just tested it on my > workstation and got a 1.8G vmcore file, when dumping only kernel > pages. > > Might it be possible to add an extension that would make it possible, > to support dumping without the whole ZFS cache? I guess this would > make kernel live dumps smaller again, as they used to be... > > Any comments? > > Cheers, > Tom > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: slow reads question...
Harley: > Old 36GB drives: > > | # time mkfile -v 1g zeros-1g > | zeros-1g 1073741824 bytes > | > | real2m31.991s > | user0m0.007s > | sys 0m0.923s > > Newer 300GB drives: > > | # time mkfile -v 1g zeros-1g > | zeros-1g 1073741824 bytes > | > | real0m8.425s > | user0m0.010s > | sys 0m1.809s This is a pretty dramatic difference. What type of drives were your old 36g drives? >I am wondering if there is something other than capacity > and seek time which has changed between the drives. Would a > different scsi command set or features have this dramatic a > difference? I'm hardly the authority on hardware, but there are a couple of possibilties. Your newer drives may have a write cache. It's also quite likely that the newer drives have a faster speed of rotation and seek time. If you subtract the usr + sys time from the real time in these measurements, I suspect the result is the amount of time you were actually waiting for the I/O to finish. In the first case, you spent 99% of your total time waiting for stuff to happen, whereas in the second case it was only ~86% of your overall time. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: slow reads question...
Harley: >I had tried other sizes with much the same results, but > hadnt gone as large as 128K. With bs=128K, it gets worse: > > | # time dd if=zeros-10g of=/dev/null bs=128k count=102400 > | 81920+0 records in > | 81920+0 records out > | > | real2m19.023s > | user0m0.105s > | sys 0m8.514s I may have done my math wrong, but if we assume that the real time is the actual amount of time we spent performing the I/O (which may be incorrect) haven't you done better here? In this case you pushed 81920 128k records in ~139 seconds -- approx 75437 k/sec. Using ZFS with 8k bs, you pushed 102400 8k records in ~68 seconds -- approx 12047 k/sec. Using the raw device you pushed 102400 8k records in ~23 seconds -- approx 35617 k/sec. I may have missed something here, but isn't this newest number the highest performance so far? What does iostat(1M) say about your disk read performance? >Is there any other info I can provide which would help? Are you just trying to measure ZFS's read performance here? It might be interesting to change your outfile (of) argument and see if we're actually running into some other performance problem. If you change of=/tmp/zeros does performance improve or degrade? Likewise, if you write the file out to another disk (UFS, ZFS, whatever), does this improve performance? -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: slow reads question...
ZFS uses a 128k block size. If you change dd to use a bs=128k, do you observe any performance improvement? > | # time dd if=zeros-10g of=/dev/null bs=8k > count=102400 > | 102400+0 records in > | 102400+0 records out > > | real1m8.763s > | user0m0.104s > | sys 0m1.759s It's also worth noting that this dd used less system and user time than the read from the raw device, yet took a longer time in "real" time. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Memory Usage
> 1) You should be able to limit your cache max size by > setting arc.c_max. Its currently initialized to be > phys-mem-size - 1GB. Mark's assertion that this is not a best practice is something of an understatement. ZFS was designed so that users/administrators wouldn't have to configure tunables to achieve optimal system performance. ZFS performance is still a work in progress. The problem with adjusting arc.c_max is that its definition may change from one release to another. It's an internal kernel variable, its existence isn't guaranteed. There are also no guarantees about the semantics of what a future arc.c_max might mean. It's possible that future implementations may change the definition such that reducing c_max has other unintended consequences. Unfortunately, at the present time this is probably the only way to limit the cache size. Mark and I are working on strategies to make sure that ZFS is a better citizen when it comes to memory usage and performance. Mark has recently made a number of changes which should help ZFS reduce its memory footprint. However, until these changes and others make it into a production build we're going to have to live with this inadvisable approach for adjusting the cache size. -j This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss