Re: [zfs-discuss] ZFS very slow under xVM
Martin, This is a shot in the dark, but, this seems to be a IO scheduling issue. Since, i am late on this thread, what is the characteristic of the IO: read mostly, appending writes, read, modify write, sequentiality, random, single large file, multiple files. And have you tracked whether any IO is aged much beyond 30 seconds if we are talking about writes. If we were talking about Xen by itself, I am sure their is some type of schedular involvement, that COULD slow down your IO due to fairness or some specified weight against other processes/ threads / tasks. Can you boost the scheduling of the IO task, by making it realtime or giving it a niceness or .. in a experimental environment and comparing stats. Whether this is the bottleneck of your problem would take a closer examination of the various metrics of the system. Mitchell Erblich - Martin wrote: > > > The behaviour of ZFS might vary between invocations, but I don't think that > > is related to xVM. Can you get the results to vary when just booting under > > "bare metal"? > > It's pretty consistently displays the behaviors of good IO (approx 60Mb/s - > 80Mb/s) for about 10-20 seconds, then always drops to approx 2.5 Mb/s for > virtually all of the rest of the output. It always displays this when running > under xVM/Xen with Dom0, and never on bare metal when xVM/Xen isn't booted. > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [docs-discuss] Introduction to Operating Systems
http://www.sun.com/software/solaris/ds/zfs.jsp Solaris ZFSThe Most Advanced File System on the Planet Anyone who has ever lost important files, run out of space on a partition, spent weekends adding new storage to servers, tried to grow or shrink a file system, or experienced data corruption knows that there is room for improvement in file systems and volume managers. The Solaris Zettabyte File System (ZFS), is designed from the ground up to meet the emerging needs of a general-purpose file system that spans the desktop to the data center. Mitchell Erblich Ex-Sun Eng -- Alan Coopersmith wrote: > > Lisa Shepherd wrote: > > "Zettabyte File System" is the formal, expanded name of the file system and > > "ZFS" is its abbreviation. In most Sun manuals, the name is expanded at > > first use and the abbreviation used the rest of the time. Though I was > > surprised to find that the Solaris ZFS System Administration Guide, which I > > would consider the main source of ZFS information, doesn't seem to have > > "Zettabyte" anywhere in it. Anyway, both names are official and correct, > > but since "Zettabyte" is such a mouthful, "ZFS" is what gets used most of > > the time. > > How current is that? I thought that while "Zettabyte File System" > was the original name, use of it was dropped a couple years ago and > ZFS became the only name. I don't see "Zettabyte" appearing anywhere > in the ZFS community pages. > > -- > -Alan Coopersmith- [EMAIL PROTECTED] > Sun Microsystems, Inc. - X Window System Engineering > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Mac OS X "Leopard" to use ZFS
Toby Thain, et al, I am guessing here, but to just be able to access the FS data locally without the headaches of verifying FS consistency, write caches, etc. Mitchell Erblich Toby Thain wrote: > > On 13-Jun-07, at 1:14 PM, Rick Mann wrote: > > >> From (http://www.informationweek.com/news/showArticle.jhtml;? > >> articleID=199903525) > > > > ... Croll explained, "ZFS is not the default file system for > > Leopard. We are exploring it as a file system option for high-end > > storage systems with really large storage. As a result, we have > > included ZFS -- a read-only copy of ZFS -- in Leopard." > > I don't get it. What possible use is "read only" ZFS? > > So that people can see if their FC array can be mounted on Leopard > beta? I must be missing the point here. > > --Toby > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS Apple WWDC Keynote Absence
Group, Isn't Apple strength really in the non-compute intensive personal computer / small business environment? IE, Plug and play. Thus, even though ZFS is able to work as the default FS, should it be the default FS for the small system environment where your average user, wants it more to work, and cares less about administration issues. It/ZFS almost, IMO, needs to check mistakes with configuration in the first case, where the larger business environment can have one or more dedicated admins that are more adapt to config and tuning issues. Mitchell Erblich Rich Teer wrote: > > On Tue, 12 Jun 2007, Robert Smicinski wrote: > > > Apple's strength is the desktop, Sun's is the datacenter. > > Agreed, to a large extent. > > > There's no need to have ZFS on the desktop, just as there's no need > > to have HFS+ in the datacenter. > > I strongly disagree with the first clause of that sentence. There's > no reason why one wouldn't want to have mirrored file systems on a > workstation, or make use of snapshots and clones. All three of those > features are supplied rather handily by ZFS. > > ZFS isn't just about easily creating massive pools of data, although > admitedly that is the first feature most people mention. > > > There is a need to improve ZFS in the datacenter, however, and I wish > > Sun had invested their time in getting dynamic LUN expansion going > > instead of working on a port to OS/X. > > I have no insider knowledge, but I don't think Sun invested much time > in this. I believe Apple's engineers did most of the work. > > -- > Rich Teer, SCSA, SCNA, SCSECA, OGB member > > CEO, > My Online Home Inventory > > Voice: +1 (250) 979-1638 > URLs: http://www.rite-group.com/rich > http://www.myonlinehomeinventory.com > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal strategy (add or replace disks) to build acheap and raidz?
Group, MOST people want a system to work without doing ANYTHING when they turn on the system. So yes, the thought of people buying another drive and installing it in a brand new system would be insane for this group of buyers. Mitchell Erblich -- Richard Elling wrote: > > Harold Ancell wrote: > > I checked Dell.com, and their "we want you to buy this higher end home > > machine" offer has a 320GB stock drive, but highlights a 500GB just > > below it with a bolded "Dell recommended for photos, music, and > > games!", for an extra 120 US$, about 10% of the machine's price. > > > > I'll bet a lot of people take them up on that. > > Interestingly, I was online recently comparing Dell, Frys.com, and Apple > Store prices (for a research project). For a sampling of products exactly > the same, Dell generally had the worst prices, Frys the best, and Apple > more often matched Frys than Dell. Specifically, for a 500 GByte disk, > Dell was asking $189 versus $129 at Frys. I couldn't directly compare > Apple's price because they don't sell raw disks, they sell "modules" which > cost 2x the price of an external drive listed right next to the modules -- > go figure. > > As the old saying goes, it pays to shop around. > -- richard > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gzip compression throttles system?
Darren Moffat, Yes and no. A earlier statement within this discussion was whether gzip is appropriate for .wav files. This just gets a relative time to compress. And relative sizes of the files after the compression. My assumption is that gzip will run as a user app in one environment. The normal r/w sys calls then take a user buffer. So, it would be hard to believe that the .wav file won't be read one user buff at at time. Yes, it could be mmap'ed, but then it would have to be unmapped. Too many sys calls, I think for the app. Sorry, haven't looked at it for awhile.. Overall, I am just trying to guess at the read-ahead delay versus the user buffer versus the internal fs. The internal FS should take it basicly one FS block at a time (or do multiple blocks in parallel) and the user app takes it anywhere from one buffer to one page size, 8k at a time. So, due to reading one buffer at a time in a loop, with a context switch from kernel to user each time. Thus, I would expect that the gzip app would be slwer. So, my first step is to keep it simple (KISS)and tell the group "what happens if" we do this simple comparison? And how many bytes/sec is compressed? And are they approx the same speed? Do you end up with the same size file Mitchell Erblich -- Darren J Moffat wrote: > > Erblichs wrote: > > So, my first order would be to take 1GB or 10GB .wav files > > AND time both the kernel implementation of Gzip and the > > user application. Approx the same times MAY indicate > > that the kernel implementation gzip funcs should > > be treatedly maybe more as interactive scheduling > > threads and that it is too high and blocks other > > threads or proces from executing. > > If you just run gzip(1) against the files you are operating on the whole > file so you only incur startup costs once and are thus doing quite a > different compression to operating on a block level. A fairer > comparison would be to build a userland program that compresses and then > writes to disk in ZFS blocksize chunks, that way you are compressing the > same sizes of data and doing the startup every time just like zio has to do. > > -- > Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gzip compression throttles system?
Ian Collins, My two free cents.. If the gzip was in application space, most gzip's implementations support (maybe a new compile) a less extensive/expensive "deflation" that would consume fewer CPU cycles. Secondly, if the file objects are being written locally, the writes to disk are being done asynchronously and shouldn't really delay other processes and slow down the system. So, my first order would be to take 1GB or 10GB .wav files AND time both the kernel implementation of Gzip and the user application. Approx the same times MAY indicate that the kernel implementation gzip funcs should be treatedly maybe more as interactive scheduling threads and that it is too high and blocks other threads or proces from executing. Mitchell Erblich Sr Software Engineer Ian Collins wrote: > > I just had a quick play with gzip compression on a filesystem and the > result was the machine grinding to a halt while copying some large > (.wav) files to it from another filesystem in the same pool. > > The system became very unresponsive, taking several seconds to echo > keystrokes. The box is a maxed out AMD QuadFX, so it should have plenty > of grunt for this. > > Comments? > > Ian > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very Large Filesystems
Jorg, Do you really think that ANY FS actually needs to support more FS objects? If that would be an issue, why not create more FSs? A multi-TB FS SHOULD support 100MB+/GB size FS objects, which IMO is the more common use. I have seen this alot in video environments. The largest that I have personally seen is in excess of 64TBs. I would assume that just normal FSops that search or display a extremely large number of FS objects is going to be difficult to use. Just try placing 10k+ FS objects/files within a directly and then list that directory. As for backups / restore type ops, I would assume that a smaller granularity of specified paths / directories would be more common due to user error and not disturbing other directories. Mitchell Erblich - Joerg Schilling wrote: > > Yaniv Aknin <[EMAIL PROTECTED]> wrote: > > > Following my previous post across several mailing lists regarding > > multi-tera volumes with small files on them, I'd be glad if people could > > share real life numbers on large filesystems and their experience with > > them. I'm slowly coming to a realization that regardless of theoretical > > filesystem capabilities (1TB, 32TB, 256TB or more), more or less across the > > enterprise filesystem arena people are recommending to keep practical > > filesystems up to 1TB in size, for manageability and recoverability. > > UFS is limited to 2**31 inodes and this also limits the filesystem size. > On Berlios we have a mixture of small and large files and the average file > size is 100 kB. This would still give you a limit os 200 TB which is more > than UFS allows you. > > I would guess that the recommendations are rather oriented on the backup. > On backup speed and on the size of the backup media. > > Jörg > > -- > EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin >[EMAIL PROTECTED](uni) >[EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ > URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cow performance penatly
Ming, Lets take a pro example with a minimal performance tradeoff. All FSs that modify a disk block, IMO, do a full disk block read before anything. If doing a extended write and moving to a larger block size with COW you give yourself the ability to write to a single block vs having to fill the original block and also needing to write the next block. The "performance loss" is the additional latency to transfer more bytes within the larger block on the next access. This pro doesn't just benefit at the end of the file but also at both ends of a hole within the file. In addition, the next non recent IO op that accesses the disk block will be able to perform a single seek. Also, if we allow ourselves to dynamicly increase the size of the block and we are within direct access to the blocks, we can delay moving to the additional latencies going to a indirect block or... So, this has a performance benefit in addition to removing the case where a OS panic occures in the middle of the disk block and losing the original and the full next iteration of the file. After the write completes we should be able to update the FS's node data struct. Mitchell Erblich Ex-Sun Kernel Engineer who proposed and implemented this in a limited release of UFS many years ago. -- Ming Zhang wrote: > > Hi All > > I wonder if any one have idea about the performance loss caused by COW > in ZFS? If you have to read old data out before write it to some other > place, it involve disk seek. > > Thanks > > Ming > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4)
Spencer, Summary: I am not sure that v4 would have a significant advantage over v3 or v2 in all envirs. I just believe it can have a significant advantage (no/minimal drawbacks) and one should use it if at all possbile to verify that it is not the bottleneck. So, no, I can not say that NFSv3 has the same performance as v4. I know that at its worst, I don't belive that v4 performs under v3 and at best, performs up to 2x or more than v3. So,, The assumptions are: - V4 is being actively worked on, - v3 is stable but no major changes are being done on it.. - leases, - better data caching (delagations and client callbacks), - state behaviour, - compound NFS requests (procs) to remove the sequential rtt of individual NFS requests - Significantly improved lookups for pathing (multi-lookup) and later attr requests.. I am sure that the attr calls are/were a significant percentage of NFS ops. - etc... ** I am not telling Spencer this he should already know this because skip... So, with the compound procs in v4, the increased latency's with some of the ops might have a different congestion type behaviour (it scales better under more environments and allows the IO bandwidth to be more of an issue). So, yes, my assumption is that NFSv4 has a good possibility of significantly outperforming v3.. Either way, I know of no degradation in any op moving to v4. So, again, if we are tuning a setup, I would rather see what ZFS does with v4, knowing that a few performance holes were closed or almost closed versus v3.. I don't think this is specific to Sun.. It would apply to all NFSv4 environments. **Yes, however even when the public (Paw,Spencer, etc) NFSv4 paper was done, the SFS was stated as not yet done.. -- LASTLY, I would also be interested in the actual times of the different TCP segments. To see, if acks are constantly in the pipeline between the dst and src, or whether "slow-start restart behaviour" is occuring. It is also theorectical that delayed acks of the dst, the number of acks is reduced, which reduces the bandwidth (IO ops) on subsequent data bursts. Also, is Allman's ABC being used in the TCP implementation. Mitchell Erblich Spencer Shepler wrote: > > On Apr 21, 2007, at 9:46 AM, Andy Lubel wrote: > > > so what you are saying is that if we were using NFS v4 things > > should be dramatically better? > > I certainly don't support this assertion (if it was being made). > > NFSv4 does have some advantages from the perspective of enabling > more aggressive file data caching; that will enable NFSv4 to > outperform NFSv3 in some specific workloads. In general, however, > NFSv4 performs similarly to NFSv3. > > Spencer > > > > > do you think this applies to any NFS v4 client or only Suns? > > > > > > > > -Original Message- > > From: [EMAIL PROTECTED] on behalf of Erblichs > > Sent: Sun 4/22/2007 4:50 AM > > To: Leon Koll > > Cc: zfs-discuss@opensolaris.org > > Subject: Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4) > > > > Leon Koll, > > > > As a knowldegeable outsider I can say something. > > > > The benchbark (SFS) page specifies NFSv3,v2 support, so I question > >whether you ra n NFSv4. I would expect a major change in > >performance just to version 4 NFS version and ZFS. > > > > The benchmark seems to stress your configuration enough that > > the latency to service NFS ops increases to the point of non > > serviced NFS requests. However, you don't know what is the > > byte count per IO op. Reads are bottlenecked against rtt of > > the connection and writes are normally sub 1K with a later > > commit. However, many ops are probably just file handle > > verifications which again are limited to your connection > > rtt (round trip time). So, my initial guess is that the number > > of NFS threads are somewhat related to the number of non > > state (v4 now has state) per file handle op. Thus, if a 64k > > ZFS block is being modified by 1 byte, COW would require a > > 64k byte read, 1 byte modify, and then allocation of another > > 64k block. So, for every write op, you COULD be writing a > > full ZFS block. > > > &
Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4)
Leon Koll, As a knowldegeable outsider I can say something. The benchbark (SFS) page specifies NFSv3,v2 support, so I question whether you ra n NFSv4. I would expect a major change in performance just to version 4 NFS version and ZFS. The benchmark seems to stress your configuration enough that the latency to service NFS ops increases to the point of non serviced NFS requests. However, you don't know what is the byte count per IO op. Reads are bottlenecked against rtt of the connection and writes are normally sub 1K with a later commit. However, many ops are probably just file handle verifications which again are limited to your connection rtt (round trip time). So, my initial guess is that the number of NFS threads are somewhat related to the number of non state (v4 now has state) per file handle op. Thus, if a 64k ZFS block is being modified by 1 byte, COW would require a 64k byte read, 1 byte modify, and then allocation of another 64k block. So, for every write op, you COULD be writing a full ZFS block. This COW philosphy works best with extending delayed writes, etc where later reads would make the trade-off of increased latency of the larger block on a read op versus being able to minimize the number of seeks on the write and read. Basicly increasing the block size from say 8k to 64K. Thus, your read latency goes up just to get the data off the disk and minimizing the number of seeks, and dropping the read ahead logic for the needed 8k to 64k file offset. I do NOT know that "THAT" 4000 IO OPS load would match your maximal load and that your actual load would never increase past 2000 IO ops. Secondly, jumping from 2000 to 4000 seems to be too big of a jump for your environment. Going to 2500 or 3000 might be more appropriate. Lastly wrt the benchmark, some remnants (NFS and/or ZFS and/or benchmark) seem to remain that have a negative impact. Lastly, my guess is that this NFS and the benchark are stressing small partial block writes and that is probably one of the worst case scenarios for ZFS. So, my guess is the proper analogy is trying to kill a nat with a sledgehammer. Each write IO OP really needs to be equal to a full size ZFS block to get the full benefit of ZFS on a per byte basis. Mitchell Erblich Sr Software Engineer - Leon Koll wrote: > > Welcome to the club, Andy... > > I tried several times to attract the attention of the community to the > dramatic performance degradation (about 3 times) of NFZ/ZFS vs. ZFS/UFS > combination - without any result : href="http://www.opensolaris.org/jive/thread.jspa?messageID=98592";>[1] , > http://www.opensolaris.org/jive/thread.jspa?threadID=24015";>[2]. > > Just look at two graphs in my href="http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html";>posting > dated August, 2006 to see how bad the situation was and, unfortunately, > this situation wasn't changed much recently: > http://photos1.blogger.com/blogger/7591/428/1600/sfs.1.png > > I don't think the storage array is a source of the problems you reported. > It's somewhere else... > > [i]-- leon[/i] > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Linux
Joerg Schilling, Stepping back into the tech discussion. If we want a port of ZFS to Linux to begin, SHOULD the kitchen sink approach be abandoned for the 1.0 release?? For later releases, dropped functionality could be added in. Suggested 1.0 Requirements -- 1) No NFS export support 2) Basic local FS support (all Vnodeops and VFS op review) 3) Identify any FSs (source availability) that are common between Linux and SunOS and use those as porting guides 4)Identify all Sun DDI/DKI calls that have no Linux equivs 5)Identify what ZFS apps need supporting 6)Identify any/all library's that are needed for the ZFS apps 7)Identify and acquire as many ZFS validation tests as possible. 8) Can we/should we assume that the Sun ZFS docs will suffice as main reference and identify any and all diffs using a suplimentary doc. 9) Create a one pager on the mmap() diffs. 10) Identify whether lookuppathname should be ported over to Linux, and whether "ships-in-the-night" approach would cause more problems. Mitchell Erblich Sr Software Engineer - Joerg Schilling wrote: > > Erblichs <[EMAIL PROTECTED]> wrote: > > > Joerg Shilling, > > > > Putting the license issues aside for a moment. > > I was trying to point people to the fact that the biggest problems are > technical problems and that the license discussion was done the wrong way. > > > If their is "INTEREST" in ZFS within Linux, should > >a small Linux group be formed to break down ZFS in > >easily portable sections and non-portable sections. > > And get a real-time/effort assessment as to what is > > needed to get it done. > > Going back to the tecnical stuff: > > - The NFS export interface from Linux is weird and needs > adoptation > > - Linux still has the outdated "namei" inteface instead of > the more than 20 year old lookuppathname() interface > from SunOS. > > - The mmap interface is extremely different > > In general, the problem on Linux is that the Linux "vfs" > interface is a low level inteface, so it is most likely easier > to adopt a Linux FS to the Solaris vfs interface than vice versa. > > There is nothing like the clean global vfsops and vnodeops on Solaris > but a lot of small interfaces. > > > Assuming their is interest and usage, if ported, I > > would assume that someone/some group would make sure > > that the code is resynced on a periodic basis. > > I also asume that the same people who are interested in a port > will do the maintenance... > > > I know a FS from Veritas and SGI were reviewed in > > these manners. The Veritas's FS originally was > > developed using the Sun's VFS layer. > > > > So, if the license issues are removed, I am sure > > that ZFS could be ported over to Linux. It is just > > time and effort... > > I am sure it could be done but Linux peole cannot asume that > Sun will do it ;-) > > Jörg > > -- > EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin >[EMAIL PROTECTED](uni) >[EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ > URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on the desktop
Rich Teer, I have a perfect app for the masses. A Hi-Def Video/ audio server for the hi-def TV and audio setup. I would think the average person would want to have access to 1000s of DVDs / CDs within a small box versus taking up the full wall. Yes, assuming the quality was their... Extrapolating the cost of drives, this is now reality for the few but given 1.5 years this is for the masses. Wouldn't this sell enough boxes to make this the newer killer app?? Mitchell Erblich Sr Software Engineer Rich Teer wrote: > > On Tue, 17 Apr 2007, Toby Thain wrote: > > > The killer feature for me is checksumming and self-healing. > > Same here. I think anyone who dismisses ZFS as being inappropriate for > desktop use ("who needs access to Petabytes of space in their desktop > machine?!") doesn't get it. (A close 2nd for me personally is the > ease of creating mirrors, but granted that's on my servers rather than > my desktop.) > > -- > Rich Teer, SCSA, SCNA, SCSECA, OGB member > > CEO, > My Online Home Inventory > > Voice: +1 (250) 979-1638 > URLs: http://www.rite-group.com/rich > http://www.myonlinehomeinventory.com > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Linux
Joerg Shilling, Putting the license issues aside for a moment. If their is "INTEREST" in ZFS within Linux, should a small Linux group be formed to break down ZFS in easily portable sections and non-portable sections. And get a real-time/effort assessment as to what is needed to get it done. Assuming their is interest and usage, if ported, I would assume that someone/some group would make sure that the code is resynced on a periodic basis. I know a FS from Veritas and SGI were reviewed in these manners. The Veritas's FS originally was developed using the Sun's VFS layer. So, if the license issues are removed, I am sure that ZFS could be ported over to Linux. It is just time and effort... Mitchell Erblich Ex: Sun Kernel Engineer Joerg Schilling wrote: > > Nicolas Williams <[EMAIL PROTECTED]> wrote: > > > Sigh. We have devolved. Every thread on OpenSolaris discuss lists > > seems to devolve into a license discussion. > > It is funny to see that in our case, the tecnical problems (those caused > by the fact that linux implements a different VFS interface layer) are > creating much bigger problem than the license issue does. > > > I have seen mailing list posts (I'd have to search again) that indicate > > [that some believe] that even dynamic linking via dlopen() qualifies as > > making a derivative. > > There is no single place in the GPL that mentions the term "linking". > For this reason, the GPL FAQ from the FSF is wring as it is based on the > term "linking". > > There is no difference whether you link statically or dynamically. > > Whether using GPLd code from a non-GPLd program creates a "derived work" > thus cannot depend on whether you link agaist it or not. If a GPLd program > however "uses" a non-GPLd library, this is definitely not a problem or > every GPLd program linked against the libc from HP-UX would be a problem. > > > If true that would mean that one could not distribute an OpenSolaris > > distribution containing a GPLed PAM module. Or perhaps, because in that > > case the header files needed to make the linking possible are not GPLed > > the linking-makes-derivatives argument would not apply. > > If the GPLd PAM module just implements a well known plug in interface, > a program that uses this odule cannot be a derivate of the GPLd code. > > Jörg > > -- > EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin >[EMAIL PROTECTED](uni) >[EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ > URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS for Linux (NO LISCENCE talk, please)
Toby Thain, I am sure someone will divise a method of subdividing the FS and run a background fsck and/or checksums on the different file objects or ... before this becomes a issue. :) Mitchell Erblich - Toby Thain wrote: > > > > > It seems that there are other reasons for the Linux kernel folks > > for not > > liking ZFS. > > I certainly don't understand why they ignore it. > > How can one have a "Storage and File Systems Workshop" in 2007 > without ZFS dominating the agenda?? > http://lwn.net/Articles/226351/ > > That "long fscks" should be a hot topic, given the state of the art, > is just bizarre. > > --Toby > > > > > Jörg > > > > -- > > EMail:[EMAIL PROTECTED] (home) Jörg Schilling > > D-13353 Berlin > >[EMAIL PROTECTED](uni) > >[EMAIL PROTECTED] (work) Blog: http:// > > schily.blogspot.com/ > > URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/ > > pub/schily > > ___ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS for Linux (NO LISCENCE talk, please)
Group, Did Joerg Schilling bring up a bigger issue within this discussion thread? > And it seems that you missunderstand the way the Linux kernel is developed. > If _you_ started a ZFS project for Linux, _you_ would need to maintain it too > or otherwise it would not be kept up to date. Note that it is a well known > fact that a lot of the non-mainstream parts of the linux kernel sources > do not work although they _are_ part of the linux kernel source tree. Whose job is it to "clean" or declare for removal kernel sources that "do not work"? Mitchell Erblich --- Joerg Schilling wrote: > > "David R. Litwin" <[EMAIL PROTECTED]> wrote: > > > If you refer to the licensing, yes. Coding-wise, I have no idea exept > > to say that I would be VERY surprised if ZFS can not be ported to > > Linux, especially since there already > > exists the FUSE project. > > So if you are interested in this project, I would encourage you to just start > with the code... > > > > ZFS is not part of the Linux Kernel. Only if you declare ZFS a "part of > > > Linux", you will observe the license conflict. > > > > > > And, as brought up elsewhere, ZFS would have to be a part of the > > Kernel -- or else some persons would have to employ Herculean > > attention to make sure ZFS was upgraded with the kernel. if some one > > were > > willing to do this, a swift resolution MAY ba possible. > > The fact that someone may put the ZFS sources in the Linux source tree > does not make it a part of that software > > And it seems that you missunderstand the way the Linux kernel is developed. > If _you_ started a ZFS project for Linux, _you_ would need to maintain it too > or otherwise it would not be kept up to date. Note that it is a well known > fact that a lot of the non-mainstream parts of the linux kernel sources > do not work although they _are_ part of the linux kernel source tree. > > Creating a port does not mean that you may forget about it once you believe > that > you are ready. > > > The GPL is talking about "works" and there is no problem to use GPL code > > > together with code under other licenses as long as this is mere > > > aggregation > > > (like creating a driver for Linux) instead of creating a "derived work". > > > > > > It seems that there are other reasons for the Linux kernel folks for not > > > liking ZFS. > > > > > > Indeed? What are these reasons? I want to have every thing in the open. > > This is something you would need to ask the Linux kernel folks > > Jörg > > -- > EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin >[EMAIL PROTECTED](uni) >[EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ > URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Gzip compression for ZFS
My two cents, Assuming that you may pick a specific compression algorithm, most algorithms can have different levels/percentages of deflations/inflations which is effects the time to compress and/or inflate wrt the CPU capacity. Secondly, if I can add an additional item, would anyone want to be able to encrypt the data vs compress or to be able to combine encryption with compression? Third, if data were to be compressed within a file object, should a reader be made aware that the data being read is compressed or should he just read garbage? Would/should a field in the znode be read transparently that de-compresses already compressed data? Fourth, if you take 8k and expect to alloc 8k of disk block storage for it and compress it to 7k, are you really saving 1k? Or are you just creating an additional 1K of internal fragmentation? It is possible that moving ' 7K of data accross your "SCSI" type interface may give you a faster read/write performance. But that is after the additional latency of the compress on the async write and adds a real latency on the current block read. So, what are you really gaining? Fifth and hopefully last, should the znode have a new length field that keeps the non-compressed length for Posix compatibility. I am assuming large file support where a process that is not large file aware should not be able to even open the file. With the additional field (unccompressed size) the file may lie on the boundry for the large file open reqs. Really last..., why not just compress the data stream before writing it out to disk? Then you can at least do a file on it and identify the type of compression... Mitchell Erblich - Darren Reed wrote: > > From: "Darren J Moffat" <[EMAIL PROTECTED]> > ... > > The other problem is that you basically need a global unique registry > > anyway so that compress algorithm 1 is always lzjb, 2 is gzip, 3 is > > etc etc. Similarly for crypto and any other transform. > > I've two thoughts on that: > 1) if there is to be a registry, it should be hosted by OpenSolaris >and be open to all and > > 2) there should be provision for a "private number space" so that >people can implement their own whatever so long as they understand >that the filesystem will not work if plugged into something else. > > Case in point for (2), if I wanted to make a bzip2 version of ZFS at > home then I should be able to and in doing so chose a number for it > that I know will be safe for my playing at home. I shouldn't have > to come to zfs-discuss@opensolaris.org to "pick a number." > > Darren > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Layout for multiple large streaming writes.
Tp the original poster, FYI, Accessing RAID drives at a constant "~70-75%" does not probably leave enough excess for degraded mode. A normal rule of thumb is 50 to 60% constant to allow excess capacity to be absorbed in degraded mode. An "old" rule of thumb for determining for estimating MTBF is if you have 100 drives and the single drive is estimated at 30,000 hours (> 3years).. Then the expected failure will occur in about 1 day/30 hours. Thus, excess capacity needs to be always present to allow the time to reconstruct the raid, ability to reconstuct it within a limited timeframe and to minimize any significantly increased latencies for normal processing. Mitchell Erblich - Richard Elling wrote: > > > I have a setup with a T2000 SAN attached to 90 500GB SATA drives > > presented as individual luns to the host. We will be sending mostly > > large streaming writes to the filesystems over the network (~2GB/file) > > in 5/6 streams per filesystem. Data protection is pretty important, but > > we need to have at most 25% overhead for redundancy. > > > > Some options I'm considering are: > > 10 x 7+2 RAIDZ2 w/ no hotspares > > 7 x 10+2 RAIDZ2 w/ 6 spares > > > > Does any one have advice relating to the performance or reliability to > > either of these? We typically would swap out a bad drive in 4-6 hrs and > > we expect the drives to be fairly full most of the time ~70-75% fs > > utilization. > > What drive manufacturer & model? > What is the SAN configuration? More nodes on a loop can significantly > reduce performance as loop arbitration begins to dominate. This problem > can be reduced by using multiple loops or switched fabric, assuming the > drives support fabrics. > > The data availability should be pretty good with raidz2. Having hot spares > will be better than not, but with a 4-6 hour (assuming 24x7 operations) > replacement time there isn't an overwhelming need for hot spares -- double > parity and fast repair time is a good combination. We do worry more > about spares when the operations are not managed 24x7 or if you wish > to save money by deferring repairs to a regularly scheduled service > window. In my blog about this, I used a 24 hour logistical response > time and see about an order of magnitude difference in the MTTDL. > http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance > > In general, you will have better performance with more sets, so the > 10-set config will outperform the 7-set config. > -- richard > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] writes lost with zfs !
Ayaz Anjum and others, I think once you move into NFS over TCP in a client server env, the chance for lost data is significantly higher than just a disconnecting a cable, Scenario, before a client generates a delayed write from his violatile DRAM client cache, client reboots, and/or a asynchronous or a delayed write is done, no error on the write and the error is missed on the close because the programmer didn't perform a fsync on the fd before the close and/or expect that a close might fail, and/or the tcp connection is lost and the data is not transfered, Thus, I know of very few FSs that can guarantee against data loss. What most modern FSs try to prevent is data corruption and FS corruption,... However, I am surprised that you seem to indicate that no hardware indication is/was present to indicate some form of hardware degredation/failure had occured. Mitchell Erblich is generated because of the delayed On 11-Mar-07, at 11:12 PM, Ayaz Anjum wrote: > > HI ! > > Well as per my actual post, i created a zfs file as part of Sun > cluster HAStoragePlus, and then disconned the FC cable, since there > was no active IO hence the failure of disk was not detected, then i > touched a file in the zfs filesystem, and it went fine, only after > that when i did sync then the node panicked and zfs filesystem is > failed over to other node. On the othernode the file i touched is > not there in the same zfs file system hence i am saying that data > is lost. I am planning to deploy zfs in a production NFS > environment with above 2TB of Data where users are constantly > updating file. Hence my concerns about data integrity. I believe Robert and Darren have offered sufficient explanations: You cannot be assured of committed data unless you've sync'd it. You are only risking data loss if your users and/or applications assume data is committed without seeing a completed sync, which would be a design error. This applies to any filesystem. --Toby > Please explain. > > thaks > > Ayaz Anjum > > > > Darren Dunham <[EMAIL PROTECTED]> > Sent by: [EMAIL PROTECTED] > 03/12/2007 05:45 AM > > To > zfs-discuss@opensolaris.org > cc > Subject > Re: Re[2]: [zfs-discuss] writes lost with zfs ! > > > > > > > I have some concerns here, from my experience in the past, > touching a > > file ( doing some IO ) will cause the ufs filesystem to failover, > unlike > > zfs where it did not ! Why the behaviour of zfs different than ufs ? > > UFS always does synchronous metadata updates. So a 'touch' that > creates > a file is going to require a metadata write. > > ZFS writes may not necessarily hit the disk until a transaction group > flush. > > > is not this compromising data integrity ? > > It should not. Is there a scenario that you are worried about? > > -- > Darren Dunham > [EMAIL PROTECTED] > Senior Technical Consultant TAOShttp:// > www.taos.com/ > Got some Dr Pepper? San Francisco, CA bay > area > < This line left intentionally blank to confuse you. > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > > > > > -- > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Efficiency when reading the same file blocks
Toby Thain, No, physical location was for the exact location and logical was for the rest of my info. But, what I might not have made clear was the use of fragments. Their are two types of fragments. One which is the partial use of a logical disk block and the other which I was also trying to refer to is the moving of modified sections of the file. The first use was well used with the Joy FFS implementation where a FS and drive tended to have a high cost per byte overhead and was fairly small. Now, lets make this perfectly clear. If a FS object is large and written "somewhat" in sequence as a stream of bytes and then random FS logical blocks or physical blocks are then modified, the new FS object will be less sequentially written and CAN decrease read performance. Sorry, I tend to care less about write performance, due to the fact that writes tend to be async without threads blocking waiting for their operation to complete. This will happen MOST as the FS fills and less optimal locations of the FS are found for the COW blocks. The same problem happens with memory with OSs that support multiple page sizes where a well used system may not be able to allocate large page sizes due to fragmentation. Yes, this is a overloaded term... :) Thus, FS performance may suffer even if their are just alot of 1 byte changes to frequently accessed FS objects. If this occurs, either keep a larger FS, clean out the FS more frequently, or backup, cleanup, and then restore to get newly sequental FS objects. Mitchell Erblich - Toby Thain wrote: > > On 28-Feb-07, at 6:43 PM, Erblichs wrote: > > > ZFS Group, > > > > My two cents.. > > > > Currently, in my experience, it is a waste of time to try to > > guarantee "exact" location of disk blocks with any FS. > > ? Sounds like you're confusing logical location with physical > location, throughout this post. > > I'm sure Roch meant logical location. > > --T > > > > > A simple reason exception is bad blocks, a neighboring block > > will suffice. > > > > Second, current disk controllers have logic that translates > > and you can't be sure outside of the firmware where the > > disk block actually is. Yes, I wrote code in this area before. > > > > Third, some FSs, do a Read-Modify-Write, where the write is > > NOT, NOT, NOT overwriting the original location of the read. > ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Efficiency when reading the same file blocks
ZFS Group, My two cents.. Currently, in my experience, it is a waste of time to try to guarantee "exact" location of disk blocks with any FS. A simple reason exception is bad blocks, a neighboring block will suffice. Second, current disk controllers have logic that translates and you can't be sure outside of the firmware where the disk block actually is. Yes, I wrote code in this area before. Third, some FSs, do a Read-Modify-Write, where the write is NOT, NOT, NOT overwriting the original location of the read. Why? for a couple of reasons.. One is that the original read may have existed in a fragment. Some do it for FS consistency to allow the write to become a partial write in some circumstances (Ex:crash), and the second file block location then allows for FS consistency and the ability to recover the original contents. No overwite. Another reason is that sometimes we are filling a hole within a FS object window from a base addr to new offset. The ability to concatenate allows us to reduce the number of future seeks and small reads / writes versus having a slightly longer transfer time for the larger theorectical disk block. Thus, the tradeoff is that we accept that we waste some FS space, we may not fully optimize the location of the disk block, we have larger read and write single large block latencies, but... we seek less, the per byte overhead is less, we can order our writes so that we again seek less, our writes can be delayed (assuming that we might write multiple times and then commmit on close) to minimize the amount of actual write operations, we can prioritize our reads over our writes to decrease read latency, etc.. Bottom line is that performance may suffer if we do alot of random small read-modify-writes within FS objects that use a very large disk block. Since the actual CHANGE is small to the file, each small write outside of a delayed write window, will consume at least 1 disk block. However, some writes are to FS objects that are writethru and thus each small write will consume a new disk block. Mitchell Erblich - Roch - PAE wrote: > > Jeff Davis writes: > > > On February 26, 2007 9:05:21 AM -0800 Jeff Davis > > > But you have to be aware that logically sequential > > > reads do not > > > necessarily translate into physically sequential > > > reads with zfs. zfs > > > > I understand that the COW design can fragment files. I'm still trying to > understand how that would affect a database. It seems like that may be bad > for performance on single disks due to the seeking, but I would expect that > to be less significant when you have many spindles. I've read the following > blogs regarding the topic, but didn't find a lot of details: > > > > http://blogs.sun.com/bonwick/entry/zfs_block_allocation > > http://blogs.sun.com/realneel/entry/zfs_and_databases > > > > > > Here is my take on this: > > DB updates (writes) are mostly governed by the synchronous > write code path which for ZFS means the ZIL performance. > It's already quite good in that it aggregatesmultiple > updates into few I/Os. Some further improvements are in the > works. COW, in general, simplify greatly write code path. > > DB reads in a transaction workloads are mostly random. If > the DB is not cacheable the performance will be that of a > head seek no matter what FS is used (since we can't guess in > advance where to seek, COW nature does not help nor hinders > performance). > > DB reads in a decision workloads can benefit from good > prefetching (since here we actually know where the next > seeks will be). > > -r > > > This message posted from opensolaris.org > > ___ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
Jeff Bonwick, Do you agree that their is a major tradeoff of "builds up a wad of transactions in memory"? We loose the changes if we have an unstable environment. Thus, I don't quite understand why a 2-phase approach to commits isn't done. First, take the transactions as they come and do a minimal amount of a delayed write. If the number of transactions build up, then convert to the delayed write scheme. This assumption is that not all ZFS envs are write heavy versus write once and read-many type accesses. My assumption is that attribute/meta reading outweighs all other accesses. Wouldn't this approach allow minimal outstanding transactions and favor read access. Yes, the assumption is that once the "wad" is started, the amount of writing could be substantial and thus the amount of available bandwidth for reading is reduced. This would then allow for a more N states to be available. Right? Second, their are a multiple uses of "then: (then pushes, then flushes all disk..., then writes the new uberblock, then flushes the caches again), in which seems to have some level of possible parallelism which should reduce the latency from the start to the final write. Or did you just say that for simplicity sake? Mitchell Erblich --- Jeff Bonwick wrote: > > Toby Thain wrote: > > I'm no guru, but would not ZFS already require strict ordering for its > > transactions ... which property Peter was exploiting to get "fbarrier()" > > for free? > > Exactly. Even if you disable the intent log, the transactional nature > of ZFS ensures preservation of event ordering. Note that disk caches > don't come into it: ZFS builds up a wad of transactions in memory, > then pushes them out as a transaction group. That entire group will > either commit or not. ZFS writes all the new data to new locations, > then flushes all disk write caches, then writes the new uberblock, > then flushes the caches again. Thus you can lose power at any point > in the middle of committing transaction group N, and you're guaranteed > that upon reboot, everything will either be at state N or state N-1. > > I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool > thing is that on ZFS, fbarrier() is a no-op. It's implicit after > every system call. > > Jeff > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Heavy writes freezing system
Rainer Heilke, You have 1/4 of the amount of memory that the 2900 system is capable of (192GBs : I think). Secondly, output from fsstat(1M) could be helpful. Run this command over time and check to see if the values change over time.. Mitchell Erblich --- Rainer Heilke wrote: > > > What hardware is used? Sparc? x86 32-bit? x86 > > 64-bit? > > How much RAM is installed? > > Which version of the OS? > > Sorry, this is happening on two systems (test and production). They're both > Solaris 10, Update 2. Test is a V880 with 8 CPU's and 32GB, production is an > E2900 with 12 dual-core CPU's and 48GB. > > > Did you already try to monitor kernel memory usage, > > while writing to zfs? Maybe the kernel is running > > out of > > free memory? (I've bugs like 6483887 in mind, > > "without direct management, arc ghost lists can run > > amok") > > We haven't seen serious kernel memory usage that I know of (I'll be honest--I > came into this problem late). > > > For a live system: > > > > echo ::kmastat | mdb -k > > echo ::memstat | mdb -k > > I can try this if the DBA group is willing to do another test, thanks. > > > In case you've got a crash dump for the hung system, > > you > > can try the same ::kmastat and ::memstat commands > > using the > > kernel crash dumps saved in directory > > /var/crash/`hostname` > > > > # cd /var/crash/`hostname` > > # mdb -k unix.1 vmcore.1 > > ::memstat > > ::kmastat > > The system doesn't actually crash. It also doesn't freeze _completely_. While > I call it a freeze (best name for it), it actually just slows down > incredibly. It's like the whole system bogs down like molasses in January. > Things happen, but very slowly. > > Rainer > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Hey guys, Do to lng URL lookups, the DNLC was pushed to variable sized entries. The hit rate was dropping because of "name to long" misses. This was done long ago while I was at Sun under a bug reported by me.. I don't know your usage, but you should attempt to estimate the amount of mem used with the default size. Yes, this is after you start tracking your DNLC hit rate and make sure it doesn't significantly drop if the ncsize is decreased. You also may wish to increase the size and again check the hit rate.. Yes, it is posible that your access is random enough that no changes will effect the hit rte. 2nd item.. Bonwick's mem allcators I think still have the ability to limit the size of each slab. The issue is that some parts of the code expect non mem failures with SLEEPs. This can result in extended SLEEPs, but can be done. If your company generates changes to your local source and then you rebuild, it is possible to pre-allocate a fixed number of objects per cache and then use NOLSLEEPs with returning values that indicate to retry or failure. 3rd.. And could be the most important, the mem cache allocators are lazy in freeing memory when it is not needed by anyone else. Thus, unfreed memory is effectively used as a cache to remove latencies of on-demand memory allocations. This artificially keeps memory usage high, but should have minimal latencies to realloc when necessary. Also, it is possible to make mods to increase the level of mem garbage collection after some watermark code is added to minimize repeated allocs and frees. Mitchell Erblich "Jason J. W. Williams" wrote: > > Hi Robert, > > We've got the default ncsize. I didn't see any advantage to increasing > it outside of NFS serving...which this server is not. For speed the > X4500 is showing to be a killer MySQL platform. Between the blazing > fast procs and the sheer number of spindles, its perfromance is > tremendous. If MySQL cluster had full disk-based support, scale-out > with X4500s a-la Greenplum would be terrific solution. > > At this point, the ZFS memory gobbling is the main roadblock to being > a good database platform. > > Regarding the paging activity, we too saw tremendous paging of up to > 24% of the X4500s CPU being used for that with the default arc_max. > After changing it to 4GB, we haven't seen anything much over 5-10%. > > Best Regards, > Jason > > On 1/10/07, Robert Milkowski <[EMAIL PROTECTED]> wrote: > > Hello Jason, > > > > Thursday, January 11, 2007, 12:36:46 AM, you wrote: > > > > JJWW> Hi Robert, > > > > JJWW> Thank you! Holy mackerel! That's a lot of memory. With that type of a > > JJWW> calculation my 4GB arc_max setting is still in the danger zone on a > > JJWW> Thumper. I wonder if any of the ZFS developers could shed some light > > JJWW> on the calculation? > > > > JJWW> That kind of memory loss makes ZFS almost unusable for a database > > system. > > > > > > If you leave ncsize with default value then I belive it won't consume > > that much memory. > > > > > > JJWW> I agree that a page cache similar to UFS would be much better. Linux > > JJWW> works similarly to free pages, and it has been effective enough in the > > JJWW> past. Though I'm equally unhappy about Linux's tendency to grab every > > JJWW> bit of free RAM available for filesystem caching, and then cause > > JJWW> massive memory thrashing as it frees it for applications. > > > > Page cache won't be better - just better memory control for ZFS caches > > is strongly desired. Unfortunately from time to time ZFS makes servers > > to page enormously :( > > > > > > -- > > Best regards, > > Robertmailto:[EMAIL PROTECTED] > >http://milek.blogspot.com > > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O patterns during a "zpool replace": whywritetothe disk being replaced?
Bill, Sommerfeld, Sorry, However, I am trying to explain what I think is happening on your system and why I consider this normal. Most of the reads/FS "replace" are normally at the block level. To copy a FS, some level of reading MUST be done at the orig_dev. At what level and whether it is recorded as a normal vnode read / mmap op for the direct and indirect blocks is another story. But it is being done. It is just not being recorded in FS stats. Read stats are normally used for normal FS object access requests. Secondly, maybe starting with the ?uberblock?, the rest of the meta data is probably being read. And because of the normal asyn access of FSs, it would not surprise me that then each znode's access time field is updated. Remember, that unless you are just touching a FS low-level(file) object, all writes are proceeded by at least 1 read. Mitchell Erblich Bill Sommerfeld wrote: > > On Thu, 2006-11-09 at 19:18 -0800, Erblichs wrote: > > Bill Sommerfield, > > Again, that's not how my name is spelled. > > > With some normal sporadic read failure, accessing > > the whole spool may force repeated reads for > > the replace. > > please look again at the iostat I posted: > > capacity operationsbandwidth > poolused avail read write read write > - - - - - - - > z 306G 714G 1.43K658 23.5M 1.11M > raidz1109G 231G 1.08K392 22.3M 497K > replacing - - 0 1012 0 5.72M > c1t4d0 - - 0753 0 5.73M > c1t5d0 - - 0790 0 5.72M > c2t12d0- -339177 9.46M 149K > c2t13d0- -317177 9.08M 149K > c3t12d0- -330181 9.27M 147K > c3t13d0- -352180 9.45M 146K > raidz1100G 240G117101 373K 225K > c1t3d0 - - 65 33 3.99M 64.1K > c2t10d0- - 60 44 3.77M 63.2K > c2t11d0- - 62 42 3.87M 63.4K > c3t10d0- - 63 42 3.88M 62.3K > c3t11d0- - 65 35 4.06M 61.8K > raidz1 96.2G 244G234164 768K 415K > c1t2d0 - -129 49 7.85M 112K > c2t8d0 - -133 54 8.05M 112K > c2t9d0 - -132 56 8.08M 113K > c3t8d0 - -132 52 8.01M 113K > c3t9d0 - -132 49 8.16M 112K > > there were no (zero, none, nada, zilch) reads directed to the failing > device. there were a lot of WRITES to the failing device; in fact, the > the same volume of data was being written to BOTH the failing device and > the new device. > > > So, I was thinking that a read access > > that could ALSO be updating the znode. This newer > > time/date stamp is causing alot of writes. > > that's not going to be significant as a source of traffic; again, look > at the above iostat, which was representative of the load throughout the > resilver. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O patterns during a "zpool replace": why writetothe disk being replaced?
Bill Sommerfield, Because, first, I have seen alot of I/O occur while a snapshot is being aged out of a system. I don't think that during the resilvering process accesses (read, writes) are completely stopped to the orig_dev. I expect at least some meta reads are going on. With some normal sporadic read failure, accessing the whole spool may force repeated reads for the replace. So, I was thinking that a read access that could ALSO be updating the znode. This newer time/date stamp is causing alot of writes. Depending on how the fs meta and blocks are being accessed, the orig_dev may also have some normal writes until it is offlined. Mitchell Erblich - Bill Sommerfeld wrote: > > On Wed, 2006-11-08 at 01:54 -0800, Erblichs wrote: > > > > Bill Sommerfield, > > that's not how my name is spelled > > > > Are their any existing snaps? > no. why do you think this would matter? > > > > Can you have any scripts that may be > > removing aged files? > no; there was essentially no other activity on the pool other than the > "replace". > > why do you think this would matter? > > - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O patterns during a "zpool replace": why write tothe disk being replaced?
Bill Sommerfield, Are their any existing snaps? Can you have any scripts that may be removing aged files? Mitchell Erblich -- Bill Sommerfeld wrote: > > On a v40z running snv_51, I'm doing a "zpool replace z c1t4d0 c1t5d0". > > (so, why am I doing the replace? The outgoing disk has been reporting > read errors sporadically but with increasing frequency over time..) > > zpool iostat -v shows writes going to the old (outgoing) disk as well as > to the replacement disk. Is this intentional? > > Seems counterintuitive as I'd think you'd want to touch a suspect disk > as little as possible and as nondestructively as possible... > > A representative snapshot from "zpool iostat -v" : > > capacity operationsbandwidth > poolused avail read write read write > - - - - - - - > z 306G 714G 1.43K658 23.5M 1.11M > raidz1109G 231G 1.08K392 22.3M 497K > replacing - - 0 1012 0 5.72M > c1t4d0 - - 0753 0 5.73M > c1t5d0 - - 0790 0 5.72M > c2t12d0- -339177 9.46M 149K > c2t13d0- -317177 9.08M 149K > c3t12d0- -330181 9.27M 147K > c3t13d0- -352180 9.45M 146K > raidz1100G 240G117101 373K 225K > c1t3d0 - - 65 33 3.99M 64.1K > c2t10d0- - 60 44 3.77M 63.2K > c2t11d0- - 62 42 3.87M 63.4K > c3t10d0- - 63 42 3.88M 62.3K > c3t11d0- - 65 35 4.06M 61.8K > raidz1 96.2G 244G234164 768K 415K > c1t2d0 - -129 49 7.85M 112K > c2t8d0 - -133 54 8.05M 112K > c2t9d0 - -132 56 8.08M 113K > c3t8d0 - -132 52 8.01M 113K > c3t9d0 - -132 49 8.16M 112K > > - Bill > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] thousands of ZFS file systems
Hi, My suggestion is direct any command output to a file that may print thous of lines. I have not tried that number of FSs. So, my first suggestion is to have alot of phys mem installed. The second item that I could be concerned with is path translation going thru alot of mount points. I think I remember in some old code that their was a limit of 256 mount points thru a path. I don't know if it still exists. Mitchell Erblich - > Rafael Friedlander wrote: > > Hi, > > An IT organization needs to implement highly available file server, > using Solaris 10, SunCluster, NFS and Samba. We are talking about > thousands, even 10s of thousands of ZFS file systems. > > Is this doable? Should I expect any impact on performance or stability > due to the fact I'll have that many mounted filesystems, with > everything implied from that fact ('df | wc -l' with thousands of > lines of result, for instance)? > > Thanks, > > Rafael. > -- > > --- > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] copying a large file..
Hi, How much time is a "long time"? Second, had a snapshot been taken after the file was created? Are the src and dst directories in the same slice? What other work was being done at the time of the move? Were their numerous files in the src or dst directories? How much phys mem is in your system? Does a equivelent move take a drasticly shorter amount of time if done right after a reboot? Mitchell Erblich -- Pavan Reddy wrote: > > 'mv' command took very long time to copy a large file from one ZFS directory > to another. The directories share the same pool and file system. I had a 385 > MB file in one directory and wanted to move that to a different directory. > It took long time to move. Any particular reasons? There is no raid involved. > > -Pavan > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ENOSPC : No space on file deletion
Matthew, et al, You haven't identified a solution / workaround? Their is one "large" file within the FS and snapshot that has been backed up. They wish to remove this large file and the system is preventing this because of a additional reference from the snapshot. For some good reason to them, they do not wish to remove the entire snapshot. Then, how do you FORCE ably remove a file that fails with a no space error? Are you telling me that their is no way to access a single file within the snapshot and remove it? Mitchell Erblich Matthew Ahrens wrote: > > Erblichs wrote: > > Now the stupid question.. > > If the snapshot is identical to the FS, I can't > > remove files from the FS because of the snapshot > > and removing files from the snapshot only removes > > a reference to the file and leaves the memory. > > > > So, how do I do a atomic file removes on both the > > original and the snapshot(s). Yes, I am assuming that > > I have backed up the file offline. > > > > Can I request a possible RFE to be able to force a > > file remove from the original FS and if found elsewhere > > remove that location too IFF a ENOSPC would fail the > > original rm? > > No, you can not remove files from snapshots. Snapshots can not be > changed. If you are out of space because of snapshots, you can always > 'zfs destroy' the snapshot :-) > > --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ENOSPC : No space on file deletion
Hey guys, I think i know what is going on. A set of files was attempted to be deleted on a FS that had almost consumed its reservation. It failed because one or more snapshots hold references to these files and the snaps needed to allocate FS space. Thus, the no space error. Now the stupid question.. If the snapshot is identical to the FS, I can't remove files from the FS because of the snapshot and removing files from the snapshot only removes a reference to the file and leaves the memory. So, how do I do a atomic file removes on both the original and the snapshot(s). Yes, I am assuming that I have backed up the file offline. Can I request a possible RFE to be able to force a file remove from the original FS and if found elsewhere remove that location too IFF a ENOSPC would fail the original rm? Thanks, Mitchell Erblich ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Group, et al, I don't understand that if the problem is systemic based on the number of continual dirty pages and stress to clean those pages, then why . If the problem is FS independent, because any number of different installed FSs can equally consume pages. Thus, to solve the problem on a per FS basis seems to me a bandaid approach.. Then why doesn't the OS determine that a dangerous level of high watermark number of pages are continually being paged out (we have swapped and have a large percentage of available pages always dirty: based on recent past history) and thus, * force the writes to a set of predetermined pages (limit the number of pages for I/O), * these pages get I/O scheduled immediately, not waiting for a need for these pages and finding them dirty, (hopefully a percentage of these pages will be cleaned and be immediately available if needed in the near future), Yes, the OS could redirect the I/O as being direct without using the page cache, but the assumption is that these procs are behaving as multiple-readers and need the cached page data in the near future. Thus, changing the behaviour to remove whether the pages are cached bcause they CAN totally consume the cache removes the multiple-reader reader to cache the data in the first place, thus... * guarantee that heartbeats are always regular by preserving 5 to 20% of pages for exec / text, * limit the number of interrupts being generated by network so low level SCSI interrupts can page and not be starved, (something the white paper did not mention), (yes, this will cause the loss of UDP based data but we need to generate some form of backpressure / explicit congestion event), * if the files coming in from network were TCP based, hopefully a segment would be dropped and act as a backpressure to the originator of the data, * if the files are being read from the FS, then a max I/O rate should be determined based on the number of pages that are clean and ready to accept FS data, * etc Thus, tuning to determine whether the page cache should be used for write or read, should allow one set of processes not to adversely effect the operation of other processes. And any OS, should only slow down the dirty I/O pages for those specific processes and other processes work being unaware of the I/O issues.. Mitchell Erblich - Richard Elling - PAE wrote: > > Roch wrote: > > Oracle will typically create it's files with 128K writes > > not recordsize ones. > > Blast from the past... > http://www.sun.com/blueprints/0400/ram-vxfs.pdf > > -- richard > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Nico, Yes, I agree. But also single random large single read and writes would also benefit from a large record size. So, I didn't try make that distinction. However, I "guess" that the best random large reads & writes would fall within single filesystem record sizes. No, I haven't reviewed whether the holes (disk block space) tend to be multiples of record size, page size, or .. Would a write of recordsize that didn't fall on a record size boundry write into 2 filesystem blocks / records? However, would extremely large record sizes, say 1MB (or more) (what is the limit?), open up write atomicity issues or file corruption issues? Would record sizes like these be equal to mulitple track writes? Also, because of the "disk block" allocation stategy, I wasn't too sure that any form of multiple disk block contigousness still applied with ZFS with smaller record sizes.. Yes, to minimize seek and rotational latencies and help with read ahead and "write behind"... Oh, but once writes have begun to the file, in the past, this has frozen the recordsize. So "self-tuning" or adjustments NEED to be decided probably at the create of the FS object. OR some type of copy mechanism needs to be done to a new file with a different record size at a later time when the default or past record size was determined to be significantly incorrect. Yes, I assume that many reads /writes will occur in the future that will amortize the copy cost. So, yes group... I am still formulating the "best" algorithm for this. ZFS uses alot of past gained knowledege applied to UFS (page lists stuff, chksum stuff, large file awwareness/support), but adds a new twist to things.. Mitchell Erblich -- Nicolas Williams wrote: > > On Fri, Oct 13, 2006 at 09:22:53PM -0700, Erblichs wrote: > > For extremely large files (25 to 100GBs), that are accessed > > sequentially for both read & write, I would expect 64k or 128k. > > Lager files accessed sequentially don't need any special heuristic for > record size determination: just use the filesystem's record size and be > done. The bigger the record size, the better -- a form of read ahead. > > Nico > -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Group, I am not sure I agree with the 8k size. Since "recordsize" is based on the size of filesystem blocks for large files, my first consideration is what will be the max size of the file object. For extremely large files (25 to 100GBs), that are accessed sequentially for both read & write, I would expect 64k or 128k. Putpage functions attempt to grab a number of pages off the vnode and place their modified contents within disk blocks. Thus if disk blocks are larger, then a fewer of them are needed, and can result in a more efficient operations. However, any small change to the filesystem block would result in the entire filesystem block being accessed, so small accesses to the block are very inefficent. Lastly, the access to a larger block will occupy the media for longer periods of continuous time, possibly creating a larger latency than necessary for another non-related op. Hope this helps... Mitchell Erblich --- Nicolas Williams wrote: > > On Fri, Oct 13, 2006 at 08:30:27AM -0700, Matthew Ahrens wrote: > > Jeremy Teo wrote: > > >Would it be worthwhile to implement heuristics to auto-tune > > >'recordsize', or would that not be worth the effort? > > > > It would be really great to automatically select the proper recordsize > > for each file! How do you suggest doing so? > > I would suggest the following: > > - on file creation start with record size = 8KB (or some such smallish >size), but don't record this on-disk yet > > - keep the record size at 8KB until the file exceeds some size, say, >.5MB, at which point the most common read size, if there were enough >reads, or the most common write size otherwise, should be used to >derive the actual file record size (rounding up if need be) > > - if the selected record size != 8KB then re-write the file with the > new record size > > - record the file's selected record size in an extended attribute > > - on truncation keep the existing file record size > > - on open of non-empty files without associated file record size stick >to the original approach (growing the file block size up to the FS >record size, defaulting to 128KB) > > I think we should create a namespace for Solaris-specific extended > attributes. > > The file record size attribute should be writable, but changes in record > size should only be allowed when the file is empty or when the file data > is in one block. E.g., writing "8KB" to a file's RS EA when the file's > larger than 8KB or consists of more than one block should appear to > succeed, but a subsequent read of the RS EA should show the previous > record size. > > This approach might lead to the creation of new tunables for controlling > the heuristic (e.g., which heuristic, initial RS, file size at which RS > will be determined, default RS when none can be determined). > > Nico > -- > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs_vfsops.c : zfs_vfsinit() : line 1179: Src inspection
Group, If their is a bad vfs ops template, why wouldn't you just return(error) versus trying to create the vnode ops template? My suggestion is after the cmn_err() then return(error); Mitchell Erblich --- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] single memory allocation in the ZFS intent log
Group, This example is done with a single threaded app. It is NOT NOT NOT intended to show any level of Thread-safe type coding. It is ONLY used to show that it is signifcantly lower cost to grab pre-allocated objects than to allocate the objects on demand. Thus, grabbing 64 byte chunks off a free list and placing them back on can be done with this simple base code even when dealing with 1Gb/sec intfs. Under extreme circumstances, the normal on demand allocator can sleep if it needs to colesce memory or steal it from another's cache. Mitchell Erblich - Frank Hofmann wrote: > > On Thu, 5 Oct 2006, Erblichs wrote: > > > Casper Dik, > > > > After my posting, I assumed that a code question should be > > directed to the ZFS code alias, so I apologize to the people > > show don't read code. However, since the discussion is here, > > I will post a code proof here. Just use "time program" to get > > a generic time frame. It is under 0.1 secs for 500k loops > > (each loop does removes a obj and puts it back). > > > > It is just to be used as a proof of concept that a simple > > pre-alloc'ed set of objects can be accessed so much faster > > than allocating and assigning them. > > Ok, could you please explain how is this piece (and all else, for that > matter): > > /* > * Get a node structure from the freelist > */ > struct node * > node_getnode() > { > struct node *node; > > if ((node = nodefree) == NULL) /* "shouldn't happen" */ > printf("out of nodes"); > > nodefree = node->node_next; > node->node_next = NULL; > > return (node); > } > > is multithread-safe ? > > Best wishes, > FrankH. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] single memory allocation in the ZFS intent log
Casper Dik, After my posting, I assumed that a code question should be directed to the ZFS code alias, so I apologize to the people show don't read code. However, since the discussion is here, I will post a code proof here. Just use "time program" to get a generic time frame. It is under 0.1 secs for 500k loops (each loop does removes a obj and puts it back). It is just to be used as a proof of concept that a simple pre-alloc'ed set of objects can be accessed so much faster than allocating and assigning them. To support the change to the intent log objects, I would suggest first identiying the number of objects normally allocated and use that as a working set of objects. A time element is also needed to identify when objects should be released from the free list to the memory cache. Yes, the initial thoughts of having a per CPU based allocs are coded in which would allow multiple simultaneious access to a freelist per CPU. This should remove most of the mutex code necessary for scalability. Objective --- What this app code is proving is that items that are pre-alloc'ed can be removed off a simple free list and stored on a free list. This is just a inception proof that shows "fast access" to a working set of objects. The time to make one chunk alloc, place all of them on a free list, and then perform 500k ops of removal and insertions is probably somewhere 50 to 1000x faster than even the best memory allocators allocating/retrieving 500k items. If a dynamic list of nodes are required the chunk alloc should be changed. This quick piece of app prog runs in less than 0.1 sec with 500k retrieves and store ops. This is fast enough to grab 64byte chunks dealing with even a 1Gb Eth link. Even though this code is simplified, it indicates that kmem_allocs will have the same benefits even without sleeping. - The code does a single pre alloc and then breaks up the alloc into N node pieces. It takes each piece and places them on a free list in the init section. The assumption here is that we have a fixed reasonable number of items. If the number of items is dynamic the init could easily alloc a number of nodes, then use watermarks to alloc and free into / from the free list as the number of nodes are used. If the logic is used to deal with kmem ops, then any free nodes could be returned to memory as excess nodes are in the free list. The base level of this type of logic is normally used when a special program project requires non-standard interfaces to guarantee a HIGH level of performance. The main has a hard code of 500k loops, which allocs one node and then frees it. Thus, 500k equiv allocs would need to be done. This ran in the 0.02 to 0.35 secs on a 1.3GHz laptop Linux box. - It is my understanding that the Bonwicks's new allocator was created to remove fragmentation. And yes it also allows the OS to reduce the overhead of of dealing with mem objects of process's that are freed and alloc'ed frequently. When the system gets low on memory, it steals freed objects that are being cached. However, even with no SLEEPING, I have yet to see it perform as fast as simple retrieves and stores. Years ago, the amount of memory on a system was limited due to its expense. This is no longer the case. Some/most processes/threads could have a decent increase in performance if the amount of workload done on a working set of objects is minimized. Up to this workset, I propose that a almost guranteed level of performance could be achieved. With the comment that any type of functionality that has merits get a API so multiple processes / threads can use that functionality. Years ago I was in the "process" of doing that when I left a company with a ARC group. It was to add a layer of working set mem objects that would have "fast access" properties. I will ONLY GUARANTEE that X working set objects once freed to the FREE LIST can be re-allocated without waiting for the objects. Any count beyond that working set, has the same underlying properties. Except if I KNOW that the number on my freelist goes down to a small value, I could pre-request more objects. The latency of retrieving these objects could thus be minimized. This logic then removes on demand memory allocations, so any WAIT time MAY not effect the other parts of the process that need more
Re: [zfs-discuss] single memory allocation in the ZFS intent log
Casper Dik, Yes, I am familiar with Bonwick's slab allocators and tried it for wirespeed test of 64byte pieces for a 1Gb and then 100Mb Eths and lastly 10Mb Eth. My results were not encouraging. I assume it has improved over time. First, let me ask what happens to the FS if the allocs in the intent log code are sleeping waiting for memory IMO, The general problem with memory allocators is: - getting memory from a "cache" of ones own size/type is orders of magnitude higher than just getting some off one's own freelist, - their is a built in latency to recouperate/steal memory from other processes, - this stealing forces a sleep and context switches, - the amount of time to sleep is undeterminate with a single call per struct. How long can you sleep for? 100ms or 250ms or more.. - no process can guarantee a working set, In the time when memory was expensive, maybe a global sharing mechanisms would make sense, but when the amount of memory is somewhat plentiful and cheap, *** It then makes sense for a 2 stage implementation of preallocation of a working set and then normal allocation with the added latency. So, it makes sense to pre-allocate a working set of allocs by a single alloc call, break up the alloc into needed sizes, and then alloc from your own free list, -> if that freelist then empties, maybe then take the extra overhead with the kmem call. Consider this a expected cost to exceed a certain watermark. But otherwise, I bet if I give you some code for the pre-alloc, I bet 10 allocs from the freelist can be done versus the kmem_alloc call, and at least 100 to 10k allocs if sleep occurs on your side. Actually, I think it is so bad, that why don't you time 1 kmem_free versus grabbing elements off the freelist, However, don't trust me, I will drop a snapshot of the code to you tomarrow if you want and you make a single CPU benchmark comparison. Your multiple CPU issue, forces me to ask, is it a common occurance that 2 are more CPUs are simultaneouly requesting memory for the intent log? If it is, then their should be a freelist of a low watermark set of elements per CPU. However, one thing at a time.. So, do you want that code? It will be a single alloc of X units and then place them on a freelist. You then time it takes to remove Y elements from the freelist versus 1 kmem_alloc with a NO_SLEEP arg and report the numbers. Then I would suggest the call with the smallest sleep possible. How many allocs can then be done? 25k, 35k, more... Oh, the reason why we aren't timing the initial kmem_alloc call for the freelist, is because I expect that to occur during init and not proceed until memory is alloc'ed. Mitchell Erblich [EMAIL PROTECTED] wrote: > > > at least one location: > > > > When adding a new dva node into the tree, a kmem_alloc is done with > > a KM_SLEEP argument. > > > > thus, this process thread could block waiting for memory. > > > > I would suggest adding a pre-allocated pool of dva nodes. > > This is how the Solaris memory allocator works. It keeps pools of > "pre-allocated" nodes about until memory conditions are low. > > > When a new dva node is needed, first check this pre-allocated > > pool and allocate from their. > > There are two reasons why this is a really bad idea: > > - the system will run out of memory even sooner if people > start building their own free-lists > > - a single freelist does not scale; at two CPUs it becomes > the allocation bottleneck (I've measured and removed two > such bottlenecks from Solaris 9) > > You might want to learn about how the Solaris memory allocator works; > it pretty much works like you want, except that it is all part of the > framework. And, just as in your case, it does run out some times but > a private freelist does not help against that. > > > Why? This would eliminate a possible sleep condition if memory > >is not immediately available. The pool would add a working > >set of dva nodes that could be monitored. Per alloc latencies > >could be amortized over a chunk allocation. > > That's how the Solaris memory allocator already works. > > Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] single memory allocation in the ZFS intent log
group, at least one location: When adding a new dva node into the tree, a kmem_alloc is done with a KM_SLEEP argument. thus, this process thread could block waiting for memory. I would suggest adding a pre-allocated pool of dva nodes. When a new dva node is needed, first check this pre-allocated pool and allocate from their. Why? This would eliminate a possible sleep condition if memory is not immediately available. The pool would add a working set of dva nodes that could be monitored. Per alloc latencies could be amortized over a chunk allocation. Lastly, if memory is scarce along time may pass before this node could be allocated to the tree. If the number is monitored, it is possible that restricted operations could be done while until the intent log is decreased in size. I can supply untested code within 24 hours if wanted. Mitchell Erblich ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss