Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Wednesday, September 10, 2014 11:29:21 PM Slawa Olhovchenkov wrote: > On Wed, Sep 10, 2014 at 02:29:30PM -0400, John Baldwin wrote: > > > Application must change behaviour when reach limit (run GC, switch > > > algorithm and etc.). > > > But mmap of big data file for scaning and processing don't touch this > > > limit. > > > Must be mmap of some temoprary file touch this limit? I don't know. > > > Must be MAP_ANON touch this limit? I think yes. > > > How to make distinction of mmap data file for processing and mmap > > > temporary file for bypass this limit? I don't know. > > > > Consider also mmap() of shm_open(SHM_ANON). > > No, shm limited separatly. mmap of shm_open(SHM_ANON) don't need to adjust > this limit. It would be another trivial way of escaping RLIMIT_DATA however. > > > PS: question about jemalloc. How I can stop jemalloc to give back > > > memory to OS for some application? > > > > You would have to hack pages_purge() in chunk_mmap.c to always return true > > and not call madvise(). > > And got memory leak? No, that doesn't leak memory. jemalloc() considers that address range free regardless and will always reuse it for a future allocation. This is just a hint to let the VM system take back the backing pages because they are currently not in use by the application, and if the address is reused, the current contents of the pages are not important (can be thrown away). > I am talk about change behaviour from periodic mmap and madvise memory > to only mmap when need and leave for future use when freed by > aplication. I think (may be wrong -- please correct) in some cases > mmap/madvise will be too often. The madvise() does not cause a future mmap(). You would have to munmap() to require a future mmap(). Instead, the madvise() can result in the page being released back to the system. Later when that address is reused by jemalloc, the process would fault on the address requiring the VM to locate a new page. Disabling the madvise() only prevents that fault from occurring (However, leaving all the pages around and marked as in-use may increase memory pressure and cause you to swap and still have to fault the page back in in the future, but at greater expense.) -- John Baldwin ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Wed, Sep 10, 2014 at 02:29:30PM -0400, John Baldwin wrote: > > Application must change behaviour when reach limit (run GC, switch > > algorithm and etc.). > > But mmap of big data file for scaning and processing don't touch this > > limit. > > Must be mmap of some temoprary file touch this limit? I don't know. > > Must be MAP_ANON touch this limit? I think yes. > > How to make distinction of mmap data file for processing and mmap > > temporary file for bypass this limit? I don't know. > > Consider also mmap() of shm_open(SHM_ANON). No, shm limited separatly. mmap of shm_open(SHM_ANON) don't need to adjust this limit. > > And again, most application don't handle correctly NULL of malloc(). > > I think this is depreciate of all. > > This is true, but it seems Firefox might, and you could set RLIMIT_AS > for Firefox in particular so it gets adequate feedback. I do think if > you can determine the "correct" value for RLIMIT_AS for Firefox that > mmap(MAP_ANON) would fail and result in NULL from malloc(). I am do some time ago firefox limiting by ulimit -v, firefox just crashing. Don't run GC or something else. > > PS: question about jemalloc. How I can stop jemalloc to give back > > memory to OS for some application? > > You would have to hack pages_purge() in chunk_mmap.c to always return true > and > not call madvise(). And got memory leak? I am talk about change behaviour from periodic mmap and madvise memory to only mmap when need and leave for future use when freed by aplication. I think (may be wrong -- please correct) in some cases mmap/madvise will be too often. ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Wednesday, September 10, 2014 08:30:58 PM Slawa Olhovchenkov wrote: > On Wed, Sep 10, 2014 at 10:44:46AM -0400, John Baldwin wrote: > > On Tuesday, September 09, 2014 02:25:18 PM Slawa Olhovchenkov wrote: > > > On Mon, Sep 08, 2014 at 11:17:51AM -0400, John Baldwin wrote: > > > > On Sunday, September 07, 2014 04:56:49 PM Slawa Olhovchenkov wrote: > > > > > PS: very bad that 'data limit' don't anymore reflect application > > > > > memory consumer. and very small application can adapt to 'no memory' > > > > > from system. > > > > > > > > You can use RLIMIT_AS instead of RLIMIT_DATA. jemalloc can also be > > > > configured to use sbrk(), though I think there's no way to prevent it > > > > from falling back to mmap(MAP_ANON) if sbrk() fails. > > > > > > Formally you are right. Realy this is scorn. And I don't know how > > > rightly. May be account in RLIMIT_DATA MAP_ANON? Anyway firefox don't > > > run GC if malloc fail and this is bad. > > > > It was not intended as scorn, simply stating what options you have > > Hmm, this is may bad english and try to translate some russian sentence. > I am try to do some tuning for satisfaction me. > You proposal do some limiting, but don't delight. Ok. > > available as alternatives to what you are asking for. The Firefox > > behavior certainly seems unfortunate. :( However, using RLIMIT_AS > > for that should work as the mmap() call malloc() makes will fail > > causing malloc() to fail. However, you need RLIMIT_AS to be a bit > > larger than what you would set RLIMIT_DATA to. > > > > As to having MAP_ANON honor RLIMIT_DATA, I'm not sure how well that > > would work. I assume that was discussed as an option in the previous > > threads on the topic of RLIMIT_DATA being useless with jemalloc. > > I don't recall it though. Perhaps Alan can speak to whether or not > > that is a good idea? > > What is sense of RLIMIT_DATA (as I understand)? > This is signaling to application of memory consumption. Yes, traditionally it only affected malloc() and not other portions of the address space like the stack (RLIMIT_STACK) or anything in shared libraries (including text/data/bss). > Limiting RSS don't visible to application. Yes, RLIMIT_RSS is not often useful because it is not synchronous with the RSS changing. > Application must change behaviour when reach limit (run GC, switch > algorithm and etc.). > But mmap of big data file for scaning and processing don't touch this > limit. > Must be mmap of some temoprary file touch this limit? I don't know. > Must be MAP_ANON touch this limit? I think yes. > How to make distinction of mmap data file for processing and mmap > temporary file for bypass this limit? I don't know. Consider also mmap() of shm_open(SHM_ANON). > And again, most application don't handle correctly NULL of malloc(). > I think this is depreciate of all. This is true, but it seems Firefox might, and you could set RLIMIT_AS for Firefox in particular so it gets adequate feedback. I do think if you can determine the "correct" value for RLIMIT_AS for Firefox that mmap(MAP_ANON) would fail and result in NULL from malloc(). > PS: question about jemalloc. How I can stop jemalloc to give back > memory to OS for some application? You would have to hack pages_purge() in chunk_mmap.c to always return true and not call madvise(). -- John Baldwin ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Wed, Sep 10, 2014 at 10:44:46AM -0400, John Baldwin wrote: > On Tuesday, September 09, 2014 02:25:18 PM Slawa Olhovchenkov wrote: > > On Mon, Sep 08, 2014 at 11:17:51AM -0400, John Baldwin wrote: > > > On Sunday, September 07, 2014 04:56:49 PM Slawa Olhovchenkov wrote: > > > > PS: very bad that 'data limit' don't anymore reflect application > > > > memory consumer. and very small application can adapt to 'no memory' > > > > from system. > > > > > > You can use RLIMIT_AS instead of RLIMIT_DATA. jemalloc can also be > > > configured to use sbrk(), though I think there's no way to prevent it > > > from falling back to mmap(MAP_ANON) if sbrk() fails. > > > > Formally you are right. Realy this is scorn. And I don't know how > > rightly. May be account in RLIMIT_DATA MAP_ANON? Anyway firefox don't > > run GC if malloc fail and this is bad. > > It was not intended as scorn, simply stating what options you have Hmm, this is may bad english and try to translate some russian sentence. I am try to do some tuning for satisfaction me. You proposal do some limiting, but don't delight. > available as alternatives to what you are asking for. The Firefox > behavior certainly seems unfortunate. :( However, using RLIMIT_AS > for that should work as the mmap() call malloc() makes will fail > causing malloc() to fail. However, you need RLIMIT_AS to be a bit > larger than what you would set RLIMIT_DATA to. > > As to having MAP_ANON honor RLIMIT_DATA, I'm not sure how well that > would work. I assume that was discussed as an option in the previous > threads on the topic of RLIMIT_DATA being useless with jemalloc. > I don't recall it though. Perhaps Alan can speak to whether or not > that is a good idea? What is sense of RLIMIT_DATA (as I understand)? This is signaling to application of memory consumption. Limiting RSS don't visible to application. Application must change behaviour when reach limit (run GC, switch algorithm and etc.). But mmap of big data file for scaning and processing don't touch this limit. Must be mmap of some temoprary file touch this limit? I don't know. Must be MAP_ANON touch this limit? I think yes. How to make distinction of mmap data file for processing and mmap temporary file for bypass this limit? I don't know. And again, most application don't handle correctly NULL of malloc(). I think this is depreciate of all. PS: question about jemalloc. How I can stop jemalloc to give back memory to OS for some application? ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Tuesday, September 09, 2014 02:25:18 PM Slawa Olhovchenkov wrote: > On Mon, Sep 08, 2014 at 11:17:51AM -0400, John Baldwin wrote: > > On Sunday, September 07, 2014 04:56:49 PM Slawa Olhovchenkov wrote: > > > PS: very bad that 'data limit' don't anymore reflect application > > > memory consumer. and very small application can adapt to 'no memory' > > > from system. > > > > You can use RLIMIT_AS instead of RLIMIT_DATA. jemalloc can also be > > configured to use sbrk(), though I think there's no way to prevent it > > from falling back to mmap(MAP_ANON) if sbrk() fails. > > Formally you are right. Realy this is scorn. And I don't know how > rightly. May be account in RLIMIT_DATA MAP_ANON? Anyway firefox don't > run GC if malloc fail and this is bad. It was not intended as scorn, simply stating what options you have available as alternatives to what you are asking for. The Firefox behavior certainly seems unfortunate. :( However, using RLIMIT_AS for that should work as the mmap() call malloc() makes will fail causing malloc() to fail. However, you need RLIMIT_AS to be a bit larger than what you would set RLIMIT_DATA to. As to having MAP_ANON honor RLIMIT_DATA, I'm not sure how well that would work. I assume that was discussed as an option in the previous threads on the topic of RLIMIT_DATA being useless with jemalloc. I don't recall it though. Perhaps Alan can speak to whether or not that is a good idea? -- John Baldwin ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Mon, Sep 08, 2014 at 11:17:51AM -0400, John Baldwin wrote: > On Sunday, September 07, 2014 04:56:49 PM Slawa Olhovchenkov wrote: > > PS: very bad that 'data limit' don't anymore reflect application > > memory consumer. and very small application can adapt to 'no memory' > > from system. > > You can use RLIMIT_AS instead of RLIMIT_DATA. jemalloc can also be > configured > to use sbrk(), though I think there's no way to prevent it from falling back > to mmap(MAP_ANON) if sbrk() fails. Formally you are right. Realy this is scorn. And I don't know how rightly. May be account in RLIMIT_DATA MAP_ANON? Anyway firefox don't run GC if malloc fail and this is bad. ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Sunday, September 07, 2014 04:56:49 PM Slawa Olhovchenkov wrote: > PS: very bad that 'data limit' don't anymore reflect application > memory consumer. and very small application can adapt to 'no memory' > from system. You can use RLIMIT_AS instead of RLIMIT_DATA. jemalloc can also be configured to use sbrk(), though I think there's no way to prevent it from falling back to mmap(MAP_ANON) if sbrk() fails. -- John Baldwin ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On 2014-09-03 21:18, Steven Hartland wrote: - Original Message - From: "Andriy Gapon" on 03/09/2014 23:22 Nikolai Lifanov said the following: On 09/03/14 15:22, John Baldwin wrote: On Wednesday, September 03, 2014 11:05:04 AM Nikolai Lifanov wrote: On 09/03/14 04:09, Steven Hartland wrote: I'm looking to MFC this change so wanted to check if anyone had an final feedback / objections? I know we currently have Alan's feedback on changing the #ifdef __i386__ to #ifndef UMA_MD_SMALL_ALLOC which sounds sensible but waiting Peter to comment on. Regards Steve I have no technical input, but this change improves ARC usefulness for me quite a bit. I would like to see the improvement in 10-STABLE. Can you verify that the current 10-STABLE (as of today) with all the various pagedaemon fixes still has ARC issues for your workload? It doesn't have any issues, but I noticed the improvement on CURRENT. I observed that just after this change, my package builder is much more likely to retain MFU and not evict useful things from there (the port tree) after large builds. However, I run a lot more 10.0-RELEASE than CURRENT and I would like to see this improvement release-bound. I would be happy to test this on 10-STABLE if you think that this is relevant. As noted before, unfortunately, this commit (plus its fixups) contains at least two related but distinct changes. So, to separate the wheat from the chaff, could you please try to comment out the following block in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, function arc_reclaim_needed: if (kmem_free_count() < zfs_arc_free_target) { DTRACE_PROBE2(arc__reclaim_freetarget, uint64_t, kmem_free_count(), uint64_t, zfs_arc_free_target); return (1); } Alternatively, I think that the same effect can be achieved by setting sysctl vfs.zfs.arc_free_target to the same value as vm.stats.vm.v_free_min. Thats correct that would achieve the same thing. It's interesting to me whether you would still see the better performance or if that improvement would be undone. Indeed that would be interesting, but we might find that its quite memory size dependent given the scaling so confirming HW details would be nice too. I'd also be interested to know who wins the free race between the VM and ARC when using that value. For those following this thread but not the review, I've added some additional information there which you might be interested in: https://reviews.freebsd.org/D702 Regards Steve I had time to re-test both the "stock" condition after the improvements and the condition in which vfs.zfs.arc_free_target=vm.stats.vm.v_free_min. It seems that MFU is more likely to be reduced in the second case. - Nikolai Lifanov ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Fri, Sep 05, 2014 at 12:23:26PM +0300, Andriy Gapon wrote: > on 04/09/2014 04:18 Steven Hartland said the following: > > Indeed that would be interesting, but we might find that its quite memory > > size > > dependent given the scaling so confirming HW details would be nice too. > > > > I'd also be interested to know who wins the free race between the VM and ARC > > when using that value. > > BTW, I've written a small silly program that tests for a problem that affected > me in the distant past: http://people.freebsd.org/~avg/arc-vs-swap.c > > It gobbles almost all of the memory and then just sits on it never accessing > it > again. At the same time it repeatedly reads blocks of data from a large file. > The idea is that eventually the unused memory should be pushed out to the swap > and the ARC is allowed to grow to accommodate for the data being read. > > I run this program on a freshly booted system without any other applications. > Prior to r270759 the system behaves as expected. Although the pace of > shifting > balance between the ARC and the swap-backed pages is quite slow. > After r270759 and with the default tuning the ARC always sits at its minimum > size. To me this is a regression. > > To summarize: I really appreciate the improvements that you are making here > https://reviews.freebsd.org/D702 > Thanks! > > P.S. > I wish there was an easy way to make the page cache and the ARC aware of each > other. I think no single way for any workload. For some workloads ARC is prefered. For some -- RSS is prefered. May be need some tunable for elastics factor ARC/RSS? PS: very bad that 'data limit' don't anymore reflect application memory consumer. and very small application can adapt to 'no memory' from system. ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
on 04/09/2014 04:18 Steven Hartland said the following: > Indeed that would be interesting, but we might find that its quite memory size > dependent given the scaling so confirming HW details would be nice too. > > I'd also be interested to know who wins the free race between the VM and ARC > when using that value. BTW, I've written a small silly program that tests for a problem that affected me in the distant past: http://people.freebsd.org/~avg/arc-vs-swap.c It gobbles almost all of the memory and then just sits on it never accessing it again. At the same time it repeatedly reads blocks of data from a large file. The idea is that eventually the unused memory should be pushed out to the swap and the ARC is allowed to grow to accommodate for the data being read. I run this program on a freshly booted system without any other applications. Prior to r270759 the system behaves as expected. Although the pace of shifting balance between the ARC and the swap-backed pages is quite slow. After r270759 and with the default tuning the ARC always sits at its minimum size. To me this is a regression. To summarize: I really appreciate the improvements that you are making here https://reviews.freebsd.org/D702 Thanks! P.S. I wish there was an easy way to make the page cache and the ARC aware of each other. -- Andriy Gapon ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On 09/03/14 21:18, Steven Hartland wrote: > > - Original Message - From: "Andriy Gapon" > > >> on 03/09/2014 23:22 Nikolai Lifanov said the following: >>> On 09/03/14 15:22, John Baldwin wrote: On Wednesday, September 03, 2014 11:05:04 AM Nikolai Lifanov wrote: > On 09/03/14 04:09, Steven Hartland wrote: >> I'm looking to MFC this change so wanted to check if >> anyone had an final feedback / objections? >> >> I know we currently have Alan's feedback on changing >> the #ifdef __i386__ to #ifndef UMA_MD_SMALL_ALLOC >> which sounds sensible but waiting Peter to comment on. >> >>Regards >>Steve > > I have no technical input, but this change improves ARC usefulness for > me quite a bit. I would like to see the improvement in 10-STABLE. Can you verify that the current 10-STABLE (as of today) with all the various pagedaemon fixes still has ARC issues for your workload? >>> >>> It doesn't have any issues, but I noticed the improvement on CURRENT. I >>> observed that just after this change, my package builder is much more >>> likely to retain MFU and not evict useful things from there (the port >>> tree) after large builds. >>> However, I run a lot more 10.0-RELEASE than CURRENT and I would like to >>> see this improvement release-bound. >>> >>> I would be happy to test this on 10-STABLE if you think that this is >>> relevant. >> >> >> As noted before, unfortunately, this commit (plus its fixups) contains >> at least >> two related but distinct changes. So, to separate the wheat from the >> chaff, >> could you please try to comment out the following block in >> sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, function >> arc_reclaim_needed: >> >>if (kmem_free_count() < zfs_arc_free_target) { >>DTRACE_PROBE2(arc__reclaim_freetarget, uint64_t, >>kmem_free_count(), uint64_t, zfs_arc_free_target); >>return (1); >>} >> >> Alternatively, I think that the same effect can be achieved by setting >> sysctl >> vfs.zfs.arc_free_target to the same value as vm.stats.vm.v_free_min. > > Thats correct that would achieve the same thing. > >> It's interesting to me whether you would still see the better >> performance or if >> that improvement would be undone. > > Indeed that would be interesting, but we might find that its quite > memory size > dependent given the scaling so confirming HW details would be nice too. > > I'd also be interested to know who wins the free race between the VM and > ARC > when using that value. > > For those following this thread but not the review, I've added some > additional > information there which you might be interested in: > https://reviews.freebsd.org/D702 > >Regards >Steve Just an update: I'm in the middle of testing this. I have to finish a large bulk build to observe the behavior one way or another. I have 32G of physical memory and 2x16G dedicated swap SSDs (L2ARC wasn't very useful, but I should probably retest this) on this machine. My ARC is usually at 14G with ~5G of MFU full of things I benefit from keeping there (port trees, base jails). Builds themselves happen in tmpfs and I usually have around 1.5G - 4G "Free" memory (unless building something like pypy). - Nikolai Lifanov ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Andriy Gapon" on 03/09/2014 23:22 Nikolai Lifanov said the following: On 09/03/14 15:22, John Baldwin wrote: On Wednesday, September 03, 2014 11:05:04 AM Nikolai Lifanov wrote: On 09/03/14 04:09, Steven Hartland wrote: I'm looking to MFC this change so wanted to check if anyone had an final feedback / objections? I know we currently have Alan's feedback on changing the #ifdef __i386__ to #ifndef UMA_MD_SMALL_ALLOC which sounds sensible but waiting Peter to comment on. Regards Steve I have no technical input, but this change improves ARC usefulness for me quite a bit. I would like to see the improvement in 10-STABLE. Can you verify that the current 10-STABLE (as of today) with all the various pagedaemon fixes still has ARC issues for your workload? It doesn't have any issues, but I noticed the improvement on CURRENT. I observed that just after this change, my package builder is much more likely to retain MFU and not evict useful things from there (the port tree) after large builds. However, I run a lot more 10.0-RELEASE than CURRENT and I would like to see this improvement release-bound. I would be happy to test this on 10-STABLE if you think that this is relevant. As noted before, unfortunately, this commit (plus its fixups) contains at least two related but distinct changes. So, to separate the wheat from the chaff, could you please try to comment out the following block in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, function arc_reclaim_needed: if (kmem_free_count() < zfs_arc_free_target) { DTRACE_PROBE2(arc__reclaim_freetarget, uint64_t, kmem_free_count(), uint64_t, zfs_arc_free_target); return (1); } Alternatively, I think that the same effect can be achieved by setting sysctl vfs.zfs.arc_free_target to the same value as vm.stats.vm.v_free_min. Thats correct that would achieve the same thing. It's interesting to me whether you would still see the better performance or if that improvement would be undone. Indeed that would be interesting, but we might find that its quite memory size dependent given the scaling so confirming HW details would be nice too. I'd also be interested to know who wins the free race between the VM and ARC when using that value. For those following this thread but not the review, I've added some additional information there which you might be interested in: https://reviews.freebsd.org/D702 Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
on 03/09/2014 23:22 Nikolai Lifanov said the following: > On 09/03/14 15:22, John Baldwin wrote: >> On Wednesday, September 03, 2014 11:05:04 AM Nikolai Lifanov wrote: >>> On 09/03/14 04:09, Steven Hartland wrote: I'm looking to MFC this change so wanted to check if anyone had an final feedback / objections? I know we currently have Alan's feedback on changing the #ifdef __i386__ to #ifndef UMA_MD_SMALL_ALLOC which sounds sensible but waiting Peter to comment on. Regards Steve >>> >>> I have no technical input, but this change improves ARC usefulness for >>> me quite a bit. I would like to see the improvement in 10-STABLE. >> >> Can you verify that the current 10-STABLE (as of today) with all the various >> pagedaemon fixes still has ARC issues for your workload? >> > > It doesn't have any issues, but I noticed the improvement on CURRENT. I > observed that just after this change, my package builder is much more > likely to retain MFU and not evict useful things from there (the port > tree) after large builds. > However, I run a lot more 10.0-RELEASE than CURRENT and I would like to > see this improvement release-bound. > > I would be happy to test this on 10-STABLE if you think that this is > relevant. As noted before, unfortunately, this commit (plus its fixups) contains at least two related but distinct changes. So, to separate the wheat from the chaff, could you please try to comment out the following block in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, function arc_reclaim_needed: if (kmem_free_count() < zfs_arc_free_target) { DTRACE_PROBE2(arc__reclaim_freetarget, uint64_t, kmem_free_count(), uint64_t, zfs_arc_free_target); return (1); } Alternatively, I think that the same effect can be achieved by setting sysctl vfs.zfs.arc_free_target to the same value as vm.stats.vm.v_free_min. It's interesting to me whether you would still see the better performance or if that improvement would be undone. Thanks! -- Andriy Gapon ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On 09/03/14 15:22, John Baldwin wrote: > On Wednesday, September 03, 2014 11:05:04 AM Nikolai Lifanov wrote: >> On 09/03/14 04:09, Steven Hartland wrote: >>> I'm looking to MFC this change so wanted to check if >>> anyone had an final feedback / objections? >>> >>> I know we currently have Alan's feedback on changing >>> the #ifdef __i386__ to #ifndef UMA_MD_SMALL_ALLOC >>> which sounds sensible but waiting Peter to comment on. >>> >>>Regards >>>Steve >> >> I have no technical input, but this change improves ARC usefulness for >> me quite a bit. I would like to see the improvement in 10-STABLE. > > Can you verify that the current 10-STABLE (as of today) with all the various > pagedaemon fixes still has ARC issues for your workload? > It doesn't have any issues, but I noticed the improvement on CURRENT. I observed that just after this change, my package builder is much more likely to retain MFU and not evict useful things from there (the port tree) after large builds. However, I run a lot more 10.0-RELEASE than CURRENT and I would like to see this improvement release-bound. I would be happy to test this on 10-STABLE if you think that this is relevant. - Nikolai Lifanov ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Wednesday, September 03, 2014 11:05:04 AM Nikolai Lifanov wrote: > On 09/03/14 04:09, Steven Hartland wrote: > > I'm looking to MFC this change so wanted to check if > > anyone had an final feedback / objections? > > > > I know we currently have Alan's feedback on changing > > the #ifdef __i386__ to #ifndef UMA_MD_SMALL_ALLOC > > which sounds sensible but waiting Peter to comment on. > > > >Regards > >Steve > > I have no technical input, but this change improves ARC usefulness for > me quite a bit. I would like to see the improvement in 10-STABLE. Can you verify that the current 10-STABLE (as of today) with all the various pagedaemon fixes still has ARC issues for your workload? -- John Baldwin ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
on 03/09/2014 17:58 Andriy Gapon said the following: > on 03/09/2014 15:17 Steven Hartland said the following: >> - Original Message - From: "Andriy Gapon" >> >>> on 03/09/2014 11:09 Steven Hartland said the following: I'm looking to MFC this change so wanted to check if anyone had an final feedback / objections? >>> >>> I think that your changes went in a bit prematurely (little review), so >>> perhaps >>> MFC would be premature as well. >> >> Its a change which really needs to make it into 10.1 IMO > > I think that this is the arguable point. > As I've mentioned before I have not noticed, perhaps through the fault of my > own, any reports that users need this change after Alan's pagedaemon fix(es). > Also, there is no confirmation yet that after this change ARC does not give up > its buffers too easily. > >> due to its impact on users so I don't really want to hold >> off too long. >> >> If anyone has any substantiated reason to then off course >> I'll hold off. Based on our parallel conversation I feel a need to clarify my position. The commit in question has multiple changes in it: 1. removal of KVA check, which was later correctly restored for i386 2. addition of DTrace probes 3. zfs_arc_free_target check and all the support code for it So, #1 plus later fixes is obviously correct. #2 is useful and I like it. #3 is what I have great doubts about. All of what I said in the previous emails applies to #3 exclusively. -- Andriy Gapon ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On 09/03/14 04:09, Steven Hartland wrote: > I'm looking to MFC this change so wanted to check if > anyone had an final feedback / objections? > > I know we currently have Alan's feedback on changing > the #ifdef __i386__ to #ifndef UMA_MD_SMALL_ALLOC > which sounds sensible but waiting Peter to comment on. > >Regards >Steve I have no technical input, but this change improves ARC usefulness for me quite a bit. I would like to see the improvement in 10-STABLE. - Nikolai Lifanov ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
on 03/09/2014 15:17 Steven Hartland said the following: > - Original Message - From: "Andriy Gapon" > >> on 03/09/2014 11:09 Steven Hartland said the following: >>> I'm looking to MFC this change so wanted to check if >>> anyone had an final feedback / objections? >> >> I think that your changes went in a bit prematurely (little review), so >> perhaps >> MFC would be premature as well. > > Its a change which really needs to make it into 10.1 IMO I think that this is the arguable point. As I've mentioned before I have not noticed, perhaps through the fault of my own, any reports that users need this change after Alan's pagedaemon fix(es). Also, there is no confirmation yet that after this change ARC does not give up its buffers too easily. > due to its impact on users so I don't really want to hold > off too long. > > If anyone has any substantiated reason to then off course > I'll hold off. -- Andriy Gapon ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Andriy Gapon" on 03/09/2014 11:09 Steven Hartland said the following: I'm looking to MFC this change so wanted to check if anyone had an final feedback / objections? I think that your changes went in a bit prematurely (little review), so perhaps MFC would be premature as well. Its a change which really needs to make it into 10.1 IMO due to its impact on users so I don't really want to hold off too long. If anyone has any substantiated reason to then off course I'll hold off. Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
on 03/09/2014 11:09 Steven Hartland said the following: > I'm looking to MFC this change so wanted to check if > anyone had an final feedback / objections? I think that your changes went in a bit prematurely (little review), so perhaps MFC would be premature as well. > I know we currently have Alan's feedback on changing > the #ifdef __i386__ to #ifndef UMA_MD_SMALL_ALLOC > which sounds sensible but waiting Peter to comment on. > >Regards >Steve > -- Andriy Gapon ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Andriy Gapon" on 02/09/2014 20:43 Steven Hartland said the following: - Original Message - From: "Andriy Gapon" And the newly added kmem_foo() functions probably do not belong in cddl/compat/opensolaris as Solaris / illumos does not have those functions. They could be moved but their current location keeps all the kmem related functions neatly in one place. Spreading them around IMO would just make things hard to find. BTW, here is some of my old WIP that completely removed the pre-existing kmem_* functions and made the related code much closer to that in illumos. Unfortunately, I will now have hard time merging my changes with your change. https://github.com/avg-I/freebsd/compare/wip/hc/kmem_size-memguard-fix (esp.commits e0cf2f7 and becf087) Looking good, I'm all for eliminating the differences between the two code bases as that will make things easier to maintain in the future, so something we should definitely try to get in for 11. Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
on 02/09/2014 20:43 Steven Hartland said the following: > - Original Message - From: "Andriy Gapon" >> And the newly added kmem_foo() functions probably do not belong in >> cddl/compat/opensolaris as Solaris / illumos does not have those functions. > > They could be moved but their current location keeps all the kmem > related functions neatly in one place. Spreading them around IMO > would just make things hard to find. BTW, here is some of my old WIP that completely removed the pre-existing kmem_* functions and made the related code much closer to that in illumos. Unfortunately, I will now have hard time merging my changes with your change. https://github.com/avg-I/freebsd/compare/wip/hc/kmem_size-memguard-fix (esp.commits e0cf2f7 and becf087) >> I think that in this case e.g. vm_cnt.v_free_target can just be >> used directly by the FreeBSD-specific ARC code. > > It could but as above keeping everything in one place makes it to > find and hence MFC as this area has seen changes which will require > all those fields renamed. It also means if the logic for free pages > changes in the future there's only one place it needs to be changed. > > For those interested there's also an open review on additional > changes in this area: https://reviews.freebsd.org/D702 -- Andriy Gapon ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
I'm looking to MFC this change so wanted to check if anyone had an final feedback / objections? I know we currently have Alan's feedback on changing the #ifdef __i386__ to #ifndef UMA_MD_SMALL_ALLOC which sounds sensible but waiting Peter to comment on. Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On 09/02/2014 14:09, Steven Hartland wrote: > - Original Message - From: "Alan Cox" >> On 09/02/2014 12:34, Steven Hartland wrote: >>> - Original Message - From: "Alan Cox" The patch actually makes two completely orthogonal changes at once, and one of those changes has no connection to the page daemon. I suspect that is why some people have said that their issues were not addressed by the page daemon fix. Prior to this patch, we were limiting the ARC size to 3/4 the kmem map/arena/object size on all architectures, including 64-bit, uma_small_alloc-enabled architectures where such a limit makes no sense. Consequently, some people were complaining, "Why is 1/4 of my memory going unused?" >>> >>> This is exactly the problem which lead me into investigating the issue. >>> >> >> Is there any evidence that anything other than eliminating the KVA limit >> is needed on machines where the page daemon isn't broken? > > In "my" direct experience, I would have to say no, I can't speak for > others. > >>> It should be noted that for i386, as requested by Peter, this >>> limitation >>> has been re-applied. >>> >> >> Unlike Solaris, we run on a few 32-bit architectures, besides i386, that >> don't have a direct map or a full 4GB address space for the kernel. So, >> for FreeBSD, this needs to be more general than just i386. I would >> suggest using '#ifndef UMA_MD_SMALL_ALLOC' as being the closest thing >> that we have to what you want here. This check will allow any machine, >> including 32-bit machines that allocate some kernel memory through a >> direct map, to opt out of the kmem map/arena/object limit. > > I'm not and don't profess to be an expert in this domain as you > know ;-) So I'm to defer to you guys, Peter would you agree with > this? > >> Finally, the Solaris KVA check is written to avoid the possibility of >> integer overflow. However, the FreeBSD version is not. For example, >> suppose that I setup an i386 machine with something approaching a >> 2GB/2GB user/kernel split, 3 * kmem_size() could overflow. > > The returns of said functions are uint64_t so I'm not sure where > the overflow would be when working with 2GB values? > Ah, I didn't see the uint64_t. I only looked at the calculation and assumed vm_offset_t was being used. So, yes, you are safe from overflow. > We seem to be missing some of the following, which doesn't impact > us directly but may affect understanding of the "current" illumos > code as its not in the tree: > https://github.com/illumos/illumos-gate/commit/94dd93aee32d1616436eb51fb7b58094b9a8d3e8 > > >Regards >Steve > > ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Alan Cox" On 09/02/2014 12:34, Steven Hartland wrote: - Original Message - From: "Alan Cox" The patch actually makes two completely orthogonal changes at once, and one of those changes has no connection to the page daemon. I suspect that is why some people have said that their issues were not addressed by the page daemon fix. Prior to this patch, we were limiting the ARC size to 3/4 the kmem map/arena/object size on all architectures, including 64-bit, uma_small_alloc-enabled architectures where such a limit makes no sense. Consequently, some people were complaining, "Why is 1/4 of my memory going unused?" This is exactly the problem which lead me into investigating the issue. Is there any evidence that anything other than eliminating the KVA limit is needed on machines where the page daemon isn't broken? In "my" direct experience, I would have to say no, I can't speak for others. It should be noted that for i386, as requested by Peter, this limitation has been re-applied. Unlike Solaris, we run on a few 32-bit architectures, besides i386, that don't have a direct map or a full 4GB address space for the kernel. So, for FreeBSD, this needs to be more general than just i386. I would suggest using '#ifndef UMA_MD_SMALL_ALLOC' as being the closest thing that we have to what you want here. This check will allow any machine, including 32-bit machines that allocate some kernel memory through a direct map, to opt out of the kmem map/arena/object limit. I'm not and don't profess to be an expert in this domain as you know ;-) So I'm to defer to you guys, Peter would you agree with this? Finally, the Solaris KVA check is written to avoid the possibility of integer overflow. However, the FreeBSD version is not. For example, suppose that I setup an i386 machine with something approaching a 2GB/2GB user/kernel split, 3 * kmem_size() could overflow. The returns of said functions are uint64_t so I'm not sure where the overflow would be when working with 2GB values? We seem to be missing some of the following, which doesn't impact us directly but may affect understanding of the "current" illumos code as its not in the tree: https://github.com/illumos/illumos-gate/commit/94dd93aee32d1616436eb51fb7b58094b9a8d3e8 Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On 09/02/2014 12:34, Steven Hartland wrote: > - Original Message - From: "Alan Cox" > >> On 09/02/2014 11:01, John Baldwin wrote: >>> On Saturday, August 30, 2014 1:37:43 pm Peter Wemm wrote: On Saturday 30 August 2014 02:03:42 Steven Hartland wrote: I'm very disappointed in the attention to detail and errors in the commit. I'm almost at the point where I want to ask for the whole thing to be backed out. >>> I would not be too supportive of that. This PR has been open for a >>> long, long time with many users using patches from it in production >>> loads that were greatly improved by the changes and clamoring on the >>> lists multiple times to get someone to look at it. avg@ contributed >>> quite a bit of time to diagnose this with Karl early on, but other >>> developers aside from Steven did not. It also was not hard to >>> explain to Karl the meaning of 'cache + free' in the bug follow-ups >>> itself (though I believe avg@ had tried this before and it didn't >>> sink in that time for some reason). >>> >>> I know Steven has since committed a fix, but if there are still >>> concerns, I think it would be best to not just revert this entirely >>> but to spend some time fixing the remaining issues. Clearly this >>> issue affects a lot of users and the earlier fixes to pagedaemon >>> were not sufficient to fix their issues alone. >>> >> >> The patch actually makes two completely orthogonal changes at once, and >> one of those changes has no connection to the page daemon. I suspect >> that is why some people have said that their issues were not addressed >> by the page daemon fix. >> >> Prior to this patch, we were limiting the ARC size to 3/4 the kmem >> map/arena/object size on all architectures, including 64-bit, >> uma_small_alloc-enabled architectures where such a limit makes no >> sense. Consequently, some people were complaining, "Why is 1/4 of my >> memory going unused?" > > This is exactly the problem which lead me into investigating the issue. > Is there any evidence that anything other than eliminating the KVA limit is needed on machines where the page daemon isn't broken? > It should be noted that for i386, as requested by Peter, this limitation > has been re-applied. > Unlike Solaris, we run on a few 32-bit architectures, besides i386, that don't have a direct map or a full 4GB address space for the kernel. So, for FreeBSD, this needs to be more general than just i386. I would suggest using '#ifndef UMA_MD_SMALL_ALLOC' as being the closest thing that we have to what you want here. This check will allow any machine, including 32-bit machines that allocate some kernel memory through a direct map, to opt out of the kmem map/arena/object limit. Finally, the Solaris KVA check is written to avoid the possibility of integer overflow. However, the FreeBSD version is not. For example, suppose that I setup an i386 machine with something approaching a 2GB/2GB user/kernel split, 3 * kmem_size() could overflow. ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Andriy Gapon" on 02/09/2014 19:01 John Baldwin said the following: I know Steven has since committed a fix, but if there are still concerns, I think it would be best to not just revert this entirely but to spend some time fixing the remaining issues. Clearly this issue affects a lot of users and the earlier fixes to pagedaemon were not sufficient to fix their issues alone. I am not sure that that is the case. I could have very well missed an evidence of that, but then I'd appreciate a pointer or two to such reports. I am certainly sure that a large number of reports about "ZFS vs swapping" issue appeared after the pagedaemon problem was introduced. I have also missed any "theoretical" justification for the patch. That is, an explanation of how the patch interacts with the pagedaemon and improves things. The empirical evidence could be insufficient, because it's easy to tilt the balance such that the ARC gives in too easily. But people who were affected by the opposite problem could be different from people who would be affected by the new problem, because of differences in system characteristics such as amount of RAM, workload patterns, working set sizes, etc. Having said that, I do not ask for the changes to be reverted, but I'll probably get back after I have sufficient time to look at the patch and also to test its effect on my systems. This might not be very soon though. P.S. I think that there was no technical reason to initialize the newly introduced parameters via SYSINIT mechanism. I created the SYSINT as the values are not available early enough for the sysctl. I think that the initialization could be just done in arc_init. I thought I'd tested this when doing the initial implementation but it was a long week so I just did a quick re-test and confirmed arc_init is called before the pagedaemon is initialised so it can't be used in this case. On a side note it would be nice if ARC sysctls, which are currently done via arc_init, where cleaned up as currently the limits aren't enforced correctly when manually changed, as well as being split up into two locations making them hard to follow. Something I intend to look at when I get some free time. And the newly added kmem_foo() functions probably do not belong in cddl/compat/opensolaris as Solaris / illumos does not have those functions. They could be moved but their current location keeps all the kmem related functions neatly in one place. Spreading them around IMO would just make things hard to find. I think that in this case e.g. vm_cnt.v_free_target can just be used directly by the FreeBSD-specific ARC code. It could but as above keeping everything in one place makes it to find and hence MFC as this area has seen changes which will require all those fields renamed. It also means if the logic for free pages changes in the future there's only one place it needs to be changed. For those interested there's also an open review on additional changes in this area: https://reviews.freebsd.org/D702 Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Alan Cox" On 09/02/2014 11:01, John Baldwin wrote: On Saturday, August 30, 2014 1:37:43 pm Peter Wemm wrote: On Saturday 30 August 2014 02:03:42 Steven Hartland wrote: I'm very disappointed in the attention to detail and errors in the commit. I'm almost at the point where I want to ask for the whole thing to be backed out. I would not be too supportive of that. This PR has been open for a long, long time with many users using patches from it in production loads that were greatly improved by the changes and clamoring on the lists multiple times to get someone to look at it. avg@ contributed quite a bit of time to diagnose this with Karl early on, but other developers aside from Steven did not. It also was not hard to explain to Karl the meaning of 'cache + free' in the bug follow-ups itself (though I believe avg@ had tried this before and it didn't sink in that time for some reason). I know Steven has since committed a fix, but if there are still concerns, I think it would be best to not just revert this entirely but to spend some time fixing the remaining issues. Clearly this issue affects a lot of users and the earlier fixes to pagedaemon were not sufficient to fix their issues alone. The patch actually makes two completely orthogonal changes at once, and one of those changes has no connection to the page daemon. I suspect that is why some people have said that their issues were not addressed by the page daemon fix. Prior to this patch, we were limiting the ARC size to 3/4 the kmem map/arena/object size on all architectures, including 64-bit, uma_small_alloc-enabled architectures where such a limit makes no sense. Consequently, some people were complaining, "Why is 1/4 of my memory going unused?" This is exactly the problem which lead me into investigating the issue. It should be noted that for i386, as requested by Peter, this limitation has been re-applied. Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On 09/02/2014 11:01, John Baldwin wrote: > On Saturday, August 30, 2014 1:37:43 pm Peter Wemm wrote: >> On Saturday 30 August 2014 02:03:42 Steven Hartland wrote: >> I'm very disappointed in the attention to detail and errors in the commit. >> I'm almost at the point where I want to ask for the whole thing to be backed >> out. > I would not be too supportive of that. This PR has been open for a long, > long > time with many users using patches from it in production loads that were > greatly improved by the changes and clamoring on the lists multiple times to > get someone to look at it. avg@ contributed quite a bit of time to diagnose > this with Karl early on, but other developers aside from Steven did not. It > also was not hard to explain to Karl the meaning of 'cache + free' in the bug > follow-ups itself (though I believe avg@ had tried this before and it didn't > sink in that time for some reason). > > I know Steven has since committed a fix, but if there are still concerns, I > think it would be best to not just revert this entirely but to spend some > time > fixing the remaining issues. Clearly this issue affects a lot of users and > the earlier fixes to pagedaemon were not sufficient to fix their issues alone. > The patch actually makes two completely orthogonal changes at once, and one of those changes has no connection to the page daemon. I suspect that is why some people have said that their issues were not addressed by the page daemon fix. Prior to this patch, we were limiting the ARC size to 3/4 the kmem map/arena/object size on all architectures, including 64-bit, uma_small_alloc-enabled architectures where such a limit makes no sense. Consequently, some people were complaining, "Why is 1/4 of my memory going unused?" ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
on 02/09/2014 19:01 John Baldwin said the following: > I know Steven has since committed a fix, but if there are still concerns, I > think it would be best to not just revert this entirely but to spend some > time > fixing the remaining issues. Clearly this issue affects a lot of users and > the earlier fixes to pagedaemon were not sufficient to fix their issues alone. I am not sure that that is the case. I could have very well missed an evidence of that, but then I'd appreciate a pointer or two to such reports. I am certainly sure that a large number of reports about "ZFS vs swapping" issue appeared after the pagedaemon problem was introduced. I have also missed any "theoretical" justification for the patch. That is, an explanation of how the patch interacts with the pagedaemon and improves things. The empirical evidence could be insufficient, because it's easy to tilt the balance such that the ARC gives in too easily. But people who were affected by the opposite problem could be different from people who would be affected by the new problem, because of differences in system characteristics such as amount of RAM, workload patterns, working set sizes, etc. Having said that, I do not ask for the changes to be reverted, but I'll probably get back after I have sufficient time to look at the patch and also to test its effect on my systems. This might not be very soon though. P.S. I think that there was no technical reason to initialize the newly introduced parameters via SYSINIT mechanism. I think that the initialization could be just done in arc_init. And the newly added kmem_foo() functions probably do not belong in cddl/compat/opensolaris as Solaris / illumos does not have those functions. I think that in this case e.g. vm_cnt.v_free_target can just be used directly by the FreeBSD-specific ARC code. -- Andriy Gapon ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Saturday, August 30, 2014 1:37:43 pm Peter Wemm wrote: > On Saturday 30 August 2014 02:03:42 Steven Hartland wrote: > I'm very disappointed in the attention to detail and errors in the commit. > I'm almost at the point where I want to ask for the whole thing to be backed > out. I would not be too supportive of that. This PR has been open for a long, long time with many users using patches from it in production loads that were greatly improved by the changes and clamoring on the lists multiple times to get someone to look at it. avg@ contributed quite a bit of time to diagnose this with Karl early on, but other developers aside from Steven did not. It also was not hard to explain to Karl the meaning of 'cache + free' in the bug follow-ups itself (though I believe avg@ had tried this before and it didn't sink in that time for some reason). I know Steven has since committed a fix, but if there are still concerns, I think it would be best to not just revert this entirely but to spend some time fixing the remaining issues. Clearly this issue affects a lot of users and the earlier fixes to pagedaemon were not sufficient to fix their issues alone. -- John Baldwin ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Peter Wemm" snip... > I'm aware of the values involved, and as I said what you're > proposing > was more akin to where I started, but I was informed that it had > already > been tested and didn't work well. And Karl also said that his tests are on machines that have no v_cache, so he's not testing the scenario. The code, as written, is wrong. It's as simple as that. The logic is wrong. You've introduced dead code. Your code changes introduce a scenario that CAUSES one of the very problems you're using as a justtification for the changes. Your own testers have admitted that they don't test the scenario that the problem exists with. Wooo hold on there, I'm trying to react to feedback and come to a proper solution. I've already said that my initial version was very much like what I believe your requesting its changed to, but I reacted to feedback on that. Now if that feedback was inaccurate or miss-guided I'm sorry, and I'm thankfull for your input as the domain expert. The PR was created quite some time ago, and it wasn't till I encountered a related issue in a live system the other weekend that it got some attention. Its clearly an important issue that needs resolving so lets work together to come up with a fix everyones happy with. > > Also, what about the magic numbers here: > > u_int zfs_arc_free_target = (1 << 19); /* default before > > pagedaemon > > init only */ > > That is just a total fall back case and should never be triggered > unless > as the comment states the pagedaemon isn't initialised. > > > That's half a million pages, or 2GB of physical ram on a 4K page > > size > > system > > How is this going to work on early boot in the machines in the > > cluster > > with > > less than 2GB of ram? > > Its there to ensure that ARC doesn't run wild ARC for the few > milliseconds > / seconds before pagedaemon is initalised. > > We can change the value no problem, what would you suggest 1<<16 aka > 256MB? Please stop picking magic numbers out of thin air. You are working with file system and VM - critical parts of the system. This is NOT the place to be screwing around with things you don't understand. alc@ was trying to be polite. Please help me out here, I'm trying to do the right thing so I'm looking to you guys for advice. > Thanks for all the feedback, its great to have my understanding of > how things work in this area confirmed by those who know. > > Hopefully we'll be able to get to the bottom of this with everyones > help and get a solid fix for these issues that have plaged 10 into > 10.1 :) I'm very disappointed in the attention to detail and errors in the commit. I'm almost at the point where I want to ask for the whole thing to be backed out. I'm not disagreeing with anyone, I'm simply trying to accure the information on how to update this that will result in the best for users, so sweeping statements like this just make me feel bad for trying to help :( I'm more than happy to address any concerns people have. I've kicked this off with a webrev https://reviews.freebsd.org/D700 Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Saturday 30 August 2014 02:03:42 Steven Hartland wrote: > - Original Message - > From: "Peter Wemm" > > > On Friday 29 August 2014 21:42:15 Steven Hartland wrote: > > > If this function returns non-zerp, ARC is given back: > > > > static int > > arc_reclaim_needed(void) > > { > > > > if (kmem_free_count() < zfs_arc_free_target) { > > > > return (1); > > > > } > > > > /* > > * Cooperate with pagedaemon when it's time for it to scan > > * and reclaim some pages. > > */ > > > > if (vm_paging_needed()) { > > > > return (1); > > > > } > > > > ie: if v_free (ignoring v_cache free pages) gets below the threshold, > > stop > > evertyhing and discard ARC pages. > > > > The vm_paging_needed() code is a NO-OP at this point. It can never > > return > > > > true. Consider: > > vm_cnt.v_free_target = 4 * vm_cnt.v_free_min + > > > > vm_cnt.v_free_reserved; > > vs > > > > vm_pageout_wakeup_thresh = (vm_cnt.v_free_min / 10) * 11; > > > > zfs_arc_free_target defaults to vm_cnt.v_free_target, which is 400% of > > v_free_min, and compares it against the smaller v_free pool. > > > > vm_paging_needed() compares the total free pool (v_free + v_cache) > > against the > > smaller wakeup threshold - 110% of v_free_min. > > > > Comparing a larger value against a smaller target than the previous > > test will > > never succeed unless you manually change the arc_free_target sysctl. > > I'm aware of the values involved, and as I said what you're proposing > was more akin to where I started, but I was informed that it had already > been tested and didn't work well. And Karl also said that his tests are on machines that have no v_cache, so he's not testing the scenario. The code, as written, is wrong. It's as simple as that. The logic is wrong. You've introduced dead code. Your code changes introduce a scenario that CAUSES one of the very problems you're using as a justtification for the changes. Your own testers have admitted that they don't test the scenario that the problem exists with. > > Also, what about the magic numbers here: > > u_int zfs_arc_free_target = (1 << 19); /* default before pagedaemon > > init only */ > > That is just a total fall back case and should never be triggered unless > as the comment states the pagedaemon isn't initialised. > > > That's half a million pages, or 2GB of physical ram on a 4K page size > > system > > How is this going to work on early boot in the machines in the cluster > > with > > less than 2GB of ram? > > Its there to ensure that ARC doesn't run wild ARC for the few > milliseconds > / seconds before pagedaemon is initalised. > > We can change the value no problem, what would you suggest 1<<16 aka > 256MB? Please stop picking magic numbers out of thin air. You are working with file system and VM - critical parts of the system. This is NOT the place to be screwing around with things you don't understand. alc@ was trying to be polite. > Thanks for all the feedback, its great to have my understanding of > how things work in this area confirmed by those who know. > > Hopefully we'll be able to get to the bottom of this with everyones > help and get a solid fix for these issues that have plaged 10 into > 10.1 :) I'm very disappointed in the attention to detail and errors in the commit. I'm almost at the point where I want to ask for the whole thing to be backed out. -- Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246 signature.asc Description: This is a digitally signed message part.
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Peter Wemm" On Friday 29 August 2014 21:42:15 Steven Hartland wrote: > - Original Message - > From: "Peter Wemm" > > > On Friday 29 August 2014 20:51:03 Steven Hartland wrote: > snip.. > > > > Does Karl's explaination as to why this doesn't work above > > > change > > > your > > > mind? > > > > Actually no, I would expect the code as committed would *cause* > > the > > undesirable behavior that Karl described. > > > > ie: access a few large files and cause them to reside in cache. > > Say > > 50GB or so > > on a 200G ram machine. We now have the state where: > > > > v_cache = 50GB > > v_free = 1MB > > > > The rest of the vm system looks at vm_paging_needed(), which is: > > do > > we have > > enough "v_cache + v_free"? Since there's 50.001GB free, the > > answer is > > no. > > It'll let v_free run right down to v_free_min because of the giant > > pool of > > v_cache just sitting there, waiting to be used. > > > > The zfs change, as committed will ignore all the free memory in > > the > > form of > > v_cache.. and will be freaking out about how low v_free is getting > > and > > will be > > sacrificing ARC in order to put more memory into the v_free pool. > > > > As long as ARC keeps sacrificing itself this way, the free pages > > in > > the v_cache > > pool won't get used. When ARC finally runs out of pages to give > > up to > > v_free, > > the kernel will start using the free pages from v_cache. > > Eventually > > it'll run > > down that v_cache free pool and arc will be in a bare minimum > > state > > while this > > is happening. > > > > Meanwhile, ZFS ARC will be crippled. This has consequences - it > > does > > RCU like > > things from ARC to keep fragmentation under control. With ARC > > crippled, > > fragmentation will increase because there's less opportunistic > > gathering of > > data from ARC. > > > > Granted, you have to get things freed from active/inactive to the > > cache state, > > but once it's there, depending on the worlkload, it'll mess with > > ARC. > > There's already a vm_paging_needed() check in there below so this > will > already be dealt with will it not? No. If you read the code that you changed, you won't get that far. The v_free test comes before vm_paging_needed(), and if the v_free test triggers then ARC will return pages and not look at the rest of the function. Sorry I should have phrased that question better, prior to the change vm_paging_needed() was the top test, ignoring needfeed, but it still causing performance issues. Surely with all other return 1 cases triggering at what should be a higher level of free memory we should never have seen the performance issues, but users where reporting it lots. So there's still some mystery surrounding why was this happening? If this function returns non-zerp, ARC is given back: static int arc_reclaim_needed(void) { if (kmem_free_count() < zfs_arc_free_target) { return (1); } /* * Cooperate with pagedaemon when it's time for it to scan * and reclaim some pages. */ if (vm_paging_needed()) { return (1); } ie: if v_free (ignoring v_cache free pages) gets below the threshold, stop evertyhing and discard ARC pages. The vm_paging_needed() code is a NO-OP at this point. It can never return true. Consider: vm_cnt.v_free_target = 4 * vm_cnt.v_free_min + vm_cnt.v_free_reserved; vs vm_pageout_wakeup_thresh = (vm_cnt.v_free_min / 10) * 11; zfs_arc_free_target defaults to vm_cnt.v_free_target, which is 400% of v_free_min, and compares it against the smaller v_free pool. vm_paging_needed() compares the total free pool (v_free + v_cache) against the smaller wakeup threshold - 110% of v_free_min. Comparing a larger value against a smaller target than the previous test will never succeed unless you manually change the arc_free_target sysctl. I'm aware of the values involved, and as I said what you're proposing was more akin to where I started, but I was informed that it had already been tested and didn't work well. So as I'm sure you'll appreciate given that information I opted to trust the real life tests. Now its totally possible there could be something in the tests that was skewing the result, but as it still indicated the performance issue was still there, where as using the current values it wasn't, I opted for that vs. what I believed was the more technically correct value. Now you've confirmed what I initially thought should be the correct values are indeed so; I've asked Karl to retest, so we can confirm that any of changes that went into stable/10 after that point haven't changed this behaviour. Hope that makes sense? Also, what about the magic numbers here: u_int zfs_arc_free_target = (1 << 19); /* default before pagedaemon init only */ That is just a total fall back case and should never be triggered
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Friday 29 August 2014 21:42:15 Steven Hartland wrote: > - Original Message - > From: "Peter Wemm" > > > On Friday 29 August 2014 20:51:03 Steven Hartland wrote: > snip.. > > > > Does Karl's explaination as to why this doesn't work above change > > > your > > > mind? > > > > Actually no, I would expect the code as committed would *cause* the > > undesirable behavior that Karl described. > > > > ie: access a few large files and cause them to reside in cache. Say > > 50GB or so > > on a 200G ram machine. We now have the state where: > > > > v_cache = 50GB > > v_free = 1MB > > > > The rest of the vm system looks at vm_paging_needed(), which is: do > > we have > > enough "v_cache + v_free"? Since there's 50.001GB free, the answer is > > no. > > It'll let v_free run right down to v_free_min because of the giant > > pool of > > v_cache just sitting there, waiting to be used. > > > > The zfs change, as committed will ignore all the free memory in the > > form of > > v_cache.. and will be freaking out about how low v_free is getting and > > will be > > sacrificing ARC in order to put more memory into the v_free pool. > > > > As long as ARC keeps sacrificing itself this way, the free pages in > > the v_cache > > pool won't get used. When ARC finally runs out of pages to give up to > > v_free, > > the kernel will start using the free pages from v_cache. Eventually > > it'll run > > down that v_cache free pool and arc will be in a bare minimum state > > while this > > is happening. > > > > Meanwhile, ZFS ARC will be crippled. This has consequences - it does > > RCU like > > things from ARC to keep fragmentation under control. With ARC > > crippled, > > fragmentation will increase because there's less opportunistic > > gathering of > > data from ARC. > > > > Granted, you have to get things freed from active/inactive to the > > cache state, > > but once it's there, depending on the worlkload, it'll mess with ARC. > > There's already a vm_paging_needed() check in there below so this will > already > be dealt with will it not? No. If you read the code that you changed, you won't get that far. The v_free test comes before vm_paging_needed(), and if the v_free test triggers then ARC will return pages and not look at the rest of the function. If this function returns non-zerp, ARC is given back: static int arc_reclaim_needed(void) { if (kmem_free_count() < zfs_arc_free_target) { return (1); } /* * Cooperate with pagedaemon when it's time for it to scan * and reclaim some pages. */ if (vm_paging_needed()) { return (1); } ie: if v_free (ignoring v_cache free pages) gets below the threshold, stop evertyhing and discard ARC pages. The vm_paging_needed() code is a NO-OP at this point. It can never return true. Consider: vm_cnt.v_free_target = 4 * vm_cnt.v_free_min + vm_cnt.v_free_reserved; vs vm_pageout_wakeup_thresh = (vm_cnt.v_free_min / 10) * 11; zfs_arc_free_target defaults to vm_cnt.v_free_target, which is 400% of v_free_min, and compares it against the smaller v_free pool. vm_paging_needed() compares the total free pool (v_free + v_cache) against the smaller wakeup threshold - 110% of v_free_min. Comparing a larger value against a smaller target than the previous test will never succeed unless you manually change the arc_free_target sysctl. Also, what about the magic numbers here: u_int zfs_arc_free_target = (1 << 19); /* default before pagedaemon init only */ That's half a million pages, or 2GB of physical ram on a 4K page size system How is this going to work on early boot in the machines in the cluster with less than 2GB of ram? -- Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246 signature.asc Description: This is a digitally signed message part.
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Peter Wemm" On Friday 29 August 2014 20:51:03 Steven Hartland wrote: snip.. > Does Karl's explaination as to why this doesn't work above change > your > mind? Actually no, I would expect the code as committed would *cause* the undesirable behavior that Karl described. ie: access a few large files and cause them to reside in cache. Say 50GB or so on a 200G ram machine. We now have the state where: v_cache = 50GB v_free = 1MB The rest of the vm system looks at vm_paging_needed(), which is: do we have enough "v_cache + v_free"? Since there's 50.001GB free, the answer is no. It'll let v_free run right down to v_free_min because of the giant pool of v_cache just sitting there, waiting to be used. The zfs change, as committed will ignore all the free memory in the form of v_cache.. and will be freaking out about how low v_free is getting and will be sacrificing ARC in order to put more memory into the v_free pool. As long as ARC keeps sacrificing itself this way, the free pages in the v_cache pool won't get used. When ARC finally runs out of pages to give up to v_free, the kernel will start using the free pages from v_cache. Eventually it'll run down that v_cache free pool and arc will be in a bare minimum state while this is happening. Meanwhile, ZFS ARC will be crippled. This has consequences - it does RCU like things from ARC to keep fragmentation under control. With ARC crippled, fragmentation will increase because there's less opportunistic gathering of data from ARC. Granted, you have to get things freed from active/inactive to the cache state, but once it's there, depending on the worlkload, it'll mess with ARC. There's already a vm_paging_needed() check in there below so this will already be dealt with will it not? Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Friday 29 August 2014 20:51:03 Steven Hartland wrote: > > On Friday 29 August 2014 11:54:42 Alan Cox wrote: > snip... > > > > With the patch we confirmed that both RAM usage and performance > > > > for those > > > > seeing that issue are resolved, with no reported regressions. > > > > > > > >> (I should know better than to fire a reply off before full fact > > > >> checking, but > > > >> this commit worries me..) > > > > > > > > Not a problem, its great to know people pay attention to changes, > > > > and > > > > raise > > > > their concerns. Always better to have a discussion about potential > > > > issues > > > > than to wait for a problem to occur. > > > > > > > > Hopefully the above gives you some piece of mind, but if you still > > > > have any > > > > concerns I'm all ears. > > > > > > You didn't really address Peter's initial technical issue. Peter > > > correctly observed that cache pages are just another flavor of free > > > pages. Whenever the VM system is checking the number of free pages > > > against any of the thresholds, it always uses the sum of > > > v_cache_count > > > and v_free_count. So, to anyone familiar with the VM system, like > > > Peter, what you've done, which is to derive a threshold from > > > v_free_target but only compare v_free_count to that threshold, looks > > > highly suspect. > > > > I think I'd like to see something like this: > > > > Index: cddl/compat/opensolaris/kern/opensolaris_kmem.c > > === > > --- cddl/compat/opensolaris/kern/opensolaris_kmem.c (revision 270824) > > +++ cddl/compat/opensolaris/kern/opensolaris_kmem.c (working copy) > > @@ -152,7 +152,8 @@ > > > > kmem_free_count(void) > > { > > > > - return (vm_cnt.v_free_count); > > + /* "cache" is just a flavor of free pages in FreeBSD */ > > + return (vm_cnt.v_free_count + vm_cnt.v_cache_count); > > > > } > > > > u_int > > This has apparently already been tried and the response from Karl was: > > - No, because memory in "cache" is subject to being either reallocated > or freed. > - When I was developing this patch that was my first impression as well > and how > - I originally coded it, and it turned out to be wrong. > - > - The issue here is that you have two parts of the system contending for > RAM -- > - the VM system generally, and the ARC cache. If the ARC cache frees > space before > - the VM system activates and starts pruning then you wind up with the > ARC pinned > - at the minimum after some period of time, because it releases "early." > > I've asked him if he would retest just to be sure. > > > The rest of the system looks at the "big picture" it would be happy to > > let the > > "free" pool run quite a way down so long as there's "cache" pages > > available to > > satisfy the free space requirements. This would lead ZFS to > > mistakenly > > sacrifice ARC for no reason. I'm not sure how big a deal this is, but > > I can't > > imagine many scenarios where I want ARC to be discarded in order to > > save some > > effectively free pages. > > From Karl's response from the original PR (above) it seems like this > causes > unexpected behaviour due to the two systems being seperate. > > > > That said, I can easily believe that your patch works better than > > > the > > > existing code, because it is closer in spirit to my interpretation > > > of > > > what the Solaris code does. Specifically, I believe that the > > > Solaris > > > code starts trimming the ARC before the Solaris page daemon starts > > > writing dirty pages to secondary storage. Now, you've made FreeBSD > > > do > > > the same. However, you've expressed it in a way that looks broken. > > > > > > To wrap up, I think that you can easily write this in a way that > > > simultaneously behaves like Solaris and doesn't look wrong to a VM > > > expert. > > > > > > > Out of interest would it be possible to update machines in the > > > > cluster to > > > > see how their workload reacts to the change? > > > > I'd like to see the free vs cache thing resolved first but it's going > > to be > > tricky to get a comparison. > > Does Karl's explaination as to why this doesn't work above change your > mind? Actually no, I would expect the code as committed would *cause* the undesirable behavior that Karl described. ie: access a few large files and cause them to reside in cache. Say 50GB or so on a 200G ram machine. We now have the state where: v_cache = 50GB v_free = 1MB The rest of the vm system looks at vm_paging_needed(), which is: do we have enough "v_cache + v_free"? Since there's 50.001GB free, the answer is no. It'll let v_free run right down to v_free_min because of the giant pool of v_cache just sitting there, waiting to be used. The zfs change, as committed will ignore all the free memory in the form of v_cache.. and will be freaking out about how low v_free is getting and will be sacrificing ARC in order to put more memory into the
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Steven Hartland" - Original Message - From: "Alan Cox" You didn't really address Peter's initial technical issue. Peter correctly observed that cache pages are just another flavor of free pages. Whenever the VM system is checking the number of free pages against any of the thresholds, it always uses the sum of v_cache_count and v_free_count. So, to anyone familiar with the VM system, like Peter, what you've done, which is to derive a threshold from v_free_target but only compare v_free_count to that threshold, looks highly suspect. That said, I can easily believe that your patch works better than the existing code, because it is closer in spirit to my interpretation of what the Solaris code does. Specifically, I believe that the Solaris code starts trimming the ARC before the Solaris page daemon starts writing dirty pages to secondary storage. Now, you've made FreeBSD do the same. However, you've expressed it in a way that looks broken. To wrap up, I think that you can easily write this in a way that simultaneously behaves like Solaris and doesn't look wrong to a VM expert. More details in my last reply on this but in short it seems this has already been tried and it didn't work. I'd be interested in what domain experts think about why that is? In the mean time I've asked Karl to see if he could retest with this change to confirm counting cache pages along with free does indeed still cause a problem. For those that want to catch up on what has already tested see the original PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 Copying in Karl's response from the PR for easy access: I can give it a shot on my test system but I will note that for my production machine right now that would be a no-op as there are no cache pages in use. Getting that on the production system with a high level of load over the holiday is rather unlikely due to it being Labor Day, so all I will have are my synthetics. I do have some time over the long weekend to run that test and should be able to do so if you think it's useful, but I am wary of that approach being correct in the general case due to my previous experience. My commentary on that discussion point and reasoning originally is here (http://lists.freebsd.org/pipermail/freebsd-fs/2014-March/019084.html); along that thread you'll see a vmstat output with the current paradigm showing that cache doesn't grow without boundary, as many suggested it would if I didn't include those pages as "free" and thus available for ARC to attempt to invade (effectively trying to force the VM system to evict them from the cache list by having it wake up first.) Indeed, up at comment #31 is the patch on that production system that has been up for about four months and you can see only a few pages are in the cached state. That is fairly typical. What is the expected reason for concern about including them in the free count for ARC (when they're not in terms of what wakes up the VM system as a whole.) If the expected area of concern is that cache pages will grow without boundary (or close to it) the evidence from both my production machine and others strongly suggests that's incorrect. My work suggests strongly that the most-likely to be correct behavior for the majority of workloads is achieved when both the ARC paring routine and VM page-cleaning system wake up at the same time. That occurs when vm_cnt.v_free_count is invaded. Attempting to bias the outcome to force either the VM system or the ARC cache to do something first appears to increase the circumstances under which bad behavior occurs. ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Alan Cox" You didn't really address Peter's initial technical issue. Peter correctly observed that cache pages are just another flavor of free pages. Whenever the VM system is checking the number of free pages against any of the thresholds, it always uses the sum of v_cache_count and v_free_count. So, to anyone familiar with the VM system, like Peter, what you've done, which is to derive a threshold from v_free_target but only compare v_free_count to that threshold, looks highly suspect. That said, I can easily believe that your patch works better than the existing code, because it is closer in spirit to my interpretation of what the Solaris code does. Specifically, I believe that the Solaris code starts trimming the ARC before the Solaris page daemon starts writing dirty pages to secondary storage. Now, you've made FreeBSD do the same. However, you've expressed it in a way that looks broken. To wrap up, I think that you can easily write this in a way that simultaneously behaves like Solaris and doesn't look wrong to a VM expert. More details in my last reply on this but in short it seems this has already been tried and it didn't work. I'd be interested in what domain experts think about why that is? In the mean time I've asked Karl to see if he could retest with this change to confirm counting cache pages along with free does indeed still cause a problem. For those that want to catch up on what has already tested see the original PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Friday 29 August 2014 11:54:42 Alan Cox wrote: snip... > > Others have also confirmed that even with r265945 they can still > > trigger > > performance issue. > > > > In addition without it we still have loads of RAM sat their > > unused, in my > > particular experience we have 40GB of 192GB sitting their unused > > and that > > was with a stable build from last weekend. > > The Solaris code only imposed this limit on 32-bit machines where > the > available kernel virtual address space may be much less than the > available physical memory. Previously, FreeBSD imposed this limit > on > both 32-bit and 64-bit machines. Now, it imposes it on neither. > Why > continue to do this differently from Solaris? My understanding is these limits where totally different on Solaris see the #ifdef sun block in arc_reclaim_needed() for details. I actually started at matching the Solaris flow but this had already been tested and proved not to work as well as the current design. Since the question was asked below, we don't have zfs machines in the cluster running i386. We can barely get them to boot as it is due to kva pressure. We have to reduce/cap physical memory and change the user/kernel virtual split from 3:1 to 2.5:1.5. We do run zfs on small amd64 machines with 2G of ram, but I can't imagine it working on the 10G i386 PAE machines that we have. > > With the patch we confirmed that both RAM usage and performance > > for those > > seeing that issue are resolved, with no reported regressions. > > > >> (I should know better than to fire a reply off before full fact > >> checking, but > >> this commit worries me..) > > > > Not a problem, its great to know people pay attention to changes, > > and > > raise > > their concerns. Always better to have a discussion about potential > > issues > > than to wait for a problem to occur. > > > > Hopefully the above gives you some piece of mind, but if you still > > have any > > concerns I'm all ears. > > You didn't really address Peter's initial technical issue. Peter > correctly observed that cache pages are just another flavor of free > pages. Whenever the VM system is checking the number of free pages > against any of the thresholds, it always uses the sum of > v_cache_count > and v_free_count. So, to anyone familiar with the VM system, like > Peter, what you've done, which is to derive a threshold from > v_free_target but only compare v_free_count to that threshold, looks > highly suspect. I think I'd like to see something like this: Index: cddl/compat/opensolaris/kern/opensolaris_kmem.c === --- cddl/compat/opensolaris/kern/opensolaris_kmem.c (revision 270824) +++ cddl/compat/opensolaris/kern/opensolaris_kmem.c (working copy) @@ -152,7 +152,8 @@ kmem_free_count(void) { - return (vm_cnt.v_free_count); + /* "cache" is just a flavor of free pages in FreeBSD */ + return (vm_cnt.v_free_count + vm_cnt.v_cache_count); } u_int This has apparently already been tried and the response from Karl was: - No, because memory in "cache" is subject to being either reallocated or freed. - When I was developing this patch that was my first impression as well and how - I originally coded it, and it turned out to be wrong. - - The issue here is that you have two parts of the system contending for RAM -- - the VM system generally, and the ARC cache. If the ARC cache frees space before - the VM system activates and starts pruning then you wind up with the ARC pinned - at the minimum after some period of time, because it releases "early." I've asked him if he would retest just to be sure. The rest of the system looks at the "big picture" it would be happy to let the "free" pool run quite a way down so long as there's "cache" pages available to satisfy the free space requirements. This would lead ZFS to mistakenly sacrifice ARC for no reason. I'm not sure how big a deal this is, but I can't imagine many scenarios where I want ARC to be discarded in order to save some effectively free pages. From Karl's response from the original PR (above) it seems like this causes unexpected behaviour due to the two systems being seperate. > That said, I can easily believe that your patch works better than > the > existing code, because it is closer in spirit to my interpretation > of > what the Solaris code does. Specifically, I believe that the > Solaris > code starts trimming the ARC before the Solaris page daemon starts > writing dirty pages to secondary storage. Now, you've made FreeBSD > do > the same. However, you've expressed it in a way that looks broken. > > To wrap up, I think that you can easily write this in a way that > simultaneously behaves like Solaris and doesn't look wrong to a VM > expert. > > > Out of interest would it be possible to update machines in the > > cluster to > > see how their workload reacts to the change? > > I'd like to see the fr
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Friday 29 August 2014 11:54:42 Alan Cox wrote: > On 08/29/2014 03:32, Steven Hartland wrote: > >> On Thursday 28 August 2014 17:30:17 Alan Cox wrote: > >> > On 08/28/2014 16:15, Matthew D. Fuller wrote: > >> > > On Thu, Aug 28, 2014 at 10:11:39PM +0100 I heard the voice of > >> > > > >> > > Steven Hartland, and lo! it spake thus: > >> > >> Its very likely applicable to stable/9 although I've never used 9 > >> > >> myself, we jumped from 9 direct to 10. > >> > > > >> > > This is actually hitting two different issues from the two bugs: > >> > > > >> > > - 191510 is about "ARC isn't greedy enough" on huge-memory > > > >> > >> machines, > >> > >> > > and from the osreldate that bug was filed on 9.2, so presumably > >> > > > >> > > is > >> > > > >> > > applicable. > >> > > > >> > > - 187594 is about "ARC is too greedy" (probably mostly on > > > >> > >> not-so-huge > >> > >> > > machines) and starves/drives the rest of the system into swap. > >> > > > >> > > That > >> > > > >> > > I believe came about as a result of some unrelated change in the > >> > > 10.x stream that upset the previous balance between ARC and the > >> > > > >> > > rest > >> > > > >> > > of the VM, so isn't a problem on 9.x. > >> > > >> > 10.0 had a bug in the page daemon that was fixed in 10-STABLE about > >> > three months ago (r265945). The ARC was not the only thing > >> > >> affected > by > >> this bug. > >> > >> I'm concerned about potential unintended consequences of this change. > >> > >> Before, arc reclaim was driven by vm_paging_needed(), which was: > >> vm_paging_needed(void) > >> { > >> > >> return (vm_cnt.v_free_count + vm_cnt.v_cache_count < > >> > >> vm_pageout_wakeup_thresh); > >> > >> } > >> > >> Now it's ignoring the v_cache_count and looking exclusively at > >> v_free_count. > >> "cache" pages are free pages that just happen to have known contents. > >> If I > >> read this change right, zfs arc will now discard checksummed cache > >> pages to > > > >> make room for non-checksummed pages: > > That test is still there so if it needs to it will still trigger. > > > > However that often a lower level as vm_pageout_wakeup_thresh is only 110% > > of min free, where as zfs_arc_free_target is based of target free > > which is > > 4 * (min free + reserved). > > > >> + if (kmem_free_count() < zfs_arc_free_target) { > >> + return (1); > >> + } > >> ... > >> +kmem_free_count(void) > >> +{ > >> + return (vm_cnt.v_free_count); > >> +} > >> > >> This seems like a pretty substantial behavior change. I'm concerned > >> that it > >> doesn't appear to count all the forms of "free" pages. > >> > >> I haven't seen the problems with the over-aggressive ARC since the > >> page daemon > >> bug was fixed. It's been working fine under pretty abusive loads in > >> the freebsd > >> cluster after that fix. > > > > Others have also confirmed that even with r265945 they can still trigger > > performance issue. > > > > In addition without it we still have loads of RAM sat their unused, in my > > particular experience we have 40GB of 192GB sitting their unused and that > > was with a stable build from last weekend. > > The Solaris code only imposed this limit on 32-bit machines where the > available kernel virtual address space may be much less than the > available physical memory. Previously, FreeBSD imposed this limit on > both 32-bit and 64-bit machines. Now, it imposes it on neither. Why > continue to do this differently from Solaris? Since the question was asked below, we don't have zfs machines in the cluster running i386. We can barely get them to boot as it is due to kva pressure. We have to reduce/cap physical memory and change the user/kernel virtual split from 3:1 to 2.5:1.5. We do run zfs on small amd64 machines with 2G of ram, but I can't imagine it working on the 10G i386 PAE machines that we have. > > With the patch we confirmed that both RAM usage and performance for those > > seeing that issue are resolved, with no reported regressions. > > > >> (I should know better than to fire a reply off before full fact > >> checking, but > >> this commit worries me..) > > > > Not a problem, its great to know people pay attention to changes, and > > raise > > their concerns. Always better to have a discussion about potential issues > > than to wait for a problem to occur. > > > > Hopefully the above gives you some piece of mind, but if you still > > have any > > concerns I'm all ears. > > You didn't really address Peter's initial technical issue. Peter > correctly observed that cache pages are just another flavor of free > pages. Whenever the VM system is checking the number of free pages > against any of the thresholds, it always uses the sum of v_cache_count > and v_free_count. So, to anyone familiar with the VM system, like > Peter, what you've done, which is to derive a threshold from > v_free_target but only compare v_free_count to
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On 08/29/2014 03:32, Steven Hartland wrote: >> On Thursday 28 August 2014 17:30:17 Alan Cox wrote: >> > On 08/28/2014 16:15, Matthew D. Fuller wrote: >> > > On Thu, Aug 28, 2014 at 10:11:39PM +0100 I heard the voice of >> > > >> > > Steven Hartland, and lo! it spake thus: >> > >> Its very likely applicable to stable/9 although I've never used 9 >> > >> myself, we jumped from 9 direct to 10. >> > > >> > > This is actually hitting two different issues from the two bugs: >> > > >> > > - 191510 is about "ARC isn't greedy enough" on huge-memory > > >> machines, >> > > >> > > and from the osreldate that bug was filed on 9.2, so presumably >> > > is >> > > applicable. >> > > >> > > - 187594 is about "ARC is too greedy" (probably mostly on > > >> not-so-huge >> > > >> > > machines) and starves/drives the rest of the system into swap. >> > > That >> > > I believe came about as a result of some unrelated change in the >> > > 10.x stream that upset the previous balance between ARC and the >> > > rest >> > > of the VM, so isn't a problem on 9.x. >> > >> > 10.0 had a bug in the page daemon that was fixed in 10-STABLE about >> > three months ago (r265945). The ARC was not the only thing >> affected > by >> this bug. >> >> I'm concerned about potential unintended consequences of this change. >> >> Before, arc reclaim was driven by vm_paging_needed(), which was: >> vm_paging_needed(void) >> { >> return (vm_cnt.v_free_count + vm_cnt.v_cache_count < >> vm_pageout_wakeup_thresh); >> } >> >> Now it's ignoring the v_cache_count and looking exclusively at >> v_free_count. >> "cache" pages are free pages that just happen to have known contents. >> If I >> read this change right, zfs arc will now discard checksummed cache >> pages to >> make room for non-checksummed pages: > > That test is still there so if it needs to it will still trigger. > > However that often a lower level as vm_pageout_wakeup_thresh is only 110% > of min free, where as zfs_arc_free_target is based of target free > which is > 4 * (min free + reserved). > >> + if (kmem_free_count() < zfs_arc_free_target) { >> + return (1); >> + } >> ... >> +kmem_free_count(void) >> +{ >> + return (vm_cnt.v_free_count); >> +} >> >> This seems like a pretty substantial behavior change. I'm concerned >> that it >> doesn't appear to count all the forms of "free" pages. >> >> I haven't seen the problems with the over-aggressive ARC since the >> page daemon >> bug was fixed. It's been working fine under pretty abusive loads in >> the freebsd >> cluster after that fix. > > Others have also confirmed that even with r265945 they can still trigger > performance issue. > > In addition without it we still have loads of RAM sat their unused, in my > particular experience we have 40GB of 192GB sitting their unused and that > was with a stable build from last weekend. > The Solaris code only imposed this limit on 32-bit machines where the available kernel virtual address space may be much less than the available physical memory. Previously, FreeBSD imposed this limit on both 32-bit and 64-bit machines. Now, it imposes it on neither. Why continue to do this differently from Solaris? > With the patch we confirmed that both RAM usage and performance for those > seeing that issue are resolved, with no reported regressions. > >> (I should know better than to fire a reply off before full fact >> checking, but >> this commit worries me..) > > Not a problem, its great to know people pay attention to changes, and > raise > their concerns. Always better to have a discussion about potential issues > than to wait for a problem to occur. > > Hopefully the above gives you some piece of mind, but if you still > have any > concerns I'm all ears. > You didn't really address Peter's initial technical issue. Peter correctly observed that cache pages are just another flavor of free pages. Whenever the VM system is checking the number of free pages against any of the thresholds, it always uses the sum of v_cache_count and v_free_count. So, to anyone familiar with the VM system, like Peter, what you've done, which is to derive a threshold from v_free_target but only compare v_free_count to that threshold, looks highly suspect. That said, I can easily believe that your patch works better than the existing code, because it is closer in spirit to my interpretation of what the Solaris code does. Specifically, I believe that the Solaris code starts trimming the ARC before the Solaris page daemon starts writing dirty pages to secondary storage. Now, you've made FreeBSD do the same. However, you've expressed it in a way that looks broken. To wrap up, I think that you can easily write this in a way that simultaneously behaves like Solaris and doesn't look wrong to a VM expert. > Out of interest would it be possible to update machines in the cluster to > see how their workload reacts to the change? > >Regards >Steve > > ___
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On 8/28/2014 6:22 PM, Peter Wemm wrote: [...] > vm_paging_needed(void) > { > return (vm_cnt.v_free_count + vm_cnt.v_cache_count < > vm_pageout_wakeup_thresh); > } By the way there is a signed to unsigned comparison here. -- Regards, Bryan Drewery signature.asc Description: OpenPGP digital signature
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Thursday 28 August 2014 17:30:17 Alan Cox wrote: > On 08/28/2014 16:15, Matthew D. Fuller wrote: > > On Thu, Aug 28, 2014 at 10:11:39PM +0100 I heard the voice of > > > > Steven Hartland, and lo! it spake thus: > >> Its very likely applicable to stable/9 although I've never used 9 > >> myself, we jumped from 9 direct to 10. > > > > This is actually hitting two different issues from the two bugs: > > > > - 191510 is about "ARC isn't greedy enough" on huge-memory > > machines, > > > > and from the osreldate that bug was filed on 9.2, so presumably > > is > > applicable. > > > > - 187594 is about "ARC is too greedy" (probably mostly on > > not-so-huge > > > > machines) and starves/drives the rest of the system into swap. > > That > > I believe came about as a result of some unrelated change in the > > 10.x stream that upset the previous balance between ARC and the > > rest > > of the VM, so isn't a problem on 9.x. > > 10.0 had a bug in the page daemon that was fixed in 10-STABLE about > three months ago (r265945). The ARC was not the only thing affected > by this bug. I'm concerned about potential unintended consequences of this change. Before, arc reclaim was driven by vm_paging_needed(), which was: vm_paging_needed(void) { return (vm_cnt.v_free_count + vm_cnt.v_cache_count < vm_pageout_wakeup_thresh); } Now it's ignoring the v_cache_count and looking exclusively at v_free_count. "cache" pages are free pages that just happen to have known contents. If I read this change right, zfs arc will now discard checksummed cache pages to make room for non-checksummed pages: That test is still there so if it needs to it will still trigger. However that often a lower level as vm_pageout_wakeup_thresh is only 110% of min free, where as zfs_arc_free_target is based of target free which is 4 * (min free + reserved). + if (kmem_free_count() < zfs_arc_free_target) { + return (1); + } ... +kmem_free_count(void) +{ + return (vm_cnt.v_free_count); +} This seems like a pretty substantial behavior change. I'm concerned that it doesn't appear to count all the forms of "free" pages. I haven't seen the problems with the over-aggressive ARC since the page daemon bug was fixed. It's been working fine under pretty abusive loads in the freebsd cluster after that fix. Others have also confirmed that even with r265945 they can still trigger performance issue. In addition without it we still have loads of RAM sat their unused, in my particular experience we have 40GB of 192GB sitting their unused and that was with a stable build from last weekend. With the patch we confirmed that both RAM usage and performance for those seeing that issue are resolved, with no reported regressions. (I should know better than to fire a reply off before full fact checking, but this commit worries me..) Not a problem, its great to know people pay attention to changes, and raise their concerns. Always better to have a discussion about potential issues than to wait for a problem to occur. Hopefully the above gives you some piece of mind, but if you still have any concerns I'm all ears. Out of interest would it be possible to update machines in the cluster to see how their workload reacts to the change? Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Thu, Aug 28, 2014 at 10:11:39PM +0100, Steven Hartland wrote: > I'll MFC to 9 and possibly 8 as well at that time if relavent. Although I don't use ZFS on stable/8, your care about our best stable branch :) is much appreciated. ./danfe ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Thursday 28 August 2014 17:30:17 Alan Cox wrote: > On 08/28/2014 16:15, Matthew D. Fuller wrote: > > On Thu, Aug 28, 2014 at 10:11:39PM +0100 I heard the voice of > > > > Steven Hartland, and lo! it spake thus: > >> Its very likely applicable to stable/9 although I've never used 9 > >> myself, we jumped from 9 direct to 10. > > > > This is actually hitting two different issues from the two bugs: > > > > - 191510 is about "ARC isn't greedy enough" on huge-memory machines, > > > > and from the osreldate that bug was filed on 9.2, so presumably is > > applicable. > > > > - 187594 is about "ARC is too greedy" (probably mostly on not-so-huge > > > > machines) and starves/drives the rest of the system into swap. That > > I believe came about as a result of some unrelated change in the > > 10.x stream that upset the previous balance between ARC and the rest > > of the VM, so isn't a problem on 9.x. > > 10.0 had a bug in the page daemon that was fixed in 10-STABLE about > three months ago (r265945). The ARC was not the only thing affected by > this bug. I'm concerned about potential unintended consequences of this change. Before, arc reclaim was driven by vm_paging_needed(), which was: vm_paging_needed(void) { return (vm_cnt.v_free_count + vm_cnt.v_cache_count < vm_pageout_wakeup_thresh); } Now it's ignoring the v_cache_count and looking exclusively at v_free_count. "cache" pages are free pages that just happen to have known contents. If I read this change right, zfs arc will now discard checksummed cache pages to make room for non-checksummed pages: + if (kmem_free_count() < zfs_arc_free_target) { + return (1); + } ... +kmem_free_count(void) +{ + return (vm_cnt.v_free_count); +} This seems like a pretty substantial behavior change. I'm concerned that it doesn't appear to count all the forms of "free" pages. I haven't seen the problems with the over-aggressive ARC since the page daemon bug was fixed. It's been working fine under pretty abusive loads in the freebsd cluster after that fix. (I should know better than to fire a reply off before full fact checking, but this commit worries me..) -- Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246 signature.asc Description: This is a digitally signed message part.
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On 08/28/2014 16:15, Matthew D. Fuller wrote: > On Thu, Aug 28, 2014 at 10:11:39PM +0100 I heard the voice of > Steven Hartland, and lo! it spake thus: >> Its very likely applicable to stable/9 although I've never used 9 >> myself, we jumped from 9 direct to 10. > This is actually hitting two different issues from the two bugs: > > - 191510 is about "ARC isn't greedy enough" on huge-memory machines, > and from the osreldate that bug was filed on 9.2, so presumably is > applicable. > > - 187594 is about "ARC is too greedy" (probably mostly on not-so-huge > machines) and starves/drives the rest of the system into swap. That > I believe came about as a result of some unrelated change in the > 10.x stream that upset the previous balance between ARC and the rest > of the VM, so isn't a problem on 9.x. 10.0 had a bug in the page daemon that was fixed in 10-STABLE about three months ago (r265945). The ARC was not the only thing affected by this bug. ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Thu, 28 Aug 2014, Steven Hartland wrote: > > Thank you, and also Karl and Devin for this work! > > > > Is this change applicable to stable/9? I suppose, pretty large base of users > > could benefit from this. > > Its very likely applicable to stable/9 although I've never used 9 myself, > we jumped from 9 direct to 10. > > My target for this is to make the cutoff for 10.1 freeze. > > I'll MFC to 9 and possibly 8 as well at that time if relavent. No rush for other stable branches, surely; also, we're at the EoL (EoS at least) for stable/8 -- but if the changeset is not too harsh I see no reason not to merge it down there. As I @work have some loaded machines with ZFS base on at least 9 (while not really memory-loaded, not more than 32G -- but sometimes rather heavy loaded with userbase apps), I would be glad to test changesets if it's feasible. Thanks again! -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: ma...@freebsd.org ] *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru *** ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
On Thu, Aug 28, 2014 at 10:11:39PM +0100 I heard the voice of Steven Hartland, and lo! it spake thus: > > Its very likely applicable to stable/9 although I've never used 9 > myself, we jumped from 9 direct to 10. This is actually hitting two different issues from the two bugs: - 191510 is about "ARC isn't greedy enough" on huge-memory machines, and from the osreldate that bug was filed on 9.2, so presumably is applicable. - 187594 is about "ARC is too greedy" (probably mostly on not-so-huge machines) and starves/drives the rest of the system into swap. That I believe came about as a result of some unrelated change in the 10.x stream that upset the previous balance between ARC and the rest of the VM, so isn't a problem on 9.x. -- Matthew Fuller (MF4839) | fulle...@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
- Original Message - From: "Dmitry Morozovsky" Steve, On Thu, 28 Aug 2014, Steven Hartland wrote: Author: smh Date: Thu Aug 28 19:50:08 2014 New Revision: 270759 URL: http://svnweb.freebsd.org/changeset/base/270759 Log: Refactor ZFS ARC reclaim logic to be more VM cooperative [snip] Credit to Karl Denninger for the original patch on which this update was based. Tested by: dteske MFC after: 1 week Relnotes: yes Sponsored by: Multiplay Thank you, and also Karl and Devin for this work! Is this change applicable to stable/9? I suppose, pretty large base of users could benefit from this. Its very likely applicable to stable/9 although I've never used 9 myself, we jumped from 9 direct to 10. My target for this is to make the cutoff for 10.1 freeze. I'll MFC to 9 and possibly 8 as well at that time if relavent. Regards Steve ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
Steve, On Thu, 28 Aug 2014, Steven Hartland wrote: > Author: smh > Date: Thu Aug 28 19:50:08 2014 > New Revision: 270759 > URL: http://svnweb.freebsd.org/changeset/base/270759 > > Log: > Refactor ZFS ARC reclaim logic to be more VM cooperative [snip] > Credit to Karl Denninger for the original patch on which this update was > based. > Tested by: dteske > MFC after: 1 week > Relnotes: yes > Sponsored by: Multiplay Thank you, and also Karl and Devin for this work! Is this change applicable to stable/9? I suppose, pretty large base of users could benefit from this. -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: ma...@freebsd.org ] *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru *** ___ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm
Author: smh Date: Thu Aug 28 19:50:08 2014 New Revision: 270759 URL: http://svnweb.freebsd.org/changeset/base/270759 Log: Refactor ZFS ARC reclaim logic to be more VM cooperative Prior to this change we triggered ARC reclaim when kmem usage passed 3/4 of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC). This could lead large amounts of unused RAM e.g. on a 192GB machine with ARC the only major RAM consumer, 40GB of RAM would remain unused. The old method has also been seen to result in extreme RAM usage under certain loads, causing poor performance and stalls. We now trigger ARC reclaim when the number of free pages drops below the value defined by the new sysctl vfs.zfs.arc_free_target, which defaults to the value of vm.v_free_target. Credit to Karl Denninger for the original patch on which this update was based. PR: 191510 and 187594 Tested by:dteske MFC after:1 week Relnotes: yes Sponsored by: Multiplay Modified: head/sys/cddl/compat/opensolaris/kern/opensolaris_kmem.c head/sys/cddl/compat/opensolaris/sys/kmem.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c head/sys/vm/vm_pageout.c Modified: head/sys/cddl/compat/opensolaris/kern/opensolaris_kmem.c == --- head/sys/cddl/compat/opensolaris/kern/opensolaris_kmem.cThu Aug 28 18:59:39 2014(r270758) +++ head/sys/cddl/compat/opensolaris/kern/opensolaris_kmem.cThu Aug 28 19:50:08 2014(r270759) @@ -126,18 +126,47 @@ kmem_size_init(void *unused __unused) } SYSINIT(kmem_size_init, SI_SUB_KMEM, SI_ORDER_ANY, kmem_size_init, NULL); -uint64_t -kmem_size(void) +/* + * The return values from kmem_free_* are only valid once the pagedaemon + * has been initialised, before then they return 0. + * + * To ensure the returns are valid the caller can use a SYSINIT with + * subsystem set to SI_SUB_KTHREAD_PAGE and an order of at least + * SI_ORDER_SECOND. + */ +u_int +kmem_free_target(void) { - return (kmem_size_val); + return (vm_cnt.v_free_target); +} + +u_int +kmem_free_min(void) +{ + + return (vm_cnt.v_free_min); +} + +u_int +kmem_free_count(void) +{ + + return (vm_cnt.v_free_count); +} + +u_int +kmem_page_count(void) +{ + + return (vm_cnt.v_page_count); } uint64_t -kmem_used(void) +kmem_size(void) { - return (vmem_size(kmem_arena, VMEM_ALLOC)); + return (kmem_size_val); } static int Modified: head/sys/cddl/compat/opensolaris/sys/kmem.h == --- head/sys/cddl/compat/opensolaris/sys/kmem.h Thu Aug 28 18:59:39 2014 (r270758) +++ head/sys/cddl/compat/opensolaris/sys/kmem.h Thu Aug 28 19:50:08 2014 (r270759) @@ -66,7 +66,16 @@ typedef struct kmem_cache { void *zfs_kmem_alloc(size_t size, int kmflags); void zfs_kmem_free(void *buf, size_t size); uint64_t kmem_size(void); -uint64_t kmem_used(void); +u_int kmem_page_count(void); + +/* + * The return values from kmem_free_* are only valid once the pagedaemon + * has been initialised, before then they return 0. + */ +u_int kmem_free_count(void); +u_int kmem_free_target(void); +u_int kmem_free_min(void); + kmem_cache_t *kmem_cache_create(char *name, size_t bufsize, size_t align, int (*constructor)(void *, void *, int), void (*destructor)(void *, void *), void (*reclaim)(void *) __unused, void *private, vmem_t *vmp, int cflags); Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c == --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c Thu Aug 28 18:59:39 2014(r270758) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c Thu Aug 28 19:50:08 2014(r270759) @@ -193,9 +193,6 @@ extern int zfs_prefetch_disable; */ static boolean_t arc_warm; -/* - * These tunables are for performance analysis. - */ uint64_t zfs_arc_max; uint64_t zfs_arc_min; uint64_t zfs_arc_meta_limit = 0; @@ -204,6 +201,20 @@ int zfs_arc_shrink_shift = 0; int zfs_arc_p_min_shift = 0; int zfs_disable_dup_eviction = 0; uint64_t zfs_arc_average_blocksize = 8 * 1024; /* 8KB */ +u_int zfs_arc_free_target = (1 << 19); /* default before pagedaemon init only */ + +static int sysctl_vfs_zfs_arc_free_target(SYSCTL_HANDLER_ARGS); + +#ifdef _KERNEL +static void +arc_free_target_init(void *unused __unused) +{ + + zfs_arc_free_target = kmem_free_target(); +} +SYSINIT(arc_free_target_init, SI_SUB_KTHREAD_PAGE, SI_ORDER_ANY, +arc_free_target_init, NULL); +#endif TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit); SYSCTL_DECL(_vfs_zfs); @@ -214,6 +225,35 @@ SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_average_blocksize, CTLFLAG_RDTUN, &zfs_arc_average_blocksize, 0, "ARC average blocksize"); +/* + * We