Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Wed, Nov 14, 2012 at 01:48:18AM -0800, Hugh Dickins wrote: > On Tue, 6 Nov 2012, Shaohua Li wrote: > > On Wed, Oct 24, 2012 at 09:13:56AM +0800, Shaohua Li wrote: > > > On Tue, Oct 23, 2012 at 09:41:00AM -0400, Rik van Riel wrote: > > > > On 10/23/2012 01:51 AM, Shaohua Li wrote: > > > > > > > > >I have no strong point against the global state method. But I'd agree > > > > >making the > > > > >heuristic simple is preferred currently. I'm happy about the patch if > > > > >the '+1' > > > > >is removed. > > > > > > > > Without the +1, how will you figure out when to re-enable readahead? > > > > > > Below code in swapin_nr_pages can recover it. > > > + if (offset == prev_offset + 1 || offset == prev_offset - > > > 1) > > > + pages <<= 1; > > > > > > Not perfect, but should work in some sort. This reminds me to think if > > > pagereadahead flag is really required, hit in swap cache is a more > > > reliable way > > > to count readahead hit, and as Hugh mentioned, swap isn't vma bound. > > > > Hugh, > > ping! Any chance you can check this again? > > I apologize, Shaohua, my slowness must be very frustrating for you, > as it is for me too. Not at all, thanks for looking at it. > Thank you for pointing out how my first patch was reading two pages > instead of one in the random case, explaining its disappointing > performance there: odd how blind I was to that, despite taking stats. > > I did experiment with removing the "+ 1" as you did, it worked well > in the random SSD case, but degraded performance in (all? I forget) > the other cases. > > I failed to rescue the "algorithm" in that patch, and changed it a > week ago for an even simpler one, that has worked well for me so far. > When I sent you a quick private ack to your ping, I was puzzled by its > "too good to be true" initial test results: once I looked into those, > found my rearrangement of the test script had left out a swapoff, > so the supposed harddisk tests were actually swapping to SSD. > > I've finally got around to assembling the results and writing up > some description, starting off from yours. I think I've gone as > far as I can with this, and don't want to hold you up with further > delays: would it be okay if I simply hand this patch over to you now, > to test and expand upon and add your Sign-off and send in to akpm to > replace your original in mmotm - IF you are satisfied with it? I played the patch more. It works as expected in random access case, but in sequential case, it has regression against vanilla, maybe because I'm using a two sockets machine. I explained the reason in below patch and changelog. Below is an addon patch above Hugh's patch. We can apply Hugh's patch first and then this one if it's ok, or just merge them to one patch. AKPM, what's your suggestion? Thanks, Shaohua Subject: mm/swap: improve swapin readahead heuristic swapout always tries to find a cluster to do swap. The cluster is shared by all processes (kswapds, direct page reclaim) who do swap. The result is swapout adjacent memory could cause interleave access pattern to disk. We do aggressive swapin in non-random access case to avoid skip swapin in interleave access pattern. This really isn't the fault of swapin, but before we improve swapout algorithm (for example, give each CPU a swap cluster), aggressive swapin gives better performance for sequential access. With below patch, the heurisic becomes: 1. swapin max_pages pages for any hit 2. otherwise swapin last_readahead_pages*3/4 pages Test is done at a two sockets machine (7G memory), so at least 3 tasks are doing swapout (2 kswapd, and one direct page reclaim). sequential test is 1 thread accessing 14G memory, random test is 24 threads random accessing 14G memory. Data is time. RandSeq vanilla 5678434 Hugh2829625 Hugh+belowpatch 2785401 For both rand and seq access, below patch gets good performance. And even slightly better than vanilla in seq. Not quite sure about the reason, but I'd suspect this is because there are some daemons doing small random swap. Signed-off-by: Shaohua Li Cc: Konstantin Khlebnikov Cc: Hugh Dickins Cc: Rik van Riel Cc: Wu Fengguang Cc: Minchan Kim --- mm/swap_state.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) Index: linux/mm/swap_state.c === --- linux.orig/mm/swap_state.c 2012-11-19 09:08:58.171621096 +0800 +++ linux/mm/swap_state.c 2012-11-19 10:01:28.016023822 +0800 @@ -359,6 +359,7 @@ struct page *read_swap_cache_async(swp_e unsigned long swapin_nr_pages(unsigned long offset) { static unsigned long prev_offset; + static atomic_t last_readahead_pages; unsigned int pages, max_pages; max_pages = 1 << ACCESS_ONCE(page_cluster); @@ -366,29 +367,29 @@ unsigned long swapin_nr_pages(u
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Tue, 6 Nov 2012, Shaohua Li wrote: > On Wed, Oct 24, 2012 at 09:13:56AM +0800, Shaohua Li wrote: > > On Tue, Oct 23, 2012 at 09:41:00AM -0400, Rik van Riel wrote: > > > On 10/23/2012 01:51 AM, Shaohua Li wrote: > > > > > > >I have no strong point against the global state method. But I'd agree > > > >making the > > > >heuristic simple is preferred currently. I'm happy about the patch if > > > >the '+1' > > > >is removed. > > > > > > Without the +1, how will you figure out when to re-enable readahead? > > > > Below code in swapin_nr_pages can recover it. > > + if (offset == prev_offset + 1 || offset == prev_offset - 1) > > + pages <<= 1; > > > > Not perfect, but should work in some sort. This reminds me to think if > > pagereadahead flag is really required, hit in swap cache is a more reliable > > way > > to count readahead hit, and as Hugh mentioned, swap isn't vma bound. > > Hugh, > ping! Any chance you can check this again? I apologize, Shaohua, my slowness must be very frustrating for you, as it is for me too. Thank you for pointing out how my first patch was reading two pages instead of one in the random case, explaining its disappointing performance there: odd how blind I was to that, despite taking stats. I did experiment with removing the "+ 1" as you did, it worked well in the random SSD case, but degraded performance in (all? I forget) the other cases. I failed to rescue the "algorithm" in that patch, and changed it a week ago for an even simpler one, that has worked well for me so far. When I sent you a quick private ack to your ping, I was puzzled by its "too good to be true" initial test results: once I looked into those, found my rearrangement of the test script had left out a swapoff, so the supposed harddisk tests were actually swapping to SSD. I've finally got around to assembling the results and writing up some description, starting off from yours. I think I've gone as far as I can with this, and don't want to hold you up with further delays: would it be okay if I simply hand this patch over to you now, to test and expand upon and add your Sign-off and send in to akpm to replace your original in mmotm - IF you are satisfied with it? [PATCH] swap: add a simple detector for inappropriate swapin readahead swapin readahead does a blind readahead, whether or not the swapin is sequential. This may be ok on harddisk, because large reads have relatively small costs, and if the readahead pages are unneeded they can be reclaimed easily - though, what if their allocation forced reclaim of useful pages? But on SSD devices large reads are more expensive than small ones: if the readahead pages are unneeded, reading them in caused significant overhead. This patch adds very simplistic random read detection. Stealing the PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages() simply looks at readahead's current success rate, and narrows or widens its readahead window accordingly. There is little science to its heuristic: it's about as stupid as can be whilst remaining effective. The table below shows elapsed times (in centiseconds) when running a single repetitive swapping load across a 1000MB mapping in 900MB ram with 1GB swap (the harddisk tests had taken painfully too long when I used mem=500M, but SSD shows similar results for that). Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch which Shaohua showed to be defective; HughNew this Nov 14 patch, with page_cluster as usual at default of 3 (8-page reads); HughPC4 this same patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0 (1-page reads: no readahead). HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for sequential access to the mapping, cycling five times around; Rand for the same number of random touches. Anon for a MAP_PRIVATE anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs. One weakness of Shaohua's vma/anon_vma approach was that it did not optimize Shmem: seen below. Konstantin's approach was perhaps mistuned, 50% slower on Seq: did not compete and is not shown below. HDDVanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 73921 76210 75611 76904 78191 121542 Seq Shmem73601 73176 73855 72947 74543 118322 Rand Anon 895392 831243 871569 845197 846496 841680 Rand Shmem 1058375 1053486 827935 764955 764376 756489 SSDVanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 24634 24198 24673 25107 21614 70018 Seq Shmem24959 24932 25052 25703 22030 69678 Rand Anon43014 26146 28075 25989 26935 25901 Rand Shmem 45349 45215 28249 24268 24138 24332 These tests are, of course, two extremes of a very simple case: under heavier mixed loads I've not yet o
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Wed, Oct 24, 2012 at 09:13:56AM +0800, Shaohua Li wrote: > On Tue, Oct 23, 2012 at 09:41:00AM -0400, Rik van Riel wrote: > > On 10/23/2012 01:51 AM, Shaohua Li wrote: > > > > >I have no strong point against the global state method. But I'd agree > > >making the > > >heuristic simple is preferred currently. I'm happy about the patch if the > > >'+1' > > >is removed. > > > > Without the +1, how will you figure out when to re-enable readahead? > > Below code in swapin_nr_pages can recover it. > + if (offset == prev_offset + 1 || offset == prev_offset - 1) > + pages <<= 1; > > Not perfect, but should work in some sort. This reminds me to think if > pagereadahead flag is really required, hit in swap cache is a more reliable > way > to count readahead hit, and as Hugh mentioned, swap isn't vma bound. Hugh, ping! Any chance you can check this again? Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Tue, Oct 23, 2012 at 09:41:00AM -0400, Rik van Riel wrote: > On 10/23/2012 01:51 AM, Shaohua Li wrote: > > >I have no strong point against the global state method. But I'd agree making > >the > >heuristic simple is preferred currently. I'm happy about the patch if the > >'+1' > >is removed. > > Without the +1, how will you figure out when to re-enable readahead? Below code in swapin_nr_pages can recover it. + if (offset == prev_offset + 1 || offset == prev_offset - 1) + pages <<= 1; Not perfect, but should work in some sort. This reminds me to think if pagereadahead flag is really required, hit in swap cache is a more reliable way to count readahead hit, and as Hugh mentioned, swap isn't vma bound. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On 10/23/2012 01:51 AM, Shaohua Li wrote: I have no strong point against the global state method. But I'd agree making the heuristic simple is preferred currently. I'm happy about the patch if the '+1' is removed. Without the +1, how will you figure out when to re-enable readahead? -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Mon, Oct 22, 2012 at 10:16:40PM -0700, Hugh Dickins wrote: > On Mon, 22 Oct 2012, Shaohua Li wrote: > > On Tue, Oct 16, 2012 at 08:50:49AM +0800, Shaohua Li wrote: > > > On Mon, Oct 08, 2012 at 03:09:58PM -0700, Hugh Dickins wrote: > > > > On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote: > > > > > > > > > Here results of my test. Workload isn't very realistic, but at least > > > > > it > > > > > threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs, > > > > > 512mb ram, dualcore cpu, ordinary hard disk. (test script in > > > > > attachment) > > > > > > > > > > average results for ten runs: > > > > > > > > > > RA=3RA=0RA=1RA=2RA=4HughShaohua > > > > > real time 500 542 528 519 500 523 522 > > > > > user time 738 737 735 737 739 737 739 > > > > > sys time 93 93 91 92 96 92 93 > > > > > pgmajfault62918 110533 92454 78221 54342 86601 77229 > > > > > pgpgin2070372 795228 1034046 1471010 3177192 1154532 1599388 > > > > > pgpgout 2597278 2022037 2110020 2350380 2802670 2286671 2526570 > > > > > pswpin462747 138873 202148 310969 739431 232710 341320 > > > > > pswpout 646363 502599 524613 584731 697797 568784 628677 > > > > > > > > > > So, last two columns shows mostly equal results: +4.6% and +4.4% in > > > > > comparison to vanilla kernel with RA=3, but your version shows more > > > > > stable > > > > > results (std-error 2.7% against 4.8%) (all this numbers in huge table > > > > > in > > > > > attachment) > > > > > > > > Thanks for doing this, Konstantin, but I'm stuck for anything much to > > > > say! > > > > Shaohua and I are both about 4.5% bad for this particular test, but I'm > > > > more consistently bad - hurrah! > > > > > > > > I suspect (not a convincing argument) that if the test were just > > > > slightly > > > > different (a little more or a little less memory, SSD instead of hard > > > > disk, diskcache instead of tmpfs), then it would come out differently. > > > > > > > > Did you draw any conclusions from the numbers you found? > > > > > > > > I haven't done any more on this in the last few days, except to verify > > > > that once an anon_vma is judged random with Shaohua's, then it appears > > > > to be condemned to no-readahead ever after. > > > > > > > > That's probably something that a hack like I had in mine would fix, > > > > but that addition might change its balance further (and increase vma > > > > or anon_vma size) - not tried yet. > > > > > > > > All I want to do right now, is suggest to Andrew that he hold Shaohua's > > > > patch back from 3.7 for the moment: I'll send a response to Sep 7th's > > > > mm-commits mail to suggest that - but no great disaster if he ignores > > > > me. > > > > > > Ok, I tested Hugh's patch. My test is a multithread random write workload. > > > With Hugh's patch, 49:28.06elapsed > > > With mine, 43:23.39elapsed > > > There is 12% more time used with Hugh's patch. > > > > > > In the stable state of this workload, SI:SO ratio should be roughly 1:1. > > > With > > > Hugh's patch, it's around 1.6:1, there is still unnecessary swapin. > > > > > > I also tried a workload with seqential/random write mixed, Hugh's patch > > > is 10% > > > bad too. > > > > With below change, the si/so ratio is back to around 1:1 in my workload. > > Guess > > the run time of my test will be reduced too, though I didn't test yet. > > - used = atomic_xchg(&swapra_hits, 0) + 1; > > + used = atomic_xchg(&swapra_hits, 0); > > Thank you for playing and trying that, I haven't found time to revisit it > at all. I'll give that adjustment a go at my end. The "+ 1" was for the > target page itself; but whatever works best, there's not much science to it. With '+1', the minimum ra pages is 2 even for a random access. > > > > I'm wondering how could a global counter based method detect readahead > > correctly. For example, if there are a sequential access thread and a random > > access thread, doesn't this method always make wrong decision? > > But only in the simplest cases is the sequentiality of placement on swap > well correlated with the sequentiality of placement in virtual memory. > Once you have a sequential access thread and a random access thread > swapping out at the same time, their pages will be interspersed. > > I'm pretty sure that if you give it more thought than I am giving it > at the moment, you can devise a test case which would go amazingly > faster by your per-vma method than by keeping just this global state. > > But I doubt such a test case would be so realistic as to deserve that > extra sophistication. I do prefer to keep the heuristic as stupid and > unpretentious as possible. I have no strong point against the global state method. But I'd agree making the heuristic simple is preferred currently. I'm happy about the patch if the '+1'
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Mon, 22 Oct 2012, Shaohua Li wrote: > On Tue, Oct 16, 2012 at 08:50:49AM +0800, Shaohua Li wrote: > > On Mon, Oct 08, 2012 at 03:09:58PM -0700, Hugh Dickins wrote: > > > On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote: > > > > > > > Here results of my test. Workload isn't very realistic, but at least it > > > > threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs, > > > > 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment) > > > > > > > > average results for ten runs: > > > > > > > > RA=3RA=0RA=1RA=2RA=4HughShaohua > > > > real time 500 542 528 519 500 523 522 > > > > user time 738 737 735 737 739 737 739 > > > > sys time93 93 91 92 96 92 93 > > > > pgmajfault 62918 110533 92454 78221 54342 86601 77229 > > > > pgpgin 2070372 795228 1034046 1471010 3177192 1154532 1599388 > > > > pgpgout 2597278 2022037 2110020 2350380 2802670 2286671 2526570 > > > > pswpin 462747 138873 202148 310969 739431 232710 341320 > > > > pswpout 646363 502599 524613 584731 697797 568784 628677 > > > > > > > > So, last two columns shows mostly equal results: +4.6% and +4.4% in > > > > comparison to vanilla kernel with RA=3, but your version shows more > > > > stable > > > > results (std-error 2.7% against 4.8%) (all this numbers in huge table in > > > > attachment) > > > > > > Thanks for doing this, Konstantin, but I'm stuck for anything much to say! > > > Shaohua and I are both about 4.5% bad for this particular test, but I'm > > > more consistently bad - hurrah! > > > > > > I suspect (not a convincing argument) that if the test were just slightly > > > different (a little more or a little less memory, SSD instead of hard > > > disk, diskcache instead of tmpfs), then it would come out differently. > > > > > > Did you draw any conclusions from the numbers you found? > > > > > > I haven't done any more on this in the last few days, except to verify > > > that once an anon_vma is judged random with Shaohua's, then it appears > > > to be condemned to no-readahead ever after. > > > > > > That's probably something that a hack like I had in mine would fix, > > > but that addition might change its balance further (and increase vma > > > or anon_vma size) - not tried yet. > > > > > > All I want to do right now, is suggest to Andrew that he hold Shaohua's > > > patch back from 3.7 for the moment: I'll send a response to Sep 7th's > > > mm-commits mail to suggest that - but no great disaster if he ignores me. > > > > Ok, I tested Hugh's patch. My test is a multithread random write workload. > > With Hugh's patch, 49:28.06elapsed > > With mine, 43:23.39elapsed > > There is 12% more time used with Hugh's patch. > > > > In the stable state of this workload, SI:SO ratio should be roughly 1:1. > > With > > Hugh's patch, it's around 1.6:1, there is still unnecessary swapin. > > > > I also tried a workload with seqential/random write mixed, Hugh's patch is > > 10% > > bad too. > > With below change, the si/so ratio is back to around 1:1 in my workload. Guess > the run time of my test will be reduced too, though I didn't test yet. > - used = atomic_xchg(&swapra_hits, 0) + 1; > + used = atomic_xchg(&swapra_hits, 0); Thank you for playing and trying that, I haven't found time to revisit it at all. I'll give that adjustment a go at my end. The "+ 1" was for the target page itself; but whatever works best, there's not much science to it. > > I'm wondering how could a global counter based method detect readahead > correctly. For example, if there are a sequential access thread and a random > access thread, doesn't this method always make wrong decision? But only in the simplest cases is the sequentiality of placement on swap well correlated with the sequentiality of placement in virtual memory. Once you have a sequential access thread and a random access thread swapping out at the same time, their pages will be interspersed. I'm pretty sure that if you give it more thought than I am giving it at the moment, you can devise a test case which would go amazingly faster by your per-vma method than by keeping just this global state. But I doubt such a test case would be so realistic as to deserve that extra sophistication. I do prefer to keep the heuristic as stupid and unpretentious as possible. Especially when I remember how get_swap_page() stripes across swap areas of equal priority: my guess is that nobody uses that feature, and we don't even want to consider it here; but it feels wrong to ignore it if we aim for more cleverness at the readahead end. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Tue, Oct 16, 2012 at 08:50:49AM +0800, Shaohua Li wrote: > On Mon, Oct 08, 2012 at 03:09:58PM -0700, Hugh Dickins wrote: > > On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote: > > > > > Here results of my test. Workload isn't very realistic, but at least it > > > threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs, > > > 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment) > > > > > > average results for ten runs: > > > > > > RA=3RA=0RA=1RA=2RA=4HughShaohua > > > real time 500 542 528 519 500 523 522 > > > user time 738 737 735 737 739 737 739 > > > sys time 93 93 91 92 96 92 93 > > > pgmajfault62918 110533 92454 78221 54342 86601 77229 > > > pgpgin2070372 795228 1034046 1471010 3177192 1154532 1599388 > > > pgpgout 2597278 2022037 2110020 2350380 2802670 2286671 2526570 > > > pswpin462747 138873 202148 310969 739431 232710 341320 > > > pswpout 646363 502599 524613 584731 697797 568784 628677 > > > > > > So, last two columns shows mostly equal results: +4.6% and +4.4% in > > > comparison to vanilla kernel with RA=3, but your version shows more stable > > > results (std-error 2.7% against 4.8%) (all this numbers in huge table in > > > attachment) > > > > Thanks for doing this, Konstantin, but I'm stuck for anything much to say! > > Shaohua and I are both about 4.5% bad for this particular test, but I'm > > more consistently bad - hurrah! > > > > I suspect (not a convincing argument) that if the test were just slightly > > different (a little more or a little less memory, SSD instead of hard > > disk, diskcache instead of tmpfs), then it would come out differently. > > > > Did you draw any conclusions from the numbers you found? > > > > I haven't done any more on this in the last few days, except to verify > > that once an anon_vma is judged random with Shaohua's, then it appears > > to be condemned to no-readahead ever after. > > > > That's probably something that a hack like I had in mine would fix, > > but that addition might change its balance further (and increase vma > > or anon_vma size) - not tried yet. > > > > All I want to do right now, is suggest to Andrew that he hold Shaohua's > > patch back from 3.7 for the moment: I'll send a response to Sep 7th's > > mm-commits mail to suggest that - but no great disaster if he ignores me. > > Ok, I tested Hugh's patch. My test is a multithread random write workload. > With Hugh's patch, 49:28.06elapsed > With mine, 43:23.39elapsed > There is 12% more time used with Hugh's patch. > > In the stable state of this workload, SI:SO ratio should be roughly 1:1. With > Hugh's patch, it's around 1.6:1, there is still unnecessary swapin. > > I also tried a workload with seqential/random write mixed, Hugh's patch is 10% > bad too. With below change, the si/so ratio is back to around 1:1 in my workload. Guess the run time of my test will be reduced too, though I didn't test yet. - used = atomic_xchg(&swapra_hits, 0) + 1; + used = atomic_xchg(&swapra_hits, 0); I'm wondering how could a global counter based method detect readahead correctly. For example, if there are a sequential access thread and a random access thread, doesn't this method always make wrong decision? Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Mon, Oct 08, 2012 at 03:09:58PM -0700, Hugh Dickins wrote: > On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote: > > > Here results of my test. Workload isn't very realistic, but at least it > > threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs, > > 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment) > > > > average results for ten runs: > > > > RA=3RA=0RA=1RA=2RA=4HughShaohua > > real time 500 542 528 519 500 523 522 > > user time 738 737 735 737 739 737 739 > > sys time93 93 91 92 96 92 93 > > pgmajfault 62918 110533 92454 78221 54342 86601 77229 > > pgpgin 2070372 795228 1034046 1471010 3177192 1154532 1599388 > > pgpgout 2597278 2022037 2110020 2350380 2802670 2286671 2526570 > > pswpin 462747 138873 202148 310969 739431 232710 341320 > > pswpout 646363 502599 524613 584731 697797 568784 628677 > > > > So, last two columns shows mostly equal results: +4.6% and +4.4% in > > comparison to vanilla kernel with RA=3, but your version shows more stable > > results (std-error 2.7% against 4.8%) (all this numbers in huge table in > > attachment) > > Thanks for doing this, Konstantin, but I'm stuck for anything much to say! > Shaohua and I are both about 4.5% bad for this particular test, but I'm > more consistently bad - hurrah! > > I suspect (not a convincing argument) that if the test were just slightly > different (a little more or a little less memory, SSD instead of hard > disk, diskcache instead of tmpfs), then it would come out differently. > > Did you draw any conclusions from the numbers you found? > > I haven't done any more on this in the last few days, except to verify > that once an anon_vma is judged random with Shaohua's, then it appears > to be condemned to no-readahead ever after. > > That's probably something that a hack like I had in mine would fix, > but that addition might change its balance further (and increase vma > or anon_vma size) - not tried yet. > > All I want to do right now, is suggest to Andrew that he hold Shaohua's > patch back from 3.7 for the moment: I'll send a response to Sep 7th's > mm-commits mail to suggest that - but no great disaster if he ignores me. Ok, I tested Hugh's patch. My test is a multithread random write workload. With Hugh's patch, 49:28.06elapsed With mine, 43:23.39elapsed There is 12% more time used with Hugh's patch. In the stable state of this workload, SI:SO ratio should be roughly 1:1. With Hugh's patch, it's around 1.6:1, there is still unnecessary swapin. I also tried a workload with seqential/random write mixed, Hugh's patch is 10% bad too. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
Hugh Dickins wrote: On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote: Here results of my test. Workload isn't very realistic, but at least it threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs, 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment) average results for ten runs: RA=3RA=0RA=1RA=2RA=4HughShaohua real time 500 542 528 519 500 523 522 user time 738 737 735 737 739 737 739 sys time93 93 91 92 96 92 93 pgmajfault 62918 110533 92454 78221 54342 86601 77229 pgpgin 2070372 795228 1034046 1471010 3177192 1154532 1599388 pgpgout 2597278 2022037 2110020 2350380 2802670 2286671 2526570 pswpin 462747 138873 202148 310969 739431 232710 341320 pswpout 646363 502599 524613 584731 697797 568784 628677 So, last two columns shows mostly equal results: +4.6% and +4.4% in comparison to vanilla kernel with RA=3, but your version shows more stable results (std-error 2.7% against 4.8%) (all this numbers in huge table in attachment) Thanks for doing this, Konstantin, but I'm stuck for anything much to say! Shaohua and I are both about 4.5% bad for this particular test, but I'm more consistently bad - hurrah! I suspect (not a convincing argument) that if the test were just slightly different (a little more or a little less memory, SSD instead of hard disk, diskcache instead of tmpfs), then it would come out differently. Yes, results depends mostly on tmpfs. Did you draw any conclusions from the numbers you found? Yeah, I have some ideas: Numbers for vanilla kernel shows strong dependence between time and readahead size. Seems like main problem is that tmpfs does not have it's own readahead, it can only rely on swap-in readahead. There are about 25% readahead hits for RA=3. As "pswpin" row shows both your and Shaohua patches makes readahead smaller. Plus tmpfs doesn't keeps copy for clean pages in the swap (unlike to anon pages). On swapin path it always marks page dirty and releases swap-entry. I didn't have any measurements but this particular test definitely re-reads some files multiple times and writes them back to the swap after that. I haven't done any more on this in the last few days, except to verify that once an anon_vma is judged random with Shaohua's, then it appears to be condemned to no-readahead ever after. That's probably something that a hack like I had in mine would fix, but that addition might change its balance further (and increase vma or anon_vma size) - not tried yet. All I want to do right now, is suggest to Andrew that he hold Shaohua's patch back from 3.7 for the moment: I'll send a response to Sep 7th's mm-commits mail to suggest that - but no great disaster if he ignores me. Hugh Numbers from your tests formatted into table for better readability HDD Vanilla Shaohua RA=3RA=0RA=4 SEQ, ANON 73921 76210 75611 121542 77950 SEQ, SHMEM 73601 73176 73855 118322 73534 RND, ANON 895392 831243 871569 841680 863871 RND, SHMEM 1058375 1053486 827935 756489 834804 SDD Vanilla Shaohua RA=3RA=0RA=4 SEQ, ANON 24634 24198 24673 70018 21125 SEQ, SHMEM 24959 24932 25052 69678 21387 RND, ANON 43014 26146 28075 25901 28686 RND, SHMEM 45349 45215 28249 24332 28226 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Mon, 8 Oct 2012 15:09:58 -0700 (PDT) Hugh Dickins wrote: > All I want to do right now, is suggest to Andrew that he hold Shaohua's > patch back from 3.7 for the moment: I'll send a response to Sep 7th's > mm-commits mail to suggest that - but no great disaster if he ignores me. Just in the nick of time. I'll move swap-add-a-simple-detector-for-inappropriate-swapin-readahead.patch and swap-add-a-simple-detector-for-inappropriate-swapin-readahead-fix.patch into the wait-and-see pile. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote: > Here results of my test. Workload isn't very realistic, but at least it > threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs, > 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment) > > average results for ten runs: > > RA=3RA=0RA=1RA=2RA=4HughShaohua > real time 500 542 528 519 500 523 522 > user time 738 737 735 737 739 737 739 > sys time 93 93 91 92 96 92 93 > pgmajfault62918 110533 92454 78221 54342 86601 77229 > pgpgin2070372 795228 1034046 1471010 3177192 1154532 1599388 > pgpgout 2597278 2022037 2110020 2350380 2802670 2286671 2526570 > pswpin462747 138873 202148 310969 739431 232710 341320 > pswpout 646363 502599 524613 584731 697797 568784 628677 > > So, last two columns shows mostly equal results: +4.6% and +4.4% in > comparison to vanilla kernel with RA=3, but your version shows more stable > results (std-error 2.7% against 4.8%) (all this numbers in huge table in > attachment) Thanks for doing this, Konstantin, but I'm stuck for anything much to say! Shaohua and I are both about 4.5% bad for this particular test, but I'm more consistently bad - hurrah! I suspect (not a convincing argument) that if the test were just slightly different (a little more or a little less memory, SSD instead of hard disk, diskcache instead of tmpfs), then it would come out differently. Did you draw any conclusions from the numbers you found? I haven't done any more on this in the last few days, except to verify that once an anon_vma is judged random with Shaohua's, then it appears to be condemned to no-readahead ever after. That's probably something that a hack like I had in mine would fix, but that addition might change its balance further (and increase vma or anon_vma size) - not tried yet. All I want to do right now, is suggest to Andrew that he hold Shaohua's patch back from 3.7 for the moment: I'll send a response to Sep 7th's mm-commits mail to suggest that - but no great disaster if he ignores me. Hugh > > Numbers from your tests formatted into table for better readability > > HDD Vanilla Shaohua RA=3RA=0RA=4 > SEQ, ANON 73921 76210 75611 121542 77950 > SEQ, SHMEM73601 73176 73855 118322 73534 > RND, ANON 895392 831243 871569 841680 863871 > RND, SHMEM1058375 1053486 827935 756489 834804 > > SDD Vanilla Shaohua RA=3RA=0RA=4 > SEQ, ANON 24634 24198 24673 70018 21125 > SEQ, SHMEM24959 24932 25052 69678 21387 > RND, ANON 43014 26146 28075 25901 28686 > RND, SHMEM45349 45215 28249 24332 28226 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
Here results of my test. Workload isn't very realistic, but at least it threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs, 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment) average results for ten runs: RA=3RA=0RA=1RA=2RA=4HughShaohua real time 500 542 528 519 500 523 522 user time 738 737 735 737 739 737 739 sys time93 93 91 92 96 92 93 pgmajfault 62918 110533 92454 78221 54342 86601 77229 pgpgin 2070372 795228 1034046 1471010 3177192 1154532 1599388 pgpgout 2597278 2022037 2110020 2350380 2802670 2286671 2526570 pswpin 462747 138873 202148 310969 739431 232710 341320 pswpout 646363 502599 524613 584731 697797 568784 628677 So, last two columns shows mostly equal results: +4.6% and +4.4% in comparison to vanilla kernel with RA=3, but your version shows more stable results (std-error 2.7% against 4.8%) (all this numbers in huge table in attachment) Numbers from your tests formatted into table for better readability HDD Vanilla Shaohua RA=3RA=0RA=4 SEQ, ANON 73921 76210 75611 121542 77950 SEQ, SHMEM 73601 73176 73855 118322 73534 RND, ANON 895392 831243 871569 841680 863871 RND, SHMEM 1058375 1053486 827935 756489 834804 SDD Vanilla Shaohua RA=3RA=0RA=4 SEQ, ANON 24634 24198 24673 70018 21125 SEQ, SHMEM 24959 24932 25052 69678 21387 RND, ANON 43014 26146 28075 25901 28686 RND, SHMEM 45349 45215 28249 24332 28226 Hugh Dickins wrote: On Tue, 2 Oct 2012, Konstantin Khlebnikov wrote: Hugh Dickins wrote: If I boot with mem=900M (and 1G swap: either on hard disk sda, or on Vertex II SSD sdb), and mmap anonymous 1000M (either MAP_PRIVATE, or MAP_SHARED for a shmem object), and either cycle sequentially round that making 5M touches (spaced a page apart), or make 5M random touches, then here are the times in centisecs that I see (but it's only elapsed that I've been worrying about). 3.6-rc7 swapping to hard disk: 124 user6154 system 73921 elapsed -rc7 sda seq 102 user8862 system 895392 elapsed -rc7 sda random 130 user6628 system 73601 elapsed -rc7 sda shmem seq 194 user8610 system 1058375 elapsed -rc7 sda shmem random 3.6-rc7 swapping to SSD: 116 user5898 system 24634 elapsed -rc7 sdb seq 96 user8166 system 43014 elapsed -rc7 sdb random 110 user6410 system 24959 elapsed -rc7 sdb shmem seq 208 user8024 system 45349 elapsed -rc7 sdb shmem random 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page), HDD: 116 user6258 system 76210 elapsed shli sda seq 80 user7716 system 831243 elapsed shli sda random 128 user6640 system 73176 elapsed shli sda shmem seq 212 user8522 system 1053486 elapsed shli sda shmem random 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page), SSD: 126 user5734 system 24198 elapsed shli sdb seq 90 user7356 system 26146 elapsed shli sdb random 128 user6396 system 24932 elapsed shli sdb shmem seq 192 user8006 system 45215 elapsed shli sdb shmem random 3.6-rc7 + my patch, swapping to hard disk: 126 user6252 system 75611 elapsed hugh sda seq 70 user8310 system 871569 elapsed hugh sda random 130 user6790 system 73855 elapsed hugh sda shmem seq 148 user7734 system 827935 elapsed hugh sda shmem random 3.6-rc7 + my patch, swapping to SSD: 116 user5996 system 24673 elapsed hugh sdb seq 76 user7568 system 28075 elapsed hugh sdb random 132 user6468 system 25052 elapsed hugh sdb shmem seq 166 user7220 system 28249 elapsed hugh sdb shmem random Hmm, It would be nice to gather numbers without swapin readahead at all, just to see the the worst possible case for sequential read and the best for random. Right, and also interesting to see what happens if we raise page_cluster (more of an option than it was, with your or my patch scaling it down). Run on the same machine under the same conditions: 3.6-rc7 + my patch, swapping to hard disk with page_cluster 0 (no readahead): 136 user 34038 system 121542 elapsed hugh cluster0 sda seq 102 user7928 system 841680 elapsed hugh cluster0 sda random 130 user 34770 system 118322 elapsed hugh cluster0 sda shmem seq 160 user7362 system 756489 elapsed hugh cluster0 sda shmem random 3.6-rc7 + my patch, swapping to SSD with page_cluster 0 (no readahead): 138 user 32230 system 70018 elapsed hugh cluster0 sdb seq 88 user7296 system 25901 elapsed hugh cluster0 sdb random 154 user
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Tue, 2 Oct 2012, Konstantin Khlebnikov wrote: > Hugh Dickins wrote: > > > > If I boot with mem=900M (and 1G swap: either on hard disk sda, or > > on Vertex II SSD sdb), and mmap anonymous 1000M (either MAP_PRIVATE, > > or MAP_SHARED for a shmem object), and either cycle sequentially round > > that making 5M touches (spaced a page apart), or make 5M random touches, > > then here are the times in centisecs that I see (but it's only elapsed > > that I've been worrying about). > > > > 3.6-rc7 swapping to hard disk: > > 124 user6154 system 73921 elapsed -rc7 sda seq > > 102 user8862 system 895392 elapsed -rc7 sda random > > 130 user6628 system 73601 elapsed -rc7 sda shmem seq > > 194 user8610 system 1058375 elapsed -rc7 sda shmem random > > > > 3.6-rc7 swapping to SSD: > > 116 user5898 system 24634 elapsed -rc7 sdb seq > > 96 user8166 system 43014 elapsed -rc7 sdb random > > 110 user6410 system 24959 elapsed -rc7 sdb shmem seq > > 208 user8024 system 45349 elapsed -rc7 sdb shmem random > > > > 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page), > > HDD: > > 116 user6258 system 76210 elapsed shli sda seq > > 80 user7716 system 831243 elapsed shli sda random > > 128 user6640 system 73176 elapsed shli sda shmem seq > > 212 user8522 system 1053486 elapsed shli sda shmem random > > > > 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page), > > SSD: > > 126 user5734 system 24198 elapsed shli sdb seq > > 90 user7356 system 26146 elapsed shli sdb random > > 128 user6396 system 24932 elapsed shli sdb shmem seq > > 192 user8006 system 45215 elapsed shli sdb shmem random > > > > 3.6-rc7 + my patch, swapping to hard disk: > > 126 user6252 system 75611 elapsed hugh sda seq > > 70 user8310 system 871569 elapsed hugh sda random > > 130 user6790 system 73855 elapsed hugh sda shmem seq > > 148 user7734 system 827935 elapsed hugh sda shmem random > > > > 3.6-rc7 + my patch, swapping to SSD: > > 116 user5996 system 24673 elapsed hugh sdb seq > > 76 user7568 system 28075 elapsed hugh sdb random > > 132 user6468 system 25052 elapsed hugh sdb shmem seq > > 166 user7220 system 28249 elapsed hugh sdb shmem random > > > > Hmm, It would be nice to gather numbers without swapin readahead at all, just > to see the the worst possible case for sequential read and the best for > random. Right, and also interesting to see what happens if we raise page_cluster (more of an option than it was, with your or my patch scaling it down). Run on the same machine under the same conditions: 3.6-rc7 + my patch, swapping to hard disk with page_cluster 0 (no readahead): 136 user 34038 system 121542 elapsed hugh cluster0 sda seq 102 user7928 system 841680 elapsed hugh cluster0 sda random 130 user 34770 system 118322 elapsed hugh cluster0 sda shmem seq 160 user7362 system 756489 elapsed hugh cluster0 sda shmem random 3.6-rc7 + my patch, swapping to SSD with page_cluster 0 (no readahead): 138 user 32230 system 70018 elapsed hugh cluster0 sdb seq 88 user7296 system 25901 elapsed hugh cluster0 sdb random 154 user 33150 system 69678 elapsed hugh cluster0 sdb shmem seq 166 user6936 system 24332 elapsed hugh cluster0 sdb shmem random 3.6-rc7 + my patch, swapping to hard disk with page_cluster 4 (default + 1): 144 user4262 system 77950 elapsed hugh cluster4 sda seq 74 user8268 system 863871 elapsed hugh cluster4 sda random 140 user4880 system 73534 elapsed hugh cluster4 sda shmem seq 160 user7788 system 834804 elapsed hugh cluster4 sda shmem random 3.6-rc7 + my patch, swapping to SSD with page_cluster 4 (default + 1): 124 user4242 system 21125 elapsed hugh cluster4 sdb seq 72 user7680 system 28686 elapsed hugh cluster4 sdb random 122 user4622 system 21387 elapsed hugh cluster4 sdb shmem seq 172 user7238 system 28226 elapsed hugh cluster4 sdb shmem random I was at first surprised to see random significantly faster than sequential on SSD with readahead off, thinking they ought to come out the same. But no, that's a warning on the limitations of the test: with an mmap of 1000M on a machine with mem=900M, the page-by-page sequential is never going to rehit cache, whereas the random has a good chance of finding in memory. Which I presume also accounts for the lower user times throughout for random - but then why not the same for shmem random? I did start off measuring on the laptop with SSD, mmap 1000M mem=500M; but once I transferred to the desktop, I rediscovered just how slow swapping to hard disk can be, couldn't wait days, so made mem=900M. > I'll run some tests too, especially I want to see how it works
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
Great job! I'm glad to see that you like my proof of concept patch. I though that +/-10 logic can switch between border states smoothly. But I have no strong experience in such kind of fuzzy-logic stuff, so it's no surprise that my code fails in some cases. (one note below about numbers) Hugh Dickins wrote: Shaohua, Konstantin, Sorry that it takes me so long to to reply on these swapin readahead bounding threads, but I had to try some things out before jumping in, and only found time to experiment last week. On Thu, 6 Sep 2012, Konstantin Khlebnikov wrote: This patch adds simple tracker for swapin readahread effectiveness, and tunes readahead cluster depending on it. It manage internal state [0..1024] and scales readahead order between 0 and value from sysctl vm.page-cluster (3 by default). Swapout and readahead misses decreases state, swapin and ra hits increases it: Swapin +1 [page fault, shmem, etc... ] Swapout -10 Readahead hit +10 Readahead miss -1 [removing from swapcache unused readahead page] If system is under serious memory pressure swapin readahead is useless, because pages in swap are highly fragmented and cache hit is mostly impossible. In this case swapin only leads to unnecessary memory allocations. But readahead helps to read all swapped pages back to memory if system recovers from memory pressure. This patch inspired by patch from Shaohua Li http://www.spinics.net/lists/linux-mm/msg41128.html mine version uses system wide state rather than per-VMA counters. Signed-off-by: Konstantin Khlebnikov While I appreciate the usefulness of the idea, I do have some issues with both implementations - Shaohua's currently in mmotm and next, and Konstantin's apparently overlooked. Shaohua, things I don't care for in your patch, but none of them thoroughly convincing killers: 1. As Konstantin mentioned (in other words), it dignifies the illusion that swap is somehow structured by vmas, rather than being a global pool allocated by accident of when pages fall to the bottom of lrus. 2. Following on from that, it's unable to extend its optimization to randomly accessed tmpfs files or shmem areas (and I don't want that horrid pseudo-vma stuff in shmem.c to be extended in any way to deal with this - I'd have replaced it years ago by alloc_page_mpol() if I had understood the since-acknowledged-broken mempolicy lifetimes). 3. Although putting swapra_miss into struct anon_vma was a neat memory- saving idea from Konstantin, anon_vmas are otherwise pretty much self- referential, never before holding any control information themselves: I hesitate to extend them in this way. 4. I have not actually performed the test to prove it (tell me if I'm plain wrong), but experience with trying to modify it tells me that if your vma (worse, your anon_vma) is sometimes used for sequential access and sometimes for random (or part of it for sequential and part of it for random), then a burst of randomness will switch readahead off it forever. Konstantin, given that, I wanted to speak up for your version. I admire the way you have confined it to swap_state.c (and without relying upon the FAULT_FLAG_TRIED patch), and make neat use of PageReadahead and lookup_swap_cache(). But when I compared it against vanilla or Shaohua's patch, okay it's comparable (a few percent slower?) than Shaohua's on random, and works on shmem where his fails - but it was 50% slower on sequential access (when testing on this laptop with Intel SSD: not quite the same as in the tests below, which I left your patch out of). I thought that's probably due to some off-by-one or other trivial bug in the patch; but when I looked to correct it, I found that I just don't understand what your heuristics are up to, the +1s and -1s and +10s and -10s. Maybe it's an off-by-ten, I haven't a clue. Perhaps, with a trivial bugfix, and comments added, yours will be great. But it drove me to steal some of your ideas, combining with a simple heuristic that even I can understand: patch below. If I boot with mem=900M (and 1G swap: either on hard disk sda, or on Vertex II SSD sdb), and mmap anonymous 1000M (either MAP_PRIVATE, or MAP_SHARED for a shmem object), and either cycle sequentially round that making 5M touches (spaced a page apart), or make 5M random touches, then here are the times in centisecs that I see (but it's only elapsed that I've been worrying about). 3.6-rc7 swapping to hard disk: 124 user6154 system 73921 elapsed -rc7 sda seq 102 user8862 system 895392 elapsed -rc7 sda random 130 user6628 system 73601 elapsed -rc7 sda shmem seq 194 user8610 system 1058375 elapsed -rc7 sda shmem random 3.6-rc7 swapping to SSD: 116 user5898 system 24634 elapsed -rc7 sdb seq 96 user8166 system 43014 elapsed -rc7 sdb random 110 user6410 system 24959 elapsed -rc7 sdb shmem seq 20
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
Shaohua, Konstantin, Sorry that it takes me so long to to reply on these swapin readahead bounding threads, but I had to try some things out before jumping in, and only found time to experiment last week. On Thu, 6 Sep 2012, Konstantin Khlebnikov wrote: > This patch adds simple tracker for swapin readahread effectiveness, and tunes > readahead cluster depending on it. It manage internal state [0..1024] and > scales > readahead order between 0 and value from sysctl vm.page-cluster (3 by > default). > Swapout and readahead misses decreases state, swapin and ra hits increases it: > > Swapin +1 [page fault, shmem, etc... ] > Swapout -10 > Readahead hit +10 > Readahead miss -1 [removing from swapcache unused readahead page] > > If system is under serious memory pressure swapin readahead is useless, > because > pages in swap are highly fragmented and cache hit is mostly impossible. In > this > case swapin only leads to unnecessary memory allocations. But readahead helps > to > read all swapped pages back to memory if system recovers from memory pressure. > > This patch inspired by patch from Shaohua Li > http://www.spinics.net/lists/linux-mm/msg41128.html > mine version uses system wide state rather than per-VMA counters. > > Signed-off-by: Konstantin Khlebnikov While I appreciate the usefulness of the idea, I do have some issues with both implementations - Shaohua's currently in mmotm and next, and Konstantin's apparently overlooked. Shaohua, things I don't care for in your patch, but none of them thoroughly convincing killers: 1. As Konstantin mentioned (in other words), it dignifies the illusion that swap is somehow structured by vmas, rather than being a global pool allocated by accident of when pages fall to the bottom of lrus. 2. Following on from that, it's unable to extend its optimization to randomly accessed tmpfs files or shmem areas (and I don't want that horrid pseudo-vma stuff in shmem.c to be extended in any way to deal with this - I'd have replaced it years ago by alloc_page_mpol() if I had understood the since-acknowledged-broken mempolicy lifetimes). 3. Although putting swapra_miss into struct anon_vma was a neat memory- saving idea from Konstantin, anon_vmas are otherwise pretty much self- referential, never before holding any control information themselves: I hesitate to extend them in this way. 4. I have not actually performed the test to prove it (tell me if I'm plain wrong), but experience with trying to modify it tells me that if your vma (worse, your anon_vma) is sometimes used for sequential access and sometimes for random (or part of it for sequential and part of it for random), then a burst of randomness will switch readahead off it forever. Konstantin, given that, I wanted to speak up for your version. I admire the way you have confined it to swap_state.c (and without relying upon the FAULT_FLAG_TRIED patch), and make neat use of PageReadahead and lookup_swap_cache(). But when I compared it against vanilla or Shaohua's patch, okay it's comparable (a few percent slower?) than Shaohua's on random, and works on shmem where his fails - but it was 50% slower on sequential access (when testing on this laptop with Intel SSD: not quite the same as in the tests below, which I left your patch out of). I thought that's probably due to some off-by-one or other trivial bug in the patch; but when I looked to correct it, I found that I just don't understand what your heuristics are up to, the +1s and -1s and +10s and -10s. Maybe it's an off-by-ten, I haven't a clue. Perhaps, with a trivial bugfix, and comments added, yours will be great. But it drove me to steal some of your ideas, combining with a simple heuristic that even I can understand: patch below. If I boot with mem=900M (and 1G swap: either on hard disk sda, or on Vertex II SSD sdb), and mmap anonymous 1000M (either MAP_PRIVATE, or MAP_SHARED for a shmem object), and either cycle sequentially round that making 5M touches (spaced a page apart), or make 5M random touches, then here are the times in centisecs that I see (but it's only elapsed that I've been worrying about). 3.6-rc7 swapping to hard disk: 124 user6154 system 73921 elapsed -rc7 sda seq 102 user8862 system 895392 elapsed -rc7 sda random 130 user6628 system 73601 elapsed -rc7 sda shmem seq 194 user8610 system 1058375 elapsed -rc7 sda shmem random 3.6-rc7 swapping to SSD: 116 user5898 system 24634 elapsed -rc7 sdb seq 96 user8166 system 43014 elapsed -rc7 sdb random 110 user6410 system 24959 elapsed -rc7 sdb shmem seq 208 user8024 system 45349 elapsed -rc7 sdb shmem random 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page), HDD: 116 user6258 system 76210 elapsed shli sda seq 80 user7716 system 831243 elapsed shli sda random 128 user6640 system
[PATCH RFC] mm/swap: automatic tuning for swapin readahead
This patch adds simple tracker for swapin readahread effectiveness, and tunes readahead cluster depending on it. It manage internal state [0..1024] and scales readahead order between 0 and value from sysctl vm.page-cluster (3 by default). Swapout and readahead misses decreases state, swapin and ra hits increases it: Swapin +1 [page fault, shmem, etc... ] Swapout -10 Readahead hit +10 Readahead miss -1 [removing from swapcache unused readahead page] If system is under serious memory pressure swapin readahead is useless, because pages in swap are highly fragmented and cache hit is mostly impossible. In this case swapin only leads to unnecessary memory allocations. But readahead helps to read all swapped pages back to memory if system recovers from memory pressure. This patch inspired by patch from Shaohua Li http://www.spinics.net/lists/linux-mm/msg41128.html mine version uses system wide state rather than per-VMA counters. Signed-off-by: Konstantin Khlebnikov Cc: Shaohua Li Cc: Rik van Riel Cc: Minchan Kim --- include/linux/page-flags.h |1 + mm/swap_state.c| 42 +- 2 files changed, 38 insertions(+), 5 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index b5d1384..3657cdc 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -231,6 +231,7 @@ PAGEFLAG(MappedToDisk, mappedtodisk) /* PG_readahead is only used for file reads; PG_reclaim is only for writes */ PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim) PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */ +TESTCLEARFLAG(Readahead, reclaim) #ifdef CONFIG_HIGHMEM /* diff --git a/mm/swap_state.c b/mm/swap_state.c index 0cb36fb..d6c7a88 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -53,12 +53,31 @@ static struct { unsigned long find_total; } swap_cache_info; +#define SWAP_RA_BITS 10 + +static atomic_t swap_ra_state = ATOMIC_INIT((1 << SWAP_RA_BITS) - 1); +static int swap_ra_cluster = 1; + +static void swap_ra_update(int delta) +{ + int old_state, new_state; + + old_state = atomic_read(&swap_ra_state); + new_state = clamp(old_state + delta, 0, 1 << SWAP_RA_BITS); + if (old_state != new_state) { + atomic_set(&swap_ra_state, new_state); + swap_ra_cluster = (page_cluster * new_state) >> SWAP_RA_BITS; + } +} + void show_swap_cache_info(void) { printk("%lu pages in swap cache\n", total_swapcache_pages); - printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu\n", + printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu," + " readahead %d/%d\n", swap_cache_info.add_total, swap_cache_info.del_total, - swap_cache_info.find_success, swap_cache_info.find_total); + swap_cache_info.find_success, swap_cache_info.find_total, + 1 << swap_ra_cluster, atomic_read(&swap_ra_state)); printk("Free swap = %ldkB\n", nr_swap_pages << (PAGE_SHIFT - 10)); printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10)); } @@ -112,6 +131,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) if (!error) { error = __add_to_swap_cache(page, entry); radix_tree_preload_end(); + /* FIXME weird place */ + swap_ra_update(-10); /* swapout, decrease readahead */ } return error; } @@ -132,6 +153,8 @@ void __delete_from_swap_cache(struct page *page) total_swapcache_pages--; __dec_zone_page_state(page, NR_FILE_PAGES); INC_CACHE_INFO(del_total); + if (TestClearPageReadahead(page)) + swap_ra_update(-1); /* readahead miss */ } /** @@ -265,8 +288,11 @@ struct page * lookup_swap_cache(swp_entry_t entry) page = find_get_page(&swapper_space, entry.val); - if (page) + if (page) { INC_CACHE_INFO(find_success); + if (TestClearPageReadahead(page)) + swap_ra_update(+10); /* readahead hit */ + } INC_CACHE_INFO(find_total); return page; @@ -374,11 +400,14 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr) { struct page *page; - unsigned long offset = swp_offset(entry); + unsigned long entry_offset = swp_offset(entry); + unsigned long offset = entry_offset; unsigned long start_offset, end_offset; - unsigned long mask = (1UL << page_cluster) - 1; + unsigned long mask = (1UL << swap_ra_cluster) - 1; struct blk_plug plug; + swap_ra_update(+1); /* swapin, increase readahead */ + /* Read a page_cluster sized and aligned cluster around offset. */ start_offset = offset & ~mask; end_o