Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
Hi Fengguang, On Fri, Oct 26, 2012 at 4:02 PM, Fengguang Wu wrote: > On Fri, Oct 26, 2012 at 03:47:19PM +0800, Ni zhan Chen wrote: >> On 10/26/2012 03:36 PM, Fengguang Wu wrote: >> >On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote: >> >>On 10/26/2012 03:09 PM, Fengguang Wu wrote: >> >>>On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: >> On 10/26/2012 02:58 PM, Fengguang Wu wrote: >> >> static void shrink_readahead_size_eio(struct file *filp, >> >> struct file_ra_state *ra) >> >> { >> >>- ra->ra_pages /= 4; >> >>+ spin_lock(&filp->f_lock); >> >>+ filp->f_mode |= FMODE_RANDOM; >> >>+ spin_unlock(&filp->f_lock); >> >> >> >>As the example in comment above this function, the read maybe still >> >>sequential, and it will waste IO bandwith if modify to FMODE_RANDOM >> >>directly. >> >Yes immediately disabling readahead may hurt IO performance, the >> >original '/ 4' may perform better when there are only 1-3 IO errors >> >encountered. >> Hi Fengguang, >> >> Why the number should be 1-3? >> >>>The original behavior is '/= 4' on each error. >> >>> >> >>>After 1 errors, readahead size will be shrinked by 1/4 >> >>>After 2 errors, readahead size will be shrinked by 1/16 >> >>>After 3 errors, readahead size will be shrinked by 1/64 >> >>>After 4 errors, readahead size will be effectively 0 (disabled) >> >>But from function shrink_readahead_size_eio and its caller >> >>filemap_fault I can't find the behavior you mentioned. How you >> >>figure out it? >> >It's this line in shrink_readahead_size_eio(): >> > >> > ra->ra_pages /= 4; >> >> Yeah, I mean why the 4th readahead size will be 0(disabled)? What's >> the original value of ra->ra_pages? How can guarantee the 4th shrink >> readahead size can be 0? > > Ah OK, I'm talking about the typical case. The default readahead size > is 128k, which will become 0 after / 256. The reasonable good ra size > for hard disks is 1MB=256pages, which also becomes 1page after 4 errors. How do you feel about my previous mail of error statistics, in fact I prefer treating these files independently, so do some check for FMODE_RANDOM before we change the readahead window to avoid trashing the read ahead window, If the user applications call fadvise but the media turn out to be error-prone, after one read we know the situation and set the file in FMODE_RANDOM, this should solve the issue Dave raised. Do I miss something? Thanks in advance. > > Thanks, > Fengguang -- Thanks, Ying Zhu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 04:02 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:47:19PM +0800, Ni zhan Chen wrote: On 10/26/2012 03:36 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote: On 10/26/2012 03:09 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(&filp->f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(&filp->f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? The original behavior is '/= 4' on each error. After 1 errors, readahead size will be shrinked by 1/4 After 2 errors, readahead size will be shrinked by 1/16 After 3 errors, readahead size will be shrinked by 1/64 After 4 errors, readahead size will be effectively 0 (disabled) But from function shrink_readahead_size_eio and its caller filemap_fault I can't find the behavior you mentioned. How you figure out it? It's this line in shrink_readahead_size_eio(): ra->ra_pages /= 4; Yeah, I mean why the 4th readahead size will be 0(disabled)? What's the original value of ra->ra_pages? How can guarantee the 4th shrink readahead size can be 0? Ah OK, I'm talking about the typical case. The default readahead size is 128k, which will become 0 after / 256. The reasonable good ra size for hard disks is 1MB=256pages, which also becomes 1page after 4 errors. Then why default size is not set to reasonable size? Regards, Chen Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Fri, Oct 26, 2012 at 03:47:19PM +0800, Ni zhan Chen wrote: > On 10/26/2012 03:36 PM, Fengguang Wu wrote: > >On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote: > >>On 10/26/2012 03:09 PM, Fengguang Wu wrote: > >>>On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: > On 10/26/2012 02:58 PM, Fengguang Wu wrote: > >> static void shrink_readahead_size_eio(struct file *filp, > >> struct file_ra_state *ra) > >> { > >>- ra->ra_pages /= 4; > >>+ spin_lock(&filp->f_lock); > >>+ filp->f_mode |= FMODE_RANDOM; > >>+ spin_unlock(&filp->f_lock); > >> > >>As the example in comment above this function, the read maybe still > >>sequential, and it will waste IO bandwith if modify to FMODE_RANDOM > >>directly. > >Yes immediately disabling readahead may hurt IO performance, the > >original '/ 4' may perform better when there are only 1-3 IO errors > >encountered. > Hi Fengguang, > > Why the number should be 1-3? > >>>The original behavior is '/= 4' on each error. > >>> > >>>After 1 errors, readahead size will be shrinked by 1/4 > >>>After 2 errors, readahead size will be shrinked by 1/16 > >>>After 3 errors, readahead size will be shrinked by 1/64 > >>>After 4 errors, readahead size will be effectively 0 (disabled) > >>But from function shrink_readahead_size_eio and its caller > >>filemap_fault I can't find the behavior you mentioned. How you > >>figure out it? > >It's this line in shrink_readahead_size_eio(): > > > > ra->ra_pages /= 4; > > Yeah, I mean why the 4th readahead size will be 0(disabled)? What's > the original value of ra->ra_pages? How can guarantee the 4th shrink > readahead size can be 0? Ah OK, I'm talking about the typical case. The default readahead size is 128k, which will become 0 after / 256. The reasonable good ra size for hard disks is 1MB=256pages, which also becomes 1page after 4 errors. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 03:36 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote: On 10/26/2012 03:09 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(&filp->f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(&filp->f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? The original behavior is '/= 4' on each error. After 1 errors, readahead size will be shrinked by 1/4 After 2 errors, readahead size will be shrinked by 1/16 After 3 errors, readahead size will be shrinked by 1/64 After 4 errors, readahead size will be effectively 0 (disabled) But from function shrink_readahead_size_eio and its caller filemap_fault I can't find the behavior you mentioned. How you figure out it? It's this line in shrink_readahead_size_eio(): ra->ra_pages /= 4; Yeah, I mean why the 4th readahead size will be 0(disabled)? What's the original value of ra->ra_pages? How can guarantee the 4th shrink readahead size can be 0? Regards, Chen That ra_pages will keep shrinking by 4 on each error. The only way to restore it is to reopen the file, or POSIX_FADV_SEQUENTIAL. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote: > On 10/26/2012 03:09 PM, Fengguang Wu wrote: > >On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: > >>On 10/26/2012 02:58 PM, Fengguang Wu wrote: > static void shrink_readahead_size_eio(struct file *filp, > struct file_ra_state *ra) > { > - ra->ra_pages /= 4; > + spin_lock(&filp->f_lock); > + filp->f_mode |= FMODE_RANDOM; > + spin_unlock(&filp->f_lock); > > As the example in comment above this function, the read maybe still > sequential, and it will waste IO bandwith if modify to FMODE_RANDOM > directly. > >>>Yes immediately disabling readahead may hurt IO performance, the > >>>original '/ 4' may perform better when there are only 1-3 IO errors > >>>encountered. > >>Hi Fengguang, > >> > >>Why the number should be 1-3? > >The original behavior is '/= 4' on each error. > > > >After 1 errors, readahead size will be shrinked by 1/4 > >After 2 errors, readahead size will be shrinked by 1/16 > >After 3 errors, readahead size will be shrinked by 1/64 > >After 4 errors, readahead size will be effectively 0 (disabled) > > But from function shrink_readahead_size_eio and its caller > filemap_fault I can't find the behavior you mentioned. How you > figure out it? It's this line in shrink_readahead_size_eio(): ra->ra_pages /= 4; That ra_pages will keep shrinking by 4 on each error. The only way to restore it is to reopen the file, or POSIX_FADV_SEQUENTIAL. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 03:09 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(&filp->f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(&filp->f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? The original behavior is '/= 4' on each error. After 1 errors, readahead size will be shrinked by 1/4 After 2 errors, readahead size will be shrinked by 1/16 After 3 errors, readahead size will be shrinked by 1/64 After 4 errors, readahead size will be effectively 0 (disabled) But from function shrink_readahead_size_eio and its caller filemap_fault I can't find the behavior you mentioned. How you figure out it? Regards, Chen Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: > On 10/26/2012 02:58 PM, Fengguang Wu wrote: > >> static void shrink_readahead_size_eio(struct file *filp, > >> struct file_ra_state *ra) > >> { > >>- ra->ra_pages /= 4; > >>+ spin_lock(&filp->f_lock); > >>+ filp->f_mode |= FMODE_RANDOM; > >>+ spin_unlock(&filp->f_lock); > >> > >>As the example in comment above this function, the read maybe still > >>sequential, and it will waste IO bandwith if modify to FMODE_RANDOM > >>directly. > >Yes immediately disabling readahead may hurt IO performance, the > >original '/ 4' may perform better when there are only 1-3 IO errors > >encountered. > > Hi Fengguang, > > Why the number should be 1-3? The original behavior is '/= 4' on each error. After 1 errors, readahead size will be shrinked by 1/4 After 2 errors, readahead size will be shrinked by 1/16 After 3 errors, readahead size will be shrinked by 1/64 After 4 errors, readahead size will be effectively 0 (disabled) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(&filp->f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(&filp->f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? Regards, Chen Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
> static void shrink_readahead_size_eio(struct file *filp, > struct file_ra_state *ra) > { > - ra->ra_pages /= 4; > + spin_lock(&filp->f_lock); > + filp->f_mode |= FMODE_RANDOM; > + spin_unlock(&filp->f_lock); > > As the example in comment above this function, the read maybe still > sequential, and it will waste IO bandwith if modify to FMODE_RANDOM > directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Fri, Oct 26, 2012 at 11:55 AM, Fengguang Wu wrote: > On Fri, Oct 26, 2012 at 11:38:11AM +0800, YingHang Zhu wrote: >> On Fri, Oct 26, 2012 at 8:25 AM, Dave Chinner wrote: >> > On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: >> >> Hi Chen, >> >> >> >> > But how can bdi related ra_pages reflect different files' readahead >> >> > window? Maybe these different files are sequential read, random read >> >> > and so on. >> >> >> >> It's simple: sequential reads will get ra_pages readahead size while >> >> random reads will not get readahead at all. >> >> >> >> Talking about the below chunk, it might hurt someone that explicitly >> >> takes advantage of the behavior, however the ra_pages*2 seems more >> >> like a hack than general solution to me: if the user will need >> >> POSIX_FADV_SEQUENTIAL to double the max readahead window size for >> >> improving IO performance, then why not just increase bdi->ra_pages and >> >> benefit all reads? One may argue that it offers some differential >> >> behavior to specific applications, however it may also present as a >> >> counter-optimization: if the root already tuned bdi->ra_pages to the >> >> optimal size, the doubled readahead size will only cost more memory >> >> and perhaps IO latency. >> >> >> >> --- a/mm/fadvise.c >> >> +++ b/mm/fadvise.c >> >> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, >> >> loff_t len, int advice) >> >> spin_unlock(&file->f_lock); >> >> break; >> >> case POSIX_FADV_SEQUENTIAL: >> >> - file->f_ra.ra_pages = bdi->ra_pages * 2; >> > >> > I think we really have to reset file->f_ra.ra_pages here as it is >> > not a set-and-forget value. e.g. shrink_readahead_size_eio() can >> > reduce ra_pages as a result of IO errors. Hence if you have had io >> > errors, telling the kernel that you are now going to do sequential >> > IO should reset the readahead to the maximum ra_pages value >> > supported >> If we unify file->f_ra.ra_pages and its' bdi->ra_pages, then the error-prone >> device's readahead can be directly tuned or turned off with blockdev >> thus affect all files >> using the device and without bring more complexity... > > It's not really feasible/convenient for the end users to hand tune > blockdev readahead size on IO errors. Even many administrators are > totally unaware of the readahead size parameter. You are right, so the problem comes in this way: If one file's read failure will affect other files? I mean for rotating disks and discs, a file's read failure may be due to the bad sectors which tend to be consecutive and won't affect other files' reading status. However for tape drive the read failure usually indicates data corruption and other file's reading may also fail. In other words, should we consider how many files failed to read data and where they failed as a factor to indicate the status of the backing device, or treat these files independently? If we choose the previous one we can accumulate the statistics and change bdi.ra_pages, otherwise we may do some check for FMODE_RANDOM before we change the readahead window. I may missed something, please point it out. Thanks, Ying Zhu > > Thanks, > Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Fri, Oct 26, 2012 at 11:51 AM, Ni zhan Chen wrote: > On 10/26/2012 11:28 AM, YingHang Zhu wrote: >> >> On Fri, Oct 26, 2012 at 10:30 AM, Ni zhan Chen >> wrote: >>> >>> On 10/26/2012 09:27 AM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote: > > On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: >> >> Hi Chen, >> >>> But how can bdi related ra_pages reflect different files' readahead >>> window? Maybe these different files are sequential read, random read >>> and so on. >> >> It's simple: sequential reads will get ra_pages readahead size while >> random reads will not get readahead at all. >> >> Talking about the below chunk, it might hurt someone that explicitly >> takes advantage of the behavior, however the ra_pages*2 seems more >> like a hack than general solution to me: if the user will need >> POSIX_FADV_SEQUENTIAL to double the max readahead window size for >> improving IO performance, then why not just increase bdi->ra_pages and >> benefit all reads? One may argue that it offers some differential >> behavior to specific applications, however it may also present as a >> counter-optimization: if the root already tuned bdi->ra_pages to the >> optimal size, the doubled readahead size will only cost more memory >> and perhaps IO latency. >> >> --- a/mm/fadvise.c >> +++ b/mm/fadvise.c >> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, >> loff_t len, int advice) >> spin_unlock(&file->f_lock); >> break; >> case POSIX_FADV_SEQUENTIAL: >> - file->f_ra.ra_pages = bdi->ra_pages * 2; > > I think we really have to reset file->f_ra.ra_pages here as it is > not a set-and-forget value. e.g. shrink_readahead_size_eio() can > reduce ra_pages as a result of IO errors. Hence if you have had io > errors, telling the kernel that you are now going to do sequential > IO should reset the readahead to the maximum ra_pages value > supported Good point! but wait this patch removes file->f_ra.ra_pages in all other places too, so there will be no file->f_ra.ra_pages to be reset here... >>> >>> >>> In his patch, >>> >>> >>> static void shrink_readahead_size_eio(struct file *filp, >>> struct file_ra_state *ra) >>> { >>> - ra->ra_pages /= 4; >>> + spin_lock(&filp->f_lock); >>> + filp->f_mode |= FMODE_RANDOM; >>> + spin_unlock(&filp->f_lock); >>> >>> As the example in comment above this function, the read maybe still >>> sequential, and it will waste IO bandwith if modify to FMODE_RANDOM >>> directly. >> >> I've considered about this. On the first try I modified file_ra_state.size >> and >> file_ra_state.async_size directly, like >> >> file_ra_state.async_size = 0; >> file_ra_state.size /= 4; >> >> but as what I comment here, we can not >> predict whether the bad sectors will trash the readahead window, maybe the >> following sectors after current one are ok to go in normal readahead, >> it's hard to know, >> the current approach gives us a chance to slow down softly. > > > Then when will check filp->f_mode |= FMODE_RANDOM; ? Does it will influence > ra->ra_pages? You can find the relevant information in function page_cache_sync_readahead. Thanks, Ying Zhu > > >> >> Thanks, >> Ying Zhu Thanks, Fengguang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Fri, Oct 26, 2012 at 11:38:11AM +0800, YingHang Zhu wrote: > On Fri, Oct 26, 2012 at 8:25 AM, Dave Chinner wrote: > > On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: > >> Hi Chen, > >> > >> > But how can bdi related ra_pages reflect different files' readahead > >> > window? Maybe these different files are sequential read, random read > >> > and so on. > >> > >> It's simple: sequential reads will get ra_pages readahead size while > >> random reads will not get readahead at all. > >> > >> Talking about the below chunk, it might hurt someone that explicitly > >> takes advantage of the behavior, however the ra_pages*2 seems more > >> like a hack than general solution to me: if the user will need > >> POSIX_FADV_SEQUENTIAL to double the max readahead window size for > >> improving IO performance, then why not just increase bdi->ra_pages and > >> benefit all reads? One may argue that it offers some differential > >> behavior to specific applications, however it may also present as a > >> counter-optimization: if the root already tuned bdi->ra_pages to the > >> optimal size, the doubled readahead size will only cost more memory > >> and perhaps IO latency. > >> > >> --- a/mm/fadvise.c > >> +++ b/mm/fadvise.c > >> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, > >> loff_t len, int advice) > >> spin_unlock(&file->f_lock); > >> break; > >> case POSIX_FADV_SEQUENTIAL: > >> - file->f_ra.ra_pages = bdi->ra_pages * 2; > > > > I think we really have to reset file->f_ra.ra_pages here as it is > > not a set-and-forget value. e.g. shrink_readahead_size_eio() can > > reduce ra_pages as a result of IO errors. Hence if you have had io > > errors, telling the kernel that you are now going to do sequential > > IO should reset the readahead to the maximum ra_pages value > > supported > If we unify file->f_ra.ra_pages and its' bdi->ra_pages, then the error-prone > device's readahead can be directly tuned or turned off with blockdev > thus affect all files > using the device and without bring more complexity... It's not really feasible/convenient for the end users to hand tune blockdev readahead size on IO errors. Even many administrators are totally unaware of the readahead size parameter. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 11:28 AM, YingHang Zhu wrote: On Fri, Oct 26, 2012 at 10:30 AM, Ni zhan Chen wrote: On 10/26/2012 09:27 AM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote: On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: Hi Chen, But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi->ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi->ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(&file->f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; I think we really have to reset file->f_ra.ra_pages here as it is not a set-and-forget value. e.g. shrink_readahead_size_eio() can reduce ra_pages as a result of IO errors. Hence if you have had io errors, telling the kernel that you are now going to do sequential IO should reset the readahead to the maximum ra_pages value supported Good point! but wait this patch removes file->f_ra.ra_pages in all other places too, so there will be no file->f_ra.ra_pages to be reset here... In his patch, static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(&filp->f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(&filp->f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. I've considered about this. On the first try I modified file_ra_state.size and file_ra_state.async_size directly, like file_ra_state.async_size = 0; file_ra_state.size /= 4; but as what I comment here, we can not predict whether the bad sectors will trash the readahead window, maybe the following sectors after current one are ok to go in normal readahead, it's hard to know, the current approach gives us a chance to slow down softly. Then when will check filp->f_mode |= FMODE_RANDOM; ? Does it will influence ra->ra_pages? Thanks, Ying Zhu Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Fri, Oct 26, 2012 at 8:25 AM, Dave Chinner wrote: > On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: >> Hi Chen, >> >> > But how can bdi related ra_pages reflect different files' readahead >> > window? Maybe these different files are sequential read, random read >> > and so on. >> >> It's simple: sequential reads will get ra_pages readahead size while >> random reads will not get readahead at all. >> >> Talking about the below chunk, it might hurt someone that explicitly >> takes advantage of the behavior, however the ra_pages*2 seems more >> like a hack than general solution to me: if the user will need >> POSIX_FADV_SEQUENTIAL to double the max readahead window size for >> improving IO performance, then why not just increase bdi->ra_pages and >> benefit all reads? One may argue that it offers some differential >> behavior to specific applications, however it may also present as a >> counter-optimization: if the root already tuned bdi->ra_pages to the >> optimal size, the doubled readahead size will only cost more memory >> and perhaps IO latency. >> >> --- a/mm/fadvise.c >> +++ b/mm/fadvise.c >> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t >> len, int advice) >> spin_unlock(&file->f_lock); >> break; >> case POSIX_FADV_SEQUENTIAL: >> - file->f_ra.ra_pages = bdi->ra_pages * 2; > > I think we really have to reset file->f_ra.ra_pages here as it is > not a set-and-forget value. e.g. shrink_readahead_size_eio() can > reduce ra_pages as a result of IO errors. Hence if you have had io > errors, telling the kernel that you are now going to do sequential > IO should reset the readahead to the maximum ra_pages value > supported If we unify file->f_ra.ra_pages and its' bdi->ra_pages, then the error-prone device's readahead can be directly tuned or turned off with blockdev thus affect all files using the device and without bring more complexity... Thanks, Ying Zhu > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Fri, Oct 26, 2012 at 10:30 AM, Ni zhan Chen wrote: > On 10/26/2012 09:27 AM, Fengguang Wu wrote: >> >> On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote: >>> >>> On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: Hi Chen, > But how can bdi related ra_pages reflect different files' readahead > window? Maybe these different files are sequential read, random read > and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi->ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi->ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(&file->f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; >>> >>> I think we really have to reset file->f_ra.ra_pages here as it is >>> not a set-and-forget value. e.g. shrink_readahead_size_eio() can >>> reduce ra_pages as a result of IO errors. Hence if you have had io >>> errors, telling the kernel that you are now going to do sequential >>> IO should reset the readahead to the maximum ra_pages value >>> supported >> >> Good point! >> >> but wait this patch removes file->f_ra.ra_pages in all other >> places too, so there will be no file->f_ra.ra_pages to be reset here... > > > In his patch, > > > static void shrink_readahead_size_eio(struct file *filp, > struct file_ra_state *ra) > { > - ra->ra_pages /= 4; > + spin_lock(&filp->f_lock); > + filp->f_mode |= FMODE_RANDOM; > + spin_unlock(&filp->f_lock); > > As the example in comment above this function, the read maybe still > sequential, and it will waste IO bandwith if modify to FMODE_RANDOM > directly. I've considered about this. On the first try I modified file_ra_state.size and file_ra_state.async_size directly, like file_ra_state.async_size = 0; file_ra_state.size /= 4; but as what I comment here, we can not predict whether the bad sectors will trash the readahead window, maybe the following sectors after current one are ok to go in normal readahead, it's hard to know, the current approach gives us a chance to slow down softly. Thanks, Ying Zhu > >> >> Thanks, >> Fengguang >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 09:27 AM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote: On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: Hi Chen, But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi->ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi->ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(&file->f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; I think we really have to reset file->f_ra.ra_pages here as it is not a set-and-forget value. e.g. shrink_readahead_size_eio() can reduce ra_pages as a result of IO errors. Hence if you have had io errors, telling the kernel that you are now going to do sequential IO should reset the readahead to the maximum ra_pages value supported Good point! but wait this patch removes file->f_ra.ra_pages in all other places too, so there will be no file->f_ra.ra_pages to be reset here... In his patch, static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(&filp->f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(&filp->f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 08:25 AM, Dave Chinner wrote: On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: Hi Chen, But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi->ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi->ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(&file->f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; I think we really have to reset file->f_ra.ra_pages here as it is not a set-and-forget value. e.g. shrink_readahead_size_eio() can reduce ra_pages as a result of IO errors. Hence if you have had io errors, telling the kernel that you are now going to do sequential IO should reset the readahead to the maximum ra_pages value supported Good catch! Cheers, Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote: > On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: > > Hi Chen, > > > > > But how can bdi related ra_pages reflect different files' readahead > > > window? Maybe these different files are sequential read, random read > > > and so on. > > > > It's simple: sequential reads will get ra_pages readahead size while > > random reads will not get readahead at all. > > > > Talking about the below chunk, it might hurt someone that explicitly > > takes advantage of the behavior, however the ra_pages*2 seems more > > like a hack than general solution to me: if the user will need > > POSIX_FADV_SEQUENTIAL to double the max readahead window size for > > improving IO performance, then why not just increase bdi->ra_pages and > > benefit all reads? One may argue that it offers some differential > > behavior to specific applications, however it may also present as a > > counter-optimization: if the root already tuned bdi->ra_pages to the > > optimal size, the doubled readahead size will only cost more memory > > and perhaps IO latency. > > > > --- a/mm/fadvise.c > > +++ b/mm/fadvise.c > > @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, > > loff_t len, int advice) > > spin_unlock(&file->f_lock); > > break; > > case POSIX_FADV_SEQUENTIAL: > > - file->f_ra.ra_pages = bdi->ra_pages * 2; > > I think we really have to reset file->f_ra.ra_pages here as it is > not a set-and-forget value. e.g. shrink_readahead_size_eio() can > reduce ra_pages as a result of IO errors. Hence if you have had io > errors, telling the kernel that you are now going to do sequential > IO should reset the readahead to the maximum ra_pages value > supported Good point! but wait this patch removes file->f_ra.ra_pages in all other places too, so there will be no file->f_ra.ra_pages to be reset here... Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: > Hi Chen, > > > But how can bdi related ra_pages reflect different files' readahead > > window? Maybe these different files are sequential read, random read > > and so on. > > It's simple: sequential reads will get ra_pages readahead size while > random reads will not get readahead at all. > > Talking about the below chunk, it might hurt someone that explicitly > takes advantage of the behavior, however the ra_pages*2 seems more > like a hack than general solution to me: if the user will need > POSIX_FADV_SEQUENTIAL to double the max readahead window size for > improving IO performance, then why not just increase bdi->ra_pages and > benefit all reads? One may argue that it offers some differential > behavior to specific applications, however it may also present as a > counter-optimization: if the root already tuned bdi->ra_pages to the > optimal size, the doubled readahead size will only cost more memory > and perhaps IO latency. > > --- a/mm/fadvise.c > +++ b/mm/fadvise.c > @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t > len, int advice) > spin_unlock(&file->f_lock); > break; > case POSIX_FADV_SEQUENTIAL: > - file->f_ra.ra_pages = bdi->ra_pages * 2; I think we really have to reset file->f_ra.ra_pages here as it is not a set-and-forget value. e.g. shrink_readahead_size_eio() can reduce ra_pages as a result of IO errors. Hence if you have had io errors, telling the kernel that you are now going to do sequential IO should reset the readahead to the maximum ra_pages value supported Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
Hi Chen, > But how can bdi related ra_pages reflect different files' readahead > window? Maybe these different files are sequential read, random read > and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi->ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi->ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(&file->f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; spin_lock(&file->f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(&file->f_lock); Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Thu, Oct 25, 2012 at 10:58 AM, Fengguang Wu wrote: > Hi Chen, > >> But how can bdi related ra_pages reflect different files' readahead >> window? Maybe these different files are sequential read, random read >> and so on. > > It's simple: sequential reads will get ra_pages readahead size while > random reads will not get readahead at all. > > Talking about the below chunk, it might hurt someone that explicitly > takes advantage of the behavior, however the ra_pages*2 seems more > like a hack than general solution to me: if the user will need > POSIX_FADV_SEQUENTIAL to double the max readahead window size for > improving IO performance, then why not just increase bdi->ra_pages and > benefit all reads? One may argue that it offers some differential > behavior to specific applications, however it may also present as a > counter-optimization: if the root already tuned bdi->ra_pages to the > optimal size, the doubled readahead size will only cost more memory > and perhaps IO latency. I agree, we should choose the reasonable solution here. Thanks, Ying Zhu > > --- a/mm/fadvise.c > +++ b/mm/fadvise.c > @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t > len, int advice) > spin_unlock(&file->f_lock); > break; > case POSIX_FADV_SEQUENTIAL: > - file->f_ra.ra_pages = bdi->ra_pages * 2; > spin_lock(&file->f_lock); > file->f_mode &= ~FMODE_RANDOM; > spin_unlock(&file->f_lock); > > Thanks, > Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Thu, Oct 25, 2012 at 10:38 AM, Fengguang Wu wrote: > Hi YingHang, > >> Actually I've talked about it with Fengguang, he advised we should unify the >> ra_pages in struct bdi and file_ra_state and leave the issue that >> spreading data >> across disks as it is. >> Fengguang, what's you opinion about this? > > Yeah the two ra_pages may run out of sync for already opened files, > which could be a problem for long opened files. However as Dave put > it, a device's max readahead size is typically a static value that can > be set at mount time. So, the question is: do you really hurt from the > old behavior that deserves this code change? We could advise the above application to reopen files. As I mentioned previously the many scst users also have this problem: [quote] Note2: you need to restart SCST after you changed read-ahead settings on the target. It is a limitation of the Linux read ahead implementation. It reads RA values for each file only when the file is open and not updates them when the global RA parameters changed. Hence, the need for vdisk to reopen all its files/devices. [/quote] So IMHO it's a functional bug in kernel that brings inconvenience to the application developers. > > I agree with Dave that the multi-disk case is not a valid concern. In > fact, how can the patch help that case? I mean, if it's two fuse files > lying in two disks, it *was* not a problem at all. If it's one big > file spreading to two disks, it's a too complex scheme to be > practically manageable which I doubt if you have such a setup. Yes this patch does not solve the issue here. I'm just push the discussion a little further, in reality we may never meet such setup. Thanks, Ying Hang > > Thanks, > Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
Hi YingHang, > Actually I've talked about it with Fengguang, he advised we should unify the > ra_pages in struct bdi and file_ra_state and leave the issue that > spreading data > across disks as it is. > Fengguang, what's you opinion about this? Yeah the two ra_pages may run out of sync for already opened files, which could be a problem for long opened files. However as Dave put it, a device's max readahead size is typically a static value that can be set at mount time. So, the question is: do you really hurt from the old behavior that deserves this code change? I agree with Dave that the multi-disk case is not a valid concern. In fact, how can the patch help that case? I mean, if it's two fuse files lying in two disks, it *was* not a problem at all. If it's one big file spreading to two disks, it's a too complex scheme to be practically manageable which I doubt if you have such a setup. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Thu, Oct 25, 2012 at 10:12 AM, Ni zhan Chen wrote: > On 10/25/2012 10:04 AM, YingHang Zhu wrote: >> >> On Thu, Oct 25, 2012 at 9:50 AM, Dave Chinner wrote: >>> >>> On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote: On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner wrote: > > On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: >> >> Hi Dave, >> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner >> wrote: >>> >>> On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, >>> >>> or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead >>> window to the (new) bdi default. >>> which is inappropriate under our circumstances. >>> >>> Which are? We don't know your circumstances, so you need to tell us >>> why you need this and why existing methods of handling such changes >>> are insufficient... >>> >>> Optimal readahead windows tend to be a physical property of the >>> storage and that does not tend to change dynamically. Hence block >>> device readahead should only need to be set up once, and generally >>> that can be done before the filesystem is mounted and files are >>> opened (e.g. via udev rules). Hence you need to explain why you need >>> to change the default block device readahead on the fly, and why >>> fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead >>> windows to the new defaults. >> >> Our system is a fuse-based file system, fuse creates a >> pseudo backing device for the user space file systems, the default >> readahead >> size is 128KB and it can't fully utilize the backing storage's read >> ability, >> so we should tune it. > > Sure, but that doesn't tell me anything about why you can't do this > at mount time before the application opens any files. i.e. you've > simply stated the reason why readahead is tunable, not why you need > to be fully dynamic. We store our file system's data on different disks so we need to change ra_pages dynamically according to where the data resides, it can't be fixed at mount time or when we open files. >>> >>> That doesn't make a whole lot of sense to me. let me try to get this >>> straight. >>> >>> There is data that resides on two devices (A + B), and a fuse >>> filesystem to access that data. There is a single file in the fuse >>> fs has data on both devices. An app has the file open, and when the >>> data it is accessing is on device A you need to set the readahead to >>> what is best for device A? And when the app tries to access data for >>> that file that is on device B, you need to set the readahead to what >>> is best for device B? And you are changing the fuse BDI readahead >>> settings according to where the data in the back end lies? >>> >>> It seems to me that you should be setting the fuse readahead to the >>> maximum of the readahead windows the data devices have configured at >>> mount time and leaving it at that >> >> Then it may not fully utilize some device's read IO bandwidth and put too >> much >> burden on other devices. The abstract bdi of fuse and btrfs provides some dynamically changing bdi.ra_pages based on the real backing device. IMHO this should not be ignored. >>> >>> btrfs simply takes into account the number of disks it has for a >>> given storage pool when setting up the default bdi ra_pages during >>> mount. This is basically doing what I suggested above. Same with >>> the generic fuse code - it's simply setting a sensible default value >>> for the given fuse configuration. >>> >>> Neither are dynamic in the sense you are talking about, though. >> >> Actually I've talked about it with Fengguang, he advised we should unify >> the > > > But how can bdi related ra_pages reflect different files' readahead window? > Maybe these different files are sequential read, random read and so on. I think you mean the dynamic tuning of readahead window, that's exactly the job of readahead algorithm and it's reflected by file_ra_state.sync_size and file_ra_state.async_size. The ra_pages in struct file_ra_state only means the max readahead ability. Thanks, Ying Zhu > >> ra_pages in struct bdi and file_ra_state and leave the issue that >> spreading data >> across disks as it is. >> Fengguang, what's you opinion about this? >> >> Thanks, >> Ying Zhu >>> >>> Cheers, >>> >>> Dave. >>> -- >>> Dave Chinner >>> da...@fromorbit.com >> >> -- >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majord...@kvack.org. For more info on Linux MM, >> see: http://ww
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/25/2012 10:04 AM, YingHang Zhu wrote: On Thu, Oct 25, 2012 at 9:50 AM, Dave Chinner wrote: On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote: On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner wrote: On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: Hi Dave, On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner wrote: On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead window to the (new) bdi default. which is inappropriate under our circumstances. Which are? We don't know your circumstances, so you need to tell us why you need this and why existing methods of handling such changes are insufficient... Optimal readahead windows tend to be a physical property of the storage and that does not tend to change dynamically. Hence block device readahead should only need to be set up once, and generally that can be done before the filesystem is mounted and files are opened (e.g. via udev rules). Hence you need to explain why you need to change the default block device readahead on the fly, and why fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead windows to the new defaults. Our system is a fuse-based file system, fuse creates a pseudo backing device for the user space file systems, the default readahead size is 128KB and it can't fully utilize the backing storage's read ability, so we should tune it. Sure, but that doesn't tell me anything about why you can't do this at mount time before the application opens any files. i.e. you've simply stated the reason why readahead is tunable, not why you need to be fully dynamic. We store our file system's data on different disks so we need to change ra_pages dynamically according to where the data resides, it can't be fixed at mount time or when we open files. That doesn't make a whole lot of sense to me. let me try to get this straight. There is data that resides on two devices (A + B), and a fuse filesystem to access that data. There is a single file in the fuse fs has data on both devices. An app has the file open, and when the data it is accessing is on device A you need to set the readahead to what is best for device A? And when the app tries to access data for that file that is on device B, you need to set the readahead to what is best for device B? And you are changing the fuse BDI readahead settings according to where the data in the back end lies? It seems to me that you should be setting the fuse readahead to the maximum of the readahead windows the data devices have configured at mount time and leaving it at that Then it may not fully utilize some device's read IO bandwidth and put too much burden on other devices. The abstract bdi of fuse and btrfs provides some dynamically changing bdi.ra_pages based on the real backing device. IMHO this should not be ignored. btrfs simply takes into account the number of disks it has for a given storage pool when setting up the default bdi ra_pages during mount. This is basically doing what I suggested above. Same with the generic fuse code - it's simply setting a sensible default value for the given fuse configuration. Neither are dynamic in the sense you are talking about, though. Actually I've talked about it with Fengguang, he advised we should unify the But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. ra_pages in struct bdi and file_ra_state and leave the issue that spreading data across disks as it is. Fengguang, what's you opinion about this? Thanks, Ying Zhu Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Thu, Oct 25, 2012 at 9:50 AM, Dave Chinner wrote: > On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote: >> On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner wrote: >> > On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: >> >> Hi Dave, >> >> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner wrote: >> >> > On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: >> >> >> Hi, >> >> >> Recently we ran into the bug that an opened file's ra_pages does not >> >> >> synchronize with it's backing device's when the latter is changed >> >> >> with blockdev --setra, the application needs to reopen the file >> >> >> to know the change, >> >> > >> >> > or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead >> >> > window to the (new) bdi default. >> >> > >> >> >> which is inappropriate under our circumstances. >> >> > >> >> > Which are? We don't know your circumstances, so you need to tell us >> >> > why you need this and why existing methods of handling such changes >> >> > are insufficient... >> >> > >> >> > Optimal readahead windows tend to be a physical property of the >> >> > storage and that does not tend to change dynamically. Hence block >> >> > device readahead should only need to be set up once, and generally >> >> > that can be done before the filesystem is mounted and files are >> >> > opened (e.g. via udev rules). Hence you need to explain why you need >> >> > to change the default block device readahead on the fly, and why >> >> > fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead >> >> > windows to the new defaults. >> >> Our system is a fuse-based file system, fuse creates a >> >> pseudo backing device for the user space file systems, the default >> >> readahead >> >> size is 128KB and it can't fully utilize the backing storage's read >> >> ability, >> >> so we should tune it. >> > >> > Sure, but that doesn't tell me anything about why you can't do this >> > at mount time before the application opens any files. i.e. you've >> > simply stated the reason why readahead is tunable, not why you need >> > to be fully dynamic. >> We store our file system's data on different disks so we need to change >> ra_pages >> dynamically according to where the data resides, it can't be fixed at mount >> time >> or when we open files. > > That doesn't make a whole lot of sense to me. let me try to get this > straight. > > There is data that resides on two devices (A + B), and a fuse > filesystem to access that data. There is a single file in the fuse > fs has data on both devices. An app has the file open, and when the > data it is accessing is on device A you need to set the readahead to > what is best for device A? And when the app tries to access data for > that file that is on device B, you need to set the readahead to what > is best for device B? And you are changing the fuse BDI readahead > settings according to where the data in the back end lies? > > It seems to me that you should be setting the fuse readahead to the > maximum of the readahead windows the data devices have configured at > mount time and leaving it at that Then it may not fully utilize some device's read IO bandwidth and put too much burden on other devices. > >> The abstract bdi of fuse and btrfs provides some dynamically changing >> bdi.ra_pages >> based on the real backing device. IMHO this should not be ignored. > > btrfs simply takes into account the number of disks it has for a > given storage pool when setting up the default bdi ra_pages during > mount. This is basically doing what I suggested above. Same with > the generic fuse code - it's simply setting a sensible default value > for the given fuse configuration. > > Neither are dynamic in the sense you are talking about, though. Actually I've talked about it with Fengguang, he advised we should unify the ra_pages in struct bdi and file_ra_state and leave the issue that spreading data across disks as it is. Fengguang, what's you opinion about this? Thanks, Ying Zhu > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote: > On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner wrote: > > On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: > >> Hi Dave, > >> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner wrote: > >> > On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: > >> >> Hi, > >> >> Recently we ran into the bug that an opened file's ra_pages does not > >> >> synchronize with it's backing device's when the latter is changed > >> >> with blockdev --setra, the application needs to reopen the file > >> >> to know the change, > >> > > >> > or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead > >> > window to the (new) bdi default. > >> > > >> >> which is inappropriate under our circumstances. > >> > > >> > Which are? We don't know your circumstances, so you need to tell us > >> > why you need this and why existing methods of handling such changes > >> > are insufficient... > >> > > >> > Optimal readahead windows tend to be a physical property of the > >> > storage and that does not tend to change dynamically. Hence block > >> > device readahead should only need to be set up once, and generally > >> > that can be done before the filesystem is mounted and files are > >> > opened (e.g. via udev rules). Hence you need to explain why you need > >> > to change the default block device readahead on the fly, and why > >> > fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead > >> > windows to the new defaults. > >> Our system is a fuse-based file system, fuse creates a > >> pseudo backing device for the user space file systems, the default > >> readahead > >> size is 128KB and it can't fully utilize the backing storage's read > >> ability, > >> so we should tune it. > > > > Sure, but that doesn't tell me anything about why you can't do this > > at mount time before the application opens any files. i.e. you've > > simply stated the reason why readahead is tunable, not why you need > > to be fully dynamic. > We store our file system's data on different disks so we need to change > ra_pages > dynamically according to where the data resides, it can't be fixed at mount > time > or when we open files. That doesn't make a whole lot of sense to me. let me try to get this straight. There is data that resides on two devices (A + B), and a fuse filesystem to access that data. There is a single file in the fuse fs has data on both devices. An app has the file open, and when the data it is accessing is on device A you need to set the readahead to what is best for device A? And when the app tries to access data for that file that is on device B, you need to set the readahead to what is best for device B? And you are changing the fuse BDI readahead settings according to where the data in the back end lies? It seems to me that you should be setting the fuse readahead to the maximum of the readahead windows the data devices have configured at mount time and leaving it at that > The abstract bdi of fuse and btrfs provides some dynamically changing > bdi.ra_pages > based on the real backing device. IMHO this should not be ignored. btrfs simply takes into account the number of disks it has for a given storage pool when setting up the default bdi ra_pages during mount. This is basically doing what I suggested above. Same with the generic fuse code - it's simply setting a sensible default value for the given fuse configuration. Neither are dynamic in the sense you are talking about, though. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/25/2012 08:17 AM, YingHang Zhu wrote: On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner wrote: On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: Hi Dave, On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner wrote: On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead window to the (new) bdi default. which is inappropriate under our circumstances. Which are? We don't know your circumstances, so you need to tell us why you need this and why existing methods of handling such changes are insufficient... Optimal readahead windows tend to be a physical property of the storage and that does not tend to change dynamically. Hence block device readahead should only need to be set up once, and generally that can be done before the filesystem is mounted and files are opened (e.g. via udev rules). Hence you need to explain why you need to change the default block device readahead on the fly, and why fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead windows to the new defaults. Our system is a fuse-based file system, fuse creates a pseudo backing device for the user space file systems, the default readahead size is 128KB and it can't fully utilize the backing storage's read ability, so we should tune it. Sure, but that doesn't tell me anything about why you can't do this at mount time before the application opens any files. i.e. you've simply stated the reason why readahead is tunable, not why you need to be fully dynamic. We store our file system's data on different disks so we need to change ra_pages dynamically according to where the data resides, it can't be fixed at mount time or when we open files. The abstract bdi of fuse and btrfs provides some dynamically changing bdi.ra_pages based on the real backing device. IMHO this should not be ignored. And how to tune ra_pages if one big file distribution in different disks, I think Fengguang Wu can answer these questions, Hi Fengguang, The above third-party application using our file system maintains some long-opened files, we does not have any chances to force them to call fadvise(POSIX_FADV_NORMAL). :( So raise a bug/feature request with the third party. Modifying kernel code because you can't directly modify the application isn't the best solution for anyone. This really is an application problem - the kernel already provides the mechanisms to solve this problem... :/ Thanks for advice, I will consult the above application's developers for more information. Now from the code itself should we merge the gap between the real device's ra_pages and the file's? Obviously the ra_pages is duplicated, otherwise each time we run into this problem, someone will do the same work as I have done here. Thanks, Ying Zhu Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner wrote: > On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: >> Hi Dave, >> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner wrote: >> > On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: >> >> Hi, >> >> Recently we ran into the bug that an opened file's ra_pages does not >> >> synchronize with it's backing device's when the latter is changed >> >> with blockdev --setra, the application needs to reopen the file >> >> to know the change, >> > >> > or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead >> > window to the (new) bdi default. >> > >> >> which is inappropriate under our circumstances. >> > >> > Which are? We don't know your circumstances, so you need to tell us >> > why you need this and why existing methods of handling such changes >> > are insufficient... >> > >> > Optimal readahead windows tend to be a physical property of the >> > storage and that does not tend to change dynamically. Hence block >> > device readahead should only need to be set up once, and generally >> > that can be done before the filesystem is mounted and files are >> > opened (e.g. via udev rules). Hence you need to explain why you need >> > to change the default block device readahead on the fly, and why >> > fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead >> > windows to the new defaults. >> Our system is a fuse-based file system, fuse creates a >> pseudo backing device for the user space file systems, the default readahead >> size is 128KB and it can't fully utilize the backing storage's read ability, >> so we should tune it. > > Sure, but that doesn't tell me anything about why you can't do this > at mount time before the application opens any files. i.e. you've > simply stated the reason why readahead is tunable, not why you need > to be fully dynamic. We store our file system's data on different disks so we need to change ra_pages dynamically according to where the data resides, it can't be fixed at mount time or when we open files. The abstract bdi of fuse and btrfs provides some dynamically changing bdi.ra_pages based on the real backing device. IMHO this should not be ignored. > >> The above third-party application using our file system maintains >> some long-opened files, we does not have any chances >> to force them to call fadvise(POSIX_FADV_NORMAL). :( > > So raise a bug/feature request with the third party. Modifying > kernel code because you can't directly modify the application isn't > the best solution for anyone. This really is an application problem > - the kernel already provides the mechanisms to solve this > problem... :/ Thanks for advice, I will consult the above application's developers for more information. Now from the code itself should we merge the gap between the real device's ra_pages and the file's? Obviously the ra_pages is duplicated, otherwise each time we run into this problem, someone will do the same work as I have done here. Thanks, Ying Zhu > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: > Hi Dave, > On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner wrote: > > On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: > >> Hi, > >> Recently we ran into the bug that an opened file's ra_pages does not > >> synchronize with it's backing device's when the latter is changed > >> with blockdev --setra, the application needs to reopen the file > >> to know the change, > > > > or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead > > window to the (new) bdi default. > > > >> which is inappropriate under our circumstances. > > > > Which are? We don't know your circumstances, so you need to tell us > > why you need this and why existing methods of handling such changes > > are insufficient... > > > > Optimal readahead windows tend to be a physical property of the > > storage and that does not tend to change dynamically. Hence block > > device readahead should only need to be set up once, and generally > > that can be done before the filesystem is mounted and files are > > opened (e.g. via udev rules). Hence you need to explain why you need > > to change the default block device readahead on the fly, and why > > fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead > > windows to the new defaults. > Our system is a fuse-based file system, fuse creates a > pseudo backing device for the user space file systems, the default readahead > size is 128KB and it can't fully utilize the backing storage's read ability, > so we should tune it. Sure, but that doesn't tell me anything about why you can't do this at mount time before the application opens any files. i.e. you've simply stated the reason why readahead is tunable, not why you need to be fully dynamic. > The above third-party application using our file system maintains > some long-opened files, we does not have any chances > to force them to call fadvise(POSIX_FADV_NORMAL). :( So raise a bug/feature request with the third party. Modifying kernel code because you can't directly modify the application isn't the best solution for anyone. This really is an application problem - the kernel already provides the mechanisms to solve this problem... :/ Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
Hi Chen, On Wed, Oct 24, 2012 at 9:02 AM, Ni zhan Chen wrote: > On 10/23/2012 09:41 PM, YingHang Zhu wrote: >> >> Sorry for the annoying, I forgot ccs in the previous mail. >> Thanks, >> Ying Zhu >> Hi Chen, >> >> On Tue, Oct 23, 2012 at 9:21 PM, Ni zhan Chen >> wrote: >>> >>> On 10/23/2012 08:46 PM, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, which is inappropriate under our circumstances. >>> >>> >>> Could you tell me in which function do this synchronize stuff? >> >> With this patch we use bdi.ra_pages directly, so change bdi.ra_pages also >> change an opened file's ra_pages. >>> >>> This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s README file. This patch tries to unify the ra_pages in struct file_ra_state and struct backing_dev_info. Basically current readahead algorithm will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the >>> >>> >>> You mean ondemand readahead algorithm will do this? I don't think so. >>> file_ra_state_init only called in btrfs path, correct? >> >> No, it's also called in do_dentry_open. >>> >>> read mode is sequential. Then all files sharing the same backing device have the same max value bdi.ra_pages set in file_ra_state. >>> >>> >>> why remove file_ra_state? If one file is read sequential and another file >>> is >>> read ramdom, how can use the global bdi.ra_pages to indicate the max >>> readahead window of each file? >> >> This patch does not remove file_ra_state, an file's readahead window >> is determined >> by it's backing device. > > > As Dave said, backing device readahead window doesn't tend to change > dynamically, but file readahead window does, it will change when sequential > read, random read, thrash, interleaved read and so on occur. > I agree about what you and Dave said totally and the point here is not about how readahead algorithm does. It's about those file systems using a abstract bdi instead of the actual devices, thus the bdi.ra_pages does not really reflect the backing storage's read IO bandwidth. AFAIK btrfs also has a abstract bdi to manage the many backing devices, so I think btrfs also has this problem. As for a further comment, btrfs may spread a file's contents across the managed backing devices, so maybe offset (0, x) and (x+1, y) land on different disks and have different readahead abilities, in this case the file's max readahead pages should change accordingly. Perhaps in reality we seldom meet this heterogenous stroage architecture. Thanks, Ying Zhu >>> Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL in fadivse will only set file reading mode without signifying the max readahead size of the file. The current apporach adds no additional overhead in read IO path, IMHO is the simplest solution. Any comments are welcome, thanks in advance. >>> >>> >>> Could you show me how you test this patch? >> >> This patch brings no perfmance gain, just fixs some functional bugs. >> By reading a 500MB file, the default max readahead size of the >> backing device was 128KB, after applying this patch, the read file's >> max ra_pages >> changed when I tuned the device's read ahead size with blockdev. >>> >>> Thanks, Ying Zhu Signed-off-by: Ying Zhu --- include/linux/fs.h |1 - mm/fadvise.c |2 -- mm/filemap.c | 17 +++-- mm/readahead.c |8 4 files changed, 15 insertions(+), 13 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 17fd887..36303a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -991,7 +991,6 @@ struct file_ra_state { unsigned int async_size;/* do asynchronous readahead when there are only # of pages ahead */ - unsigned int ra_pages; /* Maximum readahead window */ unsigned int mmap_miss; /* Cache miss stat for mmap accesses */ loff_t prev_pos;/* Cache last read() position */ }; diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e..75e2378 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) switch (advice) { case POSIX_FADV_NORMAL: - file->f_ra.ra_pages = bdi->ra_pages; spin_lock(&file->f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(&file->f_lock); @@ -87,7 +86,6
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/23/2012 09:41 PM, YingHang Zhu wrote: Sorry for the annoying, I forgot ccs in the previous mail. Thanks, Ying Zhu Hi Chen, On Tue, Oct 23, 2012 at 9:21 PM, Ni zhan Chen wrote: On 10/23/2012 08:46 PM, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, which is inappropriate under our circumstances. Could you tell me in which function do this synchronize stuff? With this patch we use bdi.ra_pages directly, so change bdi.ra_pages also change an opened file's ra_pages. This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s README file. This patch tries to unify the ra_pages in struct file_ra_state and struct backing_dev_info. Basically current readahead algorithm will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the You mean ondemand readahead algorithm will do this? I don't think so. file_ra_state_init only called in btrfs path, correct? No, it's also called in do_dentry_open. read mode is sequential. Then all files sharing the same backing device have the same max value bdi.ra_pages set in file_ra_state. why remove file_ra_state? If one file is read sequential and another file is read ramdom, how can use the global bdi.ra_pages to indicate the max readahead window of each file? This patch does not remove file_ra_state, an file's readahead window is determined by it's backing device. As Dave said, backing device readahead window doesn't tend to change dynamically, but file readahead window does, it will change when sequential read, random read, thrash, interleaved read and so on occur. Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL in fadivse will only set file reading mode without signifying the max readahead size of the file. The current apporach adds no additional overhead in read IO path, IMHO is the simplest solution. Any comments are welcome, thanks in advance. Could you show me how you test this patch? This patch brings no perfmance gain, just fixs some functional bugs. By reading a 500MB file, the default max readahead size of the backing device was 128KB, after applying this patch, the read file's max ra_pages changed when I tuned the device's read ahead size with blockdev. Thanks, Ying Zhu Signed-off-by: Ying Zhu --- include/linux/fs.h |1 - mm/fadvise.c |2 -- mm/filemap.c | 17 +++-- mm/readahead.c |8 4 files changed, 15 insertions(+), 13 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 17fd887..36303a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -991,7 +991,6 @@ struct file_ra_state { unsigned int async_size;/* do asynchronous readahead when there are only # of pages ahead */ - unsigned int ra_pages; /* Maximum readahead window */ unsigned int mmap_miss; /* Cache miss stat for mmap accesses */ loff_t prev_pos;/* Cache last read() position */ }; diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e..75e2378 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) switch (advice) { case POSIX_FADV_NORMAL: - file->f_ra.ra_pages = bdi->ra_pages; spin_lock(&file->f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(&file->f_lock); @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(&file->f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; spin_lock(&file->f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(&file->f_lock); diff --git a/mm/filemap.c b/mm/filemap.c index a4a5260..e7e4409 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait); * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => .. * * It is going insane. Fix it by quickly scaling down the readahead size. + * It's hard to estimate how the bad sectors lay out, so to be conservative, + * set the read mode in random. */ static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(&filp->f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(&filp->f_lock); } /** @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* If we don't want any read-ahead, don't bother */ if (VM_Rando
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
Hi Dave, On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner wrote: > On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: >> Hi, >> Recently we ran into the bug that an opened file's ra_pages does not >> synchronize with it's backing device's when the latter is changed >> with blockdev --setra, the application needs to reopen the file >> to know the change, > > or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead > window to the (new) bdi default. > >> which is inappropriate under our circumstances. > > Which are? We don't know your circumstances, so you need to tell us > why you need this and why existing methods of handling such changes > are insufficient... > > Optimal readahead windows tend to be a physical property of the > storage and that does not tend to change dynamically. Hence block > device readahead should only need to be set up once, and generally > that can be done before the filesystem is mounted and files are > opened (e.g. via udev rules). Hence you need to explain why you need > to change the default block device readahead on the fly, and why > fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead > windows to the new defaults. Our system is a fuse-based file system, fuse creates a pseudo backing device for the user space file systems, the default readahead size is 128KB and it can't fully utilize the backing storage's read ability, so we should tune it. The above third-party application using our file system maintains some long-opened files, we does not have any chances to force them to call fadvise(POSIX_FADV_NORMAL). :( Thanks, Ying Zhu > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: > Hi, > Recently we ran into the bug that an opened file's ra_pages does not > synchronize with it's backing device's when the latter is changed > with blockdev --setra, the application needs to reopen the file > to know the change, or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead window to the (new) bdi default. > which is inappropriate under our circumstances. Which are? We don't know your circumstances, so you need to tell us why you need this and why existing methods of handling such changes are insufficient... Optimal readahead windows tend to be a physical property of the storage and that does not tend to change dynamically. Hence block device readahead should only need to be set up once, and generally that can be done before the filesystem is mounted and files are opened (e.g. via udev rules). Hence you need to explain why you need to change the default block device readahead on the fly, and why fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead windows to the new defaults. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
Sorry for the annoying, I forgot ccs in the previous mail. Thanks, Ying Zhu Hi Chen, On Tue, Oct 23, 2012 at 9:21 PM, Ni zhan Chen wrote: > On 10/23/2012 08:46 PM, Ying Zhu wrote: >> >> Hi, >>Recently we ran into the bug that an opened file's ra_pages does not >> synchronize with it's backing device's when the latter is changed >> with blockdev --setra, the application needs to reopen the file >> to know the change, which is inappropriate under our circumstances. > > > Could you tell me in which function do this synchronize stuff? With this patch we use bdi.ra_pages directly, so change bdi.ra_pages also change an opened file's ra_pages. > > >> This bug is also mentioned in scst (generic SCSI target subsystem for >> Linux)'s >> README file. >>This patch tries to unify the ra_pages in struct file_ra_state >> and struct backing_dev_info. Basically current readahead algorithm >> will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the > > > You mean ondemand readahead algorithm will do this? I don't think so. > file_ra_state_init only called in btrfs path, correct? No, it's also called in do_dentry_open. > > >> read mode is sequential. Then all files sharing the same backing device >> have the same max value bdi.ra_pages set in file_ra_state. > > > why remove file_ra_state? If one file is read sequential and another file is > read ramdom, how can use the global bdi.ra_pages to indicate the max > readahead window of each file? This patch does not remove file_ra_state, an file's readahead window is determined by it's backing device. > > >>Applying this means the flags POSIX_FADV_NORMAL and >> POSIX_FADV_SEQUENTIAL >> in fadivse will only set file reading mode without signifying the >> max readahead size of the file. The current apporach adds no additional >> overhead in read IO path, IMHO is the simplest solution. >> Any comments are welcome, thanks in advance. > > > Could you show me how you test this patch? This patch brings no perfmance gain, just fixs some functional bugs. By reading a 500MB file, the default max readahead size of the backing device was 128KB, after applying this patch, the read file's max ra_pages changed when I tuned the device's read ahead size with blockdev. > > >> >> Thanks, >> Ying Zhu >> >> Signed-off-by: Ying Zhu >> --- >> include/linux/fs.h |1 - >> mm/fadvise.c |2 -- >> mm/filemap.c | 17 +++-- >> mm/readahead.c |8 >> 4 files changed, 15 insertions(+), 13 deletions(-) >> >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index 17fd887..36303a5 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -991,7 +991,6 @@ struct file_ra_state { >> unsigned int async_size;/* do asynchronous readahead when >>there are only # of pages ahead >> */ >> - unsigned int ra_pages; /* Maximum readahead window */ >> unsigned int mmap_miss; /* Cache miss stat for mmap >> accesses */ >> loff_t prev_pos;/* Cache last read() position */ >> }; >> diff --git a/mm/fadvise.c b/mm/fadvise.c >> index 469491e..75e2378 100644 >> --- a/mm/fadvise.c >> +++ b/mm/fadvise.c >> @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, >> loff_t len, int advice) >> switch (advice) { >> case POSIX_FADV_NORMAL: >> - file->f_ra.ra_pages = bdi->ra_pages; >> spin_lock(&file->f_lock); >> file->f_mode &= ~FMODE_RANDOM; >> spin_unlock(&file->f_lock); >> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, >> loff_t len, int advice) >> spin_unlock(&file->f_lock); >> break; >> case POSIX_FADV_SEQUENTIAL: >> - file->f_ra.ra_pages = bdi->ra_pages * 2; >> spin_lock(&file->f_lock); >> file->f_mode &= ~FMODE_RANDOM; >> spin_unlock(&file->f_lock); >> diff --git a/mm/filemap.c b/mm/filemap.c >> index a4a5260..e7e4409 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait); >>* readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => .. >>* >>* It is going insane. Fix it by quickly scaling down the readahead >> size. >> + * It's hard to estimate how the bad sectors lay out, so to be >> conservative, >> + * set the read mode in random. >>*/ >> static void shrink_readahead_size_eio(struct file *filp, >> struct file_ra_state *ra) >> { >> - ra->ra_pages /= 4; >> + spin_lock(&filp->f_lock); >> + filp->f_mode |= FMODE_RANDOM; >> + spin_unlock(&filp->f_lock); >> } >> /** >> @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct >> vm_area_struct *vma, >> /* If we don't want any read-ahead, don't bother */ >> if (VM_Rando
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/23/2012 08:46 PM, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, which is inappropriate under our circumstances. Could you tell me in which function do this synchronize stuff? This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s README file. This patch tries to unify the ra_pages in struct file_ra_state and struct backing_dev_info. Basically current readahead algorithm will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the You mean ondemand readahead algorithm will do this? I don't think so. file_ra_state_init only called in btrfs path, correct? read mode is sequential. Then all files sharing the same backing device have the same max value bdi.ra_pages set in file_ra_state. why remove file_ra_state? If one file is read sequential and another file is read ramdom, how can use the global bdi.ra_pages to indicate the max readahead window of each file? Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL in fadivse will only set file reading mode without signifying the max readahead size of the file. The current apporach adds no additional overhead in read IO path, IMHO is the simplest solution. Any comments are welcome, thanks in advance. Could you show me how you test this patch? Thanks, Ying Zhu Signed-off-by: Ying Zhu --- include/linux/fs.h |1 - mm/fadvise.c |2 -- mm/filemap.c | 17 +++-- mm/readahead.c |8 4 files changed, 15 insertions(+), 13 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 17fd887..36303a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -991,7 +991,6 @@ struct file_ra_state { unsigned int async_size;/* do asynchronous readahead when there are only # of pages ahead */ - unsigned int ra_pages; /* Maximum readahead window */ unsigned int mmap_miss; /* Cache miss stat for mmap accesses */ loff_t prev_pos;/* Cache last read() position */ }; diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e..75e2378 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) switch (advice) { case POSIX_FADV_NORMAL: - file->f_ra.ra_pages = bdi->ra_pages; spin_lock(&file->f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(&file->f_lock); @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(&file->f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; spin_lock(&file->f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(&file->f_lock); diff --git a/mm/filemap.c b/mm/filemap.c index a4a5260..e7e4409 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait); * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => .. * * It is going insane. Fix it by quickly scaling down the readahead size. + * It's hard to estimate how the bad sectors lay out, so to be conservative, + * set the read mode in random. */ static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(&filp->f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(&filp->f_lock); } /** @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* If we don't want any read-ahead, don't bother */ if (VM_RandomReadHint(vma)) return; - if (!ra->ra_pages) + if (!mapping->backing_dev_info->ra_pages) return; if (VM_SequentialReadHint(vma)) { - page_cache_sync_readahead(mapping, ra, file, offset, - ra->ra_pages); + page_cache_sync_readahead(mapping, ra, file, offset, + mapping->backing_dev_info->ra_pages); return; } @@ -1550,7 +1554,7 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* * mmap read-around */ - ra_pages = max_sane_readahead(ra->ra_pages); + ra_pages = max_sane_readahead(mapping->backing_dev_info->ra_pages); ra->start = max_t(long, 0, offset - ra_pages / 2); ra->size = ra_pages; ra->async_size = ra_pages / 4; @@ -1576,7 +1580,8 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma,
[PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, which is inappropriate under our circumstances. This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s README file. This patch tries to unify the ra_pages in struct file_ra_state and struct backing_dev_info. Basically current readahead algorithm will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the read mode is sequential. Then all files sharing the same backing device have the same max value bdi.ra_pages set in file_ra_state. Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL in fadivse will only set file reading mode without signifying the max readahead size of the file. The current apporach adds no additional overhead in read IO path, IMHO is the simplest solution. Any comments are welcome, thanks in advance. Thanks, Ying Zhu Signed-off-by: Ying Zhu --- include/linux/fs.h |1 - mm/fadvise.c |2 -- mm/filemap.c | 17 +++-- mm/readahead.c |8 4 files changed, 15 insertions(+), 13 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 17fd887..36303a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -991,7 +991,6 @@ struct file_ra_state { unsigned int async_size;/* do asynchronous readahead when there are only # of pages ahead */ - unsigned int ra_pages; /* Maximum readahead window */ unsigned int mmap_miss; /* Cache miss stat for mmap accesses */ loff_t prev_pos;/* Cache last read() position */ }; diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e..75e2378 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) switch (advice) { case POSIX_FADV_NORMAL: - file->f_ra.ra_pages = bdi->ra_pages; spin_lock(&file->f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(&file->f_lock); @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(&file->f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; spin_lock(&file->f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(&file->f_lock); diff --git a/mm/filemap.c b/mm/filemap.c index a4a5260..e7e4409 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait); * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => .. * * It is going insane. Fix it by quickly scaling down the readahead size. + * It's hard to estimate how the bad sectors lay out, so to be conservative, + * set the read mode in random. */ static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(&filp->f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(&filp->f_lock); } /** @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* If we don't want any read-ahead, don't bother */ if (VM_RandomReadHint(vma)) return; - if (!ra->ra_pages) + if (!mapping->backing_dev_info->ra_pages) return; if (VM_SequentialReadHint(vma)) { - page_cache_sync_readahead(mapping, ra, file, offset, - ra->ra_pages); + page_cache_sync_readahead(mapping, ra, file, offset, + mapping->backing_dev_info->ra_pages); return; } @@ -1550,7 +1554,7 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* * mmap read-around */ - ra_pages = max_sane_readahead(ra->ra_pages); + ra_pages = max_sane_readahead(mapping->backing_dev_info->ra_pages); ra->start = max_t(long, 0, offset - ra_pages / 2); ra->size = ra_pages; ra->async_size = ra_pages / 4; @@ -1576,7 +1580,8 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma, ra->mmap_miss--; if (PageReadahead(page)) page_cache_async_readahead(mapping, ra, file, - page, offset, ra->ra_pages); + page, offset, + mapping->backing_dev_info->ra_pages); } /** diff --git a/mm/readahead.c b/mm/readahead.c index ea8f8fa..6ea5999 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -27,7 +27,6