Re: [PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-25 Thread Anton Salikhmetov
2008/1/25, Randy Dunlap <[EMAIL PROTECTED]>:
> On Wed, 23 Jan 2008 02:21:20 +0300 Anton Salikhmetov wrote:
>
> > Add a document, which describes how the POSIX requirements on updating
> > memory-mapped file times are addressed in Linux.
>
> Hi Anton,
>
> Just a few small comments below...
>
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  Documentation/vm/00-INDEX  |2 +
> >  Documentation/vm/msync.txt |  117 
> > 
> >  2 files changed, 119 insertions(+), 0 deletions(-)
> >
> > diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
> > index 2131b00..2726c8d 100644
> > --- a/Documentation/vm/00-INDEX
> > +++ b/Documentation/vm/00-INDEX
> > @@ -6,6 +6,8 @@ hugetlbpage.txt
> >   - a brief summary of hugetlbpage support in the Linux kernel.
> >  locking
> >   - info on how locking and synchronization is done in the Linux vm 
> > code.
> > +msync.txt
> > + - the design document for memory-mapped file times update
> >  numa
> >   - information about NUMA specific code in the Linux vm.
> >  numa_memory_policy.txt
> > diff --git a/Documentation/vm/msync.txt b/Documentation/vm/msync.txt
> > new file mode 100644
> > index 000..571a766
> > --- /dev/null
> > +++ b/Documentation/vm/msync.txt
> > @@ -0,0 +1,117 @@
> > +
> > + The msync() system call and memory-mapped file times
> > +
> > + Copyright (C) 2008 Anton Salikhmetov
> > +
> > +The POSIX standard requires that any write reference to memory-mapped file
> > +data should result in updating the ctime and mtime for that file. Moreover,
> > +the standard mandates that updated file times should become visible to the
> > +world no later than at the next call to msync().
> > +
> > +Failure to meet this requirement creates difficulties for certain classes
> > +of important applications. For instance, database backup systems fail to
> > +pick up the files modified via the mmap() interface. Also, this is a
> > +security hole, which allows forging file data in such a manner that proving
> > +the fact that file data was modified is not possible.
> > +
> > +Briefly put, this requirement can be stated as follows:
> > +
> > + once the file data has changed, the operating system
> > + should acknowledge this fact by updating file metadata.
> > +
> > +This document describes how this POSIX requirement is addressed in Linux.
> > +
> > +1. Requirements
> > +
> > +1.1) the POSIX standard requires updating ctime and mtime not later
> > +than at the call to msync() with MS_SYNC or MS_ASYNC flags;
> > +
> > +1.2) in existing POSIX implementations, ctime and mtime
> > +get updated not later than at the call to fsync();
> > +
> > +1.3) in existing POSIX implementation, ctime and mtime
> > +get updated not later than at the call to sync(), the "auto-update" 
> > feature;
> > +
> > +1.4) the customers require and the common sense suggests that
> > +ctime and mtime should be updated not later than at the call to munmap()
> > +or exit(), the latter function implying an implicit call to munmap();
> > +
> > +1.5) the (1.1) item should be satisfied if the file is a block device
> > +special file;
> > +
> > +1.6) the (1.1) item should be satisfied for files residing on
> > +memory-backed filesystems such as tmpfs, too.
> > +
> > +The following operating systems were used as the reference platforms
> > +and are referred to as the "existing implementations" above:
> > +HP-UX B.11.31 and FreeBSD 6.2-RELEASE.
> > +
> > +2. Lazy update
> > +
> > +Many attempts before the current version implemented the "lazy update" 
> > approach
> > +to satisfying the requirements given above. Within the latter approach, 
> > ctime
> > +and mtime get updated at last moment allowable.
> > +
> > +Since we don't update the file times immediately, some Flag has to be
> > +used. When up, this Flag means that the file data was modified and
> > +the file times need to be updated as soon as possible.
>
> I would s/up/set/ above and below.
>
> > +Any existing "dirty" flag which, when up, mean that a page has been 
> > written to,
>
> s/mean/means/
>
> > +is not suitable for this purpose. Indeed, msync() called with MS_ASYNC
> > +would have to reset this "dirty" flag after updating ctime and mtime.
> > +The sys_msync() function itself is basically a no-op 

Re: [PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-25 Thread Anton Salikhmetov
2008/1/25, Randy Dunlap [EMAIL PROTECTED]:
 On Wed, 23 Jan 2008 02:21:20 +0300 Anton Salikhmetov wrote:

  Add a document, which describes how the POSIX requirements on updating
  memory-mapped file times are addressed in Linux.

 Hi Anton,

 Just a few small comments below...

  Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
  ---
   Documentation/vm/00-INDEX  |2 +
   Documentation/vm/msync.txt |  117 
  
   2 files changed, 119 insertions(+), 0 deletions(-)
 
  diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
  index 2131b00..2726c8d 100644
  --- a/Documentation/vm/00-INDEX
  +++ b/Documentation/vm/00-INDEX
  @@ -6,6 +6,8 @@ hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
   locking
- info on how locking and synchronization is done in the Linux vm 
  code.
  +msync.txt
  + - the design document for memory-mapped file times update
   numa
- information about NUMA specific code in the Linux vm.
   numa_memory_policy.txt
  diff --git a/Documentation/vm/msync.txt b/Documentation/vm/msync.txt
  new file mode 100644
  index 000..571a766
  --- /dev/null
  +++ b/Documentation/vm/msync.txt
  @@ -0,0 +1,117 @@
  +
  + The msync() system call and memory-mapped file times
  +
  + Copyright (C) 2008 Anton Salikhmetov
  +
  +The POSIX standard requires that any write reference to memory-mapped file
  +data should result in updating the ctime and mtime for that file. Moreover,
  +the standard mandates that updated file times should become visible to the
  +world no later than at the next call to msync().
  +
  +Failure to meet this requirement creates difficulties for certain classes
  +of important applications. For instance, database backup systems fail to
  +pick up the files modified via the mmap() interface. Also, this is a
  +security hole, which allows forging file data in such a manner that proving
  +the fact that file data was modified is not possible.
  +
  +Briefly put, this requirement can be stated as follows:
  +
  + once the file data has changed, the operating system
  + should acknowledge this fact by updating file metadata.
  +
  +This document describes how this POSIX requirement is addressed in Linux.
  +
  +1. Requirements
  +
  +1.1) the POSIX standard requires updating ctime and mtime not later
  +than at the call to msync() with MS_SYNC or MS_ASYNC flags;
  +
  +1.2) in existing POSIX implementations, ctime and mtime
  +get updated not later than at the call to fsync();
  +
  +1.3) in existing POSIX implementation, ctime and mtime
  +get updated not later than at the call to sync(), the auto-update 
  feature;
  +
  +1.4) the customers require and the common sense suggests that
  +ctime and mtime should be updated not later than at the call to munmap()
  +or exit(), the latter function implying an implicit call to munmap();
  +
  +1.5) the (1.1) item should be satisfied if the file is a block device
  +special file;
  +
  +1.6) the (1.1) item should be satisfied for files residing on
  +memory-backed filesystems such as tmpfs, too.
  +
  +The following operating systems were used as the reference platforms
  +and are referred to as the existing implementations above:
  +HP-UX B.11.31 and FreeBSD 6.2-RELEASE.
  +
  +2. Lazy update
  +
  +Many attempts before the current version implemented the lazy update 
  approach
  +to satisfying the requirements given above. Within the latter approach, 
  ctime
  +and mtime get updated at last moment allowable.
  +
  +Since we don't update the file times immediately, some Flag has to be
  +used. When up, this Flag means that the file data was modified and
  +the file times need to be updated as soon as possible.

 I would s/up/set/ above and below.

  +Any existing dirty flag which, when up, mean that a page has been 
  written to,

 s/mean/means/

  +is not suitable for this purpose. Indeed, msync() called with MS_ASYNC
  +would have to reset this dirty flag after updating ctime and mtime.
  +The sys_msync() function itself is basically a no-op in the MS_ASYNC case.
  +Thereby, the synchronization routines relying upon this dirty flag
  +would lose data. Therefore, a new Flag has to be introduced.
  +
  +The (1.5) item coupled with (1.3) requirement leads to hard work with
  +the block device inodes. Specifically, during writeback it is impossible to
  +tell which block device file was originally mapped. Therefore, we need to
  +traverse the list of active devices associated with the block device 
  inode.
  +This would lead to updating file times for block device files, which were 
  not
  +taking part in the data transfer.
  +
  +Also all versions prior to version 6 failed to correctly process ctime and
  +mtime for files on the memory-backed filesystems such as tmpfs. So the 
  (1.6)
  +requirement was not satisfied.
  +
  +If a write reference has occurred between two consecutive calls to msync

Re: [PATCH -v8 2/4] Update ctime and mtime for memory-mapped files

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Wed, 23 Jan 2008, Anton Salikhmetov wrote:
> >
> > Update ctime and mtime for memory-mapped files at a write access on
> > a present, read-only PTE, as well as at a write on a non-present PTE.
>
> Ok, this one I'm applying. I agree that it leaves MS_ASYNC not updating
> the file until the next sync actually happens, but I can't really bring
> myself to care at least for an imminent 2.6.24 thing. The file times are
> actually "correct" in the sense that they will now match when the IO is
> done, and my man-page says that MS_ASYNC "schedules the io to be done".
>
> And I think this is better than we have now, and I don't think this part
> is somethign that anybody really disagrees with.
>
> We can (and should) keep the MS_ASYNC issue open.

Thank you!

I have closed the bug #2645, because this patch solves the issue
originally reported.

>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Wed, 23 Jan 2008, Anton Salikhmetov wrote:
> > +
> > + if (pte_dirty(*pte) && pte_write(*pte)) {
>
> Not correct.
>
> You still need to check "pte_present()" before you can test any other
> bits. For a non-present pte, none of the other bits are defined, and for
> all we know there might be architectures out there that require them to
> be non-dirty.
>
> As it is, you just possibly randomly corrupted the pte.
>
> Yeah, on all architectures I know of, it the pte is clear, neither of
> those tests will trigger, so it just happens to work, but it's still
> wrong. And for a MAP_SHARED mapping, it should be either clear or valid,
> although I can imagine that we might do swap-cache entries for tmpfs or
> something (in which case trying to clear the write-enable bit would
> corrupt the swap entry!).
>
> So the bug might be hard or even impossible to trigger in practice, but
> it's still wrong.
>
> I realize that "page_mkclean_one()" doesn't do this very obviously, but
> it's actually there (it's just hidden in page_check_address()).
>
> Quite frankly, at this point I'm getting *very* tired of this series.
> Especially since you ignored me when I suggested you just revert the
> commit that removed the page table walking - and instead send in a buggy
> patch.
>
> Yes, the VM is hard. I agree. It's nasty. But exactly because it's nasty
> and subtle and horrid, I'm also very anal about it, and I get really
> nervous when somebody touches it without (a) knowing all the rules
> intimately and (b) listening to people who do.
>
> So here's even a patch to get you started. Do this:
>
> git revert 204ec841fbea3e5138168edbc3a76d46747cc987
>
> and then use this appended patch on top of that as a starting point for
> something that compiles and *possibly* works.
>
> And no, I do *not* guarantee that this is right either! I have not tested
> it or thought about it a lot, and S390 tends to be odd about some of these
> things. In particular, I actually suspect that we should possibly do this
> the same way we do
>
> ptep_clear_flush_young()
>
> except we would do "ptep_clear_flush_wrprotect()". So even though this is
> a revert plus a simple patch to make it compile again (we've changed how
> we do dirty bits), I think a patch like this needs testing and other
> people like Nick and Peter to ack it.

I'm very sorry for my bad code which can not pass LKML's review.

I reassigned the bug #2645 to default assignee, Andrew Morton, because
it seems that people start getting tired of my patch series.

Thanks for your support.

>
> Nick? Peter? Testing? Other comments?
>
> Linus
>
> ---
>  mm/msync.c |9 ++---
>  1 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/mm/msync.c b/mm/msync.c
> index a30487f..9b0af8f 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -32,6 +32,7 @@ static unsigned long msync_pte_range(struct vm_area_struct 
> *vma, pmd_t *pmd,
>  again:
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
> do {
> +   pte_t entry;
> struct page *page;
>
> if (progress >= 64) {
> @@ -47,9 +48,11 @@ again:
> page = vm_normal_page(vma, addr, *pte);
> if (!page)
> continue;
> -   if (ptep_clear_flush_dirty(vma, addr, pte) ||
> -   page_test_and_clear_dirty(page))
> -   ret += set_page_dirty(page);
> +   entry = ptep_clear_flush(vma, addr, pte);
> +   entry = pte_wrprotect(entry);
> +   set_pte_at(mm, address, pte, entry);
> +
> +   ret += 1;
> progress += 3;
> } while (pte++, addr += PAGE_SIZE, addr != end);
> pte_unmap_unlock(pte - 1, ptl);
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > Also, it still doesn't make sense to me why we'd not need to walk the
> > > rmap, it is all the same file after all.
> >
> > It's the same file, but not the same memory map.  It basically depends
> > on how you define msync:
> >
> >  a) sync _file_ on region defined by this mmap/start/end-address
> >  b) sync _memory_region_ defined by start/end-address
>
> My mmap/msync tester program can acually check this as well, with the
> '-f' flag.  Anton, can you try that on the reference platforms?

Here it is:

$ ./a.out file -f
begin   1201085546  1201085546  1200956936
write   1201085546  1201085546  1200956936
mmap1201085546  1201085546  1200956936
b   1201085546  1201085546  1200956936
msync b 1201085550  1201085550  1200956936
c   1201085550  1201085550  1200956936
msync c 1201085552  1201085552  1200956936
d   1201085552  1201085552  1200956936
munmap  1201085552  1201085552  1200956936
close   120108  120108  1200956936
sync120108  120108  1200956936
$ ./a.out file -sf
begin   1201085572  1201085572  1200956936
write   1201085572  1201085572  1200956936
mmap1201085572  1201085572  1200956936
b   1201085572  1201085572  1200956936
msync b 1201085576  1201085576  1200956936
c   1201085576  1201085576  1200956936
msync c 1201085578  1201085578  1200956936
d   1201085578  1201085578  1200956936
munmap  1201085578  1201085578  1200956936
close   1201085581  1201085581  1200956936
sync1201085581  1201085581  1200956936
$ uname -a
FreeBSD td152.testdrive.hp.com 6.2-RELEASE FreeBSD 6.2-RELEASE #0: Fri
Jan 12 11:05:30 UTC 2007
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/SMP  i386
$

>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Peter Zijlstra <[EMAIL PROTECTED]>:
>
> On Wed, 2008-01-23 at 02:21 +0300, Anton Salikhmetov wrote:
>
> > +static void vma_wrprotect_pmd_range(struct vm_area_struct *vma, pmd_t *pmd,
> > + unsigned long start, unsigned long end)
> > +{
> > + while (start < end) {
> > + spinlock_t *ptl;
> > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, start, 
> > );
> > +
> > + if (pte_dirty(*pte) && pte_write(*pte)) {
> > + pte_t entry = ptep_clear_flush(vma, start, pte);
> > +
> > + entry = pte_wrprotect(entry);
> > + set_pte_at(vma->vm_mm, start, pte, entry);
> > + }
> > +
> > + pte_unmap_unlock(pte, ptl);
> > + start += PAGE_SIZE;
> > + }
> > +}
>
> You've had two examples on how to write this loop, one from git commit
> 204ec841fbea3e5138168edbc3a76d46747cc987, and one from my draft, but
> this one looks like neither and is much less efficient. Take the lock
> only once per pmd, not once per pte please.
>
> > +static void vma_wrprotect_pud_range(struct vm_area_struct *vma, pud_t *pud,
> > + unsigned long start, unsigned long end)
> > +{
> > + pmd_t *pmd = pmd_offset(pud, start);
> > +
> > + while (start < end) {
> > + unsigned long next = pmd_addr_end(start, end);
> > +
> > + if (!pmd_none_or_clear_bad(pmd))
> > + vma_wrprotect_pmd_range(vma, pmd, start, next);
> > +
> > + ++pmd;
> > + start = next;
> > + }
> > +}
> > +
> > +static void vma_wrprotect_pgd_range(struct vm_area_struct *vma, pgd_t *pgd,
> > + unsigned long start, unsigned long end)
> > +{
> > + pud_t *pud = pud_offset(pgd, start);
> > +
> > + while (start < end) {
> > + unsigned long next = pud_addr_end(start, end);
> > +
> > + if (!pud_none_or_clear_bad(pud))
> > + vma_wrprotect_pud_range(vma, pud, start, next);
> > +
> > + ++pud;
> > + start = next;
> > + }
> > +}
> > +
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > + unsigned long addr = vma->vm_start;
> > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > +
> > + while (addr < vma->vm_end) {
> > + unsigned long next = pgd_addr_end(addr, vma->vm_end);
> > +
> > + if (!pgd_none_or_clear_bad(pgd))
> > + vma_wrprotect_pgd_range(vma, pgd, addr, next);
> > +
> > + ++pgd;
> > + addr = next;
> > + }
> > +}
>
> I think you want to pass start, end here too, you might not need to
> sweep the whole vma.

Thanks for you feedback, Peter!

I will redesign the vma_wrprotect_pmd_range() routine the way it
acquires the spinlock outside of the loop. I will also rewrite the
vma_wrprotect() function to process only the specified range.

>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Miklos Szeredi <[EMAIL PROTECTED]>:
> > Ah, OK, this is becuase mmap doesn't actually set up the page tables
> > by default.   Try adding MAP_POPULATE to the flags.
>
> Here's an updated version of the program, with an added a '-r' option,
> that performs a read access on the mapping before doing the write
> (basically equivalent to MAP_POPULATE, but portable).
>
> Please try this on a tmpfs file.
>
> Thanks,
> Miklos
>
> ---
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> static const char *filename;
> static int msync_flag = MS_ASYNC;
> static int msync_fork = 0;
> static int msync_read = 0;
>
> static void print_times(const char *msg)
> {
> struct stat stbuf;
> stat(filename, );
> printf("%s\t%li\t%li\t%li\n", msg, stbuf.st_ctime, stbuf.st_mtime,
>stbuf.st_atime);
> }
>
> static void do_msync(void *addr, int len)
> {
> int res;
> if (!msync_fork) {
> res = msync(addr, len, msync_flag);
> if (res == -1) {
> perror("msync");
> exit(1);
> }
> } else {
> int pid = fork();
> if (pid == -1) {
> perror("fork");
> exit(1);
> }
> if (!pid) {
> int fd = open(filename, O_RDWR);
> if (fd == -1) {
> perror("open");
> exit(1);
> }
> addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, 
> MAP_SHARED, fd, 0);
> if (addr == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
> res = msync(addr, len, msync_flag);
> if (res == -1) {
> perror("msync");
> exit(1);
> }
> exit(0);
> }
> wait(NULL);
> }
> }
>
> static void usage(const char *progname)
> {
> fprintf(stderr, "usage: %s filename [-sfr]\n", progname);
> fprintf(stderr, " -s: use MS_SYNC instead of MS_ASYNC\n");
> fprintf(stderr, " -f: fork and perform msync in a different 
> process\n");
> fprintf(stderr, " -r: do a read access before each write access\n");
> exit(1);
> }
>
> int main(int argc, char *argv[])
> {
> int res;
> char *addr;
> char tmp[32];
> int fd;
>
> if (argc < 2)
> usage(argv[0]);
>
> filename = argv[1];
> if (argc > 2) {
> char *s;
> if (argc > 3)
> usage(argv[0]);
> s = argv[2];
> if (s[0] != '-' || !s[1])
> usage(argv[0]);
> for (s++; *s; s++) {
> switch (*s) {
> case 's':
> msync_flag = MS_SYNC;
> break;
> case 'f':
> msync_fork = 1;
> break;
> case 'r':
> msync_read = 1;
> break;
> default:
> usage(argv[0]);
> }
> }
> }
>
> fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
> if (fd == -1) {
> perror(filename);
> return 1;
> }
> print_times("begin");
> sleep(1);
> write(fd, "wxyz\n", 4);
> print_times("write");
> sleep(1);
> addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> if (addr == MAP_FAILED) {
> perror("mmap");
> return 1;
> }
> print_times("mmap");
> sleep(1);
>
> if (msync_read) {
> sprintf(tmp, "fetch %c", addr[1]);
> print_times(tmp);
> sleep(1);
> }
> addr[1] = 'b';
> print_times("store b");
> sleep(1);
> do_msync(addr, 4);
> print_times("msync");
> sleep(1);
>
> if (msync_read) {
> sprintf(tmp, "fetch %c", addr[2]);
> print_times(tmp);
> sleep(1);
> }
> addr[2] = 'c';
> print_times("store c");
> sleep(1);
> do_msync(addr, 4);
> print_times("msync");
> sleep(1);
>
> if (msync_read) {
> sprintf(tmp, "fetch %c", addr[3]);
> print_times(tmp);
> sleep(1);
> }
>

Re: [PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Miklos Szeredi <[EMAIL PROTECTED]>:
> I think it would be more logical to move this patch forward, before
> the two patches which this document is actually describing.
>
> > Add a document, which describes how the POSIX requirements on updating
> > memory-mapped file times are addressed in Linux.
> >
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  Documentation/vm/00-INDEX  |2 +
> >  Documentation/vm/msync.txt |  117 
> > 
> >  2 files changed, 119 insertions(+), 0 deletions(-)
> >
> > diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
> > index 2131b00..2726c8d 100644
> > --- a/Documentation/vm/00-INDEX
> > +++ b/Documentation/vm/00-INDEX
> > @@ -6,6 +6,8 @@ hugetlbpage.txt
> >   - a brief summary of hugetlbpage support in the Linux kernel.
> >  locking
> >   - info on how locking and synchronization is done in the Linux vm 
> > code.
> > +msync.txt
> > + - the design document for memory-mapped file times update
> >  numa
> >   - information about NUMA specific code in the Linux vm.
> >  numa_memory_policy.txt
> > diff --git a/Documentation/vm/msync.txt b/Documentation/vm/msync.txt
> > new file mode 100644
> > index 000..571a766
> > --- /dev/null
> > +++ b/Documentation/vm/msync.txt
> > @@ -0,0 +1,117 @@
> > +
> > + The msync() system call and memory-mapped file times
> > +
> > + Copyright (C) 2008 Anton Salikhmetov
> > +
> > +The POSIX standard requires that any write reference to memory-mapped file
> > +data should result in updating the ctime and mtime for that file. Moreover,
> > +the standard mandates that updated file times should become visible to the
> > +world no later than at the next call to msync().
> > +
> > +Failure to meet this requirement creates difficulties for certain classes
> > +of important applications. For instance, database backup systems fail to
> > +pick up the files modified via the mmap() interface. Also, this is a
> > +security hole, which allows forging file data in such a manner that proving
> > +the fact that file data was modified is not possible.
> > +
> > +Briefly put, this requirement can be stated as follows:
> > +
> > + once the file data has changed, the operating system
> > + should acknowledge this fact by updating file metadata.
> > +
> > +This document describes how this POSIX requirement is addressed in Linux.
> > +
> > +1. Requirements
> > +
> > +1.1) the POSIX standard requires updating ctime and mtime not later
> > +than at the call to msync() with MS_SYNC or MS_ASYNC flags;
> > +
> > +1.2) in existing POSIX implementations, ctime and mtime
> > +get updated not later than at the call to fsync();
> > +
> > +1.3) in existing POSIX implementation, ctime and mtime
> > +get updated not later than at the call to sync(), the "auto-update" 
> > feature;
> > +
> > +1.4) the customers require and the common sense suggests that
> > +ctime and mtime should be updated not later than at the call to munmap()
> > +or exit(), the latter function implying an implicit call to munmap();
> > +
> > +1.5) the (1.1) item should be satisfied if the file is a block device
> > +special file;
> > +
> > +1.6) the (1.1) item should be satisfied for files residing on
> > +memory-backed filesystems such as tmpfs, too.
> > +
> > +The following operating systems were used as the reference platforms
> > +and are referred to as the "existing implementations" above:
> > +HP-UX B.11.31 and FreeBSD 6.2-RELEASE.
> > +
> > +2. Lazy update
> > +
> > +Many attempts before the current version implemented the "lazy update" 
> > approach
> > +to satisfying the requirements given above. Within the latter approach, 
> > ctime
> > +and mtime get updated at last moment allowable.
> > +
> > +Since we don't update the file times immediately, some Flag has to be
> > +used. When up, this Flag means that the file data was modified and
> > +the file times need to be updated as soon as possible.
> > +
> > +Any existing "dirty" flag which, when up, mean that a page has been 
> > written to,
> > +is not suitable for this purpose. Indeed, msync() called with MS_ASYNC
> > +would have to reset this "dirty" flag after updating ctime and mtime.
> > +The sys_msync() function itself is basically a no-op in the MS_ASYNC case.
> > +Thereby, the synchr

Re: [PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Miklos Szeredi [EMAIL PROTECTED]:
 I think it would be more logical to move this patch forward, before
 the two patches which this document is actually describing.

  Add a document, which describes how the POSIX requirements on updating
  memory-mapped file times are addressed in Linux.
 
  Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
  ---
   Documentation/vm/00-INDEX  |2 +
   Documentation/vm/msync.txt |  117 
  
   2 files changed, 119 insertions(+), 0 deletions(-)
 
  diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
  index 2131b00..2726c8d 100644
  --- a/Documentation/vm/00-INDEX
  +++ b/Documentation/vm/00-INDEX
  @@ -6,6 +6,8 @@ hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
   locking
- info on how locking and synchronization is done in the Linux vm 
  code.
  +msync.txt
  + - the design document for memory-mapped file times update
   numa
- information about NUMA specific code in the Linux vm.
   numa_memory_policy.txt
  diff --git a/Documentation/vm/msync.txt b/Documentation/vm/msync.txt
  new file mode 100644
  index 000..571a766
  --- /dev/null
  +++ b/Documentation/vm/msync.txt
  @@ -0,0 +1,117 @@
  +
  + The msync() system call and memory-mapped file times
  +
  + Copyright (C) 2008 Anton Salikhmetov
  +
  +The POSIX standard requires that any write reference to memory-mapped file
  +data should result in updating the ctime and mtime for that file. Moreover,
  +the standard mandates that updated file times should become visible to the
  +world no later than at the next call to msync().
  +
  +Failure to meet this requirement creates difficulties for certain classes
  +of important applications. For instance, database backup systems fail to
  +pick up the files modified via the mmap() interface. Also, this is a
  +security hole, which allows forging file data in such a manner that proving
  +the fact that file data was modified is not possible.
  +
  +Briefly put, this requirement can be stated as follows:
  +
  + once the file data has changed, the operating system
  + should acknowledge this fact by updating file metadata.
  +
  +This document describes how this POSIX requirement is addressed in Linux.
  +
  +1. Requirements
  +
  +1.1) the POSIX standard requires updating ctime and mtime not later
  +than at the call to msync() with MS_SYNC or MS_ASYNC flags;
  +
  +1.2) in existing POSIX implementations, ctime and mtime
  +get updated not later than at the call to fsync();
  +
  +1.3) in existing POSIX implementation, ctime and mtime
  +get updated not later than at the call to sync(), the auto-update 
  feature;
  +
  +1.4) the customers require and the common sense suggests that
  +ctime and mtime should be updated not later than at the call to munmap()
  +or exit(), the latter function implying an implicit call to munmap();
  +
  +1.5) the (1.1) item should be satisfied if the file is a block device
  +special file;
  +
  +1.6) the (1.1) item should be satisfied for files residing on
  +memory-backed filesystems such as tmpfs, too.
  +
  +The following operating systems were used as the reference platforms
  +and are referred to as the existing implementations above:
  +HP-UX B.11.31 and FreeBSD 6.2-RELEASE.
  +
  +2. Lazy update
  +
  +Many attempts before the current version implemented the lazy update 
  approach
  +to satisfying the requirements given above. Within the latter approach, 
  ctime
  +and mtime get updated at last moment allowable.
  +
  +Since we don't update the file times immediately, some Flag has to be
  +used. When up, this Flag means that the file data was modified and
  +the file times need to be updated as soon as possible.
  +
  +Any existing dirty flag which, when up, mean that a page has been 
  written to,
  +is not suitable for this purpose. Indeed, msync() called with MS_ASYNC
  +would have to reset this dirty flag after updating ctime and mtime.
  +The sys_msync() function itself is basically a no-op in the MS_ASYNC case.
  +Thereby, the synchronization routines relying upon this dirty flag
  +would lose data. Therefore, a new Flag has to be introduced.
  +
  +The (1.5) item coupled with (1.3) requirement leads to hard work with
  +the block device inodes. Specifically, during writeback it is impossible to
  +tell which block device file was originally mapped. Therefore, we need to
  +traverse the list of active devices associated with the block device 
  inode.
  +This would lead to updating file times for block device files, which were 
  not
  +taking part in the data transfer.
  +
  +Also all versions prior to version 6 failed to correctly process ctime and
  +mtime for files on the memory-backed filesystems such as tmpfs. So the 
  (1.6)
  +requirement was not satisfied.

 Version -8 also fails: for ram backed filesystems page tables are not
 write protected initially, nor after a sync

Re: [PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Miklos Szeredi [EMAIL PROTECTED]:
  Ah, OK, this is becuase mmap doesn't actually set up the page tables
  by default.   Try adding MAP_POPULATE to the flags.

 Here's an updated version of the program, with an added a '-r' option,
 that performs a read access on the mapping before doing the write
 (basically equivalent to MAP_POPULATE, but portable).

 Please try this on a tmpfs file.

 Thanks,
 Miklos

 ---
 #include stdio.h
 #include stdlib.h
 #include unistd.h
 #include string.h
 #include fcntl.h
 #include sys/mman.h
 #include sys/stat.h
 #include sys/wait.h

 static const char *filename;
 static int msync_flag = MS_ASYNC;
 static int msync_fork = 0;
 static int msync_read = 0;

 static void print_times(const char *msg)
 {
 struct stat stbuf;
 stat(filename, stbuf);
 printf(%s\t%li\t%li\t%li\n, msg, stbuf.st_ctime, stbuf.st_mtime,
stbuf.st_atime);
 }

 static void do_msync(void *addr, int len)
 {
 int res;
 if (!msync_fork) {
 res = msync(addr, len, msync_flag);
 if (res == -1) {
 perror(msync);
 exit(1);
 }
 } else {
 int pid = fork();
 if (pid == -1) {
 perror(fork);
 exit(1);
 }
 if (!pid) {
 int fd = open(filename, O_RDWR);
 if (fd == -1) {
 perror(open);
 exit(1);
 }
 addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, 
 MAP_SHARED, fd, 0);
 if (addr == MAP_FAILED) {
 perror(mmap);
 exit(1);
 }
 res = msync(addr, len, msync_flag);
 if (res == -1) {
 perror(msync);
 exit(1);
 }
 exit(0);
 }
 wait(NULL);
 }
 }

 static void usage(const char *progname)
 {
 fprintf(stderr, usage: %s filename [-sfr]\n, progname);
 fprintf(stderr,  -s: use MS_SYNC instead of MS_ASYNC\n);
 fprintf(stderr,  -f: fork and perform msync in a different 
 process\n);
 fprintf(stderr,  -r: do a read access before each write access\n);
 exit(1);
 }

 int main(int argc, char *argv[])
 {
 int res;
 char *addr;
 char tmp[32];
 int fd;

 if (argc  2)
 usage(argv[0]);

 filename = argv[1];
 if (argc  2) {
 char *s;
 if (argc  3)
 usage(argv[0]);
 s = argv[2];
 if (s[0] != '-' || !s[1])
 usage(argv[0]);
 for (s++; *s; s++) {
 switch (*s) {
 case 's':
 msync_flag = MS_SYNC;
 break;
 case 'f':
 msync_fork = 1;
 break;
 case 'r':
 msync_read = 1;
 break;
 default:
 usage(argv[0]);
 }
 }
 }

 fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
 if (fd == -1) {
 perror(filename);
 return 1;
 }
 print_times(begin);
 sleep(1);
 write(fd, wxyz\n, 4);
 print_times(write);
 sleep(1);
 addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
 if (addr == MAP_FAILED) {
 perror(mmap);
 return 1;
 }
 print_times(mmap);
 sleep(1);

 if (msync_read) {
 sprintf(tmp, fetch %c, addr[1]);
 print_times(tmp);
 sleep(1);
 }
 addr[1] = 'b';
 print_times(store b);
 sleep(1);
 do_msync(addr, 4);
 print_times(msync);
 sleep(1);

 if (msync_read) {
 sprintf(tmp, fetch %c, addr[2]);
 print_times(tmp);
 sleep(1);
 }
 addr[2] = 'c';
 print_times(store c);
 sleep(1);
 do_msync(addr, 4);
 print_times(msync);
 sleep(1);

 if (msync_read) {
 sprintf(tmp, fetch %c, addr[3]);
 print_times(tmp);
 sleep(1);
 }
 addr[3] = 'd';
 print_times(store d);
 sleep(1);
 res = munmap(addr, 4);
 if (res == -1) {

Re: [PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Peter Zijlstra [EMAIL PROTECTED]:

 On Wed, 2008-01-23 at 02:21 +0300, Anton Salikhmetov wrote:

  +static void vma_wrprotect_pmd_range(struct vm_area_struct *vma, pmd_t *pmd,
  + unsigned long start, unsigned long end)
  +{
  + while (start  end) {
  + spinlock_t *ptl;
  + pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, start, 
  ptl);
  +
  + if (pte_dirty(*pte)  pte_write(*pte)) {
  + pte_t entry = ptep_clear_flush(vma, start, pte);
  +
  + entry = pte_wrprotect(entry);
  + set_pte_at(vma-vm_mm, start, pte, entry);
  + }
  +
  + pte_unmap_unlock(pte, ptl);
  + start += PAGE_SIZE;
  + }
  +}

 You've had two examples on how to write this loop, one from git commit
 204ec841fbea3e5138168edbc3a76d46747cc987, and one from my draft, but
 this one looks like neither and is much less efficient. Take the lock
 only once per pmd, not once per pte please.

  +static void vma_wrprotect_pud_range(struct vm_area_struct *vma, pud_t *pud,
  + unsigned long start, unsigned long end)
  +{
  + pmd_t *pmd = pmd_offset(pud, start);
  +
  + while (start  end) {
  + unsigned long next = pmd_addr_end(start, end);
  +
  + if (!pmd_none_or_clear_bad(pmd))
  + vma_wrprotect_pmd_range(vma, pmd, start, next);
  +
  + ++pmd;
  + start = next;
  + }
  +}
  +
  +static void vma_wrprotect_pgd_range(struct vm_area_struct *vma, pgd_t *pgd,
  + unsigned long start, unsigned long end)
  +{
  + pud_t *pud = pud_offset(pgd, start);
  +
  + while (start  end) {
  + unsigned long next = pud_addr_end(start, end);
  +
  + if (!pud_none_or_clear_bad(pud))
  + vma_wrprotect_pud_range(vma, pud, start, next);
  +
  + ++pud;
  + start = next;
  + }
  +}
  +
  +static void vma_wrprotect(struct vm_area_struct *vma)
  +{
  + unsigned long addr = vma-vm_start;
  + pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
  +
  + while (addr  vma-vm_end) {
  + unsigned long next = pgd_addr_end(addr, vma-vm_end);
  +
  + if (!pgd_none_or_clear_bad(pgd))
  + vma_wrprotect_pgd_range(vma, pgd, addr, next);
  +
  + ++pgd;
  + addr = next;
  + }
  +}

 I think you want to pass start, end here too, you might not need to
 sweep the whole vma.

Thanks for you feedback, Peter!

I will redesign the vma_wrprotect_pmd_range() routine the way it
acquires the spinlock outside of the loop. I will also rewrite the
vma_wrprotect() function to process only the specified range.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Miklos Szeredi [EMAIL PROTECTED]:
   Also, it still doesn't make sense to me why we'd not need to walk the
   rmap, it is all the same file after all.
 
  It's the same file, but not the same memory map.  It basically depends
  on how you define msync:
 
   a) sync _file_ on region defined by this mmap/start/end-address
   b) sync _memory_region_ defined by start/end-address

 My mmap/msync tester program can acually check this as well, with the
 '-f' flag.  Anton, can you try that on the reference platforms?

Here it is:

$ ./a.out file -f
begin   1201085546  1201085546  1200956936
write   1201085546  1201085546  1200956936
mmap1201085546  1201085546  1200956936
b   1201085546  1201085546  1200956936
msync b 1201085550  1201085550  1200956936
c   1201085550  1201085550  1200956936
msync c 1201085552  1201085552  1200956936
d   1201085552  1201085552  1200956936
munmap  1201085552  1201085552  1200956936
close   120108  120108  1200956936
sync120108  120108  1200956936
$ ./a.out file -sf
begin   1201085572  1201085572  1200956936
write   1201085572  1201085572  1200956936
mmap1201085572  1201085572  1200956936
b   1201085572  1201085572  1200956936
msync b 1201085576  1201085576  1200956936
c   1201085576  1201085576  1200956936
msync c 1201085578  1201085578  1200956936
d   1201085578  1201085578  1200956936
munmap  1201085578  1201085578  1200956936
close   1201085581  1201085581  1200956936
sync1201085581  1201085581  1200956936
$ uname -a
FreeBSD td152.testdrive.hp.com 6.2-RELEASE FreeBSD 6.2-RELEASE #0: Fri
Jan 12 11:05:30 UTC 2007
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/SMP  i386
$


 Miklos

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Linus Torvalds [EMAIL PROTECTED]:


 On Wed, 23 Jan 2008, Anton Salikhmetov wrote:
  +
  + if (pte_dirty(*pte)  pte_write(*pte)) {

 Not correct.

 You still need to check pte_present() before you can test any other
 bits. For a non-present pte, none of the other bits are defined, and for
 all we know there might be architectures out there that require them to
 be non-dirty.

 As it is, you just possibly randomly corrupted the pte.

 Yeah, on all architectures I know of, it the pte is clear, neither of
 those tests will trigger, so it just happens to work, but it's still
 wrong. And for a MAP_SHARED mapping, it should be either clear or valid,
 although I can imagine that we might do swap-cache entries for tmpfs or
 something (in which case trying to clear the write-enable bit would
 corrupt the swap entry!).

 So the bug might be hard or even impossible to trigger in practice, but
 it's still wrong.

 I realize that page_mkclean_one() doesn't do this very obviously, but
 it's actually there (it's just hidden in page_check_address()).

 Quite frankly, at this point I'm getting *very* tired of this series.
 Especially since you ignored me when I suggested you just revert the
 commit that removed the page table walking - and instead send in a buggy
 patch.

 Yes, the VM is hard. I agree. It's nasty. But exactly because it's nasty
 and subtle and horrid, I'm also very anal about it, and I get really
 nervous when somebody touches it without (a) knowing all the rules
 intimately and (b) listening to people who do.

 So here's even a patch to get you started. Do this:

 git revert 204ec841fbea3e5138168edbc3a76d46747cc987

 and then use this appended patch on top of that as a starting point for
 something that compiles and *possibly* works.

 And no, I do *not* guarantee that this is right either! I have not tested
 it or thought about it a lot, and S390 tends to be odd about some of these
 things. In particular, I actually suspect that we should possibly do this
 the same way we do

 ptep_clear_flush_young()

 except we would do ptep_clear_flush_wrprotect(). So even though this is
 a revert plus a simple patch to make it compile again (we've changed how
 we do dirty bits), I think a patch like this needs testing and other
 people like Nick and Peter to ack it.

I'm very sorry for my bad code which can not pass LKML's review.

I reassigned the bug #2645 to default assignee, Andrew Morton, because
it seems that people start getting tired of my patch series.

Thanks for your support.


 Nick? Peter? Testing? Other comments?

 Linus

 ---
  mm/msync.c |9 ++---
  1 files changed, 6 insertions(+), 3 deletions(-)

 diff --git a/mm/msync.c b/mm/msync.c
 index a30487f..9b0af8f 100644
 --- a/mm/msync.c
 +++ b/mm/msync.c
 @@ -32,6 +32,7 @@ static unsigned long msync_pte_range(struct vm_area_struct 
 *vma, pmd_t *pmd,
  again:
 pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
 do {
 +   pte_t entry;
 struct page *page;

 if (progress = 64) {
 @@ -47,9 +48,11 @@ again:
 page = vm_normal_page(vma, addr, *pte);
 if (!page)
 continue;
 -   if (ptep_clear_flush_dirty(vma, addr, pte) ||
 -   page_test_and_clear_dirty(page))
 -   ret += set_page_dirty(page);
 +   entry = ptep_clear_flush(vma, addr, pte);
 +   entry = pte_wrprotect(entry);
 +   set_pte_at(mm, address, pte, entry);
 +
 +   ret += 1;
 progress += 3;
 } while (pte++, addr += PAGE_SIZE, addr != end);
 pte_unmap_unlock(pte - 1, ptl);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v8 2/4] Update ctime and mtime for memory-mapped files

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Linus Torvalds [EMAIL PROTECTED]:


 On Wed, 23 Jan 2008, Anton Salikhmetov wrote:
 
  Update ctime and mtime for memory-mapped files at a write access on
  a present, read-only PTE, as well as at a write on a non-present PTE.

 Ok, this one I'm applying. I agree that it leaves MS_ASYNC not updating
 the file until the next sync actually happens, but I can't really bring
 myself to care at least for an imminent 2.6.24 thing. The file times are
 actually correct in the sense that they will now match when the IO is
 done, and my man-page says that MS_ASYNC schedules the io to be done.

 And I think this is better than we have now, and I don't think this part
 is somethign that anybody really disagrees with.

 We can (and should) keep the MS_ASYNC issue open.

Thank you!

I have closed the bug #2645, because this patch solves the issue
originally reported.


 Linus

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v8 1/4] Massive code cleanup of sys_msync()

2008-01-22 Thread Anton Salikhmetov
Use the PAGE_ALIGN() macro instead of "manual" alignment.
Improve readability of the loop, which traverses the process
memory regions. Make code more symmetric and possibly boost
performance on some RISC CPUs by moving variable assignments.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   76 ---
 1 files changed, 36 insertions(+), 40 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..60efa36 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,84 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
unsigned long end;
-   struct mm_struct *mm = current->mm;
+   int error, unmapped_error;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   struct mm_struct *mm;
 
+   error = -EINVAL;
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start & ~PAGE_MASK)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
+
error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
goto out;
+
error = 0;
+   unmapped_error = 0;
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
+   mm = current->mm;
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
error = -ENOMEM;
if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   break;
if (start < vma->vm_start) {
start = vma->vm_start;
if (start >= end)
-   goto out_unlock;
-   unmapped_error = -ENOMEM;
-   }
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
-   error = -EBUSY;
-   goto out_unlock;
+   break;
+   unmapped_error = error;
}
-   file = vma->vm_file;
+
+   error = -EBUSY;
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
+   break;
+
+   error = 0;
start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+   file = vma->vm_file;
+   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(>mmap_sem);
error = do_fsync(file, 0);
@@ -88,16 +87,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   goto out_unlock;

[PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-22 Thread Anton Salikhmetov
Add a document, which describes how the POSIX requirements on updating
memory-mapped file times are addressed in Linux.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 Documentation/vm/00-INDEX  |2 +
 Documentation/vm/msync.txt |  117 
 2 files changed, 119 insertions(+), 0 deletions(-)

diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 2131b00..2726c8d 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -6,6 +6,8 @@ hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
 locking
- info on how locking and synchronization is done in the Linux vm code.
+msync.txt
+   - the design document for memory-mapped file times update
 numa
- information about NUMA specific code in the Linux vm.
 numa_memory_policy.txt
diff --git a/Documentation/vm/msync.txt b/Documentation/vm/msync.txt
new file mode 100644
index 000..571a766
--- /dev/null
+++ b/Documentation/vm/msync.txt
@@ -0,0 +1,117 @@
+
+   The msync() system call and memory-mapped file times
+
+   Copyright (C) 2008 Anton Salikhmetov
+
+The POSIX standard requires that any write reference to memory-mapped file
+data should result in updating the ctime and mtime for that file. Moreover,
+the standard mandates that updated file times should become visible to the
+world no later than at the next call to msync().
+
+Failure to meet this requirement creates difficulties for certain classes
+of important applications. For instance, database backup systems fail to
+pick up the files modified via the mmap() interface. Also, this is a
+security hole, which allows forging file data in such a manner that proving
+the fact that file data was modified is not possible.
+
+Briefly put, this requirement can be stated as follows:
+
+   once the file data has changed, the operating system
+   should acknowledge this fact by updating file metadata.
+
+This document describes how this POSIX requirement is addressed in Linux.
+
+1. Requirements
+
+1.1) the POSIX standard requires updating ctime and mtime not later
+than at the call to msync() with MS_SYNC or MS_ASYNC flags;
+
+1.2) in existing POSIX implementations, ctime and mtime
+get updated not later than at the call to fsync();
+
+1.3) in existing POSIX implementation, ctime and mtime
+get updated not later than at the call to sync(), the "auto-update" feature;
+
+1.4) the customers require and the common sense suggests that
+ctime and mtime should be updated not later than at the call to munmap()
+or exit(), the latter function implying an implicit call to munmap();
+
+1.5) the (1.1) item should be satisfied if the file is a block device
+special file;
+
+1.6) the (1.1) item should be satisfied for files residing on
+memory-backed filesystems such as tmpfs, too.
+
+The following operating systems were used as the reference platforms
+and are referred to as the "existing implementations" above:
+HP-UX B.11.31 and FreeBSD 6.2-RELEASE.
+
+2. Lazy update
+
+Many attempts before the current version implemented the "lazy update" approach
+to satisfying the requirements given above. Within the latter approach, ctime
+and mtime get updated at last moment allowable.
+
+Since we don't update the file times immediately, some Flag has to be
+used. When up, this Flag means that the file data was modified and
+the file times need to be updated as soon as possible.
+
+Any existing "dirty" flag which, when up, mean that a page has been written to,
+is not suitable for this purpose. Indeed, msync() called with MS_ASYNC
+would have to reset this "dirty" flag after updating ctime and mtime.
+The sys_msync() function itself is basically a no-op in the MS_ASYNC case.
+Thereby, the synchronization routines relying upon this "dirty" flag
+would lose data. Therefore, a new Flag has to be introduced.
+
+The (1.5) item coupled with (1.3) requirement leads to hard work with
+the block device inodes. Specifically, during writeback it is impossible to
+tell which block device file was originally mapped. Therefore, we need to
+traverse the list of "active" devices associated with the block device inode.
+This would lead to updating file times for block device files, which were not
+taking part in the data transfer.
+
+Also all versions prior to version 6 failed to correctly process ctime and
+mtime for files on the memory-backed filesystems such as tmpfs. So the (1.6)
+requirement was not satisfied.
+
+If a write reference has occurred between two consecutive calls to msync()
+with MS_ASYNC, the second call to the latter function should take into
+account the last write reference. The last write reference can not be caught
+if no pagefault occurs. Hence a pagefault needs to be forced. This can be done
+using two different approaches. The first one is to synchronize data even when
+msync() was called with M

[PATCH -v8 2/4] Update ctime and mtime for memory-mapped files

2008-01-22 Thread Anton Salikhmetov
Update ctime and mtime for memory-mapped files at a write access on
a present, read-only PTE, as well as at a write on a non-present PTE.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/memory.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6dd1cd8..4b0144b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1670,6 +1670,9 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
/*
 * Yes, Virginia, this is actually required to prevent a race
 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2343,6 +2346,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v8 0/4] Fixing the issue with memory-mapped file times

2008-01-22 Thread Anton Salikhmetov
This is the eighth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Since the previous version, the following has changed:

1) based on Linus' comment, a more efficient PTE walker implemented;

2) the design document added to the kernel documentation.

Functional tests successfully passed.

Please comment.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-22 Thread Anton Salikhmetov
Force file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   92 +--
 1 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 60efa36..87f990e 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -12,6 +13,73 @@
 #include 
 #include 
 
+static void vma_wrprotect_pmd_range(struct vm_area_struct *vma, pmd_t *pmd,
+   unsigned long start, unsigned long end)
+{
+   while (start < end) {
+   spinlock_t *ptl;
+   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, start, );
+
+   if (pte_dirty(*pte) && pte_write(*pte)) {
+   pte_t entry = ptep_clear_flush(vma, start, pte);
+
+   entry = pte_wrprotect(entry);
+   set_pte_at(vma->vm_mm, start, pte, entry);
+   }
+
+   pte_unmap_unlock(pte, ptl);
+   start += PAGE_SIZE;
+   }
+}
+
+static void vma_wrprotect_pud_range(struct vm_area_struct *vma, pud_t *pud,
+   unsigned long start, unsigned long end)
+{
+   pmd_t *pmd = pmd_offset(pud, start);
+
+   while (start < end) {
+   unsigned long next = pmd_addr_end(start, end);
+
+   if (!pmd_none_or_clear_bad(pmd))
+   vma_wrprotect_pmd_range(vma, pmd, start, next);
+
+   ++pmd;
+   start = next;
+   }
+}
+
+static void vma_wrprotect_pgd_range(struct vm_area_struct *vma, pgd_t *pgd,
+   unsigned long start, unsigned long end)
+{
+   pud_t *pud = pud_offset(pgd, start);
+
+   while (start < end) {
+   unsigned long next = pud_addr_end(start, end);
+
+   if (!pud_none_or_clear_bad(pud))
+   vma_wrprotect_pud_range(vma, pud, start, next);
+
+   ++pud;
+   start = next;
+   }
+}
+
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr = vma->vm_start;
+   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+
+   while (addr < vma->vm_end) {
+   unsigned long next = pgd_addr_end(addr, vma->vm_end);
+
+   if (!pgd_none_or_clear_bad(pgd))
+   vma_wrprotect_pgd_range(vma, pgd, addr, next);
+
+   ++pgd;
+   addr = next;
+   }
+}
+
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
@@ -78,16 +146,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
error = 0;
start = vma->vm_end;
file = vma->vm_file;
-   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
-   get_file(file);
-   up_read(>mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error || start >= end)
-   goto out;
-   down_read(>mmap_sem);
-   vma = find_vma(mm, start);
-   continue;
+   if (file && (vma->vm_flags & VM_SHARED)) {
+   if ((flags & MS_ASYNC))
+   vma_wrprotect(vma);
+   if (flags & MS_SYNC) {
+   get_file(file);
+   up_read(>mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error || start >= end)
+   goto out;
+   down_read(>mmap_sem);
+   vma = find_vma(mm, start);
+   continue;
+   }
}
 
vma = vma->vm_next;
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v8 2/4] Update ctime and mtime for memory-mapped files

2008-01-22 Thread Anton Salikhmetov
Update ctime and mtime for memory-mapped files at a write access on
a present, read-only PTE, as well as at a write on a non-present PTE.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
---
 mm/memory.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6dd1cd8..4b0144b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1670,6 +1670,9 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma-vm_file)
+   file_update_time(vma-vm_file);
+
/*
 * Yes, Virginia, this is actually required to prevent a race
 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2343,6 +2346,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+   if (vma-vm_file)
+   file_update_time(vma-vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
-- 
1.4.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v8 0/4] Fixing the issue with memory-mapped file times

2008-01-22 Thread Anton Salikhmetov
This is the eighth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Since the previous version, the following has changed:

1) based on Linus' comment, a more efficient PTE walker implemented;

2) the design document added to the kernel documentation.

Functional tests successfully passed.

Please comment.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-22 Thread Anton Salikhmetov
Force file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
---
 mm/msync.c |   92 +--
 1 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 60efa36..87f990e 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
  */
 
+#include asm/tlbflush.h
 #include linux/file.h
 #include linux/fs.h
 #include linux/mm.h
@@ -12,6 +13,73 @@
 #include linux/sched.h
 #include linux/syscalls.h
 
+static void vma_wrprotect_pmd_range(struct vm_area_struct *vma, pmd_t *pmd,
+   unsigned long start, unsigned long end)
+{
+   while (start  end) {
+   spinlock_t *ptl;
+   pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, start, ptl);
+
+   if (pte_dirty(*pte)  pte_write(*pte)) {
+   pte_t entry = ptep_clear_flush(vma, start, pte);
+
+   entry = pte_wrprotect(entry);
+   set_pte_at(vma-vm_mm, start, pte, entry);
+   }
+
+   pte_unmap_unlock(pte, ptl);
+   start += PAGE_SIZE;
+   }
+}
+
+static void vma_wrprotect_pud_range(struct vm_area_struct *vma, pud_t *pud,
+   unsigned long start, unsigned long end)
+{
+   pmd_t *pmd = pmd_offset(pud, start);
+
+   while (start  end) {
+   unsigned long next = pmd_addr_end(start, end);
+
+   if (!pmd_none_or_clear_bad(pmd))
+   vma_wrprotect_pmd_range(vma, pmd, start, next);
+
+   ++pmd;
+   start = next;
+   }
+}
+
+static void vma_wrprotect_pgd_range(struct vm_area_struct *vma, pgd_t *pgd,
+   unsigned long start, unsigned long end)
+{
+   pud_t *pud = pud_offset(pgd, start);
+
+   while (start  end) {
+   unsigned long next = pud_addr_end(start, end);
+
+   if (!pud_none_or_clear_bad(pud))
+   vma_wrprotect_pud_range(vma, pud, start, next);
+
+   ++pud;
+   start = next;
+   }
+}
+
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr = vma-vm_start;
+   pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
+
+   while (addr  vma-vm_end) {
+   unsigned long next = pgd_addr_end(addr, vma-vm_end);
+
+   if (!pgd_none_or_clear_bad(pgd))
+   vma_wrprotect_pgd_range(vma, pgd, addr, next);
+
+   ++pgd;
+   addr = next;
+   }
+}
+
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
@@ -78,16 +146,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
error = 0;
start = vma-vm_end;
file = vma-vm_file;
-   if (file  (vma-vm_flags  VM_SHARED)  (flags  MS_SYNC)) {
-   get_file(file);
-   up_read(mm-mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error || start = end)
-   goto out;
-   down_read(mm-mmap_sem);
-   vma = find_vma(mm, start);
-   continue;
+   if (file  (vma-vm_flags  VM_SHARED)) {
+   if ((flags  MS_ASYNC))
+   vma_wrprotect(vma);
+   if (flags  MS_SYNC) {
+   get_file(file);
+   up_read(mm-mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error || start = end)
+   goto out;
+   down_read(mm-mmap_sem);
+   vma = find_vma(mm, start);
+   continue;
+   }
}
 
vma = vma-vm_next;
-- 
1.4.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v8 1/4] Massive code cleanup of sys_msync()

2008-01-22 Thread Anton Salikhmetov
Use the PAGE_ALIGN() macro instead of manual alignment.
Improve readability of the loop, which traverses the process
memory regions. Make code more symmetric and possibly boost
performance on some RISC CPUs by moving variable assignments.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
---
 mm/msync.c |   76 ---
 1 files changed, 36 insertions(+), 40 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..60efa36 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,84 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
  */
 
-/*
- * The msync() system call.
- */
+#include linux/file.h
 #include linux/fs.h
 #include linux/mm.h
 #include linux/mman.h
-#include linux/file.h
-#include linux/syscalls.h
 #include linux/sched.h
+#include linux/syscalls.h
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
unsigned long end;
-   struct mm_struct *mm = current-mm;
+   int error, unmapped_error;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   struct mm_struct *mm;
 
+   error = -EINVAL;
if (flags  ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start  ~PAGE_MASK)
goto out;
if ((flags  MS_ASYNC)  (flags  MS_SYNC))
goto out;
+
error = -ENOMEM;
-   len = (len + ~PAGE_MASK)  PAGE_MASK;
+   len = PAGE_ALIGN(len);
end = start + len;
if (end  start)
goto out;
+
error = 0;
+   unmapped_error = 0;
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
+   mm = current-mm;
down_read(mm-mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start  end. */
error = -ENOMEM;
if (!vma)
-   goto out_unlock;
-   /* Here start  vma-vm_end. */
+   break;
if (start  vma-vm_start) {
start = vma-vm_start;
if (start = end)
-   goto out_unlock;
-   unmapped_error = -ENOMEM;
-   }
-   /* Here vma-vm_start = start  vma-vm_end. */
-   if ((flags  MS_INVALIDATE) 
-   (vma-vm_flags  VM_LOCKED)) {
-   error = -EBUSY;
-   goto out_unlock;
+   break;
+   unmapped_error = error;
}
-   file = vma-vm_file;
+
+   error = -EBUSY;
+   if ((flags  MS_INVALIDATE)  (vma-vm_flags  VM_LOCKED))
+   break;
+
+   error = 0;
start = vma-vm_end;
-   if ((flags  MS_SYNC)  file 
-   (vma-vm_flags  VM_SHARED)) {
+   file = vma-vm_file;
+   if (file  (vma-vm_flags  VM_SHARED)  (flags  MS_SYNC)) {
get_file(file);
up_read(mm-mmap_sem);
error = do_fsync(file, 0);
@@ -88,16 +87,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
down_read(mm-mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start = end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma-vm_next;
+   continue;
}
-   }
-out_unlock

[PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-22 Thread Anton Salikhmetov
Add a document, which describes how the POSIX requirements on updating
memory-mapped file times are addressed in Linux.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
---
 Documentation/vm/00-INDEX  |2 +
 Documentation/vm/msync.txt |  117 
 2 files changed, 119 insertions(+), 0 deletions(-)

diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 2131b00..2726c8d 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -6,6 +6,8 @@ hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
 locking
- info on how locking and synchronization is done in the Linux vm code.
+msync.txt
+   - the design document for memory-mapped file times update
 numa
- information about NUMA specific code in the Linux vm.
 numa_memory_policy.txt
diff --git a/Documentation/vm/msync.txt b/Documentation/vm/msync.txt
new file mode 100644
index 000..571a766
--- /dev/null
+++ b/Documentation/vm/msync.txt
@@ -0,0 +1,117 @@
+
+   The msync() system call and memory-mapped file times
+
+   Copyright (C) 2008 Anton Salikhmetov
+
+The POSIX standard requires that any write reference to memory-mapped file
+data should result in updating the ctime and mtime for that file. Moreover,
+the standard mandates that updated file times should become visible to the
+world no later than at the next call to msync().
+
+Failure to meet this requirement creates difficulties for certain classes
+of important applications. For instance, database backup systems fail to
+pick up the files modified via the mmap() interface. Also, this is a
+security hole, which allows forging file data in such a manner that proving
+the fact that file data was modified is not possible.
+
+Briefly put, this requirement can be stated as follows:
+
+   once the file data has changed, the operating system
+   should acknowledge this fact by updating file metadata.
+
+This document describes how this POSIX requirement is addressed in Linux.
+
+1. Requirements
+
+1.1) the POSIX standard requires updating ctime and mtime not later
+than at the call to msync() with MS_SYNC or MS_ASYNC flags;
+
+1.2) in existing POSIX implementations, ctime and mtime
+get updated not later than at the call to fsync();
+
+1.3) in existing POSIX implementation, ctime and mtime
+get updated not later than at the call to sync(), the auto-update feature;
+
+1.4) the customers require and the common sense suggests that
+ctime and mtime should be updated not later than at the call to munmap()
+or exit(), the latter function implying an implicit call to munmap();
+
+1.5) the (1.1) item should be satisfied if the file is a block device
+special file;
+
+1.6) the (1.1) item should be satisfied for files residing on
+memory-backed filesystems such as tmpfs, too.
+
+The following operating systems were used as the reference platforms
+and are referred to as the existing implementations above:
+HP-UX B.11.31 and FreeBSD 6.2-RELEASE.
+
+2. Lazy update
+
+Many attempts before the current version implemented the lazy update approach
+to satisfying the requirements given above. Within the latter approach, ctime
+and mtime get updated at last moment allowable.
+
+Since we don't update the file times immediately, some Flag has to be
+used. When up, this Flag means that the file data was modified and
+the file times need to be updated as soon as possible.
+
+Any existing dirty flag which, when up, mean that a page has been written to,
+is not suitable for this purpose. Indeed, msync() called with MS_ASYNC
+would have to reset this dirty flag after updating ctime and mtime.
+The sys_msync() function itself is basically a no-op in the MS_ASYNC case.
+Thereby, the synchronization routines relying upon this dirty flag
+would lose data. Therefore, a new Flag has to be introduced.
+
+The (1.5) item coupled with (1.3) requirement leads to hard work with
+the block device inodes. Specifically, during writeback it is impossible to
+tell which block device file was originally mapped. Therefore, we need to
+traverse the list of active devices associated with the block device inode.
+This would lead to updating file times for block device files, which were not
+taking part in the data transfer.
+
+Also all versions prior to version 6 failed to correctly process ctime and
+mtime for files on the memory-backed filesystems such as tmpfs. So the (1.6)
+requirement was not satisfied.
+
+If a write reference has occurred between two consecutive calls to msync()
+with MS_ASYNC, the second call to the latter function should take into
+account the last write reference. The last write reference can not be caught
+if no pagefault occurs. Hence a pagefault needs to be forced. This can be done
+using two different approaches. The first one is to synchronize data even when
+msync() was called with MS_ASYNC. This is not acceptable because the current
+design of the sys_msync() routine forbids

Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Tue, 22 Jan 2008, Anton Salikhmetov wrote:
> >
> >  /*
> > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > + * It will force a pagefault on the next write access.
> > + */
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > + unsigned long addr;
> > +
> > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > + spinlock_t *ptl;
> > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > + pud_t *pud = pud_offset(pgd, addr);
> > + pmd_t *pmd = pmd_offset(pud, addr);
> > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
>
> This is extremely expensive over bigger areas, especially sparsely mapped
> ones (it does all the lookups for all four levels over and over and over
> again for eachg page).
>
> I think Peter Zijlstra posted a version that uses the regular kind of
> nested loop (with inline functions to keep the thing nice and clean),
> which gets rid of that.

Thanks for your feedback, Linus!

I will use Peter Zijlstra's version of such an operation in my next
patch series.

>
> [ The sad/funny part is that this is all how we *used* to do msync(), back
>   in the days: we're literally going back to the "pre-cleanup" logic. See
>   commit 204ec841fbea3e5138168edbc3a76d46747cc987: "mm: msync() cleanup"
>   for details ]
>
> Quite frankly, I really think you might be better off just doing a
>
> git revert 204ec841fbea3e5138168edbc3a76d46747cc987
>
> and working from there! I just checked, and it still reverts cleanly, and
> you'd end up with a nice code-base that (a) has gotten years of testing
> and (b) already has the looping-over-the-pagetables code.
>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Anton Salikhmetov <[EMAIL PROTECTED]>:
> 2008/1/22, Jesper Juhl <[EMAIL PROTECTED]>:
> > Some very pedantic nitpicking below;
> >
> > On 22/01/2008, Anton Salikhmetov <[EMAIL PROTECTED]> wrote:
> > > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
> > >
> > > Update file times at write references to memory-mapped files.
> > > Force file times update at the next write reference after
> > > calling the msync() system call with the MS_ASYNC flag.
> > >
> > > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > > ---
> > >  mm/memory.c |6 ++
> > >  mm/msync.c  |   57 
> > > -
> > >  2 files changed, 50 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6dd1cd8..4b0144b 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1670,6 +1670,9 @@ gotten:
> > >  unlock:
> > > pte_unmap_unlock(page_table, ptl);
> > > if (dirty_page) {
> > > +   if (vma->vm_file)
> > > +   file_update_time(vma->vm_file);
> > > +
> > > /*
> > >  * Yes, Virginia, this is actually required to prevent a 
> > > race
> > >  * with clear_page_dirty_for_io() from clearing the page 
> > > dirty
> > > @@ -2343,6 +2346,9 @@ out_unlocked:
> > > if (anon)
> > > page_cache_release(vmf.page);
> > > else if (dirty_page) {
> > > +   if (vma->vm_file)
> > > +       file_update_time(vma->vm_file);
> > > +
> > > set_page_dirty_balance(dirty_page, page_mkwrite);
> > > put_page(dirty_page);
> > > }
> > > diff --git a/mm/msync.c b/mm/msync.c
> > > index a4de868..394130d 100644
> > > --- a/mm/msync.c
> > > +++ b/mm/msync.c
> > > @@ -5,6 +5,7 @@
> > >   * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
> > >   */
> > >
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -13,11 +14,37 @@
> > >  #include 
> > >
> > >  /*
> > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > + * It will force a pagefault on the next write access.
> > > + */
> > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > +{
> > > +   unsigned long addr;
> > > +
> > > +   for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) 
> > > {
> >
> > I know it's not the common "Linux Kernel way", but 'addr' could be
> > made to have just 'for' scope here according to C99;
>
> I believe that the C89 style is more common for the Linux kernel, so
> I've used the out-of-scope variable declaration.
>
> >
> >for (unsigned long addr = vma->vm_start; addr < vma->vm_end;
> > addr += PAGE_SIZE) {

By the way, if we're talking "pedantic", then:

>>>

debian:/tmp$ cat c.c
void f()
{
   for (unsigned long i = 0; i < 10; i++)
   continue;
}
debian:/tmp$ gcc -c -pedantic c.c
c.c: In function 'f':
c.c:3: error: 'for' loop initial declaration used outside C99 mode
debian:/tmp$

<<<

No pun intended :)

> >
> >
> > > +   spinlock_t *ptl;
> > > +   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > +   pud_t *pud = pud_offset(pgd, addr);
> > > +   pmd_t *pmd = pmd_offset(pud, addr);
> > > +   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, 
> > > );
> > > +
> > > +   if (pte_dirty(*pte) && pte_write(*pte)) {
> > > +   pte_t entry = ptep_clear_flush(vma, addr, pte);
> > > +
> > > +   entry = pte_wrprotect(entry);
> > > +   set_pte_at(vma->vm_mm, addr, pte, entry);
> > > +   }
> > > +   pte_unmap_unlock(pte, ptl);
> > > +   }
> > > +}
> > > +
> > > +/*
> > >   * MS_SYNC syncs the entire file - including mappings.
> > >   *
> > > - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> > > - * Nor does it mark the relevant pages dirty (it use

Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Jesper Juhl <[EMAIL PROTECTED]>:
> On 22/01/2008, Anton Salikhmetov <[EMAIL PROTECTED]> wrote:
> > 2008/1/22, Jesper Juhl <[EMAIL PROTECTED]>:
> > > Some very pedantic nitpicking below;
> > >
> > > On 22/01/2008, Anton Salikhmetov <[EMAIL PROTECTED]> wrote:
> ...
> > > > +   if (file && (vma->vm_flags & VM_SHARED)) {
> > > > +   if (flags & MS_ASYNC)
> > > > +   vma_wrprotect(vma);
> > > > +   if (flags & MS_SYNC) {
> > >
> > > "else if" ??
> >
> > The MS_ASYNC and MS_SYNC flags are mutually exclusive, that is why I
> > did not use the "else-if" here. Moreover, this function itself checks
> > that they never come together.
> >
>
> I would say that them being mutually exclusive would be a reason *for*
> using "else-if" here.

This check is performed by the sys_msync() function itself in its very
beginning.

We don't need to check it later.

>
> --
> Jesper Juhl <[EMAIL PROTECTED]>
> Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please  http://www.expita.com/nomime.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Jesper Juhl <[EMAIL PROTECTED]>:
> Some very pedantic nitpicking below;
>
> On 22/01/2008, Anton Salikhmetov <[EMAIL PROTECTED]> wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
> >
> > Update file times at write references to memory-mapped files.
> > Force file times update at the next write reference after
> > calling the msync() system call with the MS_ASYNC flag.
> >
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  mm/memory.c |6 ++
> >  mm/msync.c  |   57 
> > -
> >  2 files changed, 50 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6dd1cd8..4b0144b 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1670,6 +1670,9 @@ gotten:
> >  unlock:
> > pte_unmap_unlock(page_table, ptl);
> > if (dirty_page) {
> > +   if (vma->vm_file)
> > +   file_update_time(vma->vm_file);
> > +
> > /*
> >  * Yes, Virginia, this is actually required to prevent a 
> > race
> >  * with clear_page_dirty_for_io() from clearing the page 
> > dirty
> > @@ -2343,6 +2346,9 @@ out_unlocked:
> > if (anon)
> > page_cache_release(vmf.page);
> > else if (dirty_page) {
> > +   if (vma->vm_file)
> > +   file_update_time(vma->vm_file);
> > +
> > set_page_dirty_balance(dirty_page, page_mkwrite);
> > put_page(dirty_page);
> > }
> > diff --git a/mm/msync.c b/mm/msync.c
> > index a4de868..394130d 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -5,6 +5,7 @@
> >   * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
> >   */
> >
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -13,11 +14,37 @@
> >  #include 
> >
> >  /*
> > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > + * It will force a pagefault on the next write access.
> > + */
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > +   unsigned long addr;
> > +
> > +   for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
>
> I know it's not the common "Linux Kernel way", but 'addr' could be
> made to have just 'for' scope here according to C99;

I believe that the C89 style is more common for the Linux kernel, so
I've used the out-of-scope variable declaration.

>
>for (unsigned long addr = vma->vm_start; addr < vma->vm_end;
> addr += PAGE_SIZE) {
>
>
> > +   spinlock_t *ptl;
> > +   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > +   pud_t *pud = pud_offset(pgd, addr);
> > +   pmd_t *pmd = pmd_offset(pud, addr);
> > +   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, 
> > );
> > +
> > +   if (pte_dirty(*pte) && pte_write(*pte)) {
> > +   pte_t entry = ptep_clear_flush(vma, addr, pte);
> > +
> > +   entry = pte_wrprotect(entry);
> > +   set_pte_at(vma->vm_mm, addr, pte, entry);
> > +   }
> > +   pte_unmap_unlock(pte, ptl);
> > +   }
> > +}
> > +
> > +/*
> >   * MS_SYNC syncs the entire file - including mappings.
> >   *
> > - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> > - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> > - * Now it doesn't do anything, since dirty pages are properly tracked.
>
> I think keeping some version of the "up to ..." comments makes sense.
> It documents that we previously had different behaviour.

Earlier I had a request to remove any "changelog-style" comments from the code.

>
> > + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
> > + * read-only by calling vma_wrprotect(). This is needed to catch the next
> > + * write reference to the mapped region and update the file times
> > + * accordingly.
> >   *
> >   * The application may now run fsync() to write out the dirty pages and
> >   * wait on the writeout and check the result. Or the application may run
> > @@ -77,16 +104,20 @@ asmlinkage long sys_msync(unsigned long start, size_t 
> > len, int flags)
> >

Re: [PATCH -v7 0/2] Fixing the issue with memory-mapped file times

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Jesper Juhl <[EMAIL PROTECTED]>:
> On 22/01/2008, Anton Salikhmetov <[EMAIL PROTECTED]> wrote:
> > This is the seventh version of my solution for the bug #2645:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> >
> > Since the previous version, the following has changed: based on
> > Linus' comment, SMP-safe PTE update implemented.
> >
> > Discussions, which followed my past submissions, showed that it was
> > tempting to approach this problem using very different assumptions.
> > However, many such designs have proved to be incomplete or inefficient.
> >
> > Taking into account the obvious complexity of this problem, I prepared a
> > design document, the purpose of which is twofold. First, it summarizes
> > previous attempts to resolve the ctime/mtime issue. Second, it attempts
> > to prove that what the patches do is necessary and sufficient for mtime
> > and ctime update provided that we start from a certain sane set of
> > requirements. The design document is available via the following link:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
> >
> > For the seventh version, comprehensive performance testing was performed.
> > The results of performance tests, including numbers, are available here:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c43
> >
>
> Hi Anton,
>
> I applied your patches here and as far as my own test programs go,
> these patches solve the previously observed problems I saw with mtime
> not getting updated.
>
> Thank you very much for so persistently working on these long standing issues.

Thank you very much for testing these patches!

>
> --
> Jesper Juhl <[EMAIL PROTECTED]>
> Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please  http://www.expita.com/nomime.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v7 0/2] Fixing the issue with memory-mapped file times

2008-01-21 Thread Anton Salikhmetov
This is the seventh version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Since the previous version, the following has changed: based on
Linus' comment, SMP-safe PTE update implemented.

Discussions, which followed my past submissions, showed that it was
tempting to approach this problem using very different assumptions.
However, many such designs have proved to be incomplete or inefficient.

Taking into account the obvious complexity of this problem, I prepared a
design document, the purpose of which is twofold. First, it summarizes
previous attempts to resolve the ctime/mtime issue. Second, it attempts
to prove that what the patches do is necessary and sufficient for mtime
and ctime update provided that we start from a certain sane set of
requirements. The design document is available via the following link:

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40

For the seventh version, comprehensive performance testing was performed.
The results of performance tests, including numbers, are available here:

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c43
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v7 1/2] Massive code cleanup of sys_msync()

2008-01-21 Thread Anton Salikhmetov
Use the PAGE_ALIGN() macro instead of "manual" alignment.
Improve readability of the loop, which traverses the process
memory regions. Make code more symmetric and possibly boost
performance on some RISC CPUs by moving variable assignments.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   77 
 1 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a4de868 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,83 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
unsigned long end;
-   struct mm_struct *mm = current->mm;
+   int error, unmapped_error;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   struct mm_struct *mm;
 
+   error = -EINVAL;
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start & ~PAGE_MASK)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
+
error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
goto out;
-   error = 0;
+
+   error = unmapped_error = 0;
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
+   mm = current->mm;
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
error = -ENOMEM;
if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   break;
if (start < vma->vm_start) {
start = vma->vm_start;
if (start >= end)
-   goto out_unlock;
-   unmapped_error = -ENOMEM;
-   }
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
-   error = -EBUSY;
-   goto out_unlock;
+   break;
+   unmapped_error = error;
}
-   file = vma->vm_file;
+
+   error = -EBUSY;
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
+   break;
+
+   error = 0;
start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+   file = vma->vm_file;
+   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(>mmap_sem);
error = do_fsync(file, 0);
@@ -88,16 +86,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   goto out_u

[PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40

Update file times at write references to memory-mapped files.
Force file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/memory.c |6 ++
 mm/msync.c  |   57 -
 2 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6dd1cd8..4b0144b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1670,6 +1670,9 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
/*
 * Yes, Virginia, this is actually required to prevent a race
 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2343,6 +2346,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..394130d 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -13,11 +14,37 @@
 #include 
 
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr;
+
+   for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+   spinlock_t *ptl;
+   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+   pud_t *pud = pud_offset(pgd, addr);
+   pmd_t *pmd = pmd_offset(pud, addr);
+   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
+
+   if (pte_dirty(*pte) && pte_write(*pte)) {
+   pte_t entry = ptep_clear_flush(vma, addr, pte);
+
+   entry = pte_wrprotect(entry);
+   set_pte_at(vma->vm_mm, addr, pte, entry);
+   }
+   pte_unmap_unlock(pte, ptl);
+   }
+}
+
+/*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
  *
  * The application may now run fsync() to write out the dirty pages and
  * wait on the writeout and check the result. Or the application may run
@@ -77,16 +104,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
error = 0;
start = vma->vm_end;
file = vma->vm_file;
-   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
-   get_file(file);
-   up_read(>mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error || start >= end)
-   goto out;
-   down_read(>mmap_sem);
-   vma = find_vma(mm, start);
-   continue;
+   if (file && (vma->vm_flags & VM_SHARED)) {
+   if (flags & MS_ASYNC)
+   vma_wrprotect(vma);
+   if (flags & MS_SYNC) {
+   get_file(file);
+   up_read(>mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error || start >= end)
+   goto out;
+   down_read(>mmap_sem);
+   vma = find_vma(mm, start);
+   continue;
+   }
}
 
vma = vma->vm_next;
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/21, Peter Staubach <[EMAIL PROTECTED]>:
> Linus Torvalds wrote:
> > On Fri, 18 Jan 2008, Ingo Oeser wrote:
> >
> >> Can we get "if the write to the page hits the disk, the mtime has hit the 
> >> disk
> >> already no less than SOME_GRANULARITY before"?
> >>
> >> That is very important for computer forensics. Esp. in saving your ass!
> >>
> >> Ok, now back again to making that fast :-)
> >>
> >
> > I certainly don't mind it if we have some tighter guarantees, but what I'd
> > want is:
> >
> >  - keep it simple. Let's face it, Linux has never ever given those
> >guarantees before, and it's not is if anybody has really cared. Even
> >now, the issue seems to be more about paper standards conformance than
> >anything else.
> >
> >
>
> I have been working on getting something supported here for
> because I have some very large Wall Street customers who do
> care about getting the mtime updated because their backups
> are getting corrupted.  They are incomplete because although
> their applications update files, they don't get backed up
> because the mtime never changes.
>
> >  - I get worried about people playing around with the dirty bit in
> >particular. We have had some really rather nasty bugs here. Most of
> >which are totally impossible to trigger under normal loads (for
> >example the old random-access utorrent writable mmap issue from about
> >a year ago).
> >
> > So these two issues - the big red danger signs flashing in my brain,
> > coupled with the fact that no application has apparently ever really
> > noticed in the last 15 years - just makes it a case where I'd like each
> > step of the way to be obvious and simple and no larger than really
> > absolutely necessary.
>
> Simple is good.  However, too simple is not good.  I would suggest
> that we implement file time updates which make sense and if they
> happen to follow POSIX, then nifty, otherwise, oh well.

Thank you very much for your support, Peter!

I'm going to submit the design document, the next version of the patch
series, and the performance tests results soon.

>
> Thanx...
>
>ps
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/21, Peter Staubach [EMAIL PROTECTED]:
 Linus Torvalds wrote:
  On Fri, 18 Jan 2008, Ingo Oeser wrote:
 
  Can we get if the write to the page hits the disk, the mtime has hit the 
  disk
  already no less than SOME_GRANULARITY before?
 
  That is very important for computer forensics. Esp. in saving your ass!
 
  Ok, now back again to making that fast :-)
 
 
  I certainly don't mind it if we have some tighter guarantees, but what I'd
  want is:
 
   - keep it simple. Let's face it, Linux has never ever given those
 guarantees before, and it's not is if anybody has really cared. Even
 now, the issue seems to be more about paper standards conformance than
 anything else.
 
 

 I have been working on getting something supported here for
 because I have some very large Wall Street customers who do
 care about getting the mtime updated because their backups
 are getting corrupted.  They are incomplete because although
 their applications update files, they don't get backed up
 because the mtime never changes.

   - I get worried about people playing around with the dirty bit in
 particular. We have had some really rather nasty bugs here. Most of
 which are totally impossible to trigger under normal loads (for
 example the old random-access utorrent writable mmap issue from about
 a year ago).
 
  So these two issues - the big red danger signs flashing in my brain,
  coupled with the fact that no application has apparently ever really
  noticed in the last 15 years - just makes it a case where I'd like each
  step of the way to be obvious and simple and no larger than really
  absolutely necessary.

 Simple is good.  However, too simple is not good.  I would suggest
 that we implement file time updates which make sense and if they
 happen to follow POSIX, then nifty, otherwise, oh well.

Thank you very much for your support, Peter!

I'm going to submit the design document, the next version of the patch
series, and the performance tests results soon.


 Thanx...

ps

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v7 0/2] Fixing the issue with memory-mapped file times

2008-01-21 Thread Anton Salikhmetov
This is the seventh version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Since the previous version, the following has changed: based on
Linus' comment, SMP-safe PTE update implemented.

Discussions, which followed my past submissions, showed that it was
tempting to approach this problem using very different assumptions.
However, many such designs have proved to be incomplete or inefficient.

Taking into account the obvious complexity of this problem, I prepared a
design document, the purpose of which is twofold. First, it summarizes
previous attempts to resolve the ctime/mtime issue. Second, it attempts
to prove that what the patches do is necessary and sufficient for mtime
and ctime update provided that we start from a certain sane set of
requirements. The design document is available via the following link:

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40

For the seventh version, comprehensive performance testing was performed.
The results of performance tests, including numbers, are available here:

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c43
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v7 1/2] Massive code cleanup of sys_msync()

2008-01-21 Thread Anton Salikhmetov
Use the PAGE_ALIGN() macro instead of manual alignment.
Improve readability of the loop, which traverses the process
memory regions. Make code more symmetric and possibly boost
performance on some RISC CPUs by moving variable assignments.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
---
 mm/msync.c |   77 
 1 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a4de868 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,83 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
  */
 
-/*
- * The msync() system call.
- */
+#include linux/file.h
 #include linux/fs.h
 #include linux/mm.h
 #include linux/mman.h
-#include linux/file.h
-#include linux/syscalls.h
 #include linux/sched.h
+#include linux/syscalls.h
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
unsigned long end;
-   struct mm_struct *mm = current-mm;
+   int error, unmapped_error;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   struct mm_struct *mm;
 
+   error = -EINVAL;
if (flags  ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start  ~PAGE_MASK)
goto out;
if ((flags  MS_ASYNC)  (flags  MS_SYNC))
goto out;
+
error = -ENOMEM;
-   len = (len + ~PAGE_MASK)  PAGE_MASK;
+   len = PAGE_ALIGN(len);
end = start + len;
if (end  start)
goto out;
-   error = 0;
+
+   error = unmapped_error = 0;
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
+   mm = current-mm;
down_read(mm-mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start  end. */
error = -ENOMEM;
if (!vma)
-   goto out_unlock;
-   /* Here start  vma-vm_end. */
+   break;
if (start  vma-vm_start) {
start = vma-vm_start;
if (start = end)
-   goto out_unlock;
-   unmapped_error = -ENOMEM;
-   }
-   /* Here vma-vm_start = start  vma-vm_end. */
-   if ((flags  MS_INVALIDATE) 
-   (vma-vm_flags  VM_LOCKED)) {
-   error = -EBUSY;
-   goto out_unlock;
+   break;
+   unmapped_error = error;
}
-   file = vma-vm_file;
+
+   error = -EBUSY;
+   if ((flags  MS_INVALIDATE)  (vma-vm_flags  VM_LOCKED))
+   break;
+
+   error = 0;
start = vma-vm_end;
-   if ((flags  MS_SYNC)  file 
-   (vma-vm_flags  VM_SHARED)) {
+   file = vma-vm_file;
+   if (file  (vma-vm_flags  VM_SHARED)  (flags  MS_SYNC)) {
get_file(file);
up_read(mm-mmap_sem);
error = do_fsync(file, 0);
@@ -88,16 +86,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
down_read(mm-mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start = end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma-vm_next;
+   continue;
}
-   }
-out_unlock

[PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40

Update file times at write references to memory-mapped files.
Force file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
---
 mm/memory.c |6 ++
 mm/msync.c  |   57 -
 2 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6dd1cd8..4b0144b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1670,6 +1670,9 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma-vm_file)
+   file_update_time(vma-vm_file);
+
/*
 * Yes, Virginia, this is actually required to prevent a race
 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2343,6 +2346,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+   if (vma-vm_file)
+   file_update_time(vma-vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..394130d 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
  */
 
+#include asm/tlbflush.h
 #include linux/file.h
 #include linux/fs.h
 #include linux/mm.h
@@ -13,11 +14,37 @@
 #include linux/syscalls.h
 
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr;
+
+   for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) {
+   spinlock_t *ptl;
+   pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
+   pud_t *pud = pud_offset(pgd, addr);
+   pmd_t *pmd = pmd_offset(pud, addr);
+   pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
+
+   if (pte_dirty(*pte)  pte_write(*pte)) {
+   pte_t entry = ptep_clear_flush(vma, addr, pte);
+
+   entry = pte_wrprotect(entry);
+   set_pte_at(vma-vm_mm, addr, pte, entry);
+   }
+   pte_unmap_unlock(pte, ptl);
+   }
+}
+
+/*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
  *
  * The application may now run fsync() to write out the dirty pages and
  * wait on the writeout and check the result. Or the application may run
@@ -77,16 +104,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
error = 0;
start = vma-vm_end;
file = vma-vm_file;
-   if (file  (vma-vm_flags  VM_SHARED)  (flags  MS_SYNC)) {
-   get_file(file);
-   up_read(mm-mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error || start = end)
-   goto out;
-   down_read(mm-mmap_sem);
-   vma = find_vma(mm, start);
-   continue;
+   if (file  (vma-vm_flags  VM_SHARED)) {
+   if (flags  MS_ASYNC)
+   vma_wrprotect(vma);
+   if (flags  MS_SYNC) {
+   get_file(file);
+   up_read(mm-mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error || start = end)
+   goto out;
+   down_read(mm-mmap_sem);
+   vma = find_vma(mm, start);
+   continue;
+   }
}
 
vma = vma-vm_next;
-- 
1.4.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v7 0/2] Fixing the issue with memory-mapped file times

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Jesper Juhl [EMAIL PROTECTED]:
 On 22/01/2008, Anton Salikhmetov [EMAIL PROTECTED] wrote:
  This is the seventh version of my solution for the bug #2645:
 
  http://bugzilla.kernel.org/show_bug.cgi?id=2645
 
  Since the previous version, the following has changed: based on
  Linus' comment, SMP-safe PTE update implemented.
 
  Discussions, which followed my past submissions, showed that it was
  tempting to approach this problem using very different assumptions.
  However, many such designs have proved to be incomplete or inefficient.
 
  Taking into account the obvious complexity of this problem, I prepared a
  design document, the purpose of which is twofold. First, it summarizes
  previous attempts to resolve the ctime/mtime issue. Second, it attempts
  to prove that what the patches do is necessary and sufficient for mtime
  and ctime update provided that we start from a certain sane set of
  requirements. The design document is available via the following link:
 
  http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
 
  For the seventh version, comprehensive performance testing was performed.
  The results of performance tests, including numbers, are available here:
 
  http://bugzilla.kernel.org/show_bug.cgi?id=2645#c43
 

 Hi Anton,

 I applied your patches here and as far as my own test programs go,
 these patches solve the previously observed problems I saw with mtime
 not getting updated.

 Thank you very much for so persistently working on these long standing issues.

Thank you very much for testing these patches!


 --
 Jesper Juhl [EMAIL PROTECTED]
 Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
 Plain text mails only, please  http://www.expita.com/nomime.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Jesper Juhl [EMAIL PROTECTED]:
 Some very pedantic nitpicking below;

 On 22/01/2008, Anton Salikhmetov [EMAIL PROTECTED] wrote:
  http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
 
  Update file times at write references to memory-mapped files.
  Force file times update at the next write reference after
  calling the msync() system call with the MS_ASYNC flag.
 
  Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
  ---
   mm/memory.c |6 ++
   mm/msync.c  |   57 
  -
   2 files changed, 50 insertions(+), 13 deletions(-)
 
  diff --git a/mm/memory.c b/mm/memory.c
  index 6dd1cd8..4b0144b 100644
  --- a/mm/memory.c
  +++ b/mm/memory.c
  @@ -1670,6 +1670,9 @@ gotten:
   unlock:
  pte_unmap_unlock(page_table, ptl);
  if (dirty_page) {
  +   if (vma-vm_file)
  +   file_update_time(vma-vm_file);
  +
  /*
   * Yes, Virginia, this is actually required to prevent a 
  race
   * with clear_page_dirty_for_io() from clearing the page 
  dirty
  @@ -2343,6 +2346,9 @@ out_unlocked:
  if (anon)
  page_cache_release(vmf.page);
  else if (dirty_page) {
  +   if (vma-vm_file)
  +   file_update_time(vma-vm_file);
  +
  set_page_dirty_balance(dirty_page, page_mkwrite);
  put_page(dirty_page);
  }
  diff --git a/mm/msync.c b/mm/msync.c
  index a4de868..394130d 100644
  --- a/mm/msync.c
  +++ b/mm/msync.c
  @@ -5,6 +5,7 @@
* Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
*/
 
  +#include asm/tlbflush.h
   #include linux/file.h
   #include linux/fs.h
   #include linux/mm.h
  @@ -13,11 +14,37 @@
   #include linux/syscalls.h
 
   /*
  + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
  + * It will force a pagefault on the next write access.
  + */
  +static void vma_wrprotect(struct vm_area_struct *vma)
  +{
  +   unsigned long addr;
  +
  +   for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) {

 I know it's not the common Linux Kernel way, but 'addr' could be
 made to have just 'for' scope here according to C99;

I believe that the C89 style is more common for the Linux kernel, so
I've used the out-of-scope variable declaration.


for (unsigned long addr = vma-vm_start; addr  vma-vm_end;
 addr += PAGE_SIZE) {


  +   spinlock_t *ptl;
  +   pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
  +   pud_t *pud = pud_offset(pgd, addr);
  +   pmd_t *pmd = pmd_offset(pud, addr);
  +   pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, 
  ptl);
  +
  +   if (pte_dirty(*pte)  pte_write(*pte)) {
  +   pte_t entry = ptep_clear_flush(vma, addr, pte);
  +
  +   entry = pte_wrprotect(entry);
  +   set_pte_at(vma-vm_mm, addr, pte, entry);
  +   }
  +   pte_unmap_unlock(pte, ptl);
  +   }
  +}
  +
  +/*
* MS_SYNC syncs the entire file - including mappings.
*
  - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
  - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  - * Now it doesn't do anything, since dirty pages are properly tracked.

 I think keeping some version of the up to ... comments makes sense.
 It documents that we previously had different behaviour.

Earlier I had a request to remove any changelog-style comments from the code.


  + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
  + * read-only by calling vma_wrprotect(). This is needed to catch the next
  + * write reference to the mapped region and update the file times
  + * accordingly.
*
* The application may now run fsync() to write out the dirty pages and
* wait on the writeout and check the result. Or the application may run
  @@ -77,16 +104,20 @@ asmlinkage long sys_msync(unsigned long start, size_t 
  len, int flags)
  error = 0;
  start = vma-vm_end;
  file = vma-vm_file;
  -   if (file  (vma-vm_flags  VM_SHARED)  (flags  
  MS_SYNC)) {
  -   get_file(file);
  -   up_read(mm-mmap_sem);
  -   error = do_fsync(file, 0);
  -   fput(file);
  -   if (error || start = end)
  -   goto out;
  -   down_read(mm-mmap_sem);
  -   vma = find_vma(mm, start);
  -   continue;
  +   if (file  (vma-vm_flags  VM_SHARED)) {
  +   if (flags  MS_ASYNC)
  +   vma_wrprotect(vma);
  +   if (flags  MS_SYNC) {

 else if ??

The MS_ASYNC and MS_SYNC flags are mutually exclusive, that is why I
did not use the else

Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Jesper Juhl [EMAIL PROTECTED]:
 On 22/01/2008, Anton Salikhmetov [EMAIL PROTECTED] wrote:
  2008/1/22, Jesper Juhl [EMAIL PROTECTED]:
   Some very pedantic nitpicking below;
  
   On 22/01/2008, Anton Salikhmetov [EMAIL PROTECTED] wrote:
 ...
+   if (file  (vma-vm_flags  VM_SHARED)) {
+   if (flags  MS_ASYNC)
+   vma_wrprotect(vma);
+   if (flags  MS_SYNC) {
  
   else if ??
 
  The MS_ASYNC and MS_SYNC flags are mutually exclusive, that is why I
  did not use the else-if here. Moreover, this function itself checks
  that they never come together.
 

 I would say that them being mutually exclusive would be a reason *for*
 using else-if here.

This check is performed by the sys_msync() function itself in its very
beginning.

We don't need to check it later.


 --
 Jesper Juhl [EMAIL PROTECTED]
 Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
 Plain text mails only, please  http://www.expita.com/nomime.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Anton Salikhmetov [EMAIL PROTECTED]:
 2008/1/22, Jesper Juhl [EMAIL PROTECTED]:
  Some very pedantic nitpicking below;
 
  On 22/01/2008, Anton Salikhmetov [EMAIL PROTECTED] wrote:
   http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
  
   Update file times at write references to memory-mapped files.
   Force file times update at the next write reference after
   calling the msync() system call with the MS_ASYNC flag.
  
   Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
   ---
mm/memory.c |6 ++
mm/msync.c  |   57 
   -
2 files changed, 50 insertions(+), 13 deletions(-)
  
   diff --git a/mm/memory.c b/mm/memory.c
   index 6dd1cd8..4b0144b 100644
   --- a/mm/memory.c
   +++ b/mm/memory.c
   @@ -1670,6 +1670,9 @@ gotten:
unlock:
   pte_unmap_unlock(page_table, ptl);
   if (dirty_page) {
   +   if (vma-vm_file)
   +   file_update_time(vma-vm_file);
   +
   /*
* Yes, Virginia, this is actually required to prevent a 
   race
* with clear_page_dirty_for_io() from clearing the page 
   dirty
   @@ -2343,6 +2346,9 @@ out_unlocked:
   if (anon)
   page_cache_release(vmf.page);
   else if (dirty_page) {
   +   if (vma-vm_file)
   +   file_update_time(vma-vm_file);
   +
   set_page_dirty_balance(dirty_page, page_mkwrite);
   put_page(dirty_page);
   }
   diff --git a/mm/msync.c b/mm/msync.c
   index a4de868..394130d 100644
   --- a/mm/msync.c
   +++ b/mm/msync.c
   @@ -5,6 +5,7 @@
 * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
 */
  
   +#include asm/tlbflush.h
#include linux/file.h
#include linux/fs.h
#include linux/mm.h
   @@ -13,11 +14,37 @@
#include linux/syscalls.h
  
/*
   + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
   + * It will force a pagefault on the next write access.
   + */
   +static void vma_wrprotect(struct vm_area_struct *vma)
   +{
   +   unsigned long addr;
   +
   +   for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) 
   {
 
  I know it's not the common Linux Kernel way, but 'addr' could be
  made to have just 'for' scope here according to C99;

 I believe that the C89 style is more common for the Linux kernel, so
 I've used the out-of-scope variable declaration.

 
 for (unsigned long addr = vma-vm_start; addr  vma-vm_end;
  addr += PAGE_SIZE) {

By the way, if we're talking pedantic, then:



debian:/tmp$ cat c.c
void f()
{
   for (unsigned long i = 0; i  10; i++)
   continue;
}
debian:/tmp$ gcc -c -pedantic c.c
c.c: In function 'f':
c.c:3: error: 'for' loop initial declaration used outside C99 mode
debian:/tmp$



No pun intended :)

 
 
   +   spinlock_t *ptl;
   +   pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
   +   pud_t *pud = pud_offset(pgd, addr);
   +   pmd_t *pmd = pmd_offset(pud, addr);
   +   pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, 
   ptl);
   +
   +   if (pte_dirty(*pte)  pte_write(*pte)) {
   +   pte_t entry = ptep_clear_flush(vma, addr, pte);
   +
   +   entry = pte_wrprotect(entry);
   +   set_pte_at(vma-vm_mm, addr, pte, entry);
   +   }
   +   pte_unmap_unlock(pte, ptl);
   +   }
   +}
   +
   +/*
 * MS_SYNC syncs the entire file - including mappings.
 *
   - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
   - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
   - * Now it doesn't do anything, since dirty pages are properly tracked.
 
  I think keeping some version of the up to ... comments makes sense.
  It documents that we previously had different behaviour.

 Earlier I had a request to remove any changelog-style comments from the 
 code.

 
   + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
   + * read-only by calling vma_wrprotect(). This is needed to catch the next
   + * write reference to the mapped region and update the file times
   + * accordingly.
 *
 * The application may now run fsync() to write out the dirty pages and
 * wait on the writeout and check the result. Or the application may run
   @@ -77,16 +104,20 @@ asmlinkage long sys_msync(unsigned long start, 
   size_t len, int flags)
   error = 0;
   start = vma-vm_end;
   file = vma-vm_file;
   -   if (file  (vma-vm_flags  VM_SHARED)  (flags  
   MS_SYNC)) {
   -   get_file(file);
   -   up_read(mm-mmap_sem);
   -   error = do_fsync(file, 0);
   -   fput(file);
   -   if (error || start = end

Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Linus Torvalds [EMAIL PROTECTED]:


 On Tue, 22 Jan 2008, Anton Salikhmetov wrote:
 
   /*
  + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
  + * It will force a pagefault on the next write access.
  + */
  +static void vma_wrprotect(struct vm_area_struct *vma)
  +{
  + unsigned long addr;
  +
  + for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) {
  + spinlock_t *ptl;
  + pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
  + pud_t *pud = pud_offset(pgd, addr);
  + pmd_t *pmd = pmd_offset(pud, addr);
  + pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);

 This is extremely expensive over bigger areas, especially sparsely mapped
 ones (it does all the lookups for all four levels over and over and over
 again for eachg page).

 I think Peter Zijlstra posted a version that uses the regular kind of
 nested loop (with inline functions to keep the thing nice and clean),
 which gets rid of that.

Thanks for your feedback, Linus!

I will use Peter Zijlstra's version of such an operation in my next
patch series.


 [ The sad/funny part is that this is all how we *used* to do msync(), back
   in the days: we're literally going back to the pre-cleanup logic. See
   commit 204ec841fbea3e5138168edbc3a76d46747cc987: mm: msync() cleanup
   for details ]

 Quite frankly, I really think you might be better off just doing a

 git revert 204ec841fbea3e5138168edbc3a76d46747cc987

 and working from there! I just checked, and it still reverts cleanly, and
 you'd end up with a nice code-base that (a) has gotten years of testing
 and (b) already has the looping-over-the-pagetables code.

 Linus

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/19, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> >
> > The page_check_address() function is called from the
> > page_mkclean_one() routine as follows:
>
> .. and the page_mkclean_one() function is totally different.
>
> Lookie here, this is the correct and complex sequence:
>
> > entry = ptep_clear_flush(vma, address, pte);
> > entry = pte_wrprotect(entry);
> > entry = pte_mkclean(entry);
> > set_pte_at(mm, address, pte, entry);
>
> That's a rather expensive sequence, but it's done exactly because it has
> to be done that way. What it does is to
>
>  - *atomically* load the pte entry _and_ clear the old one in memory.
>
>That's the
>
> entry = ptep_clear_flush(vma, address, pte);
>
>thing, and it basically means that it's doing some
>architecture-specific magic to make sure that another CPU that accesses
>the PTE at the same time will never actually modify the pte (because
>it's clear and not valid)
>
>  - it then - while the page table is actually clear and invalid - takes
>the old value and turns it into the new one:
>
> entry = pte_wrprotect(entry);
> entry = pte_mkclean(entry);
>
>  - and finally, it replaces the entry with the new one:
>
> set_pte_at(mm, address, pte, entry);
>
>which takes care to write the new entry in some specific way that is
>atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table
>entry it writes the high word first, see the write barriers in
>"native_set_pte()" in include/asm-x86/pgtable-3level.h
>
> Now, compare that subtle and correct thing with what is *not* correct:
>
> if (pte_dirty(*pte) && pte_write(*pte))
> *pte = pte_wrprotect(*pte);
>
> which makes no effort at all to make sure that it's safe in case another
> CPU updates the accessed bit.
>
> Now, arguably it's unlikely to cause horrible problems at least on x86,
> because:
>
>  - we only do this if the pte is already marked dirty, so while we can
>lose the accessed bit, we can *not* lose the dirty bit. And the
>accessed bit isn't such a big deal.
>
>  - it's not doing any of the "be careful about" ordering things, but since
>the really important bits aren't changing, ordering probably won't
>practically matter.
>
> But the problem is that we have something like 24 different architectures,
> it's hard to make sure that none of them have issues.
>
> In other words: it may well work in practice. But when these things go
> subtly wrong, they are *really* nasty to find, and the unsafe sequence is
> really not how it's supposed to be done. For example, you don't even flush
> the TLB, so even if there are no cross-CPU issues, there's probably going
> to be writable entries in the TLB that now don't match the page tables.
>
> Will it matter? Again, probably impossible to see in practice. But ...

Linus, I am very grateful to you for your extremely clear explanation
of the issue I have overlooked!

Back to the msync() issue, I'm going to come back with a new design
for the bug fix.

Thank you once again.

Anton

>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/19, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> >
> > Before using pte_wrprotect() the vma_wrprotect() routine uses the
> > pte_offset_map_lock() macro to get the PTE and to acquire the ptl
> > spinlock. Why did you say that this code was not SMP-safe? It should
> > be atomic, I think.
>
> It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.
>
> Guess how much another x86 CPU cares when it sets the accessed bit in
> hardware?

Thank you very much for taking part in this discussion. Personally,
it's very important to me.  But I'm not sure that I understand which
bit can be lost.

Please let me explain.

The logic for my vma_wrprotect() routine was taken from the
page_check_address() function in mm/rmap.c. Here is a code snippet of
the latter function:

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
return NULL;

pud = pud_offset(pgd, address);
if (!pud_present(*pud))
return NULL;

pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return NULL;

pte = pte_offset_map(pmd, address);
/* Make a quick check before getting the lock */
if (!pte_present(*pte)) {
pte_unmap(pte);
return NULL;
}

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
return pte;
}
pte_unmap_unlock(pte, ptl);

The page_check_address() function is called from the
page_mkclean_one() routine as follows:

pte = page_check_address(page, mm, address, );
if (!pte)
goto out;

if (pte_dirty(*pte) || pte_write(*pte)) {
pte_t entry;

flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
ret = 1;
}

pte_unmap_unlock(pte, ptl);

The write-protection of the PTE is done using the pte_wrprotect()
entity. I intended to do the same during msync() with MS_ASYNC. I
understand that I'm taking a risk of looking a complete idiot now,
however I don't see any difference between the two situations.

I presumed that the code in mm/rmap.c was absolutely correct, that's
why I basically reused the design.

>
> > The POSIX standard requires the ctime and mtime stamps to be updated
> > not later than at the second call to msync() with the MS_ASYNC flag.
>
> .. and that is no excuse for bad code.
>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Fri, 18 Jan 2008, Anton Salikhmetov wrote:
> >
> > The current solution doesn't hit the performance at all when compared to
> > the competitor POSIX-compliant systems. It is faster and does even more
> > than the POSIX standard requires.
>
> Your current patches have two problems:
>  - they are simply unnecessarily invasive for a relatively simple issue
>  - all versions I've looked at closer are buggy too
>
> Example:
>
> +   if (pte_dirty(*pte) && pte_write(*pte))
> +   *pte = pte_wrprotect(*pte);
>
> Uhhuh. Looks simple enough. Except it does a non-atomic pte access while
> other CPU's may be accessing it and updating it from their hw page table
> walkers. What will happen? Who knows? I can see lost access bits at a
> minimum.
>
> IOW, this isn't simple code. It's code that it is simple to screw up. In
> this case, you really need to use ptep_set_wrprotect(), for example.

Before using pte_wrprotect() the vma_wrprotect() routine uses the
pte_offset_map_lock() macro to get the PTE and to acquire the ptl
spinlock. Why did you say that this code was not SMP-safe? It should
be atomic, I think.


>
> So why not do it in many fewer lines with that simpler vma->dirty flag?

Neither the dirty flag you suggest, nor the AS_MCTIME flag I've
introduced in my previous solutions solve the following problem:

- mmap()
- a write reference
- msync() with MS_ASYNC
- a write reference
- msync() with MS_ASYNC

The POSIX standard requires the ctime and mtime stamps to be updated
not later than at the second call to msync() with the MS_ASYNC flag.

Some other POSIX-compliant operating system such as HP-UX and FreeBSD
satisfy this POSIX requirement. Linux does not.

>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> >
> > What I'm saying is that the times could be left un-updated for a long
> > time if program doesn't do munmap() or msync(MS_SYNC) for a long time.
>
> Sure.
>
> But in those circumstances, the programmer cannot depend on the mtime
> *anyway* (because there is no synchronization), so what's the downside?
>
> Let's face it, there's exactly three possible solutions:
>
>  - the insane one: trap EVERY SINGLE instruction that does a write to the
>page, and update mtime each and every time.
>
>This one is so obviously STUPID that it's not even worth discussing
>further, except to say that "yes, there is an 'exact' algorithm, but
>no, we are never EVER going to use it".
>
>  - the non-exact solutions that don't give you mtime updates every time
>a write to the page happens, but give *some* guarantees for things that
>will update it.
>
>This is the one I think we can do, and the only things a programmer can
>impact using it is "msync()" and "munmap()", since no other operations
>really have any thing to do with it in a programmer-visible way (ie a
>normal "sync" operation may happen in the background and has no
>progam-relevant timing information)
>
>Other things *may* or may not update mtime (some filesystems - take
>most networked one as an example - will *always* update mtime on the
>server on writeback, so we cannot ever guarantee that nothing but
>msync/munmap does so), but at least we'll have a minimum set of things
>that people can depend on.
>
>  - the "we don't care at all solutions".
>
>mmap(MAP_WRITE) doesn't really update times reliably after the write
>has happened (but might do it *before* - maybe the mmap() itself does).
>
> Those are the three choices, I think. We currently approximate #3. We
> *can* do #2 (and there are various flavors of it). And even *aiming* for
> #1 is totally insane and stupid.

The current solution doesn't hit the performance at all when compared to
the competitor POSIX-compliant systems. It is faster and does even more
than the POSIX standard requires.

Please see the test results I've sent into the thread "-v6 0/2":

http://lkml.org/lkml/2008/1/18/447

I guess, the current solution is ready to use.

>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Miklos Szeredi <[EMAIL PROTECTED]>:
> > 4. Performance test was done using the program available from the
> > following link:
> >
> > http://bugzilla.kernel.org/attachment.cgi?id=14493
> >
> > Result: the impact of the changes was negligible for files of a few
> > hundred megabytes.
>
> Could you also test with ext4 and post some numbers?  Afaik, ext4 uses
> nanosecond timestamps, so the time updating code would be exercised
> more during the page faults.
>
> What about performance impact on msync(MS_ASYNC)?  Could you please do
> some measurment of that as well?

Did a quick test on an ext4 partition. This is how it looks like:

debian:~/miklos# ./miklos_test /mnt/file
begin   1200662360  1200662360  1200662353
write   1200662361  1200662361  1200662353
mmap1200662361  1200662361  1200662362
b   1200662363  1200662363  1200662362
msync b 1200662363  1200662363  1200662362
c   1200662365  1200662365  1200662362
msync c 1200662365  1200662365  1200662362
d   1200662367  1200662367  1200662362
munmap  1200662367  1200662367  1200662362
close   1200662367  1200662367  1200662362
sync1200662367  1200662367  1200662362
debian:~/miklos# mount | grep /mnt
/root/image.ext4 on /mnt type ext4dev (rw,loop=/dev/loop0)

> What about performance impact on msync(MS_ASYNC)?  Could you please do
> some measurment of that as well?

Following are the results of the measurements. Here are the relevant
portions of the test program:

>>>

#define FILE_SIZE (1024 * 1024 * 512)

p = mmap(0, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

/* Bring the pages in */
for (i = 0; i < FILE_SIZE; i += 4096)
tmp = p[i];

/* Dirty the pages */
for (i = 0; i < FILE_SIZE; i += 4096)
p[i] = i;

/* How long did we spend in msync(MS_ASYNC)? */
gettimeofday(_start, NULL);
msync(p, FILE_SIZE, MS_ASYNC);
gettimeofday(_stop, NULL);

<<<

For reference tests, the following platforms were used:

1. HP-UX B.11.31, PA-RISC 8800 processor (800 MHz, 64 MB), Memory: 4 GB.

2. HP-UX B.11.31, 2 Intel(R) Itanium 2 9000 series processors (1.59 GHz, 18 MB),
   Memory: 15.98 GB.

3. FreeBSD 6.2-RELEASE, Intel(R) Pentium(R) III CPU family 1400MHz, 2 CPUs.
   Memory: 4G.

The tests of my solution were performed using the following platform:

A KVM x86_64 guest OS, current Git kernel. Host system: Intel(R) Core(TM)2
Duo CPU T7300 @ 2.00GHz. Further referred to as "the first case".

The following tables give the time difference between the two calls
to gettimeofday(). The test program was run three times in a raw
with a delay of one second between consecutive runs. On Linux
systems, the following commands were issued prior to running the
tests:

echo 80 >/proc/sys/vm/dirty_ratio
echo 80 >/proc/sys/vm/dirty_background_ratio
echo 3 >/proc/sys/vm/dirty_expire_centisecs
sync

Table 1. Reference platforms.


| | HP-UX/PA-RISC | HP-UX/Itanium | FreeBSD|

| First run   | 263405 usec   | 202283 usec   | 90 SECONDS |

| Second run  | 262253 usec   | 172837 usec   | 90 SECONDS |

| Third run   | 238465 usec   | 238465 usec   | 90 SECONDS |


It looks like FreeBSD is a clear outsider here. Note that FreeBSD
showed an almost liner depencence of the time spent in the
msync(MS_ASYNC) call on the file size.

Table 2. The Qemu system. File size is 512M.

---
|| Before the patch | After the patch |
---
| First run  | 35 usec  |   5852 usec |
---
| Second run | 35 usec  |    usec |
---
| Third run  | 35 usec  |   6330 usec |
---

I think that the data above prove the viability of the solution I
presented. Also, I guess that this bug fix is most probably ready
for getting upstream.

Please apply the sixth version of my solution.

>
> Thanks,
> Miklos
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Peter Zijlstra <[EMAIL PROTECTED]>:
>
> On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote:
> > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> >
> > > > diff --git a/mm/msync.c b/mm/msync.c
> > > > index a4de868..a49af28 100644
> > > > --- a/mm/msync.c
> > > > +++ b/mm/msync.c
> > > > @@ -13,11 +13,33 @@
> > > >  #include 
> > > >
> > > >  /*
> > > > + * Scan the PTEs for pages belonging to the VMA and mark them 
> > > > read-only.
> > > > + * It will force a pagefault on the next write access.
> > > > + */
> > > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > > +{
> > > > + unsigned long addr;
> > > > +
> > > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > > + spinlock_t *ptl;
> > > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > > + pud_t *pud = pud_offset(pgd, addr);
> > > > + pmd_t *pmd = pmd_offset(pud, addr);
> > > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
> > > > +
> > > > + if (pte_dirty(*pte) && pte_write(*pte))
> > > > + *pte = pte_wrprotect(*pte);
> > > > + pte_unmap_unlock(pte, ptl);
> > > > + }
> > > > +}
> > >
> > > What about ram based filesystems?  They don't start out with read-only
> > > pte's, so I think they don't want them read-protected now either.
> > > Unless this is essential for correct mtime/ctime accounting on these
> > > filesystems (I don't think it really is).  But then the mapping should
> > > start out read-only as well, otherwise the time update will only work
> > > after an msync(MS_ASYNC).
> >
> > page_mkclean() has all the needed logic for this, it also walks the rmap
> > and cleans out all other users, which I think is needed too for
> > consistencies sake:
> >
> > Process A Process B
> >
> > mmap(foo.txt) mmap(foo.txt)
> >
> > dirty page
> >   dirty page
> >
> > msync(MS_ASYNC)
> >
> >   dirty page
> >
> > msync(MS_ASYNC) <--- now what?!
> >
> >
> > So what I would suggest is using the page table walkers from mm, and
> > walks the page range, obtain the page using vm_normal_page() and call
> > page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)
> >
> > All in all, that sounds rather expensive..
>
> Bah, and will break on s390... so we'd need a page_mkclean() variant
> that doesn't actually clear dirty.

So the current version of the functional changes patch takes this into account.

>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Miklos Szeredi <[EMAIL PROTECTED]>:
> > 4. Performance test was done using the program available from the
> > following link:
> >
> > http://bugzilla.kernel.org/attachment.cgi?id=14493
> >
> > Result: the impact of the changes was negligible for files of a few
> > hundred megabytes.
>
> Could you also test with ext4 and post some numbers?  Afaik, ext4 uses
> nanosecond timestamps, so the time updating code would be exercised
> more during the page faults.
>
> What about performance impact on msync(MS_ASYNC)?  Could you please do
> some measurment of that as well?

I'll do the measurements for the MS_ASYNC case and for the Ext4 filesystem.

>
> Thanks,
> Miklos
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 1/2] Massive code cleanup of sys_msync()

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Miklos Szeredi <[EMAIL PROTECTED]>:
> >   unsigned long end;
> > - struct mm_struct *mm = current->mm;
> > + int error, unmapped_error;
> >   struct vm_area_struct *vma;
> > - int unmapped_error = 0;
> > - int error = -EINVAL;
> > + struct mm_struct *mm;
> >
> > + error = -EINVAL;
>
> I think you may have misunderstood my last comment.  These are OK:
>
> struct mm_struct *mm = current->mm;
> int unmapped_error = 0;
> int error = -EINVAL;
>
> This is not so good:
>
> int error, unmapped_error;
>
> This is the worst:
>
> int error = -EINVAL, unmapped_error = 0;
>
> So I think the original code is fine as it is.
>
> Othewise patch looks OK now.

I moved the initialization of the variables to the code where they are needed.

I don't agree that "int a; int b;" is better than "int a, b".

>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Miklos Szeredi [EMAIL PROTECTED]:
  4. Performance test was done using the program available from the
  following link:
 
  http://bugzilla.kernel.org/attachment.cgi?id=14493
 
  Result: the impact of the changes was negligible for files of a few
  hundred megabytes.

 Could you also test with ext4 and post some numbers?  Afaik, ext4 uses
 nanosecond timestamps, so the time updating code would be exercised
 more during the page faults.

 What about performance impact on msync(MS_ASYNC)?  Could you please do
 some measurment of that as well?

I'll do the measurements for the MS_ASYNC case and for the Ext4 filesystem.


 Thanks,
 Miklos


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 1/2] Massive code cleanup of sys_msync()

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Miklos Szeredi [EMAIL PROTECTED]:
unsigned long end;
  - struct mm_struct *mm = current-mm;
  + int error, unmapped_error;
struct vm_area_struct *vma;
  - int unmapped_error = 0;
  - int error = -EINVAL;
  + struct mm_struct *mm;
 
  + error = -EINVAL;

 I think you may have misunderstood my last comment.  These are OK:

 struct mm_struct *mm = current-mm;
 int unmapped_error = 0;
 int error = -EINVAL;

 This is not so good:

 int error, unmapped_error;

 This is the worst:

 int error = -EINVAL, unmapped_error = 0;

 So I think the original code is fine as it is.

 Othewise patch looks OK now.

I moved the initialization of the variables to the code where they are needed.

I don't agree that int a; int b; is better than int a, b.


 Miklos

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Peter Zijlstra [EMAIL PROTECTED]:

 On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote:
  On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
 
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..a49af28 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,11 +13,33 @@
 #include linux/syscalls.h
   
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them 
read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+ unsigned long addr;
+
+ for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) {
+ spinlock_t *ptl;
+ pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+ pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
+
+ if (pte_dirty(*pte)  pte_write(*pte))
+ *pte = pte_wrprotect(*pte);
+ pte_unmap_unlock(pte, ptl);
+ }
+}
  
   What about ram based filesystems?  They don't start out with read-only
   pte's, so I think they don't want them read-protected now either.
   Unless this is essential for correct mtime/ctime accounting on these
   filesystems (I don't think it really is).  But then the mapping should
   start out read-only as well, otherwise the time update will only work
   after an msync(MS_ASYNC).
 
  page_mkclean() has all the needed logic for this, it also walks the rmap
  and cleans out all other users, which I think is needed too for
  consistencies sake:
 
  Process A Process B
 
  mmap(foo.txt) mmap(foo.txt)
 
  dirty page
dirty page
 
  msync(MS_ASYNC)
 
dirty page
 
  msync(MS_ASYNC) --- now what?!
 
 
  So what I would suggest is using the page table walkers from mm, and
  walks the page range, obtain the page using vm_normal_page() and call
  page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)
 
  All in all, that sounds rather expensive..

 Bah, and will break on s390... so we'd need a page_mkclean() variant
 that doesn't actually clear dirty.

So the current version of the functional changes patch takes this into account.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/19, Linus Torvalds [EMAIL PROTECTED]:


 On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
 
  The page_check_address() function is called from the
  page_mkclean_one() routine as follows:

 .. and the page_mkclean_one() function is totally different.

 Lookie here, this is the correct and complex sequence:

  entry = ptep_clear_flush(vma, address, pte);
  entry = pte_wrprotect(entry);
  entry = pte_mkclean(entry);
  set_pte_at(mm, address, pte, entry);

 That's a rather expensive sequence, but it's done exactly because it has
 to be done that way. What it does is to

  - *atomically* load the pte entry _and_ clear the old one in memory.

That's the

 entry = ptep_clear_flush(vma, address, pte);

thing, and it basically means that it's doing some
architecture-specific magic to make sure that another CPU that accesses
the PTE at the same time will never actually modify the pte (because
it's clear and not valid)

  - it then - while the page table is actually clear and invalid - takes
the old value and turns it into the new one:

 entry = pte_wrprotect(entry);
 entry = pte_mkclean(entry);

  - and finally, it replaces the entry with the new one:

 set_pte_at(mm, address, pte, entry);

which takes care to write the new entry in some specific way that is
atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table
entry it writes the high word first, see the write barriers in
native_set_pte() in include/asm-x86/pgtable-3level.h

 Now, compare that subtle and correct thing with what is *not* correct:

 if (pte_dirty(*pte)  pte_write(*pte))
 *pte = pte_wrprotect(*pte);

 which makes no effort at all to make sure that it's safe in case another
 CPU updates the accessed bit.

 Now, arguably it's unlikely to cause horrible problems at least on x86,
 because:

  - we only do this if the pte is already marked dirty, so while we can
lose the accessed bit, we can *not* lose the dirty bit. And the
accessed bit isn't such a big deal.

  - it's not doing any of the be careful about ordering things, but since
the really important bits aren't changing, ordering probably won't
practically matter.

 But the problem is that we have something like 24 different architectures,
 it's hard to make sure that none of them have issues.

 In other words: it may well work in practice. But when these things go
 subtly wrong, they are *really* nasty to find, and the unsafe sequence is
 really not how it's supposed to be done. For example, you don't even flush
 the TLB, so even if there are no cross-CPU issues, there's probably going
 to be writable entries in the TLB that now don't match the page tables.

 Will it matter? Again, probably impossible to see in practice. But ...

Linus, I am very grateful to you for your extremely clear explanation
of the issue I have overlooked!

Back to the msync() issue, I'm going to come back with a new design
for the bug fix.

Thank you once again.

Anton


 Linus

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/19, Linus Torvalds [EMAIL PROTECTED]:


 On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
 
  Before using pte_wrprotect() the vma_wrprotect() routine uses the
  pte_offset_map_lock() macro to get the PTE and to acquire the ptl
  spinlock. Why did you say that this code was not SMP-safe? It should
  be atomic, I think.

 It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.

 Guess how much another x86 CPU cares when it sets the accessed bit in
 hardware?

Thank you very much for taking part in this discussion. Personally,
it's very important to me.  But I'm not sure that I understand which
bit can be lost.

Please let me explain.

The logic for my vma_wrprotect() routine was taken from the
page_check_address() function in mm/rmap.c. Here is a code snippet of
the latter function:

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
return NULL;

pud = pud_offset(pgd, address);
if (!pud_present(*pud))
return NULL;

pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return NULL;

pte = pte_offset_map(pmd, address);
/* Make a quick check before getting the lock */
if (!pte_present(*pte)) {
pte_unmap(pte);
return NULL;
}

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (pte_present(*pte)  page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
return pte;
}
pte_unmap_unlock(pte, ptl);

The page_check_address() function is called from the
page_mkclean_one() routine as follows:

pte = page_check_address(page, mm, address, ptl);
if (!pte)
goto out;

if (pte_dirty(*pte) || pte_write(*pte)) {
pte_t entry;

flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
ret = 1;
}

pte_unmap_unlock(pte, ptl);

The write-protection of the PTE is done using the pte_wrprotect()
entity. I intended to do the same during msync() with MS_ASYNC. I
understand that I'm taking a risk of looking a complete idiot now,
however I don't see any difference between the two situations.

I presumed that the code in mm/rmap.c was absolutely correct, that's
why I basically reused the design.


  The POSIX standard requires the ctime and mtime stamps to be updated
  not later than at the second call to msync() with the MS_ASYNC flag.

 .. and that is no excuse for bad code.

 Linus

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Linus Torvalds [EMAIL PROTECTED]:


 On Fri, 18 Jan 2008, Anton Salikhmetov wrote:
 
  The current solution doesn't hit the performance at all when compared to
  the competitor POSIX-compliant systems. It is faster and does even more
  than the POSIX standard requires.

 Your current patches have two problems:
  - they are simply unnecessarily invasive for a relatively simple issue
  - all versions I've looked at closer are buggy too

 Example:

 +   if (pte_dirty(*pte)  pte_write(*pte))
 +   *pte = pte_wrprotect(*pte);

 Uhhuh. Looks simple enough. Except it does a non-atomic pte access while
 other CPU's may be accessing it and updating it from their hw page table
 walkers. What will happen? Who knows? I can see lost access bits at a
 minimum.

 IOW, this isn't simple code. It's code that it is simple to screw up. In
 this case, you really need to use ptep_set_wrprotect(), for example.

Before using pte_wrprotect() the vma_wrprotect() routine uses the
pte_offset_map_lock() macro to get the PTE and to acquire the ptl
spinlock. Why did you say that this code was not SMP-safe? It should
be atomic, I think.



 So why not do it in many fewer lines with that simpler vma-dirty flag?

Neither the dirty flag you suggest, nor the AS_MCTIME flag I've
introduced in my previous solutions solve the following problem:

- mmap()
- a write reference
- msync() with MS_ASYNC
- a write reference
- msync() with MS_ASYNC

The POSIX standard requires the ctime and mtime stamps to be updated
not later than at the second call to msync() with the MS_ASYNC flag.

Some other POSIX-compliant operating system such as HP-UX and FreeBSD
satisfy this POSIX requirement. Linux does not.


 Linus

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Miklos Szeredi [EMAIL PROTECTED]:
  4. Performance test was done using the program available from the
  following link:
 
  http://bugzilla.kernel.org/attachment.cgi?id=14493
 
  Result: the impact of the changes was negligible for files of a few
  hundred megabytes.

 Could you also test with ext4 and post some numbers?  Afaik, ext4 uses
 nanosecond timestamps, so the time updating code would be exercised
 more during the page faults.

 What about performance impact on msync(MS_ASYNC)?  Could you please do
 some measurment of that as well?

Did a quick test on an ext4 partition. This is how it looks like:

debian:~/miklos# ./miklos_test /mnt/file
begin   1200662360  1200662360  1200662353
write   1200662361  1200662361  1200662353
mmap1200662361  1200662361  1200662362
b   1200662363  1200662363  1200662362
msync b 1200662363  1200662363  1200662362
c   1200662365  1200662365  1200662362
msync c 1200662365  1200662365  1200662362
d   1200662367  1200662367  1200662362
munmap  1200662367  1200662367  1200662362
close   1200662367  1200662367  1200662362
sync1200662367  1200662367  1200662362
debian:~/miklos# mount | grep /mnt
/root/image.ext4 on /mnt type ext4dev (rw,loop=/dev/loop0)

 What about performance impact on msync(MS_ASYNC)?  Could you please do
 some measurment of that as well?

Following are the results of the measurements. Here are the relevant
portions of the test program:



#define FILE_SIZE (1024 * 1024 * 512)

p = mmap(0, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

/* Bring the pages in */
for (i = 0; i  FILE_SIZE; i += 4096)
tmp = p[i];

/* Dirty the pages */
for (i = 0; i  FILE_SIZE; i += 4096)
p[i] = i;

/* How long did we spend in msync(MS_ASYNC)? */
gettimeofday(tv_start, NULL);
msync(p, FILE_SIZE, MS_ASYNC);
gettimeofday(tv_stop, NULL);



For reference tests, the following platforms were used:

1. HP-UX B.11.31, PA-RISC 8800 processor (800 MHz, 64 MB), Memory: 4 GB.

2. HP-UX B.11.31, 2 Intel(R) Itanium 2 9000 series processors (1.59 GHz, 18 MB),
   Memory: 15.98 GB.

3. FreeBSD 6.2-RELEASE, Intel(R) Pentium(R) III CPU family 1400MHz, 2 CPUs.
   Memory: 4G.

The tests of my solution were performed using the following platform:

A KVM x86_64 guest OS, current Git kernel. Host system: Intel(R) Core(TM)2
Duo CPU T7300 @ 2.00GHz. Further referred to as the first case.

The following tables give the time difference between the two calls
to gettimeofday(). The test program was run three times in a raw
with a delay of one second between consecutive runs. On Linux
systems, the following commands were issued prior to running the
tests:

echo 80 /proc/sys/vm/dirty_ratio
echo 80 /proc/sys/vm/dirty_background_ratio
echo 3 /proc/sys/vm/dirty_expire_centisecs
sync

Table 1. Reference platforms.


| | HP-UX/PA-RISC | HP-UX/Itanium | FreeBSD|

| First run   | 263405 usec   | 202283 usec   | 90 SECONDS |

| Second run  | 262253 usec   | 172837 usec   | 90 SECONDS |

| Third run   | 238465 usec   | 238465 usec   | 90 SECONDS |


It looks like FreeBSD is a clear outsider here. Note that FreeBSD
showed an almost liner depencence of the time spent in the
msync(MS_ASYNC) call on the file size.

Table 2. The Qemu system. File size is 512M.

---
|| Before the patch | After the patch |
---
| First run  | 35 usec  |   5852 usec |
---
| Second run | 35 usec  |    usec |
---
| Third run  | 35 usec  |   6330 usec |
---

I think that the data above prove the viability of the solution I
presented. Also, I guess that this bug fix is most probably ready
for getting upstream.

Please apply the sixth version of my solution.


 Thanks,
 Miklos


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Linus Torvalds [EMAIL PROTECTED]:


 On Fri, 18 Jan 2008, Miklos Szeredi wrote:
 
  What I'm saying is that the times could be left un-updated for a long
  time if program doesn't do munmap() or msync(MS_SYNC) for a long time.

 Sure.

 But in those circumstances, the programmer cannot depend on the mtime
 *anyway* (because there is no synchronization), so what's the downside?

 Let's face it, there's exactly three possible solutions:

  - the insane one: trap EVERY SINGLE instruction that does a write to the
page, and update mtime each and every time.

This one is so obviously STUPID that it's not even worth discussing
further, except to say that yes, there is an 'exact' algorithm, but
no, we are never EVER going to use it.

  - the non-exact solutions that don't give you mtime updates every time
a write to the page happens, but give *some* guarantees for things that
will update it.

This is the one I think we can do, and the only things a programmer can
impact using it is msync() and munmap(), since no other operations
really have any thing to do with it in a programmer-visible way (ie a
normal sync operation may happen in the background and has no
progam-relevant timing information)

Other things *may* or may not update mtime (some filesystems - take
most networked one as an example - will *always* update mtime on the
server on writeback, so we cannot ever guarantee that nothing but
msync/munmap does so), but at least we'll have a minimum set of things
that people can depend on.

  - the we don't care at all solutions.

mmap(MAP_WRITE) doesn't really update times reliably after the write
has happened (but might do it *before* - maybe the mmap() itself does).

 Those are the three choices, I think. We currently approximate #3. We
 *can* do #2 (and there are various flavors of it). And even *aiming* for
 #1 is totally insane and stupid.

The current solution doesn't hit the performance at all when compared to
the competitor POSIX-compliant systems. It is faster and does even more
than the POSIX standard requires.

Please see the test results I've sent into the thread -v6 0/2:

http://lkml.org/lkml/2008/1/18/447

I guess, the current solution is ready to use.


 Linus

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-17 Thread Anton Salikhmetov
Updating file times at write references to memory-mapped files and
forcing file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/memory.c |6 ++
 mm/msync.c  |   52 +++-
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4bf0b6d..13d5bbf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1668,6 +1668,9 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
/*
 * Yes, Virginia, this is actually required to prevent a race
 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2341,6 +2344,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..a49af28 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,11 +13,33 @@
 #include 
 
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr;
+
+   for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+   spinlock_t *ptl;
+   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+   pud_t *pud = pud_offset(pgd, addr);
+   pmd_t *pmd = pmd_offset(pud, addr);
+   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
+
+   if (pte_dirty(*pte) && pte_write(*pte))
+   *pte = pte_wrprotect(*pte);
+   pte_unmap_unlock(pte, ptl);
+   }
+}
+
+/*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
  *
  * The application may now run fsync() to write out the dirty pages and
  * wait on the writeout and check the result. Or the application may run
@@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
error = 0;
start = vma->vm_end;
file = vma->vm_file;
-   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
-   get_file(file);
-   up_read(>mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error || start >= end)
-   goto out;
-   down_read(>mmap_sem);
-   vma = find_vma(mm, start);
-   continue;
+   if (file && (vma->vm_flags & VM_SHARED)) {
+   if (flags & MS_ASYNC)
+   vma_wrprotect(vma);
+   if (flags & MS_SYNC) {
+   get_file(file);
+   up_read(>mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error || start >= end)
+   goto out;
+   down_read(>mmap_sem);
+   vma = find_vma(mm, start);
+   continue;
+   }
}
 
vma = vma->vm_next;
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v6 0/2] Fixing the issue with memory-mapped file times

2008-01-17 Thread Anton Salikhmetov
This is the sixth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

New since the previous version:

1) a few cosmetic changes according to the latest feedback
   for the cleanup patch;

2) implementation of the following suggestion by Miklos Szeredi:

http://lkml.org/lkml/2008/1/17/158

These changes were tested as explained below. Please note that all
tests were performed with all recommended kernel debug options
enabled. Also note that the tests were performed on regular files
residing on both an ext3 partition and a tmpfs filesystem. I also
checked the block device case, which worked for me as well.

1. My own unit test:

http://bugzilla.kernel.org/attachment.cgi?id=14430

Result: all test cases passed successfully.

2. Unit test provided by Miklos Szeredi:

http://lkml.org/lkml/2008/1/14/104

Result: this test produced the following output:

debian-64:~# ./miklos_test test
begin   1200598736  1200598736  1200598617
write   1200598737  1200598737  1200598617
mmap1200598737  1200598737  1200598738
b   1200598739  1200598739  1200598738
msync b 1200598739  1200598739  1200598738
c   1200598741  1200598741  1200598738
msync c 1200598741  1200598741  1200598738
d   1200598743  1200598743  1200598738
munmap  1200598743  1200598743  1200598738
close   1200598743  1200598743  1200598738
sync1200598743  1200598743  1200598738
debian-64:~#

3. Regression tests were performed using the following test cases from
the LTP test suite:

msync01
msync02
msync03
msync04
msync05
mmapstress01
mmapstress09
mmapstress10

Result: no regressions were found while running these test cases.

4. Performance test was done using the program available from the
following link:

http://bugzilla.kernel.org/attachment.cgi?id=14493

Result: the impact of the changes was negligible for files of a few
hundred megabytes.

I wonder if these changes can be applied now.

These patches are the result of many fruitful discussions with
Miklos Szeredi, Peter Zijlstra, Rik van Riel, Peter Staubach,
and Jacob Oestergaard. I am grateful to you all for your support
during the days I was working on this long-standing nasty bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v6 1/2] Massive code cleanup of sys_msync()

2008-01-17 Thread Anton Salikhmetov
Using the PAGE_ALIGN() macro instead of "manual" alignment and
improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   77 
 1 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a4de868 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,83 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
unsigned long end;
-   struct mm_struct *mm = current->mm;
+   int error, unmapped_error;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   struct mm_struct *mm;
 
+   error = -EINVAL;
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start & ~PAGE_MASK)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
+
error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
goto out;
-   error = 0;
+
+   error = unmapped_error = 0;
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
+   mm = current->mm;
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
error = -ENOMEM;
if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   break;
if (start < vma->vm_start) {
start = vma->vm_start;
if (start >= end)
-   goto out_unlock;
-   unmapped_error = -ENOMEM;
-   }
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
-   error = -EBUSY;
-   goto out_unlock;
+   break;
+   unmapped_error = error;
}
-   file = vma->vm_file;
+
+   error = -EBUSY;
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
+   break;
+
+   error = 0;
start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+   file = vma->vm_file;
+   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(>mmap_sem);
error = do_fsync(file, 0);
@@ -88,16 +86,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma->vm_next;
+   con

Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > The do_wp_page() function is called in mm/memory.c after locking PTE.
> > And the file_update_time() routine calls the filesystem operation that can
> > sleep. It's not accepted, I guess.
>
> do_wp_page() is called with the pte lock but drops it, so that's fine.

OK, I agree.

I'll take into account your suggestion to move updating time stamps from
the __set_page_dirty() and __set_page_dirty_nobuffers() routines to
do_wp_page(). Thank you!

>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > I'm not sure this auto-updating is really needed (POSIX doesn't
> > > mandate it).
> >
> > Peter Shtaubach, author of the first solution for this bug,
> > and Jacob Ostergaard, the reporter of this bug, insist the "auto-update"
> > feature to be implemented.
>
> Can they state their reasons for the insistence?
>
> >  1) a base patch: update time just from fsync() and remove_vma()
> >  2) update time on sync(2) as well
> >  3) update time on MS_ASYNC as well
>
> Oh, and the four-liner I posted the other day will give you 1) + 2) +
> even more at a small fraction of the complexity.  And tacking on the
> reprotect code will solve the MS_ASYNC issue just the same.
>
> I agree, that having the timestamp updated on sync() is nice, and that
> trivial patch will give you that, and will also update the timestamp
> at least each 30 seconds if the file is being constantly modified,
> even if no explicit syncing is done.
>
> So maybe it's worth a little effort benchmarking how much that patch
> affects the cost of writing to a page.
>
> You could write a little test program like this (if somebody hasn't
> yet done so):
>
>  - do some preparation:
>
>echo 80 > dirty_ratio
>echo 80 > dirty_background_ratio
>echo 3 > dirty_expire_centisecs
>sync
>
>  - map a large file, one that fits comfortably into free memory
>  - bring the whole file in, by reading a byte from each page
>  - start the timer
>  - write a byte to each page
>  - stop the timer
>
> It would be most interesting to try this on a filesystem supporting
> nanosecond timestamps.  Anyone know which these are?

The do_wp_page() function is called in mm/memory.c after locking PTE.
And the file_update_time() routine calls the filesystem operation that can
sleep. It's not accepted, I guess.

>
> Miklos
> 
>
> Index: linux/mm/memory.c
> ===
> --- linux.orig/mm/memory.c  2008-01-09 21:16:30.0 +0100
> +++ linux/mm/memory.c   2008-01-15 21:16:14.0 +0100
> @@ -1680,6 +1680,8 @@ gotten:
>  unlock:
> pte_unmap_unlock(page_table, ptl);
> if (dirty_page) {
> +   if (vma->vm_file)
> +   file_update_time(vma->vm_file);
> /*
>  * Yes, Virginia, this is actually required to prevent a race
>  * with clear_page_dirty_for_io() from clearing the page dirty
> @@ -2313,6 +2315,8 @@ out_unlocked:
> if (anon)
> page_cache_release(vmf.page);
> else if (dirty_page) {
> +   if (vma->vm_file)
> +   file_update_time(vma->vm_file);
> set_page_dirty_balance(dirty_page, page_mkwrite);
> put_page(dirty_page);
> }
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > 2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > > > 4. Recording the time was the file data changed
> > > > >
> > > > > Finally, I noticed yet another issue with the previous version of my 
> > > > > patch.
> > > > > Specifically, the time stamps were set to the current time of the 
> > > > > moment
> > > > > when syncing but not the write reference was being done. This led to 
> > > > > the
> > > > > following adverse effect on my development system:
> > > > >
> > > > > 1) a text file A was updated by process B;
> > > > > 2) process B exits without calling any of the *sync() functions;
> > > > > 3) vi editor opens the file A;
> > > > > 4) file data synced, file times updated;
> > > > > 5) vi is confused by "thinking" that the file was changed after 3).
> > >
> > > Updating the time in remove_vma() would fix this, no?
> >
> > We need to save modification time. Otherwise, updating time stamps
> > will be confusing the vi editor.
>
> remove_vma() will be called when process B exits, so if the times are
> updated there, and the flag is cleared, the times won't be updated
> later.
>
> > >
> > > > > All these changes to inode.c are unnecessary, I think.
> > > >
> > > > The first part is necessary to account for "remembering" the 
> > > > modification time.
> > > >
> > > > The second part is for handling block device files. I cannot see any 
> > > > other
> > > > sane way to update file times for them.
> > >
> > > Use file_update_time(), which will do the right thing.  It will in
> > > fact do the same thing as write(2) on the device, which is really what
> > > we want.
> > >
> > > Block devices being mapped for write through different device
> > > nodes..., well, I don't think we really need to handle such weird
> > > corner cases 100% acurately.
> >
> > The file_update_time() cannot be used for implementing
> > the "auto-update" feature, because the sync() system call
> > doesn't "know" about the file which was memory-mapped.
>
> I'm not sure this auto-updating is really needed (POSIX doesn't
> mandate it).

Peter Shtaubach, author of the first solution for this bug,
and Jacob Ostergaard, the reporter of this bug, insist the "auto-update"
feature to be implemented.

>
> At least split it out into a separate patch, so it can be considered
> separately on it's own merit.
>
> I think doing the same with the page-table reprotecting in MS_ASYNC is
> also a good idea.  That will leave us with
>
>  1) a base patch: update time just from fsync() and remove_vma()
>  2) update time on sync(2) as well
>  3) update time on MS_ASYNC as well
>
> I'd happily ack the first one, which would solve the most serious
> issues, but have some reservations about the other two.
>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Rogier Wolff <[EMAIL PROTECTED]>:
> On Thu, Jan 17, 2008 at 04:16:47PM +0300, Anton Salikhmetov wrote:
> > 2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > > > 4. Recording the time was the file data changed
> > > > >
> > > > > Finally, I noticed yet another issue with the previous version of my 
> > > > > patch.
> > > > > Specifically, the time stamps were set to the current time of the 
> > > > > moment
> > > > > when syncing but not the write reference was being done. This led to 
> > > > > the
> > > > > following adverse effect on my development system:
> > > > >
> > > > > 1) a text file A was updated by process B;
> > > > > 2) process B exits without calling any of the *sync() functions;
> > > > > 3) vi editor opens the file A;
> > > > > 4) file data synced, file times updated;
> > > > > 5) vi is confused by "thinking" that the file was changed after 3).
> > >
> > > Updating the time in remove_vma() would fix this, no?
> >
> > We need to save modification time. Otherwise, updating time stamps
> > will be confusing the vi editor.
>
> If process B exits before vi opens the file, the timestamp should at
> the latest be the time that process B exits. There is no excuse for
> setting the timestamp later than the time that B exits.
>
> If process B no longer modifies the file, but still keeps it mapped
> until after vi starts, then the system can't help the
> situation. Wether or not B acesses those pages is unknown to the
> system. So you get what you deserve.

The exit() system call closes the file memory-mapped file. Therefore,
it's better to catch the close() system call. However, the close() system call
does not trigger unmapping the file. The mapped area for the file may be
used after closing the file. So, we should catch only the unmap() call,
not close() or exit().

>
> Roger.
>
> --
> ** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
> Does it sit on the couch all day? Is it unemployed? Please be specific!
> Define 'it' and what it isn't doing. - Adapted from lxrbot FAQ
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > 4. Recording the time was the file data changed
> > >
> > > Finally, I noticed yet another issue with the previous version of my 
> > > patch.
> > > Specifically, the time stamps were set to the current time of the moment
> > > when syncing but not the write reference was being done. This led to the
> > > following adverse effect on my development system:
> > >
> > > 1) a text file A was updated by process B;
> > > 2) process B exits without calling any of the *sync() functions;
> > > 3) vi editor opens the file A;
> > > 4) file data synced, file times updated;
> > > 5) vi is confused by "thinking" that the file was changed after 3).
>
> Updating the time in remove_vma() would fix this, no?

We need to save modification time. Otherwise, updating time stamps
will be confusing the vi editor.

>
> > > All these changes to inode.c are unnecessary, I think.
> >
> > The first part is necessary to account for "remembering" the modification 
> > time.
> >
> > The second part is for handling block device files. I cannot see any other
> > sane way to update file times for them.
>
> Use file_update_time(), which will do the right thing.  It will in
> fact do the same thing as write(2) on the device, which is really what
> we want.
>
> Block devices being mapped for write through different device
> nodes..., well, I don't think we really need to handle such weird
> corner cases 100% acurately.

The file_update_time() cannot be used for implementing
the "auto-update" feature, because the sync() system call
doesn't "know" about the file which was memory-mapped.

>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> >
> > Changes for updating the ctime and mtime fields for memory-mapped files:
> >
> > 1) a new flag triggering update of the inode data;
> > 2) a new field in the address_space structure for saving modification time;
> > 3) a new helper function to update ctime and mtime when needed;
> > 4) updating time stamps for mapped files in sys_msync() and do_fsync();
> > 5) implementing lazy ctime and mtime update.
>
> OK, the functionality seems to be there now.  As a next step, I think
> you should try to simplify the patch, removing everything, that is not
> strictly necessary.
>
> >
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  fs/buffer.c |3 ++
> >  fs/fs-writeback.c   |2 +
> >  fs/inode.c  |   43 +++--
> >  fs/sync.c   |2 +
> >  include/linux/fs.h  |   13 +-
> >  include/linux/pagemap.h |3 +-
> >  mm/msync.c  |   61 
> > +-
> >  mm/page-writeback.c |   54 ++---
> >  8 files changed, 124 insertions(+), 57 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 7249e01..3967aa7 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
> >   if (unlikely(!mapping))
> >   return !TestSetPageDirty(page);
> >
> > + mapping->mtime = CURRENT_TIME;
>
> Why is this needed?  POSIX explicitly states, that the modification
> time can be set to anywhere between the first write and the msync.

I've already described this change in one of my previous emails:

> 4. Recording the time was the file data changed
>
> Finally, I noticed yet another issue with the previous version of my patch.
> Specifically, the time stamps were set to the current time of the moment
> when syncing but not the write reference was being done. This led to the
> following adverse effect on my development system:
>
> 1) a text file A was updated by process B;
> 2) process B exits without calling any of the *sync() functions;
> 3) vi editor opens the file A;
> 4) file data synced, file times updated;
> 5) vi is confused by "thinking" that the file was changed after 3).
>
> This version overcomes this problem by introducing another field into the
> address_space structure. This field is used to "remember" the time of
> dirtying, and then this time value is propagated to the file metadata.
>
> This approach is based upon the following suggestion given by Peter
> Staubach during one of our previous discussions:
>
> http://lkml.org/lkml/2008/1/9/267
>
> > A better architecture would be to arrange for the file times
> > to be updated when the page makes the transition from being
> > unmodified to modified.  This is not straightforward due to
> > the current locking, but should be doable, I think.  Perhaps
> > recording the current time and then using it to update the
> > file times at a more suitable time (no pun intended) might
> > work.
>
> The solution I propose now proves the viability of the latter
> approach.

See:

http://lkml.org/lkml/2008/1/15/202

>
> > + set_bit(AS_MCTIME, >flags);
>
> A bigger problem is that doing this in __set_page_dirty() and friends
> will mean, that the flag will be set for non-mapped writes as well,
> which we definitely don't want.
>
> A better place to put it is do_wp_page and __do_fault, where
> set_page_dirty_balance() is called.

Thanks for your suggestion, I'll consider how I can apply it.

>
> > +
> >   if (TestSetPageDirty(page))
> >   return 0;
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 300324b..affd291 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct 
> > writeback_control *wbc)
> >
> >   spin_unlock(_lock);
> >
> > + mapping_update_time(mapping);
> > +
>
> I think this is unnecessary.  Rather put the update into remove_vma().

Thanks for your suggestion, I'll consider how I can apply it.

However, I suspect that this is a bit problematic. Let me check it out.

>
> >   ret = do_writepages(mapping, wbc);
> >
> >   /* Don't write the inode if only I_DIRTY_PAGES was set */
> > diff --git a/fs/inode.c b/fs/inode.c
> > index ed35383..edd5bf

Re: [PATCH -v5 1/2] Massive code cleanup of sys_msync()

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > Substantial code cleanup of the sys_msync() function:
> >
> > 1) using the PAGE_ALIGN() macro instead of "manual" alignment;
> > 2) improved readability of the loop traversing the process memory regions.
> >
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  mm/msync.c |   74 
> > +--
> >  1 files changed, 36 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/msync.c b/mm/msync.c
> > index 144a757..44997bf 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -1,24 +1,22 @@
> >  /*
> > - *   linux/mm/msync.c
> > + * The msync() system call.
> >   *
> > - * Copyright (C) 1994-1999  Linus Torvalds
> > + * Copyright (C) 1994-1999 Linus Torvalds
> > + * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
> >   */
> >
> > -/*
> > - * The msync() system call.
> > - */
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> > -#include 
> >  #include 
> > +#include 
> >
> >  /*
> >   * MS_SYNC syncs the entire file - including mappings.
> >   *
> >   * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> > - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
> > + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> >   * Now it doesn't do anything, since dirty pages are properly tracked.
> >   *
> >   * The application may now run fsync() to
> > @@ -33,8 +31,7 @@ asmlinkage long sys_msync(unsigned long start, size_t 
> > len, int flags)
> >   unsigned long end;
> >   struct mm_struct *mm = current->mm;
> >   struct vm_area_struct *vma;
> > - int unmapped_error = 0;
> > - int error = -EINVAL;
> > + int error = -EINVAL, unmapped_error = 0;
>
> I prefer multi-line variable declarations, especially for ones with an
> initializer.
>
> >
> >   if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> >   goto out;
> > @@ -42,62 +39,63 @@ asmlinkage long sys_msync(unsigned long start, size_t 
> > len, int flags)
> >   goto out;
> >   if ((flags & MS_ASYNC) && (flags & MS_SYNC))
> >   goto out;
> > - error = -ENOMEM;
> > - len = (len + ~PAGE_MASK) & PAGE_MASK;
> > +
> > + len = PAGE_ALIGN(len);
> >   end = start + len;
> > - if (end < start)
> > + if (end < start) {
> > + error = -ENOMEM;
>
> The usual style is to have the error assignment outside the
> conditional.  That way is shorter, clearer, as well as possibly
> generating better code.
>
> >   goto out;
> > + }
> > +
> >   error = 0;
> > +
>
> Unnecessary empty line here, these two statements actually belong
> together.
>
> >   if (end == start)
> >   goto out;
> > +
> >   /*
> >* If the interval [start,end) covers some unmapped address ranges,
> >* just ignore them, but return -ENOMEM at the end.
> >*/
> >   down_read(>mmap_sem);
> >   vma = find_vma(mm, start);
> > - for (;;) {
> > + do {
> >   struct file *file;
> >
> > - /* Still start < end. */
> > - error = -ENOMEM;
> > - if (!vma)
> > - goto out_unlock;
> > - /* Here start < vma->vm_end. */
> > + if (!vma) {
> > + error = -ENOMEM;
> > + break;
> > + }
>
> Again, error asignment should be outside the conditional.  This of
> course means, you'll have to set the error back to zero at the end of
> the loop, but that's fine.
>
> >   if (start < vma->vm_start) {
> >   start = vma->vm_start;
> > - if (start >= end)
> > - goto out_unlock;
> > + if (start >= end) {
> > + error = -ENOMEM;
> > + break;
> > + }
>
> Ditto.
>
> >   unmapped_error = -ENOMEM;
> >   }
> > - /* Here vma->vm_start <= start < vma->vm_end. */
> > - if ((flags &a

Re: [PATCH -v5 1/2] Massive code cleanup of sys_msync()

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi [EMAIL PROTECTED]:
  Substantial code cleanup of the sys_msync() function:
 
  1) using the PAGE_ALIGN() macro instead of manual alignment;
  2) improved readability of the loop traversing the process memory regions.
 
  Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
  ---
   mm/msync.c |   74 
  +--
   1 files changed, 36 insertions(+), 38 deletions(-)
 
  diff --git a/mm/msync.c b/mm/msync.c
  index 144a757..44997bf 100644
  --- a/mm/msync.c
  +++ b/mm/msync.c
  @@ -1,24 +1,22 @@
   /*
  - *   linux/mm/msync.c
  + * The msync() system call.
*
  - * Copyright (C) 1994-1999  Linus Torvalds
  + * Copyright (C) 1994-1999 Linus Torvalds
  + * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
*/
 
  -/*
  - * The msync() system call.
  - */
  +#include linux/file.h
   #include linux/fs.h
   #include linux/mm.h
   #include linux/mman.h
  -#include linux/file.h
  -#include linux/syscalls.h
   #include linux/sched.h
  +#include linux/syscalls.h
 
   /*
* MS_SYNC syncs the entire file - including mappings.
*
* MS_ASYNC does not start I/O (it used to, up to 2.5.67).
  - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
  + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
* Now it doesn't do anything, since dirty pages are properly tracked.
*
* The application may now run fsync() to
  @@ -33,8 +31,7 @@ asmlinkage long sys_msync(unsigned long start, size_t 
  len, int flags)
unsigned long end;
struct mm_struct *mm = current-mm;
struct vm_area_struct *vma;
  - int unmapped_error = 0;
  - int error = -EINVAL;
  + int error = -EINVAL, unmapped_error = 0;

 I prefer multi-line variable declarations, especially for ones with an
 initializer.

 
if (flags  ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
  @@ -42,62 +39,63 @@ asmlinkage long sys_msync(unsigned long start, size_t 
  len, int flags)
goto out;
if ((flags  MS_ASYNC)  (flags  MS_SYNC))
goto out;
  - error = -ENOMEM;
  - len = (len + ~PAGE_MASK)  PAGE_MASK;
  +
  + len = PAGE_ALIGN(len);
end = start + len;
  - if (end  start)
  + if (end  start) {
  + error = -ENOMEM;

 The usual style is to have the error assignment outside the
 conditional.  That way is shorter, clearer, as well as possibly
 generating better code.

goto out;
  + }
  +
error = 0;
  +

 Unnecessary empty line here, these two statements actually belong
 together.

if (end == start)
goto out;
  +
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
down_read(mm-mmap_sem);
vma = find_vma(mm, start);
  - for (;;) {
  + do {
struct file *file;
 
  - /* Still start  end. */
  - error = -ENOMEM;
  - if (!vma)
  - goto out_unlock;
  - /* Here start  vma-vm_end. */
  + if (!vma) {
  + error = -ENOMEM;
  + break;
  + }

 Again, error asignment should be outside the conditional.  This of
 course means, you'll have to set the error back to zero at the end of
 the loop, but that's fine.

if (start  vma-vm_start) {
start = vma-vm_start;
  - if (start = end)
  - goto out_unlock;
  + if (start = end) {
  + error = -ENOMEM;
  + break;
  + }

 Ditto.

unmapped_error = -ENOMEM;
}
  - /* Here vma-vm_start = start  vma-vm_end. */
  - if ((flags  MS_INVALIDATE) 
  - (vma-vm_flags  VM_LOCKED)) {
  + if ((flags  MS_INVALIDATE)  (vma-vm_flags  VM_LOCKED)) {
error = -EBUSY;

 Ditto2.

  - goto out_unlock;
  + break;
}
  - file = vma-vm_file;
start = vma-vm_end;
  - if ((flags  MS_SYNC)  file 
  - (vma-vm_flags  VM_SHARED)) {
  +
  + file = vma-vm_file;
  + if (file  (vma-vm_flags  VM_SHARED)  (flags  MS_SYNC)) 
  {
get_file(file);
up_read(mm-mmap_sem);
error = do_fsync(file, 0);
fput(file);
  - if (error || start = end)
  + if (error)

 This simplifies, but also does unnecessary down/find_vma/up.

goto out;
down_read(mm-mmap_sem

Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi [EMAIL PROTECTED]:
  http://bugzilla.kernel.org/show_bug.cgi?id=2645
 
  Changes for updating the ctime and mtime fields for memory-mapped files:
 
  1) a new flag triggering update of the inode data;
  2) a new field in the address_space structure for saving modification time;
  3) a new helper function to update ctime and mtime when needed;
  4) updating time stamps for mapped files in sys_msync() and do_fsync();
  5) implementing lazy ctime and mtime update.

 OK, the functionality seems to be there now.  As a next step, I think
 you should try to simplify the patch, removing everything, that is not
 strictly necessary.

 
  Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
  ---
   fs/buffer.c |3 ++
   fs/fs-writeback.c   |2 +
   fs/inode.c  |   43 +++--
   fs/sync.c   |2 +
   include/linux/fs.h  |   13 +-
   include/linux/pagemap.h |3 +-
   mm/msync.c  |   61 
  +-
   mm/page-writeback.c |   54 ++---
   8 files changed, 124 insertions(+), 57 deletions(-)
 
  diff --git a/fs/buffer.c b/fs/buffer.c
  index 7249e01..3967aa7 100644
  --- a/fs/buffer.c
  +++ b/fs/buffer.c
  @@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
if (unlikely(!mapping))
return !TestSetPageDirty(page);
 
  + mapping-mtime = CURRENT_TIME;

 Why is this needed?  POSIX explicitly states, that the modification
 time can be set to anywhere between the first write and the msync.

I've already described this change in one of my previous emails:

 4. Recording the time was the file data changed

 Finally, I noticed yet another issue with the previous version of my patch.
 Specifically, the time stamps were set to the current time of the moment
 when syncing but not the write reference was being done. This led to the
 following adverse effect on my development system:

 1) a text file A was updated by process B;
 2) process B exits without calling any of the *sync() functions;
 3) vi editor opens the file A;
 4) file data synced, file times updated;
 5) vi is confused by thinking that the file was changed after 3).

 This version overcomes this problem by introducing another field into the
 address_space structure. This field is used to remember the time of
 dirtying, and then this time value is propagated to the file metadata.

 This approach is based upon the following suggestion given by Peter
 Staubach during one of our previous discussions:

 http://lkml.org/lkml/2008/1/9/267

  A better architecture would be to arrange for the file times
  to be updated when the page makes the transition from being
  unmodified to modified.  This is not straightforward due to
  the current locking, but should be doable, I think.  Perhaps
  recording the current time and then using it to update the
  file times at a more suitable time (no pun intended) might
  work.

 The solution I propose now proves the viability of the latter
 approach.

See:

http://lkml.org/lkml/2008/1/15/202


  + set_bit(AS_MCTIME, mapping-flags);

 A bigger problem is that doing this in __set_page_dirty() and friends
 will mean, that the flag will be set for non-mapped writes as well,
 which we definitely don't want.

 A better place to put it is do_wp_page and __do_fault, where
 set_page_dirty_balance() is called.

Thanks for your suggestion, I'll consider how I can apply it.


  +
if (TestSetPageDirty(page))
return 0;
 
  diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
  index 300324b..affd291 100644
  --- a/fs/fs-writeback.c
  +++ b/fs/fs-writeback.c
  @@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct 
  writeback_control *wbc)
 
spin_unlock(inode_lock);
 
  + mapping_update_time(mapping);
  +

 I think this is unnecessary.  Rather put the update into remove_vma().

Thanks for your suggestion, I'll consider how I can apply it.

However, I suspect that this is a bit problematic. Let me check it out.


ret = do_writepages(mapping, wbc);
 
/* Don't write the inode if only I_DIRTY_PAGES was set */
  diff --git a/fs/inode.c b/fs/inode.c
  index ed35383..edd5bf4 100644
  --- a/fs/inode.c
  +++ b/fs/inode.c
  @@ -1243,8 +1243,10 @@ void touch_atime(struct vfsmount *mnt, struct dentry 
  *dentry)
   EXPORT_SYMBOL(touch_atime);
 
   /**
  - *   file_update_time-   update mtime and ctime time
  - *   @file: file accessed
  + *   inode_update_time   -   update mtime and ctime time
  + *   @inode: inode accessed
  + *   @ts: time when inode was accessed
  + *   @sync: whether to do synchronous update
*
*   Update the mtime and ctime members of an inode and mark the inode
*   for writeback.  Note that this function is meant exclusively for
  @@ -1253,11 +1255,8 @@ EXPORT_SYMBOL(touch_atime);
*   S_NOCTIME inode flag

Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi [EMAIL PROTECTED]:
  2008/1/17, Miklos Szeredi [EMAIL PROTECTED]:
 4. Recording the time was the file data changed

 Finally, I noticed yet another issue with the previous version of my 
 patch.
 Specifically, the time stamps were set to the current time of the 
 moment
 when syncing but not the write reference was being done. This led to 
 the
 following adverse effect on my development system:

 1) a text file A was updated by process B;
 2) process B exits without calling any of the *sync() functions;
 3) vi editor opens the file A;
 4) file data synced, file times updated;
 5) vi is confused by thinking that the file was changed after 3).
  
   Updating the time in remove_vma() would fix this, no?
 
  We need to save modification time. Otherwise, updating time stamps
  will be confusing the vi editor.

 remove_vma() will be called when process B exits, so if the times are
 updated there, and the flag is cleared, the times won't be updated
 later.

  
 All these changes to inode.c are unnecessary, I think.
   
The first part is necessary to account for remembering the 
modification time.
   
The second part is for handling block device files. I cannot see any 
other
sane way to update file times for them.
  
   Use file_update_time(), which will do the right thing.  It will in
   fact do the same thing as write(2) on the device, which is really what
   we want.
  
   Block devices being mapped for write through different device
   nodes..., well, I don't think we really need to handle such weird
   corner cases 100% acurately.
 
  The file_update_time() cannot be used for implementing
  the auto-update feature, because the sync() system call
  doesn't know about the file which was memory-mapped.

 I'm not sure this auto-updating is really needed (POSIX doesn't
 mandate it).

Peter Shtaubach, author of the first solution for this bug,
and Jacob Ostergaard, the reporter of this bug, insist the auto-update
feature to be implemented.


 At least split it out into a separate patch, so it can be considered
 separately on it's own merit.

 I think doing the same with the page-table reprotecting in MS_ASYNC is
 also a good idea.  That will leave us with

  1) a base patch: update time just from fsync() and remove_vma()
  2) update time on sync(2) as well
  3) update time on MS_ASYNC as well

 I'd happily ack the first one, which would solve the most serious
 issues, but have some reservations about the other two.

 Miklos

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi [EMAIL PROTECTED]:
   4. Recording the time was the file data changed
  
   Finally, I noticed yet another issue with the previous version of my 
   patch.
   Specifically, the time stamps were set to the current time of the moment
   when syncing but not the write reference was being done. This led to the
   following adverse effect on my development system:
  
   1) a text file A was updated by process B;
   2) process B exits without calling any of the *sync() functions;
   3) vi editor opens the file A;
   4) file data synced, file times updated;
   5) vi is confused by thinking that the file was changed after 3).

 Updating the time in remove_vma() would fix this, no?

We need to save modification time. Otherwise, updating time stamps
will be confusing the vi editor.


   All these changes to inode.c are unnecessary, I think.
 
  The first part is necessary to account for remembering the modification 
  time.
 
  The second part is for handling block device files. I cannot see any other
  sane way to update file times for them.

 Use file_update_time(), which will do the right thing.  It will in
 fact do the same thing as write(2) on the device, which is really what
 we want.

 Block devices being mapped for write through different device
 nodes..., well, I don't think we really need to handle such weird
 corner cases 100% acurately.

The file_update_time() cannot be used for implementing
the auto-update feature, because the sync() system call
doesn't know about the file which was memory-mapped.


 Miklos

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi [EMAIL PROTECTED]:
   I'm not sure this auto-updating is really needed (POSIX doesn't
   mandate it).
 
  Peter Shtaubach, author of the first solution for this bug,
  and Jacob Ostergaard, the reporter of this bug, insist the auto-update
  feature to be implemented.

 Can they state their reasons for the insistence?

   1) a base patch: update time just from fsync() and remove_vma()
   2) update time on sync(2) as well
   3) update time on MS_ASYNC as well

 Oh, and the four-liner I posted the other day will give you 1) + 2) +
 even more at a small fraction of the complexity.  And tacking on the
 reprotect code will solve the MS_ASYNC issue just the same.

 I agree, that having the timestamp updated on sync() is nice, and that
 trivial patch will give you that, and will also update the timestamp
 at least each 30 seconds if the file is being constantly modified,
 even if no explicit syncing is done.

 So maybe it's worth a little effort benchmarking how much that patch
 affects the cost of writing to a page.

 You could write a little test program like this (if somebody hasn't
 yet done so):

  - do some preparation:

echo 80  dirty_ratio
echo 80  dirty_background_ratio
echo 3  dirty_expire_centisecs
sync

  - map a large file, one that fits comfortably into free memory
  - bring the whole file in, by reading a byte from each page
  - start the timer
  - write a byte to each page
  - stop the timer

 It would be most interesting to try this on a filesystem supporting
 nanosecond timestamps.  Anyone know which these are?

The do_wp_page() function is called in mm/memory.c after locking PTE.
And the file_update_time() routine calls the filesystem operation that can
sleep. It's not accepted, I guess.


 Miklos
 

 Index: linux/mm/memory.c
 ===
 --- linux.orig/mm/memory.c  2008-01-09 21:16:30.0 +0100
 +++ linux/mm/memory.c   2008-01-15 21:16:14.0 +0100
 @@ -1680,6 +1680,8 @@ gotten:
  unlock:
 pte_unmap_unlock(page_table, ptl);
 if (dirty_page) {
 +   if (vma-vm_file)
 +   file_update_time(vma-vm_file);
 /*
  * Yes, Virginia, this is actually required to prevent a race
  * with clear_page_dirty_for_io() from clearing the page dirty
 @@ -2313,6 +2315,8 @@ out_unlocked:
 if (anon)
 page_cache_release(vmf.page);
 else if (dirty_page) {
 +   if (vma-vm_file)
 +   file_update_time(vma-vm_file);
 set_page_dirty_balance(dirty_page, page_mkwrite);
 put_page(dirty_page);
 }



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Rogier Wolff [EMAIL PROTECTED]:
 On Thu, Jan 17, 2008 at 04:16:47PM +0300, Anton Salikhmetov wrote:
  2008/1/17, Miklos Szeredi [EMAIL PROTECTED]:
 4. Recording the time was the file data changed

 Finally, I noticed yet another issue with the previous version of my 
 patch.
 Specifically, the time stamps were set to the current time of the 
 moment
 when syncing but not the write reference was being done. This led to 
 the
 following adverse effect on my development system:

 1) a text file A was updated by process B;
 2) process B exits without calling any of the *sync() functions;
 3) vi editor opens the file A;
 4) file data synced, file times updated;
 5) vi is confused by thinking that the file was changed after 3).
  
   Updating the time in remove_vma() would fix this, no?
 
  We need to save modification time. Otherwise, updating time stamps
  will be confusing the vi editor.

 If process B exits before vi opens the file, the timestamp should at
 the latest be the time that process B exits. There is no excuse for
 setting the timestamp later than the time that B exits.

 If process B no longer modifies the file, but still keeps it mapped
 until after vi starts, then the system can't help the
 situation. Wether or not B acesses those pages is unknown to the
 system. So you get what you deserve.

The exit() system call closes the file memory-mapped file. Therefore,
it's better to catch the close() system call. However, the close() system call
does not trigger unmapping the file. The mapped area for the file may be
used after closing the file. So, we should catch only the unmap() call,
not close() or exit().


 Roger.

 --
 ** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
 **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
 *-- BitWizard writes Linux device drivers for any device you may have! --*
 Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
 Does it sit on the couch all day? Is it unemployed? Please be specific!
 Define 'it' and what it isn't doing. - Adapted from lxrbot FAQ

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi [EMAIL PROTECTED]:
  The do_wp_page() function is called in mm/memory.c after locking PTE.
  And the file_update_time() routine calls the filesystem operation that can
  sleep. It's not accepted, I guess.

 do_wp_page() is called with the pte lock but drops it, so that's fine.

OK, I agree.

I'll take into account your suggestion to move updating time stamps from
the __set_page_dirty() and __set_page_dirty_nobuffers() routines to
do_wp_page(). Thank you!


 Miklos

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v6 0/2] Fixing the issue with memory-mapped file times

2008-01-17 Thread Anton Salikhmetov
This is the sixth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

New since the previous version:

1) a few cosmetic changes according to the latest feedback
   for the cleanup patch;

2) implementation of the following suggestion by Miklos Szeredi:

http://lkml.org/lkml/2008/1/17/158

These changes were tested as explained below. Please note that all
tests were performed with all recommended kernel debug options
enabled. Also note that the tests were performed on regular files
residing on both an ext3 partition and a tmpfs filesystem. I also
checked the block device case, which worked for me as well.

1. My own unit test:

http://bugzilla.kernel.org/attachment.cgi?id=14430

Result: all test cases passed successfully.

2. Unit test provided by Miklos Szeredi:

http://lkml.org/lkml/2008/1/14/104

Result: this test produced the following output:

debian-64:~# ./miklos_test test
begin   1200598736  1200598736  1200598617
write   1200598737  1200598737  1200598617
mmap1200598737  1200598737  1200598738
b   1200598739  1200598739  1200598738
msync b 1200598739  1200598739  1200598738
c   1200598741  1200598741  1200598738
msync c 1200598741  1200598741  1200598738
d   1200598743  1200598743  1200598738
munmap  1200598743  1200598743  1200598738
close   1200598743  1200598743  1200598738
sync1200598743  1200598743  1200598738
debian-64:~#

3. Regression tests were performed using the following test cases from
the LTP test suite:

msync01
msync02
msync03
msync04
msync05
mmapstress01
mmapstress09
mmapstress10

Result: no regressions were found while running these test cases.

4. Performance test was done using the program available from the
following link:

http://bugzilla.kernel.org/attachment.cgi?id=14493

Result: the impact of the changes was negligible for files of a few
hundred megabytes.

I wonder if these changes can be applied now.

These patches are the result of many fruitful discussions with
Miklos Szeredi, Peter Zijlstra, Rik van Riel, Peter Staubach,
and Jacob Oestergaard. I am grateful to you all for your support
during the days I was working on this long-standing nasty bug.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v6 1/2] Massive code cleanup of sys_msync()

2008-01-17 Thread Anton Salikhmetov
Using the PAGE_ALIGN() macro instead of manual alignment and
improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
---
 mm/msync.c |   77 
 1 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a4de868 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,83 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
  */
 
-/*
- * The msync() system call.
- */
+#include linux/file.h
 #include linux/fs.h
 #include linux/mm.h
 #include linux/mman.h
-#include linux/file.h
-#include linux/syscalls.h
 #include linux/sched.h
+#include linux/syscalls.h
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
unsigned long end;
-   struct mm_struct *mm = current-mm;
+   int error, unmapped_error;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   struct mm_struct *mm;
 
+   error = -EINVAL;
if (flags  ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start  ~PAGE_MASK)
goto out;
if ((flags  MS_ASYNC)  (flags  MS_SYNC))
goto out;
+
error = -ENOMEM;
-   len = (len + ~PAGE_MASK)  PAGE_MASK;
+   len = PAGE_ALIGN(len);
end = start + len;
if (end  start)
goto out;
-   error = 0;
+
+   error = unmapped_error = 0;
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
+   mm = current-mm;
down_read(mm-mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start  end. */
error = -ENOMEM;
if (!vma)
-   goto out_unlock;
-   /* Here start  vma-vm_end. */
+   break;
if (start  vma-vm_start) {
start = vma-vm_start;
if (start = end)
-   goto out_unlock;
-   unmapped_error = -ENOMEM;
-   }
-   /* Here vma-vm_start = start  vma-vm_end. */
-   if ((flags  MS_INVALIDATE) 
-   (vma-vm_flags  VM_LOCKED)) {
-   error = -EBUSY;
-   goto out_unlock;
+   break;
+   unmapped_error = error;
}
-   file = vma-vm_file;
+
+   error = -EBUSY;
+   if ((flags  MS_INVALIDATE)  (vma-vm_flags  VM_LOCKED))
+   break;
+
+   error = 0;
start = vma-vm_end;
-   if ((flags  MS_SYNC)  file 
-   (vma-vm_flags  VM_SHARED)) {
+   file = vma-vm_file;
+   if (file  (vma-vm_flags  VM_SHARED)  (flags  MS_SYNC)) {
get_file(file);
up_read(mm-mmap_sem);
error = do_fsync(file, 0);
@@ -88,16 +86,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
down_read(mm-mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start = end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma-vm_next;
+   continue;
}
-   }
-out_unlock:
+
+   vma = vma-vm_next;
+   } while (start  end);
up_read(mm-mmap_sem);
+
 out

[PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-17 Thread Anton Salikhmetov
Updating file times at write references to memory-mapped files and
forcing file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
---
 mm/memory.c |6 ++
 mm/msync.c  |   52 +++-
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4bf0b6d..13d5bbf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1668,6 +1668,9 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma-vm_file)
+   file_update_time(vma-vm_file);
+
/*
 * Yes, Virginia, this is actually required to prevent a race
 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2341,6 +2344,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+   if (vma-vm_file)
+   file_update_time(vma-vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..a49af28 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,11 +13,33 @@
 #include linux/syscalls.h
 
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr;
+
+   for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) {
+   spinlock_t *ptl;
+   pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
+   pud_t *pud = pud_offset(pgd, addr);
+   pmd_t *pmd = pmd_offset(pud, addr);
+   pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
+
+   if (pte_dirty(*pte)  pte_write(*pte))
+   *pte = pte_wrprotect(*pte);
+   pte_unmap_unlock(pte, ptl);
+   }
+}
+
+/*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
  *
  * The application may now run fsync() to write out the dirty pages and
  * wait on the writeout and check the result. Or the application may run
@@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
error = 0;
start = vma-vm_end;
file = vma-vm_file;
-   if (file  (vma-vm_flags  VM_SHARED)  (flags  MS_SYNC)) {
-   get_file(file);
-   up_read(mm-mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error || start = end)
-   goto out;
-   down_read(mm-mmap_sem);
-   vma = find_vma(mm, start);
-   continue;
+   if (file  (vma-vm_flags  VM_SHARED)) {
+   if (flags  MS_ASYNC)
+   vma_wrprotect(vma);
+   if (flags  MS_SYNC) {
+   get_file(file);
+   up_read(mm-mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error || start = end)
+   goto out;
+   down_read(mm-mmap_sem);
+   vma = find_vma(mm, start);
+   continue;
+   }
}
 
vma = vma-vm_next;
-- 
1.4.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-16 Thread Anton Salikhmetov
http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes for updating the ctime and mtime fields for memory-mapped files:

1) a new flag triggering update of the inode data;
2) a new field in the address_space structure for saving modification time;
3) a new helper function to update ctime and mtime when needed;
4) updating time stamps for mapped files in sys_msync() and do_fsync();
5) implementing lazy ctime and mtime update.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 fs/buffer.c |3 ++
 fs/fs-writeback.c   |2 +
 fs/inode.c  |   43 +++--
 fs/sync.c   |2 +
 include/linux/fs.h  |   13 +-
 include/linux/pagemap.h |3 +-
 mm/msync.c  |   61 +-
 mm/page-writeback.c |   54 ++---
 8 files changed, 124 insertions(+), 57 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..3967aa7 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
if (unlikely(!mapping))
return !TestSetPageDirty(page);
 
+   mapping->mtime = CURRENT_TIME;
+   set_bit(AS_MCTIME, >flags);
+
if (TestSetPageDirty(page))
return 0;
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 300324b..affd291 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct 
writeback_control *wbc)
 
spin_unlock(_lock);
 
+   mapping_update_time(mapping);
+
ret = do_writepages(mapping, wbc);
 
/* Don't write the inode if only I_DIRTY_PAGES was set */
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..edd5bf4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1243,8 +1243,10 @@ void touch_atime(struct vfsmount *mnt, struct dentry 
*dentry)
 EXPORT_SYMBOL(touch_atime);
 
 /**
- * file_update_time-   update mtime and ctime time
- * @file: file accessed
+ * inode_update_time   -   update mtime and ctime time
+ * @inode: inode accessed
+ * @ts: time when inode was accessed
+ * @sync: whether to do synchronous update
  *
  * Update the mtime and ctime members of an inode and mark the inode
  * for writeback.  Note that this function is meant exclusively for
@@ -1253,11 +1255,8 @@ EXPORT_SYMBOL(touch_atime);
  * S_NOCTIME inode flag, e.g. for network filesystem where these
  * timestamps are handled by the server.
  */
-
-void file_update_time(struct file *file)
+void inode_update_time(struct inode *inode, struct timespec *ts)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
-   struct timespec now;
int sync_it = 0;
 
if (IS_NOCMTIME(inode))
@@ -1265,22 +1264,41 @@ void file_update_time(struct file *file)
if (IS_RDONLY(inode))
return;
 
-   now = current_fs_time(inode->i_sb);
-   if (!timespec_equal(>i_mtime, )) {
-   inode->i_mtime = now;
+   if (timespec_compare(>i_mtime, ts) < 0) {
+   inode->i_mtime = *ts;
sync_it = 1;
}
 
-   if (!timespec_equal(>i_ctime, )) {
-   inode->i_ctime = now;
+   if (timespec_compare(>i_ctime, ts) < 0) {
+   inode->i_ctime = *ts;
sync_it = 1;
}
 
if (sync_it)
mark_inode_dirty_sync(inode);
 }
+EXPORT_SYMBOL(inode_update_time);
 
-EXPORT_SYMBOL(file_update_time);
+/*
+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapping_update_time(struct address_space *mapping)
+{
+   if (test_and_clear_bit(AS_MCTIME, >flags)) {
+   struct inode *inode = mapping->host;
+   struct timespec *ts = >mtime;
+
+   if (S_ISBLK(inode->i_mode)) {
+   struct block_device *bdev = inode->i_bdev;
+
+   mutex_lock(>bd_mutex);
+   list_for_each_entry(inode, >bd_inodes, i_devices)
+   inode_update_time(inode, ts);
+   mutex_unlock(>bd_mutex);
+   } else
+   inode_update_time(inode, ts);
+   }
+}
 
 int inode_needs_sync(struct inode *inode)
 {
@@ -1290,7 +1308,6 @@ int inode_needs_sync(struct inode *inode)
return 1;
return 0;
 }
-
 EXPORT_SYMBOL(inode_needs_sync);
 
 int inode_wait(void *word)
diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..5561464 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
goto out;
}
 
+   mapping_update_time(mapping);
+
ret = filemap_fdatawrite(mapping);
 
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..f0d3ced 100644
--- a/include/l

[PATCH -v5 1/2] Massive code cleanup of sys_msync()

2008-01-16 Thread Anton Salikhmetov
Substantial code cleanup of the sys_msync() function:

1) using the PAGE_ALIGN() macro instead of "manual" alignment;
2) improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   74 +--
 1 files changed, 36 insertions(+), 38 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..44997bf 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,24 +1,22 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
  * The application may now run fsync() to
@@ -33,8 +31,7 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   int error = -EINVAL, unmapped_error = 0;
 
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
@@ -42,62 +39,63 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
-   error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+
+   len = PAGE_ALIGN(len);
end = start + len;
-   if (end < start)
+   if (end < start) {
+   error = -ENOMEM;
goto out;
+   }
+
error = 0;
+
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
-   error = -ENOMEM;
-   if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   if (!vma) {
+   error = -ENOMEM;
+   break;
+   }
if (start < vma->vm_start) {
start = vma->vm_start;
-   if (start >= end)
-   goto out_unlock;
+   if (start >= end) {
+   error = -ENOMEM;
+   break;
+   }
unmapped_error = -ENOMEM;
}
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
error = -EBUSY;
-   goto out_unlock;
+   break;
}
-   file = vma->vm_file;
start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+
+   file = vma->vm_file;
+   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(>mmap_sem);
error = do_fsync(file, 0);
fput(file);
-   if (error || start >= end)
+   if (error)
goto out;
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma->vm_next;
+   continue;
}
-   }
-out_unlock:
+
+   vma = vma->vm_next;
+   } while (start < end);
up_read(>mmap_sem);
+
 out:
-   return error ? : unmapped_error;
+   return error ? error : unmapped_error;
 }
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v5 0/2] Updating ctime and mtime for memory-mapped files

2008-01-16 Thread Anton Salikhmetov
This is the fifth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

New since the previous version:

1) the case of retouching an already-dirty page pointed out
   by Miklos Szeredi has been correctly addressed;

2) a few cosmetic changes according to the latest feedback;

3) fixed the error of calling a possibly sleeping function
   from an atomic context.

The design for the first item above was suggested by Peter Zijlstra:

> It would require scanning the PTEs and marking them read-only again on
> MS_ASYNC, and some more logic in set_page_dirty() because that currently
> bails out early if the page in question is already dirty.

Miklos' test program now produces the following output for
the repeated calls to msync() with the MS_ASYNC flag:

debian:~/miklos# ./miklos_test file
begin   1200529196  1200529196  1200528798
write   1200529197  1200529197  1200528798
mmap1200529197  1200529197  1200529198
b   1200529197  1200529197  1200529198
msync b 1200529199  1200529199  1200529198
c   1200529199  1200529199  1200529198
msync c 1200529201  1200529201  1200529198
d   1200529201  1200529201  1200529198
munmap  1200529201  1200529201  1200529198
close   1200529201  1200529201  1200529198
sync1200529204  1200529204  1200529198
debian:~/miklos#

Miklos' test program can be found using the following link:

http://lkml.org/lkml/2008/1/14/104
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mtime updates for mmapped files.

2008-01-16 Thread Anton Salikhmetov
2008/1/16, Rogier Wolff <[EMAIL PROTECTED]>:
>
> Hi,
>
> I  wrote a small app yesterday that updates a file by mmapping the
> file (RW), changing the thing around, and then exiting.
>
> This did not trigger a change in the mtime of the file. Thus rsync
> didn't pick up that the file had changed.
>
> I understand that tracking every change to a RW mmapped file is
> costly, and thus unfeasable, but shouldn't then the close cause a
> mtime update?
>
> The server where this happened is running 2.6.21, so my apologies if
> this has already been corrected.

Unfortunately, this issue has not been fully fixed yet.

My last attempt (http://lkml.org/lkml/2008/1/15/202) to solve
this problem has a couple of drawbacks:

1) calling a possibly sleeping function from atomic context -
I've already corrected this;

2) there's a very special case with retouching the memory-mapped data
after a call to msync() with MS_ASYNC.

I'm still working on the latter case, but I guess that I have found a solution.

If you badly need a quick fix, I can send my working unreleased patch to you.
I reckon that your particular problem will be fixed by this patch.
Let me know if you want that.

However, if your application calls msync() with the MS_ASYNC flag,
it's better to wait a little bit more - I'll release the next version
of my solution
shortly.

>
> Roger.
>
> --
> ** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
> Does it sit on the couch all day? Is it unemployed? Please be specific!
> Define 'it' and what it isn't doing. - Adapted from lxrbot FAQ
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mtime updates for mmapped files.

2008-01-16 Thread Anton Salikhmetov
2008/1/16, Rogier Wolff [EMAIL PROTECTED]:

 Hi,

 I  wrote a small app yesterday that updates a file by mmapping the
 file (RW), changing the thing around, and then exiting.

 This did not trigger a change in the mtime of the file. Thus rsync
 didn't pick up that the file had changed.

 I understand that tracking every change to a RW mmapped file is
 costly, and thus unfeasable, but shouldn't then the close cause a
 mtime update?

 The server where this happened is running 2.6.21, so my apologies if
 this has already been corrected.

Unfortunately, this issue has not been fully fixed yet.

My last attempt (http://lkml.org/lkml/2008/1/15/202) to solve
this problem has a couple of drawbacks:

1) calling a possibly sleeping function from atomic context -
I've already corrected this;

2) there's a very special case with retouching the memory-mapped data
after a call to msync() with MS_ASYNC.

I'm still working on the latter case, but I guess that I have found a solution.

If you badly need a quick fix, I can send my working unreleased patch to you.
I reckon that your particular problem will be fixed by this patch.
Let me know if you want that.

However, if your application calls msync() with the MS_ASYNC flag,
it's better to wait a little bit more - I'll release the next version
of my solution
shortly.


 Roger.

 --
 ** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
 **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
 *-- BitWizard writes Linux device drivers for any device you may have! --*
 Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
 Does it sit on the couch all day? Is it unemployed? Please be specific!
 Define 'it' and what it isn't doing. - Adapted from lxrbot FAQ
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Updating ctime and mtime for memory-mapped files [try #4]

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Miklos Szeredi <[EMAIL PROTECTED]>:
> > 1. Introduction
> >
> > This is the fourth version of my solution for the bug #2645:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> >
> > Changes since the previous version:
> >
> > 1) the case of retouching an already-dirty page pointed out
> >   by Miklos Szeredi has been addressed;
>
> I'm a bit sceptical, as we've also pointed out, that this is not
> possible without messing with the page tables.
>
> Did you try my test program on the patched kernel?

I just tried your test program. Alas, my assumption appears to be wrong.

Thank you for your comment!

Now I start thinking that it is better not to care about
the MS_ASYNC case whatsoever.

>
> I've refreshed the patch, where we left this issue last time.  It
> should basically have equivalent functionality to your patch, and is a
> lot simpler.  There might be performance issues with it, but it's a
> good starting point.
>
> Miklos
> 
>
> Index: linux/mm/memory.c
> ===
> --- linux.orig/mm/memory.c  2008-01-09 21:16:30.0 +0100
> +++ linux/mm/memory.c   2008-01-15 21:16:14.0 +0100
> @@ -1680,6 +1680,8 @@ gotten:
>  unlock:
> pte_unmap_unlock(page_table, ptl);
> if (dirty_page) {
> +   if (vma->vm_file)
> +   file_update_time(vma->vm_file);
> /*
>  * Yes, Virginia, this is actually required to prevent a race
>  * with clear_page_dirty_for_io() from clearing the page dirty
> @@ -2313,6 +2315,8 @@ out_unlocked:
> if (anon)
> page_cache_release(vmf.page);
> else if (dirty_page) {
> +   if (vma->vm_file)
> +   file_update_time(vma->vm_file);
> set_page_dirty_balance(dirty_page, page_mkwrite);
> put_page(dirty_page);
> }
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] Massive code cleanup of sys_msync()

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Christoph Hellwig <[EMAIL PROTECTED]>:
> On Tue, Jan 15, 2008 at 08:28:48PM +0100, Peter Zijlstra wrote:
> > Notice that error is already -EINVAL, so a simple goto should suffice.
>
> Yes, for the start of the function you can basically leave it as-is.
>

OK, I will do as you suggest. Thank you for your answers.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] Massive code cleanup of sys_msync()

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Randy Dunlap <[EMAIL PROTECTED]>:
> On Tue, 15 Jan 2008 22:02:54 +0300 Anton Salikhmetov wrote:
>
> > 2008/1/15, Christoph Hellwig <[EMAIL PROTECTED]>:
> > > On Tue, Jan 15, 2008 at 07:02:44PM +0300, Anton Salikhmetov wrote:
>
> > > > @@ -33,71 +34,65 @@ asmlinkage long sys_msync(unsigned long start, 
> > > > size_t len, int flags)
> > > >   unsigned long end;
> > > >   struct mm_struct *mm = current->mm;
> > > >   struct vm_area_struct *vma;
> > > > - int unmapped_error = 0;
> > > > - int error = -EINVAL;
> > > > + int error = 0, unmapped_error = 0;
> > > >
> > > >   if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> > > > - goto out;
> > > > + return -EINVAL;
> > > >   if (start & ~PAGE_MASK)
> > > > - goto out;
> > > > + return -EINVAL;
> > >
> > > The goto out for a simple return style is used quite commonly in kernel
> > > code to have a single return statement which makes code maintaince, e.g.
> > > adding locks or allocations simpler.  Not sure that getting rid of it
> > > makes a lot of sense.
> >
> > Sorry, I can't agree. That's what is written in the CodingStyle document:
> >
> > The goto statement comes in handy when a function exits from multiple
> > locations and some common work such as cleanup has to be done.
>
> CodingStyle does not try to cover Everything.  Nor do we want it to.
>
> At any rate, there is a desire for functions to have a single point
> of return, regardless of the amount of cleanup to be done, so I agree
> with Christoph's comments.

Should I replace "return -EINVAL;" statement with the following?

{
error = -EINVAL;
goto out;
}

>
>
> > The second part of requirement does not hold true for the sys_msync() 
> > routine.
> >
> > >
> > > > + file = vma->vm_file;
> > > > + if ((flags & MS_SYNC) && file && (vma->vm_flags & 
> > > > VM_SHARED)) {
> > >
> > > Given that file is assigned just above it would be more readable to test
> > > it first.
> >
> > The second part of my solution changes this code, anyway.
> >
> > >
> > > > + if (error)
> > > > + return error;
> > >
> > > This should be an goto out, returns out of the middle of the function
> > > make reading and maintaining the code not so nice.
> >
> > Sorry, I don't think so. No "common cleanup" is needed here.
>
>
> ---
> ~Randy
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] Massive code cleanup of sys_msync()

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Christoph Hellwig <[EMAIL PROTECTED]>:
> On Tue, Jan 15, 2008 at 07:02:44PM +0300, Anton Salikhmetov wrote:
> > +++ b/mm/msync.c
> > @@ -1,24 +1,25 @@
> >  /*
> >   *   linux/mm/msync.c
> >   *
> > + * The msync() system call.
> >   * Copyright (C) 1994-1999  Linus Torvalds
> > + *
> > + * Massive code cleanup.
> > + * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
>
> Please don't put the changelog in here, that's what the log in the SCM
> is for.  And while you're at it remove the confusing file name comment.
> This should now look something like:
>
> /*
>  * The msync() system call.
>  *
>  * Copyright (C) 1994-1999  Linus Torvalds
>  * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
>  */

Thanks!

I'll take into account your recommendation.

>
> > @@ -33,71 +34,65 @@ asmlinkage long sys_msync(unsigned long start, size_t 
> > len, int flags)
> >   unsigned long end;
> >   struct mm_struct *mm = current->mm;
> >   struct vm_area_struct *vma;
> > - int unmapped_error = 0;
> > - int error = -EINVAL;
> > + int error = 0, unmapped_error = 0;
> >
> >   if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> > - goto out;
> > + return -EINVAL;
> >   if (start & ~PAGE_MASK)
> > - goto out;
> > + return -EINVAL;
>
> The goto out for a simple return style is used quite commonly in kernel
> code to have a single return statement which makes code maintaince, e.g.
> adding locks or allocations simpler.  Not sure that getting rid of it
> makes a lot of sense.

Sorry, I can't agree. That's what is written in the CodingStyle document:

The goto statement comes in handy when a function exits from multiple
locations and some common work such as cleanup has to be done.

The second part of requirement does not hold true for the sys_msync() routine.

>
> > + file = vma->vm_file;
> > + if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) 
> > {
>
> Given that file is assigned just above it would be more readable to test
> it first.

The second part of my solution changes this code, anyway.

>
> > + if (error)
> > + return error;
>
> This should be an goto out, returns out of the middle of the function
> make reading and maintaining the code not so nice.

Sorry, I don't think so. No "common cleanup" is needed here.

>
> >   return error ? : unmapped_error;
>
> Care to get rid of this odd GNU extension while you're at it and use
> the proper

I do also think that this GNU extension is not readable,
so I'll take your recommentation into account.

>
> return error ? error : unmapped_error;
>
> ?
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Christoph Hellwig <[EMAIL PROTECTED]>:
> On Tue, Jan 15, 2008 at 07:02:45PM +0300, Anton Salikhmetov wrote:
> > +/*
> > + * Update the ctime and mtime stamps for memory-mapped block device files.
> > + */
> > +static void bd_inode_update_time(struct inode *inode, struct timespec *ts)
> > +{
> > + struct block_device *bdev = inode->i_bdev;
> > + struct list_head *p;
> > +
> > + if (bdev == NULL)
> > + return;
>
> inode->i_bdev is never NULL for inodes currently beeing written to.
>
> > +
> > + mutex_lock(>bd_mutex);
> > + list_for_each(p, >bd_inodes) {
> > + inode = list_entry(p, struct inode, i_devices);
>
> this should use list_for_each_entry.
>
>

Thank you very much for your recommenations. I'll take them into account.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Peter Zijlstra <[EMAIL PROTECTED]>:
>
> On Tue, 2008-01-15 at 19:02 +0300, Anton Salikhmetov wrote:
>
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 3d3848f..53d0e34 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -997,35 +997,39 @@ int __set_page_dirty_no_writeback(struct page *page)
> >   */
> >  int __set_page_dirty_nobuffers(struct page *page)
> >  {
> > - if (!TestSetPageDirty(page)) {
> > - struct address_space *mapping = page_mapping(page);
> > - struct address_space *mapping2;
> > + struct address_space *mapping = page_mapping(page);
> > + struct address_space *mapping2;
> >
> > - if (!mapping)
> > - return 1;
> > + if (!mapping)
> > + return 1;
> >
> > - write_lock_irq(>tree_lock);
> > - mapping2 = page_mapping(page);
> > - if (mapping2) { /* Race with truncate? */
> > - BUG_ON(mapping2 != mapping);
> > - WARN_ON_ONCE(!PagePrivate(page) && 
> > !PageUptodate(page));
> > - if (mapping_cap_account_dirty(mapping)) {
> > - __inc_zone_page_state(page, NR_FILE_DIRTY);
> > - __inc_bdi_stat(mapping->backing_dev_info,
> > - BDI_RECLAIMABLE);
> > - task_io_account_write(PAGE_CACHE_SIZE);
> > - }
> > - radix_tree_tag_set(>page_tree,
> > - page_index(page), PAGECACHE_TAG_DIRTY);
> > - }
> > - write_unlock_irq(>tree_lock);
> > - if (mapping->host) {
> > - /* !PageAnon && !swapper_space */
> > - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > + mapping->mtime = CURRENT_TIME;
> > + set_bit(AS_MCTIME, >flags);
>
> This seems vulnerable to the race we have against truncate, handled by
> the mapping2 magic below. Do we care?
>
> > +
> > + if (TestSetPageDirty(page))
> > + return 0;
> > +
> > + write_lock_irq(>tree_lock);
> > + mapping2 = page_mapping(page);
> > + if (mapping2) {
> > + /* Race with truncate? */
> > + BUG_ON(mapping2 != mapping);
> > + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> > + if (mapping_cap_account_dirty(mapping)) {
> > + __inc_zone_page_state(page, NR_FILE_DIRTY);
> > + __inc_bdi_stat(mapping->backing_dev_info,
> > + BDI_RECLAIMABLE);
> > + task_io_account_write(PAGE_CACHE_SIZE);
> >   }
> > - return 1;
> > + radix_tree_tag_set(>page_tree,
> > + page_index(page), PAGECACHE_TAG_DIRTY);
> >   }
> > - return 0;
> > + write_unlock_irq(>tree_lock);
> > +
> > + if (mapping->host)
> > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

The inode gets marked dirty using the same "mapping" variable
as my code does. So, AFAIU, my change does not introduce any new
vulnerabilities. I would nevertherless be grateful to you for a scenario
where the race would be triggered.

> > +
> > + return 1;
> >  }
> >  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
> >
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Updating ctime and mtime for memory-mapped files [try #4]

2008-01-15 Thread Anton Salikhmetov
1. Introduction

This is the fourth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes since the previous version:

1) the case of retouching an already-dirty page pointed out
  by Miklos Szeredi has been addressed;

2) the file metadata are updated using the page modification time
  instead of the time of syncing data;

3) a few small corrections according to the latest feedback.

Brief explanation of these changes as well as some design considerations
are given below.

2. The case of retouching an already-dirtied page

Miklos Szeredi gave the following feedback on the previous version:

> I suspect your patch is ignoring writes after the first msync, but
> then why care about msync at all?  What's so special about the _first_
> msync?  Is it just that most test programs only check this, and not
> what happens if msync is called more than once?  That would be a bug
> in the test cases.

This version adds handling of the case of multiple msync() calls. Before
going on with the explanaion, I'll quote a remark by Peter Zijlstra:

> I must agree, doing the mmap dirty, MS_ASYNC, mmap retouch, MS_ASYNC
> case correctly would need a lot more code which I doubt is worth the
> effort.
>
> It would require scanning the PTEs and marking them read-only again on
> MS_ASYNC, and some more logic in set_page_dirty() because that currently
> bails out early if the page in question is already dirty.

Indeed, the following logic of the __set_pages_dirty_nobuffers() function:

if (!TestSetPageDirty(page)) {
   mapping = page_mapping(page);

   if (!mapping)
   return 1;

   /* critical section */

   if (mapping->host) {
   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
   set_bit(AS_MCTIME, >flags);
   }
   return 1;
}
return 0;

made it difficult to account for the case of the already-dirty page
retouch after the call to msync(MS_ASYNC).

In this version of my solution, I redesigned the logic of the same
function as follows:

mapping = page_mapping(page);

if (!mapping)
   return 1;

set_bit(AS_MCTIME, >flags);

if (TestSetPageDirty(page))
   return 0;

/* critical section */

if (mapping->host) {
   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

return 1;

This allows us to set the AS_MCTIME bit independently of whether the page
had already been dirtied or not. Besides, such change makes the logic of
the topmost "if" in this function straight thus improving readability.
Finally, we already have the __set_page_dirty() routine with almost
identical functionality. My redesign of __set_pages_dirty_nobuffers()
is based on how the __set_page_dirty() routine is implemented.

Miklos gave an example of a scenario, where the previous version of
my solution would fail:

http://lkml.org/lkml/2008/1/14/100

Here is how it looks in the version I am sending now:

 1 page is dirtied through mapping
   => the AS_MCTIME bit turned on
 2 app calls msync(MS_ASYNC)
   => inode's times updated, the AS_MCTIME bit turned off
 3 page is written again through mapping
   => the AS_MCTIME bit turned on again
 4 app calls msync(MS_ASYNC)
   => inode's times updated, the AS_MCTIME bit turned off
 5 ...
 6 page is written back
   => ... by this moment, the either the msync(MS_ASYNC) has
  taken care of updating the file times, or the AS_MCTIME
  bit is on.

I think that the feedback about writes after the first msync(MS_ASYNC)
has thereby been addressed.

3. Updating the time stamps of the block device special files

As for the block device case, let's start from the following assumption:

if the block device data changes, we should do our best to tell the world
that this has happened.

This is how I approach this requirement:

1) if the block device is active, this is done at next *sync() through
  calling the bd_inode_update_time() helper function.

2) if the block device is not active, this is done during the block
  device file deactivation in the unlink_file_vma() routine.

Removing either of these actions would leave a possibility of losing
information about the block device data update. That is why I am keeping
both.

4. Recording the time was the file data changed

Finally, I noticed yet another issue with the previous version of my patch.
Specifically, the time stamps were set to the current time of the moment
when syncing but not the write reference was being done. This led to the
following adverse effect on my development system:

1) a text file A was updated by process B;
2) process B exits without calling any of the *sync() functions;
3) vi editor opens the file A;
4) file data synced, file times updated;
5) vi is confused by "thinking" that the file was changed after 3).

This version overcomes this problem by introducing another field into the
address_space structure. This field is used to "remember" the time of
dirtying, and then this time value is propagated to the file 

[PATCH 1/2] Massive code cleanup of sys_msync()

2008-01-15 Thread Anton Salikhmetov
Substantial code cleanup of the sys_msync() function:

1) using the PAGE_ALIGN() macro instead of "manual" alignment;
2) improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   79 
 1 files changed, 37 insertions(+), 42 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..3270caa 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,24 +1,25 @@
 /*
  * linux/mm/msync.c
  *
+ * The msync() system call.
  * Copyright (C) 1994-1999  Linus Torvalds
+ *
+ * Massive code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
  * The application may now run fsync() to
@@ -33,71 +34,65 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   int error = 0, unmapped_error = 0;
 
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
-   goto out;
+   return -EINVAL;
if (start & ~PAGE_MASK)
-   goto out;
+   return -EINVAL;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
-   goto out;
-   error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   return -EINVAL;
+
+   len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
-   goto out;
-   error = 0;
+   return -ENOMEM;
if (end == start)
-   goto out;
+   return 0;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
-   error = -ENOMEM;
-   if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   if (!vma) {
+   error = -ENOMEM;
+   break;
+   }
if (start < vma->vm_start) {
start = vma->vm_start;
-   if (start >= end)
-   goto out_unlock;
+   if (start >= end) {
+   error = -ENOMEM;
+   break;
+   }
unmapped_error = -ENOMEM;
}
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
error = -EBUSY;
-   goto out_unlock;
+   break;
}
-   file = vma->vm_file;
start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+
+   file = vma->vm_file;
+   if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) {
get_file(file);
up_read(>mmap_sem);
error = do_fsync(file, 0);
fput(file);
-   if (error || start >= end)
-   goto out;
+   if (error)
+   return error;
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma->vm_next;
+   continue;
}
-   }
-out_unlock:
+
+   vma = vma->vm_next;
+   } while (start < end);
up_read(>mmap_sem);
-out:
+
return error ? : unmapped_error;
 }
-- 
1.4.4.4

--
To unsu

[PATCH 2/2] Updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes for updating the ctime and mtime fields for memory-mapped files:

1) new flag triggering update of the inode data;
2) new field in the address_space structure for saving modification time;
3) new function to update ctime and mtime for block device files;
4) new helper function to update ctime and mtime when needed;
5) updating time stamps for mapped files in sys_msync() and do_fsync();
6) implementing the feature of auto-updating ctime and mtime;
7) account for the case of retouching an already-dirtied page.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 fs/buffer.c |3 ++
 fs/fs-writeback.c   |2 +
 fs/inode.c  |   53 ++---
 fs/sync.c   |2 +
 include/linux/fs.h  |   13 ++-
 include/linux/pagemap.h |3 +-
 mm/mmap.c   |3 ++
 mm/msync.c  |   29 
 mm/page-writeback.c |   54 +-
 9 files changed, 112 insertions(+), 50 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..3967aa7 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
if (unlikely(!mapping))
return !TestSetPageDirty(page);
 
+   mapping->mtime = CURRENT_TIME;
+   set_bit(AS_MCTIME, >flags);
+
if (TestSetPageDirty(page))
return 0;
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 300324b..affd291 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct 
writeback_control *wbc)
 
spin_unlock(_lock);
 
+   mapping_update_time(mapping);
+
ret = do_writepages(mapping, wbc);
 
/* Don't write the inode if only I_DIRTY_PAGES was set */
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..5997046 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1243,8 +1243,9 @@ void touch_atime(struct vfsmount *mnt, struct dentry 
*dentry)
 EXPORT_SYMBOL(touch_atime);
 
 /**
- * file_update_time-   update mtime and ctime time
- * @file: file accessed
+ * inode_update_time   -   update mtime and ctime time
+ * @inode: inode accessed
+ * @ts: time when inode was accessed
  *
  * Update the mtime and ctime members of an inode and mark the inode
  * for writeback.  Note that this function is meant exclusively for
@@ -1253,11 +1254,8 @@ EXPORT_SYMBOL(touch_atime);
  * S_NOCTIME inode flag, e.g. for network filesystem where these
  * timestamps are handled by the server.
  */
-
-void file_update_time(struct file *file)
+void inode_update_time(struct inode *inode, struct timespec *ts)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
-   struct timespec now;
int sync_it = 0;
 
if (IS_NOCMTIME(inode))
@@ -1265,22 +1263,52 @@ void file_update_time(struct file *file)
if (IS_RDONLY(inode))
return;
 
-   now = current_fs_time(inode->i_sb);
-   if (!timespec_equal(>i_mtime, )) {
-   inode->i_mtime = now;
+   if (timespec_compare(>i_mtime, ts) < 0) {
+   inode->i_mtime = *ts;
sync_it = 1;
}
 
-   if (!timespec_equal(>i_ctime, )) {
-   inode->i_ctime = now;
+   if (timespec_compare(>i_ctime, ts) < 0) {
+   inode->i_ctime = *ts;
sync_it = 1;
}
 
if (sync_it)
mark_inode_dirty_sync(inode);
 }
+EXPORT_SYMBOL(inode_update_time);
+
+/*
+ * Update the ctime and mtime stamps for memory-mapped block device files.
+ */
+static void bd_inode_update_time(struct inode *inode, struct timespec *ts)
+{
+   struct block_device *bdev = inode->i_bdev;
+   struct list_head *p;
+
+   if (bdev == NULL)
+   return;
+
+   mutex_lock(>bd_mutex);
+   list_for_each(p, >bd_inodes) {
+   inode = list_entry(p, struct inode, i_devices);
+   inode_update_time(inode, ts);
+   }
+   mutex_unlock(>bd_mutex);
+}
 
-EXPORT_SYMBOL(file_update_time);
+/*
+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapping_update_time(struct address_space *mapping)
+{
+   if (test_and_clear_bit(AS_MCTIME, >flags)) {
+   if (S_ISBLK(mapping->host->i_mode))
+   bd_inode_update_time(mapping->host, >mtime);
+   else
+   inode_update_time(mapping->host, >mtime);
+   }
+}
 
 int inode_needs_sync(struct inode *inode)
 {
@@ -1290,7 +1318,6 @@ int inode_needs_sync(struct inode *inode)
return 1;
return 0;
 }
-
 EXPORT_SYMBOL(inode_needs_sync);
 
 int inode_wait(void *word)
diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..556146

Re: [PATCH 2/2] updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Miklos Szeredi <[EMAIL PROTECTED]>:
> > Thanks for your review, Peter and Miklos!
> >
> > I overlooked this case when AS_MCTIME flag has been turned off and the
> > page is still dirty.
> >
> > On the other hand, the words "shall be marked for update" may be
> > considered as just setting the AS_MCTIME flag, not updating the time
> > stamps.
> >
> > What do you think about calling mapping_update_time() inside of "if
> > (MS_SYNC & flags)"? I suggest such change because the code for
> > analysis of the case you've mentioned above seems impossible to me.
>
> I think that's a good idea.  As a first iteration, just updating the
> mtime/ctime in msync(MS_SYNC) and remove_vma() (called at munmap time)
> would be a big improvement over what we currently have.
>
> I would also recommend, that you drop mapping_update_time() and the
> related functions from the patch, and just use file_update_time()
> instead.

Thank you for your recommendations. I will submit my new solution shortly.

By the way, I've already changed the unlink_file_vma() function.

>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Miklos Szeredi [EMAIL PROTECTED]:
  Thanks for your review, Peter and Miklos!
 
  I overlooked this case when AS_MCTIME flag has been turned off and the
  page is still dirty.
 
  On the other hand, the words shall be marked for update may be
  considered as just setting the AS_MCTIME flag, not updating the time
  stamps.
 
  What do you think about calling mapping_update_time() inside of if
  (MS_SYNC  flags)? I suggest such change because the code for
  analysis of the case you've mentioned above seems impossible to me.

 I think that's a good idea.  As a first iteration, just updating the
 mtime/ctime in msync(MS_SYNC) and remove_vma() (called at munmap time)
 would be a big improvement over what we currently have.

 I would also recommend, that you drop mapping_update_time() and the
 related functions from the patch, and just use file_update_time()
 instead.

Thank you for your recommendations. I will submit my new solution shortly.

By the way, I've already changed the unlink_file_vma() function.


 Miklos

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] Updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes for updating the ctime and mtime fields for memory-mapped files:

1) new flag triggering update of the inode data;
2) new field in the address_space structure for saving modification time;
3) new function to update ctime and mtime for block device files;
4) new helper function to update ctime and mtime when needed;
5) updating time stamps for mapped files in sys_msync() and do_fsync();
6) implementing the feature of auto-updating ctime and mtime;
7) account for the case of retouching an already-dirtied page.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
---
 fs/buffer.c |3 ++
 fs/fs-writeback.c   |2 +
 fs/inode.c  |   53 ++---
 fs/sync.c   |2 +
 include/linux/fs.h  |   13 ++-
 include/linux/pagemap.h |3 +-
 mm/mmap.c   |3 ++
 mm/msync.c  |   29 
 mm/page-writeback.c |   54 +-
 9 files changed, 112 insertions(+), 50 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..3967aa7 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
if (unlikely(!mapping))
return !TestSetPageDirty(page);
 
+   mapping-mtime = CURRENT_TIME;
+   set_bit(AS_MCTIME, mapping-flags);
+
if (TestSetPageDirty(page))
return 0;
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 300324b..affd291 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct 
writeback_control *wbc)
 
spin_unlock(inode_lock);
 
+   mapping_update_time(mapping);
+
ret = do_writepages(mapping, wbc);
 
/* Don't write the inode if only I_DIRTY_PAGES was set */
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..5997046 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1243,8 +1243,9 @@ void touch_atime(struct vfsmount *mnt, struct dentry 
*dentry)
 EXPORT_SYMBOL(touch_atime);
 
 /**
- * file_update_time-   update mtime and ctime time
- * @file: file accessed
+ * inode_update_time   -   update mtime and ctime time
+ * @inode: inode accessed
+ * @ts: time when inode was accessed
  *
  * Update the mtime and ctime members of an inode and mark the inode
  * for writeback.  Note that this function is meant exclusively for
@@ -1253,11 +1254,8 @@ EXPORT_SYMBOL(touch_atime);
  * S_NOCTIME inode flag, e.g. for network filesystem where these
  * timestamps are handled by the server.
  */
-
-void file_update_time(struct file *file)
+void inode_update_time(struct inode *inode, struct timespec *ts)
 {
-   struct inode *inode = file-f_path.dentry-d_inode;
-   struct timespec now;
int sync_it = 0;
 
if (IS_NOCMTIME(inode))
@@ -1265,22 +1263,52 @@ void file_update_time(struct file *file)
if (IS_RDONLY(inode))
return;
 
-   now = current_fs_time(inode-i_sb);
-   if (!timespec_equal(inode-i_mtime, now)) {
-   inode-i_mtime = now;
+   if (timespec_compare(inode-i_mtime, ts)  0) {
+   inode-i_mtime = *ts;
sync_it = 1;
}
 
-   if (!timespec_equal(inode-i_ctime, now)) {
-   inode-i_ctime = now;
+   if (timespec_compare(inode-i_ctime, ts)  0) {
+   inode-i_ctime = *ts;
sync_it = 1;
}
 
if (sync_it)
mark_inode_dirty_sync(inode);
 }
+EXPORT_SYMBOL(inode_update_time);
+
+/*
+ * Update the ctime and mtime stamps for memory-mapped block device files.
+ */
+static void bd_inode_update_time(struct inode *inode, struct timespec *ts)
+{
+   struct block_device *bdev = inode-i_bdev;
+   struct list_head *p;
+
+   if (bdev == NULL)
+   return;
+
+   mutex_lock(bdev-bd_mutex);
+   list_for_each(p, bdev-bd_inodes) {
+   inode = list_entry(p, struct inode, i_devices);
+   inode_update_time(inode, ts);
+   }
+   mutex_unlock(bdev-bd_mutex);
+}
 
-EXPORT_SYMBOL(file_update_time);
+/*
+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapping_update_time(struct address_space *mapping)
+{
+   if (test_and_clear_bit(AS_MCTIME, mapping-flags)) {
+   if (S_ISBLK(mapping-host-i_mode))
+   bd_inode_update_time(mapping-host, mapping-mtime);
+   else
+   inode_update_time(mapping-host, mapping-mtime);
+   }
+}
 
 int inode_needs_sync(struct inode *inode)
 {
@@ -1290,7 +1318,6 @@ int inode_needs_sync(struct inode *inode)
return 1;
return 0;
 }
-
 EXPORT_SYMBOL(inode_needs_sync);
 
 int inode_wait(void *word)
diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..5561464 100644
--- a/fs/sync.c
+++ b

[PATCH 0/2] Updating ctime and mtime for memory-mapped files [try #4]

2008-01-15 Thread Anton Salikhmetov
1. Introduction

This is the fourth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes since the previous version:

1) the case of retouching an already-dirty page pointed out
  by Miklos Szeredi has been addressed;

2) the file metadata are updated using the page modification time
  instead of the time of syncing data;

3) a few small corrections according to the latest feedback.

Brief explanation of these changes as well as some design considerations
are given below.

2. The case of retouching an already-dirtied page

Miklos Szeredi gave the following feedback on the previous version:

 I suspect your patch is ignoring writes after the first msync, but
 then why care about msync at all?  What's so special about the _first_
 msync?  Is it just that most test programs only check this, and not
 what happens if msync is called more than once?  That would be a bug
 in the test cases.

This version adds handling of the case of multiple msync() calls. Before
going on with the explanaion, I'll quote a remark by Peter Zijlstra:

 I must agree, doing the mmap dirty, MS_ASYNC, mmap retouch, MS_ASYNC
 case correctly would need a lot more code which I doubt is worth the
 effort.

 It would require scanning the PTEs and marking them read-only again on
 MS_ASYNC, and some more logic in set_page_dirty() because that currently
 bails out early if the page in question is already dirty.

Indeed, the following logic of the __set_pages_dirty_nobuffers() function:

if (!TestSetPageDirty(page)) {
   mapping = page_mapping(page);

   if (!mapping)
   return 1;

   /* critical section */

   if (mapping-host) {
   __mark_inode_dirty(mapping-host, I_DIRTY_PAGES);
   set_bit(AS_MCTIME, mapping-flags);
   }
   return 1;
}
return 0;

made it difficult to account for the case of the already-dirty page
retouch after the call to msync(MS_ASYNC).

In this version of my solution, I redesigned the logic of the same
function as follows:

mapping = page_mapping(page);

if (!mapping)
   return 1;

set_bit(AS_MCTIME, mapping-flags);

if (TestSetPageDirty(page))
   return 0;

/* critical section */

if (mapping-host) {
   __mark_inode_dirty(mapping-host, I_DIRTY_PAGES);

return 1;

This allows us to set the AS_MCTIME bit independently of whether the page
had already been dirtied or not. Besides, such change makes the logic of
the topmost if in this function straight thus improving readability.
Finally, we already have the __set_page_dirty() routine with almost
identical functionality. My redesign of __set_pages_dirty_nobuffers()
is based on how the __set_page_dirty() routine is implemented.

Miklos gave an example of a scenario, where the previous version of
my solution would fail:

http://lkml.org/lkml/2008/1/14/100

Here is how it looks in the version I am sending now:

 1 page is dirtied through mapping
   = the AS_MCTIME bit turned on
 2 app calls msync(MS_ASYNC)
   = inode's times updated, the AS_MCTIME bit turned off
 3 page is written again through mapping
   = the AS_MCTIME bit turned on again
 4 app calls msync(MS_ASYNC)
   = inode's times updated, the AS_MCTIME bit turned off
 5 ...
 6 page is written back
   = ... by this moment, the either the msync(MS_ASYNC) has
  taken care of updating the file times, or the AS_MCTIME
  bit is on.

I think that the feedback about writes after the first msync(MS_ASYNC)
has thereby been addressed.

3. Updating the time stamps of the block device special files

As for the block device case, let's start from the following assumption:

if the block device data changes, we should do our best to tell the world
that this has happened.

This is how I approach this requirement:

1) if the block device is active, this is done at next *sync() through
  calling the bd_inode_update_time() helper function.

2) if the block device is not active, this is done during the block
  device file deactivation in the unlink_file_vma() routine.

Removing either of these actions would leave a possibility of losing
information about the block device data update. That is why I am keeping
both.

4. Recording the time was the file data changed

Finally, I noticed yet another issue with the previous version of my patch.
Specifically, the time stamps were set to the current time of the moment
when syncing but not the write reference was being done. This led to the
following adverse effect on my development system:

1) a text file A was updated by process B;
2) process B exits without calling any of the *sync() functions;
3) vi editor opens the file A;
4) file data synced, file times updated;
5) vi is confused by thinking that the file was changed after 3).

This version overcomes this problem by introducing another field into the
address_space structure. This field is used to remember the time of
dirtying, and then this time value is propagated to the file metadata.

This 

[PATCH 1/2] Massive code cleanup of sys_msync()

2008-01-15 Thread Anton Salikhmetov
Substantial code cleanup of the sys_msync() function:

1) using the PAGE_ALIGN() macro instead of manual alignment;
2) improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
---
 mm/msync.c |   79 
 1 files changed, 37 insertions(+), 42 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..3270caa 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,24 +1,25 @@
 /*
  * linux/mm/msync.c
  *
+ * The msync() system call.
  * Copyright (C) 1994-1999  Linus Torvalds
+ *
+ * Massive code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
  */
 
-/*
- * The msync() system call.
- */
+#include linux/file.h
 #include linux/fs.h
 #include linux/mm.h
 #include linux/mman.h
-#include linux/file.h
-#include linux/syscalls.h
 #include linux/sched.h
+#include linux/syscalls.h
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
  * The application may now run fsync() to
@@ -33,71 +34,65 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
unsigned long end;
struct mm_struct *mm = current-mm;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   int error = 0, unmapped_error = 0;
 
if (flags  ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
-   goto out;
+   return -EINVAL;
if (start  ~PAGE_MASK)
-   goto out;
+   return -EINVAL;
if ((flags  MS_ASYNC)  (flags  MS_SYNC))
-   goto out;
-   error = -ENOMEM;
-   len = (len + ~PAGE_MASK)  PAGE_MASK;
+   return -EINVAL;
+
+   len = PAGE_ALIGN(len);
end = start + len;
if (end  start)
-   goto out;
-   error = 0;
+   return -ENOMEM;
if (end == start)
-   goto out;
+   return 0;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
down_read(mm-mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start  end. */
-   error = -ENOMEM;
-   if (!vma)
-   goto out_unlock;
-   /* Here start  vma-vm_end. */
+   if (!vma) {
+   error = -ENOMEM;
+   break;
+   }
if (start  vma-vm_start) {
start = vma-vm_start;
-   if (start = end)
-   goto out_unlock;
+   if (start = end) {
+   error = -ENOMEM;
+   break;
+   }
unmapped_error = -ENOMEM;
}
-   /* Here vma-vm_start = start  vma-vm_end. */
-   if ((flags  MS_INVALIDATE) 
-   (vma-vm_flags  VM_LOCKED)) {
+   if ((flags  MS_INVALIDATE)  (vma-vm_flags  VM_LOCKED)) {
error = -EBUSY;
-   goto out_unlock;
+   break;
}
-   file = vma-vm_file;
start = vma-vm_end;
-   if ((flags  MS_SYNC)  file 
-   (vma-vm_flags  VM_SHARED)) {
+
+   file = vma-vm_file;
+   if ((flags  MS_SYNC)  file  (vma-vm_flags  VM_SHARED)) {
get_file(file);
up_read(mm-mmap_sem);
error = do_fsync(file, 0);
fput(file);
-   if (error || start = end)
-   goto out;
+   if (error)
+   return error;
down_read(mm-mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start = end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma-vm_next;
+   continue;
}
-   }
-out_unlock:
+
+   vma = vma-vm_next;
+   } while (start  end);
up_read(mm-mmap_sem);
-out:
+
return error ? : unmapped_error;
 }
-- 
1.4.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo

Re: [PATCH 2/2] Updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Peter Zijlstra [EMAIL PROTECTED]:

 On Tue, 2008-01-15 at 19:02 +0300, Anton Salikhmetov wrote:

  diff --git a/mm/page-writeback.c b/mm/page-writeback.c
  index 3d3848f..53d0e34 100644
  --- a/mm/page-writeback.c
  +++ b/mm/page-writeback.c
  @@ -997,35 +997,39 @@ int __set_page_dirty_no_writeback(struct page *page)
*/
   int __set_page_dirty_nobuffers(struct page *page)
   {
  - if (!TestSetPageDirty(page)) {
  - struct address_space *mapping = page_mapping(page);
  - struct address_space *mapping2;
  + struct address_space *mapping = page_mapping(page);
  + struct address_space *mapping2;
 
  - if (!mapping)
  - return 1;
  + if (!mapping)
  + return 1;
 
  - write_lock_irq(mapping-tree_lock);
  - mapping2 = page_mapping(page);
  - if (mapping2) { /* Race with truncate? */
  - BUG_ON(mapping2 != mapping);
  - WARN_ON_ONCE(!PagePrivate(page)  
  !PageUptodate(page));
  - if (mapping_cap_account_dirty(mapping)) {
  - __inc_zone_page_state(page, NR_FILE_DIRTY);
  - __inc_bdi_stat(mapping-backing_dev_info,
  - BDI_RECLAIMABLE);
  - task_io_account_write(PAGE_CACHE_SIZE);
  - }
  - radix_tree_tag_set(mapping-page_tree,
  - page_index(page), PAGECACHE_TAG_DIRTY);
  - }
  - write_unlock_irq(mapping-tree_lock);
  - if (mapping-host) {
  - /* !PageAnon  !swapper_space */
  - __mark_inode_dirty(mapping-host, I_DIRTY_PAGES);
  + mapping-mtime = CURRENT_TIME;
  + set_bit(AS_MCTIME, mapping-flags);

 This seems vulnerable to the race we have against truncate, handled by
 the mapping2 magic below. Do we care?

  +
  + if (TestSetPageDirty(page))
  + return 0;
  +
  + write_lock_irq(mapping-tree_lock);
  + mapping2 = page_mapping(page);
  + if (mapping2) {
  + /* Race with truncate? */
  + BUG_ON(mapping2 != mapping);
  + WARN_ON_ONCE(!PagePrivate(page)  !PageUptodate(page));
  + if (mapping_cap_account_dirty(mapping)) {
  + __inc_zone_page_state(page, NR_FILE_DIRTY);
  + __inc_bdi_stat(mapping-backing_dev_info,
  + BDI_RECLAIMABLE);
  + task_io_account_write(PAGE_CACHE_SIZE);
}
  - return 1;
  + radix_tree_tag_set(mapping-page_tree,
  + page_index(page), PAGECACHE_TAG_DIRTY);
}
  - return 0;
  + write_unlock_irq(mapping-tree_lock);
  +
  + if (mapping-host)
  + __mark_inode_dirty(mapping-host, I_DIRTY_PAGES);

The inode gets marked dirty using the same mapping variable
as my code does. So, AFAIU, my change does not introduce any new
vulnerabilities. I would nevertherless be grateful to you for a scenario
where the race would be triggered.

  +
  + return 1;
   }
   EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] Massive code cleanup of sys_msync()

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Christoph Hellwig [EMAIL PROTECTED]:
 On Tue, Jan 15, 2008 at 07:02:44PM +0300, Anton Salikhmetov wrote:
  +++ b/mm/msync.c
  @@ -1,24 +1,25 @@
   /*
*   linux/mm/msync.c
*
  + * The msync() system call.
* Copyright (C) 1994-1999  Linus Torvalds
  + *
  + * Massive code cleanup.
  + * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]

 Please don't put the changelog in here, that's what the log in the SCM
 is for.  And while you're at it remove the confusing file name comment.
 This should now look something like:

 /*
  * The msync() system call.
  *
  * Copyright (C) 1994-1999  Linus Torvalds
  * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
  */

Thanks!

I'll take into account your recommendation.


  @@ -33,71 +34,65 @@ asmlinkage long sys_msync(unsigned long start, size_t 
  len, int flags)
unsigned long end;
struct mm_struct *mm = current-mm;
struct vm_area_struct *vma;
  - int unmapped_error = 0;
  - int error = -EINVAL;
  + int error = 0, unmapped_error = 0;
 
if (flags  ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
  - goto out;
  + return -EINVAL;
if (start  ~PAGE_MASK)
  - goto out;
  + return -EINVAL;

 The goto out for a simple return style is used quite commonly in kernel
 code to have a single return statement which makes code maintaince, e.g.
 adding locks or allocations simpler.  Not sure that getting rid of it
 makes a lot of sense.

Sorry, I can't agree. That's what is written in the CodingStyle document:

The goto statement comes in handy when a function exits from multiple
locations and some common work such as cleanup has to be done.

The second part of requirement does not hold true for the sys_msync() routine.


  + file = vma-vm_file;
  + if ((flags  MS_SYNC)  file  (vma-vm_flags  VM_SHARED)) 
  {

 Given that file is assigned just above it would be more readable to test
 it first.

The second part of my solution changes this code, anyway.


  + if (error)
  + return error;

 This should be an goto out, returns out of the middle of the function
 make reading and maintaining the code not so nice.

Sorry, I don't think so. No common cleanup is needed here.


return error ? : unmapped_error;

 Care to get rid of this odd GNU extension while you're at it and use
 the proper

I do also think that this GNU extension is not readable,
so I'll take your recommentation into account.


 return error ? error : unmapped_error;

 ?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Christoph Hellwig [EMAIL PROTECTED]:
 On Tue, Jan 15, 2008 at 07:02:45PM +0300, Anton Salikhmetov wrote:
  +/*
  + * Update the ctime and mtime stamps for memory-mapped block device files.
  + */
  +static void bd_inode_update_time(struct inode *inode, struct timespec *ts)
  +{
  + struct block_device *bdev = inode-i_bdev;
  + struct list_head *p;
  +
  + if (bdev == NULL)
  + return;

 inode-i_bdev is never NULL for inodes currently beeing written to.

  +
  + mutex_lock(bdev-bd_mutex);
  + list_for_each(p, bdev-bd_inodes) {
  + inode = list_entry(p, struct inode, i_devices);

 this should use list_for_each_entry.



Thank you very much for your recommenations. I'll take them into account.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >